Linux 6.13-rc3 many different panics in Xen PV dom0

All of lore.kernel.org
 help / color / mirror / Atom feed

* Linux 6.13-rc3 many different panics in Xen PV dom0
@ 2024-12-19 16:14 Marek Marczykowski-Górecki
  2024-12-20  1:48 ` Marek Marczykowski-Górecki
  2025-01-02 10:20 ` Jürgen Groß
  0 siblings, 2 replies; 15+ messages in thread
From: Marek Marczykowski-Górecki @ 2024-12-19 16:14 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 16603 bytes --]

Hi,

It crashes on boot like below, most of the times. But sometimes (rarely)
it manages to stay alive. Below I'm pasting few of the crashes that look
distinctly different, if you follow the links, you can find more of
them. IMHO it looks like some memory corruption bug somewhere. I tested
also Linux 6.13-rc2 before, and it had very similar issue.

The traces below are all from nested virt (Xen inside KVM), tests with
Xen directly on the hardware are still in progress. But -rc2 failed all
of them too, so if it's the same issue, I guess they will looks similar.

Who should I CC here? The failures are all over the place... linux-mm?

[    1.743728] ------------[ cut here ]------------
[    1.744911] WARNING: CPU: 0 PID: 105 at arch/x86/xen/multicalls.c:188 xen_mc_flush+0x226/0x4f0
[    1.746474] Modules linked in:
[    1.747093] CPU: 0 UID: 0 PID: 105 Comm: modprobe Not tainted 6.13.0-0.rc3.2.qubes.1.fc41.x86_64 #1
[    1.748722] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-2-gc13ff2cd-prebuilt.qemu.org 04/01/2014
[    1.750634] RIP: e030:xen_mc_flush+0x226/0x4f0
[    1.751484] Code: c0 48 c1 e0 05 48 05 00 70 3c 81 e8 d4 98 0e 01 48 89 45 18 48 85 c0 0f 89 b6 fe ff ff 44 8b b3 e0 f2 01 00 41 bf 01 00 00 00 <0f> 0b 65 8b 0d bd 34 d4 7f 44 89 f2 44 89 fe 48 c7 c7 70 0f d5 81
[    1.754715] RSP: e02b:ffffc900404e7978 EFLAGS: 00010086
[    1.755688] RAX: fffffffffffffff0 RBX: ffff88817fe00000 RCX: ffff88817fe1f2f0
[    1.756971] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88817fe1faf0
[    1.758258] RBP: ffff88817fe1f2e0 R08: 0000000000000000 R09: ffff888101695dc0
[    1.759540] R10: 0000000000007ff0 R11: 00000000000000e4 R12: 0000000000000042
[    1.760829] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
[    1.762122] FS:  0000709a60dc5740(0000) GS:ffff88817fe00000(0000) knlGS:0000000000000000
[    1.763575] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.764633] CR2: 00007ffd77bc48c8 CR3: 0000000100a22000 CR4: 0000000000050660
[    1.765921] Call Trace:
[    1.766431]  <TASK>
[    1.766891]  ? show_trace_log_lvl+0x1b0/0x2f0
[    1.767726]  ? show_trace_log_lvl+0x1b0/0x2f0
[    1.768560]  ? xen_leave_lazy_mmu+0x15/0x60
[    1.769349]  ? xen_mc_flush+0x226/0x4f0
[    1.770093]  ? __warn.cold+0x93/0xf2
[    1.770795]  ? xen_mc_flush+0x226/0x4f0
[    1.771535]  ? report_bug+0xff/0x140
[    1.772236]  ? handle_bug+0x58/0x90
[    1.772924]  ? exc_invalid_op+0x17/0x70
[    1.773659]  ? asm_exc_invalid_op+0x1a/0x20
[    1.774448]  ? xen_mc_flush+0x226/0x4f0
[    1.775188]  ? xen_mc_flush+0x20c/0x4f0
[    1.775923]  ? xen_extend_mmu_update+0x4e/0xd0
[    1.776764]  xen_leave_lazy_mmu+0x15/0x60
[    1.777526]  set_ptes.constprop.0+0x1f/0x30
[    1.778322]  __text_poke+0x18c/0x4a0
[    1.779017]  ? __pfx_text_poke_memcpy+0x10/0x10
[    1.779877]  text_poke_copy_locked+0x63/0xa0
[    1.780696]  text_poke_copy+0x32/0x50
[    1.781408]  post_relocation+0xfd/0x190
[    1.782146]  load_module+0x480/0x810
[    1.782843]  init_module_from_file+0x86/0xc0
[    1.783661]  idempotent_init_module+0x115/0x310
[    1.784519]  __x64_sys_finit_module+0x65/0xc0
[    1.785351]  do_syscall_64+0x82/0x160
[    1.786064]  ? syscall_exit_to_user_mode+0x15/0x210
[    1.786975]  ? do_syscall_64+0x8e/0x160
[    1.787708]  ? xen_extend_mmu_update+0x4e/0xd0
[    1.788552]  ? xen_leave_lazy_mmu+0x15/0x60
[    1.789342]  ? set_ptes.isra.0+0x79/0x90
[    1.790100]  ? _raw_spin_unlock+0xe/0x30
[    1.790847]  ? do_anonymous_page+0x103/0x4a0
[    1.791664]  ? __handle_mm_fault+0x39a/0x6f0
[    1.796723]  ? do_syscall_64+0x8e/0x160
[    1.797457]  ? __count_memcg_events+0xc0/0x180
[    1.798310]  ? count_memcg_events.constprop.0+0x24/0x30
[    1.799274]  ? handle_mm_fault+0x20d/0x330
[    1.800055]  ? do_user_addr_fault+0x55a/0x7b0
[    1.800885]  ? exc_page_fault+0x83/0x180
[    1.801640]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[    1.802598] RIP: 0033:0x709a60ebca5d
[    1.803300] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 83 73 0f 00 f7 d8 64 89 01 48
[    1.806707] RSP: 002b:00007ffd77bc7868 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[    1.808096] RAX: ffffffffffffffda RBX: 0000644519919a00 RCX: 0000709a60ebca5d
[    1.809386] RDX: 0000000000000000 RSI: 00006444e143f715 RDI: 0000000000000000
[    1.810935] RBP: 00007ffd77bc7920 R08: 0000709a60fb4b20 R09: 0000000000000000
[    1.812182] R10: 0000644519919e50 R11: 0000000000000246 R12: 00006444e143f715
[    1.813422] R13: 0000000000040000 R14: 0000644519919c40 R15: 0000644519919470
[    1.814662]  </TASK>
[    1.815124] ---[ end trace 0000000000000000 ]---
[    1.815967] 1 of 1 multicall(s) failed: cpu 0
[    1.816769]   call  1: op=1 arg=[ffff88817fe1faf0] result=-16
[    1.817799] BUG: unable to handle page fault for address: 00006a042c1ac000
[    1.818988] #PF: supervisor write access in kernel mode
[    1.819930] #PF: error_code(0x0002) - not-present page
[    1.820847] PGD 100085067 P4D 100085067 PUD 100086067 PMD 100087067 PTE 0
[    1.822021] Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI
[    1.822900] CPU: 0 UID: 0 PID: 105 Comm: modprobe Tainted: G        W          6.13.0-0.rc3.2.qubes.1.fc41.x86_64 #1
[    1.824686] Tainted: [W]=WARN
[    1.825275] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-2-gc13ff2cd-prebuilt.qemu.org 04/01/2014
[    1.827109] RIP: e030:memcpy+0xc/0x20
[    1.827799] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 48 89 f8 48 89 d1 <f3> a4 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90
[    1.830926] RSP: e02b:ffffc900404e79f0 EFLAGS: 00010006
[    1.831859] RAX: 00006a042c1ac000 RBX: 0000000000001000 RCX: 0000000000001000
[    1.833091] RDX: 0000000000001000 RSI: ffffc900400c5000 RDI: 00006a042c1ac000
[    1.834324] RBP: ffffffffc0401000 R08: 0000000000000000 R09: 3120206c6c616320
[    1.835559] R10: 0000000000007ff0 R11: 20206c6c61632020 R12: ffff88810008f380
[    1.836792] R13: 0000000000000000 R14: 0000000000001000 R15: ffff88810008d280
[    1.838029] FS:  0000709a60dc5740(0000) GS:ffff88817fe00000(0000) knlGS:0000000000000000
[    1.839461] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.840591] CR2: 00006a042c1ac000 CR3: 0000000100082000 CR4: 0000000000050660
[    1.841834] Call Trace:
[    1.842328]  <TASK>
[    1.842782]  ? show_trace_log_lvl+0x1b0/0x2f0
[    1.843588]  ? show_trace_log_lvl+0x1b0/0x2f0
[    1.844417]  ? __text_poke+0x250/0x4a0
[    1.845298]  ? __die_body.cold+0x8/0x12
[    1.846055]  ? page_fault_oops+0x146/0x160
[    1.846824]  ? exc_page_fault+0x7e/0x180
[    1.847584]  ? asm_exc_page_fault+0x26/0x30
[    1.848372]  ? memcpy+0xc/0x20
[    1.848977]  __text_poke+0x250/0x4a0
[    1.849647]  ? __pfx_text_poke_memcpy+0x10/0x10
[    1.850481]  text_poke_copy_locked+0x63/0xa0
[    1.851270]  text_poke_copy+0x32/0x50
[    1.851963]  post_relocation+0xfd/0x190
[    1.852675]  load_module+0x480/0x810
[    1.853353]  init_module_from_file+0x86/0xc0
[    1.854148]  idempotent_init_module+0x115/0x310
[    1.855022]  __x64_sys_finit_module+0x65/0xc0
[    1.855846]  do_syscall_64+0x82/0x160
[    1.856567]  ? syscall_exit_to_user_mode+0x15/0x210
[    1.857531]  ? do_syscall_64+0x8e/0x160
[    1.858247]  ? xen_extend_mmu_update+0x4e/0xd0
[    1.859062]  ? xen_leave_lazy_mmu+0x15/0x60
[    1.859830]  ? set_ptes.isra.0+0x79/0x90
[    1.860555]  ? _raw_spin_unlock+0xe/0x30
[    1.861277]  ? do_anonymous_page+0x103/0x4a0
[    1.862077]  ? __handle_mm_fault+0x39a/0x6f0
[    1.862873]  ? do_syscall_64+0x8e/0x160
[    1.863586]  ? __count_memcg_events+0xc0/0x180
[    1.864407]  ? count_memcg_events.constprop.0+0x24/0x30
[    1.865349]  ? handle_mm_fault+0x20d/0x330
[    1.866120]  ? do_user_addr_fault+0x55a/0x7b0
[    1.866947]  ? exc_page_fault+0x83/0x180
[    1.867674]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[    1.868591] RIP: 0033:0x709a60ebca5d
[    1.869271] Code: Unable to access opcode bytes at 0x709a60ebca33.
[    1.870358] RSP: 002b:00007ffd77bc7868 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[    1.871697] RAX: ffffffffffffffda RBX: 0000644519919a00 RCX: 0000709a60ebca5d
[    1.872938] RDX: 0000000000000000 RSI: 00006444e143f715 RDI: 0000000000000000
[    1.874178] RBP: 00007ffd77bc7920 R08: 0000709a60fb4b20 R09: 0000000000000000
[    1.875420] R10: 0000644519919e50 R11: 0000000000000246 R12: 00006444e143f715
[    1.876656] R13: 0000000000040000 R14: 0000644519919c40 R15: 0000644519919470
[    1.877893]  </TASK>
[    1.878348] Modules linked in:
[    1.878946] CR2: 00006a042c1ac000
[    1.879585] ---[ end trace 0000000000000000 ]---
[    1.880429] RIP: e030:memcpy+0xc/0x20
[    1.881112] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 48 89 f8 48 89 d1 <f3> a4 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90
[    1.884228] RSP: e02b:ffffc900404e79f0 EFLAGS: 00010006
[    1.885158] RAX: 00006a042c1ac000 RBX: 0000000000001000 RCX: 0000000000001000
[    1.886388] RDX: 0000000000001000 RSI: ffffc900400c5000 RDI: 00006a042c1ac000
[    1.887619] RBP: ffffffffc0401000 R08: 0000000000000000 R09: 3120206c6c616320
[    1.888852] R10: 0000000000007ff0 R11: 20206c6c61632020 R12: ffff88810008f380
[    1.890128] R13: 0000000000000000 R14: 0000000000001000 R15: ffff88810008d280
[    1.891377] FS:  0000709a60dc5740(0000) GS:ffff88817fe00000(0000) knlGS:0000000000000000
[    1.892788] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.893805] CR2: 00006a042c1ac000 CR3: 0000000100082000 CR4: 0000000000050660
[    1.895039] Kernel panic - not syncing: Fatal exception
[    1.895971] Kernel Offset: disabled

Full log:
https://openqa.qubes-os.org/tests/122879/logfile?filename=serial0.txt

Another failure looks like this:

[    1.813118] BUG: unable to handle page fault for address: ffffea6666666648
[    1.814401] #PF: supervisor read access in kernel mode
[    1.815428] #PF: error_code(0x0000) - not-present page
[    1.816472] PGD 7f7d1067 P4D 7f7d1067 PUD 0 
[    1.817286] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[    1.818157] CPU: 0 UID: 0 PID: 214 Comm: modprobe Not tainted 6.13.0-0.rc3.2.qubes.1.fc41.x86_64 #1
[    1.819864] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-2-gc13ff2cd-prebuilt.qemu.org 04/01/2014
[    1.821711] RIP: e030:migration_entry_wait_on_locked+0x69/0x2e0
[    1.823058] Code: 00 48 c7 44 24 20 00 00 00 00 f3 48 ab e9 56 01 00 00 48 b8 ff ff ff ff ff 00 00 00 48 21 d0 48 c1 e0 06 48 03 05 0f 97 89 01 <48> 8b 48 08 49 89 c6 f6 c1 01 0f 85 39 02 00 00 0f 1f 44 00 00 48
[    1.826206] RSP: e02b:ffffc90041187940 EFLAGS: 00010282
[    1.827138] RAX: ffffea6666666640 RBX: ccccccccccccccc0 RCX: 0000000000000000
[    1.828377] RDX: 6401999999999999 RSI: ffffc90041187968 RDI: ffffc900411879a0
[    1.829631] RBP: 6401999999999999 R08: 0000000000000067 R09: ffffc90041187ad8
[    1.830885] R10: 0000000000000000 R11: 0000000000000000 R12: 0200000000000080
[    1.832143] R13: 0000000183689067 R14: 0000000000000af0 R15: ffff88810a507660
[    1.833504] FS:  0000000000000000(0000) GS:ffff88817fe00000(0000) knlGS:0000000000000000
[    1.835186] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.836379] CR2: ffffea6666666648 CR3: 0000000100a0e000 CR4: 0000000000050660
[    1.837763] Call Trace:
[    1.838316]  <TASK>
[    1.838784]  ? show_trace_log_lvl+0x1b0/0x2f0
[    1.839652]  ? show_trace_log_lvl+0x1b0/0x2f0
[    1.840513]  ? migration_entry_wait+0xf0/0x100
[    1.841366]  ? __die_body.cold+0x8/0x12
[    1.842100]  ? page_fault_oops+0x146/0x160
[    1.842887]  ? exc_page_fault+0x170/0x180
[    1.843671]  ? asm_exc_page_fault+0x26/0x30
[    1.844466]  ? migration_entry_wait_on_locked+0x69/0x2e0
[    1.845436]  ? __raw_callee_save_xen_pmd_val+0x15/0x30
[    1.846431]  migration_entry_wait+0xf0/0x100
[    1.847227]  do_swap_page+0x4a9/0xeb0
[    1.847940]  ? xen_pmd_val+0x35/0x70
[    1.848615]  ? __raw_callee_save_xen_pmd_val+0x15/0x30
[    1.853785]  ? __pfx_default_wake_function+0x10/0x10
[    1.854676]  __handle_mm_fault+0x39a/0x6f0
[    1.855426]  ? mt_find+0x213/0x570
[    1.856074]  handle_mm_fault+0x115/0x330
[    1.856816]  do_user_addr_fault+0x1ca/0x7b0
[    1.857576]  exc_page_fault+0x7e/0x180
[    1.858280]  asm_exc_page_fault+0x26/0x30
[    1.859018] RIP: e030:elf_load+0x20f/0x240
[    1.859773] Code: 39 d9 73 16 41 f6 c4 02 0f 84 1a ff ff ff 49 c7 c5 f2 ff ff ff e9 4d fe ff ff 0f 1f 00 b9 00 10 00 00 48 89 df 48 29 c1 31 c0 <f3> aa 0f 1f 00 0f 1f 00 48 85 c9 75 0d 48 8b 75 28 48 8b 4d 20 e9
[    1.862907] RSP: e02b:ffffc90041187d08 EFLAGS: 00010246
[    1.863841] RAX: 0000000000000000 RBX: 000070b32bcc0104 RCX: 0000000000000efc
[    1.865081] RDX: 000070b32bcc02e0 RSI: 0000000000000000 RDI: 000070b32bcc0104
[    1.866321] RBP: ffff888108b59470 R08: ffff88810a506668 R09: 0000000000000035
[    1.867568] R10: 0000000000000000 R11: 0000000000000040 R12: 0000000000000003
[    1.868809] R13: 000070b32bcbd000 R14: ffff88810a750000 R15: 0000000000000000
[    1.870056]  ? elf_load+0xa8/0x240
[    1.870707]  load_elf_interp.isra.0+0x1b5/0x330
[    1.871542]  load_elf_binary+0xa35/0xf30
[    1.872268]  search_binary_handler+0xd3/0x260
[    1.873083]  exec_binprm+0x54/0x180
[    1.873750]  bprm_execve.part.0+0x144/0x1e0
[    1.874516]  kernel_execve+0x112/0x140
[    1.875215]  call_usermodehelper_exec_async+0xd0/0x190
[    1.876135]  ? __pfx_call_usermodehelper_exec_async+0x10/0x10
[    1.877167]  ret_from_fork+0x34/0x50
[    1.877847]  ? __pfx_call_usermodehelper_exec_async+0x10/0x10
[    1.878871]  ret_from_fork_asm+0x1a/0x30
[    1.879601]  </TASK>
[    1.880056] Modules linked in:
[    1.880659] CR2: ffffea6666666648
[    1.881296] ---[ end trace 0000000000000000 ]---
[    1.882142] RIP: e030:migration_entry_wait_on_locked+0x69/0x2e0
[    1.883198] Code: 00 48 c7 44 24 20 00 00 00 00 f3 48 ab e9 56 01 00 00 48 b8 ff ff ff ff ff 00 00 00 48 21 d0 48 c1 e0 06 48 03 05 0f 97 89 01 <48> 8b 48 08 49 89 c6 f6 c1 01 0f 85 39 02 00 00 0f 1f 44 00 00 48
[    1.886323] RSP: e02b:ffffc90041187940 EFLAGS: 00010282
[    1.887258] RAX: ffffea6666666640 RBX: ccccccccccccccc0 RCX: 0000000000000000
[    1.888501] RDX: 6401999999999999 RSI: ffffc90041187968 RDI: ffffc900411879a0
[    1.889748] RBP: 6401999999999999 R08: 0000000000000067 R09: ffffc90041187ad8
[    1.890990] R10: 0000000000000000 R11: 0000000000000000 R12: 0200000000000080
[    1.892225] R13: 0000000183689067 R14: 0000000000000af0 R15: ffff88810a507660
[    1.893468] FS:  0000000000000000(0000) GS:ffff88817fe00000(0000) knlGS:0000000000000000
[    1.894881] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.895902] CR2: ffffea6666666648 CR3: 0000000100a0e000 CR4: 0000000000050660
[    1.897143] Kernel panic - not syncing: Fatal exception
[    1.898079] Kernel Offset: disabled

Full log:
https://openqa.qubes-os.org/tests/122881/logfile?filename=serial0.txt

Or this:

[    1.672650] Kernel panic - not syncing: corrupted stack end detected inside scheduler
[    1.674030] CPU: 1 UID: 0 PID: 107 Comm: cryptomgr_test Not tainted 6.13.0-0.rc3.2.qubes.1.fc41.x86_64 #1
[    1.676339] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-2-gc13ff2cd-prebuilt.qemu.org 04/01/2014
[    1.678240] Call Trace:
[    1.678743]  <TASK>
[    1.679185]  ? __pfx_cryptomgr_test+0x10/0x10
[    1.679998]  dump_stack_lvl+0x5d/0x80
[    1.680690]  panic+0x155/0x30f
[    1.681283]  schedule_debug.isra.0.cold+0xc/0xc
[    1.682151]  __schedule+0x6f/0x600
[    1.682819]  ? __pfx_cryptomgr_test+0x10/0x10
[    1.683616]  do_task_dead+0x42/0x50
[    1.684279]  do_exit+0x331/0x4a0
[    1.684902]  kthread_exit+0x28/0x30
[    1.685565]  __module_put_and_kthread_exit+0x1a/0x20
[    1.686458]  cryptomgr_test+0x3f/0x40
[    1.687195]  kthread+0xd2/0x100
[    1.687813]  ? __pfx_kthread+0x10/0x10
[    1.688545]  ret_from_fork+0x34/0x50
[    1.689299]  ? __pfx_kthread+0x10/0x10
[    1.690150]  ret_from_fork_asm+0x1a/0x30
[    1.690938]  </TASK>
[    1.691417] Kernel Offset: disabled


Full log:
https://openqa.qubes-os.org/tests/122877/logfile?filename=serial0.txt



-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Linux 6.13-rc3 many different panics in Xen PV dom0
  2024-12-19 16:14 Linux 6.13-rc3 many different panics in Xen PV dom0 Marek Marczykowski-Górecki
@ 2024-12-20  1:48 ` Marek Marczykowski-Górecki
  2024-12-26 18:48   ` Marek Marczykowski-Górecki
  2025-01-02 10:20 ` Jürgen Groß
  1 sibling, 1 reply; 15+ messages in thread
From: Marek Marczykowski-Górecki @ 2024-12-20  1:48 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 4596 bytes --]

On Thu, Dec 19, 2024 at 05:14:52PM +0100, Marek Marczykowski-Górecki wrote:
> Hi,
> 
> It crashes on boot like below, most of the times. But sometimes (rarely)
> it manages to stay alive. Below I'm pasting few of the crashes that look
> distinctly different, if you follow the links, you can find more of
> them. IMHO it looks like some memory corruption bug somewhere. I tested
> also Linux 6.13-rc2 before, and it had very similar issue.
> 
> The traces below are all from nested virt (Xen inside KVM), tests with
> Xen directly on the hardware are still in progress. But -rc2 failed all
> of them too, so if it's the same issue, I guess they will looks similar.

Yes, on real hardware it crashes too.

I tried to enable KASAN, but that didn't worked out:

(XEN) d0 has maximum 416 PIRQs
(XEN) *** Building a PV Dom0 ***
(XEN)  Xen  kernel: 64-bit, lsb
(XEN)  Dom0 kernel: 64-bit, lsb, paddr 0x200000 -> 0x7600000
(XEN) PHYSICAL MEMORY ARRANGEMENT:
(XEN)  Dom0 alloc.:   0000000260000000->0000000268000000 (1005377 pages to be allocated)
(XEN)  Init. ramdisk: 000000027d741000->000000027ffff207
(XEN) VIRTUAL MEMORY ARRANGEMENT:
(XEN)  Loaded kernel: ffffffff80200000->ffffffff87600000
(XEN)  Phys-Mach map: 0000008000000000->0000008000800000
(XEN)  Start info:    ffffffff87600000->ffffffff876004b8
(XEN)  Page tables:   ffffffff87601000->ffffffff87640000
(XEN)  Boot stack:    ffffffff87640000->ffffffff87641000
(XEN)  TOTAL:         ffffffff80000000->ffffffff87800000
(XEN)  ENTRY ADDRESS: ffffffff8615da50
(XEN) Dom0 has maximum 2 VCPUs
(XEN) Initial low memory virq threshold set at 0x4000 pages.
(XEN) Scrubbing Free RAM in background
(XEN) Std. Loglevel: All
(XEN) Guest Loglevel: All
(XEN) *** Serial input to DOM0 (type 'CTRL-a' three times to switch input)
(XEN) Freed 684kB init memory
(XEN) d0v0 Unhandled: vec 14, #PF[0002]
(XEN) Pagetable walk from fffffbfff0900fc6:
(XEN)  L4[0x1f7] = 0000000000000000 ffffffffffffffff
(XEN) domain_crash_sync called from entry.S: fault at ffff82d0402ebdec x86_64/entry.S#create_bounce_frame+0x14c/0x170
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
(XEN) ----[ Xen-4.19.0  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e033:[<ffffffff8614ff32>]
(XEN) RFLAGS: 0000000000000286   EM: 1   CONTEXT: pv guest (d0v0)
(XEN) rax: ffffffff860d8000   rbx: ffffffff87600000   rcx: 00000000c0000101
(XEN) rdx: 3be9e05ee5ed7ef7   rsi: ffffffff87600000   rdi: fffffbfff0900fc6
(XEN) rbp: ffffffff84807f48   rsp: ffffffff84807df0   r8:  0000000000000000
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
(XEN) r12: dffffc0000000000   r13: 0000000000000000   r14: 0000000000000000
(XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 0000000000340660
(XEN) cr3: 0000000267601000   cr2: fffffbfff0900fc6
(XEN) fsb: 0000000000000000   gsb: ffffffff860d8000   gss: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
(XEN) Guest stack trace from rsp=ffffffff84807df0:
(XEN)    00000000c0000101 0000000000000000 0000000000000002 ffffffff8614ff32
(XEN)    000000010000e030 0000000000010086 ffffffff84807e30 000000000000e02b
(XEN)    0000000041b58ab3 ffffffff845f8030 ffffffff8614fed0 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    ffffffff8615da6f 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000
(XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
(XEN) Resetting with ACPI MEMORY or I/O RESET_REG.

> Who should I CC here? The failures are all over the place... linux-mm?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Linux 6.13-rc3 many different panics in Xen PV dom0
  2024-12-20  1:48 ` Marek Marczykowski-Górecki
@ 2024-12-26 18:48   ` Marek Marczykowski-Górecki
  0 siblings, 0 replies; 15+ messages in thread
From: Marek Marczykowski-Górecki @ 2024-12-26 18:48 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 4949 bytes --]

On Fri, Dec 20, 2024 at 02:48:52AM +0100, Marek Marczykowski-Górecki wrote:
> On Thu, Dec 19, 2024 at 05:14:52PM +0100, Marek Marczykowski-Górecki wrote:
> > Hi,
> > 
> > It crashes on boot like below, most of the times. But sometimes (rarely)
> > it manages to stay alive. Below I'm pasting few of the crashes that look
> > distinctly different, if you follow the links, you can find more of
> > them. IMHO it looks like some memory corruption bug somewhere. I tested
> > also Linux 6.13-rc2 before, and it had very similar issue.
> > 
> > The traces below are all from nested virt (Xen inside KVM), tests with
> > Xen directly on the hardware are still in progress. But -rc2 failed all
> > of them too, so if it's the same issue, I guess they will looks similar.
> 
> Yes, on real hardware it crashes too.

6.13-rc4 fails the same way.

> I tried to enable KASAN, but that didn't worked out:
> 
> (XEN) d0 has maximum 416 PIRQs
> (XEN) *** Building a PV Dom0 ***
> (XEN)  Xen  kernel: 64-bit, lsb
> (XEN)  Dom0 kernel: 64-bit, lsb, paddr 0x200000 -> 0x7600000
> (XEN) PHYSICAL MEMORY ARRANGEMENT:
> (XEN)  Dom0 alloc.:   0000000260000000->0000000268000000 (1005377 pages to be allocated)
> (XEN)  Init. ramdisk: 000000027d741000->000000027ffff207
> (XEN) VIRTUAL MEMORY ARRANGEMENT:
> (XEN)  Loaded kernel: ffffffff80200000->ffffffff87600000
> (XEN)  Phys-Mach map: 0000008000000000->0000008000800000
> (XEN)  Start info:    ffffffff87600000->ffffffff876004b8
> (XEN)  Page tables:   ffffffff87601000->ffffffff87640000
> (XEN)  Boot stack:    ffffffff87640000->ffffffff87641000
> (XEN)  TOTAL:         ffffffff80000000->ffffffff87800000
> (XEN)  ENTRY ADDRESS: ffffffff8615da50
> (XEN) Dom0 has maximum 2 VCPUs
> (XEN) Initial low memory virq threshold set at 0x4000 pages.
> (XEN) Scrubbing Free RAM in background
> (XEN) Std. Loglevel: All
> (XEN) Guest Loglevel: All
> (XEN) *** Serial input to DOM0 (type 'CTRL-a' three times to switch input)
> (XEN) Freed 684kB init memory
> (XEN) d0v0 Unhandled: vec 14, #PF[0002]
> (XEN) Pagetable walk from fffffbfff0900fc6:
> (XEN)  L4[0x1f7] = 0000000000000000 ffffffffffffffff
> (XEN) domain_crash_sync called from entry.S: fault at ffff82d0402ebdec x86_64/entry.S#create_bounce_frame+0x14c/0x170
> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> (XEN) ----[ Xen-4.19.0  x86_64  debug=n  Not tainted ]----
> (XEN) CPU:    0
> (XEN) RIP:    e033:[<ffffffff8614ff32>]
> (XEN) RFLAGS: 0000000000000286   EM: 1   CONTEXT: pv guest (d0v0)
> (XEN) rax: ffffffff860d8000   rbx: ffffffff87600000   rcx: 00000000c0000101
> (XEN) rdx: 3be9e05ee5ed7ef7   rsi: ffffffff87600000   rdi: fffffbfff0900fc6
> (XEN) rbp: ffffffff84807f48   rsp: ffffffff84807df0   r8:  0000000000000000
> (XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
> (XEN) r12: dffffc0000000000   r13: 0000000000000000   r14: 0000000000000000
> (XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 0000000000340660
> (XEN) cr3: 0000000267601000   cr2: fffffbfff0900fc6
> (XEN) fsb: 0000000000000000   gsb: ffffffff860d8000   gss: 0000000000000000
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
> (XEN) Guest stack trace from rsp=ffffffff84807df0:
> (XEN)    00000000c0000101 0000000000000000 0000000000000002 ffffffff8614ff32
> (XEN)    000000010000e030 0000000000010086 ffffffff84807e30 000000000000e02b
> (XEN)    0000000041b58ab3 ffffffff845f8030 ffffffff8614fed0 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    ffffffff8615da6f 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000
> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
> (XEN) Resetting with ACPI MEMORY or I/O RESET_REG.
> 
> > Who should I CC here? The failures are all over the place... linux-mm?
> 
> -- 
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab



-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Linux 6.13-rc3 many different panics in Xen PV dom0
  2024-12-19 16:14 Linux 6.13-rc3 many different panics in Xen PV dom0 Marek Marczykowski-Górecki
  2024-12-20  1:48 ` Marek Marczykowski-Górecki
@ 2025-01-02 10:20 ` Jürgen Groß
  2025-01-02 11:30   ` Juergen Gross
  1 sibling, 1 reply; 15+ messages in thread
From: Jürgen Groß @ 2025-01-02 10:20 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 1002 bytes --]

On 19.12.24 17:14, Marek Marczykowski-Górecki wrote:
> Hi,
> 
> It crashes on boot like below, most of the times. But sometimes (rarely)
> it manages to stay alive. Below I'm pasting few of the crashes that look
> distinctly different, if you follow the links, you can find more of
> them. IMHO it looks like some memory corruption bug somewhere. I tested
> also Linux 6.13-rc2 before, and it had very similar issue.

...

> 
> Full log:
> https://openqa.qubes-os.org/tests/122879/logfile?filename=serial0.txt

I can reproduce a crash with 6.13-rc5 PV dom0.

What is really interesting in the logs: most crashes seem to happen right
after a module being loaded (in my reproducer it was right after loading
the first module).

I need to go through the 6.13 commits, but I think I remember having seen
a patch optimizing module loading by using large pages for addressing the
loaded modules. Maybe the case of no large pages being available isn't
handled properly.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Linux 6.13-rc3 many different panics in Xen PV dom0
  2025-01-02 10:20 ` Jürgen Groß
@ 2025-01-02 11:30   ` Juergen Gross
  2025-01-02 12:24     ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 15+ messages in thread
From: Juergen Gross @ 2025-01-02 11:30 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 1764 bytes --]

On 02.01.25 11:20, Jürgen Groß wrote:
> On 19.12.24 17:14, Marek Marczykowski-Górecki wrote:
>> Hi,
>>
>> It crashes on boot like below, most of the times. But sometimes (rarely)
>> it manages to stay alive. Below I'm pasting few of the crashes that look
>> distinctly different, if you follow the links, you can find more of
>> them. IMHO it looks like some memory corruption bug somewhere. I tested
>> also Linux 6.13-rc2 before, and it had very similar issue.
> 
> ...
> 
>>
>> Full log:
>> https://openqa.qubes-os.org/tests/122879/logfile?filename=serial0.txt
> 
> I can reproduce a crash with 6.13-rc5 PV dom0.
> 
> What is really interesting in the logs: most crashes seem to happen right
> after a module being loaded (in my reproducer it was right after loading
> the first module).
> 
> I need to go through the 6.13 commits, but I think I remember having seen
> a patch optimizing module loading by using large pages for addressing the
> loaded modules. Maybe the case of no large pages being available isn't
> handled properly.

Seems I was right.

For me the following diff fixes the issue. Marek, can you please confirm
it fixes your crashes, too?

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index c6d29f283001..b5b7964b34b0 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1080,7 +1080,7 @@ struct execmem_info __init *execmem_arch_setup(void)

         start = MODULES_VADDR + offset;

-       if (IS_ENABLED(CONFIG_ARCH_HAS_EXECMEM_ROX)) {
+       if (IS_ENABLED(CONFIG_ARCH_HAS_EXECMEM_ROX) && 
cpu_feature_enabled(X86_FEATURE_PSE)) {
                 pgprot = PAGE_KERNEL_ROX;
                 flags = EXECMEM_KASAN_SHADOW | EXECMEM_ROX_CACHE;
         } else {


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Linux 6.13-rc3 many different panics in Xen PV dom0
  2025-01-02 11:30   ` Juergen Gross
@ 2025-01-02 12:24     ` Marek Marczykowski-Górecki
  2025-01-02 18:54       ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 15+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-01-02 12:24 UTC (permalink / raw)
  To: Juergen Gross; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 2181 bytes --]

On Thu, Jan 02, 2025 at 12:30:10PM +0100, Juergen Gross wrote:
> On 02.01.25 11:20, Jürgen Groß wrote:
> > On 19.12.24 17:14, Marek Marczykowski-Górecki wrote:
> > > Hi,
> > > 
> > > It crashes on boot like below, most of the times. But sometimes (rarely)
> > > it manages to stay alive. Below I'm pasting few of the crashes that look
> > > distinctly different, if you follow the links, you can find more of
> > > them. IMHO it looks like some memory corruption bug somewhere. I tested
> > > also Linux 6.13-rc2 before, and it had very similar issue.
> > 
> > ...
> > 
> > > 
> > > Full log:
> > > https://openqa.qubes-os.org/tests/122879/logfile?filename=serial0.txt
> > 
> > I can reproduce a crash with 6.13-rc5 PV dom0.
> > 
> > What is really interesting in the logs: most crashes seem to happen right
> > after a module being loaded (in my reproducer it was right after loading
> > the first module).
> > 
> > I need to go through the 6.13 commits, but I think I remember having seen
> > a patch optimizing module loading by using large pages for addressing the
> > loaded modules. Maybe the case of no large pages being available isn't
> > handled properly.
> 
> Seems I was right.
> 
> For me the following diff fixes the issue. Marek, can you please confirm
> it fixes your crashes, too?

Thanks for looking into it!
Will do, I've pushed it to
https://github.com/QubesOS/qubes-linux-kernel/pull/662, CI will build it
and then I'll post it to openQA.

> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index c6d29f283001..b5b7964b34b0 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -1080,7 +1080,7 @@ struct execmem_info __init *execmem_arch_setup(void)
> 
>         start = MODULES_VADDR + offset;
> 
> -       if (IS_ENABLED(CONFIG_ARCH_HAS_EXECMEM_ROX)) {
> +       if (IS_ENABLED(CONFIG_ARCH_HAS_EXECMEM_ROX) &&
> cpu_feature_enabled(X86_FEATURE_PSE)) {
>                 pgprot = PAGE_KERNEL_ROX;
>                 flags = EXECMEM_KASAN_SHADOW | EXECMEM_ROX_CACHE;
>         } else {
> 
> 
> Juergen






-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Linux 6.13-rc3 many different panics in Xen PV dom0
  2025-01-02 12:24     ` Marek Marczykowski-Górecki
@ 2025-01-02 18:54       ` Marek Marczykowski-Górecki
  2025-01-02 19:04         ` Andrew Cooper
  2025-01-02 19:17         ` Jürgen Groß
  0 siblings, 2 replies; 15+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-01-02 18:54 UTC (permalink / raw)
  To: Juergen Gross; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 7252 bytes --]

On Thu, Jan 02, 2025 at 01:24:21PM +0100, Marek Marczykowski-Górecki wrote:
> On Thu, Jan 02, 2025 at 12:30:10PM +0100, Juergen Gross wrote:
> > On 02.01.25 11:20, Jürgen Groß wrote:
> > > On 19.12.24 17:14, Marek Marczykowski-Górecki wrote:
> > > > Hi,
> > > > 
> > > > It crashes on boot like below, most of the times. But sometimes (rarely)
> > > > it manages to stay alive. Below I'm pasting few of the crashes that look
> > > > distinctly different, if you follow the links, you can find more of
> > > > them. IMHO it looks like some memory corruption bug somewhere. I tested
> > > > also Linux 6.13-rc2 before, and it had very similar issue.
> > > 
> > > ...
> > > 
> > > > 
> > > > Full log:
> > > > https://openqa.qubes-os.org/tests/122879/logfile?filename=serial0.txt
> > > 
> > > I can reproduce a crash with 6.13-rc5 PV dom0.
> > > 
> > > What is really interesting in the logs: most crashes seem to happen right
> > > after a module being loaded (in my reproducer it was right after loading
> > > the first module).
> > > 
> > > I need to go through the 6.13 commits, but I think I remember having seen
> > > a patch optimizing module loading by using large pages for addressing the
> > > loaded modules. Maybe the case of no large pages being available isn't
> > > handled properly.
> > 
> > Seems I was right.
> > 
> > For me the following diff fixes the issue. Marek, can you please confirm
> > it fixes your crashes, too?
> 
> Thanks for looking into it!
> Will do, I've pushed it to
> https://github.com/QubesOS/qubes-linux-kernel/pull/662, CI will build it
> and then I'll post it to openQA.

It is much better!

Tests are still running, but I already see that many are green. There is
one issue (likely unrelated to this change) - sys-usb (HVM domU with USB
controllers passed through) crashes on a system with Raptor Lake CPU
(only, others, including ADL and MTL look fine):

[   75.770849] Bluetooth: Core ver 2.22
[   75.770866] Oops: general protection fault, probably for non-canonical address 0xc9d2315bc82c3bbd: 0000 [#1] PREEMPT SMP NOPTI
[   75.770880] CPU: 0 UID: 0 PID: 923 Comm: (udev-worker) Not tainted 6.13.0-0.rc5.2.qubes.1.fc41.x86_64 #1
[   75.770890] Hardware name: Xen HVM domU, BIOS 4.19.0 01/02/2025
[   75.770897] RIP: 0010:msft_monitor_device_del+0x93/0x170 [bluetooth]
[   75.770924] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0 65 21 <26> 2b 8b ad 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   75.770943] RSP: 0000:ffffad644108fa40 EFLAGS: 00010246
[   75.770950] RAX: ffff93da8a149600 RBX: c9d2315bc82c3810 RCX: 0000000100000000
[   75.770958] RDX: 0000000000000001 RSI: ffff93da905e9180 RDI: ffff93da81404598
[   75.770967] RBP: ffffad644108fa58 R08: 0000000000000064 R09: 00000000000012ab
[   75.770975] R10: ffff93da81207000 R11: 0000000000000286 R12: ffffad644108fb00
[   75.770983] R13: ffffad644108fa68 R14: ffff93da9089b840 R15: ffff93da8c265100
[   75.770991] FS:  000078fa4cec4bc0(0000) GS:ffff93da97000000(0000) knlGS:0000000000000000
[   75.771000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   75.771007] CR2: 000074fa64aadc08 CR3: 00000000105d2006 CR4: 0000000000770ef0
[   75.771016] PKRU: 55555554
[   75.771019] Call Trace:
[   75.771024]  <TASK>
[   75.771028]  ? show_trace_log_lvl+0x1b0/0x2f0
[   75.771036]  ? show_trace_log_lvl+0x1b0/0x2f0
[   75.771042]  ? do_one_initcall+0x58/0x310
[   75.771048]  ? __die_body.cold+0x8/0x12
[   75.771053]  ? die_addr+0x3c/0x60
[   75.771059]  ? exc_general_protection+0x17d/0x400
[   75.771066]  ? asm_exc_general_protection+0x26/0x30
[   75.771074]  ? msft_monitor_device_del+0x93/0x170 [bluetooth]
[   75.771095]  ? bt_init+0x54/0x1d0 [bluetooth]
[   75.771114]  ? __pfx_bt_init+0x10/0x10 [bluetooth]
[   75.771131]  ? do_one_initcall+0x58/0x310
[   75.771137]  ? do_init_module+0x90/0x250
[   75.771142]  ? init_module_from_file+0x86/0xc0
[   75.771149]  ? idempotent_init_module+0x115/0x310
[   75.771156]  ? __x64_sys_finit_module+0x65/0xc0
[   75.771163]  ? do_syscall_64+0x82/0x160
[   75.771168]  ? backing_file_read_iter+0x156/0x1f0
[   75.771176]  ? ovl_read_iter+0x94/0xa0 [overlay]
[   75.771189]  ? __pfx_ovl_file_accessed+0x10/0x10 [overlay]
[   75.771199]  ? rseq_get_rseq_cs+0x1d/0x220
[   75.771205]  ? rseq_ip_fixup+0x8d/0x1d0
[   75.771210]  ? __seccomp_filter+0x303/0x520
[   75.771216]  ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0
[   75.771224]  ? syscall_exit_to_user_mode+0x10/0x210
[   75.771231]  ? do_syscall_64+0x8e/0x160
[   75.771236]  ? do_sys_openat2+0x9c/0xe0
[   75.771241]  ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0
[   75.771249]  ? syscall_exit_to_user_mode+0x10/0x210
[   75.771255]  ? do_syscall_64+0x8e/0x160
[   75.771260]  ? do_user_addr_fault+0x1ec/0x7b0
[   75.771267]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   75.771274]  </TASK>
[   75.771277] Modules linked in: bluetooth(+) rfkill snd_seq_dummy snd_hrtimer snd_seq snd_seq_device snd_timer snd soundcore nft_reject_ipv6 nf_reject_ipv6 nft_reject_ipv4 nf_reject_ipv4 nft_reject intel_rapl_msr intel_rapl_common nft_ct intel_uncore_frequency_common intel_pmc_core intel_vsec joydev nft_masq pmt_telemetry pmt_class nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni xhci_pci polyval_generic ghash_clmulni_intel xhci_hcd sha512_ssse3 sha256_ssse3 nf_tables sha1_ssse3 ehci_pci mei_me ehci_hcd pcspkr mei ata_generic pata_acpi i2c_piix4 i2c_smbus serio_raw xen_scsiback target_core_mod xen_netback xen_privcmd xen_gntdev xen_gntalloc xen_blkback xen_evtchn loop fuse nfnetlink overlay xen_blkfront
[   75.771370] ---[ end trace 0000000000000000 ]---
[   75.771376] RIP: 0010:msft_monitor_device_del+0x93/0x170 [bluetooth]
[   75.771397] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0 65 21 <26> 2b 8b ad 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   75.771416] RSP: 0000:ffffad644108fa40 EFLAGS: 00010246
[   75.771422] RAX: ffff93da8a149600 RBX: c9d2315bc82c3810 RCX: 0000000100000000
[   75.771431] RDX: 0000000000000001 RSI: ffff93da905e9180 RDI: ffff93da81404598
[   75.771439] RBP: ffffad644108fa58 R08: 0000000000000064 R09: 00000000000012ab
[   75.771446] R10: ffff93da81207000 R11: 0000000000000286 R12: ffffad644108fb00
[   75.771454] R13: ffffad644108fa68 R14: ffff93da9089b840 R15: ffff93da8c265100
[   75.771463] FS:  000078fa4cec4bc0(0000) GS:ffff93da97000000(0000) knlGS:0000000000000000
[   75.771471] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   75.771477] CR2: 000074fa64aadc08 CR3: 00000000105d2006 CR4: 0000000000770ef0
[   75.771485] PKRU: 55555554
[   75.771488] Kernel panic - not syncing: Fatal exception
[   75.771519] Kernel Offset: 0x3b800000 from 0xffffffff80200000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Full log inside
https://openqa.qubes-os.org/tests/124736/file/usbvm-var_log.tar.gz
(log/xen/console/guest-sys-usb.log)

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Linux 6.13-rc3 many different panics in Xen PV dom0
  2025-01-02 18:54       ` Marek Marczykowski-Górecki
@ 2025-01-02 19:04         ` Andrew Cooper
  2025-01-02 19:17         ` Jürgen Groß
  1 sibling, 0 replies; 15+ messages in thread
From: Andrew Cooper @ 2025-01-02 19:04 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, Juergen Gross; +Cc: xen-devel

On 02/01/2025 6:54 pm, Marek Marczykowski-Górecki wrote:
> On Thu, Jan 02, 2025 at 01:24:21PM +0100, Marek Marczykowski-Górecki wrote:
>> On Thu, Jan 02, 2025 at 12:30:10PM +0100, Juergen Gross wrote:
>>> On 02.01.25 11:20, Jürgen Groß wrote:
>>>> On 19.12.24 17:14, Marek Marczykowski-Górecki wrote:
>>>>> Hi,
>>>>>
>>>>> It crashes on boot like below, most of the times. But sometimes (rarely)
>>>>> it manages to stay alive. Below I'm pasting few of the crashes that look
>>>>> distinctly different, if you follow the links, you can find more of
>>>>> them. IMHO it looks like some memory corruption bug somewhere. I tested
>>>>> also Linux 6.13-rc2 before, and it had very similar issue.
>>>> ...
>>>>
>>>>> Full log:
>>>>> https://openqa.qubes-os.org/tests/122879/logfile?filename=serial0.txt
>>>> I can reproduce a crash with 6.13-rc5 PV dom0.
>>>>
>>>> What is really interesting in the logs: most crashes seem to happen right
>>>> after a module being loaded (in my reproducer it was right after loading
>>>> the first module).
>>>>
>>>> I need to go through the 6.13 commits, but I think I remember having seen
>>>> a patch optimizing module loading by using large pages for addressing the
>>>> loaded modules. Maybe the case of no large pages being available isn't
>>>> handled properly.
>>> Seems I was right.
>>>
>>> For me the following diff fixes the issue. Marek, can you please confirm
>>> it fixes your crashes, too?
>> Thanks for looking into it!
>> Will do, I've pushed it to
>> https://github.com/QubesOS/qubes-linux-kernel/pull/662, CI will build it
>> and then I'll post it to openQA.
> It is much better!
>
> Tests are still running, but I already see that many are green. There is
> one issue (likely unrelated to this change) - sys-usb (HVM domU with USB
> controllers passed through) crashes on a system with Raptor Lake CPU
> (only, others, including ADL and MTL look fine):
>
> [   75.770849] Bluetooth: Core ver 2.22
> [   75.770866] Oops: general protection fault, probably for non-canonical address 0xc9d2315bc82c3bbd: 0000 [#1] PREEMPT SMP NOPTI
> [   75.770880] CPU: 0 UID: 0 PID: 923 Comm: (udev-worker) Not tainted 6.13.0-0.rc5.2.qubes.1.fc41.x86_64 #1
> [   75.770890] Hardware name: Xen HVM domU, BIOS 4.19.0 01/02/2025
> [   75.770897] RIP: 0010:msft_monitor_device_del+0x93/0x170 [bluetooth]
> [   75.770924] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0 65 21 <26> 2b 8b ad 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

es sub 0x3ad(%rbx),%ecx

I highly doubt that's an instruction that the compiler really put out
for this function.

The preceding bytes are "shlb 0x21(%rbp)" which isn't completely
implausible, but the surrounding 0's very much are.

This looks very fishy, and either looks like DMA hitting .text, or
module handling getting it's regions wrong.

~Andrew


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Linux 6.13-rc3 many different panics in Xen PV dom0
  2025-01-02 18:54       ` Marek Marczykowski-Górecki
  2025-01-02 19:04         ` Andrew Cooper
@ 2025-01-02 19:17         ` Jürgen Groß
  2025-01-02 19:39           ` Marek Marczykowski-Górecki
  1 sibling, 1 reply; 15+ messages in thread
From: Jürgen Groß @ 2025-01-02 19:17 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki; +Cc: xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 7638 bytes --]

On 02.01.25 19:54, Marek Marczykowski-Górecki wrote:
> On Thu, Jan 02, 2025 at 01:24:21PM +0100, Marek Marczykowski-Górecki wrote:
>> On Thu, Jan 02, 2025 at 12:30:10PM +0100, Juergen Gross wrote:
>>> On 02.01.25 11:20, Jürgen Groß wrote:
>>>> On 19.12.24 17:14, Marek Marczykowski-Górecki wrote:
>>>>> Hi,
>>>>>
>>>>> It crashes on boot like below, most of the times. But sometimes (rarely)
>>>>> it manages to stay alive. Below I'm pasting few of the crashes that look
>>>>> distinctly different, if you follow the links, you can find more of
>>>>> them. IMHO it looks like some memory corruption bug somewhere. I tested
>>>>> also Linux 6.13-rc2 before, and it had very similar issue.
>>>>
>>>> ...
>>>>
>>>>>
>>>>> Full log:
>>>>> https://openqa.qubes-os.org/tests/122879/logfile?filename=serial0.txt
>>>>
>>>> I can reproduce a crash with 6.13-rc5 PV dom0.
>>>>
>>>> What is really interesting in the logs: most crashes seem to happen right
>>>> after a module being loaded (in my reproducer it was right after loading
>>>> the first module).
>>>>
>>>> I need to go through the 6.13 commits, but I think I remember having seen
>>>> a patch optimizing module loading by using large pages for addressing the
>>>> loaded modules. Maybe the case of no large pages being available isn't
>>>> handled properly.
>>>
>>> Seems I was right.
>>>
>>> For me the following diff fixes the issue. Marek, can you please confirm
>>> it fixes your crashes, too?
>>
>> Thanks for looking into it!
>> Will do, I've pushed it to
>> https://github.com/QubesOS/qubes-linux-kernel/pull/662, CI will build it
>> and then I'll post it to openQA.
> 
> It is much better!
> 
> Tests are still running, but I already see that many are green.

So are you fine with me adding your "Tested-by:"?

> There is
> one issue (likely unrelated to this change) - sys-usb (HVM domU with USB
> controllers passed through) crashes on a system with Raptor Lake CPU
> (only, others, including ADL and MTL look fine):
> 
> [   75.770849] Bluetooth: Core ver 2.22
> [   75.770866] Oops: general protection fault, probably for non-canonical address 0xc9d2315bc82c3bbd: 0000 [#1] PREEMPT SMP NOPTI
> [   75.770880] CPU: 0 UID: 0 PID: 923 Comm: (udev-worker) Not tainted 6.13.0-0.rc5.2.qubes.1.fc41.x86_64 #1
> [   75.770890] Hardware name: Xen HVM domU, BIOS 4.19.0 01/02/2025
> [   75.770897] RIP: 0010:msft_monitor_device_del+0x93/0x170 [bluetooth]
> [   75.770924] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0 65 21 <26> 2b 8b ad 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

This code is looking suspicious. Large areas of binary 0 in a normal function?
And the code itself is nonsense, as it is using a memory access via ES:, which
doesn't make any sense in 64-bit kernel.


Juergen


> [   75.770943] RSP: 0000:ffffad644108fa40 EFLAGS: 00010246
> [   75.770950] RAX: ffff93da8a149600 RBX: c9d2315bc82c3810 RCX: 0000000100000000
> [   75.770958] RDX: 0000000000000001 RSI: ffff93da905e9180 RDI: ffff93da81404598
> [   75.770967] RBP: ffffad644108fa58 R08: 0000000000000064 R09: 00000000000012ab
> [   75.770975] R10: ffff93da81207000 R11: 0000000000000286 R12: ffffad644108fb00
> [   75.770983] R13: ffffad644108fa68 R14: ffff93da9089b840 R15: ffff93da8c265100
> [   75.770991] FS:  000078fa4cec4bc0(0000) GS:ffff93da97000000(0000) knlGS:0000000000000000
> [   75.771000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   75.771007] CR2: 000074fa64aadc08 CR3: 00000000105d2006 CR4: 0000000000770ef0
> [   75.771016] PKRU: 55555554
> [   75.771019] Call Trace:
> [   75.771024]  <TASK>
> [   75.771028]  ? show_trace_log_lvl+0x1b0/0x2f0
> [   75.771036]  ? show_trace_log_lvl+0x1b0/0x2f0
> [   75.771042]  ? do_one_initcall+0x58/0x310
> [   75.771048]  ? __die_body.cold+0x8/0x12
> [   75.771053]  ? die_addr+0x3c/0x60
> [   75.771059]  ? exc_general_protection+0x17d/0x400
> [   75.771066]  ? asm_exc_general_protection+0x26/0x30
> [   75.771074]  ? msft_monitor_device_del+0x93/0x170 [bluetooth]
> [   75.771095]  ? bt_init+0x54/0x1d0 [bluetooth]
> [   75.771114]  ? __pfx_bt_init+0x10/0x10 [bluetooth]
> [   75.771131]  ? do_one_initcall+0x58/0x310
> [   75.771137]  ? do_init_module+0x90/0x250
> [   75.771142]  ? init_module_from_file+0x86/0xc0
> [   75.771149]  ? idempotent_init_module+0x115/0x310
> [   75.771156]  ? __x64_sys_finit_module+0x65/0xc0
> [   75.771163]  ? do_syscall_64+0x82/0x160
> [   75.771168]  ? backing_file_read_iter+0x156/0x1f0
> [   75.771176]  ? ovl_read_iter+0x94/0xa0 [overlay]
> [   75.771189]  ? __pfx_ovl_file_accessed+0x10/0x10 [overlay]
> [   75.771199]  ? rseq_get_rseq_cs+0x1d/0x220
> [   75.771205]  ? rseq_ip_fixup+0x8d/0x1d0
> [   75.771210]  ? __seccomp_filter+0x303/0x520
> [   75.771216]  ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0
> [   75.771224]  ? syscall_exit_to_user_mode+0x10/0x210
> [   75.771231]  ? do_syscall_64+0x8e/0x160
> [   75.771236]  ? do_sys_openat2+0x9c/0xe0
> [   75.771241]  ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0
> [   75.771249]  ? syscall_exit_to_user_mode+0x10/0x210
> [   75.771255]  ? do_syscall_64+0x8e/0x160
> [   75.771260]  ? do_user_addr_fault+0x1ec/0x7b0
> [   75.771267]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   75.771274]  </TASK>
> [   75.771277] Modules linked in: bluetooth(+) rfkill snd_seq_dummy snd_hrtimer snd_seq snd_seq_device snd_timer snd soundcore nft_reject_ipv6 nf_reject_ipv6 nft_reject_ipv4 nf_reject_ipv4 nft_reject intel_rapl_msr intel_rapl_common nft_ct intel_uncore_frequency_common intel_pmc_core intel_vsec joydev nft_masq pmt_telemetry pmt_class nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni xhci_pci polyval_generic ghash_clmulni_intel xhci_hcd sha512_ssse3 sha256_ssse3 nf_tables sha1_ssse3 ehci_pci mei_me ehci_hcd pcspkr mei ata_generic pata_acpi i2c_piix4 i2c_smbus serio_raw xen_scsiback target_core_mod xen_netback xen_privcmd xen_gntdev xen_gntalloc xen_blkback xen_evtchn loop fuse nfnetlink overlay xen_blkfront
> [   75.771370] ---[ end trace 0000000000000000 ]---
> [   75.771376] RIP: 0010:msft_monitor_device_del+0x93/0x170 [bluetooth]
> [   75.771397] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0 65 21 <26> 2b 8b ad 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [   75.771416] RSP: 0000:ffffad644108fa40 EFLAGS: 00010246
> [   75.771422] RAX: ffff93da8a149600 RBX: c9d2315bc82c3810 RCX: 0000000100000000
> [   75.771431] RDX: 0000000000000001 RSI: ffff93da905e9180 RDI: ffff93da81404598
> [   75.771439] RBP: ffffad644108fa58 R08: 0000000000000064 R09: 00000000000012ab
> [   75.771446] R10: ffff93da81207000 R11: 0000000000000286 R12: ffffad644108fb00
> [   75.771454] R13: ffffad644108fa68 R14: ffff93da9089b840 R15: ffff93da8c265100
> [   75.771463] FS:  000078fa4cec4bc0(0000) GS:ffff93da97000000(0000) knlGS:0000000000000000
> [   75.771471] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   75.771477] CR2: 000074fa64aadc08 CR3: 00000000105d2006 CR4: 0000000000770ef0
> [   75.771485] PKRU: 55555554
> [   75.771488] Kernel panic - not syncing: Fatal exception
> [   75.771519] Kernel Offset: 0x3b800000 from 0xffffffff80200000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> 
> Full log inside
> https://openqa.qubes-os.org/tests/124736/file/usbvm-var_log.tar.gz
> (log/xen/console/guest-sys-usb.log)
> 


[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Linux 6.13-rc3 many different panics in Xen PV dom0
  2025-01-02 19:17         ` Jürgen Groß
@ 2025-01-02 19:39           ` Marek Marczykowski-Górecki
  2025-01-03  0:18             ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 15+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-01-02 19:39 UTC (permalink / raw)
  To: Jürgen Groß; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 3611 bytes --]

On Thu, Jan 02, 2025 at 08:17:00PM +0100, Jürgen Groß wrote:
> On 02.01.25 19:54, Marek Marczykowski-Górecki wrote:
> > On Thu, Jan 02, 2025 at 01:24:21PM +0100, Marek Marczykowski-Górecki wrote:
> > > On Thu, Jan 02, 2025 at 12:30:10PM +0100, Juergen Gross wrote:
> > > > On 02.01.25 11:20, Jürgen Groß wrote:
> > > > > On 19.12.24 17:14, Marek Marczykowski-Górecki wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > It crashes on boot like below, most of the times. But sometimes (rarely)
> > > > > > it manages to stay alive. Below I'm pasting few of the crashes that look
> > > > > > distinctly different, if you follow the links, you can find more of
> > > > > > them. IMHO it looks like some memory corruption bug somewhere. I tested
> > > > > > also Linux 6.13-rc2 before, and it had very similar issue.
> > > > > 
> > > > > ...
> > > > > 
> > > > > > 
> > > > > > Full log:
> > > > > > https://openqa.qubes-os.org/tests/122879/logfile?filename=serial0.txt
> > > > > 
> > > > > I can reproduce a crash with 6.13-rc5 PV dom0.
> > > > > 
> > > > > What is really interesting in the logs: most crashes seem to happen right
> > > > > after a module being loaded (in my reproducer it was right after loading
> > > > > the first module).
> > > > > 
> > > > > I need to go through the 6.13 commits, but I think I remember having seen
> > > > > a patch optimizing module loading by using large pages for addressing the
> > > > > loaded modules. Maybe the case of no large pages being available isn't
> > > > > handled properly.
> > > > 
> > > > Seems I was right.
> > > > 
> > > > For me the following diff fixes the issue. Marek, can you please confirm
> > > > it fixes your crashes, too?
> > > 
> > > Thanks for looking into it!
> > > Will do, I've pushed it to
> > > https://github.com/QubesOS/qubes-linux-kernel/pull/662, CI will build it
> > > and then I'll post it to openQA.
> > 
> > It is much better!
> > 
> > Tests are still running, but I already see that many are green.
> 
> So are you fine with me adding your "Tested-by:"?

Yes.

> > There is
> > one issue (likely unrelated to this change) - sys-usb (HVM domU with USB
> > controllers passed through) crashes on a system with Raptor Lake CPU
> > (only, others, including ADL and MTL look fine):
> > 
> > [   75.770849] Bluetooth: Core ver 2.22
> > [   75.770866] Oops: general protection fault, probably for non-canonical address 0xc9d2315bc82c3bbd: 0000 [#1] PREEMPT SMP NOPTI
> > [   75.770880] CPU: 0 UID: 0 PID: 923 Comm: (udev-worker) Not tainted 6.13.0-0.rc5.2.qubes.1.fc41.x86_64 #1
> > [   75.770890] Hardware name: Xen HVM domU, BIOS 4.19.0 01/02/2025
> > [   75.770897] RIP: 0010:msft_monitor_device_del+0x93/0x170 [bluetooth]
> > [   75.770924] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0 65 21 <26> 2b 8b ad 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> This code is looking suspicious. Large areas of binary 0 in a normal function?
> And the code itself is nonsense, as it is using a memory access via ES:, which
> doesn't make any sense in 64-bit kernel.

Could it be still something related to modules layout in memory?
It seems it's not 100% reliable crash, I see in at least one instance
sys-usb remained running (unfortunately I don't have collected full
sys-usb console log from successful test...).

I just checked again that this crash didn't happen with any 6.12 or 6.11
kernels.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Linux 6.13-rc3 many different panics in Xen PV dom0
  2025-01-02 19:39           ` Marek Marczykowski-Górecki
@ 2025-01-03  0:18             ` Marek Marczykowski-Górecki
  2025-01-03  0:42               ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 15+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-01-03  0:18 UTC (permalink / raw)
  To: Jürgen Groß; +Cc: xen-devel, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 11197 bytes --]

On Thu, Jan 02, 2025 at 08:39:16PM +0100, Marek Marczykowski-Górecki wrote:
> On Thu, Jan 02, 2025 at 08:17:00PM +0100, Jürgen Groß wrote:
> > On 02.01.25 19:54, Marek Marczykowski-Górecki wrote:
> > > On Thu, Jan 02, 2025 at 01:24:21PM +0100, Marek Marczykowski-Górecki wrote:
> > > > On Thu, Jan 02, 2025 at 12:30:10PM +0100, Juergen Gross wrote:
> > > > > On 02.01.25 11:20, Jürgen Groß wrote:
> > > > > > On 19.12.24 17:14, Marek Marczykowski-Górecki wrote:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > It crashes on boot like below, most of the times. But sometimes (rarely)
> > > > > > > it manages to stay alive. Below I'm pasting few of the crashes that look
> > > > > > > distinctly different, if you follow the links, you can find more of
> > > > > > > them. IMHO it looks like some memory corruption bug somewhere. I tested
> > > > > > > also Linux 6.13-rc2 before, and it had very similar issue.
> > > > > > 
> > > > > > ...
> > > > > > 
> > > > > > > 
> > > > > > > Full log:
> > > > > > > https://openqa.qubes-os.org/tests/122879/logfile?filename=serial0.txt
> > > > > > 
> > > > > > I can reproduce a crash with 6.13-rc5 PV dom0.
> > > > > > 
> > > > > > What is really interesting in the logs: most crashes seem to happen right
> > > > > > after a module being loaded (in my reproducer it was right after loading
> > > > > > the first module).
> > > > > > 
> > > > > > I need to go through the 6.13 commits, but I think I remember having seen
> > > > > > a patch optimizing module loading by using large pages for addressing the
> > > > > > loaded modules. Maybe the case of no large pages being available isn't
> > > > > > handled properly.
> > > > > 
> > > > > Seems I was right.
> > > > > 
> > > > > For me the following diff fixes the issue. Marek, can you please confirm
> > > > > it fixes your crashes, too?
> > > > 
> > > > Thanks for looking into it!
> > > > Will do, I've pushed it to
> > > > https://github.com/QubesOS/qubes-linux-kernel/pull/662, CI will build it
> > > > and then I'll post it to openQA.
> > > 
> > > It is much better!
> > > 
> > > Tests are still running, but I already see that many are green.
> > 
> > So are you fine with me adding your "Tested-by:"?
> 
> Yes.
> 
> > > There is
> > > one issue (likely unrelated to this change) - sys-usb (HVM domU with USB
> > > controllers passed through) crashes on a system with Raptor Lake CPU
> > > (only, others, including ADL and MTL look fine):

Correction, it does happen on some others too, just got the crash on the ADL
system, although looks a bit different ("Corrupted page table at ..."):

sys-usb login: [2025-01-02 23:44:58] [    7.295556] Bluetooth: hci0: Waiting for firmware download to complete
[    7.296996] Bluetooth: hci0: Firmware loaded in 2882606 usecs
[    7.297276] Bluetooth: hci0: Waiting for device to boot
[    7.313074] Bluetooth: hci0: Device booted in 15473 usecs
[    7.318447] Bluetooth: hci0: Found Intel DDC parameters: intel/ibt-1040-0041.ddc
[    7.321060] Bluetooth: hci0: Applying Intel DDC parameters completed
[    7.322057] Bluetooth: hci0: No support for BT device in ACPI firmware
[    7.324037] Bluetooth: hci0: Firmware timestamp 2024.33 buildtype 1 build 81755
[    7.324085] Bluetooth: hci0: Firmware SHA1: 0xd028ffe4
[    7.327995] Bluetooth: hci0: Fseq status: Success (0x00)
[    7.328017] Bluetooth: hci0: Fseq executed: 00.00.02.41
[    7.328032] Bluetooth: hci0: Fseq BT Top: 00.00.02.41
[    7.396950] Bluetooth: MGMT ver 1.23
[    9.352650] kauditd_printk_skb: 82 callbacks suppressed
[    9.352655] audit: type=1131 audit(1735861500.506:81): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-rfkill comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   15.808157] audit: type=1100 audit(1735861506.961:82): pid=867 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 msg='op=PAM:authentication grantors=pam_rootok acct="user" exe="/usr/bin/qubes-gui-runuser" hostname=sys-usb addr=? terminal=/dev/tty7 res=success'
[   15.808860] audit: type=1100 audit(1735861506.962:83): pid=866 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 msg='op=PAM:authentication grantors=pam_rootok acct="user" exe="/usr/lib/qubes/qrexec-agent" hostname=? addr=? terminal=? res=success'
[   15.814137] audit: type=1103 audit(1735861506.967:84): pid=867 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 msg='op=PAM:setcred grantors=pam_rootok acct="user" exe="/usr/bin/qubes-gui-runuser" hostname=sys-usb addr=? terminal=/dev/tty7 res=success'
[   15.814816] audit: type=1006 audit(1735861506.968:85): pid=867 uid=0 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 old-auid=4294967295 auid=1000 tty=tty7 old-ses=4294967295 ses=1 res=1
[   15.815078] audit: type=1300 audit(1735861506.968:85): arch=c000003e syscall=1 success=yes exit=4 a0=3 a1=7ffe29c03a70 a2=4 a3=0 items=0 ppid=712 pid=867 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=tty7 ses=1 comm="qubes-gui-runus" exe="/usr/bin/qubes-gui-runuser" subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 key=(null)
[   15.815164] audit: type=1327 audit(1735861506.968:85): proctitle=2F7573722F62696E2F71756265732D6775692D72756E757365720075736572002F62696E2F7368002D6C002D630065786563202F7573722F62696E2F78696E6974202F6574632F5831312F78696E69742F78696E69747263202D2D202F7573722F6C69622F71756265732F71756265732D786F72672D77726170706572203A30
[   15.815420] audit: type=1103 audit(1735861506.969:86): pid=866 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 msg='op=PAM:setcred grantors=pam_rootok acct="user" exe="/usr/lib/qubes/qrexec-agent" hostname=? addr=? terminal=? res=success'
[   15.816039] audit: type=1006 audit(1735861506.969:87): pid=866 uid=0 subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 old-auid=4294967295 auid=1000 tty=(none) old-ses=4294967295 ses=2 res=1
[   15.817029] audit: type=1300 audit(1735861506.969:87): arch=c000003e syscall=1 success=yes exit=4 a0=3 a1=7ffe550c1c30 a2=4 a3=0 items=0 ppid=864 pid=866 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=2 comm="qrexec-agent" exe="/usr/lib/qubes/qrexec-agent" subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 key=(null)
[   15.817160] audit: type=1327 audit(1735861506.969:87): proctitle="/usr/lib/qubes/qrexec-agent"
[   16.111133] systemd-journald[366]: Time jumped backwards, rotating.
th: RFCOMM TTY layer initialized
[   18.286026] Bluetooth: RFCOMM socket layer initialized
[   18.286035] Bluetooth: RFCOMM ver 1.11
[   18.469074] abrt-dump-journ: Corrupted page table at address 78c64b600010
[   18.469096] PGD 14980067 P4D 14980067 PUD 14981067 PMD 38c8047 PTE 243c8b48ffffff57
[   18.469117] Oops: Bad pagetable: 000d [#1] PREEMPT SMP NOPTI
[   18.469132] CPU: 1 UID: 0 PID: 657 Comm: abrt-dump-journ Not tainted 6.13.0-0.rc5.2.qubes.1.fc41.x86_64 #1
[   18.469152] Hardware name: Xen HVM domU, BIOS 4.19.0 01/02/2025
[   18.469165] RIP: 0033:0x78c64e1bc9a0
[   18.469177] Code: 86 f5 01 00 00 49 8b 7c 24 38 48 85 ff 0f 84 08 03 00 00 48 8d 0d 40 e6 ff ff ba 18 00 00 00 e8 46 c7 fa ff e9 d1 01 00 00 90 <0f> b6 50 10 38 96 c8 01 00 00 0f 85 63 fd ff ff 80 fa 02 0f 84 4c
[   18.469211] RSP: 002b:00007ffcdc67a8b0 EFLAGS: 00010246
[   18.469223] RAX: 000078c64b600000 RBX: 00006045c444c890 RCX: 0000000000000048
[   18.469238] RDX: 0000000000000000 RSI: 00006045c444c890 RDI: 00006045c444f040
[   18.469253] RBP: 00007ffcdc67a930 R08: 00006045c43a1010 R09: 0000000000000001
[   18.469268] R10: 00006045c44098b0 R11: 0000000000000246 R12: 00006045c444f040
[   18.469284] R13: 00006045c4409890 R14: 00006045c444c890 R15: 0000000000000000
[   18.469299] FS:  000078c64d675400 GS:  0000000000000000
[   18.469310] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device snd_timer snd soundcore rfcomm bnep btusb btrtl btintel btbcm btmtk bluetooth rfkill nft_reject_ipv6 nf_reject_ipv6 nft_reject_ipv4 nf_reject_ipv4 nft_reject nft_ct nft_masq nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 joydev nf_tables intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 xhci_pci ehci_pci xhci_hcd ehci_hcd pcspkr i2c_piix4 i2c_smbus ata_generic pata_acpi serio_raw xen_scsiback target_core_mod xen_netback xen_privcmd xen_gntdev xen_gntalloc xen_blkback xen_evtchn loop fuse nfnetlink overlay xen_blkfront
[   18.469484] ---[ end trace 0000000000000000 ]---
[   18.469495] RIP: 0033:0x78c64e1bc9a0
[   18.469504] RSP: 002b:00007ffcdc67a8b0 EFLAGS: 00010246
[   18.469516] RAX: 000078c64b600000 RBX: 00006045c444c890 RCX: 0000000000000048
[   18.469531] RDX: 0000000000000000 RSI: 00006045c444c890 RDI: 00006045c444f040
[   18.469547] RBP: 00007ffcdc67a930 R08: 00006045c43a1010 R09: 0000000000000001
[   18.469562] R10: 00006045c44098b0 R11: 0000000000000246 R12: 00006045c444f040
[   18.469577] R13: 00006045c4409890 R14: 00006045c444c890 R15: 0000000000000000
[   18.469593] FS:  000078c64d675400(0000) GS:ffff9de397100000(0000) knlGS:0000000000000000
[   18.469609] CS:  0033 DS: 0000 ES: 0000 CR0: 0000000080050033
[   18.469623] CR2: 000078c64b600010 CR3: 0000000000164004 CR4: 0000000000770ef0
[   18.469640] PKRU: 55555554
[   18.469646] Kernel panic - not syncing: Fatal exception
[   18.469706] Kernel Offset: 0x2ec00000 from 0xffffffff80200000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)


> > > [   75.770849] Bluetooth: Core ver 2.22
> > > [   75.770866] Oops: general protection fault, probably for non-canonical address 0xc9d2315bc82c3bbd: 0000 [#1] PREEMPT SMP NOPTI
> > > [   75.770880] CPU: 0 UID: 0 PID: 923 Comm: (udev-worker) Not tainted 6.13.0-0.rc5.2.qubes.1.fc41.x86_64 #1
> > > [   75.770890] Hardware name: Xen HVM domU, BIOS 4.19.0 01/02/2025
> > > [   75.770897] RIP: 0010:msft_monitor_device_del+0x93/0x170 [bluetooth]
> > > [   75.770924] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0 65 21 <26> 2b 8b ad 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 
> > This code is looking suspicious. Large areas of binary 0 in a normal function?
> > And the code itself is nonsense, as it is using a memory access via ES:, which
> > doesn't make any sense in 64-bit kernel.
> 
> Could it be still something related to modules layout in memory?
> It seems it's not 100% reliable crash, I see in at least one instance
> sys-usb remained running (unfortunately I don't have collected full
> sys-usb console log from successful test...).
> 
> I just checked again that this crash didn't happen with any 6.12 or 6.11
> kernels.
> 
> -- 
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab



-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Linux 6.13-rc3 many different panics in Xen PV dom0
  2025-01-03  0:18             ` Marek Marczykowski-Górecki
@ 2025-01-03  0:42               ` Marek Marczykowski-Górecki
  2025-01-03  2:00                 ` Andrew Cooper
  0 siblings, 1 reply; 15+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-01-03  0:42 UTC (permalink / raw)
  To: Jürgen Groß; +Cc: xen-devel, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 4652 bytes --]

On Fri, Jan 03, 2025 at 01:18:31AM +0100, Marek Marczykowski-Górecki wrote:
> On Thu, Jan 02, 2025 at 08:39:16PM +0100, Marek Marczykowski-Górecki wrote:
> > On Thu, Jan 02, 2025 at 08:17:00PM +0100, Jürgen Groß wrote:
> > > On 02.01.25 19:54, Marek Marczykowski-Górecki wrote:
> > > > On Thu, Jan 02, 2025 at 01:24:21PM +0100, Marek Marczykowski-Górecki wrote:
> > > > > On Thu, Jan 02, 2025 at 12:30:10PM +0100, Juergen Gross wrote:
> > > > > > On 02.01.25 11:20, Jürgen Groß wrote:
> > > > > > > On 19.12.24 17:14, Marek Marczykowski-Górecki wrote:
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > It crashes on boot like below, most of the times. But sometimes (rarely)
> > > > > > > > it manages to stay alive. Below I'm pasting few of the crashes that look
> > > > > > > > distinctly different, if you follow the links, you can find more of
> > > > > > > > them. IMHO it looks like some memory corruption bug somewhere. I tested
> > > > > > > > also Linux 6.13-rc2 before, and it had very similar issue.
> > > > > > > 
> > > > > > > ...
> > > > > > > 
> > > > > > > > 
> > > > > > > > Full log:
> > > > > > > > https://openqa.qubes-os.org/tests/122879/logfile?filename=serial0.txt
> > > > > > > 
> > > > > > > I can reproduce a crash with 6.13-rc5 PV dom0.
> > > > > > > 
> > > > > > > What is really interesting in the logs: most crashes seem to happen right
> > > > > > > after a module being loaded (in my reproducer it was right after loading
> > > > > > > the first module).
> > > > > > > 
> > > > > > > I need to go through the 6.13 commits, but I think I remember having seen
> > > > > > > a patch optimizing module loading by using large pages for addressing the
> > > > > > > loaded modules. Maybe the case of no large pages being available isn't
> > > > > > > handled properly.
> > > > > > 
> > > > > > Seems I was right.
> > > > > > 
> > > > > > For me the following diff fixes the issue. Marek, can you please confirm
> > > > > > it fixes your crashes, too?
> > > > > 
> > > > > Thanks for looking into it!
> > > > > Will do, I've pushed it to
> > > > > https://github.com/QubesOS/qubes-linux-kernel/pull/662, CI will build it
> > > > > and then I'll post it to openQA.
> > > > 
> > > > It is much better!
> > > > 
> > > > Tests are still running, but I already see that many are green.
> > > 
> > > So are you fine with me adding your "Tested-by:"?
> > 
> > Yes.
> > 
> > > > There is
> > > > one issue (likely unrelated to this change) - sys-usb (HVM domU with USB
> > > > controllers passed through) crashes on a system with Raptor Lake CPU
> > > > (only, others, including ADL and MTL look fine):
> 
> Correction, it does happen on some others too, just got the crash on the ADL
> system, although looks a bit different ("Corrupted page table at ..."):

I've collected some more of them at https://github.com/QubesOS/qubes-issues/issues/9681

Should I start new thread for this? On one hand, it's a different domain
type (HVM), but on the other hand, many of the crashes are around
loading modules too.

> > > > [   75.770849] Bluetooth: Core ver 2.22
> > > > [   75.770866] Oops: general protection fault, probably for non-canonical address 0xc9d2315bc82c3bbd: 0000 [#1] PREEMPT SMP NOPTI
> > > > [   75.770880] CPU: 0 UID: 0 PID: 923 Comm: (udev-worker) Not tainted 6.13.0-0.rc5.2.qubes.1.fc41.x86_64 #1
> > > > [   75.770890] Hardware name: Xen HVM domU, BIOS 4.19.0 01/02/2025
> > > > [   75.770897] RIP: 0010:msft_monitor_device_del+0x93/0x170 [bluetooth]
> > > > [   75.770924] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0 65 21 <26> 2b 8b ad 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > 
> > > This code is looking suspicious. Large areas of binary 0 in a normal function?
> > > And the code itself is nonsense, as it is using a memory access via ES:, which
> > > doesn't make any sense in 64-bit kernel.
> > 
> > Could it be still something related to modules layout in memory?
> > It seems it's not 100% reliable crash, I see in at least one instance
> > sys-usb remained running (unfortunately I don't have collected full
> > sys-usb console log from successful test...).
> > 
> > I just checked again that this crash didn't happen with any 6.12 or 6.11
> > kernels.
> > 
> > -- 
> > Best Regards,
> > Marek Marczykowski-Górecki
> > Invisible Things Lab
> 
> 
> 
> -- 
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab



-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Linux 6.13-rc3 many different panics in Xen PV dom0
  2025-01-03  0:42               ` Marek Marczykowski-Górecki
@ 2025-01-03  2:00                 ` Andrew Cooper
  2025-01-03 18:09                   ` Linux 6.13-rc5 Xen HVM with PCI passthrough (USB controller) crash Marek Marczykowski-Górecki
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Cooper @ 2025-01-03  2:00 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, Jürgen Groß; +Cc: xen-devel

On 03/01/2025 12:42 am, Marek Marczykowski-Górecki wrote:
> On Fri, Jan 03, 2025 at 01:18:31AM +0100, Marek Marczykowski-Górecki wrote:
>> On Thu, Jan 02, 2025 at 08:39:16PM +0100, Marek Marczykowski-Górecki wrote:
>>> On Thu, Jan 02, 2025 at 08:17:00PM +0100, Jürgen Groß wrote:
>>>> On 02.01.25 19:54, Marek Marczykowski-Górecki wrote:
>>>>> On Thu, Jan 02, 2025 at 01:24:21PM +0100, Marek Marczykowski-Górecki wrote:
>>>>>> On Thu, Jan 02, 2025 at 12:30:10PM +0100, Juergen Gross wrote:
>>>>>>> On 02.01.25 11:20, Jürgen Groß wrote:
>>>>>>>> On 19.12.24 17:14, Marek Marczykowski-Górecki wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> It crashes on boot like below, most of the times. But sometimes (rarely)
>>>>>>>>> it manages to stay alive. Below I'm pasting few of the crashes that look
>>>>>>>>> distinctly different, if you follow the links, you can find more of
>>>>>>>>> them. IMHO it looks like some memory corruption bug somewhere. I tested
>>>>>>>>> also Linux 6.13-rc2 before, and it had very similar issue.
>>>>>>>> ...
>>>>>>>>
>>>>>>>>> Full log:
>>>>>>>>> https://openqa.qubes-os.org/tests/122879/logfile?filename=serial0.txt
>>>>>>>> I can reproduce a crash with 6.13-rc5 PV dom0.
>>>>>>>>
>>>>>>>> What is really interesting in the logs: most crashes seem to happen right
>>>>>>>> after a module being loaded (in my reproducer it was right after loading
>>>>>>>> the first module).
>>>>>>>>
>>>>>>>> I need to go through the 6.13 commits, but I think I remember having seen
>>>>>>>> a patch optimizing module loading by using large pages for addressing the
>>>>>>>> loaded modules. Maybe the case of no large pages being available isn't
>>>>>>>> handled properly.
>>>>>>> Seems I was right.
>>>>>>>
>>>>>>> For me the following diff fixes the issue. Marek, can you please confirm
>>>>>>> it fixes your crashes, too?
>>>>>> Thanks for looking into it!
>>>>>> Will do, I've pushed it to
>>>>>> https://github.com/QubesOS/qubes-linux-kernel/pull/662, CI will build it
>>>>>> and then I'll post it to openQA.
>>>>> It is much better!
>>>>>
>>>>> Tests are still running, but I already see that many are green.
>>>> So are you fine with me adding your "Tested-by:"?
>>> Yes.
>>>
>>>>> There is
>>>>> one issue (likely unrelated to this change) - sys-usb (HVM domU with USB
>>>>> controllers passed through) crashes on a system with Raptor Lake CPU
>>>>> (only, others, including ADL and MTL look fine):
>> Correction, it does happen on some others too, just got the crash on the ADL
>> system, although looks a bit different ("Corrupted page table at ..."):
> I've collected some more of them at https://github.com/QubesOS/qubes-issues/issues/9681
>
> Should I start new thread for this? On one hand, it's a different domain
> type (HVM), but on the other hand, many of the crashes are around
> loading modules too.

https://lore.kernel.org/lkml/20241227072825.1288491-1-rppt@kernel.org/T/#t
looks relevant.  Probably worth following up.

~Andrew


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Linux 6.13-rc5 Xen HVM with PCI passthrough (USB controller) crash
  2025-01-03  2:00                 ` Andrew Cooper
@ 2025-01-03 18:09                   ` Marek Marczykowski-Górecki
  2025-01-03 18:32                     ` Geert Uytterhoeven
  0 siblings, 1 reply; 15+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-01-03 18:09 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Jürgen Groß, xen-devel, Mike Rapoport, Luis Chamberlain,
	Andreas Larsson, Andy Lutomirski, Ard Biesheuvel, Arnd Bergmann,
	Borislav Petkov, Brian Cain, Catalin Marinas, Christophe Leroy,
	Christoph Hellwig, Dave Hansen, Dinh Nguyen, Geert Uytterhoeven,
	Guo Ren, Helge Deller, Huacai Chen, Ingo Molnar, Johannes Berg,
	John Paul Adrian Glaubitz, Kent Overstreet, Liam R. Howlett,
	Mark Rutland, Masami Hiramatsu, Matt Turner, Max Filippov,
	Michael Ellerman, Michal Simek, Oleg Nesterov, Palmer Dabbelt,
	Peter Zijlstra, Richard Weinberger, Russell King, Song Liu,
	Stafford Horne, Steven Rostedt, Suren Baghdasaryan,
	Thomas Bogendoerfer, Thomas Gleixner, Uladzislau Rezki,
	Vineet Gupta, Will Deacon, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 4245 bytes --]

On Fri, Jan 03, 2025 at 02:00:28AM +0000, Andrew Cooper wrote:
> On 03/01/2025 12:42 am, Marek Marczykowski-Górecki wrote:
> > On Fri, Jan 03, 2025 at 01:18:31AM +0100, Marek Marczykowski-Górecki wrote:
> >> On Thu, Jan 02, 2025 at 08:39:16PM +0100, Marek Marczykowski-Górecki wrote:
> >>> On Thu, Jan 02, 2025 at 08:17:00PM +0100, Jürgen Groß wrote:
> >>>> On 02.01.25 19:54, Marek Marczykowski-Górecki wrote:
> >>>>> There is
> >>>>> one issue (likely unrelated to this change) - sys-usb (HVM domU with USB
> >>>>> controllers passed through) crashes on a system with Raptor Lake CPU
> >>>>> (only, others, including ADL and MTL look fine):
> >> Correction, it does happen on some others too, just got the crash on the ADL
> >> system, although looks a bit different ("Corrupted page table at ..."):
> > I've collected some more of them at https://github.com/QubesOS/qubes-issues/issues/9681
> >
> > Should I start new thread for this? On one hand, it's a different domain
> > type (HVM), but on the other hand, many of the crashes are around
> > loading modules too.
> 
> https://lore.kernel.org/lkml/20241227072825.1288491-1-rppt@kernel.org/T/#t
> looks relevant.  Probably worth following up.

As responded there, I don't think so, as that series is not part of
6.13-rc5. But in the meantime, I bisected it and got this commit:

5185e7f9f3bd754ab60680814afd714e2673ef88 is the first bad commit
commit 5185e7f9f3bd754ab60680814afd714e2673ef88
Author: Mike Rapoport (Microsoft) <rppt@kernel.org>
Date:   Wed Oct 23 19:27:11 2024 +0300

    x86/module: enable ROX caches for module text on 64 bit
    
    Enable execmem's cache of PMD_SIZE'ed pages mapped as ROX for module text
    allocations on 64 bit.
    
    Link: https://lkml.kernel.org/r/20241023162711.2579610-9-rppt@kernel.org
    Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
    Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
    Tested-by: kdevops <kdevops@lists.linux.dev>
    Cc: Andreas Larsson <andreas@gaisler.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Ard Biesheuvel <ardb@kernel.org>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Borislav Petkov (AMD) <bp@alien8.de>
    Cc: Brian Cain <bcain@quicinc.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Dinh Nguyen <dinguyen@kernel.org>
    Cc: Geert Uytterhoeven <geert@linux-m68k.org>
    Cc: Guo Ren <guoren@kernel.org>
    Cc: Helge Deller <deller@gmx.de>
    Cc: Huacai Chen <chenhuacai@kernel.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Johannes Berg <johannes@sipsolutions.net>
    Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
    Cc: Kent Overstreet <kent.overstreet@linux.dev>
    Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
    Cc: Matt Turner <mattst88@gmail.com>
    Cc: Max Filippov <jcmvbkbc@gmail.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Michal Simek <monstr@monstr.eu>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Palmer Dabbelt <palmer@dabbelt.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Richard Weinberger <richard@nod.at>
    Cc: Russell King <linux@armlinux.org.uk>
    Cc: Song Liu <song@kernel.org>
    Cc: Stafford Horne <shorne@gmail.com>
    Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Cc: Vineet Gupta <vgupta@kernel.org>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

 arch/x86/Kconfig   |  1 +
 arch/x86/mm/init.c | 37 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 37 insertions(+), 1 deletion(-)

I'm extending CC...

See initial quoted part for the issue description, and link to collected
crash messages.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Linux 6.13-rc5 Xen HVM with PCI passthrough (USB controller) crash
  2025-01-03 18:09                   ` Linux 6.13-rc5 Xen HVM with PCI passthrough (USB controller) crash Marek Marczykowski-Górecki
@ 2025-01-03 18:32                     ` Geert Uytterhoeven
  0 siblings, 0 replies; 15+ messages in thread
From: Geert Uytterhoeven @ 2025-01-03 18:32 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Andrew Cooper, Jürgen Groß, xen-devel, Mike Rapoport,
	Luis Chamberlain, Andreas Larsson, Andy Lutomirski,
	Ard Biesheuvel, Arnd Bergmann, Borislav Petkov, Brian Cain,
	Catalin Marinas, Christophe Leroy, Christoph Hellwig, Dave Hansen,
	Dinh Nguyen, Guo Ren, Helge Deller, Huacai Chen, Ingo Molnar,
	Johannes Berg, John Paul Adrian Glaubitz, Kent Overstreet,
	Liam R. Howlett, Mark Rutland, Masami Hiramatsu, Matt Turner,
	Max Filippov, Michael Ellerman, Michal Simek, Oleg Nesterov,
	Palmer Dabbelt, Peter Zijlstra, Richard Weinberger, Russell King,
	Song Liu, Stafford Horne, Steven Rostedt, Suren Baghdasaryan,
	Thomas Bogendoerfer, Thomas Gleixner, Uladzislau Rezki,
	Vineet Gupta, Will Deacon, Andrew Morton

Hi Marek,

On Fri, Jan 3, 2025 at 7:10 PM Marek Marczykowski-Górecki
<marmarek@invisiblethingslab.com> wrote:
> On Fri, Jan 03, 2025 at 02:00:28AM +0000, Andrew Cooper wrote:
> > On 03/01/2025 12:42 am, Marek Marczykowski-Górecki wrote:
> > > On Fri, Jan 03, 2025 at 01:18:31AM +0100, Marek Marczykowski-Górecki wrote:
> > >> On Thu, Jan 02, 2025 at 08:39:16PM +0100, Marek Marczykowski-Górecki wrote:
> > >>> On Thu, Jan 02, 2025 at 08:17:00PM +0100, Jürgen Groß wrote:
> > >>>> On 02.01.25 19:54, Marek Marczykowski-Górecki wrote:
> > >>>>> There is
> > >>>>> one issue (likely unrelated to this change) - sys-usb (HVM domU with USB
> > >>>>> controllers passed through) crashes on a system with Raptor Lake CPU
> > >>>>> (only, others, including ADL and MTL look fine):
> > >> Correction, it does happen on some others too, just got the crash on the ADL
> > >> system, although looks a bit different ("Corrupted page table at ..."):
> > > I've collected some more of them at https://github.com/QubesOS/qubes-issues/issues/9681
> > >
> > > Should I start new thread for this? On one hand, it's a different domain
> > > type (HVM), but on the other hand, many of the crashes are around
> > > loading modules too.
> >
> > https://lore.kernel.org/lkml/20241227072825.1288491-1-rppt@kernel.org/T/#t
> > looks relevant.  Probably worth following up.
>
> As responded there, I don't think so, as that series is not part of
> 6.13-rc5. But in the meantime, I bisected it and got this commit:
>
> 5185e7f9f3bd754ab60680814afd714e2673ef88 is the first bad commit
> commit 5185e7f9f3bd754ab60680814afd714e2673ef88
> Author: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Date:   Wed Oct 23 19:27:11 2024 +0300
>
>     x86/module: enable ROX caches for module text on 64 bit
>
>     Enable execmem's cache of PMD_SIZE'ed pages mapped as ROX for module text
>     allocations on 64 bit.
>
>     Link: https://lkml.kernel.org/r/20241023162711.2579610-9-rppt@kernel.org
>     Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>     Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
>     Tested-by: kdevops <kdevops@lists.linux.dev>
>     Cc: Andreas Larsson <andreas@gaisler.com>
>     Cc: Andy Lutomirski <luto@kernel.org>
>     Cc: Ard Biesheuvel <ardb@kernel.org>
>     Cc: Arnd Bergmann <arnd@arndb.de>
>     Cc: Borislav Petkov (AMD) <bp@alien8.de>
>     Cc: Brian Cain <bcain@quicinc.com>
>     Cc: Catalin Marinas <catalin.marinas@arm.com>
>     Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
>     Cc: Christoph Hellwig <hch@lst.de>
>     Cc: Dave Hansen <dave.hansen@linux.intel.com>
>     Cc: Dinh Nguyen <dinguyen@kernel.org>
>     Cc: Geert Uytterhoeven <geert@linux-m68k.org>
>     Cc: Guo Ren <guoren@kernel.org>
>     Cc: Helge Deller <deller@gmx.de>
>     Cc: Huacai Chen <chenhuacai@kernel.org>
>     Cc: Ingo Molnar <mingo@redhat.com>
>     Cc: Johannes Berg <johannes@sipsolutions.net>
>     Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
>     Cc: Kent Overstreet <kent.overstreet@linux.dev>
>     Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
>     Cc: Mark Rutland <mark.rutland@arm.com>
>     Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
>     Cc: Matt Turner <mattst88@gmail.com>
>     Cc: Max Filippov <jcmvbkbc@gmail.com>
>     Cc: Michael Ellerman <mpe@ellerman.id.au>
>     Cc: Michal Simek <monstr@monstr.eu>
>     Cc: Oleg Nesterov <oleg@redhat.com>
>     Cc: Palmer Dabbelt <palmer@dabbelt.com>
>     Cc: Peter Zijlstra <peterz@infradead.org>
>     Cc: Richard Weinberger <richard@nod.at>
>     Cc: Russell King <linux@armlinux.org.uk>
>     Cc: Song Liu <song@kernel.org>
>     Cc: Stafford Horne <shorne@gmail.com>
>     Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
>     Cc: Suren Baghdasaryan <surenb@google.com>
>     Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
>     Cc: Thomas Gleixner <tglx@linutronix.de>
>     Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
>     Cc: Vineet Gupta <vgupta@kernel.org>
>     Cc: Will Deacon <will@kernel.org>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>
>  arch/x86/Kconfig   |  1 +
>  arch/x86/mm/init.c | 37 ++++++++++++++++++++++++++++++++++++-
>  2 files changed, 37 insertions(+), 1 deletion(-)
>
> I'm extending CC...

Do you really think adding all non-Intel maintainers will help fixing
an Intel-specific problem? Please do not do that.
Thanks!

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-01-03 18:33 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-19 16:14 Linux 6.13-rc3 many different panics in Xen PV dom0 Marek Marczykowski-Górecki
2024-12-20  1:48 ` Marek Marczykowski-Górecki
2024-12-26 18:48   ` Marek Marczykowski-Górecki
2025-01-02 10:20 ` Jürgen Groß
2025-01-02 11:30   ` Juergen Gross
2025-01-02 12:24     ` Marek Marczykowski-Górecki
2025-01-02 18:54       ` Marek Marczykowski-Górecki
2025-01-02 19:04         ` Andrew Cooper
2025-01-02 19:17         ` Jürgen Groß
2025-01-02 19:39           ` Marek Marczykowski-Górecki
2025-01-03  0:18             ` Marek Marczykowski-Górecki
2025-01-03  0:42               ` Marek Marczykowski-Górecki
2025-01-03  2:00                 ` Andrew Cooper
2025-01-03 18:09                   ` Linux 6.13-rc5 Xen HVM with PCI passthrough (USB controller) crash Marek Marczykowski-Górecki
2025-01-03 18:32                     ` Geert Uytterhoeven

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.