[Bug 218684] New: CPU soft lockups in KVM VMs on kernel 6.x after switching hypervisor from C8S to C9S

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [Bug 218684] New: CPU soft lockups in KVM VMs on kernel 6.x after switching hypervisor from C8S to C9S
@ 2024-04-05 20:33 bugzilla-daemon
  2024-04-05 22:32 ` Sean Christopherson
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: bugzilla-daemon @ 2024-04-05 20:33 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=218684

            Bug ID: 218684
           Summary: CPU soft lockups in KVM VMs on kernel 6.x after
                    switching hypervisor from C8S to C9S
           Product: Virtualization
           Version: unspecified
          Hardware: Intel
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P3
         Component: kvm
          Assignee: virtualization_kvm@kernel-bugs.osdl.org
          Reporter: frantisek@sumsal.cz
        Regression: No

Hey!

I'm currently in the middle of moving some of our hypervisors for upstream
systemd CI from CentOS Stream 8 to CentOS Stream 9 (as the former will go EOL
soon), and started hitting soft lockups on the guest machines (Arch Linux, both
with "stock" kernel and mainline one).

The hypervisors are AWS EC2 C5n Metal instances [0] running CentOS Stream,
which then run Arch Linux (KVM) VMs (using libvirt via Vagrant) - cpuinfo from
one of the guests is at [1].

The "production" hypervisors currently run CentOS Stream 8 (kernel
4.18.0-548.el8.x86_64) and everything is fine. However, after trying to upgrade
a couple of them to CentOS Stream 9 (kernel 5.14.0-432.el9.x86_64) the guests
started exhibiting frequent soft lockups when running just the systemd unit
test suite.

I played around with the kernel version inside the guest and it doesn't seem to
make a difference - I could reproduce the issue with kernels 6.6.24-1
(linux-lts), 6.8.2.arch2-1 (linux), and 6.9.0-rc2-1-mainline (linux-mainline).

One example with the mainline kernel on the guest:
[   58.800435] (sh)[5787]: Found cgroup2 on /sys/fs/cgroup/, full unified
hierarchy
[   58.800578] (sh)[5787]: Found cgroup2 on /sys/fs/cgroup/, full unified
hierarchy
[   75.709475] kernel: watchdog: BUG: soft lockup - CPU#30 stuck for 16s!
[swapper/30:0]
[   75.728228] kernel: Modules linked in: dummy rpcsec_gss_krb5 auth_rpcgss
nfsv4 dns_resolver nfs lockd grace sunrpc netfs snd_seq_dummy snd_hrtimer
snd_seq snd_seq_device snd_timer snd soundcore intel_rapl_msr intel_rapl_common
intel_uncore_frequency_common isst_if_common nfit libnvdimm 8021q garp mrp stp
llc cbc encrypted_keys trusted asn1_encoder tee vfat fat kvm_intel kvm
crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic gf128mul
ghash_clmulni_intel sha512_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd rapl
cfg80211 rfkill iTCO_wdt joydev intel_pmc_bxt mousedev iTCO_vendor_support
i2c_i801 psmouse i2c_smbus pcspkr lpc_ich mac_hid fuse loop dm_mod nfnetlink
vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock
vmw_vmci qemu_fw_cfg ip_tables x_tables btrfs blake2b_generic libcrc32c
crc32c_generic xor raid6_pq serio_raw atkbd virtio_net libps2 net_failover
virtio_blk failover vivaldi_fmap virtio_balloon virtio_rng virtio_pci
crc32c_intel intel_agp sha256_ssse3 virtio_pci_legacy_dev i8042
[   75.796349] kernel:  xhci_pci intel_gtt virtio_pci_modern_dev
xhci_pci_renesas cirrus serio
[   75.796396] kernel: CPU: 30 PID: 0 Comm: swapper/30 Not tainted
6.9.0-rc2-1-mainline #1 4c361158dea5838a7441c4025165296045a48154
[   75.796413] kernel: Hardware name: Red Hat KVM/RHEL, BIOS
edk2-20240214-1.el9 02/14/2024
[   75.796414] kernel: RIP: 0010:pv_native_safe_halt+0xf/0x20
[   75.796421] kernel: Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 23 db 24 00 fb f4 <c3>
cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
[   75.796423] kernel: RSP: 0018:ffffa945c01b7ed8 EFLAGS: 00000246
[   75.796428] kernel: RAX: 000000000000001e RBX: ffff921dc0b33000 RCX:
ffff921de2bf2908
[   75.796431] kernel: RDX: 000000000000001e RSI: 000000000000001e RDI:
00000000001737f4
[   75.796433] kernel: RBP: 000000000000001e R08: 0000000000000001 R09:
0000000000000000
[   75.796434] kernel: R10: 0000000000000001 R11: 0000000000000000 R12:
0000000000000000
[   75.796436] kernel: R13: 0000000000000000 R14: 0000000000000000 R15:
0000000000000000
[   75.796437] kernel: FS:  0000000000000000(0000) GS:ffff923cbe900000(0000)
knlGS:0000000000000000
[   75.796439] kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   75.796441] kernel: CR2: 00007f08c013e000 CR3: 000000186ac20002 CR4:
0000000000770ef0
[   75.796444] kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[   75.796445] kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[   75.796446] kernel: PKRU: 55555554
[   75.796447] kernel: Call Trace:
[   75.796450] kernel:  <IRQ>
[   75.800549] kernel:  ? watchdog_timer_fn+0x1dd/0x260
[   75.800553] kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
[   75.800556] kernel:  ? __hrtimer_run_queues+0x10f/0x2a0
[   75.800560] kernel:  ? hrtimer_interrupt+0xfa/0x230
[   75.800563] kernel:  ? __sysvec_apic_timer_interrupt+0x55/0x150
[   75.800567] kernel:  ? sysvec_apic_timer_interrupt+0x6c/0x90
[   75.800569] kernel:  </IRQ>
[   75.800569] kernel:  <TASK>
[   75.800571] kernel:  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[   75.800590] kernel:  ? pv_native_safe_halt+0xf/0x20
[   75.800593] kernel:  default_idle+0x9/0x20
[   75.800596] kernel:  default_idle_call+0x30/0x100
[   75.800598] kernel:  do_idle+0x1cb/0x210
[   75.800603] kernel:  cpu_startup_entry+0x29/0x30
[   75.800606] kernel:  start_secondary+0x11c/0x140
[   75.800610] kernel:  common_startup_64+0x13e/0x141
[   75.800616] kernel:  </TASK>
[   75.800619] kernel: watchdog: BUG: soft lockup - CPU#59 stuck for 16s!
[swapper/59:0]
[   75.802944] kernel: Modules linked in: dummy rpcsec_gss_krb5 auth_rpcgss
nfsv4 dns_resolver nfs lockd grace sunrpc netfs snd_seq_dummy snd_hrtimer
snd_seq snd_seq_device snd_timer snd soundcore intel_rapl_msr intel_rapl_common
intel_uncore_frequency_common isst_if_common nfit libnvdimm 8021q garp mrp stp
llc cbc encrypted_keys trusted asn1_encoder tee vfat fat kvm_intel kvm
crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic gf128mul
ghash_clmulni_intel sha512_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd rapl
cfg80211 rfkill iTCO_wdt joydev intel_pmc_bxt mousedev iTCO_vendor_support
i2c_i801 psmouse i2c_smbus pcspkr lpc_ich mac_hid fuse loop dm_mod nfnetlink
vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock
vmw_vmci qemu_fw_cfg ip_tables x_tables btrfs blake2b_generic libcrc32c
crc32c_generic xor raid6_pq serio_raw atkbd virtio_net libps2 net_failover
virtio_blk failover vivaldi_fmap virtio_balloon virtio_rng virtio_pci
crc32c_intel intel_agp sha256_ssse3 virtio_pci_legacy_dev i8042
[   75.803034] kernel:  xhci_pci intel_gtt virtio_pci_modern_dev
xhci_pci_renesas cirrus serio
[   75.803043] kernel: CPU: 59 PID: 0 Comm: swapper/59 Tainted: G             L
    6.9.0-rc2-1-mainline #1 4c361158dea5838a7441c4025165296045a48154
[   75.803048] kernel: Hardware name: Red Hat KVM/RHEL, BIOS
edk2-20240214-1.el9 02/14/2024
[   75.803050] kernel: RIP: 0010:pv_native_safe_halt+0xf/0x20
[   75.803057] kernel: Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 23 db 24 00 fb f4 <c3>
cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
[   75.803059] kernel: RSP: 0018:ffffa945c029fed8 EFLAGS: 00000252
[   75.803062] kernel: RAX: 000000000000003b RBX: ffff921dc0c78000 RCX:
ffff921e01edf5a8
[   75.803064] kernel: RDX: 000000000000003b RSI: 000000000000003b RDI:
0000000000121a9c
[   75.803065] kernel: RBP: 000000000000003b R08: 0000000000000001 R09:
0000000000000000
[   75.803067] kernel: R10: 0000000000000001 R11: 0000000000000000 R12:
0000000000000000
[   75.803069] kernel: R13: 0000000000000000 R14: 0000000000000000 R15:
0000000000000000
[   75.803070] kernel: FS:  0000000000000000(0000) GS:ffff923cbf780000(0000)
knlGS:0000000000000000
[   75.803072] kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   75.803074] kernel: CR2: 00007f08c013e000 CR3: 000000186ac20005 CR4:
0000000000770ef0
[   75.803078] kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[   75.803079] kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[   75.803081] kernel: PKRU: 55555554
[   75.803082] kernel: Call Trace:
[   75.803085] kernel:  <IRQ>
[   75.803091] kernel:  ? watchdog_timer_fn+0x1dd/0x260
[   75.803096] kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
[   75.803098] kernel:  ? __hrtimer_run_queues+0x10f/0x2a0
[   75.803103] kernel:  ? hrtimer_interrupt+0xfa/0x230
[   75.803106] kernel:  ? __sysvec_apic_timer_interrupt+0x55/0x150
[   75.803111] kernel:  ? sysvec_apic_timer_interrupt+0x6c/0x90
[   75.803113] kernel:  </IRQ>
[   75.803114] kernel:  <TASK>
[   75.803116] kernel:  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[   75.803122] kernel:  ? pv_native_safe_halt+0xf/0x20
[   75.803125] kernel:  default_idle+0x9/0x20
[   75.803129] kernel:  default_idle_call+0x30/0x100
[   75.803131] kernel:  do_idle+0x1cb/0x210
[   75.803137] kernel:  cpu_startup_entry+0x29/0x30
[   75.803139] kernel:  start_secondary+0x11c/0x140
[   75.803145] kernel:  common_startup_64+0x13e/0x141
[   75.803151] kernel:  </TASK>
...

The full stack trace is quite long, please see [2] which is a complete journal
from the machine boot.

I can reproduce the soft lockups quite reliably. However, I'm not really sure
who to blame here, since there's too many moving parts. But my first question
would be - is this an issue of the guest (Arch Linux) kernel, or is it/can it
be caused by the host kernel? I did some testing and if I use the _same_ image
on the C8S hypervisor (with all three guest kernel variations), the issue
disappears, but I'm still not convinced that this might be the host kernel
doings. I'm more than happy to debug this further, but currently I have no idea
where to start, so any hint would be greatly appreciated.

[0] https://aws.amazon.com/ec2/instance-types/c5/
[1]
https://mrc0mmand.fedorapeople.org/kernel_kvm_softlockup/arch-mainline-cpuinfo.log
[2]
https://mrc0mmand.fedorapeople.org/kernel_kvm_softlockup/arch-mainline-journal.log

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Bug 218684] New: CPU soft lockups in KVM VMs on kernel 6.x after switching hypervisor from C8S to C9S
  2024-04-05 20:33 [Bug 218684] New: CPU soft lockups in KVM VMs on kernel 6.x after switching hypervisor from C8S to C9S bugzilla-daemon
@ 2024-04-05 22:32 ` Sean Christopherson
  2024-04-05 22:32 ` [Bug 218684] " bugzilla-daemon
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Sean Christopherson @ 2024-04-05 22:32 UTC (permalink / raw)
  To: bugzilla-daemon; +Cc: kvm

On Fri, Apr 05, 2024, bugzilla-daemon@kernel.org wrote:
> I'm currently in the middle of moving some of our hypervisors for upstream
> systemd CI from CentOS Stream 8 to CentOS Stream 9 (as the former will go EOL
> soon), and started hitting soft lockups on the guest machines (Arch Linux, both
> with "stock" kernel and mainline one).
> 
> The hypervisors are AWS EC2 C5n Metal instances [0] running CentOS Stream,
> which then run Arch Linux (KVM) VMs (using libvirt via Vagrant) - cpuinfo from
> one of the guests is at [1].
> 
> The "production" hypervisors currently run CentOS Stream 8 (kernel
> 4.18.0-548.el8.x86_64) and everything is fine. However, after trying to upgrade
> a couple of them to CentOS Stream 9 (kernel 5.14.0-432.el9.x86_64) the guests
> started exhibiting frequent soft lockups when running just the systemd unit
> test suite.

...

> [   75.796414] kernel: RIP: 0010:pv_native_safe_halt+0xf/0x20
v> [   75.796421] kernel: Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90
> 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 23 db 24 00 fb f4 <c3>

...

> [   75.796447] kernel: Call Trace:
> [   75.796450] kernel:  <IRQ>
> [   75.800549] kernel:  ? watchdog_timer_fn+0x1dd/0x260
> [   75.800553] kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
> [   75.800556] kernel:  ? __hrtimer_run_queues+0x10f/0x2a0
> [   75.800560] kernel:  ? hrtimer_interrupt+0xfa/0x230
> [   75.800563] kernel:  ? __sysvec_apic_timer_interrupt+0x55/0x150
> [   75.800567] kernel:  ? sysvec_apic_timer_interrupt+0x6c/0x90
> [   75.800569] kernel:  </IRQ>
> [   75.800569] kernel:  <TASK>
> [   75.800571] kernel:  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [   75.800590] kernel:  ? pv_native_safe_halt+0xf/0x20
> [   75.800593] kernel:  default_idle+0x9/0x20
> [   75.800596] kernel:  default_idle_call+0x30/0x100
> [   75.800598] kernel:  do_idle+0x1cb/0x210
> [   75.800603] kernel:  cpu_startup_entry+0x29/0x30
> [   75.800606] kernel:  start_secondary+0x11c/0x140
> [   75.800610] kernel:  common_startup_64+0x13e/0x141
> [   75.800616] kernel:  </TASK>

Hmm, the vCPU is stuck in the idle HLT loop, which suggests that the vCPU isn't
waking up when it should.  But it does obviously get the hrtimer interrupt, so
it's not completely hosed.

Are you able to test custom kernels?  If so, bisecting the host kernel is likely
the easiest way to figure out what's going on.  It might not be the _fastest_,
but it should be straightforward, and shouldn't require much KVM expertise, i.e.
won't require lengthy back-and-forth discussions if no one immediately spots a
bug.

And before bisecting, it'd be worth seeing if an upstream host kernel has the
same problem, e.g. if upstream works, it might be easier/faster to bisect to a
fix, than to bisect to a bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug 218684] CPU soft lockups in KVM VMs on kernel 6.x after switching hypervisor from C8S to C9S
  2024-04-05 20:33 [Bug 218684] New: CPU soft lockups in KVM VMs on kernel 6.x after switching hypervisor from C8S to C9S bugzilla-daemon
  2024-04-05 22:32 ` Sean Christopherson
@ 2024-04-05 22:32 ` bugzilla-daemon
  2024-04-08 10:17 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: bugzilla-daemon @ 2024-04-05 22:32 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=218684

--- Comment #1 from Sean Christopherson (seanjc@google.com) ---
On Fri, Apr 05, 2024, bugzilla-daemon@kernel.org wrote:
> I'm currently in the middle of moving some of our hypervisors for upstream
> systemd CI from CentOS Stream 8 to CentOS Stream 9 (as the former will go EOL
> soon), and started hitting soft lockups on the guest machines (Arch Linux,
> both
> with "stock" kernel and mainline one).
> 
> The hypervisors are AWS EC2 C5n Metal instances [0] running CentOS Stream,
> which then run Arch Linux (KVM) VMs (using libvirt via Vagrant) - cpuinfo
> from
> one of the guests is at [1].
> 
> The "production" hypervisors currently run CentOS Stream 8 (kernel
> 4.18.0-548.el8.x86_64) and everything is fine. However, after trying to
> upgrade
> a couple of them to CentOS Stream 9 (kernel 5.14.0-432.el9.x86_64) the guests
> started exhibiting frequent soft lockups when running just the systemd unit
> test suite.

...

> [   75.796414] kernel: RIP: 0010:pv_native_safe_halt+0xf/0x20
v> [   75.796421] kernel: Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90
90
> 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 23 db 24 00 fb f4
> <c3>

...

> [   75.796447] kernel: Call Trace:
> [   75.796450] kernel:  <IRQ>
> [   75.800549] kernel:  ? watchdog_timer_fn+0x1dd/0x260
> [   75.800553] kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
> [   75.800556] kernel:  ? __hrtimer_run_queues+0x10f/0x2a0
> [   75.800560] kernel:  ? hrtimer_interrupt+0xfa/0x230
> [   75.800563] kernel:  ? __sysvec_apic_timer_interrupt+0x55/0x150
> [   75.800567] kernel:  ? sysvec_apic_timer_interrupt+0x6c/0x90
> [   75.800569] kernel:  </IRQ>
> [   75.800569] kernel:  <TASK>
> [   75.800571] kernel:  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [   75.800590] kernel:  ? pv_native_safe_halt+0xf/0x20
> [   75.800593] kernel:  default_idle+0x9/0x20
> [   75.800596] kernel:  default_idle_call+0x30/0x100
> [   75.800598] kernel:  do_idle+0x1cb/0x210
> [   75.800603] kernel:  cpu_startup_entry+0x29/0x30
> [   75.800606] kernel:  start_secondary+0x11c/0x140
> [   75.800610] kernel:  common_startup_64+0x13e/0x141
> [   75.800616] kernel:  </TASK>

Hmm, the vCPU is stuck in the idle HLT loop, which suggests that the vCPU isn't
waking up when it should.  But it does obviously get the hrtimer interrupt, so
it's not completely hosed.

Are you able to test custom kernels?  If so, bisecting the host kernel is
likely
the easiest way to figure out what's going on.  It might not be the _fastest_,
but it should be straightforward, and shouldn't require much KVM expertise,
i.e.
won't require lengthy back-and-forth discussions if no one immediately spots a
bug.

And before bisecting, it'd be worth seeing if an upstream host kernel has the
same problem, e.g. if upstream works, it might be easier/faster to bisect to a
fix, than to bisect to a bug.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug 218684] CPU soft lockups in KVM VMs on kernel 6.x after switching hypervisor from C8S to C9S
  2024-04-05 20:33 [Bug 218684] New: CPU soft lockups in KVM VMs on kernel 6.x after switching hypervisor from C8S to C9S bugzilla-daemon
  2024-04-05 22:32 ` Sean Christopherson
  2024-04-05 22:32 ` [Bug 218684] " bugzilla-daemon
@ 2024-04-08 10:17 ` bugzilla-daemon
  2024-04-08 20:19 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: bugzilla-daemon @ 2024-04-08 10:17 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=218684

Frantisek Sumsal (frantisek@sumsal.cz) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |PATCH_ALREADY_AVAILABLE

--- Comment #2 from Frantisek Sumsal (frantisek@sumsal.cz) ---
(In reply to Sean Christopherson from comment #1)
> On Fri, Apr 05, 2024, bugzilla-daemon@kernel.org wrote:
<...snip...>
> 
> Hmm, the vCPU is stuck in the idle HLT loop, which suggests that the vCPU
> isn't
> waking up when it should.  But it does obviously get the hrtimer interrupt,
> so
> it's not completely hosed.
> 
> Are you able to test custom kernels?  If so, bisecting the host kernel is
> likely
> the easiest way to figure out what's going on.  It might not be the
> _fastest_,
> but it should be straightforward, and shouldn't require much KVM expertise,
> i.e.
> won't require lengthy back-and-forth discussions if no one immediately spots
> a
> bug.
> 
> And before bisecting, it'd be worth seeing if an upstream host kernel has the
> same problem, e.g. if upstream works, it might be easier/faster to bisect to
> a
> fix, than to bisect to a bug.

I did some tests over the weekend, and after installing the latest-ish mainline
kernel on the host (6.9.0-0.rc1.316.vanilla.fc40.x86_64, ignore the fc40 part,
I was just lazy and used [0] for a quick test) the soft lockups disappear
completely. I really should've tried this before filing an issue - I tried just
6.7.1-0.hs1.hsx.el9.x86_64 (from [1]) and that didn't help, so I mistakenly
assumed that it's not the host kernel who's at fault.

Also, with the mainline kernel on the host, I can now use the "stock" Arch
Linux kernel on the guest as well without any soft lockups.

Given the mainline kernel works as expected I'll go ahead and move this issue
to the RHEL downstream (and bisect the kernel to find out what's the fix).
Thanks a lot for nudging me into the right direction!

[0] https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories
[1] https://sig.centos.org/hyperscale/

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug 218684] CPU soft lockups in KVM VMs on kernel 6.x after switching hypervisor from C8S to C9S
  2024-04-05 20:33 [Bug 218684] New: CPU soft lockups in KVM VMs on kernel 6.x after switching hypervisor from C8S to C9S bugzilla-daemon
                   ` (2 preceding siblings ...)
  2024-04-08 10:17 ` bugzilla-daemon
@ 2024-04-08 20:19 ` bugzilla-daemon
  2024-04-10 10:44 ` bugzilla-daemon
  2024-05-13 11:22 ` bugzilla-daemon
  5 siblings, 0 replies; 7+ messages in thread
From: bugzilla-daemon @ 2024-04-08 20:19 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=218684

Sean Christopherson (seanjc@google.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |seanjc@google.com

--- Comment #3 from Sean Christopherson (seanjc@google.com) ---
Given that 6.7 is still broken, my money is on commit d02c357e5bfa ("KVM:
x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing").  Ugh,
which I neglected to mark for stable.

Note, if that's indeed what's to blame, there's also a bug in the kernel's
preemption model logic that is contributing to the problems.
https://lore.kernel.org/all/20240110214723.695930-1-seanjc@google.com

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug 218684] CPU soft lockups in KVM VMs on kernel 6.x after switching hypervisor from C8S to C9S
  2024-04-05 20:33 [Bug 218684] New: CPU soft lockups in KVM VMs on kernel 6.x after switching hypervisor from C8S to C9S bugzilla-daemon
                   ` (3 preceding siblings ...)
  2024-04-08 20:19 ` bugzilla-daemon
@ 2024-04-10 10:44 ` bugzilla-daemon
  2024-05-13 11:22 ` bugzilla-daemon
  5 siblings, 0 replies; 7+ messages in thread
From: bugzilla-daemon @ 2024-04-10 10:44 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=218684

--- Comment #4 from Frantisek Sumsal (frantisek@sumsal.cz) ---
(In reply to Sean Christopherson from comment #3)
> Given that 6.7 is still broken, my money is on commit d02c357e5bfa ("KVM:
> x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing"). 
> Ugh, which I neglected to mark for stable.
> 
> Note, if that's indeed what's to blame, there's also a bug in the kernel's
> preemption model logic that is contributing to the problems.
> https://lore.kernel.org/all/20240110214723.695930-1-seanjc@google.com

You're absolutely right. I took the C9S kernel RPM (5.14.0-436), slapped
d02c357e5bfa on top of it, and after running the same tests as in previous
cases all the soft lockups seem to be magically gone. Thanks a lot!

I'll run a couple more tests and if they pass I'll go ahead and sum this up in
a RHEL report, so the necessary patches make it to C9S/RHEL9.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug 218684] CPU soft lockups in KVM VMs on kernel 6.x after switching hypervisor from C8S to C9S
  2024-04-05 20:33 [Bug 218684] New: CPU soft lockups in KVM VMs on kernel 6.x after switching hypervisor from C8S to C9S bugzilla-daemon
                   ` (4 preceding siblings ...)
  2024-04-10 10:44 ` bugzilla-daemon
@ 2024-05-13 11:22 ` bugzilla-daemon
  5 siblings, 0 replies; 7+ messages in thread
From: bugzilla-daemon @ 2024-05-13 11:22 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=218684

Igor Mammedov (imammedo@redhat.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |imammedo@redhat.com

--- Comment #5 from Igor Mammedov (imammedo@redhat.com) ---
(In reply to Frantisek Sumsal from comment #4)
> (In reply to Sean Christopherson from comment #3)
> > Given that 6.7 is still broken, my money is on commit d02c357e5bfa ("KVM:
> > x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing"). 
> > Ugh, which I neglected to mark for stable.
> > 
> > Note, if that's indeed what's to blame, there's also a bug in the kernel's
> > preemption model logic that is contributing to the problems.
> > https://lore.kernel.org/all/20240110214723.695930-1-seanjc@google.com
> 
> You're absolutely right. I took the C9S kernel RPM (5.14.0-436), slapped
> d02c357e5bfa on top of it, and after running the same tests as in previous
> cases all the soft lockups seem to be magically gone. Thanks a lot!
> 
> I'll run a couple more tests and if they pass I'll go ahead and sum this up
> in a RHEL report, so the necessary patches make it to C9S/RHEL9.

Thanks for reporting it upstream, as for C9S it should be fixed in
kernel-5.14.0-444.el9
https://issues.redhat.com/browse/RHEL-17714

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-05-13 11:22 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-05 20:33 [Bug 218684] New: CPU soft lockups in KVM VMs on kernel 6.x after switching hypervisor from C8S to C9S bugzilla-daemon
2024-04-05 22:32 ` Sean Christopherson
2024-04-05 22:32 ` [Bug 218684] " bugzilla-daemon
2024-04-08 10:17 ` bugzilla-daemon
2024-04-08 20:19 ` bugzilla-daemon
2024-04-10 10:44 ` bugzilla-daemon
2024-05-13 11:22 ` bugzilla-daemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox