Linux PCI subsystem development
 help / color / mirror / Atom feed
* Qemu KVM thread spins at 100% CPU usage on scsi hot-unplug (kernel 6.6.8 guest)
@ 2023-12-28  2:03 Steven Haigh
  2023-12-28 13:18 ` Lukas Wunner
  0 siblings, 1 reply; 4+ messages in thread
From: Steven Haigh @ 2023-12-28  2:03 UTC (permalink / raw)
  To: linux-pci; +Cc: linux-kernel, f.ebner

Hi all,

I'm trying to summarise what I'm seeing - please feel free to contact me directly for any further information that I may 
have missed. I'm also not subscribed to either kernel.org mailing list, so please CC me in any replies.

History:
At some point in kernel 6.6.x, SCSI hotplug in qemu VMs broke. This was mostly fixed in the following commit to release 
6.6.8:
	commit 5cc8d88a1b94b900fd74abda744c29ff5845430b
	Author: Bjorn Helgaas <bhelgaas@google.com>
	Date:   Thu Dec 14 09:08:56 2023 -0600
	Revert "PCI: acpiphp: Reassign resources on bridge if necessary"

After this commit, the SCSI block device is hotplugged correctly, and a device node as /dev/sdX appears within the qemu VM.

New problem:

When the same SCSI block device is hot-unplugged, the QEMU KVM process will spin at 100% CPU usage. The guest shows no 
CPU being used via top, but the host will continue to spin in the KVM thread until the VM is rebooted.

Further information:

Guest: Fedora 39 with kernel 6.6.8 packages from:
           https://koji.fedoraproject.org/koji/buildinfo?buildID=2336239

Host: Proxmox 8.1.3 with kernel 6.5.11-7-pve

Messages when a drive is hot-plugged to the guest via:
           # qm set 104 -scsi1 /dev/sde

Dec 21 19:44:02 kernel: pci 0000:09:02.0: [1af4:1004] type 00 class 0x010000
Dec 21 19:44:02 kernel: pci 0000:09:02.0: reg 0x10: [io  0x0000-0x003f]
Dec 21 19:44:02 kernel: pci 0000:09:02.0: reg 0x14: [mem 0x00000000-0x00000fff]
Dec 21 19:44:02 kernel: pci 0000:09:02.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref]
Dec 21 19:44:02 kernel: pci 0000:09:02.0: BAR 4: assigned [mem 0xc080004000-0xc080007fff 64bit pref]
Dec 21 19:44:02 kernel: pci 0000:09:02.0: BAR 1: assigned [mem 0xc1801000-0xc1801fff]
Dec 21 19:44:02 kernel: pci 0000:09:02.0: BAR 0: assigned [io 0x6040-0x607f]
Dec 21 19:44:02 kernel: virtio-pci 0000:09:02.0: enabling device (0000 -> 0003)
Dec 21 19:44:02 kernel: scsi host7: Virtio SCSI HBA
Dec 21 19:44:02 kernel: scsi 7:0:0:1: Direct-Access     QEMU     QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
Dec 21 19:44:02 kernel: sd 7:0:0:1: Power-on or device reset occurred
Dec 21 19:44:02 kernel: sd 7:0:0:1: Attached scsi generic sg1 type 0
Dec 21 19:44:02 kernel: sd 7:0:0:1: LUN assignments on this target have changed. The Linux SCSI layer does not 
automatically remap LUN assignments.
Dec 21 19:44:02 kernel: sd 7:0:0:1: [sdb] 3906994318 512-byte logical blocks: (2.00 TB/1.82 TiB)
Dec 21 19:44:02 kernel: sd 7:0:0:1: [sdb] Write Protect is off
Dec 21 19:44:02 kernel: sd 7:0:0:1: [sdb] Mode Sense: 63 00 00 08
Dec 21 19:44:02 kernel: sd 7:0:0:1: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Dec 21 19:44:02 kernel: sd 7:0:0:1: [sdb] Attached SCSI disk

Device node is then available as /dev/sdb as expected.

Hot-unplugging the device in proxmox is done via:
	# /usr/sbin/qm set 104 --delete scsi1

where 104 is the VM ID within the proxmox host. I have been trying to trawl through the perl code for the `qm` util to 
see how that translates to a qemu command, but haven't nailed anything down yet. The code for the qm util is here:
	https://git.proxmox.com/?p=qemu-server.git;a=tree;h=refs/heads/master;hb=refs/heads/master

After the qm command is executed the device node disappears correctly from the running VM, and the VM seems to operate 
as normal. The spinning withing the KVM thread seems to only affect the host.

-- 
Steven Haigh

📧 netwiz@crc.id.au
💻 https://crc.id.au



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Qemu KVM thread spins at 100% CPU usage on scsi hot-unplug (kernel 6.6.8 guest)
  2023-12-28  2:03 Qemu KVM thread spins at 100% CPU usage on scsi hot-unplug (kernel 6.6.8 guest) Steven Haigh
@ 2023-12-28 13:18 ` Lukas Wunner
  2023-12-29  5:46   ` Steven Haigh
  0 siblings, 1 reply; 4+ messages in thread
From: Lukas Wunner @ 2023-12-28 13:18 UTC (permalink / raw)
  To: Steven Haigh; +Cc: linux-pci, linux-kernel, f.ebner

On Thu, Dec 28, 2023 at 01:03:10PM +1100, Steven Haigh wrote:
> At some point in kernel 6.6.x, SCSI hotplug in qemu VMs broke. This was
> mostly fixed in the following commit to release 6.6.8:
> 	commit 5cc8d88a1b94b900fd74abda744c29ff5845430b
> 	Author: Bjorn Helgaas <bhelgaas@google.com>
> 	Date:   Thu Dec 14 09:08:56 2023 -0600
> 	Revert "PCI: acpiphp: Reassign resources on bridge if necessary"
> 
> After this commit, the SCSI block device is hotplugged correctly, and a device node as /dev/sdX appears within the qemu VM.
> 
> New problem:
> 
> When the same SCSI block device is hot-unplugged, the QEMU KVM process will
> spin at 100% CPU usage. The guest shows no CPU being used via top, but the
> host will continue to spin in the KVM thread until the VM is rebooted.

Find out the PID of the qemu process on the host, then cat /proc/$PID/stack
to see where the CPU time is spent.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Qemu KVM thread spins at 100% CPU usage on scsi hot-unplug (kernel 6.6.8 guest)
  2023-12-28 13:18 ` Lukas Wunner
@ 2023-12-29  5:46   ` Steven Haigh
  2024-01-03  9:46     ` Fiona Ebner
  0 siblings, 1 reply; 4+ messages in thread
From: Steven Haigh @ 2023-12-29  5:46 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: linux-pci, linux-kernel, f.ebner

On 29/12/23 00:18, Lukas Wunner wrote:
> On Thu, Dec 28, 2023 at 01:03:10PM +1100, Steven Haigh wrote:
>> At some point in kernel 6.6.x, SCSI hotplug in qemu VMs broke. This was
>> mostly fixed in the following commit to release 6.6.8:
>> 	commit 5cc8d88a1b94b900fd74abda744c29ff5845430b
>> 	Author: Bjorn Helgaas <bhelgaas@google.com>
>> 	Date:   Thu Dec 14 09:08:56 2023 -0600
>> 	Revert "PCI: acpiphp: Reassign resources on bridge if necessary"
>>
>> After this commit, the SCSI block device is hotplugged correctly, and a device node as /dev/sdX appears within the qemu VM.
>>
>> New problem:
>>
>> When the same SCSI block device is hot-unplugged, the QEMU KVM process will
>> spin at 100% CPU usage. The guest shows no CPU being used via top, but the
>> host will continue to spin in the KVM thread until the VM is rebooted.
> 
> Find out the PID of the qemu process on the host, then cat /proc/$PID/stack
> to see where the CPU time is spent.

Thanks for the tip - I'll certainly do that.

Annoyingly, since I posted this report originally, then adding in a new report to the kernel.org lists in this, I have 
been unable to reproduce this problem. I have successfully done ~22 scsi hotplug / remove cycles and none resulted in 
reproducing the issue.

Kernel versions are still the same on both proxmox host and the Fedora guest - however I see an update on the host of 
the qemu-kvm packages in Proxmox. The proxmox host hasn't even been rebooted in this time.

I wonder if the initial revert included in 6.6.8 fixed the main problem, and the later update to qemu-kvm packages on 
the proxmox host followed by the last reboot of the VM with the new KVM package sorted the second issue.

Seeing as I can no longer reproduce this reliably - whereas it was 100% reproducible prior, maybe I'm now chasing ghosts.

I'll still continue to monitor - as I normally do this SCSI hotplug ~3 times per week doing backups to different 
external HDDs - so if I do observe it again, I'll grab the stack and reply to this thread again with what I can find.

Until then, I don't want to waste other peoples time also chasing ghosts :)

-- 
Steven Haigh

📧 netwiz@crc.id.au
💻 https://crc.id.au


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Qemu KVM thread spins at 100% CPU usage on scsi hot-unplug (kernel 6.6.8 guest)
  2023-12-29  5:46   ` Steven Haigh
@ 2024-01-03  9:46     ` Fiona Ebner
  0 siblings, 0 replies; 4+ messages in thread
From: Fiona Ebner @ 2024-01-03  9:46 UTC (permalink / raw)
  To: Steven Haigh, Lukas Wunner; +Cc: linux-pci, linux-kernel

Hi,

Am 29.12.23 um 06:46 schrieb Steven Haigh:
> On 29/12/23 00:18, Lukas Wunner wrote:
>> On Thu, Dec 28, 2023 at 01:03:10PM +1100, Steven Haigh wrote:
>>> At some point in kernel 6.6.x, SCSI hotplug in qemu VMs broke. This was
>>> mostly fixed in the following commit to release 6.6.8:
>>>     commit 5cc8d88a1b94b900fd74abda744c29ff5845430b
>>>     Author: Bjorn Helgaas <bhelgaas@google.com>
>>>     Date:   Thu Dec 14 09:08:56 2023 -0600
>>>     Revert "PCI: acpiphp: Reassign resources on bridge if necessary"
>>>
>>> After this commit, the SCSI block device is hotplugged correctly, and
>>> a device node as /dev/sdX appears within the qemu VM.
>>>
>>> New problem:
>>>
>>> When the same SCSI block device is hot-unplugged, the QEMU KVM
>>> process will
>>> spin at 100% CPU usage. The guest shows no CPU being used via top,
>>> but the
>>> host will continue to spin in the KVM thread until the VM is rebooted.
>>
>> Find out the PID of the qemu process on the host, then cat
>> /proc/$PID/stack
>> to see where the CPU time is spent.
> 
> Thanks for the tip - I'll certainly do that.
> 
> Annoyingly, since I posted this report originally, then adding in a new
> report to the kernel.org lists in this, I have been unable to reproduce
> this problem. I have successfully done ~22 scsi hotplug / remove cycles
> and none resulted in reproducing the issue.
> 
> Kernel versions are still the same on both proxmox host and the Fedora
> guest - however I see an update on the host of the qemu-kvm packages in
> Proxmox. The proxmox host hasn't even been rebooted in this time.
> 
> I wonder if the initial revert included in 6.6.8 fixed the main problem,
> and the later update to qemu-kvm packages on the proxmox host followed
> by the last reboot of the VM with the new KVM package sorted the second
> issue.
> 
> Seeing as I can no longer reproduce this reliably - whereas it was 100%
> reproducible prior, maybe I'm now chasing ghosts.
> 

That sounds likely. Version pve-qemu-kvm=8.1.2-5 had a regression where
an IO thread in QEMU could start spinning after a drain (which happens
during hotplug on the QEMU side). It was introduced by an attempted fix
for a much rarer problem [0] and was reverted in pve-qemu-kvm=8.1.2-6
[1]. A proper fix is still being worked on [2].

[0]:
https://git.proxmox.com/?p=pve-qemu.git;a=commit;h=6b7c1815e1c89cb66ff48fbba6da69fe6d254630
[1]:
https://git.proxmox.com/?p=pve-qemu.git;a=commit;h=2a49e667bae33f2a5c6ba6b59a0cd26387f73a27
[2]: https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg01900.html

Best Regards,
Fiona

> I'll still continue to monitor - as I normally do this SCSI hotplug ~3
> times per week doing backups to different external HDDs - so if I do
> observe it again, I'll grab the stack and reply to this thread again
> with what I can find.
> 
> Until then, I don't want to waste other peoples time also chasing ghosts :)
> 



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-01-03  9:52 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-28  2:03 Qemu KVM thread spins at 100% CPU usage on scsi hot-unplug (kernel 6.6.8 guest) Steven Haigh
2023-12-28 13:18 ` Lukas Wunner
2023-12-29  5:46   ` Steven Haigh
2024-01-03  9:46     ` Fiona Ebner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox