* dozens of qemu/kvm VMs getting into stuck states since kernel ~5.13 @ 2021-12-07 19:44 Chris Murphy 2021-12-07 22:25 ` Sean Christopherson 0 siblings, 1 reply; 3+ messages in thread From: Chris Murphy @ 2021-12-07 19:44 UTC (permalink / raw) To: kvm; +Cc: qemu-devel cc: qemu-devel Hi, I'm trying to help progress a very troublesome and so far elusive bug we're seeing in Fedora infrastructure. When running dozens of qemu-kvm VMs simultaneously, eventually they become unresponsive, as well as new processes as we try to extract information from the host about what's gone wrong. Systems (Fedora openQA worker hosts) on kernel 5.12.12+ wind up in a state where forking does not work correctly, breaking most things https://bugzilla.redhat.com/show_bug.cgi?id=2009585 In subsequent testing, we used newer kernels with lockdep and other debug stuff enabled, and managed to capture a hung task with a bunch of locks listed, including kvm and qemu processes. But I can't parse it. 5.15-rc7 https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840941 5.15+ https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840939 If anyone can take a glance at those kernel messages, and/or give hints how we can extract more information for debugging, it'd be appreciated. Maybe all of that is normal and the actual problem isn't in any of these traces. Thanks, -- Chris Murphy ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: dozens of qemu/kvm VMs getting into stuck states since kernel ~5.13 2021-12-07 19:44 dozens of qemu/kvm VMs getting into stuck states since kernel ~5.13 Chris Murphy @ 2021-12-07 22:25 ` Sean Christopherson 2021-12-08 17:09 ` Chris Murphy 0 siblings, 1 reply; 3+ messages in thread From: Sean Christopherson @ 2021-12-07 22:25 UTC (permalink / raw) To: Chris Murphy; +Cc: kvm, qemu-devel On Tue, Dec 07, 2021, Chris Murphy wrote: > cc: qemu-devel > > Hi, > > I'm trying to help progress a very troublesome and so far elusive bug > we're seeing in Fedora infrastructure. When running dozens of qemu-kvm > VMs simultaneously, eventually they become unresponsive, as well as > new processes as we try to extract information from the host about > what's gone wrong. Have you tried bisecting? IIUC, the issues showed up between v5.11 and v5.12.12, bisecting should be relatively straightforward. > Systems (Fedora openQA worker hosts) on kernel 5.12.12+ wind up in a > state where forking does not work correctly, breaking most things > https://bugzilla.redhat.com/show_bug.cgi?id=2009585 > > In subsequent testing, we used newer kernels with lockdep and other > debug stuff enabled, and managed to capture a hung task with a bunch > of locks listed, including kvm and qemu processes. But I can't parse > it. > > 5.15-rc7 > https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840941 > 5.15+ > https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840939 > > If anyone can take a glance at those kernel messages, and/or give > hints how we can extract more information for debugging, it'd be > appreciated. Maybe all of that is normal and the actual problem isn't > in any of these traces. All the instances of (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x77/0x720 [kvm] are uninteresting and expected, that's just each vCPU task taking its associated vcpu->mutex, likely for KVM_RUN. At a glance, the XFS stuff looks far more interesting/suspect. ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: dozens of qemu/kvm VMs getting into stuck states since kernel ~5.13 2021-12-07 22:25 ` Sean Christopherson @ 2021-12-08 17:09 ` Chris Murphy 0 siblings, 0 replies; 3+ messages in thread From: Chris Murphy @ 2021-12-08 17:09 UTC (permalink / raw) To: Sean Christopherson; +Cc: Chris Murphy, qemu-devel, kvm On Tue, Dec 7, 2021 at 5:25 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Dec 07, 2021, Chris Murphy wrote: > > cc: qemu-devel > > > > Hi, > > > > I'm trying to help progress a very troublesome and so far elusive bug > > we're seeing in Fedora infrastructure. When running dozens of qemu-kvm > > VMs simultaneously, eventually they become unresponsive, as well as > > new processes as we try to extract information from the host about > > what's gone wrong. > > Have you tried bisecting? IIUC, the issues showed up between v5.11 and v5.12.12, > bisecting should be relatively straightforward. We haven't tried bisecting. Due to limited access since it's a production machine, and limited resources for those who have that access, I think the chance of bisecting is low, but I've asked. We could do something of a faux-bisect by running already built kernels in Fedora infrastructure. We could start by running x.y.0 kernels to see when it first appeared, then once hitting the problem, start testing rc1, rc2, ... in that series. We also have approximately daily git builds in between those rc's. That might be enough to deduce a culprit, but I'm not sure. At the least this would get us a ~1-3 day window within two rc's for bisecting. > > > Systems (Fedora openQA worker hosts) on kernel 5.12.12+ wind up in a > > state where forking does not work correctly, breaking most things > > https://bugzilla.redhat.com/show_bug.cgi?id=2009585 > > > > In subsequent testing, we used newer kernels with lockdep and other > > debug stuff enabled, and managed to capture a hung task with a bunch > > of locks listed, including kvm and qemu processes. But I can't parse > > it. > > > > 5.15-rc7 > > https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840941 > > 5.15+ > > https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840939 > > > > If anyone can take a glance at those kernel messages, and/or give > > hints how we can extract more information for debugging, it'd be > > appreciated. Maybe all of that is normal and the actual problem isn't > > in any of these traces. > > All the instances of > > (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x77/0x720 [kvm] > > are uninteresting and expected, that's just each vCPU task taking its associated > vcpu->mutex, likely for KVM_RUN. > > At a glance, the XFS stuff looks far more interesting/suspect. Thanks for the reply. -- Chris Murphy ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2021-12-08 17:12 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-12-07 19:44 dozens of qemu/kvm VMs getting into stuck states since kernel ~5.13 Chris Murphy 2021-12-07 22:25 ` Sean Christopherson 2021-12-08 17:09 ` Chris Murphy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).