* Regression in v4.14.94 by "x86,kvm: move qemu/guest FPU switching out to vcpu_run"
@ 2019-01-28 19:25 Thomas Lindroth
2019-01-28 19:53 ` Sean Christopherson
2019-01-28 20:14 ` Sasha Levin
0 siblings, 2 replies; 4+ messages in thread
From: Thomas Lindroth @ 2019-01-28 19:25 UTC (permalink / raw)
To: kvm; +Cc: stable
I run a qemu/kvm VM with debian and I've started getting segfaults and failing checksums on
downloaded files. The failures are undeterministic and similar to the failures you get with
bad ram. I tried to diagnose the problem with various testing tools and found that
"stress-ng --verify --cpu 1" always give an error. Stress-ng give one of these errors
usually within 60 sec:
stress-ng-cpu: Newton-Rapshon sqrt not accurate enough
stress-ng-cpu: prime error detected, number of primes between 0 and 1000000 miscalculated
Nothing relevant has changed recently in the VM but the host kernel was upgraded from
4.14.93 to 4.14.96. I can't reproduce the stress-ng error with a 4.14.93 host kernel. There
is only one kvm related change in that range so I tried to revert that one.
By reverting commit 4124a4cff344abbf8187775eb643d9827830e715
"x86,kvm: move qemu/guest FPU switching out to vcpu_run" on kernel 4.14.96 I can't reproduce
the stress-ng error and I have no segfault or other problems with the guest.
The commit was originally introduced in v4.15-rc3 (Nov 14 2017) and was only recently
backported to 4.14. The other stable kernels before 4.14 didn't get any backport so it looks
like a broken 4.14 backport. That backport also cause problems for other people.
https://bugzilla.kernel.org/show_bug.cgi?id=202419
I've rebooted between the different kernels and rebooted the VM enough to be reasonably sure
that commit is the problem. Stress-ng never lasts more than 10 min with that commit but works
for hours without it.
Steps to reproduce would be to create a qemu/kvm VM with debian stretch, install stress-ng
version 0.07.16 and run "stress-ng --verify --cpu 1".
Here is the qemu-3.1.0 commandline generated by libvirt:
/usr/bin/qemu-system-x86_64 -name guest=debian,debug-threads=on -S -object
secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-debian/master-key.aes
-machine pc-i440fx-2.4,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-noTSX -m 2048
-realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid
0473ded4-d417-4b0e-a4f5-36ba5a2cd675 -no-user-config -nodefaults -chardev
socket,id=charmonitor,fd=21,server,nowait -mon chardev=charmonitor,id=monitor,mode=control
-rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown
-global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on
-device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device
ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device
ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device
ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -drive
if=none,id=drive-ide0-0-1,readonly=on -device
ide-cd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1,bootindex=2 -drive
file=/mnt/gemini.61rn.3T/Backups/debian.raw,format=raw,if=none,id=drive-virtio-disk0 -device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-netdev tap,fd=23,id=hostnet0 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=00:11:22:33:44:55,bus=pci.0,addr=0x3 -spice
port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device
VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device AC97,id=sound0,bus=pci.0,addr=0x7
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -object
rng-random,id=objrng0,filename=/dev/random -device
virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x8 -sandbox
on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
My host kernel .config is big so I put it in a paste: http://sprunge.us/u7YNBt
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Regression in v4.14.94 by "x86,kvm: move qemu/guest FPU switching out to vcpu_run"
2019-01-28 19:25 Regression in v4.14.94 by "x86,kvm: move qemu/guest FPU switching out to vcpu_run" Thomas Lindroth
@ 2019-01-28 19:53 ` Sean Christopherson
2019-01-28 20:14 ` Sasha Levin
1 sibling, 0 replies; 4+ messages in thread
From: Sean Christopherson @ 2019-01-28 19:53 UTC (permalink / raw)
To: Thomas Lindroth; +Cc: kvm, stable
On Mon, Jan 28, 2019 at 08:25:20PM +0100, Thomas Lindroth wrote:
> I run a qemu/kvm VM with debian and I've started getting segfaults and failing checksums on
> downloaded files. The failures are undeterministic and similar to the failures you get with
> bad ram. I tried to diagnose the problem with various testing tools and found that
> "stress-ng --verify --cpu 1" always give an error. Stress-ng give one of these errors
> usually within 60 sec:
>
> stress-ng-cpu: Newton-Rapshon sqrt not accurate enough
> stress-ng-cpu: prime error detected, number of primes between 0 and 1000000 miscalculated
>
> Nothing relevant has changed recently in the VM but the host kernel was upgraded from
> 4.14.93 to 4.14.96. I can't reproduce the stress-ng error with a 4.14.93 host kernel. There
> is only one kvm related change in that range so I tried to revert that one.
>
> By reverting commit 4124a4cff344abbf8187775eb643d9827830e715
> "x86,kvm: move qemu/guest FPU switching out to vcpu_run" on kernel 4.14.96 I can't reproduce
> the stress-ng error and I have no segfault or other problems with the guest.
This is the second report of this issue:
https://bugzilla.kernel.org/show_bug.cgi?id=202419
Upon inspection, the commit in question is obviously buggy,
kvm_arch_vcpu_ioctl_run() doubles up on kvm_{load,put}_guest_fpu().
The ordering of mainline commits:
f775b13eedee ("x86,kvm: move qemu/guest FPU switching out to vcpu_run")
and
5663d8f9bbe4 ("kvm: x86: fix WARN due to uninitialized guest FPU state")
were reversed when backported to 4.14. Commit 5663d8f9bbe4 even explicitly
notes that it fixes f775b13eedee. I'll send a patch.
>
> The commit was originally introduced in v4.15-rc3 (Nov 14 2017) and was only recently
> backported to 4.14. The other stable kernels before 4.14 didn't get any backport so it looks
> like a broken 4.14 backport. That backport also cause problems for other people.
> https://bugzilla.kernel.org/show_bug.cgi?id=202419
>
> I've rebooted between the different kernels and rebooted the VM enough to be reasonably sure
> that commit is the problem. Stress-ng never lasts more than 10 min with that commit but works
> for hours without it.
>
> Steps to reproduce would be to create a qemu/kvm VM with debian stretch, install stress-ng
> version 0.07.16 and run "stress-ng --verify --cpu 1".
>
> Here is the qemu-3.1.0 commandline generated by libvirt:
> /usr/bin/qemu-system-x86_64 -name guest=debian,debug-threads=on -S -object
> secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-debian/master-key.aes
> -machine pc-i440fx-2.4,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-noTSX -m 2048
> -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid
> 0473ded4-d417-4b0e-a4f5-36ba5a2cd675 -no-user-config -nodefaults -chardev
> socket,id=charmonitor,fd=21,server,nowait -mon chardev=charmonitor,id=monitor,mode=control
> -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown
> -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on
> -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device
> ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device
> ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device
> ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -drive
> if=none,id=drive-ide0-0-1,readonly=on -device
> ide-cd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1,bootindex=2 -drive
> file=/mnt/gemini.61rn.3T/Backups/debian.raw,format=raw,if=none,id=drive-virtio-disk0 -device
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
> -netdev tap,fd=23,id=hostnet0 -device
> virtio-net-pci,netdev=hostnet0,id=net0,mac=00:11:22:33:44:55,bus=pci.0,addr=0x3 -spice
> port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device
> VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device AC97,id=sound0,bus=pci.0,addr=0x7
> -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -object
> rng-random,id=objrng0,filename=/dev/random -device
> virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x8 -sandbox
> on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
>
> My host kernel .config is big so I put it in a paste: http://sprunge.us/u7YNBt
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Regression in v4.14.94 by "x86,kvm: move qemu/guest FPU switching out to vcpu_run"
2019-01-28 19:25 Regression in v4.14.94 by "x86,kvm: move qemu/guest FPU switching out to vcpu_run" Thomas Lindroth
2019-01-28 19:53 ` Sean Christopherson
@ 2019-01-28 20:14 ` Sasha Levin
2019-01-28 20:20 ` Sean Christopherson
1 sibling, 1 reply; 4+ messages in thread
From: Sasha Levin @ 2019-01-28 20:14 UTC (permalink / raw)
To: Thomas Lindroth; +Cc: kvm, stable
On Mon, Jan 28, 2019 at 08:25:20PM +0100, Thomas Lindroth wrote:
>I run a qemu/kvm VM with debian and I've started getting segfaults and failing checksums on
>downloaded files. The failures are undeterministic and similar to the failures you get with
>bad ram. I tried to diagnose the problem with various testing tools and found that
>"stress-ng --verify --cpu 1" always give an error. Stress-ng give one of these errors
>usually within 60 sec:
>
> stress-ng-cpu: Newton-Rapshon sqrt not accurate enough
> stress-ng-cpu: prime error detected, number of primes between 0 and 1000000 miscalculated
>
>Nothing relevant has changed recently in the VM but the host kernel was upgraded from
>4.14.93 to 4.14.96. I can't reproduce the stress-ng error with a 4.14.93 host kernel. There
>is only one kvm related change in that range so I tried to revert that one.
>
>By reverting commit 4124a4cff344abbf8187775eb643d9827830e715
>"x86,kvm: move qemu/guest FPU switching out to vcpu_run" on kernel 4.14.96 I can't reproduce
>the stress-ng error and I have no segfault or other problems with the guest.
>
>The commit was originally introduced in v4.15-rc3 (Nov 14 2017) and was only recently
>backported to 4.14. The other stable kernels before 4.14 didn't get any backport so it looks
>like a broken 4.14 backport. That backport also cause problems for other people.
>https://bugzilla.kernel.org/show_bug.cgi?id=202419
>
>I've rebooted between the different kernels and rebooted the VM enough to be reasonably sure
>that commit is the problem. Stress-ng never lasts more than 10 min with that commit but works
>for hours without it.
>
>Steps to reproduce would be to create a qemu/kvm VM with debian stretch, install stress-ng
>version 0.07.16 and run "stress-ng --verify --cpu 1".
>
>Here is the qemu-3.1.0 commandline generated by libvirt:
>/usr/bin/qemu-system-x86_64 -name guest=debian,debug-threads=on -S -object
>secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-debian/master-key.aes
>-machine pc-i440fx-2.4,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-noTSX -m 2048
>-realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid
>0473ded4-d417-4b0e-a4f5-36ba5a2cd675 -no-user-config -nodefaults -chardev
>socket,id=charmonitor,fd=21,server,nowait -mon chardev=charmonitor,id=monitor,mode=control
>-rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown
>-global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on
>-device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device
>ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device
>ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device
>ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -drive
>if=none,id=drive-ide0-0-1,readonly=on -device
>ide-cd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1,bootindex=2 -drive
>file=/mnt/gemini.61rn.3T/Backups/debian.raw,format=raw,if=none,id=drive-virtio-disk0 -device
>virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
>-netdev tap,fd=23,id=hostnet0 -device
>virtio-net-pci,netdev=hostnet0,id=net0,mac=00:11:22:33:44:55,bus=pci.0,addr=0x3 -spice
>port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device
>VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device AC97,id=sound0,bus=pci.0,addr=0x7
>-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -object
>rng-random,id=objrng0,filename=/dev/random -device
>virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x8 -sandbox
>on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
>
>My host kernel .config is big so I put it in a paste: http://sprunge.us/u7YNBt
Interesting, thank you for the report.
Could you confirm whether this issue reproduces on a newer kernel that
has that patch (4.19.18 for example)?
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Regression in v4.14.94 by "x86,kvm: move qemu/guest FPU switching out to vcpu_run"
2019-01-28 20:14 ` Sasha Levin
@ 2019-01-28 20:20 ` Sean Christopherson
0 siblings, 0 replies; 4+ messages in thread
From: Sean Christopherson @ 2019-01-28 20:20 UTC (permalink / raw)
To: Sasha Levin; +Cc: Thomas Lindroth, kvm, stable
On Mon, Jan 28, 2019 at 03:14:53PM -0500, Sasha Levin wrote:
> On Mon, Jan 28, 2019 at 08:25:20PM +0100, Thomas Lindroth wrote:
> >I run a qemu/kvm VM with debian and I've started getting segfaults and failing checksums on
> >downloaded files. The failures are undeterministic and similar to the failures you get with
> >bad ram. I tried to diagnose the problem with various testing tools and found that
> >"stress-ng --verify --cpu 1" always give an error. Stress-ng give one of these errors
> >usually within 60 sec:
> >
> > stress-ng-cpu: Newton-Rapshon sqrt not accurate enough
> > stress-ng-cpu: prime error detected, number of primes between 0 and 1000000 miscalculated
> >
> >Nothing relevant has changed recently in the VM but the host kernel was upgraded from
> >4.14.93 to 4.14.96. I can't reproduce the stress-ng error with a 4.14.93 host kernel. There
> >is only one kvm related change in that range so I tried to revert that one.
> >
> >By reverting commit 4124a4cff344abbf8187775eb643d9827830e715
> >"x86,kvm: move qemu/guest FPU switching out to vcpu_run" on kernel 4.14.96 I can't reproduce
> >the stress-ng error and I have no segfault or other problems with the guest.
[...]
> Interesting, thank you for the report.
>
> Could you confirm whether this issue reproduces on a newer kernel that
> has that patch (4.19.18 for example)?
The bug is specific to 4.14, two dependent commits were applied in the
wrong order and introduced the bug. I have a patch, in the process of
typing up the changelog.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2019-01-28 20:20 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-01-28 19:25 Regression in v4.14.94 by "x86,kvm: move qemu/guest FPU switching out to vcpu_run" Thomas Lindroth
2019-01-28 19:53 ` Sean Christopherson
2019-01-28 20:14 ` Sasha Levin
2019-01-28 20:20 ` Sean Christopherson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox