From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sasha Levin Subject: Re: Regression in v4.14.94 by "x86,kvm: move qemu/guest FPU switching out to vcpu_run" Date: Mon, 28 Jan 2019 15:14:53 -0500 Message-ID: <20190128201453.GM3973@sasha-vm> References: <457d0666-1951-1b7c-f7e8-18c67763e6c3@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Cc: kvm@vger.kernel.org, stable@vger.kernel.org To: Thomas Lindroth Return-path: Content-Disposition: inline In-Reply-To: <457d0666-1951-1b7c-f7e8-18c67763e6c3@gmail.com> Sender: stable-owner@vger.kernel.org List-Id: kvm.vger.kernel.org On Mon, Jan 28, 2019 at 08:25:20PM +0100, Thomas Lindroth wrote: >I run a qemu/kvm VM with debian and I've started getting segfaults and failing checksums on >downloaded files. The failures are undeterministic and similar to the failures you get with >bad ram. I tried to diagnose the problem with various testing tools and found that >"stress-ng --verify --cpu 1" always give an error. Stress-ng give one of these errors >usually within 60 sec: > > stress-ng-cpu: Newton-Rapshon sqrt not accurate enough > stress-ng-cpu: prime error detected, number of primes between 0 and 1000000 miscalculated > >Nothing relevant has changed recently in the VM but the host kernel was upgraded from >4.14.93 to 4.14.96. I can't reproduce the stress-ng error with a 4.14.93 host kernel. There >is only one kvm related change in that range so I tried to revert that one. > >By reverting commit 4124a4cff344abbf8187775eb643d9827830e715 >"x86,kvm: move qemu/guest FPU switching out to vcpu_run" on kernel 4.14.96 I can't reproduce >the stress-ng error and I have no segfault or other problems with the guest. > >The commit was originally introduced in v4.15-rc3 (Nov 14 2017) and was only recently >backported to 4.14. The other stable kernels before 4.14 didn't get any backport so it looks >like a broken 4.14 backport. That backport also cause problems for other people. >https://bugzilla.kernel.org/show_bug.cgi?id=202419 > >I've rebooted between the different kernels and rebooted the VM enough to be reasonably sure >that commit is the problem. Stress-ng never lasts more than 10 min with that commit but works >for hours without it. > >Steps to reproduce would be to create a qemu/kvm VM with debian stretch, install stress-ng >version 0.07.16 and run "stress-ng --verify --cpu 1". > >Here is the qemu-3.1.0 commandline generated by libvirt: >/usr/bin/qemu-system-x86_64 -name guest=debian,debug-threads=on -S -object >secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-debian/master-key.aes >-machine pc-i440fx-2.4,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-noTSX -m 2048 >-realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid >0473ded4-d417-4b0e-a4f5-36ba5a2cd675 -no-user-config -nodefaults -chardev >socket,id=charmonitor,fd=21,server,nowait -mon chardev=charmonitor,id=monitor,mode=control >-rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown >-global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on >-device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device >ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device >ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device >ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -drive >if=none,id=drive-ide0-0-1,readonly=on -device >ide-cd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1,bootindex=2 -drive >file=/mnt/gemini.61rn.3T/Backups/debian.raw,format=raw,if=none,id=drive-virtio-disk0 -device >virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 >-netdev tap,fd=23,id=hostnet0 -device >virtio-net-pci,netdev=hostnet0,id=net0,mac=00:11:22:33:44:55,bus=pci.0,addr=0x3 -spice >port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device >VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device AC97,id=sound0,bus=pci.0,addr=0x7 >-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -object >rng-random,id=objrng0,filename=/dev/random -device >virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x8 -sandbox >on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on > >My host kernel .config is big so I put it in a paste: http://sprunge.us/u7YNBt Interesting, thank you for the report. Could you confirm whether this issue reproduces on a newer kernel that has that patch (4.19.18 for example)? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.6 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED, USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 08699C282C8 for ; Mon, 28 Jan 2019 20:14:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id CC9902171F for ; Mon, 28 Jan 2019 20:14:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1548706496; bh=tL9kqLPz8svmY23DCd7dCXWsOC1dUWNfxlv6QXitL1o=; h=Date:From:To:Cc:Subject:References:In-Reply-To:List-ID:From; b=PZaC8OaEuiD5IEef29eqIdBPdS0j9DHkXsGd6letYhxyOJ3VWIg5zRy7yqjZY7VxF nU4v+tMJTp+x7mLsXVdtSHHf51LUmUi4ZCptwhInJKzrRtMPLILusevQYk7VluFHmH 8n0Jc3T47yBjOzNKY4NyUnU4dqECXIYMF42BxLsI= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726964AbfA1UO4 (ORCPT ); Mon, 28 Jan 2019 15:14:56 -0500 Received: from mail.kernel.org ([198.145.29.99]:38386 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726719AbfA1UO4 (ORCPT ); Mon, 28 Jan 2019 15:14:56 -0500 Received: from localhost (c-73-47-72-35.hsd1.nh.comcast.net [73.47.72.35]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id CB72D2171F; Mon, 28 Jan 2019 20:14:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1548706495; bh=tL9kqLPz8svmY23DCd7dCXWsOC1dUWNfxlv6QXitL1o=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=n9KseaDeavMz90fZgr7ArUVLjtD8VkomTZa5fNn9vXrHEjw/RPZybxfoWjazdge60 iYa0PNSDVMXiDPwJOb4LZNNZ8OkSJkCs3a4P6Qy71naPsLk3SVZPjGZoK+iPQjheKy HYo7geXpTggL0hE/rBxJqECQff6px9xRHJIcslgE= Date: Mon, 28 Jan 2019 15:14:53 -0500 From: Sasha Levin To: Thomas Lindroth Cc: kvm@vger.kernel.org, stable@vger.kernel.org Subject: Re: Regression in v4.14.94 by "x86,kvm: move qemu/guest FPU switching out to vcpu_run" Message-ID: <20190128201453.GM3973@sasha-vm> References: <457d0666-1951-1b7c-f7e8-18c67763e6c3@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline In-Reply-To: <457d0666-1951-1b7c-f7e8-18c67763e6c3@gmail.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org On Mon, Jan 28, 2019 at 08:25:20PM +0100, Thomas Lindroth wrote: >I run a qemu/kvm VM with debian and I've started getting segfaults and failing checksums on >downloaded files. The failures are undeterministic and similar to the failures you get with >bad ram. I tried to diagnose the problem with various testing tools and found that >"stress-ng --verify --cpu 1" always give an error. Stress-ng give one of these errors >usually within 60 sec: > > stress-ng-cpu: Newton-Rapshon sqrt not accurate enough > stress-ng-cpu: prime error detected, number of primes between 0 and 1000000 miscalculated > >Nothing relevant has changed recently in the VM but the host kernel was upgraded from >4.14.93 to 4.14.96. I can't reproduce the stress-ng error with a 4.14.93 host kernel. There >is only one kvm related change in that range so I tried to revert that one. > >By reverting commit 4124a4cff344abbf8187775eb643d9827830e715 >"x86,kvm: move qemu/guest FPU switching out to vcpu_run" on kernel 4.14.96 I can't reproduce >the stress-ng error and I have no segfault or other problems with the guest. > >The commit was originally introduced in v4.15-rc3 (Nov 14 2017) and was only recently >backported to 4.14. The other stable kernels before 4.14 didn't get any backport so it looks >like a broken 4.14 backport. That backport also cause problems for other people. >https://bugzilla.kernel.org/show_bug.cgi?id=202419 > >I've rebooted between the different kernels and rebooted the VM enough to be reasonably sure >that commit is the problem. Stress-ng never lasts more than 10 min with that commit but works >for hours without it. > >Steps to reproduce would be to create a qemu/kvm VM with debian stretch, install stress-ng >version 0.07.16 and run "stress-ng --verify --cpu 1". > >Here is the qemu-3.1.0 commandline generated by libvirt: >/usr/bin/qemu-system-x86_64 -name guest=debian,debug-threads=on -S -object >secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-debian/master-key.aes >-machine pc-i440fx-2.4,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-noTSX -m 2048 >-realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid >0473ded4-d417-4b0e-a4f5-36ba5a2cd675 -no-user-config -nodefaults -chardev >socket,id=charmonitor,fd=21,server,nowait -mon chardev=charmonitor,id=monitor,mode=control >-rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown >-global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on >-device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device >ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device >ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device >ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -drive >if=none,id=drive-ide0-0-1,readonly=on -device >ide-cd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1,bootindex=2 -drive >file=/mnt/gemini.61rn.3T/Backups/debian.raw,format=raw,if=none,id=drive-virtio-disk0 -device >virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 >-netdev tap,fd=23,id=hostnet0 -device >virtio-net-pci,netdev=hostnet0,id=net0,mac=00:11:22:33:44:55,bus=pci.0,addr=0x3 -spice >port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device >VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device AC97,id=sound0,bus=pci.0,addr=0x7 >-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -object >rng-random,id=objrng0,filename=/dev/random -device >virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x8 -sandbox >on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on > >My host kernel .config is big so I put it in a paste: http://sprunge.us/u7YNBt Interesting, thank you for the report. Could you confirm whether this issue reproduces on a newer kernel that has that patch (4.19.18 for example)? -- Thanks, Sasha