From mboxrd@z Thu Jan 1 00:00:00 1970 From: Liran Alon Subject: Re: VMs freezing when host is running 4.14 Date: Thu, 23 Nov 2017 18:18:07 +0200 Message-ID: <5A16F4BF.9060306@ORACLE.COM> References: <20171121161821.b6k3hdl3wgia5f5q@torres.zugschlus.de> <20171122093945.5afa2di2g7qhf4eb@torres.zugschlus.de> <20171122155208.wdcmosxfpsjbwcrm@torres.zugschlus.de> <20171122164312.GA21279@flask> <20171123152024.7xsc7lesv2qyujng@torres.zugschlus.de> <20171123155946.GC21184@flask> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Cc: LKML , "KVM-ML (kvm@vger.kernel.org)" , Wanpeng Li To: =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , Marc Haber Return-path: In-Reply-To: <20171123155946.GC21184@flask> Sender: linux-kernel-owner@vger.kernel.org List-Id: kvm.vger.kernel.org On 23/11/17 17:59, Radim Krčmář wrote: > 2017-11-23 16:20+0100, Marc Haber: >> On Wed, Nov 22, 2017 at 05:43:13PM +0100, Radim Krčmář wrote: >>> 2017-11-22 16:52+0100, Marc Haber: >>>> On Wed, Nov 22, 2017 at 04:04:42PM +0100, 王金浦 wrote: >>>>> So all guest kernels are 4.14, or also other older kernel? >>>> >>>> Guest kernels are also 4.14, but the issue disappears when the host is >>>> downgraded to an older kernel. I therefore reckoned that the guest >>>> kernel doesn't matter, but that was before I saw the trace in the log. >>> >>> The two most suspicious patches since 4.13 (which I assume works) are >>> >>> 664f8e26b00c ("KVM: X86: Fix loss of exception which has not yet been >>> injected") >> >> That one does not revert cleanly, the line in questions seems to have >> been removed a bit later. >> >> Reject is: >> 141 [24/5001]mh@fan:~/linux/git/linux ((v4.14.1) %) $ cat arch/x86/kvm/vmx.c.rej--- arch/x86/kvm/vmx.c >> +++ arch/x86/kvm/vmx.c >> @@ -2516,7 +2516,7 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu) >> struct vcpu_vmx *vmx = to_vmx(vcpu); >> unsigned nr = vcpu->arch.exception.nr; >> bool has_error_code = vcpu->arch.exception.has_error_code; >> - bool reinject = vcpu->arch.exception.injected; >> + bool reinject = vcpu->arch.exception.reinject; >> u32 error_code = vcpu->arch.exception.error_code; >> u32 intr_info = nr | INTR_INFO_VALID_MASK; > > This line one can be deleted as reinject isn't used in the function. > > Btw. there have been already many fixes from Liran Alon for that patch > and your case could be the one adressed in > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kvm_msg159158.html&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=206jU1rQdk3xs1DYWbQPz1gR7Iim02XOjwn458rwgIo&s=fz1JeZiSQBwqYpkmeX8OJukyC4M8BeXSuIOKwuVaeHg&e= > > The patch is incorrect, but you might be able to see only its benefits. Actually I would first attempt to check this patch of mine: https://www.spinics.net/lists/kvm/msg159062.html It fixes a bug of a L2 exception accidentally being delivered into L1. Regards, -Liran > >>> and >>> >>> 9a6e7c39810e ("KVM: async_pf: Fix #DF due to inject "Page not Present" >>> and "Page Ready" exceptions simultaneously") >>> >>> please try reverting them to see if it helps, >> >> That one reverted cleanly. I am now running the new kernel on the >> affected machine, and I think that a second machine has joined the >> market of being affected. > > That one had much lower chances of being the culprit. > >> Would this matter on the host only or on the guests as well? > > Only on the host. > > Thanks. >