From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kiszka Subject: Re: [PATCH 1/1 V6] qemu-kvm: fix improper nmi emulation Date: Wed, 19 Oct 2011 12:57:05 +0200 Message-ID: <4E9EAD01.50605@siemens.com> References: <20110913093835.GB4265@localhost.localdomain> <20110914093441.e2bb305c.kamezawa.hiroyu@jp.fujitsu.com> <4E705BC3.5000508@cn.fujitsu.com> <20110915164704.9cacd407.kamezawa.hiroyu@jp.fujitsu.com> <4E71B28F.7030201@cn.fujitsu.com> <4E72F3BA.2000603@jp.fujitsu.com> <4E73200A.7040908@jp.fujitsu.com> <4E76C6AA.9080403@cn.fujitsu.com> <4E7B04DC.1030407@cn.fujitsu.com> <4E7B4B8F.507@siemens.com> <4E7C51E4.2000503@cn.fujitsu.com> <4E7F3585.40108@redhat.com> <4E7F635E.6080009@web.de> <4E8035F9.9080908@redhat.com> <4E928B54.1070707@cn.fujitsu.com> <4E92958E.9000509@web.de> <4E9476E2.1070804@cn.fujitsu.com> <4E948842.4030406@web.de> <4E978827.6070008@cn.fujitsu.com> <4E97CE42.9020102@web.de> <4E97D85C.7070107@cn.fujitsu.com> <4E97DB62.9020605@web.de> <4E97FAC7.6080007@cn.fujitsu.com> <4E9A A657.1050503@redhat.com> <4E9BF821.2070805@cn.fujitsu.com> <4E9BFA40.5070806@redhat.com> <4E9C5106.5070506@cn.fujitsu.com> <4E9DD659.1050005@web.de> <4E9E6F3D.1060802@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: Avi Kivity , Kenji Kaneshige , KAMEZAWA Hiroyuki , "qemu-devel@nongnu.org" , "kvm@vger.kernel.org" To: Lai Jiangshan Return-path: Received: from thoth.sbs.de ([192.35.17.2]:26964 "EHLO thoth.sbs.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755756Ab1JSK5Y (ORCPT ); Wed, 19 Oct 2011 06:57:24 -0400 In-Reply-To: <4E9E6F3D.1060802@cn.fujitsu.com> Sender: kvm-owner@vger.kernel.org List-ID: On 2011-10-19 08:33, Lai Jiangshan wrote: > On 10/19/2011 03:41 AM, Jan Kiszka wrote: >> On 2011-10-17 18:00, Lai Jiangshan wrote: >>> On 10/17/2011 05:49 PM, Avi Kivity wrote: >>>> On 10/17/2011 11:40 AM, Lai Jiangshan wrote: >>>>>>> >>>>>> >>>>>> LINT1 may have been programmed as a level -triggered interrupt instead >>>>>> of edge triggered (NMI or interrupt). We can use the ioctl argument for >>>>>> the level (and pressing the NMI button needs to pulse the level to 1 and >>>>>> back to 0). >>>>>> >>>>> >>>>> Hi, Avi, Jan, >>>>> >>>>> Which approach you prefer to? >>>>> I need to know the result before wasting too much time to respin >>>>> the approach. >>>> >>>> Yes, sorry about the slow and sometimes conflicting feedback. >>>> >>>>> 1) Fix KVM_NMI emulation approach (which is v3 patchset) >>>>> - It directly fixes the problem and matches the >>>>> real hard ware more, but it changes KVM_NMI bahavior. >>>>> - Require both kernel-site and userspace-site fix. >>>>> >>>>> 2) Get the LAPIC state from kernel irqchip, and inject NMI if it is allowed >>>>> (which is v4 patchset) >>>>> - Simple, don't changes any kernel behavior. >>>>> - Only need the userspace-site fix >>>>> >>>>> 3) Add KVM_SET_LINT1 approach (which is v5 patchset) >>>>> - don't changes the kernel's KVM_NMI behavior. >>>>> - much complex >>>>> - Require both kernel-site and userspace-site fix. >>>>> - userspace-site should also handle the !KVM_SET_LINT1 >>>>> condition, it uses all the 2) approach' code. it means >>>>> this approach equals the 2) approach + KVM_SET_LINT1 ioctl. >>>>> >>>>> This is an urgent bug of us, we need to settle it down soo >>>> >>>> While (1) is simple, it overloads a single ioctl with two meanings, >>>> that's not so good. >>>> >>>> Whether we do (1) or (3), we need (2) as well, for older kernels. >>>> >>>> So I recommend first focusing on (2) and merging it, then doing (3). >>>> >>>> (note an additional issue with 3 is whether to make it a vm or vcpu >>>> ioctl - we've been assuming vcpu ioctl but it's not necessarily the best >>>> choice). >>>> >>> >>> It is the 2) approach. >>> It only changes the user space site, the kernel site is not touched. >>> It is changed from previous v4 patch, fixed problems found by Jan. >>> ---------------------------- >>> >>> From: Lai Jiangshan >>> >>> Currently, NMI interrupt is blindly sent to all the vCPUs when NMI >>> button event happens. This doesn't properly emulate real hardware on >>> which NMI button event triggers LINT1. Because of this, NMI is sent to >>> the processor even when LINT1 is maskied in LVT. For example, this >>> causes the problem that kdump initiated by NMI sometimes doesn't work >>> on KVM, because kdump assumes NMI is masked on CPUs other than CPU0. >>> >>> With this patch, inject-nmi request is handled as follows. >>> >>> - When in-kernel irqchip is disabled, deliver LINT1 instead of NMI >>> interrupt. >>> - When in-kernel irqchip is enabled, get the in-kernel LAPIC states >>> and test the APIC_LVT_MASKED, if LINT1 is unmasked, and then >>> delivering the NMI directly. (Suggested by Jan Kiszka) >>> >>> Changed from old version: >>> re-implement it by the Jan's suggestion. >>> fix the race found by Jan. >>> >>> Signed-off-by: Lai Jiangshan >>> Reported-by: Kenji Kaneshige >>> --- >>> hw/apic.c | 33 +++++++++++++++++++++++++++++++++ >>> hw/apic.h | 1 + >>> monitor.c | 6 +++++- >>> 3 files changed, 39 insertions(+), 1 deletions(-) >>> diff --git a/hw/apic.c b/hw/apic.c >>> index 69d6ac5..922796a 100644 >>> --- a/hw/apic.c >>> +++ b/hw/apic.c >>> @@ -205,6 +205,39 @@ void apic_deliver_pic_intr(DeviceState *d, int level) >>> } >>> } >>> >>> +static inline uint32_t kapic_reg(struct kvm_lapic_state *kapic, int reg_id); >>> + >>> +static void kvm_irqchip_deliver_nmi(void *p) >>> +{ >>> + APICState *s = p; >>> + struct kvm_lapic_state klapic; >>> + uint32_t lvt; >>> + >>> + kvm_get_lapic(s->cpu_env, &klapic); >>> + lvt = kapic_reg(&klapic, 0x32 + APIC_LVT_LINT1); >>> + >>> + if (lvt & APIC_LVT_MASKED) { >>> + return; >>> + } >>> + >>> + if (((lvt >> 8) & 7) != APIC_DM_NMI) { >>> + return; >>> + } >>> + >>> + kvm_vcpu_ioctl(s->cpu_env, KVM_NMI); >>> +} >>> + >>> +void apic_deliver_nmi(DeviceState *d) >>> +{ >>> + APICState *s = DO_UPCAST(APICState, busdev.qdev, d); >>> + >>> + if (kvm_irqchip_in_kernel()) { >>> + run_on_cpu(s->cpu_env, kvm_irqchip_deliver_nmi, s); >>> + } else { >>> + apic_local_deliver(s, APIC_LVT_LINT1); >>> + } >>> +} >>> + >>> #define foreach_apic(apic, deliver_bitmask, code) \ >>> {\ >>> int __i, __j, __mask;\ >>> diff --git a/hw/apic.h b/hw/apic.h >>> index c857d52..3a4be0a 100644 >>> --- a/hw/apic.h >>> +++ b/hw/apic.h >>> @@ -10,6 +10,7 @@ void apic_deliver_irq(uint8_t dest, uint8_t dest_mode, >>> uint8_t trigger_mode); >>> int apic_accept_pic_intr(DeviceState *s); >>> void apic_deliver_pic_intr(DeviceState *s, int level); >>> +void apic_deliver_nmi(DeviceState *d); >>> int apic_get_interrupt(DeviceState *s); >>> void apic_reset_irq_delivered(void); >>> int apic_get_irq_delivered(void); >>> diff --git a/monitor.c b/monitor.c >>> index cb485bf..0b81f17 100644 >>> --- a/monitor.c >>> +++ b/monitor.c >>> @@ -2616,7 +2616,11 @@ static int do_inject_nmi(Monitor *mon, const QDict *qdict, QObject **ret_data) >>> CPUState *env; >>> >>> for (env = first_cpu; env != NULL; env = env->next_cpu) { >>> - cpu_interrupt(env, CPU_INTERRUPT_NMI); >>> + if (!env->apic_state) { >>> + cpu_interrupt(env, CPU_INTERRUPT_NMI); >>> + } else { >>> + apic_deliver_nmi(env->apic_state); >>> + } >>> } >>> >>> return 0; >> >> Looks OK to me. >> >> Please don't forget to bake a qemu-only patch for those bits that apply >> to upstream as well (ie. the user space APIC path). >> >> Jan >> > > I did forget it. > Did you mean we need to add "#ifdef KVM_CAP_IRQCHIP" back? No. I meant basically your patch minus the kvm_in_kernel_irqchip code paths, applicable against current qemu.git. Those paths will be re-added (slightly differently) when upstream gains that support. I'm working on a basic version an will incorporate the logic if your qemu patch is already available. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:47509) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RGTpr-0007BZ-E8 for qemu-devel@nongnu.org; Wed, 19 Oct 2011 06:57:25 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RGTpp-00057A-3B for qemu-devel@nongnu.org; Wed, 19 Oct 2011 06:57:23 -0400 Received: from thoth.sbs.de ([192.35.17.2]:16160) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RGTpo-00056p-MW for qemu-devel@nongnu.org; Wed, 19 Oct 2011 06:57:21 -0400 Message-ID: <4E9EAD01.50605@siemens.com> Date: Wed, 19 Oct 2011 12:57:05 +0200 From: Jan Kiszka MIME-Version: 1.0 References: <20110913093835.GB4265@localhost.localdomain> <20110914093441.e2bb305c.kamezawa.hiroyu@jp.fujitsu.com> <4E705BC3.5000508@cn.fujitsu.com> <20110915164704.9cacd407.kamezawa.hiroyu@jp.fujitsu.com> <4E71B28F.7030201@cn.fujitsu.com> <4E72F3BA.2000603@jp.fujitsu.com> <4E73200A.7040908@jp.fujitsu.com> <4E76C6AA.9080403@cn.fujitsu.com> <4E7B04DC.1030407@cn.fujitsu.com> <4E7B4B8F.507@siemens.com> <4E7C51E4.2000503@cn.fujitsu.com> <4E7F3585.40108@redhat.com> <4E7F635E.6080009@web.de> <4E8035F9.9080908@redhat.com> <4E928B54.1070707@cn.fujitsu.com> <4E92958E.9000509@web.de> <4E9476E2.1070804@cn.fujitsu.com> <4E948842.4030406@web.de> <4E978827.6070008@cn.fujitsu.com> <4E97CE42.9020102@web.de> <4E97D85C.7070107@cn.fujitsu.com> <4E97DB62.9020605@web.de> <4E97FAC7.6080007@cn.fujitsu.com> <4E9AA657.1050503@redhat.com> <4E9BF821.2070805@cn.fujitsu.com> <4E9BFA40.5070806@redhat.com> <4E9C5106.5070506@cn.fujitsu.com> <4E9DD659.1050005@web.de> <4E9E6F3D.1060802@cn.fujitsu.com> In-Reply-To: <4E9E6F3D.1060802@cn.fujitsu.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 1/1 V6] qemu-kvm: fix improper nmi emulation List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Lai Jiangshan Cc: "qemu-devel@nongnu.org" , "kvm@vger.kernel.org" , Avi Kivity , KAMEZAWA Hiroyuki , Kenji Kaneshige On 2011-10-19 08:33, Lai Jiangshan wrote: > On 10/19/2011 03:41 AM, Jan Kiszka wrote: >> On 2011-10-17 18:00, Lai Jiangshan wrote: >>> On 10/17/2011 05:49 PM, Avi Kivity wrote: >>>> On 10/17/2011 11:40 AM, Lai Jiangshan wrote: >>>>>>> >>>>>> >>>>>> LINT1 may have been programmed as a level -triggered interrupt instead >>>>>> of edge triggered (NMI or interrupt). We can use the ioctl argument for >>>>>> the level (and pressing the NMI button needs to pulse the level to 1 and >>>>>> back to 0). >>>>>> >>>>> >>>>> Hi, Avi, Jan, >>>>> >>>>> Which approach you prefer to? >>>>> I need to know the result before wasting too much time to respin >>>>> the approach. >>>> >>>> Yes, sorry about the slow and sometimes conflicting feedback. >>>> >>>>> 1) Fix KVM_NMI emulation approach (which is v3 patchset) >>>>> - It directly fixes the problem and matches the >>>>> real hard ware more, but it changes KVM_NMI bahavior. >>>>> - Require both kernel-site and userspace-site fix. >>>>> >>>>> 2) Get the LAPIC state from kernel irqchip, and inject NMI if it is allowed >>>>> (which is v4 patchset) >>>>> - Simple, don't changes any kernel behavior. >>>>> - Only need the userspace-site fix >>>>> >>>>> 3) Add KVM_SET_LINT1 approach (which is v5 patchset) >>>>> - don't changes the kernel's KVM_NMI behavior. >>>>> - much complex >>>>> - Require both kernel-site and userspace-site fix. >>>>> - userspace-site should also handle the !KVM_SET_LINT1 >>>>> condition, it uses all the 2) approach' code. it means >>>>> this approach equals the 2) approach + KVM_SET_LINT1 ioctl. >>>>> >>>>> This is an urgent bug of us, we need to settle it down soo >>>> >>>> While (1) is simple, it overloads a single ioctl with two meanings, >>>> that's not so good. >>>> >>>> Whether we do (1) or (3), we need (2) as well, for older kernels. >>>> >>>> So I recommend first focusing on (2) and merging it, then doing (3). >>>> >>>> (note an additional issue with 3 is whether to make it a vm or vcpu >>>> ioctl - we've been assuming vcpu ioctl but it's not necessarily the best >>>> choice). >>>> >>> >>> It is the 2) approach. >>> It only changes the user space site, the kernel site is not touched. >>> It is changed from previous v4 patch, fixed problems found by Jan. >>> ---------------------------- >>> >>> From: Lai Jiangshan >>> >>> Currently, NMI interrupt is blindly sent to all the vCPUs when NMI >>> button event happens. This doesn't properly emulate real hardware on >>> which NMI button event triggers LINT1. Because of this, NMI is sent to >>> the processor even when LINT1 is maskied in LVT. For example, this >>> causes the problem that kdump initiated by NMI sometimes doesn't work >>> on KVM, because kdump assumes NMI is masked on CPUs other than CPU0. >>> >>> With this patch, inject-nmi request is handled as follows. >>> >>> - When in-kernel irqchip is disabled, deliver LINT1 instead of NMI >>> interrupt. >>> - When in-kernel irqchip is enabled, get the in-kernel LAPIC states >>> and test the APIC_LVT_MASKED, if LINT1 is unmasked, and then >>> delivering the NMI directly. (Suggested by Jan Kiszka) >>> >>> Changed from old version: >>> re-implement it by the Jan's suggestion. >>> fix the race found by Jan. >>> >>> Signed-off-by: Lai Jiangshan >>> Reported-by: Kenji Kaneshige >>> --- >>> hw/apic.c | 33 +++++++++++++++++++++++++++++++++ >>> hw/apic.h | 1 + >>> monitor.c | 6 +++++- >>> 3 files changed, 39 insertions(+), 1 deletions(-) >>> diff --git a/hw/apic.c b/hw/apic.c >>> index 69d6ac5..922796a 100644 >>> --- a/hw/apic.c >>> +++ b/hw/apic.c >>> @@ -205,6 +205,39 @@ void apic_deliver_pic_intr(DeviceState *d, int level) >>> } >>> } >>> >>> +static inline uint32_t kapic_reg(struct kvm_lapic_state *kapic, int reg_id); >>> + >>> +static void kvm_irqchip_deliver_nmi(void *p) >>> +{ >>> + APICState *s = p; >>> + struct kvm_lapic_state klapic; >>> + uint32_t lvt; >>> + >>> + kvm_get_lapic(s->cpu_env, &klapic); >>> + lvt = kapic_reg(&klapic, 0x32 + APIC_LVT_LINT1); >>> + >>> + if (lvt & APIC_LVT_MASKED) { >>> + return; >>> + } >>> + >>> + if (((lvt >> 8) & 7) != APIC_DM_NMI) { >>> + return; >>> + } >>> + >>> + kvm_vcpu_ioctl(s->cpu_env, KVM_NMI); >>> +} >>> + >>> +void apic_deliver_nmi(DeviceState *d) >>> +{ >>> + APICState *s = DO_UPCAST(APICState, busdev.qdev, d); >>> + >>> + if (kvm_irqchip_in_kernel()) { >>> + run_on_cpu(s->cpu_env, kvm_irqchip_deliver_nmi, s); >>> + } else { >>> + apic_local_deliver(s, APIC_LVT_LINT1); >>> + } >>> +} >>> + >>> #define foreach_apic(apic, deliver_bitmask, code) \ >>> {\ >>> int __i, __j, __mask;\ >>> diff --git a/hw/apic.h b/hw/apic.h >>> index c857d52..3a4be0a 100644 >>> --- a/hw/apic.h >>> +++ b/hw/apic.h >>> @@ -10,6 +10,7 @@ void apic_deliver_irq(uint8_t dest, uint8_t dest_mode, >>> uint8_t trigger_mode); >>> int apic_accept_pic_intr(DeviceState *s); >>> void apic_deliver_pic_intr(DeviceState *s, int level); >>> +void apic_deliver_nmi(DeviceState *d); >>> int apic_get_interrupt(DeviceState *s); >>> void apic_reset_irq_delivered(void); >>> int apic_get_irq_delivered(void); >>> diff --git a/monitor.c b/monitor.c >>> index cb485bf..0b81f17 100644 >>> --- a/monitor.c >>> +++ b/monitor.c >>> @@ -2616,7 +2616,11 @@ static int do_inject_nmi(Monitor *mon, const QDict *qdict, QObject **ret_data) >>> CPUState *env; >>> >>> for (env = first_cpu; env != NULL; env = env->next_cpu) { >>> - cpu_interrupt(env, CPU_INTERRUPT_NMI); >>> + if (!env->apic_state) { >>> + cpu_interrupt(env, CPU_INTERRUPT_NMI); >>> + } else { >>> + apic_deliver_nmi(env->apic_state); >>> + } >>> } >>> >>> return 0; >> >> Looks OK to me. >> >> Please don't forget to bake a qemu-only patch for those bits that apply >> to upstream as well (ie. the user space APIC path). >> >> Jan >> > > I did forget it. > Did you mean we need to add "#ifdef KVM_CAP_IRQCHIP" back? No. I meant basically your patch minus the kvm_in_kernel_irqchip code paths, applicable against current qemu.git. Those paths will be re-added (slightly differently) when upstream gains that support. I'm working on a basic version an will incorporate the logic if your qemu patch is already available. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux