From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi@redhat.com>
Subject: Re: [RFC][PATCH]: kdump on KVM
Date: Sun, 06 Jun 2010 11:01:23 +0300
Message-ID: <4C0B55D3.2080100@redhat.com>
References: <20100604201124.GD2831@localhost.localdomain>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: kvm@vger.kernel.org, Marcelo Tosatti <mtosatti@redhat.com>
To: Chris Lalancette <clalance@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:8699 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752593Ab0FFIBZ (ORCPT <rfc822;kvm@vger.kernel.org>);
	Sun, 6 Jun 2010 04:01:25 -0400
Received: from int-mx03.intmail.prod.int.phx2.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.16])
	by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id o5681PH3026865
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK)
	for <kvm@vger.kernel.org>; Sun, 6 Jun 2010 04:01:25 -0400
Received: from cleopatra.tlv.redhat.com (cleopatra.tlv.redhat.com [10.35.255.11])
	by int-mx03.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id o5681O2H016772
	for <kvm@vger.kernel.org>; Sun, 6 Jun 2010 04:01:24 -0400
In-Reply-To: <20100604201124.GD2831@localhost.localdomain>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On 06/04/2010 11:11 PM, Chris Lalancette wrote:
> Hello,
>       I've gone back to look at the problem with kdump while running as a KVM
> guest.  I'll preface this discussion by noting that all of the problems I
> describe below only happen when the HPET emulation is shutoff.  However, since
> there still seem to be some lingering issues with the HPET emulation, running
> without the HPET is a reasonable mode of operation, and one in which it would
> be nice to have kexec/kdump working.
>       The basic problem with getting kdump working is that currently KVM
> delivers the timer interrupts only to vcpu 0, so kexecing onto another
> processor will never see timer interrupts.  In the KVM code, there are a
> number of places where we make incorrect assumptions regarding the timer and
> which vcpu to deliver it to:
>
> 1)  In the i8254.c code when the hrtimer representing the PIT expires.  In
> this case, when we get the callback, we kick only the BSP.
>
> 2)  In the i8254.c code when a vcpu is migrated from one processor to another.
> In this case we only migrate the PIT timer if the vcpu to be migrated is the
> BSP.
>
> 3)  In the lapic code when deciding whether to accept a PIC interrupt, we only
> accept interrupts on the BSP.
>
> 4)  In the irq_comm.c code when calling kvm_irq_delivery_to_apic().  The
> problem here is that we don't take into account the fact that an LAPIC might
> be disabled when trying to deliver an interrupt in DM_LOWEST mode.  Further,
> on a kexec, the processor that we are kexec'ing *to* gets it's APIC ID
> re-written to the BSP APIC ID.  What it means in the end is that we are
> currently still matching against the BSP even though vcpu 1 (where the kexec
> is happening) would match if we let it.
>
> The attached patch takes care of 1), 3), and 4), and works in my testing (it
> needs to be cleaned up a bit to not be so inefficient, but it should work).
> However, problem 2) is pretty sticky.  The reason we are currently migrating
> the PIT timer around with the BSP is pretty well explained in commit
> 2f5997140f22f68f6390c49941150d3fa8a95cb7.  With my new patch, though, we are
> no longer guaranteeing that we are going to inject onto CPU 0.  I think we
> can do something where when the hrtimer expires, we figure out which
> vcpu will get the timer interrupt and IPI to that physical processor to cause
> the VMEXIT.  Unfortunately it's racy because the expiration of the hrtimer is
> de-coupled from the setting of the IRQ for the interrupt.

It's only racy because of (perhaps not immature) optimization. 
Logically, the flow of events is as follows:

1. the httimer fires
2. irq0 is toggled (perhaps via a workqueue since the hrtimer is in 
interrupt context)
3. the PIC and IOAPIC state machines run and decide which local APIC(s), 
if any, need to be notified
4. the local APICs are IPIed.

Perhaps we should start by deoptimizing the code to work like the above.

In light of the below, I'm not sure the race actually exists.

>   That means that
> the hrtimer could  expire, we could choose vcpu 2 (say), IPI to cause a
> VMEXIT, but by the time it goes to VMENTER the guest has changed something in
> the (IO)APIC(s) so that the set_irq logic chooses vcpu 3 to do the injection.
> This would result in a delayed injection that 2f5997 is trying to avoid.
>    

Not sure that's the race here.  Between the vmexit and vmenter we have 
the following trace:

   kvm_inject_pending_timer_irqs()
   kvm_inject_pit_timer_irqs()
   __inject_pit_timer_intr()
   kvm_set_irq()

So, even if we end up on the wrong vcpu, we will IPI the right one later.

> Taking a step back, it seems to me that something along the lines of my
> previous patchset (where we do set_irq directly from the hrtimer callback) is
> the right way to go.  We would still need to IPI to the appropriate physical
> cpu to cause a VMEXIT on the cpu we care about, but we would avoid the race I
> describe in the previous paragraph.

I agree.  But we also want the vcpu that accepts the interrupt to pull 
the hrtimer after it (like it does now) to minimize cross-cpu talk.

> Unfortunately that patchset is also much
> more risky.
>    

I'd like a minimal, backportable patchset first, then bigger changes 
later.  We have at least two choices, irq-safe pit and ioapic, and using 
a workqueue for to match impedances.  The latter is a lot safer since we 
already do it for assigned devices.

> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
> index 3c4f8f6..bc40143 100644
> --- a/arch/x86/kvm/i8254.c
> +++ b/arch/x86/kvm/i8254.c
> @@ -231,7 +231,13 @@ int pit_has_pending_timer(struct kvm_vcpu *vcpu)
>   {
>   	struct kvm_pit *pit = vcpu->kvm->arch.vpit;
>
> -	if (pit&&  kvm_vcpu_is_bsp(vcpu)&&  pit->pit_state.irq_ack)
> +	/* We only want to inject a single IRQ for each PIT timer that goes
> +	 * off.  If 2 vcpus happen to come in here at the same time, we may
> +	 * erroneously tell one of them that it has a timer pending.  That's
> +	 * OK, though; in kvm_inject_pit_timer_irqs we will make sure to only
> +	 * let one of them inject.
> +	 */
> +	if (pit&&  pit->pit_state.irq_ack)
>   		return atomic_read(&pit->pit_state.pit_timer.pending);
>   	return 0;
>   }
> @@ -277,6 +283,51 @@ static struct kvm_timer_ops kpit_ops = {
>   	.is_periodic = kpit_is_periodic,
>   };
>
> +static enum hrtimer_restart pit_timer_fn(struct hrtimer *data)
> +{
> +	int restart_timer = 0;
> +	struct kvm_vcpu *vcpu, *this;
> +	int i;
> +	struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer);
> +
> +	if (!ktimer->vcpu)
> +		return HRTIMER_NORESTART;
> +
> +	this = NULL;
> +	kvm_for_each_vcpu(i, vcpu, ktimer->kvm) {
> +		if (kvm_lapic_enabled(vcpu)) {
> +			this = vcpu;
> +			break;
> +		}
> +	}
> +	if (this == NULL)
> +		this = ktimer->kvm->bsp_vcpu;
>    

If lapics are active, 'this' is chosen as the highest numbered lapic.

> +
> +	/*
> +	 * There is a race window between reading and incrementing, but we do
> +	 * not care about potentially loosing timer events in the !reinject
> +	 * case anyway.
> +	 */
> +	if (ktimer->reinject || !atomic_read(&ktimer->pending)) {
> +		atomic_inc(&ktimer->pending);
> +		/* FIXME: this code should not know anything about vcpus */
> +		set_bit(KVM_REQ_PENDING_TIMER,&this->requests);
>    

The comment is correct...

Possible fix:
- audit the code to make sure that it will work regardless of which vcpu 
it is tied to
- install an irq ack notifier for the pit interrupt
- pull the pit hrtimer towards the vcpu that last acked it

> +	}
> +
> +	if (waitqueue_active(&this->wq))
> +		wake_up_interruptible(&this->wq);
>    

Need an IPI if the vcpu is in guest mode and we set 
KVM_REQ_PENDING_TIMER earlier.

> +
> +	if (ktimer->t_ops->is_periodic(ktimer)) {
> +		hrtimer_add_expires_ns(&ktimer->timer, ktimer->period);
> +		restart_timer = 1;
> +	}
> +
> +	if (restart_timer)
> +		return HRTIMER_RESTART;
> +	else
> +		return HRTIMER_NORESTART;
>    

Unrelated: as soon as ->pending starts to accumulate, we should return 
HRTIMER_NORESTART and calculate pending from the period and current 
time.  This will reduce hrtimers if the system is overloaded, or if the 
guest masked the pit and uses some other time source.

> @@ -1106,13 +1112,11 @@ int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
>   	u32 lvt0 = apic_get_reg(vcpu->arch.apic, APIC_LVT0);
>   	int r = 0;
>
> -	if (kvm_vcpu_is_bsp(vcpu)) {
> -		if (!apic_hw_enabled(vcpu->arch.apic))
> -			r = 1;
> -		if ((lvt0&  APIC_LVT_MASKED) == 0&&
> -		    GET_APIC_DELIVERY_MODE(lvt0) == APIC_MODE_EXTINT)
> -			r = 1;
> -	}
> +	if (!apic_hw_enabled(vcpu->arch.apic))
> +		r = 1;
> +	if ((lvt0&  APIC_LVT_MASKED) == 0&&
> +	    GET_APIC_DELIVERY_MODE(lvt0) == APIC_MODE_EXTINT)
> +		r = 1;
>   	return r;
>   }
>    

This is the crux, yes?  I think that if we pull the vcpu calculation out 
of pit_timer_fn into an ack notifier, we can end up with the original 
kvm_timer_fn.

>
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index 9fd5b3e..c492bae 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -98,7 +98,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
>   			if (r<  0)
>   				r = 0;
>   			r += kvm_apic_set_irq(vcpu, irq);
> -		} else {
> +		} else if (kvm_lapic_enabled(vcpu)){
>   			if (!lowest)
>   				lowest = vcpu;
>   			else if (kvm_apic_compare_prio(vcpu, lowest)<  0)
>    

Unrelated fix?

The patch is large for such a ticklish subject.  I'd be more than happy 
to remove the vcpu affinity optimization, fix the code, and reinstate 
the optimization later, since it will make the patch(es) much easier to 
review.

-- 
error compiling committee.c: too many arguments to function