From mboxrd@z Thu Jan  1 00:00:00 1970
From: Chris Lalancette <clalance@redhat.com>
Subject: Re: [PATCH 0/5]: Fix kdump under KVM
Date: Thu, 29 Oct 2009 09:34:46 +0100
Message-ID: <4AE953A6.1000204@redhat.com>
References: <1256661667-9298-1-git-send-email-clalance@redhat.com> <4AE81617.6060406@redhat.com> <4AE81959.2000801@redhat.com> <4AE81D87.1030709@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: kvm@vger.kernel.org
To: Avi Kivity <avi@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:16258 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1750921AbZJ2Ien (ORCPT <rfc822;kvm@vger.kernel.org>);
	Thu, 29 Oct 2009 04:34:43 -0400
Received: from int-mx01.intmail.prod.int.phx2.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11])
	by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id n9T8Ymet004506
	for <kvm@vger.kernel.org>; Thu, 29 Oct 2009 04:34:48 -0400
In-Reply-To: <4AE81D87.1030709@redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Avi Kivity wrote:
> On 10/28/2009 12:13 PM, Chris Lalancette wrote:
>>> The kick from i8254 code is pretty bad, as you mention.  I forget why it
>>> is needed at all - shouldn't kvm_set_irq() end up kicking the correct
>>>      
>> As I understand it, that's not quite how it works.  From what I can see, what
>> happens is that the i8254 is programmed as an hrtimer.  When the hrtimer
>> expires, we get a callback in kvm_timer_fn (or pit_timer_fn, in my new code).
>> That code is running in interrupt context, however, so you can't directly call
>> "set_irq" at that point.  Instead, we update the "pending" variable and defer
>> work until later on.  That "later on" is when we are doing a vcpu_run, at which
>> point we check the "pending" variable, and if set, inject the interrupt.
>>
>> The problem is that if the vcpu(s) are in idle when the hrtimer expires, and we
>> don't kick them, no vcpu will wake up, and hence none of them will ever run
>> "set_irq" to get it injected into the guest.
>>    
> 
> Oh yes, I remember now.
> 
>> If you have other ideas on how we might accomplish this, I'd definitely be
>> interested in hearing them.
>>    
> 
> We could schedule work to run in thread context.  Previous to the user 
> return notifier work, this would cause a reloading of a bunch of msrs 
> and would thus be quite expensive, but now switching to kernel threads 
> and back is pretty cheap.
> 
> So we could try schedule_work().  My only worry is that it uses 
> wake_up() instead of wake_up_sync(), so it could introduce delay.  I'd 
> much prefer a variant the schedules immediately.
> 
> An alternative is to make irq injection work from irq context.  It's not 
> difficult to do technically - convert mutexes to spin_lock_irqsave() - 
> I'm just worried about long irqoff times with large guests (ioapic needs 
> to iterate over all vcpus sometimes).  This is useful for other cases - 
> device assignment interrupt injection, irqfd for vhost/vbus.  So maybe 
> we should go this path, and worry about reducing irqoff times later when 
> we bump KVM_MAX_VCPUS.
> 

I'm starting to take a look at this.  While this may be a generically useful
cleanup, it occurs to me that even if we call "set_irq" from the hrtimer
callback, we still have to do the "kick_vcpus" type thing.  Otherwise, we still
have the problem where if all vcpus are idle, none of them will be doing
vcpu_run anytime soon to actually inject the interrupt into the guest.

Or did I mis-understand what you are proposing?

-- 
Chris Lalancette