From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Gregory Haskins" Subject: Re: irqdevice INTR example Date: Thu, 12 Apr 2007 09:43:10 -0400 Message-ID: <461DFF1C.BA47.005A.0@novell.com> References: <461D7702.BA47.005A.0@novell.com> <461DE791.1040707@qumranet.com> <461DE5C9.BA47.005A.0@novell.com> <461E2AD5.7070905@qumranet.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=__Part795E2C7E.0__=" Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org To: "Avi Kivity" Return-path: In-Reply-To: <461E2AD5.7070905-atKUWr5tajBWk0Htik3J/w@public.gmane.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: kvm-devel-bounces-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Errors-To: kvm-devel-bounces-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org List-Id: kvm.vger.kernel.org --=__Part795E2C7E.0__= Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Disposition: inline I have attached a new version of the patch which eliminates the condition variable (if only by name, anyway ;) >>> On Thu, Apr 12, 2007 at 8:49 AM, in message <461E2AD5.7070905-atKUWr5tajBWk0Htik3J/w@public.gmane.org>, Avi Kivity wrote: > > > Actually, I am in favor of having well- defined synchronization > primitives. The only issues I see with condition variables are: > > - there are many variants of mutexes in the kernel (mutex, spinlock, > spinlock with irqs disabled, ...), so many condvar variants are needed. Yeah, I tried to address that with lock_ops, but I agree. > - they're very close to waitqueues, so perhaps there's not much > justification Agreed they are very close. I won't go into my opinion of waitqueues vs cond-vars in this forum ;) > > I'm no synchronization primitive expert, though. My comment had nothing > to do with my opinion on condition variables or your implementation > thereof. kvm is simply not the place to introduce them. No problem. I think your argument is a good one. > > I referred to the comment. Maybe just "the hardware"? Done > > Ah, ok -- I misunderstood the whole thing. The way to avoid the race is > to disable interrupts before entering the guest. This way the IPI is > delayed until you enter guest mode: > > irq_disable(); > > spin_lock(); > vcpu- >guest_mode = 1; > check_whether_an_irq_is_pending_and_if_so_inject_it(); > spin_unlock(); > > asm ("vm_entry_sequence"); > > vcpu- >guest_mode = 0; > > irq_enable(); // currently in the vm entry sequence > > // recheck here? > > If the interrupt happens before the spin_lock(), we'll get a non- ipi > wakeup and then see it in check_whether(). If it happens after it we'll > get an IPI which will be ignored until we're snugly in guest mode. When I first read this I thought "whoa! you want to disable interrupts during the whole time we are in GUEST mode?" But then it dawned on me what you are suggesting: Interrupts would be re-enabled after the context switch because we re-load the processor state? Cool! The thing I can't wrap my head around is what happens when the guest has IF=0 and and external interrupt comes in? Would we still exit? But that aside, a problem that I see is that (IIUC) IPIs use NMIs not EXT-INTs. Assuming that is right, I suppose we might be able to do a similar trick except we also disable NMIs first (is this possible/recommended/forbidden?). > > In general I find it useful to pretend there are many userspaces being > written for kvm, otherwise we get locked into qemu's current mode of > operation. Sounds reasonable. I will start to do the same. > >> This makes sense now that I think about it because something like hlt should > cause a suspension of CPU activity until further notice. > > An alternative is to handle hlt in the kernel in a place where we're > ready for the IPI wakeup. The downside to that is that we have to be > prepared for external wakeup sources as well (signal, poll, aio... messy). Hmmm...interesting. I wonder if there are advantages that make this worth exploring. Oh well, I will back burner these thoughts until the SMP/PV/APIC stuff is sorted out. > > It's best not to use signals internally. Qemu relies on them and we > have to support them, but in kernel we should use existing kernel > mechanisms. But Avi, this was *your* idea to use signals ;) But in all seriousness, I don't know if I have a choice. I have two requirements which constrain me: 1) I need an IPI to cause a VMEXIT 2) I need to support 2.6.16, strongly preferable as a loadable module. According to my research (which is undoubtedly not 100% definitive), 2.6.16 or even the newer kernels do not export most of the IPI facilities. send_sig() happens to be exported and it happens to invoke a reschedule_IPI under the hood, so its convenient. If there is another way to get access to the IPI facility without playing games with signals, I am all ears. But until then I don't know what else to do. If I had the luxury of modifying the kernel source, we could just export what we needed and be done with it. > > I was interested in how - >pending() and - >read_vector() and the raise > callback interact, but got distracted by the, err, uniqueness of the > signal thing. You still haven't weighed in here ;) Hopefully you have a clearer picture of what I was trying to do now at least. Whether you agree or not is another matter. Regards, -Greg --=__Part795E2C7E.0__= Content-Type: text/plain; name="preemptible-cpu-2.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="preemptible-cpu-2.patch" KVM: Preemptible VCPU From: <> This adds support for interrupting an executing CPU Signed-off-by: Gregory Haskins --- drivers/kvm/kvm.h | 11 ++++++++++ drivers/kvm/kvm_main.c | 54 ++++++++++++++++++++++++++++++++++++++++++++= ---- drivers/kvm/svm.c | 35 +++++++++++++++++++++++++++++++ drivers/kvm/vmx.c | 35 +++++++++++++++++++++++++++++++ 4 files changed, 130 insertions(+), 5 deletions(-) diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h index 58966d9..70d1bb9 100644 --- a/drivers/kvm/kvm.h +++ b/drivers/kvm/kvm.h @@ -271,6 +271,16 @@ void kvm_io_bus_register_dev(struct kvm_io_bus *bus, =20 #define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long) =20 +/* + * structure for maintaining info for interrupting an executing VCPU + */ +struct kvm_vcpu_irq { + spinlock_t lock; + wait_queue_head_t wq; + struct task_struct *task; + int pending; +}; + struct kvm_vcpu { struct kvm *kvm; union { @@ -284,6 +294,7 @@ struct kvm_vcpu { struct kvm_run *run; int interrupt_window_open; struct kvm_irqdevice irq_dev; + struct kvm_vcpu_irq irq; unsigned long regs[NR_VCPU_REGS]; /* for rsp: vcpu_load_rsp_rip() = */ unsigned long rip; /* needs vcpu_load_rsp_rip() */ =20 diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index 7e00412..1cf4060 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -299,6 +299,11 @@ static struct kvm *kvm_create_vm(void) struct kvm_vcpu *vcpu =3D &kvm->vcpus[i]; =20 mutex_init(&vcpu->mutex); + + memset(&vcpu->irq, 0, sizeof(vcpu->irq)); + spin_lock_init(&vcpu->irq.lock); + init_waitqueue_head(&vcpu->irq.wq); + vcpu->cpu =3D -1; vcpu->kvm =3D kvm; vcpu->mmu.root_hpa =3D INVALID_PAGE; @@ -2320,13 +2325,52 @@ static void kvm_vcpu_intr(struct kvm_irqsink = *this, * Our irq device is requesting to interrupt the vcpu. If it is * currently running, we should inject a host IPI to force a = VMEXIT=20 */ -=09 + struct kvm_vcpu *vcpu =3D (struct kvm_vcpu*)this->private; + /* - * FIXME: Implement this or the CPU wont notice the interrupt = until - * the next natural VMEXIT. Note that this is how the system - * has always worked, so nothing is broken here. This is a future - * enhancement + * HACK ALERT! + * + * We want to send a virtual interrupt signal to the task that = owns + * the guest. However, the signal will only force a VMEXIT (via + * a reschedule IPI) if the task is currently in GUEST mode. = There + * is a race condition between the time that we mark the vcpu as + * running and the time the system actually enter guest mode. = Since + * there doesnt appear to be any way to help with this situation = from + * the hardware, we are forced to wait to make sure the guest=20 + * actually gets interrupted in a reasonable amount of time. If = it + * does not, we assume that the IPI failed because it was too = early + * and must try again until it does. + * + * This condvar/spinlock/timeout/retry eliminate the race in a = safe + * manner, at the expense of making the INTR delivery synchronous */ + spin_lock(&vcpu->irq.lock); +=09 + if (vcpu->irq.task) { + struct timespec tmo =3D { + .tv_sec =3D 0, + .tv_nsec =3D 100000 /* 100us */ + }; + + BUG_ON(vcpu->irq.task =3D=3D current); + =09 + while (vcpu->irq.task) { + DEFINE_WAIT(__wait);=09 + + send_sig(SIGSTOP, vcpu->irq.task, 0);=09 + + prepare_to_wait(&vcpu->irq.wq, &__wait,=20 + TASK_UNINTERRUPTIBLE); + spin_unlock(&vcpu->irq.lock); + schedule_timeout(timespec_to_jiffies(&tmo)); + spin_lock(&vcpu->irq.lock); + finish_wait(&vcpu->irq.wq, &__wait); + } + =09 + vcpu->irq.pending =3D 1; + } +=09 + spin_unlock(&vcpu->irq.lock); } =20 static void kvm_vcpu_irqsink_init(struct kvm_vcpu *vcpu) diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c index e59a548..41765bd 100644 --- a/drivers/kvm/svm.c +++ b/drivers/kvm/svm.c @@ -1463,9 +1463,25 @@ static int svm_vcpu_run(struct kvm_vcpu *vcpu, = struct kvm_run *kvm_run) int r; =20 again: + spin_lock(&vcpu->irq.lock); + + /* + * Setting vcpu->task signals to outsiders that the VMCS is=20 + * effectively in GUEST mode, and therefore must be signalled + * to transition the task back to HOST mode if any new interrupts + * arrive. + */ + vcpu->irq.task =3D current; + + /* + * We also must inject interrupts (if any) while the irq_lock + * is held + */ if (!vcpu->mmio_read_completed) do_interrupt_requests(vcpu, kvm_run); =20 + spin_unlock(&vcpu->irq.lock); + clgi(); =20 pre_svm_run(vcpu); @@ -1617,6 +1633,25 @@ again: reload_tss(vcpu); =20 /* + * Signal that we have transitioned back to host mode=20 + */ + spin_lock(&vcpu->irq.lock); + + vcpu->irq.task =3D NULL; + wake_up(&vcpu->irq.wq); + + /* + * If irqpending is asserted someone undoubtedly has sent us a = SIGSTOP + * signal. Counter it with a SIGCONT + */ + if(vcpu->irq.pending) { + send_sig(SIGCONT, current, 0); + vcpu->irq.pending =3D 0; + } + + spin_unlock(&vcpu->irq.lock); + + /* * Profile KVM exit RIPs: */ if (unlikely(prof_on =3D=3D KVM_PROFILING)) diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c index a0fdf02..1d5ce85 100644 --- a/drivers/kvm/vmx.c +++ b/drivers/kvm/vmx.c @@ -1748,9 +1748,25 @@ again: vmcs_writel(HOST_GS_BASE, segment_base(gs_sel)); #endif =20 + spin_lock(&vcpu->irq.lock); + + /* + * Setting vcpu->task signals to outsiders that the VMCS is=20 + * effectively in GUEST mode, and therefore must be signalled + * to transition the task back to HOST mode if any new interrupts + * arrive. + */ + vcpu->irq.task =3D current; + + /* + * We also must inject interrupts (if any) while the irq_lock + * is held + */ if (!vcpu->mmio_read_completed) do_interrupt_requests(vcpu, kvm_run); =20 + spin_unlock(&vcpu->irq.lock); + if (vcpu->guest_debug.enabled) kvm_guest_debug_pre(vcpu); =20 @@ -1911,6 +1927,25 @@ again: =20 asm ("mov %0, %%ds; mov %0, %%es" : : "r"(__USER_DS)); =20 + /* + * Signal that we have transitioned back to host mode=20 + */ + spin_lock(&vcpu->irq.lock); + + vcpu->irq.task =3D NULL; + wake_up(&vcpu->irq.wq); + + /* + * If irqpending is asserted someone undoubtedly has sent us a = SIGSTOP + * signal. Counter it with a SIGCONT + */ + if(vcpu->irq.pending) { + send_sig(SIGCONT, current, 0); + vcpu->irq.pending =3D 0; + } + + spin_unlock(&vcpu->irq.lock); + if (fail) { kvm_run->exit_reason =3D KVM_EXIT_FAIL_ENTRY; kvm_run->fail_entry.hardware_entry_failure_reason --=__Part795E2C7E.0__= Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV --=__Part795E2C7E.0__= Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ kvm-devel mailing list kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org https://lists.sourceforge.net/lists/listinfo/kvm-devel --=__Part795E2C7E.0__=--