From mboxrd@z Thu Jan  1 00:00:00 1970
From: Radim =?utf-8?B?S3LEjW3DocWZ?= <rkrcmar@redhat.com>
Subject: Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple
 interrupts by user writes
Date: Fri, 10 Oct 2014 14:50:38 +0200
Message-ID: <20141010125034.GA17347@potion.brq.redhat.com>
References: <1412287806-16016-1-git-send-email-namit@cs.technion.ac.il>
 <1412287806-16016-6-git-send-email-namit@cs.technion.ac.il>
 <20141006205737.GC2722@potion.brq.redhat.com>
 <4E3FA8A7-6CEF-4077-AD91-9AAE1AF86FEF@gmail.com>
 <20141008100619.GA20422@potion.brq.redhat.com>
 <99CA15AA-DDC1-4E95-A168-7AA640C5A3B9@gmail.com>
 <5437AAB0.4060908@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Nadav Amit <nadav.amit@gmail.com>,
	Nadav Amit <namit@cs.technion.ac.il>, joro@8bytes.org,
	kvm@vger.kernel.org
To: Paolo Bonzini <pbonzini@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:6024 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753400AbaJJMuw (ORCPT <rfc822;kvm@vger.kernel.org>);
	Fri, 10 Oct 2014 08:50:52 -0400
Content-Disposition: inline
In-Reply-To: <5437AAB0.4060908@redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

2014-10-10 11:45+0200, Paolo Bonzini:
> Il 10/10/2014 03:55, Nadav Amit ha scritto:
> >> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> >> index b8345dd..51428dd 100644
> >> --- a/arch/x86/kvm/lapic.c
> >> +++ b/arch/x86/kvm/lapic.c
> >> @@ -1096,9 +1096,12 @@ static void start_apic_timer(struct kvm_lap=
ic *apic)
> >>  		if (likely(tscdeadline > guest_tsc)) {
> >>  			ns =3D (tscdeadline - guest_tsc) * 1000000ULL;
> >>  			do_div(ns, this_tsc_khz);
> >> +			hrtimer_start(&apic->lapic_timer.timer,
> >> +				ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
> >> +		} else {
> >> +			atomic_inc(&ktimer->pending);
> >> +			kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
> >>  		}
> >> -		hrtimer_start(&apic->lapic_timer.timer,
> >> -			ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
> >>
> >>  		local_irq_restore(flags);
> >>  	}
> >> @@ -1355,9 +1358,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kv=
m_vcpu *vcpu, u64 data)
> >>  		return;
> >>
> >>  	hrtimer_cancel(&apic->lapic_timer.timer);
> >> -	/* Inject here so clearing tscdeadline won't override new value =
*/
> >> -	if (apic_has_pending_timer(vcpu))
> >> -		kvm_inject_apic_timer_irqs(vcpu);
> >>  	apic->lapic_timer.tscdeadline =3D data;
> >>  	start_apic_timer(apic);
> >> }
> >=20
> > Perhaps I am missing something, but I don=E2=80=99t see how it solv=
es the problem I encountered.
> > Recall the scenario:
> > 1. A TSC deadline timer interrupt is pending.
> > 2. TSC deadline was still not cleared (which happens during vcpu_ru=
n).
> > 3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadli=
ne msr.
> >=20
> > It appears that in such scenario the guest would still get spurious
> > interrupt for no reason, as ktimer->pending may already be increase=
d in
> > apic_timer_fn.
>=20
> If ktimer->pending =3D=3D 2 you still get only one interrupt, not two=
=2E  Am I
> misunderstanding your objection?

(I don't see it either, the patch removed kvm_inject_apic_timer_irqs(),
 so we inject as if the MSR write didn't happen.)

> > Second, I think that the solution I proposed would perform better.
> > Currently, there are many unnecessary cancellations and setups of t=
he
> > timer. This solution does not resolve this problem.
>=20
> I think it does.  You do not get an hrtimer_start if tscdeadline <=3D
> guest_tsc.  To avoid useless cancels, either check hrtimer_is_enqueue=
d
> before calling hrtimer_cancel, or go straight to the source and avoid
> taking the lock in the easy cases:

I think the useless cancels were when the timer is set to the same valu=
e
in the future.

Which makes sense to optimize, but I wasn't sure how it would affect th=
e
guest if the TSC offset was be changing or TSC itself wasn't stable ...
so I chose not to do it.

> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 1c2fe7de2842..6ce725007424 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1043,10 +1043,17 @@ int hrtimer_try_to_cancel(struct hrtimer *tim=
er)
>  {
>  	struct hrtimer_clock_base *base;
>  	unsigned long flags;
> -	int ret =3D -1;
> +	unsigned long state =3D timer->state;
> +	int ret;
> +
> +	if (state & HRTIMER_STATE_ENQUEUED)
> +		return 0;

It doesn't try very hard to cancel it ;)

> +	if (state & HRTIMER_STATE_CALLBACK)
> +		return -1;
>=20
>  	base =3D lock_hrtimer_base(timer, &flags);
>=20
> +	ret =3D -1;
>  	if (!hrtimer_callback_running(timer))
>  		ret =3D remove_hrtimer(timer, base);
>=20
> > Last, I think that having less interrupts on deadline changes is no=
t
> > completely according to the SDM which says: "If software disarms th=
e
> > timer or postpones the deadline, race conditions may result in the
> > delivery of a spurious timer interrupt.=E2=80=9D It never says inte=
rrupts may
> > be lost if you reprogram the deadline before you check it expired.

Yeah, too bad it wasn't written in a formally defined language ...
They could be just reserving design space for concurrent implementation=
,
not making race conditions mandatory.

Our race windows are much bigger -- a disarm that is hundreds of cycles
in the future can easily be hit after a msr write vm-exit, but it would
be very unlikely to get an interrupt on bare metal.

And thinking about the meaning of postpone or disarm -- software doesn'=
t
want the old interrupt anymore, so it would just add useless work.

(And I liked the code better without it, but it'd be ok if we fixed all
 the cases.)

> But the case when you rewrite the same value to the MSR is neither
> disarming nor postponing.  You would be getting two interrupts for th=
e
> same event.  That is why I agree with Radim that checking host_initia=
ted
> is wrong.

Current implementation also injects two interrupts when the timer is se=
t
to a lower nonzero value, which should give us just one under both
interpretations.