From mboxrd@z Thu Jan  1 00:00:00 1970
From: Marcelo Tosatti <mtosatti@redhat.com>
Subject: Re: [patch 2/3] KVM: x86: add option to advance tscdeadline hrtimer
 expiration
Date: Wed, 17 Dec 2014 15:41:39 -0200
Message-ID: <20141217174139.GA31721@amt.cnet>
References: <20141216140813.493421022@redhat.com>
 <20141216140853.687723255@redhat.com>
 <20141217145805.GA29368@potion.brq.redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: kvm@vger.kernel.org, Luiz Capitulino <lcapitulino@redhat.com>,
	Rik van Riel <riel@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>
To: Radim Krcmar <rkrcmar@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:49743 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1750790AbaLQRl7 (ORCPT <rfc822;kvm@vger.kernel.org>);
	Wed, 17 Dec 2014 12:41:59 -0500
Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26])
	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id sBHHfxkb016977
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL)
	for <kvm@vger.kernel.org>; Wed, 17 Dec 2014 12:41:59 -0500
Content-Disposition: inline
In-Reply-To: <20141217145805.GA29368@potion.brq.redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Wed, Dec 17, 2014 at 03:58:13PM +0100, Radim Krcmar wrote:
> 2014-12-16 09:08-0500, Marcelo Tosatti:
> > For the hrtimer which emulates the tscdeadline timer in the guest,
> > add an option to advance expiration, and busy spin on VM-entry wait=
ing
> > for the actual expiration time to elapse.
> >=20
> > This allows achieving low latencies in cyclictest (or any scenario=20
> > which requires strict timing regarding timer expiration).
> >=20
> > Reduces average cyclictest latency from 12us to 8us
> > on Core i5 desktop.
> >=20
> > Note: this option requires tuning to find the appropriate value=20
> > for a particular hardware/guest combination. One method is to measu=
re the=20
> > average delay between apic_timer_fn and VM-entry.=20
> > Another method is to start with 1000ns, and increase the value
> > in say 500ns increments until avg cyclictest numbers stop decreasin=
g.
> >=20
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>=20
> Reviewed-by: Radim Kr=C4=8Dm=C3=A1=C5=99 <rkrcmar@redhat.com>
>=20
> > +++ kvm/arch/x86/kvm/lapic.c
> > @@ -1087,11 +1089,64 @@ static void apic_timer_expired(struct kv
> [...]
> > +/*
> > + * On APICv, this test will cause a busy wait
> > + * during a higher-priority task.
> > + */
>=20
> (A bit confusing ... this test doesn't busy wait.)
>=20
> > +
> > +static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
> [...]
> > +void wait_lapic_expire(struct kvm_vcpu *vcpu)
> > +{
> [...]
> > +	tsc_deadline =3D apic->lapic_timer.expired_tscdeadline;
> > +	apic->lapic_timer.expired_tscdeadline =3D 0;
> > +	guest_tsc =3D kvm_x86_ops->read_l1_tsc(vcpu, native_read_tsc());
> > +
> > +	while (guest_tsc < tsc_deadline) {
> > +		int delay =3D min(tsc_deadline - guest_tsc, 1000ULL);
>=20
> Why break the __delay() loop into smaller parts?

So that you can handle interrupts, in case this code ever moves
outside IRQ protected region.

> > +		__delay(delay);
>=20
> (Does not have to call delay_tsc, but I guess it won't change.)
>=20
> > +		guest_tsc =3D kvm_x86_ops->read_l1_tsc(vcpu, native_read_tsc());
> > +	}
> >  }
> > =20
>=20
> Btw. simple automatic delta tuning had worse results?

Haven't tried automatic tuning.

So what happens on a realtime environment is this: you execute the fixe=
d
number of instructions from interrupt handling all the way to VM-entry.

Well, almost fixed. Fixed is the number of apic_timer_fn plus KVM
instructions. You can also execute host scheduler and timekeeping=20
processing.

In practice, the length to execute that instruction sequence is a bell
shaped normal distribution around the average (the right side is
slightly higher due to host scheduler and timekeeping processing).

You want to advance the timer by the rightmost bucket, that way you
guarantee lower possible latencies (which is the interest here).

That said, i don't see advantage in automatic tuning for the usecase=20
which this targets.