gettimeofday "slow" in RHEL4 guests

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* gettimeofday "slow" in RHEL4 guests
@ 2008-11-24 17:47 David S. Ahern
  2008-11-25  4:41 ` David S. Ahern
  0 siblings, 1 reply; 15+ messages in thread
From: David S. Ahern @ 2008-11-24 17:47 UTC (permalink / raw)
  To: kvm-devel


I noticed that gettimeofday in RHEL4.6 guests is taking much longer than
with RHEL3.8 guests. I wrote a simple program (see below) to call
gettimeofday in a loop 1,000,000 times and then used time to measure how
long it took.


For the RHEL3.8 guest:
time -p ./timeofday_bench
real 0.99
user 0.12
sys 0.24

For the RHEL4.6 guest with the default clock source (pmtmr):
time -p ./timeofday_bench
real 15.65
user 0.18
sys 15.46

and RHEL4.6 guest with PIT as the clock source (clock=pit kernel parameter):
time -p ./timeofday_bench
real 13.67
user 0.21
sys 13.45

So, basically gettimeofday() takes about 50 times as long on a RHEL4 guest.

Host is a DL380G5, 2 dual-core Xeon 5140 processors, 4 GB of RAM. It's
running kvm.git tree as of 11/18/08 with kvm-75 userspace. Guest in both
RHEL3 and RHEL4 cases has 4 vcpus, 3.5GB of RAM.

david

----------

timeofday_bench.c:

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
	int rc = 0, n;
	struct timeval tv;
	int iter = 1000000;  /* number of times to call gettimeofday */

	if (argc > 1)
		iter = atoi(argv[1]);

	if (iter == 0) {
		fprintf(stderr, "invalid number of iterations\n");
		return 1;
	}

	printf("starting.... ");
	for (n = 0; n < iter; ++n) {
		if (gettimeofday(&tv, NULL) != 0) {
			fprintf(stderr, "\ngettimeofday failed\n");
			rc = 1;
			break;
		}
	}

	if (!rc)
		printf("done\n");

	return rc;
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: gettimeofday "slow" in RHEL4 guests
  2008-11-24 17:47 gettimeofday "slow" in RHEL4 guests David S. Ahern
@ 2008-11-25  4:41 ` David S. Ahern
  2008-11-25 10:14   ` Andi Kleen
  2008-11-25 17:20   ` Hollis Blanchard
  0 siblings, 2 replies; 15+ messages in thread
From: David S. Ahern @ 2008-11-25  4:41 UTC (permalink / raw)
  To: kvm-devel; +Cc: Marcelo Tosatti, Glauber de Oliveira Costa, Avi Kivity

Some more data on this overhead.

RHEL3 (which is based on the 2.4.21 kernel) gets microsecond resolutions
by reading the TSC. Reading the TSC from within a guest is very fast on kvm.

RHEL4 (which is basd on the 2.6.9 kernel) allows multiple time sources:
pmtmr (ACPI power management timer which is the default), pit, hpet and TSC.

The pmtmr and pit both do ioport reads to get microsecond resolutions
(see read_pmtmr and get_offset_pit, respectively). For the tsc as the
timer source gettimeofday is *very* lightweight, but time drifts very
badly and ntpd cannot acquire a sync. I believe someone is working on
the HPET for guests and I know from bare metal performance that it is a
much lighter weight time source, but with RHEL4 the HPET breaks the
ability to use the RTC. So, I'm running out of options for reliable and
lightweight time sources.

Any chance the pit or pmtmr options can be optimized a bit?

thanks,

david

PS. yes, I did try the userspace pit and its performance is worse than
the in-kernel PIT.

David S. Ahern wrote:
> I noticed that gettimeofday in RHEL4.6 guests is taking much longer than
> with RHEL3.8 guests. I wrote a simple program (see below) to call
> gettimeofday in a loop 1,000,000 times and then used time to measure how
> long it took.
> 
> 
> For the RHEL3.8 guest:
> time -p ./timeofday_bench
> real 0.99
> user 0.12
> sys 0.24
> 
> For the RHEL4.6 guest with the default clock source (pmtmr):
> time -p ./timeofday_bench
> real 15.65
> user 0.18
> sys 15.46
> 
> and RHEL4.6 guest with PIT as the clock source (clock=pit kernel parameter):
> time -p ./timeofday_bench
> real 13.67
> user 0.21
> sys 13.45
> 
> So, basically gettimeofday() takes about 50 times as long on a RHEL4 guest.
> 
> Host is a DL380G5, 2 dual-core Xeon 5140 processors, 4 GB of RAM. It's
> running kvm.git tree as of 11/18/08 with kvm-75 userspace. Guest in both
> RHEL3 and RHEL4 cases has 4 vcpus, 3.5GB of RAM.
> 
> david
> 
> ----------
> 
> timeofday_bench.c:
> 
> #include <sys/time.h>
> #include <stdio.h>
> #include <stdlib.h>
> 
> int main(int argc, char *argv[])
> {
> 	int rc = 0, n;
> 	struct timeval tv;
> 	int iter = 1000000;  /* number of times to call gettimeofday */
> 
> 	if (argc > 1)
> 		iter = atoi(argv[1]);
> 
> 	if (iter == 0) {
> 		fprintf(stderr, "invalid number of iterations\n");
> 		return 1;
> 	}
> 
> 	printf("starting.... ");
> 	for (n = 0; n < iter; ++n) {
> 		if (gettimeofday(&tv, NULL) != 0) {
> 			fprintf(stderr, "\ngettimeofday failed\n");
> 			rc = 1;
> 			break;
> 		}
> 	}
> 
> 	if (!rc)
> 		printf("done\n");
> 
> 	return rc;
> }
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: gettimeofday "slow" in RHEL4 guests
  2008-11-25  4:41 ` David S. Ahern
@ 2008-11-25 10:14   ` Andi Kleen
  2008-11-25 11:17     ` Alexander Graf
  2008-11-25 17:20   ` Hollis Blanchard
  1 sibling, 1 reply; 15+ messages in thread
From: Andi Kleen @ 2008-11-25 10:14 UTC (permalink / raw)
  To: David S. Ahern
  Cc: kvm-devel, Marcelo Tosatti, Glauber de Oliveira Costa, Avi Kivity

"David S. Ahern" <daahern@cisco.com> writes:
>
> Any chance the pit or pmtmr options can be optimized a bit?

They both will require vmexits and never be really fast. Same 
for HPET. If you want fast gtod you really need to make TSC work.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: gettimeofday "slow" in RHEL4 guests
  2008-11-25 10:14   ` Andi Kleen
@ 2008-11-25 11:17     ` Alexander Graf
  2008-11-25 11:48       ` Andi Kleen
  0 siblings, 1 reply; 15+ messages in thread
From: Alexander Graf @ 2008-11-25 11:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David S. Ahern, kvm-devel, Marcelo Tosatti,
	Glauber de Oliveira Costa, Avi Kivity





On 25.11.2008, at 11:14, Andi Kleen <andi@firstfloor.org> wrote:

> "David S. Ahern" <daahern@cisco.com> writes:
>>
>> Any chance the pit or pmtmr options can be optimized a bit?
>
> They both will require vmexits and never be really fast. Same
> for HPET. If you want fast gtod you really need to make TSC work.

Why does hpet need to be slow? Can't you just 1:1 pass through one of  
the hpet timers if you only have a limited amount of vms?
If done cleverly this might even work if #hpet > #cpu.

Alex

>
>
> -Andi
>
> -- 
> ak@linux.intel.com
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: gettimeofday "slow" in RHEL4 guests
  2008-11-25 11:17     ` Alexander Graf
@ 2008-11-25 11:48       ` Andi Kleen
  2008-11-25 12:13         ` Alexander Graf
  0 siblings, 1 reply; 15+ messages in thread
From: Andi Kleen @ 2008-11-25 11:48 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Andi Kleen, David S. Ahern, kvm-devel, Marcelo Tosatti,
	Glauber de Oliveira Costa, Avi Kivity

> Why does hpet need to be slow? Can't you just 1:1 pass through one of  
> the hpet timers if you only have a limited amount of vms?

HPET is not a truly virtualizable device, it's all the counters
in one block that cannot be really mapped to different people.

Also most systems have very little counters and Linux typically needs two
at least (system timer and /dev/hpet)

> If done cleverly this might even work if #hpet > #cpu.

Sure with a device model, but that needs vmexits.

-Andi


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: gettimeofday "slow" in RHEL4 guests
  2008-11-25 11:48       ` Andi Kleen
@ 2008-11-25 12:13         ` Alexander Graf
  2008-11-25 12:52           ` Andi Kleen
  0 siblings, 1 reply; 15+ messages in thread
From: Alexander Graf @ 2008-11-25 12:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David S. Ahern, kvm-devel, Marcelo Tosatti,
	Glauber de Oliveira Costa, Avi Kivity

On 25.11.2008, at 12:48, Andi Kleen wrote:

>> Why does hpet need to be slow? Can't you just 1:1 pass through one of
>> the hpet timers if you only have a limited amount of vms?
>
> HPET is not a truly virtualizable device, it's all the counters
> in one block that cannot be really mapped to different people.

Right, you'd have to remap stuff on non-page boundaries which is  
probably pretty hard to do. Otherwise you wouldn't gain anything,  
since you'd still have exits to reprogram the hpet.

> Also most systems have very little counters and Linux typically  
> needs two
> at least (system timer and /dev/hpet)

Well, IIRC 3 is a pretty normal amount and blocking /dev/hpet if the  
hpet is in use shouldn't be a problem.

But yeah - the remapping of HPET timers to virtual HPET timers sounds  
pretty tough. I wonder if one could overcome that with a little  
hardware support though ...

Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: gettimeofday "slow" in RHEL4 guests
  2008-11-25 12:13         ` Alexander Graf
@ 2008-11-25 12:52           ` Andi Kleen
  2008-12-28 18:38             ` Marcelo Tosatti
  0 siblings, 1 reply; 15+ messages in thread
From: Andi Kleen @ 2008-11-25 12:52 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Andi Kleen, David S. Ahern, kvm-devel, Marcelo Tosatti,
	Glauber de Oliveira Costa, Avi Kivity

> But yeah - the remapping of HPET timers to virtual HPET timers sounds  
> pretty tough. I wonder if one could overcome that with a little  
> hardware support though ...

For gettimeofday better make TSC work. Even in the best case (no 
virtualization) it is much faster than HPET because it sits in the CPU,
while HPET is far away on the external south bridge.

For other HPET usages (interval timer etc.) which are less
performance critical I suppose vmexits are
not a serious problem so a standard software device model should work.

-Andi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: gettimeofday "slow" in RHEL4 guests
  2008-11-25  4:41 ` David S. Ahern
  2008-11-25 10:14   ` Andi Kleen
@ 2008-11-25 17:20   ` Hollis Blanchard
  2008-11-25 19:09     ` David S. Ahern
  1 sibling, 1 reply; 15+ messages in thread
From: Hollis Blanchard @ 2008-11-25 17:20 UTC (permalink / raw)
  To: David S. Ahern
  Cc: kvm-devel, Marcelo Tosatti, Glauber de Oliveira Costa, Avi Kivity

On Mon, 2008-11-24 at 21:41 -0700, David S. Ahern wrote:
> 
> RHEL3 (which is based on the 2.4.21 kernel) gets microsecond
> resolutions
> by reading the TSC. Reading the TSC from within a guest is very fast
> on kvm.
> 
> RHEL4 (which is basd on the 2.6.9 kernel) allows multiple time
> sources:
> pmtmr (ACPI power management timer which is the default), pit, hpet
> and TSC.
> 
> The pmtmr and pit both do ioport reads to get microsecond resolutions
> (see read_pmtmr and get_offset_pit, respectively). For the tsc as the
> timer source gettimeofday is *very* lightweight, but time drifts very
> badly and ntpd cannot acquire a sync.

Why aren't you seeing severe time drift when using RHEL3 guests with the
TSC time source?

-- 
Hollis Blanchard
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: gettimeofday "slow" in RHEL4 guests
  2008-11-25 17:20   ` Hollis Blanchard
@ 2008-11-25 19:09     ` David S. Ahern
  0 siblings, 0 replies; 15+ messages in thread
From: David S. Ahern @ 2008-11-25 19:09 UTC (permalink / raw)
  To: Hollis Blanchard
  Cc: kvm-devel, Marcelo Tosatti, Glauber de Oliveira Costa, Avi Kivity



Hollis Blanchard wrote:
> On Mon, 2008-11-24 at 21:41 -0700, David S. Ahern wrote:
>> RHEL3 (which is based on the 2.4.21 kernel) gets microsecond
>> resolutions
>> by reading the TSC. Reading the TSC from within a guest is very fast
>> on kvm.
>>
>> RHEL4 (which is basd on the 2.6.9 kernel) allows multiple time
>> sources:
>> pmtmr (ACPI power management timer which is the default), pit, hpet
>> and TSC.
>>
>> The pmtmr and pit both do ioport reads to get microsecond resolutions
>> (see read_pmtmr and get_offset_pit, respectively). For the tsc as the
>> timer source gettimeofday is *very* lightweight, but time drifts very
>> badly and ntpd cannot acquire a sync.
> 
> Why aren't you seeing severe time drift when using RHEL3 guests with the
> TSC time source?
> 

With RHEL3 it's a PIT time source, and the PIT counter is only read on
interrupts. For gettimeofday requests only the tsc is read; the
algorithm for microsecond resolution uses the pit count and its tsc
timestamp from the last interrupt.

In RHEL4, the PIT counter is read for each gettimeofday request when it
is the timer source. That's the cause of the extra overhead, and
consequently, worse performance.

david

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: gettimeofday "slow" in RHEL4 guests
  2008-11-25 12:52           ` Andi Kleen
@ 2008-12-28 18:38             ` Marcelo Tosatti
  2008-12-29 12:37               ` Yang, Sheng
  2008-12-29 13:11               ` Avi Kivity
  0 siblings, 2 replies; 15+ messages in thread
From: Marcelo Tosatti @ 2008-12-28 18:38 UTC (permalink / raw)
  To: Yang, Sheng
  Cc: Alexander Graf, David S. Ahern, kvm-devel,
	Glauber de Oliveira Costa, Avi Kivity, Gleb Natapov, Dor Laor

On Tue, Nov 25, 2008 at 01:52:59PM +0100, Andi Kleen wrote:
> > But yeah - the remapping of HPET timers to virtual HPET timers sounds  
> > pretty tough. I wonder if one could overcome that with a little  
> > hardware support though ...
> 
> For gettimeofday better make TSC work. Even in the best case (no 
> virtualization) it is much faster than HPET because it sits in the CPU,
> while HPET is far away on the external south bridge.

The tsc clock on older Linux 2.6 kernels compensates for lost ticks.
The algorithm uses the PIT count (latched) to measure the delay between
interrupt generation and handling, and sums that value, on the next
interrupt, to the TSC delta.

Sheng investigated this problem in the discussions before in-kernel PIT
was merged:

http://www.mail-archive.com/kvm-devel@lists.sourceforge.net/msg13873.html

The algorithm overcompensates for lost ticks and the guest time runs
faster than the hosts.

There are two issues:

1) A bug in the in-kernel PIT which miscalculates the count value.

2) For the case where more than one interrupt is lost, and later
reinjected, the value read from PIT count is meaningless for the purpose
of the tsc algorithm. The count is interpreted as the delay until the
next interrupt, which is not the case with reinjection.

As Sheng mentioned in the thread above, Xen pulls back the TSC value
when reinjecting interrupts. VMWare ESX has a notion of "virtual TSC",
which I believe is similar in this context.

For KVM I believe the best immediate solution (for now) is to provide an
option to disable reinjection, behaving similarly to real hardware. The
advantage is simplicity compared to virtualizing the time sources.

The QEMU PIT emulation has a limit on the rate of interrupt reinjection,
perhaps something similar should be investigated in the future.

The following patch (which contains the bugfix for 1) and disabled
reinjection) fixes the severe time drift on RHEL4 with "clock=tsc".
What I'm proposing is to condition reinjection with an option
(-kvm-pit-no-reinject or something).

Comments or better ideas?

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index e665d1c..608af7b 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -201,13 +201,16 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps)
 	if (!atomic_inc_and_test(&pt->pending))
 		set_bit(KVM_REQ_PENDING_TIMER, &vcpu0->requests);

+	if (atomic_read(&pt->pending) > 1)
+		atomic_set(&pt->pending, 1);
+
 	if (vcpu0 && waitqueue_active(&vcpu0->wq))
 		wake_up_interruptible(&vcpu0->wq);

 	hrtimer_add_expires_ns(&pt->timer, pt->period);
 	pt->scheduled = hrtimer_get_expires_ns(&pt->timer);
 	if (pt->period)
-		ps->channels[0].count_load_time = hrtimer_get_expires(&pt->timer);
+		ps->channels[0].count_load_time = ktime_get();

 	return (pt->period == 0 ? 0 : 1);
 }

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: gettimeofday "slow" in RHEL4 guests
  2008-12-28 18:38             ` Marcelo Tosatti
@ 2008-12-29 12:37               ` Yang, Sheng
  2008-12-29 13:11               ` Avi Kivity
  1 sibling, 0 replies; 15+ messages in thread
From: Yang, Sheng @ 2008-12-29 12:37 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Alexander Graf, David S. Ahern, kvm-devel,
	Glauber de Oliveira Costa, Avi Kivity, Gleb Natapov, Dor Laor

On Monday 29 December 2008 02:38:07 Marcelo Tosatti wrote:
> On Tue, Nov 25, 2008 at 01:52:59PM +0100, Andi Kleen wrote:
> > > But yeah - the remapping of HPET timers to virtual HPET timers sounds
> > > pretty tough. I wonder if one could overcome that with a little
> > > hardware support though ...
> >
> > For gettimeofday better make TSC work. Even in the best case (no
> > virtualization) it is much faster than HPET because it sits in the CPU,
> > while HPET is far away on the external south bridge.
>
> The tsc clock on older Linux 2.6 kernels compensates for lost ticks.
> The algorithm uses the PIT count (latched) to measure the delay between
> interrupt generation and handling, and sums that value, on the next
> interrupt, to the TSC delta.
>
> Sheng investigated this problem in the discussions before in-kernel PIT
> was merged:
>
> http://www.mail-archive.com/kvm-devel@lists.sourceforge.net/msg13873.html
>
> The algorithm overcompensates for lost ticks and the guest time runs
> faster than the hosts.
>
> There are two issues:
>
> 1) A bug in the in-kernel PIT which miscalculates the count value.
>
> 2) For the case where more than one interrupt is lost, and later
> reinjected, the value read from PIT count is meaningless for the purpose
> of the tsc algorithm. The count is interpreted as the delay until the
> next interrupt, which is not the case with reinjection.
>
> As Sheng mentioned in the thread above, Xen pulls back the TSC value
> when reinjecting interrupts. VMWare ESX has a notion of "virtual TSC",
> which I believe is similar in this context.
>
> For KVM I believe the best immediate solution (for now) is to provide an
> option to disable reinjection, behaving similarly to real hardware. The
> advantage is simplicity compared to virtualizing the time sources.
>
> The QEMU PIT emulation has a limit on the rate of interrupt reinjection,
> perhaps something similar should be investigated in the future.
>
> The following patch (which contains the bugfix for 1) and disabled
> reinjection) fixes the severe time drift on RHEL4 with "clock=tsc".
> What I'm proposing is to condition reinjection with an option
> (-kvm-pit-no-reinject or something).

I agree that it should go with a user space option to disable rejection, as 
it's hard to overcome the problem that we delayed interrupt injection... 

-- 
regards
Yang, Sheng

> Comments or better ideas?
>
>
> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
> index e665d1c..608af7b 100644
> --- a/arch/x86/kvm/i8254.c
> +++ b/arch/x86/kvm/i8254.c
> @@ -201,13 +201,16 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps)
>  	if (!atomic_inc_and_test(&pt->pending))
>  		set_bit(KVM_REQ_PENDING_TIMER, &vcpu0->requests);
>
> +	if (atomic_read(&pt->pending) > 1)
> +		atomic_set(&pt->pending, 1);
> +
>  	if (vcpu0 && waitqueue_active(&vcpu0->wq))
>  		wake_up_interruptible(&vcpu0->wq);
>
>  	hrtimer_add_expires_ns(&pt->timer, pt->period);
>  	pt->scheduled = hrtimer_get_expires_ns(&pt->timer);
>  	if (pt->period)
> -		ps->channels[0].count_load_time = hrtimer_get_expires(&pt->timer);
> +		ps->channels[0].count_load_time = ktime_get();
>
>  	return (pt->period == 0 ? 0 : 1);
>  }


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: gettimeofday "slow" in RHEL4 guests
  2008-12-28 18:38             ` Marcelo Tosatti
  2008-12-29 12:37               ` Yang, Sheng
@ 2008-12-29 13:11               ` Avi Kivity
  2008-12-29 16:12                 ` Dor Laor
  1 sibling, 1 reply; 15+ messages in thread
From: Avi Kivity @ 2008-12-29 13:11 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Yang, Sheng, Alexander Graf, David S. Ahern, kvm-devel,
	Glauber de Oliveira Costa, Gleb Natapov, Dor Laor

Marcelo Tosatti wrote:
> The tsc clock on older Linux 2.6 kernels compensates for lost ticks.
> The algorithm uses the PIT count (latched) to measure the delay between
> interrupt generation and handling, and sums that value, on the next
> interrupt, to the TSC delta.
>
> Sheng investigated this problem in the discussions before in-kernel PIT
> was merged:
>
> http://www.mail-archive.com/kvm-devel@lists.sourceforge.net/msg13873.html
>
> The algorithm overcompensates for lost ticks and the guest time runs
> faster than the hosts.
>
> There are two issues:
>
> 1) A bug in the in-kernel PIT which miscalculates the count value.
>
> 2) For the case where more than one interrupt is lost, and later
> reinjected, the value read from PIT count is meaningless for the purpose
> of the tsc algorithm. The count is interpreted as the delay until the
> next interrupt, which is not the case with reinjection.
>
> As Sheng mentioned in the thread above, Xen pulls back the TSC value
> when reinjecting interrupts. VMWare ESX has a notion of "virtual TSC",
> which I believe is similar in this context.
>
> For KVM I believe the best immediate solution (for now) is to provide an
> option to disable reinjection, behaving similarly to real hardware. The
> advantage is simplicity compared to virtualizing the time sources.
>
> The QEMU PIT emulation has a limit on the rate of interrupt reinjection,
> perhaps something similar should be investigated in the future.
>
> The following patch (which contains the bugfix for 1) and disabled
> reinjection) fixes the severe time drift on RHEL4 with "clock=tsc".
> What I'm proposing is to condition reinjection with an option
> (-kvm-pit-no-reinject or something).
>
> Comments or better ideas?
>
>
> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
> index e665d1c..608af7b 100644
> --- a/arch/x86/kvm/i8254.c
> +++ b/arch/x86/kvm/i8254.c
> @@ -201,13 +201,16 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps)
>  	if (!atomic_inc_and_test(&pt->pending))
>  		set_bit(KVM_REQ_PENDING_TIMER, &vcpu0->requests);
>  
> +	if (atomic_read(&pt->pending) > 1)
> +		atomic_set(&pt->pending, 1);
> +
>   

Replace the atomic_inc() with atomic_set(, 1) instead? One less test, 
and more important, the logic is scattered less around the source.

>  	if (vcpu0 && waitqueue_active(&vcpu0->wq))
>  		wake_up_interruptible(&vcpu0->wq);
>  
>  	hrtimer_add_expires_ns(&pt->timer, pt->period);
>  	pt->scheduled = hrtimer_get_expires_ns(&pt->timer);
>  	if (pt->period)
> -		ps->channels[0].count_load_time = hrtimer_get_expires(&pt->timer);
> +		ps->channels[0].count_load_time = ktime_get();
>  
>  	return (pt->period == 0 ? 0 : 1);
>  }
>   

I don't like the idea of punting to the user but looks like we don't 
have a choice.  Hopefully vendors will port kvmclock to these kernels 
and release them as updates -- time simply doesn't work will with 
virtualization, especially Linux guests.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: gettimeofday "slow" in RHEL4 guests
  2008-12-29 13:11               ` Avi Kivity
@ 2008-12-29 16:12                 ` Dor Laor
  2008-12-29 16:27                   ` Avi Kivity
  2008-12-29 16:29                   ` Avi Kivity
  0 siblings, 2 replies; 15+ messages in thread
From: Dor Laor @ 2008-12-29 16:12 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Marcelo Tosatti, Yang, Sheng, Alexander Graf, David S. Ahern,
	kvm-devel, Glauber de Oliveira Costa, Gleb Natapov

Avi Kivity wrote:
> Marcelo Tosatti wrote:
>> The tsc clock on older Linux 2.6 kernels compensates for lost ticks.
>> The algorithm uses the PIT count (latched) to measure the delay between
>> interrupt generation and handling, and sums that value, on the next
>> interrupt, to the TSC delta.
>>
>> Sheng investigated this problem in the discussions before in-kernel PIT
>> was merged:
>>
>> http://www.mail-archive.com/kvm-devel@lists.sourceforge.net/msg13873.html 
>>
>>
>> The algorithm overcompensates for lost ticks and the guest time runs
>> faster than the hosts.
>>
>> There are two issues:
>>
>> 1) A bug in the in-kernel PIT which miscalculates the count value.
>>
>> 2) For the case where more than one interrupt is lost, and later
>> reinjected, the value read from PIT count is meaningless for the purpose
>> of the tsc algorithm. The count is interpreted as the delay until the
>> next interrupt, which is not the case with reinjection.
>>
>> As Sheng mentioned in the thread above, Xen pulls back the TSC value
>> when reinjecting interrupts. VMWare ESX has a notion of "virtual TSC",
>> which I believe is similar in this context.
>>
>> For KVM I believe the best immediate solution (for now) is to provide an
>> option to disable reinjection, behaving similarly to real hardware. The
>> advantage is simplicity compared to virtualizing the time sources.
>>
>> The QEMU PIT emulation has a limit on the rate of interrupt reinjection,
>> perhaps something similar should be investigated in the future.
>>
>> The following patch (which contains the bugfix for 1) and disabled
>> reinjection) fixes the severe time drift on RHEL4 with "clock=tsc".
>> What I'm proposing is to condition reinjection with an option
>> (-kvm-pit-no-reinject or something).
>>
>> Comments or better ideas?
>>
>>
>> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
>> index e665d1c..608af7b 100644
>> --- a/arch/x86/kvm/i8254.c
>> +++ b/arch/x86/kvm/i8254.c
>> @@ -201,13 +201,16 @@ static int __pit_timer_fn(struct kvm_kpit_state 
>> *ps)
>>      if (!atomic_inc_and_test(&pt->pending))
>>          set_bit(KVM_REQ_PENDING_TIMER, &vcpu0->requests);
>>  
>> +    if (atomic_read(&pt->pending) > 1)
>> +        atomic_set(&pt->pending, 1);
>> +
>>   
>
> Replace the atomic_inc() with atomic_set(, 1) instead? One less test, 
> and more important, the logic is scattered less around the source.
But having only a pending bit instead of a counter will cause kvm to 
drop pit irqs on rare high load situations.
The disable reinjection option is better.
>
>>      if (vcpu0 && waitqueue_active(&vcpu0->wq))
>>          wake_up_interruptible(&vcpu0->wq);
>>  
>>      hrtimer_add_expires_ns(&pt->timer, pt->period);
>>      pt->scheduled = hrtimer_get_expires_ns(&pt->timer);
>>      if (pt->period)
>> -        ps->channels[0].count_load_time = 
>> hrtimer_get_expires(&pt->timer);
>> +        ps->channels[0].count_load_time = ktime_get();
>>  
>>      return (pt->period == 0 ? 0 : 1);
>>  }
>>   
>
> I don't like the idea of punting to the user but looks like we don't 
> have a choice.  Hopefully vendors will port kvmclock to these kernels 
> and release them as updates -- time simply doesn't work will with 
> virtualization, especially Linux guests.
>
Except for these 'tsc compensate' guest, what are the occasions where 
the guest writes his tsc?
If this is the only case we can disable reinjection once we trap tsc writes.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: gettimeofday "slow" in RHEL4 guests
  2008-12-29 16:12                 ` Dor Laor
@ 2008-12-29 16:27                   ` Avi Kivity
  2008-12-29 16:29                   ` Avi Kivity
  1 sibling, 0 replies; 15+ messages in thread
From: Avi Kivity @ 2008-12-29 16:27 UTC (permalink / raw)
  To: dlaor
  Cc: Marcelo Tosatti, Yang, Sheng, Alexander Graf, David S. Ahern,
	kvm-devel, Glauber de Oliveira Costa, Gleb Natapov

Dor Laor wrote:
>>>  
>>> +    if (atomic_read(&pt->pending) > 1)
>>> +        atomic_set(&pt->pending, 1);
>>> +
>>>   
>>
>> Replace the atomic_inc() with atomic_set(, 1) instead? One less test, 
>> and more important, the logic is scattered less around the source.
> But having only a pending bit instead of a counter will cause kvm to 
> drop pit irqs on rare high load situations.
> The disable reinjection option is better.

Both variants disable reinjection.  Forcing a counter to 1 every time it 
exceeds 1 is equivalent to maintaining a bit.

In both variants, there is a missing 'if (disable_reinjection)' (Marcelo 
mentioned this in the original message).

> Except for these 'tsc compensate' guest, what are the occasions where 
> the guest writes his tsc?
> If this is the only case we can disable reinjection once we trap tsc 
> writes.

I don't think these guests write to the tsc.  Rather, they read the tsc 
and the pit counters and try to correlate.  And fail.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: gettimeofday "slow" in RHEL4 guests
  2008-12-29 16:12                 ` Dor Laor
  2008-12-29 16:27                   ` Avi Kivity
@ 2008-12-29 16:29                   ` Avi Kivity
  1 sibling, 0 replies; 15+ messages in thread
From: Avi Kivity @ 2008-12-29 16:29 UTC (permalink / raw)
  To: dlaor
  Cc: Marcelo Tosatti, Yang, Sheng, Alexander Graf, David S. Ahern,
	kvm-devel, Glauber de Oliveira Costa, Gleb Natapov

Dor Laor wrote:
>>>  
>>> +    if (atomic_read(&pt->pending) > 1)
>>> +        atomic_set(&pt->pending, 1);
>>> +
>>>   
>>
>> Replace the atomic_inc() with atomic_set(, 1) instead? One less test, 
>> and more important, the logic is scattered less around the source.
> But having only a pending bit instead of a counter will cause kvm to 
> drop pit irqs on rare high load situations.
> The disable reinjection option is better.

Both variants disable reinjection.  Forcing a counter to 1 every time it 
exceeds 1 is equivalent to maintaining a bit.

In both variants, there is a missing 'if (disable_reinjection)' (Marcelo 
mentioned this in the original message).

> Except for these 'tsc compensate' guest, what are the occasions where 
> the guest writes his tsc?
> If this is the only case we can disable reinjection once we trap tsc 
> writes.

I don't think these guests write to the tsc.  Rather, they read the tsc 
and the pit counters and try to correlate.  And fail.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-12-29 16:29 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-24 17:47 gettimeofday "slow" in RHEL4 guests David S. Ahern
2008-11-25  4:41 ` David S. Ahern
2008-11-25 10:14   ` Andi Kleen
2008-11-25 11:17     ` Alexander Graf
2008-11-25 11:48       ` Andi Kleen
2008-11-25 12:13         ` Alexander Graf
2008-11-25 12:52           ` Andi Kleen
2008-12-28 18:38             ` Marcelo Tosatti
2008-12-29 12:37               ` Yang, Sheng
2008-12-29 13:11               ` Avi Kivity
2008-12-29 16:12                 ` Dor Laor
2008-12-29 16:27                   ` Avi Kivity
2008-12-29 16:29                   ` Avi Kivity
2008-11-25 17:20   ` Hollis Blanchard
2008-11-25 19:09     ` David S. Ahern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).