Forgot to add something in my last response: Another time-based oddity
I'm seeing in multi-processor guests is the microseconds value changing
as the process is moved between the vcpus.

The attached code exemplifies what I mean. In a RHEL3 VM with 2 vcpus,
start the program with an argument of 990000 (to get a wakeup every ~1
sec). Once started lock it to a vcpu. You'll nice consistent output like:

1217351975.261974
1217351976.262292
1217351977.262608
1217351978.262929
1217351979.263243
1217351980.263563
1217351981.263940

Then switch the affinity to the other vcpu. The microseconds value jumps:

1217351982.796132
1217351983.797411
1217351984.797719
1217351985.798041
1217351986.798368
1217351987.798788
1217351988.799025

Toggling the affinity or letting the process roam between the 2
processors causes the microseconds to jump. These means that data logged
using the microseconds value will show time jumps back and forth.

As I understand it the root cause is the TSC-based updates to what is
returned by gettimeofday so the fact that they toggle means the 2 vcpus
see different tsc counts. Is there anyway to make the counts coherent as
processes roam vcpus?

david


David S. Ahern wrote:
> 
> Marcelo Tosatti wrote:
>> On Tue, Jul 22, 2008 at 01:56:12PM -0600, David S. Ahern wrote:
>>> I've been running a series of tests on RHEL3, RHEL4, and RHEL5. The
>>> short of it is that all of them keep time quite well with 1 vcpu. In the
>>> case of RHEL3 and RHEL4 time is stable for *both* the uniprocessor and
>>> smp kernels, again with only 1 vcpu (there's no up/smp distinction in
>>> the kernels for RHEL5).
>>>
>>> As soon as the number of vcpus is >1, time drifts systematically with
>>> the guest *leading* the host. I see this on unloaded guests and hosts
>>> (ie., cpu usage on the host ~<5%). The drift is averaging around
>>> 0.5%-0.6% (i.e., 5 seconds gained in the guest per 1000 seconds of real
>>> wall time).
>>>
>>> This very reproducible. All I am doing is installing stock RHEL3.8, 4.4
>>> and 5.2, i386 versions, starting them and watching the drift with no
>>> time servers. In all of these recent cases the results are for in-kernel
>>> pit.
>> David,
>>
>> You mentioned earlier problems with ntpd syncing the guest time? Can you
>> provide more details?
>>
> 
> It would lose sync often, and 'ntpq -c pe' would show a '*' indicative
> of a sync when in fact time in the guest was off by 5-10 seconds. It may
> very well be a side effect of the drift due to repeated timer injection
> of timer interrupts / lost interrupts.
> 
> With your PIT injection patches:
> 
> 1. For a stock RHEL4.4 guest, ntpd synchronized quickly and saw no need
> to adjust time after the initial startup tweak of 1.004620 sec by
> ntpdate. After 40 hours it has maintained time very well with no
> adjustments. Of course the guest is relatively idle -- it is only
> keeping time.
> 
> 2. For a stock RHEL3.8 guest, I cannot get ntpd to do anything. This
> guest is running on the same host as the RHEL4 guest and using the same
> time server. This guest has been around for a few weeks and has been
> subjected to very tests -- like running with the no-kernel-pit and -tdf
> options. In light of 3. below I'll re-create this guest and see if the
> problem goes away.
> 
> 3. For a RHEL3.8 guest running a Cisco product, ntpd was able to
> synchonize just fine. We are running ntpd with different arguments;
> however using the same syntax on the stock rhel3 guest did not help.
> 
> As for as time updates, over 21+ hours of uptime there have been 20 time
> resets -- adjustments ranging from -1.01 seconds to +0.75 seconds. This
> is a remarkable improvement. Before this PIT patch set I was seeing time
> resets of 3-5 seconds every 15 minutes. This is a 2 vcpu guest running a
> modest load (disk + network) that pushes cpu usage of ~25%. Point being
> that the guest is keeping time reasonably well while do something
> useful. :-)
> 
> I am planning to install 4 vcpu guests for both RHEL3 and RHEL4 today
> and again with modest loads to see how it holds up.
> 
>> I find it _necessary_ to use the RR scheduling policy for any Linux
>> guest running at static 1000Hz (no dynticks), otherwise timer interrupts
>> will invariably be missed. And reinjection plus lost tick adjustment is
>> always problematic (will drift either way, depending which version of
>> Linux). With the standard batch scheduling policy _idle_ guests can wait
>> to run upto 6/7 ms in my testing (thus 6/7 lost timer events). Which
>> also means latency can be horrible.
>>
> 
> Noted. I'd prefer not to start priority escalations, but if it's needed....
> 
> What about for the RHEL4.7 kernel running at 250 HZ? I understand it
> with 4.7 you can pass a command line divider to run the clock at a
> slower rate. In the past I've recompiled RHEL4 kernels to run at 250 HZ
> which was a trade-off between too fast (overhead of timer interrupts)
> and too slow (need for better scheduling latency).
> 
> 
> david
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>