Forgot to add something in my last response: Another time-based oddity I'm seeing in multi-processor guests is the microseconds value changing as the process is moved between the vcpus. The attached code exemplifies what I mean. In a RHEL3 VM with 2 vcpus, start the program with an argument of 990000 (to get a wakeup every ~1 sec). Once started lock it to a vcpu. You'll nice consistent output like: 1217351975.261974 1217351976.262292 1217351977.262608 1217351978.262929 1217351979.263243 1217351980.263563 1217351981.263940 Then switch the affinity to the other vcpu. The microseconds value jumps: 1217351982.796132 1217351983.797411 1217351984.797719 1217351985.798041 1217351986.798368 1217351987.798788 1217351988.799025 Toggling the affinity or letting the process roam between the 2 processors causes the microseconds to jump. These means that data logged using the microseconds value will show time jumps back and forth. As I understand it the root cause is the TSC-based updates to what is returned by gettimeofday so the fact that they toggle means the 2 vcpus see different tsc counts. Is there anyway to make the counts coherent as processes roam vcpus? david David S. Ahern wrote: > > Marcelo Tosatti wrote: >> On Tue, Jul 22, 2008 at 01:56:12PM -0600, David S. Ahern wrote: >>> I've been running a series of tests on RHEL3, RHEL4, and RHEL5. The >>> short of it is that all of them keep time quite well with 1 vcpu. In the >>> case of RHEL3 and RHEL4 time is stable for *both* the uniprocessor and >>> smp kernels, again with only 1 vcpu (there's no up/smp distinction in >>> the kernels for RHEL5). >>> >>> As soon as the number of vcpus is >1, time drifts systematically with >>> the guest *leading* the host. I see this on unloaded guests and hosts >>> (ie., cpu usage on the host ~<5%). The drift is averaging around >>> 0.5%-0.6% (i.e., 5 seconds gained in the guest per 1000 seconds of real >>> wall time). >>> >>> This very reproducible. All I am doing is installing stock RHEL3.8, 4.4 >>> and 5.2, i386 versions, starting them and watching the drift with no >>> time servers. In all of these recent cases the results are for in-kernel >>> pit. >> David, >> >> You mentioned earlier problems with ntpd syncing the guest time? Can you >> provide more details? >> > > It would lose sync often, and 'ntpq -c pe' would show a '*' indicative > of a sync when in fact time in the guest was off by 5-10 seconds. It may > very well be a side effect of the drift due to repeated timer injection > of timer interrupts / lost interrupts. > > With your PIT injection patches: > > 1. For a stock RHEL4.4 guest, ntpd synchronized quickly and saw no need > to adjust time after the initial startup tweak of 1.004620 sec by > ntpdate. After 40 hours it has maintained time very well with no > adjustments. Of course the guest is relatively idle -- it is only > keeping time. > > 2. For a stock RHEL3.8 guest, I cannot get ntpd to do anything. This > guest is running on the same host as the RHEL4 guest and using the same > time server. This guest has been around for a few weeks and has been > subjected to very tests -- like running with the no-kernel-pit and -tdf > options. In light of 3. below I'll re-create this guest and see if the > problem goes away. > > 3. For a RHEL3.8 guest running a Cisco product, ntpd was able to > synchonize just fine. We are running ntpd with different arguments; > however using the same syntax on the stock rhel3 guest did not help. > > As for as time updates, over 21+ hours of uptime there have been 20 time > resets -- adjustments ranging from -1.01 seconds to +0.75 seconds. This > is a remarkable improvement. Before this PIT patch set I was seeing time > resets of 3-5 seconds every 15 minutes. This is a 2 vcpu guest running a > modest load (disk + network) that pushes cpu usage of ~25%. Point being > that the guest is keeping time reasonably well while do something > useful. :-) > > I am planning to install 4 vcpu guests for both RHEL3 and RHEL4 today > and again with modest loads to see how it holds up. > >> I find it _necessary_ to use the RR scheduling policy for any Linux >> guest running at static 1000Hz (no dynticks), otherwise timer interrupts >> will invariably be missed. And reinjection plus lost tick adjustment is >> always problematic (will drift either way, depending which version of >> Linux). With the standard batch scheduling policy _idle_ guests can wait >> to run upto 6/7 ms in my testing (thus 6/7 lost timer events). Which >> also means latency can be horrible. >> > > Noted. I'd prefer not to start priority escalations, but if it's needed.... > > What about for the RHEL4.7 kernel running at 250 HZ? I understand it > with 4.7 you can pass a command line divider to run the clock at a > slower rate. In the past I've recompiled RHEL4 kernels to run at 250 HZ > which was a trade-off between too fast (overhead of timer interrupts) > and too slow (need for better scheduling latency). > > > david > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >