From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Dan Magenheimer" Subject: Xen system skew MUCH worse than tsc skew (was RE: RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)) Date: Thu, 10 Jul 2008 16:42:38 -0600 Message-ID: <20080710164238562.00000003744@djm-pc> References: Reply-To: "dan.magenheimer@oracle.com" Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Keir Fraser , "Xen-Devel (E-mail)" Cc: Dave Winchell List-Id: xen-devel@lists.xenproject.org > > 7) CONJECTURE: Result of natural skews between platform > > timer and tsc, plus jitter. Unfixable. > > > > Possible, untested, not sure how. > = > I ended up suspecting this on one of the test platforms I = > originally did the > Xen-system-time implementation on. It was an old AMD white = > box iirc. On that > system, TSC and platform time seemed to have a significant = > and inexplicable > jitter at around 1Hz. The jitter was 100s of ppm, which was totally > unexpected for what should be crystal-based oscillators. And = > the test code > was simple enough that it was hard to suspect that either (I = > think I was > just dumping the counters every second or two after reading = > them as close > together as I could). Is this the code in read_clocks() in keyhandler.c? If so, I just did an experiment there with some interesting results: I modified that code to record the "max dif" and then executed it >10000 times. The result shows maxdif ~11usec which corresponds with my earlier measurements. Next, I replaced the calls to NOW() in read_clocks() and read_clocks_slave() with rdtscll(). Guess what? The result is a maxdif of 11000 "ticks" but now on a 3GHz clock, which is about 3.3usec. Next, I disabled interrupts in read_clocks_slave() around the while loop plus the rdtscll() so that I ensure I'm not accidentally counting any interrupts. Now I'm seeing maxdif<330nsec (>6000 measurements). Next, I go back to NOW(), but with interrupts disabled as above. So far maxdif is about 10.7usec (>6000 measurements). SO XEN SYSTEM TIME MAX SKEW IS >30X WORSE THAN TSC MAX SKEW! Looks to me like there's still something algorithmically wrong and its not just natural skew and jitter. Maybe some corner case in the scale-delta code? Also, should interrupts be turned off during the calibration part of init_pit_and_calibrate_tsc() (which might cause different scaling factors for each CPU)?