* RE: RE: rdtsc strangeness on upstream kernel
2010-07-20 0:44 ` Jeremy Fitzhardinge
@ 2010-07-23 16:47 ` Dan Magenheimer
0 siblings, 0 replies; 8+ messages in thread
From: Dan Magenheimer @ 2010-07-23 16:47 UTC (permalink / raw)
To: Jeremy Fitzhardinge, Andrew Jones; +Cc: xen-devel
[-- Attachment #1: Type: text/plain, Size: 3591 bytes --]
> On 07/19/2010 05:42 PM, Dan Magenheimer wrote:
> > Yes, CONFIG_XEN_DEBUG_FS is on. I dug up some instrumentation
> > I used in the hypervisor last fall which should identify the
> > kernel locations of the rdtsc's. I'll plug that in and
> > post the results tomorrow.
>
> They're probably all in the same place, so you'll need to get some
> calling stack to make sense of it.
>
> J
Sorry this took awhile. I've been hacking and instrumenting and
measuring and the results are not very good.
I'm cc'ing Andrew Jones since things look worse on stock RHEL6b2.
First, I've measured on both PV 2.6.35-rc4 ("upstream") and PV
2.6.32-37 ("rhel6b2"). I've attached the patch for rhel6b2 that
I used for measurement; the same code fragments were used for
upstream.
Rhel6b2 had HZ=1000 and had vcpus=2. Upstream had HZ=250 and vcpus=1.
Both had the "xen_sched_clock" call changed to xen_clocksource_read.
The workload is simply "main() {while(1); }" run in background.
On vcpus=2, I ran two in background.
Both had CONFIG_XEN_FS_DEBUG set, but the workload should encounter
few or no kernel spinlocks and the measurements confirmed this.
Rhel6b2 had HIGH_RES_TIMERS and SCHED_HRTICK set; they were not
set on upstream... don't know if this matters.
BOTTOM LINE: A cpu-loaded vcpus=2 HZ=1000 rhel6b2 guest had close to
20000 rdtsc/sec. Unloaded, the same guest had about 600 rdtsc/sec.
A cpu-loaded HZ=250 upstream guest had about 4000 rdtsc/sec,
regardless of whether it was loaded or not.
Breakdown: I instrumented several of the primary sources of calls
to native_read_tsc. No rdtsc instructions occurred, except the
one in native_read_tsc. (Some small number are binary translated
to call native_read_tsc.)
The most frequent users of rdtsc:
(A) sched_clock via xen_clocksource_read->pvclock_clocksource_read->
pvclock_nsec_offset->native_read_tsc
(B) clock->read via xen_clocksource_read
(C) set_next_event via xen_timerop_clockevent
I measured (running the workload overnight) how frequently each
of these paths resulted in a call to native_read_tsc. Each
of these functions go through this path exactly once so this
is also a frequency (per second) count for the callers.
ALL 4672..18689
(A) 2149...8428 (~8*HZ)
(B) 1350...9456 (~(4*HZ+?)*vcpus)
(C) 524...2102 (~2*HZ)
The measurement code outputs frequency every 2**20 samples
and I did a cursory review to verify that the frequency
samples were consistent. There is some small variance,
which accounts for the fact that the numbers above (which
were from one sample) don't quite add up. So please
consider them "trend indicators" rather than exact measurements.
I also did verify the total rdtsc usage counts against the
"xm debug-key s" output from the Xen console.
Conclusions/questions:
1) Frequency of rdtsc use is much higher than expected, especially
when HZ=1000. Sadly, the divider= kernel parameter may still
be necessary for RHEL6b2 on older machines where Xen must do
rdtsc emulation... and for any guest that has migrated (or will
migrate?)... unless this high frequency identifies a fixable bug.
2) There is some fix in rhel6b2 that hides this when vcpus are inactive.
Should this go upstream so idle guests don't consume so much CPU?
3) Why is sched_clock called at 8*HZ?!? Why is clock->read called at
4*HZ (on every processor)? !? Why is set_next_event called
at 2*HZ?!? Even with hardware rdtsc, this seems like a potential
(though admittedly small) performance issue.
[-- Attachment #2: linux-2.6.32-37-tsc.patch --]
[-- Type: application/octet-stream, Size: 8489 bytes --]
diff -Napur linux-2.6.32.x86_64/arch/x86/kernel/rtc.c linux-2.6.32.x86_64-tsc/arch/x86/kernel/rtc.c
--- linux-2.6.32.x86_64/arch/x86/kernel/rtc.c 2009-12-02 20:51:21.000000000 -0700
+++ linux-2.6.32.x86_64-tsc/arch/x86/kernel/rtc.c 2010-07-22 10:24:36.000000000 -0600
@@ -193,6 +193,15 @@ void read_persistent_clock(struct timesp
unsigned long long native_read_tsc(void)
{
+#define DJM_TSC_FREQ
+#ifdef DJM_TSC_FREQ
+#define INTERVAL (1024L*1024UL)
+{
+static unsigned long count=0;
+if (!(++count & (INTERVAL-1)))
+printk(KERN_ERR "native_read_tsc: count=%lu\n", count);
+}
+#endif
return __native_read_tsc();
}
EXPORT_SYMBOL(native_read_tsc);
diff -Napur linux-2.6.32.x86_64/arch/x86/xen/enlighten.c linux-2.6.32.x86_64-tsc/arch/x86/xen/enlighten.c
--- linux-2.6.32.x86_64/arch/x86/xen/enlighten.c 2010-07-07 11:02:01.000000000 -0600
+++ linux-2.6.32.x86_64-tsc/arch/x86/xen/enlighten.c 2010-07-22 15:56:11.000000000 -0600
@@ -955,7 +955,8 @@ static const struct pv_init_ops xen_init
};
static const struct pv_time_ops xen_time_ops __initdata = {
- .sched_clock = xen_sched_clock,
+// .sched_clock = xen_sched_clock,
+ .sched_clock = xen_schedclocksource_read,
};
static const struct pv_cpu_ops xen_cpu_ops __initdata = {
diff -Napur linux-2.6.32.x86_64/arch/x86/xen/time.c linux-2.6.32.x86_64-tsc/arch/x86/xen/time.c
--- linux-2.6.32.x86_64/arch/x86/xen/time.c 2010-07-07 11:01:46.000000000 -0600
+++ linux-2.6.32.x86_64-tsc/arch/x86/xen/time.c 2010-07-22 10:29:00.000000000 -0600
@@ -206,7 +206,83 @@ cycle_t xen_clocksource_read(void)
{
struct pvclock_vcpu_time_info *src;
cycle_t ret;
+#define DJM_TSC_FREQ
+#ifdef DJM_TSC_FREQ
+#define CPU_HZ (2993UL*1024UL*1024UL)
+#define INTERVAL (1024L*1024UL)
+{
+static unsigned long count=0,last,new;
+if (!count) rdtscll(last);
+if (!(++count & (INTERVAL-1))) {
+printk(KERN_ERR "xen_clocksource_read(xen/time.c): count=%lu freq=%lu\n", count, (INTERVAL*CPU_HZ)/(rdtscll(new)-last));
+last = new;
+}
+}
+#endif
+
+ src = &get_cpu_var(xen_vcpu)->time;
+ ret = pvclock_clocksource_read(src);
+ put_cpu_var(xen_vcpu);
+ return ret;
+}
+
+cycle_t xen_clocksource_read2(void)
+{
+ struct pvclock_vcpu_time_info *src;
+ cycle_t ret;
+
+#ifdef DJM_TSC_FREQ
+{
+static unsigned long count=0,last,new;
+if (!count) rdtscll(last);
+if (!(++count & (INTERVAL-1))) {
+printk(KERN_ERR "xen_clocksource_READ2(xen/time.c)(CALLED ONLY VIA clock->read): count=%lu freq=%lu\n", count, (INTERVAL*CPU_HZ)/(rdtscll(new)-last));
+last = new;
+}
+}
+#endif
+ src = &get_cpu_var(xen_vcpu)->time;
+ ret = pvclock_clocksource_read(src);
+ put_cpu_var(xen_vcpu);
+ return ret;
+}
+cycle_t xen_clocksource_read3(void)
+{
+ struct pvclock_vcpu_time_info *src;
+ cycle_t ret;
+
+#ifdef DJM_TSC_FREQ
+{
+static unsigned long count=0,last,new;
+if (!count) rdtscll(last);
+if (!(++count & (INTERVAL-1))) {
+printk(KERN_ERR "xen_clocksource_READ3(xen/time.c)(CALLED ONLY VIA set_next_event): count=%lu freq=%lu\n", count, (INTERVAL*CPU_HZ)/(rdtscll(new)-last));
+last = new;
+}
+}
+#endif
+ src = &get_cpu_var(xen_vcpu)->time;
+ ret = pvclock_clocksource_read(src);
+ put_cpu_var(xen_vcpu);
+ return ret;
+}
+
+cycle_t xen_schedclocksource_read(void)
+{
+ struct pvclock_vcpu_time_info *src;
+ cycle_t ret;
+
+#ifdef DJM_TSC_FREQ
+{
+static unsigned long count=0,last,new;
+if (!count) rdtscll(last);
+if (!(++count & (INTERVAL-1))) {
+printk(KERN_ERR "xen_SCHEDclocksource_READ(xen/time.c)(CALLED ONLY VIA sched_clock): count=%lu freq=%lu\n", count, (INTERVAL*CPU_HZ)/(rdtscll(new)-last));
+last = new;
+}
+}
+#endif
src = &get_cpu_var(xen_vcpu)->time;
ret = pvclock_clocksource_read(src);
put_cpu_var(xen_vcpu);
@@ -215,7 +291,7 @@ cycle_t xen_clocksource_read(void)
static cycle_t xen_clocksource_get_cycles(struct clocksource *cs)
{
- return xen_clocksource_read();
+ return xen_clocksource_read2();
}
static void xen_read_wallclock(struct timespec *ts)
@@ -286,7 +362,7 @@ static struct clocksource xen_clocksourc
*/
static s64 get_abs_timeout(unsigned long delta)
{
- return xen_clocksource_read() + delta;
+ return xen_clocksource_read3() + delta;
}
static void xen_timerop_set_mode(enum clock_event_mode mode,
diff -Napur linux-2.6.32.x86_64/arch/x86/xen/xen-ops.h linux-2.6.32.x86_64-tsc/arch/x86/xen/xen-ops.h
--- linux-2.6.32.x86_64/arch/x86/xen/xen-ops.h 2010-07-07 11:01:46.000000000 -0600
+++ linux-2.6.32.x86_64-tsc/arch/x86/xen/xen-ops.h 2010-07-22 10:36:26.000000000 -0600
@@ -45,6 +45,7 @@ void xen_setup_timer(int cpu);
void xen_setup_runstate_info(int cpu);
void xen_teardown_timer(int cpu);
cycle_t xen_clocksource_read(void);
+cycle_t xen_schedclocksource_read(void);
void xen_setup_cpu_clockevents(void);
unsigned long xen_tsc_khz(void);
void __init xen_time_init(void);
diff -Napur linux-2.6.32.x86_64/kernel/sched_clock.c linux-2.6.32.x86_64-tsc/kernel/sched_clock.c
--- linux-2.6.32.x86_64/kernel/sched_clock.c 2010-07-07 11:01:57.000000000 -0600
+++ linux-2.6.32.x86_64-tsc/kernel/sched_clock.c 2010-07-22 10:34:51.000000000 -0600
@@ -105,8 +105,21 @@ static u64 sched_clock_local(struct sche
{
u64 now, clock, old_clock, min_clock, max_clock;
s64 delta;
+#define DJM_TSC_FREQ
+#ifdef DJM_TSC_FREQ
+static unsigned long count=0,last,new;
+#define CPU_HZ (2993UL*1024UL*1024UL)
+#define INTERVAL (1024L*1024UL)
+#endif
again:
+#ifdef DJM_TSC_FREQ
+if (!count) rdtscll(last);
+if (!(++count & (INTERVAL-1))) {
+printk(KERN_ERR "sched_clock_local: count=%lu freq=%lu\n", count, (INTERVAL*CPU_HZ)/(rdtscll(new)-last));
+last = new;
+}
+#endif
now = sched_clock();
delta = now - scd->tick_raw;
if (unlikely(delta < 0))
@@ -177,7 +190,20 @@ u64 sched_clock_cpu(int cpu)
WARN_ON_ONCE(!irqs_disabled());
if (sched_clock_stable)
+#ifdef DJM_TSC_FREQ
+{
+static unsigned long count=0,last,new;
+if (!count) rdtscll(last);
+if (!(++count & (INTERVAL-1))) {
+printk(KERN_ERR "sched_clock_local: count=%lu freq=%lu\n", count, (INTERVAL*CPU_HZ)/(rdtscll(new)-last));
+last = new;
+#endif
return sched_clock();
+#ifdef DJM_TSC_FREQ
+}
+}
+#endif
+
if (unlikely(!sched_clock_running))
return 0ull;
@@ -206,6 +232,16 @@ void sched_clock_tick(void)
WARN_ON_ONCE(!irqs_disabled());
scd = this_scd();
+#ifdef DJM_TSC_FREQ
+{
+static unsigned long count=0,last,new;
+if (!count) rdtscll(last);
+if (!(++count & (INTERVAL-1))) {
+printk(KERN_ERR "sched_clock_tick (TWO): count=%lu freq=%lu\n", count, (INTERVAL*CPU_HZ)/(rdtscll(new)-last));
+last=new;
+}
+}
+#endif
now_gtod = ktime_to_ns(ktime_get());
now = sched_clock();
diff -Napur linux-2.6.32.x86_64/kernel/time/timekeeping.c linux-2.6.32.x86_64-tsc/kernel/time/timekeeping.c
--- linux-2.6.32.x86_64/kernel/time/timekeeping.c 2010-07-07 11:01:55.000000000 -0600
+++ linux-2.6.32.x86_64-tsc/kernel/time/timekeeping.c 2010-07-22 10:57:37.000000000 -0600
@@ -122,6 +122,19 @@ static inline s64 timekeeping_get_ns_raw
/* read clocksource: */
clock = timekeeper.clock;
+#define DJM_TSC_FREQ
+#ifdef DJM_TSC_FREQ
+#define CPU_HZ (2993UL*1024UL*1024UL)
+#define INTERVAL (1024L*1024UL)
+{
+static unsigned long count=0,last,new;
+if (!count) rdtscll(last);
+if (!(++count & (INTERVAL-1))) {
+printk(KERN_ERR "timekeeping_get_ns_raw(timekeeping.c)(CALLS clock->read): count=%lu freq=%lu\n", count, (INTERVAL*CPU_HZ)/(rdtscll(new)-last));
+last = new;
+}
+}
+#endif
cycle_now = clock->read(clock);
/* calculate the delta since the last update_wall_time: */
@@ -226,6 +239,14 @@ void getnstimeofday(struct timespec *ts)
WARN_ON(timekeeping_suspended);
do {
+#ifdef DJM_TSC_FREQ
+static unsigned long count=0,last,new;
+if (!count) rdtscll(last);
+if (!(++count & (INTERVAL-1))) {
+printk(KERN_ERR "getnstimeofday(timekeeping.c)(CALLS clock->read via timekeeping_get_ns): count=%lu freq=%lu\n", count, (INTERVAL*CPU_HZ)/(rdtscll(new)-last));
+last = new;
+}
+#endif
seq = read_seqbegin(&xtime_lock);
*ts = xtime;
@@ -276,10 +297,20 @@ void ktime_get_ts(struct timespec *ts)
struct timespec tomono;
unsigned int seq;
s64 nsecs;
+#ifdef DJM_TSC_FREQ
+static unsigned long count=0,last,new;
+#endif
WARN_ON(timekeeping_suspended);
do {
+#ifdef DJM_TSC_FREQ
+if (!count) rdtscll(last);
+if (!(++count & (INTERVAL-1))) {
+printk(KERN_ERR "ktime_get_ts(timekeeping.c)(CALLS getnstimeofday): count=%lu freq=%lu\n", count, (INTERVAL*CPU_HZ)/(rdtscll(new)-last));
+last = new;
+}
+#endif
seq = read_seqbegin(&xtime_lock);
*ts = xtime;
tomono = wall_to_monotonic;
[-- Attachment #3: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 8+ messages in thread