Re: [PATCH 1/2] cpu steal time accounting

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] cpu steal time accounting
       [not found] <43FCCB2C.5000408@vmware.com>
@ 2006-02-22 23:58 ` Dan Hecht
  2006-02-23  8:40   ` Keir Fraser
  2006-02-23  8:48   ` Keir Fraser
  0 siblings, 2 replies; 20+ messages in thread
From: Dan Hecht @ 2006-02-22 23:58 UTC (permalink / raw)
  To: Keir.Fraser; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 2937 bytes --]

> On 22 Feb 2006, at 17:11, Rik van Riel wrote:
> 
>> If the domain is unrunnable, surely there won't be a
>> process on the virtual cpu that is runnable?  Or am
>> I overlooking something here?
> 
> Oh, I see, this is dealt with inside account_steal_time(). No problem then.
> 

But, the guest will not be able to properly partition this time 
correctly between the idle time stat and the steal time stat.

When the vcpu enters its idle loop and blocks, at some point it will 
become ready to run again.  But, it will not necessarily run right away. 
  Instead, it will sit in the hypervisors runqueue for some amount of 
time.  While in the runqueue, this time is "stolen", since the vcpu 
wants to run but isn't (at this point it is involuntarily waiting). 
Under a heavily overcommitted system, the amount of time in the runqueue 
following the block may be nontrivial.

But, the guest (the code in account_steal_time()) cannot determine the 
time at which the vcpu was requeued into the hypervisor's runqueue.  It 
will account all of this time toward the "idle time" stat, rather than 
partitioning it between "idle time" and "steal time".

To solve this, it may be best to have the hypervisor interface expose 
per-vcpu stolen time directly, rather than vcpu_time.  Then the guest 
does not need to try to guess whether to charge (system_time - 
vcpu_time) against idle or steal.

>>>  1. What if a guest gets preempted for lots of short time periods (less
>>> than a jiffy). Then some arbitrary time in the future is preempted for
>>> long enough to activate you stolen-time logic. Won't you end up
>>> incorrectly accounting the accumulated short time periods?
>>
>> This is true.  I'm not sure we'd want to get the vcpu info
>> at every timer interrupt though, that could end up being
>> too expensive...
> 
> Having to call down to Xen to get that information is unfortunate. 
> Perhaps we can export it in shared_info, or have the guest register a 
> virtual address it would like the info written to.
> 

If you do change how this information is passed through the interface, 
then maybe this would be a good time to define the interface to export 
"stolen time" computed by the hypervisor, rather than "vcpu time".

On a topic relating to these patches, the Linux scheduler also uses the 
routine sched_clock() to calculate the run-time (and sleep-time) of 
processes.  With the current code, these calculations will include 
stolen time in the total run-time of a process.  Perhaps sched_clock() 
should be a clock that does not advance when time is stolen by the 
hypervisor?

We've been thinking about these issues also.  I've attached a document 
that describes our current thoughts.  The document describes the portion 
of the VMI that deals with time and describes the changes to Linux to 
accommodate VMI Time (Rik, this is an updated draft of the document I 
sent you before).  Comments are welcome.

Thanks,
Dan

[-- Attachment #2: vmi_time_spec.txt --]
[-- Type: text/plain, Size: 23039 bytes --]

VMI - Time Interface;   VMware, Inc.

In a virtualized environment, virtual machines (VM) will time share the
system with each other and with other processes running on the host
system.  Therefore, a VM's virtual cpus (vcpus) will be executing on the
host's physical cpus (pcpus) for only some portion of time.  This
section of the VMI exposes a paravirtual view of time to the guest
operating systems so that they may operate more effectively in a virtual
environment.  The interface also provides a way for the vcpus to set
alarms in this paravirtual view of time.

-------------------------------------------------------------------------

Time Domains:

a) Wallclock Time:

Wallclock time exposed to the VM through this interface indicates the
number of nanoseconds since epoch, 1970-01-01T00:00:00Z (ISO 8601 date
format).  If the host's wallclock time changes (say, when an error in
the host's clock is corrected), so does the wallclock time as viewed
through this interface.

b) Real Time:

Another view of time accessible through this interface is real time.
Real time always progresses except for when the VM is stopped or
suspended.  Real time is presented to the guest as a counter which
increments at a constant rate defined (and presented) by the hypervisor.
All the vcpus of a VM share the same real time counter.

The unit of the counter is called "cycles".  The unit and initial value
(corresponding to the time the VM enters para-virtual mode) are chosen
by the hypervisor so that the real time counter will not rollover in any
practical length of time.  It is expected that the frequency (cycles per
second) is chosen such that this clock provides a "high-resolution" view
of time.  The unit can only change when the VM (re)enters paravirtual
mode.

c) Stolen time and Available time:

A vcpu is always in one of three states: running, halted, or ready.  The
vcpu is in the 'running' state if it is executing.  When the vcpu
executes the VMI_Halt interface, the vcpu enters the 'halted' state and
remains halted until there is some work pending for the vcpu (e.g. an
alarm expires, host I/O completes on behalf of virtual I/O).  At this
point, the vcpu enters the 'ready' state (waiting for the hypervisor to
reschedule it).  Finally, at any time when the vcpu is not in the
'running' state nor the 'halted' state, it is in the 'ready' state.

For example, consider the following sequence of events, with times given
in real time:

(Example 1)

At 0 ms, VCPU executing guest code.
At 1 ms, VCPU requests virtual I/O.
At 2 ms, Host performs I/O for virtual I/0.
At 3 ms, VCPU executes VMI_Halt.
At 4 ms, Host completes I/O for virtual I/O request.
At 5 ms, VCPU begins executing guest code, vectoring to the interrupt 
         handler for the device initiating the virtual I/O.
At 6 ms, VCPU preempted by hypervisor.
At 9 ms, VCPU begins executing guest code.

>From 0 ms to 3 ms, VCPU is in the 'running' state.  At 3 ms, VCPU enters
the 'halted' state and remains in this state until the 4 ms mark.  From
4 ms to 5 ms, the VCPU is in the 'ready' state.  At 5 ms, the VCPU
re-enters the 'running' state until it is preempted by the hypervisor at
the 6 ms mark.  From 6 ms to 9 ms, VCPU is again in the 'ready' state,
and finally 'running' again after 9 ms.

Stolen time is defined per vcpu to progress at the rate of real time
when the vcpu is in the 'ready' state, and does not progress otherwise.
Available time is defined per vcpu to progress at the rate of real time
when the vcpu is in the 'running' and 'halted' states, and does not
progress when the vcpu is in the 'ready' state.

So, for the above example, the following table indicates these time
values for the vcpu at each ms boundary:

Real time    Stolen time    Available time
 0            0              0
 1            0              1
 2            0              2
 3            0              3
 4            0              4
 5            1              4
 6            1              5
 7            2              5
 8            3              5
 9            4              5
10            4              6

Notice that at any point:
   real_time == stolen_time + available_time

Stolen time and available time are also presented as counters in
"cycles" units.  The initial value of the stolen time counter is 0.
This implies the initial value of the available time counter is the same
as the real time counter.

Alarms:

Alarms can be set (armed) against the real time counter or the available
time counter. Alarms can be programmed to expire once (one-shot) or on a
regular period (periodic).  They are armed by indicating an absolute
counter value expiry, and in the case of a periodic alarm, a non-zero
relative period counter value.  [TBD: The method of wiring the alarms to
an interrupt vector is dependent upon the virtual interrupt controller
portion of the interface.  Currently, the alarms may be wired as if they
are attached to IRQ0 or the vector in the local APIC LVTT.  This way,
the alarms can be used as drop in replacements for the PIT or local APIC
timer.]

Alarms are per-vcpu mechanisms.  An alarm set by vcpu0 will fire only on
vcpu0, while an alarm set by vcpu1 will only fire on vcpu1.  If an alarm
is set relative to available time, its expiry is a value relative to the
available time counter of the vcpu that set it.

The interface includes a method to cancel (disarm) an alarm.  On each
vcpu, one alarm can be set against each of the two counters (real time
and available time).  A vcpu in the 'halted' state becomes 'ready' when
any of its alarm's counters reaches the expiry.

An alarm "fires" by signaling the virtual interrupt controller.  An
alarm will fire as soon as possible after the counter value is greater
than or equal to the alarm's current expiry.  However, an alarm can fire
only when its vcpu is in the 'running' state.

If the alarm is periodic, a sequence of expiry values,

 E(i) = e0 + p * i ,  i = 0, 1, 2, 3, ...

where 'e0' is the expiry specified when setting the alarm and 'p' is the
period of the alarm, is used to arm the alarm.  Initially, E(0) is used
as the expiry.  When the alarm fires, the next expiry value in the
sequence that is greater than the current value of the counter is used
as the alarm's new expiry.

One-shot alarms have only one expiry.  When a one-shot alarm fires, it
is automatically disarmed.

Suppose an alarm is set relative to real time with expiry at the 3 ms
mark and a period of 2 ms.  It will expire on these real time marks: 3,
5, 7, 9.  Note that even if the alarm does not fire during the 5 ms to 7
ms interval, the alarm can fire at most once during the 7 ms to 9 ms
interval (unless, of course, it is reprogrammed).

If an alarm is set relative to available time with expiry at the 1 ms
mark (in available time) and with a period of 2 ms, then it will expire
on these available time marks: 1, 3, 5.  In the scenario described in
example 1, those available time values correspond to these values in
real time: 1, 3, 6.

-------------------------------------------------------------------------

Interface Details:

/* The cycle counters. */

#define VMI_CYCLES_REAL        0
#define VMI_CYCLES_AVAILABLE   1
#define VMI_CYCLES_STOLEN      2

/* Predefined rate of the wallclock. */

#define VMI_WALLCLOCK_HZ       1000000000

/* The alarm interface 'flags' bits. [TBD: exact format of 'flags'] */

#define VMI_ALARM_COUNTER_MASK 0x000000ff

#define VMI_ALARM_WIRED_IRQ0   0x00000000
#define VMI_ALARM_WIRED_LVTT   0x00010000

#define VMI_ALARM_IS_ONESHOT   0x00000000
#define VMI_ALARM_IS_PERIODIC  0x00000100

/* The time types. */

typedef unsigned char VMI_BOOL; /* FALSE==0, TRUE==!FALSE. */
typedef uint64        VMI_CYCLES;
typedef uint64        VMI_NANOSECS;
typedef sint64        VMI_NANOSECSDIFF;

/* The time functions. */

VMI_NANOSECS VMI_GetWallclockTime(void);
VMI_BOOL     VMI_WallclockUpdated(void);

VMI_CYCLES   VMI_GetCycleFrequency(void);
VMI_CYCLES   VMI_GetCycleCounter(uint32 whichCounter);

void         VMI_SetAlarm(uint32 flags, VMI_CYCLES expiry, VMI_CYCLES period);
VMI_BOOL     VMI_CancelAlarm(uint32 flags);

VMI_GetWallclockTime returns the current wallclock time as the number of
nanoseconds since the epoch.  Nanosecond resolution along with the
64-bit unsigned type provide over 580 years from epoch until rollover.
The wallclock time is relative to the host's wallclock time.

VMI_WallclockUpdated returns TRUE if the wallclock time has changed
relative to the real cycle counter since the previous time that
VMI_WallclockUpdated was polled.  For example, while a VM is suspended,
the real cycle counter will halt, but wallclock time will continue to
advance.  Upon resuming the VM, the first call to VMI_WallclockUpdated
will return TRUE.

VMI_GetCycleFrequency returns the number of cycles in one second.  This
value can be used by the guest to convert between cycles and other time
units.

VMI_GetCycleCounter returns the current value, in cycles units, of the
counter corresponding to 'whichCounter' if it is one of VMI_CYCLES_REAL,
VMI_CYCLES_AVAILABLE or VMI_CYCLES_STOLEN.  VMI_GetCycleCounter returns
0 for any other value of 'whichCounter'.

VMI_SetAlarm is used to arm the vcpu's alarms.  The 'flags' parameter is
used to specify which counter's alarm is being set (VMI_CYCLES_REAL or
VMI_CYCLES_AVAILABLE), how to deliver the alarm to the vcpu
(VMI_ALARM_WIRED_IRQ0 or VMI_ALARM_WIRED_LVTT), and the mode
(VMI_ALARM_IS_ONESHOT or VMI_ALARM_IS_PERIODIC).  If the alarm is set
against the VMI_ALARM_STOLEN counter or an undefined counter number, the
call is a nop.  The 'expiry' parameter indicates the expiry of the
alarm, and for periodic alarms, the 'period' parameter indicates the
period of the alarm.  If the value of 'period' is zero, the alarm is
armed as a one-shot alarm regardless of the mode specified by 'flags'.
Finally, a call to VMI_SetAlarm for an alarm that is already armed is
equivalent to first calling VMI_CancelAlarm and then calling
VMI_SetAlarm, except that the value returned by VMI_CancelAlarm is not
accessible.

VMI_CancelAlarm is used to disarm an alarm.  The 'flags' parameter
indicates which alarm to cancel (VMI_CYCLES_REAL or
VMI_CYCLES_AVAILABLE).  The return value indicates whether or not the
cancel succeeded.  A return value of FALSE indicates that the alarm was
already disarmed either because a) the alarm was never set or b) it was
a one-shot alarm and has already fired (though perhaps not yet delivered
to the guest).  TRUE indicates that the alarm was armed and either a)
the alarm was one-shot and has not yet fired (and will no longer fire
until it is rearmed) or b) the alarm was periodic.

-------------------------------------------------------------------------

Further discussion regarding the proposed interface:

1) Mechanism to atomically read the set of time values.

The interface does not provide a way for a vcpu to atomically read the
entire set of time values { wallclock time, real time counter, available
time counter, stolen time counter }.  While it seems that such a
mechanism might be "nice" to have in theory, it seems unnecessary in
practice.  Indeed, real hardware rarely provides this functionality.

One nice side effect of having this feature is that the explicit stolen
time counter (or available time counter) can be dropped entirely from
the interface, since its value can be inferred from the real time
counter and available time counter (or stolen time counter).

Two potential proposals that provide this feature are:

a) "Struct" return value style interface:

typedef struct VMITimeValues {
   VMI_NANOSECS wallTime;
   VMI_CYCLES   realCounter;
   VMI_CYCLES   availableCounter;
} VMITimeValues;

VMITimeValues VMI_GetTimes(void);  or, similarly,
void VMI_GetTimes(VMITimeValues *t);

Pros: Easy to understand interface. Implementation can provide atomicity
by returning all time values as a complete set.

Cons: Struct return value (indirect return) less efficient, especially
when caller only needs to use one of the time values.  More difficult to
extend interface and maintain compatibility.

b) "Caching" mechanism added to current interface.

#define VMI_CYCLES_REAL        0
#define VMI_CYCLES_AVAILABLE   1

#define VMI_CYCLES_CACHED      0x80

VMI_NANOSECS VMI_GetWallclockTime(uint32 flags);
VMI_CYCLES   VMI_GetCycleCounter(uint32 flags);

When the VMI_CYCLES_CACHED bit of the 'flags' parameter is not set, the
routines atomically cache the entire set of time values into the "time
cache".  The requested time value is returned.

When the VMI_CYCLES_CACHED bit of the 'flags' parameters is set, the
routine returns the requested time value from the time cache.

Using this interface, along with guest provided synchronization, the
entire set of time values can be retrieved atomically.

Pros: Efficient calling convention, especially when only one time value
is desired.  Easier to extend the interface and maintain compatibility.

Cons: Interface is perhaps more difficult to understand and use
correctly.  Requires guest synchronization to ensure the cached values
remain consistent between VMI calls.

2) Mechanism to read the available time counter and stolen time counter
   for another vcpu.

VCPU0 cannot read VCPU1's per-vcpu counters, and vice versa.  This
feature is not required by Linux.  Might there be some guest that
requires this ability?

The VMI_GetCycleCounter interface could be extended to index by vcpu ID
in order to provide this functionality.

3) Notification mechanism for relative change between wallclock time and
   real cycle counter.

It is expected that the guest will usually read the wallclock time at
boot, and then apply an adjustment to this time using the real cycle
counter to calculate the current wallclock time.  However, the wallclock
time and real cycle counter may become out of sync for a variety of
reasons.  For example, when a VM is suspended, the real cycle counter
will stop advancing while the wallclock time continues to advance.
Additionally, when the host wallclock is corrected the real cycle
counter is not affected.  Therefore, it is convenient for the guest to
receive notification that the wallclock time has changed relative to the
real cycle counter.  Currently, a polling interface is provided for
this: VMI_WallclockUpdated.

Alternatively, an interrupt interface may be provided.  An interrupt
would be sent to the guest to notify it of a change to the wallclock
time relative to the real cycle counter.

-------------------------------------------------------------------------

Use of VMI Time in Linux:

To paravirtualize time in the Linux kernel, we introduce a new timer
device module, timer_vmi.c (the VMI Time module), which parallels the
existing modules, e.g. arch/i386/kernel/timers/{timer_pit.c,
timer_tsc.c, timer_cyclone.c, timer_hpet.c, timer_pm.c}.  The
timer_vmi.c module implements the low-level Linux kernel time routines
using the VMI Time interface.

The new VMI Time module is used only when a hypervisor is detected at
boot time.  Otherwise, we fall back to one of the traditional modules
listed above.  This provides "transparency" -- the same kernel binary
can run with the VMI Time device or with one of the traditional time
devices.

The Linux xtime variable is initialized to the wallclock time exposed by
the VMI.

The VMI Time module contains a timer interrupt that is driven by a
periodic VMI alarm set against the available time counter.  The
interrupt calls the do_timer() routine every time the real time counter
advances cycles_per_second/HZ cycles (call this quantity
cycles_per_jiffy), as detected by any vcpu.  do_timer() updates the
jiffies count and xtime count.

The timer interrupt additionally calls the update_process_times()
routine for every time the available time counter for the current vcpu
advances cycles_per_jiffy cycles.  update_process_times() accounts user
time stats for the process and vcpu, accounts system time stats for the
process and cpu, runs the soft timers for the cpu, performs some RCU
work, calls the scheduler back via scheduler_tick() and runs posix
timers for the current task.  The scheduler_tick() callback decrements
the current process' timeslice and makes scheduling decisions.  The
profile_tick() routine is also called at each cycles_per_jiffy increment
of the available time counter on each vcpu.

Finally, the timer interrupt accounts the difference between the real
time counter and the available time counter (essentially the stolen time
counter) to the steal cpustat.

The sched_clock() scheduler callback is also modified to calculate time
using the available time counter.  The effect of this is to compute
process' sleep_avg using available time.  sleep_avg is used primarily to
compute the "effective priority" of a task.

Additionally, the i386 timer modules are called back via the active
timer_opts struct to get additional time information.  arch/i386's
do_gettimeofday() and do_settimeofday() use the get_offset() callback to
offset the time from xtime.  The VMI Time module implements this
callback using the real time counter.  The monotonic clock callback is
also implemented using the real time counter.

The __delay() routine is implemented via a timer_opt callback in the VMI
Time module.  When running under a hypervisor, delays are not necessary
when communicating with virtual devices.  These delays are nops when
running under a hypervisor.  However, the smpboot.c bootup sequence does
require delays, and these delays are implemented using the real time
counter.

The effect of these changes are that Linux's real time variables and
timers continue to run on real time.  These include jiffies, xtime,
ITIMER_REAL timers, SIGALARM timers, posix-timers.  Additionally, the
notion of process times remains consistent in the paravirtual
environment as compared with native.  Since process time accounting is
done using available time, time that is stolen from the vcpu is not
accounted toward process times.  This keeps the cpustats faithful and
virtual timers consistent (ITIMER_VIRT, SIGVTALARM, ITIMER_PROF,
SIGPROF).  Additionally, the scheduler essentially runs in available
time, and so time stolen from the vcpu does not steal time from process'
timeslices.  Finally, with process time accounting performed using
available time and with steal time accounting, the view of time (for
example, using the utilities 'top', 'time', etc.) from within
paravirtualized Linux is consistent with time viewed on the host and
from within other paravirtualized guests on the same host.

Some noteworthy characteristics of the VMI Time module are:

1) Cycle counter frequency is indicated by the VMI rather than needing
to be calculated by comparing the counter rate to a known interrupt
rate.

On native hardware, the TSC frequency is calculated by comparing the
rate at which the TSC advances with the (known) rate of a timer
interrupt.  This method for determining a cycle counter frequency is
unreliable on virtual hardware since the vcpu may be preempted by the
hypervisor during this calculation.

Instead, the VMI Time device determines the frequency of the VMI cycle
counters by querying the VMI.

2) The timer_irq_works() algorithm is problematic in a virtual
environment for various reasons, and is avoided.  When running on a
hypervisor, the kernel can assume that the alarm will be wired up
correctly to the interrupt controller and this logic is avoided
altogether.

3) Rather than keeping kernel time state (e.g. jiffies, xtime, time
slices) up to date by counting interrupts, the state is updated based on
the values of the VMI cycle counters.

Therefore, a vcpu only needs to receive timer interrupts while it is
running.  This leads to better scaling of the number of virtual machines
that can be run on a single host.

4) As a consequence of #3, the interrupt rate of the alarm can be
dropped lower than HZ.  It may be beneficial for the guest to request a
lower alarm rate to decrease the overhead of delivering virtual
interrupts.  With #3, this is possible without changing the HZ constant
(which is important in order to allow for transparency with native
hardware using the normal HZ rate).

5) System wide time state (e.g. jiffies, xtime) is updated by all vcpus.

With the native x86 timer interrupts, only the boot processor's PIT
interrupt updates jiffies and xtime.  With the VMI Time device, jiffies
and xtime are updated by all the vcpus of a VM.  This is important since
the vcpus may be independently scheduled.  Also, it allows for a simpler
NO_IDLE_HZ implementation.

6) The xtime counter is kept up with the wallclock time.

7) The jiffies counter is kept up with the real time cycle counter.

8) Scheduler time slices and sched_clock() are kept up with the
available time cycle counter.

When using the VMI Time device as the time source, the scheduler runs in
available time.  Otherwise, processes would be charged for time stolen
from the vcpu (time in which the vcpu isn't even running).  The
scheduler_tick() interface is called only for every tick of available
time.  Also, sched_clock() is implemented using the available time
counter so that the scheduler's calculations of a process' run-time does
not charge stolen time.

9) Stolen time is computed by the hypervisor, not the Linux guest.

Only the hypervisor can accurately compute stolen time.  On some
hypervisors, stolen time is computed by the guest by subtracting the
amount of time a vcpu has run on a pcpu from the amount of real time
that has passed.  Then, the routine account_steal_time() is used to
charge this time difference to

 a) steal time, if the current process is not the idle process, or
 b) idle time, if the current process is the idle process.

The problem with that calculation is that when a vcpu's halt ends, it
will not necessarily run immediately.  Instead, it will be transitioned
from the 'halted' state to the 'ready' state (i.e. added to the
hypervisor's runqueue) until the hypervisor chooses to run it.  The time
the vcpu is 'ready' is not idle time, but instead stolen time.  But,
implementations using account_steal_time() will account all of this time
to idle time, rather than partitioning the time such that the time spent
in the 'halted' state is idle time and the time spent in the 'ready'
state is stolen time.

Only the hypervisor can determine when the transition from 'halted' to
'ready' occurred (due to work becoming pending for the vcpu).  So, the
VMI Time device queries the stolen time from the VMI rather than using
the faulty account_steal_time() algorithm.

10) A NO_IDLE_HZ implementation is provided.

When a vcpu enters its idle loop, it disables its periodic alarm and
sets up a one shot alarm for the next time event.  That way, it does not
become ready to run just to service the periodic alarm
interrupt. Instead, it can remain halted until there is some real work
pending for it.  This allows the hypervisor to use the physical
resources more effectively since idle vcpus will have lower overhead.

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] cpu steal time accounting
  2006-02-22 23:58 ` [PATCH 1/2] cpu steal time accounting Dan Hecht
@ 2006-02-23  8:40   ` Keir Fraser
  2006-02-23 18:17     ` Dan Hecht
  2006-02-23  8:48   ` Keir Fraser
  1 sibling, 1 reply; 20+ messages in thread
From: Keir Fraser @ 2006-02-23  8:40 UTC (permalink / raw)
  To: Dan Hecht; +Cc: xen-devel

On 22 Feb 2006, at 23:58, Dan Hecht wrote:

> To solve this, it may be best to have the hypervisor interface expose 
> per-vcpu stolen time directly, rather than vcpu_time.  Then the guest 
> does not need to try to guess whether to charge (system_time - 
> vcpu_time) against idle or steal.

Yes, the distinction between stolen and available time does makes sense 
(although I'm not sure 'available' is a great name) otherwise you can't 
account for wakeup latencies. account_steal_time() would need to be 
modified in Linux, though, as we would not need its dodgy heuristic for 
deciding whether to account to stolen time or iowait/idle.

  -- Keir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] cpu steal time accounting
  2006-02-23  8:40   ` Keir Fraser
@ 2006-02-23 18:17     ` Dan Hecht
  0 siblings, 0 replies; 20+ messages in thread
From: Dan Hecht @ 2006-02-23 18:17 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

Keir Fraser wrote:
> 
> On 22 Feb 2006, at 23:58, Dan Hecht wrote:
> 
>> To solve this, it may be best to have the hypervisor interface expose 
>> per-vcpu stolen time directly, rather than vcpu_time.  Then the guest 
>> does not need to try to guess whether to charge (system_time - 
>> vcpu_time) against idle or steal.
> 
> Yes, the distinction between stolen and available time does makes sense 
> (although I'm not sure 'available' is a great name) 

The term "available" came from looking at it from the perspective of the 
vcpu, rather than the hypervisor.  To the vcpu, the time that it's 
running or halted is, in a sense, "available" to it (even though, (as an 
optimization) the hypervisor might use the pcpu to do something else 
when the vcpu is halted).  But, anytime the hypervisor forces the vcpu 
to wait involuntarily, the time is no longer "available" to it, but stolen.

Said another way, on native hardware, stolen time is zero.  All time is 
"available" to the OS.  Though it might choose to halt for some of this 
time, the time is still "available".

> otherwise you can't 
> account for wakeup latencies. account_steal_time() would need to be 
> modified in Linux, though, as we would not need its dodgy heuristic for 
> deciding whether to account to stolen time or iowait/idle.
> 

Exactly.  We slightly refactor the account_steal_time() interface to 
have an interface that bypasses the heuristic.

Dan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] cpu steal time accounting
  2006-02-22 23:58 ` [PATCH 1/2] cpu steal time accounting Dan Hecht
  2006-02-23  8:40   ` Keir Fraser
@ 2006-02-23  8:48   ` Keir Fraser
  2006-02-23 10:04     ` Keir Fraser
  2006-02-23 13:18     ` Rik van Riel
  1 sibling, 2 replies; 20+ messages in thread
From: Keir Fraser @ 2006-02-23  8:48 UTC (permalink / raw)
  To: Dan Hecht; +Cc: xen-devel


On 22 Feb 2006, at 23:58, Dan Hecht wrote:

> The interface does not provide a way for a vcpu to atomically read the
> entire set of time values { wallclock time, real time counter, 
> available
> time counter, stolen time counter }.  While it seems that such a
> mechanism might be "nice" to have in theory, it seems unnecessary in
> practice.  Indeed, real hardware rarely provides this functionality.
>
> One nice side effect of having this feature is that the explicit stolen
> time counter (or available time counter) can be dropped entirely from
> the interface, since its value can be inferred from the real time
> counter and available time counter (or stolen time counter).

I don't understand the last paragraph here. It's not true that, for 
example,
  available_time = real_time - stolen_time
right?

  -- Keir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] cpu steal time accounting
  2006-02-23  8:48   ` Keir Fraser
@ 2006-02-23 10:04     ` Keir Fraser
  2006-02-23 19:43       ` Dan Hecht
  2006-02-23 13:18     ` Rik van Riel
  1 sibling, 1 reply; 20+ messages in thread
From: Keir Fraser @ 2006-02-23 10:04 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel, Dan Hecht


On 23 Feb 2006, at 08:48, Keir Fraser wrote:

>> One nice side effect of having this feature is that the explicit 
>> stolen
>> time counter (or available time counter) can be dropped entirely from
>> the interface, since its value can be inferred from the real time
>> counter and available time counter (or stolen time counter).
>
> I don't understand the last paragraph here. It's not true that, for 
> example,
>  available_time = real_time - stolen_time
> right?

Ah, okay, I see that in fact it is. :-)

Why not just have a halted_time instead? I think that's what we'd go 
for in Xen.

  -- Keir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] cpu steal time accounting
  2006-02-23 10:04     ` Keir Fraser
@ 2006-02-23 19:43       ` Dan Hecht
  0 siblings, 0 replies; 20+ messages in thread
From: Dan Hecht @ 2006-02-23 19:43 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

Keir Fraser wrote:
> 
> On 23 Feb 2006, at 08:48, Keir Fraser wrote:
> 
>>> One nice side effect of having this feature is that the explicit stolen
>>> time counter (or available time counter) can be dropped entirely from
>>> the interface, since its value can be inferred from the real time
>>> counter and available time counter (or stolen time counter).
>>
>> I don't understand the last paragraph here. It's not true that, for 
>> example,
>>  available_time = real_time - stolen_time
>> right?
> 
> Ah, okay, I see that in fact it is. :-)
>

Yeah, by definition, available_time is (real_time - stolen_time).

> Why not just have a halted_time instead? I think that's what we'd go for 
> in Xen.
> 

I assume you mean have halted_time in addition to vcpu_time, since you'd 
still need vcpu_time to determine stolen_time for the case a running 
vcpu is made to involuntarily wait.

Essentially, by adding halted_time, the Xen and VMI interfaces would be 
very similar in this regard.  We'd have:

xen_system_time                 <==> vmi_real_time
xen_vcpu_time + xen_halted_time <==> vmi_available_time
xen_stolen_time                 <==> vmi_stolen_time

The reason the vmi does not further partition vmi_available_time into 
vcpu_time and halted_time is because the guest is able to correctly do 
this partitioning, if it chooses to do so.  It can be done with:

halt_start = vmi_available_time counter;
Halt;
When the vcpu starts running again,
   halt_end = vmi_available_time counter;

We know during this time, vcpu_time == 0.  halted_time == (halt_end - 
halt_start).  And, when executing outside this region, any 
vmi_available_time that passes is vcpu_time.

So, rather than potentially complicating the interface, the vmi leaves 
the partitioning of vmi_available_time into vcpu_time and halted_time up 
to the guest.  Besides, perhaps there are other ways the guest may want 
to partition vmi_available_time other than into vcpu_time/halted_time, 
so why not leave this up to the guest OS?

Also, unless halted_time/vcpu_time is defined very carefully and 
precisely, having it as part of the interface can become confusing in 
the case the hypervisor wants to implement "halt" using a busy wait, or 
when the paravirtualized kernel is run on native hardware.  In these 
cases, the vcpu is still hogging a pcpu, so it might be unclear whether 
to consider that time vcpu_time or halted_time.

If vcpu_time is defined to be time in which a pcpu is dedicated to the 
vcpu (even if the vcpu executed the "halt" interface and is busy 
waiting), then halted_time would be defined to be time in which no pcpu 
is dedicated to the vcpu but the vcpu is not involuntarily waiting (i.e. 
the remaining time that is not stolen).  But, why expose this hypervisor 
implementation detail through the interface?

On the other hand, if vcpu_time is defined to be the time in which a 
pcpu is dedicated to the vcpu *and* the vcpu is not halted, then 
halted_time is defined to be the time the vcpu is halted (no matter how 
the hypervisor implements the halt -- a pcpu may still be dedicated to 
the vcpu).  But, in this case, why not leave the partitioning of 
available_time into vcpu_time/halted_time up to the guest OS?

Just trying to say that partitioning available_time into vcpu_time and 
halted_time may just add confusion and make the interface more 
complicated without making the interface any more powerful.

Dan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] cpu steal time accounting
  2006-02-23  8:48   ` Keir Fraser
  2006-02-23 10:04     ` Keir Fraser
@ 2006-02-23 13:18     ` Rik van Riel
  1 sibling, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2006-02-23 13:18 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel, Dan Hecht

On Thu, 23 Feb 2006, Keir Fraser wrote:

> I don't understand the last paragraph here. It's not true that, for example,
>  available_time = real_time - stolen_time
> right?

I think that is true.  Not sure why it wouldn't be...

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH 1/2] cpu steal time accounting
@ 2006-02-21 14:06 Tian, Kevin
  0 siblings, 0 replies; 20+ messages in thread
From: Tian, Kevin @ 2006-02-21 14:06 UTC (permalink / raw)
  To: Rik van Riel; +Cc: xen-devel

>From: Rik van Riel [mailto:riel@redhat.com]
>Sent: 2006年2月21日 20:56
>
>On Tue, 21 Feb 2006, Tian, Kevin wrote:
>
>> Why do you need to add a new VCPUOP while doing same thing as
>> DOM0_GETVCPUINFO?
>
>Because the dom0_ops only work for dom0 and reworking that
>function to allow non-privileged domains to get info just
>on themselves would end up being a way uglier patch.
>

See your point now. Since physical processor id is also exported in your patch, will it cause a trend to allow non-privileged domain to query more physical context information about domain itself? Like GETDOMAININFO, GETVCPUCONTEXT, etc. For example, guest may use gap info between max_pages and tot_pages to decide whether eagerly adding free pages as caches. It can also consider that info as an indicator of some type of tight resource contention. If it's the usage model, maybe we can consider move them out of dom0_ prefix and make it common.

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH 1/2] cpu steal time accounting
@ 2006-02-21  9:05 Tian, Kevin
  2006-02-21 12:56 ` Rik van Riel
  0 siblings, 1 reply; 20+ messages in thread
From: Tian, Kevin @ 2006-02-21  9:05 UTC (permalink / raw)
  To: Rik van Riel, xen-devel

Why do you need to add a new VCPUOP while doing same thing as DOM0_GETVCPUINFO?

Thanks,
Kevin

>-----Original Message-----
>From: xen-devel-bounces@lists.xensource.com
>[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Rik van Riel
>Sent: 2006年2月21日 8:51
>To: xen-devel@lists.xensource.com
>Subject: [Xen-devel] [PATCH 1/2] cpu steal time accounting
>
>Allow guest domains to get information from the hypervisor on how much
>cpu time their virtual cpus have used.  This is needed to estimate the
>cpu steal time.
>
>Signed-off-by: Rik van Riel <riel@redhat.com>
>
>--- xen/include/public/vcpu.h.steal	2006-02-07 18:01:41.000000000 -0500
>+++ xen/include/public/vcpu.h	2006-02-17 13:51:45.000000000 -0500
>@@ -51,6 +51,14 @@
> /* Returns 1 if the given VCPU is up. */
> #define VCPUOP_is_up                3
>
>+/*
>+ * Get information on how much CPU time this VCPU has used, etc...
>+ *
>+ * @extra_arg == pointer to an empty dom0_getvcpuinfo_t, the "OUT" variables
>+ *               of which filled in with scheduler info.
>+ */
>+#define VCPUOP_cpu_info             4
>+
> #endif /* __XEN_PUBLIC_VCPU_H__ */
>
> /*
>--- xen/common/domain.c.steal	2006-02-07 18:01:40.000000000 -0500
>+++ xen/common/domain.c	2006-02-17 13:52:44.000000000 -0500
>@@ -451,8 +451,24 @@
>     case VCPUOP_is_up:
>         rc = !test_bit(_VCPUF_down, &v->vcpu_flags);
>         break;
>+
>+    case VCPUOP_cpu_info:
>+	{
>+	    struct dom0_getvcpuinfo vi = { 0, };
>+	    vi.online = !test_bit(_VCPUF_down, &v->vcpu_flags);
>+	    vi.blocked = test_bit(_VCPUF_blocked, &v->vcpu_flags);
>+	    vi.running  = test_bit(_VCPUF_running, &v->vcpu_flags);
>+	    vi.cpu_time = v->cpu_time;
>+	    vi.cpu = v->processor;
>+	    rc = 0;
>+
>+	    if ( copy_to_user(arg, &vi, sizeof(dom0_getvcpuinfo_t)) )
>+		rc = -EFAULT;
>+	    break;
>+	}
>     }
>
>+
>     return rc;
> }
>
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH 1/2] cpu steal time accounting
  2006-02-21  9:05 Tian, Kevin
@ 2006-02-21 12:56 ` Rik van Riel
  0 siblings, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2006-02-21 12:56 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: xen-devel

On Tue, 21 Feb 2006, Tian, Kevin wrote:

> Why do you need to add a new VCPUOP while doing same thing as 
> DOM0_GETVCPUINFO?

Because the dom0_ops only work for dom0 and reworking that
function to allow non-privileged domains to get info just
on themselves would end up being a way uglier patch.

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/2] cpu steal time accounting
@ 2006-02-21  0:51 Rik van Riel
  2006-02-21 17:42 ` Keir Fraser
  0 siblings, 1 reply; 20+ messages in thread
From: Rik van Riel @ 2006-02-21  0:51 UTC (permalink / raw)
  To: xen-devel

Allow guest domains to get information from the hypervisor on how much
cpu time their virtual cpus have used.  This is needed to estimate the
cpu steal time.

Signed-off-by: Rik van Riel <riel@redhat.com>

--- xen/include/public/vcpu.h.steal	2006-02-07 18:01:41.000000000 -0500
+++ xen/include/public/vcpu.h	2006-02-17 13:51:45.000000000 -0500
@@ -51,6 +51,14 @@
 /* Returns 1 if the given VCPU is up. */
 #define VCPUOP_is_up                3
 
+/*
+ * Get information on how much CPU time this VCPU has used, etc...
+ *
+ * @extra_arg == pointer to an empty dom0_getvcpuinfo_t, the "OUT" variables
+ *               of which filled in with scheduler info.
+ */
+#define VCPUOP_cpu_info             4
+
 #endif /* __XEN_PUBLIC_VCPU_H__ */
 
 /*
--- xen/common/domain.c.steal	2006-02-07 18:01:40.000000000 -0500
+++ xen/common/domain.c	2006-02-17 13:52:44.000000000 -0500
@@ -451,8 +451,24 @@
     case VCPUOP_is_up:
         rc = !test_bit(_VCPUF_down, &v->vcpu_flags);
         break;
+
+    case VCPUOP_cpu_info:
+	{
+	    struct dom0_getvcpuinfo vi = { 0, };
+	    vi.online = !test_bit(_VCPUF_down, &v->vcpu_flags);
+	    vi.blocked = test_bit(_VCPUF_blocked, &v->vcpu_flags);
+	    vi.running  = test_bit(_VCPUF_running, &v->vcpu_flags);
+	    vi.cpu_time = v->cpu_time;
+	    vi.cpu = v->processor;
+	    rc = 0;
+
+	    if ( copy_to_user(arg, &vi, sizeof(dom0_getvcpuinfo_t)) )
+		rc = -EFAULT;
+	    break;
+	}
     }
 
+
     return rc;
 }

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] cpu steal time accounting
  2006-02-21  0:51 Rik van Riel
@ 2006-02-21 17:42 ` Keir Fraser
  2006-02-21 19:32   ` Rik van Riel
  0 siblings, 1 reply; 20+ messages in thread
From: Keir Fraser @ 2006-02-21 17:42 UTC (permalink / raw)
  To: Rik van Riel; +Cc: xen-devel

On 21 Feb 2006, at 00:51, Rik van Riel wrote:

> Allow guest domains to get information from the hypervisor on how much
> cpu time their virtual cpus have used.  This is needed to estimate the
> cpu steal time.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>

Probably we'll kill off the dom0_op instead (or at least rename the 
info structure and leave the old dom0_op just as a legacy placeholder 
for a while).

Looking at the other patch, I think I'm missing higher level context 
regarding what this patch is about. I grepped around for 
account_steal_time() -- looks like it's currently used by s390, but as 
part of a rather bigger patch that also calls 
account_user_time/account_system_time.

Is there a lkml thread I should read to get up to speed on this? Should 
your patch be using those other functions in this account_foo_time api? 
What functionality do we currently miss by not targetting that api?

  thanks,
  Keir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] cpu steal time accounting
  2006-02-21 17:42 ` Keir Fraser
@ 2006-02-21 19:32   ` Rik van Riel
  2006-02-21 21:24     ` Diwaker Gupta
  2006-02-22  9:00     ` Keir Fraser
  0 siblings, 2 replies; 20+ messages in thread
From: Rik van Riel @ 2006-02-21 19:32 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

On Tue, 21 Feb 2006, Keir Fraser wrote:

> Looking at the other patch, I think I'm missing higher level context 
> regarding what this patch is about. I grepped around for 
> account_steal_time() -- looks like it's currently used by s390, but as 
> part of a rather bigger patch that also calls 
> account_user_time/account_system_time.

Basically steal time is the amount of time when:
1) we had a runnable task, but
2) it was not running, because our vcpu was scheduled
   away by the hypervisor

Thus, steal time measures how much a particular workload
inside a virtual machine is impacted by contention on the
cpu between different virtual machines.

It also makes sure that time the vcpu itself was not running
is not erroneously accounted to the currently running process,
which matters when a user is trying to determine how much CPU
time a particular task needs.

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] cpu steal time accounting
  2006-02-21 19:32   ` Rik van Riel
@ 2006-02-21 21:24     ` Diwaker Gupta
  2006-02-22  9:00     ` Keir Fraser
  1 sibling, 0 replies; 20+ messages in thread
From: Diwaker Gupta @ 2006-02-21 21:24 UTC (permalink / raw)
  To: Rik van Riel; +Cc: xen-devel

> Basically steal time is the amount of time when:
> 1) we had a runnable task, but
> 2) it was not running, because our vcpu was scheduled
>    away by the hypervisor

Just FYI: XenMon provides a similar metric which we call Waiting Time
-- the time a domain was runnable but not running. Of course, yours is
a different use case, since you want to query the stolen time from
within the guest.

Diwaker
--
Web/Blog/Gallery: http://floatingsun.net/blog

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] cpu steal time accounting
  2006-02-21 19:32   ` Rik van Riel
  2006-02-21 21:24     ` Diwaker Gupta
@ 2006-02-22  9:00     ` Keir Fraser
  2006-02-22 14:27       ` Rik van Riel
  1 sibling, 1 reply; 20+ messages in thread
From: Keir Fraser @ 2006-02-22  9:00 UTC (permalink / raw)
  To: Rik van Riel; +Cc: xen-devel

On 21 Feb 2006, at 19:32, Rik van Riel wrote:

> Basically steal time is the amount of time when:
> 1) we had a runnable task, but
> 2) it was not running, because our vcpu was scheduled
>    away by the hypervisor
>
> Thus, steal time measures how much a particular workload
> inside a virtual machine is impacted by contention on the
> cpu between different virtual machines.
>
> It also makes sure that time the vcpu itself was not running
> is not erroneously accounted to the currently running process,
> which matters when a user is trying to determine how much CPU
> time a particular task needs.

Is accounting user/system time an unnecessary extra? I guess we already 
do it by sampling at tick granularity anyway?

Should 'steal time' include blocked time when the guest had no work to 
execute?

Also, given the logic currently only triggers when the guest detects it 
'missed a tick', would it be good enough simply to account 
#missed_ticks as steal time. It would certainly be a lot simpler to 
implement, and you end up dividing everything down to tick granularity 
anyway. :-)

  -- Keir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] cpu steal time accounting
  2006-02-22  9:00     ` Keir Fraser
@ 2006-02-22 14:27       ` Rik van Riel
  2006-02-22 17:08         ` Keir Fraser
  0 siblings, 1 reply; 20+ messages in thread
From: Rik van Riel @ 2006-02-22 14:27 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

On Wed, 22 Feb 2006, Keir Fraser wrote:

> Is accounting user/system time an unnecessary extra? I guess we already 
> do it by sampling at tick granularity anyway?
> 
> Should 'steal time' include blocked time when the guest had no work to 
> execute?

No, this is idle time.  If the guest had no work to do,
it wasn't suffering from contention of the CPU.

> Also, given the logic currently only triggers when the guest detects it
> 'missed a tick', would it be good enough simply to account #missed_ticks as
> steal time. It would certainly be a lot simpler to implement, and you end up
> dividing everything down to tick granularity anyway. :-)

Not good enough if the hypervisor ends up scheduling
guests on a granularity finer than the guest's own
timer ticks.

The reason for only checking steal time when we miss
a tick is that I don't want to run the (expensive?)
steal time logic on every timer interrupt.

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] cpu steal time accounting
  2006-02-22 14:27       ` Rik van Riel
@ 2006-02-22 17:08         ` Keir Fraser
  2006-02-22 17:11           ` Rik van Riel
  0 siblings, 1 reply; 20+ messages in thread
From: Keir Fraser @ 2006-02-22 17:08 UTC (permalink / raw)
  To: Rik van Riel; +Cc: xen-devel

On 22 Feb 2006, at 14:27, Rik van Riel wrote:

>
>> Is accounting user/system time an unnecessary extra? I guess we 
>> already
>> do it by sampling at tick granularity anyway?
>>
>> Should 'steal time' include blocked time when the guest had no work to
>> execute?
>
> No, this is idle time.  If the guest had no work to do,
> it wasn't suffering from contention of the CPU.

But the 'vcpu_time' you read out of Xen excludes time spent 
blocked/unrunnable. Won't you end up accounting  that as it it were 
involuntary preemption? Also:
  1. What if a guest gets preempted for lots of short time periods (less 
than a jiffy). Then some arbitrary time in the future is preempted for 
long enough to activate you stolen-time logic. Won't you end up 
incorrectly accounting the accumulated short time periods?
  2. Is the Xen provided 'vcpu_time', divided down into jiffies, even 
comparable with the kstats that you sum? What about accumulated 
rounding errors in 'vcpu_time' and the kstats causing relative drift 
between them over time?

  -- Keir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] cpu steal time accounting
  2006-02-22 17:08         ` Keir Fraser
@ 2006-02-22 17:11           ` Rik van Riel
  2006-02-22 17:45             ` Keir Fraser
  0 siblings, 1 reply; 20+ messages in thread
From: Rik van Riel @ 2006-02-22 17:11 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

On Wed, 22 Feb 2006, Keir Fraser wrote:

> But the 'vcpu_time' you read out of Xen excludes time spent
> blocked/unrunnable. Won't you end up accounting  that as it it were
> involuntary preemption? Also:

If the domain is unrunnable, surely there won't be a
process on the virtual cpu that is runnable?  Or am
I overlooking something here?

>  1. What if a guest gets preempted for lots of short time periods (less 
> than a jiffy). Then some arbitrary time in the future is preempted for 
> long enough to activate you stolen-time logic. Won't you end up 
> incorrectly accounting the accumulated short time periods?

This is true.  I'm not sure we'd want to get the vcpu info
at every timer interrupt though, that could end up being
too expensive...

>  2. Is the Xen provided 'vcpu_time', divided down into jiffies, even 
> comparable with the kstats that you sum? What about accumulated rounding 
> errors in 'vcpu_time' and the kstats causing relative drift between them 
> over time?

In the tests I ran the steal time seemed to work out quite
well with what I expected it to be, watching /proc/stat from
inside the guest and xentop from dom0 simultaneously.

The rounding errors happen occasionally (I added printks to
the if statements catching them), but not all that often...

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] cpu steal time accounting
  2006-02-22 17:11           ` Rik van Riel
@ 2006-02-22 17:45             ` Keir Fraser
  2006-02-24 19:04               ` Rik van Riel
  0 siblings, 1 reply; 20+ messages in thread
From: Keir Fraser @ 2006-02-22 17:45 UTC (permalink / raw)
  To: Rik van Riel; +Cc: xen-devel

On 22 Feb 2006, at 17:11, Rik van Riel wrote:

> If the domain is unrunnable, surely there won't be a
> process on the virtual cpu that is runnable?  Or am
> I overlooking something here?

Oh, I see, this is dealt with inside account_steal_time(). No problem 
then.

>>  1. What if a guest gets preempted for lots of short time periods 
>> (less
>> than a jiffy). Then some arbitrary time in the future is preempted for
>> long enough to activate you stolen-time logic. Won't you end up
>> incorrectly accounting the accumulated short time periods?
>
> This is true.  I'm not sure we'd want to get the vcpu info
> at every timer interrupt though, that could end up being
> too expensive...

Having to call down to Xen to get that information is unfortunate. 
Perhaps we can export it in shared_info, or have the guest register a 
virtual address it would like the info written to.

> In the tests I ran the steal time seemed to work out quite
> well with what I expected it to be, watching /proc/stat from
> inside the guest and xentop from dom0 simultaneously.
>
> The rounding errors happen occasionally (I added printks to
> the if statements catching them), but not all that often...

I think the calculation of delta stolen time would be clearer as:
  ((system_time - prev_system_time) - (vcpu_time - prev_vcpu_time)) / 
NS_PER_TICK
where system_time/vcpu_time become the prev_system_time/prev_vcpu_time 
the next time your logic is triggered.

It has another advantage that it does not subtract quantities that can 
slowly relatively drift over days/weeks.

  -- Keir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] cpu steal time accounting
  2006-02-22 17:45             ` Keir Fraser
@ 2006-02-24 19:04               ` Rik van Riel
  0 siblings, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2006-02-24 19:04 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

On Wed, 22 Feb 2006, Keir Fraser wrote:

> I think the calculation of delta stolen time would be clearer as:
>  ((system_time - prev_system_time) - (vcpu_time - prev_vcpu_time)) /
> NS_PER_TICK

The "(system_time - prev_system_time)" above is equivalent
to the delta_cpu variable.

I agree on saving the prev_vcpu_time though, it makes the code
quite a bit nicer.  I hope you like this patch ;)


Signed-off-by: Rik van Riel <riel@redhat.com>


--- linux-2.6.15.i686/arch/i386/kernel/time-xen.c.steal	2006-02-17 16:44:40.000000000 -0500
+++ linux-2.6.15.i686/arch/i386/kernel/time-xen.c	2006-02-24 13:53:08.000000000 -0500
@@ -48,6 +48,7 @@
 #include <linux/mca.h>
 #include <linux/sysctl.h>
 #include <linux/percpu.h>
+#include <linux/kernel_stat.h>
 
 #include <asm/io.h>
 #include <asm/smp.h>
@@ -77,6 +78,7 @@
 #include <asm/arch_hooks.h>
 
 #include <xen/evtchn.h>
+#include <xen/interface/vcpu.h>
 
 #if defined (__i386__)
 #include <asm/i8259.h>
@@ -125,6 +127,9 @@ static u32 shadow_tv_version;
 static u64 processed_system_time;   /* System time (ns) at last processing. */
 static DEFINE_PER_CPU(u64, processed_system_time);
 
+/* Keep track of how much time our vcpu used, for steal time calculation. */
+static DEFINE_PER_CPU(u64, prev_vcpu_time);
+
 /* Must be signed, as it's compared with s64 quantities which can be -ve. */
 #define NS_PER_TICK (1000000000LL/HZ)
 
@@ -624,7 +629,32 @@ irqreturn_t timer_interrupt(int irq, voi
          * Local CPU jiffy work. No need to hold xtime_lock, and I'm not sure
          * if there is risk of deadlock if we do (since update_process_times
          * may do scheduler rebalancing work and thus acquire runqueue locks).
+	 *
+	 * If we have not run for a while, chances are this vcpu got scheduled
+	 * away.  Try to estimate how much time was stolen.
          */
+	if (delta_cpu > (s64)(2 * NS_PER_TICK)) {
+		dom0_getvcpuinfo_t vcpu = { 0, };
+		s64 steal;
+		u64 dvcpu;
+
+		if (HYPERVISOR_vcpu_op(VCPUOP_cpu_info, cpu, &vcpu) == 0) {
+			dvcpu = vcpu.cpu_time - per_cpu(prev_vcpu_time, cpu);
+			per_cpu(prev_vcpu_time, cpu) = vcpu.cpu_time;
+			steal = delta_cpu - (s64)dvcpu;
+
+			if (steal > 0) {
+				/* do_div modifies the variable in place. */
+				do_div(steal, NS_PER_TICK);
+
+				delta_cpu -= steal * NS_PER_TICK;
+				per_cpu(processed_system_time, cpu) +=
+							steal * NS_PER_TICK;
+				account_steal_time(current, (cputime_t)steal);
+			}
+		}
+	}
+
 	while (delta_cpu >= NS_PER_TICK) {
 		delta_cpu -= NS_PER_TICK;
 		per_cpu(processed_system_time, cpu) += NS_PER_TICK;
--- linux-2.6.15.i686/include/xen/interface/vcpu.h.steal	2006-02-17 16:14:17.000000000 -0500
+++ linux-2.6.15.i686/include/xen/interface/vcpu.h	2006-02-17 16:14:52.000000000 -0500
@@ -51,6 +51,14 @@
 /* Returns 1 if the given VCPU is up. */
 #define VCPUOP_is_up                3
 
+/*
+ * Get information on how much CPU time this VCPU has used, etc...
+ *
+ * @extra_arg == pointer to an empty dom0_getvcpuinfo_t, the "OUT" variables
+ *               of which filled in with scheduler info.
+ */
+#define VCPUOP_cpu_info             4
+
 #endif /* __XEN_PUBLIC_VCPU_H__ */
 
 /*

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2006-02-24 19:04 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <43FCCB2C.5000408@vmware.com>
2006-02-22 23:58 ` [PATCH 1/2] cpu steal time accounting Dan Hecht
2006-02-23  8:40   ` Keir Fraser
2006-02-23 18:17     ` Dan Hecht
2006-02-23  8:48   ` Keir Fraser
2006-02-23 10:04     ` Keir Fraser
2006-02-23 19:43       ` Dan Hecht
2006-02-23 13:18     ` Rik van Riel
2006-02-21 14:06 Tian, Kevin
  -- strict thread matches above, loose matches on Subject: below --
2006-02-21  9:05 Tian, Kevin
2006-02-21 12:56 ` Rik van Riel
2006-02-21  0:51 Rik van Riel
2006-02-21 17:42 ` Keir Fraser
2006-02-21 19:32   ` Rik van Riel
2006-02-21 21:24     ` Diwaker Gupta
2006-02-22  9:00     ` Keir Fraser
2006-02-22 14:27       ` Rik van Riel
2006-02-22 17:08         ` Keir Fraser
2006-02-22 17:11           ` Rik van Riel
2006-02-22 17:45             ` Keir Fraser
2006-02-24 19:04               ` Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.