> On 22 Feb 2006, at 17:11, Rik van Riel wrote:
> 
>> If the domain is unrunnable, surely there won't be a
>> process on the virtual cpu that is runnable?  Or am
>> I overlooking something here?
> 
> Oh, I see, this is dealt with inside account_steal_time(). No problem then.
> 

But, the guest will not be able to properly partition this time 
correctly between the idle time stat and the steal time stat.

When the vcpu enters its idle loop and blocks, at some point it will 
become ready to run again.  But, it will not necessarily run right away. 
  Instead, it will sit in the hypervisors runqueue for some amount of 
time.  While in the runqueue, this time is "stolen", since the vcpu 
wants to run but isn't (at this point it is involuntarily waiting). 
Under a heavily overcommitted system, the amount of time in the runqueue 
following the block may be nontrivial.

But, the guest (the code in account_steal_time()) cannot determine the 
time at which the vcpu was requeued into the hypervisor's runqueue.  It 
will account all of this time toward the "idle time" stat, rather than 
partitioning it between "idle time" and "steal time".

To solve this, it may be best to have the hypervisor interface expose 
per-vcpu stolen time directly, rather than vcpu_time.  Then the guest 
does not need to try to guess whether to charge (system_time - 
vcpu_time) against idle or steal.

>>>  1. What if a guest gets preempted for lots of short time periods (less
>>> than a jiffy). Then some arbitrary time in the future is preempted for
>>> long enough to activate you stolen-time logic. Won't you end up
>>> incorrectly accounting the accumulated short time periods?
>>
>> This is true.  I'm not sure we'd want to get the vcpu info
>> at every timer interrupt though, that could end up being
>> too expensive...
> 
> Having to call down to Xen to get that information is unfortunate. 
> Perhaps we can export it in shared_info, or have the guest register a 
> virtual address it would like the info written to.
> 

If you do change how this information is passed through the interface, 
then maybe this would be a good time to define the interface to export 
"stolen time" computed by the hypervisor, rather than "vcpu time".

On a topic relating to these patches, the Linux scheduler also uses the 
routine sched_clock() to calculate the run-time (and sleep-time) of 
processes.  With the current code, these calculations will include 
stolen time in the total run-time of a process.  Perhaps sched_clock() 
should be a clock that does not advance when time is stolen by the 
hypervisor?

We've been thinking about these issues also.  I've attached a document 
that describes our current thoughts.  The document describes the portion 
of the VMI that deals with time and describes the changes to Linux to 
accommodate VMI Time (Rik, this is an updated draft of the document I 
sent you before).  Comments are welcome.

Thanks,
Dan