[BUG] Linux process vruntime accounting in Xen

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [BUG] Linux process vruntime accounting in Xen
@ 2016-05-15  0:25 Tony S
  2016-05-16 11:37 ` Dario Faggioli
  0 siblings, 1 reply; 12+ messages in thread
From: Tony S @ 2016-05-15  0:25 UTC (permalink / raw)
  To: xen-devel

In virtualized environments, sometimes we need to limit the CPU
resources to a virtual machine(VM). For example in Xen, we use
$ xl sched-credit -d 1 -c 50

to limit the CPU resource of dom 1 as half of
one physical CPU core. If the VM CPU resource is capped, the process
inside the VM will have a vruntime accounting problem. Here, I report
my findings about Linux process scheduler under the above scenario.

------------Description------------
Linux CFS relies on delta_exec to charge the vruntime of processes.
The variable delta_exec is the difference of a process starts and
stops running on a CPU. This works well in physical machine. However,
in virtual machine under capped resources, some processes might be
accounted with inaccurate vruntime.

For example, suppose we have a VM which has one vCPU and is capped to
have as much as 50% of a physical CPU. When process A inside the VM
starts running and the CPU resource of that VM runs out, the VM will
be paused. Next round when the VM is allocated new CPU resource and
starts running again, process A stops running and is put back to the
runqueue. The delta_exec of process A is accounted as its "real
execution time" plus the paused time of its VM. That will make the
vruntime of process A much larger than it should be and process A
would not be scheduled again for a long time until the vruntimes of other
processes catch it.
---------------------------------------

------------Analysis----------------
When a process stops running and is going to put back to the runqueue,
update_curr() will be executed.
[src/kernel/sched/fair.c]

static void update_curr(struct cfs_rq *cfs_rq)
{
    ... ...
    delta_exec = now - curr->exec_start;
    ... ...
    curr->exec_start = now;
    ... ...
    curr->sum_exec_runtime += delta_exec;
    schedstat_add(cfs_rq, exec_clock, delta_exec);
    curr->vruntime += calc_delta_fair(delta_exec, curr);
    update_min_vruntime(cfs_rq);
    ... ...
}

"now" --> the right now time
"exec_start" --> the time when the current process is put on the CPU
"delta_exec" --> the time difference of a process between it starts
and stops running on the CPU

When a process starts running before its VM is paused and the process
stops running after its VM is unpaused, the delta_exec will include
the VM suspend time which is pretty large compared to the real
execution time of a process.

This issue will make a great performance harm to the victim process.
If the process is an I/O-bound workload, its throughput and latency
will be influenced. If the process is a CPU-bound workload, this issue
will make its vruntime "unfair" compared to other processes under CFS.

Because the CPU resource of some type VMs in the cloud are limited as
the above describes(like Amazon EC2 t2.small instance), I doubt that
will also harm the performance of public cloud instances.
---------------------------------------

My test environment is as follows: Hypervisor(Xen 4.5.0), Dom 0(Linux
3.18.21), Dom U(Linux 3.18.21). I also test longterm version Linux
3.18.30 and the latest longterm version, Linux 4.4.7. Those kernels
all have this issue.

Please confirm this bug. Thanks.

-- 
Tony. S
Ph. D student of University of Colorado, Colorado Springs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Linux process vruntime accounting in Xen
  2016-05-15  0:25 [BUG] Linux process vruntime accounting in Xen Tony S
@ 2016-05-16 11:37 ` Dario Faggioli
  2016-05-16 21:38   ` Tony S
  0 siblings, 1 reply; 12+ messages in thread
From: Dario Faggioli @ 2016-05-16 11:37 UTC (permalink / raw)
  To: Tony S, xen-devel
  Cc: George Dunlap, Juergen Gross, Boris Ostrovsky, David Vrabel,
	Matt Fleming


[-- Attachment #1.1: Type: text/plain, Size: 4347 bytes --]

[Adding George again, and a few Linux/Xen folks]

On Sat, 2016-05-14 at 18:25 -0600, Tony S wrote:
> In virtualized environments, sometimes we need to limit the CPU
> resources to a virtual machine(VM). For example in Xen, we use
> $ xl sched-credit -d 1 -c 50
> 
> to limit the CPU resource of dom 1 as half of
> one physical CPU core. If the VM CPU resource is capped, the process
> inside the VM will have a vruntime accounting problem. Here, I report
> my findings about Linux process scheduler under the above scenario.
> 
Thanks for this other report as well. :-)

All you say makes sense to me, and I will think about it. I'm not sure
about one thing, though...

> ------------Description------------
> Linux CFS relies on delta_exec to charge the vruntime of processes.
> The variable delta_exec is the difference of a process starts and
> stops running on a CPU. This works well in physical machine. However,
> in virtual machine under capped resources, some processes might be
> accounted with inaccurate vruntime.
> 
> For example, suppose we have a VM which has one vCPU and is capped to
> have as much as 50% of a physical CPU. When process A inside the VM
> starts running and the CPU resource of that VM runs out, the VM will
> be paused. Next round when the VM is allocated new CPU resource and
> starts running again, process A stops running and is put back to the
> runqueue. The delta_exec of process A is accounted as its "real
> execution time" plus the paused time of its VM. That will make the
> vruntime of process A much larger than it should be and process A
> would not be scheduled again for a long time until the vruntimes of
> other
> processes catch it.
> ---------------------------------------
> 
> 
> ------------Analysis----------------
> When a process stops running and is going to put back to the
> runqueue,
> update_curr() will be executed.
> [src/kernel/sched/fair.c]
> 
> static void update_curr(struct cfs_rq *cfs_rq)
> {
>     ... ...
>     delta_exec = now - curr->exec_start;
>     ... ...
>     curr->exec_start = now;
>     ... ...
>     curr->sum_exec_runtime += delta_exec;
>     schedstat_add(cfs_rq, exec_clock, delta_exec);
>     curr->vruntime += calc_delta_fair(delta_exec, curr);
>     update_min_vruntime(cfs_rq);
>     ... ...
> }
> 
> "now" --> the right now time
> "exec_start" --> the time when the current process is put on the CPU
> "delta_exec" --> the time difference of a process between it starts
> and stops running on the CPU
> 
> When a process starts running before its VM is paused and the process
> stops running after its VM is unpaused, the delta_exec will include
> the VM suspend time which is pretty large compared to the real
> execution time of a process.
> 
... but would that also apply to a VM that is not scheduled --just
because of pCPU contention, not because it was paused-- for a few time?

Isn't there anything in place in Xen or Linux (the latter being better
suitable for something like this, IMHO) to compensate for that?

I have to admit I haven't really ever checked myself, maybe either
George or our Linux people do know more?

> This issue will make a great performance harm to the victim process.
> If the process is an I/O-bound workload, its throughput and latency
> will be influenced. If the process is a CPU-bound workload, this
> issue
> will make its vruntime "unfair" compared to other processes under
> CFS.
> 
> Because the CPU resource of some type VMs in the cloud are limited as
> the above describes(like Amazon EC2 t2.small instance), I doubt that
> will also harm the performance of public cloud instances.
> ---------------------------------------
> 
> 
> My test environment is as follows: Hypervisor(Xen 4.5.0), Dom 0(Linux
> 3.18.21), Dom U(Linux 3.18.21). I also test longterm version Linux
> 3.18.30 and the latest longterm version, Linux 4.4.7. Those kernels
> all have this issue.
> 
> Please confirm this bug. Thanks.
> 
> 
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Linux process vruntime accounting in Xen
  2016-05-16 11:37 ` Dario Faggioli
@ 2016-05-16 21:38   ` Tony S
  2016-05-16 22:33     ` Boris Ostrovsky
  2016-05-16 22:33     ` Tony S
  0 siblings, 2 replies; 12+ messages in thread
From: Tony S @ 2016-05-16 21:38 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: George Dunlap, Juergen Gross, Boris Ostrovsky, David Vrabel,
	Matt Fleming

On Mon, May 16, 2016 at 5:37 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> [Adding George again, and a few Linux/Xen folks]
>
> On Sat, 2016-05-14 at 18:25 -0600, Tony S wrote:
>> In virtualized environments, sometimes we need to limit the CPU
>> resources to a virtual machine(VM). For example in Xen, we use
>> $ xl sched-credit -d 1 -c 50
>>
>> to limit the CPU resource of dom 1 as half of
>> one physical CPU core. If the VM CPU resource is capped, the process
>> inside the VM will have a vruntime accounting problem. Here, I report
>> my findings about Linux process scheduler under the above scenario.
>>
> Thanks for this other report as well. :-)
>
> All you say makes sense to me, and I will think about it. I'm not sure
> about one thing, though...
>

Hi Dario,

Thank you for your reply.


>> ------------Description------------
>> Linux CFS relies on delta_exec to charge the vruntime of processes.
>> The variable delta_exec is the difference of a process starts and
>> stops running on a CPU. This works well in physical machine. However,
>> in virtual machine under capped resources, some processes might be
>> accounted with inaccurate vruntime.
>>
>> For example, suppose we have a VM which has one vCPU and is capped to
>> have as much as 50% of a physical CPU. When process A inside the VM
>> starts running and the CPU resource of that VM runs out, the VM will
>> be paused. Next round when the VM is allocated new CPU resource and
>> starts running again, process A stops running and is put back to the
>> runqueue. The delta_exec of process A is accounted as its "real
>> execution time" plus the paused time of its VM. That will make the
>> vruntime of process A much larger than it should be and process A
>> would not be scheduled again for a long time until the vruntimes of
>> other
>> processes catch it.
>> ---------------------------------------
>>
>>
>> ------------Analysis----------------
>> When a process stops running and is going to put back to the
>> runqueue,
>> update_curr() will be executed.
>> [src/kernel/sched/fair.c]
>>
>> static void update_curr(struct cfs_rq *cfs_rq)
>> {
>>     ... ...
>>     delta_exec = now - curr->exec_start;
>>     ... ...
>>     curr->exec_start = now;
>>     ... ...
>>     curr->sum_exec_runtime += delta_exec;
>>     schedstat_add(cfs_rq, exec_clock, delta_exec);
>>     curr->vruntime += calc_delta_fair(delta_exec, curr);
>>     update_min_vruntime(cfs_rq);
>>     ... ...
>> }
>>
>> "now" --> the right now time
>> "exec_start" --> the time when the current process is put on the CPU
>> "delta_exec" --> the time difference of a process between it starts
>> and stops running on the CPU
>>
>> When a process starts running before its VM is paused and the process
>> stops running after its VM is unpaused, the delta_exec will include
>> the VM suspend time which is pretty large compared to the real
>> execution time of a process.
>>
> ... but would that also apply to a VM that is not scheduled --just
> because of pCPU contention, not because it was paused-- for a few time?
>

Thanks for your suggestion. I have tried to see whether this issue
exists on pCPU sharing today. Unfortunately, I found this issue was
there, not only for capping case, but also for pCPU sharing case.

In the above both cases, the process vruntime accounting in guest OS
has "vruntime jump", which might cause that victim process to have
poor and unpredictable performance.

In the cloud, from my point of view, the VM exists in three scenarios:
1, dedicated hardware(in this case, VM = Physical Machine);
2, part of dedicated hardware(using capping, like Amazon EC2 T2.small instance);
3, sharing with other VMs on the same hardware;

Both case#2 and case#3 will be influenced due to the issue I mentioned.


> Isn't there anything in place in Xen or Linux (the latter being better
> suitable for something like this, IMHO) to compensate for that?
>

No. I do not think so. I think this is a bug in Linux kernel under
virtualization(vmm platform is Xen).

> I have to admit I haven't really ever checked myself, maybe either
> George or our Linux people do know more?

The issue behind it is that the process execution calculation(e.g.,
delta_exec) in virtualized environment should not be calculated as it
did in physical enviroment.

Here are two solutions to fix it:

1) Based on the vcpu->runstate.time(running/runnable/block/offline)
changes, to determine how much time the process on this VCPU is
running, instead of just "delta_exec = now - exec_start";

2) Build another clock inside the guest OS which records the exect
time that the VCPU runs. All vruntime calculation is based on this
clock, instead of hyperivosr clock/time(real clock).

Thanks.

>
>> This issue will make a great performance harm to the victim process.
>> If the process is an I/O-bound workload, its throughput and latency
>> will be influenced. If the process is a CPU-bound workload, this
>> issue
>> will make its vruntime "unfair" compared to other processes under
>> CFS.
>>
>> Because the CPU resource of some type VMs in the cloud are limited as
>> the above describes(like Amazon EC2 t2.small instance), I doubt that
>> will also harm the performance of public cloud instances.
>> ---------------------------------------
>>
>>
>> My test environment is as follows: Hypervisor(Xen 4.5.0), Dom 0(Linux
>> 3.18.21), Dom U(Linux 3.18.21). I also test longterm version Linux
>> 3.18.30 and the latest longterm version, Linux 4.4.7. Those kernels
>> all have this issue.
>>
>> Please confirm this bug. Thanks.
>>
>>
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>

--
Tony. S
Ph. D student of University of Colorado, Colorado Springs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Linux process vruntime accounting in Xen
  2016-05-16 21:38   ` Tony S
@ 2016-05-16 22:33     ` Boris Ostrovsky
  2016-05-17  9:33       ` George Dunlap
  2016-05-16 22:33     ` Tony S
  1 sibling, 1 reply; 12+ messages in thread
From: Boris Ostrovsky @ 2016-05-16 22:33 UTC (permalink / raw)
  To: Tony S, Dario Faggioli, xen-devel
  Cc: George Dunlap, Juergen Gross, David Vrabel, Matt Fleming

On 05/16/2016 05:38 PM, Tony S wrote:
> The issue behind it is that the process execution calculation(e.g.,
> delta_exec) in virtualized environment should not be calculated as it
> did in physical enviroment.
>
> Here are two solutions to fix it:
>
> 1) Based on the vcpu->runstate.time(running/runnable/block/offline)
> changes, to determine how much time the process on this VCPU is
> running, instead of just "delta_exec = now - exec_start";
>
> 2) Build another clock inside the guest OS which records the exect
> time that the VCPU runs. All vruntime calculation is based on this
> clock, instead of hyperivosr clock/time(real clock).

Looks like CONFIG_PARAVIRT_TIME_ACCOUNTING is used for adjusting process
times. KVM uses it but Xen doesn't.

-boris


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Linux process vruntime accounting in Xen
  2016-05-16 21:38   ` Tony S
  2016-05-16 22:33     ` Boris Ostrovsky
@ 2016-05-16 22:33     ` Tony S
  1 sibling, 0 replies; 12+ messages in thread
From: Tony S @ 2016-05-16 22:33 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: George Dunlap, Juergen Gross, Boris Ostrovsky, David Vrabel,
	Matt Fleming

On Mon, May 16, 2016 at 3:38 PM, Tony S <suokunstar@gmail.com> wrote:
> On Mon, May 16, 2016 at 5:37 AM, Dario Faggioli
> <dario.faggioli@citrix.com> wrote:
>> [Adding George again, and a few Linux/Xen folks]
>>
>> On Sat, 2016-05-14 at 18:25 -0600, Tony S wrote:
>>> In virtualized environments, sometimes we need to limit the CPU
>>> resources to a virtual machine(VM). For example in Xen, we use
>>> $ xl sched-credit -d 1 -c 50
>>>
>>> to limit the CPU resource of dom 1 as half of
>>> one physical CPU core. If the VM CPU resource is capped, the process
>>> inside the VM will have a vruntime accounting problem. Here, I report
>>> my findings about Linux process scheduler under the above scenario.
>>>
>> Thanks for this other report as well. :-)
>>
>> All you say makes sense to me, and I will think about it. I'm not sure
>> about one thing, though...
>>
>
> Hi Dario,
>
> Thank you for your reply.
>
>
>>> ------------Description------------
>>> Linux CFS relies on delta_exec to charge the vruntime of processes.
>>> The variable delta_exec is the difference of a process starts and
>>> stops running on a CPU. This works well in physical machine. However,
>>> in virtual machine under capped resources, some processes might be
>>> accounted with inaccurate vruntime.
>>>
>>> For example, suppose we have a VM which has one vCPU and is capped to
>>> have as much as 50% of a physical CPU. When process A inside the VM
>>> starts running and the CPU resource of that VM runs out, the VM will
>>> be paused. Next round when the VM is allocated new CPU resource and
>>> starts running again, process A stops running and is put back to the
>>> runqueue. The delta_exec of process A is accounted as its "real
>>> execution time" plus the paused time of its VM. That will make the
>>> vruntime of process A much larger than it should be and process A
>>> would not be scheduled again for a long time until the vruntimes of
>>> other
>>> processes catch it.
>>> ---------------------------------------
>>>
>>>
>>> ------------Analysis----------------
>>> When a process stops running and is going to put back to the
>>> runqueue,
>>> update_curr() will be executed.
>>> [src/kernel/sched/fair.c]
>>>
>>> static void update_curr(struct cfs_rq *cfs_rq)
>>> {
>>>     ... ...
>>>     delta_exec = now - curr->exec_start;
>>>     ... ...
>>>     curr->exec_start = now;
>>>     ... ...
>>>     curr->sum_exec_runtime += delta_exec;
>>>     schedstat_add(cfs_rq, exec_clock, delta_exec);
>>>     curr->vruntime += calc_delta_fair(delta_exec, curr);
>>>     update_min_vruntime(cfs_rq);
>>>     ... ...
>>> }
>>>
>>> "now" --> the right now time
>>> "exec_start" --> the time when the current process is put on the CPU
>>> "delta_exec" --> the time difference of a process between it starts
>>> and stops running on the CPU
>>>
>>> When a process starts running before its VM is paused and the process
>>> stops running after its VM is unpaused, the delta_exec will include
>>> the VM suspend time which is pretty large compared to the real
>>> execution time of a process.
>>>
>> ... but would that also apply to a VM that is not scheduled --just
>> because of pCPU contention, not because it was paused-- for a few time?
>>
>
> Thanks for your suggestion. I have tried to see whether this issue
> exists on pCPU sharing today. Unfortunately, I found this issue was
> there, not only for capping case, but also for pCPU sharing case.
>
> In the above both cases, the process vruntime accounting in guest OS
> has "vruntime jump", which might cause that victim process to have
> poor and unpredictable performance.
>
> In the cloud, from my point of view, the VM exists in three scenarios:
> 1, dedicated hardware(in this case, VM = Physical Machine);
> 2, part of dedicated hardware(using capping, like Amazon EC2 T2.small instance);
> 3, sharing with other VMs on the same hardware;
>
> Both case#2 and case#3 will be influenced due to the issue I mentioned.
>
>
>> Isn't there anything in place in Xen or Linux (the latter being better
>> suitable for something like this, IMHO) to compensate for that?
>>
>
> No. I do not think so. I think this is a bug in Linux kernel under
> virtualization(vmm platform is Xen).
>
>> I have to admit I haven't really ever checked myself, maybe either
>> George or our Linux people do know more?
>
> The issue behind it is that the process execution calculation(e.g.,
> delta_exec) in virtualized environment should not be calculated as it
> did in physical enviroment.
>
> Here are two solutions to fix it:
>
> 1) Based on the vcpu->runstate.time(running/runnable/block/offline)
> changes, to determine how much time the process on this VCPU is
> running, instead of just "delta_exec = now - exec_start";
>
> 2) Build another clock inside the guest OS which records the exect
> time that the VCPU runs. All vruntime calculation is based on this
> clock, instead of hyperivosr clock/time(real clock).
>
> Thanks.
>

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Deployment_and_Administration_Guide/sect-KVM_guest_timing_management-Steal_time_accounting.html

Here is what redhat did in KVM to fix the steal time accounting issue
in guest OS. Hoping Xen can fix this issue in the future.


>>
>>> This issue will make a great performance harm to the victim process.
>>> If the process is an I/O-bound workload, its throughput and latency
>>> will be influenced. If the process is a CPU-bound workload, this
>>> issue
>>> will make its vruntime "unfair" compared to other processes under
>>> CFS.
>>>
>>> Because the CPU resource of some type VMs in the cloud are limited as
>>> the above describes(like Amazon EC2 t2.small instance), I doubt that
>>> will also harm the performance of public cloud instances.
>>> ---------------------------------------
>>>
>>>
>>> My test environment is as follows: Hypervisor(Xen 4.5.0), Dom 0(Linux
>>> 3.18.21), Dom U(Linux 3.18.21). I also test longterm version Linux
>>> 3.18.30 and the latest longterm version, Linux 4.4.7. Those kernels
>>> all have this issue.
>>>
>>> Please confirm this bug. Thanks.
>>>
>>>
>> --
>> <<This happens because I choose it to happen!>> (Raistlin Majere)
>> -----------------------------------------------------------------
>> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>>
>
> --
> Tony. S
> Ph. D student of University of Colorado, Colorado Springs


--
Tony. S
Ph. D student of University of Colorado, Colorado Springs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Linux process vruntime accounting in Xen
  2016-05-16 22:33     ` Boris Ostrovsky
@ 2016-05-17  9:33       ` George Dunlap
  2016-05-17  9:45         ` Juergen Gross
  2016-05-18 12:24         ` Juergen Gross
  0 siblings, 2 replies; 12+ messages in thread
From: George Dunlap @ 2016-05-17  9:33 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Juergen Gross, Matt Fleming, Dario Faggioli,
	xen-devel@lists.xen.org, David Vrabel, Tony S

On Mon, May 16, 2016 at 11:33 PM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 05/16/2016 05:38 PM, Tony S wrote:
>> The issue behind it is that the process execution calculation(e.g.,
>> delta_exec) in virtualized environment should not be calculated as it
>> did in physical enviroment.
>>
>> Here are two solutions to fix it:
>>
>> 1) Based on the vcpu->runstate.time(running/runnable/block/offline)
>> changes, to determine how much time the process on this VCPU is
>> running, instead of just "delta_exec = now - exec_start";
>>
>> 2) Build another clock inside the guest OS which records the exect
>> time that the VCPU runs. All vruntime calculation is based on this
>> clock, instead of hyperivosr clock/time(real clock).
>
> Looks like CONFIG_PARAVIRT_TIME_ACCOUNTING is used for adjusting process
> times. KVM uses it but Xen doesn't.

Is someone on the Linux side going to put this on their to-do list then? :-)

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Linux process vruntime accounting in Xen
  2016-05-17  9:33       ` George Dunlap
@ 2016-05-17  9:45         ` Juergen Gross
  2016-05-18 12:24         ` Juergen Gross
  1 sibling, 0 replies; 12+ messages in thread
From: Juergen Gross @ 2016-05-17  9:45 UTC (permalink / raw)
  To: George Dunlap, Boris Ostrovsky
  Cc: Tony S, Matt Fleming, Dario Faggioli, David Vrabel,
	xen-devel@lists.xen.org

On 17/05/16 11:33, George Dunlap wrote:
> On Mon, May 16, 2016 at 11:33 PM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 05/16/2016 05:38 PM, Tony S wrote:
>>> The issue behind it is that the process execution calculation(e.g.,
>>> delta_exec) in virtualized environment should not be calculated as it
>>> did in physical enviroment.
>>>
>>> Here are two solutions to fix it:
>>>
>>> 1) Based on the vcpu->runstate.time(running/runnable/block/offline)
>>> changes, to determine how much time the process on this VCPU is
>>> running, instead of just "delta_exec = now - exec_start";
>>>
>>> 2) Build another clock inside the guest OS which records the exect
>>> time that the VCPU runs. All vruntime calculation is based on this
>>> clock, instead of hyperivosr clock/time(real clock).
>>
>> Looks like CONFIG_PARAVIRT_TIME_ACCOUNTING is used for adjusting process
>> times. KVM uses it but Xen doesn't.
> 
> Is someone on the Linux side going to put this on their to-do list then? :-)

I'll have a try.


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Linux process vruntime accounting in Xen
  2016-05-17  9:33       ` George Dunlap
  2016-05-17  9:45         ` Juergen Gross
@ 2016-05-18 12:24         ` Juergen Gross
  2016-05-18 14:57           ` Dario Faggioli
  1 sibling, 1 reply; 12+ messages in thread
From: Juergen Gross @ 2016-05-18 12:24 UTC (permalink / raw)
  To: George Dunlap, Boris Ostrovsky
  Cc: Tony S, Matt Fleming, Dario Faggioli, David Vrabel,
	xen-devel@lists.xen.org

On 17/05/16 11:33, George Dunlap wrote:
> On Mon, May 16, 2016 at 11:33 PM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 05/16/2016 05:38 PM, Tony S wrote:
>>> The issue behind it is that the process execution calculation(e.g.,
>>> delta_exec) in virtualized environment should not be calculated as it
>>> did in physical enviroment.
>>>
>>> Here are two solutions to fix it:
>>>
>>> 1) Based on the vcpu->runstate.time(running/runnable/block/offline)
>>> changes, to determine how much time the process on this VCPU is
>>> running, instead of just "delta_exec = now - exec_start";
>>>
>>> 2) Build another clock inside the guest OS which records the exect
>>> time that the VCPU runs. All vruntime calculation is based on this
>>> clock, instead of hyperivosr clock/time(real clock).
>>
>> Looks like CONFIG_PARAVIRT_TIME_ACCOUNTING is used for adjusting process
>> times. KVM uses it but Xen doesn't.
> 
> Is someone on the Linux side going to put this on their to-do list then? :-)

Patch sent.

Support was already existing for arm. What is missing is support for
paravirt_steal_rq_enabled which requires to be able to read the stolen
time of another cpu. This can't work today as accessing another cpu's
vcpu_runstate_info isn't possible without risking inconsistent data.
I plan to add support for this, too, but this will require adding
another hypercall to map a modified vcpu_runstate_info containing an
indicator for an ongoing update of the data.

Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Linux process vruntime accounting in Xen
  2016-05-18 12:24         ` Juergen Gross
@ 2016-05-18 14:57           ` Dario Faggioli
  2016-05-18 16:09             ` Tony S
  0 siblings, 1 reply; 12+ messages in thread
From: Dario Faggioli @ 2016-05-18 14:57 UTC (permalink / raw)
  To: Juergen Gross, George Dunlap, Boris Ostrovsky
  Cc: Tony S, Matt Fleming, David Vrabel, xen-devel@lists.xen.org


[-- Attachment #1.1: Type: text/plain, Size: 1507 bytes --]

On Wed, 2016-05-18 at 14:24 +0200, Juergen Gross wrote:
> On 17/05/16 11:33, George Dunlap wrote:
> > > Looks like CONFIG_PARAVIRT_TIME_ACCOUNTING is used for adjusting
> > > process
> > > times. KVM uses it but Xen doesn't.
> > Is someone on the Linux side going to put this on their to-do list
> > then? :-)
>
> Patch sent.
> 
Yep, seen it, thanks.

> Support was already existing for arm. 
>
Yes!! I remember Stefano talking about introducing it, and that was
also why I thought we had it already since long time on x86.

Well, anyway... :-)

> What is missing is support for
> paravirt_steal_rq_enabled which requires to be able to read the
> stolen
> time of another cpu. This can't work today as accessing another cpu's
> vcpu_runstate_info isn't possible without risking inconsistent data.
> I plan to add support for this, too, but this will require adding
> another hypercall to map a modified vcpu_runstate_info containing an
> indicator for an ongoing update of the data.
> 
Understood.

So, Tony, up for trying again your workload with this patch applied to
Linux?

Most likely, it _won't_ fix all the problems you're seeing, but I'm
curious to see if it helps.

Thanks again and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Linux process vruntime accounting in Xen
  2016-05-18 14:57           ` Dario Faggioli
@ 2016-05-18 16:09             ` Tony S
  2016-05-18 16:14               ` Juergen Gross
  0 siblings, 1 reply; 12+ messages in thread
From: Tony S @ 2016-05-18 16:09 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel@lists.xen.org
  Cc: Juergen Gross, George Dunlap, Boris Ostrovsky, David Vrabel,
	Matt Fleming

On Wed, May 18, 2016 at 8:57 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On Wed, 2016-05-18 at 14:24 +0200, Juergen Gross wrote:
>> On 17/05/16 11:33, George Dunlap wrote:
>> > > Looks like CONFIG_PARAVIRT_TIME_ACCOUNTING is used for adjusting
>> > > process
>> > > times. KVM uses it but Xen doesn't.
>> > Is someone on the Linux side going to put this on their to-do list
>> > then? :-)
>>
>> Patch sent.
>>
> Yep, seen it, thanks.
>
>> Support was already existing for arm.
>>
> Yes!! I remember Stefano talking about introducing it, and that was
> also why I thought we had it already since long time on x86.
>
> Well, anyway... :-)
>
>> What is missing is support for
>> paravirt_steal_rq_enabled which requires to be able to read the
>> stolen
>> time of another cpu. This can't work today as accessing another cpu's
>> vcpu_runstate_info isn't possible without risking inconsistent data.
>> I plan to add support for this, too, but this will require adding
>> another hypercall to map a modified vcpu_runstate_info containing an
>> indicator for an ongoing update of the data.
>>
> Understood.
>
> So, Tony, up for trying again your workload with this patch applied to
> Linux?
>
> Most likely, it _won't_ fix all the problems you're seeing, but I'm
> curious to see if it helps.

Hi Dario,

I did not see the patch. Can you please send me the patch and I will
try to test it later.

Best
Tony

>
> Thanks again and Regards,
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>

-- 
Tony. S
Ph. D student of University of Colorado, Colorado Springs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Linux process vruntime accounting in Xen
  2016-05-18 16:09             ` Tony S
@ 2016-05-18 16:14               ` Juergen Gross
  2016-05-20 12:50                 ` Juergen Gross
  0 siblings, 1 reply; 12+ messages in thread
From: Juergen Gross @ 2016-05-18 16:14 UTC (permalink / raw)
  To: Tony S, Dario Faggioli, xen-devel@lists.xen.org
  Cc: George Dunlap, Matt Fleming, Boris Ostrovsky, David Vrabel

[-- Attachment #1: Type: text/plain, Size: 1516 bytes --]

On 18/05/16 18:09, Tony S wrote:
> On Wed, May 18, 2016 at 8:57 AM, Dario Faggioli
> <dario.faggioli@citrix.com> wrote:
>> On Wed, 2016-05-18 at 14:24 +0200, Juergen Gross wrote:
>>> On 17/05/16 11:33, George Dunlap wrote:
>>>>> Looks like CONFIG_PARAVIRT_TIME_ACCOUNTING is used for adjusting
>>>>> process
>>>>> times. KVM uses it but Xen doesn't.
>>>> Is someone on the Linux side going to put this on their to-do list
>>>> then? :-)
>>>
>>> Patch sent.
>>>
>> Yep, seen it, thanks.
>>
>>> Support was already existing for arm.
>>>
>> Yes!! I remember Stefano talking about introducing it, and that was
>> also why I thought we had it already since long time on x86.
>>
>> Well, anyway... :-)
>>
>>> What is missing is support for
>>> paravirt_steal_rq_enabled which requires to be able to read the
>>> stolen
>>> time of another cpu. This can't work today as accessing another cpu's
>>> vcpu_runstate_info isn't possible without risking inconsistent data.
>>> I plan to add support for this, too, but this will require adding
>>> another hypercall to map a modified vcpu_runstate_info containing an
>>> indicator for an ongoing update of the data.
>>>
>> Understood.
>>
>> So, Tony, up for trying again your workload with this patch applied to
>> Linux?
>>
>> Most likely, it _won't_ fix all the problems you're seeing, but I'm
>> curious to see if it helps.
> 
> Hi Dario,
> 
> I did not see the patch. Can you please send me the patch and I will
> try to test it later.

Here is an updated version.


Juergen


[-- Attachment #2: v2-0001-xen-add-steal_clock-support-on-x86.patch --]
[-- Type: text/x-patch, Size: 6864 bytes --]

>From d4a6e2217adfa8715237738b67a8989528d59cae Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 17 May 2016 14:03:02 +0200
Subject: [PATCH v2] xen: add steal_clock support on x86

With CONFIG_PARAVIRT_TIME_ACCOUNTING the kernel is capable to account
for time a thread wasn't able to run due to hypervisor scheduling.
Add support in Xen arch independent time handling for this feature by
moving it out of the arm arch into drivers/xen.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/arm/xen/enlighten.c    | 17 ++---------------
 arch/x86/Kconfig            |  2 +-
 arch/x86/xen/time.c         | 44 ++------------------------------------------
 drivers/xen/time.c          | 19 +++++++++++++++++++
 include/linux/kernel_stat.h |  1 -
 include/xen/xen-ops.h       |  1 +
 kernel/sched/cputime.c      | 10 ----------
 7 files changed, 25 insertions(+), 69 deletions(-)

diff --git a/arch/arm/xen/enlighten.c b/arch/arm/xen/enlighten.c
index 75cd734..9163b94 100644
--- a/arch/arm/xen/enlighten.c
+++ b/arch/arm/xen/enlighten.c
@@ -84,19 +84,6 @@ int xen_unmap_domain_gfn_range(struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL_GPL(xen_unmap_domain_gfn_range);
 
-static unsigned long long xen_stolen_accounting(int cpu)
-{
-	struct vcpu_runstate_info state;
-
-	BUG_ON(cpu != smp_processor_id());
-
-	xen_get_runstate_snapshot(&state);
-
-	WARN_ON(state.state != RUNSTATE_running);
-
-	return state.time[RUNSTATE_runnable] + state.time[RUNSTATE_offline];
-}
-
 static void xen_read_wallclock(struct timespec64 *ts)
 {
 	u32 version;
@@ -355,8 +342,8 @@ static int __init xen_guest_init(void)
 
 	register_cpu_notifier(&xen_cpu_notifier);
 
-	pv_time_ops.steal_clock = xen_stolen_accounting;
-	static_key_slow_inc(&paravirt_steal_enabled);
+	xen_time_setup_guest();
+
 	if (xen_initial_domain())
 		pvclock_gtod_register_notifier(&xen_pvclock_gtod_notifier);
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7bb1574..3be1fee 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -752,7 +752,7 @@ source "arch/x86/lguest/Kconfig"
 config PARAVIRT_TIME_ACCOUNTING
 	bool "Paravirtual steal time accounting"
 	depends on PARAVIRT
-	default n
+	default y if XEN
 	---help---
 	  Select this option to enable fine granularity task steal time
 	  accounting. Time spent executing other tasks in parallel with
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index a0a4e55..6be31df 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -11,8 +11,6 @@
 #include <linux/interrupt.h>
 #include <linux/clocksource.h>
 #include <linux/clockchips.h>
-#include <linux/kernel_stat.h>
-#include <linux/math64.h>
 #include <linux/gfp.h>
 #include <linux/slab.h>
 #include <linux/pvclock_gtod.h>
@@ -31,44 +29,6 @@
 
 /* Xen may fire a timer up to this many ns early */
 #define TIMER_SLOP	100000
-#define NS_PER_TICK	(1000000000LL / HZ)
-
-/* snapshots of runstate info */
-static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate_snapshot);
-
-/* unused ns of stolen time */
-static DEFINE_PER_CPU(u64, xen_residual_stolen);
-
-static void do_stolen_accounting(void)
-{
-	struct vcpu_runstate_info state;
-	struct vcpu_runstate_info *snap;
-	s64 runnable, offline, stolen;
-	cputime_t ticks;
-
-	xen_get_runstate_snapshot(&state);
-
-	WARN_ON(state.state != RUNSTATE_running);
-
-	snap = this_cpu_ptr(&xen_runstate_snapshot);
-
-	/* work out how much time the VCPU has not been runn*ing*  */
-	runnable = state.time[RUNSTATE_runnable] - snap->time[RUNSTATE_runnable];
-	offline = state.time[RUNSTATE_offline] - snap->time[RUNSTATE_offline];
-
-	*snap = state;
-
-	/* Add the appropriate number of ticks of stolen time,
-	   including any left-overs from last time. */
-	stolen = runnable + offline + __this_cpu_read(xen_residual_stolen);
-
-	if (stolen < 0)
-		stolen = 0;
-
-	ticks = iter_div_u64_rem(stolen, NS_PER_TICK, &stolen);
-	__this_cpu_write(xen_residual_stolen, stolen);
-	account_steal_ticks(ticks);
-}
 
 /* Get the TSC speed from Xen */
 static unsigned long xen_tsc_khz(void)
@@ -335,8 +295,6 @@ static irqreturn_t xen_timer_interrupt(int irq, void *dev_id)
 		ret = IRQ_HANDLED;
 	}
 
-	do_stolen_accounting();
-
 	return ret;
 }
 
@@ -431,6 +389,8 @@ static void __init xen_time_init(void)
 	xen_setup_timer(cpu);
 	xen_setup_cpu_clockevents();
 
+	xen_time_setup_guest();
+
 	if (xen_initial_domain())
 		pvclock_gtod_register_notifier(&xen_pvclock_gtod_notifier);
 }
diff --git a/drivers/xen/time.c b/drivers/xen/time.c
index 7107842..6648a78 100644
--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -75,6 +75,15 @@ bool xen_vcpu_stolen(int vcpu)
 	return per_cpu(xen_runstate, vcpu).state == RUNSTATE_runnable;
 }
 
+static u64 xen_steal_clock(int cpu)
+{
+	struct vcpu_runstate_info state;
+
+	BUG_ON(cpu != smp_processor_id());
+	xen_get_runstate_snapshot(&state);
+	return state.time[RUNSTATE_runnable] + state.time[RUNSTATE_offline];
+}
+
 void xen_setup_runstate_info(int cpu)
 {
 	struct vcpu_register_runstate_memory_area area;
@@ -86,3 +95,13 @@ void xen_setup_runstate_info(int cpu)
 		BUG();
 }
 
+void __init xen_time_setup_guest(void)
+{
+	pv_time_ops.steal_clock = xen_steal_clock;
+
+	static_key_slow_inc(&paravirt_steal_enabled);
+	/*
+	 * We can't set paravirt_steal_rq_enabled as this would require the
+	 * capability to read another cpu's runstate info.
+	 */
+}
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 25a822f..44fda64 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -92,7 +92,6 @@ static inline void account_process_tick(struct task_struct *tsk, int user)
 extern void account_process_tick(struct task_struct *, int user);
 #endif
 
-extern void account_steal_ticks(unsigned long ticks);
 extern void account_idle_ticks(unsigned long ticks);
 
 #endif /* _LINUX_KERNEL_STAT_H */
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 86abe07..5ce51c2 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -21,6 +21,7 @@ void xen_resume_notifier_unregister(struct notifier_block *nb);
 
 bool xen_vcpu_stolen(int vcpu);
 void xen_setup_runstate_info(int cpu);
+void __init xen_time_setup_guest(void);
 void xen_get_runstate_snapshot(struct vcpu_runstate_info *res);
 
 int xen_setup_shutdown_event(void);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 75f98c5..8c4c6dc 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -490,16 +490,6 @@ void account_process_tick(struct task_struct *p, int user_tick)
 }
 
 /*
- * Account multiple ticks of steal time.
- * @p: the process from which the cpu time has been stolen
- * @ticks: number of stolen ticks
- */
-void account_steal_ticks(unsigned long ticks)
-{
-	account_steal_time(jiffies_to_cputime(ticks));
-}
-
-/*
  * Account multiple ticks of idle time.
  * @ticks: number of stolen ticks
  */
-- 
2.6.6


[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [BUG] Linux process vruntime accounting in Xen
  2016-05-18 16:14               ` Juergen Gross
@ 2016-05-20 12:50                 ` Juergen Gross
  0 siblings, 0 replies; 12+ messages in thread
From: Juergen Gross @ 2016-05-20 12:50 UTC (permalink / raw)
  To: Tony S
  Cc: George Dunlap, Dario Faggioli, xen-devel@lists.xen.org,
	Matt Fleming, David Vrabel, Boris Ostrovsky

[-- Attachment #1: Type: text/plain, Size: 1951 bytes --]

On 18/05/16 18:14, Juergen Gross wrote:
> On 18/05/16 18:09, Tony S wrote:
>> On Wed, May 18, 2016 at 8:57 AM, Dario Faggioli
>> <dario.faggioli@citrix.com> wrote:
>>> On Wed, 2016-05-18 at 14:24 +0200, Juergen Gross wrote:
>>>> On 17/05/16 11:33, George Dunlap wrote:
>>>>>> Looks like CONFIG_PARAVIRT_TIME_ACCOUNTING is used for adjusting
>>>>>> process
>>>>>> times. KVM uses it but Xen doesn't.
>>>>> Is someone on the Linux side going to put this on their to-do list
>>>>> then? :-)
>>>>
>>>> Patch sent.
>>>>
>>> Yep, seen it, thanks.
>>>
>>>> Support was already existing for arm.
>>>>
>>> Yes!! I remember Stefano talking about introducing it, and that was
>>> also why I thought we had it already since long time on x86.
>>>
>>> Well, anyway... :-)
>>>
>>>> What is missing is support for
>>>> paravirt_steal_rq_enabled which requires to be able to read the
>>>> stolen
>>>> time of another cpu. This can't work today as accessing another cpu's
>>>> vcpu_runstate_info isn't possible without risking inconsistent data.
>>>> I plan to add support for this, too, but this will require adding
>>>> another hypercall to map a modified vcpu_runstate_info containing an
>>>> indicator for an ongoing update of the data.
>>>>
>>> Understood.
>>>
>>> So, Tony, up for trying again your workload with this patch applied to
>>> Linux?
>>>
>>> Most likely, it _won't_ fix all the problems you're seeing, but I'm
>>> curious to see if it helps.
>>
>> Hi Dario,
>>
>> I did not see the patch. Can you please send me the patch and I will
>> try to test it later.
> 
> Here is an updated version.

Tony, would you be interested to test a complete solution?

This would require to use a Xen 4.7 hypervisor with some patches applied
and some patches for the Linux kernel (I've done some basic tests with
kernel 4.6). I've attached the patches in case you want to try them. :-)
You should set CONFIG_PARAVIRT_TIME_ACCOUNTING=y in the kernel .config


Juergen


[-- Attachment #2: linux-patch-01 --]
[-- Type: text/plain, Size: 7189 bytes --]

From 689b4ba8c13be73ed51e485a7f7baea593d0ce6e Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 17 May 2016 14:03:02 +0200
Subject: [PATCH v4] xen: add steal_clock support on x86

The pv_time_ops structure contains a function pointer for the
"steal_clock" functionality used only by KVM and Xen on ARM. Xen on x86
uses its own mechanism to account for the "stolen" time a thread wasn't
able to run due to hypervisor scheduling.

Add support in Xen arch independent time handling for this feature by
moving it out of the arm arch into drivers/xen and remove the x86 Xen
hack.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---
V4: minor adjustments as requested by Stefano Stabellini (remove
    no longer needed #include, remove __init from header)
V3: add #include <asm/paravirt.h> to avoid build error on arm
V2: remove the x86 do_stolen_accounting() hack
---
 arch/arm/xen/enlighten.c    | 18 ++----------------
 arch/x86/xen/time.c         | 44 ++------------------------------------------
 drivers/xen/time.c          | 20 ++++++++++++++++++++
 include/linux/kernel_stat.h |  1 -
 include/xen/xen-ops.h       |  1 +
 kernel/sched/cputime.c      | 10 ----------
 6 files changed, 25 insertions(+), 69 deletions(-)

diff --git a/arch/arm/xen/enlighten.c b/arch/arm/xen/enlighten.c
index 75cd734..71db30c 100644
--- a/arch/arm/xen/enlighten.c
+++ b/arch/arm/xen/enlighten.c
@@ -12,7 +12,6 @@
 #include <xen/page.h>
 #include <xen/interface/sched.h>
 #include <xen/xen-ops.h>
-#include <asm/paravirt.h>
 #include <asm/xen/hypervisor.h>
 #include <asm/xen/hypercall.h>
 #include <asm/system_misc.h>
@@ -84,19 +83,6 @@ int xen_unmap_domain_gfn_range(struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL_GPL(xen_unmap_domain_gfn_range);
 
-static unsigned long long xen_stolen_accounting(int cpu)
-{
-	struct vcpu_runstate_info state;
-
-	BUG_ON(cpu != smp_processor_id());
-
-	xen_get_runstate_snapshot(&state);
-
-	WARN_ON(state.state != RUNSTATE_running);
-
-	return state.time[RUNSTATE_runnable] + state.time[RUNSTATE_offline];
-}
-
 static void xen_read_wallclock(struct timespec64 *ts)
 {
 	u32 version;
@@ -355,8 +341,8 @@ static int __init xen_guest_init(void)
 
 	register_cpu_notifier(&xen_cpu_notifier);
 
-	pv_time_ops.steal_clock = xen_stolen_accounting;
-	static_key_slow_inc(&paravirt_steal_enabled);
+	xen_time_setup_guest();
+
 	if (xen_initial_domain())
 		pvclock_gtod_register_notifier(&xen_pvclock_gtod_notifier);
 
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index a0a4e55..6be31df 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -11,8 +11,6 @@
 #include <linux/interrupt.h>
 #include <linux/clocksource.h>
 #include <linux/clockchips.h>
-#include <linux/kernel_stat.h>
-#include <linux/math64.h>
 #include <linux/gfp.h>
 #include <linux/slab.h>
 #include <linux/pvclock_gtod.h>
@@ -31,44 +29,6 @@
 
 /* Xen may fire a timer up to this many ns early */
 #define TIMER_SLOP	100000
-#define NS_PER_TICK	(1000000000LL / HZ)
-
-/* snapshots of runstate info */
-static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate_snapshot);
-
-/* unused ns of stolen time */
-static DEFINE_PER_CPU(u64, xen_residual_stolen);
-
-static void do_stolen_accounting(void)
-{
-	struct vcpu_runstate_info state;
-	struct vcpu_runstate_info *snap;
-	s64 runnable, offline, stolen;
-	cputime_t ticks;
-
-	xen_get_runstate_snapshot(&state);
-
-	WARN_ON(state.state != RUNSTATE_running);
-
-	snap = this_cpu_ptr(&xen_runstate_snapshot);
-
-	/* work out how much time the VCPU has not been runn*ing*  */
-	runnable = state.time[RUNSTATE_runnable] - snap->time[RUNSTATE_runnable];
-	offline = state.time[RUNSTATE_offline] - snap->time[RUNSTATE_offline];
-
-	*snap = state;
-
-	/* Add the appropriate number of ticks of stolen time,
-	   including any left-overs from last time. */
-	stolen = runnable + offline + __this_cpu_read(xen_residual_stolen);
-
-	if (stolen < 0)
-		stolen = 0;
-
-	ticks = iter_div_u64_rem(stolen, NS_PER_TICK, &stolen);
-	__this_cpu_write(xen_residual_stolen, stolen);
-	account_steal_ticks(ticks);
-}
 
 /* Get the TSC speed from Xen */
 static unsigned long xen_tsc_khz(void)
@@ -335,8 +295,6 @@ static irqreturn_t xen_timer_interrupt(int irq, void *dev_id)
 		ret = IRQ_HANDLED;
 	}
 
-	do_stolen_accounting();
-
 	return ret;
 }
 
@@ -431,6 +389,8 @@ static void __init xen_time_init(void)
 	xen_setup_timer(cpu);
 	xen_setup_cpu_clockevents();
 
+	xen_time_setup_guest();
+
 	if (xen_initial_domain())
 		pvclock_gtod_register_notifier(&xen_pvclock_gtod_notifier);
 }
diff --git a/drivers/xen/time.c b/drivers/xen/time.c
index 7107842..2257b66 100644
--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -6,6 +6,7 @@
 #include <linux/math64.h>
 #include <linux/gfp.h>
 
+#include <asm/paravirt.h>
 #include <asm/xen/hypervisor.h>
 #include <asm/xen/hypercall.h>
 
@@ -75,6 +76,15 @@ bool xen_vcpu_stolen(int vcpu)
 	return per_cpu(xen_runstate, vcpu).state == RUNSTATE_runnable;
 }
 
+static u64 xen_steal_clock(int cpu)
+{
+	struct vcpu_runstate_info state;
+
+	BUG_ON(cpu != smp_processor_id());
+	xen_get_runstate_snapshot(&state);
+	return state.time[RUNSTATE_runnable] + state.time[RUNSTATE_offline];
+}
+
 void xen_setup_runstate_info(int cpu)
 {
 	struct vcpu_register_runstate_memory_area area;
@@ -86,3 +96,13 @@ void xen_setup_runstate_info(int cpu)
 		BUG();
 }
 
+void __init xen_time_setup_guest(void)
+{
+	pv_time_ops.steal_clock = xen_steal_clock;
+
+	static_key_slow_inc(&paravirt_steal_enabled);
+	/*
+	 * We can't set paravirt_steal_rq_enabled as this would require the
+	 * capability to read another cpu's runstate info.
+	 */
+}
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 25a822f..44fda64 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -92,7 +92,6 @@ static inline void account_process_tick(struct task_struct *tsk, int user)
 extern void account_process_tick(struct task_struct *, int user);
 #endif
 
-extern void account_steal_ticks(unsigned long ticks);
 extern void account_idle_ticks(unsigned long ticks);
 
 #endif /* _LINUX_KERNEL_STAT_H */
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 86abe07..77bf9d1 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -21,6 +21,7 @@ void xen_resume_notifier_unregister(struct notifier_block *nb);
 
 bool xen_vcpu_stolen(int vcpu);
 void xen_setup_runstate_info(int cpu);
+void xen_time_setup_guest(void);
 void xen_get_runstate_snapshot(struct vcpu_runstate_info *res);
 
 int xen_setup_shutdown_event(void);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 75f98c5..8c4c6dc 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -490,16 +490,6 @@ void account_process_tick(struct task_struct *p, int user_tick)
 }
 
 /*
- * Account multiple ticks of steal time.
- * @p: the process from which the cpu time has been stolen
- * @ticks: number of stolen ticks
- */
-void account_steal_ticks(unsigned long ticks)
-{
-	account_steal_time(jiffies_to_cputime(ticks));
-}
-
-/*
  * Account multiple ticks of idle time.
  * @ticks: number of stolen ticks
  */
-- 
2.6.6


[-- Attachment #3: linux-patch-02 --]
[-- Type: text/plain, Size: 2564 bytes --]

From 4073bb301aed18981ec69c3cf5f0df4fae567d7c Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Fri, 20 May 2016 09:32:30 +0200
Subject: [PATCH 1/3] xen: update xen headers

Update some Xen headers to be able to use new functionality.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 include/xen/interface/vcpu.h | 24 +++++++++++++++---------
 include/xen/interface/xen.h  | 17 ++++++++++++++++-
 2 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/include/xen/interface/vcpu.h b/include/xen/interface/vcpu.h
index b05288c..98188c8 100644
--- a/include/xen/interface/vcpu.h
+++ b/include/xen/interface/vcpu.h
@@ -75,15 +75,21 @@
  */
 #define VCPUOP_get_runstate_info	 4
 struct vcpu_runstate_info {
-		/* VCPU's current state (RUNSTATE_*). */
-		int		 state;
-		/* When was current state entered (system time, ns)? */
-		uint64_t state_entry_time;
-		/*
-		 * Time spent in each RUNSTATE_* (ns). The sum of these times is
-		 * guaranteed not to drift from system time.
-		 */
-		uint64_t time[4];
+	/* VCPU's current state (RUNSTATE_*). */
+	int		 state;
+	/* When was current state entered (system time, ns)? */
+	uint64_t state_entry_time;
+	/*
+	 * Update indicator set in state_entry_time:
+	 * When activated via VMASST_TYPE_runstate_update_flag, set during
+	 * updates in guest memory mapped copy of vcpu_runstate_info.
+	 */
+#define XEN_RUNSTATE_UPDATE	(1ULL << 63)
+	/*
+	 * Time spent in each RUNSTATE_* (ns). The sum of these times is
+	 * guaranteed not to drift from system time.
+	 */
+	uint64_t time[4];
 };
 DEFINE_GUEST_HANDLE_STRUCT(vcpu_runstate_info);
 
diff --git a/include/xen/interface/xen.h b/include/xen/interface/xen.h
index d133112..1b0d189 100644
--- a/include/xen/interface/xen.h
+++ b/include/xen/interface/xen.h
@@ -413,7 +413,22 @@ DEFINE_GUEST_HANDLE_STRUCT(mmuext_op);
 /* x86/PAE guests: support PDPTs above 4GB. */
 #define VMASST_TYPE_pae_extended_cr3     3
 
-#define MAX_VMASST_TYPE 3
+/*
+ * x86 guests: Sane behaviour for virtual iopl
+ *  - virtual iopl updated from do_iret() hypercalls.
+ *  - virtual iopl reported in bounce frames.
+ *  - guest kernels assumed to be level 0 for the purpose of iopl checks.
+ */
+#define VMASST_TYPE_architectural_iopl   4
+
+/*
+ * All guests: activate update indicator in vcpu_runstate_info
+ * Enable setting the XEN_RUNSTATE_UPDATE flag in guest memory mapped
+ * vcpu_runstate_info during updates of the runstate information.
+ */
+#define VMASST_TYPE_runstate_update_flag 5
+
+#define MAX_VMASST_TYPE 5
 
 #ifndef __ASSEMBLY__
 
-- 
2.6.6


[-- Attachment #4: linux-patch-03 --]
[-- Type: text/plain, Size: 2247 bytes --]

From ab457b88c03a66c6051ac022b51bc5c218f48842 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Fri, 20 May 2016 12:08:21 +0200
Subject: [PATCH 2/3] arm/xen: add support for vm_assist hypercall

Add support for the Xen HYPERVISOR_vm_assist hypercall.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/arm/include/asm/xen/hypercall.h | 1 +
 arch/arm/xen/enlighten.c             | 1 +
 arch/arm/xen/hypercall.S             | 1 +
 arch/arm64/xen/hypercall.S           | 1 +
 4 files changed, 4 insertions(+)

diff --git a/arch/arm/include/asm/xen/hypercall.h b/arch/arm/include/asm/xen/hypercall.h
index b6b962d..9d874db 100644
--- a/arch/arm/include/asm/xen/hypercall.h
+++ b/arch/arm/include/asm/xen/hypercall.h
@@ -52,6 +52,7 @@ int HYPERVISOR_memory_op(unsigned int cmd, void *arg);
 int HYPERVISOR_physdev_op(int cmd, void *arg);
 int HYPERVISOR_vcpu_op(int cmd, int vcpuid, void *extra_args);
 int HYPERVISOR_tmem_op(void *arg);
+int HYPERVISOR_vm_assist(unsigned int cmd, unsigned int type);
 int HYPERVISOR_platform_op_raw(void *arg);
 static inline int HYPERVISOR_platform_op(struct xen_platform_op *op)
 {
diff --git a/arch/arm/xen/enlighten.c b/arch/arm/xen/enlighten.c
index 71db30c..0f3aa12 100644
--- a/arch/arm/xen/enlighten.c
+++ b/arch/arm/xen/enlighten.c
@@ -389,4 +389,5 @@ EXPORT_SYMBOL_GPL(HYPERVISOR_vcpu_op);
 EXPORT_SYMBOL_GPL(HYPERVISOR_tmem_op);
 EXPORT_SYMBOL_GPL(HYPERVISOR_platform_op);
 EXPORT_SYMBOL_GPL(HYPERVISOR_multicall);
+EXPORT_SYMBOL_GPL(HYPERVISOR_vm_assist);
 EXPORT_SYMBOL_GPL(privcmd_call);
diff --git a/arch/arm/xen/hypercall.S b/arch/arm/xen/hypercall.S
index 9a36f4f..a648dfc 100644
--- a/arch/arm/xen/hypercall.S
+++ b/arch/arm/xen/hypercall.S
@@ -91,6 +91,7 @@ HYPERCALL3(vcpu_op);
 HYPERCALL1(tmem_op);
 HYPERCALL1(platform_op_raw);
 HYPERCALL2(multicall);
+HYPERCALL2(vm_assist);
 
 ENTRY(privcmd_call)
 	stmdb sp!, {r4}
diff --git a/arch/arm64/xen/hypercall.S b/arch/arm64/xen/hypercall.S
index 70df80e..329c802 100644
--- a/arch/arm64/xen/hypercall.S
+++ b/arch/arm64/xen/hypercall.S
@@ -82,6 +82,7 @@ HYPERCALL3(vcpu_op);
 HYPERCALL1(tmem_op);
 HYPERCALL1(platform_op_raw);
 HYPERCALL2(multicall);
+HYPERCALL2(vm_assist);
 
 ENTRY(privcmd_call)
 	mov x16, x0
-- 
2.6.6


[-- Attachment #5: linux-patch-04 --]
[-- Type: text/plain, Size: 3143 bytes --]

From f27da1aba6c9c92add4f88b4dcec517e5e321caa Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Fri, 20 May 2016 12:25:58 +0200
Subject: [PATCH 3/3] xen: support runqueue steal time on xen

Up to now reading the stolen time of a remote cpu was not possible in a
performant way under Xen. This made support of runqueue steal time via
paravirt_steal_rq_enabled impossible.

With the addition of an appropriate hypervisor interface this is now
possible, so add the support.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 drivers/xen/time.c | 42 +++++++++++++++++++++++++-----------------
 1 file changed, 25 insertions(+), 17 deletions(-)

diff --git a/drivers/xen/time.c b/drivers/xen/time.c
index 2257b66..04b6cb7 100644
--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -19,6 +19,9 @@
 /* runstate info updated by Xen */
 static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
 
+/* runstate info of remote cpu accessible */
+static bool xen_runstate_remote;
+
 /* return an consistent snapshot of 64-bit time/counter value */
 static u64 get64(const u64 *p)
 {
@@ -47,27 +50,31 @@ static u64 get64(const u64 *p)
 	return ret;
 }
 
-/*
- * Runstate accounting
- */
-void xen_get_runstate_snapshot(struct vcpu_runstate_info *res)
+static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
+                                          unsigned cpu)
 {
 	u64 state_time;
 	struct vcpu_runstate_info *state;
 
 	BUG_ON(preemptible());
 
-	state = this_cpu_ptr(&xen_runstate);
+	state = per_cpu_ptr(&xen_runstate, cpu);
 
-	/*
-	 * The runstate info is always updated by the hypervisor on
-	 * the current CPU, so there's no need to use anything
-	 * stronger than a compiler barrier when fetching it.
-	 */
 	do {
 		state_time = get64(&state->state_entry_time);
+		rmb();
 		*res = READ_ONCE(*state);
-	} while (get64(&state->state_entry_time) != state_time);
+		rmb();
+	} while (get64(&state->state_entry_time) != state_time ||
+		 (state_time & XEN_RUNSTATE_UPDATE));
+}
+
+/*
+ * Runstate accounting
+ */
+void xen_get_runstate_snapshot(struct vcpu_runstate_info *res)
+{
+	xen_get_runstate_snapshot_cpu(res, smp_processor_id());
 }
 
 /* return true when a vcpu could run but has no real cpu to run on */
@@ -80,8 +87,8 @@ static u64 xen_steal_clock(int cpu)
 {
 	struct vcpu_runstate_info state;
 
-	BUG_ON(cpu != smp_processor_id());
-	xen_get_runstate_snapshot(&state);
+	BUG_ON(!xen_runstate_remote && cpu != smp_processor_id());
+	xen_get_runstate_snapshot_cpu(&state, cpu);
 	return state.time[RUNSTATE_runnable] + state.time[RUNSTATE_offline];
 }
 
@@ -98,11 +105,12 @@ void xen_setup_runstate_info(int cpu)
 
 void __init xen_time_setup_guest(void)
 {
+	xen_runstate_remote = !HYPERVISOR_vm_assist(VMASST_CMD_enable,
+					VMASST_TYPE_runstate_update_flag);
+
 	pv_time_ops.steal_clock = xen_steal_clock;
 
 	static_key_slow_inc(&paravirt_steal_enabled);
-	/*
-	 * We can't set paravirt_steal_rq_enabled as this would require the
-	 * capability to read another cpu's runstate info.
-	 */
+	if (xen_runstate_remote)
+		static_key_slow_inc(&paravirt_steal_rq_enabled);
 }
-- 
2.6.6


[-- Attachment #6: xen-patch-01 --]
[-- Type: text/plain, Size: 2417 bytes --]

From 39c8fea3b4fdcf634e4858748f77fc4ee2ddc4cd Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Thu, 19 May 2016 18:19:42 +0200
Subject: [PATCH 1/2] xen/arm: add support for vm_assist hypercall

Up to now the vm_assist hypercall hasn't been supported on ARM, as
there are only x86 specific features to switch. Add support of
vm_assist on ARM for future use.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/arch/arm/traps.c         | 1 +
 xen/common/domain.c          | 2 --
 xen/common/kernel.c          | 2 --
 xen/include/asm-arm/config.h | 2 ++
 4 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
index 1828ea1..ccc6351 100644
--- a/xen/arch/arm/traps.c
+++ b/xen/arch/arm/traps.c
@@ -1284,6 +1284,7 @@ static arm_hypercall_t arm_hypercall_table[] = {
     HYPERCALL(multicall, 2),
     HYPERCALL(platform_op, 1),
     HYPERCALL_ARM(vcpu_op, 3),
+    HYPERCALL(vm_assist, 2),
 };
 
 #ifndef NDEBUG
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 45273d4..0afb1ee 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1408,7 +1408,6 @@ long do_vcpu_op(int cmd, unsigned int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg)
     return rc;
 }
 
-#ifdef VM_ASSIST_VALID
 long vm_assist(struct domain *p, unsigned int cmd, unsigned int type,
                unsigned long valid)
 {
@@ -1427,7 +1426,6 @@ long vm_assist(struct domain *p, unsigned int cmd, unsigned int type,
 
     return -ENOSYS;
 }
-#endif
 
 struct pirq *pirq_get_info(struct domain *d, int pirq)
 {
diff --git a/xen/common/kernel.c b/xen/common/kernel.c
index 1a6823a..74b6e1f 100644
--- a/xen/common/kernel.c
+++ b/xen/common/kernel.c
@@ -441,12 +441,10 @@ DO(nmi_op)(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
     return rc;
 }
 
-#ifdef VM_ASSIST_VALID
 DO(vm_assist)(unsigned int cmd, unsigned int type)
 {
     return vm_assist(current->domain, cmd, type, VM_ASSIST_VALID);
 }
-#endif
 
 DO(ni_hypercall)(void)
 {
diff --git a/xen/include/asm-arm/config.h b/xen/include/asm-arm/config.h
index 2d11b62..563f49b 100644
--- a/xen/include/asm-arm/config.h
+++ b/xen/include/asm-arm/config.h
@@ -199,6 +199,8 @@ extern unsigned long frametable_virt_end;
 #define watchdog_disable() ((void)0)
 #define watchdog_enable()  ((void)0)
 
+#define VM_ASSIST_VALID          (0)
+
 #endif /* __ARM_CONFIG_H__ */
 /*
  * Local variables:
-- 
2.6.6


[-- Attachment #7: xen-patch-02 --]
[-- Type: text/plain, Size: 6980 bytes --]

From f1b30599d330fcb4086b92125cc36cfc1c95fe63 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Fri, 20 May 2016 06:32:12 +0200
Subject: [PATCH 2/2] xen: add update indicator to vcpu_runstate_info

In order to support reading another vcpu's mapped vcpu_runstate_info
an indicator for an occurring update of the runstate information is
needed.

Add the possibility to activate setting this indicator in the highest
bit of state_entry_time via a vm_assist hypercall. When activated the
update indicator will be set before the runstate information is
modified in guest memory and it will be reset after modification is
done. As state_entry_time is guaranteed to be different after each
update the guest can detect any update (either in progress or while
reading the runstate data) by comparing state_entry_time before and
after reading runstate data: in case the values differ or the update
indicator was set the data might be inconsistent and should be reread.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/arch/arm/domain.c        | 22 ++++++++++++++++++++++
 xen/arch/x86/domain.c        | 31 +++++++++++++++++++++++++++++++
 xen/include/asm-arm/config.h |  2 +-
 xen/include/asm-x86/config.h |  1 +
 xen/include/public/vcpu.h    |  6 ++++++
 xen/include/public/xen.h     |  7 +++++++
 6 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index 1365b4a..91f256b 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -239,10 +239,32 @@ static void ctxt_switch_to(struct vcpu *n)
 /* Update per-VCPU guest runstate shared memory area (if registered). */
 static void update_runstate_area(struct vcpu *v)
 {
+    bool_t update_flag;
+    void __user *guest_handle = NULL;
+    unsigned off = 0;
+
     if ( guest_handle_is_null(runstate_guest(v)) )
         return;
 
+    update_flag = VM_ASSIST(v->domain, runstate_update_flag);
+    if ( update_flag )
+    {
+        off = offsetof(struct vcpu_runstate_info, state_entry_time) + 7;
+        guest_handle = v->runstate_guest.p;
+        guest_handle += off;
+        v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
+        __raw_copy_to_guest(guest_handle, (void *)&v->runstate + off, 1);
+        wmb();
+    }
+
     __copy_to_guest(runstate_guest(v), &v->runstate, 1);
+
+    if ( update_flag )
+    {
+        v->runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE;
+        wmb();
+        __raw_copy_to_guest(guest_handle, (void *)&v->runstate + off, 1);
+    }
 }
 
 static void schedule_tail(struct vcpu *prev)
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 5af2cc5..dfaee5d 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1925,13 +1925,37 @@ static void paravirt_ctxt_switch_to(struct vcpu *v)
 bool_t update_runstate_area(struct vcpu *v)
 {
     bool_t rc;
+    bool_t update_flag;
     smap_check_policy_t smap_policy;
+    void __user *guest_handle = NULL;
+    unsigned off = 0;
 
     if ( guest_handle_is_null(runstate_guest(v)) )
         return 1;
 
+    update_flag = VM_ASSIST(v->domain, runstate_update_flag);
+
     smap_policy = smap_policy_change(v, SMAP_CHECK_ENABLED);
 
+    if ( update_flag )
+    {
+        off = offsetof(struct vcpu_runstate_info, state_entry_time) + 7;
+        if ( has_32bit_shinfo(v->domain) )
+        {
+            guest_handle = v->runstate_guest.compat.p;
+            guest_handle += offsetof(struct compat_vcpu_runstate_info,
+                                     state_entry_time) + 7;
+        }
+        else
+        {
+            guest_handle = v->runstate_guest.native.p;
+            guest_handle += off;
+        }
+        v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
+        __raw_copy_to_guest(guest_handle, (void *)&v->runstate + off, 1);
+        wmb();
+    }
+
     if ( has_32bit_shinfo(v->domain) )
     {
         struct compat_vcpu_runstate_info info;
@@ -1944,6 +1968,13 @@ bool_t update_runstate_area(struct vcpu *v)
         rc = __copy_to_guest(runstate_guest(v), &v->runstate, 1) !=
              sizeof(v->runstate);
 
+    if ( update_flag )
+    {
+        v->runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE;
+        wmb();
+        __raw_copy_to_guest(guest_handle, (void *)&v->runstate + off, 1);
+    }
+
     smap_policy_change(v, smap_policy);
 
     return rc;
diff --git a/xen/include/asm-arm/config.h b/xen/include/asm-arm/config.h
index 563f49b..ce3edc2 100644
--- a/xen/include/asm-arm/config.h
+++ b/xen/include/asm-arm/config.h
@@ -199,7 +199,7 @@ extern unsigned long frametable_virt_end;
 #define watchdog_disable() ((void)0)
 #define watchdog_enable()  ((void)0)
 
-#define VM_ASSIST_VALID          (0)
+#define VM_ASSIST_VALID          (1UL << VMASST_TYPE_runstate_update_flag)
 
 #endif /* __ARM_CONFIG_H__ */
 /*
diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
index c10129d..6fd84e7 100644
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -332,6 +332,7 @@ extern unsigned long xen_phys_start;
                                   (1UL << VMASST_TYPE_writable_pagetables) | \
                                   (1UL << VMASST_TYPE_pae_extended_cr3)    | \
                                   (1UL << VMASST_TYPE_architectural_iopl)  | \
+                                  (1UL << VMASST_TYPE_runstate_update_flag)| \
                                   (1UL << VMASST_TYPE_m2p_strict))
 #define VM_ASSIST_VALID          NATIVE_VM_ASSIST_VALID
 #define COMPAT_VM_ASSIST_VALID   (NATIVE_VM_ASSIST_VALID & \
diff --git a/xen/include/public/vcpu.h b/xen/include/public/vcpu.h
index 692b87a..2aa230d 100644
--- a/xen/include/public/vcpu.h
+++ b/xen/include/public/vcpu.h
@@ -84,6 +84,12 @@ struct vcpu_runstate_info {
     /* When was current state entered (system time, ns)? */
     uint64_t state_entry_time;
     /*
+     * Update indicator set in state_entry_time:
+     * When activated via VMASST_TYPE_runstate_update_flag, set during
+     * updates in guest memory mapped copy of vcpu_runstate_info.
+     */
+#define XEN_RUNSTATE_UPDATE          (1ULL << 63)
+    /*
      * Time spent in each RUNSTATE_* (ns). The sum of these times is
      * guaranteed not to drift from system time.
      */
diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h
index 37bbb22..b9e5e0f 100644
--- a/xen/include/public/xen.h
+++ b/xen/include/public/xen.h
@@ -510,6 +510,13 @@ DEFINE_XEN_GUEST_HANDLE(mmuext_op_t);
 #define VMASST_TYPE_architectural_iopl   4
 
 /*
+ * All guests: activate update indicator in vcpu_runstate_info
+ * Enable setting the XEN_RUNSTATE_UPDATE flag in guest memory mapped
+ * vcpu_runstate_info during updates of the runstate information.
+ */
+#define VMASST_TYPE_runstate_update_flag 5
+
+/*
  * x86/64 guests: strictly hide M2P from user mode.
  * This allows the guest to control respective hypervisor behavior:
  * - when not set, L4 tables get created with the respective slot blank,
-- 
2.6.6


[-- Attachment #8: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-05-20 12:50 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-15  0:25 [BUG] Linux process vruntime accounting in Xen Tony S
2016-05-16 11:37 ` Dario Faggioli
2016-05-16 21:38   ` Tony S
2016-05-16 22:33     ` Boris Ostrovsky
2016-05-17  9:33       ` George Dunlap
2016-05-17  9:45         ` Juergen Gross
2016-05-18 12:24         ` Juergen Gross
2016-05-18 14:57           ` Dario Faggioli
2016-05-18 16:09             ` Tony S
2016-05-18 16:14               ` Juergen Gross
2016-05-20 12:50                 ` Juergen Gross
2016-05-16 22:33     ` Tony S

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).