[PATCH RFC v3 0/6] x86/idle: add halt poll support

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC v3 0/6] x86/idle: add halt poll support
@ 2017-11-13  8:16 Quan Xu
  0 siblings, 0 replies; 4+ messages in thread
From: Quan Xu @ 2017-11-13  8:16 UTC (permalink / raw)
  To: kvm, linux-doc, linux-fsdevel, linux-kernel, virtualization, x86,
	xen-devel
  Cc: Yang Zhang

From: Yang Zhang <yang.zhang.wz@gmail.com>

Some latency-intensive workload have seen obviously performance
drop when running inside VM. The main reason is that the overhead
is amplified when running inside VM. The most cost I have seen is
inside idle path.

This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.

Here is the data we get when running benchmark contextswitch to measure
the latency(lower is better):

   1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
     3402.9 ns/ctxsw -- 199.8 %CPU

   2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
      halt_poll_threshold=10000  -- 1151.4 ns/ctxsw -- 200.1 %CPU
      halt_poll_threshold=20000  -- 1149.7 ns/ctxsw -- 199.9 %CPU
      halt_poll_threshold=30000  -- 1151.0 ns/ctxsw -- 199.9 %CPU
      halt_poll_threshold=40000  -- 1155.4 ns/ctxsw -- 199.3 %CPU
      halt_poll_threshold=50000  -- 1161.0 ns/ctxsw -- 200.0 %CPU
      halt_poll_threshold=100000 -- 1163.8 ns/ctxsw -- 200.4 %CPU
      halt_poll_threshold=300000 -- 1159.4 ns/ctxsw -- 201.9 %CPU
      halt_poll_threshold=500000 -- 1163.5 ns/ctxsw -- 205.5 %CPU

   3. w/ kvm dynamic poll:
      halt_poll_ns=10000  -- 3470.5 ns/ctxsw -- 199.6 %CPU
      halt_poll_ns=20000  -- 3273.0 ns/ctxsw -- 199.7 %CPU
      halt_poll_ns=30000  -- 3628.7 ns/ctxsw -- 199.4 %CPU
      halt_poll_ns=40000  -- 2280.6 ns/ctxsw -- 199.5 %CPU
      halt_poll_ns=50000  -- 3200.3 ns/ctxsw -- 199.7 %CPU
      halt_poll_ns=100000 -- 2186.6 ns/ctxsw -- 199.6 %CPU
      halt_poll_ns=300000 -- 3178.7 ns/ctxsw -- 199.6 %CPU
      halt_poll_ns=500000 -- 3505.4 ns/ctxsw -- 199.7 %CPU

   4. w/patch and w/ kvm dynamic poll:

      halt_poll_ns=10000 & halt_poll_threshold=10000  -- 1155.5 ns/ctxsw -- 199.8 %CPU
      halt_poll_ns=10000 & halt_poll_threshold=20000  -- 1165.6 ns/ctxsw -- 199.8 %CPU
      halt_poll_ns=10000 & halt_poll_threshold=30000  -- 1161.1 ns/ctxsw -- 200.0 %CPU

      halt_poll_ns=20000 & halt_poll_threshold=10000  -- 1158.1 ns/ctxsw -- 199.8 %CPU
      halt_poll_ns=20000 & halt_poll_threshold=20000  -- 1161.0 ns/ctxsw -- 199.7 %CPU
      halt_poll_ns=20000 & halt_poll_threshold=30000  -- 1163.7 ns/ctxsw -- 199.9 %CPU

      halt_poll_ns=30000 & halt_poll_threshold=10000  -- 1158.7 ns/ctxsw -- 199.7 %CPU
      halt_poll_ns=30000 & halt_poll_threshold=20000  -- 1153.8 ns/ctxsw -- 199.8 %CPU
      halt_poll_ns=30000 & halt_poll_threshold=30000  -- 1155.1 ns/ctxsw -- 199.8 %CPU

   5. idle=poll
      3957.57 ns/ctxsw --  999.4%CPU

Here is the data we get when running benchmark netperf:

   1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
      29031.6 bit/s -- 76.1 %CPU

   2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
      halt_poll_threshold=10000  -- 29021.7 bit/s -- 105.1 %CPU
      halt_poll_threshold=20000  -- 33463.5 bit/s -- 128.2 %CPU
      halt_poll_threshold=30000  -- 34436.4 bit/s -- 127.8 %CPU
      halt_poll_threshold=40000  -- 35563.3 bit/s -- 129.6 %CPU
      halt_poll_threshold=50000  -- 35787.7 bit/s -- 129.4 %CPU
      halt_poll_threshold=100000 -- 35477.7 bit/s -- 130.0 %CPU
      halt_poll_threshold=300000 -- 35730.0 bit/s -- 132.4 %CPU
      halt_poll_threshold=500000 -- 34978.4 bit/s -- 134.2 %CPU

   3. w/ kvm dynamic poll:
      halt_poll_ns=10000  -- 28849.8 bit/s -- 75.2  %CPU
      halt_poll_ns=20000  -- 29004.8 bit/s -- 76.1  %CPU
      halt_poll_ns=30000  -- 35662.0 bit/s -- 199.7 %CPU
      halt_poll_ns=40000  -- 35874.8 bit/s -- 187.5 %CPU
      halt_poll_ns=50000  -- 35603.1 bit/s -- 199.8 %CPU
      halt_poll_ns=100000 -- 35588.8 bit/s -- 200.0 %CPU
      halt_poll_ns=300000 -- 35912.4 bit/s -- 200.0 %CPU
      halt_poll_ns=500000 -- 35735.6 bit/s -- 200.0 %CPU

   4. w/patch and w/ kvm dynamic poll:

      halt_poll_ns=10000 & halt_poll_threshold=10000  -- 29427.9 bit/s -- 107.8 %CPU
      halt_poll_ns=10000 & halt_poll_threshold=20000  -- 33048.4 bit/s -- 128.1 %CPU
      halt_poll_ns=10000 & halt_poll_threshold=30000  -- 35129.8 bit/s -- 129.1 %CPU

      halt_poll_ns=20000 & halt_poll_threshold=10000  -- 31091.3 bit/s -- 130.3 %CPU
      halt_poll_ns=20000 & halt_poll_threshold=20000  -- 33587.9 bit/s -- 128.9 %CPU
      halt_poll_ns=20000 & halt_poll_threshold=30000  -- 35532.9 bit/s -- 129.1 %CPU

      halt_poll_ns=30000 & halt_poll_threshold=10000  -- 35633.1 bit/s -- 199.4 %CPU
      halt_poll_ns=30000 & halt_poll_threshold=20000  -- 42225.3 bit/s -- 198.7 %CPU
      halt_poll_ns=30000 & halt_poll_threshold=30000  -- 42210.7 bit/s -- 200.3 %CPU

   5. idle=poll
      37081.7 bit/s -- 998.1 %CPU

---
V2 -> V3:
- move poll update into arch/. in v3, poll update is based on duration of the
  last idle loop which is from tick_nohz_idle_enter to tick_nohz_idle_exit,
  and try our best not to interfere with scheduler/idle code. (This seems
  not to follow Peter's v2 comment, however we had a f2f discussion about it
  in Prague.)
- enhance patch desciption.
- enhance Documentation and sysctls.
- test with IRQ_TIMINGS related code, which seems not working so far.

V1 -> V2:
- integrate the smart halt poll into paravirt code
- use idle_stamp instead of check_poll
- since it hard to get whether vcpu is the only task in pcpu, so we
  don't consider it in this series.(May improve it in future)

---
Quan Xu (4):
  x86/paravirt: Add pv_idle_ops to paravirt ops
  KVM guest: register kvm_idle_poll for pv_idle_ops
  Documentation: Add three sysctls for smart idle poll
  tick: get duration of the last idle loop

Yang Zhang (2):
  sched/idle: Add a generic poll before enter real idle path
  KVM guest: introduce smart idle poll algorithm

 Documentation/sysctl/kernel.txt       |   35 ++++++++++++++++
 arch/x86/include/asm/paravirt.h       |    5 ++
 arch/x86/include/asm/paravirt_types.h |    6 +++
 arch/x86/kernel/kvm.c                 |   73 +++++++++++++++++++++++++++++++++
 arch/x86/kernel/paravirt.c            |   10 +++++
 arch/x86/kernel/process.c             |    7 +++
 include/linux/kernel.h                |    6 +++
 include/linux/tick.h                  |    2 +
 kernel/sched/idle.c                   |    2 +
 kernel/sysctl.c                       |   34 +++++++++++++++
 kernel/time/tick-sched.c              |   11 +++++
 kernel/time/tick-sched.h              |    3 +
 12 files changed, 194 insertions(+), 0 deletions(-)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH RFC v3 0/6] x86/idle: add halt poll support
@ 2017-11-13 10:05 Quan Xu
  2017-11-15 21:31 ` Konrad Rzeszutek Wilk
       [not found] ` <20171115213131.GB21113@char.us.oracle.com>
  0 siblings, 2 replies; 4+ messages in thread
From: Quan Xu @ 2017-11-13 10:05 UTC (permalink / raw)
  To: kvm, linux-doc, linux-fsdevel, linux-kernel, virtualization, x86,
	xen-devel
  Cc: Yang Zhang

From: Yang Zhang <yang.zhang.wz@gmail.com>

Some latency-intensive workload have seen obviously performance
drop when running inside VM. The main reason is that the overhead
is amplified when running inside VM. The most cost I have seen is
inside idle path.

This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.

Here is the data we get when running benchmark contextswitch to measure
the latency(lower is better):

   1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
     3402.9 ns/ctxsw -- 199.8 %CPU

   2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
      halt_poll_threshold=10000  -- 1151.4 ns/ctxsw -- 200.1 %CPU
      halt_poll_threshold=20000  -- 1149.7 ns/ctxsw -- 199.9 %CPU
      halt_poll_threshold=30000  -- 1151.0 ns/ctxsw -- 199.9 %CPU
      halt_poll_threshold=40000  -- 1155.4 ns/ctxsw -- 199.3 %CPU
      halt_poll_threshold=50000  -- 1161.0 ns/ctxsw -- 200.0 %CPU
      halt_poll_threshold=100000 -- 1163.8 ns/ctxsw -- 200.4 %CPU
      halt_poll_threshold=300000 -- 1159.4 ns/ctxsw -- 201.9 %CPU
      halt_poll_threshold=500000 -- 1163.5 ns/ctxsw -- 205.5 %CPU

   3. w/ kvm dynamic poll:
      halt_poll_ns=10000  -- 3470.5 ns/ctxsw -- 199.6 %CPU
      halt_poll_ns=20000  -- 3273.0 ns/ctxsw -- 199.7 %CPU
      halt_poll_ns=30000  -- 3628.7 ns/ctxsw -- 199.4 %CPU
      halt_poll_ns=40000  -- 2280.6 ns/ctxsw -- 199.5 %CPU
      halt_poll_ns=50000  -- 3200.3 ns/ctxsw -- 199.7 %CPU
      halt_poll_ns=100000 -- 2186.6 ns/ctxsw -- 199.6 %CPU
      halt_poll_ns=300000 -- 3178.7 ns/ctxsw -- 199.6 %CPU
      halt_poll_ns=500000 -- 3505.4 ns/ctxsw -- 199.7 %CPU

   4. w/patch and w/ kvm dynamic poll:

      halt_poll_ns=10000 & halt_poll_threshold=10000  -- 1155.5 ns/ctxsw -- 199.8 %CPU
      halt_poll_ns=10000 & halt_poll_threshold=20000  -- 1165.6 ns/ctxsw -- 199.8 %CPU
      halt_poll_ns=10000 & halt_poll_threshold=30000  -- 1161.1 ns/ctxsw -- 200.0 %CPU

      halt_poll_ns=20000 & halt_poll_threshold=10000  -- 1158.1 ns/ctxsw -- 199.8 %CPU
      halt_poll_ns=20000 & halt_poll_threshold=20000  -- 1161.0 ns/ctxsw -- 199.7 %CPU
      halt_poll_ns=20000 & halt_poll_threshold=30000  -- 1163.7 ns/ctxsw -- 199.9 %CPU

      halt_poll_ns=30000 & halt_poll_threshold=10000  -- 1158.7 ns/ctxsw -- 199.7 %CPU
      halt_poll_ns=30000 & halt_poll_threshold=20000  -- 1153.8 ns/ctxsw -- 199.8 %CPU
      halt_poll_ns=30000 & halt_poll_threshold=30000  -- 1155.1 ns/ctxsw -- 199.8 %CPU

   5. idle=poll
      3957.57 ns/ctxsw --  999.4%CPU

Here is the data we get when running benchmark netperf:

   1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
      29031.6 bit/s -- 76.1 %CPU

   2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
      halt_poll_threshold=10000  -- 29021.7 bit/s -- 105.1 %CPU
      halt_poll_threshold=20000  -- 33463.5 bit/s -- 128.2 %CPU
      halt_poll_threshold=30000  -- 34436.4 bit/s -- 127.8 %CPU
      halt_poll_threshold=40000  -- 35563.3 bit/s -- 129.6 %CPU
      halt_poll_threshold=50000  -- 35787.7 bit/s -- 129.4 %CPU
      halt_poll_threshold=100000 -- 35477.7 bit/s -- 130.0 %CPU
      halt_poll_threshold=300000 -- 35730.0 bit/s -- 132.4 %CPU
      halt_poll_threshold=500000 -- 34978.4 bit/s -- 134.2 %CPU

   3. w/ kvm dynamic poll:
      halt_poll_ns=10000  -- 28849.8 bit/s -- 75.2  %CPU
      halt_poll_ns=20000  -- 29004.8 bit/s -- 76.1  %CPU
      halt_poll_ns=30000  -- 35662.0 bit/s -- 199.7 %CPU
      halt_poll_ns=40000  -- 35874.8 bit/s -- 187.5 %CPU
      halt_poll_ns=50000  -- 35603.1 bit/s -- 199.8 %CPU
      halt_poll_ns=100000 -- 35588.8 bit/s -- 200.0 %CPU
      halt_poll_ns=300000 -- 35912.4 bit/s -- 200.0 %CPU
      halt_poll_ns=500000 -- 35735.6 bit/s -- 200.0 %CPU

   4. w/patch and w/ kvm dynamic poll:

      halt_poll_ns=10000 & halt_poll_threshold=10000  -- 29427.9 bit/s -- 107.8 %CPU
      halt_poll_ns=10000 & halt_poll_threshold=20000  -- 33048.4 bit/s -- 128.1 %CPU
      halt_poll_ns=10000 & halt_poll_threshold=30000  -- 35129.8 bit/s -- 129.1 %CPU

      halt_poll_ns=20000 & halt_poll_threshold=10000  -- 31091.3 bit/s -- 130.3 %CPU
      halt_poll_ns=20000 & halt_poll_threshold=20000  -- 33587.9 bit/s -- 128.9 %CPU
      halt_poll_ns=20000 & halt_poll_threshold=30000  -- 35532.9 bit/s -- 129.1 %CPU

      halt_poll_ns=30000 & halt_poll_threshold=10000  -- 35633.1 bit/s -- 199.4 %CPU
      halt_poll_ns=30000 & halt_poll_threshold=20000  -- 42225.3 bit/s -- 198.7 %CPU
      halt_poll_ns=30000 & halt_poll_threshold=30000  -- 42210.7 bit/s -- 200.3 %CPU

   5. idle=poll
      37081.7 bit/s -- 998.1 %CPU

---
V2 -> V3:
- move poll update into arch/. in v3, poll update is based on duration of the
  last idle loop which is from tick_nohz_idle_enter to tick_nohz_idle_exit,
  and try our best not to interfere with scheduler/idle code. (This seems
  not to follow Peter's v2 comment, however we had a f2f discussion about it
  in Prague.)
- enhance patch desciption.
- enhance Documentation and sysctls.
- test with IRQ_TIMINGS related code, which seems not working so far.

V1 -> V2:
- integrate the smart halt poll into paravirt code
- use idle_stamp instead of check_poll
- since it hard to get whether vcpu is the only task in pcpu, so we
  don't consider it in this series.(May improve it in future)

---
Quan Xu (4):
  x86/paravirt: Add pv_idle_ops to paravirt ops
  KVM guest: register kvm_idle_poll for pv_idle_ops
  Documentation: Add three sysctls for smart idle poll
  tick: get duration of the last idle loop

Yang Zhang (2):
  sched/idle: Add a generic poll before enter real idle path
  KVM guest: introduce smart idle poll algorithm

 Documentation/sysctl/kernel.txt       |   35 ++++++++++++++++
 arch/x86/include/asm/paravirt.h       |    5 ++
 arch/x86/include/asm/paravirt_types.h |    6 +++
 arch/x86/kernel/kvm.c                 |   73 +++++++++++++++++++++++++++++++++
 arch/x86/kernel/paravirt.c            |   10 +++++
 arch/x86/kernel/process.c             |    7 +++
 include/linux/kernel.h                |    6 +++
 include/linux/tick.h                  |    2 +
 kernel/sched/idle.c                   |    2 +
 kernel/sysctl.c                       |   34 +++++++++++++++
 kernel/time/tick-sched.c              |   11 +++++
 kernel/time/tick-sched.h              |    3 +
 12 files changed, 194 insertions(+), 0 deletions(-)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC v3 0/6] x86/idle: add halt poll support
  2017-11-13 10:05 Quan Xu
@ 2017-11-15 21:31 ` Konrad Rzeszutek Wilk
       [not found] ` <20171115213131.GB21113@char.us.oracle.com>
  1 sibling, 0 replies; 4+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-11-15 21:31 UTC (permalink / raw)
  To: Quan Xu
  Cc: Yang Zhang, kvm, linux-doc, x86, linux-kernel, virtualization,
	linux-fsdevel, xen-devel

On Mon, Nov 13, 2017 at 06:05:59PM +0800, Quan Xu wrote:
> From: Yang Zhang <yang.zhang.wz@gmail.com>
> 
> Some latency-intensive workload have seen obviously performance
> drop when running inside VM. The main reason is that the overhead
> is amplified when running inside VM. The most cost I have seen is
> inside idle path.

Meaning an VMEXIT b/c it is an 'halt' operation ? And then going
back in guest (VMRESUME) takes time. And hence your latency gets
all whacked b/c of this?

So if I understand - you want to use your _full_ timeslice (of the guest)
without ever (or as much as possible) to go in the hypervisor?

Which means in effect you don't care about power-saving or CPUfreq
savings, you just want to eat the full CPU for snack?

> 
> This patch introduces a new mechanism to poll for a while before
> entering idle state. If schedule is needed during poll, then we
> don't need to goes through the heavy overhead path.

Schedule of what? The guest or the host?


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC v3 0/6] x86/idle: add halt poll support
       [not found] ` <20171115213131.GB21113@char.us.oracle.com>
@ 2017-11-20  7:18   ` Quan Xu
  0 siblings, 0 replies; 4+ messages in thread
From: Quan Xu @ 2017-11-20  7:18 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Quan Xu
  Cc: Yang Zhang, kvm, linux-doc, x86, linux-kernel, virtualization,
	linux-fsdevel, xen-devel



On 2017-11-16 05:31, Konrad Rzeszutek Wilk wrote:
> On Mon, Nov 13, 2017 at 06:05:59PM +0800, Quan Xu wrote:
>> From: Yang Zhang <yang.zhang.wz@gmail.com>
>>
>> Some latency-intensive workload have seen obviously performance
>> drop when running inside VM. The main reason is that the overhead
>> is amplified when running inside VM. The most cost I have seen is
>> inside idle path.
> Meaning an VMEXIT b/c it is an 'halt' operation ? And then going
> back in guest (VMRESUME) takes time. And hence your latency gets
> all whacked b/c of this?
    Konrad, I can't follow 'b/c' here.. sorry.

> So if I understand - you want to use your _full_ timeslice (of the guest)
> without ever (or as much as possible) to go in the hypervisor?
     as much as possible.

> Which means in effect you don't care about power-saving or CPUfreq
> savings, you just want to eat the full CPU for snack?
   actually, we  care about power-saving. The poll duration is 
self-tuning, otherwise it is almost as the same as
   'halt=poll'. Also we always sent out with CPU usage of benchmark 
netperf/ctxsw. We got much more
   performance with limited promotion of CPU usage.


>> This patch introduces a new mechanism to poll for a while before
>> entering idle state. If schedule is needed during poll, then we
>> don't need to goes through the heavy overhead path.
> Schedule of what? The guest or the host?
   rescheduled of guest scheduler..
   it is the guest.


Quan
Alibaba Cloud
>
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-11-20  7:18 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-11-13  8:16 [PATCH RFC v3 0/6] x86/idle: add halt poll support Quan Xu
  -- strict thread matches above, loose matches on Subject: below --
2017-11-13 10:05 Quan Xu
2017-11-15 21:31 ` Konrad Rzeszutek Wilk
     [not found] ` <20171115213131.GB21113@char.us.oracle.com>
2017-11-20  7:18   ` Quan Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).