[PATCH 0/2] x86: improve remote CPU wakeup

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] x86: improve remote CPU wakeup
@ 2014-09-11  9:36 Jan Beulich
  2014-09-11  9:40 ` [PATCH 1/2] x86: suppress event check IPI to MWAITing CPUs Jan Beulich
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Jan Beulich @ 2014-09-11  9:36 UTC (permalink / raw)
  To: xen-devel; +Cc: Ian Campbell, Keir Fraser, Ian Jackson, Tim Deegan

Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
especially when many of the remote pCPU-s are in deep C-states. For
64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
accumulated times of over 2ms were observed (average 1.1ms).
Considering that Windows broadcasts IPIs from its timer interrupt,
which at least at certain times can run at 1kHz, it is clear that this
can't result in good guest behavior. In fact, on said hardware guests
with significantly beyond 40 vCPU-s simply hung when e.g. ServerManager
gets started.

The two patches bring down the average to 250us on said hardware.

1: x86: suppress event check IPI to MWAITing CPUs
2: x86/HVM: batch vCPU wakeups

Signed-off-by: Jan Beulich <jbeulich@suse.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] x86: suppress event check IPI to MWAITing CPUs
  2014-09-11  9:36 [PATCH 0/2] x86: improve remote CPU wakeup Jan Beulich
@ 2014-09-11  9:40 ` Jan Beulich
  2014-09-11 10:02   ` Andrew Cooper
  2014-09-11  9:40 ` [PATCH 2/2] x86/HVM: batch vCPU wakeups Jan Beulich
  2014-09-18 10:59 ` [PATCH 0/2] x86: improve remote CPU wakeup Tim Deegan
  2 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2014-09-11  9:40 UTC (permalink / raw)
  To: xen-devel; +Cc: Ian Campbell, Keir Fraser, Ian Jackson, Tim Deegan

[-- Attachment #1: Type: text/plain, Size: 3740 bytes --]

Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
especially when many of the remote pCPU-s are in deep C-states. For
64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
accumulated times of over 2ms were observed (average 1.1ms).
Considering that Windows broadcasts IPIs from its timer interrupt,
which at least at certain times can run at 1kHz, it is clear that this
can't result in good guest behavior. In fact, on said hardware guests
with significantly beyond 40 vCPU-s simply hung when e.g. ServerManager
gets started.

Recognizing that writes to softirq_pending() already have the effect of
waking remote CPUs from MWAITing (due to being co-located on the same
cache line with mwait_wakeup()), we can avoid sending IPIs to CPUs we
know are in a (deep) C-state entered via MWAIT.

With this, average broadcast times for a 64-vCPU guest went down to a
measured maximum of 255us (which is still quite a lot).

One aspect worth noting is that cpumask_raise_softirq() gets brought in
sync here with cpu_raise_softirq() in that now both don't attempt to
raise a self-IPI on the processing CPU.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -330,6 +330,16 @@ void cpuidle_wakeup_mwait(cpumask_t *mas
     cpumask_andnot(mask, mask, &target);
 }
 
+bool_t arch_skip_send_event_check(unsigned int cpu)
+{
+    /*
+     * This relies on softirq_pending() and mwait_wakeup() to access data
+     * on the same cache line.
+     */
+    smp_mb();
+    return !!cpumask_test_cpu(cpu, &cpuidle_mwait_flags);
+}
+
 void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
 {
     unsigned int cpu = smp_processor_id();
@@ -349,7 +359,7 @@ void mwait_idle_with_hints(unsigned int 
      * Timer deadline passing is the event on which we will be woken via
      * cpuidle_mwait_wakeup. So check it now that the location is armed.
      */
-    if ( expires > NOW() || expires == 0 )
+    if ( (expires > NOW() || expires == 0) && !softirq_pending(cpu) )
     {
         cpumask_set_cpu(cpu, &cpuidle_mwait_flags);
         __mwait(eax, ecx);
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -70,12 +70,14 @@ void open_softirq(int nr, softirq_handle
 
 void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
 {
-    int cpu;
+    unsigned int cpu, this_cpu = smp_processor_id();
     cpumask_t send_mask;
 
     cpumask_clear(&send_mask);
     for_each_cpu(cpu, mask)
-        if ( !test_and_set_bit(nr, &softirq_pending(cpu)) )
+        if ( !test_and_set_bit(nr, &softirq_pending(cpu)) &&
+             cpu != this_cpu &&
+             !arch_skip_send_event_check(cpu) )
             cpumask_set_cpu(cpu, &send_mask);
 
     smp_send_event_check_mask(&send_mask);
@@ -84,7 +86,8 @@ void cpumask_raise_softirq(const cpumask
 void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
 {
     if ( !test_and_set_bit(nr, &softirq_pending(cpu))
-         && (cpu != smp_processor_id()) )
+         && (cpu != smp_processor_id())
+         && !arch_skip_send_event_check(cpu) )
         smp_send_event_check_cpu(cpu);
 }
 
--- a/xen/include/asm-arm/softirq.h
+++ b/xen/include/asm-arm/softirq.h
@@ -3,6 +3,8 @@
 
 #define NR_ARCH_SOFTIRQS       0
 
+#define arch_skip_send_event_check(cpu) 0
+
 #endif /* __ASM_SOFTIRQ_H__ */
 /*
  * Local variables:
--- a/xen/include/asm-x86/softirq.h
+++ b/xen/include/asm-x86/softirq.h
@@ -9,4 +9,6 @@
 #define PCI_SERR_SOFTIRQ       (NR_COMMON_SOFTIRQS + 4)
 #define NR_ARCH_SOFTIRQS       5
 
+bool_t arch_skip_send_event_check(unsigned int cpu);
+
 #endif /* __ASM_SOFTIRQ_H__ */




[-- Attachment #2: x86-suppress-event-check-IPI.patch --]
[-- Type: text/plain, Size: 3784 bytes --]

x86: suppress event check IPI to MWAITing CPUs

Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
especially when many of the remote pCPU-s are in deep C-states. For
64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
accumulated times of over 2ms were observed (average 1.1ms).
Considering that Windows broadcasts IPIs from its timer interrupt,
which at least at certain times can run at 1kHz, it is clear that this
can't result in good guest behavior. In fact, on said hardware guests
with significantly beyond 40 vCPU-s simply hung when e.g. ServerManager
gets started.

Recognizing that writes to softirq_pending() already have the effect of
waking remote CPUs from MWAITing (due to being co-located on the same
cache line with mwait_wakeup()), we can avoid sending IPIs to CPUs we
know are in a (deep) C-state entered via MWAIT.

With this, average broadcast times for a 64-vCPU guest went down to a
measured maximum of 255us (which is still quite a lot).

One aspect worth noting is that cpumask_raise_softirq() gets brought in
sync here with cpu_raise_softirq() in that now both don't attempt to
raise a self-IPI on the processing CPU.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -330,6 +330,16 @@ void cpuidle_wakeup_mwait(cpumask_t *mas
     cpumask_andnot(mask, mask, &target);
 }
 
+bool_t arch_skip_send_event_check(unsigned int cpu)
+{
+    /*
+     * This relies on softirq_pending() and mwait_wakeup() to access data
+     * on the same cache line.
+     */
+    smp_mb();
+    return !!cpumask_test_cpu(cpu, &cpuidle_mwait_flags);
+}
+
 void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
 {
     unsigned int cpu = smp_processor_id();
@@ -349,7 +359,7 @@ void mwait_idle_with_hints(unsigned int 
      * Timer deadline passing is the event on which we will be woken via
      * cpuidle_mwait_wakeup. So check it now that the location is armed.
      */
-    if ( expires > NOW() || expires == 0 )
+    if ( (expires > NOW() || expires == 0) && !softirq_pending(cpu) )
     {
         cpumask_set_cpu(cpu, &cpuidle_mwait_flags);
         __mwait(eax, ecx);
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -70,12 +70,14 @@ void open_softirq(int nr, softirq_handle
 
 void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
 {
-    int cpu;
+    unsigned int cpu, this_cpu = smp_processor_id();
     cpumask_t send_mask;
 
     cpumask_clear(&send_mask);
     for_each_cpu(cpu, mask)
-        if ( !test_and_set_bit(nr, &softirq_pending(cpu)) )
+        if ( !test_and_set_bit(nr, &softirq_pending(cpu)) &&
+             cpu != this_cpu &&
+             !arch_skip_send_event_check(cpu) )
             cpumask_set_cpu(cpu, &send_mask);
 
     smp_send_event_check_mask(&send_mask);
@@ -84,7 +86,8 @@ void cpumask_raise_softirq(const cpumask
 void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
 {
     if ( !test_and_set_bit(nr, &softirq_pending(cpu))
-         && (cpu != smp_processor_id()) )
+         && (cpu != smp_processor_id())
+         && !arch_skip_send_event_check(cpu) )
         smp_send_event_check_cpu(cpu);
 }
 
--- a/xen/include/asm-arm/softirq.h
+++ b/xen/include/asm-arm/softirq.h
@@ -3,6 +3,8 @@
 
 #define NR_ARCH_SOFTIRQS       0
 
+#define arch_skip_send_event_check(cpu) 0
+
 #endif /* __ASM_SOFTIRQ_H__ */
 /*
  * Local variables:
--- a/xen/include/asm-x86/softirq.h
+++ b/xen/include/asm-x86/softirq.h
@@ -9,4 +9,6 @@
 #define PCI_SERR_SOFTIRQ       (NR_COMMON_SOFTIRQS + 4)
 #define NR_ARCH_SOFTIRQS       5
 
+bool_t arch_skip_send_event_check(unsigned int cpu);
+
 #endif /* __ASM_SOFTIRQ_H__ */

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] x86: suppress event check IPI to MWAITing CPUs
  2014-09-11  9:40 ` [PATCH 1/2] x86: suppress event check IPI to MWAITing CPUs Jan Beulich
@ 2014-09-11 10:02   ` Andrew Cooper
  2014-09-11 10:07     ` Jan Beulich
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Cooper @ 2014-09-11 10:02 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Ian Campbell, Ian Jackson, Keir Fraser, Tim Deegan


[-- Attachment #1.1: Type: text/plain, Size: 4388 bytes --]

On 11/09/14 10:40, Jan Beulich wrote:
> Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
> especially when many of the remote pCPU-s are in deep C-states. For
> 64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
> accumulated times of over 2ms were observed (average 1.1ms).
> Considering that Windows broadcasts IPIs from its timer interrupt,
> which at least at certain times can run at 1kHz, it is clear that this
> can't result in good guest behavior. In fact, on said hardware guests
> with significantly beyond 40 vCPU-s simply hung when e.g. ServerManager
> gets started.
>
> Recognizing that writes to softirq_pending() already have the effect of
> waking remote CPUs from MWAITing (due to being co-located on the same
> cache line with mwait_wakeup()), we can avoid sending IPIs to CPUs we
> know are in a (deep) C-state entered via MWAIT.
>
> With this, average broadcast times for a 64-vCPU guest went down to a
> measured maximum of 255us (which is still quite a lot).
>
> One aspect worth noting is that cpumask_raise_softirq() gets brought in
> sync here with cpu_raise_softirq() in that now both don't attempt to
> raise a self-IPI on the processing CPU.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> --- a/xen/arch/x86/acpi/cpu_idle.c
> +++ b/xen/arch/x86/acpi/cpu_idle.c
> @@ -330,6 +330,16 @@ void cpuidle_wakeup_mwait(cpumask_t *mas
>      cpumask_andnot(mask, mask, &target);
>  }
>  
> +bool_t arch_skip_send_event_check(unsigned int cpu)
> +{
> +    /*
> +     * This relies on softirq_pending() and mwait_wakeup() to access data
> +     * on the same cache line.
> +     */
> +    smp_mb();
> +    return !!cpumask_test_cpu(cpu, &cpuidle_mwait_flags);
> +}
> +
>  void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
>  {
>      unsigned int cpu = smp_processor_id();
> @@ -349,7 +359,7 @@ void mwait_idle_with_hints(unsigned int 
>       * Timer deadline passing is the event on which we will be woken via
>       * cpuidle_mwait_wakeup. So check it now that the location is armed.
>       */
> -    if ( expires > NOW() || expires == 0 )
> +    if ( (expires > NOW() || expires == 0) && !softirq_pending(cpu) )
>      {
>          cpumask_set_cpu(cpu, &cpuidle_mwait_flags);

Don't you need a smp_wmb() or better here for the results of
cpumask_set_cpu() to be guaranteed to be externally visible?  mwait does
not appear to be a serialising instruction, and doesn't appear to have
any ordering guarantees in the manual.

~Andrew

>          __mwait(eax, ecx);
> --- a/xen/common/softirq.c
> +++ b/xen/common/softirq.c
> @@ -70,12 +70,14 @@ void open_softirq(int nr, softirq_handle
>  
>  void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
>  {
> -    int cpu;
> +    unsigned int cpu, this_cpu = smp_processor_id();
>      cpumask_t send_mask;
>  
>      cpumask_clear(&send_mask);
>      for_each_cpu(cpu, mask)
> -        if ( !test_and_set_bit(nr, &softirq_pending(cpu)) )
> +        if ( !test_and_set_bit(nr, &softirq_pending(cpu)) &&
> +             cpu != this_cpu &&
> +             !arch_skip_send_event_check(cpu) )
>              cpumask_set_cpu(cpu, &send_mask);
>  
>      smp_send_event_check_mask(&send_mask);
> @@ -84,7 +86,8 @@ void cpumask_raise_softirq(const cpumask
>  void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
>  {
>      if ( !test_and_set_bit(nr, &softirq_pending(cpu))
> -         && (cpu != smp_processor_id()) )
> +         && (cpu != smp_processor_id())
> +         && !arch_skip_send_event_check(cpu) )
>          smp_send_event_check_cpu(cpu);
>  }
>  
> --- a/xen/include/asm-arm/softirq.h
> +++ b/xen/include/asm-arm/softirq.h
> @@ -3,6 +3,8 @@
>  
>  #define NR_ARCH_SOFTIRQS       0
>  
> +#define arch_skip_send_event_check(cpu) 0
> +
>  #endif /* __ASM_SOFTIRQ_H__ */
>  /*
>   * Local variables:
> --- a/xen/include/asm-x86/softirq.h
> +++ b/xen/include/asm-x86/softirq.h
> @@ -9,4 +9,6 @@
>  #define PCI_SERR_SOFTIRQ       (NR_COMMON_SOFTIRQS + 4)
>  #define NR_ARCH_SOFTIRQS       5
>  
> +bool_t arch_skip_send_event_check(unsigned int cpu);
> +
>  #endif /* __ASM_SOFTIRQ_H__ */
>
>
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel


[-- Attachment #1.2: Type: text/html, Size: 5033 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] x86: suppress event check IPI to MWAITing CPUs
  2014-09-11 10:02   ` Andrew Cooper
@ 2014-09-11 10:07     ` Jan Beulich
  2014-09-11 10:09       ` Andrew Cooper
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2014-09-11 10:07 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Ian Campbell, xen-devel, Keir Fraser, IanJackson, Tim Deegan

>>> On 11.09.14 at 12:02, <andrew.cooper3@citrix.com> wrote:
> On 11/09/14 10:40, Jan Beulich wrote:
>> @@ -349,7 +359,7 @@ void mwait_idle_with_hints(unsigned int 
>>       * Timer deadline passing is the event on which we will be woken via
>>       * cpuidle_mwait_wakeup. So check it now that the location is armed.
>>       */
>> -    if ( expires > NOW() || expires == 0 )
>> +    if ( (expires > NOW() || expires == 0) && !softirq_pending(cpu) )
>>      {
>>          cpumask_set_cpu(cpu, &cpuidle_mwait_flags);
> 
> Don't you need a smp_wmb() or better here for the results of
> cpumask_set_cpu() to be guaranteed to be externally visible?  mwait does
> not appear to be a serialising instruction, and doesn't appear to have
> any ordering guarantees in the manual.

I think cpumask_set_cpu() using a LOCKed r-m-w instruction should
provide all the needed ordering.

Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] x86: suppress event check IPI to MWAITing CPUs
  2014-09-11 10:07     ` Jan Beulich
@ 2014-09-11 10:09       ` Andrew Cooper
  2014-09-11 10:26         ` Jan Beulich
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Cooper @ 2014-09-11 10:09 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Campbell, xen-devel, Keir Fraser, IanJackson, Tim Deegan

On 11/09/14 11:07, Jan Beulich wrote:
>>>> On 11.09.14 at 12:02, <andrew.cooper3@citrix.com> wrote:
>> On 11/09/14 10:40, Jan Beulich wrote:
>>> @@ -349,7 +359,7 @@ void mwait_idle_with_hints(unsigned int 
>>>       * Timer deadline passing is the event on which we will be woken via
>>>       * cpuidle_mwait_wakeup. So check it now that the location is armed.
>>>       */
>>> -    if ( expires > NOW() || expires == 0 )
>>> +    if ( (expires > NOW() || expires == 0) && !softirq_pending(cpu) )
>>>      {
>>>          cpumask_set_cpu(cpu, &cpuidle_mwait_flags);
>> Don't you need a smp_wmb() or better here for the results of
>> cpumask_set_cpu() to be guaranteed to be externally visible?  mwait does
>> not appear to be a serialising instruction, and doesn't appear to have
>> any ordering guarantees in the manual.
> I think cpumask_set_cpu() using a LOCKed r-m-w instruction should
> provide all the needed ordering.
>
> Jan
>

Ah yes, and you haven't gotten around to changing the cpumask_*
functions wrt atomicity yet.

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] x86: suppress event check IPI to MWAITing CPUs
  2014-09-11 10:09       ` Andrew Cooper
@ 2014-09-11 10:26         ` Jan Beulich
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Beulich @ 2014-09-11 10:26 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Ian Campbell, xen-devel, KeirFraser, IanJackson, Tim Deegan

>>> On 11.09.14 at 12:09, <andrew.cooper3@citrix.com> wrote:
> On 11/09/14 11:07, Jan Beulich wrote:
>>>>> On 11.09.14 at 12:02, <andrew.cooper3@citrix.com> wrote:
>>> On 11/09/14 10:40, Jan Beulich wrote:
>>>> @@ -349,7 +359,7 @@ void mwait_idle_with_hints(unsigned int 
>>>>       * Timer deadline passing is the event on which we will be woken via
>>>>       * cpuidle_mwait_wakeup. So check it now that the location is armed.
>>>>       */
>>>> -    if ( expires > NOW() || expires == 0 )
>>>> +    if ( (expires > NOW() || expires == 0) && !softirq_pending(cpu) )
>>>>      {
>>>>          cpumask_set_cpu(cpu, &cpuidle_mwait_flags);
>>> Don't you need a smp_wmb() or better here for the results of
>>> cpumask_set_cpu() to be guaranteed to be externally visible?  mwait does
>>> not appear to be a serialising instruction, and doesn't appear to have
>>> any ordering guarantees in the manual.
>> I think cpumask_set_cpu() using a LOCKed r-m-w instruction should
>> provide all the needed ordering.
> 
> Ah yes, and you haven't gotten around to changing the cpumask_*
> functions wrt atomicity yet.

And this clearly is one that needs to continue using the LOCKed
accesses.

> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

Thanks.

Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/2] x86/HVM: batch vCPU wakeups
  2014-09-11  9:36 [PATCH 0/2] x86: improve remote CPU wakeup Jan Beulich
  2014-09-11  9:40 ` [PATCH 1/2] x86: suppress event check IPI to MWAITing CPUs Jan Beulich
@ 2014-09-11  9:40 ` Jan Beulich
  2014-09-11 10:48   ` Andrew Cooper
  2014-09-18 10:59 ` [PATCH 0/2] x86: improve remote CPU wakeup Tim Deegan
  2 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2014-09-11  9:40 UTC (permalink / raw)
  To: xen-devel; +Cc: Ian Campbell, Keir Fraser, Ian Jackson, Tim Deegan

[-- Attachment #1: Type: text/plain, Size: 6497 bytes --]

Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
especially when many of the remote pCPU-s are in deep C-states. For
64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
accumulated times of over 2ms were observed (average 1.1ms).
Considering that Windows broadcasts IPIs from its timer interrupt,
which at least at certain times can run at 1kHz, it is clear that this
can't result in good guest behavior. In fact, on said hardware guests
with significantly beyond 40 vCPU-s simply hung when e.g. ServerManager
gets started.

This isn't just helping to reduce the number of ICR writes when the
host APICs run in clustered mode, it also reduces them by suppressing
the sends altogether when - by the time
cpu_raise_softirq_batch_finish() is reached - the remote CPU already
managed to handle the softirq. Plus - when using MONITOR/MWAIT - the
update of softirq_pending(cpu), being on the monitored cache line -
should make the remote CPU wake up ahead of the ICR being sent,
allowing the wait-for-ICR-idle latencies to be reduced (perhaps to a
large part due to overlapping the wakeups of multiple CPUs).

With this alone (i.e. without the IPI avoidance patch in place),
average broadcast times for a 64-vCPU guest went down to a measured
maximum of 310us. With that other patch in place, improvements aren't
as clear anymore (short term averages only went down from 255us to
250us, which clearly is within the error range of the measurements),
but longer term an improvement of the averages is still visible.
Depending on hardware, long term maxima were observed to go down quite
a bit (on aforementioned hardware), while they were seen to go up
again on a (single core) Nehalem (where instead the improvement on the
average values was more visible).

Of course this necessarily increases the latencies for the remote
CPU wakeup at least slightly. To weigh between the effects, the
condition to enable batching in vlapic_ipi() may need further tuning.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -409,6 +409,26 @@ void vlapic_handle_EOI_induced_exit(stru
     hvm_dpci_msi_eoi(current->domain, vector);
 }
 
+static bool_t is_multicast_dest(struct vlapic *vlapic, unsigned int short_hand,
+                                uint32_t dest, bool_t dest_mode)
+{
+    if ( vlapic_domain(vlapic)->max_vcpus <= 2 )
+        return 0;
+
+    if ( short_hand )
+        return short_hand != APIC_DEST_SELF;
+
+    if ( vlapic_x2apic_mode(vlapic) )
+        return dest_mode ? hweight16(dest) > 1 : dest == 0xffffffff;
+
+    if ( dest_mode )
+        return hweight8(dest &
+                        GET_xAPIC_DEST_FIELD(vlapic_get_reg(vlapic,
+                                                            APIC_DFR))) > 1;
+
+    return dest == 0xff;
+}
+
 void vlapic_ipi(
     struct vlapic *vlapic, uint32_t icr_low, uint32_t icr_high)
 {
@@ -447,12 +467,18 @@ void vlapic_ipi(
 
     default: {
         struct vcpu *v;
+        bool_t batch = is_multicast_dest(vlapic, short_hand, dest, dest_mode);
+
+        if ( batch )
+            cpu_raise_softirq_batch_begin();
         for_each_vcpu ( vlapic_domain(vlapic), v )
         {
             if ( vlapic_match_dest(vcpu_vlapic(v), vlapic,
                                    short_hand, dest, dest_mode) )
                 vlapic_accept_irq(v, icr_low);
         }
+        if ( batch )
+            cpu_raise_softirq_batch_finish();
         break;
     }
     }
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -23,6 +23,9 @@ irq_cpustat_t irq_stat[NR_CPUS];
 
 static softirq_handler softirq_handlers[NR_SOFTIRQS];
 
+static DEFINE_PER_CPU(cpumask_t, batch_mask);
+static DEFINE_PER_CPU(unsigned int, batching);
+
 static void __do_softirq(unsigned long ignore_mask)
 {
     unsigned int i, cpu;
@@ -71,24 +74,58 @@ void open_softirq(int nr, softirq_handle
 void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
 {
     unsigned int cpu, this_cpu = smp_processor_id();
-    cpumask_t send_mask;
+    cpumask_t send_mask, *raise_mask;
+
+    if ( !per_cpu(batching, this_cpu) || in_irq() )
+    {
+        cpumask_clear(&send_mask);
+        raise_mask = &send_mask;
+    }
+    else
+        raise_mask = &per_cpu(batch_mask, this_cpu);
 
-    cpumask_clear(&send_mask);
     for_each_cpu(cpu, mask)
         if ( !test_and_set_bit(nr, &softirq_pending(cpu)) &&
              cpu != this_cpu &&
              !arch_skip_send_event_check(cpu) )
-            cpumask_set_cpu(cpu, &send_mask);
+            cpumask_set_cpu(cpu, raise_mask);
 
-    smp_send_event_check_mask(&send_mask);
+    if ( raise_mask == &send_mask )
+        smp_send_event_check_mask(raise_mask);
 }
 
 void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
 {
-    if ( !test_and_set_bit(nr, &softirq_pending(cpu))
-         && (cpu != smp_processor_id())
-         && !arch_skip_send_event_check(cpu) )
+    unsigned int this_cpu = smp_processor_id();
+
+    if ( test_and_set_bit(nr, &softirq_pending(cpu))
+         || (cpu == this_cpu)
+         || arch_skip_send_event_check(cpu) )
+        return;
+
+    if ( !per_cpu(batching, this_cpu) || in_irq() )
         smp_send_event_check_cpu(cpu);
+    else
+        set_bit(nr, &per_cpu(batch_mask, this_cpu));
+}
+
+void cpu_raise_softirq_batch_begin(void)
+{
+    ++this_cpu(batching);
+}
+
+void cpu_raise_softirq_batch_finish(void)
+{
+    unsigned int cpu, this_cpu = smp_processor_id();
+    cpumask_t *mask = &per_cpu(batch_mask, this_cpu);
+
+    ASSERT(per_cpu(batching, this_cpu));
+    for_each_cpu ( cpu, mask )
+        if ( !softirq_pending(cpu) )
+            cpumask_clear_cpu(cpu, mask);
+    smp_send_event_check_mask(mask);
+    cpumask_clear(mask);
+    --per_cpu(batching, this_cpu);
 }
 
 void raise_softirq(unsigned int nr)
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -30,6 +30,9 @@ void cpumask_raise_softirq(const cpumask
 void cpu_raise_softirq(unsigned int cpu, unsigned int nr);
 void raise_softirq(unsigned int nr);
 
+void cpu_raise_softirq_batch_begin(void);
+void cpu_raise_softirq_batch_finish(void);
+
 /*
  * Process pending softirqs on this CPU. This should be called periodically
  * when performing work that prevents softirqs from running in a timely manner.



[-- Attachment #2: raise-softirq-batch.patch --]
[-- Type: text/plain, Size: 6524 bytes --]

x86/HVM: batch vCPU wakeups

Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
especially when many of the remote pCPU-s are in deep C-states. For
64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
accumulated times of over 2ms were observed (average 1.1ms).
Considering that Windows broadcasts IPIs from its timer interrupt,
which at least at certain times can run at 1kHz, it is clear that this
can't result in good guest behavior. In fact, on said hardware guests
with significantly beyond 40 vCPU-s simply hung when e.g. ServerManager
gets started.

This isn't just helping to reduce the number of ICR writes when the
host APICs run in clustered mode, it also reduces them by suppressing
the sends altogether when - by the time
cpu_raise_softirq_batch_finish() is reached - the remote CPU already
managed to handle the softirq. Plus - when using MONITOR/MWAIT - the
update of softirq_pending(cpu), being on the monitored cache line -
should make the remote CPU wake up ahead of the ICR being sent,
allowing the wait-for-ICR-idle latencies to be reduced (perhaps to a
large part due to overlapping the wakeups of multiple CPUs).

With this alone (i.e. without the IPI avoidance patch in place),
average broadcast times for a 64-vCPU guest went down to a measured
maximum of 310us. With that other patch in place, improvements aren't
as clear anymore (short term averages only went down from 255us to
250us, which clearly is within the error range of the measurements),
but longer term an improvement of the averages is still visible.
Depending on hardware, long term maxima were observed to go down quite
a bit (on aforementioned hardware), while they were seen to go up
again on a (single core) Nehalem (where instead the improvement on the
average values was more visible).

Of course this necessarily increases the latencies for the remote
CPU wakeup at least slightly. To weigh between the effects, the
condition to enable batching in vlapic_ipi() may need further tuning.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -409,6 +409,26 @@ void vlapic_handle_EOI_induced_exit(stru
     hvm_dpci_msi_eoi(current->domain, vector);
 }
 
+static bool_t is_multicast_dest(struct vlapic *vlapic, unsigned int short_hand,
+                                uint32_t dest, bool_t dest_mode)
+{
+    if ( vlapic_domain(vlapic)->max_vcpus <= 2 )
+        return 0;
+
+    if ( short_hand )
+        return short_hand != APIC_DEST_SELF;
+
+    if ( vlapic_x2apic_mode(vlapic) )
+        return dest_mode ? hweight16(dest) > 1 : dest == 0xffffffff;
+
+    if ( dest_mode )
+        return hweight8(dest &
+                        GET_xAPIC_DEST_FIELD(vlapic_get_reg(vlapic,
+                                                            APIC_DFR))) > 1;
+
+    return dest == 0xff;
+}
+
 void vlapic_ipi(
     struct vlapic *vlapic, uint32_t icr_low, uint32_t icr_high)
 {
@@ -447,12 +467,18 @@ void vlapic_ipi(
 
     default: {
         struct vcpu *v;
+        bool_t batch = is_multicast_dest(vlapic, short_hand, dest, dest_mode);
+
+        if ( batch )
+            cpu_raise_softirq_batch_begin();
         for_each_vcpu ( vlapic_domain(vlapic), v )
         {
             if ( vlapic_match_dest(vcpu_vlapic(v), vlapic,
                                    short_hand, dest, dest_mode) )
                 vlapic_accept_irq(v, icr_low);
         }
+        if ( batch )
+            cpu_raise_softirq_batch_finish();
         break;
     }
     }
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -23,6 +23,9 @@ irq_cpustat_t irq_stat[NR_CPUS];
 
 static softirq_handler softirq_handlers[NR_SOFTIRQS];
 
+static DEFINE_PER_CPU(cpumask_t, batch_mask);
+static DEFINE_PER_CPU(unsigned int, batching);
+
 static void __do_softirq(unsigned long ignore_mask)
 {
     unsigned int i, cpu;
@@ -71,24 +74,58 @@ void open_softirq(int nr, softirq_handle
 void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
 {
     unsigned int cpu, this_cpu = smp_processor_id();
-    cpumask_t send_mask;
+    cpumask_t send_mask, *raise_mask;
+
+    if ( !per_cpu(batching, this_cpu) || in_irq() )
+    {
+        cpumask_clear(&send_mask);
+        raise_mask = &send_mask;
+    }
+    else
+        raise_mask = &per_cpu(batch_mask, this_cpu);
 
-    cpumask_clear(&send_mask);
     for_each_cpu(cpu, mask)
         if ( !test_and_set_bit(nr, &softirq_pending(cpu)) &&
              cpu != this_cpu &&
              !arch_skip_send_event_check(cpu) )
-            cpumask_set_cpu(cpu, &send_mask);
+            cpumask_set_cpu(cpu, raise_mask);
 
-    smp_send_event_check_mask(&send_mask);
+    if ( raise_mask == &send_mask )
+        smp_send_event_check_mask(raise_mask);
 }
 
 void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
 {
-    if ( !test_and_set_bit(nr, &softirq_pending(cpu))
-         && (cpu != smp_processor_id())
-         && !arch_skip_send_event_check(cpu) )
+    unsigned int this_cpu = smp_processor_id();
+
+    if ( test_and_set_bit(nr, &softirq_pending(cpu))
+         || (cpu == this_cpu)
+         || arch_skip_send_event_check(cpu) )
+        return;
+
+    if ( !per_cpu(batching, this_cpu) || in_irq() )
         smp_send_event_check_cpu(cpu);
+    else
+        set_bit(nr, &per_cpu(batch_mask, this_cpu));
+}
+
+void cpu_raise_softirq_batch_begin(void)
+{
+    ++this_cpu(batching);
+}
+
+void cpu_raise_softirq_batch_finish(void)
+{
+    unsigned int cpu, this_cpu = smp_processor_id();
+    cpumask_t *mask = &per_cpu(batch_mask, this_cpu);
+
+    ASSERT(per_cpu(batching, this_cpu));
+    for_each_cpu ( cpu, mask )
+        if ( !softirq_pending(cpu) )
+            cpumask_clear_cpu(cpu, mask);
+    smp_send_event_check_mask(mask);
+    cpumask_clear(mask);
+    --per_cpu(batching, this_cpu);
 }
 
 void raise_softirq(unsigned int nr)
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -30,6 +30,9 @@ void cpumask_raise_softirq(const cpumask
 void cpu_raise_softirq(unsigned int cpu, unsigned int nr);
 void raise_softirq(unsigned int nr);
 
+void cpu_raise_softirq_batch_begin(void);
+void cpu_raise_softirq_batch_finish(void);
+
 /*
  * Process pending softirqs on this CPU. This should be called periodically
  * when performing work that prevents softirqs from running in a timely manner.

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] x86/HVM: batch vCPU wakeups
  2014-09-11  9:40 ` [PATCH 2/2] x86/HVM: batch vCPU wakeups Jan Beulich
@ 2014-09-11 10:48   ` Andrew Cooper
  2014-09-11 11:03     ` Jan Beulich
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Cooper @ 2014-09-11 10:48 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Ian Campbell, Ian Jackson, Keir Fraser, Tim Deegan


[-- Attachment #1.1: Type: text/plain, Size: 7322 bytes --]

On 11/09/14 10:40, Jan Beulich wrote:
> Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
> especially when many of the remote pCPU-s are in deep C-states. For
> 64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
> accumulated times of over 2ms were observed (average 1.1ms).
> Considering that Windows broadcasts IPIs from its timer interrupt,
> which at least at certain times can run at 1kHz, it is clear that this
> can't result in good guest behavior. In fact, on said hardware guests
> with significantly beyond 40 vCPU-s simply hung when e.g. ServerManager
> gets started.
>
> This isn't just helping to reduce the number of ICR writes when the
> host APICs run in clustered mode, it also reduces them by suppressing
> the sends altogether when - by the time
> cpu_raise_softirq_batch_finish() is reached - the remote CPU already
> managed to handle the softirq. Plus - when using MONITOR/MWAIT - the
> update of softirq_pending(cpu), being on the monitored cache line -
> should make the remote CPU wake up ahead of the ICR being sent,
> allowing the wait-for-ICR-idle latencies to be reduced (perhaps to a
> large part due to overlapping the wakeups of multiple CPUs).
>
> With this alone (i.e. without the IPI avoidance patch in place),
> average broadcast times for a 64-vCPU guest went down to a measured
> maximum of 310us. With that other patch in place, improvements aren't
> as clear anymore (short term averages only went down from 255us to
> 250us, which clearly is within the error range of the measurements),
> but longer term an improvement of the averages is still visible.
> Depending on hardware, long term maxima were observed to go down quite
> a bit (on aforementioned hardware), while they were seen to go up
> again on a (single core) Nehalem (where instead the improvement on the
> average values was more visible).
>
> Of course this necessarily increases the latencies for the remote
> CPU wakeup at least slightly. To weigh between the effects, the
> condition to enable batching in vlapic_ipi() may need further tuning.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> --- a/xen/arch/x86/hvm/vlapic.c
> +++ b/xen/arch/x86/hvm/vlapic.c
> @@ -409,6 +409,26 @@ void vlapic_handle_EOI_induced_exit(stru
>      hvm_dpci_msi_eoi(current->domain, vector);
>  }
>  
> +static bool_t is_multicast_dest(struct vlapic *vlapic, unsigned int short_hand,
> +                                uint32_t dest, bool_t dest_mode)
> +{
> +    if ( vlapic_domain(vlapic)->max_vcpus <= 2 )
> +        return 0;
> +
> +    if ( short_hand )
> +        return short_hand != APIC_DEST_SELF;
> +
> +    if ( vlapic_x2apic_mode(vlapic) )
> +        return dest_mode ? hweight16(dest) > 1 : dest == 0xffffffff;
> +
> +    if ( dest_mode )
> +        return hweight8(dest &
> +                        GET_xAPIC_DEST_FIELD(vlapic_get_reg(vlapic,
> +                                                            APIC_DFR))) > 1;
> +
> +    return dest == 0xff;
> +}

Much more readable!

> +
>  void vlapic_ipi(
>      struct vlapic *vlapic, uint32_t icr_low, uint32_t icr_high)
>  {
> @@ -447,12 +467,18 @@ void vlapic_ipi(
>  
>      default: {
>          struct vcpu *v;
> +        bool_t batch = is_multicast_dest(vlapic, short_hand, dest, dest_mode);
> +
> +        if ( batch )
> +            cpu_raise_softirq_batch_begin();
>          for_each_vcpu ( vlapic_domain(vlapic), v )
>          {
>              if ( vlapic_match_dest(vcpu_vlapic(v), vlapic,
>                                     short_hand, dest, dest_mode) )
>                  vlapic_accept_irq(v, icr_low);
>          }
> +        if ( batch )
> +            cpu_raise_softirq_batch_finish();
>          break;
>      }
>      }
> --- a/xen/common/softirq.c
> +++ b/xen/common/softirq.c
> @@ -23,6 +23,9 @@ irq_cpustat_t irq_stat[NR_CPUS];
>  
>  static softirq_handler softirq_handlers[NR_SOFTIRQS];
>  
> +static DEFINE_PER_CPU(cpumask_t, batch_mask);
> +static DEFINE_PER_CPU(unsigned int, batching);
> +
>  static void __do_softirq(unsigned long ignore_mask)
>  {
>      unsigned int i, cpu;
> @@ -71,24 +74,58 @@ void open_softirq(int nr, softirq_handle
>  void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
>  {
>      unsigned int cpu, this_cpu = smp_processor_id();
> -    cpumask_t send_mask;
> +    cpumask_t send_mask, *raise_mask;
> +
> +    if ( !per_cpu(batching, this_cpu) || in_irq() )
> +    {
> +        cpumask_clear(&send_mask);
> +        raise_mask = &send_mask;
> +    }
> +    else
> +        raise_mask = &per_cpu(batch_mask, this_cpu);
>  
> -    cpumask_clear(&send_mask);
>      for_each_cpu(cpu, mask)
>          if ( !test_and_set_bit(nr, &softirq_pending(cpu)) &&
>               cpu != this_cpu &&
>               !arch_skip_send_event_check(cpu) )
> -            cpumask_set_cpu(cpu, &send_mask);
> +            cpumask_set_cpu(cpu, raise_mask);
>  
> -    smp_send_event_check_mask(&send_mask);
> +    if ( raise_mask == &send_mask )
> +        smp_send_event_check_mask(raise_mask);
>  }
>  
>  void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
>  {
> -    if ( !test_and_set_bit(nr, &softirq_pending(cpu))
> -         && (cpu != smp_processor_id())
> -         && !arch_skip_send_event_check(cpu) )
> +    unsigned int this_cpu = smp_processor_id();
> +
> +    if ( test_and_set_bit(nr, &softirq_pending(cpu))
> +         || (cpu == this_cpu)
> +         || arch_skip_send_event_check(cpu) )
> +        return;
> +
> +    if ( !per_cpu(batching, this_cpu) || in_irq() )
>          smp_send_event_check_cpu(cpu);
> +    else
> +        set_bit(nr, &per_cpu(batch_mask, this_cpu));

Under what circumstances would it be sensible to batch calls to
cpu_raise_softirq()?

All of the current callers are singleshot events, and their use in a
batched period would only be as a result of a timer interrupt, which
bypasses the batching.

~Andrew

> +}
> +
> +void cpu_raise_softirq_batch_begin(void)
> +{
> +    ++this_cpu(batching);
> +}
> +
> +void cpu_raise_softirq_batch_finish(void)
> +{
> +    unsigned int cpu, this_cpu = smp_processor_id();
> +    cpumask_t *mask = &per_cpu(batch_mask, this_cpu);
> +
> +    ASSERT(per_cpu(batching, this_cpu));
> +    for_each_cpu ( cpu, mask )
> +        if ( !softirq_pending(cpu) )
> +            cpumask_clear_cpu(cpu, mask);
> +    smp_send_event_check_mask(mask);
> +    cpumask_clear(mask);
> +    --per_cpu(batching, this_cpu);
>  }
>  
>  void raise_softirq(unsigned int nr)
> --- a/xen/include/xen/softirq.h
> +++ b/xen/include/xen/softirq.h
> @@ -30,6 +30,9 @@ void cpumask_raise_softirq(const cpumask
>  void cpu_raise_softirq(unsigned int cpu, unsigned int nr);
>  void raise_softirq(unsigned int nr);
>  
> +void cpu_raise_softirq_batch_begin(void);
> +void cpu_raise_softirq_batch_finish(void);
> +
>  /*
>   * Process pending softirqs on this CPU. This should be called periodically
>   * when performing work that prevents softirqs from running in a timely manner.
>
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel


[-- Attachment #1.2: Type: text/html, Size: 7922 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] x86/HVM: batch vCPU wakeups
  2014-09-11 10:48   ` Andrew Cooper
@ 2014-09-11 11:03     ` Jan Beulich
  2014-09-11 11:11       ` Andrew Cooper
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2014-09-11 11:03 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Ian Campbell, xen-devel, Keir Fraser, Ian Jackson, Tim Deegan

>>> On 11.09.14 at 12:48, <andrew.cooper3@citrix.com> wrote:
> On 11/09/14 10:40, Jan Beulich wrote:
>>  void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
>>  {
>> -    if ( !test_and_set_bit(nr, &softirq_pending(cpu))
>> -         && (cpu != smp_processor_id())
>> -         && !arch_skip_send_event_check(cpu) )
>> +    unsigned int this_cpu = smp_processor_id();
>> +
>> +    if ( test_and_set_bit(nr, &softirq_pending(cpu))
>> +         || (cpu == this_cpu)
>> +         || arch_skip_send_event_check(cpu) )
>> +        return;
>> +
>> +    if ( !per_cpu(batching, this_cpu) || in_irq() )
>>          smp_send_event_check_cpu(cpu);
>> +    else
>> +        set_bit(nr, &per_cpu(batch_mask, this_cpu));
> 
> Under what circumstances would it be sensible to batch calls to
> cpu_raise_softirq()?
> 
> All of the current callers are singleshot events, and their use in a
> batched period would only be as a result of a timer interrupt, which
> bypasses the batching.

You shouldn't be looking at the immediate callers of
cpu_raise_softirq(), but at those much higher up the stack.
Rooted at vlapic_ipi(), depending on the scheduler you might
end up in credit1's __runq_tickle() (calling cpumask_raise_softirq())
or credit2's runq_tickle() (calling cpu_raise_softirq()).

Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] x86/HVM: batch vCPU wakeups
  2014-09-11 11:03     ` Jan Beulich
@ 2014-09-11 11:11       ` Andrew Cooper
  0 siblings, 0 replies; 11+ messages in thread
From: Andrew Cooper @ 2014-09-11 11:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Campbell, xen-devel, Keir Fraser, Ian Jackson, Tim Deegan

On 11/09/14 12:03, Jan Beulich wrote:
>>>> On 11.09.14 at 12:48, <andrew.cooper3@citrix.com> wrote:
>> On 11/09/14 10:40, Jan Beulich wrote:
>>>  void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
>>>  {
>>> -    if ( !test_and_set_bit(nr, &softirq_pending(cpu))
>>> -         && (cpu != smp_processor_id())
>>> -         && !arch_skip_send_event_check(cpu) )
>>> +    unsigned int this_cpu = smp_processor_id();
>>> +
>>> +    if ( test_and_set_bit(nr, &softirq_pending(cpu))
>>> +         || (cpu == this_cpu)
>>> +         || arch_skip_send_event_check(cpu) )
>>> +        return;
>>> +
>>> +    if ( !per_cpu(batching, this_cpu) || in_irq() )
>>>          smp_send_event_check_cpu(cpu);
>>> +    else
>>> +        set_bit(nr, &per_cpu(batch_mask, this_cpu));
>> Under what circumstances would it be sensible to batch calls to
>> cpu_raise_softirq()?
>>
>> All of the current callers are singleshot events, and their use in a
>> batched period would only be as a result of a timer interrupt, which
>> bypasses the batching.
> You shouldn't be looking at the immediate callers of
> cpu_raise_softirq(), but at those much higher up the stack.
> Rooted at vlapic_ipi(), depending on the scheduler you might
> end up in credit1's __runq_tickle() (calling cpumask_raise_softirq())
> or credit2's runq_tickle() (calling cpu_raise_softirq()).
>
> Jan
>

Ah true, which is valid to batch.

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/2] x86: improve remote CPU wakeup
  2014-09-11  9:36 [PATCH 0/2] x86: improve remote CPU wakeup Jan Beulich
  2014-09-11  9:40 ` [PATCH 1/2] x86: suppress event check IPI to MWAITing CPUs Jan Beulich
  2014-09-11  9:40 ` [PATCH 2/2] x86/HVM: batch vCPU wakeups Jan Beulich
@ 2014-09-18 10:59 ` Tim Deegan
  2 siblings, 0 replies; 11+ messages in thread
From: Tim Deegan @ 2014-09-18 10:59 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Campbell, xen-devel, Keir Fraser, Ian Jackson

At 10:36 +0100 on 11 Sep (1410428168), Jan Beulich wrote:
> Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
> especially when many of the remote pCPU-s are in deep C-states. For
> 64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
> accumulated times of over 2ms were observed (average 1.1ms).
> Considering that Windows broadcasts IPIs from its timer interrupt,
> which at least at certain times can run at 1kHz, it is clear that this
> can't result in good guest behavior. In fact, on said hardware guests
> with significantly beyond 40 vCPU-s simply hung when e.g. ServerManager
> gets started.
> 
> The two patches bring down the average to 250us on said hardware.
> 
> 1: x86: suppress event check IPI to MWAITing CPUs
> 2: x86/HVM: batch vCPU wakeups
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Tim Deegan <tim@xen.org>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-09-18 10:59 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-11  9:36 [PATCH 0/2] x86: improve remote CPU wakeup Jan Beulich
2014-09-11  9:40 ` [PATCH 1/2] x86: suppress event check IPI to MWAITing CPUs Jan Beulich
2014-09-11 10:02   ` Andrew Cooper
2014-09-11 10:07     ` Jan Beulich
2014-09-11 10:09       ` Andrew Cooper
2014-09-11 10:26         ` Jan Beulich
2014-09-11  9:40 ` [PATCH 2/2] x86/HVM: batch vCPU wakeups Jan Beulich
2014-09-11 10:48   ` Andrew Cooper
2014-09-11 11:03     ` Jan Beulich
2014-09-11 11:11       ` Andrew Cooper
2014-09-18 10:59 ` [PATCH 0/2] x86: improve remote CPU wakeup Tim Deegan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).