Re: [PATCH, RFC] x86/HVM: batch vCPU wakeups

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Tim Deegan <tim@xen.org>, Jan Beulich <JBeulich@suse.com>
Cc: Ian Campbell <Ian.Campbell@eu.citrix.com>,
	xen-devel <xen-devel@lists.xenproject.org>,
	Keir Fraser <keir@xen.org>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>
Subject: Re: [PATCH, RFC] x86/HVM: batch vCPU wakeups
Date: Tue, 09 Sep 2014 23:37:23 +0100	[thread overview]
Message-ID: <540F8123.7060000@citrix.com> (raw)
In-Reply-To: <20140909212925.GC82414@deinos.phlegethon.org>

On 09/09/2014 22:29, Tim Deegan wrote:
> At 09:33 +0100 on 09 Sep (1410251617), Jan Beulich wrote:
>> Mass wakeups (via vlapic_ipi()) can take enormous amounts of time,
>> especially when many of the remote pCPU-s are in deep C-states. For
>> 64-vCPU Windows Server 2012 R2 guests on Ivybridge hardware,
>> accumulated times of over 2ms were observed. Considering that Windows
>> broadcasts IPIs from its timer interrupt
> Bleargh. :(
>
>> RFC for two reasons:
>> 1) I started to consider elimination of the event-check IPIs in a more
>>    general way when MONITOR/MWAIT is in use: As long as the remote CPU
>>    is known to be MWAITing (or in the process of resuming after MWAIT),
>>    the write of softirq_pending(cpu) ought to be sufficient to wake it.
>>    This would yield the patch here pointless on MONITOR/MWAIT capable
>>    hardware.
> That's a promising line.   I guess the number of 64-way systems that
> don't support MONITOR/MWAIT is pretty small. :)
>
>> 2) The condition when to enable batching in vlapic_ipi() is already
>>    rather complex, but it is nevertheless unclear whether it would
>>    benefit from further tuning (as mentioned above).
> The test you chose seems plausible enough to me.  I suppose we care
> enough about single-destination IPIs on idle hardware that we need
> this check at all; but I doubt that IPIs to exactly two destinations,
> for example, are worth worrying about.
>
> The implementation looks OK in general; a few nits below.
>
>> --- a/xen/arch/x86/hvm/vlapic.c
>> +++ b/xen/arch/x86/hvm/vlapic.c
>> @@ -447,12 +447,30 @@ void vlapic_ipi(
>>  
>>      default: {
>>          struct vcpu *v;
>> -        for_each_vcpu ( vlapic_domain(vlapic), v )
>> +        const struct domain *d = vlapic_domain(vlapic);
>> +        bool_t batch = d->max_vcpus > 2 &&
>> +                       (short_hand
>> +                        ? short_hand != APIC_DEST_SELF
>> +                        : vlapic_x2apic_mode(vlapic)
>> +                          ? dest_mode ? hweight16(dest) > 1
>> +                                      : dest == 0xffffffff
>> +                          : dest_mode
>> +                            ? hweight8(dest &
>> +                                       GET_xAPIC_DEST_FIELD(
>> +                                           vlapic_get_reg(vlapic,
>> +                                                          APIC_DFR))) > 1
>> +                            : dest == 0xff);
> Yoiks.  Could you extract some of this into a helper function, say
> is_unicast_dest(), just for readability?
>
>>  void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
>>  {
>> -    if ( !test_and_set_bit(nr, &softirq_pending(cpu))
>> -         && (cpu != smp_processor_id()) )
>> +    unsigned int this_cpu = smp_processor_id();
> The name clashes with the this_cpu() macro (obviously OK for compiling
> but harder to read) and can be dropped since...
>
>> +
>> +    if ( test_and_set_bit(nr, &softirq_pending(cpu))
>> +         || (cpu == this_cpu) )
>> +        return;
>> +
>> +    if ( !per_cpu(batching, this_cpu) || in_irq() )
> this_cpu(batching) is the preferred way of addressing this on Xen
> anyway - though I guess maybe it's _slightly_ less efficient to
> recalculate get_cpu_info() here than to indirect through the
> per_cpu_offsets[]?  I haven't looked at the asm.

It depends whether the compiler decides to do any stack manipulation. 
Certainly with the older implementation, if gcc leaves the stack pointer
alone, it will coalesce calls to get_cpu_info(), but will recall it if
any pushes/pops happen between basic blocks.  I believe my newer
implementation is coalesced across small stack changes, but it still
recalculated more than I would expect in some of the later functions. 
(Guess who spent a long time looking at generated asm while working on
that patch)

>
>>          smp_send_event_check_cpu(cpu);
>> +    else
>> +        set_bit(nr, &per_cpu(batch_mask, this_cpu));
>> +}
>> +
>> +void cpu_raise_softirq_batch_begin(void)
>> +{
>> +    ++per_cpu(batching, smp_processor_id());
> This one should definitely be using this_cpu().

Agreed in this case...

>
>> +}
>> +
>> +void cpu_raise_softirq_batch_finish(void)
>> +{
>> +    unsigned int cpu, this_cpu = smp_processor_id();
>> +    cpumask_t *mask = &per_cpu(batch_mask, this_cpu);
> Again, this_cpu()?

...But disagree here.  Multiple uses of this_cpu($FOO) cannot be
coalesced due to RELOC_HIDE() deliberately preventing optimisation.  For
multiple uses, pulling it out by pointer to start with results in rather
more efficient code.

~Andrew

>
>> +    ASSERT(per_cpu(batching, this_cpu));
>> +    for_each_cpu ( cpu, mask )
>> +        if ( !softirq_pending(cpu) )
>> +            cpumask_clear_cpu(cpu, mask);
>> +    smp_send_event_check_mask(mask);
>> +    cpumask_clear(mask);
>> +    --per_cpu(batching, this_cpu);
>>  }
> One other thing that occurs to me is that we might do the batching in
> the caller (by gathering a mask of pcpus in the vlapic code) but that
> sounds messy.  And it's not like softirq is so particularly
> latency-sensitive that we'd want to avoid this much overhead in the
> common case.  So on second thoughts this way is better. :)
>
> Cheers,
>
> Tim.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2014-09-09 22:37 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-09  8:33 [PATCH, RFC] x86/HVM: batch vCPU wakeups Jan Beulich
2014-09-09  9:33 ` Pasi Kärkkäinen
2014-09-09  9:54   ` Jan Beulich
2014-09-09 21:29 ` Tim Deegan
2014-09-09 22:37   ` Andrew Cooper [this message]
2014-09-10 10:29     ` Tim Deegan
2014-09-10 10:37       ` Andrew Cooper
2014-09-10 11:58         ` Jan Beulich
2014-09-10  9:28   ` Jan Beulich
2014-09-10 13:10   ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=540F8123.7060000@citrix.com \
    --to=andrew.cooper3@citrix.com \
    --cc=Ian.Campbell@eu.citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=keir@xen.org \
    --cc=tim@xen.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.