Re: [v4 16/17] vmx: Add some scheduler hooks for VT-d posted interrupts

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dario Faggioli <dario.faggioli@citrix.com>
To: "Wu, Feng" <feng.wu@intel.com>
Cc: "Tian, Kevin" <kevin.tian@intel.com>, Keir Fraser <keir@xen.org>,
	George Dunlap <george.dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jan Beulich <jbeulich@suse.com>
Subject: Re: [v4 16/17] vmx: Add some scheduler hooks for VT-d posted interrupts
Date: Thu, 30 Jul 2015 20:26:37 +0200	[thread overview]
Message-ID: <1438280797.16912.86.camel@citrix.com> (raw)
In-Reply-To: <E959C4978C3B6342920538CF579893F00265D3F2@SHSMSX104.ccr.corp.intel.com>


[-- Attachment #1.1: Type: text/plain, Size: 7571 bytes --]

On Thu, 2015-07-30 at 02:04 +0000, Wu, Feng wrote:
> > -----Original Message-----
> > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]

> > > --- a/xen/arch/x86/domain.c
> > > +++ b/xen/arch/x86/domain.c
> > > @@ -1550,9 +1550,19 @@ void context_switch(struct vcpu *prev, struct
> > vcpu *next)
> > >
> > >      set_current(next);
> > >
> > > +    /*
> > > +     * We need to update posted interrupt descriptor for each context
> > switch,
> > > +     * hence cannot use the lazy context switch for this.
> > > +     */
> > >
> > Perhaps it's me, but I don't get the comment. Why do you mention "the
> > lazy context switch"? We can't use it "for this", as opposed to what
> > other circumstance where we can use it?
> 
> Oh, maybe I shouldn't use the word here, what I want to say here is
> __context_switch() isn't called in each context switch, such as,
> non-idle vcpu -> idle vcpu, so we need to call prev->arch.pi_ctxt_switch_from
> explicitly instead of in __context_switch().
> 
Ok, I see what you mean now, and it's probably correct, as 'lazy context
switch' is, in this context, exactly that (i.e., not actually context
switching if next is the idle vcpu).

It's just that such term is used, in literature, in different places to
mean (slightly) different thing, and there is no close reference to it
(like in the function), so I still see a bit of room for potential
confusion.

In the end, as you which. If it were me, I'd add a few word to specify
things better, something very similar to what you've put in this email,
e.g.:

"When switching from non-idle to idle, we only do a lazy context switch.
However, in order for posted interrupt (if available and enabled) to
work properly, we at least need to update the descriptors"

Or some better English form of it. :-)

But that's certainly something not critical, and I'll be ok with
everything other maintainers agree on.

> > >      if ( (per_cpu(curr_vcpu, cpu) == next) ||
> > >           (is_idle_vcpu(next) && cpu_online(cpu)) )
> > >      {
> > > +        if ( !is_idle_vcpu(next) && next->arch.pi_ctxt_switch_to )
> > >
> > Same as above.
> > 
> > > +            next->arch.pi_ctxt_switch_to(next);
> > > +
> > >          local_irq_enable();
> > >
> > Another thing: if prev == next --and let's call such vcpu pp-- you go
> > through both:
> > 
> >     pp->arch.pi_ctxt_switch_from(pp);
> >     pp->arch.pi_ctxt_switch_to(pp);
> 
> In my understanding, if the scheduler chooses the same vcpu to run, it
> will return early in schedule() as below:
> 
> static void schedule(void)
> {
>     ....
> 
>     /* get policy-specific decision on scheduling... */
>     sched = this_cpu(scheduler);
>     next_slice = sched->do_schedule(sched, now, tasklet_work_scheduled);
> 
>     next = next_slice.task;
> 
>     sd->curr = next;
> 
>     if ( next_slice.time >= 0 ) /* -ve means no limit */
>         set_timer(&sd->s_timer, now + next_slice.time);
> 
>     if ( unlikely(prev == next) )
>     {
>         pcpu_schedule_unlock_irq(lock, cpu);
>         trace_continue_running(next);
>         return continue_running(prev);
>     }
> 
>     ....
> 
> }
> 
> If this is that case, when we get context_switch(), the prev and next are
> different. Do I miss something?
> 
That looks correct. Still, there are checks like '(prev!=next)' around
in context_switch(), for both x86 and ARM... weird. I shall have a
deeper look...

In any case, as far as this hunk is concerned, the
'(per_cpu(curr_vcpu,cpu)==next)' is there to deal with the case where we
went from vcpu v to idle, and we're now going from idle to v again,
which is something you want to intercept.

So, at least for now, ignore my comments about it. I'll let you know if
I find something interesting that you should take into account.

> > > --- a/xen/common/schedule.c
> > > +++ b/xen/common/schedule.c
> > > @@ -381,6 +381,8 @@ void vcpu_wake(struct vcpu *v)
> > >      unsigned long flags;
> > >      spinlock_t *lock = vcpu_schedule_lock_irqsave(v, &flags);
> > >
> > > +    arch_vcpu_wake(v);
> > > +
> > So, in the draft you sent a few days back, this was called at the end of
> > vcpu_wake(), right before releasing the lock. Now it's at the beginning,
> > before the scheduler's wakeup routine has a chance to run.
> > 
> > IMO, it feels more natural for it to be at the bottom (i.e., generic
> > stuff first, arch specific stuff afterwards), and, after a quick
> > inspection, I don't think I see nothing preventing things to be that
> > way.
> > 
> > However, I recall you mentioning having issues with such draft, which
> > are now resolved with this version. 
> 
> The long latency issue mentioned previously is caused by another reason.
> Originally I called the ' pi_ctxt_switch_from ' and ' pi_ctxt_switch_to ' in
> __context_switch(), however, this function is not called for each context
> switch, as I described above, after fixing this, the performance issue
> disappeared.
> 
I see, thanks for explaining this.

> > Since this is one of the differences
> > between the two, was it the cause of the issues you were seeing? If yes,
> > can you elaborate on how and why?
> > 
> > In the end, I'm not too opposed to the hook being at the beginning
> > rather than at the end, but there has to be a reason, which may well end
> > up better be stated in a comment...
> 
> Here is the reason I put arch_vcpu_wake() ahead of vcpu_wake():
> arch_vcpu_wake() does some prerequisites for a vCPU which is about
> to run, such as, setting SN again, changing NV filed back to
> ' posted_intr_vector ', which should be finished before the vCPU is
> actually scheduled to run. However, if we put arch_vcpu_wake() later
> in vcpu_wake() right before ' vcpu_schedule_unlock_irqrestore', after
> the 'wake' hook get finished, the vcpu can run at any time (maybe in
> another pCPU since the current pCPU is protected by the lock), if
> this can happen, it is incorrect. Does my understanding make sense?
> 
It's safe in any case. In fact, the spinlock will  prevent both the
vcpu's processor to schedule, as well as any other processors to steal
the waking vcpu from the runqueue to run it.

That's actually why I wanted to double check you changing the position
of the hook (wrt the draft), as it felt weird that the issue were in
there. :-)

So, now that we know that safety is not an issue, where should we put
the hook?

Having it before SCHED_OP(wake) may make people think that arch specific
code is (or can, at some point) somehow influencing the scheduler
specific wakeup code, which is not (and should not become, if possible)
the case.

However, I kind of like the fact that the spinlock is released as soon
as possible, after the call to SCHED_OP(wake). That will make it more
likely, for the processors we may have sent IPIs to, during the
scheduler specific wakeup code, to find the spinlock free. So, looking
at things from this angle, it would be better to avoid putting stuff in
between SCHED_OP(wake) and vcpu_schedule_unlock().

So, all in all, I'd say leave it on top, where it is in this patch. Of
course, if others have opinions, I'm all ears. :-)

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2015-07-30 18:26 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-23 11:35 [v4 00/17] Add VT-d Posted-Interrupts support Feng Wu
2015-07-23 11:35 ` [v4 01/17] VT-d Posted-intterrupt (PI) design Feng Wu
2015-07-23 11:35 ` [v4 02/17] Add helper macro for X86_FEATURE_CX16 feature detection Feng Wu
2015-07-23 11:35 ` [v4 03/17] Add cmpxchg16b support for x86-64 Feng Wu
2015-07-24 15:03   ` Jan Beulich
2015-07-23 11:35 ` [v4 04/17] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature Feng Wu
2015-07-23 14:01   ` Andrew Cooper
2015-07-23 14:05     ` Andrew Cooper
2015-07-24  0:47       ` Wu, Feng
2015-07-23 11:35 ` [v4 05/17] vt-d: VT-d Posted-Interrupts feature detection Feng Wu
2015-07-24 15:05   ` Jan Beulich
2015-07-23 11:35 ` [v4 06/17] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts Feng Wu
2015-07-23 11:35 ` [v4 07/17] vmx: Add some helper functions for Posted-Interrupts Feng Wu
2015-07-23 11:35 ` [v4 08/17] vmx: Initialize VT-d Posted-Interrupts Descriptor Feng Wu
2015-07-23 11:35 ` [v4 09/17] vmx: Suppress posting interrupts when 'SN' is set Feng Wu
2015-07-24 15:11   ` Jan Beulich
2015-07-23 11:35 ` [v4 10/17] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts Feng Wu
2015-07-24 15:13   ` Jan Beulich
2015-07-23 11:35 ` [v4 11/17] vt-d: Add API to update IRTE when VT-d PI is used Feng Wu
2015-07-23 13:51   ` Andrew Cooper
2015-07-23 15:52     ` Jan Beulich
2015-07-23 15:55       ` Andrew Cooper
2015-07-23 16:00         ` Jan Beulich
2015-07-23 16:11           ` Andrew Cooper
2015-07-24  0:39     ` Wu, Feng
2015-07-24 15:27   ` Jan Beulich
2015-07-28  7:34     ` Wu, Feng
2015-08-11 10:18       ` Jan Beulich
2015-07-23 11:35 ` [v4 12/17] Update IRTE according to guest interrupt config changes Feng Wu
2015-07-23 11:35 ` [v4 13/17] vmx: posted-interrupt handling when vCPU is blocked Feng Wu
2015-07-23 11:35 ` [v4 14/17] vmx: Properly handle notification event when vCPU is running Feng Wu
2015-07-23 11:35 ` [v4 15/17] arm: add a dummy arch hooks for scheduler Feng Wu
2015-07-23 11:54   ` Julien Grall
2015-07-24  0:39     ` Wu, Feng
2015-07-23 11:58   ` Jan Beulich
2015-07-23 11:35 ` [v4 16/17] vmx: Add some scheduler hooks for VT-d posted interrupts Feng Wu
2015-07-23 12:50   ` Dario Faggioli
2015-07-24  0:49     ` Wu, Feng
2015-07-28 14:15   ` Dario Faggioli
2015-07-30  2:04     ` Wu, Feng
2015-07-30 18:26       ` Dario Faggioli [this message]
2015-08-11 10:23         ` Jan Beulich
2015-07-23 11:35 ` [v4 17/17] VT-d: Dump the posted format IRTE Feng Wu
  -- strict thread matches above, loose matches on Subject: below --
2015-08-03  1:36 [v4 16/17] vmx: Add some scheduler hooks for VT-d posted interrupts Wu, Feng
2015-08-03 10:02 ` Dario Faggioli
2015-08-05  6:06   ` Wu, Feng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1438280797.16912.86.camel@citrix.com \
    --to=dario.faggioli@citrix.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=feng.wu@intel.com \
    --cc=george.dunlap@eu.citrix.com \
    --cc=jbeulich@suse.com \
    --cc=keir@xen.org \
    --cc=kevin.tian@intel.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.