Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
@ 2015-09-21  5:07 Wu, Feng
  2015-09-21  9:45 ` George Dunlap
  0 siblings, 1 reply; 27+ messages in thread
From: Wu, Feng @ 2015-09-21  5:07 UTC (permalink / raw)
  To: George Dunlap, Dario Faggioli
  Cc: Tian, Kevin, Wu, Feng, Andrew Cooper, xen-devel@lists.xen.org,
	Jan Beulich, Keir Fraser



> -----Original Message-----
> From: Wu, Feng
> Sent: Thursday, September 17, 2015 2:16 PM
> To: George Dunlap; Jan Beulich
> Cc: Tian, Kevin; Keir Fraser; Andrew Cooper; Dario Faggioli;
> xen-devel@lists.xen.org; Wu, Feng
> Subject: RE: [Xen-devel] [PATCH v6 16/18] vmx: Add some scheduler hooks for
> VT-d posted interrupts
> 
> 
> 
> > -----Original Message-----
> > From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of
> George
> > Dunlap
> > Sent: Thursday, September 17, 2015 12:57 AM
> > To: Jan Beulich
> > Cc: Wu, Feng; Tian, Kevin; Keir Fraser; Andrew Cooper; Dario Faggioli;
> > xen-devel@lists.xen.org
> > Subject: Re: [Xen-devel] [PATCH v6 16/18] vmx: Add some scheduler hooks for
> > VT-d posted interrupts
> >
> > On Mon, Sep 7, 2015 at 1:54 PM, Jan Beulich <JBeulich@suse.com> wrote:
> > >>>> On 25.08.15 at 03:57, <feng.wu@intel.com> wrote:
> > >> --- a/xen/arch/x86/domain.c
> > >> +++ b/xen/arch/x86/domain.c
> > >> @@ -1573,6 +1573,22 @@ static void __context_switch(void)
> > >>      per_cpu(curr_vcpu, cpu) = n;
> > >>  }
> > >>
> > >> +static inline void pi_ctxt_switch_from(struct vcpu *prev)
> > >> +{
> > >> +    /*
> > >> +     * When switching from non-idle to idle, we only do a lazy context
> > switch.
> > >> +     * However, in order for posted interrupt (if available and enabled)
> to
> > >> +     * work properly, we at least need to update the descriptors.
> > >> +     */
> > >> +    if ( prev->arch.pi_ctxt_switch_from && !is_idle_vcpu(prev) )
> > >> +        prev->arch.pi_ctxt_switch_from(prev);
> > >> +}
> > >> +
> > >> +static inline void pi_ctxt_switch_to(struct vcpu *next)
> > >> +{
> > >> +    if ( next->arch.pi_ctxt_switch_to && !is_idle_vcpu(next) )
> > >> +        next->arch.pi_ctxt_switch_to(next);
> > >> +}
> > >>
> > >>  void context_switch(struct vcpu *prev, struct vcpu *next)
> > >>  {
> > >> @@ -1605,9 +1621,12 @@ void context_switch(struct vcpu *prev, struct
> > vcpu *next)
> > >>
> > >>      set_current(next);
> > >>
> > >> +    pi_ctxt_switch_from(prev);
> > >> +
> > >>      if ( (per_cpu(curr_vcpu, cpu) == next) ||
> > >>           (is_idle_domain(nextd) && cpu_online(cpu)) )
> > >>      {
> > >> +        pi_ctxt_switch_to(next);
> > >>          local_irq_enable();
> > >
> > > This placement, if really intended that way, needs explanation (in a
> > > comment) and perhaps even renaming of the involved symbols, as
> > > looking at it from a general perspective it seems wrong (with
> > > pi_ctxt_switch_to() excluding idle vCPU-s it effectively means you
> > > want this only when switching back to what got switched out lazily
> > > before, i.e. this would be not something to take place on an arbitrary
> > > context switch). As to possible alternative names - maybe make the
> > > hooks ctxt_switch_prepare() and ctxt_switch_cancel()?
> >
> > Why on earth is this more clear than what he had before?
> >
> > In the first call, he's not "preparing" anything -- he's actually
> > switching the PI context out for prev.  And in the second call, he's
> > not "cancelling" anything -- he's actually switching the PI context in
> > for next.  The names you suggest are actively confusing, not helpful.
> >
> > But before talking about how to make things more clear, one side
> > question -- do we need to actually call pi_ctxt_switch_to() in
> > __context_switch()?
> >
> > The only other place __context_switch() is called is
> > from__sync_local_execstate().  But the only reason that needs to be
> > called is because sometimes we *don't* call __context_switch(), and so
> > there are things on the cpu that aren't copied into the vcpu struct.
> 
> Thanks for the comments!
> 
> From my understanding, __sync_local_execstate() can only get called
> in the following two cases:
> #1) this_cpu(curr_vcpu) == current, in this case, __context_switch() is
> not called.
> #2) this_cpu(curr_vcpu) != current, and current == idle_vcpu, that means
> we just switched from a non-idle vCPU to idle vCPU, so here we need to
> call __context_switch() to copy things to the original vcpu struct.
> 
> Please correct me if the above understanding is wrong or incomplete?

Hi George / Dario,

Could you please confirm the above understanding is correct? (In fact, it is
Related to lazy context switch, right?) if so I can continue with the
pi_context_switch() way George suggested.

Thanks,
Feng

> 
> I think calling pi_ctxt_switch_to() in __context_switch() is needed when
> we are switching to a non-idle vCPU (we need change the PI state of the
> target vCPU), and the call is not needed when switching to idle vCPU.
> So if the above understanding is correct, I think you suggestion below
> is really good, it makes things clearer.
> 
> >
> > That doesn't apply to the PI state -- for one, nothing is copied from
> > the processor; and for two, pi_ctxt_switch_from() is called
> > unconditionally anyway.
> >
> > Would it make more sense to call pi_context_switch(prev, next) just
> > after "set_current"?
> 
> I think it is a good point.
> 
> Thanks,
> Feng
> 
> >
> > (Keeping in mind I totally may have missed something...)
> >
> >  -George

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-21  5:07 [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts Wu, Feng
@ 2015-09-21  9:45 ` George Dunlap
  2015-09-21 12:07   ` Wu, Feng
  0 siblings, 1 reply; 27+ messages in thread
From: George Dunlap @ 2015-09-21  9:45 UTC (permalink / raw)
  To: Wu, Feng, George Dunlap, Dario Faggioli
  Cc: Andrew Cooper, Tian, Kevin, Keir Fraser, Jan Beulich,
	xen-devel@lists.xen.org

On 09/21/2015 06:07 AM, Wu, Feng wrote:
> 
> 
>> -----Original Message-----
>> From: Wu, Feng
>> Sent: Thursday, September 17, 2015 2:16 PM
>> To: George Dunlap; Jan Beulich
>> Cc: Tian, Kevin; Keir Fraser; Andrew Cooper; Dario Faggioli;
>> xen-devel@lists.xen.org; Wu, Feng
>> Subject: RE: [Xen-devel] [PATCH v6 16/18] vmx: Add some scheduler hooks for
>> VT-d posted interrupts
>>
>>
>>
>>> -----Original Message-----
>>> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of
>> George
>>> Dunlap
>>> Sent: Thursday, September 17, 2015 12:57 AM
>>> To: Jan Beulich
>>> Cc: Wu, Feng; Tian, Kevin; Keir Fraser; Andrew Cooper; Dario Faggioli;
>>> xen-devel@lists.xen.org
>>> Subject: Re: [Xen-devel] [PATCH v6 16/18] vmx: Add some scheduler hooks for
>>> VT-d posted interrupts
>>>
>>> On Mon, Sep 7, 2015 at 1:54 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>>>> On 25.08.15 at 03:57, <feng.wu@intel.com> wrote:
>>>>> --- a/xen/arch/x86/domain.c
>>>>> +++ b/xen/arch/x86/domain.c
>>>>> @@ -1573,6 +1573,22 @@ static void __context_switch(void)
>>>>>      per_cpu(curr_vcpu, cpu) = n;
>>>>>  }
>>>>>
>>>>> +static inline void pi_ctxt_switch_from(struct vcpu *prev)
>>>>> +{
>>>>> +    /*
>>>>> +     * When switching from non-idle to idle, we only do a lazy context
>>> switch.
>>>>> +     * However, in order for posted interrupt (if available and enabled)
>> to
>>>>> +     * work properly, we at least need to update the descriptors.
>>>>> +     */
>>>>> +    if ( prev->arch.pi_ctxt_switch_from && !is_idle_vcpu(prev) )
>>>>> +        prev->arch.pi_ctxt_switch_from(prev);
>>>>> +}
>>>>> +
>>>>> +static inline void pi_ctxt_switch_to(struct vcpu *next)
>>>>> +{
>>>>> +    if ( next->arch.pi_ctxt_switch_to && !is_idle_vcpu(next) )
>>>>> +        next->arch.pi_ctxt_switch_to(next);
>>>>> +}
>>>>>
>>>>>  void context_switch(struct vcpu *prev, struct vcpu *next)
>>>>>  {
>>>>> @@ -1605,9 +1621,12 @@ void context_switch(struct vcpu *prev, struct
>>> vcpu *next)
>>>>>
>>>>>      set_current(next);
>>>>>
>>>>> +    pi_ctxt_switch_from(prev);
>>>>> +
>>>>>      if ( (per_cpu(curr_vcpu, cpu) == next) ||
>>>>>           (is_idle_domain(nextd) && cpu_online(cpu)) )
>>>>>      {
>>>>> +        pi_ctxt_switch_to(next);
>>>>>          local_irq_enable();
>>>>
>>>> This placement, if really intended that way, needs explanation (in a
>>>> comment) and perhaps even renaming of the involved symbols, as
>>>> looking at it from a general perspective it seems wrong (with
>>>> pi_ctxt_switch_to() excluding idle vCPU-s it effectively means you
>>>> want this only when switching back to what got switched out lazily
>>>> before, i.e. this would be not something to take place on an arbitrary
>>>> context switch). As to possible alternative names - maybe make the
>>>> hooks ctxt_switch_prepare() and ctxt_switch_cancel()?
>>>
>>> Why on earth is this more clear than what he had before?
>>>
>>> In the first call, he's not "preparing" anything -- he's actually
>>> switching the PI context out for prev.  And in the second call, he's
>>> not "cancelling" anything -- he's actually switching the PI context in
>>> for next.  The names you suggest are actively confusing, not helpful.
>>>
>>> But before talking about how to make things more clear, one side
>>> question -- do we need to actually call pi_ctxt_switch_to() in
>>> __context_switch()?
>>>
>>> The only other place __context_switch() is called is
>>> from__sync_local_execstate().  But the only reason that needs to be
>>> called is because sometimes we *don't* call __context_switch(), and so
>>> there are things on the cpu that aren't copied into the vcpu struct.
>>
>> Thanks for the comments!
>>
>> From my understanding, __sync_local_execstate() can only get called
>> in the following two cases:
>> #1) this_cpu(curr_vcpu) == current, in this case, __context_switch() is
>> not called.
>> #2) this_cpu(curr_vcpu) != current, and current == idle_vcpu, that means
>> we just switched from a non-idle vCPU to idle vCPU, so here we need to
>> call __context_switch() to copy things to the original vcpu struct.
>>
>> Please correct me if the above understanding is wrong or incomplete?
> 
> Hi George / Dario,
> 
> Could you please confirm the above understanding is correct? (In fact, it is
> Related to lazy context switch, right?) if so I can continue with the
> pi_context_switch() way George suggested.

Yes, that's the general idea.  Normally, you can access the registers of
a non-running vcpu from the vcpu struct.  But if we've done a lazy
context switch, that's not true -- so to access those registers properly
we need to go through and do the full context switch *on that pcpu*.

Since we need to do the full context switch for PI every time, there
should never be any "local" state which needs to be synced.

I think at this point you should probably just give it a try. :-)

 -George

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-21  9:45 ` George Dunlap
@ 2015-09-21 12:07   ` Wu, Feng
  0 siblings, 0 replies; 27+ messages in thread
From: Wu, Feng @ 2015-09-21 12:07 UTC (permalink / raw)
  To: George Dunlap, George Dunlap, Dario Faggioli
  Cc: Tian, Kevin, Wu, Feng, Andrew Cooper, xen-devel@lists.xen.org,
	Jan Beulich, Keir Fraser

> >> Thanks for the comments!
> >>
> >> From my understanding, __sync_local_execstate() can only get called
> >> in the following two cases:
> >> #1) this_cpu(curr_vcpu) == current, in this case, __context_switch() is
> >> not called.
> >> #2) this_cpu(curr_vcpu) != current, and current == idle_vcpu, that means
> >> we just switched from a non-idle vCPU to idle vCPU, so here we need to
> >> call __context_switch() to copy things to the original vcpu struct.
> >>
> >> Please correct me if the above understanding is wrong or incomplete?
> >
> > Hi George / Dario,
> >
> > Could you please confirm the above understanding is correct? (In fact, it is
> > Related to lazy context switch, right?) if so I can continue with the
> > pi_context_switch() way George suggested.
> 
> Yes, that's the general idea.  Normally, you can access the registers of
> a non-running vcpu from the vcpu struct.  But if we've done a lazy
> context switch, that's not true -- so to access those registers properly
> we need to go through and do the full context switch *on that pcpu*.

Thanks for the clarification!

Thanks,
Feng

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v6 00/18] Add VT-d Posted-Interrupts support
@ 2015-08-25  1:57 Feng Wu
  2015-08-25  1:57 ` [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts Feng Wu
  0 siblings, 1 reply; 27+ messages in thread
From: Feng Wu @ 2015-08-25  1:57 UTC (permalink / raw)
  To: xen-devel; +Cc: Feng Wu

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

You can find the VT-d Posted-Interrtups Spec. in the following URL:
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html


Feng Wu (18):
  VT-d Posted-intterrupt (PI) design
  Add cmpxchg16b support for x86-64
  iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature
  vt-d: VT-d Posted-Interrupts feature detection
  vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  vmx: Add some helper functions for Posted-Interrupts
  vmx: Initialize VT-d Posted-Interrupts Descriptor
  vmx: Suppress posting interrupts when 'SN' is set
  VT-d: Remove pointless casts
  vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts
  vt-d: Add API to update IRTE when VT-d PI is used
  x86: move some APIC related macros to apicdef.h
  Update IRTE according to guest interrupt config changes
  vmx: posted-interrupt handling when vCPU is blocked
  vmx: Properly handle notification event when vCPU is running
  vmx: Add some scheduler hooks for VT-d posted interrupts
  VT-d: Dump the posted format IRTE
  Add a command line parameter for VT-d posted-interrupts

 docs/misc/vtd-pi.txt                   | 332 +++++++++++++++++++++++++++++++++
 docs/misc/xen-command-line.markdown    |   9 +-
 xen/arch/x86/domain.c                  |  19 ++
 xen/arch/x86/hvm/vlapic.c              |   5 -
 xen/arch/x86/hvm/vmx/vmcs.c            |  21 +++
 xen/arch/x86/hvm/vmx/vmx.c             | 289 +++++++++++++++++++++++++++-
 xen/common/schedule.c                  |   2 +
 xen/drivers/passthrough/io.c           | 125 ++++++++++++-
 xen/drivers/passthrough/iommu.c        |  16 +-
 xen/drivers/passthrough/vtd/intremap.c | 199 +++++++++++++++-----
 xen/drivers/passthrough/vtd/iommu.c    |  17 +-
 xen/drivers/passthrough/vtd/iommu.h    |  50 +++--
 xen/drivers/passthrough/vtd/utils.c    |  59 ++++--
 xen/include/asm-arm/domain.h           |   2 +
 xen/include/asm-x86/apicdef.h          |   4 +
 xen/include/asm-x86/domain.h           |   3 +
 xen/include/asm-x86/hvm/hvm.h          |   2 +
 xen/include/asm-x86/hvm/vmx/vmcs.h     |  26 ++-
 xen/include/asm-x86/hvm/vmx/vmx.h      |  28 +++
 xen/include/asm-x86/iommu.h            |   2 +
 xen/include/asm-x86/x86_64/system.h    |  28 +++
 xen/include/xen/iommu.h                |   2 +-
 22 files changed, 1155 insertions(+), 85 deletions(-)
 create mode 100644 docs/misc/vtd-pi.txt

-- 
2.1.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-08-25  1:57 [PATCH v6 00/18] Add VT-d Posted-Interrupts support Feng Wu
@ 2015-08-25  1:57 ` Feng Wu
  2015-09-07 12:54   ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: Feng Wu @ 2015-08-25  1:57 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, Jan Beulich, Feng Wu

This patch adds the following arch hooks in scheduler:
- vmx_pre_ctx_switch_pi():
It is called before context switch, we update the posted
interrupt descriptor when the vCPU is preempted, go to sleep,
or is blocked.

- vmx_post_ctx_switch_pi()
It is called after context switch, we update the posted
interrupt descriptor when the vCPU is going to run.

- arch_vcpu_wake_prepare()
It will be called when waking up the vCPU, we update
the posted interrupt descriptor when the vCPU is unblocked.

CC: Keir Fraser <keir@xen.org>
CC: Jan Beulich <jbeulich@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Kevin Tian <kevin.tian@intel.com>
CC: George Dunlap <george.dunlap@eu.citrix.com>
CC: Dario Faggioli <dario.faggioli@citrix.com>
Sugguested-by: Dario Faggioli <dario.faggioli@citrix.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
---
v6:
- Add two static inline functions for pi context switch
- Fix typos

v5:
- Rename arch_vcpu_wake to arch_vcpu_wake_prepare
- Make arch_vcpu_wake_prepare() inline for ARM
- Merge the ARM dummy hook with together
- Changes to some code comments
- Leave 'pi_ctxt_switch_from' and 'pi_ctxt_switch_to' NULL if
  PI is disabled or the vCPU is not in HVM
- Coding style

v4:
- Newly added

 xen/arch/x86/domain.c              |  19 +++++
 xen/arch/x86/hvm/vmx/vmx.c         | 147 +++++++++++++++++++++++++++++++++++++
 xen/common/schedule.c              |   2 +
 xen/include/asm-arm/domain.h       |   2 +
 xen/include/asm-x86/domain.h       |   3 +
 xen/include/asm-x86/hvm/hvm.h      |   2 +
 xen/include/asm-x86/hvm/vmx/vmcs.h |   8 ++
 7 files changed, 183 insertions(+)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 045f6ff..443986e 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1573,6 +1573,22 @@ static void __context_switch(void)
     per_cpu(curr_vcpu, cpu) = n;
 }
 
+static inline void pi_ctxt_switch_from(struct vcpu *prev)
+{
+    /*
+     * When switching from non-idle to idle, we only do a lazy context switch.
+     * However, in order for posted interrupt (if available and enabled) to
+     * work properly, we at least need to update the descriptors.
+     */
+    if ( prev->arch.pi_ctxt_switch_from && !is_idle_vcpu(prev) )
+        prev->arch.pi_ctxt_switch_from(prev);
+}
+
+static inline void pi_ctxt_switch_to(struct vcpu *next)
+{
+    if ( next->arch.pi_ctxt_switch_to && !is_idle_vcpu(next) )
+        next->arch.pi_ctxt_switch_to(next);
+}
 
 void context_switch(struct vcpu *prev, struct vcpu *next)
 {
@@ -1605,9 +1621,12 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
 
     set_current(next);
 
+    pi_ctxt_switch_from(prev);
+
     if ( (per_cpu(curr_vcpu, cpu) == next) ||
          (is_idle_domain(nextd) && cpu_online(cpu)) )
     {
+        pi_ctxt_switch_to(next);
         local_irq_enable();
     }
     else
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 5167fae..889ede3 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -67,6 +67,8 @@ enum handler_return { HNDL_done, HNDL_unhandled, HNDL_exception_raised };
 
 static void vmx_ctxt_switch_from(struct vcpu *v);
 static void vmx_ctxt_switch_to(struct vcpu *v);
+static void vmx_pre_ctx_switch_pi(struct vcpu *v);
+static void vmx_post_ctx_switch_pi(struct vcpu *v);
 
 static int  vmx_alloc_vlapic_mapping(struct domain *d);
 static void vmx_free_vlapic_mapping(struct domain *d);
@@ -117,10 +119,20 @@ static int vmx_vcpu_initialise(struct vcpu *v)
     INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
     INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_vcpu_on_set_list);
 
+    v->arch.hvm_vmx.pi_block_cpu = -1;
+
+    spin_lock_init(&v->arch.hvm_vmx.pi_lock);
+
     v->arch.schedule_tail    = vmx_do_resume;
     v->arch.ctxt_switch_from = vmx_ctxt_switch_from;
     v->arch.ctxt_switch_to   = vmx_ctxt_switch_to;
 
+    if ( iommu_intpost && is_hvm_vcpu(v) )
+    {
+        v->arch.pi_ctxt_switch_from = vmx_pre_ctx_switch_pi;
+        v->arch.pi_ctxt_switch_to = vmx_post_ctx_switch_pi;
+    }
+
     if ( (rc = vmx_create_vmcs(v)) != 0 )
     {
         dprintk(XENLOG_WARNING,
@@ -718,6 +730,140 @@ static void vmx_fpu_leave(struct vcpu *v)
     }
 }
 
+void arch_vcpu_wake_prepare(struct vcpu *v)
+{
+    unsigned long gflags;
+
+    if ( !iommu_intpost || !is_hvm_vcpu(v) || !has_arch_pdevs(v->domain) )
+        return;
+
+    spin_lock_irqsave(&v->arch.hvm_vmx.pi_lock, gflags);
+
+    if ( likely(vcpu_runnable(v)) ||
+         !test_bit(_VPF_blocked, &v->pause_flags) )
+    {
+        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+        unsigned long flags;
+
+        /*
+         * We don't need to send notification event to a non-running
+         * vcpu, the interrupt information will be delivered to it before
+         * VM-ENTRY when the vcpu is scheduled to run next time.
+         */
+        pi_set_sn(pi_desc);
+
+        /*
+         * Set 'NV' field back to posted_intr_vector, so the
+         * Posted-Interrupts can be delivered to the vCPU by
+         * VT-d HW after it is scheduled to run.
+         */
+        write_atomic((uint8_t*)&pi_desc->nv, posted_intr_vector);
+
+        /*
+         * Delete the vCPU from the related block list
+         * if we are resuming from blocked state.
+         */
+        if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
+        {
+            spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+                              v->arch.hvm_vmx.pi_block_cpu), flags);
+            list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
+            spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
+                                    v->arch.hvm_vmx.pi_block_cpu), flags);
+        }
+    }
+
+    spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock, gflags);
+}
+
+static void vmx_pre_ctx_switch_pi(struct vcpu *v)
+{
+    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+    struct pi_desc old, new;
+    unsigned long flags, gflags;
+
+    if ( !has_arch_pdevs(v->domain) )
+        return;
+
+    spin_lock_irqsave(&v->arch.hvm_vmx.pi_lock, gflags);
+
+    if ( vcpu_runnable(v) || !test_bit(_VPF_blocked, &v->pause_flags) )
+    {
+        /*
+         * The vCPU has been preempted or went to sleep. We don't need to send
+         * notification event to a non-running vcpu, the interrupt information
+         * will be delivered to it before VM-ENTRY when the vcpu is scheduled
+         * to run next time.
+         */
+        pi_set_sn(pi_desc);
+
+    }
+    else if ( test_bit(_VPF_blocked, &v->pause_flags) )
+    {
+        /*
+         * The vCPU is blocking, we need to add it to one of the per pCPU lists.
+         * We save v->processor to v->arch.hvm_vmx.pi_block_cpu and use it for
+         * the per-CPU list, we also save it to posted-interrupt descriptor and
+         * make it as the destination of the wake-up notification event.
+         */
+        v->arch.hvm_vmx.pi_block_cpu = v->processor;
+        spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+                          v->arch.hvm_vmx.pi_block_cpu), flags);
+        list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list,
+                      &per_cpu(pi_blocked_vcpu, v->arch.hvm_vmx.pi_block_cpu));
+        spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
+                           v->arch.hvm_vmx.pi_block_cpu), flags);
+
+        do {
+            old.control = new.control = pi_desc->control;
+
+            /* Should not block the vCPU if an interrupt was posted for it. */
+            if ( pi_test_on(&old) )
+            {
+                spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock, gflags);
+                vcpu_unblock(v);
+                return;
+            }
+
+            /*
+             * Change the 'NDST' field to v->arch.hvm_vmx.pi_block_cpu,
+             * so when external interrupts from assigned deivces happen,
+             * wakeup notifiction event will go to
+             * v->arch.hvm_vmx.pi_block_cpu, then in pi_wakeup_interrupt()
+             * we can find the vCPU in the right list to wake up.
+             */
+            if ( x2apic_enabled )
+                new.ndst = cpu_physical_id(v->arch.hvm_vmx.pi_block_cpu);
+            else
+                new.ndst = MASK_INSR(cpu_physical_id(
+                                 v->arch.hvm_vmx.pi_block_cpu),
+                                 PI_xAPIC_NDST_MASK);
+            pi_clear_sn(&new);
+            new.nv = pi_wakeup_vector;
+        } while ( cmpxchg(&pi_desc->control, old.control, new.control)
+                  != old.control );
+    }
+
+    spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock, gflags);
+}
+
+static void vmx_post_ctx_switch_pi(struct vcpu *v)
+{
+    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+
+    if ( !has_arch_pdevs(v->domain) )
+        return;
+
+    if ( x2apic_enabled )
+        write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor));
+    else
+        write_atomic(&pi_desc->ndst,
+                     MASK_INSR(cpu_physical_id(v->processor),
+                     PI_xAPIC_NDST_MASK));
+
+    pi_clear_sn(pi_desc);
+}
+
 static void vmx_ctxt_switch_from(struct vcpu *v)
 {
     /*
@@ -756,6 +902,7 @@ static void vmx_ctxt_switch_to(struct vcpu *v)
 
     vmx_restore_guest_msrs(v);
     vmx_restore_dr(v);
+    vmx_post_ctx_switch_pi(v);
 }
 
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 3eefed7..bc49098 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -412,6 +412,8 @@ void vcpu_wake(struct vcpu *v)
     unsigned long flags;
     spinlock_t *lock = vcpu_schedule_lock_irqsave(v, &flags);
 
+    arch_vcpu_wake_prepare(v);
+
     if ( likely(vcpu_runnable(v)) )
     {
         if ( v->runstate.state >= RUNSTATE_blocked )
diff --git a/xen/include/asm-arm/domain.h b/xen/include/asm-arm/domain.h
index 56aa208..cffe2c6 100644
--- a/xen/include/asm-arm/domain.h
+++ b/xen/include/asm-arm/domain.h
@@ -301,6 +301,8 @@ static inline register_t vcpuid_to_vaffinity(unsigned int vcpuid)
     return vaff;
 }
 
+static inline void arch_vcpu_wake_prepare(struct vcpu *v) {}
+
 #endif /* __ASM_DOMAIN_H__ */
 
 /*
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 0fce09e..979210a 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -481,6 +481,9 @@ struct arch_vcpu
     void (*ctxt_switch_from) (struct vcpu *);
     void (*ctxt_switch_to) (struct vcpu *);
 
+    void (*pi_ctxt_switch_from) (struct vcpu *);
+    void (*pi_ctxt_switch_to) (struct vcpu *);
+
     struct vpmu_struct vpmu;
 
     /* Virtual Machine Extensions */
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index 3cac64f..95f5357 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -545,6 +545,8 @@ static inline bool_t hvm_altp2m_supported(void)
     return hvm_funcs.altp2m_supported;
 }
 
+void arch_vcpu_wake_prepare(struct vcpu *v);
+
 #ifndef NDEBUG
 /* Permit use of the Forced Emulation Prefix in HVM guests */
 extern bool_t opt_hvm_fep;
diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index 9a986d0..209fb39 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -164,6 +164,14 @@ struct arch_vmx_struct {
 
     struct list_head     pi_blocked_vcpu_list;
     struct list_head     pi_vcpu_on_set_list;
+
+    /*
+     * Before vCPU is blocked, it is added to the global per-cpu list
+     * of 'pi_block_cpu', then VT-d engine can send wakeup notification
+     * event to 'pi_block_cpu' and wakeup the related vCPU.
+     */
+    int                  pi_block_cpu;
+    spinlock_t           pi_lock;
 };
 
 int vmx_create_vmcs(struct vcpu *v);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-08-25  1:57 ` [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts Feng Wu
@ 2015-09-07 12:54   ` Jan Beulich
  2015-09-09  8:56     ` Wu, Feng
  2015-09-16 16:56     ` George Dunlap
  0 siblings, 2 replies; 27+ messages in thread
From: Jan Beulich @ 2015-09-07 12:54 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel

>>> On 25.08.15 at 03:57, <feng.wu@intel.com> wrote:
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1573,6 +1573,22 @@ static void __context_switch(void)
>      per_cpu(curr_vcpu, cpu) = n;
>  }
>  
> +static inline void pi_ctxt_switch_from(struct vcpu *prev)
> +{
> +    /*
> +     * When switching from non-idle to idle, we only do a lazy context switch.
> +     * However, in order for posted interrupt (if available and enabled) to
> +     * work properly, we at least need to update the descriptors.
> +     */
> +    if ( prev->arch.pi_ctxt_switch_from && !is_idle_vcpu(prev) )
> +        prev->arch.pi_ctxt_switch_from(prev);
> +}
> +
> +static inline void pi_ctxt_switch_to(struct vcpu *next)
> +{
> +    if ( next->arch.pi_ctxt_switch_to && !is_idle_vcpu(next) )
> +        next->arch.pi_ctxt_switch_to(next);
> +}
>  
>  void context_switch(struct vcpu *prev, struct vcpu *next)
>  {
> @@ -1605,9 +1621,12 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
>  
>      set_current(next);
>  
> +    pi_ctxt_switch_from(prev);
> +
>      if ( (per_cpu(curr_vcpu, cpu) == next) ||
>           (is_idle_domain(nextd) && cpu_online(cpu)) )
>      {
> +        pi_ctxt_switch_to(next);
>          local_irq_enable();

This placement, if really intended that way, needs explanation (in a
comment) and perhaps even renaming of the involved symbols, as
looking at it from a general perspective it seems wrong (with
pi_ctxt_switch_to() excluding idle vCPU-s it effectively means you
want this only when switching back to what got switched out lazily
before, i.e. this would be not something to take place on an arbitrary
context switch). As to possible alternative names - maybe make the
hooks ctxt_switch_prepare() and ctxt_switch_cancel()?

> @@ -117,10 +119,20 @@ static int vmx_vcpu_initialise(struct vcpu *v)
>      INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
>      INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_vcpu_on_set_list);
>  
> +    v->arch.hvm_vmx.pi_block_cpu = -1;
> +
> +    spin_lock_init(&v->arch.hvm_vmx.pi_lock);
> +
>      v->arch.schedule_tail    = vmx_do_resume;
>      v->arch.ctxt_switch_from = vmx_ctxt_switch_from;
>      v->arch.ctxt_switch_to   = vmx_ctxt_switch_to;
>  
> +    if ( iommu_intpost && is_hvm_vcpu(v) )
> +    {
> +        v->arch.pi_ctxt_switch_from = vmx_pre_ctx_switch_pi;
> +        v->arch.pi_ctxt_switch_to = vmx_post_ctx_switch_pi;
> +    }

Why conditional upon is_hvm_vcpu()?

> @@ -718,6 +730,140 @@ static void vmx_fpu_leave(struct vcpu *v)
>      }
>  }
>  
> +void arch_vcpu_wake_prepare(struct vcpu *v)

A non-static function with this name should not live in vmx.c.

> +{
> +    unsigned long gflags;

Just "flags" please.

> +    if ( !iommu_intpost || !is_hvm_vcpu(v) || !has_arch_pdevs(v->domain) )

DYM !has_hvm_container_vcpu()?

> +        return;
> +
> +    spin_lock_irqsave(&v->arch.hvm_vmx.pi_lock, gflags);
> +
> +    if ( likely(vcpu_runnable(v)) ||
> +         !test_bit(_VPF_blocked, &v->pause_flags) )

_VPF_blocked set implies !vcpu_runnable(), i.e. afaict just the
latter check would suffice if this is really what you mean.

> +    {
> +        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +        unsigned long flags;

You don't need this - you already know IRQs are off for the lock
acquire below.

> +        /*
> +         * We don't need to send notification event to a non-running
> +         * vcpu, the interrupt information will be delivered to it before
> +         * VM-ENTRY when the vcpu is scheduled to run next time.
> +         */
> +        pi_set_sn(pi_desc);
> +
> +        /*
> +         * Set 'NV' field back to posted_intr_vector, so the
> +         * Posted-Interrupts can be delivered to the vCPU by
> +         * VT-d HW after it is scheduled to run.
> +         */
> +        write_atomic((uint8_t*)&pi_desc->nv, posted_intr_vector);

Bogus cast.

> +
> +        /*
> +         * Delete the vCPU from the related block list
> +         * if we are resuming from blocked state.
> +         */
> +        if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
> +        {
> +            spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
> +                              v->arch.hvm_vmx.pi_block_cpu), flags);
> +            list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);

Shouldn't you set .pi_block_cpu back to -1 along with removing
the vCPU from the list? Which then ought to allow using just
list_del() here?

> +static void vmx_pre_ctx_switch_pi(struct vcpu *v)
> +{
> +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +    struct pi_desc old, new;

Please limit the scope of these two variables.

> +    unsigned long flags, gflags;

See above.

> +    if ( !has_arch_pdevs(v->domain) )
> +        return;
> +
> +    spin_lock_irqsave(&v->arch.hvm_vmx.pi_lock, gflags);
> +
> +    if ( vcpu_runnable(v) || !test_bit(_VPF_blocked, &v->pause_flags) )

See above.

> +    {
> +        /*
> +         * The vCPU has been preempted or went to sleep. We don't need to send
> +         * notification event to a non-running vcpu, the interrupt information
> +         * will be delivered to it before VM-ENTRY when the vcpu is scheduled
> +         * to run next time.
> +         */
> +        pi_set_sn(pi_desc);
> +
> +    }
> +    else if ( test_bit(_VPF_blocked, &v->pause_flags) )

The condition here is redundant with the if() one above.

> +    {
> +        /*
> +         * The vCPU is blocking, we need to add it to one of the per pCPU lists.
> +         * We save v->processor to v->arch.hvm_vmx.pi_block_cpu and use it for
> +         * the per-CPU list, we also save it to posted-interrupt descriptor and
> +         * make it as the destination of the wake-up notification event.
> +         */
> +        v->arch.hvm_vmx.pi_block_cpu = v->processor;
> +        spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
> +                          v->arch.hvm_vmx.pi_block_cpu), flags);
> +        list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list,
> +                      &per_cpu(pi_blocked_vcpu, v->arch.hvm_vmx.pi_block_cpu));
> +        spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
> +                           v->arch.hvm_vmx.pi_block_cpu), flags);
> +
> +        do {
> +            old.control = new.control = pi_desc->control;
> +
> +            /* Should not block the vCPU if an interrupt was posted for it. */
> +            if ( pi_test_on(&old) )
> +            {
> +                spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock, gflags);
> +                vcpu_unblock(v);

Calling vcpu_unblock() in the middle of context_switch()? Why? And
is this safe? And if really needed to cover something not dealt with
elsewhere, wouldn't this need to happen _after_ having switched
the notification mechanism below?

> +                return;
> +            }
> +
> +            /*
> +             * Change the 'NDST' field to v->arch.hvm_vmx.pi_block_cpu,
> +             * so when external interrupts from assigned deivces happen,
> +             * wakeup notifiction event will go to
> +             * v->arch.hvm_vmx.pi_block_cpu, then in pi_wakeup_interrupt()
> +             * we can find the vCPU in the right list to wake up.
> +             */
> +            if ( x2apic_enabled )
> +                new.ndst = cpu_physical_id(v->arch.hvm_vmx.pi_block_cpu);
> +            else
> +                new.ndst = MASK_INSR(cpu_physical_id(
> +                                 v->arch.hvm_vmx.pi_block_cpu),
> +                                 PI_xAPIC_NDST_MASK);

Indentation is screwed up here. Perhaps it would help if you latched
cpu_phyiscal_id() into a local variable and used it on both branches?

> +            pi_clear_sn(&new);
> +            new.nv = pi_wakeup_vector;
> +        } while ( cmpxchg(&pi_desc->control, old.control, new.control)
> +                  != old.control );

Operators belong on the earlier of the split up lines.

> +static void vmx_post_ctx_switch_pi(struct vcpu *v)
> +{
> +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +
> +    if ( !has_arch_pdevs(v->domain) )
> +        return;
> +
> +    if ( x2apic_enabled )
> +        write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor));
> +    else
> +        write_atomic(&pi_desc->ndst,
> +                     MASK_INSR(cpu_physical_id(v->processor),
> +                     PI_xAPIC_NDST_MASK));
> +
> +    pi_clear_sn(pi_desc);
> +}

So you alter where notifications go, but not via which vector? How
is the vCPU going to get removed from the blocked list then?

> @@ -756,6 +902,7 @@ static void vmx_ctxt_switch_to(struct vcpu *v)
>  
>      vmx_restore_guest_msrs(v);
>      vmx_restore_dr(v);
> +    vmx_post_ctx_switch_pi(v);
>  }

Ah, here is the other invocation! But no, this shouldn't be done this
way. This really belongs into vendor independent code, even if the
indirect call overhead is slightly higher.

> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
> +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
> @@ -164,6 +164,14 @@ struct arch_vmx_struct {
>  
>      struct list_head     pi_blocked_vcpu_list;
>      struct list_head     pi_vcpu_on_set_list;
> +
> +    /*
> +     * Before vCPU is blocked, it is added to the global per-cpu list
> +     * of 'pi_block_cpu', then VT-d engine can send wakeup notification
> +     * event to 'pi_block_cpu' and wakeup the related vCPU.
> +     */
> +    int                  pi_block_cpu;

I can see that using int is in line with storing -1 into the field. But
generally CPU numbers should be unsigned. Please make it so,
using NR_CPUS or UINT_MAX in place of the -1.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-07 12:54   ` Jan Beulich
@ 2015-09-09  8:56     ` Wu, Feng
  2015-09-09 10:26       ` Jan Beulich
  2015-09-16 16:56     ` George Dunlap
  1 sibling, 1 reply; 27+ messages in thread
From: Wu, Feng @ 2015-09-09  8:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel@lists.xen.org, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Monday, September 07, 2015 8:55 PM
> To: Wu, Feng
> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
> xen-devel@lists.xen.org; Keir Fraser
> Subject: Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
> interrupts
> 
> >>> On 25.08.15 at 03:57, <feng.wu@intel.com> wrote:
> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -1573,6 +1573,22 @@ static void __context_switch(void)
> >      per_cpu(curr_vcpu, cpu) = n;
> >  }
> >
> > +static inline void pi_ctxt_switch_from(struct vcpu *prev)
> > +{
> > +    /*
> > +     * When switching from non-idle to idle, we only do a lazy context
> switch.
> > +     * However, in order for posted interrupt (if available and enabled) to
> > +     * work properly, we at least need to update the descriptors.
> > +     */
> > +    if ( prev->arch.pi_ctxt_switch_from && !is_idle_vcpu(prev) )
> > +        prev->arch.pi_ctxt_switch_from(prev);
> > +}
> > +
> > +static inline void pi_ctxt_switch_to(struct vcpu *next)
> > +{
> > +    if ( next->arch.pi_ctxt_switch_to && !is_idle_vcpu(next) )
> > +        next->arch.pi_ctxt_switch_to(next);
> > +}
> >
> >  void context_switch(struct vcpu *prev, struct vcpu *next)
> >  {
> > @@ -1605,9 +1621,12 @@ void context_switch(struct vcpu *prev, struct
> vcpu *next)
> >
> >      set_current(next);
> >
> > +    pi_ctxt_switch_from(prev);
> > +
> >      if ( (per_cpu(curr_vcpu, cpu) == next) ||
> >           (is_idle_domain(nextd) && cpu_online(cpu)) )
> >      {
> > +        pi_ctxt_switch_to(next);
> >          local_irq_enable();
> 
> This placement, if really intended that way, needs explanation (in a
> comment) and perhaps even renaming of the involved symbols, as

I think maybe removing the static wrapper function can make thing
clearer. What is your opinion?

> looking at it from a general perspective it seems wrong (with
> pi_ctxt_switch_to() excluding idle vCPU-s it effectively means you
> want this only when switching back to what got switched out lazily
> before, i.e. this would be not something to take place on an arbitrary
> context switch). As to possible alternative names - maybe make the
> hooks ctxt_switch_prepare() and ctxt_switch_cancel()?
> 
> > @@ -117,10 +119,20 @@ static int vmx_vcpu_initialise(struct vcpu *v)
> >      INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> >      INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_vcpu_on_set_list);
> >
> > +    v->arch.hvm_vmx.pi_block_cpu = -1;
> > +
> > +    spin_lock_init(&v->arch.hvm_vmx.pi_lock);
> > +
> >      v->arch.schedule_tail    = vmx_do_resume;
> >      v->arch.ctxt_switch_from = vmx_ctxt_switch_from;
> >      v->arch.ctxt_switch_to   = vmx_ctxt_switch_to;
> >
> > +    if ( iommu_intpost && is_hvm_vcpu(v) )
> > +    {
> > +        v->arch.pi_ctxt_switch_from = vmx_pre_ctx_switch_pi;
> > +        v->arch.pi_ctxt_switch_to = vmx_post_ctx_switch_pi;
> > +    }
> 
> Why conditional upon is_hvm_vcpu()?

I only think we need to add these hooks for PV guests, right?
Or you mean I should use has_hvm_container_vcpu() instead?

> 
> > @@ -718,6 +730,140 @@ static void vmx_fpu_leave(struct vcpu *v)
> >      }
> >  }
> >
> > +void arch_vcpu_wake_prepare(struct vcpu *v)
> 
> A non-static function with this name should not live in vmx.c.
> 
> > +{
> > +    unsigned long gflags;
> 
> Just "flags" please.
> 
> > +    if ( !iommu_intpost || !is_hvm_vcpu(v)
> || !has_arch_pdevs(v->domain) )
> 
> DYM !has_hvm_container_vcpu()?
> 
> > +        return;
> > +
> > +    spin_lock_irqsave(&v->arch.hvm_vmx.pi_lock, gflags);
> > +
> > +    if ( likely(vcpu_runnable(v)) ||
> > +         !test_bit(_VPF_blocked, &v->pause_flags) )
> 
> _VPF_blocked set implies !vcpu_runnable(), i.e. afaict just the
> latter check would suffice if this is really what you mean.
> 
> > +    {
> > +        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> > +        unsigned long flags;
> 
> You don't need this - you already know IRQs are off for the lock
> acquire below.
> 
> > +        /*
> > +         * We don't need to send notification event to a non-running
> > +         * vcpu, the interrupt information will be delivered to it before
> > +         * VM-ENTRY when the vcpu is scheduled to run next time.
> > +         */
> > +        pi_set_sn(pi_desc);
> > +
> > +        /*
> > +         * Set 'NV' field back to posted_intr_vector, so the
> > +         * Posted-Interrupts can be delivered to the vCPU by
> > +         * VT-d HW after it is scheduled to run.
> > +         */
> > +        write_atomic((uint8_t*)&pi_desc->nv, posted_intr_vector);
> 
> Bogus cast.
> 
> > +
> > +        /*
> > +         * Delete the vCPU from the related block list
> > +         * if we are resuming from blocked state.
> > +         */
> > +        if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
> > +        {
> > +            spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
> > +                              v->arch.hvm_vmx.pi_block_cpu),
> flags);
> > +            list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> 
> Shouldn't you set .pi_block_cpu back to -1 along with removing
> the vCPU from the list? Which then ought to allow using just
> list_del() here?

Here is the story about using list_del_init() instead of list_del(), (the
same reason for using list_del_init() in pi_wakeup_interrupt() ).
If we use list_del(), that means we need to set .pi_block_cpu to
-1 after removing the vCPU from the block list, there are two
places where the vCPU is removed from the list:
- arch_vcpu_wake_prepare()
- pi_wakeup_interrupt()

That means we also need to set .pi_block_cpu to -1 in
pi_wakeup_interrupt(), which seems problematic to me, since
.pi_block_cpu is used to get the per-CPU spin lock, it is not safe
to change it in pi_wakeup_interrupt(), maybe someone is using
it at the same time.

And the check "if ( v->arch.hvm_vmx.pi_block_cpu != -1 )" here
is only used when .pi_block_cpu is set to -1 at the initial time.

If you have any good suggestions about this, I will be all ears.

> 
> > +static void vmx_pre_ctx_switch_pi(struct vcpu *v)
> > +{
> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> > +    struct pi_desc old, new;
> 
> Please limit the scope of these two variables.
> 
> > +    unsigned long flags, gflags;
> 
> See above.
> 
> > +    if ( !has_arch_pdevs(v->domain) )
> > +        return;
> > +
> > +    spin_lock_irqsave(&v->arch.hvm_vmx.pi_lock, gflags);
> > +
> > +    if ( vcpu_runnable(v) || !test_bit(_VPF_blocked, &v->pause_flags) )
> 
> See above.
> 
> > +    {
> > +        /*
> > +         * The vCPU has been preempted or went to sleep. We don't need
> to send
> > +         * notification event to a non-running vcpu, the interrupt
> information
> > +         * will be delivered to it before VM-ENTRY when the vcpu is
> scheduled
> > +         * to run next time.
> > +         */
> > +        pi_set_sn(pi_desc);
> > +
> > +    }
> > +    else if ( test_bit(_VPF_blocked, &v->pause_flags) )
> 
> The condition here is redundant with the if() one above.
> 
> > +    {
> > +        /*
> > +         * The vCPU is blocking, we need to add it to one of the per pCPU
> lists.
> > +         * We save v->processor to v->arch.hvm_vmx.pi_block_cpu and
> use it for
> > +         * the per-CPU list, we also save it to posted-interrupt descriptor
> and
> > +         * make it as the destination of the wake-up notification event.
> > +         */
> > +        v->arch.hvm_vmx.pi_block_cpu = v->processor;
> > +        spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
> > +                          v->arch.hvm_vmx.pi_block_cpu), flags);
> > +        list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list,
> > +                      &per_cpu(pi_blocked_vcpu,
> v->arch.hvm_vmx.pi_block_cpu));
> > +        spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
> > +                           v->arch.hvm_vmx.pi_block_cpu), flags);
> > +
> > +        do {
> > +            old.control = new.control = pi_desc->control;
> > +
> > +            /* Should not block the vCPU if an interrupt was posted for it.
> */
> > +            if ( pi_test_on(&old) )
> > +            {
> > +                spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock,
> gflags);
> > +                vcpu_unblock(v);
> 
> Calling vcpu_unblock() in the middle of context_switch()? Why? And
> is this safe? 

I cannot see anything unsafe so far, can some scheduler maintainer
help to confirm it? Dario? George?

> And if really needed to cover something not dealt with
> elsewhere, wouldn't this need to happen _after_ having switched
> the notification mechanism below?

If ON is set, we don't need to block the vCPU hence no need to change the
notification vector to the wakeup one, which is used when vCPU is blocked.

> 
> > +                return;
> > +            }
> > +
> > +            /*
> > +             * Change the 'NDST' field to v->arch.hvm_vmx.pi_block_cpu,
> > +             * so when external interrupts from assigned deivces
> happen,
> > +             * wakeup notifiction event will go to
> > +             * v->arch.hvm_vmx.pi_block_cpu, then in
> pi_wakeup_interrupt()
> > +             * we can find the vCPU in the right list to wake up.
> > +             */
> > +            if ( x2apic_enabled )
> > +                new.ndst =
> cpu_physical_id(v->arch.hvm_vmx.pi_block_cpu);
> > +            else
> > +                new.ndst = MASK_INSR(cpu_physical_id(
> > +                                 v->arch.hvm_vmx.pi_block_cpu),
> > +                                 PI_xAPIC_NDST_MASK);
> 
> Indentation is screwed up here. Perhaps it would help if you latched
> cpu_phyiscal_id() into a local variable and used it on both branches?
> 
> > +            pi_clear_sn(&new);
> > +            new.nv = pi_wakeup_vector;
> > +        } while ( cmpxchg(&pi_desc->control, old.control, new.control)
> > +                  != old.control );
> 
> Operators belong on the earlier of the split up lines.
> 
> > +static void vmx_post_ctx_switch_pi(struct vcpu *v)
> > +{
> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> > +
> > +    if ( !has_arch_pdevs(v->domain) )
> > +        return;
> > +
> > +    if ( x2apic_enabled )
> > +        write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor));
> > +    else
> > +        write_atomic(&pi_desc->ndst,
> > +                     MASK_INSR(cpu_physical_id(v->processor),
> > +                     PI_xAPIC_NDST_MASK));
> > +
> > +    pi_clear_sn(pi_desc);
> > +}
> 
> So you alter where notifications go, but not via which vector? How
> is the vCPU going to get removed from the blocked list then?

vmx_post_ctx_switch_pi() is called when the vCPU is about to run,
in that case, the vector has been set properly and it has been removed
from the block list if it was blocked before.

> 
> > @@ -756,6 +902,7 @@ static void vmx_ctxt_switch_to(struct vcpu *v)
> >
> >      vmx_restore_guest_msrs(v);
> >      vmx_restore_dr(v);
> > +    vmx_post_ctx_switch_pi(v);
> >  }
> 
> Ah, here is the other invocation! But no, this shouldn't be done this
> way. This really belongs into vendor independent code, even if the
> indirect call overhead is slightly higher.

Why do you say this is vendor independent? What is your suggestion?

Thanks,
Feng

> 
> > --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
> > +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
> > @@ -164,6 +164,14 @@ struct arch_vmx_struct {
> >
> >      struct list_head     pi_blocked_vcpu_list;
> >      struct list_head     pi_vcpu_on_set_list;
> > +
> > +    /*
> > +     * Before vCPU is blocked, it is added to the global per-cpu list
> > +     * of 'pi_block_cpu', then VT-d engine can send wakeup notification
> > +     * event to 'pi_block_cpu' and wakeup the related vCPU.
> > +     */
> > +    int                  pi_block_cpu;
> 
> I can see that using int is in line with storing -1 into the field. But
> generally CPU numbers should be unsigned. Please make it so,
> using NR_CPUS or UINT_MAX in place of the -1.
> 
> Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-09  8:56     ` Wu, Feng
@ 2015-09-09 10:26       ` Jan Beulich
  2015-09-10  2:07         ` Wu, Feng
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2015-09-09 10:26 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel@lists.xen.org

>>> On 09.09.15 at 10:56, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Monday, September 07, 2015 8:55 PM
>> >>> On 25.08.15 at 03:57, <feng.wu@intel.com> wrote:
>> > @@ -1605,9 +1621,12 @@ void context_switch(struct vcpu *prev, struct
>> vcpu *next)
>> >
>> >      set_current(next);
>> >
>> > +    pi_ctxt_switch_from(prev);
>> > +
>> >      if ( (per_cpu(curr_vcpu, cpu) == next) ||
>> >           (is_idle_domain(nextd) && cpu_online(cpu)) )
>> >      {
>> > +        pi_ctxt_switch_to(next);
>> >          local_irq_enable();
>> 
>> This placement, if really intended that way, needs explanation (in a
>> comment) and perhaps even renaming of the involved symbols, as
> 
> I think maybe removing the static wrapper function can make thing
> clearer. What is your opinion?

Considering that there's going to be a second use of the
switch-to variant, I think the helpers are okay to be kept (but
I wouldn't insist on that), just that they need to be named in
a away making clear what their purpose is.

>> > @@ -117,10 +119,20 @@ static int vmx_vcpu_initialise(struct vcpu *v)
>> >      INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
>> >      INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_vcpu_on_set_list);
>> >
>> > +    v->arch.hvm_vmx.pi_block_cpu = -1;
>> > +
>> > +    spin_lock_init(&v->arch.hvm_vmx.pi_lock);
>> > +
>> >      v->arch.schedule_tail    = vmx_do_resume;
>> >      v->arch.ctxt_switch_from = vmx_ctxt_switch_from;
>> >      v->arch.ctxt_switch_to   = vmx_ctxt_switch_to;
>> >
>> > +    if ( iommu_intpost && is_hvm_vcpu(v) )
>> > +    {
>> > +        v->arch.pi_ctxt_switch_from = vmx_pre_ctx_switch_pi;
>> > +        v->arch.pi_ctxt_switch_to = vmx_post_ctx_switch_pi;
>> > +    }
>> 
>> Why conditional upon is_hvm_vcpu()?
> 
> I only think we need to add these hooks for PV guests, right?

"... we don't need to ..."?

> Or you mean I should use has_hvm_container_vcpu() instead?

Exactly.

>> > +
>> > +        /*
>> > +         * Delete the vCPU from the related block list
>> > +         * if we are resuming from blocked state.
>> > +         */
>> > +        if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
>> > +        {
>> > +            spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
>> > +                              v->arch.hvm_vmx.pi_block_cpu),
>> flags);
>> > +            list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
>> 
>> Shouldn't you set .pi_block_cpu back to -1 along with removing
>> the vCPU from the list? Which then ought to allow using just
>> list_del() here?
> 
> Here is the story about using list_del_init() instead of list_del(), (the
> same reason for using list_del_init() in pi_wakeup_interrupt() ).
> If we use list_del(), that means we need to set .pi_block_cpu to
> -1 after removing the vCPU from the block list, there are two
> places where the vCPU is removed from the list:
> - arch_vcpu_wake_prepare()
> - pi_wakeup_interrupt()
> 
> That means we also need to set .pi_block_cpu to -1 in
> pi_wakeup_interrupt(), which seems problematic to me, since
> .pi_block_cpu is used to get the per-CPU spin lock, it is not safe
> to change it in pi_wakeup_interrupt(), maybe someone is using
> it at the same time.
> 
> And the check "if ( v->arch.hvm_vmx.pi_block_cpu != -1 )" here
> is only used when .pi_block_cpu is set to -1 at the initial time.
> 
> If you have any good suggestions about this, I will be all ears.

Latching pi_block_cpu into a local variable would take care of that
part of the problem. Of course you then may also need to check
that while waiting to acquire the lock pi_block_cpu didn't change.
And if that absolutely can't be made work correctly, these
apparently strange uses of list_del_init() would each need a
justifying comment.

>> > +        do {
>> > +            old.control = new.control = pi_desc->control;
>> > +
>> > +            /* Should not block the vCPU if an interrupt was posted for it.
>> */
>> > +            if ( pi_test_on(&old) )
>> > +            {
>> > +                spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock,
>> gflags);
>> > +                vcpu_unblock(v);
>> 
>> Calling vcpu_unblock() in the middle of context_switch()? Why? And
>> is this safe? 
> 
> I cannot see anything unsafe so far, can some scheduler maintainer
> help to confirm it? Dario? George?

But first and foremost you ought to answer the "why".

>> And if really needed to cover something not dealt with
>> elsewhere, wouldn't this need to happen _after_ having switched
>> the notification mechanism below?
> 
> If ON is set, we don't need to block the vCPU hence no need to change the
> notification vector to the wakeup one, which is used when vCPU is blocked.
> 
>> 
>> > +                return;

Oh, right, I somehow managed to ignore the "return" here.

>> > +static void vmx_post_ctx_switch_pi(struct vcpu *v)
>> > +{
>> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
>> > +
>> > +    if ( !has_arch_pdevs(v->domain) )
>> > +        return;
>> > +
>> > +    if ( x2apic_enabled )
>> > +        write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor));
>> > +    else
>> > +        write_atomic(&pi_desc->ndst,
>> > +                     MASK_INSR(cpu_physical_id(v->processor),
>> > +                     PI_xAPIC_NDST_MASK));
>> > +
>> > +    pi_clear_sn(pi_desc);
>> > +}
>> 
>> So you alter where notifications go, but not via which vector? How
>> is the vCPU going to get removed from the blocked list then?
> 
> vmx_post_ctx_switch_pi() is called when the vCPU is about to run,
> in that case, the vector has been set properly and it has been removed
> from the block list if it was blocked before.

So why do you two updates then (first [elsewhere] to set the vector
and then here to set the destination)?

>> > @@ -756,6 +902,7 @@ static void vmx_ctxt_switch_to(struct vcpu *v)
>> >
>> >      vmx_restore_guest_msrs(v);
>> >      vmx_restore_dr(v);
>> > +    vmx_post_ctx_switch_pi(v);
>> >  }
>> 
>> Ah, here is the other invocation! But no, this shouldn't be done this
>> way. This really belongs into vendor independent code, even if the
>> indirect call overhead is slightly higher.
> 
> Why do you say this is vendor independent? What is your suggestion?

This needs to become a second invocation of pi_ctxt_switch_to()
from the context switch code.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-09 10:26       ` Jan Beulich
@ 2015-09-10  2:07         ` Wu, Feng
  2015-09-10  8:27           ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: Wu, Feng @ 2015-09-10  2:07 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel@lists.xen.org, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, September 09, 2015 6:27 PM
> To: Wu, Feng
> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
> xen-devel@lists.xen.org; Keir Fraser
> Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
> interrupts
> 
> >>> On 09.09.15 at 10:56, <feng.wu@intel.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Monday, September 07, 2015 8:55 PM
> >> >>> On 25.08.15 at 03:57, <feng.wu@intel.com> wrote:
> >> > +
> >> > +        /*
> >> > +         * Delete the vCPU from the related block list
> >> > +         * if we are resuming from blocked state.
> >> > +         */
> >> > +        if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
> >> > +        {
> >> > +            spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
> >> > +                              v->arch.hvm_vmx.pi_block_cpu),
> >> flags);
> >> > +            list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> >>
> >> Shouldn't you set .pi_block_cpu back to -1 along with removing
> >> the vCPU from the list? Which then ought to allow using just
> >> list_del() here?
> >
> > Here is the story about using list_del_init() instead of list_del(), (the
> > same reason for using list_del_init() in pi_wakeup_interrupt() ).
> > If we use list_del(), that means we need to set .pi_block_cpu to
> > -1 after removing the vCPU from the block list, there are two
> > places where the vCPU is removed from the list:
> > - arch_vcpu_wake_prepare()
> > - pi_wakeup_interrupt()
> >
> > That means we also need to set .pi_block_cpu to -1 in
> > pi_wakeup_interrupt(), which seems problematic to me, since
> > .pi_block_cpu is used to get the per-CPU spin lock, it is not safe
> > to change it in pi_wakeup_interrupt(), maybe someone is using
> > it at the same time.
> >
> > And the check "if ( v->arch.hvm_vmx.pi_block_cpu != -1 )" here
> > is only used when .pi_block_cpu is set to -1 at the initial time.
> >
> > If you have any good suggestions about this, I will be all ears.
> 
> Latching pi_block_cpu into a local variable would take care of that
> part of the problem. Of course you then may also need to check
> that while waiting to acquire the lock pi_block_cpu didn't change.

Thanks for the suggestion! But I think it is not easy to "check
"that while waiting to acquire the lock pi_block_cpu didn't change".
Let's take the following pseudo code as an example:

int local_pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu;

......

spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
               local_pi_block_cpu), flags);
list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
                   local_pi_block_cpu), flags);

If .pi_block_cpu is changed to -1 by others during calling list_del()
above, that means the vCPU is removed by others, then calling
list_del() here again would be problematic. I think there might be
other corner cases for this. So I suggest adding some comments
for list_del_init() as you mentioned below.

> And if that absolutely can't be made work correctly, these
> apparently strange uses of list_del_init() would each need a
> justifying comment.
> 
> >> > +        do {
> >> > +            old.control = new.control = pi_desc->control;
> >> > +
> >> > +            /* Should not block the vCPU if an interrupt was posted for
> it.
> >> */
> >> > +            if ( pi_test_on(&old) )
> >> > +            {
> >> > +                spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock,
> >> gflags);
> >> > +                vcpu_unblock(v);
> >>
> >> Calling vcpu_unblock() in the middle of context_switch()? Why? And
> >> is this safe?
> >
> > I cannot see anything unsafe so far, can some scheduler maintainer
> > help to confirm it? Dario? George?
> 
> But first and foremost you ought to answer the "why".

I think if you think this solution is unsafe, you need point out where it is, not
just ask "is this safe ?", I don't think this is an effective comments.

Anyway, I go through the main path of the code as below, I really don't find
anything unsafe here.

context_switch() -->
    pi_ctxt_switch_from() -->
        vmx_pre_ctx_switch_pi() -->
            vcpu_unblock() -->
                vcpu_wake() -->
                    SCHED_OP(VCPU2OP(v), wake, v) -->
                        csched_vcpu_wake() -->
                            __runq_insert()
                            __runq_tickle()

If you continue to think it is unsafe, pointing out the place will be welcome!

> 
> >> And if really needed to cover something not dealt with
> >> elsewhere, wouldn't this need to happen _after_ having switched
> >> the notification mechanism below?
> >
> > If ON is set, we don't need to block the vCPU hence no need to change the
> > notification vector to the wakeup one, which is used when vCPU is blocked.
> >
> >>
> >> > +                return;
> 
> Oh, right, I somehow managed to ignore the "return" here.
> 
> >> > +static void vmx_post_ctx_switch_pi(struct vcpu *v)
> >> > +{
> >> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> >> > +
> >> > +    if ( !has_arch_pdevs(v->domain) )
> >> > +        return;
> >> > +
> >> > +    if ( x2apic_enabled )
> >> > +        write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor));
> >> > +    else
> >> > +        write_atomic(&pi_desc->ndst,
> >> > +                     MASK_INSR(cpu_physical_id(v->processor),
> >> > +                     PI_xAPIC_NDST_MASK));
> >> > +
> >> > +    pi_clear_sn(pi_desc);
> >> > +}
> >>
> >> So you alter where notifications go, but not via which vector? How
> >> is the vCPU going to get removed from the blocked list then?
> >
> > vmx_post_ctx_switch_pi() is called when the vCPU is about to run,
> > in that case, the vector has been set properly and it has been removed
> > from the block list if it was blocked before.
> 
> So why do you two updates then (first [elsewhere] to set the vector
> and then here to set the destination)?

When the vCPU is unblocked and then removed from the blocking list, we
need to change the vector to the normal notification vector, since the
wakeup vector is only used when the vCPU is blocked and hence in the
blocked list. And here is the place we can decide which pCPU the vCPU
will be scheduled on, so we change the destination field here.

Thanks,
Feng

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-10  2:07         ` Wu, Feng
@ 2015-09-10  8:27           ` Jan Beulich
  2015-09-10  8:59             ` Wu, Feng
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2015-09-10  8:27 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel@lists.xen.org

>>> On 10.09.15 at 04:07, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, September 09, 2015 6:27 PM
>> >>> On 09.09.15 at 10:56, <feng.wu@intel.com> wrote:
>> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Monday, September 07, 2015 8:55 PM
>> >> >>> On 25.08.15 at 03:57, <feng.wu@intel.com> wrote:
>> >> > +
>> >> > +        /*
>> >> > +         * Delete the vCPU from the related block list
>> >> > +         * if we are resuming from blocked state.
>> >> > +         */
>> >> > +        if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
>> >> > +        {
>> >> > +            spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
>> >> > +                              v->arch.hvm_vmx.pi_block_cpu),
>> >> flags);
>> >> > +            list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
>> >>
>> >> Shouldn't you set .pi_block_cpu back to -1 along with removing
>> >> the vCPU from the list? Which then ought to allow using just
>> >> list_del() here?
>> >
>> > Here is the story about using list_del_init() instead of list_del(), (the
>> > same reason for using list_del_init() in pi_wakeup_interrupt() ).
>> > If we use list_del(), that means we need to set .pi_block_cpu to
>> > -1 after removing the vCPU from the block list, there are two
>> > places where the vCPU is removed from the list:
>> > - arch_vcpu_wake_prepare()
>> > - pi_wakeup_interrupt()
>> >
>> > That means we also need to set .pi_block_cpu to -1 in
>> > pi_wakeup_interrupt(), which seems problematic to me, since
>> > .pi_block_cpu is used to get the per-CPU spin lock, it is not safe
>> > to change it in pi_wakeup_interrupt(), maybe someone is using
>> > it at the same time.
>> >
>> > And the check "if ( v->arch.hvm_vmx.pi_block_cpu != -1 )" here
>> > is only used when .pi_block_cpu is set to -1 at the initial time.
>> >
>> > If you have any good suggestions about this, I will be all ears.
>> 
>> Latching pi_block_cpu into a local variable would take care of that
>> part of the problem. Of course you then may also need to check
>> that while waiting to acquire the lock pi_block_cpu didn't change.
> 
> Thanks for the suggestion! But I think it is not easy to "check
> "that while waiting to acquire the lock pi_block_cpu didn't change".
> Let's take the following pseudo code as an example:
> 
> int local_pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu;
> 
> ......
> 
> spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
>                local_pi_block_cpu), flags);
> list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
>                    local_pi_block_cpu), flags);
> 
> If .pi_block_cpu is changed to -1 by others during calling list_del()
> above, that means the vCPU is removed by others, then calling
> list_del() here again would be problematic. I think there might be
> other corner cases for this. So I suggest adding some comments
> for list_del_init() as you mentioned below.

That's why I said "check that while waiting to acquire the lock
pi_block_cpu didn't change." The code you present does not do
this.

>> >> > +        do {
>> >> > +            old.control = new.control = pi_desc->control;
>> >> > +
>> >> > +            /* Should not block the vCPU if an interrupt was posted for
>> it.
>> >> */
>> >> > +            if ( pi_test_on(&old) )
>> >> > +            {
>> >> > +                spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock,
>> >> gflags);
>> >> > +                vcpu_unblock(v);
>> >>
>> >> Calling vcpu_unblock() in the middle of context_switch()? Why? And
>> >> is this safe?
>> >
>> > I cannot see anything unsafe so far, can some scheduler maintainer
>> > help to confirm it? Dario? George?
>> 
>> But first and foremost you ought to answer the "why".
> 
> I think if you think this solution is unsafe, you need point out where it 
> is, not
> just ask "is this safe ?", I don't think this is an effective comments.

It is my general understanding that people wanting code to be
included have to prove their code is okay, not reviewers to prove
the code is broken.

> Anyway, I go through the main path of the code as below, I really don't find
> anything unsafe here.
> 
> context_switch() -->
>     pi_ctxt_switch_from() -->
>         vmx_pre_ctx_switch_pi() -->
>             vcpu_unblock() -->
>                 vcpu_wake() -->
>                     SCHED_OP(VCPU2OP(v), wake, v) -->
>                         csched_vcpu_wake() -->
>                             __runq_insert()
>                             __runq_tickle()
> 
> If you continue to think it is unsafe, pointing out the place will be 
> welcome!

Once again - I didn't say it's unsafe (and hence can't point at a
particular place). What bothers me is that vcpu_unblock() affects
scheduler state, and altering scheduler state from inside the
context switch path looks problematic at the first glance. I'd be
happy if I was told there is no such problem, but be aware that
this would have to include latent ones: Even if right now things
are safe, having such feedback have the potential of becoming
an actual problem with a later scheduler change is already an
issue, as finding the problem is then likely going to be rather hard
(and I suspect it's not going to be you to debug this).

>> >> > +static void vmx_post_ctx_switch_pi(struct vcpu *v)
>> >> > +{
>> >> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
>> >> > +
>> >> > +    if ( !has_arch_pdevs(v->domain) )
>> >> > +        return;
>> >> > +
>> >> > +    if ( x2apic_enabled )
>> >> > +        write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor));
>> >> > +    else
>> >> > +        write_atomic(&pi_desc->ndst,
>> >> > +                     MASK_INSR(cpu_physical_id(v->processor),
>> >> > +                     PI_xAPIC_NDST_MASK));
>> >> > +
>> >> > +    pi_clear_sn(pi_desc);
>> >> > +}
>> >>
>> >> So you alter where notifications go, but not via which vector? How
>> >> is the vCPU going to get removed from the blocked list then?
>> >
>> > vmx_post_ctx_switch_pi() is called when the vCPU is about to run,
>> > in that case, the vector has been set properly and it has been removed
>> > from the block list if it was blocked before.
>> 
>> So why do you two updates then (first [elsewhere] to set the vector
>> and then here to set the destination)?
> 
> When the vCPU is unblocked and then removed from the blocking list, we
> need to change the vector to the normal notification vector, since the
> wakeup vector is only used when the vCPU is blocked and hence in the
> blocked list. And here is the place we can decide which pCPU the vCPU
> will be scheduled on, so we change the destination field here.

Right, that's what I understood. But isn't the state things are in
between these two updates inconsistent? I.e. - where does the
notification go if one arrives in this window? Particularly - if it went
to the right place, why would this second update be needed at all?

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-10  8:27           ` Jan Beulich
@ 2015-09-10  8:59             ` Wu, Feng
  2015-09-10  9:26               ` Jan Beulich
  2015-09-16 17:08               ` George Dunlap
  0 siblings, 2 replies; 27+ messages in thread
From: Wu, Feng @ 2015-09-10  8:59 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel@lists.xen.org, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, September 10, 2015 4:28 PM
> To: Wu, Feng
> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
> xen-devel@lists.xen.org; Keir Fraser
> Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
> interrupts
> 
> >>> On 10.09.15 at 04:07, <feng.wu@intel.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Wednesday, September 09, 2015 6:27 PM
> >> >>> On 09.09.15 at 10:56, <feng.wu@intel.com> wrote:
> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: Monday, September 07, 2015 8:55 PM
> >> >> >>> On 25.08.15 at 03:57, <feng.wu@intel.com> wrote:
> >> >> > +
> >> >> > +        /*
> >> >> > +         * Delete the vCPU from the related block list
> >> >> > +         * if we are resuming from blocked state.
> >> >> > +         */
> >> >> > +        if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
> >> >> > +        {
> >> >> > +            spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
> >> >> > +                              v->arch.hvm_vmx.pi_block_cpu),
> >> >> flags);
> >> >> > +            list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> >> >>
> >> >> Shouldn't you set .pi_block_cpu back to -1 along with removing
> >> >> the vCPU from the list? Which then ought to allow using just
> >> >> list_del() here?
> >> >
> >> > Here is the story about using list_del_init() instead of list_del(), (the
> >> > same reason for using list_del_init() in pi_wakeup_interrupt() ).
> >> > If we use list_del(), that means we need to set .pi_block_cpu to
> >> > -1 after removing the vCPU from the block list, there are two
> >> > places where the vCPU is removed from the list:
> >> > - arch_vcpu_wake_prepare()
> >> > - pi_wakeup_interrupt()
> >> >
> >> > That means we also need to set .pi_block_cpu to -1 in
> >> > pi_wakeup_interrupt(), which seems problematic to me, since
> >> > .pi_block_cpu is used to get the per-CPU spin lock, it is not safe
> >> > to change it in pi_wakeup_interrupt(), maybe someone is using
> >> > it at the same time.
> >> >
> >> > And the check "if ( v->arch.hvm_vmx.pi_block_cpu != -1 )" here
> >> > is only used when .pi_block_cpu is set to -1 at the initial time.
> >> >
> >> > If you have any good suggestions about this, I will be all ears.
> >>
> >> Latching pi_block_cpu into a local variable would take care of that
> >> part of the problem. Of course you then may also need to check
> >> that while waiting to acquire the lock pi_block_cpu didn't change.
> >
> > Thanks for the suggestion! But I think it is not easy to "check
> > "that while waiting to acquire the lock pi_block_cpu didn't change".
> > Let's take the following pseudo code as an example:
> >
> > int local_pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu;
> >
> > ......
> >
> > spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
> >                local_pi_block_cpu), flags);
> > list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> > spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
> >                    local_pi_block_cpu), flags);
> >
> > If .pi_block_cpu is changed to -1 by others during calling list_del()
> > above, that means the vCPU is removed by others, then calling
> > list_del() here again would be problematic. I think there might be
> > other corner cases for this. So I suggest adding some comments
> > for list_del_init() as you mentioned below.
> 
> That's why I said "check that while waiting to acquire the lock
> pi_block_cpu didn't change." The code you present does not do
> this.

I didn't see we need implement it as the above code, I just list it
here the show it is hard to do that. 
First, how to check it while waiting to acquire the lock .pi_block_cpu
didn't change?
Secondly, even if we can check it, what should we do if .pi_block_cpu
is changed after acquiring the lock as I mentioned above?

> 
> >> >> > +        do {
> >> >> > +            old.control = new.control = pi_desc->control;
> >> >> > +
> >> >> > +            /* Should not block the vCPU if an interrupt was posted
> for
> >> it.
> >> >> */
> >> >> > +            if ( pi_test_on(&old) )
> >> >> > +            {
> >> >> > +
> spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock,
> >> >> gflags);
> >> >> > +                vcpu_unblock(v);
> >> >>
> >> >> Calling vcpu_unblock() in the middle of context_switch()? Why? And
> >> >> is this safe?
> >> >
> >> > I cannot see anything unsafe so far, can some scheduler maintainer
> >> > help to confirm it? Dario? George?
> >>
> >> But first and foremost you ought to answer the "why".
> >
> > I think if you think this solution is unsafe, you need point out where it
> > is, not
> > just ask "is this safe ?", I don't think this is an effective comments.
> 
> It is my general understanding that people wanting code to be
> included have to prove their code is okay, not reviewers to prove
> the code is broken.
> 
> > Anyway, I go through the main path of the code as below, I really don't find
> > anything unsafe here.
> >
> > context_switch() -->
> >     pi_ctxt_switch_from() -->
> >         vmx_pre_ctx_switch_pi() -->
> >             vcpu_unblock() -->
> >                 vcpu_wake() -->
> >                     SCHED_OP(VCPU2OP(v), wake, v) -->
> >                         csched_vcpu_wake() -->
> >                             __runq_insert()
> >                             __runq_tickle()
> >
> > If you continue to think it is unsafe, pointing out the place will be
> > welcome!
> 
> Once again - I didn't say it's unsafe (and hence can't point at a
> particular place). What bothers me is that vcpu_unblock() affects
> scheduler state, and altering scheduler state from inside the
> context switch path looks problematic at the first glance. I'd be
> happy if I was told there is no such problem, but be aware that
> this would have to include latent ones: Even if right now things
> are safe, having such feedback have the potential of becoming
> an actual problem with a later scheduler change is already an
> issue, as finding the problem is then likely going to be rather hard
> (and I suspect it's not going to be you to debug this).

What I can say is that after investigation, I don't find anything bad
for this vcpu_unblock(), I tested it in my experiment. So that is why
I'd like to ask some ideas from scheduler expects.

> 
> >> >> > +static void vmx_post_ctx_switch_pi(struct vcpu *v)
> >> >> > +{
> >> >> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> >> >> > +
> >> >> > +    if ( !has_arch_pdevs(v->domain) )
> >> >> > +        return;
> >> >> > +
> >> >> > +    if ( x2apic_enabled )
> >> >> > +        write_atomic(&pi_desc->ndst,
> cpu_physical_id(v->processor));
> >> >> > +    else
> >> >> > +        write_atomic(&pi_desc->ndst,
> >> >> > +                     MASK_INSR(cpu_physical_id(v->processor),
> >> >> > +                     PI_xAPIC_NDST_MASK));
> >> >> > +
> >> >> > +    pi_clear_sn(pi_desc);
> >> >> > +}
> >> >>
> >> >> So you alter where notifications go, but not via which vector? How
> >> >> is the vCPU going to get removed from the blocked list then?
> >> >
> >> > vmx_post_ctx_switch_pi() is called when the vCPU is about to run,
> >> > in that case, the vector has been set properly and it has been removed
> >> > from the block list if it was blocked before.
> >>
> >> So why do you two updates then (first [elsewhere] to set the vector
> >> and then here to set the destination)?
> >
> > When the vCPU is unblocked and then removed from the blocking list, we
> > need to change the vector to the normal notification vector, since the
> > wakeup vector is only used when the vCPU is blocked and hence in the
> > blocked list. And here is the place we can decide which pCPU the vCPU
> > will be scheduled on, so we change the destination field here.
> 
> Right, that's what I understood. But isn't the state things are in
> between these two updates inconsistent? I.e. - where does the
> notification go if one arrives in this window?

Before arriving here, the SN is set, no need to send notification event and
hence no notification at all.

 Particularly - if it went
> to the right place, why would this second update be needed at all?

So we need to read the ndst from the PI descriptor and compare the
cpu_physical_id(v->processor), if they are different, update it? I don't
think this check is needed, since we need at least a read operation,
an if check, then maybe a write operation.

Thanks,
Feng

> 
> Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-10  8:59             ` Wu, Feng
@ 2015-09-10  9:26               ` Jan Beulich
  2015-09-10  9:41                 ` Wu, Feng
  2015-09-16  8:56                 ` Wu, Feng
  2015-09-16 17:08               ` George Dunlap
  1 sibling, 2 replies; 27+ messages in thread
From: Jan Beulich @ 2015-09-10  9:26 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel@lists.xen.org

>>> On 10.09.15 at 10:59, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Thursday, September 10, 2015 4:28 PM
>> >>> On 10.09.15 at 04:07, <feng.wu@intel.com> wrote:
>> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Wednesday, September 09, 2015 6:27 PM
>> >> >>> On 09.09.15 at 10:56, <feng.wu@intel.com> wrote:
>> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> >> Sent: Monday, September 07, 2015 8:55 PM
>> >> >> >>> On 25.08.15 at 03:57, <feng.wu@intel.com> wrote:
>> >> >> > +
>> >> >> > +        /*
>> >> >> > +         * Delete the vCPU from the related block list
>> >> >> > +         * if we are resuming from blocked state.
>> >> >> > +         */
>> >> >> > +        if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
>> >> >> > +        {
>> >> >> > +            spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
>> >> >> > +                              v->arch.hvm_vmx.pi_block_cpu),
>> >> >> flags);
>> >> >> > +            list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
>> >> >>
>> >> >> Shouldn't you set .pi_block_cpu back to -1 along with removing
>> >> >> the vCPU from the list? Which then ought to allow using just
>> >> >> list_del() here?
>> >> >
>> >> > Here is the story about using list_del_init() instead of list_del(), (the
>> >> > same reason for using list_del_init() in pi_wakeup_interrupt() ).
>> >> > If we use list_del(), that means we need to set .pi_block_cpu to
>> >> > -1 after removing the vCPU from the block list, there are two
>> >> > places where the vCPU is removed from the list:
>> >> > - arch_vcpu_wake_prepare()
>> >> > - pi_wakeup_interrupt()
>> >> >
>> >> > That means we also need to set .pi_block_cpu to -1 in
>> >> > pi_wakeup_interrupt(), which seems problematic to me, since
>> >> > .pi_block_cpu is used to get the per-CPU spin lock, it is not safe
>> >> > to change it in pi_wakeup_interrupt(), maybe someone is using
>> >> > it at the same time.
>> >> >
>> >> > And the check "if ( v->arch.hvm_vmx.pi_block_cpu != -1 )" here
>> >> > is only used when .pi_block_cpu is set to -1 at the initial time.
>> >> >
>> >> > If you have any good suggestions about this, I will be all ears.
>> >>
>> >> Latching pi_block_cpu into a local variable would take care of that
>> >> part of the problem. Of course you then may also need to check
>> >> that while waiting to acquire the lock pi_block_cpu didn't change.
>> >
>> > Thanks for the suggestion! But I think it is not easy to "check
>> > "that while waiting to acquire the lock pi_block_cpu didn't change".
>> > Let's take the following pseudo code as an example:
>> >
>> > int local_pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu;
>> >
>> > ......
>> >
>> > spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
>> >                local_pi_block_cpu), flags);
>> > list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
>> > spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
>> >                    local_pi_block_cpu), flags);
>> >
>> > If .pi_block_cpu is changed to -1 by others during calling list_del()
>> > above, that means the vCPU is removed by others, then calling
>> > list_del() here again would be problematic. I think there might be
>> > other corner cases for this. So I suggest adding some comments
>> > for list_del_init() as you mentioned below.
>> 
>> That's why I said "check that while waiting to acquire the lock
>> pi_block_cpu didn't change." The code you present does not do
>> this.
> 
> I didn't see we need implement it as the above code, I just list it
> here the show it is hard to do that. 
> First, how to check it while waiting to acquire the lock .pi_block_cpu
> didn't change?

Note the difference between "check while waiting" and "check that
while waiting": The former is indeed hard to implement, while the
latter is pretty straightforward (and we do so elsewhere).

> Secondly, even if we can check it, what should we do if .pi_block_cpu
> is changed after acquiring the lock as I mentioned above?

Drop the lock and start over. I.e. (taking your pseudo code)

restart:
    local_pi_block_cpu = ...;
    bail-if-invalid (e.g. -1 in current model)
    spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
    if(local_pi_block_cpu != actual_pi_block_cpu) {
        spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);
        goto restart;
    }
    list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
    spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);

>> > Anyway, I go through the main path of the code as below, I really don't find
>> > anything unsafe here.
>> >
>> > context_switch() -->
>> >     pi_ctxt_switch_from() -->
>> >         vmx_pre_ctx_switch_pi() -->
>> >             vcpu_unblock() -->
>> >                 vcpu_wake() -->
>> >                     SCHED_OP(VCPU2OP(v), wake, v) -->
>> >                         csched_vcpu_wake() -->
>> >                             __runq_insert()
>> >                             __runq_tickle()
>> >
>> > If you continue to think it is unsafe, pointing out the place will be
>> > welcome!
>> 
>> Once again - I didn't say it's unsafe (and hence can't point at a
>> particular place). What bothers me is that vcpu_unblock() affects
>> scheduler state, and altering scheduler state from inside the
>> context switch path looks problematic at the first glance. I'd be
>> happy if I was told there is no such problem, but be aware that
>> this would have to include latent ones: Even if right now things
>> are safe, having such feedback have the potential of becoming
>> an actual problem with a later scheduler change is already an
>> issue, as finding the problem is then likely going to be rather hard
>> (and I suspect it's not going to be you to debug this).
> 
> What I can say is that after investigation, I don't find anything bad
> for this vcpu_unblock(), I tested it in my experiment. So that is why
> I'd like to ask some ideas from scheduler expects.

Agreed - I'm also awaiting their input.

>> >> >> > +static void vmx_post_ctx_switch_pi(struct vcpu *v)
>> >> >> > +{
>> >> >> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
>> >> >> > +
>> >> >> > +    if ( !has_arch_pdevs(v->domain) )
>> >> >> > +        return;
>> >> >> > +
>> >> >> > +    if ( x2apic_enabled )
>> >> >> > +        write_atomic(&pi_desc->ndst,
>> cpu_physical_id(v->processor));
>> >> >> > +    else
>> >> >> > +        write_atomic(&pi_desc->ndst,
>> >> >> > +                     MASK_INSR(cpu_physical_id(v->processor),
>> >> >> > +                     PI_xAPIC_NDST_MASK));
>> >> >> > +
>> >> >> > +    pi_clear_sn(pi_desc);
>> >> >> > +}
>> >> >>
>> >> >> So you alter where notifications go, but not via which vector? How
>> >> >> is the vCPU going to get removed from the blocked list then?
>> >> >
>> >> > vmx_post_ctx_switch_pi() is called when the vCPU is about to run,
>> >> > in that case, the vector has been set properly and it has been removed
>> >> > from the block list if it was blocked before.
>> >>
>> >> So why do you two updates then (first [elsewhere] to set the vector
>> >> and then here to set the destination)?
>> >
>> > When the vCPU is unblocked and then removed from the blocking list, we
>> > need to change the vector to the normal notification vector, since the
>> > wakeup vector is only used when the vCPU is blocked and hence in the
>> > blocked list. And here is the place we can decide which pCPU the vCPU
>> > will be scheduled on, so we change the destination field here.
>> 
>> Right, that's what I understood. But isn't the state things are in
>> between these two updates inconsistent? I.e. - where does the
>> notification go if one arrives in this window?
> 
> Before arriving here, the SN is set, no need to send notification event and
> hence no notification at all.

Ah, okay.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-10  9:26               ` Jan Beulich
@ 2015-09-10  9:41                 ` Wu, Feng
  2015-09-10 10:01                   ` Jan Beulich
  2015-09-16  8:56                 ` Wu, Feng
  1 sibling, 1 reply; 27+ messages in thread
From: Wu, Feng @ 2015-09-10  9:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel@lists.xen.org, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, September 10, 2015 5:26 PM
> To: Wu, Feng
> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
> xen-devel@lists.xen.org; Keir Fraser
> Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
> interrupts
> 
> >>> On 10.09.15 at 10:59, <feng.wu@intel.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Thursday, September 10, 2015 4:28 PM
> >> >>> On 10.09.15 at 04:07, <feng.wu@intel.com> wrote:
> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: Wednesday, September 09, 2015 6:27 PM
> >> >> >>> On 09.09.15 at 10:56, <feng.wu@intel.com> wrote:
> >> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> >> Sent: Monday, September 07, 2015 8:55 PM
> >> >> >> >>> On 25.08.15 at 03:57, <feng.wu@intel.com> wrote:
> >> >> >> > +
> >> >> >> > +        /*
> >> >> >> > +         * Delete the vCPU from the related block list
> >> >> >> > +         * if we are resuming from blocked state.
> >> >> >> > +         */
> >> >> >> > +        if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
> >> >> >> > +        {
> >> >> >> > +            spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
> >> >> >> > +
> v->arch.hvm_vmx.pi_block_cpu),
> >> >> >> flags);
> >> >> >> > +
> list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> >> >> >>
> >> >> >> Shouldn't you set .pi_block_cpu back to -1 along with removing
> >> >> >> the vCPU from the list? Which then ought to allow using just
> >> >> >> list_del() here?
> >> >> >
> >> >> > Here is the story about using list_del_init() instead of list_del(), (the
> >> >> > same reason for using list_del_init() in pi_wakeup_interrupt() ).
> >> >> > If we use list_del(), that means we need to set .pi_block_cpu to
> >> >> > -1 after removing the vCPU from the block list, there are two
> >> >> > places where the vCPU is removed from the list:
> >> >> > - arch_vcpu_wake_prepare()
> >> >> > - pi_wakeup_interrupt()
> >> >> >
> >> >> > That means we also need to set .pi_block_cpu to -1 in
> >> >> > pi_wakeup_interrupt(), which seems problematic to me, since
> >> >> > .pi_block_cpu is used to get the per-CPU spin lock, it is not safe
> >> >> > to change it in pi_wakeup_interrupt(), maybe someone is using
> >> >> > it at the same time.
> >> >> >
> >> >> > And the check "if ( v->arch.hvm_vmx.pi_block_cpu != -1 )" here
> >> >> > is only used when .pi_block_cpu is set to -1 at the initial time.
> >> >> >
> >> >> > If you have any good suggestions about this, I will be all ears.
> >> >>
> >> >> Latching pi_block_cpu into a local variable would take care of that
> >> >> part of the problem. Of course you then may also need to check
> >> >> that while waiting to acquire the lock pi_block_cpu didn't change.
> >> >
> >> > Thanks for the suggestion! But I think it is not easy to "check
> >> > "that while waiting to acquire the lock pi_block_cpu didn't change".
> >> > Let's take the following pseudo code as an example:
> >> >
> >> > int local_pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu;
> >> >
> >> > ......
> >> >
> >> > spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
> >> >                local_pi_block_cpu), flags);
> >> > list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> >> > spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
> >> >                    local_pi_block_cpu), flags);
> >> >
> >> > If .pi_block_cpu is changed to -1 by others during calling list_del()
> >> > above, that means the vCPU is removed by others, then calling
> >> > list_del() here again would be problematic. I think there might be
> >> > other corner cases for this. So I suggest adding some comments
> >> > for list_del_init() as you mentioned below.
> >>
> >> That's why I said "check that while waiting to acquire the lock
> >> pi_block_cpu didn't change." The code you present does not do
> >> this.
> >
> > I didn't see we need implement it as the above code, I just list it
> > here the show it is hard to do that.
> > First, how to check it while waiting to acquire the lock .pi_block_cpu
> > didn't change?
> 
> Note the difference between "check while waiting" and "check that
> while waiting": The former is indeed hard to implement, while the
> latter is pretty straightforward (and we do so elsewhere).
> 
> > Secondly, even if we can check it, what should we do if .pi_block_cpu
> > is changed after acquiring the lock as I mentioned above?
> 
> Drop the lock and start over. I.e. (taking your pseudo code)
> 
> restart:
>     local_pi_block_cpu = ...;
>     bail-if-invalid (e.g. -1 in current model)
>     spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
>     if(local_pi_block_cpu != actual_pi_block_cpu) {
>         spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);
>         goto restart;
>     }

Thanks a lot for showing me this pseudo code! My concern is if
.pi_block_vcpu is changed to -1 at this point, it doesn't work.
.pi_block_vcpu being -1 here means the vCPU is remove from
the blocking list by others, then we cannot delete it again via
list_del() here.

BTW, I cannot see performance overhead for list_del_init()
compared to list_del().

list_del():
static inline void list_del(struct list_head *entry)
{
    ASSERT(entry->next->prev == entry);
    ASSERT(entry->prev->next == entry);
    __list_del(entry->prev, entry->next);
    entry->next = LIST_POISON1;
    entry->prev = LIST_POISON2;
}

list_del_init():
static inline void list_del_init(struct list_head *entry)
{
    __list_del(entry->prev, entry->next);
    INIT_LIST_HEAD(entry);
}

Thanks,
Feng

>     list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
>     spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-10  9:41                 ` Wu, Feng
@ 2015-09-10 10:01                   ` Jan Beulich
  2015-09-10 12:34                     ` Wu, Feng
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2015-09-10 10:01 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel@lists.xen.org

>>> On 10.09.15 at 11:41, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Thursday, September 10, 2015 5:26 PM
>> >>> On 10.09.15 at 10:59, <feng.wu@intel.com> wrote:
>> > First, how to check it while waiting to acquire the lock .pi_block_cpu
>> > didn't change?
>> 
>> Note the difference between "check while waiting" and "check that
>> while waiting": The former is indeed hard to implement, while the
>> latter is pretty straightforward (and we do so elsewhere).
>> 
>> > Secondly, even if we can check it, what should we do if .pi_block_cpu
>> > is changed after acquiring the lock as I mentioned above?
>> 
>> Drop the lock and start over. I.e. (taking your pseudo code)
>> 
>> restart:
>>     local_pi_block_cpu = ...;
>>     bail-if-invalid (e.g. -1 in current model)
>>     spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
>>     if(local_pi_block_cpu != actual_pi_block_cpu) {
>>         spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);
>>         goto restart;
>>     }
> 
> Thanks a lot for showing me this pseudo code! My concern is if
> .pi_block_vcpu is changed to -1 at this point, it doesn't work.
> .pi_block_vcpu being -1 here means the vCPU is remove from
> the blocking list by others, then we cannot delete it again via
> list_del() here.

Did you miss the "bail-if-invalid" above?

> BTW, I cannot see performance overhead for list_del_init()
> compared to list_del().
> 
> list_del():
> static inline void list_del(struct list_head *entry)
> {
>     ASSERT(entry->next->prev == entry);
>     ASSERT(entry->prev->next == entry);
>     __list_del(entry->prev, entry->next);
>     entry->next = LIST_POISON1;
>     entry->prev = LIST_POISON2;
> }
> 
> list_del_init():
> static inline void list_del_init(struct list_head *entry)
> {
>     __list_del(entry->prev, entry->next);
>     INIT_LIST_HEAD(entry);
> }

Well, yes, both do two stores (I forgot about the poisoning), but
arguably the poisoning could become a debug-build-only thing. I.e.
it is an implementation detail that the number of stores currently
is the same. From an abstract perspective one should still prefer
list_del() when the re-init isn't really needed. And in the specific
case here asking you to use list_del() makes sure the code ends
up not even trying the deletion when not needed.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-10 10:01                   ` Jan Beulich
@ 2015-09-10 12:34                     ` Wu, Feng
  2015-09-10 12:44                       ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: Wu, Feng @ 2015-09-10 12:34 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel@lists.xen.org, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, September 10, 2015 6:01 PM
> To: Wu, Feng
> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
> xen-devel@lists.xen.org; Keir Fraser
> Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
> interrupts
> 
> >>> On 10.09.15 at 11:41, <feng.wu@intel.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Thursday, September 10, 2015 5:26 PM
> >> >>> On 10.09.15 at 10:59, <feng.wu@intel.com> wrote:
> >> > First, how to check it while waiting to acquire the lock .pi_block_cpu
> >> > didn't change?
> >>
> >> Note the difference between "check while waiting" and "check that
> >> while waiting": The former is indeed hard to implement, while the
> >> latter is pretty straightforward (and we do so elsewhere).
> >>
> >> > Secondly, even if we can check it, what should we do if .pi_block_cpu
> >> > is changed after acquiring the lock as I mentioned above?
> >>
> >> Drop the lock and start over. I.e. (taking your pseudo code)
> >>
> >> restart:
> >>     local_pi_block_cpu = ...;
> >>     bail-if-invalid (e.g. -1 in current model)
> >>     spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
> >>     if(local_pi_block_cpu != actual_pi_block_cpu) {
> >>         spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);
> >>         goto restart;
> >>     }
> >
> > Thanks a lot for showing me this pseudo code! My concern is if
> > .pi_block_vcpu is changed to -1 at this point, it doesn't work.
> > .pi_block_vcpu being -1 here means the vCPU is remove from
> > the blocking list by others, then we cannot delete it again via
> > list_del() here.
> 
> Did you miss the "bail-if-invalid" above?

I am sorry, do I miss something here? If .pi_block_cpu becomes
-1 here (after the above 'if' statement is finished with
local_pi_block_cpu == actual_pi_block_cpu ), how can "bail-if-invalid"
above help?

Thanks,
Feng

> 
> > BTW, I cannot see performance overhead for list_del_init()
> > compared to list_del().
> >
> > list_del():
> > static inline void list_del(struct list_head *entry)
> > {
> >     ASSERT(entry->next->prev == entry);
> >     ASSERT(entry->prev->next == entry);
> >     __list_del(entry->prev, entry->next);
> >     entry->next = LIST_POISON1;
> >     entry->prev = LIST_POISON2;
> > }
> >
> > list_del_init():
> > static inline void list_del_init(struct list_head *entry)
> > {
> >     __list_del(entry->prev, entry->next);
> >     INIT_LIST_HEAD(entry);
> > }
> 
> Well, yes, both do two stores (I forgot about the poisoning), but
> arguably the poisoning could become a debug-build-only thing. I.e.
> it is an implementation detail that the number of stores currently
> is the same. From an abstract perspective one should still prefer
> list_del() when the re-init isn't really needed. And in the specific
> case here asking you to use list_del() makes sure the code ends
> up not even trying the deletion when not needed.
> 
> Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-10 12:34                     ` Wu, Feng
@ 2015-09-10 12:44                       ` Jan Beulich
  2015-09-10 12:58                         ` Wu, Feng
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2015-09-10 12:44 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel@lists.xen.org

>>> On 10.09.15 at 14:34, <feng.wu@intel.com> wrote:

> 
>> -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Thursday, September 10, 2015 6:01 PM
>> To: Wu, Feng
>> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
>> xen-devel@lists.xen.org; Keir Fraser
>> Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
>> interrupts
>> 
>> >>> On 10.09.15 at 11:41, <feng.wu@intel.com> wrote:
>> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Thursday, September 10, 2015 5:26 PM
>> >> >>> On 10.09.15 at 10:59, <feng.wu@intel.com> wrote:
>> >> > First, how to check it while waiting to acquire the lock .pi_block_cpu
>> >> > didn't change?
>> >>
>> >> Note the difference between "check while waiting" and "check that
>> >> while waiting": The former is indeed hard to implement, while the
>> >> latter is pretty straightforward (and we do so elsewhere).
>> >>
>> >> > Secondly, even if we can check it, what should we do if .pi_block_cpu
>> >> > is changed after acquiring the lock as I mentioned above?
>> >>
>> >> Drop the lock and start over. I.e. (taking your pseudo code)
>> >>
>> >> restart:
>> >>     local_pi_block_cpu = ...;
>> >>     bail-if-invalid (e.g. -1 in current model)
>> >>     spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
>> >>     if(local_pi_block_cpu != actual_pi_block_cpu) {
>> >>         spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);
>> >>         goto restart;
>> >>     }
>> >
>> > Thanks a lot for showing me this pseudo code! My concern is if
>> > .pi_block_vcpu is changed to -1 at this point, it doesn't work.
>> > .pi_block_vcpu being -1 here means the vCPU is remove from
>> > the blocking list by others, then we cannot delete it again via
>> > list_del() here.
>> 
>> Did you miss the "bail-if-invalid" above?
> 
> I am sorry, do I miss something here? If .pi_block_cpu becomes
> -1 here (after the above 'if' statement is finished with
> local_pi_block_cpu == actual_pi_block_cpu ), how can "bail-if-invalid"
> above help?

The (obvious I thought) implication is that all assignments to
pi_block_cpu (along with all list manipulations) now need to happen
with the lock held.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-10 12:44                       ` Jan Beulich
@ 2015-09-10 12:58                         ` Wu, Feng
  2015-09-10 13:15                           ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: Wu, Feng @ 2015-09-10 12:58 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel@lists.xen.org, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, September 10, 2015 8:45 PM
> To: Wu, Feng
> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
> xen-devel@lists.xen.org; Keir Fraser
> Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
> interrupts
> 
> >>> On 10.09.15 at 14:34, <feng.wu@intel.com> wrote:
> 
> >
> >> -----Original Message-----
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Thursday, September 10, 2015 6:01 PM
> >> To: Wu, Feng
> >> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
> >> xen-devel@lists.xen.org; Keir Fraser
> >> Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d
> posted
> >> interrupts
> >>
> >> >>> On 10.09.15 at 11:41, <feng.wu@intel.com> wrote:
> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: Thursday, September 10, 2015 5:26 PM
> >> >> >>> On 10.09.15 at 10:59, <feng.wu@intel.com> wrote:
> >> >> > First, how to check it while waiting to acquire the lock .pi_block_cpu
> >> >> > didn't change?
> >> >>
> >> >> Note the difference between "check while waiting" and "check that
> >> >> while waiting": The former is indeed hard to implement, while the
> >> >> latter is pretty straightforward (and we do so elsewhere).
> >> >>
> >> >> > Secondly, even if we can check it, what should we do if .pi_block_cpu
> >> >> > is changed after acquiring the lock as I mentioned above?
> >> >>
> >> >> Drop the lock and start over. I.e. (taking your pseudo code)
> >> >>
> >> >> restart:
> >> >>     local_pi_block_cpu = ...;
> >> >>     bail-if-invalid (e.g. -1 in current model)
> >> >>     spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
> >> >>     if(local_pi_block_cpu != actual_pi_block_cpu) {
> >> >>         spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);
> >> >>         goto restart;
> >> >>     }
> >> >
> >> > Thanks a lot for showing me this pseudo code! My concern is if
> >> > .pi_block_vcpu is changed to -1 at this point, it doesn't work.
> >> > .pi_block_vcpu being -1 here means the vCPU is remove from
> >> > the blocking list by others, then we cannot delete it again via
> >> > list_del() here.
> >>
> >> Did you miss the "bail-if-invalid" above?
> >
> > I am sorry, do I miss something here? If .pi_block_cpu becomes
> > -1 here (after the above 'if' statement is finished with
> > local_pi_block_cpu == actual_pi_block_cpu ), how can "bail-if-invalid"
> > above help?
> 
> The (obvious I thought) implication is that all assignments to
> pi_block_cpu (along with all list manipulations) now need to happen
> with the lock held.

If all the assignment to pi_block_cpu is with one lock held, I don't think
we need to above checkout, we can safely use .pi_block_cpu, right?

Thanks,
Feng

> 
> Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-10 12:58                         ` Wu, Feng
@ 2015-09-10 13:15                           ` Jan Beulich
  2015-09-10 13:27                             ` Wu, Feng
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2015-09-10 13:15 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel@lists.xen.org

>>> On 10.09.15 at 14:58, <feng.wu@intel.com> wrote:

> 
>> -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Thursday, September 10, 2015 8:45 PM
>> To: Wu, Feng
>> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
>> xen-devel@lists.xen.org; Keir Fraser
>> Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
>> interrupts
>> 
>> >>> On 10.09.15 at 14:34, <feng.wu@intel.com> wrote:
>> 
>> >
>> >> -----Original Message-----
>> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Thursday, September 10, 2015 6:01 PM
>> >> To: Wu, Feng
>> >> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
>> >> xen-devel@lists.xen.org; Keir Fraser
>> >> Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d
>> posted
>> >> interrupts
>> >>
>> >> >>> On 10.09.15 at 11:41, <feng.wu@intel.com> wrote:
>> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> >> Sent: Thursday, September 10, 2015 5:26 PM
>> >> >> >>> On 10.09.15 at 10:59, <feng.wu@intel.com> wrote:
>> >> >> > First, how to check it while waiting to acquire the lock .pi_block_cpu
>> >> >> > didn't change?
>> >> >>
>> >> >> Note the difference between "check while waiting" and "check that
>> >> >> while waiting": The former is indeed hard to implement, while the
>> >> >> latter is pretty straightforward (and we do so elsewhere).
>> >> >>
>> >> >> > Secondly, even if we can check it, what should we do if .pi_block_cpu
>> >> >> > is changed after acquiring the lock as I mentioned above?
>> >> >>
>> >> >> Drop the lock and start over. I.e. (taking your pseudo code)
>> >> >>
>> >> >> restart:
>> >> >>     local_pi_block_cpu = ...;
>> >> >>     bail-if-invalid (e.g. -1 in current model)
>> >> >>     spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
>> >> >>     if(local_pi_block_cpu != actual_pi_block_cpu) {
>> >> >>         spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);
>> >> >>         goto restart;
>> >> >>     }
>> >> >
>> >> > Thanks a lot for showing me this pseudo code! My concern is if
>> >> > .pi_block_vcpu is changed to -1 at this point, it doesn't work.
>> >> > .pi_block_vcpu being -1 here means the vCPU is remove from
>> >> > the blocking list by others, then we cannot delete it again via
>> >> > list_del() here.
>> >>
>> >> Did you miss the "bail-if-invalid" above?
>> >
>> > I am sorry, do I miss something here? If .pi_block_cpu becomes
>> > -1 here (after the above 'if' statement is finished with
>> > local_pi_block_cpu == actual_pi_block_cpu ), how can "bail-if-invalid"
>> > above help?
>> 
>> The (obvious I thought) implication is that all assignments to
>> pi_block_cpu (along with all list manipulations) now need to happen
>> with the lock held.
> 
> If all the assignment to pi_block_cpu is with one lock held, I don't think
> we need to above checkout, we can safely use .pi_block_cpu, right?

No. In order to use it you need to make sure it's valid (or else
there's no valid lock to acquire). And once you acquired the
lock, you have to check whether changed / became invalid.
Plus the up front check allows to avoid acquiring the lock in the
first place when the field is invalid anyway.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-10 13:15                           ` Jan Beulich
@ 2015-09-10 13:27                             ` Wu, Feng
  2015-09-10 14:01                               ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: Wu, Feng @ 2015-09-10 13:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel@lists.xen.org, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, September 10, 2015 9:15 PM
> To: Wu, Feng
> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
> xen-devel@lists.xen.org; Keir Fraser
> Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
> interrupts
> 
> >>> On 10.09.15 at 14:58, <feng.wu@intel.com> wrote:
> 
> >
> >> -----Original Message-----
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Thursday, September 10, 2015 8:45 PM
> >> To: Wu, Feng
> >> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
> >> xen-devel@lists.xen.org; Keir Fraser
> >> Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d
> posted
> >> interrupts
> >>
> >> >>> On 10.09.15 at 14:34, <feng.wu@intel.com> wrote:
> >>
> >> >
> >> >> -----Original Message-----
> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: Thursday, September 10, 2015 6:01 PM
> >> >> To: Wu, Feng
> >> >> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
> >> >> xen-devel@lists.xen.org; Keir Fraser
> >> >> Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d
> >> posted
> >> >> interrupts
> >> >>
> >> >> >>> On 10.09.15 at 11:41, <feng.wu@intel.com> wrote:
> >> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> >> Sent: Thursday, September 10, 2015 5:26 PM
> >> >> >> >>> On 10.09.15 at 10:59, <feng.wu@intel.com> wrote:
> >> >> >> > First, how to check it while waiting to acquire the lock .pi_block_cpu
> >> >> >> > didn't change?
> >> >> >>
> >> >> >> Note the difference between "check while waiting" and "check that
> >> >> >> while waiting": The former is indeed hard to implement, while the
> >> >> >> latter is pretty straightforward (and we do so elsewhere).
> >> >> >>
> >> >> >> > Secondly, even if we can check it, what should we do if .pi_block_cpu
> >> >> >> > is changed after acquiring the lock as I mentioned above?
> >> >> >>
> >> >> >> Drop the lock and start over. I.e. (taking your pseudo code)
> >> >> >>
> >> >> >> restart:
> >> >> >>     local_pi_block_cpu = ...;
> >> >> >>     bail-if-invalid (e.g. -1 in current model)
> >> >> >>     spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
> >> >> >>     if(local_pi_block_cpu != actual_pi_block_cpu) {
> >> >> >>         spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu),
> flags);
> >> >> >>         goto restart;
> >> >> >>     }
> >> >> >
> >> >> > Thanks a lot for showing me this pseudo code! My concern is if
> >> >> > .pi_block_vcpu is changed to -1 at this point, it doesn't work.
> >> >> > .pi_block_vcpu being -1 here means the vCPU is remove from
> >> >> > the blocking list by others, then we cannot delete it again via
> >> >> > list_del() here.
> >> >>
> >> >> Did you miss the "bail-if-invalid" above?
> >> >
> >> > I am sorry, do I miss something here? If .pi_block_cpu becomes
> >> > -1 here (after the above 'if' statement is finished with
> >> > local_pi_block_cpu == actual_pi_block_cpu ), how can "bail-if-invalid"
> >> > above help?
> >>
> >> The (obvious I thought) implication is that all assignments to
> >> pi_block_cpu (along with all list manipulations) now need to happen
> >> with the lock held.
> >
> > If all the assignment to pi_block_cpu is with one lock held, I don't think
> > we need to above checkout, we can safely use .pi_block_cpu, right?
> 
> No. In order to use it you need to make sure it's valid (or else
> there's no valid lock to acquire). And once you acquired the
> lock, you have to check whether changed / became invalid.

If all the assignment to .pi_block_cpu is with one lock held,
how can it get changed / become invalid then once I acquired
the lock?

Thanks,
Feng

> Plus the up front check allows to avoid acquiring the lock in the
> first place when the field is invalid anyway.
> 
> Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-10 13:27                             ` Wu, Feng
@ 2015-09-10 14:01                               ` Jan Beulich
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Beulich @ 2015-09-10 14:01 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel@lists.xen.org

>>> On 10.09.15 at 15:27, <feng.wu@intel.com> wrote:

> 
>> -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Thursday, September 10, 2015 9:15 PM
>> To: Wu, Feng
>> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
>> xen-devel@lists.xen.org; Keir Fraser
>> Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
>> interrupts
>> 
>> >>> On 10.09.15 at 14:58, <feng.wu@intel.com> wrote:
>> 
>> >
>> >> -----Original Message-----
>> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Thursday, September 10, 2015 8:45 PM
>> >> To: Wu, Feng
>> >> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
>> >> xen-devel@lists.xen.org; Keir Fraser
>> >> Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d
>> posted
>> >> interrupts
>> >>
>> >> >>> On 10.09.15 at 14:34, <feng.wu@intel.com> wrote:
>> >>
>> >> >
>> >> >> -----Original Message-----
>> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> >> Sent: Thursday, September 10, 2015 6:01 PM
>> >> >> To: Wu, Feng
>> >> >> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
>> >> >> xen-devel@lists.xen.org; Keir Fraser
>> >> >> Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d
>> >> posted
>> >> >> interrupts
>> >> >>
>> >> >> >>> On 10.09.15 at 11:41, <feng.wu@intel.com> wrote:
>> >> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> >> >> Sent: Thursday, September 10, 2015 5:26 PM
>> >> >> >> >>> On 10.09.15 at 10:59, <feng.wu@intel.com> wrote:
>> >> >> >> > First, how to check it while waiting to acquire the lock .pi_block_cpu
>> >> >> >> > didn't change?
>> >> >> >>
>> >> >> >> Note the difference between "check while waiting" and "check that
>> >> >> >> while waiting": The former is indeed hard to implement, while the
>> >> >> >> latter is pretty straightforward (and we do so elsewhere).
>> >> >> >>
>> >> >> >> > Secondly, even if we can check it, what should we do if .pi_block_cpu
>> >> >> >> > is changed after acquiring the lock as I mentioned above?
>> >> >> >>
>> >> >> >> Drop the lock and start over. I.e. (taking your pseudo code)
>> >> >> >>
>> >> >> >> restart:
>> >> >> >>     local_pi_block_cpu = ...;
>> >> >> >>     bail-if-invalid (e.g. -1 in current model)
>> >> >> >>     spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
>> >> >> >>     if(local_pi_block_cpu != actual_pi_block_cpu) {
>> >> >> >>         spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu),
>> flags);
>> >> >> >>         goto restart;
>> >> >> >>     }
>> >> >> >
>> >> >> > Thanks a lot for showing me this pseudo code! My concern is if
>> >> >> > .pi_block_vcpu is changed to -1 at this point, it doesn't work.
>> >> >> > .pi_block_vcpu being -1 here means the vCPU is remove from
>> >> >> > the blocking list by others, then we cannot delete it again via
>> >> >> > list_del() here.
>> >> >>
>> >> >> Did you miss the "bail-if-invalid" above?
>> >> >
>> >> > I am sorry, do I miss something here? If .pi_block_cpu becomes
>> >> > -1 here (after the above 'if' statement is finished with
>> >> > local_pi_block_cpu == actual_pi_block_cpu ), how can "bail-if-invalid"
>> >> > above help?
>> >>
>> >> The (obvious I thought) implication is that all assignments to
>> >> pi_block_cpu (along with all list manipulations) now need to happen
>> >> with the lock held.
>> >
>> > If all the assignment to pi_block_cpu is with one lock held, I don't think
>> > we need to above checkout, we can safely use .pi_block_cpu, right?
>> 
>> No. In order to use it you need to make sure it's valid (or else
>> there's no valid lock to acquire). And once you acquired the
>> lock, you have to check whether changed / became invalid.
> 
> If all the assignment to .pi_block_cpu is with one lock held,
> how can it get changed / become invalid then once I acquired
> the lock?

Again - if pi_block_cpu is invalid, which lock do you want to acquire?

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-10  9:26               ` Jan Beulich
  2015-09-10  9:41                 ` Wu, Feng
@ 2015-09-16  8:56                 ` Wu, Feng
  1 sibling, 0 replies; 27+ messages in thread
From: Wu, Feng @ 2015-09-16  8:56 UTC (permalink / raw)
  To: Dario Faggioli, George Dunlap, Jan Beulich (JBeulich@suse.com)
  Cc: Andrew Cooper, Tian, Kevin, Keir Fraser, Wu, Feng,
	xen-devel@lists.xen.org



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, September 10, 2015 5:26 PM
> To: Wu, Feng
> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
> xen-devel@lists.xen.org; Keir Fraser
> Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
> interrupts
> 
> >> > Anyway, I go through the main path of the code as below, I really don't find
> >> > anything unsafe here.
> >> >
> >> > context_switch() -->
> >> >     pi_ctxt_switch_from() -->
> >> >         vmx_pre_ctx_switch_pi() -->
> >> >             vcpu_unblock() -->
> >> >                 vcpu_wake() -->
> >> >                     SCHED_OP(VCPU2OP(v), wake, v) -->
> >> >                         csched_vcpu_wake() -->
> >> >                             __runq_insert()
> >> >                             __runq_tickle()
> >> >
> >> > If you continue to think it is unsafe, pointing out the place will be
> >> > welcome!
> >>
> >> Once again - I didn't say it's unsafe (and hence can't point at a
> >> particular place). What bothers me is that vcpu_unblock() affects
> >> scheduler state, and altering scheduler state from inside the
> >> context switch path looks problematic at the first glance. I'd be
> >> happy if I was told there is no such problem, but be aware that
> >> this would have to include latent ones: Even if right now things
> >> are safe, having such feedback have the potential of becoming
> >> an actual problem with a later scheduler change is already an
> >> issue, as finding the problem is then likely going to be rather hard
> >> (and I suspect it's not going to be you to debug this).
> >
> > What I can say is that after investigation, I don't find anything bad
> > for this vcpu_unblock(), I tested it in my experiment. So that is why
> > I'd like to ask some ideas from scheduler expects.
> 
> Agreed - I'm also awaiting their input.

Hi Dario, & George, could you please gave your thoughts about this?
Your input is very important for us!

Thanks,
Feng

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-10  8:59             ` Wu, Feng
  2015-09-10  9:26               ` Jan Beulich
@ 2015-09-16 17:08               ` George Dunlap
  2015-09-17  6:26                 ` Wu, Feng
  1 sibling, 1 reply; 27+ messages in thread
From: George Dunlap @ 2015-09-16 17:08 UTC (permalink / raw)
  To: Wu, Feng
  Cc: Tian, Kevin, Keir Fraser, Andrew Cooper, Dario Faggioli,
	xen-devel@lists.xen.org, Jan Beulich

On Thu, Sep 10, 2015 at 9:59 AM, Wu, Feng <feng.wu@intel.com> wrote:
>> >> >> Calling vcpu_unblock() in the middle of context_switch()? Why? And
>> >> >> is this safe?
>> >> >
>> >> > I cannot see anything unsafe so far, can some scheduler maintainer
>> >> > help to confirm it? Dario? George?
>> >>
>> >> But first and foremost you ought to answer the "why".
>> >
>> > I think if you think this solution is unsafe, you need point out where it
>> > is, not
>> > just ask "is this safe ?", I don't think this is an effective comments.
>>
>> It is my general understanding that people wanting code to be
>> included have to prove their code is okay, not reviewers to prove
>> the code is broken.
>>
>> > Anyway, I go through the main path of the code as below, I really don't find
>> > anything unsafe here.
>> >
>> > context_switch() -->
>> >     pi_ctxt_switch_from() -->
>> >         vmx_pre_ctx_switch_pi() -->
>> >             vcpu_unblock() -->
>> >                 vcpu_wake() -->
>> >                     SCHED_OP(VCPU2OP(v), wake, v) -->
>> >                         csched_vcpu_wake() -->
>> >                             __runq_insert()
>> >                             __runq_tickle()
>> >
>> > If you continue to think it is unsafe, pointing out the place will be
>> > welcome!
>>
>> Once again - I didn't say it's unsafe (and hence can't point at a
>> particular place). What bothers me is that vcpu_unblock() affects
>> scheduler state, and altering scheduler state from inside the
>> context switch path looks problematic at the first glance. I'd be
>> happy if I was told there is no such problem, but be aware that
>> this would have to include latent ones: Even if right now things
>> are safe, having such feedback have the potential of becoming
>> an actual problem with a later scheduler change is already an
>> issue, as finding the problem is then likely going to be rather hard
>> (and I suspect it's not going to be you to debug this).
>
> What I can say is that after investigation, I don't find anything bad
> for this vcpu_unblock(), I tested it in my experiment. So that is why
> I'd like to ask some ideas from scheduler expects.

You still didn't answer Jan's question as to why you chose to call it there.

As to why Jan is suspicious, the hypervisor code is incredibly
complicated, and even hypervisor maintainers are only mortals. :-) One
way we deal with the complication is by trying to restrict the ways
different code interacts, so that people reading, writing, and
maintaining the code can make simplifying assumptions to make it
easier to grasp / harder to make mistakes.

People reading or working on vcpu_wake() code expect it to be called
from interrupt contexts, and also expect it to be called from normal
hypercalls.  They *don't* expect it to be called from in the middle of
a context switch, where "current", the registers, and all those things
are all up in the air.  It's possible that the code *already*
implicitly assumes it's not called from context switch (i.e., that v
== current means certain state), and even if not, it's possible
someone will make that assumption in the future, causing a very
hard-to-detect bug.

Now I haven't looked at the code in detail yet; it may be that the
only way to make PI work is to make this call.  If that's the case we
need to carefully examine assumptions, and carefully comment the code
to make sure people working on the scheduling code continue to think
carefully about their assumptions in the future.  But if we can get
away without doing that, it will make things much easier.

But it's certainly reasonable to expect the maintainers to offer you
an alternate solution if your proposed solution is unacceptable. :-)
Let me take a look and see what I think.

 -George

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-16 17:08               ` George Dunlap
@ 2015-09-17  6:26                 ` Wu, Feng
  0 siblings, 0 replies; 27+ messages in thread
From: Wu, Feng @ 2015-09-17  6:26 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tian, Kevin, Keir Fraser, Andrew Cooper, Dario Faggioli,
	xen-devel@lists.xen.org, Jan Beulich, Wu, Feng



> -----Original Message-----
> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George
> Dunlap
> Sent: Thursday, September 17, 2015 1:08 AM
> To: Wu, Feng
> Cc: Jan Beulich; Tian, Kevin; Keir Fraser; Andrew Cooper; Dario Faggioli;
> xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] [PATCH v6 16/18] vmx: Add some scheduler hooks for
> VT-d posted interrupts
> 
> On Thu, Sep 10, 2015 at 9:59 AM, Wu, Feng <feng.wu@intel.com> wrote:
> >> >> >> Calling vcpu_unblock() in the middle of context_switch()? Why? And
> >> >> >> is this safe?
> >> >> >
> >> >> > I cannot see anything unsafe so far, can some scheduler maintainer
> >> >> > help to confirm it? Dario? George?
> >> >>
> >> >> But first and foremost you ought to answer the "why".
> >> >
> >> > I think if you think this solution is unsafe, you need point out where it
> >> > is, not
> >> > just ask "is this safe ?", I don't think this is an effective comments.
> >>
> >> It is my general understanding that people wanting code to be
> >> included have to prove their code is okay, not reviewers to prove
> >> the code is broken.
> >>
> >> > Anyway, I go through the main path of the code as below, I really don't find
> >> > anything unsafe here.
> >> >
> >> > context_switch() -->
> >> >     pi_ctxt_switch_from() -->
> >> >         vmx_pre_ctx_switch_pi() -->
> >> >             vcpu_unblock() -->
> >> >                 vcpu_wake() -->
> >> >                     SCHED_OP(VCPU2OP(v), wake, v) -->
> >> >                         csched_vcpu_wake() -->
> >> >                             __runq_insert()
> >> >                             __runq_tickle()
> >> >
> >> > If you continue to think it is unsafe, pointing out the place will be
> >> > welcome!
> >>
> >> Once again - I didn't say it's unsafe (and hence can't point at a
> >> particular place). What bothers me is that vcpu_unblock() affects
> >> scheduler state, and altering scheduler state from inside the
> >> context switch path looks problematic at the first glance. I'd be
> >> happy if I was told there is no such problem, but be aware that
> >> this would have to include latent ones: Even if right now things
> >> are safe, having such feedback have the potential of becoming
> >> an actual problem with a later scheduler change is already an
> >> issue, as finding the problem is then likely going to be rather hard
> >> (and I suspect it's not going to be you to debug this).
> >
> > What I can say is that after investigation, I don't find anything bad
> > for this vcpu_unblock(), I tested it in my experiment. So that is why
> > I'd like to ask some ideas from scheduler expects.
> 
> You still didn't answer Jan's question as to why you chose to call it there.

The reason why I need to call vcpu_unblock() here is that we are about
to block the vCPU and put it on the blocking list if everything goes okay,
but if during updating the posted-interrupt descriptor 'ON' is set, which
means there is a new interrupt coming for this vCPU, we shouldn't block
it, so I call vcpu_unblock to do it.

> 
> As to why Jan is suspicious, the hypervisor code is incredibly
> complicated, and even hypervisor maintainers are only mortals. :-) One
> way we deal with the complication is by trying to restrict the ways
> different code interacts, so that people reading, writing, and
> maintaining the code can make simplifying assumptions to make it
> easier to grasp / harder to make mistakes.
> 
> People reading or working on vcpu_wake() code expect it to be called
> from interrupt contexts, and also expect it to be called from normal
> hypercalls.  They *don't* expect it to be called from in the middle of
> a context switch, where "current", the registers, and all those things
> are all up in the air.  It's possible that the code *already*
> implicitly assumes it's not called from context switch (i.e., that v
> == current means certain state), and even if not, it's possible
> someone will make that assumption in the future, causing a very
> hard-to-detect bug.
> 
> Now I haven't looked at the code in detail yet; it may be that the
> only way to make PI work is to make this call.  If that's the case we
> need to carefully examine assumptions, and carefully comment the code
> to make sure people working on the scheduling code continue to think
> carefully about their assumptions in the future.  But if we can get
> away without doing that, it will make things much easier.
> 
> But it's certainly reasonable to expect the maintainers to offer you
> an alternate solution if your proposed solution is unacceptable. :-)
> Let me take a look and see what I think.

Really appreciate your detailed description about this, it is reasonable,
and thank you from your comments!

Thanks,
Feng

> 
>  -George

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-07 12:54   ` Jan Beulich
  2015-09-09  8:56     ` Wu, Feng
@ 2015-09-16 16:56     ` George Dunlap
  2015-09-17  6:15       ` Wu, Feng
  2015-09-21  8:23       ` Jan Beulich
  1 sibling, 2 replies; 27+ messages in thread
From: George Dunlap @ 2015-09-16 16:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Keir Fraser, Andrew Cooper, Dario Faggioli,
	xen-devel@lists.xen.org, Feng Wu

On Mon, Sep 7, 2015 at 1:54 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 25.08.15 at 03:57, <feng.wu@intel.com> wrote:
>> --- a/xen/arch/x86/domain.c
>> +++ b/xen/arch/x86/domain.c
>> @@ -1573,6 +1573,22 @@ static void __context_switch(void)
>>      per_cpu(curr_vcpu, cpu) = n;
>>  }
>>
>> +static inline void pi_ctxt_switch_from(struct vcpu *prev)
>> +{
>> +    /*
>> +     * When switching from non-idle to idle, we only do a lazy context switch.
>> +     * However, in order for posted interrupt (if available and enabled) to
>> +     * work properly, we at least need to update the descriptors.
>> +     */
>> +    if ( prev->arch.pi_ctxt_switch_from && !is_idle_vcpu(prev) )
>> +        prev->arch.pi_ctxt_switch_from(prev);
>> +}
>> +
>> +static inline void pi_ctxt_switch_to(struct vcpu *next)
>> +{
>> +    if ( next->arch.pi_ctxt_switch_to && !is_idle_vcpu(next) )
>> +        next->arch.pi_ctxt_switch_to(next);
>> +}
>>
>>  void context_switch(struct vcpu *prev, struct vcpu *next)
>>  {
>> @@ -1605,9 +1621,12 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
>>
>>      set_current(next);
>>
>> +    pi_ctxt_switch_from(prev);
>> +
>>      if ( (per_cpu(curr_vcpu, cpu) == next) ||
>>           (is_idle_domain(nextd) && cpu_online(cpu)) )
>>      {
>> +        pi_ctxt_switch_to(next);
>>          local_irq_enable();
>
> This placement, if really intended that way, needs explanation (in a
> comment) and perhaps even renaming of the involved symbols, as
> looking at it from a general perspective it seems wrong (with
> pi_ctxt_switch_to() excluding idle vCPU-s it effectively means you
> want this only when switching back to what got switched out lazily
> before, i.e. this would be not something to take place on an arbitrary
> context switch). As to possible alternative names - maybe make the
> hooks ctxt_switch_prepare() and ctxt_switch_cancel()?

Why on earth is this more clear than what he had before?

In the first call, he's not "preparing" anything -- he's actually
switching the PI context out for prev.  And in the second call, he's
not "cancelling" anything -- he's actually switching the PI context in
for next.  The names you suggest are actively confusing, not helpful.

But before talking about how to make things more clear, one side
question -- do we need to actually call pi_ctxt_switch_to() in
__context_switch()?

The only other place __context_switch() is called is
from__sync_local_execstate().  But the only reason that needs to be
called is because sometimes we *don't* call __context_switch(), and so
there are things on the cpu that aren't copied into the vcpu struct.

That doesn't apply to the PI state -- for one, nothing is copied from
the processor; and for two, pi_ctxt_switch_from() is called
unconditionally anyway.

Would it make more sense to call pi_context_switch(prev, next) just
after "set_current"?

(Keeping in mind I totally may have missed something...)

 -George

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-16 16:56     ` George Dunlap
@ 2015-09-17  6:15       ` Wu, Feng
  2015-09-21  8:23       ` Jan Beulich
  1 sibling, 0 replies; 27+ messages in thread
From: Wu, Feng @ 2015-09-17  6:15 UTC (permalink / raw)
  To: George Dunlap, Jan Beulich
  Cc: Tian, Kevin, Keir Fraser, Andrew Cooper, Dario Faggioli,
	xen-devel@lists.xen.org, Wu, Feng



> -----Original Message-----
> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George
> Dunlap
> Sent: Thursday, September 17, 2015 12:57 AM
> To: Jan Beulich
> Cc: Wu, Feng; Tian, Kevin; Keir Fraser; Andrew Cooper; Dario Faggioli;
> xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] [PATCH v6 16/18] vmx: Add some scheduler hooks for
> VT-d posted interrupts
> 
> On Mon, Sep 7, 2015 at 1:54 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >>>> On 25.08.15 at 03:57, <feng.wu@intel.com> wrote:
> >> --- a/xen/arch/x86/domain.c
> >> +++ b/xen/arch/x86/domain.c
> >> @@ -1573,6 +1573,22 @@ static void __context_switch(void)
> >>      per_cpu(curr_vcpu, cpu) = n;
> >>  }
> >>
> >> +static inline void pi_ctxt_switch_from(struct vcpu *prev)
> >> +{
> >> +    /*
> >> +     * When switching from non-idle to idle, we only do a lazy context
> switch.
> >> +     * However, in order for posted interrupt (if available and enabled) to
> >> +     * work properly, we at least need to update the descriptors.
> >> +     */
> >> +    if ( prev->arch.pi_ctxt_switch_from && !is_idle_vcpu(prev) )
> >> +        prev->arch.pi_ctxt_switch_from(prev);
> >> +}
> >> +
> >> +static inline void pi_ctxt_switch_to(struct vcpu *next)
> >> +{
> >> +    if ( next->arch.pi_ctxt_switch_to && !is_idle_vcpu(next) )
> >> +        next->arch.pi_ctxt_switch_to(next);
> >> +}
> >>
> >>  void context_switch(struct vcpu *prev, struct vcpu *next)
> >>  {
> >> @@ -1605,9 +1621,12 @@ void context_switch(struct vcpu *prev, struct
> vcpu *next)
> >>
> >>      set_current(next);
> >>
> >> +    pi_ctxt_switch_from(prev);
> >> +
> >>      if ( (per_cpu(curr_vcpu, cpu) == next) ||
> >>           (is_idle_domain(nextd) && cpu_online(cpu)) )
> >>      {
> >> +        pi_ctxt_switch_to(next);
> >>          local_irq_enable();
> >
> > This placement, if really intended that way, needs explanation (in a
> > comment) and perhaps even renaming of the involved symbols, as
> > looking at it from a general perspective it seems wrong (with
> > pi_ctxt_switch_to() excluding idle vCPU-s it effectively means you
> > want this only when switching back to what got switched out lazily
> > before, i.e. this would be not something to take place on an arbitrary
> > context switch). As to possible alternative names - maybe make the
> > hooks ctxt_switch_prepare() and ctxt_switch_cancel()?
> 
> Why on earth is this more clear than what he had before?
> 
> In the first call, he's not "preparing" anything -- he's actually
> switching the PI context out for prev.  And in the second call, he's
> not "cancelling" anything -- he's actually switching the PI context in
> for next.  The names you suggest are actively confusing, not helpful.
> 
> But before talking about how to make things more clear, one side
> question -- do we need to actually call pi_ctxt_switch_to() in
> __context_switch()?
> 
> The only other place __context_switch() is called is
> from__sync_local_execstate().  But the only reason that needs to be
> called is because sometimes we *don't* call __context_switch(), and so
> there are things on the cpu that aren't copied into the vcpu struct.

Thanks for the comments!

>From my understanding, __sync_local_execstate() can only get called
in the following two cases:
#1) this_cpu(curr_vcpu) == current, in this case, __context_switch() is
not called.
#2) this_cpu(curr_vcpu) != current, and current == idle_vcpu, that means
we just switched from a non-idle vCPU to idle vCPU, so here we need to
call __context_switch() to copy things to the original vcpu struct.

Please correct me if the above understanding is wrong or incomplete?

I think calling pi_ctxt_switch_to() in __context_switch() is needed when
we are switching to a non-idle vCPU (we need change the PI state of the
target vCPU), and the call is not needed when switching to idle vCPU.
So if the above understanding is correct, I think you suggestion below
is really good, it makes things clearer.

> 
> That doesn't apply to the PI state -- for one, nothing is copied from
> the processor; and for two, pi_ctxt_switch_from() is called
> unconditionally anyway.
> 
> Would it make more sense to call pi_context_switch(prev, next) just
> after "set_current"?

I think it is a good point.

Thanks,
Feng

> 
> (Keeping in mind I totally may have missed something...)
> 
>  -George

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-16 16:56     ` George Dunlap
  2015-09-17  6:15       ` Wu, Feng
@ 2015-09-21  8:23       ` Jan Beulich
  2015-09-21  9:28         ` George Dunlap
  1 sibling, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2015-09-21  8:23 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, Keir Fraser, Andrew Cooper, Dario Faggioli,
	xen-devel@lists.xen.org, Feng Wu

>>> On 16.09.15 at 18:56, <George.Dunlap@eu.citrix.com> wrote:
> On Mon, Sep 7, 2015 at 1:54 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>> On 25.08.15 at 03:57, <feng.wu@intel.com> wrote:
>>> --- a/xen/arch/x86/domain.c
>>> +++ b/xen/arch/x86/domain.c
>>> @@ -1573,6 +1573,22 @@ static void __context_switch(void)
>>>      per_cpu(curr_vcpu, cpu) = n;
>>>  }
>>>
>>> +static inline void pi_ctxt_switch_from(struct vcpu *prev)
>>> +{
>>> +    /*
>>> +     * When switching from non-idle to idle, we only do a lazy context switch.
>>> +     * However, in order for posted interrupt (if available and enabled) to
>>> +     * work properly, we at least need to update the descriptors.
>>> +     */
>>> +    if ( prev->arch.pi_ctxt_switch_from && !is_idle_vcpu(prev) )
>>> +        prev->arch.pi_ctxt_switch_from(prev);
>>> +}
>>> +
>>> +static inline void pi_ctxt_switch_to(struct vcpu *next)
>>> +{
>>> +    if ( next->arch.pi_ctxt_switch_to && !is_idle_vcpu(next) )
>>> +        next->arch.pi_ctxt_switch_to(next);
>>> +}
>>>
>>>  void context_switch(struct vcpu *prev, struct vcpu *next)
>>>  {
>>> @@ -1605,9 +1621,12 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
>>>
>>>      set_current(next);
>>>
>>> +    pi_ctxt_switch_from(prev);
>>> +
>>>      if ( (per_cpu(curr_vcpu, cpu) == next) ||
>>>           (is_idle_domain(nextd) && cpu_online(cpu)) )
>>>      {
>>> +        pi_ctxt_switch_to(next);
>>>          local_irq_enable();
>>
>> This placement, if really intended that way, needs explanation (in a
>> comment) and perhaps even renaming of the involved symbols, as
>> looking at it from a general perspective it seems wrong (with
>> pi_ctxt_switch_to() excluding idle vCPU-s it effectively means you
>> want this only when switching back to what got switched out lazily
>> before, i.e. this would be not something to take place on an arbitrary
>> context switch). As to possible alternative names - maybe make the
>> hooks ctxt_switch_prepare() and ctxt_switch_cancel()?
> 
> Why on earth is this more clear than what he had before?
> 
> In the first call, he's not "preparing" anything -- he's actually
> switching the PI context out for prev.  And in the second call, he's
> not "cancelling" anything -- he's actually switching the PI context in
> for next.  The names you suggest are actively confusing, not helpful.

While I think later discussion on this thread moved in a good direction,
I still think I should reply here (even if late): To me, the use of
pi_ctxt_switch_to() in the patch fragment still seen above is very
much the cancellation of the immediately preceding pi_ctxt_switch_from(),
as it's the "we don't want to do anything else" path that it gets put
into.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-21  8:23       ` Jan Beulich
@ 2015-09-21  9:28         ` George Dunlap
  2015-09-21 11:56           ` Jan Beulich
  0 siblings, 1 reply; 27+ messages in thread
From: George Dunlap @ 2015-09-21  9:28 UTC (permalink / raw)
  To: Jan Beulich, George Dunlap
  Cc: Kevin Tian, Keir Fraser, Andrew Cooper, Dario Faggioli,
	xen-devel@lists.xen.org, Feng Wu

On 09/21/2015 09:23 AM, Jan Beulich wrote:
>>>> On 16.09.15 at 18:56, <George.Dunlap@eu.citrix.com> wrote:
>> On Mon, Sep 7, 2015 at 1:54 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>>> On 25.08.15 at 03:57, <feng.wu@intel.com> wrote:
>>>> --- a/xen/arch/x86/domain.c
>>>> +++ b/xen/arch/x86/domain.c
>>>> @@ -1573,6 +1573,22 @@ static void __context_switch(void)
>>>>      per_cpu(curr_vcpu, cpu) = n;
>>>>  }
>>>>
>>>> +static inline void pi_ctxt_switch_from(struct vcpu *prev)
>>>> +{
>>>> +    /*
>>>> +     * When switching from non-idle to idle, we only do a lazy context switch.
>>>> +     * However, in order for posted interrupt (if available and enabled) to
>>>> +     * work properly, we at least need to update the descriptors.
>>>> +     */
>>>> +    if ( prev->arch.pi_ctxt_switch_from && !is_idle_vcpu(prev) )
>>>> +        prev->arch.pi_ctxt_switch_from(prev);
>>>> +}
>>>> +
>>>> +static inline void pi_ctxt_switch_to(struct vcpu *next)
>>>> +{
>>>> +    if ( next->arch.pi_ctxt_switch_to && !is_idle_vcpu(next) )
>>>> +        next->arch.pi_ctxt_switch_to(next);
>>>> +}
>>>>
>>>>  void context_switch(struct vcpu *prev, struct vcpu *next)
>>>>  {
>>>> @@ -1605,9 +1621,12 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
>>>>
>>>>      set_current(next);
>>>>
>>>> +    pi_ctxt_switch_from(prev);
>>>> +
>>>>      if ( (per_cpu(curr_vcpu, cpu) == next) ||
>>>>           (is_idle_domain(nextd) && cpu_online(cpu)) )
>>>>      {
>>>> +        pi_ctxt_switch_to(next);
>>>>          local_irq_enable();
>>>
>>> This placement, if really intended that way, needs explanation (in a
>>> comment) and perhaps even renaming of the involved symbols, as
>>> looking at it from a general perspective it seems wrong (with
>>> pi_ctxt_switch_to() excluding idle vCPU-s it effectively means you
>>> want this only when switching back to what got switched out lazily
>>> before, i.e. this would be not something to take place on an arbitrary
>>> context switch). As to possible alternative names - maybe make the
>>> hooks ctxt_switch_prepare() and ctxt_switch_cancel()?
>>
>> Why on earth is this more clear than what he had before?
>>
>> In the first call, he's not "preparing" anything -- he's actually
>> switching the PI context out for prev.  And in the second call, he's
>> not "cancelling" anything -- he's actually switching the PI context in
>> for next.  The names you suggest are actively confusing, not helpful.
> 
> While I think later discussion on this thread moved in a good direction,
> I still think I should reply here (even if late): To me, the use of
> pi_ctxt_switch_to() in the patch fragment still seen above is very
> much the cancellation of the immediately preceding pi_ctxt_switch_from(),
> as it's the "we don't want to do anything else" path that it gets put
> into.

Either we have different understandings about what the code does, or I
don't understand what you're saying here.

The codepath in question will only be called if we're switching *into*
or *out of* the "lazy context swtich" -- i.e., switching from a vcpu to
the idle vcpu, but not saving or restoring state.

For the "switch into" case:
* prev will be a domain vcpu, next will be the idle vcpu
* pi_ctxt_switch_from(prev) will either change NV to 'pi_wake' or set SN
for prev's PI vectors (depending on whether prev is blocked or offline)
* pi_ctxt_switch_to(next) will do nothing, since next is the idle vcpu

For the "switching out" case:
* prev will be the idle vcpu, next will be a domain vcpu
* pi_ctxt_switch_from(prev) will do nothing, since prev is the idle vcpu
* pi_ctxt_switch_to(next) will clear SN and change the vector to
'posted_intr'

So there is no situation in which pi_ctxt_switch_to() "cancels" the
immediately preceding pi_ctxt_switch_from().  Either the preceding
from() did something, in which case to() does nothing; or the preceding
from() did nothing, in which case to() does something.

 -George

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  2015-09-21  9:28         ` George Dunlap
@ 2015-09-21 11:56           ` Jan Beulich
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Beulich @ 2015-09-21 11:56 UTC (permalink / raw)
  To: George Dunlap, George Dunlap
  Cc: Kevin Tian, Keir Fraser, Andrew Cooper, Dario Faggioli,
	xen-devel@lists.xen.org, Feng Wu

>>> On 21.09.15 at 11:28, <george.dunlap@citrix.com> wrote:
> On 09/21/2015 09:23 AM, Jan Beulich wrote:
>>>>> On 16.09.15 at 18:56, <George.Dunlap@eu.citrix.com> wrote:
>>> On Mon, Sep 7, 2015 at 1:54 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>>>> On 25.08.15 at 03:57, <feng.wu@intel.com> wrote:
>>>>> @@ -1605,9 +1621,12 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
>>>>>
>>>>>      set_current(next);
>>>>>
>>>>> +    pi_ctxt_switch_from(prev);
>>>>> +
>>>>>      if ( (per_cpu(curr_vcpu, cpu) == next) ||
>>>>>           (is_idle_domain(nextd) && cpu_online(cpu)) )
>>>>>      {
>>>>> +        pi_ctxt_switch_to(next);
>>>>>          local_irq_enable();
>>>>
>>>> This placement, if really intended that way, needs explanation (in a
>>>> comment) and perhaps even renaming of the involved symbols, as
>>>> looking at it from a general perspective it seems wrong (with
>>>> pi_ctxt_switch_to() excluding idle vCPU-s it effectively means you
>>>> want this only when switching back to what got switched out lazily
>>>> before, i.e. this would be not something to take place on an arbitrary
>>>> context switch). As to possible alternative names - maybe make the
>>>> hooks ctxt_switch_prepare() and ctxt_switch_cancel()?
>>>
>>> Why on earth is this more clear than what he had before?
>>>
>>> In the first call, he's not "preparing" anything -- he's actually
>>> switching the PI context out for prev.  And in the second call, he's
>>> not "cancelling" anything -- he's actually switching the PI context in
>>> for next.  The names you suggest are actively confusing, not helpful.
>> 
>> While I think later discussion on this thread moved in a good direction,
>> I still think I should reply here (even if late): To me, the use of
>> pi_ctxt_switch_to() in the patch fragment still seen above is very
>> much the cancellation of the immediately preceding pi_ctxt_switch_from(),
>> as it's the "we don't want to do anything else" path that it gets put
>> into.
> 
> Either we have different understandings about what the code does, or I
> don't understand what you're saying here.
> 
> The codepath in question will only be called if we're switching *into*
> or *out of* the "lazy context swtich" -- i.e., switching from a vcpu to
> the idle vcpu, but not saving or restoring state.

Oh, I'm sorry - you're right.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2015-09-21 12:07 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-21  5:07 [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts Wu, Feng
2015-09-21  9:45 ` George Dunlap
2015-09-21 12:07   ` Wu, Feng
  -- strict thread matches above, loose matches on Subject: below --
2015-08-25  1:57 [PATCH v6 00/18] Add VT-d Posted-Interrupts support Feng Wu
2015-08-25  1:57 ` [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts Feng Wu
2015-09-07 12:54   ` Jan Beulich
2015-09-09  8:56     ` Wu, Feng
2015-09-09 10:26       ` Jan Beulich
2015-09-10  2:07         ` Wu, Feng
2015-09-10  8:27           ` Jan Beulich
2015-09-10  8:59             ` Wu, Feng
2015-09-10  9:26               ` Jan Beulich
2015-09-10  9:41                 ` Wu, Feng
2015-09-10 10:01                   ` Jan Beulich
2015-09-10 12:34                     ` Wu, Feng
2015-09-10 12:44                       ` Jan Beulich
2015-09-10 12:58                         ` Wu, Feng
2015-09-10 13:15                           ` Jan Beulich
2015-09-10 13:27                             ` Wu, Feng
2015-09-10 14:01                               ` Jan Beulich
2015-09-16  8:56                 ` Wu, Feng
2015-09-16 17:08               ` George Dunlap
2015-09-17  6:26                 ` Wu, Feng
2015-09-16 16:56     ` George Dunlap
2015-09-17  6:15       ` Wu, Feng
2015-09-21  8:23       ` Jan Beulich
2015-09-21  9:28         ` George Dunlap
2015-09-21 11:56           ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).