VT-d Posted-interrupt (PI) design for XEN

All of lore.kernel.org
 help / color / mirror / Atom feed

* VT-d Posted-interrupt (PI) design for XEN
@ 2015-03-04 13:30 Wu, Feng
  2015-03-04 15:19 ` Jan Beulich
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Wu, Feng @ 2015-03-04 13:30 UTC (permalink / raw)
  To: xen-devel@lists.xen.org; +Cc: Zhang, Yang Z, Tian, Kevin, Wu, Feng, Jan Beulich

VT-d Posted-interrupt (PI) design for XEN

Background
==========
With the development of virtualization, there are more and more device
assignment requirements. However, today when a VM is running with
assigned devices (such as, NIC), external interrupt handling for the assigned
devices always needs VMM intervention.

VT-d Posted-interrupt is a more enhanced method to handle interrupts
in the virtualization environment. Interrupt posting is the process by
which an interrupt request is recorded in a memory-resident
posted-interrupt-descriptor structure by the root-complex, followed by
an optional notification event issued to the CPU complex.

With VT-d Posted-interrupt we can get the following advantages:
- Directly delivery of external interrupts to running vCPUs without VMM
intervention
- Decease the interrupt migration complexity. On vCPU migration, software
can atomically co-migrate all interrupts targeting the migrating vCPU.

Posted-interrupt Introduction
========================
There are two components to the Posted-interrupt architecture:
Processor Support and Root-Complex Support

- Processor Support
Posted-interrupt processing is a feature by which a processor processes
the virtual interrupts by recording them as pending on the virtual-APIC
page.

Posted-interrupt processing is enabled by setting the "process posted
interrupts" VM-execution control. The processing is performed in response
to the arrival of an interrupt with the posted-interrupt notification vector.
In response to such an interrupt, the processor processes virtual interrupts
recorded in a data structure called a posted-interrupt descriptor.

More information about APICv and CPU-side Posted-interrupt, please refer
to Chapter 29, and Section 29.6 in the Intel SDM:
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

- Root-Complex Support
Interrupt posting is the process by which an interrupt request (from IOAPIC
or MSI/MSIx capable sources) is recorded in a memory-resident
posted-interrupt-descriptor structure by the root-complex, followed by
an optional notification event issued to the CPU complex. The interrupt
request arriving at the root-complex carry the identity of the interrupt
request source and a 'remapping-index'. The remapping-index is used to
look-up an entry from the memory-resident interrupt-remap-table. Unlike
with interrupt-remapping, the interrupt-remap-table-entry for a posted-
interrupt, specifies a virtual-vector and a pointer to the posted-interrupt
descriptor. The virtual-vector specifies the vector of the interrupt to be
recorded in the posted-interrupt descriptor. The posted-interrupt descriptor
hosts storage for the virtual-vectors and contains the attributes of the
notification event (interrupt) to be issued to the CPU complex to inform
CPU/software about pending interrupts recorded in the posted-interrupt
descriptor.

More information about VT-d PI, please refer to
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html

Design Overview
==============
In this design, we will cover the following items:
1. Add a variant to control whether enable VT-d posted-interrupt or not.
2. VT-d PI feature detection.
3. Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff.
4. Extend IRTE structure to support VT-d PI.
5. Introduce a new global vector which is used for waking up the HLT'ed vCPU.
6. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration).
7. Update posted-interrupt descriptor during vCPU scheduling (when the state
of the vCPU is transmitted among RUNSTATE_running / RUNSTATE_blocked/
RUNSTATE_runnable / RUNSTATE_offline).
8. New boot command line for Xen, which controls VT-d PI feature by user.
9. Multicast/broadcast and lowest priority interrupts consideration.

Implementation details
===================
- New variant to control VT-d PI
Like variant 'iommu_intremap' for interrupt remapping, it is very straightforward
to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is set
only when interrupt remapping and VT-d posted-interrupt are both enabled.

- VT-d PI feature detection.
Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt support.

- Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff.
Here is the new structure for posted-interrupt descriptor:

struct pi_desc {
     DECLARE_BITMAP(pir, NR_VECTORS);
     union {
        struct
        {
        u64 on     : 1,
            sn     : 1,
            rsvd_1 : 13,
            ndm    : 1,
            nv     : 8,
            rsvd_2 : 8,
            ndst   : 32;
        };
        u64 control;
    };
    u32 rsvd[6];
 } __attribute__ ((aligned (64)));

- Extend IRTE structure to support VT-d PI.
Here is the new structure for IRTE:
/* interrupt remap entry */
struct iremap_entry {
  union {
    u64 lo_val;
    struct {
        u64 p       : 1,
            fpd     : 1,
            dm      : 1,
            rh      : 1,
            tm      : 1,
            dlm     : 3,
            avail   : 4,
            res_1   : 4,
            vector  : 8,
            res_2   : 8,
            dst     : 32;
    }lo;
    struct {
        u64 p       : 1,
            fpd     : 1,
            res_1   : 6,
            avail   : 4,
            res_2   : 2,
            urg     : 1,
            pst     : 1,
            vector  : 8,
            res_3   : 14,
            pda_l   : 26;
    }lo_intpost;
  };
  union {
    u64 hi_val;
    struct {
        u64 sid     : 16,
            sq      : 2,
            svt     : 2,
            res_1   : 44;
    }hi;
    struct {
        u64 sid     : 16,
            sq      : 2,
            svt     : 2,
            res_1   : 12,
            pda_h   : 32;
    }hi_intpost;
  };
};

- Introduce a new global vector which is used to wake up the HLT'ed vCPU.
Currently, there is a global vector 'posted_intr_vector', which is used as the
global notification vector for all vCPUs in the system. This vector is stored in
VMCS and CPU considers it as a special vector, uses it to notify the related
pCPU when an interrupt is recorded in the posted-interrupt descriptor.

After having VT-d PI, VT-d engine can issue notification event when the
assigned devices issue interrupts. We need add a new global vector to
wakeup the HLT'ed vCPU, please refer to the following scenario for the
usage of this new global vector:

1. vCPU0 is running on pCPU0
2. vCPU0 is HLT'ed and vCPU1 is currently running on pCPU0
3. An external interrupt from an assigned device occurs for vCPU0, if we
still use 'posted_intr_vector' as the notification vector for vCPU0, the
notification event for vCPU0 (the event will go to pCPU1) will be consumed
by vCPU1 incorrectly. The worst case is that vCPU0 will never be woken up
again since the wakeup event for it is always consumed by other vCPUs
incorrectly. So we need introduce another global vector, naming 'pi_wakeup_vector'
to wake up the HTL'ed vCPU.

- Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration).
After VT-d PI is introduced, the format of IRTE is changed as follows:
	Descriptor Address: the address of the posted-interrupt descriptor
	Virtual Vector: the guest vector of the interrupt
	URG: indicates if the interrupt is urgent
	Other fields continue to have the same meaning

'Descriptor Address' tells the destination vCPU of this interrupt, since
each vCPU has a dedicated posted-interrupt descriptor.

'Virtual Vector' tells the guest vector of the interrupt.

When guest changes the configuration of the interrupts, such as, the
cpu affinity, or the vector, we need to update the associated IRTE accordingly.

- Update posted-interrupt descriptor during vCPU scheduling
The basic idea here is:
1. When vCPU's state is RUNSTATE_running,
        - Set 'NV' to 'posted_intr_vector'.
        - Clear 'SN' to accept posted-interrupts.
        - Set 'NDST' to the pCPU on which the vCPU will be running.
2. When vCPU's state is RUNSTATE_blocked,
        - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the
          related vCPU when posted-interrupt happens for it.
          Please refer to the above section about the new global vector.
        - Clear 'SN' to accept posted-interrupts
3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
        - Set 'SN' to suppress non-urgent interrupts
          (Current, we only support non-urgent interrupts)
         When vCPU is in RUNSTATE_runnable or RUNSTATE_offline,
         It is not needed to accept posted-interrupt notification event,
         since we don't change the behavior of scheduler when the interrupt
         occurs, we still need wait the next scheduling of the vCPU.
         When external interrupts from assigned devices occur, the interrupts
         are recorded in PIR, and will be synced to IRR before VM-Entry.
        - Set 'NV' to 'posted_intr_vector'.

- New boot command line for Xen, which controls VT-d PI feature by user.
Like 'intremap' for interrupt remapping, we add a new boot command line
'intpost' for posted-interrupts.

- Multicast/broadcast and lowest priority interrupts consideration
With VT-d PI, the destination vCPU information of an external interrupt
from assigned devices is stored in IRTE, this makes the following
consideration of the design:
1. Multicast/broadcast interrupts cannot be posted.
2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex
(starting from Nehalem) ignore TPR value, and instead supported two other
ways (configurable by BIOS) on how the handle lowest priority interrupts:
	A) Round robin: In this method, the chipset simply delivers lowest priority
interrupts in a round-robin manner across all the available logical CPUs. While
this provides good load balancing, this was not the best thing to do always as
interrupts from the same device (like NIC) will start running on all the CPUs
thrashing caches and taking locks. This led to the next scheme.
	B) Vector hashing: In this method, hardware would apply a hash function
on the vector value in the interrupt request, and use that hash to pick a logical
CPU to route the lowest priority interrupt. This way, a given vector always goes
to the same logical CPU, avoiding the thrashing problem above.

So, gist of above is that, lowest priority interrupts has never been delivered as
"lowest priority" in physical hardware. 

For KVM enabling work of VT-d PI, we divide this into two stage:
Stage 1: Only support single-CPU lowest-priority interrupts (configured via
/proc/irq or irqbalance). This is simple and clear.
Stage 2: After all the patches are merged, I will add the vector hashing support
for lowest-priority on VT-d PI.

On Xen side, what is your opinion about support lowest-priority interrupts
for VT-d PI?

================================

Any comments about this design are highly appreciated!

Thanks,
Feng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-04 13:30 VT-d Posted-interrupt (PI) design for XEN Wu, Feng
@ 2015-03-04 15:19 ` Jan Beulich
  2015-03-05  5:04   ` Wu, Feng
  2015-03-04 18:48 ` Andrew Cooper
  2015-03-10  2:22 ` Tian, Kevin
  2 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2015-03-04 15:19 UTC (permalink / raw)
  To: Feng Wu; +Cc: Yang Z Zhang, Kevin Tian, xen-devel@lists.xen.org

>>> On 04.03.15 at 14:30, <feng.wu@intel.com> wrote:
> - Introduce a new global vector which is used to wake up the HLT'ed vCPU.
> Currently, there is a global vector 'posted_intr_vector', which is used as 
> the
> global notification vector for all vCPUs in the system. This vector is 
> stored in
> VMCS and CPU considers it as a special vector, uses it to notify the related
> pCPU when an interrupt is recorded in the posted-interrupt descriptor.
> 
> After having VT-d PI, VT-d engine can issue notification event when the
> assigned devices issue interrupts. We need add a new global vector to
> wakeup the HLT'ed vCPU, please refer to the following scenario for the
> usage of this new global vector:
> 
> 1. vCPU0 is running on pCPU0
> 2. vCPU0 is HLT'ed and vCPU1 is currently running on pCPU0
> 3. An external interrupt from an assigned device occurs for vCPU0, if we
> still use 'posted_intr_vector' as the notification vector for vCPU0, the
> notification event for vCPU0 (the event will go to pCPU1) will be consumed
> by vCPU1 incorrectly. The worst case is that vCPU0 will never be woken up
> again since the wakeup event for it is always consumed by other vCPUs
> incorrectly. So we need introduce another global vector, naming 
> 'pi_wakeup_vector'
> to wake up the HTL'ed vCPU.

I'm afraid you describe a particular scenario here, but I don't see
how this is related to the introduction of another global vector:
Either the current (global) vector is sufficient, or another global
vector also can't solve your problem. I'm sure I'm missing something
here, so please be explicit.

> - Update posted-interrupt descriptor during vCPU scheduling
> The basic idea here is:
> 1. When vCPU's state is RUNSTATE_running,
>         - Set 'NV' to 'posted_intr_vector'.
>         - Clear 'SN' to accept posted-interrupts.
>         - Set 'NDST' to the pCPU on which the vCPU will be running.
>[...]

This is pretty hard to read without knowing what the abbreviations
actually stand for, and suggesting to hunt for them in the spec isn't
very reader friendly either. Please explain these fields, at the very
least by way of comments on the structure fields presented earlier.

> On Xen side, what is your opinion about support lowest-priority interrupts
> for VT-d PI?

I certainly think (as with every other virtualized piece of hardware)
that hardware behavior should be emulated as closely as possible.
I.e. yes, we should have it eventually. As to the two stage approach
mentioned for KVM - I've grown reservations against Intel people
making promises towards future implementation of something, i.e.
I'm kind of hesitant to agree to such an implementation model. Yet
you're to contribute the patches, and I'm surely not planning to veto
a stage-1-only implementation as it would be an improvement anyway.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-04 15:19 ` Jan Beulich
@ 2015-03-05  5:04   ` Wu, Feng
  2015-03-05  7:12     ` Jan Beulich
  0 siblings, 1 reply; 22+ messages in thread
From: Wu, Feng @ 2015-03-05  5:04 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Zhang, Yang Z, Tian, Kevin, Wu, Feng, xen-devel@lists.xen.org

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, March 04, 2015 11:19 PM
> To: Wu, Feng
> Cc: Tian, Kevin; Zhang, Yang Z; xen-devel@lists.xen.org
> Subject: Re: VT-d Posted-interrupt (PI) design for XEN
> 
> >>> On 04.03.15 at 14:30, <feng.wu@intel.com> wrote:
> > - Introduce a new global vector which is used to wake up the HLT'ed vCPU.
> > Currently, there is a global vector 'posted_intr_vector', which is used as
> > the
> > global notification vector for all vCPUs in the system. This vector is
> > stored in
> > VMCS and CPU considers it as a special vector, uses it to notify the related
> > pCPU when an interrupt is recorded in the posted-interrupt descriptor.
> >
> > After having VT-d PI, VT-d engine can issue notification event when the
> > assigned devices issue interrupts. We need add a new global vector to
> > wakeup the HLT'ed vCPU, please refer to the following scenario for the
> > usage of this new global vector:
> >
> > 1. vCPU0 is running on pCPU0
> > 2. vCPU0 is HLT'ed and vCPU1 is currently running on pCPU0
> > 3. An external interrupt from an assigned device occurs for vCPU0, if we
> > still use 'posted_intr_vector' as the notification vector for vCPU0, the
> > notification event for vCPU0 (the event will go to pCPU1) will be consumed
> > by vCPU1 incorrectly. The worst case is that vCPU0 will never be woken up
> > again since the wakeup event for it is always consumed by other vCPUs
> > incorrectly. So we need introduce another global vector, naming
> > 'pi_wakeup_vector'
> > to wake up the HTL'ed vCPU.
> 
> I'm afraid you describe a particular scenario here, but I don't see
> how this is related to the introduction of another global vector:
> Either the current (global) vector is sufficient, or another global
> vector also can't solve your problem. I'm sure I'm missing something
> here, so please be explicit.
> 

In fact, the new global vector is used for the above scenario. Let me
explain this a bit more:

After having VT-d PI, when an external interrupt from an assigned device happens,
here is the hardware processing flow:

1. Interrupts happen.
2. Find the associated IRTE.
3. Find the destination vCPU from IRTE (from Posted-interrupt descriptor address)
4. Sync the interrupt (stored in IRTE as 'virtual vector') to PIRR fields in Posted-interrupt descriptor.
5. If needed (Please refer to the VT-d Spec about the condition of issuing Notification Event),
issue notification event to the destination CPU which is store in posted-interrupt descriptor as 'NDST'

Back to the above scenario:
1. vCPU0 is running in pCPU0, and the 'NDST' filed of vCPU0's posted-interrupt descriptor is pCPU0
2. vCPU0 is HLT'ed and vCPU1 is currently running on pCPU0.
3. An external interrupt from an assigned device happens, the destination of this interrupt will be
determined as above flow (IRTE --> posted-interrupt descriptor address/vCPU --> notification event to 'NDST'),
If this external interrupt is for vCPU0, the notification event will be delivered to pCPU0 since the 'NDST' field
of vCPU0's posted-interrupt descriptor is pCPU0. if we use the current (global) vector for the notification event
for vCPU0 in the above case, since the current global vector (notification vector) is a particular vector to CPU,
vCPU1 will consume it while vCPU1 is currently running on pCPU0, so we failed to wake up the HLT'ed vCPU0.

please refer to Section 29.6 in the Intel SDM about how CPU handles this particular vector:
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

After introducing a new global vector naming 'pi_wakeup_vector', before vCPU is being HLT'ed, we set
The 'NV' filed (Notification Vector) in the vCPU's posted-interrupt descriptor to 'pi_wakeup_vector', and
this is a normal vector to CPU and CPU will not do special things for it (different from the current global vector).
In the handler of this vector, we can wake up the HLT'ed vCPU.

> > - Update posted-interrupt descriptor during vCPU scheduling
> > The basic idea here is:
> > 1. When vCPU's state is RUNSTATE_running,
> >         - Set 'NV' to 'posted_intr_vector'.
> >         - Clear 'SN' to accept posted-interrupts.
> >         - Set 'NDST' to the pCPU on which the vCPU will be running.
> >[...]
> 
> This is pretty hard to read without knowing what the abbreviations
> actually stand for, and suggesting to hunt for them in the spec isn't
> very reader friendly either. Please explain these fields, at the very
> least by way of comments on the structure fields presented earlier.
> 

There are some changes to IRTE and posted-interrupt descriptor after
VT-d PI is introduced:
IRTE:
Posted-interrupt Descriptor Address: the address of the posted-interrupt descriptor
Virtual Vector: the guest vector of the interrupt
URG: indicates if the interrupt is urgent

Posted-interrupt descriptor:
The Posted Interrupt Descriptor hosts the following fields:
Posted Interrupt Request (PIR): Provide storage for posting (recording) interrupts (one bit
per vector, for up to 256 vectors).
Outstanding Notification (ON): Indicate if there is a notification event outstanding (not
processed by processor or software) for this Posted Interrupt Descriptor. When this field is 0,
hardware modifies it from 0 to 1 when generating a notification event, and the entity receiving
the notification event (processor or software) resets it as part of posted interrupt processing.
Suppress Notification (SN): Indicate if a notification event is to be suppressed (not
generated) for non-urgent interrupt requests (interrupts processed through an IRTE with
URG=0).
Notification Vector (NV): Specify the vector for notification event (interrupt).
Notification Destination (NDST): Specify the physical APIC-ID of the destination logical
processor for the notification event.

> > On Xen side, what is your opinion about support lowest-priority interrupts
> > for VT-d PI?
> 
> I certainly think (as with every other virtualized piece of hardware)
> that hardware behavior should be emulated as closely as possible.
> I.e. yes, we should have it eventually. As to the two stage approach
> mentioned for KVM - I've grown reservations against Intel people
> making promises towards future implementation of something, i.e.
> I'm kind of hesitant to agree to such an implementation model. Yet
> you're to contribute the patches, and I'm surely not planning to veto
> a stage-1-only implementation as it would be an improvement anyway.
> 

Well, I am okay with doing a full implementation for lowest-priority. KVM people
trends to do simple things at the first stage of hardware enabling, if you don't
like do it this way, I will skip the stage 1 above and implement the full solution
directly on XEN side.

Thanks,
Feng

> Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-05  5:04   ` Wu, Feng
@ 2015-03-05  7:12     ` Jan Beulich
  2015-03-05  8:29       ` Wu, Feng
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2015-03-05  7:12 UTC (permalink / raw)
  To: Feng Wu; +Cc: Yang Z Zhang, Kevin Tian, xen-devel@lists.xen.org

>>> On 05.03.15 at 06:04, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, March 04, 2015 11:19 PM
>> >>> On 04.03.15 at 14:30, <feng.wu@intel.com> wrote:
>> > - Introduce a new global vector which is used to wake up the HLT'ed vCPU.
>> > Currently, there is a global vector 'posted_intr_vector', which is used as
>> > the
>> > global notification vector for all vCPUs in the system. This vector is
>> > stored in
>> > VMCS and CPU considers it as a special vector, uses it to notify the related
>> > pCPU when an interrupt is recorded in the posted-interrupt descriptor.
>> >
>> > After having VT-d PI, VT-d engine can issue notification event when the
>> > assigned devices issue interrupts. We need add a new global vector to
>> > wakeup the HLT'ed vCPU, please refer to the following scenario for the
>> > usage of this new global vector:
>> >
>> > 1. vCPU0 is running on pCPU0
>> > 2. vCPU0 is HLT'ed and vCPU1 is currently running on pCPU0
>> > 3. An external interrupt from an assigned device occurs for vCPU0, if we
>> > still use 'posted_intr_vector' as the notification vector for vCPU0, the
>> > notification event for vCPU0 (the event will go to pCPU1) will be consumed
>> > by vCPU1 incorrectly. The worst case is that vCPU0 will never be woken up
>> > again since the wakeup event for it is always consumed by other vCPUs
>> > incorrectly. So we need introduce another global vector, naming
>> > 'pi_wakeup_vector'
>> > to wake up the HTL'ed vCPU.
>> 
>> I'm afraid you describe a particular scenario here, but I don't see
>> how this is related to the introduction of another global vector:
>> Either the current (global) vector is sufficient, or another global
>> vector also can't solve your problem. I'm sure I'm missing something
>> here, so please be explicit.
>> 
> 
> In fact, the new global vector is used for the above scenario. Let me
> explain this a bit more:
> 
> After having VT-d PI, when an external interrupt from an assigned device 
> happens,
> here is the hardware processing flow:
> 
> 1. Interrupts happen.
> 2. Find the associated IRTE.
> 3. Find the destination vCPU from IRTE (from Posted-interrupt descriptor 
> address)
> 4. Sync the interrupt (stored in IRTE as 'virtual vector') to PIRR fields in 
> Posted-interrupt descriptor.
> 5. If needed (Please refer to the VT-d Spec about the condition of issuing 
> Notification Event),
> issue notification event to the destination CPU which is store in 
> posted-interrupt descriptor as 'NDST'
> 
> Back to the above scenario:
> 1. vCPU0 is running in pCPU0, and the 'NDST' filed of vCPU0's 
> posted-interrupt descriptor is pCPU0
> 2. vCPU0 is HLT'ed and vCPU1 is currently running on pCPU0.
> 3. An external interrupt from an assigned device happens, the destination of 
> this interrupt will be
> determined as above flow (IRTE --> posted-interrupt descriptor address/vCPU --> 
> notification event to 'NDST'),
> If this external interrupt is for vCPU0, the notification event will be 
> delivered to pCPU0 since the 'NDST' field
> of vCPU0's posted-interrupt descriptor is pCPU0. if we use the current 
> (global) vector for the notification event
> for vCPU0 in the above case, since the current global vector (notification 
> vector) is a particular vector to CPU,
> vCPU1 will consume it while vCPU1 is currently running on pCPU0, so we 
> failed to wake up the HLT'ed vCPU0.
> 
> please refer to Section 29.6 in the Intel SDM about how CPU handles this 
> particular vector:
> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-ar 
> chitectures-software-developer-manual-325462.pdf
> 
> After introducing a new global vector naming 'pi_wakeup_vector', before vCPU 
> is being HLT'ed, we set
> The 'NV' filed (Notification Vector) in the vCPU's posted-interrupt 
> descriptor to 'pi_wakeup_vector', and
> this is a normal vector to CPU and CPU will not do special things for it 
> (different from the current global vector).
> In the handler of this vector, we can wake up the HLT'ed vCPU.

So suppose you have more than on vCPU which most recently ran on
pCPU0 - how will the handler for the new vector know which of the
vCPU-s it should kick? And if it can know, why couldn't the handler for
posted_intr_vector not know either (i.e. after introducing a specific
handler for it in place of the currently used event_check_interrupt)?
(One of the reasons I'm asking, i.e. apart from wanting to
understand the model, is the limited amount of vectors we have.)

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-05  7:12     ` Jan Beulich
@ 2015-03-05  8:29       ` Wu, Feng
  2015-03-05  8:52         ` Jan Beulich
  0 siblings, 1 reply; 22+ messages in thread
From: Wu, Feng @ 2015-03-05  8:29 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Zhang, Yang Z, Tian, Kevin, Wu, Feng, xen-devel@lists.xen.org



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, March 05, 2015 3:13 PM
> To: Wu, Feng
> Cc: Tian, Kevin; Zhang, Yang Z; xen-devel@lists.xen.org
> Subject: RE: VT-d Posted-interrupt (PI) design for XEN
> 
> >>> On 05.03.15 at 06:04, <feng.wu@intel.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Wednesday, March 04, 2015 11:19 PM
> >> >>> On 04.03.15 at 14:30, <feng.wu@intel.com> wrote:
> >> > - Introduce a new global vector which is used to wake up the HLT'ed vCPU.
> >> > Currently, there is a global vector 'posted_intr_vector', which is used as
> >> > the
> >> > global notification vector for all vCPUs in the system. This vector is
> >> > stored in
> >> > VMCS and CPU considers it as a special vector, uses it to notify the related
> >> > pCPU when an interrupt is recorded in the posted-interrupt descriptor.
> >> >
> >> > After having VT-d PI, VT-d engine can issue notification event when the
> >> > assigned devices issue interrupts. We need add a new global vector to
> >> > wakeup the HLT'ed vCPU, please refer to the following scenario for the
> >> > usage of this new global vector:
> >> >
> >> > 1. vCPU0 is running on pCPU0
> >> > 2. vCPU0 is HLT'ed and vCPU1 is currently running on pCPU0
> >> > 3. An external interrupt from an assigned device occurs for vCPU0, if we
> >> > still use 'posted_intr_vector' as the notification vector for vCPU0, the
> >> > notification event for vCPU0 (the event will go to pCPU1) will be consumed
> >> > by vCPU1 incorrectly. The worst case is that vCPU0 will never be woken up
> >> > again since the wakeup event for it is always consumed by other vCPUs
> >> > incorrectly. So we need introduce another global vector, naming
> >> > 'pi_wakeup_vector'
> >> > to wake up the HTL'ed vCPU.
> >>
> >> I'm afraid you describe a particular scenario here, but I don't see
> >> how this is related to the introduction of another global vector:
> >> Either the current (global) vector is sufficient, or another global
> >> vector also can't solve your problem. I'm sure I'm missing something
> >> here, so please be explicit.
> >>
> >
> > In fact, the new global vector is used for the above scenario. Let me
> > explain this a bit more:
> >
> > After having VT-d PI, when an external interrupt from an assigned device
> > happens,
> > here is the hardware processing flow:
> >
> > 1. Interrupts happen.
> > 2. Find the associated IRTE.
> > 3. Find the destination vCPU from IRTE (from Posted-interrupt descriptor
> > address)
> > 4. Sync the interrupt (stored in IRTE as 'virtual vector') to PIRR fields in
> > Posted-interrupt descriptor.
> > 5. If needed (Please refer to the VT-d Spec about the condition of issuing
> > Notification Event),
> > issue notification event to the destination CPU which is store in
> > posted-interrupt descriptor as 'NDST'
> >
> > Back to the above scenario:
> > 1. vCPU0 is running in pCPU0, and the 'NDST' filed of vCPU0's
> > posted-interrupt descriptor is pCPU0
> > 2. vCPU0 is HLT'ed and vCPU1 is currently running on pCPU0.
> > 3. An external interrupt from an assigned device happens, the destination of
> > this interrupt will be
> > determined as above flow (IRTE --> posted-interrupt descriptor address/vCPU
> -->
> > notification event to 'NDST'),
> > If this external interrupt is for vCPU0, the notification event will be
> > delivered to pCPU0 since the 'NDST' field
> > of vCPU0's posted-interrupt descriptor is pCPU0. if we use the current
> > (global) vector for the notification event
> > for vCPU0 in the above case, since the current global vector (notification
> > vector) is a particular vector to CPU,
> > vCPU1 will consume it while vCPU1 is currently running on pCPU0, so we
> > failed to wake up the HLT'ed vCPU0.
> >
> > please refer to Section 29.6 in the Intel SDM about how CPU handles this
> > particular vector:
> >
> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6
> 4-ia-32-ar
> > chitectures-software-developer-manual-325462.pdf
> >
> > After introducing a new global vector naming 'pi_wakeup_vector', before
> vCPU
> > is being HLT'ed, we set
> > The 'NV' filed (Notification Vector) in the vCPU's posted-interrupt
> > descriptor to 'pi_wakeup_vector', and
> > this is a normal vector to CPU and CPU will not do special things for it
> > (different from the current global vector).
> > In the handler of this vector, we can wake up the HLT'ed vCPU.
> 
> So suppose you have more than on vCPU which most recently ran on
> pCPU0 - how will the handler for the new vector know which of the
> vCPU-s it should kick? 

Oh, sorry, I thought I had added how the wakeup the HLT'ed vCPU in this design,
Seems I missed it. Here is it.

1. Define a per-cpu list 'blocked_vcpu_on_cpu_lock', which stored the blocked
vCPU on the pCPU.
2. When the vCPU's state is changed to RUNSTATE_blocked, insert the vCPU
to the per-cpu list belonging to the pCPU it was running
3. When the vCPU is unblocked, remove the vCPU from the related pCPU list.

In the handler of 'pi_wakeup_vector', we do:
1. Get the physical CPU.
2. Iterate the list 'blocked_vcpu_on_cpu_lock' of the current pCPU, if 'ON' is set,
we unblock the associated vCPU.

> And if it can know, why couldn't the handler for
> posted_intr_vector not know either (i.e. after introducing a specific
> handler for it in place of the currently used event_check_interrupt)?

Come back to the above scenario, vCPU1 is running on pCPU0 while vCPU0
is blocked, if we still use posted_intr_vector for the blocked vCPU0. If vCPU1
is running in non-root mode and external interrupts happen for it, the notification
event will be handled by CPU hardware (in non-root mode) automatically,
then we cannot get any control in the handler for posted_intr_vector.

Thanks,
Feng

> (One of the reasons I'm asking, i.e. apart from wanting to
> understand the model, is the limited amount of vectors we have.)
> 
> Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-05  8:29       ` Wu, Feng
@ 2015-03-05  8:52         ` Jan Beulich
  2015-03-05  9:07           ` Wu, Feng
  2015-03-05 12:02           ` Tim Deegan
  0 siblings, 2 replies; 22+ messages in thread
From: Jan Beulich @ 2015-03-05  8:52 UTC (permalink / raw)
  To: Feng Wu; +Cc: Yang Z Zhang, Kevin Tian, xen-devel@lists.xen.org

>>> On 05.03.15 at 09:29, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Thursday, March 05, 2015 3:13 PM
>> And if it can know, why couldn't the handler for
>> posted_intr_vector not know either (i.e. after introducing a specific
>> handler for it in place of the currently used event_check_interrupt)?
> 
> Come back to the above scenario, vCPU1 is running on pCPU0 while vCPU0
> is blocked, if we still use posted_intr_vector for the blocked vCPU0. If 
> vCPU1
> is running in non-root mode and external interrupts happen for it, the 
> notification
> event will be handled by CPU hardware (in non-root mode) automatically,
> then we cannot get any control in the handler for posted_intr_vector.

And how would this be different with your separate new vector? I
feel I'm missing something, but I'm afraid I have to rely on you to
point out what it is. Just again - please explain what it is you need
two global vectors for that can't be done with one.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-05  8:52         ` Jan Beulich
@ 2015-03-05  9:07           ` Wu, Feng
  2015-03-05 10:14             ` Jan Beulich
  2015-03-05 12:02           ` Tim Deegan
  1 sibling, 1 reply; 22+ messages in thread
From: Wu, Feng @ 2015-03-05  9:07 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Zhang, Yang Z, Tian, Kevin, Wu, Feng, xen-devel@lists.xen.org

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, March 05, 2015 4:52 PM
> To: Wu, Feng
> Cc: Tian, Kevin; Zhang, Yang Z; xen-devel@lists.xen.org
> Subject: RE: VT-d Posted-interrupt (PI) design for XEN
> 
> >>> On 05.03.15 at 09:29, <feng.wu@intel.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Thursday, March 05, 2015 3:13 PM
> >> And if it can know, why couldn't the handler for
> >> posted_intr_vector not know either (i.e. after introducing a specific
> >> handler for it in place of the currently used event_check_interrupt)?
> >
> > Come back to the above scenario, vCPU1 is running on pCPU0 while vCPU0
> > is blocked, if we still use posted_intr_vector for the blocked vCPU0. If
> > vCPU1
> > is running in non-root mode and external interrupts happen for it, the
> > notification
> > event will be handled by CPU hardware (in non-root mode) automatically,
> > then we cannot get any control in the handler for posted_intr_vector.
> 
> And how would this be different with your separate new vector? I
> feel I'm missing something, but I'm afraid I have to rely on you to
> point out what it is. Just again - please explain what it is you need
> two global vectors for that can't be done with one.

Stilling using the above scenario, if vCPU1 is running in non-root mode
and external interrupts happen for vCPU0 (who is HLT'ed).

If using 'posted_intr_vector' for vCPU0 and 'posted_intr_vector' is also
used for other vCPUs, including vCPU1. VT-d engine will issue notification
event using this global vector, and this SPECIAL vector will be handled
this way: (from Section 29.6 in the Intel SDM:
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf)

1. The local APIC is acknowledged; this provides the processor core with an interrupt vector, called here the
physical vector.
2. If the physical vector equals the posted-interrupt notification vector, the logical processor continues to the next
step. Otherwise, a VM exit occurs as it would normally due to an external interrupt; the vector is saved in the
VM-exit interruption-information field.
3. The processor clears the outstanding-notification bit in the posted-interrupt descriptor. This is done atomically
so as to leave the remainder of the descriptor unmodified (e.g., with a locked AND operation).
4. The processor writes zero to the EOI register in the local APIC; this dismisses the interrupt with the postedinterrupt
notification vector from the local APIC.
5. The logical processor performs a logical-OR of PIR into VIRR and clears PIR. No other agent can read or write a
PIR bit (or group of bits) between the time it is read (to determine what to OR into VIRR) and when it is cleared.
6. The logical processor sets RVI to be the maximum of the old value of RVI and the highest index of all bits that
were set in PIR; if no bit was set in PIR, RVI is left unmodified.
7. The logical processor evaluates pending virtual interrupts as described in Section 29.2.1.

This is totally handled by CPU hardware, so we cannot get control in the handler for posted_intr_vector.

OTOH, if using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue notification event using this new vector,
Since this new vector is not a SPECIAL one to CPU, it is just a normal vector. To cpu, it just receives an normal
external interrupt, then we can get control in the handler of this new vector. In this case, hypervisor can
do something in it, such as wakeup the HLT'ed vCPU.

Hope this can clarify your confusion.

Thanks,
Feng

> 
> Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-05  9:07           ` Wu, Feng
@ 2015-03-05 10:14             ` Jan Beulich
  2015-03-06  2:01               ` Wu, Feng
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2015-03-05 10:14 UTC (permalink / raw)
  To: Feng Wu; +Cc: Yang Z Zhang, Kevin Tian, xen-devel@lists.xen.org

>>> On 05.03.15 at 10:07, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Thursday, March 05, 2015 4:52 PM
>> And how would this be different with your separate new vector? I
>> feel I'm missing something, but I'm afraid I have to rely on you to
>> point out what it is. Just again - please explain what it is you need
>> two global vectors for that can't be done with one.
> 
> Stilling using the above scenario, if vCPU1 is running in non-root mode
> and external interrupts happen for vCPU0 (who is HLT'ed).
> 
> If using 'posted_intr_vector' for vCPU0 and 'posted_intr_vector' is also
> used for other vCPUs, including vCPU1. VT-d engine will issue notification
> event using this global vector, and this SPECIAL vector will be handled
> this way: (from Section 29.6 in the Intel SDM:
> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-ar 
> chitectures-software-developer-manual-325462.pdf)
> 
> 1. The local APIC is acknowledged; this provides the processor core with an 
> interrupt vector, called here the
> physical vector.
> 2. If the physical vector equals the posted-interrupt notification vector, 
> the logical processor continues to the next
> step. Otherwise, a VM exit occurs as it would normally due to an external 
> interrupt; the vector is saved in the
> VM-exit interruption-information field.
> 3. The processor clears the outstanding-notification bit in the 
> posted-interrupt descriptor. This is done atomically
> so as to leave the remainder of the descriptor unmodified (e.g., with a 
> locked AND operation).
> 4. The processor writes zero to the EOI register in the local APIC; this 
> dismisses the interrupt with the postedinterrupt
> notification vector from the local APIC.
> 5. The logical processor performs a logical-OR of PIR into VIRR and clears 
> PIR. No other agent can read or write a
> PIR bit (or group of bits) between the time it is read (to determine what to 
> OR into VIRR) and when it is cleared.
> 6. The logical processor sets RVI to be the maximum of the old value of RVI 
> and the highest index of all bits that
> were set in PIR; if no bit was set in PIR, RVI is left unmodified.
> 7. The logical processor evaluates pending virtual interrupts as described 
> in Section 29.2.1.
> 
> This is totally handled by CPU hardware, so we cannot get control in the 
> handler for posted_intr_vector.
> 
> OTOH, if using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue 
> notification event using this new vector,
> Since this new vector is not a SPECIAL one to CPU, it is just a normal 
> vector. To cpu, it just receives an normal
> external interrupt, then we can get control in the handler of this new 
> vector. In this case, hypervisor can
> do something in it, such as wakeup the HLT'ed vCPU.
> 
> Hope this can clarify your confusion.

Thanks, yes - it is this "vector-is-special-to-CPU" that makes a second
vector necessary. Please make sure this is being properly explained in
the description and/or code comments of the patches to come (of
course without need to quote the SDM, but a reference to the
respective section may be useful).

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-05 10:14             ` Jan Beulich
@ 2015-03-06  2:01               ` Wu, Feng
  0 siblings, 0 replies; 22+ messages in thread
From: Wu, Feng @ 2015-03-06  2:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Zhang, Yang Z, Tian, Kevin, Wu, Feng, xen-devel@lists.xen.org



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, March 05, 2015 6:15 PM
> To: Wu, Feng
> Cc: Tian, Kevin; Zhang, Yang Z; xen-devel@lists.xen.org
> Subject: RE: VT-d Posted-interrupt (PI) design for XEN
> 
> >>> On 05.03.15 at 10:07, <feng.wu@intel.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Thursday, March 05, 2015 4:52 PM
> >> And how would this be different with your separate new vector? I
> >> feel I'm missing something, but I'm afraid I have to rely on you to
> >> point out what it is. Just again - please explain what it is you need
> >> two global vectors for that can't be done with one.
> >
> > Stilling using the above scenario, if vCPU1 is running in non-root mode
> > and external interrupts happen for vCPU0 (who is HLT'ed).
> >
> > If using 'posted_intr_vector' for vCPU0 and 'posted_intr_vector' is also
> > used for other vCPUs, including vCPU1. VT-d engine will issue notification
> > event using this global vector, and this SPECIAL vector will be handled
> > this way: (from Section 29.6 in the Intel SDM:
> >
> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6
> 4-ia-32-ar
> > chitectures-software-developer-manual-325462.pdf)
> >
> > 1. The local APIC is acknowledged; this provides the processor core with an
> > interrupt vector, called here the
> > physical vector.
> > 2. If the physical vector equals the posted-interrupt notification vector,
> > the logical processor continues to the next
> > step. Otherwise, a VM exit occurs as it would normally due to an external
> > interrupt; the vector is saved in the
> > VM-exit interruption-information field.
> > 3. The processor clears the outstanding-notification bit in the
> > posted-interrupt descriptor. This is done atomically
> > so as to leave the remainder of the descriptor unmodified (e.g., with a
> > locked AND operation).
> > 4. The processor writes zero to the EOI register in the local APIC; this
> > dismisses the interrupt with the postedinterrupt
> > notification vector from the local APIC.
> > 5. The logical processor performs a logical-OR of PIR into VIRR and clears
> > PIR. No other agent can read or write a
> > PIR bit (or group of bits) between the time it is read (to determine what to
> > OR into VIRR) and when it is cleared.
> > 6. The logical processor sets RVI to be the maximum of the old value of RVI
> > and the highest index of all bits that
> > were set in PIR; if no bit was set in PIR, RVI is left unmodified.
> > 7. The logical processor evaluates pending virtual interrupts as described
> > in Section 29.2.1.
> >
> > This is totally handled by CPU hardware, so we cannot get control in the
> > handler for posted_intr_vector.
> >
> > OTOH, if using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue
> > notification event using this new vector,
> > Since this new vector is not a SPECIAL one to CPU, it is just a normal
> > vector. To cpu, it just receives an normal
> > external interrupt, then we can get control in the handler of this new
> > vector. In this case, hypervisor can
> > do something in it, such as wakeup the HLT'ed vCPU.
> >
> > Hope this can clarify your confusion.
> 
> Thanks, yes - it is this "vector-is-special-to-CPU" that makes a second
> vector necessary. Please make sure this is being properly explained in
> the description and/or code comments of the patches to come (of
> course without need to quote the SDM, but a reference to the
> respective section may be useful).

Sure, I will add the description later!

So things are a little clear now, could you please take some time to
review this design again and give more comments? Thanks a lot!!

Thanks,
Feng

> 
> Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-05  8:52         ` Jan Beulich
  2015-03-05  9:07           ` Wu, Feng
@ 2015-03-05 12:02           ` Tim Deegan
  2015-03-06  2:07             ` Wu, Feng
  1 sibling, 1 reply; 22+ messages in thread
From: Tim Deegan @ 2015-03-05 12:02 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Yang Z Zhang, Kevin Tian, Feng Wu, xen-devel@lists.xen.org

Hi,

At 08:52 +0000 on 05 Mar (1425541947), Jan Beulich wrote:
> >>> On 05.03.15 at 09:29, <feng.wu@intel.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Thursday, March 05, 2015 3:13 PM
> >> And if it can know, why couldn't the handler for
> >> posted_intr_vector not know either (i.e. after introducing a specific
> >> handler for it in place of the currently used event_check_interrupt)?
> > 
> > Come back to the above scenario, vCPU1 is running on pCPU0 while vCPU0
> > is blocked, if we still use posted_intr_vector for the blocked vCPU0. If 
> > vCPU1
> > is running in non-root mode and external interrupts happen for it, the 
> > notification
> > event will be handled by CPU hardware (in non-root mode) automatically,
> > then we cannot get any control in the handler for posted_intr_vector.
> 
> And how would this be different with your separate new vector? I
> feel I'm missing something, but I'm afraid I have to rely on you to
> point out what it is. Just again - please explain what it is you need
> two global vectors for that can't be done with one.

I think the relevant detail is that the posted_intr_vector is consumed
by the CPU's posted-interrupt logic and doesn't cause an exit to Xen.

But I don't understand why we would need a new global vector for
RUNSTATE_blocked rather than suppressing the posted interrupts as you
suggest for RUNSTATE_runnable.  (Or conversely why not use the new
global vector for RUNSTATE_runnable too?)

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-05 12:02           ` Tim Deegan
@ 2015-03-06  2:07             ` Wu, Feng
  2015-03-06  9:44               ` Tim Deegan
  0 siblings, 1 reply; 22+ messages in thread
From: Wu, Feng @ 2015-03-06  2:07 UTC (permalink / raw)
  To: Tim Deegan, Jan Beulich
  Cc: Zhang, Yang Z, Tian, Kevin, Wu, Feng, xen-devel@lists.xen.org



> -----Original Message-----
> From: Tim Deegan [mailto:tim@xen.org]
> Sent: Thursday, March 05, 2015 8:03 PM
> To: Jan Beulich
> Cc: Wu, Feng; Zhang, Yang Z; Tian, Kevin; xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] VT-d Posted-interrupt (PI) design for XEN
> 
> Hi,
> 
> At 08:52 +0000 on 05 Mar (1425541947), Jan Beulich wrote:
> > >>> On 05.03.15 at 09:29, <feng.wu@intel.com> wrote:
> > >> From: Jan Beulich [mailto:JBeulich@suse.com]
> > >> Sent: Thursday, March 05, 2015 3:13 PM
> > >> And if it can know, why couldn't the handler for
> > >> posted_intr_vector not know either (i.e. after introducing a specific
> > >> handler for it in place of the currently used event_check_interrupt)?
> > >
> > > Come back to the above scenario, vCPU1 is running on pCPU0 while vCPU0
> > > is blocked, if we still use posted_intr_vector for the blocked vCPU0. If
> > > vCPU1
> > > is running in non-root mode and external interrupts happen for it, the
> > > notification
> > > event will be handled by CPU hardware (in non-root mode) automatically,
> > > then we cannot get any control in the handler for posted_intr_vector.
> >
> > And how would this be different with your separate new vector? I
> > feel I'm missing something, but I'm afraid I have to rely on you to
> > point out what it is. Just again - please explain what it is you need
> > two global vectors for that can't be done with one.
> 
> I think the relevant detail is that the posted_intr_vector is consumed
> by the CPU's posted-interrupt logic and doesn't cause an exit to Xen.
> 

Exactly!

> But I don't understand why we would need a new global vector for
> RUNSTATE_blocked rather than suppressing the posted interrupts as you
> suggest for RUNSTATE_runnable.  (Or conversely why not use the new
> global vector for RUNSTATE_runnable too?)

If we suppress the posted-interrupts when vCPU is blocked, it cannot
be unblocked by the external interrupts, this is not correct.

Thanks,
Feng

> 
> Cheers,
> 
> Tim.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-06  2:07             ` Wu, Feng
@ 2015-03-06  9:44               ` Tim Deegan
  2015-03-09  2:03                 ` Wu, Feng
  0 siblings, 1 reply; 22+ messages in thread
From: Tim Deegan @ 2015-03-06  9:44 UTC (permalink / raw)
  To: Wu, Feng; +Cc: Zhang, Yang Z, Tian, Kevin, Jan Beulich, xen-devel@lists.xen.org

At 02:07 +0000 on 06 Mar (1425604054), Wu, Feng wrote:
> > From: Tim Deegan [mailto:tim@xen.org]
> > But I don't understand why we would need a new global vector for
> > RUNSTATE_blocked rather than suppressing the posted interrupts as you
> > suggest for RUNSTATE_runnable.  (Or conversely why not use the new
> > global vector for RUNSTATE_runnable too?)
> 
> If we suppress the posted-interrupts when vCPU is blocked, it cannot
> be unblocked by the external interrupts, this is not correct.

OK, I don't understand at all now. :)  When the posted interrupt is
suppressed, what happens to the interrupt?  If it's just dropped, then
we can't use that for _any_ cases.  If it goes through the old path,
via the vlapic, that should be enough to wake any HLT'ed vcpu.  It
sounds like it might be a little slower, but not necessarily once
you've had to add a new list of potentially-HLT'd-and-wakeable vcpus,
especially with many idle vcpus.

Tim.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-06  9:44               ` Tim Deegan
@ 2015-03-09  2:03                 ` Wu, Feng
  2015-03-09 10:33                   ` Tim Deegan
  0 siblings, 1 reply; 22+ messages in thread
From: Wu, Feng @ 2015-03-09  2:03 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Zhang, Yang Z, Tian, Kevin, Wu, Feng, Jan Beulich,
	xen-devel@lists.xen.org



> -----Original Message-----
> From: Tim Deegan [mailto:tim@xen.org]
> Sent: Friday, March 06, 2015 5:44 PM
> To: Wu, Feng
> Cc: Jan Beulich; Zhang, Yang Z; Tian, Kevin; xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] VT-d Posted-interrupt (PI) design for XEN
> 
> At 02:07 +0000 on 06 Mar (1425604054), Wu, Feng wrote:
> > > From: Tim Deegan [mailto:tim@xen.org]
> > > But I don't understand why we would need a new global vector for
> > > RUNSTATE_blocked rather than suppressing the posted interrupts as you
> > > suggest for RUNSTATE_runnable.  (Or conversely why not use the new
> > > global vector for RUNSTATE_runnable too?)
> >
> > If we suppress the posted-interrupts when vCPU is blocked, it cannot
> > be unblocked by the external interrupts, this is not correct.
> 
> OK, I don't understand at all now. :)  When the posted interrupt is
> suppressed, what happens to the interrupt? 

When the posted interrupt is suppressed, VT-d engine will not issue
notification events.

> If it's just dropped, then we can't use that for _any_ cases. 

We can suppress the posted-interrupt when vCPU is waiting in the runqueue
(vCPU is in RUNSTATE_runnable state), it is not needed to send notification
event when vCPU is in this state, since when interrupt happens, the interrupt
information are not _dropped_, instead, they are stored in PIR, and this will
be synced to vIRR before VM-Entry.

> If it goes through the old path,
> via the vlapic, that should be enough to wake any HLT'ed vcpu.  It
> sounds like it might be a little slower, but not necessarily once
> you've had to add a new list of potentially-HLT'd-and-wakeable vcpus,
> especially with many idle vcpus.


When Posted-interrupt is used, how to go to the old path?

Thanks,
Feng

Thanks,
Feng

> 
> Tim.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-09  2:03                 ` Wu, Feng
@ 2015-03-09 10:33                   ` Tim Deegan
  2015-03-09 11:45                     ` Andrew Cooper
  0 siblings, 1 reply; 22+ messages in thread
From: Tim Deegan @ 2015-03-09 10:33 UTC (permalink / raw)
  To: Wu, Feng; +Cc: Zhang, Yang Z, Tian, Kevin, Jan Beulich, xen-devel@lists.xen.org

At 02:03 +0000 on 09 Mar (1425863009), Wu, Feng wrote:
> 
> 
> > -----Original Message-----
> > From: Tim Deegan [mailto:tim@xen.org]
> > Sent: Friday, March 06, 2015 5:44 PM
> > To: Wu, Feng
> > Cc: Jan Beulich; Zhang, Yang Z; Tian, Kevin; xen-devel@lists.xen.org
> > Subject: Re: [Xen-devel] VT-d Posted-interrupt (PI) design for XEN
> > 
> > At 02:07 +0000 on 06 Mar (1425604054), Wu, Feng wrote:
> > > > From: Tim Deegan [mailto:tim@xen.org]
> > > > But I don't understand why we would need a new global vector for
> > > > RUNSTATE_blocked rather than suppressing the posted interrupts as you
> > > > suggest for RUNSTATE_runnable.  (Or conversely why not use the new
> > > > global vector for RUNSTATE_runnable too?)
> > >
> > > If we suppress the posted-interrupts when vCPU is blocked, it cannot
> > > be unblocked by the external interrupts, this is not correct.
> > 
> > OK, I don't understand at all now. :)  When the posted interrupt is
> > suppressed, what happens to the interrupt? 
> 
> When the posted interrupt is suppressed, VT-d engine will not issue
> notification events.
> 
> > If it's just dropped, then we can't use that for _any_ cases. 
> 
> We can suppress the posted-interrupt when vCPU is waiting in the runqueue
> (vCPU is in RUNSTATE_runnable state), it is not needed to send notification
> event when vCPU is in this state, since when interrupt happens, the interrupt
> information are not _dropped_, instead, they are stored in PIR, and this will
> be synced to vIRR before VM-Entry.

So you think you can use the same system for RUNSTATE_runnable as
RUNSTATE_blocked?  That seems like a good idea. 

I'll leave the details (e.g. single global vector + queue vs any other
way to wake the vcpu) to people who know the x86 irq code better than
I do. :)

Thanks for the clarification.

Tim.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-09 10:33                   ` Tim Deegan
@ 2015-03-09 11:45                     ` Andrew Cooper
  2015-03-10  2:01                       ` Tian, Kevin
  2015-03-16  5:07                       ` Wu, Feng
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Cooper @ 2015-03-09 11:45 UTC (permalink / raw)
  To: Tim Deegan, Wu, Feng
  Cc: Zhang, Yang Z, Tian, Kevin, Jan Beulich, xen-devel@lists.xen.org

On 09/03/15 10:33, Tim Deegan wrote:
> At 02:03 +0000 on 09 Mar (1425863009), Wu, Feng wrote:
>>
>>> -----Original Message-----
>>> From: Tim Deegan [mailto:tim@xen.org]
>>> Sent: Friday, March 06, 2015 5:44 PM
>>> To: Wu, Feng
>>> Cc: Jan Beulich; Zhang, Yang Z; Tian, Kevin; xen-devel@lists.xen.org
>>> Subject: Re: [Xen-devel] VT-d Posted-interrupt (PI) design for XEN
>>>
>>> At 02:07 +0000 on 06 Mar (1425604054), Wu, Feng wrote:
>>>>> From: Tim Deegan [mailto:tim@xen.org]
>>>>> But I don't understand why we would need a new global vector for
>>>>> RUNSTATE_blocked rather than suppressing the posted interrupts as you
>>>>> suggest for RUNSTATE_runnable.  (Or conversely why not use the new
>>>>> global vector for RUNSTATE_runnable too?)
>>>> If we suppress the posted-interrupts when vCPU is blocked, it cannot
>>>> be unblocked by the external interrupts, this is not correct.
>>> OK, I don't understand at all now. :)  When the posted interrupt is
>>> suppressed, what happens to the interrupt? 
>> When the posted interrupt is suppressed, VT-d engine will not issue
>> notification events.
>>
>>> If it's just dropped, then we can't use that for _any_ cases. 
>> We can suppress the posted-interrupt when vCPU is waiting in the runqueue
>> (vCPU is in RUNSTATE_runnable state), it is not needed to send notification
>> event when vCPU is in this state, since when interrupt happens, the interrupt
>> information are not _dropped_, instead, they are stored in PIR, and this will
>> be synced to vIRR before VM-Entry.
> So you think you can use the same system for RUNSTATE_runnable as
> RUNSTATE_blocked?  That seems like a good idea. 
>
> I'll leave the details (e.g. single global vector + queue vs any other
> way to wake the vcpu) to people who know the x86 irq code better than
> I do. :)

>From my reading the relevant section in the VT-d spec, to the best of my
understanding:

We only need the second vector if Xen wishes to be informed that an
interrupt has been queued for a vcpu.  The spec suggests that, for one
usecase, this information should affect scheduling decisions.

If we do not wish to make scheduling alterations based on interrupt
delivery, the extra vector can be ignored.

If we do wish to make scheduling alterations, we will need to be able to
uniquely identify a vcpu from a vector, which will involve allocating
one vector per vcpu.


If my understanding is correct, I would suggest that Xen opt for not
getting notifications.  Interrupting one guest to indicate that another
vcpu has been interrupted scales progressively worse with the number of
running VMs, and there are existing usecases which have already
exhausted the x86 vector space completely.

It might be sensible to have the option available as a per-domain opt-in
option.  A usecase such as device driver domain could easily want to
deal with its interrupts ahead of running the domains it is servicing.

~Andrew

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-09 11:45                     ` Andrew Cooper
@ 2015-03-10  2:01                       ` Tian, Kevin
  2015-03-16  4:03                         ` Wu, Feng
  2015-03-16  5:07                       ` Wu, Feng
  1 sibling, 1 reply; 22+ messages in thread
From: Tian, Kevin @ 2015-03-10  2:01 UTC (permalink / raw)
  To: Andrew Cooper, Tim Deegan, Wu, Feng
  Cc: Zhang, Yang Z, Jan Beulich, xen-devel@lists.xen.org

> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Monday, March 09, 2015 7:46 PM
> 
> On 09/03/15 10:33, Tim Deegan wrote:
> > At 02:03 +0000 on 09 Mar (1425863009), Wu, Feng wrote:
> >>
> >>> -----Original Message-----
> >>> From: Tim Deegan [mailto:tim@xen.org]
> >>> Sent: Friday, March 06, 2015 5:44 PM
> >>> To: Wu, Feng
> >>> Cc: Jan Beulich; Zhang, Yang Z; Tian, Kevin; xen-devel@lists.xen.org
> >>> Subject: Re: [Xen-devel] VT-d Posted-interrupt (PI) design for XEN
> >>>
> >>> At 02:07 +0000 on 06 Mar (1425604054), Wu, Feng wrote:
> >>>>> From: Tim Deegan [mailto:tim@xen.org]
> >>>>> But I don't understand why we would need a new global vector for
> >>>>> RUNSTATE_blocked rather than suppressing the posted interrupts as you
> >>>>> suggest for RUNSTATE_runnable.  (Or conversely why not use the new
> >>>>> global vector for RUNSTATE_runnable too?)
> >>>> If we suppress the posted-interrupts when vCPU is blocked, it cannot
> >>>> be unblocked by the external interrupts, this is not correct.
> >>> OK, I don't understand at all now. :)  When the posted interrupt is
> >>> suppressed, what happens to the interrupt?
> >> When the posted interrupt is suppressed, VT-d engine will not issue
> >> notification events.
> >>
> >>> If it's just dropped, then we can't use that for _any_ cases.
> >> We can suppress the posted-interrupt when vCPU is waiting in the runqueue
> >> (vCPU is in RUNSTATE_runnable state), it is not needed to send notification
> >> event when vCPU is in this state, since when interrupt happens, the
> interrupt
> >> information are not _dropped_, instead, they are stored in PIR, and this will
> >> be synced to vIRR before VM-Entry.
> > So you think you can use the same system for RUNSTATE_runnable as
> > RUNSTATE_blocked?  That seems like a good idea.
> >
> > I'll leave the details (e.g. single global vector + queue vs any other
> > way to wake the vcpu) to people who know the x86 irq code better than
> > I do. :)
> 
> From my reading the relevant section in the VT-d spec, to the best of my
> understanding:
> 
> We only need the second vector if Xen wishes to be informed that an
> interrupt has been queued for a vcpu.  The spec suggests that, for one
> usecase, this information should affect scheduling decisions.
> 
> If we do not wish to make scheduling alterations based on interrupt
> delivery, the extra vector can be ignored.
> 
> If we do wish to make scheduling alterations, we will need to be able to
> uniquely identify a vcpu from a vector, which will involve allocating
> one vector per vcpu.
> 
> 
> If my understanding is correct, I would suggest that Xen opt for not
> getting notifications.  Interrupting one guest to indicate that another
> vcpu has been interrupted scales progressively worse with the number of
> running VMs, and there are existing usecases which have already
> exhausted the x86 vector space completely.
> 
> It might be sensible to have the option available as a per-domain opt-in
> option.  A usecase such as device driver domain could easily want to
> deal with its interrupts ahead of running the domains it is servicing.
> 

IMO we don't need such opt. An blocked VCPU may not be woken up
when losing a virtual interrupt notification, and if you look at earlier
reply to Jan it's not necessarily to have one-vector-per-vcpu. It's just
a global vector, which when sent to a specific pcpu, the handler will
walk through blocked vcpus on that pcpu to decide which one should
be woken up. So only one new vector is required.

from Feng's design, the notification may be disabled in one scenario,
i.e. when vcpu is in runnable state. That works if real-time is not
considered since we know runnable vcpu is already unblocked. Later
when considering real-time, this notification will be required too.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-10  2:01                       ` Tian, Kevin
@ 2015-03-16  4:03                         ` Wu, Feng
  0 siblings, 0 replies; 22+ messages in thread
From: Wu, Feng @ 2015-03-16  4:03 UTC (permalink / raw)
  To: Tian, Kevin, Andrew Cooper, Tim Deegan
  Cc: Zhang, Yang Z, Wu, Feng, Jan Beulich, xen-devel@lists.xen.org



> -----Original Message-----
> From: Tian, Kevin
> Sent: Tuesday, March 10, 2015 10:01 AM
> To: Andrew Cooper; Tim Deegan; Wu, Feng
> Cc: Zhang, Yang Z; Jan Beulich; xen-devel@lists.xen.org
> Subject: RE: [Xen-devel] VT-d Posted-interrupt (PI) design for XEN
> 
> > From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> > Sent: Monday, March 09, 2015 7:46 PM
> >
> > On 09/03/15 10:33, Tim Deegan wrote:
> > > At 02:03 +0000 on 09 Mar (1425863009), Wu, Feng wrote:
> > >>
> > >>> -----Original Message-----
> > >>> From: Tim Deegan [mailto:tim@xen.org]
> > >>> Sent: Friday, March 06, 2015 5:44 PM
> > >>> To: Wu, Feng
> > >>> Cc: Jan Beulich; Zhang, Yang Z; Tian, Kevin; xen-devel@lists.xen.org
> > >>> Subject: Re: [Xen-devel] VT-d Posted-interrupt (PI) design for XEN
> > >>>
> > >>> At 02:07 +0000 on 06 Mar (1425604054), Wu, Feng wrote:
> > >>>>> From: Tim Deegan [mailto:tim@xen.org]
> > >>>>> But I don't understand why we would need a new global vector for
> > >>>>> RUNSTATE_blocked rather than suppressing the posted interrupts as
> you
> > >>>>> suggest for RUNSTATE_runnable.  (Or conversely why not use the new
> > >>>>> global vector for RUNSTATE_runnable too?)
> > >>>> If we suppress the posted-interrupts when vCPU is blocked, it cannot
> > >>>> be unblocked by the external interrupts, this is not correct.
> > >>> OK, I don't understand at all now. :)  When the posted interrupt is
> > >>> suppressed, what happens to the interrupt?
> > >> When the posted interrupt is suppressed, VT-d engine will not issue
> > >> notification events.
> > >>
> > >>> If it's just dropped, then we can't use that for _any_ cases.
> > >> We can suppress the posted-interrupt when vCPU is waiting in the
> runqueue
> > >> (vCPU is in RUNSTATE_runnable state), it is not needed to send notification
> > >> event when vCPU is in this state, since when interrupt happens, the
> > interrupt
> > >> information are not _dropped_, instead, they are stored in PIR, and this
> will
> > >> be synced to vIRR before VM-Entry.
> > > So you think you can use the same system for RUNSTATE_runnable as
> > > RUNSTATE_blocked?  That seems like a good idea.
> > >
> > > I'll leave the details (e.g. single global vector + queue vs any other
> > > way to wake the vcpu) to people who know the x86 irq code better than
> > > I do. :)
> >
> > From my reading the relevant section in the VT-d spec, to the best of my
> > understanding:
> >
> > We only need the second vector if Xen wishes to be informed that an
> > interrupt has been queued for a vcpu.  The spec suggests that, for one
> > usecase, this information should affect scheduling decisions.
> >
> > If we do not wish to make scheduling alterations based on interrupt
> > delivery, the extra vector can be ignored.
> >
> > If we do wish to make scheduling alterations, we will need to be able to
> > uniquely identify a vcpu from a vector, which will involve allocating
> > one vector per vcpu.
> >
> >
> > If my understanding is correct, I would suggest that Xen opt for not
> > getting notifications.  Interrupting one guest to indicate that another
> > vcpu has been interrupted scales progressively worse with the number of
> > running VMs, and there are existing usecases which have already
> > exhausted the x86 vector space completely.
> >
> > It might be sensible to have the option available as a per-domain opt-in
> > option.  A usecase such as device driver domain could easily want to
> > deal with its interrupts ahead of running the domains it is servicing.
> >
> 
> IMO we don't need such opt. An blocked VCPU may not be woken up
> when losing a virtual interrupt notification, and if you look at earlier
> reply to Jan it's not necessarily to have one-vector-per-vcpu. It's just
> a global vector, which when sent to a specific pcpu, the handler will
> walk through blocked vcpus on that pcpu to decide which one should
> be woken up. So only one new vector is required.
> 
> from Feng's design, the notification may be disabled in one scenario,
> i.e. when vcpu is in runnable state. That works if real-time is not
> considered since we know runnable vcpu is already unblocked. Later
> when considering real-time, this notification will be required too.

Thanks for your clarification, Kevin!

Thanks,
Feng

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-09 11:45                     ` Andrew Cooper
  2015-03-10  2:01                       ` Tian, Kevin
@ 2015-03-16  5:07                       ` Wu, Feng
  1 sibling, 0 replies; 22+ messages in thread
From: Wu, Feng @ 2015-03-16  5:07 UTC (permalink / raw)
  To: Andrew Cooper, Tim Deegan
  Cc: Zhang, Yang Z, Tian, Kevin, Wu, Feng, Jan Beulich,
	xen-devel@lists.xen.org



> -----Original Message-----
> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Monday, March 09, 2015 7:46 PM
> To: Tim Deegan; Wu, Feng
> Cc: Zhang, Yang Z; Tian, Kevin; Jan Beulich; xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] VT-d Posted-interrupt (PI) design for XEN
> 
> On 09/03/15 10:33, Tim Deegan wrote:
> > At 02:03 +0000 on 09 Mar (1425863009), Wu, Feng wrote:
> >>
> >>> -----Original Message-----
> >>> From: Tim Deegan [mailto:tim@xen.org]
> >>> Sent: Friday, March 06, 2015 5:44 PM
> >>> To: Wu, Feng
> >>> Cc: Jan Beulich; Zhang, Yang Z; Tian, Kevin; xen-devel@lists.xen.org
> >>> Subject: Re: [Xen-devel] VT-d Posted-interrupt (PI) design for XEN
> >>>
> >>> At 02:07 +0000 on 06 Mar (1425604054), Wu, Feng wrote:
> >>>>> From: Tim Deegan [mailto:tim@xen.org]
> >>>>> But I don't understand why we would need a new global vector for
> >>>>> RUNSTATE_blocked rather than suppressing the posted interrupts as you
> >>>>> suggest for RUNSTATE_runnable.  (Or conversely why not use the new
> >>>>> global vector for RUNSTATE_runnable too?)
> >>>> If we suppress the posted-interrupts when vCPU is blocked, it cannot
> >>>> be unblocked by the external interrupts, this is not correct.
> >>> OK, I don't understand at all now. :)  When the posted interrupt is
> >>> suppressed, what happens to the interrupt?
> >> When the posted interrupt is suppressed, VT-d engine will not issue
> >> notification events.
> >>
> >>> If it's just dropped, then we can't use that for _any_ cases.
> >> We can suppress the posted-interrupt when vCPU is waiting in the runqueue
> >> (vCPU is in RUNSTATE_runnable state), it is not needed to send notification
> >> event when vCPU is in this state, since when interrupt happens, the
> interrupt
> >> information are not _dropped_, instead, they are stored in PIR, and this will
> >> be synced to vIRR before VM-Entry.
> > So you think you can use the same system for RUNSTATE_runnable as
> > RUNSTATE_blocked?  That seems like a good idea.
> >
> > I'll leave the details (e.g. single global vector + queue vs any other
> > way to wake the vcpu) to people who know the x86 irq code better than
> > I do. :)
> 
> From my reading the relevant section in the VT-d spec, to the best of my
> understanding:
> 
> We only need the second vector if Xen wishes to be informed that an
> interrupt has been queued for a vcpu.  The spec suggests that, for one
> usecase, this information should affect scheduling decisions.
> 
> If we do not wish to make scheduling alterations based on interrupt
> delivery, the extra vector can be ignored.

As I mentioned in the previous mail in this thread, the second vector is used to
wake up the blocked vCPU when external interrupts is coming for the vCPU.

Thanks,
Feng

> 
> If we do wish to make scheduling alterations, we will need to be able to
> uniquely identify a vcpu from a vector, which will involve allocating
> one vector per vcpu.
> 
> 
> If my understanding is correct, I would suggest that Xen opt for not
> getting notifications.  Interrupting one guest to indicate that another
> vcpu has been interrupted scales progressively worse with the number of
> running VMs, and there are existing usecases which have already
> exhausted the x86 vector space completely.
> 
> It might be sensible to have the option available as a per-domain opt-in
> option.  A usecase such as device driver domain could easily want to
> deal with its interrupts ahead of running the domains it is servicing.
> 
> ~Andrew

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-04 13:30 VT-d Posted-interrupt (PI) design for XEN Wu, Feng
  2015-03-04 15:19 ` Jan Beulich
@ 2015-03-04 18:48 ` Andrew Cooper
  2015-03-05  5:28   ` Wu, Feng
  2015-03-10  2:22 ` Tian, Kevin
  2 siblings, 1 reply; 22+ messages in thread
From: Andrew Cooper @ 2015-03-04 18:48 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org; +Cc: Zhang, Yang Z, Tian, Kevin, Jan Beulich

On 04/03/15 13:30, Wu, Feng wrote:
> VT-d Posted-interrupt (PI) design for XEN

Thankyou very much for this!

>
> Background
> ==========
> With the development of virtualization, there are more and more device
> assignment requirements. However, today when a VM is running with
> assigned devices (such as, NIC), external interrupt handling for the assigned
> devices always needs VMM intervention.
>
> VT-d Posted-interrupt is a more enhanced method to handle interrupts
> in the virtualization environment. Interrupt posting is the process by
> which an interrupt request is recorded in a memory-resident
> posted-interrupt-descriptor structure by the root-complex, followed by
> an optional notification event issued to the CPU complex.
>
> With VT-d Posted-interrupt we can get the following advantages:
> - Directly delivery of external interrupts to running vCPUs without VMM
> intervention
> - Decease the interrupt migration complexity. On vCPU migration, software
> can atomically co-migrate all interrupts targeting the migrating vCPU.

I presume you mean "Decrease" ?

"Decease" means something quite different.

>
>
> Posted-interrupt Introduction
> ========================
> There are two components to the Posted-interrupt architecture:
> Processor Support and Root-Complex Support
>
> - Processor Support
> Posted-interrupt processing is a feature by which a processor processes
> the virtual interrupts by recording them as pending on the virtual-APIC
> page.
>
> Posted-interrupt processing is enabled by setting the "process posted
> interrupts" VM-execution control. The processing is performed in response
> to the arrival of an interrupt with the posted-interrupt notification vector.
> In response to such an interrupt, the processor processes virtual interrupts
> recorded in a data structure called a posted-interrupt descriptor.
>
> More information about APICv and CPU-side Posted-interrupt, please refer
> to Chapter 29, and Section 29.6 in the Intel SDM:
> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
>
> - Root-Complex Support
> Interrupt posting is the process by which an interrupt request (from IOAPIC
> or MSI/MSIx capable sources) is recorded in a memory-resident
> posted-interrupt-descriptor structure by the root-complex, followed by
> an optional notification event issued to the CPU complex. The interrupt
> request arriving at the root-complex carry the identity of the interrupt
> request source and a 'remapping-index'. The remapping-index is used to
> look-up an entry from the memory-resident interrupt-remap-table. Unlike
> with interrupt-remapping, the interrupt-remap-table-entry for a posted-
> interrupt, specifies a virtual-vector and a pointer to the posted-interrupt
> descriptor. The virtual-vector specifies the vector of the interrupt to be
> recorded in the posted-interrupt descriptor. The posted-interrupt descriptor
> hosts storage for the virtual-vectors and contains the attributes of the
> notification event (interrupt) to be issued to the CPU complex to inform
> CPU/software about pending interrupts recorded in the posted-interrupt
> descriptor.
>
> More information about VT-d PI, please refer to
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html
>
>
> Design Overview
> ==============
> In this design, we will cover the following items:
> 1. Add a variant to control whether enable VT-d posted-interrupt or not.
> 2. VT-d PI feature detection.
> 3. Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff.
> 4. Extend IRTE structure to support VT-d PI.
> 5. Introduce a new global vector which is used for waking up the HLT'ed vCPU.
> 6. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration).
> 7. Update posted-interrupt descriptor during vCPU scheduling (when the state
> of the vCPU is transmitted among RUNSTATE_running / RUNSTATE_blocked/
> RUNSTATE_runnable / RUNSTATE_offline).
> 8. New boot command line for Xen, which controls VT-d PI feature by user.
> 9. Multicast/broadcast and lowest priority interrupts consideration.
>
>
> Implementation details
> ===================
> - New variant to control VT-d PI

I know what you are trying to say, but "New variant" does not express
what you mean.

"A new control relating to VT-d PI" perhaps?

> Like variant 'iommu_intremap' for interrupt remapping, it is very straightforward
> to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is set
> only when interrupt remapping and VT-d posted-interrupt are both enabled.

I would avoid mixing names such as PI and intpost.  If anything, it
should be "iommu_postint" to keep the naming consistent.  (Here and
elsewhere).

>
> - VT-d PI feature detection.
> Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt support.
>
> - Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff.
> Here is the new structure for posted-interrupt descriptor:
>
> struct pi_desc {
>      DECLARE_BITMAP(pir, NR_VECTORS);
>      union {
>         struct
>         {
>         u64 on     : 1,
>             sn     : 1,
>             rsvd_1 : 13,
>             ndm    : 1,
>             nv     : 8,
>             rsvd_2 : 8,
>             ndst   : 32;
>         };
>         u64 control;
>     };
>     u32 rsvd[6];
>  } __attribute__ ((aligned (64)));

Is there a pending update to the system programming guide?  According to
325384.pdf, only the Oustanding Notification is defined, and all others
are reserved for software use.

I however noticed that these fields match up with the description of a
posted interrupt descriptor in the VT-d spec.  Are they supposed to be
the same structure in memory used by both the cpu and root complex, or
independent structures which happen to look very similar?

>
> - Extend IRTE structure to support VT-d PI.
> Here is the new structure for IRTE:
> /* interrupt remap entry */
> struct iremap_entry {
>   union {
>     u64 lo_val;
>     struct {
>         u64 p       : 1,
>             fpd     : 1,
>             dm      : 1,
>             rh      : 1,
>             tm      : 1,
>             dlm     : 3,
>             avail   : 4,
>             res_1   : 4,
>             vector  : 8,
>             res_2   : 8,
>             dst     : 32;
>     }lo;
>     struct {
>         u64 p       : 1,
>             fpd     : 1,
>             res_1   : 6,
>             avail   : 4,
>             res_2   : 2,
>             urg     : 1,
>             pst     : 1,
>             vector  : 8,
>             res_3   : 14,
>             pda_l   : 26;
>     }lo_intpost;
>   };
>   union {
>     u64 hi_val;
>     struct {
>         u64 sid     : 16,
>             sq      : 2,
>             svt     : 2,
>             res_1   : 44;
>     }hi;
>     struct {
>         u64 sid     : 16,
>             sq      : 2,
>             svt     : 2,
>             res_1   : 12,
>             pda_h   : 32;
>     }hi_intpost;
>   };
> };

None of the bitfields contain the IM field (bit 15) which is stated as
the qualification between the two interpretations of the IRTE.

Also, I feel that the structure would be better layed out as:

struct iremap_entry {
    union {
        struct { u64 lo, hi; };
        struct { <bitfields> } norm; (names subject to improvement)
        struct { <bitfields> } post;
    };
};

Which does not duplicate the lo and hi u64s in sub-unions.  (This will
involve some refactoring of the existing code.)

>
> - Introduce a new global vector which is used to wake up the HLT'ed vCPU.
> Currently, there is a global vector 'posted_intr_vector', which is used as the
> global notification vector for all vCPUs in the system. This vector is stored in
> VMCS and CPU considers it as a special vector, uses it to notify the related
> pCPU when an interrupt is recorded in the posted-interrupt descriptor.
>
> After having VT-d PI, VT-d engine can issue notification event when the
> assigned devices issue interrupts. We need add a new global vector to
> wakeup the HLT'ed vCPU, please refer to the following scenario for the
> usage of this new global vector:
>
> 1. vCPU0 is running on pCPU0
> 2. vCPU0 is HLT'ed and vCPU1 is currently running on pCPU0

I don't understand what you are trying to express with this scenario. 
vCPU0 cannot be running on pCPU0 and also halted with vCPU1 running on
pCPU0.

A vCPU is either running, in which case it has an associated pCPU, or it
is not running and has no specific pCPU affiliation.

~Andrew

> 3. An external interrupt from an assigned device occurs for vCPU0, if we
> still use 'posted_intr_vector' as the notification vector for vCPU0, the
> notification event for vCPU0 (the event will go to pCPU1) will be consumed
> by vCPU1 incorrectly. The worst case is that vCPU0 will never be woken up
> again since the wakeup event for it is always consumed by other vCPUs
> incorrectly. So we need introduce another global vector, naming 'pi_wakeup_vector'
> to wake up the HTL'ed vCPU.
>
> - Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration).
> After VT-d PI is introduced, the format of IRTE is changed as follows:
> 	Descriptor Address: the address of the posted-interrupt descriptor
> 	Virtual Vector: the guest vector of the interrupt
> 	URG: indicates if the interrupt is urgent
> 	Other fields continue to have the same meaning
>
> 'Descriptor Address' tells the destination vCPU of this interrupt, since
> each vCPU has a dedicated posted-interrupt descriptor.
>
> 'Virtual Vector' tells the guest vector of the interrupt.
>
> When guest changes the configuration of the interrupts, such as, the
> cpu affinity, or the vector, we need to update the associated IRTE accordingly.
>
> - Update posted-interrupt descriptor during vCPU scheduling
> The basic idea here is:
> 1. When vCPU's state is RUNSTATE_running,
>         - Set 'NV' to 'posted_intr_vector'.
>         - Clear 'SN' to accept posted-interrupts.
>         - Set 'NDST' to the pCPU on which the vCPU will be running.
> 2. When vCPU's state is RUNSTATE_blocked,
>         - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the
>           related vCPU when posted-interrupt happens for it.
>           Please refer to the above section about the new global vector.
>         - Clear 'SN' to accept posted-interrupts
> 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
>         - Set 'SN' to suppress non-urgent interrupts
>           (Current, we only support non-urgent interrupts)
>          When vCPU is in RUNSTATE_runnable or RUNSTATE_offline,
>          It is not needed to accept posted-interrupt notification event,
>          since we don't change the behavior of scheduler when the interrupt
>          occurs, we still need wait the next scheduling of the vCPU.
>          When external interrupts from assigned devices occur, the interrupts
>          are recorded in PIR, and will be synced to IRR before VM-Entry.
>         - Set 'NV' to 'posted_intr_vector'.
>
> - New boot command line for Xen, which controls VT-d PI feature by user.
> Like 'intremap' for interrupt remapping, we add a new boot command line
> 'intpost' for posted-interrupts.
>
> - Multicast/broadcast and lowest priority interrupts consideration
> With VT-d PI, the destination vCPU information of an external interrupt
> from assigned devices is stored in IRTE, this makes the following
> consideration of the design:
> 1. Multicast/broadcast interrupts cannot be posted.
> 2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex
> (starting from Nehalem) ignore TPR value, and instead supported two other
> ways (configurable by BIOS) on how the handle lowest priority interrupts:
> 	A) Round robin: In this method, the chipset simply delivers lowest priority
> interrupts in a round-robin manner across all the available logical CPUs. While
> this provides good load balancing, this was not the best thing to do always as
> interrupts from the same device (like NIC) will start running on all the CPUs
> thrashing caches and taking locks. This led to the next scheme.
> 	B) Vector hashing: In this method, hardware would apply a hash function
> on the vector value in the interrupt request, and use that hash to pick a logical
> CPU to route the lowest priority interrupt. This way, a given vector always goes
> to the same logical CPU, avoiding the thrashing problem above.
>
> So, gist of above is that, lowest priority interrupts has never been delivered as
> "lowest priority" in physical hardware. 
>
> For KVM enabling work of VT-d PI, we divide this into two stage:
> Stage 1: Only support single-CPU lowest-priority interrupts (configured via
> /proc/irq or irqbalance). This is simple and clear.
> Stage 2: After all the patches are merged, I will add the vector hashing support
> for lowest-priority on VT-d PI.
>
> On Xen side, what is your opinion about support lowest-priority interrupts
> for VT-d PI?
>
> ================================
>
> Any comments about this design are highly appreciated!
>
> Thanks,
> Feng
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-04 18:48 ` Andrew Cooper
@ 2015-03-05  5:28   ` Wu, Feng
  0 siblings, 0 replies; 22+ messages in thread
From: Wu, Feng @ 2015-03-05  5:28 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel@lists.xen.org
  Cc: Zhang, Yang Z, Tian, Kevin, Wu, Feng, Jan Beulich



> -----Original Message-----
> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Thursday, March 05, 2015 2:48 AM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: Zhang, Yang Z; Tian, Kevin; Jan Beulich
> Subject: Re: [Xen-devel] VT-d Posted-interrupt (PI) design for XEN
> 
> On 04/03/15 13:30, Wu, Feng wrote:
> > VT-d Posted-interrupt (PI) design for XEN
> 
> Thankyou very much for this!
> 
> >
> > Background
> > ==========
> > With the development of virtualization, there are more and more device
> > assignment requirements. However, today when a VM is running with
> > assigned devices (such as, NIC), external interrupt handling for the assigned
> > devices always needs VMM intervention.
> >
> > VT-d Posted-interrupt is a more enhanced method to handle interrupts
> > in the virtualization environment. Interrupt posting is the process by
> > which an interrupt request is recorded in a memory-resident
> > posted-interrupt-descriptor structure by the root-complex, followed by
> > an optional notification event issued to the CPU complex.
> >
> > With VT-d Posted-interrupt we can get the following advantages:
> > - Directly delivery of external interrupts to running vCPUs without VMM
> > intervention
> > - Decease the interrupt migration complexity. On vCPU migration, software
> > can atomically co-migrate all interrupts targeting the migrating vCPU.
> 
> I presume you mean "Decrease" ?

Yes!

> 
> "Decease" means something quite different.

Sorry for the typo. 

> 
> >
> >
> > Posted-interrupt Introduction
> > ========================
> > There are two components to the Posted-interrupt architecture:
> > Processor Support and Root-Complex Support
> >
> > - Processor Support
> > Posted-interrupt processing is a feature by which a processor processes
> > the virtual interrupts by recording them as pending on the virtual-APIC
> > page.
> >
> > Posted-interrupt processing is enabled by setting the "process posted
> > interrupts" VM-execution control. The processing is performed in response
> > to the arrival of an interrupt with the posted-interrupt notification vector.
> > In response to such an interrupt, the processor processes virtual interrupts
> > recorded in a data structure called a posted-interrupt descriptor.
> >
> > More information about APICv and CPU-side Posted-interrupt, please refer
> > to Chapter 29, and Section 29.6 in the Intel SDM:
> >
> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6
> 4-ia-32-architectures-software-developer-manual-325462.pdf
> >
> > - Root-Complex Support
> > Interrupt posting is the process by which an interrupt request (from IOAPIC
> > or MSI/MSIx capable sources) is recorded in a memory-resident
> > posted-interrupt-descriptor structure by the root-complex, followed by
> > an optional notification event issued to the CPU complex. The interrupt
> > request arriving at the root-complex carry the identity of the interrupt
> > request source and a 'remapping-index'. The remapping-index is used to
> > look-up an entry from the memory-resident interrupt-remap-table. Unlike
> > with interrupt-remapping, the interrupt-remap-table-entry for a posted-
> > interrupt, specifies a virtual-vector and a pointer to the posted-interrupt
> > descriptor. The virtual-vector specifies the vector of the interrupt to be
> > recorded in the posted-interrupt descriptor. The posted-interrupt descriptor
> > hosts storage for the virtual-vectors and contains the attributes of the
> > notification event (interrupt) to be issued to the CPU complex to inform
> > CPU/software about pending interrupts recorded in the posted-interrupt
> > descriptor.
> >
> > More information about VT-d PI, please refer to
> >
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
> y/vt-directed-io-spec.html
> >
> >
> > Design Overview
> > ==============
> > In this design, we will cover the following items:
> > 1. Add a variant to control whether enable VT-d posted-interrupt or not.
> > 2. VT-d PI feature detection.
> > 3. Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff.
> > 4. Extend IRTE structure to support VT-d PI.
> > 5. Introduce a new global vector which is used for waking up the HLT'ed vCPU.
> > 6. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx
> configuration).
> > 7. Update posted-interrupt descriptor during vCPU scheduling (when the
> state
> > of the vCPU is transmitted among RUNSTATE_running / RUNSTATE_blocked/
> > RUNSTATE_runnable / RUNSTATE_offline).
> > 8. New boot command line for Xen, which controls VT-d PI feature by user.
> > 9. Multicast/broadcast and lowest priority interrupts consideration.
> >
> >
> > Implementation details
> > ===================
> > - New variant to control VT-d PI
> 
> I know what you are trying to say, but "New variant" does not express
> what you mean.
> 
> "A new control relating to VT-d PI" perhaps?
> 
> > Like variant 'iommu_intremap' for interrupt remapping, it is very
> straightforward
> > to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is
> set
> > only when interrupt remapping and VT-d posted-interrupt are both enabled.
> 
> I would avoid mixing names such as PI and intpost.  If anything, it
> should be "iommu_postint" to keep the naming consistent.  (Here and
> elsewhere).
> 

My original ideas is 'iommu_intpost' is consistent to 'iommu_intremap', we can
also use 'interrupt posting' for this feature, just like 'interrupt remapping', but I
think your comments is also good.


> >
> > - VT-d PI feature detection.
> > Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt
> support.
> >
> > - Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff.
> > Here is the new structure for posted-interrupt descriptor:
> >
> > struct pi_desc {
> >      DECLARE_BITMAP(pir, NR_VECTORS);
> >      union {
> >         struct
> >         {
> >         u64 on     : 1,
> >             sn     : 1,
> >             rsvd_1 : 13,
> >             ndm    : 1,
> >             nv     : 8,
> >             rsvd_2 : 8,
> >             ndst   : 32;
> >         };
> >         u64 control;
> >     };
> >     u32 rsvd[6];
> >  } __attribute__ ((aligned (64)));
> 
> Is there a pending update to the system programming guide?  According to
> 325384.pdf, only the Oustanding Notification is defined, and all others
> are reserved for software use.
> 
> I however noticed that these fields match up with the description of a
> posted interrupt descriptor in the VT-d spec.  Are they supposed to be
> the same structure in memory used by both the cpu and root complex, or
> independent structures which happen to look very similar?

In 325384.pdf, the format of posted-interrupt descriptor is the one before
VT-d PI is introduced, after having VT-d PI, we enhance the structure to
the format defined in the VT-d Spec above.

> 
> >
> > - Extend IRTE structure to support VT-d PI.
> > Here is the new structure for IRTE:
> > /* interrupt remap entry */
> > struct iremap_entry {
> >   union {
> >     u64 lo_val;
> >     struct {
> >         u64 p       : 1,
> >             fpd     : 1,
> >             dm      : 1,
> >             rh      : 1,
> >             tm      : 1,
> >             dlm     : 3,
> >             avail   : 4,
> >             res_1   : 4,
> >             vector  : 8,
> >             res_2   : 8,
> >             dst     : 32;
> >     }lo;
> >     struct {
> >         u64 p       : 1,
> >             fpd     : 1,
> >             res_1   : 6,
> >             avail   : 4,
> >             res_2   : 2,
> >             urg     : 1,
> >             pst     : 1,
> >             vector  : 8,
> >             res_3   : 14,
> >             pda_l   : 26;
> >     }lo_intpost;
> >   };
> >   union {
> >     u64 hi_val;
> >     struct {
> >         u64 sid     : 16,
> >             sq      : 2,
> >             svt     : 2,
> >             res_1   : 44;
> >     }hi;
> >     struct {
> >         u64 sid     : 16,
> >             sq      : 2,
> >             svt     : 2,
> >             res_1   : 12,
> >             pda_h   : 32;
> >     }hi_intpost;
> >   };
> > };
> 
> None of the bitfields contain the IM field (bit 15) which is stated as
> the qualification between the two interpretations of the IRTE.

Oh, I defined this according to an old version of VT-d PI Spec. 'pst' is
in fact the 'IM' bit in the latest Spec. I will change this.

> 
> Also, I feel that the structure would be better layed out as:
> 
> struct iremap_entry {
>     union {
>         struct { u64 lo, hi; };
>         struct { <bitfields> } norm; (names subject to improvement)
>         struct { <bitfields> } post;
>     };
> };
> 
> Which does not duplicate the lo and hi u64s in sub-unions.  (This will
> involve some refactoring of the existing code.)

This is a good suggestion, I also think about this before, but this need
some changes to the existing code. May need more thinking whether
worth it.

> 
> >
> > - Introduce a new global vector which is used to wake up the HLT'ed vCPU.
> > Currently, there is a global vector 'posted_intr_vector', which is used as the
> > global notification vector for all vCPUs in the system. This vector is stored in
> > VMCS and CPU considers it as a special vector, uses it to notify the related
> > pCPU when an interrupt is recorded in the posted-interrupt descriptor.
> >
> > After having VT-d PI, VT-d engine can issue notification event when the
> > assigned devices issue interrupts. We need add a new global vector to
> > wakeup the HLT'ed vCPU, please refer to the following scenario for the
> > usage of this new global vector:
> >
> > 1. vCPU0 is running on pCPU0
> > 2. vCPU0 is HLT'ed and vCPU1 is currently running on pCPU0
> 
> I don't understand what you are trying to express with this scenario.
> vCPU0 cannot be running on pCPU0 and also halted with vCPU1 running on
> pCPU0.
> 
> A vCPU is either running, in which case it has an associated pCPU, or it
> is not running and has no specific pCPU affiliation.
> 

Here I just want to show why and when we need the extra global vector.
Please see more explanation about this in the reply to Jan!

Thanks for all the comments!

Thanks,
Feng

> ~Andrew
> 
> > 3. An external interrupt from an assigned device occurs for vCPU0, if we
> > still use 'posted_intr_vector' as the notification vector for vCPU0, the
> > notification event for vCPU0 (the event will go to pCPU1) will be consumed
> > by vCPU1 incorrectly. The worst case is that vCPU0 will never be woken up
> > again since the wakeup event for it is always consumed by other vCPUs
> > incorrectly. So we need introduce another global vector, naming
> 'pi_wakeup_vector'
> > to wake up the HTL'ed vCPU.
> >
> > - Update IRTE when guest modifies the interrupt configuration (MSI/MSIx
> configuration).
> > After VT-d PI is introduced, the format of IRTE is changed as follows:
> > 	Descriptor Address: the address of the posted-interrupt descriptor
> > 	Virtual Vector: the guest vector of the interrupt
> > 	URG: indicates if the interrupt is urgent
> > 	Other fields continue to have the same meaning
> >
> > 'Descriptor Address' tells the destination vCPU of this interrupt, since
> > each vCPU has a dedicated posted-interrupt descriptor.
> >
> > 'Virtual Vector' tells the guest vector of the interrupt.
> >
> > When guest changes the configuration of the interrupts, such as, the
> > cpu affinity, or the vector, we need to update the associated IRTE accordingly.
> >
> > - Update posted-interrupt descriptor during vCPU scheduling
> > The basic idea here is:
> > 1. When vCPU's state is RUNSTATE_running,
> >         - Set 'NV' to 'posted_intr_vector'.
> >         - Clear 'SN' to accept posted-interrupts.
> >         - Set 'NDST' to the pCPU on which the vCPU will be running.
> > 2. When vCPU's state is RUNSTATE_blocked,
> >         - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the
> >           related vCPU when posted-interrupt happens for it.
> >           Please refer to the above section about the new global vector.
> >         - Clear 'SN' to accept posted-interrupts
> > 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
> >         - Set 'SN' to suppress non-urgent interrupts
> >           (Current, we only support non-urgent interrupts)
> >          When vCPU is in RUNSTATE_runnable or RUNSTATE_offline,
> >          It is not needed to accept posted-interrupt notification event,
> >          since we don't change the behavior of scheduler when the
> interrupt
> >          occurs, we still need wait the next scheduling of the vCPU.
> >          When external interrupts from assigned devices occur, the
> interrupts
> >          are recorded in PIR, and will be synced to IRR before VM-Entry.
> >         - Set 'NV' to 'posted_intr_vector'.
> >
> > - New boot command line for Xen, which controls VT-d PI feature by user.
> > Like 'intremap' for interrupt remapping, we add a new boot command line
> > 'intpost' for posted-interrupts.
> >
> > - Multicast/broadcast and lowest priority interrupts consideration
> > With VT-d PI, the destination vCPU information of an external interrupt
> > from assigned devices is stored in IRTE, this makes the following
> > consideration of the design:
> > 1. Multicast/broadcast interrupts cannot be posted.
> > 2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex
> > (starting from Nehalem) ignore TPR value, and instead supported two other
> > ways (configurable by BIOS) on how the handle lowest priority interrupts:
> > 	A) Round robin: In this method, the chipset simply delivers lowest priority
> > interrupts in a round-robin manner across all the available logical CPUs. While
> > this provides good load balancing, this was not the best thing to do always as
> > interrupts from the same device (like NIC) will start running on all the CPUs
> > thrashing caches and taking locks. This led to the next scheme.
> > 	B) Vector hashing: In this method, hardware would apply a hash function
> > on the vector value in the interrupt request, and use that hash to pick a
> logical
> > CPU to route the lowest priority interrupt. This way, a given vector always
> goes
> > to the same logical CPU, avoiding the thrashing problem above.
> >
> > So, gist of above is that, lowest priority interrupts has never been delivered
> as
> > "lowest priority" in physical hardware.
> >
> > For KVM enabling work of VT-d PI, we divide this into two stage:
> > Stage 1: Only support single-CPU lowest-priority interrupts (configured via
> > /proc/irq or irqbalance). This is simple and clear.
> > Stage 2: After all the patches are merged, I will add the vector hashing
> support
> > for lowest-priority on VT-d PI.
> >
> > On Xen side, what is your opinion about support lowest-priority interrupts
> > for VT-d PI?
> >
> > ================================
> >
> > Any comments about this design are highly appreciated!
> >
> > Thanks,
> > Feng
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-04 13:30 VT-d Posted-interrupt (PI) design for XEN Wu, Feng
  2015-03-04 15:19 ` Jan Beulich
  2015-03-04 18:48 ` Andrew Cooper
@ 2015-03-10  2:22 ` Tian, Kevin
  2015-03-16  4:03   ` Wu, Feng
  2 siblings, 1 reply; 22+ messages in thread
From: Tian, Kevin @ 2015-03-10  2:22 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org; +Cc: Zhang, Yang Z, Jan Beulich

> From: Wu, Feng
> Sent: Wednesday, March 04, 2015 9:30 PM
> 
> VT-d Posted-interrupt (PI) design for XEN
> 
> Background
> ==========
> With the development of virtualization, there are more and more device
> assignment requirements. However, today when a VM is running with
> assigned devices (such as, NIC), external interrupt handling for the assigned
> devices always needs VMM intervention.
> 
> VT-d Posted-interrupt is a more enhanced method to handle interrupts
> in the virtualization environment. Interrupt posting is the process by
> which an interrupt request is recorded in a memory-resident
> posted-interrupt-descriptor structure by the root-complex, followed by
> an optional notification event issued to the CPU complex.
> 
> With VT-d Posted-interrupt we can get the following advantages:
> - Directly delivery of external interrupts to running vCPUs without VMM
> intervention

"Directly" -> "Direct"

> - Decease the interrupt migration complexity. On vCPU migration, software
> can atomically co-migrate all interrupts targeting the migrating vCPU.

could you elaborate this benefit? I didn't see discussion around migration
throughout the proposal.

> 
> 
> Posted-interrupt Introduction
> ========================
> There are two components to the Posted-interrupt architecture:
> Processor Support and Root-Complex Support
> 
> - Processor Support
> Posted-interrupt processing is a feature by which a processor processes
> the virtual interrupts by recording them as pending on the virtual-APIC
> page.
> 
> Posted-interrupt processing is enabled by setting the "process posted
> interrupts" VM-execution control. The processing is performed in response
> to the arrival of an interrupt with the posted-interrupt notification vector.
> In response to such an interrupt, the processor processes virtual interrupts
> recorded in a data structure called a posted-interrupt descriptor.
> 
> More information about APICv and CPU-side Posted-interrupt, please refer
> to Chapter 29, and Section 29.6 in the Intel SDM:
> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6
> 4-ia-32-architectures-software-developer-manual-325462.pdf
> 
> - Root-Complex Support
> Interrupt posting is the process by which an interrupt request (from IOAPIC
> or MSI/MSIx capable sources) is recorded in a memory-resident
> posted-interrupt-descriptor structure by the root-complex, followed by
> an optional notification event issued to the CPU complex. The interrupt
> request arriving at the root-complex carry the identity of the interrupt
> request source and a 'remapping-index'. The remapping-index is used to
> look-up an entry from the memory-resident interrupt-remap-table. Unlike
> with interrupt-remapping, the interrupt-remap-table-entry for a posted-
> interrupt, specifies a virtual-vector and a pointer to the posted-interrupt
> descriptor. The virtual-vector specifies the vector of the interrupt to be
> recorded in the posted-interrupt descriptor. The posted-interrupt descriptor
> hosts storage for the virtual-vectors and contains the attributes of the
> notification event (interrupt) to be issued to the CPU complex to inform
> CPU/software about pending interrupts recorded in the posted-interrupt
> descriptor.
> 
> More information about VT-d PI, please refer to
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
> y/vt-directed-io-spec.html
> 
> 
> Design Overview
> ==============
> In this design, we will cover the following items:
> 1. Add a variant to control whether enable VT-d posted-interrupt or not.
> 2. VT-d PI feature detection.
> 3. Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff.
> 4. Extend IRTE structure to support VT-d PI.
> 5. Introduce a new global vector which is used for waking up the HLT'ed vCPU.

HLT'ed -> blocked

> 6. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx
> configuration).
> 7. Update posted-interrupt descriptor during vCPU scheduling (when the state
> of the vCPU is transmitted among RUNSTATE_running / RUNSTATE_blocked/
> RUNSTATE_runnable / RUNSTATE_offline).
> 8. New boot command line for Xen, which controls VT-d PI feature by user.
> 9. Multicast/broadcast and lowest priority interrupts consideration.
> 

add a step on notification handler, as what you described in another mail.

> 
> Implementation details
> ===================
> - New variant to control VT-d PI
> Like variant 'iommu_intremap' for interrupt remapping, it is very
> straightforward
> to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is set
> only when interrupt remapping and VT-d posted-interrupt are both enabled.
> 
> - VT-d PI feature detection.
> Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt
> support.
> 
> - Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff.
> Here is the new structure for posted-interrupt descriptor:
> 
> struct pi_desc {
>      DECLARE_BITMAP(pir, NR_VECTORS);
>      union {
>         struct
>         {
>         u64 on     : 1,
>             sn     : 1,
>             rsvd_1 : 13,
>             ndm    : 1,
>             nv     : 8,
>             rsvd_2 : 8,
>             ndst   : 32;
>         };
>         u64 control;
>     };
>     u32 rsvd[6];
>  } __attribute__ ((aligned (64)));
> 
> - Extend IRTE structure to support VT-d PI.
> Here is the new structure for IRTE:
> /* interrupt remap entry */
> struct iremap_entry {
>   union {
>     u64 lo_val;
>     struct {
>         u64 p       : 1,
>             fpd     : 1,
>             dm      : 1,
>             rh      : 1,
>             tm      : 1,
>             dlm     : 3,
>             avail   : 4,
>             res_1   : 4,
>             vector  : 8,
>             res_2   : 8,
>             dst     : 32;
>     }lo;
>     struct {
>         u64 p       : 1,
>             fpd     : 1,
>             res_1   : 6,
>             avail   : 4,
>             res_2   : 2,
>             urg     : 1,
>             pst     : 1,
>             vector  : 8,
>             res_3   : 14,
>             pda_l   : 26;
>     }lo_intpost;
>   };
>   union {
>     u64 hi_val;
>     struct {
>         u64 sid     : 16,
>             sq      : 2,
>             svt     : 2,
>             res_1   : 44;
>     }hi;
>     struct {
>         u64 sid     : 16,
>             sq      : 2,
>             svt     : 2,
>             res_1   : 12,
>             pda_h   : 32;
>     }hi_intpost;
>   };
> };
> 
> - Introduce a new global vector which is used to wake up the HLT'ed vCPU.
> Currently, there is a global vector 'posted_intr_vector', which is used as the
> global notification vector for all vCPUs in the system. This vector is stored in
> VMCS and CPU considers it as a special vector, uses it to notify the related
> pCPU when an interrupt is recorded in the posted-interrupt descriptor.
> 
> After having VT-d PI, VT-d engine can issue notification event when the
> assigned devices issue interrupts. We need add a new global vector to
> wakeup the HLT'ed vCPU, please refer to the following scenario for the
> usage of this new global vector:
> 
> 1. vCPU0 is running on pCPU0
> 2. vCPU0 is HLT'ed and vCPU1 is currently running on pCPU0
> 3. An external interrupt from an assigned device occurs for vCPU0, if we
> still use 'posted_intr_vector' as the notification vector for vCPU0, the
> notification event for vCPU0 (the event will go to pCPU1) will be consumed
> by vCPU1 incorrectly. The worst case is that vCPU0 will never be woken up
> again since the wakeup event for it is always consumed by other vCPUs
> incorrectly. So we need introduce another global vector, naming
> 'pi_wakeup_vector'
> to wake up the HTL'ed vCPU.

update above example with design about notification handler.

> 
> - Update IRTE when guest modifies the interrupt configuration (MSI/MSIx
> configuration).
> After VT-d PI is introduced, the format of IRTE is changed as follows:
> 	Descriptor Address: the address of the posted-interrupt descriptor
> 	Virtual Vector: the guest vector of the interrupt
> 	URG: indicates if the interrupt is urgent
> 	Other fields continue to have the same meaning
> 
> 'Descriptor Address' tells the destination vCPU of this interrupt, since
> each vCPU has a dedicated posted-interrupt descriptor.
> 
> 'Virtual Vector' tells the guest vector of the interrupt.
> 
> When guest changes the configuration of the interrupts, such as, the
> cpu affinity, or the vector, we need to update the associated IRTE accordingly.
> 
> - Update posted-interrupt descriptor during vCPU scheduling
> The basic idea here is:
> 1. When vCPU's state is RUNSTATE_running,
>         - Set 'NV' to 'posted_intr_vector'.
>         - Clear 'SN' to accept posted-interrupts.
>         - Set 'NDST' to the pCPU on which the vCPU will be running.
> 2. When vCPU's state is RUNSTATE_blocked,
>         - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the
>           related vCPU when posted-interrupt happens for it.
>           Please refer to the above section about the new global vector.
>         - Clear 'SN' to accept posted-interrupts
> 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
>         - Set 'SN' to suppress non-urgent interrupts
>           (Current, we only support non-urgent interrupts)
>          When vCPU is in RUNSTATE_runnable or RUNSTATE_offline,
>          It is not needed to accept posted-interrupt notification event,
>          since we don't change the behavior of scheduler when the interrupt
>          occurs, we still need wait the next scheduling of the vCPU.
>          When external interrupts from assigned devices occur, the
> interrupts
>          are recorded in PIR, and will be synced to IRR before VM-Entry.
>         - Set 'NV' to 'posted_intr_vector'.

would it be safer to use 'pi_wakeup_vector', if it's the right one to use
in the future when we consider real-time scheduling?

> 
> - New boot command line for Xen, which controls VT-d PI feature by user.
> Like 'intremap' for interrupt remapping, we add a new boot command line
> 'intpost' for posted-interrupts.
> 
> - Multicast/broadcast and lowest priority interrupts consideration
> With VT-d PI, the destination vCPU information of an external interrupt
> from assigned devices is stored in IRTE, this makes the following
> consideration of the design:
> 1. Multicast/broadcast interrupts cannot be posted.
> 2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex
> (starting from Nehalem) ignore TPR value, and instead supported two other
> ways (configurable by BIOS) on how the handle lowest priority interrupts:
> 	A) Round robin: In this method, the chipset simply delivers lowest priority
> interrupts in a round-robin manner across all the available logical CPUs. While
> this provides good load balancing, this was not the best thing to do always as
> interrupts from the same device (like NIC) will start running on all the CPUs
> thrashing caches and taking locks. This led to the next scheme.
> 	B) Vector hashing: In this method, hardware would apply a hash function
> on the vector value in the interrupt request, and use that hash to pick a logical
> CPU to route the lowest priority interrupt. This way, a given vector always goes
> to the same logical CPU, avoiding the thrashing problem above.
> 
> So, gist of above is that, lowest priority interrupts has never been delivered as
> "lowest priority" in physical hardware.
> 
> For KVM enabling work of VT-d PI, we divide this into two stage:
> Stage 1: Only support single-CPU lowest-priority interrupts (configured via
> /proc/irq or irqbalance). This is simple and clear.
> Stage 2: After all the patches are merged, I will add the vector hashing support
> for lowest-priority on VT-d PI.
> 
> On Xen side, what is your opinion about support lowest-priority interrupts
> for VT-d PI?

I'm not sure how important supporting vector hashing is here. We can do same
thing in software when setting NDST in fixed delivery mode?

> 
> ================================
> 
> Any comments about this design are highly appreciated!

Could you send an updated version based on all comments so far?

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: VT-d Posted-interrupt (PI) design for XEN
  2015-03-10  2:22 ` Tian, Kevin
@ 2015-03-16  4:03   ` Wu, Feng
  0 siblings, 0 replies; 22+ messages in thread
From: Wu, Feng @ 2015-03-16  4:03 UTC (permalink / raw)
  To: Tian, Kevin, xen-devel@lists.xen.org; +Cc: Zhang, Yang Z, Wu, Feng, Jan Beulich



> -----Original Message-----
> From: Tian, Kevin
> Sent: Tuesday, March 10, 2015 10:22 AM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: Jan Beulich; Zhang, Yang Z
> Subject: RE: VT-d Posted-interrupt (PI) design for XEN
> 
> > From: Wu, Feng
> > Sent: Wednesday, March 04, 2015 9:30 PM
> >
> > VT-d Posted-interrupt (PI) design for XEN
> >
> > Background
> > ==========
> > With the development of virtualization, there are more and more device
> > assignment requirements. However, today when a VM is running with
> > assigned devices (such as, NIC), external interrupt handling for the assigned
> > devices always needs VMM intervention.
> >
> > VT-d Posted-interrupt is a more enhanced method to handle interrupts
> > in the virtualization environment. Interrupt posting is the process by
> > which an interrupt request is recorded in a memory-resident
> > posted-interrupt-descriptor structure by the root-complex, followed by
> > an optional notification event issued to the CPU complex.
> >
> > With VT-d Posted-interrupt we can get the following advantages:
> > - Directly delivery of external interrupts to running vCPUs without VMM
> > intervention
> 
> "Directly" -> "Direct"
> 
> > - Decease the interrupt migration complexity. On vCPU migration, software
> > can atomically co-migrate all interrupts targeting the migrating vCPU.
> 
> could you elaborate this benefit? I didn't see discussion around migration
> throughout the proposal.
> 
> >
> >
> > Posted-interrupt Introduction
> > ========================
> > There are two components to the Posted-interrupt architecture:
> > Processor Support and Root-Complex Support
> >
> > - Processor Support
> > Posted-interrupt processing is a feature by which a processor processes
> > the virtual interrupts by recording them as pending on the virtual-APIC
> > page.
> >
> > Posted-interrupt processing is enabled by setting the "process posted
> > interrupts" VM-execution control. The processing is performed in response
> > to the arrival of an interrupt with the posted-interrupt notification vector.
> > In response to such an interrupt, the processor processes virtual interrupts
> > recorded in a data structure called a posted-interrupt descriptor.
> >
> > More information about APICv and CPU-side Posted-interrupt, please refer
> > to Chapter 29, and Section 29.6 in the Intel SDM:
> >
> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6
> > 4-ia-32-architectures-software-developer-manual-325462.pdf
> >
> > - Root-Complex Support
> > Interrupt posting is the process by which an interrupt request (from IOAPIC
> > or MSI/MSIx capable sources) is recorded in a memory-resident
> > posted-interrupt-descriptor structure by the root-complex, followed by
> > an optional notification event issued to the CPU complex. The interrupt
> > request arriving at the root-complex carry the identity of the interrupt
> > request source and a 'remapping-index'. The remapping-index is used to
> > look-up an entry from the memory-resident interrupt-remap-table. Unlike
> > with interrupt-remapping, the interrupt-remap-table-entry for a posted-
> > interrupt, specifies a virtual-vector and a pointer to the posted-interrupt
> > descriptor. The virtual-vector specifies the vector of the interrupt to be
> > recorded in the posted-interrupt descriptor. The posted-interrupt descriptor
> > hosts storage for the virtual-vectors and contains the attributes of the
> > notification event (interrupt) to be issued to the CPU complex to inform
> > CPU/software about pending interrupts recorded in the posted-interrupt
> > descriptor.
> >
> > More information about VT-d PI, please refer to
> >
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
> > y/vt-directed-io-spec.html
> >
> >
> > Design Overview
> > ==============
> > In this design, we will cover the following items:
> > 1. Add a variant to control whether enable VT-d posted-interrupt or not.
> > 2. VT-d PI feature detection.
> > 3. Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff.
> > 4. Extend IRTE structure to support VT-d PI.
> > 5. Introduce a new global vector which is used for waking up the HLT'ed vCPU.
> 
> HLT'ed -> blocked
> 
> > 6. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx
> > configuration).
> > 7. Update posted-interrupt descriptor during vCPU scheduling (when the state
> > of the vCPU is transmitted among RUNSTATE_running / RUNSTATE_blocked/
> > RUNSTATE_runnable / RUNSTATE_offline).
> > 8. New boot command line for Xen, which controls VT-d PI feature by user.
> > 9. Multicast/broadcast and lowest priority interrupts consideration.
> >
> 
> add a step on notification handler, as what you described in another mail.
> 
> >
> > Implementation details
> > ===================
> > - New variant to control VT-d PI
> > Like variant 'iommu_intremap' for interrupt remapping, it is very
> > straightforward
> > to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is set
> > only when interrupt remapping and VT-d posted-interrupt are both enabled.
> >
> > - VT-d PI feature detection.
> > Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt
> > support.
> >
> > - Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff.
> > Here is the new structure for posted-interrupt descriptor:
> >
> > struct pi_desc {
> >      DECLARE_BITMAP(pir, NR_VECTORS);
> >      union {
> >         struct
> >         {
> >         u64 on     : 1,
> >             sn     : 1,
> >             rsvd_1 : 13,
> >             ndm    : 1,
> >             nv     : 8,
> >             rsvd_2 : 8,
> >             ndst   : 32;
> >         };
> >         u64 control;
> >     };
> >     u32 rsvd[6];
> >  } __attribute__ ((aligned (64)));
> >
> > - Extend IRTE structure to support VT-d PI.
> > Here is the new structure for IRTE:
> > /* interrupt remap entry */
> > struct iremap_entry {
> >   union {
> >     u64 lo_val;
> >     struct {
> >         u64 p       : 1,
> >             fpd     : 1,
> >             dm      : 1,
> >             rh      : 1,
> >             tm      : 1,
> >             dlm     : 3,
> >             avail   : 4,
> >             res_1   : 4,
> >             vector  : 8,
> >             res_2   : 8,
> >             dst     : 32;
> >     }lo;
> >     struct {
> >         u64 p       : 1,
> >             fpd     : 1,
> >             res_1   : 6,
> >             avail   : 4,
> >             res_2   : 2,
> >             urg     : 1,
> >             pst     : 1,
> >             vector  : 8,
> >             res_3   : 14,
> >             pda_l   : 26;
> >     }lo_intpost;
> >   };
> >   union {
> >     u64 hi_val;
> >     struct {
> >         u64 sid     : 16,
> >             sq      : 2,
> >             svt     : 2,
> >             res_1   : 44;
> >     }hi;
> >     struct {
> >         u64 sid     : 16,
> >             sq      : 2,
> >             svt     : 2,
> >             res_1   : 12,
> >             pda_h   : 32;
> >     }hi_intpost;
> >   };
> > };
> >
> > - Introduce a new global vector which is used to wake up the HLT'ed vCPU.
> > Currently, there is a global vector 'posted_intr_vector', which is used as the
> > global notification vector for all vCPUs in the system. This vector is stored in
> > VMCS and CPU considers it as a special vector, uses it to notify the related
> > pCPU when an interrupt is recorded in the posted-interrupt descriptor.
> >
> > After having VT-d PI, VT-d engine can issue notification event when the
> > assigned devices issue interrupts. We need add a new global vector to
> > wakeup the HLT'ed vCPU, please refer to the following scenario for the
> > usage of this new global vector:
> >
> > 1. vCPU0 is running on pCPU0
> > 2. vCPU0 is HLT'ed and vCPU1 is currently running on pCPU0
> > 3. An external interrupt from an assigned device occurs for vCPU0, if we
> > still use 'posted_intr_vector' as the notification vector for vCPU0, the
> > notification event for vCPU0 (the event will go to pCPU1) will be consumed
> > by vCPU1 incorrectly. The worst case is that vCPU0 will never be woken up
> > again since the wakeup event for it is always consumed by other vCPUs
> > incorrectly. So we need introduce another global vector, naming
> > 'pi_wakeup_vector'
> > to wake up the HTL'ed vCPU.
> 
> update above example with design about notification handler.
> 
> >
> > - Update IRTE when guest modifies the interrupt configuration (MSI/MSIx
> > configuration).
> > After VT-d PI is introduced, the format of IRTE is changed as follows:
> > 	Descriptor Address: the address of the posted-interrupt descriptor
> > 	Virtual Vector: the guest vector of the interrupt
> > 	URG: indicates if the interrupt is urgent
> > 	Other fields continue to have the same meaning
> >
> > 'Descriptor Address' tells the destination vCPU of this interrupt, since
> > each vCPU has a dedicated posted-interrupt descriptor.
> >
> > 'Virtual Vector' tells the guest vector of the interrupt.
> >
> > When guest changes the configuration of the interrupts, such as, the
> > cpu affinity, or the vector, we need to update the associated IRTE accordingly.
> >
> > - Update posted-interrupt descriptor during vCPU scheduling
> > The basic idea here is:
> > 1. When vCPU's state is RUNSTATE_running,
> >         - Set 'NV' to 'posted_intr_vector'.
> >         - Clear 'SN' to accept posted-interrupts.
> >         - Set 'NDST' to the pCPU on which the vCPU will be running.
> > 2. When vCPU's state is RUNSTATE_blocked,
> >         - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the
> >           related vCPU when posted-interrupt happens for it.
> >           Please refer to the above section about the new global vector.
> >         - Clear 'SN' to accept posted-interrupts
> > 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
> >         - Set 'SN' to suppress non-urgent interrupts
> >           (Current, we only support non-urgent interrupts)
> >          When vCPU is in RUNSTATE_runnable or RUNSTATE_offline,
> >          It is not needed to accept posted-interrupt notification event,
> >          since we don't change the behavior of scheduler when the
> interrupt
> >          occurs, we still need wait the next scheduling of the vCPU.
> >          When external interrupts from assigned devices occur, the
> > interrupts
> >          are recorded in PIR, and will be synced to IRR before VM-Entry.
> >         - Set 'NV' to 'posted_intr_vector'.
> 
> would it be safer to use 'pi_wakeup_vector', if it's the right one to use
> in the future when we consider real-time scheduling?
>

Since we don't consider real-time case now, is it better to set 'NV' to 'posted_intr_vector'
together with other changes when supporting real-time cases?


> >
> > - New boot command line for Xen, which controls VT-d PI feature by user.
> > Like 'intremap' for interrupt remapping, we add a new boot command line
> > 'intpost' for posted-interrupts.
> >
> > - Multicast/broadcast and lowest priority interrupts consideration
> > With VT-d PI, the destination vCPU information of an external interrupt
> > from assigned devices is stored in IRTE, this makes the following
> > consideration of the design:
> > 1. Multicast/broadcast interrupts cannot be posted.
> > 2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex
> > (starting from Nehalem) ignore TPR value, and instead supported two other
> > ways (configurable by BIOS) on how the handle lowest priority interrupts:
> > 	A) Round robin: In this method, the chipset simply delivers lowest priority
> > interrupts in a round-robin manner across all the available logical CPUs. While
> > this provides good load balancing, this was not the best thing to do always as
> > interrupts from the same device (like NIC) will start running on all the CPUs
> > thrashing caches and taking locks. This led to the next scheme.
> > 	B) Vector hashing: In this method, hardware would apply a hash function
> > on the vector value in the interrupt request, and use that hash to pick a logical
> > CPU to route the lowest priority interrupt. This way, a given vector always
> goes
> > to the same logical CPU, avoiding the thrashing problem above.
> >
> > So, gist of above is that, lowest priority interrupts has never been delivered as
> > "lowest priority" in physical hardware.
> >
> > For KVM enabling work of VT-d PI, we divide this into two stage:
> > Stage 1: Only support single-CPU lowest-priority interrupts (configured via
> > /proc/irq or irqbalance). This is simple and clear.
> > Stage 2: After all the patches are merged, I will add the vector hashing
> support
> > for lowest-priority on VT-d PI.
> >
> > On Xen side, what is your opinion about support lowest-priority interrupts
> > for VT-d PI?
> 
> I'm not sure how important supporting vector hashing is here. We can do same
> thing in software when setting NDST in fixed delivery mode?

I am not clear about this, here we need find a way to support lowest-priority interrupts,
Could you please elaborate it a bit more? Thanks!

> 
> >
> > ================================
> >
> > Any comments about this design are highly appreciated!
> 
> Could you send an updated version based on all comments so far?

Sure!

Thanks,
Feng

> 
> Thanks,
> Kevin

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-03-16  5:07 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-04 13:30 VT-d Posted-interrupt (PI) design for XEN Wu, Feng
2015-03-04 15:19 ` Jan Beulich
2015-03-05  5:04   ` Wu, Feng
2015-03-05  7:12     ` Jan Beulich
2015-03-05  8:29       ` Wu, Feng
2015-03-05  8:52         ` Jan Beulich
2015-03-05  9:07           ` Wu, Feng
2015-03-05 10:14             ` Jan Beulich
2015-03-06  2:01               ` Wu, Feng
2015-03-05 12:02           ` Tim Deegan
2015-03-06  2:07             ` Wu, Feng
2015-03-06  9:44               ` Tim Deegan
2015-03-09  2:03                 ` Wu, Feng
2015-03-09 10:33                   ` Tim Deegan
2015-03-09 11:45                     ` Andrew Cooper
2015-03-10  2:01                       ` Tian, Kevin
2015-03-16  4:03                         ` Wu, Feng
2015-03-16  5:07                       ` Wu, Feng
2015-03-04 18:48 ` Andrew Cooper
2015-03-05  5:28   ` Wu, Feng
2015-03-10  2:22 ` Tian, Kevin
2015-03-16  4:03   ` Wu, Feng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.