* (v2) VT-d Posted-interrupt (PI) design for XEN
@ 2015-03-18 12:44 Wu, Feng
2015-03-18 16:09 ` Konrad Rzeszutek Wilk
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Wu, Feng @ 2015-03-18 12:44 UTC (permalink / raw)
To: xen-devel@lists.xen.org
Cc: Zhang, Yang Z, Wu, Feng, Tian, Kevin, Keir Fraser (keir@xen.org),
Jan Beulich (JBeulich@suse.com)
VT-d Posted-interrupt (PI) design for XEN
Background
==========
With the development of virtualization, there are more and more device
assignment requirements. However, today when a VM is running with
assigned devices (such as, NIC), external interrupt handling for the assigned
devices always needs VMM intervention.
VT-d Posted-interrupt is a more enhanced method to handle interrupts
in the virtualization environment. Interrupt posting is the process by
which an interrupt request is recorded in a memory-resident
posted-interrupt-descriptor structure by the root-complex, followed by
an optional notification event issued to the CPU complex.
With VT-d Posted-interrupt we can get the following advantages:
- Direct delivery of external interrupts to running vCPUs without VMM
intervention
- Decrease the interrupt migration complexity. On vCPU migration, software
can atomically co-migrate all interrupts targeting the migrating vCPU. For
virtual machines with assigned devices, migrating a vCPU across pCPUs
either incur the overhead of forwarding interrupts in software (e.g. via VMM
generated IPIS), or complexity to independently migrate each interrupt targeting
the vCPU to the new pCPU. However, after enabling VT-d PI, the destination vCPU
of an external interrupt from assigned devices is stored in the IRTE (i.e.
Posted-interrupt Descriptor Address), when vCPU is migrated to another pCPU,
we will set this new pCPU in the 'NDST' filed of Posted-interrupt descriptor, this
make the interrupt migration automatic.
Posted-interrupt Introduction
========================
There are two components to the Posted-interrupt architecture:
Processor Support and Root-Complex Support
- Processor Support
Posted-interrupt processing is a feature by which a processor processes
the virtual interrupts by recording them as pending on the virtual-APIC
page.
Posted-interrupt processing is enabled by setting the "process posted
interrupts" VM-execution control. The processing is performed in response
to the arrival of an interrupt with the posted-interrupt notification vector.
In response to such an interrupt, the processor processes virtual interrupts
recorded in a data structure called a posted-interrupt descriptor.
More information about APICv and CPU-side Posted-interrupt, please refer
to Chapter 29, and Section 29.6 in the Intel SDM:
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
- Root-Complex Support
Interrupt posting is the process by which an interrupt request (from IOAPIC
or MSI/MSIx capable sources) is recorded in a memory-resident
posted-interrupt-descriptor structure by the root-complex, followed by
an optional notification event issued to the CPU complex. The interrupt
request arriving at the root-complex carry the identity of the interrupt
request source and a 'remapping-index'. The remapping-index is used to
look-up an entry from the memory-resident interrupt-remap-table. Unlike
with interrupt-remapping, the interrupt-remap-table-entry for a posted-
interrupt, specifies a virtual-vector and a pointer to the posted-interrupt
descriptor. The virtual-vector specifies the vector of the interrupt to be
recorded in the posted-interrupt descriptor. The posted-interrupt descriptor
hosts storage for the virtual-vectors and contains the attributes of the
notification event (interrupt) to be issued to the CPU complex to inform
CPU/software about pending interrupts recorded in the posted-interrupt
descriptor.
More information about VT-d PI, please refer to
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html
Important Definitions
==================
There are some changes to IRTE and posted-interrupt descriptor after
VT-d PI is introduced:
IRTE:
Posted-interrupt Descriptor Address: the address of the posted-interrupt descriptor
Virtual Vector: the guest vector of the interrupt
URG: indicates if the interrupt is urgent
Posted-interrupt descriptor:
The Posted Interrupt Descriptor hosts the following fields:
Posted Interrupt Request (PIR): Provide storage for posting (recording) interrupts (one bit
per vector, for up to 256 vectors).
Outstanding Notification (ON): Indicate if there is a notification event outstanding (not
processed by processor or software) for this Posted Interrupt Descriptor. When this field is 0,
hardware modifies it from 0 to 1 when generating a notification event, and the entity receiving
the notification event (processor or software) resets it as part of posted interrupt processing.
Suppress Notification (SN): Indicate if a notification event is to be suppressed (not
generated) for non-urgent interrupt requests (interrupts processed through an IRTE with
URG=0).
Notification Vector (NV): Specify the vector for notification event (interrupt).
Notification Destination (NDST): Specify the physical APIC-ID of the destination logical
processor for the notification event.
Design Overview
==============
In this design, we will cover the following items:
1. Add a variable to control whether enable VT-d posted-interrupt or not.
2. VT-d PI feature detection.
3. Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff.
4. Extend IRTE structure to support VT-d PI.
5. Introduce a new global vector which is used for waking up the blocked vCPU.
6. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration).
7. Update posted-interrupt descriptor during vCPU scheduling (when the state
of the vCPU is transmitted among RUNSTATE_running / RUNSTATE_blocked/
RUNSTATE_runnable / RUNSTATE_offline).
8. How to wakeup blocked vCPU when an interrupt is posted for it (wakeup notification handler).
9. New boot command line for Xen, which controls VT-d PI feature by user.
10. Multicast/broadcast and lowest priority interrupts consideration.
Implementation details
===================
- New variable to control VT-d PI
Like variable 'iommu_intremap' for interrupt remapping, it is very straightforward
to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is set
only when interrupt remapping and VT-d posted-interrupt are both enabled.
- VT-d PI feature detection.
Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt support.
- Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff.
Here is the new structure for posted-interrupt descriptor:
struct pi_desc {
DECLARE_BITMAP(pir, NR_VECTORS);
union {
struct
{
u64 on : 1,
sn : 1,
rsvd_1 : 13,
ndm : 1,
nv : 8,
rsvd_2 : 8,
ndst : 32;
};
u64 control;
};
u32 rsvd[6];
} __attribute__ ((aligned (64)));
- Extend IRTE structure to support VT-d PI.
Here is the new structure for IRTE:
/* interrupt remap entry */
struct iremap_entry {
union {
u64 lo_val;
struct {
u64 p : 1,
fpd : 1,
dm : 1,
rh : 1,
tm : 1,
dlm : 3,
avail : 4,
res_1 : 4,
vector : 8,
res_2 : 8,
dst : 32;
}lo;
struct {
u64 p : 1,
fpd : 1,
res_1 : 6,
avail : 4,
res_2 : 2,
urg : 1,
im : 1,
vector : 8,
res_3 : 14,
pda_l : 26;
}lo_intpost;
};
union {
u64 hi_val;
struct {
u64 sid : 16,
sq : 2,
svt : 2,
res_1 : 44;
}hi;
struct {
u64 sid : 16,
sq : 2,
svt : 2,
res_1 : 12,
pda_h : 32;
}hi_intpost;
};
};
- Introduce a new global vector which is used to wake up the blocked vCPU.
Currently, there is a global vector 'posted_intr_vector', which is used as the
global notification vector for all vCPUs in the system. This vector is stored in
VMCS and CPU considers it as a _special_ vector, uses it to notify the related
pCPU when an interrupt is recorded in the posted-interrupt descriptor.
This existing global vector is a _special_ vector to CPU, CPU handle it in a
_special_ way compared to normal vectors, please refer to 29.6 in Intel SDM
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
for more information about how CPU handles it.
After having VT-d PI, VT-d engine can issue notification event when the
assigned devices issue interrupts. We need add a new global vector to
wakeup the blocked vCPU, please refer to later section in this design for
how to use this new global vector.
- Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration).
After VT-d PI is introduced, the format of IRTE is changed as follows:
Descriptor Address: the address of the posted-interrupt descriptor
Virtual Vector: the guest vector of the interrupt
URG: indicates if the interrupt is urgent
Other fields continue to have the same meaning
'Descriptor Address' tells the destination vCPU of this interrupt, since
each vCPU has a dedicated posted-interrupt descriptor.
'Virtual Vector' tells the guest vector of the interrupt.
When guest changes the configuration of the interrupts, such as, the
cpu affinity, or the vector, we need to update the associated IRTE accordingly.
- Update posted-interrupt descriptor during vCPU scheduling
The basic idea here is:
1. When vCPU's state is RUNSTATE_running,
- Set 'NV' to 'posted_intr_vector'.
- Clear 'SN' to accept posted-interrupts.
- Set 'NDST' to the pCPU on which the vCPU will be running.
2. When vCPU's state is RUNSTATE_blocked,
- Set 'NV' to ' pi_wakeup_vector ', so we can wake up the
related vCPU when posted-interrupt happens for it.
Please refer to the above section about the new global vector.
- Clear 'SN' to accept posted-interrupts
3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
- Set 'SN' to suppress non-urgent interrupts
(Current, we only support non-urgent interrupts)
When vCPU is in RUNSTATE_runnable or RUNSTATE_offline,
It is not needed to accept posted-interrupt notification event,
since we don't change the behavior of scheduler when the interrupt
occurs, we still need wait the next scheduling of the vCPU.
When external interrupts from assigned devices occur, the interrupts
are recorded in PIR, and will be synced to IRR before VM-Entry.
- Set 'NV' to 'posted_intr_vector'.
- How to wakeup blocked vCPU when an interrupt is posted for it (wakeup notification handler).
Here is the scenario for the usage of the new global vector:
1. vCPU0 is running on pCPU0
2. vCPU0 is blocked and vCPU1 is currently running on pCPU0
3. An external interrupt from an assigned device occurs for vCPU0, if we
still use 'posted_intr_vector' as the notification vector for vCPU0, the
notification event for vCPU0 (the event will go to pCPU1) will be consumed
by vCPU1 incorrectly (remember this is a special vector to CPU). The worst
case is that vCPU0 will never be woken up again since the wakeup event
for it is always consumed by other vCPUs incorrectly. So we need introduce
another global vector, naming 'pi_wakeup_vector' to wake up the blocked vCPU.
After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue notification
event using this new vector. Since this new vector is not a SPECIAL one to CPU,
it is just a normal vector. To cpu, it just receives an normal external interrupt,
then we can get control in the handler of this new vector. In this case, hypervisor
can do something in it, such as wakeup the blocked vCPU.
Here are what we do for the blocked vCPU:
1. Define a per-cpu list 'blocked_vcpu_on_cpu', which stored the blocked
vCPU on the pCPU.
2. When the vCPU's state is changed to RUNSTATE_blocked, insert the vCPU
to the per-cpu list belonging to the pCPU it was running.
3. When the vCPU is unblocked, remove the vCPU from the related pCPU list.
In the handler of 'pi_wakeup_vector', we do:
1. Get the physical CPU.
2. Iterate the list 'blocked_vcpu_on_cpu' of the current pCPU, if 'ON' is set,
we unblock the associated vCPU.
- New boot command line for Xen, which controls VT-d PI feature by user.
Like 'intremap' for interrupt remapping, we add a new boot command line
'intpost' for posted-interrupts.
- Multicast/broadcast and lowest priority interrupts consideration.
With VT-d PI, the destination vCPU information of an external interrupt
from assigned devices is stored in IRTE, this makes the following
consideration of the design:
1. Multicast/broadcast interrupts cannot be posted.
2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex
(starting from Nehalem) ignore TPR value, and instead supported two other
ways (configurable by BIOS) on how the handle lowest priority interrupts:
A) Round robin: In this method, the chipset simply delivers lowest priority
interrupts in a round-robin manner across all the available logical CPUs. While
this provides good load balancing, this was not the best thing to do always as
interrupts from the same device (like NIC) will start running on all the CPUs
thrashing caches and taking locks. This led to the next scheme.
B) Vector hashing: In this method, hardware would apply a hash function
on the vector value in the interrupt request, and use that hash to pick a logical
CPU to route the lowest priority interrupt. This way, a given vector always goes
to the same logical CPU, avoiding the thrashing problem above.
So, gist of above is that, lowest priority interrupts has never been delivered as
"lowest priority" in physical hardware.
I will emulate vector hashing for posted-interrupt for XEN.
================================
Any comments about this design are highly appreciated!
Thanks,
Feng
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: (v2) VT-d Posted-interrupt (PI) design for XEN 2015-03-18 12:44 (v2) VT-d Posted-interrupt (PI) design for XEN Wu, Feng @ 2015-03-18 16:09 ` Konrad Rzeszutek Wilk 2015-03-19 2:37 ` Zhang, Yang Z 2015-03-19 3:03 ` Wu, Feng 2015-03-19 9:56 ` Jan Beulich 2015-03-25 5:10 ` Wu, Feng 2 siblings, 2 replies; 16+ messages in thread From: Konrad Rzeszutek Wilk @ 2015-03-18 16:09 UTC (permalink / raw) To: Wu, Feng Cc: Zhang, Yang Z, Tian, Kevin, Keir Fraser (keir@xen.org), Jan Beulich (JBeulich@suse.com), xen-devel@lists.xen.org On Wed, Mar 18, 2015 at 12:44:21PM +0000, Wu, Feng wrote: > VT-d Posted-interrupt (PI) design for XEN > > Background > ========== > With the development of virtualization, there are more and more device > assignment requirements. However, today when a VM is running with > assigned devices (such as, NIC), external interrupt handling for the assigned > devices always needs VMM intervention. > > VT-d Posted-interrupt is a more enhanced method to handle interrupts > in the virtualization environment. Interrupt posting is the process by > which an interrupt request is recorded in a memory-resident > posted-interrupt-descriptor structure by the root-complex, followed by > an optional notification event issued to the CPU complex. > > With VT-d Posted-interrupt we can get the following advantages: > - Direct delivery of external interrupts to running vCPUs without VMM > intervention I hadn't digged deep in what Xen has currently - but I would assume that this is exactly what we have now in Xen? Hm, actually we seem to be still invoking the hypervisor on the interrupts -except that if we need to dispatch it to another CPU using an normal vector to do so - which would still cause the hypervisor to be invoked? Or does it actually go straight in the guest? So what kind of support do we currently have in Xen from posted interrupt? Could you add a bit about this in the background please? > - Decrease the interrupt migration complexity. On vCPU migration, software > can atomically co-migrate all interrupts targeting the migrating vCPU. For > virtual machines with assigned devices, migrating a vCPU across pCPUs > either incur the overhead of forwarding interrupts in software (e.g. via VMM > generated IPIS), or complexity to independently migrate each interrupt targeting > the vCPU to the new pCPU. However, after enabling VT-d PI, the destination vCPU > of an external interrupt from assigned devices is stored in the IRTE (i.e. > Posted-interrupt Descriptor Address), when vCPU is migrated to another pCPU, > we will set this new pCPU in the 'NDST' filed of Posted-interrupt descriptor, this > make the interrupt migration automatic. > > > Posted-interrupt Introduction > ======================== > There are two components to the Posted-interrupt architecture: > Processor Support and Root-Complex Support > > - Processor Support > Posted-interrupt processing is a feature by which a processor processes > the virtual interrupts by recording them as pending on the virtual-APIC > page. > > Posted-interrupt processing is enabled by setting the "process posted > interrupts" VM-execution control. The processing is performed in response > to the arrival of an interrupt with the posted-interrupt notification vector. > In response to such an interrupt, the processor processes virtual interrupts > recorded in a data structure called a posted-interrupt descriptor. > > More information about APICv and CPU-side Posted-interrupt, please refer > to Chapter 29, and Section 29.6 in the Intel SDM: > http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf > > - Root-Complex Support > Interrupt posting is the process by which an interrupt request (from IOAPIC > or MSI/MSIx capable sources) is recorded in a memory-resident > posted-interrupt-descriptor structure by the root-complex, followed by > an optional notification event issued to the CPU complex. The interrupt > request arriving at the root-complex carry the identity of the interrupt > request source and a 'remapping-index'. The remapping-index is used to > look-up an entry from the memory-resident interrupt-remap-table. Unlike > with interrupt-remapping, the interrupt-remap-table-entry for a posted- > interrupt, specifies a virtual-vector and a pointer to the posted-interrupt > descriptor. The virtual-vector specifies the vector of the interrupt to be > recorded in the posted-interrupt descriptor. The posted-interrupt descriptor > hosts storage for the virtual-vectors and contains the attributes of the > notification event (interrupt) to be issued to the CPU complex to inform > CPU/software about pending interrupts recorded in the posted-interrupt > descriptor. > > More information about VT-d PI, please refer to > http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html > > Important Definitions > ================== > There are some changes to IRTE and posted-interrupt descriptor after > VT-d PI is introduced: s/is/was/ > IRTE: > Posted-interrupt Descriptor Address: the address of the posted-interrupt descriptor > Virtual Vector: the guest vector of the interrupt > URG: indicates if the interrupt is urgent > > Posted-interrupt descriptor: > The Posted Interrupt Descriptor hosts the following fields: > Posted Interrupt Request (PIR): Provide storage for posting (recording) interrupts (one bit > per vector, for up to 256 vectors). > > Outstanding Notification (ON): Indicate if there is a notification event outstanding (not > processed by processor or software) for this Posted Interrupt Descriptor. When this field is 0, > hardware modifies it from 0 to 1 when generating a notification event, and the entity receiving > the notification event (processor or software) resets it as part of posted interrupt processing. > > Suppress Notification (SN): Indicate if a notification event is to be suppressed (not > generated) for non-urgent interrupt requests (interrupts processed through an IRTE with > URG=0). > > Notification Vector (NV): Specify the vector for notification event (interrupt). > > Notification Destination (NDST): Specify the physical APIC-ID of the destination logical > processor for the notification event. > > Design Overview > ============== > In this design, we will cover the following items: > 1. Add a variable to control whether enable VT-d posted-interrupt or not. > 2. VT-d PI feature detection. > 3. Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff. stuff? Perhaps features? > 4. Extend IRTE structure to support VT-d PI. > 5. Introduce a new global vector which is used for waking up the blocked vCPU. > 6. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration). > 7. Update posted-interrupt descriptor during vCPU scheduling (when the state > of the vCPU is transmitted among RUNSTATE_running / RUNSTATE_blocked/ > RUNSTATE_runnable / RUNSTATE_offline). > 8. How to wakeup blocked vCPU when an interrupt is posted for it (wakeup notification handler). > 9. New boot command line for Xen, which controls VT-d PI feature by user. > 10. Multicast/broadcast and lowest priority interrupts consideration. > > > Implementation details > =================== > - New variable to control VT-d PI > > Like variable 'iommu_intremap' for interrupt remapping, it is very straightforward > to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is set > only when interrupt remapping and VT-d posted-interrupt are both enabled. > > - VT-d PI feature detection. > Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt support. > > - Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff. > Here is the new structure for posted-interrupt descriptor: > > struct pi_desc { > DECLARE_BITMAP(pir, NR_VECTORS); > union { > struct > { > u64 on : 1, > sn : 1, > rsvd_1 : 13, > ndm : 1, > nv : 8, > rsvd_2 : 8, > ndst : 32; > }; > u64 control; > }; > u32 rsvd[6]; > } __attribute__ ((aligned (64))); > > - Extend IRTE structure to support VT-d PI. > > Here is the new structure for IRTE: > /* interrupt remap entry */ > struct iremap_entry { > union { > u64 lo_val; > struct { > u64 p : 1, > fpd : 1, > dm : 1, > rh : 1, > tm : 1, > dlm : 3, > avail : 4, > res_1 : 4, > vector : 8, > res_2 : 8, > dst : 32; > }lo; > struct { > u64 p : 1, > fpd : 1, > res_1 : 6, > avail : 4, > res_2 : 2, > urg : 1, > im : 1, > vector : 8, > res_3 : 14, > pda_l : 26; > }lo_intpost; > }; > union { > u64 hi_val; > struct { > u64 sid : 16, > sq : 2, > svt : 2, > res_1 : 44; > }hi; > struct { > u64 sid : 16, > sq : 2, > svt : 2, > res_1 : 12, > pda_h : 32; > }hi_intpost; > }; > }; > > - Introduce a new global vector which is used to wake up the blocked vCPU. > > Currently, there is a global vector 'posted_intr_vector', which is used as the s/Currently/In Xen 4.6 and earlier/ > global notification vector for all vCPUs in the system. This vector is stored in > VMCS and CPU considers it as a _special_ vector, uses it to notify the related > pCPU when an interrupt is recorded in the posted-interrupt descriptor. > > This existing global vector is a _special_ vector to CPU, CPU handle it in a > _special_ way compared to normal vectors, please refer to 29.6 in Intel SDM > http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf > for more information about how CPU handles it. > > After having VT-d PI, VT-d engine can issue notification event when the > assigned devices issue interrupts. We need add a new global vector to > wakeup the blocked vCPU, please refer to later section in this design for > how to use this new global vector. Ah, so this is what Xen has right now - and the changes that this design outlines are here deal with an blocked guests. > > - Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration). > After VT-d PI is introduced, the format of IRTE is changed as follows: > Descriptor Address: the address of the posted-interrupt descriptor > Virtual Vector: the guest vector of the interrupt > URG: indicates if the interrupt is urgent > Other fields continue to have the same meaning > > 'Descriptor Address' tells the destination vCPU of this interrupt, since > each vCPU has a dedicated posted-interrupt descriptor. > > 'Virtual Vector' tells the guest vector of the interrupt. > > When guest changes the configuration of the interrupts, such as, the > cpu affinity, or the vector, we need to update the associated IRTE accordingly. > > - Update posted-interrupt descriptor during vCPU scheduling > > The basic idea here is: > 1. When vCPU's state is RUNSTATE_running, > - Set 'NV' to 'posted_intr_vector'. > - Clear 'SN' to accept posted-interrupts. > - Set 'NDST' to the pCPU on which the vCPU will be running. > 2. When vCPU's state is RUNSTATE_blocked, > - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the > related vCPU when posted-interrupt happens for it. > Please refer to the above section about the new global vector. > - Clear 'SN' to accept posted-interrupts > 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline, > - Set 'SN' to suppress non-urgent interrupts > (Current, we only support non-urgent interrupts) > When vCPU is in RUNSTATE_runnable or RUNSTATE_offline, > It is not needed to accept posted-interrupt notification event, > since we don't change the behavior of scheduler when the interrupt > occurs, we still need wait the next scheduling of the vCPU. still need to wait for the next.. > When external interrupts from assigned devices occur, the interrupts > are recorded in PIR, and will be synced to IRR before VM-Entry. > - Set 'NV' to 'posted_intr_vector'. > > - How to wakeup blocked vCPU when an interrupt is posted for it (wakeup notification handler). > > Here is the scenario for the usage of the new global vector: > > 1. vCPU0 is running on pCPU0 > 2. vCPU0 is blocked and vCPU1 is currently running on pCPU0 > 3. An external interrupt from an assigned device occurs for vCPU0, if we > still use 'posted_intr_vector' as the notification vector for vCPU0, the > notification event for vCPU0 (the event will go to pCPU1) will be consumed > by vCPU1 incorrectly (remember this is a special vector to CPU). The worst > case is that vCPU0 will never be woken up again since the wakeup event > for it is always consumed by other vCPUs incorrectly. So we need introduce > another global vector, naming 'pi_wakeup_vector' to wake up the blocked vCPU. > > After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue notification > event using this new vector. Since this new vector is not a SPECIAL one to CPU, > it is just a normal vector. To cpu, it just receives an normal external interrupt, > then we can get control in the handler of this new vector. In this case, hypervisor > can do something in it, such as wakeup the blocked vCPU. > > Here are what we do for the blocked vCPU: > 1. Define a per-cpu list 'blocked_vcpu_on_cpu', which stored the blocked > vCPU on the pCPU. > 2. When the vCPU's state is changed to RUNSTATE_blocked, insert the vCPU > to the per-cpu list belonging to the pCPU it was running. > 3. When the vCPU is unblocked, remove the vCPU from the related pCPU list. > > In the handler of 'pi_wakeup_vector', we do: > 1. Get the physical CPU. > 2. Iterate the list 'blocked_vcpu_on_cpu' of the current pCPU, if 'ON' is set, > we unblock the associated vCPU. > > - New boot command line for Xen, which controls VT-d PI feature by user. > > Like 'intremap' for interrupt remapping, we add a new boot command line > 'intpost' for posted-interrupts. Earlier you mentioned "iommu_intpost" ? > > - Multicast/broadcast and lowest priority interrupts consideration. > > With VT-d PI, the destination vCPU information of an external interrupt > from assigned devices is stored in IRTE, this makes the following > consideration of the design: > 1. Multicast/broadcast interrupts cannot be posted. > 2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex > (starting from Nehalem) ignore TPR value, and instead supported two other > ways (configurable by BIOS) on how the handle lowest priority interrupts: > A) Round robin: In this method, the chipset simply delivers lowest priority > interrupts in a round-robin manner across all the available logical CPUs. While > this provides good load balancing, this was not the best thing to do always as > interrupts from the same device (like NIC) will start running on all the CPUs > thrashing caches and taking locks. This led to the next scheme. > B) Vector hashing: In this method, hardware would apply a hash function > on the vector value in the interrupt request, and use that hash to pick a logical > CPU to route the lowest priority interrupt. This way, a given vector always goes > to the same logical CPU, avoiding the thrashing problem above. > > So, gist of above is that, lowest priority interrupts has never been delivered as > "lowest priority" in physical hardware. > > I will emulate vector hashing for posted-interrupt for XEN. > > ================================ > > Any comments about this design are highly appreciated! > > Thanks, > Feng > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: (v2) VT-d Posted-interrupt (PI) design for XEN 2015-03-18 16:09 ` Konrad Rzeszutek Wilk @ 2015-03-19 2:37 ` Zhang, Yang Z 2015-03-19 3:03 ` Wu, Feng 1 sibling, 0 replies; 16+ messages in thread From: Zhang, Yang Z @ 2015-03-19 2:37 UTC (permalink / raw) To: Konrad Rzeszutek Wilk, Wu, Feng Cc: Tian, Kevin, Keir Fraser (keir@xen.org), Jan Beulich (JBeulich@suse.com), xen-devel@lists.xen.org Konrad Rzeszutek Wilk wrote on 2015-03-19: > On Wed, Mar 18, 2015 at 12:44:21PM +0000, Wu, Feng wrote: >> VT-d Posted-interrupt (PI) design for XEN >> >> Background >> ========== >> With the development of virtualization, there are more and more >> device assignment requirements. However, today when a VM is running >> with assigned devices (such as, NIC), external interrupt handling >> for the assigned devices always needs VMM intervention. >> >> VT-d Posted-interrupt is a more enhanced method to handle interrupts >> in the virtualization environment. Interrupt posting is the process >> by which an interrupt request is recorded in a memory-resident >> posted-interrupt-descriptor structure by the root-complex, followed >> by an optional notification event issued to the CPU complex. >> >> With VT-d Posted-interrupt we can get the following advantages: >> - Direct delivery of external interrupts to running vCPUs without >> VMM intervention > > > I hadn't digged deep in what Xen has currently - but I would assume > that this is exactly what we have now in Xen? > > Hm, actually we seem to be still invoking the hypervisor on the > interrupts -except that if we need to dispatch it to another CPU using > an normal vector to do so - which would still cause the hypervisor to > be invoked? Or does it actually go straight in the guest? > > So what kind of support do we currently have in Xen from posted interrupt? > Could you add a bit about this in the background please? All virtual interrupts are delivered through CPU side posted interrupt regardless the VT-d side PI supporting. The difference is: W/o VT-side PI supporting, for the interrupt of assigned device, we deliver it to another CPU(different from the CPU which target vcpu is running) and then use PI to deliver it to eliminate the vmexit. With VT-d side PI supporting, the interrupt is able to be delivered to guest directly without any other CPU's involvement and vmexit. Is it clear? > Best regards, Yang ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: (v2) VT-d Posted-interrupt (PI) design for XEN 2015-03-18 16:09 ` Konrad Rzeszutek Wilk 2015-03-19 2:37 ` Zhang, Yang Z @ 2015-03-19 3:03 ` Wu, Feng 2015-03-19 19:11 ` Konrad Rzeszutek Wilk 1 sibling, 1 reply; 16+ messages in thread From: Wu, Feng @ 2015-03-19 3:03 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Tian, Kevin, Wu, Feng, xen-devel@lists.xen.org, Jan Beulich (JBeulich@suse.com), Zhang, Yang Z, Keir Fraser (keir@xen.org) Thanks for the comments! > -----Original Message----- > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > Sent: Thursday, March 19, 2015 12:10 AM > To: Wu, Feng > Cc: xen-devel@lists.xen.org; Zhang, Yang Z; Tian, Kevin; Keir Fraser > (keir@xen.org); Jan Beulich (JBeulich@suse.com) > Subject: Re: [Xen-devel] (v2) VT-d Posted-interrupt (PI) design for XEN > > On Wed, Mar 18, 2015 at 12:44:21PM +0000, Wu, Feng wrote: > > VT-d Posted-interrupt (PI) design for XEN > > > > Background > > ========== > > With the development of virtualization, there are more and more device > > assignment requirements. However, today when a VM is running with > > assigned devices (such as, NIC), external interrupt handling for the assigned > > devices always needs VMM intervention. > > > > VT-d Posted-interrupt is a more enhanced method to handle interrupts > > in the virtualization environment. Interrupt posting is the process by > > which an interrupt request is recorded in a memory-resident > > posted-interrupt-descriptor structure by the root-complex, followed by > > an optional notification event issued to the CPU complex. > > > > With VT-d Posted-interrupt we can get the following advantages: > > - Direct delivery of external interrupts to running vCPUs without VMM > > intervention > > > I hadn't digged deep in what Xen has currently - but I would assume that > this is exactly what we have now in Xen? Here is what Xen currently does for external interrupts from assigned devices: When a VM is running and an external interrupts from an assigned devices occurs for it. VM-EXIT happens, then: vmx_do_extint() --> do_IRQ() --> __do_IRQ_guest() --> hvm_do_IRQ_dpci() --> raise_softirq_for(pirq_dpci) --> raise_softirq(HVM_DPCI_SOFTIRQ) softirq HVM_DPCI_SOFTIRQ is bound to dpci_softirq() dpci_softirq() --> hvm_dirq_assist() --> vmsi_deliver_pirq() --> vmsi_deliver() --> vmsi_inj_irq() --> vlapic_set_irq() vlapic_set_irq() does the following things: 1. If CPU-side posted-interrupt is supported (I think it is supported from Xen 4.3, or Xen 4.4, sorry, not quite remember the exact version), call vmx_deliver_posted_intr() to deliver the virtual interrupt via posted-interrupt infrastructure. 2. Else If CPU-side posted-interrupt is not supported, set the related vIRR in vLAPIC page and call vcpu_kick() to kick the related vCPU. Before VM-Entry, vmx_intr_assist() will help to inject the interrupt to guests. However, after VT-d PI is supported, when a guest is running in non-root and an external interrupt from an assigned device occurs for it. _no_ VM-Exit is needed, the guest can handle this totally in non-root mode, thus avoiding all the above code flow. > > Hm, actually we seem to be still invoking the hypervisor on the > interrupts -except that if we need to dispatch it to another CPU > using an normal vector to do so - which would still cause the > hypervisor to be invoked? Or does it actually go straight in the > guest? > Like what I mentioned above, If the guest is running, we don't need invoke hypervisor. > So what kind of support do we currently have in Xen from posted > interrupt? Could you add a bit about this in the background please? Good suggestion. Currently, Xen only supports the CPU-side posted-interrupt. Like what I mentioned above, function vlapic_set_irq() can use this to deliver virtual interrupts, basically there are several methods to deliver virtual interrupts to guests: - Event delivery before VM-Entry via __vmx_inject_exception(), this is the oldest way. - After APICv was enabled, we had hardware support for virtual interrupt delivery, virtual interrupts are stored in virtual LAPIC page, after VM-Entry, guests can evaluate these virtual interrupt and handle them in non-root mode. - As an enhancement to APICv, CPU-side posted-interrupt was introduced, like above comments, with this new feature, we don't need to kick the vCPU and deliver the virtual interrupts direct to it. About APICv and CPU-side Posted-interrupt, please refer to Chapter 29, and Section 29.6 in the Intel SDM: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf > > > - Decrease the interrupt migration complexity. On vCPU migration, software > > can atomically co-migrate all interrupts targeting the migrating vCPU. For > > virtual machines with assigned devices, migrating a vCPU across pCPUs > > either incur the overhead of forwarding interrupts in software (e.g. via VMM > > generated IPIS), or complexity to independently migrate each interrupt > targeting > > the vCPU to the new pCPU. However, after enabling VT-d PI, the destination > vCPU > > of an external interrupt from assigned devices is stored in the IRTE (i.e. > > Posted-interrupt Descriptor Address), when vCPU is migrated to another > pCPU, > > we will set this new pCPU in the 'NDST' filed of Posted-interrupt descriptor, > this > > make the interrupt migration automatic. > > > > > > Posted-interrupt Introduction > > ======================== > > There are two components to the Posted-interrupt architecture: > > Processor Support and Root-Complex Support > > > > - Processor Support > > Posted-interrupt processing is a feature by which a processor processes > > the virtual interrupts by recording them as pending on the virtual-APIC > > page. > > > > Posted-interrupt processing is enabled by setting the "process posted > > interrupts" VM-execution control. The processing is performed in response > > to the arrival of an interrupt with the posted-interrupt notification vector. > > In response to such an interrupt, the processor processes virtual interrupts > > recorded in a data structure called a posted-interrupt descriptor. > > > > More information about APICv and CPU-side Posted-interrupt, please refer > > to Chapter 29, and Section 29.6 in the Intel SDM: > > > http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6 > 4-ia-32-architectures-software-developer-manual-325462.pdf > > > > - Root-Complex Support > > Interrupt posting is the process by which an interrupt request (from IOAPIC > > or MSI/MSIx capable sources) is recorded in a memory-resident > > posted-interrupt-descriptor structure by the root-complex, followed by > > an optional notification event issued to the CPU complex. The interrupt > > request arriving at the root-complex carry the identity of the interrupt > > request source and a 'remapping-index'. The remapping-index is used to > > look-up an entry from the memory-resident interrupt-remap-table. Unlike > > with interrupt-remapping, the interrupt-remap-table-entry for a posted- > > interrupt, specifies a virtual-vector and a pointer to the posted-interrupt > > descriptor. The virtual-vector specifies the vector of the interrupt to be > > recorded in the posted-interrupt descriptor. The posted-interrupt descriptor > > hosts storage for the virtual-vectors and contains the attributes of the > > notification event (interrupt) to be issued to the CPU complex to inform > > CPU/software about pending interrupts recorded in the posted-interrupt > > descriptor. > > > > More information about VT-d PI, please refer to > > > http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog > y/vt-directed-io-spec.html > > > > Important Definitions > > ================== > > There are some changes to IRTE and posted-interrupt descriptor after > > VT-d PI is introduced: > > s/is/was/ > > > IRTE: > > Posted-interrupt Descriptor Address: the address of the posted-interrupt > descriptor > > Virtual Vector: the guest vector of the interrupt > > URG: indicates if the interrupt is urgent > > > > Posted-interrupt descriptor: > > The Posted Interrupt Descriptor hosts the following fields: > > Posted Interrupt Request (PIR): Provide storage for posting (recording) > interrupts (one bit > > per vector, for up to 256 vectors). > > > > Outstanding Notification (ON): Indicate if there is a notification event > outstanding (not > > processed by processor or software) for this Posted Interrupt Descriptor. > When this field is 0, > > hardware modifies it from 0 to 1 when generating a notification event, and > the entity receiving > > the notification event (processor or software) resets it as part of posted > interrupt processing. > > > > Suppress Notification (SN): Indicate if a notification event is to be suppressed > (not > > generated) for non-urgent interrupt requests (interrupts processed through > an IRTE with > > URG=0). > > > > Notification Vector (NV): Specify the vector for notification event (interrupt). > > > > Notification Destination (NDST): Specify the physical APIC-ID of the > destination logical > > processor for the notification event. > > > > Design Overview > > ============== > > In this design, we will cover the following items: > > 1. Add a variable to control whether enable VT-d posted-interrupt or not. > > 2. VT-d PI feature detection. > > 3. Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff. > > stuff? Perhaps features? > > 4. Extend IRTE structure to support VT-d PI. > > 5. Introduce a new global vector which is used for waking up the blocked > vCPU. > > 6. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx > configuration). > > 7. Update posted-interrupt descriptor during vCPU scheduling (when the > state > > of the vCPU is transmitted among RUNSTATE_running / RUNSTATE_blocked/ > > RUNSTATE_runnable / RUNSTATE_offline). > > 8. How to wakeup blocked vCPU when an interrupt is posted for it (wakeup > notification handler). > > 9. New boot command line for Xen, which controls VT-d PI feature by user. > > 10. Multicast/broadcast and lowest priority interrupts consideration. > > > > > > Implementation details > > =================== > > - New variable to control VT-d PI > > > > Like variable 'iommu_intremap' for interrupt remapping, it is very > straightforward > > to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is > set > > only when interrupt remapping and VT-d posted-interrupt are both enabled. > > > > - VT-d PI feature detection. > > Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt > support. > > > > - Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff. > > Here is the new structure for posted-interrupt descriptor: > > > > struct pi_desc { > > DECLARE_BITMAP(pir, NR_VECTORS); > > union { > > struct > > { > > u64 on : 1, > > sn : 1, > > rsvd_1 : 13, > > ndm : 1, > > nv : 8, > > rsvd_2 : 8, > > ndst : 32; > > }; > > u64 control; > > }; > > u32 rsvd[6]; > > } __attribute__ ((aligned (64))); > > > > - Extend IRTE structure to support VT-d PI. > > > > Here is the new structure for IRTE: > > /* interrupt remap entry */ > > struct iremap_entry { > > union { > > u64 lo_val; > > struct { > > u64 p : 1, > > fpd : 1, > > dm : 1, > > rh : 1, > > tm : 1, > > dlm : 3, > > avail : 4, > > res_1 : 4, > > vector : 8, > > res_2 : 8, > > dst : 32; > > }lo; > > struct { > > u64 p : 1, > > fpd : 1, > > res_1 : 6, > > avail : 4, > > res_2 : 2, > > urg : 1, > > im : 1, > > vector : 8, > > res_3 : 14, > > pda_l : 26; > > }lo_intpost; > > }; > > union { > > u64 hi_val; > > struct { > > u64 sid : 16, > > sq : 2, > > svt : 2, > > res_1 : 44; > > }hi; > > struct { > > u64 sid : 16, > > sq : 2, > > svt : 2, > > res_1 : 12, > > pda_h : 32; > > }hi_intpost; > > }; > > }; > > > > - Introduce a new global vector which is used to wake up the blocked vCPU. > > > > Currently, there is a global vector 'posted_intr_vector', which is used as the > > s/Currently/In Xen 4.6 and earlier/ > > global notification vector for all vCPUs in the system. This vector is stored in > > VMCS and CPU considers it as a _special_ vector, uses it to notify the related > > pCPU when an interrupt is recorded in the posted-interrupt descriptor. > > > > This existing global vector is a _special_ vector to CPU, CPU handle it in a > > _special_ way compared to normal vectors, please refer to 29.6 in Intel SDM > > > http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6 > 4-ia-32-architectures-software-developer-manual-325462.pdf > > for more information about how CPU handles it. > > > > After having VT-d PI, VT-d engine can issue notification event when the > > assigned devices issue interrupts. We need add a new global vector to > > wakeup the blocked vCPU, please refer to later section in this design for > > how to use this new global vector. > > Ah, so this is what Xen has right now - and the changes that this design > outlines are here deal with an blocked guests. No, this is what I add for enabling VT-d PI. We discussed a lot about this new global vector and its usage scenario after posting version 1 of this design. Do you have any question about this? > > > > - Update IRTE when guest modifies the interrupt configuration (MSI/MSIx > configuration). > > After VT-d PI is introduced, the format of IRTE is changed as follows: > > Descriptor Address: the address of the posted-interrupt descriptor > > Virtual Vector: the guest vector of the interrupt > > URG: indicates if the interrupt is urgent > > Other fields continue to have the same meaning > > > > 'Descriptor Address' tells the destination vCPU of this interrupt, since > > each vCPU has a dedicated posted-interrupt descriptor. > > > > 'Virtual Vector' tells the guest vector of the interrupt. > > > > When guest changes the configuration of the interrupts, such as, the > > cpu affinity, or the vector, we need to update the associated IRTE accordingly. > > > > - Update posted-interrupt descriptor during vCPU scheduling > > > > The basic idea here is: > > 1. When vCPU's state is RUNSTATE_running, > > - Set 'NV' to 'posted_intr_vector'. > > - Clear 'SN' to accept posted-interrupts. > > - Set 'NDST' to the pCPU on which the vCPU will be running. > > 2. When vCPU's state is RUNSTATE_blocked, > > - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the > > related vCPU when posted-interrupt happens for it. > > Please refer to the above section about the new global vector. > > - Clear 'SN' to accept posted-interrupts > > 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline, > > - Set 'SN' to suppress non-urgent interrupts > > (Current, we only support non-urgent interrupts) > > When vCPU is in RUNSTATE_runnable or RUNSTATE_offline, > > It is not needed to accept posted-interrupt notification event, > > since we don't change the behavior of scheduler when the > interrupt > > occurs, we still need wait the next scheduling of the vCPU. > > still need to wait for the next.. > > When external interrupts from assigned devices occur, the > interrupts > > are recorded in PIR, and will be synced to IRR before VM-Entry. > > - Set 'NV' to 'posted_intr_vector'. > > > > - How to wakeup blocked vCPU when an interrupt is posted for it (wakeup > notification handler). > > > > Here is the scenario for the usage of the new global vector: > > > > 1. vCPU0 is running on pCPU0 > > 2. vCPU0 is blocked and vCPU1 is currently running on pCPU0 > > 3. An external interrupt from an assigned device occurs for vCPU0, if we > > still use 'posted_intr_vector' as the notification vector for vCPU0, the > > notification event for vCPU0 (the event will go to pCPU1) will be consumed > > by vCPU1 incorrectly (remember this is a special vector to CPU). The worst > > case is that vCPU0 will never be woken up again since the wakeup event > > for it is always consumed by other vCPUs incorrectly. So we need introduce > > another global vector, naming 'pi_wakeup_vector' to wake up the blocked > vCPU. > > > > After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue notification > > event using this new vector. Since this new vector is not a SPECIAL one to > CPU, > > it is just a normal vector. To cpu, it just receives an normal external interrupt, > > then we can get control in the handler of this new vector. In this case, > hypervisor > > can do something in it, such as wakeup the blocked vCPU. > > > > Here are what we do for the blocked vCPU: > > 1. Define a per-cpu list 'blocked_vcpu_on_cpu', which stored the blocked > > vCPU on the pCPU. > > 2. When the vCPU's state is changed to RUNSTATE_blocked, insert the vCPU > > to the per-cpu list belonging to the pCPU it was running. > > 3. When the vCPU is unblocked, remove the vCPU from the related pCPU list. > > > > In the handler of 'pi_wakeup_vector', we do: > > 1. Get the physical CPU. > > 2. Iterate the list 'blocked_vcpu_on_cpu' of the current pCPU, if 'ON' is set, > > we unblock the associated vCPU. > > > > - New boot command line for Xen, which controls VT-d PI feature by user. > > > > Like 'intremap' for interrupt remapping, we add a new boot command line > > 'intpost' for posted-interrupts. > > Earlier you mentioned "iommu_intpost" ? 'intpost' is a Xen command line parameter, while 'iommu_intpost' is a variable In the Code, just like 'intremap' and 'iommu_intremap'. Thanks, Feng > > > > > - Multicast/broadcast and lowest priority interrupts consideration. > > > > With VT-d PI, the destination vCPU information of an external interrupt > > from assigned devices is stored in IRTE, this makes the following > > consideration of the design: > > 1. Multicast/broadcast interrupts cannot be posted. > > 2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex > > (starting from Nehalem) ignore TPR value, and instead supported two other > > ways (configurable by BIOS) on how the handle lowest priority interrupts: > > A) Round robin: In this method, the chipset simply delivers lowest priority > > interrupts in a round-robin manner across all the available logical CPUs. While > > this provides good load balancing, this was not the best thing to do always as > > interrupts from the same device (like NIC) will start running on all the CPUs > > thrashing caches and taking locks. This led to the next scheme. > > B) Vector hashing: In this method, hardware would apply a hash function > > on the vector value in the interrupt request, and use that hash to pick a > logical > > CPU to route the lowest priority interrupt. This way, a given vector always > goes > > to the same logical CPU, avoiding the thrashing problem above. > > > > So, gist of above is that, lowest priority interrupts has never been delivered > as > > "lowest priority" in physical hardware. > > > > I will emulate vector hashing for posted-interrupt for XEN. > > > > ================================ > > > > Any comments about this design are highly appreciated! > > > > Thanks, > > Feng > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xen.org > > http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: (v2) VT-d Posted-interrupt (PI) design for XEN 2015-03-19 3:03 ` Wu, Feng @ 2015-03-19 19:11 ` Konrad Rzeszutek Wilk 2015-03-23 8:04 ` Wu, Feng 0 siblings, 1 reply; 16+ messages in thread From: Konrad Rzeszutek Wilk @ 2015-03-19 19:11 UTC (permalink / raw) To: Wu, Feng Cc: Zhang, Yang Z, Tian, Kevin, Keir Fraser (keir@xen.org), Jan Beulich (JBeulich@suse.com), xen-devel@lists.xen.org On Thu, Mar 19, 2015 at 03:03:55AM +0000, Wu, Feng wrote: > Thanks for the comments! > > > -----Original Message----- > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > > Sent: Thursday, March 19, 2015 12:10 AM > > To: Wu, Feng > > Cc: xen-devel@lists.xen.org; Zhang, Yang Z; Tian, Kevin; Keir Fraser > > (keir@xen.org); Jan Beulich (JBeulich@suse.com) > > Subject: Re: [Xen-devel] (v2) VT-d Posted-interrupt (PI) design for XEN > > > > On Wed, Mar 18, 2015 at 12:44:21PM +0000, Wu, Feng wrote: > > > VT-d Posted-interrupt (PI) design for XEN > > > > > > Background > > > ========== > > > With the development of virtualization, there are more and more device > > > assignment requirements. However, today when a VM is running with > > > assigned devices (such as, NIC), external interrupt handling for the assigned > > > devices always needs VMM intervention. > > > > > > VT-d Posted-interrupt is a more enhanced method to handle interrupts > > > in the virtualization environment. Interrupt posting is the process by > > > which an interrupt request is recorded in a memory-resident > > > posted-interrupt-descriptor structure by the root-complex, followed by > > > an optional notification event issued to the CPU complex. > > > > > > With VT-d Posted-interrupt we can get the following advantages: > > > - Direct delivery of external interrupts to running vCPUs without VMM > > > intervention > > > > > > I hadn't digged deep in what Xen has currently - but I would assume that > > this is exactly what we have now in Xen? > > Here is what Xen currently does for external interrupts from assigned devices: > > When a VM is running and an external interrupts from an assigned devices occurs > for it. VM-EXIT happens, then: > > vmx_do_extint() --> do_IRQ() --> __do_IRQ_guest() --> hvm_do_IRQ_dpci() --> > raise_softirq_for(pirq_dpci) --> raise_softirq(HVM_DPCI_SOFTIRQ) > > softirq HVM_DPCI_SOFTIRQ is bound to dpci_softirq() > > dpci_softirq() --> hvm_dirq_assist() --> vmsi_deliver_pirq() --> vmsi_deliver() --> > vmsi_inj_irq() --> vlapic_set_irq() <nods> This would be fantastic to put in the design document to help people make sure that their expectations are in line. > > vlapic_set_irq() does the following things: > 1. If CPU-side posted-interrupt is supported (I think it is supported from Xen 4.3, or Xen 4.4, > sorry, not quite remember the exact version), call vmx_deliver_posted_intr() to deliver > the virtual interrupt via posted-interrupt infrastructure. The benefit is that if an interrupt comes for VCPU0 instead of VCPU1 we can inject the interrupt in the VCPU1 without having it do an VMEXIT. However if we pin the vCPUs, then CPU-side posted interrupt do not help - we still have to process the interrupt in Xen hypervisor. > 2. Else If CPU-side posted-interrupt is not supported, set the related vIRR in vLAPIC > page and call vcpu_kick() to kick the related vCPU. Before VM-Entry, vmx_intr_assist() > will help to inject the interrupt to guests. > > However, after VT-d PI is supported, when a guest is running in non-root and an > external interrupt from an assigned device occurs for it. _no_ VM-Exit is needed, > the guest can handle this totally in non-root mode, thus avoiding all the above > code flow. <nods> However it does require for Linux PVHVM guests to not use the vector callback mechanism - or rather - not use the event mechanism. What you require for this to work on the Linux side is for the PCIe device to use the 'baremetal' mechanism to setup MSIs (program the IOAPIC, etc). It would be worth mentioning this in the document too. > > > > > Hm, actually we seem to be still invoking the hypervisor on the > > interrupts -except that if we need to dispatch it to another CPU > > using an normal vector to do so - which would still cause the > > hypervisor to be invoked? Or does it actually go straight in the > > guest? > > > > Like what I mentioned above, If the guest is running, we don't need invoke hypervisor. > > > So what kind of support do we currently have in Xen from posted > > interrupt? Could you add a bit about this in the background please? > > Good suggestion. > > Currently, Xen only supports the CPU-side posted-interrupt. Like what I mentioned above, > function vlapic_set_irq() can use this to deliver virtual interrupts, basically there are several > methods to deliver virtual interrupts to guests: > - Event delivery before VM-Entry via __vmx_inject_exception(), this is the oldest way. > - After APICv was enabled, we had hardware support for virtual interrupt delivery, virtual > interrupts are stored in virtual LAPIC page, after VM-Entry, guests can evaluate these > virtual interrupt and handle them in non-root mode. > - As an enhancement to APICv, CPU-side posted-interrupt was introduced, like above comments, > with this new feature, we don't need to kick the vCPU and deliver the virtual interrupts > direct to it. > > About APICv and CPU-side Posted-interrupt, please refer to Chapter 29, and Section 29.6 in the Intel SDM: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf > > > > > > - Decrease the interrupt migration complexity. On vCPU migration, software > > > can atomically co-migrate all interrupts targeting the migrating vCPU. For > > > virtual machines with assigned devices, migrating a vCPU across pCPUs > > > either incur the overhead of forwarding interrupts in software (e.g. via VMM > > > generated IPIS), or complexity to independently migrate each interrupt > > targeting > > > the vCPU to the new pCPU. However, after enabling VT-d PI, the destination > > vCPU > > > of an external interrupt from assigned devices is stored in the IRTE (i.e. > > > Posted-interrupt Descriptor Address), when vCPU is migrated to another > > pCPU, > > > we will set this new pCPU in the 'NDST' filed of Posted-interrupt descriptor, > > this > > > make the interrupt migration automatic. > > > > > > > > > Posted-interrupt Introduction > > > ======================== > > > There are two components to the Posted-interrupt architecture: > > > Processor Support and Root-Complex Support > > > > > > - Processor Support > > > Posted-interrupt processing is a feature by which a processor processes > > > the virtual interrupts by recording them as pending on the virtual-APIC > > > page. > > > > > > Posted-interrupt processing is enabled by setting the "process posted > > > interrupts" VM-execution control. The processing is performed in response > > > to the arrival of an interrupt with the posted-interrupt notification vector. > > > In response to such an interrupt, the processor processes virtual interrupts > > > recorded in a data structure called a posted-interrupt descriptor. > > > > > > More information about APICv and CPU-side Posted-interrupt, please refer > > > to Chapter 29, and Section 29.6 in the Intel SDM: > > > > > http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6 > > 4-ia-32-architectures-software-developer-manual-325462.pdf > > > > > > - Root-Complex Support > > > Interrupt posting is the process by which an interrupt request (from IOAPIC > > > or MSI/MSIx capable sources) is recorded in a memory-resident > > > posted-interrupt-descriptor structure by the root-complex, followed by > > > an optional notification event issued to the CPU complex. The interrupt > > > request arriving at the root-complex carry the identity of the interrupt > > > request source and a 'remapping-index'. The remapping-index is used to > > > look-up an entry from the memory-resident interrupt-remap-table. Unlike > > > with interrupt-remapping, the interrupt-remap-table-entry for a posted- > > > interrupt, specifies a virtual-vector and a pointer to the posted-interrupt > > > descriptor. The virtual-vector specifies the vector of the interrupt to be > > > recorded in the posted-interrupt descriptor. The posted-interrupt descriptor > > > hosts storage for the virtual-vectors and contains the attributes of the > > > notification event (interrupt) to be issued to the CPU complex to inform > > > CPU/software about pending interrupts recorded in the posted-interrupt > > > descriptor. > > > > > > More information about VT-d PI, please refer to > > > > > http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog > > y/vt-directed-io-spec.html > > > > > > Important Definitions > > > ================== > > > There are some changes to IRTE and posted-interrupt descriptor after > > > VT-d PI is introduced: > > > > s/is/was/ > > > > > IRTE: > > > Posted-interrupt Descriptor Address: the address of the posted-interrupt > > descriptor > > > Virtual Vector: the guest vector of the interrupt > > > URG: indicates if the interrupt is urgent > > > > > > Posted-interrupt descriptor: > > > The Posted Interrupt Descriptor hosts the following fields: > > > Posted Interrupt Request (PIR): Provide storage for posting (recording) > > interrupts (one bit > > > per vector, for up to 256 vectors). > > > > > > Outstanding Notification (ON): Indicate if there is a notification event > > outstanding (not > > > processed by processor or software) for this Posted Interrupt Descriptor. > > When this field is 0, > > > hardware modifies it from 0 to 1 when generating a notification event, and > > the entity receiving > > > the notification event (processor or software) resets it as part of posted > > interrupt processing. > > > > > > Suppress Notification (SN): Indicate if a notification event is to be suppressed > > (not > > > generated) for non-urgent interrupt requests (interrupts processed through > > an IRTE with > > > URG=0). > > > > > > Notification Vector (NV): Specify the vector for notification event (interrupt). > > > > > > Notification Destination (NDST): Specify the physical APIC-ID of the > > destination logical > > > processor for the notification event. > > > > > > Design Overview > > > ============== > > > In this design, we will cover the following items: > > > 1. Add a variable to control whether enable VT-d posted-interrupt or not. > > > 2. VT-d PI feature detection. > > > 3. Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff. > > > > stuff? Perhaps features? > > > 4. Extend IRTE structure to support VT-d PI. > > > 5. Introduce a new global vector which is used for waking up the blocked > > vCPU. > > > 6. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx > > configuration). > > > 7. Update posted-interrupt descriptor during vCPU scheduling (when the > > state > > > of the vCPU is transmitted among RUNSTATE_running / RUNSTATE_blocked/ > > > RUNSTATE_runnable / RUNSTATE_offline). > > > 8. How to wakeup blocked vCPU when an interrupt is posted for it (wakeup > > notification handler). > > > 9. New boot command line for Xen, which controls VT-d PI feature by user. > > > 10. Multicast/broadcast and lowest priority interrupts consideration. > > > > > > > > > Implementation details > > > =================== > > > - New variable to control VT-d PI > > > > > > Like variable 'iommu_intremap' for interrupt remapping, it is very > > straightforward > > > to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is > > set > > > only when interrupt remapping and VT-d posted-interrupt are both enabled. > > > > > > - VT-d PI feature detection. > > > Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt > > support. > > > > > > - Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff. > > > Here is the new structure for posted-interrupt descriptor: > > > > > > struct pi_desc { > > > DECLARE_BITMAP(pir, NR_VECTORS); > > > union { > > > struct > > > { > > > u64 on : 1, > > > sn : 1, > > > rsvd_1 : 13, > > > ndm : 1, > > > nv : 8, > > > rsvd_2 : 8, > > > ndst : 32; > > > }; > > > u64 control; > > > }; > > > u32 rsvd[6]; > > > } __attribute__ ((aligned (64))); > > > > > > - Extend IRTE structure to support VT-d PI. > > > > > > Here is the new structure for IRTE: > > > /* interrupt remap entry */ > > > struct iremap_entry { > > > union { > > > u64 lo_val; > > > struct { > > > u64 p : 1, > > > fpd : 1, > > > dm : 1, > > > rh : 1, > > > tm : 1, > > > dlm : 3, > > > avail : 4, > > > res_1 : 4, > > > vector : 8, > > > res_2 : 8, > > > dst : 32; > > > }lo; > > > struct { > > > u64 p : 1, > > > fpd : 1, > > > res_1 : 6, > > > avail : 4, > > > res_2 : 2, > > > urg : 1, > > > im : 1, > > > vector : 8, > > > res_3 : 14, > > > pda_l : 26; > > > }lo_intpost; > > > }; > > > union { > > > u64 hi_val; > > > struct { > > > u64 sid : 16, > > > sq : 2, > > > svt : 2, > > > res_1 : 44; > > > }hi; > > > struct { > > > u64 sid : 16, > > > sq : 2, > > > svt : 2, > > > res_1 : 12, > > > pda_h : 32; > > > }hi_intpost; > > > }; > > > }; > > > > > > - Introduce a new global vector which is used to wake up the blocked vCPU. > > > > > > Currently, there is a global vector 'posted_intr_vector', which is used as the > > > > s/Currently/In Xen 4.6 and earlier/ > > > global notification vector for all vCPUs in the system. This vector is stored in > > > VMCS and CPU considers it as a _special_ vector, uses it to notify the related > > > pCPU when an interrupt is recorded in the posted-interrupt descriptor. > > > > > > This existing global vector is a _special_ vector to CPU, CPU handle it in a > > > _special_ way compared to normal vectors, please refer to 29.6 in Intel SDM > > > > > http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6 > > 4-ia-32-architectures-software-developer-manual-325462.pdf > > > for more information about how CPU handles it. > > > > > > After having VT-d PI, VT-d engine can issue notification event when the > > > assigned devices issue interrupts. We need add a new global vector to > > > wakeup the blocked vCPU, please refer to later section in this design for > > > how to use this new global vector. > > > > Ah, so this is what Xen has right now - and the changes that this design > > outlines are here deal with an blocked guests. > > No, this is what I add for enabling VT-d PI. We discussed a lot about this > new global vector and its usage scenario after posting version 1 of this > design. Do you have any question about this? No, you clarified it in your answers to my questions! thank you. > > > > > > > - Update IRTE when guest modifies the interrupt configuration (MSI/MSIx > > configuration). > > > After VT-d PI is introduced, the format of IRTE is changed as follows: > > > Descriptor Address: the address of the posted-interrupt descriptor > > > Virtual Vector: the guest vector of the interrupt > > > URG: indicates if the interrupt is urgent > > > Other fields continue to have the same meaning > > > > > > 'Descriptor Address' tells the destination vCPU of this interrupt, since > > > each vCPU has a dedicated posted-interrupt descriptor. > > > > > > 'Virtual Vector' tells the guest vector of the interrupt. > > > > > > When guest changes the configuration of the interrupts, such as, the > > > cpu affinity, or the vector, we need to update the associated IRTE accordingly. > > > > > > - Update posted-interrupt descriptor during vCPU scheduling > > > > > > The basic idea here is: > > > 1. When vCPU's state is RUNSTATE_running, > > > - Set 'NV' to 'posted_intr_vector'. > > > - Clear 'SN' to accept posted-interrupts. > > > - Set 'NDST' to the pCPU on which the vCPU will be running. > > > 2. When vCPU's state is RUNSTATE_blocked, > > > - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the > > > related vCPU when posted-interrupt happens for it. > > > Please refer to the above section about the new global vector. > > > - Clear 'SN' to accept posted-interrupts > > > 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline, > > > - Set 'SN' to suppress non-urgent interrupts > > > (Current, we only support non-urgent interrupts) > > > When vCPU is in RUNSTATE_runnable or RUNSTATE_offline, > > > It is not needed to accept posted-interrupt notification event, > > > since we don't change the behavior of scheduler when the > > interrupt > > > occurs, we still need wait the next scheduling of the vCPU. > > > > still need to wait for the next.. > > > When external interrupts from assigned devices occur, the > > interrupts > > > are recorded in PIR, and will be synced to IRR before VM-Entry. > > > - Set 'NV' to 'posted_intr_vector'. > > > > > > - How to wakeup blocked vCPU when an interrupt is posted for it (wakeup > > notification handler). > > > > > > Here is the scenario for the usage of the new global vector: > > > > > > 1. vCPU0 is running on pCPU0 > > > 2. vCPU0 is blocked and vCPU1 is currently running on pCPU0 > > > 3. An external interrupt from an assigned device occurs for vCPU0, if we > > > still use 'posted_intr_vector' as the notification vector for vCPU0, the > > > notification event for vCPU0 (the event will go to pCPU1) will be consumed > > > by vCPU1 incorrectly (remember this is a special vector to CPU). The worst > > > case is that vCPU0 will never be woken up again since the wakeup event > > > for it is always consumed by other vCPUs incorrectly. So we need introduce > > > another global vector, naming 'pi_wakeup_vector' to wake up the blocked > > vCPU. > > > > > > After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue notification > > > event using this new vector. Since this new vector is not a SPECIAL one to > > CPU, > > > it is just a normal vector. To cpu, it just receives an normal external interrupt, > > > then we can get control in the handler of this new vector. In this case, > > hypervisor > > > can do something in it, such as wakeup the blocked vCPU. > > > > > > Here are what we do for the blocked vCPU: > > > 1. Define a per-cpu list 'blocked_vcpu_on_cpu', which stored the blocked > > > vCPU on the pCPU. > > > 2. When the vCPU's state is changed to RUNSTATE_blocked, insert the vCPU > > > to the per-cpu list belonging to the pCPU it was running. > > > 3. When the vCPU is unblocked, remove the vCPU from the related pCPU list. > > > > > > In the handler of 'pi_wakeup_vector', we do: > > > 1. Get the physical CPU. > > > 2. Iterate the list 'blocked_vcpu_on_cpu' of the current pCPU, if 'ON' is set, > > > we unblock the associated vCPU. > > > > > > - New boot command line for Xen, which controls VT-d PI feature by user. > > > > > > Like 'intremap' for interrupt remapping, we add a new boot command line > > > 'intpost' for posted-interrupts. > > > > Earlier you mentioned "iommu_intpost" ? > > 'intpost' is a Xen command line parameter, while 'iommu_intpost' is a variable > In the Code, just like 'intremap' and 'iommu_intremap'. Why not piggyback on 'iommu' ? It might be worth mentioning the reasoning why you choose a new name instead of adding new options for the 'iommu'. > > Thanks, > Feng > > > > > > > > > - Multicast/broadcast and lowest priority interrupts consideration. > > > > > > With VT-d PI, the destination vCPU information of an external interrupt > > > from assigned devices is stored in IRTE, this makes the following > > > consideration of the design: > > > 1. Multicast/broadcast interrupts cannot be posted. > > > 2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex > > > (starting from Nehalem) ignore TPR value, and instead supported two other > > > ways (configurable by BIOS) on how the handle lowest priority interrupts: > > > A) Round robin: In this method, the chipset simply delivers lowest priority > > > interrupts in a round-robin manner across all the available logical CPUs. While > > > this provides good load balancing, this was not the best thing to do always as > > > interrupts from the same device (like NIC) will start running on all the CPUs > > > thrashing caches and taking locks. This led to the next scheme. > > > B) Vector hashing: In this method, hardware would apply a hash function > > > on the vector value in the interrupt request, and use that hash to pick a > > logical > > > CPU to route the lowest priority interrupt. This way, a given vector always > > goes > > > to the same logical CPU, avoiding the thrashing problem above. > > > > > > So, gist of above is that, lowest priority interrupts has never been delivered > > as > > > "lowest priority" in physical hardware. > > > > > > I will emulate vector hashing for posted-interrupt for XEN. > > > > > > ================================ > > > > > > Any comments about this design are highly appreciated! > > > > > > Thanks, > > > Feng > > > > > > _______________________________________________ > > > Xen-devel mailing list > > > Xen-devel@lists.xen.org > > > http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: (v2) VT-d Posted-interrupt (PI) design for XEN 2015-03-19 19:11 ` Konrad Rzeszutek Wilk @ 2015-03-23 8:04 ` Wu, Feng 0 siblings, 0 replies; 16+ messages in thread From: Wu, Feng @ 2015-03-23 8:04 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Tian, Kevin, Wu, Feng, xen-devel@lists.xen.org, Jan Beulich (JBeulich@suse.com), Zhang, Yang Z, Keir Fraser (keir@xen.org) > -----Original Message----- > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > Sent: Friday, March 20, 2015 3:12 AM > To: Wu, Feng > Cc: xen-devel@lists.xen.org; Zhang, Yang Z; Tian, Kevin; Keir Fraser > (keir@xen.org); Jan Beulich (JBeulich@suse.com) > Subject: Re: [Xen-devel] (v2) VT-d Posted-interrupt (PI) design for XEN > > On Thu, Mar 19, 2015 at 03:03:55AM +0000, Wu, Feng wrote: > > Thanks for the comments! > > > > > -----Original Message----- > > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > > > Sent: Thursday, March 19, 2015 12:10 AM > > > To: Wu, Feng > > > Cc: xen-devel@lists.xen.org; Zhang, Yang Z; Tian, Kevin; Keir Fraser > > > (keir@xen.org); Jan Beulich (JBeulich@suse.com) > > > Subject: Re: [Xen-devel] (v2) VT-d Posted-interrupt (PI) design for XEN > > > > > > On Wed, Mar 18, 2015 at 12:44:21PM +0000, Wu, Feng wrote: > > > > VT-d Posted-interrupt (PI) design for XEN > > > > > > > > Background > > > > ========== > > > > With the development of virtualization, there are more and more device > > > > assignment requirements. However, today when a VM is running with > > > > assigned devices (such as, NIC), external interrupt handling for the > assigned > > > > devices always needs VMM intervention. > > > > > > > > VT-d Posted-interrupt is a more enhanced method to handle interrupts > > > > in the virtualization environment. Interrupt posting is the process by > > > > which an interrupt request is recorded in a memory-resident > > > > posted-interrupt-descriptor structure by the root-complex, followed by > > > > an optional notification event issued to the CPU complex. > > > > > > > > With VT-d Posted-interrupt we can get the following advantages: > > > > - Direct delivery of external interrupts to running vCPUs without VMM > > > > intervention > > > > > > > > > I hadn't digged deep in what Xen has currently - but I would assume that > > > this is exactly what we have now in Xen? > > > > Here is what Xen currently does for external interrupts from assigned > devices: > > > > When a VM is running and an external interrupts from an assigned devices > occurs > > for it. VM-EXIT happens, then: > > > > vmx_do_extint() --> do_IRQ() --> __do_IRQ_guest() --> hvm_do_IRQ_dpci() --> > > raise_softirq_for(pirq_dpci) --> raise_softirq(HVM_DPCI_SOFTIRQ) > > > > softirq HVM_DPCI_SOFTIRQ is bound to dpci_softirq() > > > > dpci_softirq() --> hvm_dirq_assist() --> vmsi_deliver_pirq() --> vmsi_deliver() > --> > > vmsi_inj_irq() --> vlapic_set_irq() > > <nods> This would be fantastic to put in the design document to help > people make sure that their expectations are in line. Sure! > > > > > vlapic_set_irq() does the following things: > > 1. If CPU-side posted-interrupt is supported (I think it is supported from Xen > 4.3, or Xen 4.4, > > sorry, not quite remember the exact version), call vmx_deliver_posted_intr() > to deliver > > the virtual interrupt via posted-interrupt infrastructure. > > The benefit is that if an interrupt comes for VCPU0 instead of > VCPU1 we can inject the interrupt in the VCPU1 without having it > do an VMEXIT. > > However if we pin the vCPUs, then CPU-side posted interrupt do not > help - we still have to process the interrupt in Xen hypervisor. > > > 2. Else If CPU-side posted-interrupt is not supported, set the related vIRR in > vLAPIC > > page and call vcpu_kick() to kick the related vCPU. Before VM-Entry, > vmx_intr_assist() > > will help to inject the interrupt to guests. > > > > However, after VT-d PI is supported, when a guest is running in non-root and > an > > external interrupt from an assigned device occurs for it. _no_ VM-Exit is > needed, > > the guest can handle this totally in non-root mode, thus avoiding all the above > > code flow. > > <nods> However it does require for Linux PVHVM guests to not use the > vector callback mechanism - or rather - not use the event mechanism. > > What you require for this to work on the Linux side is for the PCIe > device to use the 'baremetal' mechanism to setup MSIs (program the > IOAPIC, etc). It would be worth mentioning this in the document too. Thanks for the suggestion. In fact, there are some information about this in this design doc, please refer to section " Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration)." When guests update the MSI/MSIx information, Xen will get control and the guest interrupt information will get updated in related IRTE. > > > > > > > > > Hm, actually we seem to be still invoking the hypervisor on the > > > interrupts -except that if we need to dispatch it to another CPU > > > using an normal vector to do so - which would still cause the > > > hypervisor to be invoked? Or does it actually go straight in the > > > guest? > > > > > > > Like what I mentioned above, If the guest is running, we don't need invoke > hypervisor. > > > > > So what kind of support do we currently have in Xen from posted > > > interrupt? Could you add a bit about this in the background please? > > > > Good suggestion. > > > > Currently, Xen only supports the CPU-side posted-interrupt. Like what I > mentioned above, > > function vlapic_set_irq() can use this to deliver virtual interrupts, basically > there are several > > methods to deliver virtual interrupts to guests: > > - Event delivery before VM-Entry via __vmx_inject_exception(), this is the > oldest way. > > - After APICv was enabled, we had hardware support for virtual interrupt > delivery, virtual > > interrupts are stored in virtual LAPIC page, after VM-Entry, guests can > evaluate these > > virtual interrupt and handle them in non-root mode. > > - As an enhancement to APICv, CPU-side posted-interrupt was introduced, > like above comments, > > with this new feature, we don't need to kick the vCPU and deliver the virtual > interrupts > > direct to it. > > > > About APICv and CPU-side Posted-interrupt, please refer to Chapter 29, and > Section 29.6 in the Intel SDM: > http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6 > 4-ia-32-architectures-software-developer-manual-325462.pdf > > > > > > > > > - Decrease the interrupt migration complexity. On vCPU migration, > software > > > > can atomically co-migrate all interrupts targeting the migrating vCPU. For > > > > virtual machines with assigned devices, migrating a vCPU across pCPUs > > > > either incur the overhead of forwarding interrupts in software (e.g. via > VMM > > > > generated IPIS), or complexity to independently migrate each interrupt > > > targeting > > > > the vCPU to the new pCPU. However, after enabling VT-d PI, the > destination > > > vCPU > > > > of an external interrupt from assigned devices is stored in the IRTE (i.e. > > > > Posted-interrupt Descriptor Address), when vCPU is migrated to another > > > pCPU, > > > > we will set this new pCPU in the 'NDST' filed of Posted-interrupt > descriptor, > > > this > > > > make the interrupt migration automatic. > > > > > > > > > > > > Posted-interrupt Introduction > > > > ======================== > > > > There are two components to the Posted-interrupt architecture: > > > > Processor Support and Root-Complex Support > > > > > > > > - Processor Support > > > > Posted-interrupt processing is a feature by which a processor processes > > > > the virtual interrupts by recording them as pending on the virtual-APIC > > > > page. > > > > > > > > Posted-interrupt processing is enabled by setting the "process posted > > > > interrupts" VM-execution control. The processing is performed in > response > > > > to the arrival of an interrupt with the posted-interrupt notification vector. > > > > In response to such an interrupt, the processor processes virtual > interrupts > > > > recorded in a data structure called a posted-interrupt descriptor. > > > > > > > > More information about APICv and CPU-side Posted-interrupt, please > refer > > > > to Chapter 29, and Section 29.6 in the Intel SDM: > > > > > > > > http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6 > > > 4-ia-32-architectures-software-developer-manual-325462.pdf > > > > > > > > - Root-Complex Support > > > > Interrupt posting is the process by which an interrupt request (from > IOAPIC > > > > or MSI/MSIx capable sources) is recorded in a memory-resident > > > > posted-interrupt-descriptor structure by the root-complex, followed by > > > > an optional notification event issued to the CPU complex. The interrupt > > > > request arriving at the root-complex carry the identity of the interrupt > > > > request source and a 'remapping-index'. The remapping-index is used to > > > > look-up an entry from the memory-resident interrupt-remap-table. Unlike > > > > with interrupt-remapping, the interrupt-remap-table-entry for a posted- > > > > interrupt, specifies a virtual-vector and a pointer to the posted-interrupt > > > > descriptor. The virtual-vector specifies the vector of the interrupt to be > > > > recorded in the posted-interrupt descriptor. The posted-interrupt > descriptor > > > > hosts storage for the virtual-vectors and contains the attributes of the > > > > notification event (interrupt) to be issued to the CPU complex to inform > > > > CPU/software about pending interrupts recorded in the posted-interrupt > > > > descriptor. > > > > > > > > More information about VT-d PI, please refer to > > > > > > > > http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog > > > y/vt-directed-io-spec.html > > > > > > > > Important Definitions > > > > ================== > > > > There are some changes to IRTE and posted-interrupt descriptor after > > > > VT-d PI is introduced: > > > > > > s/is/was/ > > > > > > > IRTE: > > > > Posted-interrupt Descriptor Address: the address of the posted-interrupt > > > descriptor > > > > Virtual Vector: the guest vector of the interrupt > > > > URG: indicates if the interrupt is urgent > > > > > > > > Posted-interrupt descriptor: > > > > The Posted Interrupt Descriptor hosts the following fields: > > > > Posted Interrupt Request (PIR): Provide storage for posting (recording) > > > interrupts (one bit > > > > per vector, for up to 256 vectors). > > > > > > > > Outstanding Notification (ON): Indicate if there is a notification event > > > outstanding (not > > > > processed by processor or software) for this Posted Interrupt Descriptor. > > > When this field is 0, > > > > hardware modifies it from 0 to 1 when generating a notification event, > and > > > the entity receiving > > > > the notification event (processor or software) resets it as part of posted > > > interrupt processing. > > > > > > > > Suppress Notification (SN): Indicate if a notification event is to be > suppressed > > > (not > > > > generated) for non-urgent interrupt requests (interrupts processed > through > > > an IRTE with > > > > URG=0). > > > > > > > > Notification Vector (NV): Specify the vector for notification event > (interrupt). > > > > > > > > Notification Destination (NDST): Specify the physical APIC-ID of the > > > destination logical > > > > processor for the notification event. > > > > > > > > Design Overview > > > > ============== > > > > In this design, we will cover the following items: > > > > 1. Add a variable to control whether enable VT-d posted-interrupt or not. > > > > 2. VT-d PI feature detection. > > > > 3. Extend posted-interrupt descriptor structure to cover VT-d PI specific > stuff. > > > > > > stuff? Perhaps features? > > > > 4. Extend IRTE structure to support VT-d PI. > > > > 5. Introduce a new global vector which is used for waking up the blocked > > > vCPU. > > > > 6. Update IRTE when guest modifies the interrupt configuration > (MSI/MSIx > > > configuration). > > > > 7. Update posted-interrupt descriptor during vCPU scheduling (when the > > > state > > > > of the vCPU is transmitted among RUNSTATE_running / > RUNSTATE_blocked/ > > > > RUNSTATE_runnable / RUNSTATE_offline). > > > > 8. How to wakeup blocked vCPU when an interrupt is posted for it > (wakeup > > > notification handler). > > > > 9. New boot command line for Xen, which controls VT-d PI feature by user. > > > > 10. Multicast/broadcast and lowest priority interrupts consideration. > > > > > > > > > > > > Implementation details > > > > =================== > > > > - New variable to control VT-d PI > > > > > > > > Like variable 'iommu_intremap' for interrupt remapping, it is very > > > straightforward > > > > to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is > > > set > > > > only when interrupt remapping and VT-d posted-interrupt are both > enabled. > > > > > > > > - VT-d PI feature detection. > > > > Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt > > > support. > > > > > > > > - Extend posted-interrupt descriptor structure to cover VT-d PI specific > stuff. > > > > Here is the new structure for posted-interrupt descriptor: > > > > > > > > struct pi_desc { > > > > DECLARE_BITMAP(pir, NR_VECTORS); > > > > union { > > > > struct > > > > { > > > > u64 on : 1, > > > > sn : 1, > > > > rsvd_1 : 13, > > > > ndm : 1, > > > > nv : 8, > > > > rsvd_2 : 8, > > > > ndst : 32; > > > > }; > > > > u64 control; > > > > }; > > > > u32 rsvd[6]; > > > > } __attribute__ ((aligned (64))); > > > > > > > > - Extend IRTE structure to support VT-d PI. > > > > > > > > Here is the new structure for IRTE: > > > > /* interrupt remap entry */ > > > > struct iremap_entry { > > > > union { > > > > u64 lo_val; > > > > struct { > > > > u64 p : 1, > > > > fpd : 1, > > > > dm : 1, > > > > rh : 1, > > > > tm : 1, > > > > dlm : 3, > > > > avail : 4, > > > > res_1 : 4, > > > > vector : 8, > > > > res_2 : 8, > > > > dst : 32; > > > > }lo; > > > > struct { > > > > u64 p : 1, > > > > fpd : 1, > > > > res_1 : 6, > > > > avail : 4, > > > > res_2 : 2, > > > > urg : 1, > > > > im : 1, > > > > vector : 8, > > > > res_3 : 14, > > > > pda_l : 26; > > > > }lo_intpost; > > > > }; > > > > union { > > > > u64 hi_val; > > > > struct { > > > > u64 sid : 16, > > > > sq : 2, > > > > svt : 2, > > > > res_1 : 44; > > > > }hi; > > > > struct { > > > > u64 sid : 16, > > > > sq : 2, > > > > svt : 2, > > > > res_1 : 12, > > > > pda_h : 32; > > > > }hi_intpost; > > > > }; > > > > }; > > > > > > > > - Introduce a new global vector which is used to wake up the blocked > vCPU. > > > > > > > > Currently, there is a global vector 'posted_intr_vector', which is used as > the > > > > > > s/Currently/In Xen 4.6 and earlier/ > > > > global notification vector for all vCPUs in the system. This vector is stored > in > > > > VMCS and CPU considers it as a _special_ vector, uses it to notify the > related > > > > pCPU when an interrupt is recorded in the posted-interrupt descriptor. > > > > > > > > This existing global vector is a _special_ vector to CPU, CPU handle it in a > > > > _special_ way compared to normal vectors, please refer to 29.6 in Intel > SDM > > > > > > > > http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6 > > > 4-ia-32-architectures-software-developer-manual-325462.pdf > > > > for more information about how CPU handles it. > > > > > > > > After having VT-d PI, VT-d engine can issue notification event when the > > > > assigned devices issue interrupts. We need add a new global vector to > > > > wakeup the blocked vCPU, please refer to later section in this design for > > > > how to use this new global vector. > > > > > > Ah, so this is what Xen has right now - and the changes that this design > > > outlines are here deal with an blocked guests. > > > > No, this is what I add for enabling VT-d PI. We discussed a lot about this > > new global vector and its usage scenario after posting version 1 of this > > design. Do you have any question about this? > > No, you clarified it in your answers to my questions! thank you. > > > > > > > > > > - Update IRTE when guest modifies the interrupt configuration (MSI/MSIx > > > configuration). > > > > After VT-d PI is introduced, the format of IRTE is changed as follows: > > > > Descriptor Address: the address of the posted-interrupt descriptor > > > > Virtual Vector: the guest vector of the interrupt > > > > URG: indicates if the interrupt is urgent > > > > Other fields continue to have the same meaning > > > > > > > > 'Descriptor Address' tells the destination vCPU of this interrupt, since > > > > each vCPU has a dedicated posted-interrupt descriptor. > > > > > > > > 'Virtual Vector' tells the guest vector of the interrupt. > > > > > > > > When guest changes the configuration of the interrupts, such as, the > > > > cpu affinity, or the vector, we need to update the associated IRTE > accordingly. > > > > > > > > - Update posted-interrupt descriptor during vCPU scheduling > > > > > > > > The basic idea here is: > > > > 1. When vCPU's state is RUNSTATE_running, > > > > - Set 'NV' to 'posted_intr_vector'. > > > > - Clear 'SN' to accept posted-interrupts. > > > > - Set 'NDST' to the pCPU on which the vCPU will be running. > > > > 2. When vCPU's state is RUNSTATE_blocked, > > > > - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the > > > > related vCPU when posted-interrupt happens for it. > > > > Please refer to the above section about the new global > vector. > > > > - Clear 'SN' to accept posted-interrupts > > > > 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline, > > > > - Set 'SN' to suppress non-urgent interrupts > > > > (Current, we only support non-urgent interrupts) > > > > When vCPU is in RUNSTATE_runnable or RUNSTATE_offline, > > > > It is not needed to accept posted-interrupt notification event, > > > > since we don't change the behavior of scheduler when the > > > interrupt > > > > occurs, we still need wait the next scheduling of the vCPU. > > > > > > still need to wait for the next.. > > > > When external interrupts from assigned devices occur, the > > > interrupts > > > > are recorded in PIR, and will be synced to IRR before > VM-Entry. > > > > - Set 'NV' to 'posted_intr_vector'. > > > > > > > > - How to wakeup blocked vCPU when an interrupt is posted for it (wakeup > > > notification handler). > > > > > > > > Here is the scenario for the usage of the new global vector: > > > > > > > > 1. vCPU0 is running on pCPU0 > > > > 2. vCPU0 is blocked and vCPU1 is currently running on pCPU0 > > > > 3. An external interrupt from an assigned device occurs for vCPU0, if we > > > > still use 'posted_intr_vector' as the notification vector for vCPU0, the > > > > notification event for vCPU0 (the event will go to pCPU1) will be consumed > > > > by vCPU1 incorrectly (remember this is a special vector to CPU). The > worst > > > > case is that vCPU0 will never be woken up again since the wakeup event > > > > for it is always consumed by other vCPUs incorrectly. So we need > introduce > > > > another global vector, naming 'pi_wakeup_vector' to wake up the blocked > > > vCPU. > > > > > > > > After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue > notification > > > > event using this new vector. Since this new vector is not a SPECIAL one to > > > CPU, > > > > it is just a normal vector. To cpu, it just receives an normal external > interrupt, > > > > then we can get control in the handler of this new vector. In this case, > > > hypervisor > > > > can do something in it, such as wakeup the blocked vCPU. > > > > > > > > Here are what we do for the blocked vCPU: > > > > 1. Define a per-cpu list 'blocked_vcpu_on_cpu', which stored the blocked > > > > vCPU on the pCPU. > > > > 2. When the vCPU's state is changed to RUNSTATE_blocked, insert the > vCPU > > > > to the per-cpu list belonging to the pCPU it was running. > > > > 3. When the vCPU is unblocked, remove the vCPU from the related pCPU > list. > > > > > > > > In the handler of 'pi_wakeup_vector', we do: > > > > 1. Get the physical CPU. > > > > 2. Iterate the list 'blocked_vcpu_on_cpu' of the current pCPU, if 'ON' is > set, > > > > we unblock the associated vCPU. > > > > > > > > - New boot command line for Xen, which controls VT-d PI feature by user. > > > > > > > > Like 'intremap' for interrupt remapping, we add a new boot command line > > > > 'intpost' for posted-interrupts. > > > > > > Earlier you mentioned "iommu_intpost" ? > > > > 'intpost' is a Xen command line parameter, while 'iommu_intpost' is a variable > > In the Code, just like 'intremap' and 'iommu_intremap'. > > Why not piggyback on 'iommu' ? It might be worth mentioning the > reasoning why you choose a new name instead of adding new options for > the 'iommu'. Oh, sorry, there is a mistake in my previous description. In fact, 'intpost' is an option for 'iommu' command line, just like ' intremap'. Thanks, Feng > > > > Thanks, > > Feng > > > > > > > > > > > > > - Multicast/broadcast and lowest priority interrupts consideration. > > > > > > > > With VT-d PI, the destination vCPU information of an external interrupt > > > > from assigned devices is stored in IRTE, this makes the following > > > > consideration of the design: > > > > 1. Multicast/broadcast interrupts cannot be posted. > > > > 2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex > > > > (starting from Nehalem) ignore TPR value, and instead supported two > other > > > > ways (configurable by BIOS) on how the handle lowest priority interrupts: > > > > A) Round robin: In this method, the chipset simply delivers lowest > priority > > > > interrupts in a round-robin manner across all the available logical CPUs. > While > > > > this provides good load balancing, this was not the best thing to do always > as > > > > interrupts from the same device (like NIC) will start running on all the > CPUs > > > > thrashing caches and taking locks. This led to the next scheme. > > > > B) Vector hashing: In this method, hardware would apply a hash > function > > > > on the vector value in the interrupt request, and use that hash to pick a > > > logical > > > > CPU to route the lowest priority interrupt. This way, a given vector always > > > goes > > > > to the same logical CPU, avoiding the thrashing problem above. > > > > > > > > So, gist of above is that, lowest priority interrupts has never been > delivered > > > as > > > > "lowest priority" in physical hardware. > > > > > > > > I will emulate vector hashing for posted-interrupt for XEN. > > > > > > > > ================================ > > > > > > > > Any comments about this design are highly appreciated! > > > > > > > > Thanks, > > > > Feng > > > > > > > > _______________________________________________ > > > > Xen-devel mailing list > > > > Xen-devel@lists.xen.org > > > > http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: (v2) VT-d Posted-interrupt (PI) design for XEN 2015-03-18 12:44 (v2) VT-d Posted-interrupt (PI) design for XEN Wu, Feng 2015-03-18 16:09 ` Konrad Rzeszutek Wilk @ 2015-03-19 9:56 ` Jan Beulich 2015-03-23 8:14 ` Wu, Feng 2015-03-25 5:10 ` Wu, Feng 2 siblings, 1 reply; 16+ messages in thread From: Jan Beulich @ 2015-03-19 9:56 UTC (permalink / raw) To: Feng Wu Cc: Yang Z Zhang, Kevin Tian, Keir Fraser (keir@xen.org), xen-devel@lists.xen.org >>> On 18.03.15 at 13:44, <feng.wu@intel.com> wrote: > Here are what we do for the blocked vCPU: > 1. Define a per-cpu list 'blocked_vcpu_on_cpu', which stored the blocked > vCPU on the pCPU. > 2. When the vCPU's state is changed to RUNSTATE_blocked, insert the vCPU > to the per-cpu list belonging to the pCPU it was running. > 3. When the vCPU is unblocked, remove the vCPU from the related pCPU list. And this works transparently not only with the generic scheduler code moving the vCPU to another pCPU, but also with some of the individual scheduler implementations doing such re-assignments? Jan ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: (v2) VT-d Posted-interrupt (PI) design for XEN 2015-03-19 9:56 ` Jan Beulich @ 2015-03-23 8:14 ` Wu, Feng 2015-03-23 8:26 ` Jan Beulich 0 siblings, 1 reply; 16+ messages in thread From: Wu, Feng @ 2015-03-23 8:14 UTC (permalink / raw) To: Jan Beulich Cc: Zhang, Yang Z, Wu, Feng, Tian, Kevin, Keir Fraser (keir@xen.org), xen-devel@lists.xen.org > -----Original Message----- > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Thursday, March 19, 2015 5:57 PM > To: Wu, Feng > Cc: Tian, Kevin; Zhang, Yang Z; xen-devel@lists.xen.org; Keir Fraser > (keir@xen.org) > Subject: Re: (v2) VT-d Posted-interrupt (PI) design for XEN > > >>> On 18.03.15 at 13:44, <feng.wu@intel.com> wrote: > > Here are what we do for the blocked vCPU: > > 1. Define a per-cpu list 'blocked_vcpu_on_cpu', which stored the blocked > > vCPU on the pCPU. > > 2. When the vCPU's state is changed to RUNSTATE_blocked, insert the vCPU > > to the per-cpu list belonging to the pCPU it was running. > > 3. When the vCPU is unblocked, remove the vCPU from the related pCPU list. > > And this works transparently not only with the generic scheduler > code moving the vCPU to another pCPU, but also with some of the > individual scheduler implementations doing such re-assignments? > I cannot quite understand this, could you please elaborate a bit more. Thanks a lot! Thanks, Feng > Jan ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: (v2) VT-d Posted-interrupt (PI) design for XEN 2015-03-23 8:14 ` Wu, Feng @ 2015-03-23 8:26 ` Jan Beulich 2015-03-23 8:49 ` Wu, Feng 0 siblings, 1 reply; 16+ messages in thread From: Jan Beulich @ 2015-03-23 8:26 UTC (permalink / raw) To: Feng Wu Cc: Yang Z Zhang, Kevin Tian, Keir Fraser (keir@xen.org), xen-devel@lists.xen.org >>> On 23.03.15 at 09:14, <feng.wu@intel.com> wrote: >> From: Jan Beulich [mailto:JBeulich@suse.com] >> Sent: Thursday, March 19, 2015 5:57 PM >> >>> On 18.03.15 at 13:44, <feng.wu@intel.com> wrote: >> > Here are what we do for the blocked vCPU: >> > 1. Define a per-cpu list 'blocked_vcpu_on_cpu', which stored the blocked >> > vCPU on the pCPU. >> > 2. When the vCPU's state is changed to RUNSTATE_blocked, insert the vCPU >> > to the per-cpu list belonging to the pCPU it was running. >> > 3. When the vCPU is unblocked, remove the vCPU from the related pCPU list. >> >> And this works transparently not only with the generic scheduler >> code moving the vCPU to another pCPU, but also with some of the >> individual scheduler implementations doing such re-assignments? > > I cannot quite understand this, could you please elaborate a bit more. There are multiple places where v->processor can get changed for a particular vCPU, and obviously all of these need to be taken care of. Yet a change like the one to come here would normally not be expected to touch specific schedulers' code, and hence suitably abstracting this may need some extra thought. Jan ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: (v2) VT-d Posted-interrupt (PI) design for XEN 2015-03-23 8:26 ` Jan Beulich @ 2015-03-23 8:49 ` Wu, Feng 2015-03-23 9:07 ` Jan Beulich 0 siblings, 1 reply; 16+ messages in thread From: Wu, Feng @ 2015-03-23 8:49 UTC (permalink / raw) To: Jan Beulich Cc: Zhang, Yang Z, Wu, Feng, Tian, Kevin, Keir Fraser (keir@xen.org), xen-devel@lists.xen.org > -----Original Message----- > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Monday, March 23, 2015 4:26 PM > To: Wu, Feng > Cc: Tian, Kevin; Zhang, Yang Z; xen-devel@lists.xen.org; Keir Fraser > (keir@xen.org) > Subject: RE: (v2) VT-d Posted-interrupt (PI) design for XEN > > >>> On 23.03.15 at 09:14, <feng.wu@intel.com> wrote: > >> From: Jan Beulich [mailto:JBeulich@suse.com] > >> Sent: Thursday, March 19, 2015 5:57 PM > >> >>> On 18.03.15 at 13:44, <feng.wu@intel.com> wrote: > >> > Here are what we do for the blocked vCPU: > >> > 1. Define a per-cpu list 'blocked_vcpu_on_cpu', which stored the blocked > >> > vCPU on the pCPU. > >> > 2. When the vCPU's state is changed to RUNSTATE_blocked, insert the > vCPU > >> > to the per-cpu list belonging to the pCPU it was running. > >> > 3. When the vCPU is unblocked, remove the vCPU from the related pCPU > list. > >> > >> And this works transparently not only with the generic scheduler > >> code moving the vCPU to another pCPU, but also with some of the > >> individual scheduler implementations doing such re-assignments? > > > > I cannot quite understand this, could you please elaborate a bit more. > > There are multiple places where v->processor can get changed for a > particular vCPU, and obviously all of these need to be taken care of. > Yet a change like the one to come here would normally not be > expected to touch specific schedulers' code, and hence suitably > abstracting this may need some extra thought. > > Jan Why do we need care about the places where v->processor gets changed, my idea about this is: Before vCPU is blocked, we can get v->processor, and save the vCPU to this per-CPU list. Besides that v->processor is the destination of the notification event (it is stored in Posted-interrupt descriptor). So when wakeup notification event happens form this vCPU, it goes to pCPU v->processor, then in the wakeup notification event handler, we can find the list via smp_processor_id(), hence find the right vCPU to wake up. Do I miss something here? Thanks a lot! Thanks, Feng ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: (v2) VT-d Posted-interrupt (PI) design for XEN 2015-03-23 8:49 ` Wu, Feng @ 2015-03-23 9:07 ` Jan Beulich 2015-03-23 9:19 ` Wu, Feng 0 siblings, 1 reply; 16+ messages in thread From: Jan Beulich @ 2015-03-23 9:07 UTC (permalink / raw) To: Feng Wu Cc: Yang Z Zhang, Kevin Tian, Keir Fraser (keir@xen.org), xen-devel@lists.xen.org >>> On 23.03.15 at 09:49, <feng.wu@intel.com> wrote: > >> -----Original Message----- >> From: Jan Beulich [mailto:JBeulich@suse.com] >> Sent: Monday, March 23, 2015 4:26 PM >> To: Wu, Feng >> Cc: Tian, Kevin; Zhang, Yang Z; xen-devel@lists.xen.org; Keir Fraser >> (keir@xen.org) >> Subject: RE: (v2) VT-d Posted-interrupt (PI) design for XEN >> >> >>> On 23.03.15 at 09:14, <feng.wu@intel.com> wrote: >> >> From: Jan Beulich [mailto:JBeulich@suse.com] >> >> Sent: Thursday, March 19, 2015 5:57 PM >> >> >>> On 18.03.15 at 13:44, <feng.wu@intel.com> wrote: >> >> > Here are what we do for the blocked vCPU: >> >> > 1. Define a per-cpu list 'blocked_vcpu_on_cpu', which stored the blocked >> >> > vCPU on the pCPU. >> >> > 2. When the vCPU's state is changed to RUNSTATE_blocked, insert the >> vCPU >> >> > to the per-cpu list belonging to the pCPU it was running. >> >> > 3. When the vCPU is unblocked, remove the vCPU from the related pCPU >> list. >> >> >> >> And this works transparently not only with the generic scheduler >> >> code moving the vCPU to another pCPU, but also with some of the >> >> individual scheduler implementations doing such re-assignments? >> > >> > I cannot quite understand this, could you please elaborate a bit more. >> >> There are multiple places where v->processor can get changed for a >> particular vCPU, and obviously all of these need to be taken care of. >> Yet a change like the one to come here would normally not be >> expected to touch specific schedulers' code, and hence suitably >> abstracting this may need some extra thought. > > Why do we need care about the places where v->processor gets changed, > my idea about this is: > > Before vCPU is blocked, we can get v->processor, and save the vCPU to > this per-CPU list. Besides that v->processor is the destination of the > notification > event (it is stored in Posted-interrupt descriptor). So when wakeup > notification event happens form this vCPU, it goes to pCPU v->processor, > then in the wakeup notification event handler, we can find the list via > smp_processor_id(), hence find the right vCPU to wake up. > > Do I miss something here? Perhaps you don't, and perhaps I implied things I shouldn't have implied: When v->processor changes, it would look to me that the respective vCPU then ends up on the wrong list. If that's not a problem - fine. Using per-CPU lists, however, would seem to make it desirable to access those lists without lock, yet that can't work when the list may get accessed from other than the owning CPU. Jan ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: (v2) VT-d Posted-interrupt (PI) design for XEN 2015-03-23 9:07 ` Jan Beulich @ 2015-03-23 9:19 ` Wu, Feng 2015-03-24 3:06 ` Tian, Kevin 0 siblings, 1 reply; 16+ messages in thread From: Wu, Feng @ 2015-03-23 9:19 UTC (permalink / raw) To: Jan Beulich Cc: Zhang, Yang Z, Wu, Feng, Tian, Kevin, Keir Fraser (keir@xen.org), xen-devel@lists.xen.org > -----Original Message----- > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Monday, March 23, 2015 5:08 PM > To: Wu, Feng > Cc: Tian, Kevin; Zhang, Yang Z; xen-devel@lists.xen.org; Keir Fraser > (keir@xen.org) > Subject: RE: (v2) VT-d Posted-interrupt (PI) design for XEN > > >>> On 23.03.15 at 09:49, <feng.wu@intel.com> wrote: > > > > >> -----Original Message----- > >> From: Jan Beulich [mailto:JBeulich@suse.com] > >> Sent: Monday, March 23, 2015 4:26 PM > >> To: Wu, Feng > >> Cc: Tian, Kevin; Zhang, Yang Z; xen-devel@lists.xen.org; Keir Fraser > >> (keir@xen.org) > >> Subject: RE: (v2) VT-d Posted-interrupt (PI) design for XEN > >> > >> >>> On 23.03.15 at 09:14, <feng.wu@intel.com> wrote: > >> >> From: Jan Beulich [mailto:JBeulich@suse.com] > >> >> Sent: Thursday, March 19, 2015 5:57 PM > >> >> >>> On 18.03.15 at 13:44, <feng.wu@intel.com> wrote: > >> >> > Here are what we do for the blocked vCPU: > >> >> > 1. Define a per-cpu list 'blocked_vcpu_on_cpu', which stored the > blocked > >> >> > vCPU on the pCPU. > >> >> > 2. When the vCPU's state is changed to RUNSTATE_blocked, insert the > >> vCPU > >> >> > to the per-cpu list belonging to the pCPU it was running. > >> >> > 3. When the vCPU is unblocked, remove the vCPU from the related > pCPU > >> list. > >> >> > >> >> And this works transparently not only with the generic scheduler > >> >> code moving the vCPU to another pCPU, but also with some of the > >> >> individual scheduler implementations doing such re-assignments? > >> > > >> > I cannot quite understand this, could you please elaborate a bit more. > >> > >> There are multiple places where v->processor can get changed for a > >> particular vCPU, and obviously all of these need to be taken care of. > >> Yet a change like the one to come here would normally not be > >> expected to touch specific schedulers' code, and hence suitably > >> abstracting this may need some extra thought. > > > > Why do we need care about the places where v->processor gets changed, > > my idea about this is: > > > > Before vCPU is blocked, we can get v->processor, and save the vCPU to > > this per-CPU list. Besides that v->processor is the destination of the > > notification > > event (it is stored in Posted-interrupt descriptor). So when wakeup > > notification event happens form this vCPU, it goes to pCPU v->processor, > > then in the wakeup notification event handler, we can find the list via > > smp_processor_id(), hence find the right vCPU to wake up. > > > > Do I miss something here? > > Perhaps you don't, and perhaps I implied things I shouldn't have > implied: When v->processor changes, it would look to me that the > respective vCPU then ends up on the wrong list. If that's not a > problem - fine. Yes, vCPU is not changed to another list when v->processor is changed. Using per-CPU lists, however, would seem to make > it desirable to access those lists without lock, yet that can't work > when the list may get accessed from other than the owning CPU. Yes, it need to be accessed in other CPUs. Thanks, Feng > > Jan ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: (v2) VT-d Posted-interrupt (PI) design for XEN 2015-03-23 9:19 ` Wu, Feng @ 2015-03-24 3:06 ` Tian, Kevin 2015-03-24 3:19 ` Wu, Feng 0 siblings, 1 reply; 16+ messages in thread From: Tian, Kevin @ 2015-03-24 3:06 UTC (permalink / raw) To: Wu, Feng, Jan Beulich Cc: Zhang, Yang Z, Keir Fraser (keir@xen.org), xen-devel@lists.xen.org > From: Wu, Feng > Sent: Monday, March 23, 2015 5:19 PM > > > > > -----Original Message----- > > From: Jan Beulich [mailto:JBeulich@suse.com] > > Sent: Monday, March 23, 2015 5:08 PM > > To: Wu, Feng > > Cc: Tian, Kevin; Zhang, Yang Z; xen-devel@lists.xen.org; Keir Fraser > > (keir@xen.org) > > Subject: RE: (v2) VT-d Posted-interrupt (PI) design for XEN > > > > >>> On 23.03.15 at 09:49, <feng.wu@intel.com> wrote: > > > > > > > >> -----Original Message----- > > >> From: Jan Beulich [mailto:JBeulich@suse.com] > > >> Sent: Monday, March 23, 2015 4:26 PM > > >> To: Wu, Feng > > >> Cc: Tian, Kevin; Zhang, Yang Z; xen-devel@lists.xen.org; Keir Fraser > > >> (keir@xen.org) > > >> Subject: RE: (v2) VT-d Posted-interrupt (PI) design for XEN > > >> > > >> >>> On 23.03.15 at 09:14, <feng.wu@intel.com> wrote: > > >> >> From: Jan Beulich [mailto:JBeulich@suse.com] > > >> >> Sent: Thursday, March 19, 2015 5:57 PM > > >> >> >>> On 18.03.15 at 13:44, <feng.wu@intel.com> wrote: > > >> >> > Here are what we do for the blocked vCPU: > > >> >> > 1. Define a per-cpu list 'blocked_vcpu_on_cpu', which stored the > > blocked > > >> >> > vCPU on the pCPU. > > >> >> > 2. When the vCPU's state is changed to RUNSTATE_blocked, insert > the > > >> vCPU > > >> >> > to the per-cpu list belonging to the pCPU it was running. > > >> >> > 3. When the vCPU is unblocked, remove the vCPU from the related > > pCPU > > >> list. > > >> >> > > >> >> And this works transparently not only with the generic scheduler > > >> >> code moving the vCPU to another pCPU, but also with some of the > > >> >> individual scheduler implementations doing such re-assignments? > > >> > > > >> > I cannot quite understand this, could you please elaborate a bit more. > > >> > > >> There are multiple places where v->processor can get changed for a > > >> particular vCPU, and obviously all of these need to be taken care of. > > >> Yet a change like the one to come here would normally not be > > >> expected to touch specific schedulers' code, and hence suitably > > >> abstracting this may need some extra thought. > > > > > > Why do we need care about the places where v->processor gets changed, > > > my idea about this is: > > > > > > Before vCPU is blocked, we can get v->processor, and save the vCPU to > > > this per-CPU list. Besides that v->processor is the destination of the > > > notification > > > event (it is stored in Posted-interrupt descriptor). So when wakeup > > > notification event happens form this vCPU, it goes to pCPU v->processor, > > > then in the wakeup notification event handler, we can find the list via > > > smp_processor_id(), hence find the right vCPU to wake up. > > > > > > Do I miss something here? > > > > Perhaps you don't, and perhaps I implied things I shouldn't have > > implied: When v->processor changes, it would look to me that the > > respective vCPU then ends up on the wrong list. If that's not a > > problem - fine. > > Yes, vCPU is not changed to another list when v->processor is changed. > > Using per-CPU lists, however, would seem to make > > it desirable to access those lists without lock, yet that can't work > > when the list may get accessed from other than the owning CPU. > > Yes, it need to be accessed in other CPUs. > Then you do need some lock mechanism to avoid race condition. Possibly it's not a big problem if all the operations around this per-cpu list are associated with scheduling logic so we might leverage the scheduling lock for the purpose. Thanks Kevin ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: (v2) VT-d Posted-interrupt (PI) design for XEN 2015-03-24 3:06 ` Tian, Kevin @ 2015-03-24 3:19 ` Wu, Feng 0 siblings, 0 replies; 16+ messages in thread From: Wu, Feng @ 2015-03-24 3:19 UTC (permalink / raw) To: Tian, Kevin, Jan Beulich Cc: Zhang, Yang Z, Wu, Feng, Keir Fraser (keir@xen.org), xen-devel@lists.xen.org > -----Original Message----- > From: Tian, Kevin > Sent: Tuesday, March 24, 2015 11:06 AM > To: Wu, Feng; Jan Beulich > Cc: Zhang, Yang Z; xen-devel@lists.xen.org; Keir Fraser (keir@xen.org) > Subject: RE: (v2) VT-d Posted-interrupt (PI) design for XEN > > > From: Wu, Feng > > Sent: Monday, March 23, 2015 5:19 PM > > > > > > > > > -----Original Message----- > > > From: Jan Beulich [mailto:JBeulich@suse.com] > > > Sent: Monday, March 23, 2015 5:08 PM > > > To: Wu, Feng > > > Cc: Tian, Kevin; Zhang, Yang Z; xen-devel@lists.xen.org; Keir Fraser > > > (keir@xen.org) > > > Subject: RE: (v2) VT-d Posted-interrupt (PI) design for XEN > > > > > > >>> On 23.03.15 at 09:49, <feng.wu@intel.com> wrote: > > > > > > > > > > >> -----Original Message----- > > > >> From: Jan Beulich [mailto:JBeulich@suse.com] > > > >> Sent: Monday, March 23, 2015 4:26 PM > > > >> To: Wu, Feng > > > >> Cc: Tian, Kevin; Zhang, Yang Z; xen-devel@lists.xen.org; Keir Fraser > > > >> (keir@xen.org) > > > >> Subject: RE: (v2) VT-d Posted-interrupt (PI) design for XEN > > > >> > > > >> >>> On 23.03.15 at 09:14, <feng.wu@intel.com> wrote: > > > >> >> From: Jan Beulich [mailto:JBeulich@suse.com] > > > >> >> Sent: Thursday, March 19, 2015 5:57 PM > > > >> >> >>> On 18.03.15 at 13:44, <feng.wu@intel.com> wrote: > > > >> >> > Here are what we do for the blocked vCPU: > > > >> >> > 1. Define a per-cpu list 'blocked_vcpu_on_cpu', which stored the > > > blocked > > > >> >> > vCPU on the pCPU. > > > >> >> > 2. When the vCPU's state is changed to RUNSTATE_blocked, insert > > the > > > >> vCPU > > > >> >> > to the per-cpu list belonging to the pCPU it was running. > > > >> >> > 3. When the vCPU is unblocked, remove the vCPU from the related > > > pCPU > > > >> list. > > > >> >> > > > >> >> And this works transparently not only with the generic scheduler > > > >> >> code moving the vCPU to another pCPU, but also with some of the > > > >> >> individual scheduler implementations doing such re-assignments? > > > >> > > > > >> > I cannot quite understand this, could you please elaborate a bit more. > > > >> > > > >> There are multiple places where v->processor can get changed for a > > > >> particular vCPU, and obviously all of these need to be taken care of. > > > >> Yet a change like the one to come here would normally not be > > > >> expected to touch specific schedulers' code, and hence suitably > > > >> abstracting this may need some extra thought. > > > > > > > > Why do we need care about the places where v->processor gets changed, > > > > my idea about this is: > > > > > > > > Before vCPU is blocked, we can get v->processor, and save the vCPU to > > > > this per-CPU list. Besides that v->processor is the destination of the > > > > notification > > > > event (it is stored in Posted-interrupt descriptor). So when wakeup > > > > notification event happens form this vCPU, it goes to pCPU v->processor, > > > > then in the wakeup notification event handler, we can find the list via > > > > smp_processor_id(), hence find the right vCPU to wake up. > > > > > > > > Do I miss something here? > > > > > > Perhaps you don't, and perhaps I implied things I shouldn't have > > > implied: When v->processor changes, it would look to me that the > > > respective vCPU then ends up on the wrong list. If that's not a > > > problem - fine. > > > > Yes, vCPU is not changed to another list when v->processor is changed. > > > > Using per-CPU lists, however, would seem to make > > > it desirable to access those lists without lock, yet that can't work > > > when the list may get accessed from other than the owning CPU. > > > > Yes, it need to be accessed in other CPUs. > > > > Then you do need some lock mechanism to avoid race condition. Possibly > it's not a big problem if all the operations around this per-cpu list are > associated with scheduling logic so we might leverage the scheduling lock > for the purpose. > Thanks for the comments, Kevin! Yes, we need spin lock to protect the list. Besides the scheduling logic, we also need to access the list in the wakeup notification event handler, which is in interrupt context. Thanks, Feng > Thanks > Kevin ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: (v2) VT-d Posted-interrupt (PI) design for XEN 2015-03-18 12:44 (v2) VT-d Posted-interrupt (PI) design for XEN Wu, Feng 2015-03-18 16:09 ` Konrad Rzeszutek Wilk 2015-03-19 9:56 ` Jan Beulich @ 2015-03-25 5:10 ` Wu, Feng 2015-03-25 10:33 ` Jan Beulich 2 siblings, 1 reply; 16+ messages in thread From: Wu, Feng @ 2015-03-25 5:10 UTC (permalink / raw) To: xen-devel@lists.xen.org, Jan Beulich (JBeulich@suse.com) Cc: Zhang, Yang Z, Wu, Feng, Tian, Kevin, Keir Fraser (keir@xen.org) Hi Jan & other maintainers, Do you think it is good for you guys to continue the review if I send out a RFC patch for this feature? Thanks, Feng > -----Original Message----- > From: Wu, Feng > Sent: Wednesday, March 18, 2015 8:44 PM > To: xen-devel@lists.xen.org > Cc: Keir Fraser (keir@xen.org); Jan Beulich (JBeulich@suse.com); Tian, Kevin; > Zhang, Yang Z; Wu, Feng > Subject: (v2) VT-d Posted-interrupt (PI) design for XEN > > VT-d Posted-interrupt (PI) design for XEN > > Background > ========== > With the development of virtualization, there are more and more device > assignment requirements. However, today when a VM is running with > assigned devices (such as, NIC), external interrupt handling for the assigned > devices always needs VMM intervention. > > VT-d Posted-interrupt is a more enhanced method to handle interrupts > in the virtualization environment. Interrupt posting is the process by > which an interrupt request is recorded in a memory-resident > posted-interrupt-descriptor structure by the root-complex, followed by > an optional notification event issued to the CPU complex. > > With VT-d Posted-interrupt we can get the following advantages: > - Direct delivery of external interrupts to running vCPUs without VMM > intervention > - Decrease the interrupt migration complexity. On vCPU migration, software > can atomically co-migrate all interrupts targeting the migrating vCPU. For > virtual machines with assigned devices, migrating a vCPU across pCPUs > either incur the overhead of forwarding interrupts in software (e.g. via VMM > generated IPIS), or complexity to independently migrate each interrupt > targeting > the vCPU to the new pCPU. However, after enabling VT-d PI, the destination > vCPU > of an external interrupt from assigned devices is stored in the IRTE (i.e. > Posted-interrupt Descriptor Address), when vCPU is migrated to another pCPU, > we will set this new pCPU in the 'NDST' filed of Posted-interrupt descriptor, this > make the interrupt migration automatic. > > > Posted-interrupt Introduction > ======================== > There are two components to the Posted-interrupt architecture: > Processor Support and Root-Complex Support > > - Processor Support > Posted-interrupt processing is a feature by which a processor processes > the virtual interrupts by recording them as pending on the virtual-APIC > page. > > Posted-interrupt processing is enabled by setting the "process posted > interrupts" VM-execution control. The processing is performed in response > to the arrival of an interrupt with the posted-interrupt notification vector. > In response to such an interrupt, the processor processes virtual interrupts > recorded in a data structure called a posted-interrupt descriptor. > > More information about APICv and CPU-side Posted-interrupt, please refer > to Chapter 29, and Section 29.6 in the Intel SDM: > http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6 > 4-ia-32-architectures-software-developer-manual-325462.pdf > > - Root-Complex Support > Interrupt posting is the process by which an interrupt request (from IOAPIC > or MSI/MSIx capable sources) is recorded in a memory-resident > posted-interrupt-descriptor structure by the root-complex, followed by > an optional notification event issued to the CPU complex. The interrupt > request arriving at the root-complex carry the identity of the interrupt > request source and a 'remapping-index'. The remapping-index is used to > look-up an entry from the memory-resident interrupt-remap-table. Unlike > with interrupt-remapping, the interrupt-remap-table-entry for a posted- > interrupt, specifies a virtual-vector and a pointer to the posted-interrupt > descriptor. The virtual-vector specifies the vector of the interrupt to be > recorded in the posted-interrupt descriptor. The posted-interrupt descriptor > hosts storage for the virtual-vectors and contains the attributes of the > notification event (interrupt) to be issued to the CPU complex to inform > CPU/software about pending interrupts recorded in the posted-interrupt > descriptor. > > More information about VT-d PI, please refer to > http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog > y/vt-directed-io-spec.html > > Important Definitions > ================== > There are some changes to IRTE and posted-interrupt descriptor after > VT-d PI is introduced: > IRTE: > Posted-interrupt Descriptor Address: the address of the posted-interrupt > descriptor > Virtual Vector: the guest vector of the interrupt > URG: indicates if the interrupt is urgent > > Posted-interrupt descriptor: > The Posted Interrupt Descriptor hosts the following fields: > Posted Interrupt Request (PIR): Provide storage for posting (recording) > interrupts (one bit > per vector, for up to 256 vectors). > > Outstanding Notification (ON): Indicate if there is a notification event > outstanding (not > processed by processor or software) for this Posted Interrupt Descriptor. When > this field is 0, > hardware modifies it from 0 to 1 when generating a notification event, and the > entity receiving > the notification event (processor or software) resets it as part of posted > interrupt processing. > > Suppress Notification (SN): Indicate if a notification event is to be suppressed > (not > generated) for non-urgent interrupt requests (interrupts processed through an > IRTE with > URG=0). > > Notification Vector (NV): Specify the vector for notification event (interrupt). > > Notification Destination (NDST): Specify the physical APIC-ID of the destination > logical > processor for the notification event. > > Design Overview > ============== > In this design, we will cover the following items: > 1. Add a variable to control whether enable VT-d posted-interrupt or not. > 2. VT-d PI feature detection. > 3. Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff. > 4. Extend IRTE structure to support VT-d PI. > 5. Introduce a new global vector which is used for waking up the blocked vCPU. > 6. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx > configuration). > 7. Update posted-interrupt descriptor during vCPU scheduling (when the state > of the vCPU is transmitted among RUNSTATE_running / RUNSTATE_blocked/ > RUNSTATE_runnable / RUNSTATE_offline). > 8. How to wakeup blocked vCPU when an interrupt is posted for it (wakeup > notification handler). > 9. New boot command line for Xen, which controls VT-d PI feature by user. > 10. Multicast/broadcast and lowest priority interrupts consideration. > > > Implementation details > =================== > - New variable to control VT-d PI > > Like variable 'iommu_intremap' for interrupt remapping, it is very > straightforward > to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is set > only when interrupt remapping and VT-d posted-interrupt are both enabled. > > - VT-d PI feature detection. > Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt > support. > > - Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff. > Here is the new structure for posted-interrupt descriptor: > > struct pi_desc { > DECLARE_BITMAP(pir, NR_VECTORS); > union { > struct > { > u64 on : 1, > sn : 1, > rsvd_1 : 13, > ndm : 1, > nv : 8, > rsvd_2 : 8, > ndst : 32; > }; > u64 control; > }; > u32 rsvd[6]; > } __attribute__ ((aligned (64))); > > - Extend IRTE structure to support VT-d PI. > > Here is the new structure for IRTE: > /* interrupt remap entry */ > struct iremap_entry { > union { > u64 lo_val; > struct { > u64 p : 1, > fpd : 1, > dm : 1, > rh : 1, > tm : 1, > dlm : 3, > avail : 4, > res_1 : 4, > vector : 8, > res_2 : 8, > dst : 32; > }lo; > struct { > u64 p : 1, > fpd : 1, > res_1 : 6, > avail : 4, > res_2 : 2, > urg : 1, > im : 1, > vector : 8, > res_3 : 14, > pda_l : 26; > }lo_intpost; > }; > union { > u64 hi_val; > struct { > u64 sid : 16, > sq : 2, > svt : 2, > res_1 : 44; > }hi; > struct { > u64 sid : 16, > sq : 2, > svt : 2, > res_1 : 12, > pda_h : 32; > }hi_intpost; > }; > }; > > - Introduce a new global vector which is used to wake up the blocked vCPU. > > Currently, there is a global vector 'posted_intr_vector', which is used as the > global notification vector for all vCPUs in the system. This vector is stored in > VMCS and CPU considers it as a _special_ vector, uses it to notify the related > pCPU when an interrupt is recorded in the posted-interrupt descriptor. > > This existing global vector is a _special_ vector to CPU, CPU handle it in a > _special_ way compared to normal vectors, please refer to 29.6 in Intel SDM > http://www.intel.com/content/dam/www/public/us/en/documents/manuals/6 > 4-ia-32-architectures-software-developer-manual-325462.pdf > for more information about how CPU handles it. > > After having VT-d PI, VT-d engine can issue notification event when the > assigned devices issue interrupts. We need add a new global vector to > wakeup the blocked vCPU, please refer to later section in this design for > how to use this new global vector. > > - Update IRTE when guest modifies the interrupt configuration (MSI/MSIx > configuration). > After VT-d PI is introduced, the format of IRTE is changed as follows: > Descriptor Address: the address of the posted-interrupt descriptor > Virtual Vector: the guest vector of the interrupt > URG: indicates if the interrupt is urgent > Other fields continue to have the same meaning > > 'Descriptor Address' tells the destination vCPU of this interrupt, since > each vCPU has a dedicated posted-interrupt descriptor. > > 'Virtual Vector' tells the guest vector of the interrupt. > > When guest changes the configuration of the interrupts, such as, the > cpu affinity, or the vector, we need to update the associated IRTE accordingly. > > - Update posted-interrupt descriptor during vCPU scheduling > > The basic idea here is: > 1. When vCPU's state is RUNSTATE_running, > - Set 'NV' to 'posted_intr_vector'. > - Clear 'SN' to accept posted-interrupts. > - Set 'NDST' to the pCPU on which the vCPU will be running. > 2. When vCPU's state is RUNSTATE_blocked, > - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the > related vCPU when posted-interrupt happens for it. > Please refer to the above section about the new global vector. > - Clear 'SN' to accept posted-interrupts > 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline, > - Set 'SN' to suppress non-urgent interrupts > (Current, we only support non-urgent interrupts) > When vCPU is in RUNSTATE_runnable or RUNSTATE_offline, > It is not needed to accept posted-interrupt notification event, > since we don't change the behavior of scheduler when the interrupt > occurs, we still need wait the next scheduling of the vCPU. > When external interrupts from assigned devices occur, the > interrupts > are recorded in PIR, and will be synced to IRR before VM-Entry. > - Set 'NV' to 'posted_intr_vector'. > > - How to wakeup blocked vCPU when an interrupt is posted for it (wakeup > notification handler). > > Here is the scenario for the usage of the new global vector: > > 1. vCPU0 is running on pCPU0 > 2. vCPU0 is blocked and vCPU1 is currently running on pCPU0 > 3. An external interrupt from an assigned device occurs for vCPU0, if we > still use 'posted_intr_vector' as the notification vector for vCPU0, the > notification event for vCPU0 (the event will go to pCPU1) will be consumed > by vCPU1 incorrectly (remember this is a special vector to CPU). The worst > case is that vCPU0 will never be woken up again since the wakeup event > for it is always consumed by other vCPUs incorrectly. So we need introduce > another global vector, naming 'pi_wakeup_vector' to wake up the blocked > vCPU. > > After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue notification > event using this new vector. Since this new vector is not a SPECIAL one to CPU, > it is just a normal vector. To cpu, it just receives an normal external interrupt, > then we can get control in the handler of this new vector. In this case, > hypervisor > can do something in it, such as wakeup the blocked vCPU. > > Here are what we do for the blocked vCPU: > 1. Define a per-cpu list 'blocked_vcpu_on_cpu', which stored the blocked > vCPU on the pCPU. > 2. When the vCPU's state is changed to RUNSTATE_blocked, insert the vCPU > to the per-cpu list belonging to the pCPU it was running. > 3. When the vCPU is unblocked, remove the vCPU from the related pCPU list. > > In the handler of 'pi_wakeup_vector', we do: > 1. Get the physical CPU. > 2. Iterate the list 'blocked_vcpu_on_cpu' of the current pCPU, if 'ON' is set, > we unblock the associated vCPU. > > - New boot command line for Xen, which controls VT-d PI feature by user. > > Like 'intremap' for interrupt remapping, we add a new boot command line > 'intpost' for posted-interrupts. > > - Multicast/broadcast and lowest priority interrupts consideration. > > With VT-d PI, the destination vCPU information of an external interrupt > from assigned devices is stored in IRTE, this makes the following > consideration of the design: > 1. Multicast/broadcast interrupts cannot be posted. > 2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex > (starting from Nehalem) ignore TPR value, and instead supported two other > ways (configurable by BIOS) on how the handle lowest priority interrupts: > A) Round robin: In this method, the chipset simply delivers lowest priority > interrupts in a round-robin manner across all the available logical CPUs. While > this provides good load balancing, this was not the best thing to do always as > interrupts from the same device (like NIC) will start running on all the CPUs > thrashing caches and taking locks. This led to the next scheme. > B) Vector hashing: In this method, hardware would apply a hash function > on the vector value in the interrupt request, and use that hash to pick a logical > CPU to route the lowest priority interrupt. This way, a given vector always goes > to the same logical CPU, avoiding the thrashing problem above. > > So, gist of above is that, lowest priority interrupts has never been delivered as > "lowest priority" in physical hardware. > > I will emulate vector hashing for posted-interrupt for XEN. > > ================================ > > Any comments about this design are highly appreciated! > > Thanks, > Feng ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: (v2) VT-d Posted-interrupt (PI) design for XEN 2015-03-25 5:10 ` Wu, Feng @ 2015-03-25 10:33 ` Jan Beulich 0 siblings, 0 replies; 16+ messages in thread From: Jan Beulich @ 2015-03-25 10:33 UTC (permalink / raw) To: Feng Wu Cc: Yang Z Zhang, Kevin Tian, Keir Fraser (keir@xen.org), xen-devel@lists.xen.org >>> On 25.03.15 at 06:10, <feng.wu@intel.com> wrote: > Do you think it is good for you guys to continue the review if I send out > a RFC patch for this feature? Yes, I think this would make sense. Jan ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2015-03-25 10:33 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-03-18 12:44 (v2) VT-d Posted-interrupt (PI) design for XEN Wu, Feng 2015-03-18 16:09 ` Konrad Rzeszutek Wilk 2015-03-19 2:37 ` Zhang, Yang Z 2015-03-19 3:03 ` Wu, Feng 2015-03-19 19:11 ` Konrad Rzeszutek Wilk 2015-03-23 8:04 ` Wu, Feng 2015-03-19 9:56 ` Jan Beulich 2015-03-23 8:14 ` Wu, Feng 2015-03-23 8:26 ` Jan Beulich 2015-03-23 8:49 ` Wu, Feng 2015-03-23 9:07 ` Jan Beulich 2015-03-23 9:19 ` Wu, Feng 2015-03-24 3:06 ` Tian, Kevin 2015-03-24 3:19 ` Wu, Feng 2015-03-25 5:10 ` Wu, Feng 2015-03-25 10:33 ` Jan Beulich
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.