* Enabling VT-d PI by default @ 2017-04-11 0:59 Chao Gao 2017-04-11 8:21 ` Jan Beulich 0 siblings, 1 reply; 15+ messages in thread From: Chao Gao @ 2017-04-11 0:59 UTC (permalink / raw) To: xen-devel, Jan Beulich Cc: George Dunlap, Andrew Cooper, Kevin Tian, Dario Faggioli Hello, Jan. As you know, with VT-d PI enabled, hardware can directly deliver external interrupts to guest without any VMM intervention. It will reduces overall interrupt latency to guest and reduces overheads otherwise incurred by the VMM for virtualizing interrupts. In my mind, it's an important feature to interrupt virtualization. But VT-d PI feature is disabled by default on Xen for some corner cases and bugs. Based on Feng's work, we have fixed those corner cases related to VT-d PI. Do you think it is a time to enable VT-d PI by default. If no, could you list your concerns so that we can resolve them? Thanks Chao _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Enabling VT-d PI by default 2017-04-11 0:59 Enabling VT-d PI by default Chao Gao @ 2017-04-11 8:21 ` Jan Beulich 2017-04-16 20:13 ` Chao Gao 0 siblings, 1 reply; 15+ messages in thread From: Jan Beulich @ 2017-04-11 8:21 UTC (permalink / raw) To: Chao Gao Cc: Kevin Tian, George Dunlap, Andrew Cooper, Dario Faggioli, xen-devel >>> On 11.04.17 at 02:59, <chao.gao@intel.com> wrote: > As you know, with VT-d PI enabled, hardware can directly deliver external > interrupts to guest without any VMM intervention. It will reduces overall > interrupt latency to guest and reduces overheads otherwise incurred by the > VMM for virtualizing interrupts. In my mind, it's an important feature to > interrupt virtualization. > > But VT-d PI feature is disabled by default on Xen for some corner > cases and bugs. Based on Feng's work, we have fixed those corner > cases related to VT-d PI. Do you think it is a time to enable VT-d PI by > default. If no, could you list your concerns so that we can resolve them? I don't recall you addressing the main issue (blocked vCPU-s list length; see the comment next to the iommu_intpost definition). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Enabling VT-d PI by default 2017-04-11 8:21 ` Jan Beulich @ 2017-04-16 20:13 ` Chao Gao 2017-04-18 6:24 ` Tian, Kevin 2017-04-18 8:13 ` Jan Beulich 0 siblings, 2 replies; 15+ messages in thread From: Chao Gao @ 2017-04-16 20:13 UTC (permalink / raw) To: Jan Beulich Cc: Kevin Tian, George Dunlap, Andrew Cooper, Dario Faggioli, xen-devel On Tue, Apr 11, 2017 at 02:21:07AM -0600, Jan Beulich wrote: >>>> On 11.04.17 at 02:59, <chao.gao@intel.com> wrote: >> As you know, with VT-d PI enabled, hardware can directly deliver external >> interrupts to guest without any VMM intervention. It will reduces overall >> interrupt latency to guest and reduces overheads otherwise incurred by the >> VMM for virtualizing interrupts. In my mind, it's an important feature to >> interrupt virtualization. >> >> But VT-d PI feature is disabled by default on Xen for some corner >> cases and bugs. Based on Feng's work, we have fixed those corner >> cases related to VT-d PI. Do you think it is a time to enable VT-d PI by >> default. If no, could you list your concerns so that we can resolve them? > >I don't recall you addressing the main issue (blocked vCPU-s list >length; see the comment next to the iommu_intpost definition). > Indeed. I have gone through the discussion happened in April 2016[1, 2]. [1] https://lists.gt.net/xen/devel/422661?search_string=VT-d%20posted-interrupt%20core%20logic%20handling;#422661 [2] https://lists.gt.net/xen/devel/422567?search_string=%20The%20length%20of%20the%20list%20depends;#422567. First of all, I admit this is an issue in extreme case and we should come up with a solution. The problem we are facing is: There is a per-cpu list used to maintain all the blocked vCPU on a pCPU. When a wakeup interrupt comes, the interrupt handler travels the list to wake the vCPUs whose pi_desc indicates an interrupt has been posted. There is no policy to restrict the size of the list such that in some extreme case, the list can be too long to cause some issues (the most obvious issue is about interrupt latency). The theoretical max number of entry in the list is 4M as one host can have 32k domains and every domain can have 128vCPU. If all the vCPUs are blocked in one list, the list gets its theoretical maximum. The root cause of this issue, I think, is that the wakeup interrupt vector is shared by all the vCPUs on one pCPU. Lacking of enough information (such as which device sends or which IRTE translates this interrupt), there is no effective method to distinguish the interrupt's destination vCPU except traveling this list. Right? So we only can mitigate this issue through decreasing or limiting the entry's maximum in one list. Several methods we can take to mitigate this issue: 1. According to your discussions, evenly distributing all the blocked vCPUs among all pCPUs can mitigate this issue. With this approach, all vCPUs are blocked in one list can be avoided. It can decrease the entry's maximum in one list by N times (N is the number of pCPU). 2. Don't put the blocked vCPUs which won't be woken by the wakeup interrupt into the per-cpu list. Currently, we put the blocked vCPUs belong to domains who have assigned devices into the list. But if one blocked vCPU of such domain is not a destination of every posted format IRTE, it needn't be added to the per-cpu list. The blocked vCPU will be woken by IPIs or other virtual interrupts. From this aspect, we can decrease the entries in the per-cpu list. 3. Like what we do in struct irq_guest_action_t, can we limit the maximum of entry we support in the list. With this approach, during domain creation, we calculate the available entries and compare with the domain's vCPU number to decide whether the domain can use VT-d PI. This method will pose a strict restriction to the maximum of entry in one list. But it may affect vCPU hotplug. According to your intuition, which methods are feasible and acceptable? I will attempt to mitigate this issue per your advices. Thanks Chao _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Enabling VT-d PI by default 2017-04-16 20:13 ` Chao Gao @ 2017-04-18 6:24 ` Tian, Kevin 2017-04-17 23:57 ` Chao Gao 2017-04-26 17:11 ` George Dunlap 2017-04-18 8:13 ` Jan Beulich 1 sibling, 2 replies; 15+ messages in thread From: Tian, Kevin @ 2017-04-18 6:24 UTC (permalink / raw) To: Gao, Chao, Jan Beulich Cc: George Dunlap, Andrew Cooper, Dario Faggioli, xen-devel@lists.xen.org > From: Gao, Chao > Sent: Monday, April 17, 2017 4:14 AM > > On Tue, Apr 11, 2017 at 02:21:07AM -0600, Jan Beulich wrote: > >>>> On 11.04.17 at 02:59, <chao.gao@intel.com> wrote: > >> As you know, with VT-d PI enabled, hardware can directly deliver external > >> interrupts to guest without any VMM intervention. It will reduces overall > >> interrupt latency to guest and reduces overheads otherwise incurred by > the > >> VMM for virtualizing interrupts. In my mind, it's an important feature to > >> interrupt virtualization. > >> > >> But VT-d PI feature is disabled by default on Xen for some corner > >> cases and bugs. Based on Feng's work, we have fixed those corner > >> cases related to VT-d PI. Do you think it is a time to enable VT-d PI by > >> default. If no, could you list your concerns so that we can resolve them? > > > >I don't recall you addressing the main issue (blocked vCPU-s list > >length; see the comment next to the iommu_intpost definition). > > > > Indeed. I have gone through the discussion happened in April 2016[1, 2]. > [1] https://lists.gt.net/xen/devel/422661?search_string=VT-d%20posted- > interrupt%20core%20logic%20handling;#422661 > [2] > https://lists.gt.net/xen/devel/422567?search_string=%20The%20length%20o > f%20the%20list%20depends;#422567. > > First of all, I admit this is an issue in extreme case and we should > come up with a solution. > > The problem we are facing is: > There is a per-cpu list used to maintain all the blocked vCPU on a > pCPU. When a wakeup interrupt comes, the interrupt handler travels > the list to wake the vCPUs whose pi_desc indicates an interrupt has > been posted. There is no policy to restrict the size of the list such > that in some extreme case, the list can be too long to cause some > issues (the most obvious issue is about interrupt latency). > > The theoretical max number of entry in the list is 4M as one host can > have 32k domains and every domain can have 128vCPU. If all the vCPUs > are blocked in one list, the list gets its theoretical maximum. > > The root cause of this issue, I think, is that the wakeup interrupt > vector is shared by all the vCPUs on one pCPU. Lacking of enough > information (such as which device sends or which IRTE translates this > interrupt), there is no effective method to distinguish the > interrupt's destination vCPU except traveling this list. Right? So we > only can mitigate this issue through decreasing or limiting the > entry's maximum in one list. > > Several methods we can take to mitigate this issue: > 1. According to your discussions, evenly distributing all the blocked > vCPUs among all pCPUs can mitigate this issue. With this approach, all > vCPUs are blocked in one list can be avoided. It can decrease the > entry's maximum in one list by N times (N is the number of pCPU). > > 2. Don't put the blocked vCPUs which won't be woken by the wakeup > interrupt into the per-cpu list. Currently, we put the blocked vCPUs > belong to domains who have assigned devices into the list. But if one > blocked vCPU of such domain is not a destination of every posted > format IRTE, it needn't be added to the per-cpu list. The blocked vCPU > will be woken by IPIs or other virtual interrupts. From this aspect, we > can decrease the entries in the per-cpu list. > > 3. Like what we do in struct irq_guest_action_t, can we limit the > maximum of entry we support in the list. With this approach, during > domain creation, we calculate the available entries and compare with > the domain's vCPU number to decide whether the domain can use VT-d PI. VT-d PI is global instead of per-domain. I guess you actually mean failing device assignment operation if counting new domain's #VCPUs exceeds the limitation. > This method will pose a strict restriction to the maximum of entry in > one list. But it may affect vCPU hotplug. > > According to your intuition, which methods are feasible and > acceptable? I will attempt to mitigate this issue per your advices. > My understanding is that we need them all. #1 is the baseline, with #2/#3 as further optimization. :-) Thanks Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Enabling VT-d PI by default 2017-04-18 6:24 ` Tian, Kevin @ 2017-04-17 23:57 ` Chao Gao 2017-04-26 17:11 ` George Dunlap 1 sibling, 0 replies; 15+ messages in thread From: Chao Gao @ 2017-04-17 23:57 UTC (permalink / raw) To: Tian, Kevin Cc: George Dunlap, Andrew Cooper, Dario Faggioli, xen-devel@lists.xen.org, Jan Beulich On Tue, Apr 18, 2017 at 02:24:05PM +0800, Tian, Kevin wrote: >> From: Gao, Chao >> Sent: Monday, April 17, 2017 4:14 AM >> >> On Tue, Apr 11, 2017 at 02:21:07AM -0600, Jan Beulich wrote: >> >>>> On 11.04.17 at 02:59, <chao.gao@intel.com> wrote: >> 3. Like what we do in struct irq_guest_action_t, can we limit the >> maximum of entry we support in the list. With this approach, during >> domain creation, we calculate the available entries and compare with >> the domain's vCPU number to decide whether the domain can use VT-d PI. > >VT-d PI is global instead of per-domain. I guess you actually mean >failing device assignment operation if counting new domain's #VCPUs >exceeds the limitation. Almost agree. But I think device assignment is also allowed in that case. We just disable the new created domain to use VT-d PI. > >> This method will pose a strict restriction to the maximum of entry in >> one list. But it may affect vCPU hotplug. >> >> According to your intuition, which methods are feasible and >> acceptable? I will attempt to mitigate this issue per your advices. >> > >My understanding is that we need them all. #1 is the baseline, >with #2/#3 as further optimization. :-) Thanks your input. I will have a try. Thanks Chao _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Enabling VT-d PI by default 2017-04-18 6:24 ` Tian, Kevin 2017-04-17 23:57 ` Chao Gao @ 2017-04-26 17:11 ` George Dunlap 2017-04-27 7:08 ` Jan Beulich 1 sibling, 1 reply; 15+ messages in thread From: George Dunlap @ 2017-04-26 17:11 UTC (permalink / raw) To: Tian, Kevin, Gao, Chao, Jan Beulich Cc: George Dunlap, Andrew Cooper, Dario Faggioli, xen-devel@lists.xen.org On 18/04/17 07:24, Tian, Kevin wrote: >> From: Gao, Chao >> Sent: Monday, April 17, 2017 4:14 AM >> >> On Tue, Apr 11, 2017 at 02:21:07AM -0600, Jan Beulich wrote: >>>>>> On 11.04.17 at 02:59, <chao.gao@intel.com> wrote: >>>> As you know, with VT-d PI enabled, hardware can directly deliver external >>>> interrupts to guest without any VMM intervention. It will reduces overall >>>> interrupt latency to guest and reduces overheads otherwise incurred by >> the >>>> VMM for virtualizing interrupts. In my mind, it's an important feature to >>>> interrupt virtualization. >>>> >>>> But VT-d PI feature is disabled by default on Xen for some corner >>>> cases and bugs. Based on Feng's work, we have fixed those corner >>>> cases related to VT-d PI. Do you think it is a time to enable VT-d PI by >>>> default. If no, could you list your concerns so that we can resolve them? >>> >>> I don't recall you addressing the main issue (blocked vCPU-s list >>> length; see the comment next to the iommu_intpost definition). >>> >> >> Indeed. I have gone through the discussion happened in April 2016[1, 2]. >> [1] https://lists.gt.net/xen/devel/422661?search_string=VT-d%20posted- >> interrupt%20core%20logic%20handling;#422661 >> [2] >> https://lists.gt.net/xen/devel/422567?search_string=%20The%20length%20o >> f%20the%20list%20depends;#422567. >> >> First of all, I admit this is an issue in extreme case and we should >> come up with a solution. >> >> The problem we are facing is: >> There is a per-cpu list used to maintain all the blocked vCPU on a >> pCPU. When a wakeup interrupt comes, the interrupt handler travels >> the list to wake the vCPUs whose pi_desc indicates an interrupt has >> been posted. There is no policy to restrict the size of the list such >> that in some extreme case, the list can be too long to cause some >> issues (the most obvious issue is about interrupt latency). >> >> The theoretical max number of entry in the list is 4M as one host can >> have 32k domains and every domain can have 128vCPU. If all the vCPUs >> are blocked in one list, the list gets its theoretical maximum. >> >> The root cause of this issue, I think, is that the wakeup interrupt >> vector is shared by all the vCPUs on one pCPU. Lacking of enough >> information (such as which device sends or which IRTE translates this >> interrupt), there is no effective method to distinguish the >> interrupt's destination vCPU except traveling this list. Right? So we >> only can mitigate this issue through decreasing or limiting the >> entry's maximum in one list. >> >> Several methods we can take to mitigate this issue: >> 1. According to your discussions, evenly distributing all the blocked >> vCPUs among all pCPUs can mitigate this issue. With this approach, all >> vCPUs are blocked in one list can be avoided. It can decrease the >> entry's maximum in one list by N times (N is the number of pCPU). >> >> 2. Don't put the blocked vCPUs which won't be woken by the wakeup >> interrupt into the per-cpu list. Currently, we put the blocked vCPUs >> belong to domains who have assigned devices into the list. But if one >> blocked vCPU of such domain is not a destination of every posted >> format IRTE, it needn't be added to the per-cpu list. The blocked vCPU >> will be woken by IPIs or other virtual interrupts. From this aspect, we >> can decrease the entries in the per-cpu list. >> >> 3. Like what we do in struct irq_guest_action_t, can we limit the >> maximum of entry we support in the list. With this approach, during >> domain creation, we calculate the available entries and compare with >> the domain's vCPU number to decide whether the domain can use VT-d PI. > > VT-d PI is global instead of per-domain. I guess you actually mean > failing device assignment operation if counting new domain's #VCPUs > exceeds the limitation. > >> This method will pose a strict restriction to the maximum of entry in >> one list. But it may affect vCPU hotplug. >> >> According to your intuition, which methods are feasible and >> acceptable? I will attempt to mitigate this issue per your advices. >> > > My understanding is that we need them all. #1 is the baseline, > with #2/#3 as further optimization. :-) Actually, regarding #2, is that the case? If we do reference counting (as in patches 3 and 4 of Chao Gao's recent series), then we are guaranteed never to have more vcpus on any given wakeup list than there are machine IRQs on the system. Are we ever going to have a system with so many IRQs that going through such a list would be problematic? -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Enabling VT-d PI by default 2017-04-26 17:11 ` George Dunlap @ 2017-04-27 7:08 ` Jan Beulich 2017-05-12 11:05 ` Andrew Cooper 0 siblings, 1 reply; 15+ messages in thread From: Jan Beulich @ 2017-04-27 7:08 UTC (permalink / raw) To: George Dunlap Cc: Kevin Tian, George Dunlap, Andrew Cooper, Dario Faggioli, xen-devel@lists.xen.org, Chao Gao >>> On 26.04.17 at 19:11, <george.dunlap@citrix.com> wrote: > On 18/04/17 07:24, Tian, Kevin wrote: >>> From: Gao, Chao >>> Sent: Monday, April 17, 2017 4:14 AM >>> >>> On Tue, Apr 11, 2017 at 02:21:07AM -0600, Jan Beulich wrote: >>>>>>> On 11.04.17 at 02:59, <chao.gao@intel.com> wrote: >>>>> As you know, with VT-d PI enabled, hardware can directly deliver external >>>>> interrupts to guest without any VMM intervention. It will reduces overall >>>>> interrupt latency to guest and reduces overheads otherwise incurred by >>> the >>>>> VMM for virtualizing interrupts. In my mind, it's an important feature to >>>>> interrupt virtualization. >>>>> >>>>> But VT-d PI feature is disabled by default on Xen for some corner >>>>> cases and bugs. Based on Feng's work, we have fixed those corner >>>>> cases related to VT-d PI. Do you think it is a time to enable VT-d PI by >>>>> default. If no, could you list your concerns so that we can resolve them? >>>> >>>> I don't recall you addressing the main issue (blocked vCPU-s list >>>> length; see the comment next to the iommu_intpost definition). >>>> >>> >>> Indeed. I have gone through the discussion happened in April 2016[1, 2]. >>> [1] https://lists.gt.net/xen/devel/422661?search_string=VT-d%20posted- >>> interrupt%20core%20logic%20handling;#422661 >>> [2] >>> https://lists.gt.net/xen/devel/422567?search_string=%20The%20length%20o >>> f%20the%20list%20depends;#422567. >>> >>> First of all, I admit this is an issue in extreme case and we should >>> come up with a solution. >>> >>> The problem we are facing is: >>> There is a per-cpu list used to maintain all the blocked vCPU on a >>> pCPU. When a wakeup interrupt comes, the interrupt handler travels >>> the list to wake the vCPUs whose pi_desc indicates an interrupt has >>> been posted. There is no policy to restrict the size of the list such >>> that in some extreme case, the list can be too long to cause some >>> issues (the most obvious issue is about interrupt latency). >>> >>> The theoretical max number of entry in the list is 4M as one host can >>> have 32k domains and every domain can have 128vCPU. If all the vCPUs >>> are blocked in one list, the list gets its theoretical maximum. >>> >>> The root cause of this issue, I think, is that the wakeup interrupt >>> vector is shared by all the vCPUs on one pCPU. Lacking of enough >>> information (such as which device sends or which IRTE translates this >>> interrupt), there is no effective method to distinguish the >>> interrupt's destination vCPU except traveling this list. Right? So we >>> only can mitigate this issue through decreasing or limiting the >>> entry's maximum in one list. >>> >>> Several methods we can take to mitigate this issue: >>> 1. According to your discussions, evenly distributing all the blocked >>> vCPUs among all pCPUs can mitigate this issue. With this approach, all >>> vCPUs are blocked in one list can be avoided. It can decrease the >>> entry's maximum in one list by N times (N is the number of pCPU). >>> >>> 2. Don't put the blocked vCPUs which won't be woken by the wakeup >>> interrupt into the per-cpu list. Currently, we put the blocked vCPUs >>> belong to domains who have assigned devices into the list. But if one >>> blocked vCPU of such domain is not a destination of every posted >>> format IRTE, it needn't be added to the per-cpu list. The blocked vCPU >>> will be woken by IPIs or other virtual interrupts. From this aspect, we >>> can decrease the entries in the per-cpu list. >>> >>> 3. Like what we do in struct irq_guest_action_t, can we limit the >>> maximum of entry we support in the list. With this approach, during >>> domain creation, we calculate the available entries and compare with >>> the domain's vCPU number to decide whether the domain can use VT-d PI. >> >> VT-d PI is global instead of per-domain. I guess you actually mean >> failing device assignment operation if counting new domain's #VCPUs >> exceeds the limitation. >> >>> This method will pose a strict restriction to the maximum of entry in >>> one list. But it may affect vCPU hotplug. >>> >>> According to your intuition, which methods are feasible and >>> acceptable? I will attempt to mitigate this issue per your advices. >>> >> >> My understanding is that we need them all. #1 is the baseline, >> with #2/#3 as further optimization. :-) > > Actually, regarding #2, is that the case? > > If we do reference counting (as in patches 3 and 4 of Chao Gao's recent > series), then we are guaranteed never to have more vcpus on any given > wakeup list than there are machine IRQs on the system. Are we ever > going to have a system with so many IRQs that going through such a list > would be problematic? I'm afraid this is not impossible, considering that people have already run into the interrupt vector limitation coming from there only being about 200 vectors per CPU (and there not being, in physical mode, any sharing of vectors between multiple CPUs, iirc). Devices using namely MSI-X can use an awful lot of vectors. Perhaps Andrew remembers numbers observed on actual systems here... Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Enabling VT-d PI by default 2017-04-27 7:08 ` Jan Beulich @ 2017-05-12 11:05 ` Andrew Cooper 2017-05-15 10:27 ` George Dunlap 0 siblings, 1 reply; 15+ messages in thread From: Andrew Cooper @ 2017-05-12 11:05 UTC (permalink / raw) To: Jan Beulich, George Dunlap Cc: Kevin Tian, George Dunlap, Dario Faggioli, xen-devel@lists.xen.org, Chao Gao On 27/04/17 08:08, Jan Beulich wrote: >>>> On 26.04.17 at 19:11, <george.dunlap@citrix.com> wrote: >> On 18/04/17 07:24, Tian, Kevin wrote: >>>> From: Gao, Chao >>>> Sent: Monday, April 17, 2017 4:14 AM >>>> >>>> On Tue, Apr 11, 2017 at 02:21:07AM -0600, Jan Beulich wrote: >>>>>>>> On 11.04.17 at 02:59, <chao.gao@intel.com> wrote: >>>>>> As you know, with VT-d PI enabled, hardware can directly deliver external >>>>>> interrupts to guest without any VMM intervention. It will reduces overall >>>>>> interrupt latency to guest and reduces overheads otherwise incurred by >>>> the >>>>>> VMM for virtualizing interrupts. In my mind, it's an important feature to >>>>>> interrupt virtualization. >>>>>> >>>>>> But VT-d PI feature is disabled by default on Xen for some corner >>>>>> cases and bugs. Based on Feng's work, we have fixed those corner >>>>>> cases related to VT-d PI. Do you think it is a time to enable VT-d PI by >>>>>> default. If no, could you list your concerns so that we can resolve them? >>>>> I don't recall you addressing the main issue (blocked vCPU-s list >>>>> length; see the comment next to the iommu_intpost definition). >>>>> >>>> Indeed. I have gone through the discussion happened in April 2016[1, 2]. >>>> [1] https://lists.gt.net/xen/devel/422661?search_string=VT-d%20posted- >>>> interrupt%20core%20logic%20handling;#422661 >>>> [2] >>>> https://lists.gt.net/xen/devel/422567?search_string=%20The%20length%20o >>>> f%20the%20list%20depends;#422567. >>>> >>>> First of all, I admit this is an issue in extreme case and we should >>>> come up with a solution. >>>> >>>> The problem we are facing is: >>>> There is a per-cpu list used to maintain all the blocked vCPU on a >>>> pCPU. When a wakeup interrupt comes, the interrupt handler travels >>>> the list to wake the vCPUs whose pi_desc indicates an interrupt has >>>> been posted. There is no policy to restrict the size of the list such >>>> that in some extreme case, the list can be too long to cause some >>>> issues (the most obvious issue is about interrupt latency). >>>> >>>> The theoretical max number of entry in the list is 4M as one host can >>>> have 32k domains and every domain can have 128vCPU. If all the vCPUs >>>> are blocked in one list, the list gets its theoretical maximum. >>>> >>>> The root cause of this issue, I think, is that the wakeup interrupt >>>> vector is shared by all the vCPUs on one pCPU. Lacking of enough >>>> information (such as which device sends or which IRTE translates this >>>> interrupt), there is no effective method to distinguish the >>>> interrupt's destination vCPU except traveling this list. Right? So we >>>> only can mitigate this issue through decreasing or limiting the >>>> entry's maximum in one list. >>>> >>>> Several methods we can take to mitigate this issue: >>>> 1. According to your discussions, evenly distributing all the blocked >>>> vCPUs among all pCPUs can mitigate this issue. With this approach, all >>>> vCPUs are blocked in one list can be avoided. It can decrease the >>>> entry's maximum in one list by N times (N is the number of pCPU). >>>> >>>> 2. Don't put the blocked vCPUs which won't be woken by the wakeup >>>> interrupt into the per-cpu list. Currently, we put the blocked vCPUs >>>> belong to domains who have assigned devices into the list. But if one >>>> blocked vCPU of such domain is not a destination of every posted >>>> format IRTE, it needn't be added to the per-cpu list. The blocked vCPU >>>> will be woken by IPIs or other virtual interrupts. From this aspect, we >>>> can decrease the entries in the per-cpu list. >>>> >>>> 3. Like what we do in struct irq_guest_action_t, can we limit the >>>> maximum of entry we support in the list. With this approach, during >>>> domain creation, we calculate the available entries and compare with >>>> the domain's vCPU number to decide whether the domain can use VT-d PI. >>> VT-d PI is global instead of per-domain. I guess you actually mean >>> failing device assignment operation if counting new domain's #VCPUs >>> exceeds the limitation. >>> >>>> This method will pose a strict restriction to the maximum of entry in >>>> one list. But it may affect vCPU hotplug. >>>> >>>> According to your intuition, which methods are feasible and >>>> acceptable? I will attempt to mitigate this issue per your advices. >>>> >>> My understanding is that we need them all. #1 is the baseline, >>> with #2/#3 as further optimization. :-) >> Actually, regarding #2, is that the case? >> >> If we do reference counting (as in patches 3 and 4 of Chao Gao's recent >> series), then we are guaranteed never to have more vcpus on any given >> wakeup list than there are machine IRQs on the system. Are we ever >> going to have a system with so many IRQs that going through such a list >> would be problematic? > I'm afraid this is not impossible, considering that people have already > run into the interrupt vector limitation coming from there only being > about 200 vectors per CPU (and there not being, in physical mode, > any sharing of vectors between multiple CPUs, iirc). Devices using > namely MSI-X can use an awful lot of vectors. Perhaps Andrew > remembers numbers observed on actual systems here... Citrix Netscalar SDX boxes have more MSI-X interrupts than fit in the cumulative IDTs of a top end dual-socket Xeon server systems. Some of the device drivers are purposefully modelled to use fewer interrupts than they otherwise would want to. Using PI is the proper solution longterm, because doing so would remove any need to allocate IDT vectors for the interrupts; the IOMMU could be programmed to dump device vectors straight into the PI block without them ever going through Xen's IDT. However, fixing that requires rewriting Xen's Interrupt remapping handling so it doesn't rewrite the cpu/vector in every interrupt source, and only rewrites the interrupt remapping table. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Enabling VT-d PI by default 2017-05-12 11:05 ` Andrew Cooper @ 2017-05-15 10:27 ` George Dunlap 2017-05-15 13:35 ` Andrew Cooper 0 siblings, 1 reply; 15+ messages in thread From: George Dunlap @ 2017-05-15 10:27 UTC (permalink / raw) To: Andrew Cooper Cc: Kevin Tian, Chao Gao, Dario Faggioli, Jan Beulich, xen-devel@lists.xen.org On Fri, May 12, 2017 at 12:05 PM, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > Citrix Netscalar SDX boxes have more MSI-X interrupts than fit in the > cumulative IDTs of a top end dual-socket Xeon server systems. Some of > the device drivers are purposefully modelled to use fewer interrupts > than they otherwise would want to. > > Using PI is the proper solution longterm, because doing so would remove > any need to allocate IDT vectors for the interrupts; the IOMMU could be > programmed to dump device vectors straight into the PI block without > them ever going through Xen's IDT. I wouldn't necessarily call that a "proper" solution. With PI, instead of an interrupt telling you exactly which VM to wake up and/or which routine you need to run, instead you have to search through (potentially) thousands of entries to see which vcpu the interrupt you received wanted to wake up; and you need to do that on every single interrupt. (Obviously it does have the advantage that if the vcpu happens to be running Xen doesn't get an interrupt at all.) -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Enabling VT-d PI by default 2017-05-15 10:27 ` George Dunlap @ 2017-05-15 13:35 ` Andrew Cooper 2017-05-15 14:32 ` George Dunlap 0 siblings, 1 reply; 15+ messages in thread From: Andrew Cooper @ 2017-05-15 13:35 UTC (permalink / raw) To: George Dunlap Cc: Kevin Tian, Chao Gao, Dario Faggioli, Jan Beulich, xen-devel@lists.xen.org On 15/05/17 11:27, George Dunlap wrote: > On Fri, May 12, 2017 at 12:05 PM, Andrew Cooper > <andrew.cooper3@citrix.com> wrote: >> Citrix Netscalar SDX boxes have more MSI-X interrupts than fit in the >> cumulative IDTs of a top end dual-socket Xeon server systems. Some of >> the device drivers are purposefully modelled to use fewer interrupts >> than they otherwise would want to. >> >> Using PI is the proper solution longterm, because doing so would remove >> any need to allocate IDT vectors for the interrupts; the IOMMU could be >> programmed to dump device vectors straight into the PI block without >> them ever going through Xen's IDT. > I wouldn't necessarily call that a "proper" solution. With PI, instead > of an interrupt telling you exactly which VM to wake up and/or which > routine you need to run, instead you have to search through > (potentially) thousands of entries to see which vcpu the interrupt you > received wanted to wake up; and you need to do that on every single > interrupt. (Obviously it does have the advantage that if the vcpu > happens to be running Xen doesn't get an interrupt at all.) Having spoken to the PI architects, this is not how the technology was designed to be used. On systems with this number of in-flight interrupts, trying to track "who got what interrupt" for priority boosting purposes is a waste of time, as we spend ages taking vmexits to process interrupt notifications for out-of-context vcpus. The way the PI architects envisaged the technology being used is that Suppress Notification is set at all points other than executing in non-root mode for the vcpu in question (there is a small race window around clearing SN on vmentry), and that the scheduler uses Outstanding Notification on each of the PI blocks when it rebalances credit to see which vcpus have had interrupts in the last 30ms. This current behaviour of leaving SN clear until an interrupt arrives is devastating for performance, especially in combination with the 3-step mechanism Xen uses to rewrite the interrupt source information, which pretty much guarantees that interrupts arrive on the wrong pcpu (unless strict pinning is in effect). ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Enabling VT-d PI by default 2017-05-15 13:35 ` Andrew Cooper @ 2017-05-15 14:32 ` George Dunlap 2017-05-16 11:52 ` Dario Faggioli 0 siblings, 1 reply; 15+ messages in thread From: George Dunlap @ 2017-05-15 14:32 UTC (permalink / raw) To: Andrew Cooper Cc: Kevin Tian, xen-devel@lists.xen.org, Dario Faggioli, Jan Beulich, Chao Gao On Mon, May 15, 2017 at 2:35 PM, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 15/05/17 11:27, George Dunlap wrote: >> On Fri, May 12, 2017 at 12:05 PM, Andrew Cooper >> <andrew.cooper3@citrix.com> wrote: >>> Citrix Netscalar SDX boxes have more MSI-X interrupts than fit in the >>> cumulative IDTs of a top end dual-socket Xeon server systems. Some of >>> the device drivers are purposefully modelled to use fewer interrupts >>> than they otherwise would want to. >>> >>> Using PI is the proper solution longterm, because doing so would remove >>> any need to allocate IDT vectors for the interrupts; the IOMMU could be >>> programmed to dump device vectors straight into the PI block without >>> them ever going through Xen's IDT. >> I wouldn't necessarily call that a "proper" solution. With PI, instead >> of an interrupt telling you exactly which VM to wake up and/or which >> routine you need to run, instead you have to search through >> (potentially) thousands of entries to see which vcpu the interrupt you >> received wanted to wake up; and you need to do that on every single >> interrupt. (Obviously it does have the advantage that if the vcpu >> happens to be running Xen doesn't get an interrupt at all.) > > Having spoken to the PI architects, this is not how the technology was > designed to be used. > > On systems with this number of in-flight interrupts, trying to track > "who got what interrupt" for priority boosting purposes is a waste of > time, as we spend ages taking vmexits to process interrupt notifications > for out-of-context vcpus. > > The way the PI architects envisaged the technology being used is that > Suppress Notification is set at all points other than executing in > non-root mode for the vcpu in question (there is a small race window > around clearing SN on vmentry), and that the scheduler uses Outstanding > Notification on each of the PI blocks when it rebalances credit to see > which vcpus have had interrupts in the last 30ms. It sounds like they may have made the mistake that the Credit1 designers made, in analyzing only a system that was overloaded; and one where all workloads were identical, as opposed to analyzing a system that was at least sometimes partially loaded, and where workloads were very different. You're right that if you weren't going to preempt the currently running vcpu anyway, there's no need for Xen to get the interrupt. But it should be obvious that on a system that's idle (even for a relatively short amount of time) that we want to get the interrupt and wake up the appropriate vcpu immediately. It should also be obvious that in a mixed workload, where one vcpu is doing tons of computation and another is mainly handling interrupts quickly and going to sleep again, that we would want Xen at regular intervals to check to see if it should run the vcpu that's mostly handling interrupts. We generally wouldn't want to delay waking up the lower-priority vcpu longer than 1ms. In both cases, waiting 30ms to see if we should wake somebody up is far too long. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Enabling VT-d PI by default 2017-05-15 14:32 ` George Dunlap @ 2017-05-16 11:52 ` Dario Faggioli 0 siblings, 0 replies; 15+ messages in thread From: Dario Faggioli @ 2017-05-16 11:52 UTC (permalink / raw) To: George Dunlap, Andrew Cooper Cc: Kevin Tian, xen-devel@lists.xen.org, Jan Beulich, Chao Gao [-- Attachment #1.1: Type: text/plain, Size: 2690 bytes --] On Mon, 2017-05-15 at 15:32 +0100, George Dunlap wrote: > On Mon, May 15, 2017 at 2:35 PM, Andrew Cooper > <andrew.cooper3@citrix.com> wrote: > > On systems with this number of in-flight interrupts, trying to > > track > > "who got what interrupt" for priority > > boosting purposes is a waste of > > time, as we spend ages taking vmexits to process interrupt > > notifications > > for out-of-context vcpus. > > > > The way the PI architects envisaged the technology being used is > > that > > Suppress Notification is set at all points other than executing in > > non-root mode for the vcpu in question (there is a small race > > window > > around clearing SN on vmentry), and that the scheduler uses > > Outstanding > > Notification on each of the PI blocks when it rebalances credit to > > see > > which vcpus have had interrupts in the last 30ms. > > It sounds like they may have made the mistake that the Credit1 > designers made, in analyzing only a system that was overloaded; and > one where all workloads were identical, as opposed to analyzing a > system that was at least sometimes partially loaded, and where > workloads were very different. > Totally agree. Also, I'm not sure I follow why PI architects would be basing hardware design on specific characteristics of a particular Xen scheduler. E.g., in Linux --which I'd think they also had in mind when envisioning uses of the technology-- there is no such thing as 30ms timeslice, nor credits redistribution. And AFAICU what you seem to suggest, not notifying an interrupt/not waking up anyone, at the time at which it happens, means there must be some kind of list_for_each_vcpu() anyway, for checking which vCPUs have pending notifications. Hence the problem we're discussing here, would just be moved between subsystems, rather than going away. And, finally, I don't get what you mean when you say that we're trying to use PI "for priority boosting purposes". I don't think we do that. FTR, I've quickly checked how this is done in Linux, and the solution pushed there looks really similar to the one that has been pushed to Xen as well. E.g., the also there, the handler scans the blocked vCPUs list: http://elixir.free-electrons.com/linux/latest/source/arch/x86/kvm/vmx.c#L6464 > In both cases, waiting 30ms to see if we should wake somebody up is > far too long. > Absoluely! Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] [-- Attachment #2: Type: text/plain, Size: 127 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Enabling VT-d PI by default 2017-04-16 20:13 ` Chao Gao 2017-04-18 6:24 ` Tian, Kevin @ 2017-04-18 8:13 ` Jan Beulich 2017-04-18 3:41 ` Chao Gao 1 sibling, 1 reply; 15+ messages in thread From: Jan Beulich @ 2017-04-18 8:13 UTC (permalink / raw) To: Chao Gao Cc: Kevin Tian, George Dunlap, Andrew Cooper, Dario Faggioli, xen-devel >>> On 16.04.17 at 22:13, <chao.gao@intel.com> wrote: > 3. Like what we do in struct irq_guest_action_t, can we limit the > maximum of entry we support in the list. With this approach, during > domain creation, we calculate the available entries and compare with > the domain's vCPU number to decide whether the domain can use VT-d PI. > This method will pose a strict restriction to the maximum of entry in > one list. But it may affect vCPU hotplug. I don't view this as really suitable - irq_guest_action is quite different, as one can reasonably place expectations on how many devices may share an interrupt line. If someone really hit this boundary, (s)he could likely re-configure their system by moving expansion cards between slots. Neither of this is comparable with the PI situation, as it looks to me. Furthermore, whether a guest would be able to start / use PI would be quite hard to tell for an admin as it seems, again as opposed to the case with the shared interrupt lines. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Enabling VT-d PI by default 2017-04-18 8:13 ` Jan Beulich @ 2017-04-18 3:41 ` Chao Gao 2017-04-18 10:52 ` Jan Beulich 0 siblings, 1 reply; 15+ messages in thread From: Chao Gao @ 2017-04-18 3:41 UTC (permalink / raw) To: Jan Beulich Cc: Kevin Tian, George Dunlap, Andrew Cooper, Dario Faggioli, xen-devel On Tue, Apr 18, 2017 at 02:13:36AM -0600, Jan Beulich wrote: >>>> On 16.04.17 at 22:13, <chao.gao@intel.com> wrote: >> 3. Like what we do in struct irq_guest_action_t, can we limit the >> maximum of entry we support in the list. With this approach, during >> domain creation, we calculate the available entries and compare with >> the domain's vCPU number to decide whether the domain can use VT-d PI. >> This method will pose a strict restriction to the maximum of entry in >> one list. But it may affect vCPU hotplug. > >I don't view this as really suitable - irq_guest_action is quite different, >as one can reasonably place expectations on how many devices may >share an interrupt line. If someone really hit this boundary, (s)he >could likely re-configure their system by moving expansion cards >between slots. Neither of this is comparable with the PI situation, as >it looks to me. > >Furthermore, whether a guest would be able to start / use PI would >be quite hard to tell for an admin as it seems, again as opposed to >the case with the shared interrupt lines. > Indeed. It would annoy the admin. What's your opinion on the first and second methods? Do you think we need such policy to restrict the #entry in the list even with the first two methods? Thanks Chao _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Enabling VT-d PI by default 2017-04-18 3:41 ` Chao Gao @ 2017-04-18 10:52 ` Jan Beulich 0 siblings, 0 replies; 15+ messages in thread From: Jan Beulich @ 2017-04-18 10:52 UTC (permalink / raw) To: Chao Gao Cc: Kevin Tian, George Dunlap, Andrew Cooper, Dario Faggioli, xen-devel >>> On 18.04.17 at 05:41, <chao.gao@intel.com> wrote: > On Tue, Apr 18, 2017 at 02:13:36AM -0600, Jan Beulich wrote: >>>>> On 16.04.17 at 22:13, <chao.gao@intel.com> wrote: >>> 3. Like what we do in struct irq_guest_action_t, can we limit the >>> maximum of entry we support in the list. With this approach, during >>> domain creation, we calculate the available entries and compare with >>> the domain's vCPU number to decide whether the domain can use VT-d PI. >>> This method will pose a strict restriction to the maximum of entry in >>> one list. But it may affect vCPU hotplug. >> >>I don't view this as really suitable - irq_guest_action is quite different, >>as one can reasonably place expectations on how many devices may >>share an interrupt line. If someone really hit this boundary, (s)he >>could likely re-configure their system by moving expansion cards >>between slots. Neither of this is comparable with the PI situation, as >>it looks to me. >> >>Furthermore, whether a guest would be able to start / use PI would >>be quite hard to tell for an admin as it seems, again as opposed to >>the case with the shared interrupt lines. > > Indeed. It would annoy the admin. What's your opinion on the > first and second methods? Do you think we need such policy to > restrict the #entry in the list even with the first two methods? Well, I'm in agreement with Kevin that all reasonable approaches should be made use of here, so I'd like to defer a decision on a forced limit until we see what effects can be achieved by the other two methods. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2017-05-16 11:52 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-04-11 0:59 Enabling VT-d PI by default Chao Gao 2017-04-11 8:21 ` Jan Beulich 2017-04-16 20:13 ` Chao Gao 2017-04-18 6:24 ` Tian, Kevin 2017-04-17 23:57 ` Chao Gao 2017-04-26 17:11 ` George Dunlap 2017-04-27 7:08 ` Jan Beulich 2017-05-12 11:05 ` Andrew Cooper 2017-05-15 10:27 ` George Dunlap 2017-05-15 13:35 ` Andrew Cooper 2017-05-15 14:32 ` George Dunlap 2017-05-16 11:52 ` Dario Faggioli 2017-04-18 8:13 ` Jan Beulich 2017-04-18 3:41 ` Chao Gao 2017-04-18 10:52 ` Jan Beulich
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.