From: Chao Gao <chao.gao@intel.com>
To: Jan Beulich <JBeulich@suse.com>
Cc: Kevin Tian <kevin.tian@intel.com>, Wei Liu <wei.liu2@citrix.com>,
George Dunlap <George.Dunlap@eu.citrix.com>,
Andrew Cooper <andrew.cooper3@citrix.com>,
IanJackson <ian.jackson@eu.citrix.com>,
George Dunlap <george.dunlap@citrix.com>,
xen-devel@lists.xen.org, JunNakajima <jun.nakajima@intel.com>
Subject: Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
Date: Tue, 9 May 2017 00:38:37 +0800 [thread overview]
Message-ID: <20170508163836.GA16351@bdw.sh.intel.com> (raw)
In-Reply-To: <59104ADD020000780015796D@prv-mh.provo.novell.com>
On Mon, May 08, 2017 at 02:39:25AM -0600, Jan Beulich wrote:
>>>> On 08.05.17 at 18:15, <chao.gao@intel.com> wrote:
>> On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote:
>>>>>> On 03.05.17 at 12:08, <george.dunlap@citrix.com> wrote:
>>>> On 02/05/17 06:45, Chao Gao wrote:
>>>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>>>>> On 26/04/17 01:52, Chao Gao wrote:
>>>>>>> I compared the maximum of #entry in one list and #event (adding entry to
>>>>>>> PI blocking list) with and without the three latter patches. Here
>>>>>>> is the result:
>>>>>>> -------------------------------------------------------------
>>>>>>> | | | |
>>>>>>> | Items | Maximum of #entry | #event |
>>>>>>> | | | |
>>>>>>> -------------------------------------------------------------
>>>>>>> | | | |
>>>>>>> |W/ the patches | 6 | 22740 |
>>>>>>> | | | |
>>>>>>> -------------------------------------------------------------
>>>>>>> | | | |
>>>>>>> |W/O the patches| 128 | 46481 |
>>>>>>> | | | |
>>>>>>> -------------------------------------------------------------
>>>>>>
>>>>>> Any chance you could trace how long the list traversal took? It would
>>>>>> be good for future reference to have an idea what kinds of timescales
>>>>>> we're talking about.
>>>>>
>>>>> Hi.
>>>>>
>>>>> I made a simple test to get the time consumed by the list traversal.
>>>>> Apply below patch and create one hvm guest with 128 vcpus and a passthrough
>> 40 NIC.
>>>>> All guest vcpu are pinned to one pcpu. collect data by
>>>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
>>>>> xentrace_format. When the list length is about 128, the traversal time
>>>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
>>>>> frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
>>>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
>>>>> 2900 vcpus can be added into the list.
>>>>
>>>> Great, thanks Chao Gao, that's useful.
>>>
>>>Looks like Chao Gao has been dropped ...
>>>
>>>> I'm not sure a fixed latency --
>>>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged
>>>> to have interrupts staggered at 500us intervals it could easily lock up
>>>> the cpu for nearly a full second. But I'm having trouble formulating a
>>>> good limit scenario.
>>>>
>>>> In any case, 22us should be safe from a security standpoint*, and 128
>>>> should be pretty safe from a "make the common case fast" standpoint:
>>>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
>>>> traffic will be the least of your performance problems I should think.
>>>>
>>>> -George
>>>>
>>>> * Waiting for Jan to contradict me on this one. :-)
>>>
>>>22us would certainly be fine, if this was the worst case scenario.
>>>I'm not sure the value measured for 128 list entries can be easily
>>>scaled to several thousands of them, due cache and/or NUMA
>>>effects. I continue to think that we primarily need theoretical
>>>proof of an upper boundary on list length being enforced, rather
>>>than any measurements or randomized balancing. And just to be
>>>clear - if someone overloads their system, I do not see a need to
>>>have a guaranteed maximum list traversal latency here. All I ask
>>>for is that list traversal time scales with total vCPU count divided
>>>by pCPU count.
>>
>> Thanks, Jan & George.
>>
>> I think it is more clear to me about what should I do next step.
>>
>> In my understanding, we should distribute the wakeup interrupts like
>> this:
>> 1. By default, distribute it to the local pCPU ('local' means the vCPU
>> is on the pCPU) to make the common case fast.
>> 2. With the list grows to a point where we think it may consumers too
>> much time to traverse the list, also distribute wakeup interrupt to local
>> pCPU, ignoring admin intentionally overloads their system.
>> 3. When the list length reachs the theoretic average maximum (means
>> maximal vCPU count divided by pCPU count), distribute wakeup interrupt
>> to another underutilized pCPU.
>>
>> But, I am confused about that If we don't care someone overload their
>> system, why we need the stage #3? If not, I have no idea to meet Jan's
>> request, the list traversal time scales with total vCPU count divided by
>> pCPU count. Or we will reach stage #3 before stage #2?
>
>Things is that imo point 2 is too fuzzy to be of any use, i.e. 3 should
>take effect immediately. We don't mean to ignore any admin decisions
>here, it is just that if they overload their systems, the net effect of 3
>may still not be good enough to provide smooth behavior. But that's
>then a result of them overloading their systems in the first place. IOW,
>you should try to evenly distribute vCPU-s as soon as their count on
>a given pCPU exceeds the calculated average.
Very helpful and reasonable. Thank you, Jan.
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
next prev parent reply other threads:[~2017-05-08 16:38 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-04-26 0:52 [PATCH 0/4] mitigate the per-pCPU blocking list may be too long Chao Gao
2017-04-26 0:52 ` [PATCH 1/4] xentrace: add TRC_HVM_VT_D_PI_BLOCK Chao Gao
2017-04-26 0:52 ` [PATCH 2/4] VT-d PI: Randomly Distribute entries to all online pCPUs' pi blocking list Chao Gao
2017-04-26 0:52 ` [PATCH 3/4] VT-d PI: Add reference count to pi_desc Chao Gao
2017-04-26 0:52 ` [PATCH 4/4] VT-d PI: Don't add vCPU to PI blocking list for a case Chao Gao
2017-04-26 8:19 ` [PATCH 0/4] mitigate the per-pCPU blocking list may be too long Jan Beulich
2017-04-26 3:30 ` Chao Gao
2017-04-26 10:52 ` Jan Beulich
2017-04-26 16:39 ` George Dunlap
2017-04-27 0:43 ` Chao Gao
2017-04-27 9:44 ` George Dunlap
2017-04-27 5:02 ` Chao Gao
2017-05-02 5:45 ` Chao Gao
2017-05-03 10:08 ` George Dunlap
2017-05-03 10:21 ` Jan Beulich
2017-05-08 16:15 ` Chao Gao
2017-05-08 8:39 ` Jan Beulich
2017-05-08 16:38 ` Chao Gao [this message]
2017-05-08 9:13 ` George Dunlap
2017-05-08 9:24 ` Jan Beulich
2017-05-08 17:37 ` Chao Gao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170508163836.GA16351@bdw.sh.intel.com \
--to=chao.gao@intel.com \
--cc=George.Dunlap@eu.citrix.com \
--cc=JBeulich@suse.com \
--cc=andrew.cooper3@citrix.com \
--cc=george.dunlap@citrix.com \
--cc=ian.jackson@eu.citrix.com \
--cc=jun.nakajima@intel.com \
--cc=kevin.tian@intel.com \
--cc=wei.liu2@citrix.com \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).