Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Chao Gao <chao.gao@intel.com>
To: Jan Beulich <JBeulich@suse.com>
Cc: Kevin Tian <kevin.tian@intel.com>, Wei Liu <wei.liu2@citrix.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	IanJackson <ian.jackson@eu.citrix.com>,
	George Dunlap <george.dunlap@citrix.com>,
	xen-devel@lists.xen.org, JunNakajima <jun.nakajima@intel.com>
Subject: Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
Date: Tue, 9 May 2017 01:37:12 +0800	[thread overview]
Message-ID: <20170508173710.GA22160@bdw.sh.intel.com> (raw)
In-Reply-To: <5910557F02000078001579B5@prv-mh.provo.novell.com>

On Mon, May 08, 2017 at 03:24:47AM -0600, Jan Beulich wrote:
>(Chao Gao got lost from the recipients list again; re-adding)
>
>>>> On 08.05.17 at 11:13, <george.dunlap@citrix.com> wrote:
>> On 08/05/17 17:15, Chao Gao wrote:
>>> On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote:
>>>>>>> On 03.05.17 at 12:08, <george.dunlap@citrix.com> wrote:
>>>>> On 02/05/17 06:45, Chao Gao wrote:
>>>>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>>>>>> On 26/04/17 01:52, Chao Gao wrote:
>>>>>>>> I compared the maximum of #entry in one list and #event (adding entry to
>>>>>>>> PI blocking list) with and without the three latter patches. Here
>>>>>>>> is the result:
>>>>>>>> -------------------------------------------------------------
>>>>>>>> |               |                      |                    |
>>>>>>>> |    Items      |   Maximum of #entry  |      #event        |
>>>>>>>> |               |                      |                    |
>>>>>>>> -------------------------------------------------------------
>>>>>>>> |               |                      |                    |
>>>>>>>> |W/ the patches |         6            |       22740        |
>>>>>>>> |               |                      |                    |
>>>>>>>> -------------------------------------------------------------
>>>>>>>> |               |                      |                    |
>>>>>>>> |W/O the patches|        128           |       46481        |
>>>>>>>> |               |                      |                    |
>>>>>>>> -------------------------------------------------------------
>>>>>>>
>>>>>>> Any chance you could trace how long the list traversal took?  It would
>>>>>>> be good for future reference to have an idea what kinds of timescales
>>>>>>> we're talking about.
>>>>>>
>>>>>> Hi.
>>>>>>
>>>>>> I made a simple test to get the time consumed by the list traversal.
>>>>>> Apply below patch and create one hvm guest with 128 vcpus and a passthrough 
>> 40 NIC.
>>>>>> All guest vcpu are pinned to one pcpu. collect data by
>>>>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
>>>>>> xentrace_format. When the list length is about 128, the traversal time
>>>>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
>>>>>> frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
>>>>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
>>>>>> 2900 vcpus can be added into the list.
>>>>>
>>>>> Great, thanks Chao Gao, that's useful.
>>>>
>>>> Looks like Chao Gao has been dropped ...
>>>>
>>>>>  I'm not sure a fixed latency --
>>>>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged
>>>>> to have interrupts staggered at 500us intervals it could easily lock up
>>>>> the cpu for nearly a full second.  But I'm having trouble formulating a
>>>>> good limit scenario.
>>>>>
>>>>> In any case, 22us should be safe from a security standpoint*, and 128
>>>>> should be pretty safe from a "make the common case fast" standpoint:
>>>>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
>>>>> traffic will be the least of your performance problems I should think.
>>>>>
>>>>>  -George
>>>>>
>>>>> * Waiting for Jan to contradict me on this one. :-)
>>>>
>>>> 22us would certainly be fine, if this was the worst case scenario.
>>>> I'm not sure the value measured for 128 list entries can be easily
>>>> scaled to several thousands of them, due cache and/or NUMA
>>>> effects. I continue to think that we primarily need theoretical
>>>> proof of an upper boundary on list length being enforced, rather
>>>> than any measurements or randomized balancing. And just to be
>>>> clear - if someone overloads their system, I do not see a need to
>>>> have a guaranteed maximum list traversal latency here. All I ask
>>>> for is that list traversal time scales with total vCPU count divided
>>>> by pCPU count.
>>> 
>>> Thanks, Jan & George.
>>> 
>>> I think it is more clear to me about what should I do next step.
>>> 
>>> In my understanding, we should distribute the wakeup interrupts like
>>> this:
>>> 1. By default, distribute it to the local pCPU ('local' means the vCPU
>>> is on the pCPU) to make the common case fast.
>>> 2. With the list grows to a point where we think it may consumers too
>>> much time to traverse the list, also distribute wakeup interrupt to local
>>> pCPU, ignoring admin intentionally overloads their system.
>>> 3. When the list length reachs the theoretic average maximum (means
>>> maximal vCPU count divided by pCPU count), distribute wakeup interrupt
>>> to another underutilized pCPU.
>> 
>> By "maximal vCPU count" do you mean, "total number of active vcpus on
>> the system"?  Or some other theoretical maximum vcpu count (e.g., 32k
>> domans * 512 vcpus each or something)?
>
>The former.

Ok. Actually I meant the latter. But now, I realize I was wrong.

>
>> What about saying that the limit of vcpus for any given pcpu will be:
>>  (v_tot / p_tot) + K
>> where v_tot is the total number of vcpus on the system, p_tot is the
>> total number of pcpus in the system, and K is a fixed number (such as
>> 128) such that 1) the additional time walking the list is minimal, and
>> 2) in the common case we should never come close to reaching that number?
>> 
>> Then the algorithm for choosing which pcpu to have the interrupt
>> delivered to would be:
>>  1. Set p = current_pcpu
>>  2. if len(list(p)) < v_tot / p_tot + k, choose p
>>  3. Otherwise, choose another p and goto 2
>> 
>> The "choose another p" could be random / pseudorandom selection, or it
>> could be some other mechanism (rotate, look for pcpus nearby on the
>> topology, choose the lowest one, &c).  But as long as we check the
>> length before assigning it, it should satisfy Jan.

Very clear and helpful. Othewise, I may need spending several months to
reach this solution. Thanks, George. :)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

     prev parent reply	other threads:[~2017-05-08 17:37 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-26  0:52 [PATCH 0/4] mitigate the per-pCPU blocking list may be too long Chao Gao
2017-04-26  0:52 ` [PATCH 1/4] xentrace: add TRC_HVM_VT_D_PI_BLOCK Chao Gao
2017-04-26  0:52 ` [PATCH 2/4] VT-d PI: Randomly Distribute entries to all online pCPUs' pi blocking list Chao Gao
2017-04-26  0:52 ` [PATCH 3/4] VT-d PI: Add reference count to pi_desc Chao Gao
2017-04-26  0:52 ` [PATCH 4/4] VT-d PI: Don't add vCPU to PI blocking list for a case Chao Gao
2017-04-26  8:19 ` [PATCH 0/4] mitigate the per-pCPU blocking list may be too long Jan Beulich
2017-04-26  3:30   ` Chao Gao
2017-04-26 10:52     ` Jan Beulich
2017-04-26 16:39 ` George Dunlap
2017-04-27  0:43   ` Chao Gao
2017-04-27  9:44     ` George Dunlap
2017-04-27  5:02       ` Chao Gao
2017-05-02  5:45   ` Chao Gao
2017-05-03 10:08     ` George Dunlap
2017-05-03 10:21       ` Jan Beulich
2017-05-08 16:15         ` Chao Gao
2017-05-08  8:39           ` Jan Beulich
2017-05-08 16:38             ` Chao Gao
2017-05-08  9:13           ` George Dunlap
2017-05-08  9:24             ` Jan Beulich
2017-05-08 17:37               ` Chao Gao [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170508173710.GA22160@bdw.sh.intel.com \
    --to=chao.gao@intel.com \
    --cc=George.Dunlap@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=george.dunlap@citrix.com \
    --cc=ian.jackson@eu.citrix.com \
    --cc=jun.nakajima@intel.com \
    --cc=kevin.tian@intel.com \
    --cc=wei.liu2@citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).