From mboxrd@z Thu Jan 1 00:00:00 1970 From: George Dunlap Subject: Re: About vcpu wakeup and runq tickling in credit Date: Thu, 15 Nov 2012 12:18:29 +0000 Message-ID: <50A4DD95.5020107@eu.citrix.com> References: <1350999260.5064.56.camel@Solace> <5086B4DF.6060701@eu.citrix.com> <1352981447.5351.51.camel@Solace> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1352981447.5351.51.camel@Solace> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli Cc: Keir Fraser , Jan Beulich , xen-devel List-Id: xen-devel@lists.xenproject.org On 15/11/12 12:10, Dario Faggioli wrote: > On Tue, 2012-10-23 at 16:16 +0100, George Dunlap wrote: >>> As it comes to possible solution, I think that, for instance, tickling >>> all the CPUs in both v_W's and v_C's affinity masks could solve this, >>> but that would also potentially increase the overhead (by asking _a_lot_ >>> of CPUs to reschedule), and again, it's hard to say if/when it's >>> worth... >> Well in my code, opt_tickle_idle_one is on by default, which means only >> one other cpu will be woken up. If there were an easy way to make it >> wake up a CPU in v_C's affinity as well (supposing that there was no >> overlap), that would probably be a win. >> >> Of course, that's only necessary if: >> * v_C is lower priority than v_W >> * There are no idlers that intersect both v_C and v_W's affinity mask. >> >> It's probably a good idea though to try to set up a scenario where this >> might be an issue and see how often it actually happens. >> > Ok, I think I managed in reproducing this. Look at the following trace, > considering that d51 has vcpu-affinity with pcpus 8-15, while d0 has no > affinity at all (its vcpus can run everywhere): > > 166.853945095 ---|-|-------x-| d51v1 runstate_change d0v7 blocked->runnable > ]166.853945884 ---|-|-------x-| d51v1 28004(2:8:4) 2 [ 0 7 ] > . > ]166.853986385 ---|-|-------x-| d51v1 2800e(2:8:e) 2 [ 33 4bf97be ] > ]166.853986522 ---|-|-------x-| d51v1 2800f(2:8:f) 3 [ 0 a050 1c9c380 ] > ]166.853986636 ---|-|-------x-| d51v1 2800a(2:8:a) 4 [ 33 1 0 7 ] > . > 166.853986775 ---|-|-------x-| d51v1 runstate_change d51v1 running->runnable > 166.853986905 ---|-|-------x-| d?v? runstate_change d0v7 runnable->running > . > . > . > ]166.854195353 ---|-|-------x-| d0v7 28006(2:8:6) 2 [ 0 7 ] > ]166.854196484 ---|-|-------x-| d0v7 2800e(2:8:e) 2 [ 0 33530 ] > ]166.854196584 ---|-|-------x-| d0v7 2800f(2:8:f) 3 [ 33 33530 1c9c380 ] > ]166.854196691 ---|-|-------x-| d0v7 2800a(2:8:a) 4 [ 0 7 33 1 ] > 166.854196809 ---|-|-------x-| d0v7 runstate_change d0v7 running->blocked > 166.854197175 ---|-|-------x-| d?v? runstate_change d51v1 runnable->running > > So, if I'm not reading the trace wrong, when d0v7 wakes up (very first > event) it preempts d51v1. Now, even if almost all pcpus 8-15 are idle, > none of them get tickled and comes to pick d51v1 up, which has then to > wait in the runq until d0v7 goes back to sleep. > > I suspect this could be because, at d0v7 wakeup time, we try to tickle > some pcpu which is in d0v7's affinity, but not in d51v1's one (as in the > second 'if() {}' block in __runq_tickle() we only care about > new->vcpu->cpu_affinity, and in this case, new is d0v7). > > I know, looking at the timestamps it doesn't look like it is a big deal > in this case, and I'm still working on producing numbers that can better > show whether or not this is a real problem. > > Anyway, and independently from the results of these tests, why do I care > so much? > > Well, if you substitute the concept of "vcpu-affinity" with > "node-affinity" above (which is what I am doing in my NUMA aware > scheduling patches) you'll see why this is bothering me quite a bit. In > fact, in that case, waking up a random pcpu with which d0v7 has > node-affinity with, while d51v1 has not, would cause d51v1 being pulled > by that cpu (since node-affinity is only preference)! > > So, in the vcpu-affinity case, if pcpu 3 get tickled, when it peeks at > pcpu 13's runq for work to steal it does not find anything suitable and > give up, leaving d51v1 in the runq even if there are idle pcpus on which > it could run, which is already bad. > In the node-affinity case, pcpu 3 will actually manage in stealing d51v1 > and running it, even if there are idle pcpus with which it has > node-affinity, and thus defeating most of the benefits of the whole NUMA > aware scheduling thing (at least for some workloads). Maybe what we should do is do the wake-up based on who is likely to run on the current cpu: i.e., if "current" is likely to be pre-empted, look at idlers based on "current"'s mask; if "new" is likely to be put on the queue, look at idlers based on "new"'s mask. What do you think? -George