From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jeremy Fitzhardinge <jeremy@goop.org>
Subject: Re: [Patch 2 of 2]: PV-domain SMP performance Linux-part
Date: Fri, 16 Jan 2009 09:41:39 -0800
Message-ID: <4970C6D3.2080206@goop.org>
References: <C59609B4.21084%keir.fraser@eu.citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <C59609B4.21084%keir.fraser@eu.citrix.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Keir Fraser <keir.fraser@eu.citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>, Juergen Gross <juergen.gross@fujitsu-siemens.com>, "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>
List-Id: xen-devel@lists.xenproject.org

Keir Fraser wrote:
> On 16/01/2009 09:36, "Juergen Gross" <juergen.gross@fujitsu-siemens.com>
> wrote:
>
>   
>>> The approach taken in Linux is not merely 'yield on spinlock' by the way, it
>>> is 'block on event channel on spinlock' essentially turning a contended
>>> spinlock into a sleeping mutex. I think that is quite different behaviour
>>> from merely yielding, and expecting the scheduler to do something sensible
>>> with your yield request.
>>>       
>> Could you explain this a little bit more in detail, please?
>>     
>
> Jeremy Fitzhardinge did the implementation for Linux, so I'm cc'ing him in
> case he remembers more details than me.
>
> Basically each CPU allocates itself an IPI event channel at start of day.
> When a CPU attempts to acquire a spinlock it spins a short while (perhaps a
> few microseconds?) and then adds itself to a bitmap stored in the lock
> structure (I think, or it might be a linked list of sleepers?). It then
> calls SCHEDOP_poll listing its IPI evtchn as its wakeup requirement. When
> the lock holder releases the lock it checks for sleepers and if it sees one
> then it pings one of them (or is it all of them?) on its event channel, thus
> waking it to take the lock.
>   

Yes, that's more or less right.  Each lock has a count of how many cpus 
are waiting for the lock; if its non-zero on unlock, the unlocker kicks 
all the waiting cpus via IPI.  There's a per-cpu variable of "lock I am 
waiting for"; the kicker looks at each cpu's entry and kicks it if its 
waiting for the lock being unlocked.

The locking side does the expected "spin for a while, then block on 
timeout".  The timeout is settable if you have the appropriate debugfs 
option enabled (which also produces quite a lot of detailed stats about 
locking behaviour).  The IPI is never delivered as an event BTW; the 
locker uses the event poll hypercall to block until the event is pending 
(this hypercall had some performance problems until relatively recent 
versions of Xen; I'm not sure which release versions has the fix).

The lock itself is a simple byte spinlock, with no fairness guarantees; 
I'm assuming (hoping) that the pathological cases that ticket locks were 
introduced to solve will be mitigated by the timeout/blocking path 
(and/or less likely in a virtual environment anyway).

I measured a small performance improvement within the domain with this 
patch (kernbench-type workload), but an overall 10% reduction in 
system-wide CPU use with multiple competing domains.

    J