Re: [patch 00/15] Generic Mutex Subsystem

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@elte.hu>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Andi Kleen <ak@suse.de>,
	linux-kernel@vger.kernel.org, Andrew Morton <akpm@osdl.org>,
	Arjan van de Ven <arjanv@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Alan Cox <alan@lxorguk.ukuu.org.uk>,
	Christoph Hellwig <hch@infradead.org>,
	David Howells <dhowells@redhat.com>,
	Alexander Viro <viro@ftp.linux.org.uk>,
	Oleg Nesterov <oleg@tv-sign.ru>
Subject: Re: [patch 00/15] Generic Mutex Subsystem
Date: Mon, 19 Dec 2005 16:50:10 +0100	[thread overview]
Message-ID: <20051219155010.GA7790@elte.hu> (raw)
In-Reply-To: <Pine.LNX.4.64.0512182214400.4827@g5.osdl.org>

* Linus Torvalds <torvalds@osdl.org> wrote:

> On Mon, 19 Dec 2005, Andi Kleen wrote:
> > 
> > Do you have an idea where this big difference comes from? It doesn't look
> > it's from the fast path which is essentially the same.  Do the mutexes have
> > that much better scheduling behaviour than semaphores? It is a bit hard to 
> > believe.
> 
> Are Ingo's mutex'es perhaps not trying to be fair?
> 
> The normal mutexes try to make sure that if a process comes in and 
> gets stuck on a mutex, and then another CPU releases the mutex and 
> immediately tries to grab it again, the other CPU will _not_ get it.

the VFS creat+unlink performance advantage i measured on SMP systems is 
a pure 'struct mutex' vs. stock 'struct semaphore' measurement. I 
measured the same kernel with a single .config option difference 
(DEBUG_MUTEX_FULL to get the 'mutex' variant, vs. DEBUG_MUTEX_PARTIAL to 
get the 'semaphore' variant), the systems were completely idle, and the 
results are statistically stable.

fact is, we dont implement fairness in the upstream struct semaphore 
implementation(s) either - it would be extremely bad to performance (as 
you are pointing it out too).

both stock semaphores and generic mutexes are unfair in the following 
sense: if there is a waiter 'in flight' (but has not grabbed the lock 
yet), a 'lucky bastard may jump the queue' and may grab the lock, 
without having waited anything. So comparing semaphores to generic 
mutexes is an apples to apples comparison.

in fact, generic mutexes are _more_ fair than struct semaphore in their 
wait logic. In the stock semaphore implementation, when a waiter is 
woken up, it will retry the lock, and if it fails, it goes back to the 
_tail_ of the queue again - waiting one full cycle again.

in the generic mutex implementation, the first waiter is woken up, and 
it will retry the lock - but no other waiters are woken up until this 
waiter has done its round. (a mutex implementation can do this, because 
it knows that there can only be one owner, while a counting semaphore 
has to be prepared for the possibility of multiple concurrent up()s, and 
for the count going above 1.)

and this is also the crutial difference that gives mutexes the 
performance edge in contended workloads (so this should also answers 
Andi's question): stock semaphores easily end up creating a 'thundering 
herd' scenario, if a 'fast lucky bastard' releases the lock - and wakes 
up _yet another_ waiter. The end result is, that given a high enough 
'semaphore use frequency', our wake-one logic in semaphores totally 
falls apart and we end up having all waiters racing for the runqueues, 
and racing for the lock itself. This causes much higher CPU utilization, 
wasted resources, and slower total performance.

there is one more wakeup related optimization in generic mutexes: the 
waiter->woken flag. It is set when the waiter is 'set into flight', and 
subsequent wakeups first check this flag. Since new waiters are added to 
the end of the wait-list, this waiter's data cachelines stay clean, and 
the ->woken flag is nicely shared amongst CPUs. Doing a blind 
wake_up_process() would at a minimum dirty the remote runqueue's lock 
cacheline.

> That's a huge performance disadvantage, but it's done on purpose, 
> because otherwise you end up in a situation where the semaphore 
> release code did wake up the waiter, but before the waiter actually 
> had time to grab it (it has to go through the IPI and scheduling 
> logic), the same CPU just grabbed it again.
> 
> The original semaphores were unfair, and it works really well most of 
> the time. But then it really sucks in some nasty cases.
> 
> The numbers make me suspect that Ingo's mutexes are unfair too, but 
> I've not looked at the code yet.

yes, they are unfair - just like stock semaphores.

[ Note that the generic rt-mutexes in the -rt tree are of course fair to 
  higher-prio tasks, and this fairness is implemented via 
  priority-queueing and priority inheritance: the highest prio (RT) task
  gets the lock, and if a lower-prio task is holding the lock, it is
  boosted up until the end of the critical section, at which point it
  hands over the lock to the highprio task.

  implementing any other method to achieve fairness would result in 
  really bad real-world performance. The -rt tree's performance is quite 
  close to the vanilla kernel (considering that it's a fully preemptible
  kernel), and we couldnt do that without the mutex implementation. ]

generic mutexes, as present in this patchqueue, do not include priority 
queueing and priority inheritance, so they are 'plain unfair', just like 
stock semaphores are.

> NOTE! I'm not a huge fan of fairness per se. I think unfair is often 
> ok, and the performance advantages are certainly real. It may well be 
> that the cases that caused problems before are now done differently 
> (eg we switched the VM semaphore over to an rwsem), and that we can 
> have an unfair and fast mutex for those cases where we don't care.
> 
> I just suspect the comparison isn't fair.

the comparison is 100% totally fair, because both generic mutexes, and 
struct semaphores are 100% totally unfair :-) There's no difference in 
the level of unfairness either: 'lucky bastards' are allowed to steal 
the lock in both implementations.

we have only one (limited) notion of fairness in Linux synchronization 
primitives: rwsems prefer waiting writers, and then block subsequent 
readers. Note that rwsems are still reader-reader and writer-writer 
unfair, hence the -rt tree replaces the rwsem implementation too, with 
generic mutexes and PI.

	Ingo

next prev parent reply	other threads:[~2005-12-19 15:51 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-12-19  1:34 [patch 00/15] Generic Mutex Subsystem Ingo Molnar
2005-12-19  4:22 ` Andi Kleen
2005-12-19  4:28   ` Steven Rostedt
2005-12-19  4:31     ` Andi Kleen
2005-12-19  6:24   ` Linus Torvalds
2005-12-19 12:56     ` Steven Rostedt
2005-12-19 16:55       ` Ingo Molnar
2005-12-19 15:50     ` Ingo Molnar [this message]
2005-12-19 19:11       ` Linus Torvalds
2005-12-19 19:25         ` Benjamin LaHaise
2005-12-19 19:55           ` Linus Torvalds
2005-12-21 16:42             ` Oleg Nesterov
2006-01-10 10:28               ` Balbir Singh
2006-01-10 18:03                 ` Oleg Nesterov
2006-01-11  6:33                   ` Balbir Singh
2006-01-11  9:22                     ` Oleg Nesterov
2005-12-19 20:11           ` Ingo Molnar
2005-12-19 20:19             ` Benjamin LaHaise
2005-12-19 20:32             ` Russell King
2005-12-19 20:57               ` Ingo Molnar
2005-12-19 19:55         ` Ingo Molnar
2005-12-19 20:12           ` Linus Torvalds
2005-12-19 23:37             ` Christoph Hellwig
2005-12-20  8:03             ` Nick Piggin
2005-12-20  8:06               ` Arjan van de Ven
2005-12-20  8:21                 ` Nick Piggin
2005-12-20  8:36                   ` Arjan van de Ven
2005-12-20  8:48                     ` Nick Piggin
2005-12-19 16:22   ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20051219155010.GA7790@elte.hu \
    --to=mingo@elte.hu \
    --cc=ak@suse.de \
    --cc=akpm@osdl.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=arjanv@infradead.org \
    --cc=dhowells@redhat.com \
    --cc=hch@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleg@tv-sign.ru \
    --cc=rostedt@goodmis.org \
    --cc=torvalds@osdl.org \
    --cc=viro@ftp.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox