From: Ingo Molnar <mingo@elte.hu>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Andi Kleen <ak@suse.de>,
linux-kernel@vger.kernel.org, Andrew Morton <akpm@osdl.org>,
Arjan van de Ven <arjanv@infradead.org>,
Steven Rostedt <rostedt@goodmis.org>,
Alan Cox <alan@lxorguk.ukuu.org.uk>,
Christoph Hellwig <hch@infradead.org>,
David Howells <dhowells@redhat.com>,
Alexander Viro <viro@ftp.linux.org.uk>,
Oleg Nesterov <oleg@tv-sign.ru>
Subject: Re: [patch 00/15] Generic Mutex Subsystem
Date: Mon, 19 Dec 2005 16:50:10 +0100 [thread overview]
Message-ID: <20051219155010.GA7790@elte.hu> (raw)
In-Reply-To: <Pine.LNX.4.64.0512182214400.4827@g5.osdl.org>
* Linus Torvalds <torvalds@osdl.org> wrote:
> On Mon, 19 Dec 2005, Andi Kleen wrote:
> >
> > Do you have an idea where this big difference comes from? It doesn't look
> > it's from the fast path which is essentially the same. Do the mutexes have
> > that much better scheduling behaviour than semaphores? It is a bit hard to
> > believe.
>
> Are Ingo's mutex'es perhaps not trying to be fair?
>
> The normal mutexes try to make sure that if a process comes in and
> gets stuck on a mutex, and then another CPU releases the mutex and
> immediately tries to grab it again, the other CPU will _not_ get it.
the VFS creat+unlink performance advantage i measured on SMP systems is
a pure 'struct mutex' vs. stock 'struct semaphore' measurement. I
measured the same kernel with a single .config option difference
(DEBUG_MUTEX_FULL to get the 'mutex' variant, vs. DEBUG_MUTEX_PARTIAL to
get the 'semaphore' variant), the systems were completely idle, and the
results are statistically stable.
fact is, we dont implement fairness in the upstream struct semaphore
implementation(s) either - it would be extremely bad to performance (as
you are pointing it out too).
both stock semaphores and generic mutexes are unfair in the following
sense: if there is a waiter 'in flight' (but has not grabbed the lock
yet), a 'lucky bastard may jump the queue' and may grab the lock,
without having waited anything. So comparing semaphores to generic
mutexes is an apples to apples comparison.
in fact, generic mutexes are _more_ fair than struct semaphore in their
wait logic. In the stock semaphore implementation, when a waiter is
woken up, it will retry the lock, and if it fails, it goes back to the
_tail_ of the queue again - waiting one full cycle again.
in the generic mutex implementation, the first waiter is woken up, and
it will retry the lock - but no other waiters are woken up until this
waiter has done its round. (a mutex implementation can do this, because
it knows that there can only be one owner, while a counting semaphore
has to be prepared for the possibility of multiple concurrent up()s, and
for the count going above 1.)
and this is also the crutial difference that gives mutexes the
performance edge in contended workloads (so this should also answers
Andi's question): stock semaphores easily end up creating a 'thundering
herd' scenario, if a 'fast lucky bastard' releases the lock - and wakes
up _yet another_ waiter. The end result is, that given a high enough
'semaphore use frequency', our wake-one logic in semaphores totally
falls apart and we end up having all waiters racing for the runqueues,
and racing for the lock itself. This causes much higher CPU utilization,
wasted resources, and slower total performance.
there is one more wakeup related optimization in generic mutexes: the
waiter->woken flag. It is set when the waiter is 'set into flight', and
subsequent wakeups first check this flag. Since new waiters are added to
the end of the wait-list, this waiter's data cachelines stay clean, and
the ->woken flag is nicely shared amongst CPUs. Doing a blind
wake_up_process() would at a minimum dirty the remote runqueue's lock
cacheline.
> That's a huge performance disadvantage, but it's done on purpose,
> because otherwise you end up in a situation where the semaphore
> release code did wake up the waiter, but before the waiter actually
> had time to grab it (it has to go through the IPI and scheduling
> logic), the same CPU just grabbed it again.
>
> The original semaphores were unfair, and it works really well most of
> the time. But then it really sucks in some nasty cases.
>
> The numbers make me suspect that Ingo's mutexes are unfair too, but
> I've not looked at the code yet.
yes, they are unfair - just like stock semaphores.
[ Note that the generic rt-mutexes in the -rt tree are of course fair to
higher-prio tasks, and this fairness is implemented via
priority-queueing and priority inheritance: the highest prio (RT) task
gets the lock, and if a lower-prio task is holding the lock, it is
boosted up until the end of the critical section, at which point it
hands over the lock to the highprio task.
implementing any other method to achieve fairness would result in
really bad real-world performance. The -rt tree's performance is quite
close to the vanilla kernel (considering that it's a fully preemptible
kernel), and we couldnt do that without the mutex implementation. ]
generic mutexes, as present in this patchqueue, do not include priority
queueing and priority inheritance, so they are 'plain unfair', just like
stock semaphores are.
> NOTE! I'm not a huge fan of fairness per se. I think unfair is often
> ok, and the performance advantages are certainly real. It may well be
> that the cases that caused problems before are now done differently
> (eg we switched the VM semaphore over to an rwsem), and that we can
> have an unfair and fast mutex for those cases where we don't care.
>
> I just suspect the comparison isn't fair.
the comparison is 100% totally fair, because both generic mutexes, and
struct semaphores are 100% totally unfair :-) There's no difference in
the level of unfairness either: 'lucky bastards' are allowed to steal
the lock in both implementations.
we have only one (limited) notion of fairness in Linux synchronization
primitives: rwsems prefer waiting writers, and then block subsequent
readers. Note that rwsems are still reader-reader and writer-writer
unfair, hence the -rt tree replaces the rwsem implementation too, with
generic mutexes and PI.
Ingo
next prev parent reply other threads:[~2005-12-19 15:51 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-12-19 1:34 [patch 00/15] Generic Mutex Subsystem Ingo Molnar
2005-12-19 4:22 ` Andi Kleen
2005-12-19 4:28 ` Steven Rostedt
2005-12-19 4:31 ` Andi Kleen
2005-12-19 6:24 ` Linus Torvalds
2005-12-19 12:56 ` Steven Rostedt
2005-12-19 16:55 ` Ingo Molnar
2005-12-19 15:50 ` Ingo Molnar [this message]
2005-12-19 19:11 ` Linus Torvalds
2005-12-19 19:25 ` Benjamin LaHaise
2005-12-19 19:55 ` Linus Torvalds
2005-12-21 16:42 ` Oleg Nesterov
2006-01-10 10:28 ` Balbir Singh
2006-01-10 18:03 ` Oleg Nesterov
2006-01-11 6:33 ` Balbir Singh
2006-01-11 9:22 ` Oleg Nesterov
2005-12-19 20:11 ` Ingo Molnar
2005-12-19 20:19 ` Benjamin LaHaise
2005-12-19 20:32 ` Russell King
2005-12-19 20:57 ` Ingo Molnar
2005-12-19 19:55 ` Ingo Molnar
2005-12-19 20:12 ` Linus Torvalds
2005-12-19 23:37 ` Christoph Hellwig
2005-12-20 8:03 ` Nick Piggin
2005-12-20 8:06 ` Arjan van de Ven
2005-12-20 8:21 ` Nick Piggin
2005-12-20 8:36 ` Arjan van de Ven
2005-12-20 8:48 ` Nick Piggin
2005-12-19 16:22 ` Ingo Molnar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20051219155010.GA7790@elte.hu \
--to=mingo@elte.hu \
--cc=ak@suse.de \
--cc=akpm@osdl.org \
--cc=alan@lxorguk.ukuu.org.uk \
--cc=arjanv@infradead.org \
--cc=dhowells@redhat.com \
--cc=hch@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=oleg@tv-sign.ru \
--cc=rostedt@goodmis.org \
--cc=torvalds@osdl.org \
--cc=viro@ftp.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox