* [patch 0/9] mutex subsystem, -V4 @ 2005-12-22 11:41 Ingo Molnar 2005-12-22 11:53 ` Christoph Hellwig 2005-12-22 11:54 ` Andrew Morton 0 siblings, 2 replies; 62+ messages in thread From: Ingo Molnar @ 2005-12-22 11:41 UTC (permalink / raw) To: lkml Cc: Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt, Christoph Hellwig, Andi Kleen, Russell King this is the -V4 of the mutex subsystem patch-queue. It consists of the following patches: add-atomic-xchg.patch add-atomic-call-func-i386.patch add-atomic-call-func-x86_64.patch add-atomic-call-wrappers-rest.patch mutex-core.patch mutex-switch-arm-to-xchg.patch mutex-debug.patch mutex-debug-more.patch xfs-mutex-namespace-collision-fix.patch the patches are against Linus' latest GIT tree, and they should work fine on every Linux architecture. Changes since -V3: - imlemented an atomic_xchg() based mutex implementation. It integrated pretty nicely into the generic code, and most of the code is still shared. - added __ARCH_WANT_XCHG_BASED_ATOMICS: if an architecture defines this then the generic mutex code will switch to the atomic_xchg() implementation. This should be conceptually equivalent to the variant Nicolas Pitre posted - Nicolas, could you check out this one? It's much easier to provide this in the generic implementation, and the code ends up looking cleaner. - eliminated ARCH_IMPLEMENTS_MUTEX_FASTPATH: there's no need for architectures to override the generic code anymore, with the introduction of __ARCH_WANT_XCHG_BASED_ATOMICS. - ARM: enable __ARCH_WANT_XCHG_BASED_ATOMICS. - ARM buildfix: move the new atomic primitives to the correct place. (Nicolas Pitre) - optimization: unlock the mutex outside of the spinlock (suggested by Nicolas Pitre) - removed leftover arch_semaphore reference from the XFS patch. (noticed by Arjan van de Ven) - better document the fact that mutex_trylock() follows spin_trylock() semantics, not sem_trylock() semantics. - cleanup: got rid of the MUTEX_LOCKLESS_FASTPATH define. (in -V3 this became equivalent to DEBUG_MUTEXES, via the elimination of the __HAVE_ARCH_CMPXCHG dependency on the fastpath.) - further simplified and unified mutex_trylock(). - DocBook entries, and more documentation of internals. - dropped the spinlock-debug fix, Linus merged it. Ingo ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 11:41 [patch 0/9] mutex subsystem, -V4 Ingo Molnar @ 2005-12-22 11:53 ` Christoph Hellwig 2005-12-22 12:45 ` Ingo Molnar 2005-12-22 15:34 ` Nicolas Pitre 2005-12-22 11:54 ` Andrew Morton 1 sibling, 2 replies; 62+ messages in thread From: Christoph Hellwig @ 2005-12-22 11:53 UTC (permalink / raw) To: Ingo Molnar Cc: lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt, Christoph Hellwig, Andi Kleen, Russell King On Thu, Dec 22, 2005 at 12:41:47PM +0100, Ingo Molnar wrote: > this is the -V4 of the mutex subsystem patch-queue. It consists of the > following patches: > > add-atomic-xchg.patch > add-atomic-call-func-i386.patch > add-atomic-call-func-x86_64.patch > add-atomic-call-wrappers-rest.patch > mutex-core.patch > mutex-switch-arm-to-xchg.patch > mutex-debug.patch > mutex-debug-more.patch > xfs-mutex-namespace-collision-fix.patch > > the patches are against Linus' latest GIT tree, and they should work > fine on every Linux architecture. > > Changes since -V3: > > - imlemented an atomic_xchg() based mutex implementation. It integrated > pretty nicely into the generic code, and most of the code is still > shared. > > - added __ARCH_WANT_XCHG_BASED_ATOMICS: if an architecture defines > this then the generic mutex code will switch to the atomic_xchg() > implementation. > > This should be conceptually equivalent to the variant Nicolas Pitre > posted - Nicolas, could you check out this one? It's much easier to > provide this in the generic implementation, and the code ends up > looking cleaner. > > - eliminated ARCH_IMPLEMENTS_MUTEX_FASTPATH: there's no need for > architectures to override the generic code anymore, with the > introduction of __ARCH_WANT_XCHG_BASED_ATOMICS. > > - ARM: enable __ARCH_WANT_XCHG_BASED_ATOMICS. I must admit I really really hat __ARCH_ stuff if we can avoid it. An <asm/mutex.h> that usually includes two asm-generic variants is probably a much better choice. Actually just havign asm/mutex.h implement the faspath per-arch and get rid of all the oddball atomic.h additions would be even better. While this means we need per-arch code it also means the code is a lot easier understandable, and we don't add odd public APIs. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 11:53 ` Christoph Hellwig @ 2005-12-22 12:45 ` Ingo Molnar 2005-12-22 15:34 ` Nicolas Pitre 1 sibling, 0 replies; 62+ messages in thread From: Ingo Molnar @ 2005-12-22 12:45 UTC (permalink / raw) To: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt, Andi Kleen, Russell King * Christoph Hellwig <hch@infradead.org> wrote: > > Changes since -V3: > > > > - imlemented an atomic_xchg() based mutex implementation. It integrated > > pretty nicely into the generic code, and most of the code is still > > shared. > > > > - added __ARCH_WANT_XCHG_BASED_ATOMICS: if an architecture defines > > this then the generic mutex code will switch to the atomic_xchg() > > implementation. > > > > This should be conceptually equivalent to the variant Nicolas Pitre > > posted - Nicolas, could you check out this one? It's much easier to > > provide this in the generic implementation, and the code ends up > > looking cleaner. > > > > - eliminated ARCH_IMPLEMENTS_MUTEX_FASTPATH: there's no need for > > architectures to override the generic code anymore, with the > > introduction of __ARCH_WANT_XCHG_BASED_ATOMICS. > > > > - ARM: enable __ARCH_WANT_XCHG_BASED_ATOMICS. > > I must admit I really really hat __ARCH_ stuff if we can avoid it. An > <asm/mutex.h> that usually includes two asm-generic variants is > probably a much better choice. agreed. In my tree i've changed it to CONFIG_MUTEX_XCHG_ALGORITHM, which is selected by ARM in its Kconfig. Ingo ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 11:53 ` Christoph Hellwig 2005-12-22 12:45 ` Ingo Molnar @ 2005-12-22 15:34 ` Nicolas Pitre 2005-12-22 15:40 ` Ingo Molnar ` (2 more replies) 1 sibling, 3 replies; 62+ messages in thread From: Nicolas Pitre @ 2005-12-22 15:34 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt, Andi Kleen, Russell King On Thu, 22 Dec 2005, Christoph Hellwig wrote: > On Thu, Dec 22, 2005 at 12:41:47PM +0100, Ingo Molnar wrote: > > this is the -V4 of the mutex subsystem patch-queue. It consists of the > > following patches: > > > > add-atomic-xchg.patch > > add-atomic-call-func-i386.patch > > add-atomic-call-func-x86_64.patch > > add-atomic-call-wrappers-rest.patch > > mutex-core.patch > > mutex-switch-arm-to-xchg.patch > > mutex-debug.patch > > mutex-debug-more.patch > > xfs-mutex-namespace-collision-fix.patch > > > > the patches are against Linus' latest GIT tree, and they should work > > fine on every Linux architecture. > > > > Changes since -V3: > > > > - imlemented an atomic_xchg() based mutex implementation. It integrated > > pretty nicely into the generic code, and most of the code is still > > shared. > > > > - added __ARCH_WANT_XCHG_BASED_ATOMICS: if an architecture defines > > this then the generic mutex code will switch to the atomic_xchg() > > implementation. > > > > This should be conceptually equivalent to the variant Nicolas Pitre > > posted - Nicolas, could you check out this one? It's much easier to > > provide this in the generic implementation, and the code ends up > > looking cleaner. > > > > - eliminated ARCH_IMPLEMENTS_MUTEX_FASTPATH: there's no need for > > architectures to override the generic code anymore, with the > > introduction of __ARCH_WANT_XCHG_BASED_ATOMICS. > > > > - ARM: enable __ARCH_WANT_XCHG_BASED_ATOMICS. > > I must admit I really really hat __ARCH_ stuff if we can avoid it. > An <asm/mutex.h> that usually includes two asm-generic variants is probably > a much better choice. > > Actually just havign asm/mutex.h implement the faspath per-arch and get > rid of all the oddball atomic.h additions would be even better. While > this means we need per-arch code it also means the code is a lot easier > understandable, and we don't add odd public APIs. I'm with Christoph here. Please preserve my arch_mutex_fast_lock/arch_mutex_fast_unlock helpers. I did it that way because the most important thing they bring is flexibility where it is needed i.e. in architecture specific implementations. And done that way the architecture specific part is well abstracted with the minimum semantics allowing flexibility in the implementation. I insist on that because, even if ARM currently relies on the atomic swap behavior, on ARMv6 at least this can be improved even further, but a special implementation which is neither a fully qualified atomic decrement nor an atomic swap is needed. That's why I insist that you should keep my arch_mutex_fast_lock and friends (rename them if you wish) and remove __ARCH_WANT_XCHG_BASED_ATOMICS entirely. Nicolas ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 15:34 ` Nicolas Pitre @ 2005-12-22 15:40 ` Ingo Molnar 2005-12-22 16:32 ` Nicolas Pitre 2005-12-22 17:20 ` Christoph Hellwig 2005-12-22 17:33 ` Steven Rostedt 2 siblings, 1 reply; 62+ messages in thread From: Ingo Molnar @ 2005-12-22 15:40 UTC (permalink / raw) To: Nicolas Pitre Cc: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt, Andi Kleen, Russell King * Nicolas Pitre <nico@cam.org> wrote: > [...] on ARMv6 at least this can be improved even further, but a > special implementation which is neither a fully qualified atomic > decrement nor an atomic swap is needed. [...] i'm curious, how would this ARMv6 solution look like, and what would be the advantages over the atomic swap based variant? Ingo ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 15:40 ` Ingo Molnar @ 2005-12-22 16:32 ` Nicolas Pitre 2005-12-22 16:44 ` Ingo Molnar 0 siblings, 1 reply; 62+ messages in thread From: Nicolas Pitre @ 2005-12-22 16:32 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt, Andi Kleen, Russell King On Thu, 22 Dec 2005, Ingo Molnar wrote: > > * Nicolas Pitre <nico@cam.org> wrote: > > > [...] on ARMv6 at least this can be improved even further, but a > > special implementation which is neither a fully qualified atomic > > decrement nor an atomic swap is needed. [...] > > i'm curious, how would this ARMv6 solution look like, and what would be > the advantages over the atomic swap based variant? On ARMv6 (which can be SMP) the atomic swap instruction is much more costly than on former ARM versions. It however has ll/sc instructions which allows it to implement a true atomic decrement, and the lock fast path would look like: __mutex_lock: 1: ldrex r1, [r0] sub r1, r1, #1 strex r2, r1, [r0] cmp r2, #0 bne 1b cmp r1, #0 moveq pc, lr b __mutex_lock_failed With my ARMv6 implementation of arch_mutex_fast_lock() then it would become: __mutex_lock: ldrex r1, [r0] sub r1, r1, #1 strex r2, r1, [r0] orrs r0, r2, r1 moveq pc, lr b __mutex_lock_failed This code sequence is not possible with any of the standard atomic primitives. And the above would work even for the arch_mutex_fast_lock_retval() used for mutex_lock_interruptible. Giving complete freedom to the architecture in implementing those could benefit architectures where disabling preemption while performing the lock instead of attempting any true atomic operation would be cheaper (Linus argued about that previously). And if we are UP with preemption disabled then the lock could possibly be simplified even further if someone dare to. But those implementation issues don't belong in the common core code at all IMHO. Nicolas ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 16:32 ` Nicolas Pitre @ 2005-12-22 16:44 ` Ingo Molnar 2005-12-22 16:58 ` Russell King 2005-12-22 16:58 ` Nicolas Pitre 0 siblings, 2 replies; 62+ messages in thread From: Ingo Molnar @ 2005-12-22 16:44 UTC (permalink / raw) To: Nicolas Pitre Cc: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt, Andi Kleen, Russell King * Nicolas Pitre <nico@cam.org> wrote: > > i'm curious, how would this ARMv6 solution look like, and what would be > > the advantages over the atomic swap based variant? > > On ARMv6 (which can be SMP) the atomic swap instruction is much more > costly than on former ARM versions. It however has ll/sc instructions > which allows it to implement a true atomic decrement, and the lock > fast path would look like: [...] but couldnt you implement atomic_dec_return() with the ll/sc instructions? Something like: repeat: ldrex r1, [r0] sub r1, r1, #1 strex r2, r1, [r0] orrs r0, r2, r1 jneq repeat (shot-in-the-dark guess at ARMv6 assembly) hm? Ingo ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 16:44 ` Ingo Molnar @ 2005-12-22 16:58 ` Russell King 2005-12-22 21:04 ` Ingo Molnar 2005-12-22 16:58 ` Nicolas Pitre 1 sibling, 1 reply; 62+ messages in thread From: Russell King @ 2005-12-22 16:58 UTC (permalink / raw) To: Ingo Molnar Cc: Nicolas Pitre, Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt, Andi Kleen On Thu, Dec 22, 2005 at 05:44:15PM +0100, Ingo Molnar wrote: > * Nicolas Pitre <nico@cam.org> wrote: > > > i'm curious, how would this ARMv6 solution look like, and what would be > > > the advantages over the atomic swap based variant? > > > > On ARMv6 (which can be SMP) the atomic swap instruction is much more > > costly than on former ARM versions. It however has ll/sc instructions > > which allows it to implement a true atomic decrement, and the lock > > fast path would look like: [...] > > but couldnt you implement atomic_dec_return() with the ll/sc > instructions? Something like: > > repeat: > ldrex r1, [r0] > sub r1, r1, #1 > strex r2, r1, [r0] > orrs r0, r2, r1 > jneq repeat > > (shot-in-the-dark guess at ARMv6 assembly) atomic_dec_return() would be: 1: ldrex r1, [r0] sub r1, r1, #1 strex r2, r1, [r0] teq r2, #0 bne 1b @ result in r1 But that's not really the main point Nico's making. Yes, on ARMv6 there is little difference. However, ARMv6 is _not_ mainstream yet. The previous generation which do not have this is currently mainstream. When it comes down to it, unlike x86 land where new CPU features are taken up very quickly, the take up of new features on ARM CPUs is a lot slower - it's a matter of years not a matter of months. Therefore, we can expect ARMv5 architecture CPUs to be mainstream at least for the next year or two, and these are the ones which we should optimise for. Nico's point still stands though - and I'd like to ask a more direct question. There is an efficient implementation for ARMv5 CPUs which it appears we're being denied the ability to use. Given that this has been argued for using clear technical arguments over the last two days, I have yet to see any explaination why you're choosing to ignore it. Could you please explain why? -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 Serial core ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 16:58 ` Russell King @ 2005-12-22 21:04 ` Ingo Molnar 2005-12-22 21:26 ` Russell King 2005-12-22 21:27 ` Nicolas Pitre 0 siblings, 2 replies; 62+ messages in thread From: Ingo Molnar @ 2005-12-22 21:04 UTC (permalink / raw) To: Nicolas Pitre, Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt, Andi Kleen * Russell King <rmk+lkml@arm.linux.org.uk> wrote: > > but couldnt you implement atomic_dec_return() with the ll/sc > > instructions? Something like: > > > > repeat: > > ldrex r1, [r0] > > sub r1, r1, #1 > > strex r2, r1, [r0] > > orrs r0, r2, r1 > > jneq repeat > > > > (shot-in-the-dark guess at ARMv6 assembly) > > atomic_dec_return() would be: > > 1: ldrex r1, [r0] > sub r1, r1, #1 > strex r2, r1, [r0] > teq r2, #0 > bne 1b > @ result in r1 > > But that's not really the main point Nico's making. Yes, on ARMv6 > there is little difference. However, ARMv6 is _not_ mainstream yet. > The previous generation which do not have this is currently > mainstream. i think there is some miscommunication here. I am actually on your side, and i spent all my day enabling the best mutex variant on both the v5 and v6 version of your CPU. What we were doing was to discuss the details of getting there. So dont shoot the messenger, ok? This is Nico's point i replied to: > > [...] on ARMv6 at least this can be improved even further, but a > > special implementation which is neither a fully qualified atomic > > decrement nor an atomic swap is needed. [...] and this was my reply: > i'm curious, how would this ARMv6 solution look like, and what would > be the advantages over the atomic swap based variant? to which Nico replied with an ll/sc implementation, which very much looked like atomic_dec_return(). That's all i said. Nobody is trying to shut out anything from anywhere, and certainly not me. I am _enabling_ your stuff. I'm just trying to find the most maintainable and still most flexible solution. > Nico's point still stands though - and I'd like to ask a more direct > question. There is an efficient implementation for ARMv5 CPUs which > it appears we're being denied the ability to use. do you realize that i've enabled this efficient implementation via CONFIG_MUTEX_XCHG_ALGORITHM? Do you realize that you _dont_ have to use extremely heavy IRQ-disabling ops on ARM to enable mutexes? Do you realize that what we are down to now are 1-2 instructions of differences? The discussion was not about ARMv5 at all. The discussion was about the claim that 'ARMv6 is somehow magical that it needs its own stuff'. No it isnt magical, it's a sane CPU that can implement a sane atomic_dec_return(), or even better, a sane atomic_*_call_if_*() primitive. Or whatever primitive we end up having - i dont mind if it's called __mutex_whatever, as long as it has a _well defined_ meaning. what i _DONT_ want is some over-opaque per-arch thing that will again escallate into the same situation as semaphores: 23 different implementations nobody is able to change at once, nobody is able to add features or debugging to, and by today there's probably is no single person on this planet who knows all 23 of them to begin with. if you sense me trying to avoid something then that's it: ambiguities in the arch-level implementation, because they end up stiffling the generic code. Ingo ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 21:04 ` Ingo Molnar @ 2005-12-22 21:26 ` Russell King 2005-12-22 21:27 ` Nicolas Pitre 1 sibling, 0 replies; 62+ messages in thread From: Russell King @ 2005-12-22 21:26 UTC (permalink / raw) To: Ingo Molnar Cc: Nicolas Pitre, Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt, Andi Kleen On Thu, Dec 22, 2005 at 10:04:46PM +0100, Ingo Molnar wrote: > i think there is some miscommunication here. Quite possibly. TBH there's soo much waffle generated on the mutex stuff over the last few days that I've probably lost track of where everyone is on it. Sorry. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 Serial core ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 21:04 ` Ingo Molnar 2005-12-22 21:26 ` Russell King @ 2005-12-22 21:27 ` Nicolas Pitre 2005-12-22 21:37 ` [patch 1/2] mutex subsystem: basic per arch fast path primitives Nicolas Pitre ` (2 more replies) 1 sibling, 3 replies; 62+ messages in thread From: Nicolas Pitre @ 2005-12-22 21:27 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt, Andi Kleen On Thu, 22 Dec 2005, Ingo Molnar wrote: > The discussion was not about ARMv5 at all. The discussion was about the > claim that 'ARMv6 is somehow magical that it needs its own stuff'. No it > isnt magical, it's a sane CPU that can implement a sane > atomic_dec_return(), or even better, a sane atomic_*_call_if_*() > primitive. Or whatever primitive we end up having - i dont mind if it's > called __mutex_whatever, as long as it has a _well defined_ meaning. I'd like to point out that, while atomic_dec_call_if_* is really nice on i386, it is probably only good for i386 since no other architecture will be able to provide a better implementation than what can be done with atomic_dec_return() anyway. Yet that IMHO overloaded atomic_dec_call_if_* stuff appears in core code. But like for i386, those other architectures might be able to do some other tricks to achieve the same needed semantics. My ARMv6 is one of them (and no it doesn't strictly follows the semantics of atomic_dec_call_if_*). Are you willing to add more #if defined(CONFIG_MUTEX_FOO_ALGO) in the core code as time goes by? I hope not. > what i _DONT_ want is some over-opaque per-arch thing that will again > escallate into the same situation as semaphores: 23 different > implementations nobody is able to change at once, nobody is able to add > features or debugging to, and by today there's probably is no single > person on this planet who knows all 23 of them to begin with. I agree completely with you. But what I don't want is core code needlessly restricting architecture specific implementation for short and well defined fast paths. Maybe that's where we must agree upon: a well defined fast path helper for mutexes? Nicolas ^ permalink raw reply [flat|nested] 62+ messages in thread
* [patch 1/2] mutex subsystem: basic per arch fast path primitives 2005-12-22 21:27 ` Nicolas Pitre @ 2005-12-22 21:37 ` Nicolas Pitre 2005-12-22 21:53 ` Christoph Hellwig 2005-12-22 21:40 ` [patch 2/2] mutex subsystem: use the per architecture fast path lock_unlock defines Nicolas Pitre 2005-12-22 21:54 ` [patch 0/9] mutex subsystem, -V4 Ingo Molnar 2 siblings, 1 reply; 62+ messages in thread From: Nicolas Pitre @ 2005-12-22 21:37 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt, Andi Kleen On Thu, 22 Dec 2005, Nicolas Pitre wrote: > On Thu, 22 Dec 2005, Ingo Molnar wrote: > > > what i _DONT_ want is some over-opaque per-arch thing that will again > > escallate into the same situation as semaphores: 23 different > > implementations nobody is able to change at once, nobody is able to add > > features or debugging to, and by today there's probably is no single > > person on this planet who knows all 23 of them to begin with. > > I agree completely with you. But what I don't want is core code > needlessly restricting architecture specific implementation for short > and well defined fast paths. Maybe that's where we must agree upon: a > well defined fast path helper for mutexes? OK, let's look at actual code please. Do you have anything against this and the following patches? This patch adds fast path mutex primitives for all architectures. It replaces your atomic_*_call_if_* patches that didn't seem to please everybody, mutex usage notwithstanding. Comments? Index: linux-2.6/include/asm-alpha/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-alpha/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-arm/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-arm/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-arm26/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-arm26/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-cris/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-cris/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-frv/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-frv/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-generic/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-generic/mutex.h @@ -0,0 +1,77 @@ +/* + * asm-generic/mutex.h + * + * Generic wrappers for architecture specific low-level mutex + * fast path locking and unlocking. Each architecture is welcome + * to provide optimized versions for those. + */ + +#ifndef _ASM_GENERIC_MUTEX_H +#define _ASM_GENERIC_MUTEX_H + +#include <asm/atomic.h> + +/** + * __mutex_fastpath_lock - lock mutex and call function if already locked + * @v: pointer of type atomic_t + * @contention_fn: function to call if v was already locked + * + * Atomically locks @v and calls a function if @v was already locked. + * When @v == 1 it is unlocked, <= 0 means locked. + */ +#define __mutex_fastpath_lock(v, contention_fn) \ +do { \ + if (atomic_xchg(v, 0) != 1) \ + contention_fn(v); \ +} while (0) + +/** + * __mutex_fastpath_lock_retval - lock mutex and call function if already locked + * @v: pointer of type atomic_t + * @contention_fn: function to call if v was already locked + * + * Atomically locks @v and calls a function if @v was already locked. + * When @v == 1 it is unlocked, <= 0 means locked. + * + * It also returns 0 if the lock was successful (@v wasn't locked), otherwise + * it returns what @contention_fn returned. + */ +#define __mutex_fastpath_lock_retval(v, contention_fn) \ +({ \ + int __retval = 0; \ + if (atomic_xchg(v, 0) != 1) \ + __retval = contention_fn(v); \ + __retval; \ +}) + +/** + * __mutex_fastpath_unlock - unlock and call function if contended + * @v: pointer of type atomic_t + * @contention_fn: function to call if v was contended + * + * Atomically unlocks @v and calls a function if @v was contended. + * When @v == 1 it is unlocked, 0 it is locked, any negative value means + * locked with contention. + * + * If @v was contended, its new value is implementation defined (it can be + * either locked or unlocked). The contention function (slow path) should + * use __mutex_slowpath_needs_unlock() to determine if the mutex needs + * manual unlocking. + */ +#define __mutex_fastpath_unlock(v, contention_fn) \ +do { \ + if (atomic_xchg(v, 1) != 0) \ + contention_fn(v); \ +} while (0) + +/** + * __mutex_slowpath_needs_unlock - tell the state of a contended unlock + * + * This is meant to be called from any contention function passed to + * __mutex_fastpath_unlock(). It tells the contention (slow path) function + * whether or not it has to unlock the mutex manually after unlocking a + * contended lock was attempted. + */ +#define __mutex_slowpath_needs_unlock() (0) /* xchg always unlocks */ + +#endif Index: linux-2.6/include/asm-h8300/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-h8300/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-i386/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-i386/mutex.h @@ -0,0 +1,70 @@ +/* + * asm-i386/mutex.h + * + * Optimized i386 low-level mutex fast path locking and unlocking. + */ + +#ifndef _ASM_MUTEX_H +#define _ASM_MUTEX_H + +#include <asm/atomic.h> + +/* + * Please look into asm-generic/mutex.h for a description of semantics + * expected for those. + */ + +#define __mutex_fastpath_lock(v, contention_fn) \ +do { \ + fastcall void (*__tmp)(atomic_t *) = contention_fn; \ + unsigned int dummy; \ + \ + (void)__tmp; \ + typecheck(atomic_t *, v); \ + \ + __asm__ __volatile__( \ + LOCK "decl (%%eax)\n" \ + "js 2f\n" \ + "1:\n" \ + LOCK_SECTION_START("") \ + "2: call "#contention_fn"\n\t" \ + "jmp 1b\n" \ + LOCK_SECTION_END \ + :"=a"(dummy) \ + :"a" (v) \ + :"memory", "ecx", "edx"); \ +} while (0) + +#define __mutex_fastpath_lock_retval(v, contention_fn) \ +({ \ + int __retval = 0; \ + if (unlikely(atomic_dec_return(v) < 0)) \ + __retval = contention_fn(v); \ + __retval; \ +}) + +#define __mutex_fastpath_unlock(v, contention_fn) \ +do { \ + fastcall void (*__tmp)(atomic_t *) = contention_fn; \ + unsigned int dummy; \ + \ + (void)__tmp; \ + typecheck(atomic_t *, v); \ + \ + __asm__ __volatile__( \ + LOCK "incl (%%eax)\n" \ + "jle 2f\n" \ + "1:\n" \ + LOCK_SECTION_START("") \ + "2: call "#contention_fn"\n\t" \ + "jmp 1b\n" \ + LOCK_SECTION_END \ + :"=a" (dummy) \ + :"a" (v) \ + :"memory", "ecx", "edx"); \ +} while (0) + +/* In the contended case, the unlock doesn't actually unlock */ +#define __mutex_slowpath_needs_unlock() (1) + +#endif Index: linux-2.6/include/asm-ia64/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-ia64/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-m32r/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-m32r/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-m68k/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-m68k/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-m68knommu/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-m68knommu/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-mips/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-mips/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-parisc/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-parisc/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-powerpc/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-powerpc/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-ppc/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-ppc/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-ppc64/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-ppc64/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-s390/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-s390/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-sh/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-sh/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-sh64/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-sh64/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-sparc/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-sparc/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-sparc64/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-sparc64/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-um/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-um/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-v850/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-v850/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> Index: linux-2.6/include/asm-x86_64/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-x86_64/mutex.h @@ -0,0 +1,72 @@ +/* + * asm-x86_64/mutex.h + * + * Optimized x86_64 low-level mutex fast path locking and unlocking. + */ + +#ifndef _ASM_MUTEX_H +#define _ASM_MUTEX_H + +#include <asm/atomic.h> + +/* + * Please look into asm-generic/mutex.h for a description of semantics + * expected for those. + */ + +#define __mutex_fastpath_lock(v, contention_fn) \ +do { \ + fastcall void (*__tmp)(atomic_t *) = contention_fn; \ + unsigned long dummy; \ + \ + (void)__tmp; \ + typecheck(atomic_t *, v); \ + \ + __asm__ __volatile__( \ + LOCK "decl (%%rdi)\n" \ + "js 2f\n" \ + "1:\n" \ + LOCK_SECTION_START("") \ + "2: call "#contention_fn"\n\t" \ + "jmp 1b\n" \ + LOCK_SECTION_END \ + :"=D" (dummy) \ + :"D" (v) \ + :"rax", "rsi", "rdx", "rcx", \ + "r8", "r9", "r10", "r11", "memory"); \ +} while (0) + +#define __mutex_fastpath_lock_retval(v, contention_fn) \ +({ \ + int __retval = 0; \ + if (unlikely(atomic_dec_return(v) < 0)) \ + __retval = contention_fn(v); \ + __retval; \ +}) + +#define __mutex_fastpath_unlock(v, contention_fn) \ +do { \ + fastcall void (*__tmp)(atomic_t *) = contention_fn; \ + unsigned long dummy; \ + \ + (void)__tmp; \ + typecheck(atomic_t *, v); \ + \ + __asm__ __volatile__( \ + LOCK "incl (%%rdi)\n" \ + "jle 2f\n" \ + "1:\n" \ + LOCK_SECTION_START("") \ + "2: call "#contention_fn"\n\t" \ + "jmp 1b\n" \ + LOCK_SECTION_END \ + :"=D" (dummy) \ + :"D" (v) \ + :"rax", "rsi", "rdx", "rcx", \ + "r8", "r9", "r10", "r11", "memory"); \ +} while (0) + +/* In the contended case, the unlock doesn't actually unlock */ +#define __mutex_slowpath_needs_unlock() (1) + +#endif Index: linux-2.6/include/asm-xtensa/mutex.h =================================================================== --- /dev/null +++ linux-2.6/include/asm-xtensa/mutex.h @@ -0,0 +1 @@ +#include <asm-generic/mutex.h> ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 1/2] mutex subsystem: basic per arch fast path primitives 2005-12-22 21:37 ` [patch 1/2] mutex subsystem: basic per arch fast path primitives Nicolas Pitre @ 2005-12-22 21:53 ` Christoph Hellwig 0 siblings, 0 replies; 62+ messages in thread From: Christoph Hellwig @ 2005-12-22 21:53 UTC (permalink / raw) To: Nicolas Pitre Cc: Ingo Molnar, Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt, Andi Kleen > OK, let's look at actual code please. Do you have anything against this > and the following patches? > > This patch adds fast path mutex primitives for all architectures. It > replaces your atomic_*_call_if_* patches that didn't seem to please > everybody, mutex usage notwithstanding. Thanks Nico, this is exactly what I wanted to see. One question about the naming of the arch helpers, do we really need the fastpath in there? Or just __mutex_* ? ^ permalink raw reply [flat|nested] 62+ messages in thread
* [patch 2/2] mutex subsystem: use the per architecture fast path lock_unlock defines 2005-12-22 21:27 ` Nicolas Pitre 2005-12-22 21:37 ` [patch 1/2] mutex subsystem: basic per arch fast path primitives Nicolas Pitre @ 2005-12-22 21:40 ` Nicolas Pitre 2005-12-22 21:54 ` [patch 0/9] mutex subsystem, -V4 Ingo Molnar 2 siblings, 0 replies; 62+ messages in thread From: Nicolas Pitre @ 2005-12-22 21:40 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt, Andi Kleen This switch the core over to the architecture defined fast path locking primitives. It also adds __mutex_lock_interruptible_noinline and friends. Index: linux-2.6/kernel/mutex.c =================================================================== --- linux-2.6.orig/kernel/mutex.c +++ linux-2.6/kernel/mutex.c @@ -266,13 +266,8 @@ __mutex_lock_interruptible_nonatomic(str */ int fastcall mutex_trylock(struct mutex *lock) { -#ifdef __HAVE_ARCH_CMPXCHG if (atomic_cmpxchg(&lock->count, 1, 0) == 1) return 1; -#else - if (atomic_dec_return(&lock->count) == 0) - return 1; -#endif return 0; } @@ -316,13 +311,14 @@ static inline void __mutex_unlock_nonato * Waiters take care of themselves and stay in flight until * necessary. * - * (in the xchg based implementation the fastpath has set the + * (depending on the implementation the fastpath has set the * count to 1 already, so we must not set it here, because we * dont own the lock anymore. In the debug case we must set * the lock inside the spinlock.) */ -#if !defined(CONFIG_MUTEX_XCHG_ALGORITHM) && !defined(CONFIG_DEBUG_MUTEXES) - atomic_set(&lock->count, 1); +#if !defined(CONFIG_DEBUG_MUTEXES) + if (__mutex_slowpath_needs_unlock()) + atomic_set(&lock->count, 1); #endif spin_lock_mutex(&lock->wait_lock); #ifdef CONFIG_DEBUG_MUTEXES @@ -349,21 +345,12 @@ static inline void __mutex_unlock_nonato static __sched void FASTCALL(__mutex_lock_noinline(atomic_t *lock_count)); /* - * Some architectures do not have fast dec_and_test atomic primitives, - * for them we are providing an atomic_xchg() based mutex implementation, - * if they enable CONFIG_MUTEX_XCHG_ALGORITHM. - * * The locking fastpath is the 1->0 transition from 'unlocked' into * 'locked' state: */ static inline void __mutex_lock_atomic(struct mutex *lock) { -#ifdef CONFIG_MUTEX_XCHG_ALGORITHM - if (unlikely(atomic_xchg(&lock->count, 0) != 1)) - __mutex_lock_noinline(&lock->count); -#else - atomic_dec_call_if_negative(&lock->count, __mutex_lock_noinline); -#endif + __mutex_fastpath_lock(&lock->count, __mutex_lock_noinline); } /* @@ -383,16 +370,23 @@ static inline void __mutex_lock(struct m __mutex_lock_atomic(lock); } +static __sched int FASTCALL(__mutex_lock_interruptible_noinline(atomic_t *lock_count)); + +static int __mutex_lock_interruptible_atomic(struct mutex *lock) +{ + __mutex_fastpath_lock_retval(&lock->count, __mutex_lock_interruptible_noinline); +} + +static int fastcall __sched __mutex_lock_interruptible_noinline(atomic_t *lock_count) +{ + struct mutex *lock = container_of(lock_count, struct mutex, count); + + return __mutex_lock_interruptible_nonatomic(lock); +} + static inline int __mutex_lock_interruptible(struct mutex *lock) { -#ifdef CONFIG_MUTEX_XCHG_ALGORITHM - if (unlikely(atomic_xchg(&lock->count, 0) != 1)) - return __mutex_lock_interruptible_nonatomic(lock); -#else - if (unlikely(atomic_dec_return(&lock->count) < 0)) - return __mutex_lock_interruptible_nonatomic(lock); -#endif - return 0; + __mutex_lock_interruptible_atomic(lock); } static void __sched FASTCALL(__mutex_unlock_noinline(atomic_t *lock_count)); @@ -403,12 +397,7 @@ static void __sched FASTCALL(__mutex_unl */ static inline void __mutex_unlock_atomic(struct mutex *lock) { -#ifdef CONFIG_MUTEX_XCHG_ALGORITHM - if (unlikely(atomic_xchg(&lock->count, 1) != 0)) - __mutex_unlock_noinline(&lock->count); -#else - atomic_inc_call_if_nonpositive(&lock->count, __mutex_unlock_noinline); -#endif + __mutex_fastpath_unlock(&lock->count, __mutex_unlock_noinline); } static void fastcall __sched __mutex_unlock_noinline(atomic_t *lock_count) Index: linux-2.6/include/linux/mutex.h =================================================================== --- linux-2.6.orig/include/linux/mutex.h +++ linux-2.6/include/linux/mutex.h @@ -12,6 +12,7 @@ */ #include <linux/list.h> #include <linux/spinlock_types.h> +#include <asm/mutex.h> #include <asm/atomic.h> ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 21:27 ` Nicolas Pitre 2005-12-22 21:37 ` [patch 1/2] mutex subsystem: basic per arch fast path primitives Nicolas Pitre 2005-12-22 21:40 ` [patch 2/2] mutex subsystem: use the per architecture fast path lock_unlock defines Nicolas Pitre @ 2005-12-22 21:54 ` Ingo Molnar 2 siblings, 0 replies; 62+ messages in thread From: Ingo Molnar @ 2005-12-22 21:54 UTC (permalink / raw) To: Nicolas Pitre Cc: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt, Andi Kleen * Nicolas Pitre <nico@cam.org> wrote: > I'd like to point out that, while atomic_dec_call_if_* is really nice > on i386, it is probably only good for i386 since no other architecture > will be able to provide a better implementation than what can be done > with atomic_dec_return() anyway. Yet that IMHO overloaded > atomic_dec_call_if_* stuff appears in core code. i'd have no problem with going to atomic_dec_return() on i386 too. atomic_dec_call_if_*() is just working around a gcc limitation: there's no way to pass a condition from inline assembly into C code, it has to go over a register. But the difference is small, just 1 extra 'sete' instruction. So x86 would be just fine with atomic_dec_return() too. anyway, it seems like everyone would like to hack a few instructions from the fastpath, so we might as well go with your approach. As long as the state is well-defined (and it has to be well-defined because it all hits the common slowpath), and the function names are descriptive of what happens, it's fine with me. Ingo ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 16:44 ` Ingo Molnar 2005-12-22 16:58 ` Russell King @ 2005-12-22 16:58 ` Nicolas Pitre 1 sibling, 0 replies; 62+ messages in thread From: Nicolas Pitre @ 2005-12-22 16:58 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt, Andi Kleen, Russell King On Thu, 22 Dec 2005, Ingo Molnar wrote: > > * Nicolas Pitre <nico@cam.org> wrote: > > > > i'm curious, how would this ARMv6 solution look like, and what would be > > > the advantages over the atomic swap based variant? > > > > On ARMv6 (which can be SMP) the atomic swap instruction is much more > > costly than on former ARM versions. It however has ll/sc instructions > > which allows it to implement a true atomic decrement, and the lock > > fast path would look like: [...] > > but couldnt you implement atomic_dec_return() with the ll/sc > instructions? Something like: NO. My first example was atomic_dec_return based. The second is lighter and fulfill the semantics of arch_mutex_fast_lock() but is not a common atomic primitive. Nicolas ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 15:34 ` Nicolas Pitre 2005-12-22 15:40 ` Ingo Molnar @ 2005-12-22 17:20 ` Christoph Hellwig 2005-12-22 17:33 ` Steven Rostedt 2 siblings, 0 replies; 62+ messages in thread From: Christoph Hellwig @ 2005-12-22 17:20 UTC (permalink / raw) To: Nicolas Pitre Cc: Ingo Molnar, Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt, Andi Kleen, Russell King On Thu, Dec 22, 2005 at 10:34:18AM -0500, Nicolas Pitre wrote: > I'm with Christoph here. Please preserve my > arch_mutex_fast_lock/arch_mutex_fast_unlock helpers. I did it that way > because the most important thing they bring is flexibility where it is > needed i.e. in architecture specific implementations. And done that way > the architecture specific part is well abstracted with the minimum > semantics allowing flexibility in the implementation. > > I insist on that because, even if ARM currently relies on the atomic > swap behavior, on ARMv6 at least this can be improved even further, but > a special implementation which is neither a fully qualified atomic > decrement nor an atomic swap is needed. That's why I insist that you > should keep my arch_mutex_fast_lock and friends (rename them if you > wish) and remove __ARCH_WANT_XCHG_BASED_ATOMICS entirely. I think one of us should so a new version based on that scheme and without all the new atomic helpers, then we can compare it against the current version. I'll try to once I'll get some time. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 15:34 ` Nicolas Pitre 2005-12-22 15:40 ` Ingo Molnar 2005-12-22 17:20 ` Christoph Hellwig @ 2005-12-22 17:33 ` Steven Rostedt 2005-12-22 18:24 ` Nicolas Pitre 2 siblings, 1 reply; 62+ messages in thread From: Steven Rostedt @ 2005-12-22 17:33 UTC (permalink / raw) To: Nicolas Pitre Cc: Ingo Molnar, Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Andi Kleen, Russell King On Thu, 2005-12-22 at 10:34 -0500, Nicolas Pitre wrote: > > Actually just havign asm/mutex.h implement the faspath per-arch and get > > rid of all the oddball atomic.h additions would be even better. While > > this means we need per-arch code it also means the code is a lot easier > > understandable, and we don't add odd public APIs. > > I'm with Christoph here. Please preserve my > arch_mutex_fast_lock/arch_mutex_fast_unlock helpers. I did it that way > because the most important thing they bring is flexibility where it is > needed i.e. in architecture specific implementations. And done that way > the architecture specific part is well abstracted with the minimum > semantics allowing flexibility in the implementation. > > I insist on that because, even if ARM currently relies on the atomic > swap behavior, on ARMv6 at least this can be improved even further, but > a special implementation which is neither a fully qualified atomic > decrement nor an atomic swap is needed. That's why I insist that you > should keep my arch_mutex_fast_lock and friends (rename them if you > wish) and remove __ARCH_WANT_XCHG_BASED_ATOMICS entirely. > Not sure how well this is accepted, but would it be acceptable to have the mutex_lock and friends covered with the (weak) attribute? ie. void fastcall __sched __attribute__((weak)) mutex_lock(struct mutex *lock) Then let the archs override them if they wish? You would just need to make an extra slow path mutex visible to the archs. -- Steve ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 17:33 ` Steven Rostedt @ 2005-12-22 18:24 ` Nicolas Pitre 0 siblings, 0 replies; 62+ messages in thread From: Nicolas Pitre @ 2005-12-22 18:24 UTC (permalink / raw) To: Steven Rostedt Cc: Ingo Molnar, Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise, Andi Kleen, Russell King On Thu, 22 Dec 2005, Steven Rostedt wrote: > On Thu, 2005-12-22 at 10:34 -0500, Nicolas Pitre wrote: > > > > Actually just havign asm/mutex.h implement the faspath per-arch and get > > > rid of all the oddball atomic.h additions would be even better. While > > > this means we need per-arch code it also means the code is a lot easier > > > understandable, and we don't add odd public APIs. > > > > I'm with Christoph here. Please preserve my > > arch_mutex_fast_lock/arch_mutex_fast_unlock helpers. I did it that way > > because the most important thing they bring is flexibility where it is > > needed i.e. in architecture specific implementations. And done that way > > the architecture specific part is well abstracted with the minimum > > semantics allowing flexibility in the implementation. > > > > I insist on that because, even if ARM currently relies on the atomic > > swap behavior, on ARMv6 at least this can be improved even further, but > > a special implementation which is neither a fully qualified atomic > > decrement nor an atomic swap is needed. That's why I insist that you > > should keep my arch_mutex_fast_lock and friends (rename them if you > > wish) and remove __ARCH_WANT_XCHG_BASED_ATOMICS entirely. > > > > Not sure how well this is accepted, but would it be acceptable to have > the mutex_lock and friends covered with the (weak) attribute? > > ie. > > void fastcall __sched __attribute__((weak)) mutex_lock(struct mutex *lock) > > Then let the archs override them if they wish? Actually, my next request would be for architectures to actually inline the fast path if they see benefit. On ARM, given the tight fast path allowed by mutexes, it is indeed beneficial to inline them (the last patch of my latest serie does just that). So weak symbols are not really useful in that case. Nicolas ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 11:41 [patch 0/9] mutex subsystem, -V4 Ingo Molnar 2005-12-22 11:53 ` Christoph Hellwig @ 2005-12-22 11:54 ` Andrew Morton 2005-12-22 12:20 ` Ingo Molnar 1 sibling, 1 reply; 62+ messages in thread From: Andrew Morton @ 2005-12-22 11:54 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, alan, bcrl, rostedt, hch, ak, rmk+lkml Ingo Molnar <mingo@elte.hu> wrote: > > this is the -V4 of the mutex subsystem patch-queue. > I've only been following this with half an eye, with the apparently erroneous expectation that future versions of the patchset would come with some explanation of why on earth we'd want to merge all this stuff into the kernel. We've seen some benchmarks which indicate that there are performance gains to be had if there's a lot of lock contention, but isn't it the case that this is because of the semi-extra wake_up() in down()? Has anyone tried to redo existing semaphores so they no longer have that disadvantage? ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 11:54 ` Andrew Morton @ 2005-12-22 12:20 ` Ingo Molnar 2005-12-22 13:07 ` Andrew Morton 0 siblings, 1 reply; 62+ messages in thread From: Ingo Molnar @ 2005-12-22 12:20 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, alan, bcrl, rostedt, hch, ak, rmk+lkml * Andrew Morton <akpm@osdl.org> wrote: > I've only been following this with half an eye, with the apparently > erroneous expectation that future versions of the patchset would come > with some explanation of why on earth we'd want to merge all this > stuff into the kernel. in my initial announcement i listed 10 good reasons to do so, and they are still true: http://people.redhat.com/mingo/generic-mutex-subsystem/mutex-announce.txt [...] But firstly, i'd like to answer the most important question: "Why the hell do we need a new mutex subsystem, and what's wrong with semaphores??" This question is clearly nagging most of the doubters, so i'm listing my answers first, before fully explaining the patchset. (For more implementational details, see the subseqent sections.) here are the top 10 reasons of why i think the generic mutex code should be considered for upstream integration: - 'struct mutex' is smaller: on x86, 'struct semaphore' is 20 bytes, 'struct mutex' is 16 bytes. A smaller structure size means less RAM footprint, and better CPU-cache utilization. - tighter code. On x86 i get the following .text sizes when switching all mutex-alike semaphores in the kernel to the mutex subsystem: text data bss dec hex filename 3280380 868188 396860 4545428 455b94 vmlinux-semaphore 3255329 865296 396732 4517357 44eded vmlinux-mutex that's 25051 bytes of code saved, or a 0.76% win - off the hottest codepaths of the kernel. (The .data savings are 2892 bytes, or 0.33%) Smaller code means better icache footprint, which is one of the major optimization goals in the Linux kernel currently. - the mutex subsystem is faster and has superior scalability for contented workloads. On an 8-way x86 system, running a mutex-based kernel and testing creat+unlink+close (of separate, per-task files) in /tmp with 16 parallel tasks, the average number of ops/sec is: Semaphores: Mutexes: $ ./test-mutex V 16 10 $ ./test-mutex V 16 10 8 CPUs, running 16 tasks. 8 CPUs, running 16 tasks. checking VFS performance. checking VFS performance. avg loops/sec: 34713 avg loops/sec: 84153 CPU utilization: 63% CPU utilization: 22% i.e. in this workload, the mutex based kernel was 2.4 times faster than the semaphore based kernel, _and_ it also had 2.8 times less CPU utilization. (In terms of 'ops per CPU cycle', the semaphore kernel performed 551 ops/sec per 1% of CPU time used, while the mutex kernel performed 3825 ops/sec per 1% of CPU time used - it was 6.9 times more efficient.) the scalability difference is visible even on a 2-way P4 HT box: Semaphores: Mutexes: $ ./test-mutex V 16 10 $ ./test-mutex V 16 10 4 CPUs, running 16 tasks. 8 CPUs, running 16 tasks. checking VFS performance. checking VFS performance. avg loops/sec: 127659 avg loops/sec: 181082 CPU utilization: 100% CPU utilization: 34% (the straight performance advantage of mutexes is 41%, the per-cycle efficiency of mutexes is 4.1 times better.) - there are no fastpath tradeoffs, the mutex fastpath is just as tight as the semaphore fastpath. On x86, the locking fastpath is 2 instructions: c0377ccb <mutex_lock>: c0377ccb: f0 ff 08 lock decl (%eax) c0377cce: 78 0e js c0377cde <.text.lock.mutex> c0377cd0: c3 ret the unlocking fastpath is equally tight: c0377cd1 <mutex_unlock>: c0377cd1: f0 ff 00 lock incl (%eax) c0377cd4: 7e 0f jle c0377ce5 <.text.lock.mutex+0x7> c0377cd6: c3 ret - the per-call-site inlining cost of the slowpath is cheaper and smaller than that of semaphores, by one instruction, because the mutex trampoline code does not do a "lea %0,%%eax" that the semaphore code does before calling __down_failed. The mutex subsystem uses out of line code currently so this makes only a small difference in .text size, but in case we want to inline mutexes, they will be cheaper than semaphores. - No wholesale or dangerous migration path. The migration to mutexes is fundamentally opt-in, safe and easy: multiple type-based and .config based migration helpers are provided to make the migration to mutexes easy. Migration is as finegrained as it gets, so robustness of the kernel or out-of-tree code should not be jeopardized at any stage. The migration helpers can be eliminated once migration is completed, once the kernel has been separated into 'mutex users' and 'semaphore users'. Out-of-tree code automatically defaults to semaphore semantics, mutexes are not forced upon anyone, at any stage of the migration. - 'struct mutex' semantics are well-defined and are enforced if CONFIG_DEBUG_MUTEXES is turned on. Semaphores on the other hand have virtually no debugging code or instrumentation. The mutex subsystem checks and enforces the following rules: * - only one task can hold the mutex at a time * - only the owner can unlock the mutex * - multiple unlocks are not permitted * - recursive locking is not permitted * - a mutex object must be initialized via the API * - a mutex object must not be initialized via memset or copying * - task may not exit with mutex held * - memory areas where held locks reside must not be freed furthermore, there are also convenience features in the debugging code: * - uses symbolic names of mutexes, whenever they are printed in debug output * - point-of-acquire tracking, symbolic lookup of function names * - list of all locks held in the system, printout of them * - owner tracking * - detects self-recursing locks and prints out all relevant info * - detects multi-task circular deadlocks and prints out all affected * locks and tasks (and only those tasks) we have extensive experience with the mutex debugging code in the -rt kernel, and it eases the debugging of mutex related bugs considerably. A handful of upstream bugs were found as well this way, and were contributed back to the vanilla kernel. We do believe that improved debugging code is an important tool in improving the fast-paced upstream kernel's quality. a side-effect of the strict semantics is that mutexes are much easier to analyze on a static level. E.g. Sparse could check the correctness of mutex users, further improving the kernel's quality. Also, the highest-level security and reliability validation techniques (and certification processes) involve static code analysis. - kernel/mutex.c is generic, and has minimal per-arch needs. No new primitives have to be implemented to support spinlock-based generic mutexes. Only 2 new atomic primitives have to be implemented for an architecture to support optimized, lockless generic mutexes. In contrast, to implement semaphores on a new architecture, hundreds of lines of nontrivial (often assembly) code has to be written and debugged. - kernel/mutex.c is highly hackable. New locking features can be implemented in C, and they carry over to every architecture. Extensive self-consistency debugging checks of the mutex implementation are done if CONFIG_DEBUG_MUTEXES is turned on. I do think that hackability is the most important property of kernel code. - the generic mutex subsystem is also one more step towards enabling the fully preemptable -rt kernel. Ok, this shouldnt bother the upstream kernel too much at the moment, but it's a personal driving force for me nevertheless ;-) (NOTE: i consciously did not list 'Priority Inheritance' amongst the reasons, because priority inheritance for blocking kernel locks would be a misguided reason at best, and flat out wrong at worst.) ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 12:20 ` Ingo Molnar @ 2005-12-22 13:07 ` Andrew Morton 2005-12-22 13:23 ` Arjan van de Ven ` (2 more replies) 0 siblings, 3 replies; 62+ messages in thread From: Andrew Morton @ 2005-12-22 13:07 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, alan, bcrl, rostedt, hch, ak, rmk+lkml Ingo Molnar <mingo@elte.hu> wrote: > > ... > > here are the top 10 reasons of why i think the generic mutex code should > be considered for upstream integration: Appropriate for the changelog, please. > - 'struct mutex' is smaller: on x86, 'struct semaphore' is 20 bytes, > 'struct mutex' is 16 bytes. A smaller structure size means less RAM > footprint, and better CPU-cache utilization. Because of the .sleepers thing. Perhaps a revamped semaphore wouldn't need thsat field? > - tighter code. On x86 i get the following .text sizes when > switching all mutex-alike semaphores in the kernel to the mutex > subsystem: > > text data bss dec hex filename > 3280380 868188 396860 4545428 455b94 vmlinux-semaphore > 3255329 865296 396732 4517357 44eded vmlinux-mutex > > that's 25051 bytes of code saved, or a 0.76% win - off the hottest > codepaths of the kernel. (The .data savings are 2892 bytes, or 0.33%) > Smaller code means better icache footprint, which is one of the > major optimization goals in the Linux kernel currently. Why is the mutex-using kernel any smaller? IOW: from where do these savings come? > - the mutex subsystem is faster and has superior scalability for > contented workloads. On an 8-way x86 system, running a mutex-based > kernel and testing creat+unlink+close (of separate, per-task files) > in /tmp with 16 parallel tasks, the average number of ops/sec is: > > Semaphores: Mutexes: > > $ ./test-mutex V 16 10 $ ./test-mutex V 16 10 > 8 CPUs, running 16 tasks. 8 CPUs, running 16 tasks. > checking VFS performance. checking VFS performance. > avg loops/sec: 34713 avg loops/sec: 84153 > CPU utilization: 63% CPU utilization: 22% > > i.e. in this workload, the mutex based kernel was 2.4 times faster > than the semaphore based kernel, _and_ it also had 2.8 times less CPU > utilization. (In terms of 'ops per CPU cycle', the semaphore kernel > performed 551 ops/sec per 1% of CPU time used, while the mutex kernel > performed 3825 ops/sec per 1% of CPU time used - it was 6.9 times > more efficient.) > > the scalability difference is visible even on a 2-way P4 HT box: > > Semaphores: Mutexes: > > $ ./test-mutex V 16 10 $ ./test-mutex V 16 10 > 4 CPUs, running 16 tasks. 8 CPUs, running 16 tasks. > checking VFS performance. checking VFS performance. > avg loops/sec: 127659 avg loops/sec: 181082 > CPU utilization: 100% CPU utilization: 34% > > (the straight performance advantage of mutexes is 41%, the per-cycle > efficiency of mutexes is 4.1 times better.) Why is the mutex-using kernel more scalable? Can semaphores be tuned to offer the same scalability improvements? > ... > > - the per-call-site inlining cost of the slowpath is cheaper and > smaller than that of semaphores, by one instruction, because the > mutex trampoline code does not do a "lea %0,%%eax" that the semaphore > code does before calling __down_failed. The mutex subsystem uses out > of line code currently so this makes only a small difference in .text > size, but in case we want to inline mutexes, they will be cheaper > than semaphores. Cannot the semaphore code be improved in the same manner? > - No wholesale or dangerous migration path. The migration to mutexes is > fundamentally opt-in, safe and easy: multiple type-based and .config > based migration helpers are provided to make the migration to mutexes > easy. Migration is as finegrained as it gets, so robustness of the > kernel or out-of-tree code should not be jeopardized at any stage. > The migration helpers can be eliminated once migration is completed, > once the kernel has been separated into 'mutex users' and 'semaphore > users'. Out-of-tree code automatically defaults to semaphore > semantics, mutexes are not forced upon anyone, at any stage of the > migration. IOW: a complete PITA, sorry. > - 'struct mutex' semantics are well-defined and are enforced if > CONFIG_DEBUG_MUTEXES is turned on. Semaphores on the other hand have > virtually no debugging code or instrumentation. The mutex subsystem > checks and enforces the following rules: > > * - only one task can hold the mutex at a time > * - only the owner can unlock the mutex > * - multiple unlocks are not permitted > * - recursive locking is not permitted > * - a mutex object must be initialized via the API > * - a mutex object must not be initialized via memset or copying > * - task may not exit with mutex held > * - memory areas where held locks reside must not be freed Pretty much all of that could be added to semaphores-when-used-as-mutexes. Without introducing a whole new locking mechanism. > furthermore, there are also convenience features in the debugging > code: > > * - uses symbolic names of mutexes, whenever they are printed in debug output > * - point-of-acquire tracking, symbolic lookup of function names > * - list of all locks held in the system, printout of them > * - owner tracking > * - detects self-recursing locks and prints out all relevant info > * - detects multi-task circular deadlocks and prints out all affected > * locks and tasks (and only those tasks) Ditto, I expect. > - the generic mutex subsystem is also one more step towards enabling > the fully preemptable -rt kernel. Ok, this shouldnt bother the > upstream kernel too much at the moment, but it's a personal driving > force for me nevertheless ;-) Actually that's the only compelling reason I've yet seen, sorry. What is it about these mutexes which is useful to RT and why cannot that feature be provided by semaphores? Ingo, there appear to be quite a few straw-man arguments here. You're comparing a subsystem (semaphores) which obviously could do with a lot of fixing and enhancing with something new which has had a lot of recent feature work out into it. I'd prefer to see mutexes compared with semaphores after you've put as much work into improving semaphores as you have into developing mutexes. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 13:07 ` Andrew Morton @ 2005-12-22 13:23 ` Arjan van de Ven 2005-12-22 13:44 ` Andrew Morton 2005-12-22 15:19 ` Nicolas Pitre 2005-12-22 21:43 ` Paul Mackerras 2 siblings, 1 reply; 62+ messages in thread From: Arjan van de Ven @ 2005-12-22 13:23 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, alan, bcrl, rostedt, hch, ak, rmk+lkml > > * - only one task can hold the mutex at a time > > * - only the owner can unlock the mutex > > * - multiple unlocks are not permitted > > * - recursive locking is not permitted > > * - a mutex object must be initialized via the API > > * - a mutex object must not be initialized via memset or copying > > * - task may not exit with mutex held > > * - memory areas where held locks reside must not be freed > > Pretty much all of that could be added to semaphores-when-used-as-mutexes. > Without introducing a whole new locking mechanism. this is basically in direct conflict with what Linus has been asking for.. "leave semaphore semantics alone". I agree with Linus that if it has very different rules, it should have a different name. > Ingo, there appear to be quite a few straw-man arguments here. You're > comparing a subsystem (semaphores) which obviously could do with a lot of > fixing and enhancing with something new which has had a lot of recent > feature work out into it. > > I'd prefer to see mutexes compared with semaphores after you've put as much > work into improving semaphores as you have into developing mutexes. semaphores have had a lot of work for the last... 10 years. To me that is a sign that maybe they're close to their limit already. You keep saying 10 times "so please enhance semaphores to do this". Semaphores have far more complex rules for the slowpath (mutex semantics are simple because the number of wakers is always at most one, and if you hold the lock, you KNOW nobody else can do another release under you). Adding dual rules or other complexity to it doesn't sound too compelling to me actually; they're highly tricky things already (just look at the i386 ones.. that extra wakeup was there to plug a hole (at least that is my empirical conclusion based on "we remove it it hangs" behavior). Having 2 sets of behaviors for the same primitive also sounds not good to me to be honest, that's bound to explode stuff left and right all the time. I realize you're not looking forward to a gradual conversion; yet Linus says he doesn't want a wholesale change either but wants it gradual. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 13:23 ` Arjan van de Ven @ 2005-12-22 13:44 ` Andrew Morton 2005-12-22 14:11 ` Alan Cox ` (2 more replies) 0 siblings, 3 replies; 62+ messages in thread From: Andrew Morton @ 2005-12-22 13:44 UTC (permalink / raw) To: Arjan van de Ven Cc: mingo, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, alan, bcrl, rostedt, hch, ak, rmk+lkml Arjan van de Ven <arjan@infradead.org> wrote: > > > > > * - only one task can hold the mutex at a time > > > * - only the owner can unlock the mutex > > > * - multiple unlocks are not permitted > > > * - recursive locking is not permitted > > > * - a mutex object must be initialized via the API > > > * - a mutex object must not be initialized via memset or copying > > > * - task may not exit with mutex held > > > * - memory areas where held locks reside must not be freed > > > > Pretty much all of that could be added to semaphores-when-used-as-mutexes. > > Without introducing a whole new locking mechanism. > > this is basically in direct conflict with what Linus has been asking > for.. "leave semaphore semantics alone". No, it's mostly just adding new debug stuff. And we'd need to add additional anotations to those few semaphores which are used as non-mutexes to support that new debug stuff. That's if the new debugging stuff is actually useful. I'm not sure it is, really - I can't think of any kernel bugs which these features would have exposed. Although I think Ingo found some, but I don't recall what they were. > > > Ingo, there appear to be quite a few straw-man arguments here. You're > > comparing a subsystem (semaphores) which obviously could do with a lot of > > fixing and enhancing with something new which has had a lot of recent > > feature work out into it. > > > > I'd prefer to see mutexes compared with semaphores after you've put as much > > work into improving semaphores as you have into developing mutexes. > > semaphores have had a lot of work for the last... 10 years. To me that > is a sign that maybe they're close to their limit already. No they haven't - they're basically unchanged from four years ago (at least). > You keep saying 10 times "so please enhance semaphores to do this". Well I think diligence requires that we be able to demonstrate why it's not possible. It's plain dumb for us to justify a fancy-pants new system based on features which we could have added to the old one, no? > Semaphores have far more complex rules for the slowpath (mutex semantics > are simple because the number of wakers is always at most one, and if > you hold the lock, you KNOW nobody else can do another release under > you). Adding dual rules or other complexity to it doesn't sound too > compelling to me actually; they're highly tricky things already (just > look at the i386 ones.. that extra wakeup was there to plug a hole (at > least that is my empirical conclusion based on "we remove it it hangs" > behavior). > > Having 2 sets of behaviors for the same primitive also sounds not good > to me to be honest, that's bound to explode stuff left and right all the > time. There's no need for two sets of behaviour. What I'm saying is that we could add similar debugging features to semaphores (if useful). Yes, we would have to tell the kernel at the source level to defeat that debugging if a particular lock isn't being used as a mutex. That's rather less intrusive than adding a whole new type of lock. But I'd question the value even of doing that, given the general historical non-bugginess of existing semaphore users. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 13:44 ` Andrew Morton @ 2005-12-22 14:11 ` Alan Cox 2005-12-22 23:30 ` Andrew Morton 2005-12-22 15:46 ` Thomas Gleixner 2005-12-22 17:17 ` Christoph Hellwig 2 siblings, 1 reply; 62+ messages in thread From: Alan Cox @ 2005-12-22 14:11 UTC (permalink / raw) To: Andrew Morton Cc: Arjan van de Ven, mingo, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, hch, ak, rmk+lkml On Iau, 2005-12-22 at 05:44 -0800, Andrew Morton wrote: > > semaphores have had a lot of work for the last... 10 years. To me that > > is a sign that maybe they're close to their limit already. > > No they haven't - they're basically unchanged from four years ago (at > least). Point still holds > It's plain dumb for us to justify a fancy-pants new system based on > features which we could have added to the old one, no? The old one does one job well. Mutex wakeups are very different in behaviour because of the single waker rule and the fact you can do stuff like affine wakeups. You could make it one function but it would inevitably end up full of if(mutex) { } so you'd make it slower, destabilize both and not get some of the winnings like code size. Oh and of course you no doubt break the semaphores while doing it, while at least this seperate way as Linus suggested you don't break the existing functionality. > There's no need for two sets of behaviour. The fundamental reason for mutexes in terms of performance and scalability requires the different wake up behaviours. The performance numbers are pretty compelling. I was against the original proposal but now the gradual changeover is proposed I'm for. Alan ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 14:11 ` Alan Cox @ 2005-12-22 23:30 ` Andrew Morton 2005-12-22 23:34 ` Christoph Hellwig 0 siblings, 1 reply; 62+ messages in thread From: Andrew Morton @ 2005-12-22 23:30 UTC (permalink / raw) To: Alan Cox Cc: arjan, mingo, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, hch, ak, rmk+lkml Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: > > On Iau, 2005-12-22 at 05:44 -0800, Andrew Morton wrote: > > > semaphores have had a lot of work for the last... 10 years. To me that > > > is a sign that maybe they're close to their limit already. > > > > No they haven't - they're basically unchanged from four years ago (at > > least). > > Point still holds No it does not. Ingo's work has shown us two things: a) semaphores use more kernel text than they should and b) semaphores are less efficient than they could be. Fine. Let's update the semaphore implementation to fix those things. Nobody has addressed this code in several years. If we conclusively cannot fix these things then that's the time to start looking at implementing new locking mechanisms. > > It's plain dumb for us to justify a fancy-pants new system based on > > features which we could have added to the old one, no? > > The old one does one job well. Mutex wakeups are very different in > behaviour because of the single waker rule and the fact you can do stuff > like affine wakeups. > > You could make it one function but it would inevitably end up full of > > if(mutex) { > } We'd only need such constructs if we were trying to add all the mutex runtime debugging features to semaphores. I don't think that's likely to be very useful so fine, let's not do that. > Oh and of course you no doubt break the semaphores while doing it, while > at least this seperate way as Linus suggested you don't break the > existing functionality. Linus would not be averse to a patch which optimises semaphores. > > There's no need for two sets of behaviour. > > The fundamental reason for mutexes in terms of performance and > scalability requires the different wake up behaviours. The performance > numbers are pretty compelling. This where I have a bad feeling that I'm missing some vital clue. An efficient semaphore implementation will wake a single process per up(). Just like mutexes. So why should mutexes have any performance advantage? ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 23:30 ` Andrew Morton @ 2005-12-22 23:34 ` Christoph Hellwig [not found] ` <32801.10.10.10.28.1135295357.squirrel@linux1> ` (2 more replies) 0 siblings, 3 replies; 62+ messages in thread From: Christoph Hellwig @ 2005-12-22 23:34 UTC (permalink / raw) To: Andrew Morton Cc: Alan Cox, arjan, mingo, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, hch, ak, rmk+lkml On Thu, Dec 22, 2005 at 03:30:14PM -0800, Andrew Morton wrote: > No it does not. > > Ingo's work has shown us two things: > > a) semaphores use more kernel text than they should and > > b) semaphores are less efficient than they could be. > > Fine. Let's update the semaphore implementation to fix those things. > Nobody has addressed this code in several years. If we conclusively cannot > fix these things then that's the time to start looking at implementing new > locking mechanisms. c) semaphores are total overkill for 99% percent of the users. Remember this thing about optimizing for the common case? Pretty much everywhere we do want mutex semantic. So let's have a proper primitive exactly for that, and we can keep the current semaphore implementation (with a much simpler implementation) for that handfull of users in the kernel that really want a counting semaphore. I really don't get why you hate mutex primitives so much. ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <32801.10.10.10.28.1135295357.squirrel@linux1>]
* Re: [patch 0/9] mutex subsystem, -V4 [not found] ` <32801.10.10.10.28.1135295357.squirrel@linux1> @ 2005-12-22 23:49 ` Sean 2005-12-22 23:53 ` Randy.Dunlap 0 siblings, 1 reply; 62+ messages in thread From: Sean @ 2005-12-22 23:49 UTC (permalink / raw) To: Christoph Hellwig Cc: Andrew Morton, Alan Cox, arjan, mingo, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, hch, ak, rmk+lkml On Thu, December 22, 2005 6:34 pm, Christoph Hellwig said: > On Thu, Dec 22, 2005 at 03:30:14PM -0800, Andrew Morton wrote: >> No it does not. >> >> Ingo's work has shown us two things: >> >> a) semaphores use more kernel text than they should and >> >> b) semaphores are less efficient than they could be. >> >> Fine. Let's update the semaphore implementation to fix those things. >> Nobody has addressed this code in several years. If we conclusively >> cannot >> fix these things then that's the time to start looking at implementing >> new >> locking mechanisms. > > c) semaphores are total overkill for 99% percent of the users. Remember > this thing about optimizing for the common case? > > Pretty much everywhere we do want mutex semantic. So let's have a proper > primitive exactly for that, and we can keep the current semaphore > implementation (with a much simpler implementation) for that handfull of > users in the kernel that really want a counting semaphore. > > I really don't get why you hate mutex primitives so much. > Yes it's hard to figure. It seems to be deeper than just hating mutex primitives, he hates the timer core updates that come from Ingo too; this may be a general dislike for all things -rt. Sean ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 23:49 ` Sean @ 2005-12-22 23:53 ` Randy.Dunlap [not found] ` <50572.10.10.10.28.1135296023.squirrel@linux1> 2005-12-23 0:00 ` Steven Rostedt 0 siblings, 2 replies; 62+ messages in thread From: Randy.Dunlap @ 2005-12-22 23:53 UTC (permalink / raw) To: Sean Cc: Christoph Hellwig, Andrew Morton, Alan Cox, arjan, mingo, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml On Thu, 22 Dec 2005, Sean wrote: > On Thu, December 22, 2005 6:34 pm, Christoph Hellwig said: > > On Thu, Dec 22, 2005 at 03:30:14PM -0800, Andrew Morton wrote: > >> No it does not. > >> > >> Ingo's work has shown us two things: > >> > >> a) semaphores use more kernel text than they should and > >> > >> b) semaphores are less efficient than they could be. > >> > >> Fine. Let's update the semaphore implementation to fix those things. > >> Nobody has addressed this code in several years. If we conclusively > >> cannot > >> fix these things then that's the time to start looking at implementing > >> new > >> locking mechanisms. > > > > c) semaphores are total overkill for 99% percent of the users. Remember > > this thing about optimizing for the common case? > > > > Pretty much everywhere we do want mutex semantic. So let's have a proper > > primitive exactly for that, and we can keep the current semaphore > > implementation (with a much simpler implementation) for that handfull of > > users in the kernel that really want a counting semaphore. > > > > I really don't get why you hate mutex primitives so much. > > > > Yes it's hard to figure. It seems to be deeper than just hating mutex > primitives, he hates the timer core updates that come from Ingo too; this > may be a general dislike for all things -rt. Andrew can surely answer that, but it could be something as simple as wanting to build a more stable kernel (one without so much churn), so that things have time to mature and improve without breaking so many other things... This (current) is a hectic development cycle style. -- ~Randy ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <50572.10.10.10.28.1135296023.squirrel@linux1>]
* Re: [patch 0/9] mutex subsystem, -V4 [not found] ` <50572.10.10.10.28.1135296023.squirrel@linux1> @ 2005-12-23 0:00 ` Sean 0 siblings, 0 replies; 62+ messages in thread From: Sean @ 2005-12-23 0:00 UTC (permalink / raw) To: Randy.Dunlap Cc: Christoph Hellwig, Andrew Morton, Alan Cox, arjan, mingo, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml On Thu, December 22, 2005 6:53 pm, Randy.Dunlap said: > Andrew can surely answer that, but it could be something as > simple as wanting to build a more stable kernel (one without > so much churn), so that things have time to mature and > improve without breaking so many other things... > > This (current) is a hectic development cycle style. Sure, its probably that simple. Just seems like the techincal arguments clearly support adding a mutex. Sean ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 23:53 ` Randy.Dunlap [not found] ` <50572.10.10.10.28.1135296023.squirrel@linux1> @ 2005-12-23 0:00 ` Steven Rostedt 1 sibling, 0 replies; 62+ messages in thread From: Steven Rostedt @ 2005-12-23 0:00 UTC (permalink / raw) To: Randy.Dunlap Cc: Sean, Christoph Hellwig, Andrew Morton, Alan Cox, arjan, mingo, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, ak, rmk+lkml On Thu, 22 Dec 2005, Randy.Dunlap wrote: > > Andrew can surely answer that, but it could be something as > simple as wanting to build a more stable kernel (one without > so much churn), so that things have time to mature and > improve without breaking so many other things... Not to mention that when this gets into the kernel, Andrew's the one that will have to deal with a ton of "switch semaphores to mutexes" patches that will slowly update the entire kernel. These patches will probably continue to come in for a few years. ;) IIRC, that was one of his rational reasons why he didn't want the timer name change. -- Steve ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <20051222221311.2f6056ec.akpm@osdl.org>]
* Re: [patch 0/9] mutex subsystem, -V4 [not found] ` <20051222221311.2f6056ec.akpm@osdl.org> @ 2005-12-23 14:24 ` Nicolas Pitre 2005-12-23 14:51 ` Andrew Morton 0 siblings, 1 reply; 62+ messages in thread From: Nicolas Pitre @ 2005-12-23 14:24 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Hellwig, alan, arjan, mingo, lkml, torvalds, arjanv, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml On Thu, 22 Dec 2005, Andrew Morton wrote: > Christoph Hellwig <hch@infradead.org> wrote: > > I really don't get why you hate mutex primitives so much. > > I've just spelled out in considerable detail why this work is premature. > How can you not "get" it? Why do I have to keep saying the same thing in > different ways? It's really quite simple. > > > So here is permutation #4: > > If we can optimise semaphores for speed and space, the mutexes are > *unneeded*. How can't you get the fact that semaphores could _never_ be as simple as mutexes? This is a theoritical impossibility, which maybe turns out not to be so true on x86, but which is damn true on ARM where the fast path (the common case of a mutex) is significantly more efficient. Semaphores _require_ an atomic decrement, mutexes do not. On some architectures that makes a huge difference. > And I think we _should_ optimise semaphores for speed and space. Don't you? No one disagrees with that. > If we can do that, there is no point at all in adding a new lock type which > has no speed advantage and no space advantage and which has less > functionality than semaphores. The very point is that mutexes will always have a speed advantage by nature. > And, repeating myself yet again: if we can demonstrate that it is not > feasible to optimise semaphores to the same performance and space efficiency > of mutexes then (and only then) we have a reason for adding mutexes. I spent the whole week making that demonstration repeatedly. Why are you ignoring me? Nicolas ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-23 14:24 ` Nicolas Pitre @ 2005-12-23 14:51 ` Andrew Morton 2005-12-23 14:57 ` Russell King ` (2 more replies) 0 siblings, 3 replies; 62+ messages in thread From: Andrew Morton @ 2005-12-23 14:51 UTC (permalink / raw) To: Nicolas Pitre Cc: hch, alan, arjan, mingo, linux-kernel, torvalds, arjanv, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml Nicolas Pitre <nico@cam.org> wrote: > > How can't you get the fact that semaphores could _never_ be as simple as > mutexes? This is a theoritical impossibility, which maybe turns out not > to be so true on x86, but which is damn true on ARM where the fast path > (the common case of a mutex) is significantly more efficient. > I did notice your comments. I'll grant that mutexes will save some tens of fastpath cycles on one minor architecture. Sorry, but that doesn't seem very important. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-23 14:51 ` Andrew Morton @ 2005-12-23 14:57 ` Russell King 2005-12-23 15:04 ` Xavier Bestel 2005-12-23 15:00 ` Steven Rostedt 2006-01-03 17:54 ` Abhijit Bhopatkar 2 siblings, 1 reply; 62+ messages in thread From: Russell King @ 2005-12-23 14:57 UTC (permalink / raw) To: Andrew Morton Cc: Nicolas Pitre, hch, alan, arjan, mingo, linux-kernel, torvalds, arjanv, jes, zwane, oleg, dhowells, bcrl, rostedt, ak On Fri, Dec 23, 2005 at 06:51:18AM -0800, Andrew Morton wrote: > Nicolas Pitre <nico@cam.org> wrote: > > How can't you get the fact that semaphores could _never_ be as simple as > > mutexes? This is a theoritical impossibility, which maybe turns out not > > to be so true on x86, but which is damn true on ARM where the fast path > > (the common case of a mutex) is significantly more efficient. > > I did notice your comments. I'll grant that mutexes will save some tens of > fastpath cycles on one minor architecture. Sorry, but that doesn't seem > very important. Wow. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 Serial core ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-23 14:57 ` Russell King @ 2005-12-23 15:04 ` Xavier Bestel 2005-12-23 15:27 ` Andrew Morton 0 siblings, 1 reply; 62+ messages in thread From: Xavier Bestel @ 2005-12-23 15:04 UTC (permalink / raw) To: Russell King Cc: Andrew Morton, Nicolas Pitre, hch, alan, arjan, mingo, linux-kernel, torvalds, arjanv, jes, zwane, oleg, dhowells, bcrl, rostedt, ak On Fri, 2005-12-23 at 15:57, Russell King wrote: > On Fri, Dec 23, 2005 at 06:51:18AM -0800, Andrew Morton wrote: > > Nicolas Pitre <nico@cam.org> wrote: > > > How can't you get the fact that semaphores could _never_ be as simple as > > > mutexes? This is a theoritical impossibility, which maybe turns out not > > > to be so true on x86, but which is damn true on ARM where the fast path > > > (the common case of a mutex) is significantly more efficient. > > > > I did notice your comments. I'll grant that mutexes will save some tens of > > fastpath cycles on one minor architecture. Sorry, but that doesn't seem > > very important. > > Wow. Yes, wow. Andrew doesn't seem aware of embedded linux people, for whom cycles are important and ARM is king. Xav ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-23 15:04 ` Xavier Bestel @ 2005-12-23 15:27 ` Andrew Morton 0 siblings, 0 replies; 62+ messages in thread From: Andrew Morton @ 2005-12-23 15:27 UTC (permalink / raw) To: Xavier Bestel Cc: rmk+lkml, nico, hch, alan, arjan, mingo, linux-kernel, torvalds, arjanv, jes, zwane, oleg, dhowells, bcrl, rostedt, ak Xavier Bestel <xavier.bestel@free.fr> wrote: > > On Fri, 2005-12-23 at 15:57, Russell King wrote: > > On Fri, Dec 23, 2005 at 06:51:18AM -0800, Andrew Morton wrote: > > > Nicolas Pitre <nico@cam.org> wrote: > > > > How can't you get the fact that semaphores could _never_ be as simple as > > > > mutexes? This is a theoritical impossibility, which maybe turns out not > > > > to be so true on x86, but which is damn true on ARM where the fast path > > > > (the common case of a mutex) is significantly more efficient. > > > > > > I did notice your comments. I'll grant that mutexes will save some tens of > > > fastpath cycles on one minor architecture. Sorry, but that doesn't seem > > > very important. > > > > Wow. > > Yes, wow. Andrew doesn't seem aware of embedded linux people, for whom > cycles are important and ARM is king. > Please, spare me the rhetoric. Adding a new locking primitive is a big deal. We need good reasons for doing it. And no, I don't think a few cycles on ARM is good enough. I'd be very surprised if it was measurable - the busiest semaphore is probably i_sem and when it's taken we're also doing heavy filesystem operations which would swamp any benefit. And if we're going to churn i_sem then we perhaps should turn it into an rwsem so we can perform concurrent lookups (at least). We do disk I/O with that thing held. Look, I'm not wildly against mutexes - it's not exactly a large amount of code. I just think they're fairly pointless and I regret that we seem to be diving into adding yet another locking primitive without having fully investigated improving the existing one. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-23 14:51 ` Andrew Morton 2005-12-23 14:57 ` Russell King @ 2005-12-23 15:00 ` Steven Rostedt 2006-01-03 17:54 ` Abhijit Bhopatkar 2 siblings, 0 replies; 62+ messages in thread From: Steven Rostedt @ 2005-12-23 15:00 UTC (permalink / raw) To: Andrew Morton Cc: Nicolas Pitre, hch, alan, arjan, mingo, linux-kernel, torvalds, arjanv, jes, zwane, oleg, dhowells, bcrl, ak, rmk+lkml On Fri, 23 Dec 2005, Andrew Morton wrote: > Nicolas Pitre <nico@cam.org> wrote: > > > > How can't you get the fact that semaphores could _never_ be as simple as > > mutexes? This is a theoritical impossibility, which maybe turns out not > > to be so true on x86, but which is damn true on ARM where the fast path > > (the common case of a mutex) is significantly more efficient. > > > > I did notice your comments. I'll grant that mutexes will save some tens of > fastpath cycles on one minor architecture. Sorry, but that doesn't seem > very important. > "minor architecture"? Granted, I don't know of any ARM desktops or servers, but there's a large number of ARM devices out in the real world. Or are we giving up on Linux being an embedded OS? -- Steve ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-23 14:51 ` Andrew Morton 2005-12-23 14:57 ` Russell King 2005-12-23 15:00 ` Steven Rostedt @ 2006-01-03 17:54 ` Abhijit Bhopatkar 2 siblings, 0 replies; 62+ messages in thread From: Abhijit Bhopatkar @ 2006-01-03 17:54 UTC (permalink / raw) To: Andrew Morton Cc: Nicolas Pitre, hch, alan, arjan, mingo, linux-kernel, torvalds, arjanv, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml > > How can't you get the fact that semaphores could _never_ be as simple as > > mutexes? This is a theoritical impossibility, which maybe turns out not > > to be so true on x86, but which is damn true on ARM where the fast path > > (the common case of a mutex) is significantly more efficient. > > > > I did notice your comments. I'll grant that mutexes will save some tens of > fastpath cycles on one minor architecture. Sorry, but that doesn't seem > very important. Heh !! i can't find words so i will just spell the emotion.... COMPLAIN HARD ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 23:34 ` Christoph Hellwig [not found] ` <32801.10.10.10.28.1135295357.squirrel@linux1> [not found] ` <20051222221311.2f6056ec.akpm@osdl.org> @ 2005-12-25 16:08 ` Roman Zippel 2005-12-25 22:54 ` Ingo Molnar 2005-12-25 23:04 ` Andrew Morton 2 siblings, 2 replies; 62+ messages in thread From: Roman Zippel @ 2005-12-25 16:08 UTC (permalink / raw) To: Christoph Hellwig Cc: Andrew Morton, Alan Cox, arjan, mingo, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml Hi, On Friday 23 December 2005 00:34, Christoph Hellwig wrote: > c) semaphores are total overkill for 99% percent of the users. Remember > this thing about optimizing for the common case? Semaphores are not that different from mutexes. What makes me suspicious is the large difference in the test results, that either means something is wrong with the test or something is wrong with the semaphores. From reading the discussion I still don't really know, why the improvements to mutexes can't be applied to semaphores. I also haven't hardly seen any discussion about why semaphores the way they are. Linus did suspect there is a wakeup bug in the semaphore, but there was no conclusive followup to that. IMO there are still too many questions open, so I can understand Andrew. We may only cover up the problem instead of fixing it. I understand that mutexes have advantages, but if we compare them to semaphores it should be a fair comparison, otherwise people start to think semaphores are something bad. The majority of the discussion has been about microoptimisation, but on some archs non-debug mutexes and semaphores may very well be the same thing. bye, Roman ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-25 16:08 ` Roman Zippel @ 2005-12-25 22:54 ` Ingo Molnar 2005-12-26 21:49 ` Roman Zippel 2005-12-25 23:04 ` Andrew Morton 1 sibling, 1 reply; 62+ messages in thread From: Ingo Molnar @ 2005-12-25 22:54 UTC (permalink / raw) To: Roman Zippel Cc: Christoph Hellwig, Andrew Morton, Alan Cox, arjan, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml * Roman Zippel <zippel@linux-m68k.org> wrote: > > c) semaphores are total overkill for 99% percent of the users. Remember > > this thing about optimizing for the common case? > > [...] I also haven't hardly seen any discussion about why semaphores > the way they are. Linus did suspect there is a wakeup bug in the > semaphore, but there was no conclusive followup to that. no conclusive follow-up because ... they are too complex for people to answer such questions off the cuff? Something so frequently used in trivial ways should have the complexity of that typical use, not the complexity of the theoretical use. There is no problem with semaphores, other than that they are not being used as semaphores all that often. for which i think there is a rather simple practical reason: if i want to control a counted resource within the kernel, it is rarely the simplest solution to use a semaphore for it, because a semaphore cannot be used to protect data structures in the 'resource is available' stage [i.e. when the semaphore count is above zero]. It does decrement the counter atomically, but that is just half of the job i have to do. to control (allocate/free) the resource i _have to_ add some other locking mechanism anyway in most cases (a spinlock most likely, to protect the internal list and other internal state) - at which point it's simpler and faster to simply add a counter and a waitqueue to those existing internal variables, than to add a separate locking object to around (or within) the whole construct. semaphores would be nice in theory, if there was a way to attach the 'decrement counter atomically' operation to another set of atomic ops, like list_del() or list_add(), making the whole thing transactional. [this would also be a wholly new API, so it only applies to semaphores as a concept, not our actual semaphore incarnation] So i see the theoretical beauty of semaphores, but in practice, CPUs force us to work with much simpler constructs. there are some exceptions: e.g. when the resource is _nothing else_ but a count. Ingo ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-25 22:54 ` Ingo Molnar @ 2005-12-26 21:49 ` Roman Zippel 0 siblings, 0 replies; 62+ messages in thread From: Roman Zippel @ 2005-12-26 21:49 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Hellwig, Andrew Morton, Alan Cox, arjan, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml Hi, On Sunday 25 December 2005 23:54, Ingo Molnar wrote: > > [...] I also haven't hardly seen any discussion about why semaphores > > the way they are. Linus did suspect there is a wakeup bug in the > > semaphore, but there was no conclusive followup to that. > > no conclusive follow-up because ... they are too complex for people to > answer such questions off the cuff? Something so frequently used in > trivial ways should have the complexity of that typical use, not the > complexity of the theoretical use. There is no problem with semaphores, > other than that they are not being used as semaphores all that often. It shouldn't be that out of the blue for you. I don't mind the whole concept of mutexes and I agree that that's what most semaphores are used for. Please stop for a moment trying to sell mutexes, the basic question I'd like to get answered is, what is the worst-case scenerio if we convert everything to mutexes? To make it very clear: I'm not arguing against mutexes, I only want a look at the complete picture, I don't only want to see the undoubted advantages, but also what are the risks? Is the semaphore wakeup behaviour really only a bug or does it fix some problem that just nobody remembers (and maybe even doesn't exist anymore)? What about the fairness issues mentioned, how easy is it to starve waiters? Ingo, you're working on it already for a while, so I would expect you already thought about possible problems already, so why don't you take a look at the risks for us instead of just explaining the advantages? What are the chances we end up with semaphores just under a different name? Are there other possible problems, which then can be only solved e.g. by adding priority inheritance? The point of this is to be prepared for any predictable problem, since this change in its consequence is rather huge and we don't have the luxury of a single development tree anymore, which is used by most developers. bye, Roman ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-25 16:08 ` Roman Zippel 2005-12-25 22:54 ` Ingo Molnar @ 2005-12-25 23:04 ` Andrew Morton 2005-12-25 23:22 ` Ingo Molnar ` (2 more replies) 1 sibling, 3 replies; 62+ messages in thread From: Andrew Morton @ 2005-12-25 23:04 UTC (permalink / raw) To: Roman Zippel Cc: hch, alan, arjan, mingo, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml Roman Zippel <zippel@linux-m68k.org> wrote: > > IMO there are still too many questions open, so I can understand Andrew. We > may only cover up the problem instead of fixing it. I understand that mutexes > have advantages, but if we compare them to semaphores it should be a fair > comparison, otherwise people start to think semaphores are something bad. The > majority of the discussion has been about microoptimisation, but on some > archs non-debug mutexes and semaphores may very well be the same thing. Ingo has told me offline that he thinks that we can indeed remove the current semaphore implementation. 90% of existing semaphore users get migrated to mutexes. 9% of current semaphore users get migrated to completions. The remaining 1% of semaphore users are using the counting feature. We reimplement that in a mostly-arch-independent fashion and then remove the current semaphore implementation. Note that there are no sequencing dependencies in all the above. It's a lot of churn, but we'll end up with a better end result and a somewhat-net-simpler kernel, so I'm happy. One side point on semaphores and mutexes: the so-called "fast path" is generally not performance-critical, because we just don't take them at high frequencies. Any workload which involves taking a semaphore at more than 50,000-100,000 times/second tends to have ghastly overscheduling failure scenarios on SMP. So people hit those scenarios and the code gets converted to a lockless algorithm or to use spinlocking. For example, for a while ext3/JBD was doing 200,000 context-switches per second due to taking lock_super() at high frequencies. When I converted the whole fs to use spin locking throughout the performance in some workloads went up by 1000%. Another example: Ingo's VFS stresstest which is hitting i_sem hard: it only does ~8000 ops/sec on an 8-way, and it's an artificial microbenchmark which is _designed_ to hit that lock hard. So if/when i_sem is converted to a mutex, I figure that the benefits to ARM in that workload will be about a 0.01% performance increase. ie: about two hours' worth of Moore's law in a dopey microbenchmark. For these reasons, I think that with sleeping locks, the fastpath is realtively unimportant. What _is_ important is not screwing up the slowpath and not taking the lock at too high a frequency. Because either one will cause overscheduling which is a grossly worse problem. Also, there's very little point in adding lots of tricky arch-dependent code and generally mucking up the kernel source to squeeze the last drop of performance out of the sleeping lock fastpath. Because if those changes actually make a difference, we've already lost - the code needs to be changed to use spinlocking. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-25 23:04 ` Andrew Morton @ 2005-12-25 23:22 ` Ingo Molnar 2005-12-26 10:35 ` Andrew Morton 2005-12-26 0:33 ` Moore's law (was Re: [patch 0/9] mutex subsystem, -V4) Pavel Machek 2005-12-26 15:29 ` [patch 0/9] mutex subsystem, -V4 Nicolas Pitre 2 siblings, 1 reply; 62+ messages in thread From: Ingo Molnar @ 2005-12-25 23:22 UTC (permalink / raw) To: Andrew Morton Cc: Roman Zippel, hch, alan, arjan, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml * Andrew Morton <akpm@osdl.org> wrote: > One side point on semaphores and mutexes: the so-called "fast path" is > generally not performance-critical, because we just don't take them at > high frequencies. Any workload which involves taking a semaphore at > more than 50,000-100,000 times/second tends to have ghastly > overscheduling failure scenarios on SMP. So people hit those > scenarios and the code gets converted to a lockless algorithm or to > use spinlocking. > > For example, for a while ext3/JBD was doing 200,000 context-switches > per second due to taking lock_super() at high frequencies. When I > converted the whole fs to use spin locking throughout the performance > in some workloads went up by 1000%. actually, i'm 99.9% certain [ ;-) ] that all that ext3 spinlock conversion pain could have been avoided by converting ext3 to the mutex code. Mutexes definitely do not overschedule, even in very high frequency lock/unlock scenarios. They behave and perform quite close to spinlocks. (which property is obviously a must for the -rt kernel, where all spinlocks, rwlocks, seqlocks, rwsems and semaphores are mutexes - providing a big playground for locking constructs) hm, can you see any easy way for me to test my bold assertion on ext3, by somehow moving/hacking it back to semaphores? I could then apply the MUTEX_DEBUG_FULL mechanism to it and redo those ext3 semaphore vs. spinlock benchmarks on a couple of boxes, and add a mutex column to the table of numbers. also, often it's simpler to use a sleeping lock than to use a spinlock for something, so we want to have that option open. Ingo ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-25 23:22 ` Ingo Molnar @ 2005-12-26 10:35 ` Andrew Morton 2005-12-26 10:42 ` Arjan van de Ven 2005-12-27 14:42 ` Ingo Molnar 0 siblings, 2 replies; 62+ messages in thread From: Andrew Morton @ 2005-12-26 10:35 UTC (permalink / raw) To: Ingo Molnar Cc: zippel, hch, alan, arjan, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml Ingo Molnar <mingo@elte.hu> wrote: > > > * Andrew Morton <akpm@osdl.org> wrote: > > > One side point on semaphores and mutexes: the so-called "fast path" is > > generally not performance-critical, because we just don't take them at > > high frequencies. Any workload which involves taking a semaphore at > > more than 50,000-100,000 times/second tends to have ghastly > > overscheduling failure scenarios on SMP. So people hit those > > scenarios and the code gets converted to a lockless algorithm or to > > use spinlocking. > > > > For example, for a while ext3/JBD was doing 200,000 context-switches > > per second due to taking lock_super() at high frequencies. When I > > converted the whole fs to use spin locking throughout the performance > > in some workloads went up by 1000%. > > actually, i'm 99.9% certain [ ;-) ] that all that ext3 spinlock > conversion pain could have been avoided by converting ext3 to the mutex > code. Mutexes definitely do not overschedule, even in very high > frequency lock/unlock scenarios. They behave and perform quite close to > spinlocks. (which property is obviously a must for the -rt kernel, where > all spinlocks, rwlocks, seqlocks, rwsems and semaphores are mutexes - > providing a big playground for locking constructs) hm. 16 CPUs hitting the same semaphore at great arrival rates. The cost of a short spin is much less than the cost of a sleep/wakeup. The machine was doing 100,000 - 200,000 context switches per second. > hm, can you see any easy way for me to test my bold assertion on ext3, > by somehow moving/hacking it back to semaphores? Not really. The problem was most apparent after the lock_kernel() removal patches. The first thing a CPU hit when it entered the fs was previously lock_kernel(). That became lock_super() and performance went down the tubes. From memory, the bad kernel was tip-of-tree as of Memorial Weekend 2003 ;) I guess you could re-add all the lock_super()s as per 2.5.x's ext3/jbd, check that it sucks running SDET on 8-way then implement the lock_super()s via a mutex. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-26 10:35 ` Andrew Morton @ 2005-12-26 10:42 ` Arjan van de Ven 2005-12-26 11:11 ` Andrew Morton 2005-12-26 18:15 ` Linus Torvalds 2005-12-27 14:42 ` Ingo Molnar 1 sibling, 2 replies; 62+ messages in thread From: Arjan van de Ven @ 2005-12-26 10:42 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, zippel, hch, alan, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml > hm. 16 CPUs hitting the same semaphore at great arrival rates. The cost > of a short spin is much less than the cost of a sleep/wakeup. The machine > was doing 100,000 - 200,000 context switches per second. interesting.. this might be a good indication that a "spin a bit first" mutex slowpath for some locks might be worth implementing... ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-26 10:42 ` Arjan van de Ven @ 2005-12-26 11:11 ` Andrew Morton 2005-12-26 17:15 ` Mike Galbraith 2005-12-27 0:32 ` David Lang 2005-12-26 18:15 ` Linus Torvalds 1 sibling, 2 replies; 62+ messages in thread From: Andrew Morton @ 2005-12-26 11:11 UTC (permalink / raw) To: Arjan van de Ven Cc: mingo, zippel, hch, alan, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml Arjan van de Ven <arjan@infradead.org> wrote: > > > > hm. 16 CPUs hitting the same semaphore at great arrival rates. The cost > > of a short spin is much less than the cost of a sleep/wakeup. The machine > > was doing 100,000 - 200,000 context switches per second. > > interesting.. this might be a good indication that a "spin a bit first" > mutex slowpath for some locks might be worth implementing... If we see a workload which is triggering such high context switch rates, maybe. But I don't think we've seen any such for a long time. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-26 11:11 ` Andrew Morton @ 2005-12-26 17:15 ` Mike Galbraith 2005-12-26 17:44 ` Lee Revell 2005-12-27 0:32 ` David Lang 1 sibling, 1 reply; 62+ messages in thread From: Mike Galbraith @ 2005-12-26 17:15 UTC (permalink / raw) To: Andrew Morton, Arjan van de Ven Cc: mingo, zippel, hch, alan, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml At 03:11 AM 12/26/2005 -0800, Andrew Morton wrote: >Arjan van de Ven <arjan@infradead.org> wrote: > > > > > > > hm. 16 CPUs hitting the same semaphore at great arrival rates. The cost > > > of a short spin is much less than the cost of a sleep/wakeup. The > machine > > > was doing 100,000 - 200,000 context switches per second. > > > > interesting.. this might be a good indication that a "spin a bit first" > > mutex slowpath for some locks might be worth implementing... > >If we see a workload which is triggering such high context switch rates, >maybe. But I don't think we've seen any such for a long time. Hmm. Is there a real workload where such a high context switch rate is necessary? Every time I've seen a high (100,000 - 200,000 is beyond absurd on my little box, but...) context switch rate, it's been because something sucked. -Mike ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-26 17:15 ` Mike Galbraith @ 2005-12-26 17:44 ` Lee Revell 0 siblings, 0 replies; 62+ messages in thread From: Lee Revell @ 2005-12-26 17:44 UTC (permalink / raw) To: Mike Galbraith Cc: Andrew Morton, Arjan van de Ven, mingo, zippel, hch, alan, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml On Mon, 2005-12-26 at 18:15 +0100, Mike Galbraith wrote: > At 03:11 AM 12/26/2005 -0800, Andrew Morton wrote: > >Arjan van de Ven <arjan@infradead.org> wrote: > > > > > > > > > > hm. 16 CPUs hitting the same semaphore at great arrival rates. The cost > > > > of a short spin is much less than the cost of a sleep/wakeup. The > > machine > > > > was doing 100,000 - 200,000 context switches per second. > > > > > > interesting.. this might be a good indication that a "spin a bit first" > > > mutex slowpath for some locks might be worth implementing... > > > >If we see a workload which is triggering such high context switch rates, > >maybe. But I don't think we've seen any such for a long time. > > Hmm. Is there a real workload where such a high context switch rate is > necessary? Every time I've seen a high (100,000 - 200,000 is beyond absurd > on my little box, but...) context switch rate, it's been because something > sucked. I can trivially produce 20K per second on my little sub Ghz box so 100K on a busy server is certainly plausible. Especially if for the purposes of this discussion we are also worried about -rt + IRQ threading where each IRQ costs two context switches (more if it raises a softirq). Lee ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-26 11:11 ` Andrew Morton 2005-12-26 17:15 ` Mike Galbraith @ 2005-12-27 0:32 ` David Lang 1 sibling, 0 replies; 62+ messages in thread From: David Lang @ 2005-12-27 0:32 UTC (permalink / raw) To: Andrew Morton Cc: Arjan van de Ven, mingo, zippel, hch, alan, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml On Mon, 26 Dec 2005, Andrew Morton wrote: > Arjan van de Ven <arjan@infradead.org> wrote: >> >> >>> hm. 16 CPUs hitting the same semaphore at great arrival rates. The cost >>> of a short spin is much less than the cost of a sleep/wakeup. The machine >>> was doing 100,000 - 200,000 context switches per second. >> >> interesting.. this might be a good indication that a "spin a bit first" >> mutex slowpath for some locks might be worth implementing... > > If we see a workload which is triggering such high context switch rates, > maybe. But I don't think we've seen any such for a long time. > how does 'spin a bit' interact with virutal CPU's (HT and equivalent). it would seem to me that on a multi true-core system the spin-a-bit is a win becouse it allows the other CPU's to release the lock, but on virtual CPU's all the spinning would do is to delay the other virtual CPU's running and therefor delay the lock getting released. David Lang -- There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies. -- C.A.R. Hoare ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-26 10:42 ` Arjan van de Ven 2005-12-26 11:11 ` Andrew Morton @ 2005-12-26 18:15 ` Linus Torvalds 1 sibling, 0 replies; 62+ messages in thread From: Linus Torvalds @ 2005-12-26 18:15 UTC (permalink / raw) To: Arjan van de Ven Cc: Andrew Morton, Ingo Molnar, zippel, hch, alan, linux-kernel, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml On Mon, 26 Dec 2005, Arjan van de Ven wrote: > > > hm. 16 CPUs hitting the same semaphore at great arrival rates. The cost > > of a short spin is much less than the cost of a sleep/wakeup. The machine > > was doing 100,000 - 200,000 context switches per second. > > interesting.. this might be a good indication that a "spin a bit first" > mutex slowpath for some locks might be worth implementing... No, please don't. Almost always it's a huge sign that the locking is totally broken. And yes, the fix was to _fix_ the ext3 locking. Not to make semaphores spin. The ext3 locking was using a semaphore for totally the wrong reasons, it made zero sense at all. I think it's fixed now. Linus ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-26 10:35 ` Andrew Morton 2005-12-26 10:42 ` Arjan van de Ven @ 2005-12-27 14:42 ` Ingo Molnar 2005-12-27 23:02 ` Andrew Morton 1 sibling, 1 reply; 62+ messages in thread From: Ingo Molnar @ 2005-12-27 14:42 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel * Andrew Morton <akpm@osdl.org> wrote: > > hm, can you see any easy way for me to test my bold assertion on ext3, > > by somehow moving/hacking it back to semaphores? > > Not really. The problem was most apparent after the lock_kernel() > removal patches. The first thing a CPU hit when it entered the fs was > previously lock_kernel(). That became lock_super() and performance > went down the tubes. From memory, the bad kernel was tip-of-tree as > of Memorial Weekend 2003 ;) > > I guess you could re-add all the lock_super()s as per 2.5.x's > ext3/jbd, check that it sucks running SDET on 8-way then implement the > lock_super()s via a mutex. ok - does the patch below look roughly ok as a really bad (but functional) hack to restore that old behavior, for ext2? Ingo fs/ext2/balloc.c | 19 ++++++++++++++++--- fs/ext2/ialloc.c | 15 +++++++++++++-- 2 files changed, 30 insertions(+), 6 deletions(-) Index: linux/fs/ext2/balloc.c =================================================================== --- linux.orig/fs/ext2/balloc.c +++ linux/fs/ext2/balloc.c @@ -139,7 +139,7 @@ static void release_blocks(struct super_ } } -static int group_reserve_blocks(struct ext2_sb_info *sbi, int group_no, +static int group_reserve_blocks(struct super_block *sb, struct ext2_sb_info *sbi, int group_no, struct ext2_group_desc *desc, struct buffer_head *bh, int count) { unsigned free_blocks; @@ -147,12 +147,15 @@ static int group_reserve_blocks(struct e if (!desc->bg_free_blocks_count) return 0; + lock_super(sb); spin_lock(sb_bgl_lock(sbi, group_no)); free_blocks = le16_to_cpu(desc->bg_free_blocks_count); if (free_blocks < count) count = free_blocks; desc->bg_free_blocks_count = cpu_to_le16(free_blocks - count); spin_unlock(sb_bgl_lock(sbi, group_no)); + unlock_super(sb); + mark_buffer_dirty(bh); return count; } @@ -164,10 +167,12 @@ static void group_release_blocks(struct struct ext2_sb_info *sbi = EXT2_SB(sb); unsigned free_blocks; + lock_super(sb); spin_lock(sb_bgl_lock(sbi, group_no)); free_blocks = le16_to_cpu(desc->bg_free_blocks_count); desc->bg_free_blocks_count = cpu_to_le16(free_blocks + count); spin_unlock(sb_bgl_lock(sbi, group_no)); + unlock_super(sb); sb->s_dirt = 1; mark_buffer_dirty(bh); } @@ -234,6 +239,7 @@ do_more: "Block = %lu, count = %lu", block, count); + lock_super(sb); for (i = 0, group_freed = 0; i < count; i++) { if (!ext2_clear_bit_atomic(sb_bgl_lock(sbi, block_group), bit + i, bitmap_bh->b_data)) { @@ -243,6 +249,7 @@ do_more: group_freed++; } } + unlock_super(sb); mark_buffer_dirty(bitmap_bh); if (sb->s_flags & MS_SYNCHRONOUS) @@ -377,7 +384,7 @@ int ext2_new_block(struct inode *inode, goto io_error; } - group_alloc = group_reserve_blocks(sbi, group_no, desc, + group_alloc = group_reserve_blocks(sb, sbi, group_no, desc, gdp_bh, es_alloc); if (group_alloc) { ret_block = ((goal - le32_to_cpu(es->s_first_data_block)) % @@ -389,8 +396,10 @@ int ext2_new_block(struct inode *inode, ext2_debug("goal is at %d:%d.\n", group_no, ret_block); + lock_super(sb); ret_block = grab_block(sb_bgl_lock(sbi, group_no), bitmap_bh->b_data, group_size, ret_block); + unlock_super(sb); if (ret_block >= 0) goto got_block; group_release_blocks(sb, group_no, desc, gdp_bh, group_alloc); @@ -413,7 +422,7 @@ retry: desc = ext2_get_group_desc(sb, group_no, &gdp_bh); if (!desc) goto io_error; - group_alloc = group_reserve_blocks(sbi, group_no, desc, + group_alloc = group_reserve_blocks(sb, sbi, group_no, desc, gdp_bh, es_alloc); } if (!group_alloc) { @@ -425,8 +434,10 @@ retry: if (!bitmap_bh) goto io_error; + lock_super(sb); ret_block = grab_block(sb_bgl_lock(sbi, group_no), bitmap_bh->b_data, group_size, 0); + unlock_super(sb); if (ret_block < 0) { /* * If a free block counter is corrupted we can loop inifintely. @@ -485,12 +496,14 @@ got_block: if (group_alloc && !*prealloc_count) { unsigned n; + lock_super(sb); for (n = 0; n < group_alloc && ++ret_block < group_size; n++) { if (ext2_set_bit_atomic(sb_bgl_lock(sbi, group_no), ret_block, (void*) bitmap_bh->b_data)) break; } + unlock_super(sb); *prealloc_block = block + 1; *prealloc_count = n; es_alloc -= n; Index: linux/fs/ext2/ialloc.c =================================================================== --- linux.orig/fs/ext2/ialloc.c +++ linux/fs/ext2/ialloc.c @@ -75,6 +75,7 @@ static void ext2_release_inode(struct su return; } + lock_super(sb); spin_lock(sb_bgl_lock(EXT2_SB(sb), group)); desc->bg_free_inodes_count = cpu_to_le16(le16_to_cpu(desc->bg_free_inodes_count) + 1); @@ -82,6 +83,7 @@ static void ext2_release_inode(struct su desc->bg_used_dirs_count = cpu_to_le16(le16_to_cpu(desc->bg_used_dirs_count) - 1); spin_unlock(sb_bgl_lock(EXT2_SB(sb), group)); + unlock_super(sb); if (dir) percpu_counter_dec(&EXT2_SB(sb)->s_dirs_counter); sb->s_dirt = 1; @@ -148,12 +150,16 @@ void ext2_free_inode (struct inode * ino goto error_return; /* Ok, now we can actually update the inode bitmaps.. */ + lock_super(sb); if (!ext2_clear_bit_atomic(sb_bgl_lock(EXT2_SB(sb), block_group), - bit, (void *) bitmap_bh->b_data)) + bit, (void *) bitmap_bh->b_data)) { + unlock_super(sb); ext2_error (sb, "ext2_free_inode", "bit already cleared for inode %lu", ino); - else + } else { + unlock_super(sb); ext2_release_inode(sb, block_group, is_directory); + } mark_buffer_dirty(bitmap_bh); if (sb->s_flags & MS_SYNCHRONOUS) sync_dirty_buffer(bitmap_bh); @@ -507,8 +513,10 @@ repeat_in_this_group: group = 0; continue; } + lock_super(sb); if (ext2_set_bit_atomic(sb_bgl_lock(sbi, group), ino, bitmap_bh->b_data)) { + unlock_super(sb); /* we lost this inode */ if (++ino >= EXT2_INODES_PER_GROUP(sb)) { /* this group is exhausted, try next group */ @@ -519,6 +527,7 @@ repeat_in_this_group: /* try to find free inode in the same group */ goto repeat_in_this_group; } + unlock_super(sb); goto got; } @@ -547,6 +556,7 @@ got: if (S_ISDIR(mode)) percpu_counter_inc(&sbi->s_dirs_counter); + lock_super(sb); spin_lock(sb_bgl_lock(sbi, group)); gdp->bg_free_inodes_count = cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) - 1); @@ -560,6 +570,7 @@ got: sbi->s_debts[group]--; } spin_unlock(sb_bgl_lock(sbi, group)); + unlock_super(sb); sb->s_dirt = 1; mark_buffer_dirty(bh2); ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-27 14:42 ` Ingo Molnar @ 2005-12-27 23:02 ` Andrew Morton 0 siblings, 0 replies; 62+ messages in thread From: Andrew Morton @ 2005-12-27 23:02 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel Ingo Molnar <mingo@elte.hu> wrote: > > > * Andrew Morton <akpm@osdl.org> wrote: > > > > hm, can you see any easy way for me to test my bold assertion on ext3, > > > by somehow moving/hacking it back to semaphores? > > > > Not really. The problem was most apparent after the lock_kernel() > > removal patches. The first thing a CPU hit when it entered the fs was > > previously lock_kernel(). That became lock_super() and performance > > went down the tubes. From memory, the bad kernel was tip-of-tree as > > of Memorial Weekend 2003 ;) > > > > I guess you could re-add all the lock_super()s as per 2.5.x's > > ext3/jbd, check that it sucks running SDET on 8-way then implement the > > lock_super()s via a mutex. > > ok - does the patch below look roughly ok as a really bad (but > functional) hack to restore that old behavior, for ext2? > Hard to tell ;) 2.5.20's ext2 had 7 lock_super()s whereas for some reason this patch adds 12... I don't recall whether ext2 suffered wild context switches as badly as ext3 did. It becomes pretty obvious in testing. The really bad workload was SDET, which isn't available to mortals. So some hunting might be neded to find a suitable alternative. dbench would be a good start I guess. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Moore's law (was Re: [patch 0/9] mutex subsystem, -V4) 2005-12-25 23:04 ` Andrew Morton 2005-12-25 23:22 ` Ingo Molnar @ 2005-12-26 0:33 ` Pavel Machek 2006-01-05 15:30 ` Andi Kleen 2005-12-26 15:29 ` [patch 0/9] mutex subsystem, -V4 Nicolas Pitre 2 siblings, 1 reply; 62+ messages in thread From: Pavel Machek @ 2005-12-26 0:33 UTC (permalink / raw) To: Andrew Morton Cc: Roman Zippel, hch, alan, arjan, mingo, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml Hi! > Another example: Ingo's VFS stresstest which is hitting i_sem hard: it only > does ~8000 ops/sec on an 8-way, and it's an artificial microbenchmark which > is _designed_ to hit that lock hard. So if/when i_sem is converted to a > mutex, I figure that the benefits to ARM in that workload will be about a > 0.01% performance increase. ie: about two hours' worth of Moore's law in a > dopey microbenchmark. :-) Expressing performance increases in Moore's hours seems like neat trick. OTOH I do not think it is valid any more. Single-threaded performance stopped increasing 2 years ago AFAICS. Plus people are pushing Linux onto smaller machines, that were unavailable 2 years ago. Pavel -- Thanks, Sharp! ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Moore's law (was Re: [patch 0/9] mutex subsystem, -V4) 2005-12-26 0:33 ` Moore's law (was Re: [patch 0/9] mutex subsystem, -V4) Pavel Machek @ 2006-01-05 15:30 ` Andi Kleen 2006-01-05 19:08 ` Pavel Machek 0 siblings, 1 reply; 62+ messages in thread From: Andi Kleen @ 2006-01-05 15:30 UTC (permalink / raw) To: Pavel Machek; +Cc: linux-kernel On Monday 26 December 2005 01:33, Pavel Machek wrote: [cc list from hell trimmed down] > > Another example: Ingo's VFS stresstest which is hitting i_sem hard: it only > > does ~8000 ops/sec on an 8-way, and it's an artificial microbenchmark which > > is _designed_ to hit that lock hard. So if/when i_sem is converted to a > > mutex, I figure that the benefits to ARM in that workload will be about a > > 0.01% performance increase. ie: about two hours' worth of Moore's law in a > > dopey microbenchmark. Moore's law actually doesn't say anything about performance increases, just about the number of transistors available. > > :-) Expressing performance increases in Moore's hours seems like > neat trick. OTOH I do not think it is valid any more. Single-threaded > performance stopped increasing 2 years ago AFAICS. It's not true. e.g. a 2.6 Ghz FX-57 is significantly faster than the top end CPU you could get 2 years ago. And I'm sure this years CPUs will be still faster than last years. > Plus people are > pushing Linux onto smaller machines, that were unavailable 2 years > ago. Even smaller systems are still getting faster. -Andi ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Moore's law (was Re: [patch 0/9] mutex subsystem, -V4) 2006-01-05 15:30 ` Andi Kleen @ 2006-01-05 19:08 ` Pavel Machek 0 siblings, 0 replies; 62+ messages in thread From: Pavel Machek @ 2006-01-05 19:08 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel On Thu 05-01-06 16:30:59, Andi Kleen wrote: > On Monday 26 December 2005 01:33, Pavel Machek wrote: > > [cc list from hell trimmed down] > > :-) Expressing performance increases in Moore's hours seems like > > neat trick. OTOH I do not think it is valid any more. Single-threaded > > performance stopped increasing 2 years ago AFAICS. > > It's not true. e.g. a 2.6 Ghz FX-57 is significantly faster than the > top end CPU you could get 2 years ago. And I'm sure this years CPUs > will be still faster than last years. Well, but it is no longer 2x as fast... > > Plus people are > > pushing Linux onto smaller machines, that were unavailable 2 years > > ago. > > Even smaller systems are still getting faster. Yep. OTOH this is the first time when *average* linux box is getting *slower* -- mostly due to massive ammount of cheap, linux-based wifi access points on the market. (Average linux desktop box is still gettting faster, though) -- Thanks, Sharp! ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-25 23:04 ` Andrew Morton 2005-12-25 23:22 ` Ingo Molnar 2005-12-26 0:33 ` Moore's law (was Re: [patch 0/9] mutex subsystem, -V4) Pavel Machek @ 2005-12-26 15:29 ` Nicolas Pitre 2 siblings, 0 replies; 62+ messages in thread From: Nicolas Pitre @ 2005-12-26 15:29 UTC (permalink / raw) To: Andrew Morton Cc: Roman Zippel, hch, alan, arjan, mingo, lkml, torvalds, arjanv, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml On Sun, 25 Dec 2005, Andrew Morton wrote: > Ingo has told me offline that he thinks that we can indeed remove the > current semaphore implementation. > > 90% of existing semaphore users get migrated to mutexes. > > 9% of current semaphore users get migrated to completions. > > The remaining 1% of semaphore users are using the counting feature. We > reimplement that in a mostly-arch-independent fashion and then remove the > current semaphore implementation. Note that there are no sequencing > dependencies in all the above. Great! I'm glad you're amenable to such changes. Especially the "remove the current semaphore implementation" part many people are complaining about for all sorts of reasons. > Another example: Ingo's VFS stresstest which is hitting i_sem hard: it only > does ~8000 ops/sec on an 8-way, and it's an artificial microbenchmark which > is _designed_ to hit that lock hard. So if/when i_sem is converted to a > mutex, I figure that the benefits to ARM in that workload will be about a > 0.01% performance increase. ie: about two hours' worth of Moore's law in a > dopey microbenchmark. There is not only the question of cycles. There is also kernel text size. And _that_ is significant. You could argue that only adding a function call for every mutex lock/unlock cannot be made smaller. However on ARM a function call is quite a costly operation with frame pointers, (and frame pointers are on by default on ARM otherwise the kernel is undebuggable). So given that current semaphore fast path is inlined to save on the function call overhead, and given that on ARM the transition from semaphores to mutexes will shrink that from 8 instruction down to 3 instructions, or even 2 instructions in some cases, will not only save cycles, but a bunch of kernel .text bytes as well without compromize on the cycle count. This is therefore an all win situation for ARM. > Also, there's very little point in adding lots of tricky arch-dependent > code and generally mucking up the kernel source to squeeze the last drop of > performance out of the sleeping lock fastpath. Because if those changes > actually make a difference, we've already lost - the code needs to be > changed to use spinlocking. Consider my text size argument above which is something I believe you value more. Nicolas ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 13:44 ` Andrew Morton 2005-12-22 14:11 ` Alan Cox @ 2005-12-22 15:46 ` Thomas Gleixner 2005-12-22 17:40 ` Linus Torvalds 2005-12-22 17:17 ` Christoph Hellwig 2 siblings, 1 reply; 62+ messages in thread From: Thomas Gleixner @ 2005-12-22 15:46 UTC (permalink / raw) To: Andrew Morton Cc: Arjan van de Ven, mingo, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, alan, bcrl, rostedt, hch, ak, rmk+lkml On Thu, 2005-12-22 at 05:44 -0800, Andrew Morton wrote: > There's no need for two sets of behaviour. What I'm saying is that we > could add similar debugging features to semaphores (if useful). Yes, we > would have to tell the kernel at the source level to defeat that debugging > if a particular lock isn't being used as a mutex. That's rather less > intrusive than adding a whole new type of lock. But I'd question the value > even of doing that, given the general historical non-bugginess of existing > semaphore users. Semaphores need a counter, mutexes only a binary representation of the locked/unlocked state, which makes it possible to use instructions like xchg, swap, test_and_set_bit depending on what atomic operations are available on the architecture. On many architectures this is more efficient than the counter based implementation. Also the wakeup rules allow smaller and faster implementations for mutexes. tglx ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 15:46 ` Thomas Gleixner @ 2005-12-22 17:40 ` Linus Torvalds 2005-12-22 20:09 ` Steven Rostedt 0 siblings, 1 reply; 62+ messages in thread From: Linus Torvalds @ 2005-12-22 17:40 UTC (permalink / raw) To: Thomas Gleixner Cc: Andrew Morton, Arjan van de Ven, mingo, linux-kernel, arjanv, nico, jes, zwane, oleg, dhowells, alan, bcrl, rostedt, hch, ak, rmk+lkml On Thu, 22 Dec 2005, Thomas Gleixner wrote: > > Semaphores need a counter, mutexes only a binary representation of the > locked/unlocked state Actually, that's not true. A _spinlock_ only needs a binary representation of the locked/unlocked state. A mutex needs a _ternary_ representation. It needs an additional "contention" state to tell the wakeup that extra action is needed. If you don't handle contention (and do extra action all the time), you're screwed from a performance standpoint. Linus ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 17:40 ` Linus Torvalds @ 2005-12-22 20:09 ` Steven Rostedt 0 siblings, 0 replies; 62+ messages in thread From: Steven Rostedt @ 2005-12-22 20:09 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Gleixner, Andrew Morton, Arjan van de Ven, mingo, linux-kernel, arjanv, nico, jes, zwane, oleg, dhowells, alan, bcrl, hch, ak, rmk+lkml On Thu, 22 Dec 2005, Linus Torvalds wrote: > > > On Thu, 22 Dec 2005, Thomas Gleixner wrote: > > > > Semaphores need a counter, mutexes only a binary representation of the > > locked/unlocked state > > Actually, that's not true. > > A _spinlock_ only needs a binary representation of the locked/unlocked > state. > > A mutex needs a _ternary_ representation. It needs an additional > "contention" state to tell the wakeup that extra action is needed. True, and that's exactly what Ingo has. 1 unlocked, 0 locked, -1 locked with waiters. But it still works well with xchg. -- Steve > > If you don't handle contention (and do extra action all the time), you're > screwed from a performance standpoint. > > Linus > ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 13:44 ` Andrew Morton 2005-12-22 14:11 ` Alan Cox 2005-12-22 15:46 ` Thomas Gleixner @ 2005-12-22 17:17 ` Christoph Hellwig 2 siblings, 0 replies; 62+ messages in thread From: Christoph Hellwig @ 2005-12-22 17:17 UTC (permalink / raw) To: Andrew Morton Cc: Arjan van de Ven, mingo, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, alan, bcrl, rostedt, hch, ak, rmk+lkml On Thu, Dec 22, 2005 at 05:44:13AM -0800, Andrew Morton wrote: > > You keep saying 10 times "so please enhance semaphores to do this". > > Well I think diligence requires that we be able to demonstrate why it's not > possible. > > It's plain dumb for us to justify a fancy-pants new system based on > features which we could have added to the old one, no? But why should we add features to the semaphores. There's very little users of those real semaphore semantics, and they could do with a generic, all-C implementation because they are not important fast-pathes. OTOH we have lots of places needing plain mutex semantics, that are important fastpathes. Let's optimize for the common case instead of the corner case. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 13:07 ` Andrew Morton 2005-12-22 13:23 ` Arjan van de Ven @ 2005-12-22 15:19 ` Nicolas Pitre 2005-12-22 21:43 ` Paul Mackerras 2 siblings, 0 replies; 62+ messages in thread From: Nicolas Pitre @ 2005-12-22 15:19 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, linux-kernel, torvalds, arjanv, jes, zwane, oleg, dhowells, alan, bcrl, rostedt, hch, ak, rmk+lkml On Thu, 22 Dec 2005, Andrew Morton wrote: > I'd prefer to see mutexes compared with semaphores after you've put as much > work into improving semaphores as you have into developing mutexes. There is a fundamental difference between semaphores and mutexes. The semaphore semantics _require_ atomic increments/decrements where mutexes do not. This makes a huge difference on ARM where 99% of all ARM processors out there can only perform atomic swaps which is sufficient for mutexes but insufficient for semaphores. Therefore on ARM performing an atomic increment/decrement (the semaphore fast path) requires extra costly locking and .text space (23 cycles over 8 instructions) while the mutex fast path is about 2-3 instructions and needs 7-8 cycles. I bet many other architectures are in the same camp. Nicolas ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [patch 0/9] mutex subsystem, -V4 2005-12-22 13:07 ` Andrew Morton 2005-12-22 13:23 ` Arjan van de Ven 2005-12-22 15:19 ` Nicolas Pitre @ 2005-12-22 21:43 ` Paul Mackerras 2 siblings, 0 replies; 62+ messages in thread From: Paul Mackerras @ 2005-12-22 21:43 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells, alan, bcrl, rostedt, hch, ak, rmk+lkml Andrew Morton writes: > Ingo Molnar <mingo@elte.hu> wrote: > > - 'struct mutex' is smaller: on x86, 'struct semaphore' is 20 bytes, > > 'struct mutex' is 16 bytes. A smaller structure size means less RAM > > footprint, and better CPU-cache utilization. > > Because of the .sleepers thing. Perhaps a revamped semaphore wouldn't need > thsat field? Note that semaphores on 32-bit PPC are only 16 bytes already, since I got rid of the sleepers field ages ago. The fast path is just atomic_dec/inc, and the slow path needs an atomic op that does x = max(x, 0) + inc atomically and returns the old value of x. That's easy to do with lwarx/stwcx (just two more instructions than an atomic inc), and can also be done quite straightforwardly with cmpxchg. Alpha, mips, s390 and sparc64 also use this scheme. In fact I would go so far as to say that I cannot see how it would be possible to do a mutex any smaller or faster than a counting semaphore on these architectures. Regards, Paul. ^ permalink raw reply [flat|nested] 62+ messages in thread
end of thread, other threads:[~2006-01-05 19:11 UTC | newest]
Thread overview: 62+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-22 11:41 [patch 0/9] mutex subsystem, -V4 Ingo Molnar
2005-12-22 11:53 ` Christoph Hellwig
2005-12-22 12:45 ` Ingo Molnar
2005-12-22 15:34 ` Nicolas Pitre
2005-12-22 15:40 ` Ingo Molnar
2005-12-22 16:32 ` Nicolas Pitre
2005-12-22 16:44 ` Ingo Molnar
2005-12-22 16:58 ` Russell King
2005-12-22 21:04 ` Ingo Molnar
2005-12-22 21:26 ` Russell King
2005-12-22 21:27 ` Nicolas Pitre
2005-12-22 21:37 ` [patch 1/2] mutex subsystem: basic per arch fast path primitives Nicolas Pitre
2005-12-22 21:53 ` Christoph Hellwig
2005-12-22 21:40 ` [patch 2/2] mutex subsystem: use the per architecture fast path lock_unlock defines Nicolas Pitre
2005-12-22 21:54 ` [patch 0/9] mutex subsystem, -V4 Ingo Molnar
2005-12-22 16:58 ` Nicolas Pitre
2005-12-22 17:20 ` Christoph Hellwig
2005-12-22 17:33 ` Steven Rostedt
2005-12-22 18:24 ` Nicolas Pitre
2005-12-22 11:54 ` Andrew Morton
2005-12-22 12:20 ` Ingo Molnar
2005-12-22 13:07 ` Andrew Morton
2005-12-22 13:23 ` Arjan van de Ven
2005-12-22 13:44 ` Andrew Morton
2005-12-22 14:11 ` Alan Cox
2005-12-22 23:30 ` Andrew Morton
2005-12-22 23:34 ` Christoph Hellwig
[not found] ` <32801.10.10.10.28.1135295357.squirrel@linux1>
2005-12-22 23:49 ` Sean
2005-12-22 23:53 ` Randy.Dunlap
[not found] ` <50572.10.10.10.28.1135296023.squirrel@linux1>
2005-12-23 0:00 ` Sean
2005-12-23 0:00 ` Steven Rostedt
[not found] ` <20051222221311.2f6056ec.akpm@osdl.org>
2005-12-23 14:24 ` Nicolas Pitre
2005-12-23 14:51 ` Andrew Morton
2005-12-23 14:57 ` Russell King
2005-12-23 15:04 ` Xavier Bestel
2005-12-23 15:27 ` Andrew Morton
2005-12-23 15:00 ` Steven Rostedt
2006-01-03 17:54 ` Abhijit Bhopatkar
2005-12-25 16:08 ` Roman Zippel
2005-12-25 22:54 ` Ingo Molnar
2005-12-26 21:49 ` Roman Zippel
2005-12-25 23:04 ` Andrew Morton
2005-12-25 23:22 ` Ingo Molnar
2005-12-26 10:35 ` Andrew Morton
2005-12-26 10:42 ` Arjan van de Ven
2005-12-26 11:11 ` Andrew Morton
2005-12-26 17:15 ` Mike Galbraith
2005-12-26 17:44 ` Lee Revell
2005-12-27 0:32 ` David Lang
2005-12-26 18:15 ` Linus Torvalds
2005-12-27 14:42 ` Ingo Molnar
2005-12-27 23:02 ` Andrew Morton
2005-12-26 0:33 ` Moore's law (was Re: [patch 0/9] mutex subsystem, -V4) Pavel Machek
2006-01-05 15:30 ` Andi Kleen
2006-01-05 19:08 ` Pavel Machek
2005-12-26 15:29 ` [patch 0/9] mutex subsystem, -V4 Nicolas Pitre
2005-12-22 15:46 ` Thomas Gleixner
2005-12-22 17:40 ` Linus Torvalds
2005-12-22 20:09 ` Steven Rostedt
2005-12-22 17:17 ` Christoph Hellwig
2005-12-22 15:19 ` Nicolas Pitre
2005-12-22 21:43 ` Paul Mackerras
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox