* Announce: Semaphore-Removal tree
@ 2008-04-25 17:00 Matthew Wilcox
2008-04-25 20:24 ` Daniel Walker
` (3 more replies)
0 siblings, 4 replies; 21+ messages in thread
From: Matthew Wilcox @ 2008-04-25 17:00 UTC (permalink / raw)
To: linux-kernel; +Cc: Stephen Rothwell
It's been a Good Idea for a while to use mutexes instead of
semaphores where possible. Additional debuggability, better optimised,
better-enforced semantics, etc.
Obviously, there are some places that can't be converted to mutexes.
I'm not proposing blind changes. I've been through a dozen places
and found some that can be converted to spinlocks, others to
completions, but mostly to mutexes. So as not to lose these patches,
I've made them available as a git tree. I think the right way to get
these patches in will be to go through the maintainers of each file.
I'll post a list of the places using semaphores later. My current list
is a bit out of date and contains insulting comments that I would like
to rephrase before the authors of the code in question see them ;-)
http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=semaphore-removal
Stephen, could you pick these patches up for linux-next?
git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git semaphore-removal
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-25 17:00 Announce: Semaphore-Removal tree Matthew Wilcox
@ 2008-04-25 20:24 ` Daniel Walker
2008-04-25 20:38 ` Daniel Walker
` (2 subsequent siblings)
3 siblings, 0 replies; 21+ messages in thread
From: Daniel Walker @ 2008-04-25 20:24 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: linux-kernel, Stephen Rothwell
On Fri, 2008-04-25 at 11:00 -0600, Matthew Wilcox wrote:
> It's been a Good Idea for a while to use mutexes instead of
> semaphores where possible. Additional debuggability, better optimised,
> better-enforced semantics, etc.
>
> Obviously, there are some places that can't be converted to mutexes.
> I'm not proposing blind changes. I've been through a dozen places
> and found some that can be converted to spinlocks, others to
> completions, but mostly to mutexes. So as not to lose these patches,
> I've made them available as a git tree. I think the right way to get
> these patches in will be to go through the maintainers of each file.
>
> I'll post a list of the places using semaphores later. My current list
> is a bit out of date and contains insulting comments that I would like
> to rephrase before the authors of the code in question see them ;-)
I've got a bunch of these if you want to incorporate them.. I've been
slowly trickling them up stream ..
Daniel
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-25 17:00 Announce: Semaphore-Removal tree Matthew Wilcox
2008-04-25 20:24 ` Daniel Walker
@ 2008-04-25 20:38 ` Daniel Walker
2008-04-25 21:12 ` Christoph Hellwig
2008-04-26 13:54 ` Stephen Rothwell
2008-04-28 5:10 ` David Chinner
3 siblings, 1 reply; 21+ messages in thread
From: Daniel Walker @ 2008-04-25 20:38 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: linux-kernel, Stephen Rothwell
On Fri, 2008-04-25 at 11:00 -0600, Matthew Wilcox wrote:
> http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=semaphore-removal
>
I was reviewing your patches, and I don't like the semaphore to spinlock
changes.. There's no reason to start adding spinlocks, unless it's
really performance sensitive which none of those places are..
Also the ps3-gelic changes for instance is three locks in one patch. It
should be one lock per patch .. I have a broken out conversion for gelic
which I was going to submit in my next round ..
There's also a number of other people doing these , so you might want to
hold off on doing more unless you know they aren't already completed ..
Daniel
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-25 20:38 ` Daniel Walker
@ 2008-04-25 21:12 ` Christoph Hellwig
2008-04-25 21:22 ` Daniel Walker
0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2008-04-25 21:12 UTC (permalink / raw)
To: Daniel Walker; +Cc: Matthew Wilcox, linux-kernel, Stephen Rothwell
On Fri, Apr 25, 2008 at 01:38:37PM -0700, Daniel Walker wrote:
> I was reviewing your patches, and I don't like the semaphore to spinlock
> changes.. There's no reason to start adding spinlocks, unless it's
> really performance sensitive which none of those places are..
Yes, there is. The spinlock is our most efficient locking primitive
for the normal mostly un-contentded case. Please get out of your
realtime-ghetto.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-25 21:12 ` Christoph Hellwig
@ 2008-04-25 21:22 ` Daniel Walker
2008-04-26 9:30 ` Christoph Hellwig
0 siblings, 1 reply; 21+ messages in thread
From: Daniel Walker @ 2008-04-25 21:22 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Matthew Wilcox, linux-kernel, Stephen Rothwell
On Fri, 2008-04-25 at 17:12 -0400, Christoph Hellwig wrote:
> On Fri, Apr 25, 2008 at 01:38:37PM -0700, Daniel Walker wrote:
> > I was reviewing your patches, and I don't like the semaphore to spinlock
> > changes.. There's no reason to start adding spinlocks, unless it's
> > really performance sensitive which none of those places are..
>
> Yes, there is. The spinlock is our most efficient locking primitive
> for the normal mostly un-contentded case. Please get out of your
> realtime-ghetto.
If you can make a case for converting some semaphores to spinlocks be my
guest .. If you have good reasoning I wouldn't stand in the way.. (Real
time converts all the spinlocks to mutexes anyway ..)
Daniel
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-25 21:22 ` Daniel Walker
@ 2008-04-26 9:30 ` Christoph Hellwig
2008-04-26 13:39 ` Peter Zijlstra
0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2008-04-26 9:30 UTC (permalink / raw)
To: Daniel Walker
Cc: Christoph Hellwig, Matthew Wilcox, linux-kernel, Stephen Rothwell
On Fri, Apr 25, 2008 at 02:22:31PM -0700, Daniel Walker wrote:
> If you can make a case for converting some semaphores to spinlocks be my
> guest .. If you have good reasoning I wouldn't stand in the way.. (Real
> time converts all the spinlocks to mutexes anyway ..)
Right at hand I have the XFS inode hash lock was converted from a rw_semaphore
to a rwlock_t becuase the context switch overhead was killing
performance in various benchmarks. This is a very typical scenary for
locks that are taken often and held for a rather short time. Add to
that fact that a spinlock is compltely optimized away for an UP kernel
while a mutex is not and the amount of memory that any mutex takes
compared to a spinlock you have a clear winner.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-26 9:30 ` Christoph Hellwig
@ 2008-04-26 13:39 ` Peter Zijlstra
2008-04-26 13:44 ` Christoph Hellwig
0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2008-04-26 13:39 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Daniel Walker, Matthew Wilcox, linux-kernel, Stephen Rothwell
On Sat, 2008-04-26 at 05:30 -0400, Christoph Hellwig wrote:
> On Fri, Apr 25, 2008 at 02:22:31PM -0700, Daniel Walker wrote:
> > If you can make a case for converting some semaphores to spinlocks be my
> > guest .. If you have good reasoning I wouldn't stand in the way.. (Real
> > time converts all the spinlocks to mutexes anyway ..)
>
> Right at hand I have the XFS inode hash lock was converted from a rw_semaphore
> to a rwlock_t becuase the context switch overhead was killing
> performance in various benchmarks. This is a very typical scenary for
> locks that are taken often and held for a rather short time. Add to
> that fact that a spinlock is compltely optimized away for an UP kernel
> while a mutex is not and the amount of memory that any mutex takes
> compared to a spinlock you have a clear winner.
I'm guessing RCU would be a bit more work?
The problem with rwlock_t is that for it to be a spinning lock the hold
times should be short, for it to be a rwlock over a spinlock there
should be a significant amount of concurrency, these two things together
make for a cache-line bouncing fest.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-26 13:39 ` Peter Zijlstra
@ 2008-04-26 13:44 ` Christoph Hellwig
2008-04-26 14:04 ` Peter Zijlstra
2008-04-28 4:59 ` David Chinner
0 siblings, 2 replies; 21+ messages in thread
From: Christoph Hellwig @ 2008-04-26 13:44 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Christoph Hellwig, Daniel Walker, Matthew Wilcox, linux-kernel,
Stephen Rothwell
On Sat, Apr 26, 2008 at 03:39:13PM +0200, Peter Zijlstra wrote:
> I'm guessing RCU would be a bit more work?
Actually the whole XFS inode hash is gone now and replaced by a radix tree.
As soon as nick and your work on the lockless radix trees goes in I'm
pretty sure Dave will make use of that.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-25 17:00 Announce: Semaphore-Removal tree Matthew Wilcox
2008-04-25 20:24 ` Daniel Walker
2008-04-25 20:38 ` Daniel Walker
@ 2008-04-26 13:54 ` Stephen Rothwell
2008-04-26 15:59 ` Matthew Wilcox
2008-04-28 5:10 ` David Chinner
3 siblings, 1 reply; 21+ messages in thread
From: Stephen Rothwell @ 2008-04-26 13:54 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: linux-kernel
[-- Attachment #1: Type: text/plain, Size: 456 bytes --]
Hi Willy,
On Fri, 25 Apr 2008 11:00:21 -0600 Matthew Wilcox <matthew@wil.cx> wrote:
>
> Stephen, could you pick these patches up for linux-next?
>
> git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git semaphore-removal
This is in addition to the semaphore branch? Should it go before or after
that i.e. does it depend on it?
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-26 13:44 ` Christoph Hellwig
@ 2008-04-26 14:04 ` Peter Zijlstra
2008-04-28 4:59 ` David Chinner
1 sibling, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2008-04-26 14:04 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Daniel Walker, Matthew Wilcox, linux-kernel, Stephen Rothwell
On Sat, 2008-04-26 at 09:44 -0400, Christoph Hellwig wrote:
> On Sat, Apr 26, 2008 at 03:39:13PM +0200, Peter Zijlstra wrote:
> > I'm guessing RCU would be a bit more work?
>
> Actually the whole XFS inode hash is gone now and replaced by a radix tree.
> As soon as nick and your work on the lockless radix trees goes in I'm
> pretty sure Dave will make use of that.
Nick's RCU radix tree is already merged - its just the speculative page
references and the rest of the lockless page-cache that is still
missing.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-26 13:54 ` Stephen Rothwell
@ 2008-04-26 15:59 ` Matthew Wilcox
2008-04-26 16:43 ` Stephen Rothwell
0 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2008-04-26 15:59 UTC (permalink / raw)
To: Stephen Rothwell; +Cc: linux-kernel
On Sat, Apr 26, 2008 at 11:54:39PM +1000, Stephen Rothwell wrote:
> Hi Willy,
>
> On Fri, 25 Apr 2008 11:00:21 -0600 Matthew Wilcox <matthew@wil.cx> wrote:
> >
> > Stephen, could you pick these patches up for linux-next?
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git semaphore-removal
>
> This is in addition to the semaphore branch? Should it go before or after
> that i.e. does it depend on it?
It's in addition to, and it's independent of it. I don't intend to
create any conflicts or dependencies between them. In case I do ...
let's say semaphore-removal goes after semaphore.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-26 15:59 ` Matthew Wilcox
@ 2008-04-26 16:43 ` Stephen Rothwell
0 siblings, 0 replies; 21+ messages in thread
From: Stephen Rothwell @ 2008-04-26 16:43 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: linux-kernel
[-- Attachment #1: Type: text/plain, Size: 410 bytes --]
Hi Willy,
On Sat, 26 Apr 2008 09:59:39 -0600 Matthew Wilcox <matthew@wil.cx> wrote:
>
> It's in addition to, and it's independent of it. I don't intend to
> create any conflicts or dependencies between them. In case I do ...
> let's say semaphore-removal goes after semaphore.
OK, added.
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-26 13:44 ` Christoph Hellwig
2008-04-26 14:04 ` Peter Zijlstra
@ 2008-04-28 4:59 ` David Chinner
1 sibling, 0 replies; 21+ messages in thread
From: David Chinner @ 2008-04-28 4:59 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Peter Zijlstra, Daniel Walker, Matthew Wilcox, linux-kernel,
Stephen Rothwell
On Sat, Apr 26, 2008 at 09:44:30AM -0400, Christoph Hellwig wrote:
> On Sat, Apr 26, 2008 at 03:39:13PM +0200, Peter Zijlstra wrote:
> > I'm guessing RCU would be a bit more work?
>
> Actually the whole XFS inode hash is gone now and replaced by a radix tree.
> As soon as nick and your work on the lockless radix trees goes in I'm
> pretty sure Dave will make use of that.
I have patches for that, but came across reference counting issues
with the refcount being held in the linux inode not the XFS inode
and hence not being able to properly refcount the xfs inode once
the linux inode is gone away....
Solvable, but I've got to find some time to work on stuff that isn't
just bug fixing....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-25 17:00 Announce: Semaphore-Removal tree Matthew Wilcox
` (2 preceding siblings ...)
2008-04-26 13:54 ` Stephen Rothwell
@ 2008-04-28 5:10 ` David Chinner
2008-04-28 12:20 ` Matthew Wilcox
3 siblings, 1 reply; 21+ messages in thread
From: David Chinner @ 2008-04-28 5:10 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: linux-kernel, Stephen Rothwell
On Fri, Apr 25, 2008 at 11:00:21AM -0600, Matthew Wilcox wrote:
>
> It's been a Good Idea for a while to use mutexes instead of
> semaphores where possible. Additional debuggability, better optimised,
> better-enforced semantics, etc.
>
> Obviously, there are some places that can't be converted to mutexes.
> I'm not proposing blind changes.
Matthew, what's the plan for code using semaphores that cannot be
easily converted to something else? e.g. XFS?
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-28 5:10 ` David Chinner
@ 2008-04-28 12:20 ` Matthew Wilcox
2008-04-29 0:09 ` David Chinner
0 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2008-04-28 12:20 UTC (permalink / raw)
To: David Chinner; +Cc: linux-kernel, Stephen Rothwell
On Mon, Apr 28, 2008 at 03:10:40PM +1000, David Chinner wrote:
> On Fri, Apr 25, 2008 at 11:00:21AM -0600, Matthew Wilcox wrote:
> >
> > It's been a Good Idea for a while to use mutexes instead of
> > semaphores where possible. Additional debuggability, better optimised,
> > better-enforced semantics, etc.
> >
> > Obviously, there are some places that can't be converted to mutexes.
> > I'm not proposing blind changes.
>
> Matthew, what's the plan for code using semaphores that cannot be
> easily converted to something else? e.g. XFS?
I'm glad you asked!
Arjan, Ingo and I have been batting around something called a kcounter.
I appear to have misplaced the patch right now, but the basic idea is
that it returns you a cookie when you down(), which you then have to
pass to the up()-equivalent. This gives you at least some of the
assurances you get from mutexes.
Though ... looking at XFS, you have 5 counting semaphores currently:
1. i_flock
This one seems to be a mutex. I'd need to immerse myself in XFS for a
couple of days to verify that though -- there's a lot of places that use
it, and it doesn't have obvious lock/unlock pairing. Is it sometimes
unlocked from a different thread than the one which locked it? If so,
kcounters might be the right thing to use here.
2. l_flushsema
This seems to be a completion. ie you're using it to wait for the log
to be flushed.
3. q_flock
Ow. ow. My brain hurts. What are these semantics?
4. b_iodonesema
This should be a completion. It's used to wait for the io to be
complete.
5. b_sema
This looks like a mutex, but I think it's released in a different
context from the one which acquires it.
-----
Possibly XFS should be using constructs like wait_on_bit instead of
semaphores. See the implementation of wait_on_buffer for an example.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-28 12:20 ` Matthew Wilcox
@ 2008-04-29 0:09 ` David Chinner
2008-04-29 2:35 ` Matthew Wilcox
2008-04-30 10:06 ` Matthew Wilcox
0 siblings, 2 replies; 21+ messages in thread
From: David Chinner @ 2008-04-29 0:09 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: David Chinner, linux-kernel, Stephen Rothwell
On Mon, Apr 28, 2008 at 06:20:04AM -0600, Matthew Wilcox wrote:
> On Mon, Apr 28, 2008 at 03:10:40PM +1000, David Chinner wrote:
> > On Fri, Apr 25, 2008 at 11:00:21AM -0600, Matthew Wilcox wrote:
> > >
> > > It's been a Good Idea for a while to use mutexes instead of
> > > semaphores where possible. Additional debuggability, better optimised,
> > > better-enforced semantics, etc.
> > >
> > > Obviously, there are some places that can't be converted to mutexes.
> > > I'm not proposing blind changes.
> >
> > Matthew, what's the plan for code using semaphores that cannot be
> > easily converted to something else? e.g. XFS?
>
> I'm glad you asked!
>
> Arjan, Ingo and I have been batting around something called a kcounter.
> I appear to have misplaced the patch right now, but the basic idea is
> that it returns you a cookie when you down(), which you then have to
> pass to the up()-equivalent. This gives you at least some of the
> assurances you get from mutexes.
<sigh>
back to the days of cookies being required for locks. We only just
removed all the remaining lock cruft left over from Irix that used
cookies like this. i.e.:
DECL_LOCK_COOKIE(cookie);
cookie = spin_lock(&lock);
.....
spin_unlock(&lock, cookie);
it's an ugly, ugly API....
> Though ... looking at XFS, you have 5 counting semaphores currently:
>
> 1. i_flock
>
> This one seems to be a mutex.
No, it's a semaphore. It is the inode flush lock and is held over
I/O on the inode. It is released in a different context to the
process that holds it. We use trylock semantics on it all the time
to determine if we can write the inode to disk.
> 2. l_flushsema
>
> This seems to be a completion. ie you're using it to wait for the log
> to be flushed.
Yes, that could probably be a completion. I'm assuming that a completion
can handle several thousand waiting processes, right?
> 3. q_flock
>
> Ow. ow. My brain hurts. What are these semantics?
Same semantics as the i_flock - it's held while flushing the dquot
to disk and is released by a different thread. Trylocks are used on
this as well...
> 4. b_iodonesema
>
> This should be a completion. It's used to wait for the io to be
> complete.
Yup, that could be done.
> 5. b_sema
>
> This looks like a mutex, but I think it's released in a different
> context from the one which acquires it.
Yup. held across I/O and typically released by a different thread.
Trylock semantics used as well.
> Possibly XFS should be using constructs like wait_on_bit instead of
> semaphores. See the implementation of wait_on_buffer for an example.
That sounds to me like you are saying is "semaphores are going away so
implement your own semaphore-like thingy using some other construct".
Right?
If that's the case, then AFAICT changing to completions and then
s/semaphore/rw_semaphore/ and using only {down,up}_write() for
the rest should work, right? Or are rwsem's going to go away, too?
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-29 0:09 ` David Chinner
@ 2008-04-29 2:35 ` Matthew Wilcox
2008-04-29 3:56 ` David Chinner
2008-04-30 10:06 ` Matthew Wilcox
1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2008-04-29 2:35 UTC (permalink / raw)
To: David Chinner; +Cc: linux-kernel, Stephen Rothwell, Christoph Hellwig
On Tue, Apr 29, 2008 at 10:09:30AM +1000, David Chinner wrote:
> On Mon, Apr 28, 2008 at 06:20:04AM -0600, Matthew Wilcox wrote:
> > Arjan, Ingo and I have been batting around something called a kcounter.
> > I appear to have misplaced the patch right now, but the basic idea is
> > that it returns you a cookie when you down(), which you then have to
> > pass to the up()-equivalent. This gives you at least some of the
> > assurances you get from mutexes.
>
> <sigh>
>
> back to the days of cookies being required for locks. We only just
> removed all the remaining lock cruft left over from Irix that used
> cookies like this. i.e.:
>
> DECL_LOCK_COOKIE(cookie);
>
> cookie = spin_lock(&lock);
> .....
> spin_unlock(&lock, cookie);
>
> it's an ugly, ugly API....
Perhaps you can suggest a better one? Our thought was that you have ...
struct xfs_inode {
struct kcounter_t i_flock
};
struct foo {
... other stuff you need for the io ...
kcounter_cookie_t kct;
}
int err = kcounter_claim(&ino->i_flock, &foo->kct);
...
kcounter_release(&ino->i_flock, &foo->kct);
> > Though ... looking at XFS, you have 5 counting semaphores currently:
> >
> > 1. i_flock
> >
> > This one seems to be a mutex.
>
> No, it's a semaphore. It is the inode flush lock and is held over
> I/O on the inode. It is released in a different context to the
> process that holds it. We use trylock semantics on it all the time
> to determine if we can write the inode to disk.
If you're always using trylock semantics on it, then it's not really a
semaphore, is it?
> > 2. l_flushsema
> >
> > This seems to be a completion. ie you're using it to wait for the log
> > to be flushed.
>
> Yes, that could probably be a completion. I'm assuming that a completion
> can handle several thousand waiting processes, right?
Up to 2 billion.
> > 3. q_flock
> >
> > Ow. ow. My brain hurts. What are these semantics?
>
> Same semantics as the i_flock - it's held while flushing the dquot
> to disk and is released by a different thread. Trylocks are used on
> this as well...
... but not just trylocks, right? There's a sleeping aspect to them
too.
> > 4. b_iodonesema
> >
> > This should be a completion. It's used to wait for the io to be
> > complete.
>
> Yup, that could be done.
>
> > 5. b_sema
> >
> > This looks like a mutex, but I think it's released in a different
> > context from the one which acquires it.
>
> Yup. held across I/O and typically released by a different thread.
> Trylock semantics used as well.
OK.
> > Possibly XFS should be using constructs like wait_on_bit instead of
> > semaphores. See the implementation of wait_on_buffer for an example.
>
> That sounds to me like you are saying is "semaphores are going away so
> implement your own semaphore-like thingy using some other construct".
> Right?
I don't want to say that. People (and I'm *not* referring to XFS here)
manage to abuse semaphores in the most hideous ways. If we tell them to
use lower-level constructs, they'll make a mess of using those too. I
think we need to look for patterns in the semaphore users which don't
fit the mutex pattern or the completion pattern and figure out how to
satisfy those users.
> If that's the case, then AFAICT changing to completions and then
> s/semaphore/rw_semaphore/ and using only {down,up}_write() for
> the rest should work, right? Or are rwsem's going to go away, too?
I don't think there are any plans to get rid of rwsems, though the RT
people probably hate rwsems even more than they hate regular semaphores.
The mmap rwsem is a compelling argument ;-)
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-29 2:35 ` Matthew Wilcox
@ 2008-04-29 3:56 ` David Chinner
2008-04-30 10:21 ` Matthew Wilcox
0 siblings, 1 reply; 21+ messages in thread
From: David Chinner @ 2008-04-29 3:56 UTC (permalink / raw)
To: Matthew Wilcox
Cc: David Chinner, linux-kernel, Stephen Rothwell, Christoph Hellwig
On Mon, Apr 28, 2008 at 08:35:20PM -0600, Matthew Wilcox wrote:
> On Tue, Apr 29, 2008 at 10:09:30AM +1000, David Chinner wrote:
> > On Mon, Apr 28, 2008 at 06:20:04AM -0600, Matthew Wilcox wrote:
> > > Arjan, Ingo and I have been batting around something called a kcounter.
> > > I appear to have misplaced the patch right now, but the basic idea is
> > > that it returns you a cookie when you down(), which you then have to
> > > pass to the up()-equivalent. This gives you at least some of the
> > > assurances you get from mutexes.
> >
> > <sigh>
> >
> > back to the days of cookies being required for locks. We only just
> > removed all the remaining lock cruft left over from Irix that used
> > cookies like this. i.e.:
> >
> > DECL_LOCK_COOKIE(cookie);
> >
> > cookie = spin_lock(&lock);
> > .....
> > spin_unlock(&lock, cookie);
> >
> > it's an ugly, ugly API....
>
> Perhaps you can suggest a better one? Our thought was that you have ...
>
> struct xfs_inode {
> struct kcounter_t i_flock
> };
>
> struct foo {
> ... other stuff you need for the io ...
> kcounter_cookie_t kct;
> }
You mean:
struct kcounter_sem {
struct kcounter cnt;
kcounter_cookie_t cookie;
};
#define down(s) kcounter_claim(&s->cnt, &s->cookie);
#define up(s) kcounter_release(&s->cnt, &s->cookie);
I can't see how this fixes the semaphore abuse problem at all
because you can trivially roll your own. We know where that
leads (i.e. everyone does it their own unique way)...
> int err = kcounter_claim(&ino->i_flock, &foo->kct);
> ...
> kcounter_release(&ino->i_flock, &foo->kct);
Is there the possibility of errors when taking a counter reference
in this api? i.e. can the equivalent of "down()" return an error?
> > > Though ... looking at XFS, you have 5 counting semaphores currently:
> > >
> > > 1. i_flock
> > >
> > > This one seems to be a mutex.
> >
> > No, it's a semaphore. It is the inode flush lock and is held over
> > I/O on the inode. It is released in a different context to the
> > process that holds it. We use trylock semantics on it all the time
> > to determine if we can write the inode to disk.
>
> If you're always using trylock semantics on it, then it's not really a
> semaphore, is it?
I should have been more precise with my description - we use trylock
semantics on them when we need to gain them in different orders to
the normal heirachy (which is quite often) or we are operating in
non-blocking conditions (again quite often). Otherwise we do normal
sleeping down() calls.
> > > 3. q_flock
> > >
> > > Ow. ow. My brain hurts. What are these semantics?
> >
> > Same semantics as the i_flock - it's held while flushing the dquot
> > to disk and is released by a different thread. Trylocks are used on
> > this as well...
>
> ... but not just trylocks, right? There's a sleeping aspect to them
> too.
*nod*
> > > Possibly XFS should be using constructs like wait_on_bit instead of
> > > semaphores. See the implementation of wait_on_buffer for an example.
> >
> > That sounds to me like you are saying is "semaphores are going away so
> > implement your own semaphore-like thingy using some other construct".
> > Right?
>
> I don't want to say that. People (and I'm *not* referring to XFS here)
> manage to abuse semaphores in the most hideous ways.
Yes, I've been following the argume^W discussions wating for an outcome.
> If we tell them to
> use lower-level constructs, they'll make a mess of using those too.
See above ;)
> I
> think we need to look for patterns in the semaphore users which don't
> fit the mutex pattern or the completion pattern and figure out how to
> satisfy those users.
Ok, so here's a user that says they need a semaphore-like construct that
behaves the same way the current semaphores do. What is the solution?
> > If that's the case, then AFAICT changing to completions and then
> > s/semaphore/rw_semaphore/ and using only {down,up}_write() for
> > the rest should work, right? Or are rwsem's going to go away, too?
>
> I don't think there are any plans to get rid of rwsems, though the RT
> people probably hate rwsems even more than they hate regular semaphores.
Fmeh.
> The mmap rwsem is a compelling argument ;-)
It's an argument for a different lock type for that particular case, not
an argument for removing the lock type completely.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-29 0:09 ` David Chinner
2008-04-29 2:35 ` Matthew Wilcox
@ 2008-04-30 10:06 ` Matthew Wilcox
2008-04-30 11:01 ` David Chinner
1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2008-04-30 10:06 UTC (permalink / raw)
To: David Chinner; +Cc: linux-kernel, Stephen Rothwell
On Tue, Apr 29, 2008 at 10:09:30AM +1000, David Chinner wrote:
> > 2. l_flushsema
> >
> > This seems to be a completion. ie you're using it to wait for the log
> > to be flushed.
>
> Yes, that could probably be a completion. I'm assuming that a completion
> can handle several thousand waiting processes, right?
By the way ... is it common that you get several thousand waiting
processes? I ask because you wake them all up, then the herd thunders
into the l_icloglock spinlock. Or is this a worst-case scenario that
happens once in a blue moon?
If l_flushsema does typically get more than one waiter, we can instead
wake the waiters one at a time.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-29 3:56 ` David Chinner
@ 2008-04-30 10:21 ` Matthew Wilcox
0 siblings, 0 replies; 21+ messages in thread
From: Matthew Wilcox @ 2008-04-30 10:21 UTC (permalink / raw)
To: David Chinner; +Cc: linux-kernel, Stephen Rothwell, Christoph Hellwig
On Tue, Apr 29, 2008 at 01:56:59PM +1000, David Chinner wrote:
> On Mon, Apr 28, 2008 at 08:35:20PM -0600, Matthew Wilcox wrote:
> > Perhaps you can suggest a better one? Our thought was that you have ...
> >
> > struct xfs_inode {
> > struct kcounter_t i_flock
> > };
> >
> > struct foo {
> > ... other stuff you need for the io ...
> > kcounter_cookie_t kct;
> > }
>
> You mean:
>
> struct kcounter_sem {
> struct kcounter cnt;
> kcounter_cookie_t cookie;
> };
>
> #define down(s) kcounter_claim(&s->cnt, &s->cookie);
> #define up(s) kcounter_release(&s->cnt, &s->cookie);
>
> I can't see how this fixes the semaphore abuse problem at all
> because you can trivially roll your own. We know where that
> leads (i.e. everyone does it their own unique way)...
That would actually work -- if you use it in a mutex way. It would then
have slightly sloppier semantics than mutexes -- ie you could unlock from
a different context than the one which locked. But it would whinge if
you try to use it as a completion.
> > int err = kcounter_claim(&ino->i_flock, &foo->kct);
> > ...
> > kcounter_release(&ino->i_flock, &foo->kct);
>
> Is there the possibility of errors when taking a counter reference
> in this api? i.e. can the equivalent of "down()" return an error?
My current thinking was not to support an equivalent to down(), only to
down_killable(). On further thought, that may not be wise.
> > If you're always using trylock semantics on it, then it's not really a
> > semaphore, is it?
>
> I should have been more precise with my description - we use trylock
> semantics on them when we need to gain them in different orders to
> the normal heirachy (which is quite often) or we are operating in
> non-blocking conditions (again quite often). Otherwise we do normal
> sleeping down() calls.
OK.
> > > > Possibly XFS should be using constructs like wait_on_bit instead of
> > > > semaphores. See the implementation of wait_on_buffer for an example.
> > >
> > > That sounds to me like you are saying is "semaphores are going away so
> > > implement your own semaphore-like thingy using some other construct".
> > > Right?
> >
> > I don't want to say that. People (and I'm *not* referring to XFS here)
> > manage to abuse semaphores in the most hideous ways.
>
> Yes, I've been following the argume^W discussions wating for an outcome.
>
> > If we tell them to
> > use lower-level constructs, they'll make a mess of using those too.
>
> See above ;)
>
> > I
> > think we need to look for patterns in the semaphore users which don't
> > fit the mutex pattern or the completion pattern and figure out how to
> > satisfy those users.
>
> Ok, so here's a user that says they need a semaphore-like construct that
> behaves the same way the current semaphores do. What is the solution?
I think it's kcounters. The trick is creating an API that's
lockdep-friendly and works for the users.
> > I don't think there are any plans to get rid of rwsems, though the RT
> > people probably hate rwsems even more than they hate regular semaphores.
>
> Fmeh.
>
> > The mmap rwsem is a compelling argument ;-)
>
> It's an argument for a different lock type for that particular case, not
> an argument for removing the lock type completely.
Ah, I meant it's a compelling argument for keeping rwsems.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Announce: Semaphore-Removal tree
2008-04-30 10:06 ` Matthew Wilcox
@ 2008-04-30 11:01 ` David Chinner
0 siblings, 0 replies; 21+ messages in thread
From: David Chinner @ 2008-04-30 11:01 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: David Chinner, linux-kernel, Stephen Rothwell
On Wed, Apr 30, 2008 at 04:06:13AM -0600, Matthew Wilcox wrote:
> On Tue, Apr 29, 2008 at 10:09:30AM +1000, David Chinner wrote:
> > > 2. l_flushsema
> > >
> > > This seems to be a completion. ie you're using it to wait for the log
> > > to be flushed.
> >
> > Yes, that could probably be a completion. I'm assuming that a completion
> > can handle several thousand waiting processes, right?
>
> By the way ... is it common that you get several thousand waiting
> processes? I ask because you wake them all up, then the herd thunders
> into the l_icloglock spinlock. Or is this a worst-case scenario that
> happens once in a blue moon?
I'm thinking of a certain 2048p machine at NASA where they run
MPI jobs that do synchronised exit() calls with about 6-7 open
file descriptors that all run into ->release at the same time
and try to do EOF truncation transcations all at the same time.
We get the first 100-200 processes filling all the log buffers
and forcing them to disk, and then the rest start waiting on
the flush sema.
Given that we're already pushing 4096p support for the next gen
machines, this problem isn't going to get any better...
> If l_flushsema does typically get more than one waiter, we can instead
> wake the waiters one at a time.
The current code does them one at a time, but in such a way that is
not any better than a thundering herd. wake_up_all() is probably
a more efficient thundering herd as it doesn't require picking up
and dropping a spinlock for every task being woken.
I think that we need to redesign this code to prevent the thundering
herd problem - it's not problem solvable by just changing the wakeup
call. Besides, the evidence points to the fact that a thundering
herd on the flush sema is not the limiting factor in thoughput.
There's still several layers of the onion to peel before
we get to that point.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2008-04-30 11:02 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-25 17:00 Announce: Semaphore-Removal tree Matthew Wilcox
2008-04-25 20:24 ` Daniel Walker
2008-04-25 20:38 ` Daniel Walker
2008-04-25 21:12 ` Christoph Hellwig
2008-04-25 21:22 ` Daniel Walker
2008-04-26 9:30 ` Christoph Hellwig
2008-04-26 13:39 ` Peter Zijlstra
2008-04-26 13:44 ` Christoph Hellwig
2008-04-26 14:04 ` Peter Zijlstra
2008-04-28 4:59 ` David Chinner
2008-04-26 13:54 ` Stephen Rothwell
2008-04-26 15:59 ` Matthew Wilcox
2008-04-26 16:43 ` Stephen Rothwell
2008-04-28 5:10 ` David Chinner
2008-04-28 12:20 ` Matthew Wilcox
2008-04-29 0:09 ` David Chinner
2008-04-29 2:35 ` Matthew Wilcox
2008-04-29 3:56 ` David Chinner
2008-04-30 10:21 ` Matthew Wilcox
2008-04-30 10:06 ` Matthew Wilcox
2008-04-30 11:01 ` David Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox