[parisc-linux] ldcw in __pthread

All of lore.kernel.org
 help / color / mirror / Atom feed

* [parisc-linux] ldcw in __pthread_acquire
@ 2000-12-15 10:12 John Marvin
  2000-12-15 11:37 ` Alan Modra
  0 siblings, 1 reply; 29+ messages in thread
From: John Marvin @ 2000-12-15 10:12 UTC (permalink / raw)
  To: parisc-linux

I just ran into an unaligned data reference in user land.  The problem is
that the routine __pthread_acquire (in libpthread) does a ldcw, but it is
not ensuring that the address it is operating on is 16 byte aligned (it is
operating on the address that was passed in as the first argument).  Some
processors don't require the 16 byte alignment, but many do.

I haven't looked at the source (I just found the location by disassembly),
so I don't know what the root cause is.  The actual ldcw is probably from
an inlined function or macro, e.g. spin_lock().  My first guess would be
that it is using the machine dependent spin_lock macro, but the procedure
which called __pthread_acquire is passing in a structure whose lock field
does not have the aligned(16) attribute.

I don't have time to look at this right now.  Any volunteers?

John Marvin
jsm@fc.hp.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
@ 2000-12-15 10:26 John Marvin
  0 siblings, 0 replies; 29+ messages in thread
From: John Marvin @ 2000-12-15 10:26 UTC (permalink / raw)
  To: parisc-linux

I just realized that I forgot to mention that __pthread_acquire was called by
__pthread_alt_lock, and that it is passing a pointer that has a 4 byte offset
from whatever was passed as the first argument to __pthread_alt_lock.

John

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-15 10:12 [parisc-linux] ldcw in __pthread_acquire John Marvin
@ 2000-12-15 11:37 ` Alan Modra
  2000-12-15 16:37   ` Matthew Wilcox
  0 siblings, 1 reply; 29+ messages in thread
From: Alan Modra @ 2000-12-15 11:37 UTC (permalink / raw)
  To: John Marvin; +Cc: parisc-linux, parisc-linux

On Fri, 15 Dec 2000, John Marvin wrote:

> 
> I just ran into an unaligned data reference in user land.  The problem is
> that the routine __pthread_acquire (in libpthread) does a ldcw, but it is
> not ensuring that the address it is operating on is 16 byte aligned (it is
> operating on the address that was passed in as the first argument).  Some
> processors don't require the 16 byte alignment, but many do.
> 
> I haven't looked at the source (I just found the location by disassembly),
> so I don't know what the root cause is.  The actual ldcw is probably from
> an inlined function or macro, e.g. spin_lock().  My first guess would be
> that it is using the machine dependent spin_lock macro, but the procedure
> which called __pthread_acquire is passing in a structure whose lock field
> does not have the aligned(16) attribute.
> 
> I don't have time to look at this right now.  Any volunteers?

This isn't exactly volunteering, but I've looked at this code before.
Here's where the problem is:

glibc/linuxthreads/sysdeps/pthread/bits/pthreadtypes.h

struct _pthread_fastlock
{
  long int __status;   /* "Free" or "taken" or head of waiting list */
  int __spinlock;      /* Used by compare_and_swap emulation. Also,
			  adaptive SMP lock stores spin count here. */
};

--
Linuxcare.  Support for the Revolution.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-15 11:37 ` Alan Modra
@ 2000-12-15 16:37   ` Matthew Wilcox
  2000-12-15 17:32     ` Jes Sorensen
  0 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox @ 2000-12-15 16:37 UTC (permalink / raw)
  To: Alan Modra; +Cc: John Marvin, parisc-linux, parisc-linux

On Fri, Dec 15, 2000 at 10:37:55PM +1100, Alan Modra wrote:
> On Fri, 15 Dec 2000, John Marvin wrote:
> 
> > 
> > I just ran into an unaligned data reference in user land.  The problem is
> > that the routine __pthread_acquire (in libpthread) does a ldcw, but it is
> > not ensuring that the address it is operating on is 16 byte aligned (it is
> > operating on the address that was passed in as the first argument).  Some
> > processors don't require the 16 byte alignment, but many do.

> Here's where the problem is:
> 
> glibc/linuxthreads/sysdeps/pthread/bits/pthreadtypes.h
> 
> struct _pthread_fastlock
> {
>   long int __status;   /* "Free" or "taken" or head of waiting list */
>   int __spinlock;      /* Used by compare_and_swap emulation. Also,
> 			  adaptive SMP lock stores spin count here. */
> };

Can we move the __spinlock to be the first element of the struct,
and then mark the struct as requiring 16-byte alignment?  i have the
feeling this might `just work', since most structs will be allocated at
16-byte-aligned addresses anyway.

-- 
Revolutions do not require corporate support.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-15 16:37   ` Matthew Wilcox
@ 2000-12-15 17:32     ` Jes Sorensen
  2000-12-16 19:29       ` Matthew Wilcox
  0 siblings, 1 reply; 29+ messages in thread
From: Jes Sorensen @ 2000-12-15 17:32 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Alan Modra, John Marvin, parisc-linux

>>>>> "Matthew" == Matthew Wilcox <matthew@wil.cx> writes:

Matthew> Can we move the __spinlock to be the first element of the
Matthew> struct, and then mark the struct as requiring 16-byte
Matthew> alignment?  i have the feeling this might `just work', since
Matthew> most structs will be allocated at 16-byte-aligned addresses
Matthew> anyway.

We probably cannot, we can however create our own version of the
header file for the parisc port.

Jes
(yuck yuck, I just hate linuxthreads)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-15 17:32     ` Jes Sorensen
@ 2000-12-16 19:29       ` Matthew Wilcox
  2000-12-16 21:58         ` Jes Sorensen
  2000-12-17  1:22         ` Stan Sieler
  0 siblings, 2 replies; 29+ messages in thread
From: Matthew Wilcox @ 2000-12-16 19:29 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: Matthew Wilcox, Alan Modra, John Marvin, parisc-linux

On Fri, Dec 15, 2000 at 06:32:12PM +0100, Jes Sorensen wrote:
> >>>>> "Matthew" == Matthew Wilcox <matthew@wil.cx> writes:
> 
> Matthew> Can we move the __spinlock to be the first element of the
> Matthew> struct, and then mark the struct as requiring 16-byte
> Matthew> alignment?  i have the feeling this might `just work', since
> Matthew> most structs will be allocated at 16-byte-aligned addresses
> Matthew> anyway.
> 
> We probably cannot, we can however create our own version of the
> header file for the parisc port.

That was what I meant.  the question is, would it _work_?  Or is this
struct embedded in other structs at non-16byte-aligned positions?

-- 
Revolutions do not require corporate support.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-16 19:29       ` Matthew Wilcox
@ 2000-12-16 21:58         ` Jes Sorensen
  2000-12-17  4:31           ` Alan Modra
  2000-12-17  1:22         ` Stan Sieler
  1 sibling, 1 reply; 29+ messages in thread
From: Jes Sorensen @ 2000-12-16 21:58 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Alan Modra, John Marvin, parisc-linux

>>>>> "Matthew" == Matthew Wilcox <matthew@wil.cx> writes:

Matthew> On Fri, Dec 15, 2000 at 06:32:12PM +0100, Jes Sorensen wrote:
>>  We probably cannot, we can however create our own version of the
>> header file for the parisc port.

Matthew> That was what I meant.  the question is, would it _work_?  Or
Matthew> is this struct embedded in other structs at
Matthew> non-16byte-aligned positions?

That can be solved by adding an aligned attribute to the struct definition.

Jes

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-16 19:29       ` Matthew Wilcox
  2000-12-16 21:58         ` Jes Sorensen
@ 2000-12-17  1:22         ` Stan Sieler
  2000-12-17  2:38           ` Alan Cox
  1 sibling, 1 reply; 29+ messages in thread
From: Stan Sieler @ 2000-12-17  1:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jes Sorensen, Matthew Wilcox, Alan Modra, John Marvin,
	parisc-linux

Re:

> > Matthew> Can we move the __spinlock to be the first element of the

Why not do it "right", with a kernel locking call of some kind?
User code should never do LDCW calls ... we've just seen yet another
example of why: it's difficult to do correctly.  Providing fast
locking mechanisms is the *kernel's* job.

(That's the real short version of the well-founded diatribe I launched
against HP for suggesting that developers do their own spinlocks...
and I was proved correct when a couple of months later HP said "oops,
our suggested code was wrong". :)

Another reason for kernel implementation: *real world experience*:
the programmer might *think* that the lock will *always* be held
for a very short time, so the spin loop is acceptable ... but ... yes,
things happen ... and sometimes the programmer is wrong, and that
can have tragic consequences.  A kernel-implemented fast-lock would
presumably ...like on MPE/iX ... have a max number of hard loops,
and then revert to a blocking-lock of some kind.

BTW, MPE's solution to misaligned semaphore structures is to select
one of 15 reserved kernel semaphores (based on the bottom 4 bits of
the address of the user's misaligned semaphore) as a temporary
"helper" semaphore (to loop/ldcws on).  At least that way,
there's a 14/15 chance that two different misaligned semaphores won't
compete with each other when they shouldn't.  

Sorry this isn't a solution, just a suggestion that requires a bit 
more work :)

-- 
Stan Sieler                                           sieler@allegro.com
www.allegro.com/sieler/wanted/index.html                  www.sieler.com        

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-17  1:22         ` Stan Sieler
@ 2000-12-17  2:38           ` Alan Cox
  2000-12-17  4:18             ` LaMont Jones
  2000-12-18  0:29             ` Stan Sieler
  0 siblings, 2 replies; 29+ messages in thread
From: Alan Cox @ 2000-12-17  2:38 UTC (permalink / raw)
  To: Stan Sieler
  Cc: Matthew Wilcox, Jes Sorensen, Alan Modra, John Marvin,
	parisc-linux

> User code should never do LDCW calls ... we've just seen yet another
> example of why: it's difficult to do correctly.  Providing fast
> locking mechanisms is the *kernel's* job.

There are good reasons for doing buzzlocks in user space

> for a very short time, so the spin loop is acceptable ... but ... yes,
> things happen ... and sometimes the programmer is wrong, and that
> can have tragic consequences.  A kernel-implemented fast-lock would
> presumably ...like on MPE/iX ... have a max number of hard loops,
> and then revert to a blocking-lock of some kind.

Thats how Mozilla does it, and with a short spin in user space performance
is way higher than always bugging the kernel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-17  2:38           ` Alan Cox
@ 2000-12-17  4:18             ` LaMont Jones
  2000-12-18  0:29             ` Stan Sieler
  1 sibling, 0 replies; 29+ messages in thread
From: LaMont Jones @ 2000-12-17  4:18 UTC (permalink / raw)
  To: Alan Cox
  Cc: Stan Sieler, Matthew Wilcox, Jes Sorensen, Alan Modra,
	John Marvin, parisc-linux, lamont

> Thats how Mozilla does it, and with a short spin in user space performance
> is way higher than always bugging the kernel

The hp-ux implementation of msem_lock() takes the advantages of both:
Gateway to kernel mode (very light weight - one predicted branch beyond
the function call), and then it does the spin and wait as needed.  Very
low cost, with the kernel-mode blocking easier to do that a combined
user-and-kernel-mode implementation.

Just my $.02,
lamont

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-16 21:58         ` Jes Sorensen
@ 2000-12-17  4:31           ` Alan Modra
  0 siblings, 0 replies; 29+ messages in thread
From: Alan Modra @ 2000-12-17  4:31 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: Matthew Wilcox, John Marvin, parisc-linux

On 16 Dec 2000, Jes Sorensen wrote:

> >>>>> "Matthew" == Matthew Wilcox <matthew@wil.cx> writes:
> 
> Matthew> On Fri, Dec 15, 2000 at 06:32:12PM +0100, Jes Sorensen wrote:
> >>  We probably cannot, we can however create our own version of the
> >> header file for the parisc port.
> 
> Matthew> That was what I meant.  the question is, would it _work_?  Or
> Matthew> is this struct embedded in other structs at
> Matthew> non-16byte-aligned positions?
> 
> That can be solved by adding an aligned attribute to the struct definition.

Doesn't a struct automatically inherit the maximum alignment of its
fields?

-- 
Linuxcare.  Support for the Revolution.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-17  2:38           ` Alan Cox
  2000-12-17  4:18             ` LaMont Jones
@ 2000-12-18  0:29             ` Stan Sieler
  2000-12-18  0:36               ` Alan Cox
  1 sibling, 1 reply; 29+ messages in thread
From: Stan Sieler @ 2000-12-18  0:29 UTC (permalink / raw)
  To: Alan Cox
  Cc: Matthew Wilcox, Jes Sorensen, Alan Modra, John Marvin,
	parisc-linux

Re:

> There are good reasons for doing buzzlocks in user space

Not really...they all vanish when you look with the microscope of
experience.  Trust me.  I've been doing this stuff (multi-processor,
semaphores, locks, etc.) for 30 years.  

And, Lamont agrees, apparently :)

(Thanks, Lamont, and hi!)

The apparent advantages are *strictly* short term.  A single mistake
using a buzz lock from user code in a single process on a single computer
can cost more time than all properly implemented buzz locks ever save.

The "it's faster" argument is the same kind of argument as "not indenting my 
code makes it faster to write, because I don't have to waste the 
time pressing that space bar or tab key"....and precisely as bad an argument :)

Operating system functions, strangely enough, deserved to be implemented
in the *operating system*!  Gaining exclusive access to a data structure
is such a function.

-- 
Stan Sieler                                           sieler@allegro.com
www.allegro.com/sieler/wanted/index.html                  www.sieler.com        

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-18  0:29             ` Stan Sieler
@ 2000-12-18  0:36               ` Alan Cox
  2000-12-18  0:48                 ` Stan Sieler
  2000-12-18  7:10                 ` Philippe Benard
  0 siblings, 2 replies; 29+ messages in thread
From: Alan Cox @ 2000-12-18  0:36 UTC (permalink / raw)
  To: Stan Sieler
  Cc: Alan Cox, Matthew Wilcox, Jes Sorensen, Alan Modra, John Marvin,
	parisc-linux

> The apparent advantages are *strictly* short term.  A single mistake
> using a buzz lock from user code in a single process on a single computer
> can cost more time than all properly implemented buzz locks ever save.

The cost of a syscall against a rarely contended lock is huge. So you
test in user space, you spin enough times to cover SMP contentions then
you leap into the kernel since its a context switch away while the
other guy held the lock, which at maybe a 30 clock window is unlikely

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-18  0:36               ` Alan Cox
@ 2000-12-18  0:48                 ` Stan Sieler
  2000-12-18  0:59                   ` Alan Cox
  2000-12-18  7:10                 ` Philippe Benard
  1 sibling, 1 reply; 29+ messages in thread
From: Stan Sieler @ 2000-12-18  0:48 UTC (permalink / raw)
  To: Alan Cox
  Cc: Alan Cox, Matthew Wilcox, Jes Sorensen, Alan Modra, John Marvin,
	parisc-linux

Hi Alan,

As Lamont Jones agreed, lightweight system calls exist, and their
cost is on the order of a couple of instructions (typically the equivalent
of a missed branch prediction).  HP-UX has some, Lamont speaks from experience
(as do I).

> > The apparent advantages are *strictly* short term.  A single mistake
> > using a buzz lock from user code in a single process on a single computer
> > can cost more time than all properly implemented buzz locks ever save.
> 
> The cost of a syscall against a rarely contended lock is huge. So you

Thus, it is *NOT* true that a syscall cost must be "huge".

You can say: I don't want to change it, or "we've always done it that way".
but, the one thing you can't accurately say is "this is the right way".

Been there, learned that, tried to share it with you, and am now 
giving up on it.

-- 
Stan (still right :) Sieler

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-18  0:48                 ` Stan Sieler
@ 2000-12-18  0:59                   ` Alan Cox
  2000-12-18  4:43                     ` LaMont Jones
  0 siblings, 1 reply; 29+ messages in thread
From: Alan Cox @ 2000-12-18  0:59 UTC (permalink / raw)
  To: Stan Sieler
  Cc: Alan Cox, Matthew Wilcox, Jes Sorensen, Alan Modra, John Marvin,
	parisc-linux

> As Lamont Jones agreed, lightweight system calls exist, and their
> cost is on the order of a couple of instructions (typically the equivalent
> of a missed branch prediction).  HP-UX has some, Lamont speaks from experience
> (as do I).

x86 is about 50 clocks to do a syscall so the maths is strongly in favour
of user mode spins.

> Thus, it is *NOT* true that a syscall cost must be "huge".

On 99.99% of the machines people care about it is 8)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-18  0:59                   ` Alan Cox
@ 2000-12-18  4:43                     ` LaMont Jones
  2000-12-18 11:53                       ` Alan Cox
  0 siblings, 1 reply; 29+ messages in thread
From: LaMont Jones @ 2000-12-18  4:43 UTC (permalink / raw)
  To: Alan Cox
  Cc: Stan Sieler, Matthew Wilcox, Jes Sorensen, Alan Modra,
	John Marvin, parisc-linux, lamont

> x86 is about 50 clocks to do a syscall so the maths is strongly in favour
> of user mode spins.

It really comes down to what is the cost to get into kernel mode, as
compared to the cost of dealing with the atomicity problems that come
in when you try to implement locking completely in user space.

In the ideal world, I think that what we want is a libc entry point that
we can use for semaphoring (msem_lock comes to mind...), which the lib
(in arch specific code) either implements in a mixture of user/kernel
space, or (if you can get to kernel mode cheaply), does it in kernel mode
via a lightweight system call.

Note also that spinning in a ldcw is very painful for the bus, and
switching to a ldw-loop followed by ldcw results in starvation in a
greater-than-2-way MP system.

lamont

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-18  0:36               ` Alan Cox
  2000-12-18  0:48                 ` Stan Sieler
@ 2000-12-18  7:10                 ` Philippe Benard
  2000-12-18 12:06                   ` Alan Cox
  1 sibling, 1 reply; 29+ messages in thread
From: Philippe Benard @ 2000-12-18  7:10 UTC (permalink / raw)
  To: Alan Cox
  Cc: Stan Sieler, Matthew Wilcox, Jes Sorensen, Alan Modra,
	John Marvin, parisc-linux

> The cost of a syscall against a rarely contended lock is huge. So you
> test in user space, you spin enough times to cover SMP contentions then
> you leap into the kernel since its a context switch away while the
> other guy held the lock, which at maybe a 30 clock window is unlikely
> 

I am not sure I understand this thread very well. I would say that going
syscall() for a mutex lock (ldcw, testset, spinlock, whatever you name it) is
not a question of cost it is just simply impossible to avoid. 

Atomically load/clear (or test/set) a word, indeed can be done (should be
done) in user space, but whence one thread got it (for a very short period of
time claim the getter), the getter can be pre-empted anytime owning the lock
word, because and as far as I know, interuption are not disabled when getting
the lock word. The other wanters will then spinlock for a very long period of
time then, what apeared a fast non-syscall get, become a CPU hog, there is one
spinlock running. So I agry with what you said, the wanters want to spin lock
a little on the lock word, because the taker may be rigth sometime and release
it quick, and after a little number of loop count, the wanter goes to sleep on
the lockword, just to get a chance to be awaken, and who take car of
sleep/wakeup, I think it is the OS. Then how to communicate the lock word to
the OS, because the OS must do the atomic load/clear (test/set) on the same
lock word, I think it is with an API, then a libcall/syscall, or
macro/syscall, I think libcall is better since the tiny user space function
can be inlined... I think the pthread API is perfect for this, and this thread
started from pthread. For people who need 'fast' mutual exclusion without the
need to make their prog pthreaded, I think they can implement their own
mutex_lock() syscall, i.e the one to call if you didn't succeeded to get the
lock in user land for a little while.

Phi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-18  4:43                     ` LaMont Jones
@ 2000-12-18 11:53                       ` Alan Cox
  2000-12-18 12:27                         ` Philippe Benard
  0 siblings, 1 reply; 29+ messages in thread
From: Alan Cox @ 2000-12-18 11:53 UTC (permalink / raw)
  To: LaMont Jones
  Cc: Alan Cox, Stan Sieler, Matthew Wilcox, Jes Sorensen, Alan Modra,
	John Marvin, parisc-linux, lamont

> we can use for semaphoring (msem_lock comes to mind...), which the lib
> (in arch specific code) either implements in a mixture of user/kernel
> space, or (if you can get to kernel mode cheaply), does it in kernel mode
> via a lightweight system call.
> 
> Note also that spinning in a ldcw is very painful for the bus, and
> switching to a ldw-loop followed by ldcw results in starvation in a
> greater-than-2-way MP system.

That sounds a good reason to do at least most of it in kernel space on hppa

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-18  7:10                 ` Philippe Benard
@ 2000-12-18 12:06                   ` Alan Cox
  2000-12-18 14:49                     ` LaMont Jones
  0 siblings, 1 reply; 29+ messages in thread
From: Alan Cox @ 2000-12-18 12:06 UTC (permalink / raw)
  To: Philippe Benard
  Cc: Alan Cox, Stan Sieler, Matthew Wilcox, Jes Sorensen, Alan Modra,
	John Marvin, parisc-linux

> I am not sure I understand this thread very well. I would say that going
> syscall() for a mutex lock (ldcw, testset, spinlock, whatever you name it) is
> not a question of cost it is just simply impossible to avoid. 

It comes down to probability

A syscall on x86 gives you a 50+ clock overhead at all times
A user mode test and short spin has a 1 or 2 clock overhead if uncontended
You spin for a few clocks in case the contention is SMP and if that works
you win

IFF the lock is contended then you just spent 100 clocks instead of 50
doing a short spin then entering the kernel.

With 1% contention that means you spent 99 times doing 2 clocks, 1 time doing
100, which is a win over 100 times doing 50 clocks.

Since hppa its apparently 2 clocks to the syscall the numbers are apparently
different.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-18 11:53                       ` Alan Cox
@ 2000-12-18 12:27                         ` Philippe Benard
  2000-12-18 14:40                           ` LaMont Jones
  0 siblings, 1 reply; 29+ messages in thread
From: Philippe Benard @ 2000-12-18 12:27 UTC (permalink / raw)
  To: Alan Cox
  Cc: LaMont Jones, Stan Sieler, Matthew Wilcox, Jes Sorensen,
	Alan Modra, John Marvin, parisc-linux

> >
> > Note also that spinning in a ldcw is very painful for the bus, and
> > switching to a ldw-loop followed by ldcw results in starvation in a
> > greater-than-2-way MP system.
> 
> That sounds a good reason to do at least most of it in kernel space on hppa
> 

I think LaMont Jones meant that the spinlock loop (in user space AND in kernel
space) must implement a load word loop, and issue a load-and-clear word only
when the lock word looks 'free', this is because the 'write' part of the
load-and-clear word issue a cache broadcast transaction on the bus, so the
loop should be on load-word followed by load-clear-word when it looks free,
after a given amount of loop you got to ask for kernel arbitration, then OS
whence again try to get the lock and if fail go to sleep.

Phi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-18 12:27                         ` Philippe Benard
@ 2000-12-18 14:40                           ` LaMont Jones
  2000-12-18 19:44                             ` Stan Sieler
  0 siblings, 1 reply; 29+ messages in thread
From: LaMont Jones @ 2000-12-18 14:40 UTC (permalink / raw)
  To: Philippe Benard
  Cc: Alan Cox, LaMont Jones, Stan Sieler, Matthew Wilcox, Jes Sorensen,
	Alan Modra, John Marvin, parisc-linux, lamont

> I think LaMont Jones meant that the spinlock loop (in user space AND in kerne
> space) must implement a load word loop, and issue a load-and-clear word only
> when the lock word looks 'free', this is because the 'write' part of the
> load-and-clear word issue a cache broadcast transaction on the bus, so the
> loop should be on load-word followed by load-clear-word when it looks free,

That was one of the first solutions tried in HP-UX, and it resulted in
processor 4 not getting any time (3 wasn't much better), due to the way
that bus arbitration works (it favors one end of the bus.)

The current semaphore operations in the HP-UX kernel do not use ldcw: they
use stb and ldw in some interesting orders (which break when we get weak
ordering with IA64, but then we'll have a low-cost test-and-set.)

lamont

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-18 12:06                   ` Alan Cox
@ 2000-12-18 14:49                     ` LaMont Jones
  2000-12-18 15:59                       ` Matthew Wilcox
  0 siblings, 1 reply; 29+ messages in thread
From: LaMont Jones @ 2000-12-18 14:49 UTC (permalink / raw)
  To: Alan Cox
  Cc: Philippe Benard, Stan Sieler, Matthew Wilcox, Jes Sorensen,
	Alan Modra, John Marvin, parisc-linux, lamont

> Since hppa its apparently 2 clocks to the syscall the numbers are apparently
> different.

Actually, the sequence consists of:

libc_stub:
	bl	gateway_page_addr
	...

gateway_page_addr:
	gate	.+8
	...
	...

And we find ourselves in kernel mode after 2 branches (unconditional
and pre-computed ==> predict correctly) and two delay slots.  At that
point we have kernel data structures at our fingertips, but have in no
way done a complete 'syscall' entry - those are (at least on hp-ux) a
bit more expensive...  (It also means that kernel vs user detection in
traps code needs to look at the priv level, not the stack pointer...)

lamont

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-18 14:49                     ` LaMont Jones
@ 2000-12-18 15:59                       ` Matthew Wilcox
  0 siblings, 0 replies; 29+ messages in thread
From: Matthew Wilcox @ 2000-12-18 15:59 UTC (permalink / raw)
  To: LaMont Jones
  Cc: Alan Cox, Philippe Benard, Stan Sieler, Matthew Wilcox,
	Jes Sorensen, Alan Modra, John Marvin, parisc-linux

On Mon, Dec 18, 2000 at 07:49:32AM -0700, LaMont Jones wrote:
> > Since hppa its apparently 2 clocks to the syscall the numbers are apparently
> > different.
> 
> Actually, the sequence consists of:
> 
> libc_stub:
> 	bl	gateway_page_addr
> 	...
> 
> gateway_page_addr:
> 	gate	.+8
> 	...
> 	...
> 
> And we find ourselves in kernel mode after 2 branches (unconditional
> and pre-computed ==> predict correctly) and two delay slots.  At that
> point we have kernel data structures at our fingertips, but have in no
> way done a complete 'syscall' entry - those are (at least on hp-ux) a
> bit more expensive...  (It also means that kernel vs user detection in
> traps code needs to look at the priv level, not the stack pointer...)

... ignoring the potential pain of TLB misses caused by losing locality.
But this is merely a quibble, I think.

-- 
Revolutions do not require corporate support.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-18 14:40                           ` LaMont Jones
@ 2000-12-18 19:44                             ` Stan Sieler
  2000-12-18 19:54                               ` Alan Cox
  2000-12-18 22:26                               ` LaMont Jones
  0 siblings, 2 replies; 29+ messages in thread
From: Stan Sieler @ 2000-12-18 19:44 UTC (permalink / raw)
  To: parisc-linux
  Cc: Philippe Benard, Alan Cox, LaMont Jones, Matthew Wilcox,
	Jes Sorensen, Alan Modra, John Marvin, parisc-linux

Re:

LaMont writes:
...
> That was one of the first solutions tried in HP-UX, and it resulted in
> processor 4 not getting any time (3 wasn't much better), due to the way
> that bus arbitration works (it favors one end of the bus.)
> 
> The current semaphore operations in the HP-UX kernel do not use ldcw: they
> use stb and ldw in some interesting orders (which break when we get weak
> ordering with IA64, but then we'll have a low-cost test-and-set.)

Although I said I'd stay out...

Alan...this is *important*...re-read what's clear between the lines above:

   The user is the *WRONG* person to implement locks.
   (this includes user libraries)

Why?

   - they make mistakes

   - they don't know as much as they need to know

   - their code runs on slightly different hardware (e.g., different
     models of PA-RISC with slightly different characteristics).

   - the cost of multiple copies of code (some copies by one user
     programmer, some by another) 
     ... many of which are "wrong" ... can be extreme.

Simply put:

   Locking is *important*:

      - it must be done correctly (e.g., for single-owner locks, only one
        thread must think it owns it at a time; and the owner shouldn't
        be starved of CPU time; and a requestor shouldn't run away with
        CPU resources)

      - it must be efficient.

Note that efficiency *IS ALWAYS LESS IMPORTANT THAN CORRECTNESS*.
That's 100%, totally vital!  To say "important" is to make a severe
understatement.

Well then, where can we put locking such that it's more likely to be
correct?  The kernel.  You can (and have to) rely on the kernel more than on
user code.  The kernel gets patched/fixed/updated regularly.  The kernel
is a *single point* of implementation, as opposed to hundreds of separate
points of implementation.

Why not rely on libraries?  Because code in libraries is potentially
staler than the kernel, and you have potentially many different variations.
Can you interrogate and ask what version of msem_lock() you're calling?  
Can you find out what version of msem_lock an archive-linked application
you downloaded from a web site is running?
No...but you *can* ask what version of Linux (or whatever) you're running!

Alan...this is the voice of experience again...shouting louder! :)

An operating system should provide a user-callable locking mechanism that:

   - provides a single-owner lock;

   - provides an optional multi-owner lock (e.g., multiple processes 
     can lock for shared read access, or it can be locked by a single 
     owner for "write" access);

   - provides an optional (short-term) priority boost if a high priority
     process wants to obtain a lock owned by a low priority process

   - identifies what locks are currently held by what processes (and
     for how long)

   - is 100% reliable and, if possible, highly efficient

   - allows the programmer to give a hint to the OS about the length
     of time they'll have the lock locked

   - allows a root process to unlock a lock owned by a hung/dead process
     (with stated semantics...e.g., does the first waiter get the lock,
     or receive an error (i.e., ERR_PRIOR_OWNER_DIED))

   - allows the programmer to specify what happens to a locked lock
     owned by a process that then dies. 

   - optionally detects deadlocks, and/or prevents deadlock attempts.

Although I can't find the man pages for Linux msem_lock, I know that the
HP-UX msem_lock doesn't meet all of these criteria (nor does MPE/iX, although
it comes a lot closer).

-- 
Stan Sieler                                           sieler@allegro.com
www.allegro.com/sieler/wanted/index.html                  www.sieler.com        

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-18 19:44                             ` Stan Sieler
@ 2000-12-18 19:54                               ` Alan Cox
  2000-12-18 20:15                                 ` Stan Sieler
  2000-12-18 20:44                                 ` LaMont Jones
  2000-12-18 22:26                               ` LaMont Jones
  1 sibling, 2 replies; 29+ messages in thread
From: Alan Cox @ 2000-12-18 19:54 UTC (permalink / raw)
  To: Stan Sieler
  Cc: parisc-linux, Philippe Benard, Alan Cox, LaMont Jones,
	Matthew Wilcox, Jes Sorensen, Alan Modra, John Marvin

> Note that efficiency *IS ALWAYS LESS IMPORTANT THAN CORRECTNESS*.
> That's 100%, totally vital!  To say "important" is to make a severe
> understatement.

Tell that to the folks I work with at times for whom user space lock testing
shaves 4 weeks off a run. Try the difference in Mozilla.

In both cases Im forced to disagree - at least for x86.

> Can you interrogate and ask what version of msem_lock() you're calling?  

Yes. ELF has versioned symbols if they have changed. You can use those for
many things. X86 however has a stable instruction set abi for locking.

> Although I can't find the man pages for Linux msem_lock, I know that the
> HP-UX msem_lock doesn't meet all of these criteria (nor does MPE/iX, although
> it comes a lot closer).

We use user space locks for stuff like pthreads on most platforms with
the kernel doing the contention cases. I'm not arguing that it wouldnt be nice
to let the kernel do it all if we had cheap syscalls. 

Alan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-18 19:54                               ` Alan Cox
@ 2000-12-18 20:15                                 ` Stan Sieler
  2000-12-18 20:44                                 ` LaMont Jones
  1 sibling, 0 replies; 29+ messages in thread
From: Stan Sieler @ 2000-12-18 20:15 UTC (permalink / raw)
  To: parisc-linux

Re:

> 
> > Note that efficiency *IS ALWAYS LESS IMPORTANT THAN CORRECTNESS*.
> > That's 100%, totally vital!  To say "important" is to make a severe
> > understatement.
> 
> Tell that to the folks I work with at times for whom user space lock testing
> shaves 4 weeks off a run. Try the difference in Mozilla.

To say I'm shocked is an understatment.

Alan, if you think efficiency is more important than correctness, then
please let us know what programs you've worked on,
so we can all avoid them! :)

However, I don't really believe you meant that!

Note that I never said efficiency isn't important.  But, it's like
a car with no brakes...that car can go faster in a straight line, but
it isn't correct ... and I sure as hell am not going to deploy such a car
for my commute!

> In both cases Im forced to disagree - at least for x86.

So...a fast x86 lock is more important than one that's correct?
I'm glad I'm not doing mission critical work on an x86! 

> > Can you interrogate and ask what version of msem_lock() you're calling?  
> 
> Yes. ELF has versioned symbols if they have changed. You can use those for
> many things. X86 however has a stable instruction set abi for locking.

So, the man page for msem_lock for ELF documents various versions?  Great!

Wait...no, it doesn't.

Scanning through a few ELF x-86 libraries fails to show any per-procedure
version information (although you can generally guess an overall version
of the library package).

With HP-UX, or MPE/iX, I can make a system call to inquire about the
OS version ... that, coupled with information available to the programmer
at writing time, allows a program to make decisions like "I won't try
to do X on this release, because I know that X didn't work correctly
until the next release".

Alan...I write products, some of them run on every release of MPE/iX
that's every come out.  That requires attention to detail, but the 
payoff is that you don't have to tell me what version you're running when
you get one of my programs.  I'd *like* to be able to do that on HP-UX
and Linux, but it's a awful lot harder. 

> We use user space locks for stuff like pthreads on most platforms with
> the kernel doing the contention cases. I'm not arguing that it wouldnt be nice
> to let the kernel do it all if we had cheap syscalls. 

Ok...the "I'm not arguing" sure wasn't clear before!

And...you still don't get it...it's not merely "nice", it's clearly better.
...just like the idea of adding brakes to that car.

(BTW, I'm trying to strip the cc list, so the interested parties don't get
tons of copies :)

-- 
Stan Sieler                                           sieler@allegro.com
www.allegro.com/sieler/wanted/index.html                  www.sieler.com        

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-18 19:54                               ` Alan Cox
  2000-12-18 20:15                                 ` Stan Sieler
@ 2000-12-18 20:44                                 ` LaMont Jones
  2000-12-18 21:56                                   ` Alan Cox
  1 sibling, 1 reply; 29+ messages in thread
From: LaMont Jones @ 2000-12-18 20:44 UTC (permalink / raw)
  To: Alan Cox
  Cc: Stan Sieler, parisc-linux, Philippe Benard, LaMont Jones,
	Matthew Wilcox, Jes Sorensen, Alan Modra, John Marvin, lamont

> > Note that efficiency *IS ALWAYS LESS IMPORTANT THAN CORRECTNESS*.
> > That's 100%, totally vital!  To say "important" is to make a severe
> > understatement.
> Tell that to the folks I work with at times for whom user space lock testing
> shaves 4 weeks off a run. Try the difference in Mozilla.
> In both cases Im forced to disagree - at least for x86.

Correct appears to be a relative term.  An implementation of semaphores is
correct if it provides mutual exclusion.  Performance becomes a secondary
consideration to that, of course.

I have to side with Alan on this point.  x86 costs too much to go into the
kernel for something that you'll normally not have contention on.  The only
challenge there is that the owner of the resource may hold it for a _long_
time if he gets swapped out, since the kernel has no knowledge.  Note,
however, that it is still correct: eventually, all of the waiters will go
to sleep, and the owner will finally get time to do his thing and free the
sema.

With PA, there's a chance to get into kernel mode cheaply, and it probably
makes sense to do an arch-specific msem_lock() that avoids the heavy
contention issues.

> > Can you interrogate and ask what version of msem_lock() you're calling?  
> Yes. ELF has versioned symbols if they have changed. You can use those for
> many things. X86 however has a stable instruction set abi for locking.

parisc's set is usable as well, but there's a difference in how you code
test-and-set vs test-and-clear.  This makes it a royal pain to port code
that expects test-and-set to parisc...  (And test and inc/dec makes it
even easier...)

> > Although I can't find the man pages for Linux msem_lock, I know that the
> > HP-UX msem_lock doesn't meet all of these criteria (nor does MPE/iX, althou
> > it comes a lot closer).
> We use user space locks for stuff like pthreads on most platforms with
> the kernel doing the contention cases. I'm not arguing that it wouldnt be nic
> to let the kernel do it all if we had cheap syscalls. 

Sounds perfectly reasonable.  How hard is it to put arch-specific things
into that path?

lamont

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-18 20:44                                 ` LaMont Jones
@ 2000-12-18 21:56                                   ` Alan Cox
  0 siblings, 0 replies; 29+ messages in thread
From: Alan Cox @ 2000-12-18 21:56 UTC (permalink / raw)
  To: LaMont Jones
  Cc: Alan Cox, Stan Sieler, parisc-linux, Philippe Benard,
	LaMont Jones, Matthew Wilcox, Jes Sorensen, Alan Modra,
	John Marvin

> > We use user space locks for stuff like pthreads on most platforms with
> > the kernel doing the contention cases. I'm not arguing that it wouldnt be nic
> > to let the kernel do it all if we had cheap syscalls. 
> 
> Sounds perfectly reasonable.  How hard is it to put arch-specific things
> into that path?

Its all arch specific, so you can do it right for hppa. The code in db3 is
also arch specific (falling back to kernel) as is the Mozilla code.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [parisc-linux] ldcw in __pthread_acquire
  2000-12-18 19:44                             ` Stan Sieler
  2000-12-18 19:54                               ` Alan Cox
@ 2000-12-18 22:26                               ` LaMont Jones
  1 sibling, 0 replies; 29+ messages in thread
From: LaMont Jones @ 2000-12-18 22:26 UTC (permalink / raw)
  To: Stan Sieler
  Cc: parisc-linux, Philippe Benard, Alan Cox, LaMont Jones,
	Matthew Wilcox, Jes Sorensen, Alan Modra, John Marvin, lamont

> Why?
>    - they make mistakes
So do kernel engineers.

>    - they don't know as much as they need to know
>    - their code runs on slightly different hardware (e.g., different
>      models of PA-RISC with slightly different characteristics).
These are the same point, and it ain't necessarily so.

>    - the cost of multiple copies of code (some copies by one user
>      programmer, some by another) 
>      ... many of which are "wrong" ... can be extreme.
This isn't a right vs wrong, just a code bloat issue...

>       - it must be done correctly (e.g., for single-owner locks, only one
>         thread must think it owns it at a time;
It doesn't matter if a thread thinks it owns the lock if it can't access
the resource.  (Yes, you can do locking that way, I can think of at least
two places in HP-UX where that is the case:  one in kernel mode, and one
in user mode with a kernel assist...)  Both of those are based on the fact
that it's more efficient to run like hell and then pick yourself up when
you trip than it is to lock before using.  Both cases were driven by the
simple fact that efficiency was the difference between having a product
and having a piece of junk.

>	  and the owner shouldn't be starved of CPU time;
Doesn't matter until someone else wants the resource...  Given a finite
amount of CPU resource, and a given number of locks, someone is going to
get starved sometime.

>	  and a requestor shouldn't run away with CPU resources)
Shouldn't hold the resource _longer_than_necessary_.  But now you're
talking performance.

>       - it must be efficient.
It _should_ be efficient.

> Note that efficiency *IS ALWAYS LESS IMPORTANT THAN CORRECTNESS*.
> That's 100%, totally vital!  To say "important" is to make a severe
> understatement.

See above.  The correct technical solution is not always the correct
business decision.  (And man, does it hurt parts of me to say that.)
If the efficient solution allows a bit of starvation in a corner case,
then it may be best to just document the corner case and live with it,
based on how much better the normal case is.

> Well then, where can we put locking such that it's more likely to be
> correct?  The kernel.  You can (and have to) rely on the kernel more than on
> user code.  The kernel gets patched/fixed/updated regularly.  The kernel
> is a *single point* of implementation, as opposed to hundreds of separate
> points of implementation.

A single shared-only library pretty much constitutes a single point of
implementation as well.

> Why not rely on libraries?  Because code in libraries is potentially
> staler than the kernel, and you have potentially many different variations.
> Can you interrogate and ask what version of msem_lock() you're calling?  
> Can you find out what version of msem_lock an archive-linked application
> you downloaded from a web site is running?
> No...but you *can* ask what version of Linux (or whatever) you're running!

You can also provide the locking code in a shared-only library.  Depending
on what is being locked, you may not have to worry about all of the above:
if the entire set of binaries that will be locking the shared resource
arrive as a set, then you just make sure that you deliver the set.  If you
have, say, a database that is accessed by everyone and their mother, then
you may have a different situation on your hands.

Spinning in user space before going to the kernel to do it is a waste
of every user-space cycle, but only when you go to the kernel.  Faced
with a 50-state kernel-mode cost, I would be strongly inclined (for a
performance sensitive app) to go with a user space spin, with kernel
assisted blocking.  If I were concerned about the starvation potential,
I would consider some minor adjustments to the blocking code in the
kernel to promote the owner of the lock, in order to reduce the starvation
issues.

There are situations where performance is _EVERYTHING_.  In those cases,
you pay a higher support price, and just do what has to be done.

> Alan...this is the voice of experience again...shouting louder! :)
> An operating system should provide a user-callable locking mechanism that:

>    - allows the programmer to give a hint to the OS about the length
>      of time they'll have the lock locked

Not really needed, but certainly on the wishlist.

>    - allows a root process to unlock a lock owned by a hung/dead process
>      (with stated semantics...e.g., does the first waiter get the lock,
>      or receive an error (i.e., ERR_PRIOR_OWNER_DIED))

If the lock is not in kernel memory, then I don't have to have someone
unlock the sema, that becomes an app issue.

>    - optionally detects deadlocks, and/or prevents deadlock attempts.

If sleeping is done at interruptable priorities, then this is an app
problem, although it sure is nice when the locking API takes care of
deadlock detection - it lets you be sloppy in defining your locking
strategy and get away with it.

lamont

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2000-12-18 22:23 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2000-12-15 10:12 [parisc-linux] ldcw in __pthread_acquire John Marvin
2000-12-15 11:37 ` Alan Modra
2000-12-15 16:37   ` Matthew Wilcox
2000-12-15 17:32     ` Jes Sorensen
2000-12-16 19:29       ` Matthew Wilcox
2000-12-16 21:58         ` Jes Sorensen
2000-12-17  4:31           ` Alan Modra
2000-12-17  1:22         ` Stan Sieler
2000-12-17  2:38           ` Alan Cox
2000-12-17  4:18             ` LaMont Jones
2000-12-18  0:29             ` Stan Sieler
2000-12-18  0:36               ` Alan Cox
2000-12-18  0:48                 ` Stan Sieler
2000-12-18  0:59                   ` Alan Cox
2000-12-18  4:43                     ` LaMont Jones
2000-12-18 11:53                       ` Alan Cox
2000-12-18 12:27                         ` Philippe Benard
2000-12-18 14:40                           ` LaMont Jones
2000-12-18 19:44                             ` Stan Sieler
2000-12-18 19:54                               ` Alan Cox
2000-12-18 20:15                                 ` Stan Sieler
2000-12-18 20:44                                 ` LaMont Jones
2000-12-18 21:56                                   ` Alan Cox
2000-12-18 22:26                               ` LaMont Jones
2000-12-18  7:10                 ` Philippe Benard
2000-12-18 12:06                   ` Alan Cox
2000-12-18 14:49                     ` LaMont Jones
2000-12-18 15:59                       ` Matthew Wilcox
  -- strict thread matches above, loose matches on Subject: below --
2000-12-15 10:26 John Marvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.