* Re: [Lse-tech] scalability of signal delivery for Posix Threads
2004-11-22 16:22 ` [Lse-tech] " Andi Kleen
@ 2004-11-22 16:51 ` Andreas Schwab
2004-11-22 16:54 ` Andi Kleen
2004-11-22 17:23 ` Philip J. Mucci
` (3 subsequent siblings)
4 siblings, 1 reply; 19+ messages in thread
From: Andreas Schwab @ 2004-11-22 16:51 UTC (permalink / raw)
To: Andi Kleen
Cc: Ray Bryant, Kernel Mailing List, linux-ia64@vger.kernel.org,
lse-tech, holt, Dean Roe, Brian Sumner, John Hawkes
Andi Kleen <ak@suse.de> writes:
> At least in traditional signal semantics you have to call sigaction
> or signal in each signal handler to reset the signal. So that
> assumption is not necessarily true.
If you use sigaction then you get POSIX semantics, which don't have this
problem.
Andreas.
--
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [Lse-tech] scalability of signal delivery for Posix Threads
2004-11-22 16:51 ` Andreas Schwab
@ 2004-11-22 16:54 ` Andi Kleen
2004-11-22 18:56 ` Ray Bryant
2004-11-22 19:22 ` Ray Bryant
0 siblings, 2 replies; 19+ messages in thread
From: Andi Kleen @ 2004-11-22 16:54 UTC (permalink / raw)
To: Andreas Schwab
Cc: Andi Kleen, Ray Bryant, Kernel Mailing List,
linux-ia64@vger.kernel.org, lse-tech, holt, Dean Roe,
Brian Sumner, John Hawkes
On Mon, Nov 22, 2004 at 05:51:59PM +0100, Andreas Schwab wrote:
> Andi Kleen <ak@suse.de> writes:
>
> > At least in traditional signal semantics you have to call sigaction
> > or signal in each signal handler to reset the signal. So that
> > assumption is not necessarily true.
>
> If you use sigaction then you get POSIX semantics, which don't have this
> problem.
It's just a common case where Ray's assumption is not true.
-Andi
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lse-tech] scalability of signal delivery for Posix Threads
2004-11-22 16:54 ` Andi Kleen
@ 2004-11-22 18:56 ` Ray Bryant
2004-11-22 19:22 ` Ray Bryant
1 sibling, 0 replies; 19+ messages in thread
From: Ray Bryant @ 2004-11-22 18:56 UTC (permalink / raw)
To: Andi Kleen
Cc: Andreas Schwab, Kernel Mailing List, linux-ia64@vger.kernel.org,
lse-tech, holt, Dean Roe, Brian Sumner, John Hawkes
Andi Kleen wrote:
> On Mon, Nov 22, 2004 at 05:51:59PM +0100, Andreas Schwab wrote:
>
>>Andi Kleen <ak@suse.de> writes:
>>
>>
>>>At least in traditional signal semantics you have to call sigaction
>>>or signal in each signal handler to reset the signal. So that
>>>assumption is not necessarily true.
>>
>>If you use sigaction then you get POSIX semantics, which don't have this
>>problem.
>
>
> It's just a common case where Ray's assumption is not true.
>
> -Andi
>
True enough. And in that case the design that I was describing wouldn't
make sigaction() that much more expensive since if you are not in the POSIX
thread environment (more precisely, the thread was not created with
CLONE_SIGHAND) each thread has its own sighand structure and the "global"
locking mechanisum I had proposed would only require the taking of one
additional lock.
However, special casing ITIMER_PROF is also a reasonable avenue of approach.
The performance monitor code can also deliver signals to user space when
a sampling buffer overflows, and this can have the same kind of scaling
problem as ITIMER_PROF. I'll have to do a little research to figure out
how exactly that works, but that signal (SIGIO?) would also be a candidate
for special casing on our platform.
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
raybry@sgi.com raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [Lse-tech] scalability of signal delivery for Posix Threads
2004-11-22 16:54 ` Andi Kleen
2004-11-22 18:56 ` Ray Bryant
@ 2004-11-22 19:22 ` Ray Bryant
1 sibling, 0 replies; 19+ messages in thread
From: Ray Bryant @ 2004-11-22 19:22 UTC (permalink / raw)
To: Andi Kleen
Cc: Andreas Schwab, Kernel Mailing List, linux-ia64@vger.kernel.org,
lse-tech, holt, Dean Roe, Brian Sumner, John Hawkes
OK, apparently SIGPROF is delivered in both the ITIMER_PROF and
pmu interrupt cases, so if we special case that signal we should
be fine.
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
raybry@sgi.com raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lse-tech] scalability of signal delivery for Posix Threads
2004-11-22 16:22 ` [Lse-tech] " Andi Kleen
2004-11-22 16:51 ` Andreas Schwab
@ 2004-11-22 17:23 ` Philip J. Mucci
2004-11-22 21:26 ` Boehm, Hans
` (2 subsequent siblings)
4 siblings, 0 replies; 19+ messages in thread
From: Philip J. Mucci @ 2004-11-22 17:23 UTC (permalink / raw)
To: Andi Kleen
Cc: Ray Bryant, Kernel Mailing List, linux-ia64@vger.kernel.org,
lse-tech, holt, Dean Roe, Brian Sumner, John Hawkes
Hi Andi,
Allow me to say that this particular case is very special, because
ITIMER_PROF is used in many performance tools. Inside of PAPI for
example, we use ITIMER_PROF for both counter multiplexing and for
statistical profiling. Tools such as HPCToolkit, PerfSuite and others
often enable ITIMER_PROF at the highest available resolution. So while
there may be hundreds or thousands of ways to do this, this particular
avenue has many useful tools out there that make can easily trigger this
case.
However, I think your suggestion is an excellent one regarding a fast
path for ITIMER_PROF.
FWIW, Solaris has ITIMER_REALPROF, which when enabled, sends
ITIMER_PROFs to all LWP's in the process for each tick. In this way,
each thread doesn't have to call setitimer()
by itself to get a signal...yeah yeah, I know what POSIX says about
signal delivery to any LWP, but on every Linux I've tested, you only get
itimer(PROF) to the thread that registered the timer. Granted,
I haven't run the test on an Altix.
Regards,
Philip Mucci
> I suspect there are hundreds or thousands of ways on such a big system to
> exploit some lock to make the system unresponsive. If you wanted
> to fix them all your would be in a large scale redesign effort.
> It's not clear why this particular case is special.
> > Since signals are sent much more often than sigaction() is called, it would
> > seem to make more sense to make sigaction() take a heavier weight lock of
>
> At least in traditional signal semantics you have to call sigaction
> or signal in each signal handler to reset the signal. So that
> assumption is not necessarily true.
>
> > It seems to me that scalability would be improved if we moved the siglock
> > from
> > the sighand structure to the task_struct. (keep reading, please...) Code
> > that manipulates the current task signal data only would just obtain that
> > lock. Code that needs to change the sighand structure (e. g. sigaction())
> > would obtain all of the siglock's of all tasks using the same sighand
> > structure. A list of those task_struct's would be added to the sighand
> > structure to enable finding these structurs without having to take the
> > task_list_lock and search for them.
>
> Taking all these locks without risking deadlock would be tricky.
> You could just use a ring, but would need to point to a common
> anchor and always start from there to make sure all lock grabbers
> aquire the locks in the same order.
>
> > Anyway, we would be interested in the community's ideas about dealing with
> > this signal delivery scalability issue, and, comments on the solution above
> > or suggestions for alternative solutions are welcome.
>
> How about you figure out a fast path of some signals that can work
> without locking: e.g. no load balancing needed, no queued signal, etc.
> and then just do the delivery of SIGPROF lockless? Or just ignore it
> since the original premise doesn't seem to useful.
>
> -Andi
>
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://productguide.itmanagersjournal.com/
> _______________________________________________
> Lse-tech mailing list
> Lse-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lse-tech
^ permalink raw reply [flat|nested] 19+ messages in thread* RE: [Lse-tech] scalability of signal delivery for Posix Threads
2004-11-22 16:22 ` [Lse-tech] " Andi Kleen
2004-11-22 16:51 ` Andreas Schwab
2004-11-22 17:23 ` Philip J. Mucci
@ 2004-11-22 21:26 ` Boehm, Hans
2004-11-22 21:34 ` Andi Kleen
2004-11-22 21:27 ` Rick Lindsley
2004-11-22 23:01 ` Boehm, Hans
4 siblings, 1 reply; 19+ messages in thread
From: Boehm, Hans @ 2004-11-22 21:26 UTC (permalink / raw)
To: Ray Bryant, Andi Kleen
Cc: Andreas Schwab, Kernel Mailing List, linux-ia64, lse-tech, holt,
Dean Roe, Brian Sumner, John Hawkes
Although I don't fully understand all the issues here,
I'm concerned about this proposal. In particular, our
garbage collector (used by gcj, and Mono, among others)
uses signals to stop threads for each garbage collection.
With a small heap, and many threads, I would expect the
frequency of signal delivery to be similar to what you
get with performance tools. But it does not, and should not,
use SIGPROF.
I think this is a more general issue. Special casing one
piece of it is only going to make performance more surprising,
something I think should be avoided if at all possible.
Hans
> -----Original Message-----
> From: linux-ia64-owner@vger.kernel.org
> [mailto:linux-ia64-owner@vger.kernel.org]On Behalf Of Ray Bryant
> Sent: Monday, November 22, 2004 11:23 AM
> To: Andi Kleen
> Cc: Andreas Schwab; Kernel Mailing List; linux-ia64@vger.kernel.org;
> lse-tech; holt@sgi.com; Dean Roe; Brian Sumner; John Hawkes
> Subject: Re: [Lse-tech] scalability of signal delivery for
> Posix Threads
>
>
> OK, apparently SIGPROF is delivered in both the ITIMER_PROF and
> pmu interrupt cases, so if we special case that signal we should
> be fine.
> --
> Best Regards,
> Ray
> -----------------------------------------------
> Ray Bryant
> 512-453-9679 (work) 512-507-7807 (cell)
> raybry@sgi.com raybry@austin.rr.com
> The box said: "Requires Windows 98 or better",
> so I installed Linux.
> -----------------------------------------------
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lse-tech] scalability of signal delivery for Posix Threads
2004-11-22 21:26 ` Boehm, Hans
@ 2004-11-22 21:34 ` Andi Kleen
2004-12-01 22:53 ` Brent Casavant
0 siblings, 1 reply; 19+ messages in thread
From: Andi Kleen @ 2004-11-22 21:34 UTC (permalink / raw)
To: Boehm, Hans
Cc: Ray Bryant, Andi Kleen, Andreas Schwab, Kernel Mailing List,
linux-ia64, lse-tech, holt, Dean Roe, Brian Sumner, John Hawkes
> I think this is a more general issue. Special casing one
It just cannot be done in the general case without slowing
down sigaction significantly. Or maybe it can, but nobody
has proposed a way to do it so far.
It's difficult to design for machines where a simple spinlock
doesn't work properly anymore.
> piece of it is only going to make performance more surprising,
> something I think should be avoided if at all possible.
The special case in particular would be signals directed to a specific TID;
compared to signals load balanced over the thread group which needs
shared writable state. To simplify the fast path you could also make
more simplications: no queueing (otherwise you would need to duplicate
a lot of state to handle that into the task_struct) and probably
no SIGCHILD which is also full of special cases.
-And
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lse-tech] scalability of signal delivery for Posix Threads
2004-11-22 21:34 ` Andi Kleen
@ 2004-12-01 22:53 ` Brent Casavant
0 siblings, 0 replies; 19+ messages in thread
From: Brent Casavant @ 2004-12-01 22:53 UTC (permalink / raw)
To: Andi Kleen
Cc: Boehm, Hans, Ray Bryant, Andreas Schwab, Kernel Mailing List,
linux-ia64, lse-tech, holt, Dean Roe, John Hawkes
On Mon, 22 Nov 2004, Andi Kleen wrote:
> > I think this is a more general issue. Special casing one
>
> It just cannot be done in the general case without slowing
> down sigaction significantly. Or maybe it can, but nobody
> has proposed a way to do it so far.
Sorry for the late reply, but I just inheirited some of this work
from Ray and am catching up.
At a high level the seqlock seemed like the right idea, though
neither it nor seqcount is appropriate since in the case of signal
processing we can't tolerate consuming stale information and redoing
the operation.
But it got me thinking in a good direction. We could add a per-task
shadow copy of the per-process sighand_struct. Added to sighand_struct
would be a generation number. Whenever we perform an operation that
currently consumes data in the sighand_struct, we would first check the
shadow copy generation number against the per-process generation number.
If there is a mismatch, the per-process siglock is taken and the shadow
copy is updated, then the siglock is dropped. Whether or not this update
was necessary, we complete the signal processing using only the shadow copy.
Whenever the per-process sighand_struct needs updating, the structure
would be updated as normal, and as a last operation before unlocking
the generation number would be bumped.
This lazy update method would not suffer a significant slowdown during
a sigaction(2) call. The only potentially significant penalty occurs at
the time of signal delivery when a signal disposition/handler has changed.
Even this would be limited to a memcpy() of the sighand_struct->action --
which is not significant unless the disposition/handler is changing rapidly
Does this seem like a solution that would be worth pursuing? I see some
potential pitfalls in that siglock protects more than sighand itself, and
that IRQs would not be disabled except during the shadowing operation.
There would be a race where the generation numbers match, so we begin
using the shadowed data, but simultaneously another task updates the
per-process sighand_struct. This causes no direct ill effect as the
shadowed data is coherent, however I'm not sure whether an application
could possibly be sensitive to this race. It seems that any such
application already suffers from a race as to which task obtains the
siglock first, but we are at least guaranteed that if signal delivery
begins, it is complete through signal_wake_up() before the racing
sigaction(2) begins. I suspect there's nothing to worry about here, but
I haven't convinced myself of this quite yet.
I see that signal_wake_up() currently requires interrupts be disabled
on its behalf by holding siglock. Under this new scheme it may be
necessary to lock interrupts without taking siglock itself, unless a
way can be found to make signal_wake_up() interrupt-safe.
Anyway, all that to once again ask if this seems like a beneficial
or feasible method to pursue? Any glaring holes? Any opinion as
to whether we should track the generation for the sighand_struct as a
whole, or for each individual element of sighand_struct->action
(seems like overkill to me, but it was casually suggested in hallway
chatter)?
Thanks,
Brent Casavant
--
Brent Casavant If you had nothing to fear,
bcasavan@sgi.com how then could you be brave?
Silicon Graphics, Inc. -- Queen Dama, Source Wars
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Lse-tech] scalability of signal delivery for Posix Threads
2004-11-22 16:22 ` [Lse-tech] " Andi Kleen
` (2 preceding siblings ...)
2004-11-22 21:26 ` Boehm, Hans
@ 2004-11-22 21:27 ` Rick Lindsley
2004-11-22 23:39 ` Ray Bryant
2004-11-22 23:01 ` Boehm, Hans
4 siblings, 1 reply; 19+ messages in thread
From: Rick Lindsley @ 2004-11-22 21:27 UTC (permalink / raw)
To: Ray Bryant
Cc: Kernel Mailing List, linux-ia64@vger.kernel.org, lse-tech, holt,
Dean Roe, Brian Sumner, John Hawkes
So with CLONE_SIGHAND, we share the handler assignments and which signals
are blocked, but retain the ability for individual threads to receive
a signal. And when all of them receive signals in quick succession,
we see lock contention because they're sharing the same (effectively)
global lock to receive all of their (effectively) individual signals
.. is that correct?
Are you contending on tasklist_lock, or on siglock?
It seems to me that scalability would be improved if we moved the
siglock from the sighand structure to the task_struct.
Only if you want to keep its current semantics of it being a lock for
all things signal. Finer granularity would, it seems at first look,
afford you the benefits you're looking for. (But not without the cost of
a fair amount of work to make sure the new locks are utilized correctly.)
For the problem you're describing, it sounds like the contention is occuring
at delivery, so a new lock for pending, blocked, and real_blocked might be
in order.
Rick
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [Lse-tech] scalability of signal delivery for Posix Threads
2004-11-22 21:27 ` Rick Lindsley
@ 2004-11-22 23:39 ` Ray Bryant
0 siblings, 0 replies; 19+ messages in thread
From: Ray Bryant @ 2004-11-22 23:39 UTC (permalink / raw)
To: Rick Lindsley
Cc: Kernel Mailing List, linux-ia64@vger.kernel.org, lse-tech, holt,
Dean Roe, Brian Sumner, John Hawkes
Rick Lindsley wrote:
> So with CLONE_SIGHAND, we share the handler assignments and which signals
> are blocked, but retain the ability for individual threads to receive
> a signal. And when all of them receive signals in quick succession,
> we see lock contention because they're sharing the same (effectively)
> global lock to receive all of their (effectively) individual signals
> .. is that correct?
>
Yes, I think that's whats happening, except that I think the blocked
signal list is per thread as well. The shared sighand structure just
has the saved arguments from sigaction, as I remember. (It's confusing:
the set of signals blocked during execution of the signal handler is
part of the sigaction structure and hence is global to the entire
thread group, whilst the set of signals blocked in general is per thread.)
> Are you contending on tasklist_lock, or on siglock?
Definately: siglock. All of the profiling ticks occur at
unlock_irqrestore(&p->sighand->siglock) in the routines I
mentioned before. [we don't have NMI profiling on Altix...
so profiling typically can't look inside
of code sections with interrupts suspended.]
>
> It seems to me that scalability would be improved if we moved the
> siglock from the sighand structure to the task_struct.
>
> Only if you want to keep its current semantics of it being a lock for
> all things signal. Finer granularity would, it seems at first look,
> afford you the benefits you're looking for. (But not without the cost of
> a fair amount of work to make sure the new locks are utilized correctly.)
> For the problem you're describing, it sounds like the contention is occuring
> at delivery, so a new lock for pending, blocked, and real_blocked might be
> in order.
>
> Rick
>
Yes, I was hoping to keep the current semantics of siglock as the lock for
all things signal, just make it local per thread, and require that all of the
siglocks be held to change the sighand structure. That seemed like a change I
could manage. My personal notion was that the slowdown of sigaction()
processing for multi-threaded POSIX programs was not that big of deal because
it doesn't happen very often, and for non-CLONE_SIGHAND threads the additional
cost would be minor. But if the slowdown in the CLONE_SIGHAND case is not
acceptable then I'm stuck as to how to do this
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
raybry@sgi.com raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------
^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [Lse-tech] scalability of signal delivery for Posix Threads
2004-11-22 16:22 ` [Lse-tech] " Andi Kleen
` (3 preceding siblings ...)
2004-11-22 21:27 ` Rick Lindsley
@ 2004-11-22 23:01 ` Boehm, Hans
4 siblings, 0 replies; 19+ messages in thread
From: Boehm, Hans @ 2004-11-22 23:01 UTC (permalink / raw)
To: Andi Kleen
Cc: Ray Bryant, Andreas Schwab, Kernel Mailing List, linux-ia64,
lse-tech, holt, Dean Roe, Brian Sumner, John Hawkes
Just to clarify:
I have no problem with special-casing signals sent to a specific
thread. Our garbage collector uses pthread_kill, and thus should
also benefit from that change. And it makes sense to me that this
kind of signal should be cheaper to deliver.
SIGSEGV delivery also matters to me. But that should presumably
also fall into the same class.
I would prefer to avoid special handling for just SIGPROF.
If that was never proposed, please ignore my comments.
Hans
> -----Original Message-----
> From: Andi Kleen [mailto:ak@suse.de]
> Sent: Monday, November 22, 2004 1:35 PM
> To: Boehm, Hans
> Cc: Ray Bryant; Andi Kleen; Andreas Schwab; Kernel Mailing List;
> linux-ia64@vger.kernel.org; lse-tech; holt@sgi.com; Dean Roe; Brian
> Sumner; John Hawkes
> Subject: Re: [Lse-tech] scalability of signal delivery for
> Posix Threads
>
>
> > I think this is a more general issue. Special casing one
>
> It just cannot be done in the general case without slowing
> down sigaction significantly. Or maybe it can, but nobody
> has proposed a way to do it so far.
>
> It's difficult to design for machines where a simple spinlock
> doesn't work properly anymore.
>
> > piece of it is only going to make performance more surprising,
> > something I think should be avoided if at all possible.
>
> The special case in particular would be signals directed to a
> specific TID;
> compared to signals load balanced over the thread group which needs
> shared writable state. To simplify the fast path you could also make
> more simplications: no queueing (otherwise you would need to duplicate
> a lot of state to handle that into the task_struct) and probably
> no SIGCHILD which is also full of special cases.
>
> -And
>
^ permalink raw reply [flat|nested] 19+ messages in thread