scalability of signal delivery for Posix Threads

public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed

* scalability of signal delivery for Posix Threads
@ 2004-11-22 15:51 Ray Bryant
  2004-11-22 16:07 ` Matthew Wilcox
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Ray Bryant @ 2004-11-22 15:51 UTC (permalink / raw)
  To: Kernel Mailing List, linux-ia64@vger.kernel.org, lse-tech
  Cc: holt, Dean Roe, Brian Sumner, John Hawkes

We've encountered a scalability problem with signal delivery.  Our application
is attempting to use ITIMER_PROF to deliver one signal per clock tick to each
thread of a ptrheaded (NPTL).  These threads are created with CLONE_SIGHAND,
so that there is a single sighand->siglock for the entire application.

On our Altix systems, everything works fine until we increase the number of
threads (and processors, with one thread bound to each processor) beyond
about 112 threads or so.  At that point lock contention can become severe
enough to make the system unresponsive.  The reason is that each thread has
to obtain the (global to the application, in this case) lock sighand->siglock.

(Obviously, one solution is to recode the application to send fewer signals
per thread as the number of threads increase.  However, we are concerned by
the fact that a user application, of any kind, can be constructed in a way
that causes system to become responsive and would like to find a solutiuon
that would let us correctly execute the program as described.)

Perusing the kernel sources shows that this global lock is used, in many 
cases, to protect data that is purely local to the current thread.  (For 
example, see block_all_signals(), where current->sighand->siglock is obtained
and then the current task's signal mask is manipulated.)

This lock is also aquired in the following routines, where most of the time
is being spent our application:

ia64_do_signal
set_signal_to_deliver
ia64_rt_sigreturn

In these cases, it appears that the global lock is being acquired to make sure
the signal handler definition doesn't change underneath us, as well as dealing
with the per thread signal data.

Since signals are sent much more often than sigaction() is called, it would
seem to make more sense to make sigaction() take a heavier weight lock of
some kind (to update the signal handler decription) and to have the signal
delivery mechanism take a lighter weight lock.  Making 
current->sighand->siglock a rwlock_t really doesn't improve the situation
much, since cache line contention is just a severe in that case (if not worse) 
than it is with the current definition.

It seems to me that scalability would be improved if we moved the siglock from
the sighand structure to the task_struct.  (keep reading, please...)  Code 
that manipulates the current task signal data only would just obtain that 
lock.  Code that needs to change the sighand structure (e. g. sigaction())
would obtain all of the siglock's of all tasks using the same sighand 
structure.  A list of those task_struct's would be added to the sighand
structure to enable finding these structurs without having to take the
task_list_lock and search for them.

Obviously, this change ricochet's throughout the entire signal handling code.
It also means that sigaction() can become quite expensive for a many threaded
POSIX application, but my guess is that this doesn't happen very often.
The change could also make do_exit(), thread group shutdown, etc slower and
perhaps somewhat more complex.

Anyway, we would be interested in the community's ideas about dealing with
this signal delivery scalability issue, and, comments on the solution above
or suggestions for alternative solutions are welcome.
-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: scalability of signal delivery for Posix Threads
  2004-11-22 15:51 scalability of signal delivery for Posix Threads Ray Bryant
@ 2004-11-22 16:07 ` Matthew Wilcox
  2004-11-22 19:49   ` [Lse-tech] " Ray Bryant
  2004-11-22 16:22 ` [Lse-tech] " Andi Kleen
  2004-11-22 17:19 ` Robin Holt
  2 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2004-11-22 16:07 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Kernel Mailing List, linux-ia64@vger.kernel.org, lse-tech, holt,
	Dean Roe, Brian Sumner, John Hawkes

On Mon, Nov 22, 2004 at 09:51:15AM -0600, Ray Bryant wrote:
> Since signals are sent much more often than sigaction() is called, it would
> seem to make more sense to make sigaction() take a heavier weight lock of
> some kind (to update the signal handler decription) and to have the signal
> delivery mechanism take a lighter weight lock.  Making 
> current->sighand->siglock a rwlock_t really doesn't improve the situation
> much, since cache line contention is just a severe in that case (if not 
> worse) than it is with the current definition.

What about RCU or seqlock?

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lse-tech] Re: scalability of signal delivery for Posix Threads
  2004-11-22 16:07 ` Matthew Wilcox
@ 2004-11-22 19:49   ` Ray Bryant
  2004-11-22 19:53     ` Andi Kleen
  0 siblings, 1 reply; 19+ messages in thread
From: Ray Bryant @ 2004-11-22 19:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kernel Mailing List, linux-ia64@vger.kernel.org, lse-tech, holt,
	Dean Roe, Brian Sumner, John Hawkes

Matthew Wilcox wrote:
> On Mon, Nov 22, 2004 at 09:51:15AM -0600, Ray Bryant wrote:
> 
>>Since signals are sent much more often than sigaction() is called, it would
>>seem to make more sense to make sigaction() take a heavier weight lock of
>>some kind (to update the signal handler decription) and to have the signal
>>delivery mechanism take a lighter weight lock.  Making 
>>current->sighand->siglock a rwlock_t really doesn't improve the situation
>>much, since cache line contention is just a severe in that case (if not 
>>worse) than it is with the current definition.
> 
> 
> What about RCU or seqlock?
> 

Well, the sighand->siglock is taken so many places in the kernel (>200 times)
that RCUing its usage looks like a daunting change to make.

In principle, I guess a seqlock could be made to work.  The idea would be that
we'd want to get a consistent copy of the sighand structure in the presence
of very rare updates.  Once again, I'd have to modify all of those code
paths mentioned above.

Since a seqlock was created AFAIK as an alternate to a brlock, and the
global/local locking structure I described before is more or less equivalent
to a brlock, I think we are thinking along similar lines.

For me, converting spinlock_irqsave(&p->sighand->siglock) to
spinlock_irqsave(&p->siglock) and then checking to make sure that
only task local data is updated in the critical section is an easier
way to go than modifying each of the code paths to deal with the
"failure" case for a seqlock.  But I could be proven wrong.

Anyway, Andi makes a good point -- if I can fast patch SIGPROF handling,
then I may have a more localized change, and that is a good thing [tm].

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lse-tech] Re: scalability of signal delivery for Posix Threads
  2004-11-22 19:49   ` [Lse-tech] " Ray Bryant
@ 2004-11-22 19:53     ` Andi Kleen
  0 siblings, 0 replies; 19+ messages in thread
From: Andi Kleen @ 2004-11-22 19:53 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Matthew Wilcox, Kernel Mailing List, linux-ia64@vger.kernel.org,
	lse-tech, holt, Dean Roe, Brian Sumner, John Hawkes

> Well, the sighand->siglock is taken so many places in the kernel (>200 
> times)
> that RCUing its usage looks like a daunting change to make.

Agreed.  And having to wait for all CPUs in sigaction would also not
be nice.

> 
> In principle, I guess a seqlock could be made to work.  The idea would be 

seqlocks are reader only, but for signal delivery you need a writer to 
update state like the thread load balancing. We got all that gunk
from POSIX, before NPTL it would have been probably possible ;-)

-Andi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lse-tech] scalability of signal delivery for Posix Threads
  2004-11-22 15:51 scalability of signal delivery for Posix Threads Ray Bryant
  2004-11-22 16:07 ` Matthew Wilcox
@ 2004-11-22 16:22 ` Andi Kleen
  2004-11-22 16:51   ` Andreas Schwab
                     ` (4 more replies)
  2004-11-22 17:19 ` Robin Holt
  2 siblings, 5 replies; 19+ messages in thread
From: Andi Kleen @ 2004-11-22 16:22 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Kernel Mailing List, linux-ia64@vger.kernel.org, lse-tech, holt,
	Dean Roe, Brian Sumner, John Hawkes

On Mon, Nov 22, 2004 at 09:51:15AM -0600, Ray Bryant wrote:
> (Obviously, one solution is to recode the application to send fewer signals
> per thread as the number of threads increase.  However, we are concerned by
> the fact that a user application, of any kind, can be constructed in a way
> that causes system to become responsive and would like to find a solutiuon
> that would let us correctly execute the program as described.)

I suspect there are hundreds or thousands of ways on such a big system to 
exploit some lock to make the system unresponsive.  If you wanted
to fix them all your would be in a large scale redesign effort. 
It's not clear why this particular case is special.

> Since signals are sent much more often than sigaction() is called, it would
> seem to make more sense to make sigaction() take a heavier weight lock of

At least in traditional signal semantics you have to call sigaction
or signal in each signal handler to reset the signal. So that 
assumption is not necessarily true.

> It seems to me that scalability would be improved if we moved the siglock 
> from
> the sighand structure to the task_struct.  (keep reading, please...)  Code 
> that manipulates the current task signal data only would just obtain that 
> lock.  Code that needs to change the sighand structure (e. g. sigaction())
> would obtain all of the siglock's of all tasks using the same sighand 
> structure.  A list of those task_struct's would be added to the sighand
> structure to enable finding these structurs without having to take the
> task_list_lock and search for them.

Taking all these locks without risking deadlock would be tricky.
You could just use a ring, but would need to point to a common
anchor and always start from there to make sure all lock grabbers
aquire the locks in the same order.

> Anyway, we would be interested in the community's ideas about dealing with
> this signal delivery scalability issue, and, comments on the solution above
> or suggestions for alternative solutions are welcome.

How about you figure out a fast path of some signals that can work
without locking: e.g. no load balancing needed, no queued signal, etc. 
and then just do the delivery of SIGPROF lockless? Or just ignore it
since the original premise doesn't seem to useful.

-Andi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lse-tech] scalability of signal delivery for Posix Threads
  2004-11-22 16:22 ` [Lse-tech] " Andi Kleen
@ 2004-11-22 16:51   ` Andreas Schwab
  2004-11-22 16:54     ` Andi Kleen
  2004-11-22 17:23   ` Philip J. Mucci
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Andreas Schwab @ 2004-11-22 16:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ray Bryant, Kernel Mailing List, linux-ia64@vger.kernel.org,
	lse-tech, holt, Dean Roe, Brian Sumner, John Hawkes

Andi Kleen <ak@suse.de> writes:

> At least in traditional signal semantics you have to call sigaction
> or signal in each signal handler to reset the signal. So that 
> assumption is not necessarily true.

If you use sigaction then you get POSIX semantics, which don't have this
problem.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lse-tech] scalability of signal delivery for Posix Threads
  2004-11-22 16:51   ` Andreas Schwab
@ 2004-11-22 16:54     ` Andi Kleen
  2004-11-22 18:56       ` Ray Bryant
  2004-11-22 19:22       ` Ray Bryant
  0 siblings, 2 replies; 19+ messages in thread
From: Andi Kleen @ 2004-11-22 16:54 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Andi Kleen, Ray Bryant, Kernel Mailing List,
	linux-ia64@vger.kernel.org, lse-tech, holt, Dean Roe,
	Brian Sumner, John Hawkes

On Mon, Nov 22, 2004 at 05:51:59PM +0100, Andreas Schwab wrote:
> Andi Kleen <ak@suse.de> writes:
> 
> > At least in traditional signal semantics you have to call sigaction
> > or signal in each signal handler to reset the signal. So that 
> > assumption is not necessarily true.
> 
> If you use sigaction then you get POSIX semantics, which don't have this
> problem.

It's just a common case where Ray's assumption is not true.

-Andi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lse-tech] scalability of signal delivery for Posix Threads
  2004-11-22 16:54     ` Andi Kleen
@ 2004-11-22 18:56       ` Ray Bryant
  2004-11-22 19:22       ` Ray Bryant
  1 sibling, 0 replies; 19+ messages in thread
From: Ray Bryant @ 2004-11-22 18:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andreas Schwab, Kernel Mailing List, linux-ia64@vger.kernel.org,
	lse-tech, holt, Dean Roe, Brian Sumner, John Hawkes

Andi Kleen wrote:
> On Mon, Nov 22, 2004 at 05:51:59PM +0100, Andreas Schwab wrote:
> 
>>Andi Kleen <ak@suse.de> writes:
>>
>>
>>>At least in traditional signal semantics you have to call sigaction
>>>or signal in each signal handler to reset the signal. So that 
>>>assumption is not necessarily true.
>>
>>If you use sigaction then you get POSIX semantics, which don't have this
>>problem.
> 
> 
> It's just a common case where Ray's assumption is not true.
> 
> -Andi
> 

True enough.  And in that case the design that I was describing wouldn't
make sigaction() that much more expensive since if you are not in the POSIX
thread environment (more precisely, the thread was not created with
CLONE_SIGHAND) each thread has its own sighand structure and the "global" 
locking mechanisum I had proposed would only require the taking of one 
additional lock.

However, special casing ITIMER_PROF is also a reasonable avenue of approach.
The performance monitor code can also deliver signals to user space when
a sampling buffer overflows, and this can have the same kind of scaling
problem as ITIMER_PROF.  I'll have to do a little research to figure out
how exactly that works, but that signal (SIGIO?) would also be a candidate
for special casing on our platform.

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lse-tech] scalability of signal delivery for Posix Threads
  2004-11-22 16:54     ` Andi Kleen
  2004-11-22 18:56       ` Ray Bryant
@ 2004-11-22 19:22       ` Ray Bryant
  1 sibling, 0 replies; 19+ messages in thread
From: Ray Bryant @ 2004-11-22 19:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andreas Schwab, Kernel Mailing List, linux-ia64@vger.kernel.org,
	lse-tech, holt, Dean Roe, Brian Sumner, John Hawkes

OK, apparently SIGPROF is delivered in both the ITIMER_PROF and
pmu interrupt cases, so if we special case that signal we should
be fine.
-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lse-tech] scalability of signal delivery for Posix Threads
  2004-11-22 16:22 ` [Lse-tech] " Andi Kleen
  2004-11-22 16:51   ` Andreas Schwab
@ 2004-11-22 17:23   ` Philip J. Mucci
  2004-11-22 21:26   ` Boehm, Hans
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Philip J. Mucci @ 2004-11-22 17:23 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ray Bryant, Kernel Mailing List, linux-ia64@vger.kernel.org,
	lse-tech, holt, Dean Roe, Brian Sumner, John Hawkes

Hi Andi,

Allow me to say that this particular case is very special, because
ITIMER_PROF is used in many performance tools. Inside of PAPI for
example, we use ITIMER_PROF for both counter multiplexing and for
statistical profiling. Tools such as HPCToolkit, PerfSuite and others
often enable ITIMER_PROF at the highest available resolution. So while
there may be hundreds or thousands of ways to do this, this particular
avenue has many useful tools out there that make can easily trigger this
case. 

However, I think your suggestion is an excellent one regarding a fast
path for ITIMER_PROF. 

FWIW, Solaris has ITIMER_REALPROF, which when enabled, sends
ITIMER_PROFs to all LWP's in the process for each tick. In this way,
each thread doesn't have to call setitimer()
by itself to get a signal...yeah yeah, I know what POSIX says about
signal delivery to any LWP, but on every Linux I've tested, you only get
itimer(PROF) to the thread that registered the timer. Granted, 
I haven't run the test on an Altix.

Regards,

Philip Mucci

> I suspect there are hundreds or thousands of ways on such a big system to 
> exploit some lock to make the system unresponsive.  If you wanted
> to fix them all your would be in a large scale redesign effort. 
> It's not clear why this particular case is special.

> > Since signals are sent much more often than sigaction() is called, it would
> > seem to make more sense to make sigaction() take a heavier weight lock of
> 
> At least in traditional signal semantics you have to call sigaction
> or signal in each signal handler to reset the signal. So that 
> assumption is not necessarily true.
> 
> > It seems to me that scalability would be improved if we moved the siglock 
> > from
> > the sighand structure to the task_struct.  (keep reading, please...)  Code 
> > that manipulates the current task signal data only would just obtain that 
> > lock.  Code that needs to change the sighand structure (e. g. sigaction())
> > would obtain all of the siglock's of all tasks using the same sighand 
> > structure.  A list of those task_struct's would be added to the sighand
> > structure to enable finding these structurs without having to take the
> > task_list_lock and search for them.
> 
> Taking all these locks without risking deadlock would be tricky.
> You could just use a ring, but would need to point to a common
> anchor and always start from there to make sure all lock grabbers
> aquire the locks in the same order.
> 
> > Anyway, we would be interested in the community's ideas about dealing with
> > this signal delivery scalability issue, and, comments on the solution above
> > or suggestions for alternative solutions are welcome.
> 
> How about you figure out a fast path of some signals that can work
> without locking: e.g. no load balancing needed, no queued signal, etc. 
> and then just do the delivery of SIGPROF lockless? Or just ignore it
> since the original premise doesn't seem to useful.
> 
> -Andi
> 
> 
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now. 
> http://productguide.itmanagersjournal.com/
> _______________________________________________
> Lse-tech mailing list
> Lse-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lse-tech


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [Lse-tech] scalability of signal delivery for Posix Threads
  2004-11-22 16:22 ` [Lse-tech] " Andi Kleen
  2004-11-22 16:51   ` Andreas Schwab
  2004-11-22 17:23   ` Philip J. Mucci
@ 2004-11-22 21:26   ` Boehm, Hans
  2004-11-22 21:34     ` Andi Kleen
  2004-11-22 21:27   ` Rick Lindsley
  2004-11-22 23:01   ` Boehm, Hans
  4 siblings, 1 reply; 19+ messages in thread
From: Boehm, Hans @ 2004-11-22 21:26 UTC (permalink / raw)
  To: Ray Bryant, Andi Kleen
  Cc: Andreas Schwab, Kernel Mailing List, linux-ia64, lse-tech, holt,
	Dean Roe, Brian Sumner, John Hawkes

Although I don't fully understand all the issues here,
I'm concerned about this proposal.  In particular, our
garbage collector (used by gcj, and Mono, among others)
uses signals to stop threads for each garbage collection.
With a small heap, and many threads, I would expect the
frequency of signal delivery to be similar to what you
get with performance tools.  But it does not, and should not,
use SIGPROF.

I think this is a more general issue.  Special casing one
piece of it is only going to make performance more surprising,
something I think should be avoided if at all possible.

Hans

> -----Original Message-----
> From: linux-ia64-owner@vger.kernel.org
> [mailto:linux-ia64-owner@vger.kernel.org]On Behalf Of Ray Bryant
> Sent: Monday, November 22, 2004 11:23 AM
> To: Andi Kleen
> Cc: Andreas Schwab; Kernel Mailing List; linux-ia64@vger.kernel.org;
> lse-tech; holt@sgi.com; Dean Roe; Brian Sumner; John Hawkes
> Subject: Re: [Lse-tech] scalability of signal delivery for 
> Posix Threads
> 
> 
> OK, apparently SIGPROF is delivered in both the ITIMER_PROF and
> pmu interrupt cases, so if we special case that signal we should
> be fine.
> -- 
> Best Regards,
> Ray
> -----------------------------------------------
>                    Ray Bryant
> 512-453-9679 (work)         512-507-7807 (cell)
> raybry@sgi.com             raybry@austin.rr.com
> The box said: "Requires Windows 98 or better",
>             so I installed Linux.
> -----------------------------------------------
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lse-tech] scalability of signal delivery for Posix Threads
  2004-11-22 21:26   ` Boehm, Hans
@ 2004-11-22 21:34     ` Andi Kleen
  2004-12-01 22:53       ` Brent Casavant
  0 siblings, 1 reply; 19+ messages in thread
From: Andi Kleen @ 2004-11-22 21:34 UTC (permalink / raw)
  To: Boehm, Hans
  Cc: Ray Bryant, Andi Kleen, Andreas Schwab, Kernel Mailing List,
	linux-ia64, lse-tech, holt, Dean Roe, Brian Sumner, John Hawkes

> I think this is a more general issue.  Special casing one

It just cannot be done in the general case without slowing
down sigaction significantly. Or maybe it can, but nobody
has proposed a way to do it so far. 

It's difficult to design for machines where a simple spinlock
doesn't work properly anymore.

> piece of it is only going to make performance more surprising,
> something I think should be avoided if at all possible.

The special case in particular would be signals directed to a specific TID;
compared to signals load balanced over the thread group which needs
shared writable state. To simplify the fast path you could also make
more simplications: no queueing (otherwise you would need to duplicate
a lot of state to handle that into the task_struct) and probably
no SIGCHILD which is also full of special cases.

-And

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lse-tech] scalability of signal delivery for Posix Threads
  2004-11-22 21:34     ` Andi Kleen
@ 2004-12-01 22:53       ` Brent Casavant
  0 siblings, 0 replies; 19+ messages in thread
From: Brent Casavant @ 2004-12-01 22:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Boehm, Hans, Ray Bryant, Andreas Schwab, Kernel Mailing List,
	linux-ia64, lse-tech, holt, Dean Roe, John Hawkes

On Mon, 22 Nov 2004, Andi Kleen wrote:

> > I think this is a more general issue.  Special casing one
>
> It just cannot be done in the general case without slowing
> down sigaction significantly. Or maybe it can, but nobody
> has proposed a way to do it so far.

Sorry for the late reply, but I just inheirited some of this work
from Ray and am catching up.

At a high level the seqlock seemed like the right idea, though
neither it nor seqcount is appropriate since in the case of signal
processing we can't tolerate consuming stale information and redoing
the operation.

But it got me thinking in a good direction.  We could add a per-task
shadow copy of the per-process sighand_struct.  Added to sighand_struct
would be a generation number.  Whenever we perform an operation that
currently consumes data in the sighand_struct, we would first check the
shadow copy generation number against the per-process generation number.
If there is a mismatch, the per-process siglock is taken and the shadow
copy is updated, then the siglock is dropped.  Whether or not this update
was necessary, we complete the signal processing using only the shadow copy.

Whenever the per-process sighand_struct needs updating, the structure
would be updated as normal, and as a last operation before unlocking
the generation number would be bumped.

This lazy update method would not suffer a significant slowdown during
a sigaction(2) call.  The only potentially significant penalty occurs at
the time of signal delivery when a signal disposition/handler has changed.
Even this would be limited to a memcpy() of the sighand_struct->action --
which is not significant unless the disposition/handler is changing rapidly

Does this seem like a solution that would be worth pursuing?  I see some
potential pitfalls in that siglock protects more than sighand itself, and
that IRQs would not be disabled except during the shadowing operation.

There would be a race where the generation numbers match, so we begin
using the shadowed data, but simultaneously another task updates the
per-process sighand_struct.  This causes no direct ill effect as the
shadowed data is coherent, however I'm not sure whether an application
could possibly be sensitive to this race.  It seems that any such
application already suffers from a race as to which task obtains the
siglock first, but we are at least guaranteed that if signal delivery
begins, it is complete through signal_wake_up() before the racing
sigaction(2) begins.  I suspect there's nothing to worry about here, but
I haven't convinced myself of this quite yet.

I see that signal_wake_up() currently requires interrupts be disabled
on its behalf by holding siglock.  Under this new scheme it may be
necessary to lock interrupts without taking siglock itself, unless a
way can be found to make signal_wake_up() interrupt-safe.

Anyway, all that to once again ask if this seems like a beneficial
or feasible method to pursue?  Any glaring holes?  Any opinion as
to whether we should track the generation for the sighand_struct as a
whole, or for each individual element of sighand_struct->action
(seems like overkill to me, but it was casually suggested in hallway
chatter)?

Thanks,
Brent Casavant

-- 
Brent Casavant                          If you had nothing to fear,
bcasavan@sgi.com                        how then could you be brave?
Silicon Graphics, Inc.                    -- Queen Dama, Source Wars

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lse-tech] scalability of signal delivery for Posix Threads
  2004-11-22 16:22 ` [Lse-tech] " Andi Kleen
                     ` (2 preceding siblings ...)
  2004-11-22 21:26   ` Boehm, Hans
@ 2004-11-22 21:27   ` Rick Lindsley
  2004-11-22 23:39     ` Ray Bryant
  2004-11-22 23:01   ` Boehm, Hans
  4 siblings, 1 reply; 19+ messages in thread
From: Rick Lindsley @ 2004-11-22 21:27 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Kernel Mailing List, linux-ia64@vger.kernel.org, lse-tech, holt,
	Dean Roe, Brian Sumner, John Hawkes

So with CLONE_SIGHAND, we share the handler assignments and which signals
are blocked, but retain the ability for individual threads to receive
a signal.  And when all of them receive signals in quick succession,
we see lock contention because they're sharing the same (effectively)
global lock to receive all of their (effectively) individual signals
.. is that correct?

Are you contending on tasklist_lock, or on siglock?

    It seems to me that scalability would be improved if we moved the
    siglock from the sighand structure to the task_struct.

Only if you want to keep its current semantics of it being a lock for
all things signal.  Finer granularity would, it seems at first look,
afford you the benefits you're looking for.  (But not without the cost of
a fair amount of work to make sure the new locks are utilized correctly.)
For the problem you're describing, it sounds like the contention is occuring
at delivery, so a new lock for pending, blocked, and real_blocked might be
in order.

Rick

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lse-tech] scalability of signal delivery for Posix Threads
  2004-11-22 21:27   ` Rick Lindsley
@ 2004-11-22 23:39     ` Ray Bryant
  0 siblings, 0 replies; 19+ messages in thread
From: Ray Bryant @ 2004-11-22 23:39 UTC (permalink / raw)
  To: Rick Lindsley
  Cc: Kernel Mailing List, linux-ia64@vger.kernel.org, lse-tech, holt,
	Dean Roe, Brian Sumner, John Hawkes

Rick Lindsley wrote:
> So with CLONE_SIGHAND, we share the handler assignments and which signals
> are blocked, but retain the ability for individual threads to receive
> a signal.  And when all of them receive signals in quick succession,
> we see lock contention because they're sharing the same (effectively)
> global lock to receive all of their (effectively) individual signals
> .. is that correct?
> 

Yes, I think that's whats happening, except that I think the blocked
signal list is per thread as well.  The shared sighand structure just
has the saved arguments from sigaction, as I remember.   (It's confusing:
the set of signals blocked during execution of the signal handler is
part of the sigaction structure and hence is global to the entire
thread group, whilst the set of signals blocked in general is per thread.)

> Are you contending on tasklist_lock, or on siglock?

Definately: siglock.  All of the profiling ticks occur at
unlock_irqrestore(&p->sighand->siglock) in the routines I
mentioned before.  [we don't have NMI profiling on Altix...
so profiling typically can't look inside
of code sections with interrupts suspended.]

> 
>     It seems to me that scalability would be improved if we moved the
>     siglock from the sighand structure to the task_struct.
> 
> Only if you want to keep its current semantics of it being a lock for
> all things signal.  Finer granularity would, it seems at first look,
> afford you the benefits you're looking for.  (But not without the cost of
> a fair amount of work to make sure the new locks are utilized correctly.)
> For the problem you're describing, it sounds like the contention is occuring
> at delivery, so a new lock for pending, blocked, and real_blocked might be
> in order.
> 
> Rick
> 

Yes, I was hoping to keep the current semantics of siglock as the lock for
all things signal, just make it local per thread, and require that all of the
siglocks be held to change the sighand structure.  That seemed like a change I
could manage.  My personal notion was that the slowdown of sigaction()
processing for multi-threaded POSIX programs was not that big of deal because
it doesn't happen very often, and for non-CLONE_SIGHAND threads the additional
cost would be minor.  But if the slowdown in the CLONE_SIGHAND case is not
acceptable then I'm stuck as to how to do this

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [Lse-tech] scalability of signal delivery for Posix Threads
  2004-11-22 16:22 ` [Lse-tech] " Andi Kleen
                     ` (3 preceding siblings ...)
  2004-11-22 21:27   ` Rick Lindsley
@ 2004-11-22 23:01   ` Boehm, Hans
  4 siblings, 0 replies; 19+ messages in thread
From: Boehm, Hans @ 2004-11-22 23:01 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ray Bryant, Andreas Schwab, Kernel Mailing List, linux-ia64,
	lse-tech, holt, Dean Roe, Brian Sumner, John Hawkes

Just to clarify:

I have no problem with special-casing signals sent to a specific
thread.  Our garbage collector uses pthread_kill, and thus should
also benefit from that change.  And it makes sense to me that this
kind of signal should be cheaper to deliver.

SIGSEGV delivery also matters to me.  But that should presumably
also fall into the same class.

I would prefer to avoid special handling for just SIGPROF.
If that was never proposed, please ignore my comments.

Hans

> -----Original Message-----
> From: Andi Kleen [mailto:ak@suse.de]
> Sent: Monday, November 22, 2004 1:35 PM
> To: Boehm, Hans
> Cc: Ray Bryant; Andi Kleen; Andreas Schwab; Kernel Mailing List;
> linux-ia64@vger.kernel.org; lse-tech; holt@sgi.com; Dean Roe; Brian
> Sumner; John Hawkes
> Subject: Re: [Lse-tech] scalability of signal delivery for 
> Posix Threads
> 
> 
> > I think this is a more general issue.  Special casing one
> 
> It just cannot be done in the general case without slowing
> down sigaction significantly. Or maybe it can, but nobody
> has proposed a way to do it so far. 
> 
> It's difficult to design for machines where a simple spinlock
> doesn't work properly anymore.
> 
> > piece of it is only going to make performance more surprising,
> > something I think should be avoided if at all possible.
> 
> The special case in particular would be signals directed to a 
> specific TID;
> compared to signals load balanced over the thread group which needs
> shared writable state. To simplify the fast path you could also make
> more simplications: no queueing (otherwise you would need to duplicate
> a lot of state to handle that into the task_struct) and probably
> no SIGCHILD which is also full of special cases.
> 
> -And
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: scalability of signal delivery for Posix Threads
  2004-11-22 15:51 scalability of signal delivery for Posix Threads Ray Bryant
  2004-11-22 16:07 ` Matthew Wilcox
  2004-11-22 16:22 ` [Lse-tech] " Andi Kleen
@ 2004-11-22 17:19 ` Robin Holt
  2004-11-22 19:25   ` Ray Bryant
  2004-11-23 20:42   ` Ray Bryant
  2 siblings, 2 replies; 19+ messages in thread
From: Robin Holt @ 2004-11-22 17:19 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Kernel Mailing List, linux-ia64@vger.kernel.org, lse-tech, holt,
	Dean Roe, Brian Sumner, John Hawkes

On Mon, Nov 22, 2004 at 09:51:15AM -0600, Ray Bryant wrote:
> We've encountered a scalability problem with signal delivery.  Our 
> application
> is attempting to use ITIMER_PROF to deliver one signal per clock tick to 
> each
> thread of a ptrheaded (NPTL).  These threads are created with CLONE_SIGHAND,
> so that there is a single sighand->siglock for the entire application.

Ray, can you provide a simple example application that trips this case?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: scalability of signal delivery for Posix Threads
  2004-11-22 17:19 ` Robin Holt
@ 2004-11-22 19:25   ` Ray Bryant
  2004-11-23 20:42   ` Ray Bryant
  1 sibling, 0 replies; 19+ messages in thread
From: Ray Bryant @ 2004-11-22 19:25 UTC (permalink / raw)
  To: Robin Holt
  Cc: Kernel Mailing List, linux-ia64@vger.kernel.org, lse-tech,
	Dean Roe, Brian Sumner, John Hawkes

Robin Holt wrote:
> On Mon, Nov 22, 2004 at 09:51:15AM -0600, Ray Bryant wrote:
> 
>>We've encountered a scalability problem with signal delivery.  Our 
>>application
>>is attempting to use ITIMER_PROF to deliver one signal per clock tick to 
>>each
>>thread of a ptrheaded (NPTL).  These threads are created with CLONE_SIGHAND,
>>so that there is a single sighand->siglock for the entire application.
> 
> 
> Ray, can you provide a simple example application that trips this case?

I'll do that, but it may be after the Thanksgiving holiday when I get it
out to the mailing list.  It's a minor thing, but the test we are using
now is an OpenMP application (written in .c) and I'd propose rewriting
it using POSIX threads without OpenMP for a more general audience

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: scalability of signal delivery for Posix Threads
  2004-11-22 17:19 ` Robin Holt
  2004-11-22 19:25   ` Ray Bryant
@ 2004-11-23 20:42   ` Ray Bryant
  1 sibling, 0 replies; 19+ messages in thread
From: Ray Bryant @ 2004-11-23 20:42 UTC (permalink / raw)
  To: Robin Holt
  Cc: Kernel Mailing List, linux-ia64@vger.kernel.org, lse-tech,
	Dean Roe, Brian Sumner, John Hawkes

[-- Attachment #1: Type: text/plain, Size: 665 bytes --]

Robin Holt wrote:

> 
> Ray, can you provide a simple example application that trips this case?
> 

Robin,

Attached is a pthreads program (y.c) that exercises the scaling problem
discussed above.

Compile this as: gcc y.c -o y -lpthread -lm

Call it as: ./y nthread

We start to see problems on Altix with this program at around 76 cpus.

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

[-- Attachment #2: y.c --]
[-- Type: text/plain, Size: 2952 bytes --]


#define _GNU_SOURCE

#include <sys/syscall.h>
#include <pthread.h>
#include <sys/time.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <signal.h>
#include <errno.h>

#define NITER 100000000
#define perrorx(s)      (perror(s), exit(1))
#define MAX_THREADS 512
#define CACHE_LINE_SIZE 128
#define __cache_aligned __attribute__((__aligned__(CACHE_LINE_SIZE)))

static struct itimerval itv;

/* this makes count a per thread variable */
static __thread unsigned volatile long count;

pthread_t ptid[MAX_THREADS];
struct child_state_struct {
	int 	state;
	pid_t 	tid;
	long    start_count;
	long    end_count;
	double  result;
	char    filler[CACHE_LINE_SIZE - (sizeof(int)+sizeof(pid_t)+2*sizeof(long)+sizeof(double))];
};
struct child_state_struct child_state[MAX_THREADS] __cache_aligned;

volatile int go __cache_aligned = 0;
char filler[CACHE_LINE_SIZE - (sizeof(int))];

static void
sigprof_handler(int signo, struct siginfo *sip, void *scp)
{
    ++count;
}

void *func(void *arg)
{
    int i, id = (int)(long)arg;
    double a = 0.1;
    pid_t tid;

    tid = syscall(__NR_gettid);
    child_state[id].tid = tid;
    child_state[id].state = 1;

    if (setitimer(ITIMER_PROF, &itv, 0) < 0) {
	fprintf(stderr, "Setitimer failed: %s\n", strerror(errno));
	exit(1);
    }

    while(go<1);

    child_state[id].start_count = count;
    for (i=0;i<NITER;++i)
	a = cos(asin(a));
    child_state[id].end_count = count;
    child_state[id].result = a;

    child_state[id].state = 2;

    while(go<2);

}

int
main(int argc, char *argv[])
{
    struct sigaction sa;
    int j, started, finished, thread_count;

    if (argc < 2) {
    	printf("Call as '%s NN', where NN is the number of threads to create.\n", argv[0]);
	exit(1);
    }

    thread_count = atol(argv[1]);
    printf("Creating %d threads.\n", thread_count);

    itv.it_value.tv_sec = 0;
    itv.it_value.tv_usec = 1000000.0 / (double)sysconf(_SC_CLK_TCK);
    itv.it_interval = itv.it_value;

    memset(&sa, 0, sizeof(sa));
    sigfillset(&sa.sa_mask);
    sa.sa_sigaction = sigprof_handler;
    sa.sa_flags = SA_RESTART | SA_SIGINFO;

    if (sigaction(SIGPROF, &sa, 0) < 0) {
	fprintf(stderr, "Sigaction failed: %s\n", strerror(errno));
	exit(1);
    }

    for (j=0; j<thread_count; j++) 
    	if (pthread_create(&ptid[j], NULL, &func, (void *)(long) j) < 0)
		perrorx("pthread create");

    do {
    	sleep(1);
	started = 0;
	for (j=0; j<thread_count; j++)
		started += child_state[j].state;
    } while(started < thread_count);

    printf("All threads have started.\n");
    go++;

    do {
    	sleep(1);
	finished = 0;
	for (j=0; j<thread_count; j++)
		finished += (child_state[j].state == 2);
    } while(finished < thread_count);
    printf("All threads have finished.\n");

    for (j=0; j<thread_count; j++) 
    	printf("count[%d]=%d\n",
		j, child_state[j].end_count-child_state[j].start_count);

    return 0;
}

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2004-12-01 22:53 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-11-22 15:51 scalability of signal delivery for Posix Threads Ray Bryant
2004-11-22 16:07 ` Matthew Wilcox
2004-11-22 19:49   ` [Lse-tech] " Ray Bryant
2004-11-22 19:53     ` Andi Kleen
2004-11-22 16:22 ` [Lse-tech] " Andi Kleen
2004-11-22 16:51   ` Andreas Schwab
2004-11-22 16:54     ` Andi Kleen
2004-11-22 18:56       ` Ray Bryant
2004-11-22 19:22       ` Ray Bryant
2004-11-22 17:23   ` Philip J. Mucci
2004-11-22 21:26   ` Boehm, Hans
2004-11-22 21:34     ` Andi Kleen
2004-12-01 22:53       ` Brent Casavant
2004-11-22 21:27   ` Rick Lindsley
2004-11-22 23:39     ` Ray Bryant
2004-11-22 23:01   ` Boehm, Hans
2004-11-22 17:19 ` Robin Holt
2004-11-22 19:25   ` Ray Bryant
2004-11-23 20:42   ` Ray Bryant

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox