Why processes on linux loses signals?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Why processes on linux loses signals?
@ 2009-11-22 21:14 Michael Tokarev
  2009-11-22 22:21 ` Nikita V. Youshchenko
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Michael Tokarev @ 2009-11-22 21:14 UTC (permalink / raw)
  To: Linux-kernel

It's a very old issue, but I still don't know an answer.

In short, processes on linux loses signals.  It happens
rarely, but it happens, and the frequency of this happening
is enough to be annoying.

For example, I've a program that used alarm(2) to periodically
check for something.  Nothing fancy, nothing interesting is done
in the signal handler, no long operations or something, plain
signal(2) with sighandler just setting a global variable.  When
under heavy usage (it's a DNS nameserver), in about a week
(sometimes a few hours, sometimes after a month) it stops checking
for updates, because apparently some sigalrm got lost.

For this program I had to replace alarm() with setitimer(), but
only on linux.  On all other operating systems (Solaris, FreeBSD,
HP/UX, AIX) where it is used, everything works as expected.

Another common issue is SIGIO-based event loop.  For a classical
form of it, on a non-heavily-loaded process.  Quite often server
loses SIGIO so even if an I/O is possible, the process does not
know.  The pending (or stuck) I/O gets processed on receipt of
next SIGIO that indicates readiness of another filedescriptor --
since after SIGIO a process does poll() it notices both.

A "classical" (for me) example of this is an Oracle database
version 8 (we've many of these in production still; in later
versions they rewrote the event loop to use different techniques).
There, there's a dispatcher process that does nothing but listens
on the network, receives requests and sends them to a set of
worker processes.  Everything is non-blocking and the process
mostly does nothing.  It is very annoying when trivial actions
in a user application causes loooong delays - when an app sent
some request to oracle db and that request stuck in the event
queue because the corresponding SIGIO was never delivered.  It
helps immediately to make another connection to the same DB to
"unstuck" that request.  It is done transparently when there are
many users are working with the database at the same time, each
making requests --- this way any stuck/lost I/O unstucks immediately
because new requests are coming from other users; but at evenings
or over periods of small activity it becomes real problem.

I looked at the server behavour numerous times -- the server (oracle)
works quite reasonable, strace is sane enough.  That to say, one
can't blame "stupid closed-source programmers" for this.

There are other examples like this, all involving lost signals.
The two above are just the most "famous" for me.

The problem becomes much much worse when a system has multiple
cores.  On single-CPU system such situation is rare enough to
become almost unnoticeable.  But with even second core the issue
emerges almost immediately - enough for many users to start calling
techsupport because their apps are very slow.

Last time I asked similar question here, I was told that signals
are unreliable and should not be used.  But what is the reason for
the unreliability, and why signals should be unreliable on linux
only?

Thanks!

/mjt

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Why processes on linux loses signals?
  2009-11-22 21:14 Why processes on linux loses signals? Michael Tokarev
@ 2009-11-22 22:21 ` Nikita V. Youshchenko
  2009-11-23  1:39 ` Ray Lee
  2009-11-23 10:34 ` Mikael Pettersson
  2 siblings, 0 replies; 5+ messages in thread
From: Nikita V. Youshchenko @ 2009-11-22 22:21 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Linux-kernel

> In short, processes on linux loses signals.  It happens
> rarely, but it happens, and the frequency of this happening
> is enough to be annoying.
>
> ...
>
> The problem becomes much much worse when a system has multiple
> cores... But with even second core the issue emerges almost
> immediately ...

Looks like a classical race description.
Double-check your user-space code for signal-related races.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Why processes on linux loses signals?
  2009-11-22 21:14 Why processes on linux loses signals? Michael Tokarev
  2009-11-22 22:21 ` Nikita V. Youshchenko
@ 2009-11-23  1:39 ` Ray Lee
  2009-11-23 14:40   ` Oleg Nesterov
  2009-11-23 10:34 ` Mikael Pettersson
  2 siblings, 1 reply; 5+ messages in thread
From: Ray Lee @ 2009-11-23  1:39 UTC (permalink / raw)
  To: Michael Tokarev, Oleg Nesterov, roland; +Cc: Linux-kernel

[ adding potential interested parties to the CC:. Michael, please respond
with the latest kernel version you've tried that exhibits the problem, as well
as whether or not you've been able to create a test-case that shows the
signal loss. ]

On Sun, Nov 22, 2009 at 1:14 PM, Michael Tokarev <mjt@tls.msk.ru> wrote:
> It's a very old issue, but I still don't know an answer.
>
> In short, processes on linux loses signals.  It happens
> rarely, but it happens, and the frequency of this happening
> is enough to be annoying.
>
> For example, I've a program that used alarm(2) to periodically
> check for something.  Nothing fancy, nothing interesting is done
> in the signal handler, no long operations or something, plain
> signal(2) with sighandler just setting a global variable.  When
> under heavy usage (it's a DNS nameserver), in about a week
> (sometimes a few hours, sometimes after a month) it stops checking
> for updates, because apparently some sigalrm got lost.
>
> For this program I had to replace alarm() with setitimer(), but
> only on linux.  On all other operating systems (Solaris, FreeBSD,
> HP/UX, AIX) where it is used, everything works as expected.
>
> Another common issue is SIGIO-based event loop.  For a classical
> form of it, on a non-heavily-loaded process.  Quite often server
> loses SIGIO so even if an I/O is possible, the process does not
> know.  The pending (or stuck) I/O gets processed on receipt of
> next SIGIO that indicates readiness of another filedescriptor --
> since after SIGIO a process does poll() it notices both.
>
> A "classical" (for me) example of this is an Oracle database
> version 8 (we've many of these in production still; in later
> versions they rewrote the event loop to use different techniques).
> There, there's a dispatcher process that does nothing but listens
> on the network, receives requests and sends them to a set of
> worker processes.  Everything is non-blocking and the process
> mostly does nothing.  It is very annoying when trivial actions
> in a user application causes loooong delays - when an app sent
> some request to oracle db and that request stuck in the event
> queue because the corresponding SIGIO was never delivered.  It
> helps immediately to make another connection to the same DB to
> "unstuck" that request.  It is done transparently when there are
> many users are working with the database at the same time, each
> making requests --- this way any stuck/lost I/O unstucks immediately
> because new requests are coming from other users; but at evenings
> or over periods of small activity it becomes real problem.
>
> I looked at the server behavour numerous times -- the server (oracle)
> works quite reasonable, strace is sane enough.  That to say, one
> can't blame "stupid closed-source programmers" for this.
>
> There are other examples like this, all involving lost signals.
> The two above are just the most "famous" for me.
>
> The problem becomes much much worse when a system has multiple
> cores.  On single-CPU system such situation is rare enough to
> become almost unnoticeable.  But with even second core the issue
> emerges almost immediately - enough for many users to start calling
> techsupport because their apps are very slow.
>
> Last time I asked similar question here, I was told that signals
> are unreliable and should not be used.  But what is the reason for
> the unreliability, and why signals should be unreliable on linux
> only?
>
> Thanks!
>
> /mjt

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Why processes on linux loses signals?
  2009-11-22 21:14 Why processes on linux loses signals? Michael Tokarev
  2009-11-22 22:21 ` Nikita V. Youshchenko
  2009-11-23  1:39 ` Ray Lee
@ 2009-11-23 10:34 ` Mikael Pettersson
  2 siblings, 0 replies; 5+ messages in thread
From: Mikael Pettersson @ 2009-11-23 10:34 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Linux-kernel

Michael Tokarev writes:
 > In short, processes on linux loses signals.

You neglected to attach a self-contained test program for this alleged problem.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Why processes on linux loses signals?
  2009-11-23  1:39 ` Ray Lee
@ 2009-11-23 14:40   ` Oleg Nesterov
  0 siblings, 0 replies; 5+ messages in thread
From: Oleg Nesterov @ 2009-11-23 14:40 UTC (permalink / raw)
  To: Ray Lee; +Cc: Michael Tokarev, roland, Linux-kernel

On 11/22, Ray Lee wrote:
>
> [ adding potential interested parties to the CC:. Michael, please respond
> with the latest kernel version you've tried that exhibits the problem, as well
> as whether or not you've been able to create a test-case that shows the
> signal loss. ]

Yes, it would be nice to have a test-case.

> On Sun, Nov 22, 2009 at 1:14 PM, Michael Tokarev <mjt@tls.msk.ru> wrote:
>
> > It's a very old issue, but I still don't know an answer.
> >
> > In short, processes on linux loses signals.  It happens
> > rarely, but it happens, and the frequency of this happening
> > is enough to be annoying.
> >
> > For example, I've a program that used alarm(2) to periodically
> > check for something.  Nothing fancy, nothing interesting is done
> > in the signal handler, no long operations or something, plain
> > signal(2) with sighandler just setting a global variable.  When
> > under heavy usage (it's a DNS nameserver), in about a week
> > (sometimes a few hours, sometimes after a month) it stops checking
> > for updates, because apparently some sigalrm got lost.

This shouldn't happen (assuming your application is correct ;)

If this happens again, could you look in /proc/pid/status? I don't
really think this will help, but still.

> > Last time I asked similar question here, I was told that signals
> > are unreliable

They should be reliable. If not we have a kernel bug.

Oleg.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-11-23 14:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-22 21:14 Why processes on linux loses signals? Michael Tokarev
2009-11-22 22:21 ` Nikita V. Youshchenko
2009-11-23  1:39 ` Ray Lee
2009-11-23 14:40   ` Oleg Nesterov
2009-11-23 10:34 ` Mikael Pettersson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.