Local DoS (was: Strange 'zombie' problem both in 2.4 and 2.6)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Local DoS (was: Strange 'zombie' problem both in 2.4 and 2.6)
@ 2004-04-09  9:11 Nikita V. Youshchenko
  2004-04-09 14:45 ` Denis Vlasenko
  2004-04-13 13:10 ` Marcelo Tosatti
  0 siblings, 2 replies; 6+ messages in thread
From: Nikita V. Youshchenko @ 2004-04-09  9:11 UTC (permalink / raw)
  To: linux-kernel

Hello.

Several days ago I've posted to linux-kernel describing "zombie problem" 
related to sigqueue overflow.

Futher exploration of the problem showed that the reason of the described 
behaviour is in user-space. There is a process that blocks a signal and 
later receives tons of such signals. This effectively causes sigqueue 
overflow.

The following program gives the same effect:

#include <signal.h>
#include <unistd.h>
#include <stdlib.h>

int main()
{
        sigset_t set;
        int i;
        pid_t pid;

        sigemptyset(&set);
        sigaddset(&set, 40);
        sigprocmask(SIG_BLOCK, &set, 0);

        pid = getpid();
        for (i = 0; i < 1024; i++)
                kill(pid, 40);

        while (1)
                sleep(1);
}

Running this program on 2.4 or 2.6 kernel with 
default /proc/sys/kernel/rtsig-max value will cause sigqueue overflow, and 
all linuxthreads-based programs, INCLUDING DAEMONS RUNNING AS ROOT, will 
stop receiving notifications about thread exits, so all completed threads 
will become zombies. Exact reason why this is hapenning is described in 
detail in my previous postings.

This is a local DoS.

Affected system services include (but are not limited to) mysql and clamav. 
In fact, any linuxthreads application will be affected.

The problem is not that bad on 2.6, since NPTL is used instead of 
linuxthreads, so there are no zombies from system daemons. However, bad 
things still happen: when sigqueue is overflown, all processes get zeroed 
siginfo, which causes random application misbehaviours (like hangs in 
pthread_cancel()).

I don't know what is the correct solution for this issue. Probably there 
should be per-process or per-user (but not systemwide) limits on number of 
pending signals.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Local DoS (was: Strange 'zombie' problem both in 2.4 and 2.6)
  2004-04-09  9:11 Local DoS (was: Strange 'zombie' problem both in 2.4 and 2.6) Nikita V. Youshchenko
@ 2004-04-09 14:45 ` Denis Vlasenko
  2004-04-13 13:10 ` Marcelo Tosatti
  1 sibling, 0 replies; 6+ messages in thread
From: Denis Vlasenko @ 2004-04-09 14:45 UTC (permalink / raw)
  To: Nikita V. Youshchenko, linux-kernel

On Friday 09 April 2004 12:11, Nikita V. Youshchenko wrote:
> Hello.
>
> Several days ago I've posted to linux-kernel describing "zombie problem"
> related to sigqueue overflow.
>
> Futher exploration of the problem showed that the reason of the described
> behaviour is in user-space. There is a process that blocks a signal and
> later receives tons of such signals. This effectively causes sigqueue
> overflow.

One solution would be to watermark sigqueue and upon reaching
high mark, find the process with most signals queued and drop those.
This prevents one buggy process, even root-launched, from interfering
with non-buggy ones.

If low watermark is not reached, find _UID_ which have max # of
signals pending, and drop them all. This will work against rogue
user trying to DoS box who's careful enough to do it from multiple
processes.
--
vda


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Local DoS (was: Strange 'zombie' problem both in 2.4 and 2.6)
  2004-04-09  9:11 Local DoS (was: Strange 'zombie' problem both in 2.4 and 2.6) Nikita V. Youshchenko
  2004-04-09 14:45 ` Denis Vlasenko
@ 2004-04-13 13:10 ` Marcelo Tosatti
  2004-06-14 17:01   ` David Lang
  1 sibling, 1 reply; 6+ messages in thread
From: Marcelo Tosatti @ 2004-04-13 13:10 UTC (permalink / raw)
  To: Nikita V. Youshchenko; +Cc: linux-kernel

On Fri, Apr 09, 2004 at 01:11:50PM +0400, Nikita V. Youshchenko wrote:
> Hello.
> 
> Several days ago I've posted to linux-kernel describing "zombie problem" 
> related to sigqueue overflow.
> 
> Futher exploration of the problem showed that the reason of the described 
> behaviour is in user-space. There is a process that blocks a signal and 
> later receives tons of such signals. This effectively causes sigqueue 
> overflow.
> 
> The following program gives the same effect:
> 
> #include <signal.h>
> #include <unistd.h>
> #include <stdlib.h>
> 
> int main()
> {
>         sigset_t set;
>         int i;
>         pid_t pid;
> 
>         sigemptyset(&set);
>         sigaddset(&set, 40);
>         sigprocmask(SIG_BLOCK, &set, 0);
> 
>         pid = getpid();
>         for (i = 0; i < 1024; i++)
>                 kill(pid, 40);
> 
>         while (1)
>                 sleep(1);
> }
> 
> Running this program on 2.4 or 2.6 kernel with 
> default /proc/sys/kernel/rtsig-max value will cause sigqueue overflow, and 
> all linuxthreads-based programs, INCLUDING DAEMONS RUNNING AS ROOT, will 
> stop receiving notifications about thread exits, so all completed threads 
> will become zombies. Exact reason why this is hapenning is described in 
> detail in my previous postings.
> 
> This is a local DoS.
> 
> Affected system services include (but are not limited to) mysql and clamav. 
> In fact, any linuxthreads application will be affected.
> 
> The problem is not that bad on 2.6, since NPTL is used instead of 
> linuxthreads, so there are no zombies from system daemons. However, bad 
> things still happen: when sigqueue is overflown, all processes get zeroed 
> siginfo, which causes random application misbehaviours (like hangs in 
> pthread_cancel()).
> 
> I don't know what is the correct solution for this issue. Probably there 
> should be per-process or per-user (but not systemwide) limits on number of 
> pending signals.

Indeed, per-user sigqueue limit is the way to fix this.

Anyone willing to implement it ? 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Local DoS (was: Strange 'zombie' problem both in 2.4 and 2.6)
  2004-04-13 13:10 ` Marcelo Tosatti
@ 2004-06-14 17:01   ` David Lang
  2004-06-15  0:27     ` Marcelo Tosatti
  0 siblings, 1 reply; 6+ messages in thread
From: David Lang @ 2004-06-14 17:01 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Nikita V. Youshchenko, linux-kernel

I think I may be running into the same (or a similar) issue with a 
workload that forks heavily (~3500 forks/sec). What can I do to let the 
system survive this sort of load?

David Lang

On Tue, 13 Apr 2004, Marcelo Tosatti wrote:

> Date: Tue, 13 Apr 2004 10:10:17 -0300
> From: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
> To: Nikita V. Youshchenko <yoush@cs.msu.su>
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: Local DoS (was: Strange 'zombie' problem both in 2.4 and 2.6)
> 
> On Fri, Apr 09, 2004 at 01:11:50PM +0400, Nikita V. Youshchenko wrote:
>> Hello.
>>
>> Several days ago I've posted to linux-kernel describing "zombie problem"
>> related to sigqueue overflow.
>>
>> Futher exploration of the problem showed that the reason of the described
>> behaviour is in user-space. There is a process that blocks a signal and
>> later receives tons of such signals. This effectively causes sigqueue
>> overflow.
>>
>> The following program gives the same effect:
>>
>> #include <signal.h>
>> #include <unistd.h>
>> #include <stdlib.h>
>>
>> int main()
>> {
>>         sigset_t set;
>>         int i;
>>         pid_t pid;
>>
>>         sigemptyset(&set);
>>         sigaddset(&set, 40);
>>         sigprocmask(SIG_BLOCK, &set, 0);
>>
>>         pid = getpid();
>>         for (i = 0; i < 1024; i++)
>>                 kill(pid, 40);
>>
>>         while (1)
>>                 sleep(1);
>> }
>>
>> Running this program on 2.4 or 2.6 kernel with
>> default /proc/sys/kernel/rtsig-max value will cause sigqueue overflow, and
>> all linuxthreads-based programs, INCLUDING DAEMONS RUNNING AS ROOT, will
>> stop receiving notifications about thread exits, so all completed threads
>> will become zombies. Exact reason why this is hapenning is described in
>> detail in my previous postings.
>>
>> This is a local DoS.
>>
>> Affected system services include (but are not limited to) mysql and clamav.
>> In fact, any linuxthreads application will be affected.
>>
>> The problem is not that bad on 2.6, since NPTL is used instead of
>> linuxthreads, so there are no zombies from system daemons. However, bad
>> things still happen: when sigqueue is overflown, all processes get zeroed
>> siginfo, which causes random application misbehaviours (like hangs in
>> pthread_cancel()).
>>
>> I don't know what is the correct solution for this issue. Probably there
>> should be per-process or per-user (but not systemwide) limits on number of
>> pending signals.
>
> Indeed, per-user sigqueue limit is the way to fix this.
>
> Anyone willing to implement it ?
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Local DoS (was: Strange 'zombie' problem both in 2.4 and 2.6)
  2004-06-14 17:01   ` David Lang
@ 2004-06-15  0:27     ` Marcelo Tosatti
  2004-06-15  1:31       ` David Lang
  0 siblings, 1 reply; 6+ messages in thread
From: Marcelo Tosatti @ 2004-06-15  0:27 UTC (permalink / raw)
  To: David Lang; +Cc: Nikita V. Youshchenko, linux-kernel

On Mon, Jun 14, 2004 at 10:01:53AM -0700, David Lang wrote:
> I think I may be running into the same (or a similar) issue with a 
> workload that forks heavily (~3500 forks/sec). What can I do to let the 
> system survive this sort of load? 

Hi David,

v2.6.7-mm tree contains a fix for this, adding a rlimit for
pending signals.

Can you describe the problem you are seeing in more detail?



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Local DoS (was: Strange 'zombie' problem both in 2.4 and 2.6)
  2004-06-15  0:27     ` Marcelo Tosatti
@ 2004-06-15  1:31       ` David Lang
  0 siblings, 0 replies; 6+ messages in thread
From: David Lang @ 2004-06-15  1:31 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Nikita V. Youshchenko, linux-kernel

On Mon, 14 Jun 2004, Marcelo Tosatti wrote:
> On Mon, Jun 14, 2004 at 10:01:53AM -0700, David Lang wrote:
>> I think I may be running into the same (or a similar) issue with a
>> workload that forks heavily (~3500 forks/sec). What can I do to let the
>> system survive this sort of load?
>
> Hi David,
>
> v2.6.7-mm tree contains a fix for this, adding a rlimit for
> pending signals.

I'll have to give this a try.

> Can you describe the problem you are seeing in more detail?

I have a stress-test I am running on a dual Opteron 1.4GHz box that 
receives a network connection, forks a new process, does a little bit of 
network traffic then the child exits. when I hammer this I get ~3500 
connections/sec (with a significant amount of spare CPU, I'm limited by 
my load boxes), but after a few secnds (8-10) something happens and the 
parent stops receiving the sigchild signals. if I connect strace to the 
parent process the signals are re-enabled and everything works for a 
little bit longer before the process repeats.

if I only hit it with ~10,000 connections and then pause the box survives 
indefinantly

running the same test on a dual athlonMP 2200+ I get ~2500 connections a 
sec and it has no problems. I just compiled a 32 bit kernel for the 
opteron and get ~3300 connections/sec (with no idle CPU time) and the box 
doesn't lock up.

I don't know if this is becouse it's just below the threashold of the 
problem or if there is a bug in the 64 bit kernel (or both)

I'm currently trying to tweak the 32 bit opteron kernel to get a smidge 
more speed out of it to see if getting back up to the same speed starts 
triggering the problem again.

David Lang

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-06-15  1:32 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-09  9:11 Local DoS (was: Strange 'zombie' problem both in 2.4 and 2.6) Nikita V. Youshchenko
2004-04-09 14:45 ` Denis Vlasenko
2004-04-13 13:10 ` Marcelo Tosatti
2004-06-14 17:01   ` David Lang
2004-06-15  0:27     ` Marcelo Tosatti
2004-06-15  1:31       ` David Lang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox