A possible sys_wait* bug

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* A possible sys_wait* bug
@ 2010-05-20 20:32 Salman Qazi
  2010-07-01  0:47 ` KOSAKI Motohiro
  0 siblings, 1 reply; 6+ messages in thread
From: Salman Qazi @ 2010-05-20 20:32 UTC (permalink / raw)
  To: linux-kernel, akpm

One of our internal workloads ran into a problem with waitpid.  A
simple repro case is as follows:

#include <sys/types.h>
#include <sys/wait.h>
#include <sys/time.h>
#include <signal.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <assert.h>
#include <sched.h>

#define NUM_CPUS 4

void *thread_code(void *args)
{
        int j;
        int pid2;
        for (j = 0; j < 1000; j++) {
                pid2 = fork();
                if (pid2 == 0)
                        while(1) { sleep(1000); }
        }

        while (1) {
                int status;
                if (waitpid(-1, &status, WNOHANG)) {
                        printf("! %d\n", errno);
                }

        }
        exit(0);

}

/*
 * non-blocking waitpids in tight loop, with many children to go through,
 * done on multiple thread, so that they can "pass the torch" to eachother
 * and eliminate the window that a writer has to get in.
 *
 * This maximizes the holding of the tasklist_lock in read mode, starving
 * any attempts to take the lock in the write mode.
 */
int main(int argc, char **argv)
{
        int i;
        pthread_attr_t attr;
        pthread_t threads[NUM_CPUS];
        for (i = 0; i < NUM_CPUS; i++) {
                assert(!pthread_attr_init(&attr));
                assert(!pthread_create(&threads[i], &attr, thread_code));
        }
        while(1) { sleep(1000);}
        return 0;
}

Basically, it is possibly for readers to continuously hold
tasklist_lock (theoretically forever, as they pass from one to other),
preventing the writer from taking that lock.  This typically causes a
lockup on a CPU where a task is attempting to do a fork() or exit(),
resulting in the NMI watchdog firing.

Yes, WNOHANG is being used.  And I agree that this is an inefficient
use of wait().  However, I think it should be possible to produce the
same effect without WNOHANG on sufficiently large number of threads:
by having it so that at least one thread always has the reader lock.

I think the most direct approach to the problem is to have the
readers-writer locks be writer biased (i.e. as soon as a writer
contends, we do not permit any new readers).  However all suggestions
are welcome.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: A possible sys_wait* bug
  2010-05-20 20:32 A possible sys_wait* bug Salman Qazi
@ 2010-07-01  0:47 ` KOSAKI Motohiro
  2010-07-01  5:00   ` Salman Qazi
                     ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: KOSAKI Motohiro @ 2010-07-01  0:47 UTC (permalink / raw)
  To: Salman Qazi
  Cc: kosaki.motohiro, linux-kernel, akpm, Oleg Nesterov,
	Roland McGrath, Ingo Molnar, Peter Zijlstra

Hello,  (cc to some core developers)

Are anyone tracking this issue? This seems security issue.


> One of our internal workloads ran into a problem with waitpid.  A
> simple repro case is as follows:
> 
> 
> #include <sys/types.h>
> #include <sys/wait.h>
> #include <sys/time.h>
> #include <signal.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <errno.h>
> #include <assert.h>
> #include <sched.h>
> 
> #define NUM_CPUS 4
> 
> void *thread_code(void *args)
> {
>         int j;
>         int pid2;
>         for (j = 0; j < 1000; j++) {
>                 pid2 = fork();
>                 if (pid2 == 0)
>                         while(1) { sleep(1000); }
>         }
> 
>         while (1) {
>                 int status;
>                 if (waitpid(-1, &status, WNOHANG)) {
>                         printf("! %d\n", errno);
>                 }
> 
>         }
>         exit(0);
> 
> }
> 
> /*
>  * non-blocking waitpids in tight loop, with many children to go through,
>  * done on multiple thread, so that they can "pass the torch" to eachother
>  * and eliminate the window that a writer has to get in.
>  *
>  * This maximizes the holding of the tasklist_lock in read mode, starving
>  * any attempts to take the lock in the write mode.
>  */
> int main(int argc, char **argv)
> {
>         int i;
>         pthread_attr_t attr;
>         pthread_t threads[NUM_CPUS];
>         for (i = 0; i < NUM_CPUS; i++) {
>                 assert(!pthread_attr_init(&attr));
>                 assert(!pthread_create(&threads[i], &attr, thread_code));
>         }
>         while(1) { sleep(1000);}
>         return 0;
> }
> 
> 
> Basically, it is possibly for readers to continuously hold
> tasklist_lock (theoretically forever, as they pass from one to other),
> preventing the writer from taking that lock.  This typically causes a
> lockup on a CPU where a task is attempting to do a fork() or exit(),
> resulting in the NMI watchdog firing.
> 
> Yes, WNOHANG is being used.  And I agree that this is an inefficient
> use of wait().  However, I think it should be possible to produce the
> same effect without WNOHANG on sufficiently large number of threads:
> by having it so that at least one thread always has the reader lock.
> 
> I think the most direct approach to the problem is to have the
> readers-writer locks be writer biased (i.e. as soon as a writer
> contends, we do not permit any new readers).  However all suggestions
> are welcome.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: A possible sys_wait* bug
  2010-07-01  0:47 ` KOSAKI Motohiro
@ 2010-07-01  5:00   ` Salman Qazi
  2010-07-01  5:34   ` Roland McGrath
  2010-07-01 14:08   ` Oleg Nesterov
  2 siblings, 0 replies; 6+ messages in thread
From: Salman Qazi @ 2010-07-01  5:00 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-kernel, akpm, Oleg Nesterov, Roland McGrath, Ingo Molnar,
	Peter Zijlstra

On Wed, Jun 30, 2010 at 5:47 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hello,  (cc to some core developers)
>
> Are anyone tracking this issue? This seems security issue.

Please explain why this is a security issue.  This is not readily
apparent to me.  As far as Google is concerned it is a low/medium
priority bug, as there are user space workarounds, at least for the
time being.

>
>
>> One of our internal workloads ran into a problem with waitpid.  A
>> simple repro case is as follows:
>>
>>
>> #include <sys/types.h>
>> #include <sys/wait.h>
>> #include <sys/time.h>
>> #include <signal.h>
>> #include <stdlib.h>
>> #include <stdio.h>
>> #include <errno.h>
>> #include <assert.h>
>> #include <sched.h>
>>
>> #define NUM_CPUS 4
>>
>> void *thread_code(void *args)
>> {
>>         int j;
>>         int pid2;
>>         for (j = 0; j < 1000; j++) {
>>                 pid2 = fork();
>>                 if (pid2 == 0)
>>                         while(1) { sleep(1000); }
>>         }
>>
>>         while (1) {
>>                 int status;
>>                 if (waitpid(-1, &status, WNOHANG)) {
>>                         printf("! %d\n", errno);
>>                 }
>>
>>         }
>>         exit(0);
>>
>> }
>>
>> /*
>>  * non-blocking waitpids in tight loop, with many children to go through,
>>  * done on multiple thread, so that they can "pass the torch" to eachother
>>  * and eliminate the window that a writer has to get in.
>>  *
>>  * This maximizes the holding of the tasklist_lock in read mode, starving
>>  * any attempts to take the lock in the write mode.
>>  */
>> int main(int argc, char **argv)
>> {
>>         int i;
>>         pthread_attr_t attr;
>>         pthread_t threads[NUM_CPUS];
>>         for (i = 0; i < NUM_CPUS; i++) {
>>                 assert(!pthread_attr_init(&attr));
>>                 assert(!pthread_create(&threads[i], &attr, thread_code));
>>         }
>>         while(1) { sleep(1000);}
>>         return 0;
>> }
>>
>>
>> Basically, it is possibly for readers to continuously hold
>> tasklist_lock (theoretically forever, as they pass from one to other),
>> preventing the writer from taking that lock.  This typically causes a
>> lockup on a CPU where a task is attempting to do a fork() or exit(),
>> resulting in the NMI watchdog firing.
>>
>> Yes, WNOHANG is being used.  And I agree that this is an inefficient
>> use of wait().  However, I think it should be possible to produce the
>> same effect without WNOHANG on sufficiently large number of threads:
>> by having it so that at least one thread always has the reader lock.
>>
>> I think the most direct approach to the problem is to have the
>> readers-writer locks be writer biased (i.e. as soon as a writer
>> contends, we do not permit any new readers).  However all suggestions
>> are welcome.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: A possible sys_wait* bug
  2010-07-01  0:47 ` KOSAKI Motohiro
  2010-07-01  5:00   ` Salman Qazi
@ 2010-07-01  5:34   ` Roland McGrath
  2010-07-01 14:08   ` Oleg Nesterov
  2 siblings, 0 replies; 6+ messages in thread
From: Roland McGrath @ 2010-07-01  5:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Salman Qazi, linux-kernel, akpm, Oleg Nesterov, Ingo Molnar,
	Peter Zijlstra

There are several other ways than wait that pathological user processes can
arrange to do tight loops taking tasklist_lock.  So if you call that a DoS
risk then it's not solved with anything specific to the wait syscalls.

Using different rwlock behavior (at least for that lock) would be one way
to address it.  Of course, changing the contention dynamics might have
other effects that you don't want (i.e. fork/exec/exit/reap drowning out
the reader uses, cascading to some bad pattern).  I don't have an opinion
on that, and wish you good luck figuring it out.

In the long run, things have been moving to finer-grained locking and/or
less direct locking.  We can imagine tasklist_lock might go this way one
day too, removing the general bottleneck (i.e. slicing it up into different
smaller, bottlenecks).  As things stand today, it's a bit hard to see
exactly how we might get there e.g. for wait/exit with the synchronization
requirements for reparenting and all that.  But it's clear that eventually
the big system-wide bottlenecks like tasklist_lock all get reduced.  I
don't know of anything like this to expect to see any time soon, however.

Thanks,
Roland

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: A possible sys_wait* bug
  2010-07-01  0:47 ` KOSAKI Motohiro
  2010-07-01  5:00   ` Salman Qazi
  2010-07-01  5:34   ` Roland McGrath
@ 2010-07-01 14:08   ` Oleg Nesterov
  2010-07-02  6:18     ` KOSAKI Motohiro
  2 siblings, 1 reply; 6+ messages in thread
From: Oleg Nesterov @ 2010-07-01 14:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Salman Qazi, linux-kernel, akpm, Roland McGrath, Ingo Molnar,
	Peter Zijlstra

On 07/01, KOSAKI Motohiro wrote:
>
> > Basically, it is possibly for readers to continuously hold
> > tasklist_lock

Yes, this is the known problem.

Perhaps do_wait() is not the worst example. sys_kill(-1),
sys_ioprio_set() scan the global list.

> > I think the most direct approach to the problem is to have the
> > readers-writer locks be writer biased (i.e. as soon as a writer
> > contends, we do not permit any new readers).

I thought about this too, but this is deadlockable. At least,
read_lock(tasklist) should nest, and it should work in irq context.

We need the more fine-grained locking, but it is not clear to me what
should be done in the long term. Afaics, this is very nontrivial.

Oleg.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: A possible sys_wait* bug
  2010-07-01 14:08   ` Oleg Nesterov
@ 2010-07-02  6:18     ` KOSAKI Motohiro
  0 siblings, 0 replies; 6+ messages in thread
From: KOSAKI Motohiro @ 2010-07-02  6:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, Salman Qazi, linux-kernel, akpm, Roland McGrath,
	Ingo Molnar, Peter Zijlstra

> On 07/01, KOSAKI Motohiro wrote:
> >
> > > Basically, it is possibly for readers to continuously hold
> > > tasklist_lock
> 
> Yes, this is the known problem.
> 
> Perhaps do_wait() is not the worst example. sys_kill(-1),
> sys_ioprio_set() scan the global list.

Ah, I see.
Yup, Roland also pointed out this is NOT biggest risk, there are much
other way. My thinking coverage was too narrow. sorry.


> > > I think the most direct approach to the problem is to have the
> > > readers-writer locks be writer biased (i.e. as soon as a writer
> > > contends, we do not permit any new readers).
> 
> I thought about this too, but this is deadlockable. At least,
> read_lock(tasklist) should nest, and it should work in irq context.
> 
> We need the more fine-grained locking, but it is not clear to me what
> should be done in the long term. Afaics, this is very nontrivial.

Thank you for kindful explanation.



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-07-02  6:18 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-20 20:32 A possible sys_wait* bug Salman Qazi
2010-07-01  0:47 ` KOSAKI Motohiro
2010-07-01  5:00   ` Salman Qazi
2010-07-01  5:34   ` Roland McGrath
2010-07-01 14:08   ` Oleg Nesterov
2010-07-02  6:18     ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox