* select() not returning though pipe became readable
@ 2005-03-24 15:46 Lutz Vieweg
2005-03-25 1:07 ` Andrew Morton
0 siblings, 1 reply; 4+ messages in thread
From: Lutz Vieweg @ 2005-03-24 15:46 UTC (permalink / raw)
To: linux-kernel
Hi everyone,
I'm currently investigating the following problem, which seems to indicate
a misbehaviour of the kernel:
A server software we implemented is sporadically "hanging" in a select()
call since we upgraded from kernel 2.4 to (currently) 2.6.9 (we have to wait
for 2.6.12 before we can upgrade again due to the shared-mem-not-dumped-into-
core-files problem addressed there).
What's suspicious is that whenever we attach with gdb to such a hanging process,
we can see that a pipe, whose file-descriptor is definitely included in the
fd_set "readfds" (and "n" is also high enough) has a byte in it available for
reading - and just leaving gdb again is enough to let the server continue just
fine.
We are using that pipe, which is known only to the same one process, to cause
select() to return immediately if a signal (SIGUSR1) had been delivered to the
process (by another process), there's a signal handler installed that does
nothing but a (non-blocking) write of 1 byte to the writing end of the pipe.
This mechanism worked fine before kernel 2.6, and it is still working in 99.99% of
the cases, but under heavy load, every few hours, we'll see the hanging select()
as mentioned above.
I noticed a recent thread at lkml about poll() and pipes, but that seems to address a
different issue, where there are more events reported than occured, what we
see is quite the opposite, we want select() to return on that pipe becoming readable...
Any ideas?
Any hints on what to do to investigate the problem further?
Regards,
Lutz Vieweg
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: select() not returning though pipe became readable
2005-03-24 15:46 select() not returning though pipe became readable Lutz Vieweg
@ 2005-03-25 1:07 ` Andrew Morton
2005-03-30 17:29 ` Lutz Vieweg
2005-03-31 16:14 ` Lutz Vieweg
0 siblings, 2 replies; 4+ messages in thread
From: Andrew Morton @ 2005-03-25 1:07 UTC (permalink / raw)
To: Lutz Vieweg; +Cc: linux-kernel
Lutz Vieweg <lutz.vieweg@is-teledata.com> wrote:
>
> I'm currently investigating the following problem, which seems to indicate
> a misbehaviour of the kernel:
>
> A server software we implemented is sporadically "hanging" in a select()
> call since we upgraded from kernel 2.4 to (currently) 2.6.9 (we have to wait
> for 2.6.12 before we can upgrade again due to the shared-mem-not-dumped-into-
> core-files problem addressed there).
>
> What's suspicious is that whenever we attach with gdb to such a hanging process,
> we can see that a pipe, whose file-descriptor is definitely included in the
> fd_set "readfds" (and "n" is also high enough) has a byte in it available for
> reading - and just leaving gdb again is enough to let the server continue just
> fine.
>
> We are using that pipe, which is known only to the same one process, to cause
> select() to return immediately if a signal (SIGUSR1) had been delivered to the
> process (by another process), there's a signal handler installed that does
> nothing but a (non-blocking) write of 1 byte to the writing end of the pipe.
>
> This mechanism worked fine before kernel 2.6, and it is still working in 99.99% of
> the cases, but under heavy load, every few hours, we'll see the hanging select()
> as mentioned above.
>
> I noticed a recent thread at lkml about poll() and pipes, but that seems to address a
> different issue, where there are more events reported than occured, what we
> see is quite the opposite, we want select() to return on that pipe becoming readable...
>
> Any ideas?
> Any hints on what to do to investigate the problem further?
Could you at least test 2.6.12-rc1? Otherwise we might be looking for a
bug whicj isn't there.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: select() not returning though pipe became readable
2005-03-25 1:07 ` Andrew Morton
@ 2005-03-30 17:29 ` Lutz Vieweg
2005-03-31 16:14 ` Lutz Vieweg
1 sibling, 0 replies; 4+ messages in thread
From: Lutz Vieweg @ 2005-03-30 17:29 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
Andrew Morton wrote:
> Lutz Vieweg <lutz.vieweg@is-teledata.com> wrote:
>
>>I'm currently investigating the following problem, which seems to indicate
>>a misbehaviour of the kernel:
>>
>>A server software we implemented is sporadically "hanging" in a select()
>>call since we upgraded from kernel 2.4 to (currently) 2.6.9 (we have to wait
>>for 2.6.12 before we can upgrade again due to the shared-mem-not-dumped-into-
>>core-files problem addressed there).
...
>>Any ideas?
>>Any hints on what to do to investigate the problem further?
>
>
> Could you at least test 2.6.12-rc1? Otherwise we might be looking for a
> bug whicj isn't there.
We'll do that, but it will take some time, as the server requirements are such
that we cannot easily setup yet another instance, we don't have that many 32GB-RAM
4-way-opterons :-)
Jim Nance wrote:
>>We are using that pipe, which is known only to the same one process, to
>>cause select() to return immediately if a signal (SIGUSR1) had been
>>delivered to the process (by another process), there's a signal handler
>>installed that does nothing but a (non-blocking) write of 1 byte to the
>>writing end of the pipe.
>
>
> I'm not sure if this is what is causing your problem, but shouldnt you
> be doing a blocking write? It may be that the pipe is not writeable
> at the moment the signal arives. I think that could cause the symptoms
> you describe.
If the pipe wasn't writeable at the time when the signal handler tried to
write a byte, that would mean there were already N (probably 4096) bytes in
the pipe, causing the select() to fall through, anyway. The semantic of
the pipe is not to count signal deliveries, but only to contain "something"
if there had been a reason to fall through the select().
Regards,
Lutz Vieweg
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: select() not returning though pipe became readable
2005-03-25 1:07 ` Andrew Morton
2005-03-30 17:29 ` Lutz Vieweg
@ 2005-03-31 16:14 ` Lutz Vieweg
1 sibling, 0 replies; 4+ messages in thread
From: Lutz Vieweg @ 2005-03-31 16:14 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton
Lutz Vieweg <lutz.vieweg@is-teledata.com> wrote:
>I'm currently investigating the following problem, which seems to indicate
>a misbehaviour of the kernel:
>
>A server software we implemented is sporadically "hanging" in a select()
>call since we upgraded from kernel 2.4 to (currently) 2.6.9 (we have to wait
>for 2.6.12 before we can upgrade again due to the shared-mem-not-dumped-into-
>core-files problem addressed there).
>
>What's suspicious is that whenever we attach with gdb to such a hanging process,
>we can see that a pipe, whose file-descriptor is definitely included in the
>fd_set "readfds" (and "n" is also high enough) has a byte in it available for
>reading - and just leaving gdb again is enough to let the server continue just
>fine.
>
>We are using that pipe, which is known only to the same one process, to cause
>select() to return immediately if a signal (SIGUSR1) had been delivered to the
>process (by another process), there's a signal handler installed that does
>nothing but a (non-blocking) write of 1 byte to the writing end of the pipe.
>
>This mechanism worked fine before kernel 2.6, and it is still working in 99.99% of
>the cases, but under heavy load, every few hours, we'll see the hanging select()
>as mentioned above.
Following up on my own (yes, still using kernel 2.6.9, we will try it with .12 later -
but I wanted to share the latest results on my investigation nevertheless):
We found that when the server process hangs inside the select() call, the
kernel structure flags indicate a situation where select() shall indeed return:
The result of
> ps -eo cmd,pid,sig_pend,sig_block,sig_catch,sig_ignore
for the hanging process is:
CMD PID SIGNAL BLOCKED CATCHED IGNORED
./csn io_child 10972 0000000000000200 0000000000000000 000000001181764b 0000000000000000
which means that SIGUSR1 is known to be pending (and of course SIGUSR1 is also catched
as there's a signal handler installed as described above).
Correct me if I'm wrong, but isn't it a clear sign of something being wrong
with select() if it does not return in this situation?
Sending the hanging process another "kill -s SIGUSR1 10972" does not change the
situation, the process keeps hanging and the values printed above do not change.
Sending a different signal or attaching/detaching gdb causes select() to return,
with the pending value returning to 0 as expected.
So my suspicion is that there's a race condition where select() goes to sleep
even though SIGUSR1 just arrives.
Will follow up once we could upgrade to 2.6.12 or gained significant news,
I'm thankful for any ideas on this issue at any time.
Regards,
Lutz Vieweg
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2005-03-31 16:14 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-24 15:46 select() not returning though pipe became readable Lutz Vieweg
2005-03-25 1:07 ` Andrew Morton
2005-03-30 17:29 ` Lutz Vieweg
2005-03-31 16:14 ` Lutz Vieweg
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox