* recv() hangs until SIGCHLD ?
@ 2008-10-10 13:30 Nicolas Cannasse
2008-10-10 19:17 ` Stephen Hemminger
0 siblings, 1 reply; 10+ messages in thread
From: Nicolas Cannasse @ 2008-10-10 13:30 UTC (permalink / raw)
To: linux-kernel
Hi,
We've been tracking a bug in our server application for some time now,
and now that we could isolate it we're stuck without a meaningful
explanation. Hope somehow would be able to give use some answers.
We run a multithread application which is using pthreads and sockets. A
thread uses accept() then dispatch the socket to one of the workers
threads that process it. Sockets are then not used simultaneously by
several threads.
In some rare cases, one (or several) threads are hanging in recv(). Both
lsof and ls /proc/<pid>/fd show that the socket used is in ESTABLISHED
mode but when checking on the host on which it's connected (a mysql DB)
we can't find the corresponding client socket (as it's been closed
already on the other side).
We are using the Boehm GC which uses the signals SIGXCPU and SIGPWR to
pause+restart the threads when running a GC cycle. We are correctly
handling EINTR in send() and recv() by restarting the call in case they
get interrupted this way.
However, when attaching GDB to our locked thread it seems that even when
the GC runs, recv() does not exit (the breakpoint after it is not
reached). If we send SIGCHLD to the hanging thread with GDB, recv() does
exit and the thread is correctly unlocked. If we don't, it will hang
forever.
Additional details : recv() is using MSG_NOSIGNAL and we have enabled
TCP_NODELAY on the socket by using setsockopt. Some other
not-multithreaded apps are using the same Databases and this behavior
does not occur for them.
Any idea how we can stop this from happening or what additional things
we can check to get more informations on what's occurring ?
Thanks a lot,
Nicolas
^ permalink raw reply [flat|nested] 10+ messages in thread
* recv() hangs until SIGCHLD ?
@ 2008-10-10 16:43 Nicolas Cannasse
2008-10-11 4:48 ` David Schwartz
0 siblings, 1 reply; 10+ messages in thread
From: Nicolas Cannasse @ 2008-10-10 16:43 UTC (permalink / raw)
To: linux-kernel
Hi,
We've been tracking a bug in our server application for some time now,
and now that we could isolate it we're stuck without a meaningful
explanation. Hope somehow would be able to give use some answers.
We run a multithread application which is using pthreads and sockets.
A thread uses accept() then dispatch the socket to one of the workers
threads that process it. Sockets are then not used simultaneously by
several threads.
In some rare cases, one (or several) threads are hanging in recv().
Both lsof and ls /proc/<pid>/fd show that the socket used is in
ESTABLISHED mode but when checking on the host on which it's connected
(a mysql DB) we can't find the corresponding client socket (as it's
been closed already on the other side).
We are using the Boehm GC which uses the signals SIGXCPU and SIGPWR to
pause+restart the threads when running a GC cycle. We are correctly
handling EINTR in send() and recv() by restarting the call in case
they get interrupted this way.
However, when attaching GDB to our locked thread it seems that even
when the GC runs, recv() does not exit (the breakpoint after it is not
reached). If we send SIGCHLD to the hanging thread with GDB, recv()
does exit and the thread is correctly unlocked. If we don't, it will
hang forever.
Additional details : recv() is using MSG_NOSIGNAL and we have enabled
TCP_NODELAY on the socket by using setsockopt. Some other
not-multithreaded apps are using the same Databases and this behavior
does not occur for them.
Any idea how we can stop this from happening or what additional things
we can check to get more informations on what's occurring ?
Thanks a lot,
Nicolas
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: recv() hangs until SIGCHLD ?
2008-10-10 13:30 recv() hangs until SIGCHLD ? Nicolas Cannasse
@ 2008-10-10 19:17 ` Stephen Hemminger
2008-10-11 8:28 ` Nicolas Cannasse
0 siblings, 1 reply; 10+ messages in thread
From: Stephen Hemminger @ 2008-10-10 19:17 UTC (permalink / raw)
To: Nicolas Cannasse; +Cc: linux-net, linux-kernel
On Fri, 10 Oct 2008 15:30:01 +0200
Nicolas Cannasse <ncannasse@motion-twin.com> wrote:
> Hi,
>
> We've been tracking a bug in our server application for some time now,
> and now that we could isolate it we're stuck without a meaningful
> explanation. Hope somehow would be able to give use some answers.
>
> We run a multithread application which is using pthreads and sockets. A
> thread uses accept() then dispatch the socket to one of the workers
> threads that process it. Sockets are then not used simultaneously by
> several threads.
>
> In some rare cases, one (or several) threads are hanging in recv(). Both
> lsof and ls /proc/<pid>/fd show that the socket used is in ESTABLISHED
> mode but when checking on the host on which it's connected (a mysql DB)
> we can't find the corresponding client socket (as it's been closed
> already on the other side).
>
> We are using the Boehm GC which uses the signals SIGXCPU and SIGPWR to
> pause+restart the threads when running a GC cycle. We are correctly
> handling EINTR in send() and recv() by restarting the call in case they
> get interrupted this way.
>
> However, when attaching GDB to our locked thread it seems that even when
> the GC runs, recv() does not exit (the breakpoint after it is not
> reached). If we send SIGCHLD to the hanging thread with GDB, recv() does
> exit and the thread is correctly unlocked. If we don't, it will hang
> forever.
>
> Additional details : recv() is using MSG_NOSIGNAL and we have enabled
> TCP_NODELAY on the socket by using setsockopt. Some other
> not-multithreaded apps are using the same Databases and this behavior
> does not occur for them.
>
> Any idea how we can stop this from happening or what additional things
> we can check to get more informations on what's occurring ?
>
> Thanks a lot,
> Nicolas
Look at Receive queue length with ss or netstat for the hung thread. It will
show if there is anything that thread could read.
If there is data and the thread didn't wake up then that is a libc or kernel problem;
but if there is no data, then look for cases where earlier interrupted io actually
consumed the data already or blame the sending process not the receiver.
Also are the sockets blocking or non-blocking?
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: recv() hangs until SIGCHLD ?
2008-10-10 16:43 Nicolas Cannasse
@ 2008-10-11 4:48 ` David Schwartz
2008-10-11 9:30 ` Samuel Thibault
0 siblings, 1 reply; 10+ messages in thread
From: David Schwartz @ 2008-10-11 4:48 UTC (permalink / raw)
To: linux-kernel
Nicolas Cannasse wrote:
> In some rare cases, one (or several) threads are hanging in recv().
> Both lsof and ls /proc/<pid>/fd show that the socket used is in
> ESTABLISHED mode but when checking on the host on which it's connected
> (a mysql DB) we can't find the corresponding client socket (as it's
> been closed already on the other side).
Blocking sockets will block until data is received. If no other thread is
sending data, this can block forever.
> We are using the Boehm GC which uses the signals SIGXCPU and SIGPWR to
> pause+restart the threads when running a GC cycle. We are correctly
> handling EINTR in send() and recv() by restarting the call in case
> they get interrupted this way.
>
> However, when attaching GDB to our locked thread it seems that even
> when the GC runs, recv() does not exit (the breakpoint after it is not
> reached). If we send SIGCHLD to the hanging thread with GDB, recv()
> does exit and the thread is correctly unlocked. If we don't, it will
> hang forever.
Why shouldn't it hang forever? What was supposed to wake it that's not?
> Any idea how we can stop this from happening or what additional things
> we can check to get more informations on what's occurring ?
You say a thread is hanging in receive and not returning. But you've yet to
explain why it should return. Was it interrupted by a signal? Was data
received? Is the socket non-blocking? Why isn't this expected behavior?
Blocking sockets block, full stop.
DS
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: recv() hangs until SIGCHLD ?
2008-10-10 19:17 ` Stephen Hemminger
@ 2008-10-11 8:28 ` Nicolas Cannasse
2008-10-11 12:20 ` David Schwartz
2008-10-13 8:31 ` Nicolas Cannasse
0 siblings, 2 replies; 10+ messages in thread
From: Nicolas Cannasse @ 2008-10-11 8:28 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: linux-net, linux-kernel
>> We run a multithread application which is using pthreads and sockets. A
>> thread uses accept() then dispatch the socket to one of the workers
>> threads that process it. Sockets are then not used simultaneously by
>> several threads.
>>
>> In some rare cases, one (or several) threads are hanging in recv(). Both
>> lsof and ls /proc/<pid>/fd show that the socket used is in ESTABLISHED
>> mode but when checking on the host on which it's connected (a mysql DB)
>> we can't find the corresponding client socket (as it's been closed
>> already on the other side).
>>
>> We are using the Boehm GC which uses the signals SIGXCPU and SIGPWR to
>> pause+restart the threads when running a GC cycle. We are correctly
>> handling EINTR in send() and recv() by restarting the call in case they
>> get interrupted this way.
>>
>> However, when attaching GDB to our locked thread it seems that even when
>> the GC runs, recv() does not exit (the breakpoint after it is not
>> reached). If we send SIGCHLD to the hanging thread with GDB, recv() does
>> exit and the thread is correctly unlocked. If we don't, it will hang
>> forever.
>>
>> Additional details : recv() is using MSG_NOSIGNAL and we have enabled
>> TCP_NODELAY on the socket by using setsockopt. Some other
>> not-multithreaded apps are using the same Databases and this behavior
>> does not occur for them.
>>
>> Any idea how we can stop this from happening or what additional things
>> we can check to get more informations on what's occurring ?
>>
>> Thanks a lot,
>> Nicolas
>
> Look at Receive queue length with ss or netstat for the hung thread. It will
> show if there is anything that thread could read.
>
> If there is data and the thread didn't wake up then that is a libc or kernel problem;
> but if there is no data, then look for cases where earlier interrupted io actually
> consumed the data already or blame the sending process not the receiver.
> Also are the sockets blocking or non-blocking?
The sockets are non-blocking.
Checking with netstat and ss I can confirm that both Send and Recv
queues are empty, which makes the recv() behavior consistent.
However since this problem does not occur without threads, we can be
sure that the blame is still on the receiver.
In a practical case, we have a thread blocked in recv() for more than 12
hours, which is way beyond the timeout of the sender connection. The
socket has already been closed by the sender so recv() should at least
be noticed and returns 0.
Is it safe to assume that when either send() or recv() get interrupted
by a signal and returns EINTR, no actual data has been either sent or
consumed ? And if it's not, is there any other way around this ?
Best,
Nicolas
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: recv() hangs until SIGCHLD ?
2008-10-11 4:48 ` David Schwartz
@ 2008-10-11 9:30 ` Samuel Thibault
0 siblings, 0 replies; 10+ messages in thread
From: Samuel Thibault @ 2008-10-11 9:30 UTC (permalink / raw)
To: David Schwartz; +Cc: linux-kernel
David Schwartz, le Fri 10 Oct 2008 21:48:45 -0700, a écrit :
> > We are using the Boehm GC which uses the signals SIGXCPU and SIGPWR to
> > pause+restart the threads when running a GC cycle.
> > [...]
> >
> > However, when attaching GDB to our locked thread it seems that even
> > when the GC runs, recv() does not exit (the breakpoint after it is not
> > reached).
>
> But you've yet to explain why it should return. Was it interrupted by
> a signal?
See quote above.
> Was data received?
No
> Is the socket non-blocking?
No
> Why isn't this expected behavior? Blocking sockets block, full stop.
But using a signal is a common technique to have it getting unblocked,
since SUS says
« [EINTR] The recv() function was interrupted by a signal that was
caught, before any data was available. »
Samuel
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: recv() hangs until SIGCHLD ?
2008-10-11 8:28 ` Nicolas Cannasse
@ 2008-10-11 12:20 ` David Schwartz
2008-10-12 15:47 ` Stephen Hemminger
2008-10-13 8:31 ` Nicolas Cannasse
1 sibling, 1 reply; 10+ messages in thread
From: David Schwartz @ 2008-10-11 12:20 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: linux-net, linux-kernel
Nicolas Cannasse wrote:
> The sockets are non-blocking.
Ouch, that's a serious bug. Non-blocking operations shouldn't block!
> Checking with netstat and ss I can confirm that both Send and Recv
> queues are empty, which makes the recv() behavior consistent.
>
> However since this problem does not occur without threads, we can be
> sure that the blame is still on the receiver.
>
> In a practical case, we have a thread blocked in recv() for more than 12
> hours, which is way beyond the timeout of the sender connection. The
> socket has already been closed by the sender so recv() should at least
> be noticed and returns 0.
Can you clarify what you mean by "the socket has already been closed by the
sender"? You mean the other end of the TCP connection shut it down? By "the
socket", you don't mean the socket you called 'recv' on, right? You mean the
socket on the other end that's connected to it?
> Is it safe to assume that when either send() or recv() get interrupted
> by a signal and returns EINTR, no actual data has been either sent or
> consumed ? And if it's not, is there any other way around this ?
EINTR can only be return if 'send' or 'recv' have not sent or received
anything. Otherwise the connection would be left in an indeterminate state.
DS
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: recv() hangs until SIGCHLD ?
2008-10-11 12:20 ` David Schwartz
@ 2008-10-12 15:47 ` Stephen Hemminger
0 siblings, 0 replies; 10+ messages in thread
From: Stephen Hemminger @ 2008-10-12 15:47 UTC (permalink / raw)
To: davids; +Cc: linux-net, linux-kernel
On Sat, 11 Oct 2008 05:20:37 -0700
"David Schwartz" <davids@webmaster.com> wrote:
>
> Nicolas Cannasse wrote:
>
> > The sockets are non-blocking.
>
> Ouch, that's a serious bug. Non-blocking operations shouldn't block!
>
> > Checking with netstat and ss I can confirm that both Send and Recv
> > queues are empty, which makes the recv() behavior consistent.
> >
> > However since this problem does not occur without threads, we can be
> > sure that the blame is still on the receiver.
> >
> > In a practical case, we have a thread blocked in recv() for more than 12
> > hours, which is way beyond the timeout of the sender connection. The
> > socket has already been closed by the sender so recv() should at least
> > be noticed and returns 0.
>
> Can you clarify what you mean by "the socket has already been closed by the
> sender"? You mean the other end of the TCP connection shut it down? By "the
> socket", you don't mean the socket you called 'recv' on, right? You mean the
> socket on the other end that's connected to it?
>
> > Is it safe to assume that when either send() or recv() get interrupted
> > by a signal and returns EINTR, no actual data has been either sent or
> > consumed ? And if it's not, is there any other way around this ?
>
> EINTR can only be return if 'send' or 'recv' have not sent or received
> anything. Otherwise the connection would be left in an indeterminate state.
Does application correctly handle the case where recv() returns 0?
This indicates the TCP connection is closed by the other end.
It is incorrect to assume that a return of 0 in non-blocking mode
is the same as -1. The only correct action after receiving 0 bytes
(even in non-blocking mode), is to close the socket. If you attempt
to do another receive, the result could be that the recv() waits for
another event (more data or FIN), which can never happen since socket
is closed.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: recv() hangs until SIGCHLD ?
2008-10-11 8:28 ` Nicolas Cannasse
2008-10-11 12:20 ` David Schwartz
@ 2008-10-13 8:31 ` Nicolas Cannasse
2008-10-13 15:02 ` Nicolas Cannasse
1 sibling, 1 reply; 10+ messages in thread
From: Nicolas Cannasse @ 2008-10-13 8:31 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: linux-net, linux-kernel
>> If there is data and the thread didn't wake up then that is a libc or
>> kernel problem;
>> but if there is no data, then look for cases where earlier interrupted
>> io actually
>> consumed the data already or blame the sending process not the receiver.
>> Also are the sockets blocking or non-blocking?
>
> The sockets are non-blocking.
Sorry, I made a spelling mistake here.
I wanted to tell that the sockets ARE blocking (default behavior).
> In a practical case, we have a thread blocked in recv() for more than 12
> hours, which is way beyond the timeout of the sender connection. The
> socket has already been closed by the sender so recv() should at least
> be noticed and returns 0.
To provide more informations :
Doing a lsof on the receiver, we can see that it has several ESTABLISHED
sockets connected to a given host/sender. Doing a lsof on the host does
not give any socket connected to the receiver (since they have been
closed due to a timeout).
Also, the application correctly handles 0.
The pseudo-code is the following :
loop:
ret = recv()
if( ret == -1 ) {
if( errno == EINTR ) goto loop;
return -1;
}
return ret;
Then, on the higher level, in case we get an error ( ret <= 0 ) then we
close the socket.
At first, we were using the libmysqlclient but since we had the bug with
it we rewrote a mysql client so we can more easily check what's
occurring. The same bug seems to occur with both implementations.
Best,
Nicolas
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: recv() hangs until SIGCHLD ?
2008-10-13 8:31 ` Nicolas Cannasse
@ 2008-10-13 15:02 ` Nicolas Cannasse
0 siblings, 0 replies; 10+ messages in thread
From: Nicolas Cannasse @ 2008-10-13 15:02 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: linux-net, linux-kernel
Nicolas Cannasse a écrit :
>>> If there is data and the thread didn't wake up then that is a libc or
>>> kernel problem;
>>> but if there is no data, then look for cases where earlier
>>> interrupted io actually
>>> consumed the data already or blame the sending process not the receiver.
>>> Also are the sockets blocking or non-blocking?
One other thing :
We tried to use a poll(POLLIN) on the socket before entering the recv().
The poll() does exit (and we are looping in case of EINTR result) but
after that recv() blocks infinitely.
Hope that helps,
Nicolas
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2008-10-13 15:03 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-10 13:30 recv() hangs until SIGCHLD ? Nicolas Cannasse
2008-10-10 19:17 ` Stephen Hemminger
2008-10-11 8:28 ` Nicolas Cannasse
2008-10-11 12:20 ` David Schwartz
2008-10-12 15:47 ` Stephen Hemminger
2008-10-13 8:31 ` Nicolas Cannasse
2008-10-13 15:02 ` Nicolas Cannasse
-- strict thread matches above, loose matches on Subject: below --
2008-10-10 16:43 Nicolas Cannasse
2008-10-11 4:48 ` David Schwartz
2008-10-11 9:30 ` Samuel Thibault
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox