Re: dccp bugs

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: dccp bugs
@ 2008-03-23 20:31 Tomasz Grobelny
  2008-03-24 10:11 ` Gerrit Renker
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: Tomasz Grobelny @ 2008-03-23 20:31 UTC (permalink / raw)
  To: dccp

One more thing I noticed when using dccp...
I have a server which accepts connection, receives data and finishes execution 
and a client that sends 1000 data packets and finishes execution (see 
http://dccp.one.pl/svn/userspace/test/). When I run ./server then ./client 
packets are sent but client program finishes executions only after all 
packets from queue are sent. Which is I guess quite ok. The problem happens 
when I kill the server program while client is running and sending packets. 
The client detects that the connection is broken and starts returning error 
32 from sendmsg call. But after it finishes sending packets it hangs on exit 
and even kill -9 doesn't work. It finishes after quite a long time (eg. 10 
minutes). Am I doing something wrong or is it a bug in dccp? Tested on 
loopback with rate limiting (sudo tc qdisc add dev lo root handle 1:0 tbf 
rate 3kbit burst 3kbit latency 500ms). With rate limiting turned off I don't 
see any problems. Testing between two virtual machines with rate limiting on 
shows the same problem.
-- 
Regards,
Tomasz Grobelny

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dccp bugs
  2008-03-23 20:31 dccp bugs Tomasz Grobelny
@ 2008-03-24 10:11 ` Gerrit Renker
  2008-03-24 13:07 ` Tomasz Grobelny
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Gerrit Renker @ 2008-03-24 10:11 UTC (permalink / raw)
  To: dccp

| One more thing I noticed when using dccp...
| I have a server which accepts connection, receives data and finishes execution 
| and a client that sends 1000 data packets and finishes execution (see 
| http://dccp.one.pl/svn/userspace/test/). When I run ./server then ./client 
| packets are sent but client program finishes executions only after all 
| packets from queue are sent. Which is I guess quite ok. The problem happens 
| when I kill the server program while client is running and sending packets. 
| The client detects that the connection is broken and starts returning error 
| 32 from sendmsg call.
Error 32 is EPIPEi ("broken pipe"), so this looks correct.

| But after it finishes sending packets it hangs on exit 
| and even kill -9 doesn't work. It finishes after quite a long time (eg. 10 
| minutes). Am I doing something wrong or is it a bug in dccp? Tested on 
| loopback with rate limiting (sudo tc qdisc add dev lo root handle 1:0 tbf 
| rate 3kbit burst 3kbit latency 500ms). With rate limiting turned off I don't 
| see any problems. Testing between two virtual machines with rate limiting on 
| shows the same problem.
| -- 
Can you try the `ss' command from the iproute package when the problem
occurs, using `ss -nadep' to display the DCCP states?

DCCP is connection-oriented, so killing a server/client is different
from UDP. When you try to kill a DCCP node, it will first try to finish
its connection. The `hang' effect is most likely due to an uncompleted
system call such as close(), and it is in a non-interruptible state.

What is far more important to know - are you using a standard kernel, a
netdev kernel, or the test tree? And from what you describe, I suspect
you are using CCID-3 - does the same problem happen with CCID-2?

I am aware that there is at least one patch which may remedy the problem
you encountered, which is the patch to clean up the write queue on
(forced) disconnect, also the wait-for-ccid cleanup routine which
flushes the write queue at the end of the connection.

There are also sysctls to reduce the number of attempts to repeat a
(futile) close at the end, in Documentation/networking/dccp.txt

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dccp bugs
  2008-03-23 20:31 dccp bugs Tomasz Grobelny
  2008-03-24 10:11 ` Gerrit Renker
@ 2008-03-24 13:07 ` Tomasz Grobelny
  2008-03-24 14:04 ` Gerrit Renker
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Tomasz Grobelny @ 2008-03-24 13:07 UTC (permalink / raw)
  To: dccp

Dnia Monday 24 of March 2008, Gerrit Renker napisa³:
> | One more thing I noticed when using dccp...
> | I have a server which accepts connection, receives data and finishes
> | execution and a client that sends 1000 data packets and finishes
> | execution (see http://dccp.one.pl/svn/userspace/test/). When I run
> | ./server then ./client packets are sent but client program finishes
> | executions only after all packets from queue are sent. Which is I guess
> | quite ok. The problem happens when I kill the server program while client
> | is running and sending packets. The client detects that the connection is
> | broken and starts returning error 32 from sendmsg call.
>
> Error 32 is EPIPEi ("broken pipe"), so this looks correct.
>
> | But after it finishes sending packets it hangs on exit
> | and even kill -9 doesn't work. It finishes after quite a long time (eg.
> | 10 minutes). Am I doing something wrong or is it a bug in dccp? Tested on
> | loopback with rate limiting (sudo tc qdisc add dev lo root handle 1:0 tbf
> | rate 3kbit burst 3kbit latency 500ms). With rate limiting turned off I
> | don't see any problems. Testing between two virtual machines with rate
> | limiting on shows the same problem.
> | --
>
> Can you try the `ss' command from the iproute package when the problem
> occurs, using `ss -nadep' to display the DCCP states?
>
$ ss -nadep
State       Recv-Q Send-Q                        Local Address:Port                          
Peer Address:Port
FIN-WAIT-1  0      0                                 127.0.0.1:2008                             
127.0.0.1:29792  ino:0 sk:d301d3c0

> DCCP is connection-oriented, so killing a server/client is different
> from UDP. When you try to kill a DCCP node, it will first try to finish
> its connection. The `hang' effect is most likely due to an uncompleted
> system call such as close(), and it is in a non-interruptible state.
>
10 minutes for an uninterruptible call seems to be quite a long time. If I 
were a system administrator it would probably drive me mad.

> What is far more important to know - are you using a standard kernel, a
> netdev kernel, or the test tree? And from what you describe, I suspect
> you are using CCID-3 - does the same problem happen with CCID-2?
>
I'm using not that fresh DCCP experimental tree. Tested on CCID-2 but same 
thing happens on CCID-3.

> I am aware that there is at least one patch which may remedy the problem
> you encountered, which is the patch to clean up the write queue on
> (forced) disconnect, also the wait-for-ccid cleanup routine which
> flushes the write queue at the end of the connection.
>
Is it in experimental tree?

> There are also sysctls to reduce the number of attempts to repeat a
> (futile) close at the end, in Documentation/networking/dccp.txt
You mean the *retries* entries? Setting all three to 1 doesn't make it any 
better.

And one more thing: if I try to interrupt the client program before it reaches 
its end all is fine - the program finishes execution immediately.
-- 
Regards,
Tomasz Grobelny

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dccp bugs
  2008-03-23 20:31 dccp bugs Tomasz Grobelny
  2008-03-24 10:11 ` Gerrit Renker
  2008-03-24 13:07 ` Tomasz Grobelny
@ 2008-03-24 14:04 ` Gerrit Renker
  2008-03-24 14:32 ` Arnaldo Carvalho de Melo
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Gerrit Renker @ 2008-03-24 14:04 UTC (permalink / raw)
  To: dccp

| > Can you try the `ss' command from the iproute package when the problem
| > occurs, using `ss -nadep' to display the DCCP states?
| >
| $ ss -nadep
| State       Recv-Q Send-Q                        Local Address:Port                          
| Peer Address:Port
| FIN-WAIT-1  0      0                                 127.0.0.1:2008                             
| 127.0.0.1:29792  ino:0 sk:d301d3c0
| 
I almost expected that it would be this state. I have also encountered
this when aborting applications in a non-expected way.

The state FIN-WAIT-1 is mapped into ACTIVE_CLOSEREQ, i.e. the server seems
to be the one running on port 2008; it has sent a CloseReq, asking the client
to terminate the connection. The DCCP spec says that the CloseReq must be
retransmitted (RFC 4340, 8.3.1).

   Endpoints in the CLOSEREQ and CLOSING states MUST retransmit DCCP-
   CloseReq and DCCP-Close packets, respectively, until leaving those
   states.  The retransmission timer should initially be set to go off
   in two round-trip times and should back off to not less than once
   every 64 seconds if no relevant response is received.

Hence the implementation is according to the spec - the server will
retry until it gets the required DCCP-Close from the client, and only
then leave the CLOSEREQ state.

| > DCCP is connection-oriented, so killing a server/client is different
| > from UDP. When you try to kill a DCCP node, it will first try to finish
| > its connection. The `hang' effect is most likely due to an uncompleted
| > system call such as close(), and it is in a non-interruptible state.
| >
| 10 minutes for an uninterruptible call seems to be quite a long time. If I 
| were a system administrator it would probably drive me mad.
| 
I think that there are cases in TCP where TCP is similarly pernicious.
Not least because DCCP uses the same sysctls (request_retries,
retries1, retries2).

| > I am aware that there is at least one patch which may remedy the problem
| > you encountered, which is the patch to clean up the write queue on
| > (forced) disconnect, also the wait-for-ccid cleanup routine which
| > flushes the write queue at the end of the connection.
| >
| Is it in experimental tree?
| 
Yes theses patches are all in the experimental tree. What they do is to
purge the write queue on abnormal termination and they ensure that
flushing the write queue at the end of a connection takes no longer than
the SO_LINGER time. 

But as you already said that the problem happens with both CCIDs, it may
(or may not, am not entirely sure) be that this does not help with the
long timeouts. It is definitively worth a try:
http://www.linux-foundation.org/en/Net:DCCP_Testing#Experimental_DCCP_source_tree	

| > There are also sysctls to reduce the number of attempts to repeat a
| > (futile) close at the end, in Documentation/networking/dccp.txt
| You mean the *retries* entries? Setting all three to 1 doesn't make it any 
| better.
How to quantify `better'? Changing valueswill not change the problem as such, but
it will reduce the timeout until the server gives up. And from earlier
tests (about 1+1/2 years ago when the sysctls were activated) I recall that
this worked correctly.

There are alternatives, the first relates to your comment below.
| And one more thing: if I try to interrupt the client program before it reaches 
| its end all is fine - the program finishes execution immediately.
| -- 
In this case the long timeout is avoided: the client either sends a Reset (when
it still has unread data or SO_LINGER with linger=0 is used) or a Close
if it terminates cleanly. In this case the server directly closes and
does not enter into CLOSEREQ where it is required to retransmit the
CloseReq.

So to get around the annoyance, killing the client first avoids these
long waiting times.

(The other alternative is to enable the DCCP_SOCKOPT_SERVER_TIMEWAIT
 option (documented also in Documentation/networking/dccp.txt), where the
 server sends just a single close. But if here also the client dies
 before the server, the server would have to retransmit the Close also.)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dccp bugs
  2008-03-23 20:31 dccp bugs Tomasz Grobelny
                   ` (2 preceding siblings ...)
  2008-03-24 14:04 ` Gerrit Renker
@ 2008-03-24 14:32 ` Arnaldo Carvalho de Melo
  2008-03-24 16:02 ` Gerrit Renker
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Arnaldo Carvalho de Melo @ 2008-03-24 14:32 UTC (permalink / raw)
  To: dccp

Em Mon, Mar 24, 2008 at 02:04:15PM +0000, Gerrit Renker escreveu:
> | > Can you try the `ss' command from the iproute package when the problem
> | > occurs, using `ss -nadep' to display the DCCP states?
> | >
> | $ ss -nadep
> | State       Recv-Q Send-Q                        Local Address:Port                          
> | Peer Address:Port
> | FIN-WAIT-1  0      0                                 127.0.0.1:2008                             
> | 127.0.0.1:29792  ino:0 sk:d301d3c0
> | 
> I almost expected that it would be this state. I have also encountered
> this when aborting applications in a non-expected way.
> 
> The state FIN-WAIT-1 is mapped into ACTIVE_CLOSEREQ, i.e. the server seems
> to be the one running on port 2008; it has sent a CloseReq, asking the client
> to terminate the connection. The DCCP spec says that the CloseReq must be
> retransmitted (RFC 4340, 8.3.1).
> 
>    Endpoints in the CLOSEREQ and CLOSING states MUST retransmit DCCP-
>    CloseReq and DCCP-Close packets, respectively, until leaving those
>    states.  The retransmission timer should initially be set to go off
>    in two round-trip times and should back off to not less than once
>    every 64 seconds if no relevant response is received.
> 
> Hence the implementation is according to the spec - the server will
> retry until it gets the required DCCP-Close from the client, and only
> then leave the CLOSEREQ state.

IIRC Thomas said that the server process terminates, its just the client
that hangs for a long time, lemme re-read the other messages...

<quote Thomas>
 When I run ./server then ./client packets are sent but client program
 finishes executions only after all packets from queue are sent. Which
 is I guess quite ok. The problem happens when I kill the server program
 while client is running and sending packets.  The client detects that
 the connection is broken and starts returning error 32 from sendmsg
 call. But after it finishes sending packets it hangs on exit and even
 kill -9 doesn't work. It finishes after quite a long time (eg.  10
 minutes).
</quote>

Thomas, is it the client that hangs? Or is it the server?

- Arnaldo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dccp bugs
  2008-03-23 20:31 dccp bugs Tomasz Grobelny
                   ` (3 preceding siblings ...)
  2008-03-24 14:32 ` Arnaldo Carvalho de Melo
@ 2008-03-24 16:02 ` Gerrit Renker
  2008-03-25  0:03 ` Tomasz Grobelny
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Gerrit Renker @ 2008-03-24 16:02 UTC (permalink / raw)
  To: dccp

| > | > Can you try the `ss' command from the iproute package when the problem
| > | > occurs, using `ss -nadep' to display the DCCP states?
| > | >
| > | $ ss -nadep
| > | State       Recv-Q Send-Q                        Local Address:Port                          
| > | Peer Address:Port
| > | FIN-WAIT-1  0      0                                 127.0.0.1:2008                             
| > | 127.0.0.1:29792  ino:0 sk:d301d3c0
| > | 
<snip>
| 
| IIRC Thomas said that the server process terminates, its just the client
| that hangs for a long time, lemme re-read the other messages...
| 
| <quote Thomas>
|  When I run ./server then ./client packets are sent but client program
|  finishes executions only after all packets from queue are sent. Which
|  is I guess quite ok. The problem happens when I kill the server program
|  while client is running and sending packets.  The client detects that
|  the connection is broken and starts returning error 32 from sendmsg
|  call. But after it finishes sending packets it hangs on exit and even
|  kill -9 doesn't work. It finishes after quite a long time (eg.  10
|  minutes).
| </quote>
| 
This is a good point, but now it gets confusing. The FIN-WAIT-1/active
CLOSEREQ state can only happen on the server, so maybe the above output
is not the right one.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dccp bugs
  2008-03-23 20:31 dccp bugs Tomasz Grobelny
                   ` (4 preceding siblings ...)
  2008-03-24 16:02 ` Gerrit Renker
@ 2008-03-25  0:03 ` Tomasz Grobelny
  2008-03-25  0:13 ` Tomasz Grobelny
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Tomasz Grobelny @ 2008-03-25  0:03 UTC (permalink / raw)
  To: dccp

Dnia Monday 24 of March 2008, Arnaldo Carvalho de Melo napisa³:
> IIRC Thomas said that the server process terminates, its just the client
> that hangs for a long time, lemme re-read the other messages...
Yes, you are right.

> Thomas, is it the client that hangs? Or is it the server?
>
It is the client (the program that does connect(...)) that hangs. The server 
(the one which does listen(...)) is already dead. Well, the server process 
finished execution but the socket is somehow busy: when I try to run the 
server once again (on the same port) and try to connect to it with another 
instance of the client program I always get this 32 return code from sendmsg 
call.
-- 
Regards,
Tomasz Grobelny

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dccp bugs
  2008-03-23 20:31 dccp bugs Tomasz Grobelny
                   ` (5 preceding siblings ...)
  2008-03-25  0:03 ` Tomasz Grobelny
@ 2008-03-25  0:13 ` Tomasz Grobelny
  2008-03-25 10:48 ` Gerrit Renker
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Tomasz Grobelny @ 2008-03-25  0:13 UTC (permalink / raw)
  To: dccp

Dnia Monday 24 of March 2008, Gerrit Renker napisa³:
> | > | > Can you try the `ss' command from the iproute package when the
> | > | > problem occurs, using `ss -nadep' to display the DCCP states?
> | > |
> | > | $ ss -nadep
> | > | State       Recv-Q Send-Q                        Local Address:Port
> | > | Peer Address:Port
> | > | FIN-WAIT-1  0      0                                 127.0.0.1:2008
> | > | 127.0.0.1:29792  ino:0 sk:d301d3c0
>
> <snip>
>
> This is a good point, but now it gets confusing. The FIN-WAIT-1/active
> CLOSEREQ state can only happen on the server, so maybe the above output
> is not the right one.
I checked it once again and it looks almost the same. But even if the server 
is dead its socket is not that dead. At least it cannot be reused by other 
instance of server.
-- 
Regards,
Tomasz Grobelny

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dccp bugs
  2008-03-23 20:31 dccp bugs Tomasz Grobelny
                   ` (6 preceding siblings ...)
  2008-03-25  0:13 ` Tomasz Grobelny
@ 2008-03-25 10:48 ` Gerrit Renker
  2008-03-29 17:00 ` Tomasz Grobelny
  2008-03-31  6:41 ` Gerrit Renker
  9 siblings, 0 replies; 11+ messages in thread
From: Gerrit Renker @ 2008-03-25 10:48 UTC (permalink / raw)
  To: dccp

Quoting Tomasz Grobelny:
| Dnia Monday 24 of March 2008, Arnaldo Carvalho de Melo napisa?:
| > IIRC Thomas said that the server process terminates, its just the client
| > that hangs for a long time, lemme re-read the other messages...
| Yes, you are right.
| 
| > Thomas, is it the client that hangs? Or is it the server?
| >
| It is the client (the program that does connect(...)) that hangs. The server 
| (the one which does listen(...)) is already dead. Well, the server process 
| finished execution but the socket is somehow busy: when I try to run the 
| server once again (on the same port) and try to connect to it with another 
| instance of the client program I always get this 32 return code from sendmsg 
| call.
Ok, it is clearer now: the "somehow busy" is related to the `ss' output,
where the server tries to resend its CloseReq (as per other email).

Since it is the client that hangs and which sends packets, there is a
very good chance that the recent patches in the test tree will fix that
problem, in particular as it is related to 
 * aborting a client with a full write queue,
 * closing a client with a full write queue while the client can't send.

If possible, can you please try the test tree patches and see if the
problem persists. I am doing a lot of testing and have not observed any
similar errors in the test tree -- the closest being the server-hand on
retransmitting CloseReq, which may account for the earlier confusion.

You need not pull git, patches are available for recent standard kernels on:
http://www.erg.abdn.ac.uk/users/gerrit/dccp/testing_dccp/test-tree/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dccp bugs
  2008-03-23 20:31 dccp bugs Tomasz Grobelny
                   ` (7 preceding siblings ...)
  2008-03-25 10:48 ` Gerrit Renker
@ 2008-03-29 17:00 ` Tomasz Grobelny
  2008-03-31  6:41 ` Gerrit Renker
  9 siblings, 0 replies; 11+ messages in thread
From: Tomasz Grobelny @ 2008-03-29 17:00 UTC (permalink / raw)
  To: dccp

Dnia Tuesday 25 of March 2008, Gerrit Renker napisa³:
> If possible, can you please try the test tree patches and see if the
> problem persists. I am doing a lot of testing and have not observed any
> similar errors in the test tree -- the closest being the server-hand on
> retransmitting CloseReq, which may account for the earlier confusion.
>
Yes, it works ok now. I was already using the test tree before but the wrong 
branch.
-- 
Regards,
Tomasz Grobelny

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dccp bugs
  2008-03-23 20:31 dccp bugs Tomasz Grobelny
                   ` (8 preceding siblings ...)
  2008-03-29 17:00 ` Tomasz Grobelny
@ 2008-03-31  6:41 ` Gerrit Renker
  9 siblings, 0 replies; 11+ messages in thread
From: Gerrit Renker @ 2008-03-31  6:41 UTC (permalink / raw)
  To: dccp

| > If possible, can you please try the test tree patches and see if the
| > problem persists. I am doing a lot of testing and have not observed any
| > similar errors in the test tree -- the closest being the server-hand on
| > retransmitting CloseReq, which may account for the earlier confusion.
| >
| Yes, it works ok now. I was already using the test tree before but the wrong 
| branch.
Thanks very much for testing. I have a suspicion that the fix is in one
of the following patches:

 [CCID]: Refine the wait-for-ccid mechanism        
 [DCCP]: Empty the write queue when disconnecting 

Both flush the write queue on exit. The first one limits the amount of
time that close() will allow to drain the write queue, the second patch
flushes the write queue when the connection is aborted.

The University of Aberdeen is a charity registered in Scotland, No SC013683.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-03-31  6:41 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-23 20:31 dccp bugs Tomasz Grobelny
2008-03-24 10:11 ` Gerrit Renker
2008-03-24 13:07 ` Tomasz Grobelny
2008-03-24 14:04 ` Gerrit Renker
2008-03-24 14:32 ` Arnaldo Carvalho de Melo
2008-03-24 16:02 ` Gerrit Renker
2008-03-25  0:03 ` Tomasz Grobelny
2008-03-25  0:13 ` Tomasz Grobelny
2008-03-25 10:48 ` Gerrit Renker
2008-03-29 17:00 ` Tomasz Grobelny
2008-03-31  6:41 ` Gerrit Renker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.