listen(2) backlog changes in or around Linux 3.1?

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* listen(2) backlog changes in or around Linux 3.1?
@ 2012-10-12 23:40 enh
  2012-10-15 17:12 ` Venkat Venkatsubra
  0 siblings, 1 reply; 20+ messages in thread
From: enh @ 2012-10-12 23:40 UTC (permalink / raw)
  To: netdev

i used to use the following hack to unit test connect timeouts: i'd
call listen(2) on a socket and then deliberately connect (backlog + 3)
sockets without accept(2)ing any of the connections. (why 3? because
Stevens told me so, and experiment backed him up. see figure 4.10 in
his UNIX Network Programming.)

with "old" kernels, 2.6.35-ish to 3.0-ish, this worked great. my next
connect(2) to the same loopback port would hang indefinitely. i could
even unblock the connect by calling accept(2) in another thread. this
was awesome for testing.

in 3.1 on ARM, 3.2 on x86 (Ubuntu desktop), and 3.4 on ARM, this no
longer works. it doesn't seem to be as simple as "the constant is no
longer 3". my tests are now flaky. sometimes they work like they used
to, and sometimes an extra connect(2) will succeed. (or, if i'm in
non-blocking mode, my poll(2) will return with the non-blocking socket
that's trying to connect now ready.)

i'm guessing if this changed in 3.1 and is still changed in 3.4,
whatever's changed wasn't an accident. but i haven't been able to find
the right search terms to RTFM. i also finally got around to grepping
the kernel for the "+ 3", but wasn't able to find that. (so i'd be
interested to know where the old behavior came from too.)

my least worst workaround at the moment is to use one of RFC5737's
test networks, but that requires that the device have a network
connection, otherwise my connect(2)s fail immediately with
ENETUNREACH, which is no use to me. also, unlike my old trick, i've
got no way to suddenly "unblock" a slow connect(2) (this is useful for
unit testing the code that does the poll(2) part of the usual
connect-with-timeout implementation).
https://android-review.googlesource.com/#/c/44563/

hopefully someone here can shed some light on this? ideally someone
will have a workaround as good as my old trick. i realize i was
relying on undocumented behavior, and i'm happy to have to check
/proc/version and behave appropriately, but i'd really like a way to
keep my unit tests!

thanks,
 elliott

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-12 23:40 listen(2) backlog changes in or around Linux 3.1? enh
@ 2012-10-15 17:12 ` Venkat Venkatsubra
  2012-10-15 17:26   ` enh
  0 siblings, 1 reply; 20+ messages in thread
From: Venkat Venkatsubra @ 2012-10-15 17:12 UTC (permalink / raw)
  To: enh; +Cc: netdev

On 10/12/2012 6:40 PM, enh wrote:
> i used to use the following hack to unit test connect timeouts: i'd
> call listen(2) on a socket and then deliberately connect (backlog + 3)
> sockets without accept(2)ing any of the connections. (why 3? because
> Stevens told me so, and experiment backed him up. see figure 4.10 in
> his UNIX Network Programming.)
>
> with "old" kernels, 2.6.35-ish to 3.0-ish, this worked great. my next
> connect(2) to the same loopback port would hang indefinitely. i could
> even unblock the connect by calling accept(2) in another thread. this
> was awesome for testing.
>
> in 3.1 on ARM, 3.2 on x86 (Ubuntu desktop), and 3.4 on ARM, this no
> longer works. it doesn't seem to be as simple as "the constant is no
> longer 3". my tests are now flaky. sometimes they work like they used
> to, and sometimes an extra connect(2) will succeed. (or, if i'm in
> non-blocking mode, my poll(2) will return with the non-blocking socket
> that's trying to connect now ready.)
>
> i'm guessing if this changed in 3.1 and is still changed in 3.4,
> whatever's changed wasn't an accident. but i haven't been able to find
> the right search terms to RTFM. i also finally got around to grepping
> the kernel for the "+ 3", but wasn't able to find that. (so i'd be
> interested to know where the old behavior came from too.)
>
> my least worst workaround at the moment is to use one of RFC5737's
> test networks, but that requires that the device have a network
> connection, otherwise my connect(2)s fail immediately with
> ENETUNREACH, which is no use to me. also, unlike my old trick, i've
> got no way to suddenly "unblock" a slow connect(2) (this is useful for
> unit testing the code that does the poll(2) part of the usual
> connect-with-timeout implementation).
> https://android-review.googlesource.com/#/c/44563/
>
> hopefully someone here can shed some light on this? ideally someone
> will have a workaround as good as my old trick. i realize i was
> relying on undocumented behavior, and i'm happy to have to check
> /proc/version and behave appropriately, but i'd really like a way to
> keep my unit tests!
>
> thanks,
>   elliott
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hi Elliott,

In BSD I think the backlog used to be reset to 3/2 times that passed by 
the user. So, 2 becomes 3.
Probably the 1/2 times increase was to accommodate the ones in 
partial/incomplete queue.
In Linux is it possible you were getting the same behavior before the 
below commit ?
Since the check used to be "backlog+1" a 2 will behave as 3 ?

commit 8488df894d05d6fa41c2bd298c335f944bb0e401
Author: Wei Dong <weid@np.css.fujitsu.com>
Date:   Fri Mar 2 12:37:26 2007 -0800

     [NET]: Fix bugs in "Whether sock accept queue is full" checking

         when I use linux TCP socket, and find there is a bug in 
function  sk_acceptq_is_full().

         When a new SYN comes, TCP module first checks its validation. 
If valid,
     send SYN,ACK to the client and add the sock to the syn hash table. Next
     time if received the valid ACK for SYN,ACK from the client. server will
     accept this connection and increase the sk->sk_ack_backlog -- which is
     done in function tcp_check_req().We check wether acceptq is full in
     function tcp_v4_syn_recv_sock().

     Consider an example:

      After listen(sockfd, 1) system call, sk->sk_max_ack_backlog is set to
     1. As we know, sk->sk_ack_backlog is initialized to 0. Assuming 
accept()
     system call is not invoked now.

     1. 1st connection comes. invoke sk_acceptq_is_full().
      sk->sk_ack_backlog=0 sk->sk_max_ack_backlog=1, function return 0 
accept this connection.
      Increase the sk->sk_ack_backlog
     2. 2nd connection comes. invoke sk_acceptq_is_full().
      sk->sk_ack_backlog=1 sk->sk_max_ack_backlog=1, function return 0 
accept this connection.
      Increase the sk->sk_ack_backlog
     3. 3rd connection comes. invoke sk_acceptq_is_full().
      sk->sk_ack_backlog=2 sk->sk_max_ack_backlog=1, function return 1. 
Refuse this connection.

     I think it has bugs. after listen system call. sk->sk_max_ack_backlog=1
     but now it can accept 2 connections.

     Signed-off-by: Wei Dong <weid@np.css.fujitsu.com>
     Signed-off-by: David S. Miller <davem@davemloft.net>

Venkat

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-15 17:12 ` Venkat Venkatsubra
@ 2012-10-15 17:26   ` enh
  2012-10-15 21:30     ` Venkat Venkatsubra
  2012-10-16 23:31     ` enh
  0 siblings, 2 replies; 20+ messages in thread
From: enh @ 2012-10-15 17:26 UTC (permalink / raw)
  To: Venkat Venkatsubra; +Cc: netdev

On Mon, Oct 15, 2012 at 10:12 AM, Venkat Venkatsubra
<venkat.x.venkatsubra@oracle.com> wrote:
> On 10/12/2012 6:40 PM, enh wrote:
>>
>> i used to use the following hack to unit test connect timeouts: i'd
>> call listen(2) on a socket and then deliberately connect (backlog + 3)
>> sockets without accept(2)ing any of the connections. (why 3? because
>> Stevens told me so, and experiment backed him up. see figure 4.10 in
>> his UNIX Network Programming.)
>>
>> with "old" kernels, 2.6.35-ish to 3.0-ish, this worked great. my next
>> connect(2) to the same loopback port would hang indefinitely. i could
>> even unblock the connect by calling accept(2) in another thread. this
>> was awesome for testing.
>>
>> in 3.1 on ARM, 3.2 on x86 (Ubuntu desktop), and 3.4 on ARM, this no
>> longer works. it doesn't seem to be as simple as "the constant is no
>> longer 3". my tests are now flaky. sometimes they work like they used
>> to, and sometimes an extra connect(2) will succeed. (or, if i'm in
>> non-blocking mode, my poll(2) will return with the non-blocking socket
>> that's trying to connect now ready.)
>>
>> i'm guessing if this changed in 3.1 and is still changed in 3.4,
>> whatever's changed wasn't an accident. but i haven't been able to find
>> the right search terms to RTFM. i also finally got around to grepping
>> the kernel for the "+ 3", but wasn't able to find that. (so i'd be
>> interested to know where the old behavior came from too.)
>>
>> my least worst workaround at the moment is to use one of RFC5737's
>> test networks, but that requires that the device have a network
>> connection, otherwise my connect(2)s fail immediately with
>> ENETUNREACH, which is no use to me. also, unlike my old trick, i've
>> got no way to suddenly "unblock" a slow connect(2) (this is useful for
>> unit testing the code that does the poll(2) part of the usual
>> connect-with-timeout implementation).
>> https://android-review.googlesource.com/#/c/44563/
>>
>> hopefully someone here can shed some light on this? ideally someone
>> will have a workaround as good as my old trick. i realize i was
>> relying on undocumented behavior, and i'm happy to have to check
>> /proc/version and behave appropriately, but i'd really like a way to
>> keep my unit tests!
>>
>> thanks,
>>   elliott
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> Hi Elliott,
>
> In BSD I think the backlog used to be reset to 3/2 times that passed by the
> user. So, 2 becomes 3.
> Probably the 1/2 times increase was to accommodate the ones in
> partial/incomplete queue.
> In Linux is it possible you were getting the same behavior before the below
> commit ?
> Since the check used to be "backlog+1" a 2 will behave as 3 ?

i don't think so, because with <= 3.0 kernels i used to have a backlog
of 1 and be able to make _4_ connections before my next connect would
hang. but this > to >= change is at least something for me to
investigate...

> commit 8488df894d05d6fa41c2bd298c335f944bb0e401
> Author: Wei Dong <weid@np.css.fujitsu.com>
> Date:   Fri Mar 2 12:37:26 2007 -0800
>
>     [NET]: Fix bugs in "Whether sock accept queue is full" checking
>
>         when I use linux TCP socket, and find there is a bug in function
> sk_acceptq_is_full().
>
>         When a new SYN comes, TCP module first checks its validation. If
> valid,
>     send SYN,ACK to the client and add the sock to the syn hash table. Next
>     time if received the valid ACK for SYN,ACK from the client. server will
>     accept this connection and increase the sk->sk_ack_backlog -- which is
>     done in function tcp_check_req().We check wether acceptq is full in
>     function tcp_v4_syn_recv_sock().
>
>     Consider an example:
>
>      After listen(sockfd, 1) system call, sk->sk_max_ack_backlog is set to
>     1. As we know, sk->sk_ack_backlog is initialized to 0. Assuming accept()
>     system call is not invoked now.
>
>     1. 1st connection comes. invoke sk_acceptq_is_full().
>      sk->sk_ack_backlog=0 sk->sk_max_ack_backlog=1, function return 0 accept
> this connection.
>      Increase the sk->sk_ack_backlog
>     2. 2nd connection comes. invoke sk_acceptq_is_full().
>      sk->sk_ack_backlog=1 sk->sk_max_ack_backlog=1, function return 0 accept
> this connection.
>      Increase the sk->sk_ack_backlog
>     3. 3rd connection comes. invoke sk_acceptq_is_full().
>      sk->sk_ack_backlog=2 sk->sk_max_ack_backlog=1, function return 1.
> Refuse this connection.
>
>     I think it has bugs. after listen system call. sk->sk_max_ack_backlog=1
>     but now it can accept 2 connections.
>
>     Signed-off-by: Wei Dong <weid@np.css.fujitsu.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
>
> Venkat

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-15 17:26   ` enh
@ 2012-10-15 21:30     ` Venkat Venkatsubra
  2012-10-16 23:31     ` enh
  1 sibling, 0 replies; 20+ messages in thread
From: Venkat Venkatsubra @ 2012-10-15 21:30 UTC (permalink / raw)
  To: enh; +Cc: netdev

On 10/15/2012 12:26 PM, enh wrote:
> On Mon, Oct 15, 2012 at 10:12 AM, Venkat Venkatsubra
> <venkat.x.venkatsubra@oracle.com>  wrote:
>> On 10/12/2012 6:40 PM, enh wrote:
>>> i used to use the following hack to unit test connect timeouts: i'd
>>> call listen(2) on a socket and then deliberately connect (backlog + 3)
>>> sockets without accept(2)ing any of the connections. (why 3? because
>>> Stevens told me so, and experiment backed him up. see figure 4.10 in
>>> his UNIX Network Programming.)
>>>
>>> with "old" kernels, 2.6.35-ish to 3.0-ish, this worked great. my next
>>> connect(2) to the same loopback port would hang indefinitely. i could
>>> even unblock the connect by calling accept(2) in another thread. this
>>> was awesome for testing.
>>>
>>> in 3.1 on ARM, 3.2 on x86 (Ubuntu desktop), and 3.4 on ARM, this no
>>> longer works. it doesn't seem to be as simple as "the constant is no
>>> longer 3". my tests are now flaky. sometimes they work like they used
>>> to, and sometimes an extra connect(2) will succeed. (or, if i'm in
>>> non-blocking mode, my poll(2) will return with the non-blocking socket
>>> that's trying to connect now ready.)
>>>
>>> i'm guessing if this changed in 3.1 and is still changed in 3.4,
>>> whatever's changed wasn't an accident. but i haven't been able to find
>>> the right search terms to RTFM. i also finally got around to grepping
>>> the kernel for the "+ 3", but wasn't able to find that. (so i'd be
>>> interested to know where the old behavior came from too.)
>>>
>>> my least worst workaround at the moment is to use one of RFC5737's
>>> test networks, but that requires that the device have a network
>>> connection, otherwise my connect(2)s fail immediately with
>>> ENETUNREACH, which is no use to me. also, unlike my old trick, i've
>>> got no way to suddenly "unblock" a slow connect(2) (this is useful for
>>> unit testing the code that does the poll(2) part of the usual
>>> connect-with-timeout implementation).
>>> https://android-review.googlesource.com/#/c/44563/
>>>
>>> hopefully someone here can shed some light on this? ideally someone
>>> will have a workaround as good as my old trick. i realize i was
>>> relying on undocumented behavior, and i'm happy to have to check
>>> /proc/version and behave appropriately, but i'd really like a way to
>>> keep my unit tests!
>>>
>>> thanks,
>>>    elliott
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Hi Elliott,
>>
>> In BSD I think the backlog used to be reset to 3/2 times that passed by the
>> user. So, 2 becomes 3.
>> Probably the 1/2 times increase was to accommodate the ones in
>> partial/incomplete queue.
>> In Linux is it possible you were getting the same behavior before the below
>> commit ?
>> Since the check used to be "backlog+1" a 2 will behave as 3 ?
> i don't think so, because with<= 3.0 kernels i used to have a backlog
> of 1 and be able to make _4_ connections before my next connect would
> hang. but this>  to>= change is at least something for me to
> investigate...
>
>> commit 8488df894d05d6fa41c2bd298c335f944bb0e401
>> Author: Wei Dong<weid@np.css.fujitsu.com>
>> Date:   Fri Mar 2 12:37:26 2007 -0800
>>
>>      [NET]: Fix bugs in "Whether sock accept queue is full" checking
>>
>>          when I use linux TCP socket, and find there is a bug in function
>> sk_acceptq_is_full().
>>
>>          When a new SYN comes, TCP module first checks its validation. If
>> valid,
>>      send SYN,ACK to the client and add the sock to the syn hash table. Next
>>      time if received the valid ACK for SYN,ACK from the client. server will
>>      accept this connection and increase the sk->sk_ack_backlog -- which is
>>      done in function tcp_check_req().We check wether acceptq is full in
>>      function tcp_v4_syn_recv_sock().
>>
>>      Consider an example:
>>
>>       After listen(sockfd, 1) system call, sk->sk_max_ack_backlog is set to
>>      1. As we know, sk->sk_ack_backlog is initialized to 0. Assuming accept()
>>      system call is not invoked now.
>>
>>      1. 1st connection comes. invoke sk_acceptq_is_full().
>>       sk->sk_ack_backlog=0 sk->sk_max_ack_backlog=1, function return 0 accept
>> this connection.
>>       Increase the sk->sk_ack_backlog
>>      2. 2nd connection comes. invoke sk_acceptq_is_full().
>>       sk->sk_ack_backlog=1 sk->sk_max_ack_backlog=1, function return 0 accept
>> this connection.
>>       Increase the sk->sk_ack_backlog
>>      3. 3rd connection comes. invoke sk_acceptq_is_full().
>>       sk->sk_ack_backlog=2 sk->sk_max_ack_backlog=1, function return 1.
>> Refuse this connection.
>>
>>      I think it has bugs. after listen system call. sk->sk_max_ack_backlog=1
>>      but now it can accept 2 connections.
>>
>>      Signed-off-by: Wei Dong<weid@np.css.fujitsu.com>
>>      Signed-off-by: David S. Miller<davem@davemloft.net>
>>
>> Venkat
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ignore my sk_acceptq_is_full() > and >= changes.
That commit was reverted back by this one :
commit 64a146513f8f12ba204b7bf5cb7e9505594ead42
Author: David S. Miller <davem@sunset.davemloft.net>
Date:   Tue Mar 6 11:21:05 2007 -0800

     [NET]: Revert incorrect accept queue backlog changes.

     This reverts two changes:

     8488df894d05d6fa41c2bd298c335f944bb0e401
     248f06726e866942b3d8ca8f411f9067713b7ff8

     A backlog value of N really does mean allow "N + 1" connections
     to queue to a listening socket.  This allows one to specify
     "0" as the backlog and still get 1 connection.

     Noticed by Gerrit Renker and Rick Jones.

     Signed-off-by: David S. Miller <davem@davemloft.net>

Venkat

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-15 17:26   ` enh
  2012-10-15 21:30     ` Venkat Venkatsubra
@ 2012-10-16 23:31     ` enh
  2012-10-18 16:00       ` Venkat Venkatsubra
  2012-10-18 16:54       ` Eric Dumazet
  1 sibling, 2 replies; 20+ messages in thread
From: enh @ 2012-10-16 23:31 UTC (permalink / raw)
  To: netdev

boiling things down to a short C++ program, i see that i can reproduce
the behavior even on 2.6 kernels. if i run this, i see 4 connections
immediately (3 + 1, as i'd expect)... but then about 10s later i see
another 2. and every few seconds after that, i see another 2. i've let
this run until i have hundreds of connect(2) calls that have returned,
despite my small listen(2) backlog and the fact that i'm not
accept(2)ing.

so i guess the only thing that's changed with newer kernels is timing
(hell, since i only see newer kernels on newer hardware, it might just
be a hardware thing).

and clearly i don't understand what the listen(2) backlog means any more.

#include <netinet/ip.h>
#include <netinet/tcp.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <iostream>
#include <stdlib.h>
#include <string.h>
#include <errno.h>

void dump_ti(int fd) {
 tcp_info ti;
 socklen_t tcp_info_length = sizeof(tcp_info);
 int rc = getsockopt(fd, SOL_IP, TCP_INFO, &ti, &tcp_info_length);
 if (rc == -1) {
   std::cout << "getsockopt rc " << rc << ": " << strerror(errno) << "\n";
   return;
 }

 std::cout << "ti.tcpi_unacked=" << ti.tcpi_unacked << "\n";
 std::cout << "ti.tcpi_sacked=" << ti.tcpi_sacked << "\n";
}

void connect_to(sockaddr_in& sa) {
 int s = socket(AF_INET, SOCK_STREAM, 0);
 if (s == -1) {
   abort();
 }

 int rc = connect(s, (sockaddr*) &sa, sizeof(sockaddr_in));
 std::cout << "connect = " << rc << "\n";
}

int main() {
 int ss = socket(AF_INET, SOCK_STREAM, 0);
 std::cout << "socket fd " << ss << "\n";

 sockaddr_in sa;
 memset(&sa, 0, sizeof(sa));
 sa.sin_family = AF_INET;
 sa.sin_addr.s_addr = htonl(INADDR_ANY);
 sa.sin_port = htons(9877);
 int rc = bind(ss, (sockaddr*) &sa, sizeof(sa));
 std::cout << "bind rc " << rc << ": " << strerror(errno) << "\n";
 std::cout << "bind port " << sa.sin_port << "\n";

 rc = listen(ss, 1);
 std::cout << "listen rc " << rc << ": " << strerror(errno) << "\n";
 dump_ti(ss);

 while (true) {
  connect_to(sa);
  dump_ti(ss);
 }

 return 0;
}


On Mon, Oct 15, 2012 at 10:26 AM, enh <enh@google.com> wrote:
> On Mon, Oct 15, 2012 at 10:12 AM, Venkat Venkatsubra
> <venkat.x.venkatsubra@oracle.com> wrote:
>> On 10/12/2012 6:40 PM, enh wrote:
>>>
>>> i used to use the following hack to unit test connect timeouts: i'd
>>> call listen(2) on a socket and then deliberately connect (backlog + 3)
>>> sockets without accept(2)ing any of the connections. (why 3? because
>>> Stevens told me so, and experiment backed him up. see figure 4.10 in
>>> his UNIX Network Programming.)
>>>
>>> with "old" kernels, 2.6.35-ish to 3.0-ish, this worked great. my next
>>> connect(2) to the same loopback port would hang indefinitely. i could
>>> even unblock the connect by calling accept(2) in another thread. this
>>> was awesome for testing.
>>>
>>> in 3.1 on ARM, 3.2 on x86 (Ubuntu desktop), and 3.4 on ARM, this no
>>> longer works. it doesn't seem to be as simple as "the constant is no
>>> longer 3". my tests are now flaky. sometimes they work like they used
>>> to, and sometimes an extra connect(2) will succeed. (or, if i'm in
>>> non-blocking mode, my poll(2) will return with the non-blocking socket
>>> that's trying to connect now ready.)
>>>
>>> i'm guessing if this changed in 3.1 and is still changed in 3.4,
>>> whatever's changed wasn't an accident. but i haven't been able to find
>>> the right search terms to RTFM. i also finally got around to grepping
>>> the kernel for the "+ 3", but wasn't able to find that. (so i'd be
>>> interested to know where the old behavior came from too.)
>>>
>>> my least worst workaround at the moment is to use one of RFC5737's
>>> test networks, but that requires that the device have a network
>>> connection, otherwise my connect(2)s fail immediately with
>>> ENETUNREACH, which is no use to me. also, unlike my old trick, i've
>>> got no way to suddenly "unblock" a slow connect(2) (this is useful for
>>> unit testing the code that does the poll(2) part of the usual
>>> connect-with-timeout implementation).
>>> https://android-review.googlesource.com/#/c/44563/
>>>
>>> hopefully someone here can shed some light on this? ideally someone
>>> will have a workaround as good as my old trick. i realize i was
>>> relying on undocumented behavior, and i'm happy to have to check
>>> /proc/version and behave appropriately, but i'd really like a way to
>>> keep my unit tests!
>>>
>>> thanks,
>>>   elliott
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> Hi Elliott,
>>
>> In BSD I think the backlog used to be reset to 3/2 times that passed by the
>> user. So, 2 becomes 3.
>> Probably the 1/2 times increase was to accommodate the ones in
>> partial/incomplete queue.
>> In Linux is it possible you were getting the same behavior before the below
>> commit ?
>> Since the check used to be "backlog+1" a 2 will behave as 3 ?
>
> i don't think so, because with <= 3.0 kernels i used to have a backlog
> of 1 and be able to make _4_ connections before my next connect would
> hang. but this > to >= change is at least something for me to
> investigate...
>
>> commit 8488df894d05d6fa41c2bd298c335f944bb0e401
>> Author: Wei Dong <weid@np.css.fujitsu.com>
>> Date:   Fri Mar 2 12:37:26 2007 -0800
>>
>>     [NET]: Fix bugs in "Whether sock accept queue is full" checking
>>
>>         when I use linux TCP socket, and find there is a bug in function
>> sk_acceptq_is_full().
>>
>>         When a new SYN comes, TCP module first checks its validation. If
>> valid,
>>     send SYN,ACK to the client and add the sock to the syn hash table. Next
>>     time if received the valid ACK for SYN,ACK from the client. server will
>>     accept this connection and increase the sk->sk_ack_backlog -- which is
>>     done in function tcp_check_req().We check wether acceptq is full in
>>     function tcp_v4_syn_recv_sock().
>>
>>     Consider an example:
>>
>>      After listen(sockfd, 1) system call, sk->sk_max_ack_backlog is set to
>>     1. As we know, sk->sk_ack_backlog is initialized to 0. Assuming accept()
>>     system call is not invoked now.
>>
>>     1. 1st connection comes. invoke sk_acceptq_is_full().
>>      sk->sk_ack_backlog=0 sk->sk_max_ack_backlog=1, function return 0 accept
>> this connection.
>>      Increase the sk->sk_ack_backlog
>>     2. 2nd connection comes. invoke sk_acceptq_is_full().
>>      sk->sk_ack_backlog=1 sk->sk_max_ack_backlog=1, function return 0 accept
>> this connection.
>>      Increase the sk->sk_ack_backlog
>>     3. 3rd connection comes. invoke sk_acceptq_is_full().
>>      sk->sk_ack_backlog=2 sk->sk_max_ack_backlog=1, function return 1.
>> Refuse this connection.
>>
>>     I think it has bugs. after listen system call. sk->sk_max_ack_backlog=1
>>     but now it can accept 2 connections.
>>
>>     Signed-off-by: Wei Dong <weid@np.css.fujitsu.com>
>>     Signed-off-by: David S. Miller <davem@davemloft.net>
>>
>> Venkat

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-16 23:31     ` enh
@ 2012-10-18 16:00       ` Venkat Venkatsubra
  2012-10-18 16:53         ` Venkat Venkatsubra
  2012-10-18 16:54       ` Eric Dumazet
  1 sibling, 1 reply; 20+ messages in thread
From: Venkat Venkatsubra @ 2012-10-18 16:00 UTC (permalink / raw)
  To: enh; +Cc: netdev

Hi Elliott,

I see the same behavior with your test program.
The connect() keeps succeeding even though accept() is not performed.
It pauses after 4 connections for a while and then periodically keeps 
adding few (2 I think).

But the server side end points are terminated too. You will see only the 
first 2 sessions on the server side.
If you modify your test program to say read or poll the sockets you 
should get a termination notification on them I think .

The behavior overall looks fine in my opinion.  But it could be a change 
of behavior for your test program.

Venkat

On 10/16/2012 6:31 PM, enh wrote:
> boiling things down to a short C++ program, i see that i can reproduce
> the behavior even on 2.6 kernels. if i run this, i see 4 connections
> immediately (3 + 1, as i'd expect)... but then about 10s later i see
> another 2. and every few seconds after that, i see another 2. i've let
> this run until i have hundreds of connect(2) calls that have returned,
> despite my small listen(2) backlog and the fact that i'm not
> accept(2)ing.
>
> so i guess the only thing that's changed with newer kernels is timing
> (hell, since i only see newer kernels on newer hardware, it might just
> be a hardware thing).
>
> and clearly i don't understand what the listen(2) backlog means any more.
>
> #include<netinet/ip.h>
> #include<netinet/tcp.h>
> #include<sys/types.h>
> #include<sys/socket.h>
> #include<iostream>
> #include<stdlib.h>
> #include<string.h>
> #include<errno.h>
>
> void dump_ti(int fd) {
>   tcp_info ti;
>   socklen_t tcp_info_length = sizeof(tcp_info);
>   int rc = getsockopt(fd, SOL_IP, TCP_INFO,&ti,&tcp_info_length);
>   if (rc == -1) {
>     std::cout<<  "getsockopt rc "<<  rc<<  ": "<<  strerror(errno)<<  "\n";
>     return;
>   }
>
>   std::cout<<  "ti.tcpi_unacked="<<  ti.tcpi_unacked<<  "\n";
>   std::cout<<  "ti.tcpi_sacked="<<  ti.tcpi_sacked<<  "\n";
> }
>
> void connect_to(sockaddr_in&  sa) {
>   int s = socket(AF_INET, SOCK_STREAM, 0);
>   if (s == -1) {
>     abort();
>   }
>
>   int rc = connect(s, (sockaddr*)&sa, sizeof(sockaddr_in));
>   std::cout<<  "connect = "<<  rc<<  "\n";
> }
>
> int main() {
>   int ss = socket(AF_INET, SOCK_STREAM, 0);
>   std::cout<<  "socket fd "<<  ss<<  "\n";
>
>   sockaddr_in sa;
>   memset(&sa, 0, sizeof(sa));
>   sa.sin_family = AF_INET;
>   sa.sin_addr.s_addr = htonl(INADDR_ANY);
>   sa.sin_port = htons(9877);
>   int rc = bind(ss, (sockaddr*)&sa, sizeof(sa));
>   std::cout<<  "bind rc "<<  rc<<  ": "<<  strerror(errno)<<  "\n";
>   std::cout<<  "bind port "<<  sa.sin_port<<  "\n";
>
>   rc = listen(ss, 1);
>   std::cout<<  "listen rc "<<  rc<<  ": "<<  strerror(errno)<<  "\n";
>   dump_ti(ss);
>
>   while (true) {
>    connect_to(sa);
>    dump_ti(ss);
>   }
>
>   return 0;
> }
>
>
> On Mon, Oct 15, 2012 at 10:26 AM, enh<enh@google.com>  wrote:
>> On Mon, Oct 15, 2012 at 10:12 AM, Venkat Venkatsubra
>> <venkat.x.venkatsubra@oracle.com>  wrote:
>>> On 10/12/2012 6:40 PM, enh wrote:
>>>> i used to use the following hack to unit test connect timeouts: i'd
>>>> call listen(2) on a socket and then deliberately connect (backlog + 3)
>>>> sockets without accept(2)ing any of the connections. (why 3? because
>>>> Stevens told me so, and experiment backed him up. see figure 4.10 in
>>>> his UNIX Network Programming.)
>>>>
>>>> with "old" kernels, 2.6.35-ish to 3.0-ish, this worked great. my next
>>>> connect(2) to the same loopback port would hang indefinitely. i could
>>>> even unblock the connect by calling accept(2) in another thread. this
>>>> was awesome for testing.
>>>>
>>>> in 3.1 on ARM, 3.2 on x86 (Ubuntu desktop), and 3.4 on ARM, this no
>>>> longer works. it doesn't seem to be as simple as "the constant is no
>>>> longer 3". my tests are now flaky. sometimes they work like they used
>>>> to, and sometimes an extra connect(2) will succeed. (or, if i'm in
>>>> non-blocking mode, my poll(2) will return with the non-blocking socket
>>>> that's trying to connect now ready.)
>>>>
>>>> i'm guessing if this changed in 3.1 and is still changed in 3.4,
>>>> whatever's changed wasn't an accident. but i haven't been able to find
>>>> the right search terms to RTFM. i also finally got around to grepping
>>>> the kernel for the "+ 3", but wasn't able to find that. (so i'd be
>>>> interested to know where the old behavior came from too.)
>>>>
>>>> my least worst workaround at the moment is to use one of RFC5737's
>>>> test networks, but that requires that the device have a network
>>>> connection, otherwise my connect(2)s fail immediately with
>>>> ENETUNREACH, which is no use to me. also, unlike my old trick, i've
>>>> got no way to suddenly "unblock" a slow connect(2) (this is useful for
>>>> unit testing the code that does the poll(2) part of the usual
>>>> connect-with-timeout implementation).
>>>> https://android-review.googlesource.com/#/c/44563/
>>>>
>>>> hopefully someone here can shed some light on this? ideally someone
>>>> will have a workaround as good as my old trick. i realize i was
>>>> relying on undocumented behavior, and i'm happy to have to check
>>>> /proc/version and behave appropriately, but i'd really like a way to
>>>> keep my unit tests!
>>>>
>>>> thanks,
>>>>    elliott
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Hi Elliott,
>>>
>>> In BSD I think the backlog used to be reset to 3/2 times that passed by the
>>> user. So, 2 becomes 3.
>>> Probably the 1/2 times increase was to accommodate the ones in
>>> partial/incomplete queue.
>>> In Linux is it possible you were getting the same behavior before the below
>>> commit ?
>>> Since the check used to be "backlog+1" a 2 will behave as 3 ?
>> i don't think so, because with<= 3.0 kernels i used to have a backlog
>> of 1 and be able to make _4_ connections before my next connect would
>> hang. but this>  to>= change is at least something for me to
>> investigate...
>>
>>> commit 8488df894d05d6fa41c2bd298c335f944bb0e401
>>> Author: Wei Dong<weid@np.css.fujitsu.com>
>>> Date:   Fri Mar 2 12:37:26 2007 -0800
>>>
>>>      [NET]: Fix bugs in "Whether sock accept queue is full" checking
>>>
>>>          when I use linux TCP socket, and find there is a bug in function
>>> sk_acceptq_is_full().
>>>
>>>          When a new SYN comes, TCP module first checks its validation. If
>>> valid,
>>>      send SYN,ACK to the client and add the sock to the syn hash table. Next
>>>      time if received the valid ACK for SYN,ACK from the client. server will
>>>      accept this connection and increase the sk->sk_ack_backlog -- which is
>>>      done in function tcp_check_req().We check wether acceptq is full in
>>>      function tcp_v4_syn_recv_sock().
>>>
>>>      Consider an example:
>>>
>>>       After listen(sockfd, 1) system call, sk->sk_max_ack_backlog is set to
>>>      1. As we know, sk->sk_ack_backlog is initialized to 0. Assuming accept()
>>>      system call is not invoked now.
>>>
>>>      1. 1st connection comes. invoke sk_acceptq_is_full().
>>>       sk->sk_ack_backlog=0 sk->sk_max_ack_backlog=1, function return 0 accept
>>> this connection.
>>>       Increase the sk->sk_ack_backlog
>>>      2. 2nd connection comes. invoke sk_acceptq_is_full().
>>>       sk->sk_ack_backlog=1 sk->sk_max_ack_backlog=1, function return 0 accept
>>> this connection.
>>>       Increase the sk->sk_ack_backlog
>>>      3. 3rd connection comes. invoke sk_acceptq_is_full().
>>>       sk->sk_ack_backlog=2 sk->sk_max_ack_backlog=1, function return 1.
>>> Refuse this connection.
>>>
>>>      I think it has bugs. after listen system call. sk->sk_max_ack_backlog=1
>>>      but now it can accept 2 connections.
>>>
>>>      Signed-off-by: Wei Dong<weid@np.css.fujitsu.com>
>>>      Signed-off-by: David S. Miller<davem@davemloft.net>
>>>
>>> Venkat
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-18 16:00       ` Venkat Venkatsubra
@ 2012-10-18 16:53         ` Venkat Venkatsubra
  2012-10-18 17:20           ` enh
  0 siblings, 1 reply; 20+ messages in thread
From: Venkat Venkatsubra @ 2012-10-18 16:53 UTC (permalink / raw)
  To: enh; +Cc: netdev

Correction. I don't see the client side receiving any abort/termination 
notification.
They all remain on ESTABLISHED state on the client side.
In tcpdump I don't see a FIN or RST coming from the server for the 
aborted connections.

Venkat

On 10/18/2012 11:00 AM, Venkat Venkatsubra wrote:
> Hi Elliott,
>
> I see the same behavior with your test program.
> The connect() keeps succeeding even though accept() is not performed.
> It pauses after 4 connections for a while and then periodically keeps 
> adding few (2 I think).
>
> But the server side end points are terminated too. You will see only 
> the first 2 sessions on the server side.
> If you modify your test program to say read or poll the sockets you 
> should get a termination notification on them I think .
>
> The behavior overall looks fine in my opinion.  But it could be a 
> change of behavior for your test program.
>
> Venkat
>
> On 10/16/2012 6:31 PM, enh wrote:
>> boiling things down to a short C++ program, i see that i can reproduce
>> the behavior even on 2.6 kernels. if i run this, i see 4 connections
>> immediately (3 + 1, as i'd expect)... but then about 10s later i see
>> another 2. and every few seconds after that, i see another 2. i've let
>> this run until i have hundreds of connect(2) calls that have returned,
>> despite my small listen(2) backlog and the fact that i'm not
>> accept(2)ing.
>>
>> so i guess the only thing that's changed with newer kernels is timing
>> (hell, since i only see newer kernels on newer hardware, it might just
>> be a hardware thing).
>>
>> and clearly i don't understand what the listen(2) backlog means any 
>> more.
>>
>> #include<netinet/ip.h>
>> #include<netinet/tcp.h>
>> #include<sys/types.h>
>> #include<sys/socket.h>
>> #include<iostream>
>> #include<stdlib.h>
>> #include<string.h>
>> #include<errno.h>
>>
>> void dump_ti(int fd) {
>>   tcp_info ti;
>>   socklen_t tcp_info_length = sizeof(tcp_info);
>>   int rc = getsockopt(fd, SOL_IP, TCP_INFO,&ti,&tcp_info_length);
>>   if (rc == -1) {
>>     std::cout<<  "getsockopt rc "<<  rc<<  ": "<<  strerror(errno)<<  
>> "\n";
>>     return;
>>   }
>>
>>   std::cout<<  "ti.tcpi_unacked="<<  ti.tcpi_unacked<<  "\n";
>>   std::cout<<  "ti.tcpi_sacked="<<  ti.tcpi_sacked<<  "\n";
>> }
>>
>> void connect_to(sockaddr_in&  sa) {
>>   int s = socket(AF_INET, SOCK_STREAM, 0);
>>   if (s == -1) {
>>     abort();
>>   }
>>
>>   int rc = connect(s, (sockaddr*)&sa, sizeof(sockaddr_in));
>>   std::cout<<  "connect = "<<  rc<<  "\n";
>> }
>>
>> int main() {
>>   int ss = socket(AF_INET, SOCK_STREAM, 0);
>>   std::cout<<  "socket fd "<<  ss<<  "\n";
>>
>>   sockaddr_in sa;
>>   memset(&sa, 0, sizeof(sa));
>>   sa.sin_family = AF_INET;
>>   sa.sin_addr.s_addr = htonl(INADDR_ANY);
>>   sa.sin_port = htons(9877);
>>   int rc = bind(ss, (sockaddr*)&sa, sizeof(sa));
>>   std::cout<<  "bind rc "<<  rc<<  ": "<<  strerror(errno)<<  "\n";
>>   std::cout<<  "bind port "<<  sa.sin_port<<  "\n";
>>
>>   rc = listen(ss, 1);
>>   std::cout<<  "listen rc "<<  rc<<  ": "<<  strerror(errno)<<  "\n";
>>   dump_ti(ss);
>>
>>   while (true) {
>>    connect_to(sa);
>>    dump_ti(ss);
>>   }
>>
>>   return 0;
>> }
>>
>>
>> On Mon, Oct 15, 2012 at 10:26 AM, enh<enh@google.com>  wrote:
>>> On Mon, Oct 15, 2012 at 10:12 AM, Venkat Venkatsubra
>>> <venkat.x.venkatsubra@oracle.com>  wrote:
>>>> On 10/12/2012 6:40 PM, enh wrote:
>>>>> i used to use the following hack to unit test connect timeouts: i'd
>>>>> call listen(2) on a socket and then deliberately connect (backlog 
>>>>> + 3)
>>>>> sockets without accept(2)ing any of the connections. (why 3? because
>>>>> Stevens told me so, and experiment backed him up. see figure 4.10 in
>>>>> his UNIX Network Programming.)
>>>>>
>>>>> with "old" kernels, 2.6.35-ish to 3.0-ish, this worked great. my next
>>>>> connect(2) to the same loopback port would hang indefinitely. i could
>>>>> even unblock the connect by calling accept(2) in another thread. this
>>>>> was awesome for testing.
>>>>>
>>>>> in 3.1 on ARM, 3.2 on x86 (Ubuntu desktop), and 3.4 on ARM, this no
>>>>> longer works. it doesn't seem to be as simple as "the constant is no
>>>>> longer 3". my tests are now flaky. sometimes they work like they used
>>>>> to, and sometimes an extra connect(2) will succeed. (or, if i'm in
>>>>> non-blocking mode, my poll(2) will return with the non-blocking 
>>>>> socket
>>>>> that's trying to connect now ready.)
>>>>>
>>>>> i'm guessing if this changed in 3.1 and is still changed in 3.4,
>>>>> whatever's changed wasn't an accident. but i haven't been able to 
>>>>> find
>>>>> the right search terms to RTFM. i also finally got around to grepping
>>>>> the kernel for the "+ 3", but wasn't able to find that. (so i'd be
>>>>> interested to know where the old behavior came from too.)
>>>>>
>>>>> my least worst workaround at the moment is to use one of RFC5737's
>>>>> test networks, but that requires that the device have a network
>>>>> connection, otherwise my connect(2)s fail immediately with
>>>>> ENETUNREACH, which is no use to me. also, unlike my old trick, i've
>>>>> got no way to suddenly "unblock" a slow connect(2) (this is useful 
>>>>> for
>>>>> unit testing the code that does the poll(2) part of the usual
>>>>> connect-with-timeout implementation).
>>>>> https://android-review.googlesource.com/#/c/44563/
>>>>>
>>>>> hopefully someone here can shed some light on this? ideally someone
>>>>> will have a workaround as good as my old trick. i realize i was
>>>>> relying on undocumented behavior, and i'm happy to have to check
>>>>> /proc/version and behave appropriately, but i'd really like a way to
>>>>> keep my unit tests!
>>>>>
>>>>> thanks,
>>>>>    elliott
>>>>> -- 
>>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> Hi Elliott,
>>>>
>>>> In BSD I think the backlog used to be reset to 3/2 times that 
>>>> passed by the
>>>> user. So, 2 becomes 3.
>>>> Probably the 1/2 times increase was to accommodate the ones in
>>>> partial/incomplete queue.
>>>> In Linux is it possible you were getting the same behavior before 
>>>> the below
>>>> commit ?
>>>> Since the check used to be "backlog+1" a 2 will behave as 3 ?
>>> i don't think so, because with<= 3.0 kernels i used to have a backlog
>>> of 1 and be able to make _4_ connections before my next connect would
>>> hang. but this>  to>= change is at least something for me to
>>> investigate...
>>>
>>>> commit 8488df894d05d6fa41c2bd298c335f944bb0e401
>>>> Author: Wei Dong<weid@np.css.fujitsu.com>
>>>> Date:   Fri Mar 2 12:37:26 2007 -0800
>>>>
>>>>      [NET]: Fix bugs in "Whether sock accept queue is full" checking
>>>>
>>>>          when I use linux TCP socket, and find there is a bug in 
>>>> function
>>>> sk_acceptq_is_full().
>>>>
>>>>          When a new SYN comes, TCP module first checks its 
>>>> validation. If
>>>> valid,
>>>>      send SYN,ACK to the client and add the sock to the syn hash 
>>>> table. Next
>>>>      time if received the valid ACK for SYN,ACK from the client. 
>>>> server will
>>>>      accept this connection and increase the sk->sk_ack_backlog -- 
>>>> which is
>>>>      done in function tcp_check_req().We check wether acceptq is 
>>>> full in
>>>>      function tcp_v4_syn_recv_sock().
>>>>
>>>>      Consider an example:
>>>>
>>>>       After listen(sockfd, 1) system call, sk->sk_max_ack_backlog 
>>>> is set to
>>>>      1. As we know, sk->sk_ack_backlog is initialized to 0. 
>>>> Assuming accept()
>>>>      system call is not invoked now.
>>>>
>>>>      1. 1st connection comes. invoke sk_acceptq_is_full().
>>>>       sk->sk_ack_backlog=0 sk->sk_max_ack_backlog=1, function 
>>>> return 0 accept
>>>> this connection.
>>>>       Increase the sk->sk_ack_backlog
>>>>      2. 2nd connection comes. invoke sk_acceptq_is_full().
>>>>       sk->sk_ack_backlog=1 sk->sk_max_ack_backlog=1, function 
>>>> return 0 accept
>>>> this connection.
>>>>       Increase the sk->sk_ack_backlog
>>>>      3. 3rd connection comes. invoke sk_acceptq_is_full().
>>>>       sk->sk_ack_backlog=2 sk->sk_max_ack_backlog=1, function 
>>>> return 1.
>>>> Refuse this connection.
>>>>
>>>>      I think it has bugs. after listen system call. 
>>>> sk->sk_max_ack_backlog=1
>>>>      but now it can accept 2 connections.
>>>>
>>>>      Signed-off-by: Wei Dong<weid@np.css.fujitsu.com>
>>>>      Signed-off-by: David S. Miller<davem@davemloft.net>
>>>>
>>>> Venkat
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-18 16:53         ` Venkat Venkatsubra
@ 2012-10-18 17:20           ` enh
  2012-10-19  6:02             ` Vijay Subramanian
  0 siblings, 1 reply; 20+ messages in thread
From: enh @ 2012-10-18 17:20 UTC (permalink / raw)
  To: Venkat Venkatsubra; +Cc: netdev

On Thu, Oct 18, 2012 at 9:53 AM, Venkat Venkatsubra
<venkat.x.venkatsubra@oracle.com> wrote:
> Correction. I don't see the client side receiving any abort/termination
> notification.
> They all remain on ESTABLISHED state on the client side.

yeah, that's what i see with netstat -t too.

in the meantime i'm working around this by connecting to one of
RFC5737's test networks
(https://android-review.googlesource.com/#/c/44563/), but i'd love to
at least understand what's going on here, even if it's just that i
have a fundamental misunderstanding of what the listen backlog is
supposed to mean.

> In tcpdump I don't see a FIN or RST coming from the server for the aborted
> connections.
>
> Venkat
>
>
> On 10/18/2012 11:00 AM, Venkat Venkatsubra wrote:
>>
>> Hi Elliott,
>>
>> I see the same behavior with your test program.
>> The connect() keeps succeeding even though accept() is not performed.
>> It pauses after 4 connections for a while and then periodically keeps
>> adding few (2 I think).
>>
>> But the server side end points are terminated too. You will see only the
>> first 2 sessions on the server side.
>> If you modify your test program to say read or poll the sockets you should
>> get a termination notification on them I think .
>>
>> The behavior overall looks fine in my opinion.  But it could be a change
>> of behavior for your test program.
>>
>> Venkat
>>
>> On 10/16/2012 6:31 PM, enh wrote:
>>>
>>> boiling things down to a short C++ program, i see that i can reproduce
>>> the behavior even on 2.6 kernels. if i run this, i see 4 connections
>>> immediately (3 + 1, as i'd expect)... but then about 10s later i see
>>> another 2. and every few seconds after that, i see another 2. i've let
>>> this run until i have hundreds of connect(2) calls that have returned,
>>> despite my small listen(2) backlog and the fact that i'm not
>>> accept(2)ing.
>>>
>>> so i guess the only thing that's changed with newer kernels is timing
>>> (hell, since i only see newer kernels on newer hardware, it might just
>>> be a hardware thing).
>>>
>>> and clearly i don't understand what the listen(2) backlog means any more.
>>>
>>> #include<netinet/ip.h>
>>> #include<netinet/tcp.h>
>>> #include<sys/types.h>
>>> #include<sys/socket.h>
>>> #include<iostream>
>>> #include<stdlib.h>
>>> #include<string.h>
>>> #include<errno.h>
>>>
>>> void dump_ti(int fd) {
>>>   tcp_info ti;
>>>   socklen_t tcp_info_length = sizeof(tcp_info);
>>>   int rc = getsockopt(fd, SOL_IP, TCP_INFO,&ti,&tcp_info_length);
>>>   if (rc == -1) {
>>>     std::cout<<  "getsockopt rc "<<  rc<<  ": "<<  strerror(errno)<<
>>> "\n";
>>>     return;
>>>   }
>>>
>>>   std::cout<<  "ti.tcpi_unacked="<<  ti.tcpi_unacked<<  "\n";
>>>   std::cout<<  "ti.tcpi_sacked="<<  ti.tcpi_sacked<<  "\n";
>>> }
>>>
>>> void connect_to(sockaddr_in&  sa) {
>>>   int s = socket(AF_INET, SOCK_STREAM, 0);
>>>   if (s == -1) {
>>>     abort();
>>>   }
>>>
>>>   int rc = connect(s, (sockaddr*)&sa, sizeof(sockaddr_in));
>>>   std::cout<<  "connect = "<<  rc<<  "\n";
>>> }
>>>
>>> int main() {
>>>   int ss = socket(AF_INET, SOCK_STREAM, 0);
>>>   std::cout<<  "socket fd "<<  ss<<  "\n";
>>>
>>>   sockaddr_in sa;
>>>   memset(&sa, 0, sizeof(sa));
>>>   sa.sin_family = AF_INET;
>>>   sa.sin_addr.s_addr = htonl(INADDR_ANY);
>>>   sa.sin_port = htons(9877);
>>>   int rc = bind(ss, (sockaddr*)&sa, sizeof(sa));
>>>   std::cout<<  "bind rc "<<  rc<<  ": "<<  strerror(errno)<<  "\n";
>>>   std::cout<<  "bind port "<<  sa.sin_port<<  "\n";
>>>
>>>   rc = listen(ss, 1);
>>>   std::cout<<  "listen rc "<<  rc<<  ": "<<  strerror(errno)<<  "\n";
>>>   dump_ti(ss);
>>>
>>>   while (true) {
>>>    connect_to(sa);
>>>    dump_ti(ss);
>>>   }
>>>
>>>   return 0;
>>> }
>>>
>>>
>>> On Mon, Oct 15, 2012 at 10:26 AM, enh<enh@google.com>  wrote:
>>>>
>>>> On Mon, Oct 15, 2012 at 10:12 AM, Venkat Venkatsubra
>>>> <venkat.x.venkatsubra@oracle.com>  wrote:
>>>>>
>>>>> On 10/12/2012 6:40 PM, enh wrote:
>>>>>>
>>>>>> i used to use the following hack to unit test connect timeouts: i'd
>>>>>> call listen(2) on a socket and then deliberately connect (backlog + 3)
>>>>>> sockets without accept(2)ing any of the connections. (why 3? because
>>>>>> Stevens told me so, and experiment backed him up. see figure 4.10 in
>>>>>> his UNIX Network Programming.)
>>>>>>
>>>>>> with "old" kernels, 2.6.35-ish to 3.0-ish, this worked great. my next
>>>>>> connect(2) to the same loopback port would hang indefinitely. i could
>>>>>> even unblock the connect by calling accept(2) in another thread. this
>>>>>> was awesome for testing.
>>>>>>
>>>>>> in 3.1 on ARM, 3.2 on x86 (Ubuntu desktop), and 3.4 on ARM, this no
>>>>>> longer works. it doesn't seem to be as simple as "the constant is no
>>>>>> longer 3". my tests are now flaky. sometimes they work like they used
>>>>>> to, and sometimes an extra connect(2) will succeed. (or, if i'm in
>>>>>> non-blocking mode, my poll(2) will return with the non-blocking socket
>>>>>> that's trying to connect now ready.)
>>>>>>
>>>>>> i'm guessing if this changed in 3.1 and is still changed in 3.4,
>>>>>> whatever's changed wasn't an accident. but i haven't been able to find
>>>>>> the right search terms to RTFM. i also finally got around to grepping
>>>>>> the kernel for the "+ 3", but wasn't able to find that. (so i'd be
>>>>>> interested to know where the old behavior came from too.)
>>>>>>
>>>>>> my least worst workaround at the moment is to use one of RFC5737's
>>>>>> test networks, but that requires that the device have a network
>>>>>> connection, otherwise my connect(2)s fail immediately with
>>>>>> ENETUNREACH, which is no use to me. also, unlike my old trick, i've
>>>>>> got no way to suddenly "unblock" a slow connect(2) (this is useful for
>>>>>> unit testing the code that does the poll(2) part of the usual
>>>>>> connect-with-timeout implementation).
>>>>>> https://android-review.googlesource.com/#/c/44563/
>>>>>>
>>>>>> hopefully someone here can shed some light on this? ideally someone
>>>>>> will have a workaround as good as my old trick. i realize i was
>>>>>> relying on undocumented behavior, and i'm happy to have to check
>>>>>> /proc/version and behave appropriately, but i'd really like a way to
>>>>>> keep my unit tests!
>>>>>>
>>>>>> thanks,
>>>>>>    elliott
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> Hi Elliott,
>>>>>
>>>>> In BSD I think the backlog used to be reset to 3/2 times that passed by
>>>>> the
>>>>> user. So, 2 becomes 3.
>>>>> Probably the 1/2 times increase was to accommodate the ones in
>>>>> partial/incomplete queue.
>>>>> In Linux is it possible you were getting the same behavior before the
>>>>> below
>>>>> commit ?
>>>>> Since the check used to be "backlog+1" a 2 will behave as 3 ?
>>>>
>>>> i don't think so, because with<= 3.0 kernels i used to have a backlog
>>>> of 1 and be able to make _4_ connections before my next connect would
>>>> hang. but this>  to>= change is at least something for me to
>>>> investigate...
>>>>
>>>>> commit 8488df894d05d6fa41c2bd298c335f944bb0e401
>>>>> Author: Wei Dong<weid@np.css.fujitsu.com>
>>>>> Date:   Fri Mar 2 12:37:26 2007 -0800
>>>>>
>>>>>      [NET]: Fix bugs in "Whether sock accept queue is full" checking
>>>>>
>>>>>          when I use linux TCP socket, and find there is a bug in
>>>>> function
>>>>> sk_acceptq_is_full().
>>>>>
>>>>>          When a new SYN comes, TCP module first checks its validation.
>>>>> If
>>>>> valid,
>>>>>      send SYN,ACK to the client and add the sock to the syn hash table.
>>>>> Next
>>>>>      time if received the valid ACK for SYN,ACK from the client. server
>>>>> will
>>>>>      accept this connection and increase the sk->sk_ack_backlog --
>>>>> which is
>>>>>      done in function tcp_check_req().We check wether acceptq is full
>>>>> in
>>>>>      function tcp_v4_syn_recv_sock().
>>>>>
>>>>>      Consider an example:
>>>>>
>>>>>       After listen(sockfd, 1) system call, sk->sk_max_ack_backlog is
>>>>> set to
>>>>>      1. As we know, sk->sk_ack_backlog is initialized to 0. Assuming
>>>>> accept()
>>>>>      system call is not invoked now.
>>>>>
>>>>>      1. 1st connection comes. invoke sk_acceptq_is_full().
>>>>>       sk->sk_ack_backlog=0 sk->sk_max_ack_backlog=1, function return 0
>>>>> accept
>>>>> this connection.
>>>>>       Increase the sk->sk_ack_backlog
>>>>>      2. 2nd connection comes. invoke sk_acceptq_is_full().
>>>>>       sk->sk_ack_backlog=1 sk->sk_max_ack_backlog=1, function return 0
>>>>> accept
>>>>> this connection.
>>>>>       Increase the sk->sk_ack_backlog
>>>>>      3. 3rd connection comes. invoke sk_acceptq_is_full().
>>>>>       sk->sk_ack_backlog=2 sk->sk_max_ack_backlog=1, function return 1.
>>>>> Refuse this connection.
>>>>>
>>>>>      I think it has bugs. after listen system call.
>>>>> sk->sk_max_ack_backlog=1
>>>>>      but now it can accept 2 connections.
>>>>>
>>>>>      Signed-off-by: Wei Dong<weid@np.css.fujitsu.com>
>>>>>      Signed-off-by: David S. Miller<davem@davemloft.net>
>>>>>
>>>>> Venkat
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



-- 
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
NIO, JNI, or bionic questions? Mail me/drop by/add me as a reviewer.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-18 17:20           ` enh
@ 2012-10-19  6:02             ` Vijay Subramanian
  2012-10-19  6:50               ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Vijay Subramanian @ 2012-10-19  6:02 UTC (permalink / raw)
  To: enh; +Cc: Venkat Venkatsubra, netdev, Eric Dumazet

>> They all remain on ESTABLISHED state on the client side.
>
> yeah, that's what i see with netstat -t too.
>

> (https://android-review.googlesource.com/#/c/44563/), but i'd love to
> at least understand what's going on here, even if it's just that i
> have a fundamental misunderstanding of what the listen backlog is
> supposed to mean.
>

The listen backlog represents the number of received SYNs that have
not been processed i.e. for which a SYN-ACK has not been sent.
Actually, the number of SYNs
that can be pending for processing is actually backlog+1. With a
backlog of 1, there can be 2 SYNs that can be pending for processing.

Once a SYN is processed by the server socket (in LISTEN state) and a
syn-ack is sent back, a request_sock is created to represent it. Once
the client replies with the last step of connect() i.e. with  an ack,
a fully established socket is created. The number of queued
request-socks for a LISTEN socket can be much more than the backlog
limit given in listen() (which is 1 in your case). If after a short
period (after SYNACK_RETRIES), the three way handshake is not
completed, request_socks can be silently discarded.

When a SYN is received, it is processed by   tcp_v4_conn_request()
where we have..
if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
 got drop;

So, for the SYN to be dropped, backlog limit must be exceeded _and_ we
must have recently accepted another SYN request. So, even when backlog
limit is exceeded, SYNs are processed and syn-acks are sent back. It
seems that the listen backlog limit is applied definitively only in
the third step in tcp_v4_syn_recv_sock() and not in the first step.
In  tcp_v4_syn_recv_sock(), we have
 if (sk_acceptq_is_full(sk))
                goto exit_overflow;

This prevents the socket from being created fully. On the client side
however, since the three way handshake has finished, the socket goes
into ESTABLISHED state which is what you see with netstat. In your
test case, typically 2 connections are in state where SYN has to be
processed  and rest are as request_sock where synacks have been sent.
However,
they may not become fully created sockets as they will fail in step 3
as described above.

man listen() says
" The  backlog  argument  defines  the maximum length to which the
queue of pending connections for sockfd may grow. " In your case where
backlog is 1, there can be a max of 2 pending connections (SYNs not
yet processed) and this is what we see. By this interpretation,
behavior seems correct.

Not sure if this behavior is a bug but the processing in
tcp_v4_conn_request()  does look suspicious. Should we terminate
earlier without doing three way hand shake?
Perhaps someone who knows this better can clarify.

Hope this helps.
Vijay

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-19  6:02             ` Vijay Subramanian
@ 2012-10-19  6:50               ` Eric Dumazet
  2012-10-19  8:06                 ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2012-10-19  6:50 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: enh, Venkat Venkatsubra, netdev

On Thu, 2012-10-18 at 23:02 -0700, Vijay Subramanian wrote:
> >> They all remain on ESTABLISHED state on the client side.
> >
> > yeah, that's what i see with netstat -t too.
> >
> 
> > (https://android-review.googlesource.com/#/c/44563/), but i'd love to
> > at least understand what's going on here, even if it's just that i
> > have a fundamental misunderstanding of what the listen backlog is
> > supposed to mean.
> >
> 
> The listen backlog represents the number of received SYNs that have
> not been processed i.e. for which a SYN-ACK has not been sent.
> Actually, the number of SYNs
> that can be pending for processing is actually backlog+1. With a
> backlog of 1, there can be 2 SYNs that can be pending for processing.
> 
> Once a SYN is processed by the server socket (in LISTEN state) and a
> syn-ack is sent back, a request_sock is created to represent it. Once
> the client replies with the last step of connect() i.e. with  an ack,
> a fully established socket is created. The number of queued
> request-socks for a LISTEN socket can be much more than the backlog
> limit given in listen() (which is 1 in your case). If after a short
> period (after SYNACK_RETRIES), the three way handshake is not
> completed, request_socks can be silently discarded.
> 
> When a SYN is received, it is processed by   tcp_v4_conn_request()
> where we have..
> if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
>  got drop;
> 
> So, for the SYN to be dropped, backlog limit must be exceeded _and_ we
> must have recently accepted another SYN request. So, even when backlog
> limit is exceeded, SYNs are processed and syn-acks are sent back. It
> seems that the listen backlog limit is applied definitively only in
> the third step in tcp_v4_syn_recv_sock() and not in the first step.
> In  tcp_v4_syn_recv_sock(), we have
>  if (sk_acceptq_is_full(sk))
>                 goto exit_overflow;
> 
> This prevents the socket from being created fully. On the client side
> however, since the three way handshake has finished, the socket goes
> into ESTABLISHED state which is what you see with netstat. In your
> test case, typically 2 connections are in state where SYN has to be
> processed  and rest are as request_sock where synacks have been sent.
> However,
> they may not become fully created sockets as they will fail in step 3
> as described above.
> 
> man listen() says
> " The  backlog  argument  defines  the maximum length to which the
> queue of pending connections for sockfd may grow. " In your case where
> backlog is 1, there can be a max of 2 pending connections (SYNs not
> yet processed) and this is what we see. By this interpretation,
> behavior seems correct.
> 
> Not sure if this behavior is a bug but the processing in
> tcp_v4_conn_request()  does look suspicious. Should we terminate
> earlier without doing three way hand shake?
> Perhaps someone who knows this better can clarify.
> 
> Hope this helps.
> Vijay

I came to the same analysis than you.

Current behavior is stupid, because the traffic for such 'sockets' is
insane :

As we sent a SYNACK, client sends the 3rd packet (ACK), and we ignore
it.

Then we keep retransmitting SYNACKS....

Oh well.

21:38:27.459937 IP glaptop.53627 > 172.30.42.23.9877: Flags [S], seq
1124582230, win 14600, options [mss 1460,sackOK,TS val 84038374 ecr
0,nop,wscale 7], length 0
21:38:27.460007 IP 172.30.42.23.9877 > glaptop.53627: Flags [S.], seq
1077519728, ack 1124582231, win 14480, options [mss 1460,sackOK,TS val
4230664 ecr 84038374,nop,wscale 7], length 0
21:38:27.460235 IP glaptop.53627 > 172.30.42.23.9877: Flags [.], ack 1,
win 115, options [nop,nop,TS val 84038374 ecr 4230664], length 0

21:38:28.661139 IP 172.30.42.23.9877 > glaptop.53627: Flags [S.], seq
1077519728, ack 1124582231, win 14480, options [mss 1460,sackOK,TS val
4231866 ecr 84038374,nop,wscale 7], length 0
21:38:28.661428 IP glaptop.53627 > 172.30.42.23.9877: Flags [.], ack 1,
win 115, options [nop,nop,TS val 84038494 ecr 4231866,nop,nop,sack 1
{0:1}], length 0
21:38:30.661138 IP 172.30.42.23.9877 > glaptop.53627: Flags [S.], seq
1077519728, ack 1124582231, win 14480, options [mss 1460,sackOK,TS val
4233866 ecr 84038494,nop,wscale 7], length 0
21:38:30.661412 IP glaptop.53627 > 172.30.42.23.9877: Flags [.], ack 1,
win 115, options [nop,nop,TS val 84038694 ecr 4233866,nop,nop,sack 1
{0:1}], length 0
21:38:35.061135 IP 172.30.42.23.9877 > glaptop.53627: Flags [S.], seq
1077519728, ack 1124582231, win 14480, options [mss 1460,sackOK,TS val
4238266 ecr 84038694,nop,wscale 7], length 0
21:38:35.061413 IP glaptop.53627 > 172.30.42.23.9877: Flags [.], ack 1,
win 115, options [nop,nop,TS val 84039134 ecr 4238266,nop,nop,sack 1
{0:1}], length 0
21:38:43.061118 IP 172.30.42.23.9877 > glaptop.53627: Flags [S.], seq
1077519728, ack 1124582231, win 14480, options [mss 1460,sackOK,TS val
4246266 ecr 84039134,nop,wscale 7], length 0
21:38:43.061357 IP glaptop.53627 > 172.30.42.23.9877: Flags [.], ack 1,
win 115, options [nop,nop,TS val 84039934 ecr 4246266,nop,nop,sack 1
{0:1}], length 0
21:38:59.061135 IP 172.30.42.23.9877 > glaptop.53627: Flags [S.], seq
1077519728, ack 1124582231, win 14480, options [mss 1460,sackOK,TS val
4262266 ecr 84039934,nop,wscale 7], length 0
21:38:59.061434 IP glaptop.53627 > 172.30.42.23.9877: Flags [.], ack 1,
win 115, options [nop,nop,TS val 84041534 ecr 4262266,nop,nop,sack 1
{0:1}], length 0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-19  6:50               ` Eric Dumazet
@ 2012-10-19  8:06                 ` Eric Dumazet
  2012-10-19  9:14                   ` Vijay Subramanian
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2012-10-19  8:06 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: enh, Venkat Venkatsubra, netdev

On Fri, 2012-10-19 at 08:50 +0200, Eric Dumazet wrote:

> I came to the same analysis than you.
> 
> Current behavior is stupid, because the traffic for such 'sockets' is
> insane :
> 
> As we sent a SYNACK, client sends the 3rd packet (ACK), and we ignore
> it.
> 
> Then we keep retransmitting SYNACKS....
> 
> Oh well.


What about the following patch ?

 include/net/sock.h        |    7 ++++++-
 include/uapi/linux/snmp.h |    1 +
 net/ipv4/proc.c           |    1 +
 net/ipv4/tcp_ipv4.c       |    4 +++-
 net/ipv6/tcp_ipv6.c       |    3 ++-
 5 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 0baccb6..d2ecfbe 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -698,9 +698,14 @@ static inline void sk_acceptq_added(struct sock *sk)
 	sk->sk_ack_backlog++;
 }
 
+static inline bool __sk_acceptq_is_full(const struct sock *sk, unsigned int young)
+{
+	return (sk->sk_ack_backlog + young) > sk->sk_max_ack_backlog;
+}
+
 static inline bool sk_acceptq_is_full(const struct sock *sk)
 {
-	return sk->sk_ack_backlog > sk->sk_max_ack_backlog;
+	return __sk_acceptq_is_full(sk, 0);
 }
 
 /*
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index fdfba23..5ff2daf 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -245,6 +245,7 @@ enum
 	LINUX_MIB_TCPFASTOPENPASSIVEFAIL,	/* TCPFastOpenPassiveFail */
 	LINUX_MIB_TCPFASTOPENLISTENOVERFLOW,	/* TCPFastOpenListenOverflow */
 	LINUX_MIB_TCPFASTOPENCOOKIEREQD,	/* TCPFastOpenCookieReqd */
+	LINUX_MIB_TCPSYNDROP,			/* TCPSynDrop */
 	__LINUX_MIB_MAX
 };
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 8de53e1..a5f59ab 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -267,6 +267,7 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPFastOpenPassiveFail", LINUX_MIB_TCPFASTOPENPASSIVEFAIL),
 	SNMP_MIB_ITEM("TCPFastOpenListenOverflow", LINUX_MIB_TCPFASTOPENLISTENOVERFLOW),
 	SNMP_MIB_ITEM("TCPFastOpenCookieReqd", LINUX_MIB_TCPFASTOPENCOOKIEREQD),
+	SNMP_MIB_ITEM("TCPSynDrop", LINUX_MIB_TCPSYNDROP),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ef998b0..0404926 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1507,7 +1507,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	 * clogging syn queue with openreqs with exponentially increasing
 	 * timeout.
 	 */
-	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
+	if (__sk_acceptq_is_full(sk, inet_csk_reqsk_queue_young(sk)))
 		goto drop;
 
 	req = inet_reqsk_alloc(&tcp_request_sock_ops);
@@ -1673,6 +1673,7 @@ drop_and_release:
 drop_and_free:
 	reqsk_free(req);
 drop:
+	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNDROP);
 	return 0;
 }
 EXPORT_SYMBOL(tcp_v4_conn_request);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 26175bf..39ffc54 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1054,7 +1054,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 			goto drop;
 	}
 
-	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
+	if (__sk_acceptq_is_full(sk, inet_csk_reqsk_queue_young(sk)))
 		goto drop;
 
 	req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
@@ -1204,6 +1204,7 @@ drop_and_release:
 drop_and_free:
 	reqsk_free(req);
 drop:
+	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNDROP);
 	return 0; /* don't send reset */
 }
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-19  8:06                 ` Eric Dumazet
@ 2012-10-19  9:14                   ` Vijay Subramanian
  2012-10-19 10:29                     ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Vijay Subramanian @ 2012-10-19  9:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: enh, Venkat Venkatsubra, netdev

> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index ef998b0..0404926 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1507,7 +1507,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
>          * clogging syn queue with openreqs with exponentially increasing
>          * timeout.
>          */
> -       if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
> +       if (__sk_acceptq_is_full(sk, inet_csk_reqsk_queue_young(sk)))
>                 goto drop;
>

For what its worth, I think the changes make sense. But is there any
reason to exclude old request_socks in the  call to
__sk_acceptq_is_full().?
as in
      if (__sk_acceptq_is_full(sk, inet_csk_reqsk_queue_len(sk)))
               goto drop;

I am not sure why the current code looks only at young request_socks.
Thanks,
Vijay

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-19  9:14                   ` Vijay Subramanian
@ 2012-10-19 10:29                     ` Eric Dumazet
  2012-10-19 11:39                       ` Eric Dumazet
  2012-10-22 20:00                       ` Vijay Subramanian
  0 siblings, 2 replies; 20+ messages in thread
From: Eric Dumazet @ 2012-10-19 10:29 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: enh, Venkat Venkatsubra, netdev

On Fri, 2012-10-19 at 02:14 -0700, Vijay Subramanian wrote:
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index ef998b0..0404926 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -1507,7 +1507,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
> >          * clogging syn queue with openreqs with exponentially increasing
> >          * timeout.
> >          */
> > -       if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
> > +       if (__sk_acceptq_is_full(sk, inet_csk_reqsk_queue_young(sk)))
> >                 goto drop;
> >
> 
> For what its worth, I think the changes make sense. But is there any
> reason to exclude old request_socks in the  call to
> __sk_acceptq_is_full().?
> as in
>       if (__sk_acceptq_is_full(sk, inet_csk_reqsk_queue_len(sk)))
>                goto drop;
> 
> I am not sure why the current code looks only at young request_socks.
> Thanks,
> Vijay

Old requests are assumed to be unlikely to complete (SYN attack).

young requests are assumed to have a reasonable chance to complete.

Note that we drop the SYN packet, so its not a 'final' decision.

Some other OSes send RST in case the listener queue is full
(I tested FreeBSD 9.0 this morning.)

Note also we probably have a bug elsewhere :

If we send a SYNACK, then receive the ACK from client, and the acceptq
is full, we should reset the connexion. Right now we have kind of stupid
situation, were we drop the ACK, and leave the REQ in the SYN_RECV
state, so we retransmit SYNACKS.

I am working on this part as well.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-19 10:29                     ` Eric Dumazet
@ 2012-10-19 11:39                       ` Eric Dumazet
  2012-10-22 20:00                       ` Vijay Subramanian
  1 sibling, 0 replies; 20+ messages in thread
From: Eric Dumazet @ 2012-10-19 11:39 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: enh, Venkat Venkatsubra, netdev

On Fri, 2012-10-19 at 12:29 +0200, Eric Dumazet wrote:
> On Fri, 2012-10-19 at 02:14 -0700, Vijay Subramanian wrote:
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index ef998b0..0404926 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -1507,7 +1507,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
> > >          * clogging syn queue with openreqs with exponentially increasing
> > >          * timeout.
> > >          */
> > > -       if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
> > > +       if (__sk_acceptq_is_full(sk, inet_csk_reqsk_queue_young(sk)))
> > >                 goto drop;
> > >
> > 
> > For what its worth, I think the changes make sense. But is there any
> > reason to exclude old request_socks in the  call to
> > __sk_acceptq_is_full().?
> > as in
> >       if (__sk_acceptq_is_full(sk, inet_csk_reqsk_queue_len(sk)))
> >                goto drop;
> > 
> > I am not sure why the current code looks only at young request_socks.
> > Thanks,
> > Vijay
> 
> Old requests are assumed to be unlikely to complete (SYN attack).
> 
> young requests are assumed to have a reasonable chance to complete.
> 
> Note that we drop the SYN packet, so its not a 'final' decision.
> 
> Some other OSes send RST in case the listener queue is full
> (I tested FreeBSD 9.0 this morning.)
> 
> Note also we probably have a bug elsewhere :
> 
> If we send a SYNACK, then receive the ACK from client, and the acceptq
> is full, we should reset the connexion. Right now we have kind of stupid
> situation, were we drop the ACK, and leave the REQ in the SYN_RECV
> state, so we retransmit SYNACKS.
> 
> I am working on this part as well.
> 

Well, it seems a documented feature :

tcp_abort_on_overflow - BOOLEAN
        If listening service is too slow to accept new connections,
        reset them. Default state is FALSE. It means that if overflow
        occurred due to a burst, connection will recover. Enable this
        option _only_ if you are really sure that listening daemon
        cannot be tuned to accept connections faster. Enabling this
        option can harm clients of your server.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-19 10:29                     ` Eric Dumazet
  2012-10-19 11:39                       ` Eric Dumazet
@ 2012-10-22 20:00                       ` Vijay Subramanian
  2012-10-22 20:08                         ` Eric Dumazet
  1 sibling, 1 reply; 20+ messages in thread
From: Vijay Subramanian @ 2012-10-22 20:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Vijay Subramanian, enh, Venkat Venkatsubra, netdev


>
> If we send a SYNACK, then receive the ACK from client, and the acceptq
> is full, we should reset the connexion. Right now we have kind of stupid
> situation, were we drop the ACK, and leave the REQ in the SYN_RECV
> state, so we retransmit SYNACKS.
>


It seems the third ack is remembered in inet_rsk(req)->acked in
tcp_check_req(). However, because of the order in which the tests are performed, 
server stills retransmits the synack needlessly. Following patch 
(for review) prevents this synack retransmission if third ack has been 
received.

The request_sock will expire in around 30 seconds and will be dropped if it does
not move into accept_queue by then.  Maybe we should also call 
req->rsk_ops->send_reset(sk,skb); 
when the request_sock expires and is dropped?


net/ipv4/inet_connection_sock.c |    5 ++---
  1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index d34ce29..4e8e52e 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -598,9 +598,8 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
                                                &expire, &resend);
                                 req->rsk_ops->syn_ack_timeout(parent, req);
                                 if (!expire &&
-                                   (!resend ||
-                                    !req->rsk_ops->rtx_syn_ack(parent, req, NULL) ||
-                                    inet_rsk(req)->acked)) {
+                                   (!resend || inet_rsk(req)->acked ||
+                                    !req->rsk_ops->rtx_syn_ack(parent, req, NULL))) {
                                         unsigned long timeo;

                                         if (req->retrans++ == 0)

Thanks,
Vijay

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-22 20:00                       ` Vijay Subramanian
@ 2012-10-22 20:08                         ` Eric Dumazet
  2012-10-22 22:11                           ` Vijay Subramanian
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2012-10-22 20:08 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: enh, Venkat Venkatsubra, netdev

On Mon, 2012-10-22 at 13:00 -0700, Vijay Subramanian wrote:
> >
> > If we send a SYNACK, then receive the ACK from client, and the acceptq
> > is full, we should reset the connexion. Right now we have kind of stupid
> > situation, were we drop the ACK, and leave the REQ in the SYN_RECV
> > state, so we retransmit SYNACKS.
> >
> 
> 
> It seems the third ack is remembered in inet_rsk(req)->acked in
> tcp_check_req(). However, because of the order in which the tests are performed, 
> server stills retransmits the synack needlessly. Following patch 
> (for review) prevents this synack retransmission if third ack has been 
> received.
> 
> The request_sock will expire in around 30 seconds and will be dropped if it does
> not move into accept_queue by then.  Maybe we should also call 
> req->rsk_ops->send_reset(sk,skb); 
> when the request_sock expires and is dropped?
> 

Not sure its needed, and we are under stress.

> 
> net/ipv4/inet_connection_sock.c |    5 ++---
>   1 files changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index d34ce29..4e8e52e 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -598,9 +598,8 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
>                                                 &expire, &resend);
>                                  req->rsk_ops->syn_ack_timeout(parent, req);
>                                  if (!expire &&
> -                                   (!resend ||
> -                                    !req->rsk_ops->rtx_syn_ack(parent, req, NULL) ||
> -                                    inet_rsk(req)->acked)) {
> +                                   (!resend || inet_rsk(req)->acked ||
> +                                    !req->rsk_ops->rtx_syn_ack(parent, req, NULL))) {
>                                          unsigned long timeo;
> 
>                                          if (req->retrans++ == 0)

I wonder then if we dont need to retransmit the synack when req moves
into accept_queue then ?

Or else how the client can 'knows' it can send data to server ?

All these facilities sound very complex and not really usable by clients
(ie users not willing to wait more than few seconds anyway)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-22 20:08                         ` Eric Dumazet
@ 2012-10-22 22:11                           ` Vijay Subramanian
  2012-10-25 22:50                             ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Vijay Subramanian @ 2012-10-22 22:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: enh, Venkat Venkatsubra, netdev

>
> I wonder then if we dont need to retransmit the synack when req moves
> into accept_queue then ?

If I understood the code correctly, the socket moves into accept_queue
only when the
third ack (with or without data) comes in. So, there should be no need
to resend syn-ack. The issue is that there is no mechanism to promote
req sockets which have finished TWHS to accept_queue currently.
Socket can move into accept_queue only when third ack is processed.
If we stop resending synacks, then socket will move into accept_queue
when client sends data.

>
> Or else how the client can 'knows' it can send data to server ?

>From client's point of view, TWHS is finished.  Client is already in
established state and
can even now send data. Currently, such packets with data will be
dropped if accept_queue is full.
If accept_queue is not full, socket moves into accept_queue and
established state and processes the data.

I think the only thing my patch does is reorder the tests so that
needless syn-ack retransmissions are stopped.

>
> All these facilities sound very complex and not really usable by clients
> (ie users not willing to wait more than few seconds anyway)
>

Fair enough. We can drop this if it is not worth the trouble or if I
have missed any other scenario.

Thanks for your review and time!
Vijay

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-22 22:11                           ` Vijay Subramanian
@ 2012-10-25 22:50                             ` Eric Dumazet
  2012-10-25 23:16                               ` Vijay Subramanian
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2012-10-25 22:50 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: enh, Venkat Venkatsubra, netdev

On Mon, 2012-10-22 at 15:11 -0700, Vijay Subramanian wrote:

> >
> > All these facilities sound very complex and not really usable by clients
> > (ie users not willing to wait more than few seconds anyway)
> >
> 
> Fair enough. We can drop this if it is not worth the trouble or if I
> have missed any other scenario.
> 

Sorry my comment was not related to your patch, but existing logic.

It seems there is no value resending SYNACK, as we received the client
ACK.

Please send an official patch ?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-25 22:50                             ` Eric Dumazet
@ 2012-10-25 23:16                               ` Vijay Subramanian
  0 siblings, 0 replies; 20+ messages in thread
From: Vijay Subramanian @ 2012-10-25 23:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: enh, Venkat Venkatsubra, netdev

On 25 October 2012 15:50, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2012-10-22 at 15:11 -0700, Vijay Subramanian wrote:
>
>> >
>> > All these facilities sound very complex and not really usable by clients
>> > (ie users not willing to wait more than few seconds anyway)
>> >
>>
>> Fair enough. We can drop this if it is not worth the trouble or if I
>> have missed any other scenario.
>>
>
> Sorry my comment was not related to your patch, but existing logic.
>
> It seems there is no value resending SYNACK, as we received the client
> ACK.
>
> Please send an official patch ?
>
>
>
Eric,
I will send a patch shortly.

Thanks,
Vijay

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: listen(2) backlog changes in or around Linux 3.1?
  2012-10-16 23:31     ` enh
  2012-10-18 16:00       ` Venkat Venkatsubra
@ 2012-10-18 16:54       ` Eric Dumazet
  1 sibling, 0 replies; 20+ messages in thread
From: Eric Dumazet @ 2012-10-18 16:54 UTC (permalink / raw)
  To: enh; +Cc: netdev

On Tue, 2012-10-16 at 16:31 -0700, enh wrote:
> boiling things down to a short C++ program, i see that i can reproduce
> the behavior even on 2.6 kernels. if i run this, i see 4 connections
> immediately (3 + 1, as i'd expect)... but then about 10s later i see
> another 2. and every few seconds after that, i see another 2. i've let
> this run until i have hundreds of connect(2) calls that have returned,
> despite my small listen(2) backlog and the fact that i'm not
> accept(2)ing.
> 
> so i guess the only thing that's changed with newer kernels is timing
> (hell, since i only see newer kernels on newer hardware, it might just
> be a hardware thing).
> 
> and clearly i don't understand what the listen(2) backlog means any more.


Hi Elliott

I would say there is a bug (or several !!), and this needs a fix.

I am investigating.

Thanks

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-10-25 23:16 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-12 23:40 listen(2) backlog changes in or around Linux 3.1? enh
2012-10-15 17:12 ` Venkat Venkatsubra
2012-10-15 17:26   ` enh
2012-10-15 21:30     ` Venkat Venkatsubra
2012-10-16 23:31     ` enh
2012-10-18 16:00       ` Venkat Venkatsubra
2012-10-18 16:53         ` Venkat Venkatsubra
2012-10-18 17:20           ` enh
2012-10-19  6:02             ` Vijay Subramanian
2012-10-19  6:50               ` Eric Dumazet
2012-10-19  8:06                 ` Eric Dumazet
2012-10-19  9:14                   ` Vijay Subramanian
2012-10-19 10:29                     ` Eric Dumazet
2012-10-19 11:39                       ` Eric Dumazet
2012-10-22 20:00                       ` Vijay Subramanian
2012-10-22 20:08                         ` Eric Dumazet
2012-10-22 22:11                           ` Vijay Subramanian
2012-10-25 22:50                             ` Eric Dumazet
2012-10-25 23:16                               ` Vijay Subramanian
2012-10-18 16:54       ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).