setsockopt()

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* setsockopt()
@ 2008-07-07 18:18 Olga Kornievskaia
  2008-07-07 21:24 ` setsockopt() Stephen Hemminger
  2008-07-08  1:17 ` setsockopt() John Heffner
  0 siblings, 2 replies; 43+ messages in thread
From: Olga Kornievskaia @ 2008-07-07 18:18 UTC (permalink / raw)
  To: netdev; +Cc: Jim Rees, J. Bruce Fields

Hi,

    I'd like to ask a question regarding socket options, more 
specifically send and receive buffer sizes.

    One simple question: (on the server-side) is it true that, to set 
send/receive buffer size, setsockopt() can only be called before 
listen()? From what I can tell, if I were to set socket options for the 
listening socket, they get inherited by the socket created during the 
accept(). However, when I try to change send/receive buffer size for the 
new socket, they take no affect.

    The server in question is the NFSD server in the kernel. NFSD's code 
tries to adjust the buffer size (in order to have TCP increase the 
window size appropriately) but it does so after the new socket is 
created. It leads to the fact that the TCP window doesn't open beyond 
the TCP's "default" sysctl value (that would be the 2nd value in the 
triple net.ipv4.tcp_rmem, which on our system is set to 64KB). We 
changed the code so that setsockopt() is called for the listening socket 
is created and we set the buffer sizes to something bigger, like 8MB. 
Then we try to increase the buffer size for each socket created by the 
accept() but what is seen on the network trace is that window size 
doesn't open beyond the values used for the listening socket.

    I looked around in the code. There is a variable called 
"window_clamp" that seems to specifies the largest possible window 
advertisement. window_clamp gets set during the creation of the accept 
socket. At that time, it's value is based on the sk_rcvbuf of the 
listening socket. Thus, that would explain the behavior that window 
doesn't grow beyond the values used in setsockopt() for the listening 
socket, even though the new socket has new (larger) sk_sndbuf and 
sk_rcvbuf  than the listening socket.

    I realize that send/receive buffer size and window advertisement are 
different but they are related in the way that by telling TCP that we 
have a certain amount of memory for socket operations, it should try to 
open big enough window (provided that there is no congestion).

    Can somebody advise us on how to properly set send/receive buffer 
sizes for the NFSD in the kernel such that (1) the window is not bound 
by the TCP's default sysctl value and (2) if it is possible to do so for 
the accept sockets and not the listening socket.

    I would appreciate if we could be CC-ed on the reply as we are not 
subscribed to the netdev mailing list.

Thank you.

-Olga

   

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-07 18:18 setsockopt() Olga Kornievskaia
@ 2008-07-07 21:24 ` Stephen Hemminger
  2008-07-07 21:30   ` setsockopt() Olga Kornievskaia
  2008-07-07 21:32   ` setsockopt() J. Bruce Fields
  2008-07-08  1:17 ` setsockopt() John Heffner
  1 sibling, 2 replies; 43+ messages in thread
From: Stephen Hemminger @ 2008-07-07 21:24 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: netdev, Jim Rees, J. Bruce Fields

On Mon, 07 Jul 2008 14:18:38 -0400
Olga Kornievskaia <aglo@citi.umich.edu> wrote:

> Hi,
> 
>     I'd like to ask a question regarding socket options, more 
> specifically send and receive buffer sizes.
> 
>     One simple question: (on the server-side) is it true that, to set 
> send/receive buffer size, setsockopt() can only be called before 
> listen()? From what I can tell, if I were to set socket options for the 
> listening socket, they get inherited by the socket created during the 
> accept(). However, when I try to change send/receive buffer size for the 
> new socket, they take no affect.
> 
>     The server in question is the NFSD server in the kernel. NFSD's code 
> tries to adjust the buffer size (in order to have TCP increase the 
> window size appropriately) but it does so after the new socket is 
> created. It leads to the fact that the TCP window doesn't open beyond 
> the TCP's "default" sysctl value (that would be the 2nd value in the 
> triple net.ipv4.tcp_rmem, which on our system is set to 64KB). We 
> changed the code so that setsockopt() is called for the listening socket 
> is created and we set the buffer sizes to something bigger, like 8MB. 
> Then we try to increase the buffer size for each socket created by the 
> accept() but what is seen on the network trace is that window size 
> doesn't open beyond the values used for the listening socket.

It would be better if NFSD stayed out of doign setsockopt and just
let the sender/receiver autotuning work?

>     I looked around in the code. There is a variable called 
> "window_clamp" that seems to specifies the largest possible window 
> advertisement. window_clamp gets set during the creation of the accept 
> socket. At that time, it's value is based on the sk_rcvbuf of the 
> listening socket. Thus, that would explain the behavior that window 
> doesn't grow beyond the values used in setsockopt() for the listening 
> socket, even though the new socket has new (larger) sk_sndbuf and 
> sk_rcvbuf  than the listening socket.
> 
>     I realize that send/receive buffer size and window advertisement are 
> different but they are related in the way that by telling TCP that we 
> have a certain amount of memory for socket operations, it should try to 
> open big enough window (provided that there is no congestion).
> 
>     Can somebody advise us on how to properly set send/receive buffer 
> sizes for the NFSD in the kernel such that (1) the window is not bound 
> by the TCP's default sysctl value and (2) if it is possible to do so for 
> the accept sockets and not the listening socket.
> 
>     I would appreciate if we could be CC-ed on the reply as we are not 
> subscribed to the netdev mailing list.
> 
> Thank you.
> 
> -Olga
> 
>    
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-07 21:24 ` setsockopt() Stephen Hemminger
@ 2008-07-07 21:30   ` Olga Kornievskaia
  2008-07-07 21:33     ` setsockopt() Stephen Hemminger
                       ` (2 more replies)
  2008-07-07 21:32   ` setsockopt() J. Bruce Fields
  1 sibling, 3 replies; 43+ messages in thread
From: Olga Kornievskaia @ 2008-07-07 21:30 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, Jim Rees, J. Bruce Fields



Stephen Hemminger wrote:
> On Mon, 07 Jul 2008 14:18:38 -0400
> Olga Kornievskaia <aglo@citi.umich.edu> wrote:
>
>   
>> Hi,
>>
>>     I'd like to ask a question regarding socket options, more 
>> specifically send and receive buffer sizes.
>>
>>     One simple question: (on the server-side) is it true that, to set 
>> send/receive buffer size, setsockopt() can only be called before 
>> listen()? From what I can tell, if I were to set socket options for the 
>> listening socket, they get inherited by the socket created during the 
>> accept(). However, when I try to change send/receive buffer size for the 
>> new socket, they take no affect.
>>
>>     The server in question is the NFSD server in the kernel. NFSD's code 
>> tries to adjust the buffer size (in order to have TCP increase the 
>> window size appropriately) but it does so after the new socket is 
>> created. It leads to the fact that the TCP window doesn't open beyond 
>> the TCP's "default" sysctl value (that would be the 2nd value in the 
>> triple net.ipv4.tcp_rmem, which on our system is set to 64KB). We 
>> changed the code so that setsockopt() is called for the listening socket 
>> is created and we set the buffer sizes to something bigger, like 8MB. 
>> Then we try to increase the buffer size for each socket created by the 
>> accept() but what is seen on the network trace is that window size 
>> doesn't open beyond the values used for the listening socket.
>>     
>
> It would be better if NFSD stayed out of doign setsockopt and just
> let the sender/receiver autotuning work?
>   
Auto-tuning would be guided by the sysctl values that are set for all 
applications. I could be wrong but what I see is that unless an 
application does a setsockopt(), its window is bound by the default 
sysctl value. If it is true, than it is not acceptable. It means that in 
order for NFSD to achieve a large enough window it needs to modify TCP's 
sysctl value which will effect all other applications.

>>     I looked around in the code. There is a variable called 
>> "window_clamp" that seems to specifies the largest possible window 
>> advertisement. window_clamp gets set during the creation of the accept 
>> socket. At that time, it's value is based on the sk_rcvbuf of the 
>> listening socket. Thus, that would explain the behavior that window 
>> doesn't grow beyond the values used in setsockopt() for the listening 
>> socket, even though the new socket has new (larger) sk_sndbuf and 
>> sk_rcvbuf  than the listening socket.
>>
>>     I realize that send/receive buffer size and window advertisement are 
>> different but they are related in the way that by telling TCP that we 
>> have a certain amount of memory for socket operations, it should try to 
>> open big enough window (provided that there is no congestion).
>>
>>     Can somebody advise us on how to properly set send/receive buffer 
>> sizes for the NFSD in the kernel such that (1) the window is not bound 
>> by the TCP's default sysctl value and (2) if it is possible to do so for 
>> the accept sockets and not the listening socket.
>>
>>     I would appreciate if we could be CC-ed on the reply as we are not 
>> subscribed to the netdev mailing list.
>>
>> Thank you.
>>
>> -Olga
>>
>>    
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>     

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-07 21:30   ` setsockopt() Olga Kornievskaia
@ 2008-07-07 21:33     ` Stephen Hemminger
  2008-07-07 21:49     ` setsockopt() David Miller
  2008-07-07 22:50     ` setsockopt() Rick Jones
  2 siblings, 0 replies; 43+ messages in thread
From: Stephen Hemminger @ 2008-07-07 21:33 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: netdev, Jim Rees, J. Bruce Fields

On Mon, 07 Jul 2008 17:30:49 -0400
Olga Kornievskaia <aglo@citi.umich.edu> wrote:

> 
> 
> Stephen Hemminger wrote:
> > On Mon, 07 Jul 2008 14:18:38 -0400
> > Olga Kornievskaia <aglo@citi.umich.edu> wrote:
> >
> >   
> >> Hi,
> >>
> >>     I'd like to ask a question regarding socket options, more 
> >> specifically send and receive buffer sizes.
> >>
> >>     One simple question: (on the server-side) is it true that, to set 
> >> send/receive buffer size, setsockopt() can only be called before 
> >> listen()? From what I can tell, if I were to set socket options for the 
> >> listening socket, they get inherited by the socket created during the 
> >> accept(). However, when I try to change send/receive buffer size for the 
> >> new socket, they take no affect.
> >>
> >>     The server in question is the NFSD server in the kernel. NFSD's code 
> >> tries to adjust the buffer size (in order to have TCP increase the 
> >> window size appropriately) but it does so after the new socket is 
> >> created. It leads to the fact that the TCP window doesn't open beyond 
> >> the TCP's "default" sysctl value (that would be the 2nd value in the 
> >> triple net.ipv4.tcp_rmem, which on our system is set to 64KB). We 
> >> changed the code so that setsockopt() is called for the listening socket 
> >> is created and we set the buffer sizes to something bigger, like 8MB. 
> >> Then we try to increase the buffer size for each socket created by the 
> >> accept() but what is seen on the network trace is that window size 
> >> doesn't open beyond the values used for the listening socket.
> >>     
> >
> > It would be better if NFSD stayed out of doign setsockopt and just
> > let the sender/receiver autotuning work?
> >   
> Auto-tuning would be guided by the sysctl values that are set for all 
> applications. I could be wrong but what I see is that unless an 
> application does a setsockopt(), its window is bound by the default 
> sysctl value. If it is true, than it is not acceptable. It means that in 
> order for NFSD to achieve a large enough window it needs to modify TCP's 
> sysctl value which will effect all other applications.
>

Auto tuning starts at the default and will expand to the max allowed.
If you set a value with setsockopt, then the kernel just uses that value
and does no tuning.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-07 21:30   ` setsockopt() Olga Kornievskaia
  2008-07-07 21:33     ` setsockopt() Stephen Hemminger
@ 2008-07-07 21:49     ` David Miller
  2008-07-08  4:54       ` setsockopt() Evgeniy Polyakov
  2008-07-08 20:12       ` setsockopt() Jim Rees
  2008-07-07 22:50     ` setsockopt() Rick Jones
  2 siblings, 2 replies; 43+ messages in thread
From: David Miller @ 2008-07-07 21:49 UTC (permalink / raw)
  To: aglo; +Cc: shemminger, netdev, rees, bfields

From: Olga Kornievskaia <aglo@citi.umich.edu>
Date: Mon, 07 Jul 2008 17:30:49 -0400

> Auto-tuning would be guided by the sysctl values that are set for all 
> applications. I could be wrong but what I see is that unless an 
> application does a setsockopt(), its window is bound by the default 
> sysctl value. If it is true, than it is not acceptable. It means that in 
> order for NFSD to achieve a large enough window it needs to modify TCP's 
> sysctl value which will effect all other applications.

This is nonsense.

The kernel autotunes the receive and send buffers based upon the
dynamic behavior of the connection.

The sysctls only control limits.

If you set the socket buffer sizes explicitly, you essentially turn
off half of the TCP stack because it won't do dynamic socket buffer
sizing afterwards.

There is no reason these days to ever explicitly set the socket
buffer sizes on TCP sockets under Linux.

If something is going wrong it's a bug and we should fix it.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-07 21:49     ` setsockopt() David Miller
@ 2008-07-08  4:54       ` Evgeniy Polyakov
  2008-07-08  6:02         ` setsockopt() Bill Fink
  2008-07-08 20:12       ` setsockopt() Jim Rees
  1 sibling, 1 reply; 43+ messages in thread
From: Evgeniy Polyakov @ 2008-07-08  4:54 UTC (permalink / raw)
  To: David Miller; +Cc: aglo, shemminger, netdev, rees, bfields

Hi.

On Mon, Jul 07, 2008 at 02:49:12PM -0700, David Miller (davem@davemloft.net) wrote:
> There is no reason these days to ever explicitly set the socket
> buffer sizes on TCP sockets under Linux.
> 
> If something is going wrong it's a bug and we should fix it.

Just for the reference: autosizing is (was?) not always working correctly
for some workloads at least couple of years ago.
For example I worked with small enough embedded systems with 16-32 MB
of RAM where socket buffer size never grew up more than 200Kb (100mbit
network), but workload was very bursty, so if remote system froze for
several milliseconds (and sometimes upto couple of seconds), socket
buffer was completely filled with new burst of data and either sending
started to sleep or returned EAGAIN, which resulted in semi-realtime
data to be dropped.

Setting buffer size explicitely to large enough value like 8Mb fixed
this burst issues. Another fix was to allocate data each time it becomes
ready and copy portion to this buffer, but allocation was quite slow,
which led to unneded latencies, which again could lead to data loss.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-08  4:54       ` setsockopt() Evgeniy Polyakov
@ 2008-07-08  6:02         ` Bill Fink
  2008-07-08  6:29           ` setsockopt() Roland Dreier
  0 siblings, 1 reply; 43+ messages in thread
From: Bill Fink @ 2008-07-08  6:02 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, aglo, shemminger, netdev, rees, bfields

On Tue, 8 Jul 2008, Evgeniy Polyakov wrote:

> On Mon, Jul 07, 2008 at 02:49:12PM -0700, David Miller (davem@davemloft.net) wrote:
> > There is no reason these days to ever explicitly set the socket
> > buffer sizes on TCP sockets under Linux.
> > 
> > If something is going wrong it's a bug and we should fix it.
> 
> Just for the reference: autosizing is (was?) not always working correctly
> for some workloads at least couple of years ago.
> For example I worked with small enough embedded systems with 16-32 MB
> of RAM where socket buffer size never grew up more than 200Kb (100mbit
> network), but workload was very bursty, so if remote system froze for
> several milliseconds (and sometimes upto couple of seconds), socket
> buffer was completely filled with new burst of data and either sending
> started to sleep or returned EAGAIN, which resulted in semi-realtime
> data to be dropped.
> 
> Setting buffer size explicitely to large enough value like 8Mb fixed
> this burst issues. Another fix was to allocate data each time it becomes
> ready and copy portion to this buffer, but allocation was quite slow,
> which led to unneded latencies, which again could lead to data loss.

I admittedly haven't tested on the latest greatest kernel versions,
but the testing I have done on large RTT 10-GigE networks, if I want
to get the ultimate TCP performance I still need to explicitly set
the socket buffer sizes, although I give kudos to the autotuning which
does remarkably well.  Here's a comparison across an ~72 ms RTT
10-GigE path (sender is 2.6.20.7 and receiver is 2.6.22.9).

Autotuning (30-second TCP test with 1-second interval reports):

# nuttcp -T30 -i1 192.168.21.82
nuttcp-6.0.1: Using beta version: retrans interface/output subject to change
              (to suppress this message use "-f-beta")

    7.2500 MB /   1.01 sec =   60.4251 Mbps     0 retrans
   43.6875 MB /   1.00 sec =  366.4509 Mbps     0 retrans
  169.4375 MB /   1.00 sec = 1421.2296 Mbps     0 retrans
  475.3125 MB /   1.00 sec = 3986.8873 Mbps     0 retrans
  827.6250 MB /   1.00 sec = 6942.0247 Mbps     0 retrans
  877.6250 MB /   1.00 sec = 7361.2792 Mbps     0 retrans
  878.1250 MB /   1.00 sec = 7365.7750 Mbps     0 retrans
  878.4375 MB /   1.00 sec = 7368.2710 Mbps     0 retrans
  878.3750 MB /   1.00 sec = 7367.7173 Mbps     0 retrans
  878.7500 MB /   1.00 sec = 7370.6932 Mbps     0 retrans
  878.8125 MB /   1.00 sec = 7371.6818 Mbps     0 retrans
  879.1875 MB /   1.00 sec = 7374.5546 Mbps     0 retrans
  878.6875 MB /   1.00 sec = 7370.3754 Mbps     0 retrans
  878.2500 MB /   1.00 sec = 7366.3742 Mbps     0 retrans
  878.6875 MB /   1.00 sec = 7370.6407 Mbps     0 retrans
  878.8125 MB /   1.00 sec = 7371.4239 Mbps     0 retrans
  878.5000 MB /   1.00 sec = 7368.8174 Mbps     0 retrans
  879.0625 MB /   1.00 sec = 7373.4766 Mbps     0 retrans
  878.8125 MB /   1.00 sec = 7371.4386 Mbps     0 retrans
  878.3125 MB /   1.00 sec = 7367.2152 Mbps     0 retrans
  878.8125 MB /   1.00 sec = 7371.3723 Mbps     0 retrans
  878.6250 MB /   1.00 sec = 7369.8585 Mbps     0 retrans
  878.8125 MB /   1.00 sec = 7371.4460 Mbps     0 retrans
  875.5000 MB /   1.00 sec = 7373.0401 Mbps     0 retrans
  878.8125 MB /   1.00 sec = 7371.5123 Mbps     0 retrans
  878.3750 MB /   1.00 sec = 7367.5037 Mbps     0 retrans
  878.5000 MB /   1.00 sec = 7368.9647 Mbps     0 retrans
  879.4375 MB /   1.00 sec = 7376.6073 Mbps     0 retrans
  878.8750 MB /   1.00 sec = 7371.8891 Mbps     0 retrans
  878.4375 MB /   1.00 sec = 7368.3521 Mbps     0 retrans

23488.6875 MB /  30.10 sec = 6547.0228 Mbps 81 %TX 49 %RX 0 retrans

Same test but with explicitly specified 100 MB socket buffer:

# nuttcp -T30 -i1 -w100m 192.168.21.82
nuttcp-6.0.1: Using beta version: retrans interface/output subject to change
              (to suppress this message use "-f-beta")

    7.1250 MB /   1.01 sec =   59.4601 Mbps     0 retrans
  120.3750 MB /   1.00 sec = 1009.7464 Mbps     0 retrans
  859.4375 MB /   1.00 sec = 7208.5832 Mbps     0 retrans
  939.3125 MB /   1.00 sec = 7878.9965 Mbps     0 retrans
  935.5000 MB /   1.00 sec = 7847.0249 Mbps     0 retrans
  934.8125 MB /   1.00 sec = 7841.1248 Mbps     0 retrans
  933.8125 MB /   1.00 sec = 7832.7291 Mbps     0 retrans
  933.1875 MB /   1.00 sec = 7827.5727 Mbps     0 retrans
  932.1875 MB /   1.00 sec = 7819.1300 Mbps     0 retrans
  933.1250 MB /   1.00 sec = 7826.8059 Mbps     0 retrans
  933.3125 MB /   1.00 sec = 7828.6760 Mbps     0 retrans
  933.0000 MB /   1.00 sec = 7825.9608 Mbps     0 retrans
  932.6875 MB /   1.00 sec = 7823.1753 Mbps     0 retrans
  932.0625 MB /   1.00 sec = 7818.0268 Mbps     0 retrans
  931.7500 MB /   1.00 sec = 7815.6088 Mbps     0 retrans
  931.0625 MB /   1.00 sec = 7809.7717 Mbps     0 retrans
  931.5000 MB /   1.00 sec = 7813.3711 Mbps     0 retrans
  931.8750 MB /   1.00 sec = 7816.4931 Mbps     0 retrans
  932.0625 MB /   1.00 sec = 7817.8157 Mbps     0 retrans
  931.5000 MB /   1.00 sec = 7813.4180 Mbps     0 retrans
  931.6250 MB /   1.00 sec = 7814.5134 Mbps     0 retrans
  931.6250 MB /   1.00 sec = 7814.4821 Mbps     0 retrans
  931.3125 MB /   1.00 sec = 7811.7124 Mbps     0 retrans
  930.8750 MB /   1.00 sec = 7808.0818 Mbps     0 retrans
  931.0625 MB /   1.00 sec = 7809.6233 Mbps     0 retrans
  930.6875 MB /   1.00 sec = 7806.6964 Mbps     0 retrans
  931.2500 MB /   1.00 sec = 7811.0164 Mbps     0 retrans
  931.3125 MB /   1.00 sec = 7811.9077 Mbps     0 retrans
  931.3750 MB /   1.00 sec = 7812.3617 Mbps     0 retrans
  931.4375 MB /   1.00 sec = 7812.6750 Mbps     0 retrans

26162.6875 MB /  30.15 sec = 7279.7648 Mbps 93 %TX 54 %RX 0 retrans

As you can see, the autotuned case maxed out at about 7.37 Gbps,
whereas by explicitly specifying a 100 MB socket buffer it was
possible to achieve a somewhat higher rate of about 7.81 Gbps.
Admittedly the autotuning did great, with a difference of only
about 6 %, but if you want to squeeze the last drop of performance
out of your network, explicitly setting the socket buffer sizes
can still be helpful in certain situations (perhaps newer kernels
have reduced the gap even more).

But I would definitely agree with the general recommendation to
just take advantage of the excellent Linux TCP autotuning for most
common scenarios.

						-Bill

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-08  6:02         ` setsockopt() Bill Fink
@ 2008-07-08  6:29           ` Roland Dreier
  2008-07-08  6:43             ` setsockopt() Evgeniy Polyakov
                               ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Roland Dreier @ 2008-07-08  6:29 UTC (permalink / raw)
  To: Bill Fink
  Cc: Evgeniy Polyakov, David Miller, aglo, shemminger, netdev, rees,
	bfields

Interesting... I'd not tried nuttcp before, and on my testbed, which is
a very high-bandwidth, low-RTT network (IP-over-InfiniBand with DDR IB,
so the network is capable of 16 Gbps, and the RTT is ~25 microseconds),
the difference between autotuning and not for nuttcp is huge (testing
with 2.6.26-rc8 plus some pending 2.6.27 patches that add checksum
offload, LSO and LRO to the IP-over-IB driver):

nuttcp -T30 -i1 ends up with:

14465.0625 MB /  30.01 sec = 4043.6073 Mbps 82 %TX 2 %RX

while setting the window even to 128 KB with
nuttcp -w128k -T30 -i1 ends up with:

36416.8125 MB /  30.00 sec = 10182.8137 Mbps 90 %TX 96 %RX

so it's a factor of 2.5 with nuttcp.  I've never seen other apps behave
like that -- for example NPtcp (netpipe) only gets slower when
explicitly setting the window size.

Strange...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-08  6:29           ` setsockopt() Roland Dreier
@ 2008-07-08  6:43             ` Evgeniy Polyakov
  2008-07-08  7:03               ` setsockopt() Roland Dreier
  2008-07-08 18:48             ` setsockopt() Bill Fink
  2008-07-08 20:48             ` setsockopt() Stephen Hemminger
  2 siblings, 1 reply; 43+ messages in thread
From: Evgeniy Polyakov @ 2008-07-08  6:43 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Bill Fink, David Miller, aglo, shemminger, netdev, rees, bfields

On Mon, Jul 07, 2008 at 11:29:31PM -0700, Roland Dreier (rdreier@cisco.com) wrote:
> nuttcp -T30 -i1 ends up with:
> 
> 14465.0625 MB /  30.01 sec = 4043.6073 Mbps 82 %TX 2 %RX
> 
> while setting the window even to 128 KB with
> nuttcp -w128k -T30 -i1 ends up with:
> 
> 36416.8125 MB /  30.00 sec = 10182.8137 Mbps 90 %TX 96 %RX
> 
> so it's a factor of 2.5 with nuttcp.  I've never seen other apps behave
> like that -- for example NPtcp (netpipe) only gets slower when
> explicitly setting the window size.
> 
> Strange...

Maybe nuttcp by default sets very small socket buffer size?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-08  6:43             ` setsockopt() Evgeniy Polyakov
@ 2008-07-08  7:03               ` Roland Dreier
  0 siblings, 0 replies; 43+ messages in thread
From: Roland Dreier @ 2008-07-08  7:03 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Bill Fink, David Miller, aglo, shemminger, netdev, rees, bfields

 > Maybe nuttcp by default sets very small socket buffer size?

strace shows no setsockopt() calls other than SO_REUSEADDR in the
default case.

 - R.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-08  6:29           ` setsockopt() Roland Dreier
  2008-07-08  6:43             ` setsockopt() Evgeniy Polyakov
@ 2008-07-08 18:48             ` Bill Fink
  2008-07-09 18:10               ` setsockopt() Roland Dreier
  2008-07-08 20:48             ` setsockopt() Stephen Hemminger
  2 siblings, 1 reply; 43+ messages in thread
From: Bill Fink @ 2008-07-08 18:48 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Evgeniy Polyakov, David Miller, aglo, shemminger, netdev, rees,
	bfields

Hi Roland,

I think you set a new nuttcp speed record.  :-)
I've merely had 10-GigE networks to play with.

On Mon, 07 Jul 2008, Roland Dreier wrote:

> Interesting... I'd not tried nuttcp before, and on my testbed, which is
> a very high-bandwidth, low-RTT network (IP-over-InfiniBand with DDR IB,
> so the network is capable of 16 Gbps, and the RTT is ~25 microseconds),
> the difference between autotuning and not for nuttcp is huge (testing
> with 2.6.26-rc8 plus some pending 2.6.27 patches that add checksum
> offload, LSO and LRO to the IP-over-IB driver):
> 
> nuttcp -T30 -i1 ends up with:
> 
> 14465.0625 MB /  30.01 sec = 4043.6073 Mbps 82 %TX 2 %RX
> 
> while setting the window even to 128 KB with
> nuttcp -w128k -T30 -i1 ends up with:
> 
> 36416.8125 MB /  30.00 sec = 10182.8137 Mbps 90 %TX 96 %RX
> 
> so it's a factor of 2.5 with nuttcp.  I've never seen other apps behave
> like that -- for example NPtcp (netpipe) only gets slower when
> explicitly setting the window size.
> 
> Strange...

It is strange.  The first case just uses the TCP autotuning, since
as you discovered, nuttcp doesn't make any SO_SNDBUF/SO_RCVBUF
setsockopt() calls unless you explicitly set the "-w" option.
Perhaps the maximum value for tcp_rmem/tcp_wmem is smallish on
your systems (check both client and server).

On my system:

# cat /proc/sys/net/ipv4/tcp_rmem
4096    524288  104857600
# cat /proc/sys/net/ipv4/tcp_wmem
4096    524288  104857600

IIRC the explicit setting of SO_SNDBUF/SO_RCVBUF is instead governed
by rmem_max/wmem_max.

# cat /proc/sys/net/core/rmem_max
104857600
# cat /proc/sys/net/core/wmem_max
104857600

The other weird thing about your test is the huge difference in
the receiver (and server in this case) CPU utilization between the
autotuning and explicit setting cases (2 %RX versus 96 %RX).

						-Bill

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-08 18:48             ` setsockopt() Bill Fink
@ 2008-07-09 18:10               ` Roland Dreier
  2008-07-09 18:34                 ` setsockopt() Evgeniy Polyakov
  0 siblings, 1 reply; 43+ messages in thread
From: Roland Dreier @ 2008-07-09 18:10 UTC (permalink / raw)
  To: Bill Fink
  Cc: Evgeniy Polyakov, David Miller, aglo, shemminger, netdev, rees,
	bfields

 > The other weird thing about your test is the huge difference in
 > the receiver (and server in this case) CPU utilization between the
 > autotuning and explicit setting cases (2 %RX versus 96 %RX).

I think I found another clue -- it seems that CPU affinity has something
to do with the results.  Usually I pin the adapter interrupt to CPU 0
and use "taskset 4" to pin the benchmarking process to CPU 2 (this leads
to the best performance for these particular systems in almost all
benchmarks).  But with nuttcp I see the following:

  with taskset 4:

    $ taskset 4 nuttcp -T30 192.168.145.73 
     9911.3125 MB /  30.01 sec = 2770.3202 Mbps 42 %TX 10 %RX
    $ taskset 4 nuttcp -w128k -T30 192.168.145.73 
    36241.9375 MB /  30.00 sec = 10133.8931 Mbps 89 %TX 96 %RX

  with no taskset (ie let kernel schedule as it wants to):

    $ nuttcp -T30 192.168.145.73
    36689.6875 MB /  30.00 sec = 10259.1525 Mbps 99 %TX 96 %RX
    $ nuttcp -w128k -T30 192.168.145.73
    36486.0000 MB /  30.00 sec = 10202.1870 Mbps 74 %TX 95 %RX

so somehow setting the window helps with the scheduling of
processes... I guess autotuning lets some queue get too long or
something like that.  The actual window doesn't matter too much, since
the delay of the network is low enough that even though the bandwidth is
very high, the BDP is quite small.  (With a 25 usec RTT, a 128 KB window
should be enough for 40 Gbps, well over the raw link speed of 16 Gbps
that I have)

 - R.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-09 18:10               ` setsockopt() Roland Dreier
@ 2008-07-09 18:34                 ` Evgeniy Polyakov
  2008-07-10  2:50                   ` setsockopt() Bill Fink
  0 siblings, 1 reply; 43+ messages in thread
From: Evgeniy Polyakov @ 2008-07-09 18:34 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Bill Fink, David Miller, aglo, shemminger, netdev, rees, bfields

On Wed, Jul 09, 2008 at 11:10:31AM -0700, Roland Dreier (rdreier@cisco.com) wrote:
> so somehow setting the window helps with the scheduling of
> processes... I guess autotuning lets some queue get too long or
> something like that.  The actual window doesn't matter too much, since
> the delay of the network is low enough that even though the bandwidth is
> very high, the BDP is quite small.  (With a 25 usec RTT, a 128 KB window
> should be enough for 40 Gbps, well over the raw link speed of 16 Gbps
> that I have)

That may be cache issues: depending on what application does it can be
useful or not to be bound to the same CPU. I suppose if benchmark looks
into the packet content, then it likely wants to be on the same CPU to
aliminate cache line ping-pongs, otherwise it only needs to be awakened
to send/receive next chunk, so having it on different CPU may result in
better its utilization...

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-09 18:34                 ` setsockopt() Evgeniy Polyakov
@ 2008-07-10  2:50                   ` Bill Fink
  2008-07-10 17:26                     ` setsockopt() Rick Jones
  0 siblings, 1 reply; 43+ messages in thread
From: Bill Fink @ 2008-07-10  2:50 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Roland Dreier, David Miller, aglo, shemminger, netdev, rees,
	bfields

On Wed, 9 Jul 2008, Evgeniy Polyakov wrote:

> On Wed, Jul 09, 2008 at 11:10:31AM -0700, Roland Dreier (rdreier@cisco.com) wrote:
> > so somehow setting the window helps with the scheduling of
> > processes... I guess autotuning lets some queue get too long or
> > something like that.  The actual window doesn't matter too much, since
> > the delay of the network is low enough that even though the bandwidth is
> > very high, the BDP is quite small.  (With a 25 usec RTT, a 128 KB window
> > should be enough for 40 Gbps, well over the raw link speed of 16 Gbps
> > that I have)
> 
> That may be cache issues: depending on what application does it can be
> useful or not to be bound to the same CPU. I suppose if benchmark looks
> into the packet content, then it likely wants to be on the same CPU to
> aliminate cache line ping-pongs, otherwise it only needs to be awakened
> to send/receive next chunk, so having it on different CPU may result in
> better its utilization...

In my own network benchmarking experience, I've generally gotten the
best performance results when the nuttcp application and the NIC
interrupts are on the same CPU, which I understood was because of
cache effects.

I wonder if the "-w128" forces the socket buffer to a small enough size
that it totally fits in cache and this helps the performance.

						-Bill

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-10  2:50                   ` setsockopt() Bill Fink
@ 2008-07-10 17:26                     ` Rick Jones
  2008-07-11  0:50                       ` setsockopt() Bill Fink
  0 siblings, 1 reply; 43+ messages in thread
From: Rick Jones @ 2008-07-10 17:26 UTC (permalink / raw)
  To: Bill Fink
  Cc: Evgeniy Polyakov, Roland Dreier, David Miller, aglo, shemminger,
	netdev, rees, bfields

> In my own network benchmarking experience, I've generally gotten the
> best performance results when the nuttcp application and the NIC
> interrupts are on the same CPU, which I understood was because of
> cache effects.

Interestingly enough I have a slightly different experience:

*) single-transaction, single-stream TCP_RR - best when app and NIC use 
same core

*) bulk transfer - either TCP_STREAM or aggregate TCP_RR:
   a) enough CPU on one core to reach max tput, best when same core
   b) not enough, tput max when app and NIC on separate cores, 
preferably cores sharing some cache

That is in the context of either maximizing throughput or minimizing 
latency.  If the context is most efficient transfer, then in all cases 
my experience thusfar agrees with yours.

rick jones

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-10 17:26                     ` setsockopt() Rick Jones
@ 2008-07-11  0:50                       ` Bill Fink
  0 siblings, 0 replies; 43+ messages in thread
From: Bill Fink @ 2008-07-11  0:50 UTC (permalink / raw)
  To: Rick Jones
  Cc: Evgeniy Polyakov, Roland Dreier, David Miller, aglo, shemminger,
	netdev, rees, bfields

On Thu, 10 Jul 2008, Rick Jones wrote:

> > In my own network benchmarking experience, I've generally gotten the
> > best performance results when the nuttcp application and the NIC
> > interrupts are on the same CPU, which I understood was because of
> > cache effects.
> 
> Interestingly enough I have a slightly different experience:
> 
> *) single-transaction, single-stream TCP_RR - best when app and NIC use 
> same core
> 
> *) bulk transfer - either TCP_STREAM or aggregate TCP_RR:
>    a) enough CPU on one core to reach max tput, best when same core
>    b) not enough, tput max when app and NIC on separate cores, 
> preferably cores sharing some cache
> 
> That is in the context of either maximizing throughput or minimizing 
> latency.  If the context is most efficient transfer, then in all cases 
> my experience thusfar agrees with yours.

Yes, I was talking about single stream bulk data transfers, where the
CPU was not a limiting factor (just barely when doing full 10-GigE
line rate transfers with 9000-byte jumbo frames).

On multiple stream tests there can be a benefit to spreading the load
across multiple cores.

						-Bill

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-08  6:29           ` setsockopt() Roland Dreier
  2008-07-08  6:43             ` setsockopt() Evgeniy Polyakov
  2008-07-08 18:48             ` setsockopt() Bill Fink
@ 2008-07-08 20:48             ` Stephen Hemminger
  2008-07-08 22:05               ` setsockopt() Bill Fink
  2 siblings, 1 reply; 43+ messages in thread
From: Stephen Hemminger @ 2008-07-08 20:48 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Bill Fink, Evgeniy Polyakov, David Miller, aglo, shemminger,
	netdev, rees, bfields

On Mon, 07 Jul 2008 23:29:31 -0700
Roland Dreier <rdreier@cisco.com> wrote:

> Interesting... I'd not tried nuttcp before, and on my testbed, which is
> a very high-bandwidth, low-RTT network (IP-over-InfiniBand with DDR IB,
> so the network is capable of 16 Gbps, and the RTT is ~25 microseconds),
> the difference between autotuning and not for nuttcp is huge (testing
> with 2.6.26-rc8 plus some pending 2.6.27 patches that add checksum
> offload, LSO and LRO to the IP-over-IB driver):
> 
> nuttcp -T30 -i1 ends up with:
> 
> 14465.0625 MB /  30.01 sec = 4043.6073 Mbps 82 %TX 2 %RX
> 
> while setting the window even to 128 KB with
> nuttcp -w128k -T30 -i1 ends up with:
> 
> 36416.8125 MB /  30.00 sec = 10182.8137 Mbps 90 %TX 96 %RX
> 
> so it's a factor of 2.5 with nuttcp.  I've never seen other apps behave
> like that -- for example NPtcp (netpipe) only gets slower when
> explicitly setting the window size.
> 
> Strange...

I suspect that the link is so fast that the window growth isn't happening
fast enough. With only a 30 second test, you probably barely made it
out of TCP slow start.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-08 20:48             ` setsockopt() Stephen Hemminger
@ 2008-07-08 22:05               ` Bill Fink
  2008-07-09  5:25                 ` setsockopt() Evgeniy Polyakov
  0 siblings, 1 reply; 43+ messages in thread
From: Bill Fink @ 2008-07-08 22:05 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Roland Dreier, Evgeniy Polyakov, David Miller, aglo, shemminger,
	netdev, rees, bfields

On Tue, 8 Jul 2008, Stephen Hemminger wrote:

> On Mon, 07 Jul 2008 23:29:31 -0700
> Roland Dreier <rdreier@cisco.com> wrote:
> 
> > Interesting... I'd not tried nuttcp before, and on my testbed, which is
> > a very high-bandwidth, low-RTT network (IP-over-InfiniBand with DDR IB,
> > so the network is capable of 16 Gbps, and the RTT is ~25 microseconds),
> > the difference between autotuning and not for nuttcp is huge (testing
> > with 2.6.26-rc8 plus some pending 2.6.27 patches that add checksum
> > offload, LSO and LRO to the IP-over-IB driver):
> > 
> > nuttcp -T30 -i1 ends up with:
> > 
> > 14465.0625 MB /  30.01 sec = 4043.6073 Mbps 82 %TX 2 %RX
> > 
> > while setting the window even to 128 KB with
> > nuttcp -w128k -T30 -i1 ends up with:
> > 
> > 36416.8125 MB /  30.00 sec = 10182.8137 Mbps 90 %TX 96 %RX
> > 
> > so it's a factor of 2.5 with nuttcp.  I've never seen other apps behave
> > like that -- for example NPtcp (netpipe) only gets slower when
> > explicitly setting the window size.
> > 
> > Strange...
> 
> I suspect that the link is so fast that the window growth isn't happening
> fast enough. With only a 30 second test, you probably barely made it
> out of TCP slow start.

Nah.  30 seconds is plenty of time.  I got up to nearly 8 Gbps
in 4 seconds (see my test report in earlier message in this thread),
and that was on an ~72 ms RTT network path.  Roland's IB network
only has a ~25 usec RTT.

BTW I believe there is one other important difference between the way
the tcp_rmem/tcp_wmem autotuning parameters are handled versus the way
the rmem_max/wmem_max parameters are used when explicitly setting the
socket buffer sizes.  I believe the tcp_rmem/tcp_wmem autotuning maximum
parameters are hard limits, with the default maximum tcp_rmem setting
being ~170 KB and the default maximum tcp_wmem setting being 128 KB.
On the other hand, I believe the rmem_max/wmem_max determines the maximum
value allowed to be set via the SO_RCVBUF/SO_SNDBUF setsockopt() call.
But then Linux doubles the requested value, so when Roland specified
a "-w128" nuttcp parameter, he actually got a socket buffer size
of 256 KB, which would thus be double that available in the autotuning
case assuming the tcp_rmem/tcp_wmem settings are using their default
values.  This could then account for a factor of 2 X between the two
test cases.  The "-v" verbose option to nuttcp might shed some light
on this hypothesis.

						-Bill

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-08 22:05               ` setsockopt() Bill Fink
@ 2008-07-09  5:25                 ` Evgeniy Polyakov
  2008-07-09  5:47                   ` setsockopt() Bill Fink
  0 siblings, 1 reply; 43+ messages in thread
From: Evgeniy Polyakov @ 2008-07-09  5:25 UTC (permalink / raw)
  To: Bill Fink
  Cc: Stephen Hemminger, Roland Dreier, David Miller, aglo, shemminger,
	netdev, rees, bfields

On Tue, Jul 08, 2008 at 06:05:00PM -0400, Bill Fink (billfink@mindspring.com) wrote:
> BTW I believe there is one other important difference between the way
> the tcp_rmem/tcp_wmem autotuning parameters are handled versus the way
> the rmem_max/wmem_max parameters are used when explicitly setting the
> socket buffer sizes.  I believe the tcp_rmem/tcp_wmem autotuning maximum
> parameters are hard limits, with the default maximum tcp_rmem setting
> being ~170 KB and the default maximum tcp_wmem setting being 128 KB.

Maximum tcp_wmem depends on amount of available RAM, but at least 64k.
Maybe Reoland's distro set hard limit just to 128k...

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-09  5:25                 ` setsockopt() Evgeniy Polyakov
@ 2008-07-09  5:47                   ` Bill Fink
  2008-07-09  6:03                     ` setsockopt() Evgeniy Polyakov
  0 siblings, 1 reply; 43+ messages in thread
From: Bill Fink @ 2008-07-09  5:47 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Stephen Hemminger, Roland Dreier, David Miller, aglo, shemminger,
	netdev, rees, bfields

On Wed, 9 Jul 2008, Evgeniy Polyakov wrote:

> On Tue, Jul 08, 2008 at 06:05:00PM -0400, Bill Fink (billfink@mindspring.com) wrote:
> > BTW I believe there is one other important difference between the way
> > the tcp_rmem/tcp_wmem autotuning parameters are handled versus the way
> > the rmem_max/wmem_max parameters are used when explicitly setting the
> > socket buffer sizes.  I believe the tcp_rmem/tcp_wmem autotuning maximum
> > parameters are hard limits, with the default maximum tcp_rmem setting
> > being ~170 KB and the default maximum tcp_wmem setting being 128 KB.
> 
> Maximum tcp_wmem depends on amount of available RAM, but at least 64k.
> Maybe Reoland's distro set hard limit just to 128k...

Are you sure you're not thinking about tcp_mem, which is a function
of available memory, or has this been changed in more recent kernels?
The 2.6.22.9 Documentation/networking/ip-sysctl.txt indicates:

tcp_wmem - vector of 3 INTEGERs: min, default, max
	...
	max: Maximal amount of memory allowed for automatically selected
	send buffers for TCP socket. This value does not override
	net.core.wmem_max, "static" selection via SO_SNDBUF does not use this.
	Default: 128K

I also ran a purely local 10-GigE nuttcp TCP test, with and without
autotuning (0.13 ms RTT).

Autotuning (standard 10-second TCP test):

# nuttcp 192.168.88.13
...
11818.0625 MB /  10.01 sec = 9906.0223 Mbps 100 %TX 72 %RX 0 retrans

Same test but with explicitly specified 1 MB socket buffer:

# nuttcp -w1m 192.168.88.13
...
11818.0000 MB /  10.01 sec = 9902.0102 Mbps 99 %TX 71 %RX 0 retrans

The TCP autotuning worked great, with both tests basically achieving
full 10-GigE line rate.  The test with the TCP autotuning actually
did slightly better than the test where an explicitly specified 1 MB
socket buffer was used, although this could just be within the margin
of error of the testing.

						-Bill

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-09  5:47                   ` setsockopt() Bill Fink
@ 2008-07-09  6:03                     ` Evgeniy Polyakov
  2008-07-09 18:11                       ` setsockopt() J. Bruce Fields
  0 siblings, 1 reply; 43+ messages in thread
From: Evgeniy Polyakov @ 2008-07-09  6:03 UTC (permalink / raw)
  To: Bill Fink
  Cc: Stephen Hemminger, Roland Dreier, David Miller, aglo, shemminger,
	netdev, rees, bfields

On Wed, Jul 09, 2008 at 01:47:58AM -0400, Bill Fink (billfink@mindspring.com) wrote:
> Are you sure you're not thinking about tcp_mem, which is a function
> of available memory, or has this been changed in more recent kernels?
> The 2.6.22.9 Documentation/networking/ip-sysctl.txt indicates:

In 2.6.25 tcp_mem depends on amount of ram, and third tcp_wmem is
calculated based on it:

/* Set per-socket limits to no more than 1/128 the pressure threshold */
limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
max_share = min(4UL*1024*1024, limit);

sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
sysctl_tcp_wmem[1] = 16*1024;
sysctl_tcp_wmem[2] = max(64*1024, max_share);

sysctl_tcp_rmem[0] = SK_MEM_QUANTUM;
sysctl_tcp_rmem[1] = 87380;
sysctl_tcp_rmem[2] = max(87380, max_share);

> tcp_wmem - vector of 3 INTEGERs: min, default, max
> 	...
> 	max: Maximal amount of memory allowed for automatically selected
> 	send buffers for TCP socket. This value does not override
> 	net.core.wmem_max, "static" selection via SO_SNDBUF does not use this.
> 	Default: 128K

Yeah, its a bit confusing. It probably was copypasted, there is no
default, but minimum possible value.

> 
> The TCP autotuning worked great, with both tests basically achieving
> full 10-GigE line rate.  The test with the TCP autotuning actually
> did slightly better than the test where an explicitly specified 1 MB
> socket buffer was used, although this could just be within the margin
> of error of the testing.

If you will check tcp_wmem on your machine, it will likely show that
tcp_wmem[max] is far larger than 128k. It is equal to 1mb on my old
laptop with 256mb of ram, I suppose machines equipped with 10gige
network adapters usually have slightly more.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-09  6:03                     ` setsockopt() Evgeniy Polyakov
@ 2008-07-09 18:11                       ` J. Bruce Fields
  2008-07-09 18:43                         ` setsockopt() Evgeniy Polyakov
  0 siblings, 1 reply; 43+ messages in thread
From: J. Bruce Fields @ 2008-07-09 18:11 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Bill Fink, Stephen Hemminger, Roland Dreier, David Miller, aglo,
	shemminger, netdev, rees

On Wed, Jul 09, 2008 at 10:03:41AM +0400, Evgeniy Polyakov wrote:
> On Wed, Jul 09, 2008 at 01:47:58AM -0400, Bill Fink (billfink@mindspring.com) wrote:
> > Are you sure you're not thinking about tcp_mem, which is a function
> > of available memory, or has this been changed in more recent kernels?
> > The 2.6.22.9 Documentation/networking/ip-sysctl.txt indicates:
> 
> In 2.6.25 tcp_mem depends on amount of ram, and third tcp_wmem is
> calculated based on it:
> 
> /* Set per-socket limits to no more than 1/128 the pressure threshold */
> limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
> max_share = min(4UL*1024*1024, limit);
> 
> sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
> sysctl_tcp_wmem[1] = 16*1024;
> sysctl_tcp_wmem[2] = max(64*1024, max_share);
> 
> sysctl_tcp_rmem[0] = SK_MEM_QUANTUM;
> sysctl_tcp_rmem[1] = 87380;
> sysctl_tcp_rmem[2] = max(87380, max_share);
> 
> > tcp_wmem - vector of 3 INTEGERs: min, default, max
> > 	...
> > 	max: Maximal amount of memory allowed for automatically selected
> > 	send buffers for TCP socket. This value does not override
> > 	net.core.wmem_max, "static" selection via SO_SNDBUF does not use this.
> > 	Default: 128K
> 
> Yeah, its a bit confusing. It probably was copypasted, there is no
> default, but minimum possible value.

I don't understand; what do you mean by "there is no default"?  (And if
not, what does tcp_wmem[1] mean?)

--b.

> 
> > 
> > The TCP autotuning worked great, with both tests basically achieving
> > full 10-GigE line rate.  The test with the TCP autotuning actually
> > did slightly better than the test where an explicitly specified 1 MB
> > socket buffer was used, although this could just be within the margin
> > of error of the testing.
> 
> If you will check tcp_wmem on your machine, it will likely show that
> tcp_wmem[max] is far larger than 128k. It is equal to 1mb on my old
> laptop with 256mb of ram, I suppose machines equipped with 10gige
> network adapters usually have slightly more.
> 
> -- 
> 	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-09 18:11                       ` setsockopt() J. Bruce Fields
@ 2008-07-09 18:43                         ` Evgeniy Polyakov
  2008-07-09 22:28                           ` setsockopt() J. Bruce Fields
  0 siblings, 1 reply; 43+ messages in thread
From: Evgeniy Polyakov @ 2008-07-09 18:43 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Bill Fink, Stephen Hemminger, Roland Dreier, David Miller, aglo,
	shemminger, netdev, rees

On Wed, Jul 09, 2008 at 02:11:22PM -0400, J. Bruce Fields (bfields@fieldses.org) wrote:
> > Yeah, its a bit confusing. It probably was copypasted, there is no
> > default, but minimum possible value.
> 
> I don't understand; what do you mean by "there is no default"?  (And if
> not, what does tcp_wmem[1] mean?)

I meant there is no default value for tcp_w/rmem[2], which is calculated
based on tcp_mem, which in turn is calculated based on amount RAM of in
the system. tcp_wmem[2] will be at least 64k, but its higher limit
(calculated by system, which of course can be overwritten) is RAM/256 on
x86 (iirc only low mem is counted, although that was different in
various kernel versions), but not more than 4Mb.

tcp_wmem[1] means initial send buffer size, it can grow up to tcp_wmem[2].
There is a default for this parameter. Actually all this numbers are a
bit fluffy, so they are kind of soft rules for socket memory accounting
mechanics.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-09 18:43                         ` setsockopt() Evgeniy Polyakov
@ 2008-07-09 22:28                           ` J. Bruce Fields
  2008-07-10  1:06                             ` setsockopt() Evgeniy Polyakov
  0 siblings, 1 reply; 43+ messages in thread
From: J. Bruce Fields @ 2008-07-09 22:28 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Bill Fink, Stephen Hemminger, Roland Dreier, David Miller, aglo,
	shemminger, netdev, rees

On Wed, Jul 09, 2008 at 10:43:30PM +0400, Evgeniy Polyakov wrote:
> On Wed, Jul 09, 2008 at 02:11:22PM -0400, J. Bruce Fields (bfields@fieldses.org) wrote:
> > > Yeah, its a bit confusing. It probably was copypasted, there is no
> > > default, but minimum possible value.
> > 
> > I don't understand; what do you mean by "there is no default"?  (And if
> > not, what does tcp_wmem[1] mean?)
> 
> I meant there is no default value for tcp_w/rmem[2], which is calculated
> based on tcp_mem, which in turn is calculated based on amount RAM of in
> the system. tcp_wmem[2] will be at least 64k, but its higher limit
> (calculated by system, which of course can be overwritten) is RAM/256 on
> x86 (iirc only low mem is counted, although that was different in
> various kernel versions), but not more than 4Mb.
> 
> tcp_wmem[1] means initial send buffer size, it can grow up to tcp_wmem[2].
> There is a default for this parameter. Actually all this numbers are a
> bit fluffy, so they are kind of soft rules for socket memory accounting
> mechanics.

OK, thanks.  Would the following be any more clearer and/or accurate?

--b.

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 17a6e46..a22af04 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -336,7 +336,7 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max
 	pressure.
 	Default: 8K
 
-	default: default size of receive buffer used by TCP sockets.
+	default: initial size of receive buffer used by TCP sockets.
 	This value overrides net.core.rmem_default used by other protocols.
 	Default: 87380 bytes. This value results in window of 65535 with
 	default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit
@@ -344,8 +344,10 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max
 
 	max: maximal size of receive buffer allowed for automatically
 	selected receiver buffers for TCP socket. This value does not override
-	net.core.rmem_max, "static" selection via SO_RCVBUF does not use this.
-	Default: 87380*2 bytes.
+	net.core.rmem_max.  Calling setsockopt() with SO_RCVBUF disables
+	automatic tuning of that socket's receive buffer size, in which
+	case this value is ignored.
+	Default: between 87380B and 4MB, depending on RAM size.
 
 tcp_sack - BOOLEAN
 	Enable select acknowledgments (SACKS).
@@ -419,19 +421,21 @@ tcp_window_scaling - BOOLEAN
 	Enable window scaling as defined in RFC1323.
 
 tcp_wmem - vector of 3 INTEGERs: min, default, max
-	min: Amount of memory reserved for send buffers for TCP socket.
+	min: Amount of memory reserved for send buffers for TCP sockets.
 	Each TCP socket has rights to use it due to fact of its birth.
 	Default: 4K
 
-	default: Amount of memory allowed for send buffers for TCP socket
-	by default. This value overrides net.core.wmem_default used
-	by other protocols, it is usually lower than net.core.wmem_default.
+	default: initial size of send buffer used by TCP sockets.  This
+	value overrides net.core.wmem_default used by other protocols.
+	It is usually lower than net.core.wmem_default.
 	Default: 16K
 
-	max: Maximal amount of memory allowed for automatically selected
-	send buffers for TCP socket. This value does not override
-	net.core.wmem_max, "static" selection via SO_SNDBUF does not use this.
-	Default: 128K
+	max: Maximal amount of memory allowed for automatically tuned
+	send buffers for TCP sockets. This value does not override
+	net.core.wmem_max.  Calling setsockopt() with SO_SNDBUF disables
+	automatic tuning of that socket's send buffer size, in which case
+	this value is ignored.
+	Default: between 64K and 4MB, depending on RAM size.
 
 tcp_workaround_signed_windows - BOOLEAN
 	If set, assume no receipt of a window scaling option means the

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-09 22:28                           ` setsockopt() J. Bruce Fields
@ 2008-07-10  1:06                             ` Evgeniy Polyakov
  2008-07-10 20:05                               ` [PATCH] Documentation: clarify tcp_{r,w}mem sysctl docs J. Bruce Fields
  0 siblings, 1 reply; 43+ messages in thread
From: Evgeniy Polyakov @ 2008-07-10  1:06 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Bill Fink, Stephen Hemminger, Roland Dreier, David Miller, aglo,
	shemminger, netdev, rees

On Wed, Jul 09, 2008 at 06:28:02PM -0400, J. Bruce Fields (bfields@fieldses.org) wrote:
> OK, thanks.  Would the following be any more clearer and/or accurate?

Looks good, thank you :)
It is likely the first ever mention of the fact, that SO_RECV/SNDBUF
disables autotuning.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH] Documentation: clarify tcp_{r,w}mem sysctl docs
  2008-07-10  1:06                             ` setsockopt() Evgeniy Polyakov
@ 2008-07-10 20:05                               ` J. Bruce Fields
  2008-07-10 23:50                                 ` David Miller
  0 siblings, 1 reply; 43+ messages in thread
From: J. Bruce Fields @ 2008-07-10 20:05 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Bill Fink, Stephen Hemminger, Roland Dreier, David Miller, aglo,
	shemminger, netdev, rees, Evgeniy Polyakov

From: J. Bruce Fields <bfields@citi.umich.edu>
Date: Wed, 9 Jul 2008 18:28:48 -0400

Fix some of the defaults and attempt to clarify some language.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
---
 Documentation/networking/ip-sysctl.txt |   26 +++++++++++++++-----------
 1 files changed, 15 insertions(+), 11 deletions(-)

On Thu, Jul 10, 2008 at 05:06:50AM +0400, Evgeniy Polyakov wrote:
> On Wed, Jul 09, 2008 at 06:28:02PM -0400, J. Bruce Fields
> (bfields@fieldses.org) wrote:
> > OK, thanks.  Would the following be any more clearer and/or
> > accurate?
>
> Looks good, thank you :)
> It is likely the first ever mention of the fact, that SO_RECV/SNDBUF
> disables autotuning.

OK!  Uh, I'm assuming Jon Corbet's documentation tree would be a
reasonable route to submit this by....

--b.

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 17a6e46..a22af04 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -336,7 +336,7 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max
 	pressure.
 	Default: 8K
 
-	default: default size of receive buffer used by TCP sockets.
+	default: initial size of receive buffer used by TCP sockets.
 	This value overrides net.core.rmem_default used by other protocols.
 	Default: 87380 bytes. This value results in window of 65535 with
 	default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit
@@ -344,8 +344,10 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max
 
 	max: maximal size of receive buffer allowed for automatically
 	selected receiver buffers for TCP socket. This value does not override
-	net.core.rmem_max, "static" selection via SO_RCVBUF does not use this.
-	Default: 87380*2 bytes.
+	net.core.rmem_max.  Calling setsockopt() with SO_RCVBUF disables
+	automatic tuning of that socket's receive buffer size, in which
+	case this value is ignored.
+	Default: between 87380B and 4MB, depending on RAM size.
 
 tcp_sack - BOOLEAN
 	Enable select acknowledgments (SACKS).
@@ -419,19 +421,21 @@ tcp_window_scaling - BOOLEAN
 	Enable window scaling as defined in RFC1323.
 
 tcp_wmem - vector of 3 INTEGERs: min, default, max
-	min: Amount of memory reserved for send buffers for TCP socket.
+	min: Amount of memory reserved for send buffers for TCP sockets.
 	Each TCP socket has rights to use it due to fact of its birth.
 	Default: 4K
 
-	default: Amount of memory allowed for send buffers for TCP socket
-	by default. This value overrides net.core.wmem_default used
-	by other protocols, it is usually lower than net.core.wmem_default.
+	default: initial size of send buffer used by TCP sockets.  This
+	value overrides net.core.wmem_default used by other protocols.
+	It is usually lower than net.core.wmem_default.
 	Default: 16K
 
-	max: Maximal amount of memory allowed for automatically selected
-	send buffers for TCP socket. This value does not override
-	net.core.wmem_max, "static" selection via SO_SNDBUF does not use this.
-	Default: 128K
+	max: Maximal amount of memory allowed for automatically tuned
+	send buffers for TCP sockets. This value does not override
+	net.core.wmem_max.  Calling setsockopt() with SO_SNDBUF disables
+	automatic tuning of that socket's send buffer size, in which case
+	this value is ignored.
+	Default: between 64K and 4MB, depending on RAM size.
 
 tcp_workaround_signed_windows - BOOLEAN
 	If set, assume no receipt of a window scaling option means the
-- 
1.5.5.rc1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] Documentation: clarify tcp_{r,w}mem sysctl docs
  2008-07-10 20:05                               ` [PATCH] Documentation: clarify tcp_{r,w}mem sysctl docs J. Bruce Fields
@ 2008-07-10 23:50                                 ` David Miller
  0 siblings, 0 replies; 43+ messages in thread
From: David Miller @ 2008-07-10 23:50 UTC (permalink / raw)
  To: bfields
  Cc: corbet, billfink, stephen.hemminger, rdreier, aglo, shemminger,
	netdev, rees, johnpol

From: "J. Bruce Fields" <bfields@fieldses.org>
Date: Thu, 10 Jul 2008 16:05:10 -0400

> Fix some of the defaults and attempt to clarify some language.
> 
> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>


Applied, thanks!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-07 21:49     ` setsockopt() David Miller
  2008-07-08  4:54       ` setsockopt() Evgeniy Polyakov
@ 2008-07-08 20:12       ` Jim Rees
  2008-07-08 21:54         ` setsockopt() John Heffner
  1 sibling, 1 reply; 43+ messages in thread
From: Jim Rees @ 2008-07-08 20:12 UTC (permalink / raw)
  To: netdev; +Cc: aglo, shemminger, bfields

David Miller wrote:

  If you set the socket buffer sizes explicitly, you essentially turn
  off half of the TCP stack because it won't do dynamic socket buffer
  sizing afterwards.

  There is no reason these days to ever explicitly set the socket
  buffer sizes on TCP sockets under Linux.

So it seems clear that nfsd should stop setting the socket buffer sizes.

The problem we run into if we try that is that the server won't read any
incoming data from its socket until an entire rpc has been assembled and is
waiting to be read off the socket.  An rpc can be almost any size up to
about 1MB, but the socket buffer never grows past about 50KB, so the rpc can
never be assembled entirely in the socket buf.

Maybe the nfsd needs a way to tell the socket/tcp layers that it wants a
minimum size socket buffer.  Or maybe nfsd needs to be modified so that it
will read partial rpcs.  I would appreciate suggestions as to which is the
better fix.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-08 20:12       ` setsockopt() Jim Rees
@ 2008-07-08 21:54         ` John Heffner
  2008-07-08 23:51           ` setsockopt() Jim Rees
  0 siblings, 1 reply; 43+ messages in thread
From: John Heffner @ 2008-07-08 21:54 UTC (permalink / raw)
  To: Jim Rees; +Cc: netdev, aglo, shemminger, bfields

On Tue, Jul 8, 2008 at 1:12 PM, Jim Rees <rees@umich.edu> wrote:
> David Miller wrote:
>
>  If you set the socket buffer sizes explicitly, you essentially turn
>  off half of the TCP stack because it won't do dynamic socket buffer
>  sizing afterwards.
>
>  There is no reason these days to ever explicitly set the socket
>  buffer sizes on TCP sockets under Linux.
>
> So it seems clear that nfsd should stop setting the socket buffer sizes.
>
> The problem we run into if we try that is that the server won't read any
> incoming data from its socket until an entire rpc has been assembled and is
> waiting to be read off the socket.  An rpc can be almost any size up to
> about 1MB, but the socket buffer never grows past about 50KB, so the rpc can
> never be assembled entirely in the socket buf.
>
> Maybe the nfsd needs a way to tell the socket/tcp layers that it wants a
> minimum size socket buffer.  Or maybe nfsd needs to be modified so that it
> will read partial rpcs.  I would appreciate suggestions as to which is the
> better fix.

This is an interesting observation.  It turns out that the best way to
solve send-side autotuning is not to "tune" the send buffer at all,
but to change its semantics.  From you example, we can clearly see
that the send buffer is overloaded.  It's used to buffer data between
a scheduled application and the event-driven kernel, and also to store
data that may need to be retransmitted.  If you separate the socket
buffer from the retransmit queue, you can size the socket buffer based
on the application's needs (e.g., you want about 1 MB), and the
retransmit queue's size will naturally be bound by cwnd.

I implemented this split about six years ago, but never submitted
largely because it wasn't clear how to handle backward/cross-platform
compatibility of socket options, and because no one seemed to care
about it too much.  (I think you are the first person I remember to
bring up this issue.)

Unfortunately, where this leaves you is still trying to guess the
right socket buffer size.  I actually like your idea for a "soft"
SO_SNDBUF -- ask the kernel for at least that much, but let it
autotune higher if needed.  This is almost trivial to implement --
it's the same as SO_SNDBUF but don't set the sock sndbuf lock.

One thing to note here.  While this option would solve your problem,
there's another similar issue that would not be addressed.  Some
applications want to "feel" the network -- that is, want to as quickly
as possible observe changes in sending rate.  (Say you have an
adaptive codec.)  This application would want a small send buffer, but
a larger retransmit queue.  It's not possible to do this without
splitting the send buffer.

  -John

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-08 21:54         ` setsockopt() John Heffner
@ 2008-07-08 23:51           ` Jim Rees
  2008-07-09  0:07             ` setsockopt() John Heffner
  0 siblings, 1 reply; 43+ messages in thread
From: Jim Rees @ 2008-07-08 23:51 UTC (permalink / raw)
  To: John Heffner; +Cc: netdev, aglo, shemminger, bfields

John Heffner wrote:

  I actually like your idea for a "soft"
  SO_SNDBUF -- ask the kernel for at least that much, but let it
  autotune higher if needed.  This is almost trivial to implement --
  it's the same as SO_SNDBUF but don't set the sock sndbuf lock.

Which brings me to another issue.  The nfs server doesn't call
sock_setsockopt(), it diddles sk_sndbuf and sk_rcvbuf directly, so as to get
around the max socket buf limit.  I don't like this.  If this is a legit
thing to do, there should be an api.

I'm thinking we need a sock_set_min_bufsize(), where the values passed in
are minimums, subject to autotuning, and maybe are not limited by the max.
It would, as you say, just set sk_sndbuf and sk_rcvbuf without setting the
corresponding flags SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK.

Would this do the trick, or is there a danger that autotuning would reduce
the buffer sizes below the given minimum?  If so, we might need
sk_min_rcvbuf or something like that.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-08 23:51           ` setsockopt() Jim Rees
@ 2008-07-09  0:07             ` John Heffner
  0 siblings, 0 replies; 43+ messages in thread
From: John Heffner @ 2008-07-09  0:07 UTC (permalink / raw)
  To: Jim Rees; +Cc: netdev, aglo, shemminger, bfields

On Tue, Jul 8, 2008 at 4:51 PM, Jim Rees <rees@umich.edu> wrote:
> John Heffner wrote:
>
>  I actually like your idea for a "soft"
>  SO_SNDBUF -- ask the kernel for at least that much, but let it
>  autotune higher if needed.  This is almost trivial to implement --
>  it's the same as SO_SNDBUF but don't set the sock sndbuf lock.
>
> Which brings me to another issue.  The nfs server doesn't call
> sock_setsockopt(), it diddles sk_sndbuf and sk_rcvbuf directly, so as to get
> around the max socket buf limit.  I don't like this.  If this is a legit
> thing to do, there should be an api.
>
> I'm thinking we need a sock_set_min_bufsize(), where the values passed in
> are minimums, subject to autotuning, and maybe are not limited by the max.
> It would, as you say, just set sk_sndbuf and sk_rcvbuf without setting the
> corresponding flags SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK.
>
> Would this do the trick, or is there a danger that autotuning would reduce
> the buffer sizes below the given minimum?  If so, we might need
> sk_min_rcvbuf or something like that.


TCP buffer sizes will only be pulled back if the system runs into the
global tcp memory limits (sysctl_tcp_mem).  I think this is correct
behavior regardless of the requested value.

  -John

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-07 21:30   ` setsockopt() Olga Kornievskaia
  2008-07-07 21:33     ` setsockopt() Stephen Hemminger
  2008-07-07 21:49     ` setsockopt() David Miller
@ 2008-07-07 22:50     ` Rick Jones
  2008-07-07 23:00       ` setsockopt() David Miller
  2008-07-08  3:33       ` setsockopt() John Heffner
  2 siblings, 2 replies; 43+ messages in thread
From: Rick Jones @ 2008-07-07 22:50 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: Stephen Hemminger, netdev, Jim Rees, J. Bruce Fields

Olga Kornievskaia wrote:
> Stephen Hemminger wrote:
>> It would be better if NFSD stayed out of doign setsockopt and just
>> let the sender/receiver autotuning work?
>>   
> 
> Auto-tuning would be guided by the sysctl values that are set for all 
> applications. I could be wrong but what I see is that unless an 
> application does a setsockopt(), its window is bound by the default 
> sysctl value. If it is true, than it is not acceptable. It means that in 
> order for NFSD to achieve a large enough window it needs to modify TCP's 
> sysctl value which will effect all other applications.

My experience thusfar is that the sysctl defaults will allow an 
autotuned TCP receive window far larger than it will allow with a direct 
setsockopt() call.

I'm still a triffle puzzled/concerned/confused by the extent to which 
autotuning will allow the receive window to grow, again based on some 
netperf experience thusfar, and patient explanations provided here and 
elsewhere, it seems as though autotuning will let things get to 2x what 
it thinks the sender's cwnd happens to be.  So far under netperf testing 
that seems to be the case, and 99 times out of ten my netperf tests will 
have the window grow to the max.

rick jones

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-07 22:50     ` setsockopt() Rick Jones
@ 2008-07-07 23:00       ` David Miller
  2008-07-07 23:27         ` setsockopt() Rick Jones
  2008-07-08  3:33       ` setsockopt() John Heffner
  1 sibling, 1 reply; 43+ messages in thread
From: David Miller @ 2008-07-07 23:00 UTC (permalink / raw)
  To: rick.jones2; +Cc: aglo, shemminger, netdev, rees, bfields

From: Rick Jones <rick.jones2@hp.com>
Date: Mon, 07 Jul 2008 15:50:21 -0700

> I'm still a triffle puzzled/concerned/confused by the extent to which 
> autotuning will allow the receive window to grow, again based on some 
> netperf experience thusfar, and patient explanations provided here and 
> elsewhere, it seems as though autotuning will let things get to 2x what 
> it thinks the sender's cwnd happens to be.  So far under netperf testing 
> that seems to be the case, and 99 times out of ten my netperf tests will 
> have the window grow to the max.

We need 2x, in order to have a full window during recovery.

There was a measurement bug found a few months ago when the
google folks were probing in this area, which was fixed
by John Heffner.  Most of which had to deal with TSO subtleties.

--------------------
commit 246eb2af060fc32650f07203c02bdc0456ad76c7
Author: John Heffner <johnwheffner@gmail.com>
Date:   Tue Apr 29 03:13:52 2008 -0700

    tcp: Limit cwnd growth when deferring for GSO
    
    This fixes inappropriately large cwnd growth on sender-limited flows
    when GSO is enabled, limiting cwnd growth to 64k.
    
    Signed-off-by: John Heffner <johnwheffner@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit ce447eb91409225f8a488f6b7b2a1bdf7b2d884f
Author: John Heffner <johnwheffner@gmail.com>
Date:   Tue Apr 29 03:13:02 2008 -0700

    tcp: Allow send-limited cwnd to grow up to max_burst when gso disabled
    
    This changes the logic in tcp_is_cwnd_limited() so that cwnd may grow
    up to tcp_max_burst() even when sk_can_gso() is false, or when
    sysctl_tcp_tso_win_divisor != 0.
    
    Signed-off-by: John Heffner <johnwheffner@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
--------------------

Setting TCP socket buffer via setsockopt() is always wrong.
If there is a bug, let's fix it.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-07 23:00       ` setsockopt() David Miller
@ 2008-07-07 23:27         ` Rick Jones
  2008-07-08  1:15           ` setsockopt() Rick Jones
  2008-07-08  1:44           ` setsockopt() David Miller
  0 siblings, 2 replies; 43+ messages in thread
From: Rick Jones @ 2008-07-07 23:27 UTC (permalink / raw)
  To: David Miller; +Cc: aglo, shemminger, netdev, rees, bfields

David Miller wrote:
> We need 2x, in order to have a full window during recovery.
> 
> There was a measurement bug found a few months ago when the
> google folks were probing in this area, which was fixed
> by John Heffner.  Most of which had to deal with TSO subtleties.
> 
> --------------------
> commit 246eb2af060fc32650f07203c02bdc0456ad76c7
> Author: John Heffner <johnwheffner@gmail.com>
> Date:   Tue Apr 29 03:13:52 2008 -0700
> 
>     tcp: Limit cwnd growth when deferring for GSO
>     
>     This fixes inappropriately large cwnd growth on sender-limited flows
>     when GSO is enabled, limiting cwnd growth to 64k.
>     
>     Signed-off-by: John Heffner <johnwheffner@gmail.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> commit ce447eb91409225f8a488f6b7b2a1bdf7b2d884f
> Author: John Heffner <johnwheffner@gmail.com>
> Date:   Tue Apr 29 03:13:02 2008 -0700
> 
>     tcp: Allow send-limited cwnd to grow up to max_burst when gso disabled
>     
>     This changes the logic in tcp_is_cwnd_limited() so that cwnd may grow
>     up to tcp_max_burst() even when sk_can_gso() is false, or when
>     sysctl_tcp_tso_win_divisor != 0.
>     
>     Signed-off-by: John Heffner <johnwheffner@gmail.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> --------------------


I'll try my tests again with newer kernels since I'm not 100% certain I 
was trying with those commits in place.

> Setting TCP socket buffer via setsockopt() is always wrong.

Does that apply equally to SO_SNDBUF and SO_RCVBUF?

rick jones

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-07 23:27         ` setsockopt() Rick Jones
@ 2008-07-08  1:15           ` Rick Jones
  2008-07-08  1:48             ` setsockopt() J. Bruce Fields
  2008-07-08  1:44           ` setsockopt() David Miller
  1 sibling, 1 reply; 43+ messages in thread
From: Rick Jones @ 2008-07-08  1:15 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, aglo, shemminger, rees, bfields

Rick Jones wrote:
> David Miller wrote:
> 
>> We need 2x, in order to have a full window during recovery.
>>
>> There was a measurement bug found a few months ago when the
>> google folks were probing in this area, which was fixed
>> by John Heffner.  Most of which had to deal with TSO subtleties.
>>
>> --------------------
>> commit 246eb2af060fc32650f07203c02bdc0456ad76c7
>> ...
>> commit ce447eb91409225f8a488f6b7b2a1bdf7b2d884f
>> ...
> I'll try my tests again with newer kernels since I'm not 100% certain I 
> was trying with those commits in place.

Did those commits make it into 2.6.26-rc9?  (Gentle taps of clue-bat as 
to how to use git to check commits in various trees would be welcome - 
to say I am a git noob would be an understatement - the tree from which 
that kernel was made was cloned from Linus' about 16:00 to 17:00 pacific 
time)

Assuming they did, a pair of systems with tg3-driven BCM5704's:

moe:~# ethtool -i eth0
driver: tg3
version: 3.92.1
firmware-version: 5704-v3.27
bus-info: 0000:01:02.0
moe:~# uname -a
Linux moe 2.6.26-rc9-raj #1 SMP Mon Jul 7 17:26:15 PDT 2008 ia64 GNU/Linux

with TSO enabled still takes the socket buffers all the way out to 4MB 
for a GbE LAN test:
moe:~# netperf -t omni -H manny -- -o foo
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to manny.west 
(10.208.0.13) port 0 AF_INET
Throughput,Local Send Socket Size Requested,Local Send Socket Size 
Initial,Local Send Socket Size Final,Remote Recv Socket Size 
Requested,Remote Recv Socket Size Initial,Remote Recv Socket Size Final
941.41,-1,16384,4194304,-1,87380,4194304

When a 64K socket buffer request was sufficient:
moe:~# netperf -t omni -H manny -- -o foo -s 64K -S 64K -m 16K
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to manny.west 
(10.208.0.13) port 0 AF_INET
Throughput,Local Send Socket Size Requested,Local Send Socket Size 
Initial,Local Send Socket Size Final,Remote Recv Socket Size 
Requested,Remote Recv Socket Size Initial,Remote Recv Socket Size Final
941.12,65536,131072,131072,65536,131072,131072


FWIW, disabling TSO via ethtool didn't seem to change the behaviour:

moe:~# ethtool -K eth0 tso off
moe:~# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off
moe:~# netperf -t omni -H manny -- -o foo
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to manny.west 
(10.208.0.13) port 0 AF_INET
Throughput,Local Send Socket Size Requested,Local Send Socket Size 
Initial,Local Send Socket Size Final,Remote Recv Socket Size 
Requested,Remote Recv Socket Size Initial,Remote Recv Socket Size Final
941.40,-1,16384,4194304,-1,87380,4194304

If I was cloning off the wrong tree, my apologies and redirects to the 
correct tree would be gladly accepted.

rick jones

moe:~# cat foo
throughput,lss_size_req,lss_size,lss_size_end,rsr_size_req,rsr_size,rsr_size_end

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-08  1:15           ` setsockopt() Rick Jones
@ 2008-07-08  1:48             ` J. Bruce Fields
  0 siblings, 0 replies; 43+ messages in thread
From: J. Bruce Fields @ 2008-07-08  1:48 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev, David Miller, aglo, shemminger, rees

On Mon, Jul 07, 2008 at 06:15:01PM -0700, Rick Jones wrote:
> Rick Jones wrote:
>> David Miller wrote:
>>
>>> We need 2x, in order to have a full window during recovery.
>>>
>>> There was a measurement bug found a few months ago when the
>>> google folks were probing in this area, which was fixed
>>> by John Heffner.  Most of which had to deal with TSO subtleties.
>>>
>>> --------------------
>>> commit 246eb2af060fc32650f07203c02bdc0456ad76c7
>>> ...
>>> commit ce447eb91409225f8a488f6b7b2a1bdf7b2d884f
>>> ...
>> I'll try my tests again with newer kernels since I'm not 100% certain I 
>> was trying with those commits in place.
>
> Did those commits make it into 2.6.26-rc9?  (Gentle taps of clue-bat as  
> to how to use git to check commits in various trees would be welcome -  
> to say I am a git noob would be an understatement - the tree from which  
> that kernel was made was cloned from Linus' about 16:00 to 17:00 pacific  
> time)

"git describe --contains", will tell you the first tag git finds
containing the given commit:

	bfields@pickle:linux$ git describe --contains 246eb2af060
	v2.6.26-rc1~95^2~18
	bfields@pickle:linux$ git describe --contains ce447eb9140
	v2.6.26-rc1~95^2~19

So both were already in 2.6.25-rc1.

Or if you forget that, another trick is to note that git log A..B really
means "tell me all the commits contained in B but not in A".  So "git
log A..B" returns output if (and only if) B is not contained in A.  For
example, if "git log HEAD..246eb2af060" returns without any output, then
you know that 246eb2af060 is contained in the head of the currently
checked-out branch.

--b.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-07 23:27         ` setsockopt() Rick Jones
  2008-07-08  1:15           ` setsockopt() Rick Jones
@ 2008-07-08  1:44           ` David Miller
  1 sibling, 0 replies; 43+ messages in thread
From: David Miller @ 2008-07-08  1:44 UTC (permalink / raw)
  To: rick.jones2; +Cc: aglo, shemminger, netdev, rees, bfields

From: Rick Jones <rick.jones2@hp.com>
Date: Mon, 07 Jul 2008 16:27:14 -0700

> David Miller wrote:
> > Setting TCP socket buffer via setsockopt() is always wrong.
> 
> Does that apply equally to SO_SNDBUF and SO_RCVBUF?

Yes.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-07 22:50     ` setsockopt() Rick Jones
  2008-07-07 23:00       ` setsockopt() David Miller
@ 2008-07-08  3:33       ` John Heffner
  2008-07-08 18:16         ` setsockopt() Rick Jones
       [not found]         ` <349f35ee0807090255s58fd040bne265ee117d06d397@mail.gmail.com>
  1 sibling, 2 replies; 43+ messages in thread
From: John Heffner @ 2008-07-08  3:33 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

On Mon, Jul 7, 2008 at 3:50 PM, Rick Jones <rick.jones2@hp.com> wrote:
> I'm still a triffle puzzled/concerned/confused by the extent to which
> autotuning will allow the receive window to grow, again based on some
> netperf experience thusfar, and patient explanations provided here and
> elsewhere, it seems as though autotuning will let things get to 2x what it
> thinks the sender's cwnd happens to be.  So far under netperf testing that
> seems to be the case, and 99 times out of ten my netperf tests will have the
> window grow to the max.


Rick,

I thought this was covered pretty thoroughly back in April.  The
behavior you're seeing is 100% expected, and not likely to change
unless Jerry Chu gets his local queued data measurement patch working.
 I'm not sure what ultimately happened there, but it was a cool idea
and I hope he has time to polish it up.  It's definitely tricky to get
right.

Jerry's optimization is a sender-side change.  The fact that the
receiver announces enough window is almost certainly the right thing
for it to do, and (I hope) this will not change.

If you're still curious:
http://www.psc.edu/networking/ftp/papers/autotune_sigcomm98.ps
http://www.lanl.gov/radiant/pubs/drs/lacsi2001.pdf
http://staff.psc.edu/jheffner/papers/senior_thesis.pdf

  -John

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-08  3:33       ` setsockopt() John Heffner
@ 2008-07-08 18:16         ` Rick Jones
  2008-07-08 19:10           ` setsockopt() John Heffner
       [not found]         ` <349f35ee0807090255s58fd040bne265ee117d06d397@mail.gmail.com>
  1 sibling, 1 reply; 43+ messages in thread
From: Rick Jones @ 2008-07-08 18:16 UTC (permalink / raw)
  To: John Heffner; +Cc: netdev

John Heffner wrote:
> On Mon, Jul 7, 2008 at 3:50 PM, Rick Jones <rick.jones2@hp.com> wrote:
> 
>>I'm still a triffle puzzled/concerned/confused by the extent to which
>>autotuning will allow the receive window to grow, again based on some
>>netperf experience thusfar, and patient explanations provided here and
>>elsewhere, it seems as though autotuning will let things get to 2x what it
>>thinks the sender's cwnd happens to be.  So far under netperf testing that
>>seems to be the case, and 99 times out of ten my netperf tests will have the
>>window grow to the max.
> 
> 
> 
> Rick,
> 
> I thought this was covered pretty thoroughly back in April. 

I'll plead bit errors in the dimm wetware memory :( And go back through 
the archives.

> The behavior you're seeing is 100% expected, and not likely to change
>  unless Jerry Chu gets his local queued data measurement patch 
> working. I'm not sure what ultimately happened there, but it was a 
> cool idea and I hope he has time to polish it up. It's definitely 
> tricky to get right.
> 
> Jerry's optimization is a sender-side change.  The fact that the
> receiver announces enough window is almost certainly the right thing
> for it to do, and (I hope) this will not change.

It just seems to be so, well, trusting of the sender.

> 
> If you're still curious:
> http://www.psc.edu/networking/ftp/papers/autotune_sigcomm98.ps
> http://www.lanl.gov/radiant/pubs/drs/lacsi2001.pdf

That one didn't show the effect on LANs, only WANs, although it did say 
things like this when discussing timer granularity and estimating the 
sender's window:

page 4 - "Thus in no case will the actual window be larger than the 
measured amount of data recieved during the period. However, the amount 
of data received during the period may be three times the actual window 
size when measurements are made across wide-area networks with rtt > 20 
ms.  Further, local networks with small round-trip delays may be grossly 
over-estimated."

I imagine that the 20 ms bit depends on the release and its timer 
granularity.  It was really that last sentence that caught my eye.


Still, rerunning with multiple concurrent tests showed they weren't all 
going to the limit:

moe:~# for i in 1 2 3 4; do netperf -t omni -l 30 -H manny -P 0 -- -o 
foo & done
moe:~# 289.31,-1,16384,2635688,-1,87380,2389248
210.43,-1,16384,2415720,-1,87380,2084736
194.87,-1,16384,1783312,-1,87380,1760704
247.00,-1,16384,2647472,-1,87380,2646912

moe:~# for i in 1 2 3 4; do netperf -t omni -l 120 -H manny -P 0 -- -o 
foo & done
moe:~# 240.19,-1,16384,2761384,-1,87380,2635200
197.78,-1,16384,2337160,-1,87380,2225280
220.47,-1,16384,2867440,-1,87380,2834304
283.08,-1,16384,3244528,-1,87380,3091968



I'm not sure the extent to which skew error might be at issue in those 
measurements.
rick jones

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-08 18:16         ` setsockopt() Rick Jones
@ 2008-07-08 19:10           ` John Heffner
  0 siblings, 0 replies; 43+ messages in thread
From: John Heffner @ 2008-07-08 19:10 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

On Tue, Jul 8, 2008 at 11:16 AM, Rick Jones <rick.jones2@hp.com> wrote:
> John Heffner wrote:
>> Jerry's optimization is a sender-side change.  The fact that the
>> receiver announces enough window is almost certainly the right thing
>> for it to do, and (I hope) this will not change.
>
> It just seems to be so, well, trusting of the sender.

Really, it's not trusting the sender at all.  The way it works is that
a receiver sizes its buffer based strictly on *how much its
application has read in an RTT*.  It the application reads slowly, it
uses a small buffer.  If it reads quickly, it increases the size so
that the TCP buffer is big enough to not be the limiting factor (if
system limits allow).  That's about all there is to it.

The only effect the sender has is that it it sends slowly, it bounds
the rate at which the receiver can read, and consequently results in
an appropriately small receive buffer.

The issue you're talking about is when the RTT gets inflated by
filling a buffer somewhere in that path -- in your case specifically
in the sender's interface queue.  When the RTT gets inflated, the
receiver will continue to track it, and continue announcing a window
large enough that it doesn't limit the sender's window.  In this case,
it is not really paying a penalty, since it's keeping up and it
doesn't actually have to buffer any data.  It will happily let the
sender continue to fill its own buffers, and the sender will pay the
penalty.

The receiver *can* try to do something about this situation, basically
by seeing that the RTT is increasing, and not using the higher RTT
values in its calculation.  However, this is a very dangerous game,
and comes with all the issues of delay-based congestion control.
(Basically, you can't tell if your flow is the one causing the queuing
or if it's cross/reverse-path traffic.  Or if increased delay is
caused by a routing change.  Or wireless link layer games.)  If you're
going to try to solve this problem, the sender is the better place to
do it, because it has better information, and because it pays the
higher cost.

  -John

^ permalink raw reply	[flat|nested] 43+ messages in thread

[parent not found: <349f35ee0807090255s58fd040bne265ee117d06d397@mail.gmail.com>]

* Re: setsockopt()
       [not found]         ` <349f35ee0807090255s58fd040bne265ee117d06d397@mail.gmail.com>
@ 2008-07-09 10:38           ` Jerry Chu
  0 siblings, 0 replies; 43+ messages in thread
From: Jerry Chu @ 2008-07-09 10:38 UTC (permalink / raw)
  To: johnwheffner; +Cc: netdev, aglo, ranjitm

On Wed, Jul 9, 2008 at 2:55 AM, H.K. Jerry Chu <hkjerry.chu@gmail.com> wrote:
>
>
> ---------- Forwarded message ----------
> From: John Heffner <johnwheffner@gmail.com>
> Date: Mon, Jul 7, 2008 at 8:33 PM
> Subject: Re: setsockopt()
> To: Rick Jones <rick.jones2@hp.com>
> Cc: netdev@vger.kernel.org
>
>
> On Mon, Jul 7, 2008 at 3:50 PM, Rick Jones <rick.jones2@hp.com> wrote:
> > I'm still a triffle puzzled/concerned/confused by the extent to which
> > autotuning will allow the receive window to grow, again based on some
> > netperf experience thusfar, and patient explanations provided here and
> > elsewhere, it seems as though autotuning will let things get to 2x what it
> > thinks the sender's cwnd happens to be.  So far under netperf testing that
> > seems to be the case, and 99 times out of ten my netperf tests will have the
> > window grow to the max.
>
>
> Rick,
>
> I thought this was covered pretty thoroughly back in April.  The
> behavior you're seeing is 100% expected, and not likely to change
> unless Jerry Chu gets his local queued data measurement patch working.
>  I'm not sure what ultimately happened there, but it was a cool idea
> and I hope he has time to polish it up.  It's definitely tricky to get
> right.

Yes most certainly! I've had the non-TSO code mostly working for the
past couple of months (i.e., cwnd grows only to ~50KB on a local 1GbE
setup). But no such luck with TSO. Although the idea (exclusing pkts
still stuck in some queues inside the sending host from "in_flight" when
deciding whether needing to grow cwnd or not) seems simple, getting
the accounting right for TSO seems impossible. After catching and fixing
a slew of cases for 1GbE and seemingly getting close to the end of the
tunnel, I moved my tests to 10GbE last month and discovered accounting
leakage again. Basically my count of all the pkts still stuck inside the host
sometimes becomes larger than total in-flight. I have not figured out what
skb paths I might have missed, or is it possible the over-zealously-tuned
10G drivers are doing something funky?

Not to mention a number of other tricky scenario - e.g., when TSO
is enabled on 1GbE, the code works well for a netperf streaming test but
not the RR test with 1MB request size. After a while I discovered that
tcp_sendmsg() for the 1MB RR tests often runs in a tight loop without
flow control, hence always hitting snd_cwnd, even though acks have come
back. This is because the socket lock is only released during flow control.
The problem went away when I check and let the return traffic in inside the
tcp_sendmsg() loop. This kind of stuff can easily spoil my original simple
algorithm.

Jerry

> Jerry's optimization is a sender-side change.  The fact that the
> receiver announces enough window is almost certainly the right thing
> for it to do, and (I hope) this will not change.
>
> If you're still curious:
> http://www.psc.edu/networking/ftp/papers/autotune_sigcomm98.ps
> http://www.lanl.gov/radiant/pubs/drs/lacsi2001.pdf
> http://staff.psc.edu/jheffner/papers/senior_thesis.pdf
>
>  -John
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-07 21:24 ` setsockopt() Stephen Hemminger
  2008-07-07 21:30   ` setsockopt() Olga Kornievskaia
@ 2008-07-07 21:32   ` J. Bruce Fields
  1 sibling, 0 replies; 43+ messages in thread
From: J. Bruce Fields @ 2008-07-07 21:32 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Olga Kornievskaia, netdev, Jim Rees

On Mon, Jul 07, 2008 at 02:24:08PM -0700, Stephen Hemminger wrote:
> On Mon, 07 Jul 2008 14:18:38 -0400
> Olga Kornievskaia <aglo@citi.umich.edu> wrote:
> 
> > Hi,
> > 
> >     I'd like to ask a question regarding socket options, more 
> > specifically send and receive buffer sizes.
> > 
> >     One simple question: (on the server-side) is it true that, to set 
> > send/receive buffer size, setsockopt() can only be called before 
> > listen()? From what I can tell, if I were to set socket options for the 
> > listening socket, they get inherited by the socket created during the 
> > accept(). However, when I try to change send/receive buffer size for the 
> > new socket, they take no affect.
> > 
> >     The server in question is the NFSD server in the kernel. NFSD's code 
> > tries to adjust the buffer size (in order to have TCP increase the 
> > window size appropriately) but it does so after the new socket is 
> > created. It leads to the fact that the TCP window doesn't open beyond 
> > the TCP's "default" sysctl value (that would be the 2nd value in the 
> > triple net.ipv4.tcp_rmem, which on our system is set to 64KB). We 
> > changed the code so that setsockopt() is called for the listening socket 
> > is created and we set the buffer sizes to something bigger, like 8MB. 
> > Then we try to increase the buffer size for each socket created by the 
> > accept() but what is seen on the network trace is that window size 
> > doesn't open beyond the values used for the listening socket.
> 
> It would be better if NFSD stayed out of doign setsockopt and just
> let the sender/receiver autotuning work?

Just googling around....  Yes, that's probably exactly what we want,
thanks!  Any pointers to a good tutorial on the autotuning behavior?

So all we should have to do is never mess with setsockopt, and the
receive buffer size can increase up to the maximum (the third integer in
the tcp_rmem sysctl) if necessary?

--b.

> 
> >     I looked around in the code. There is a variable called 
> > "window_clamp" that seems to specifies the largest possible window 
> > advertisement. window_clamp gets set during the creation of the accept 
> > socket. At that time, it's value is based on the sk_rcvbuf of the 
> > listening socket. Thus, that would explain the behavior that window 
> > doesn't grow beyond the values used in setsockopt() for the listening 
> > socket, even though the new socket has new (larger) sk_sndbuf and 
> > sk_rcvbuf  than the listening socket.
> > 
> >     I realize that send/receive buffer size and window advertisement are 
> > different but they are related in the way that by telling TCP that we 
> > have a certain amount of memory for socket operations, it should try to 
> > open big enough window (provided that there is no congestion).
> > 
> >     Can somebody advise us on how to properly set send/receive buffer 
> > sizes for the NFSD in the kernel such that (1) the window is not bound 
> > by the TCP's default sysctl value and (2) if it is possible to do so for 
> > the accept sockets and not the listening socket.
> > 
> >     I would appreciate if we could be CC-ed on the reply as we are not 
> > subscribed to the netdev mailing list.
> > 
> > Thank you.
> > 
> > -Olga
> > 
> >    
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: setsockopt()
  2008-07-07 18:18 setsockopt() Olga Kornievskaia
  2008-07-07 21:24 ` setsockopt() Stephen Hemminger
@ 2008-07-08  1:17 ` John Heffner
  1 sibling, 0 replies; 43+ messages in thread
From: John Heffner @ 2008-07-08  1:17 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: netdev, Jim Rees, J. Bruce Fields

On Mon, Jul 7, 2008 at 11:18 AM, Olga Kornievskaia <aglo@citi.umich.edu> wrote:
>   Can somebody advise us on how to properly set send/receive buffer sizes
> for the NFSD in the kernel such that (1) the window is not bound by the
> TCP's default sysctl value and (2) if it is possible to do so for the accept
> sockets and not the listening socket.

As others have said, most likely you'd be better off without calling
SO_{SND,RCV}BUF.  It's possible but difficult in some circumstances to
do better than the kernel's autotuning.

If you must do SO_RCVBUF, you also need to set TCP_WINDOW_CLAMP.  It
would probably be better if the kernel would recalculate window_clamp
on an SO_RCVBUF automatically, though this is slightly problematic
from a layering point of view.  Note, however, that changing SO_RCVBUF
after connection establishment is not supported on many OS's, and
usually isn't what you want to do.

  -John

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2008-07-11  0:51 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-07 18:18 setsockopt() Olga Kornievskaia
2008-07-07 21:24 ` setsockopt() Stephen Hemminger
2008-07-07 21:30   ` setsockopt() Olga Kornievskaia
2008-07-07 21:33     ` setsockopt() Stephen Hemminger
2008-07-07 21:49     ` setsockopt() David Miller
2008-07-08  4:54       ` setsockopt() Evgeniy Polyakov
2008-07-08  6:02         ` setsockopt() Bill Fink
2008-07-08  6:29           ` setsockopt() Roland Dreier
2008-07-08  6:43             ` setsockopt() Evgeniy Polyakov
2008-07-08  7:03               ` setsockopt() Roland Dreier
2008-07-08 18:48             ` setsockopt() Bill Fink
2008-07-09 18:10               ` setsockopt() Roland Dreier
2008-07-09 18:34                 ` setsockopt() Evgeniy Polyakov
2008-07-10  2:50                   ` setsockopt() Bill Fink
2008-07-10 17:26                     ` setsockopt() Rick Jones
2008-07-11  0:50                       ` setsockopt() Bill Fink
2008-07-08 20:48             ` setsockopt() Stephen Hemminger
2008-07-08 22:05               ` setsockopt() Bill Fink
2008-07-09  5:25                 ` setsockopt() Evgeniy Polyakov
2008-07-09  5:47                   ` setsockopt() Bill Fink
2008-07-09  6:03                     ` setsockopt() Evgeniy Polyakov
2008-07-09 18:11                       ` setsockopt() J. Bruce Fields
2008-07-09 18:43                         ` setsockopt() Evgeniy Polyakov
2008-07-09 22:28                           ` setsockopt() J. Bruce Fields
2008-07-10  1:06                             ` setsockopt() Evgeniy Polyakov
2008-07-10 20:05                               ` [PATCH] Documentation: clarify tcp_{r,w}mem sysctl docs J. Bruce Fields
2008-07-10 23:50                                 ` David Miller
2008-07-08 20:12       ` setsockopt() Jim Rees
2008-07-08 21:54         ` setsockopt() John Heffner
2008-07-08 23:51           ` setsockopt() Jim Rees
2008-07-09  0:07             ` setsockopt() John Heffner
2008-07-07 22:50     ` setsockopt() Rick Jones
2008-07-07 23:00       ` setsockopt() David Miller
2008-07-07 23:27         ` setsockopt() Rick Jones
2008-07-08  1:15           ` setsockopt() Rick Jones
2008-07-08  1:48             ` setsockopt() J. Bruce Fields
2008-07-08  1:44           ` setsockopt() David Miller
2008-07-08  3:33       ` setsockopt() John Heffner
2008-07-08 18:16         ` setsockopt() Rick Jones
2008-07-08 19:10           ` setsockopt() John Heffner
     [not found]         ` <349f35ee0807090255s58fd040bne265ee117d06d397@mail.gmail.com>
2008-07-09 10:38           ` setsockopt() Jerry Chu
2008-07-07 21:32   ` setsockopt() J. Bruce Fields
2008-07-08  1:17 ` setsockopt() John Heffner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).