* can TCP socket send buffer be over used? @ 2010-08-04 0:22 Jack Zhang 2010-08-04 0:30 ` Rick Jones ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: Jack Zhang @ 2010-08-04 0:22 UTC (permalink / raw) To: netdev Hi there, I'm doing experiments with (modified*) software iSCSI over a link with an emulated Round-Trip Time (RTT) of 100 ms by netem. For example, when I set the send buffer size to 128 KB, i could get a throughput up to 43 Mbps, which seems to be impossible as the (buffer size) / RTT is only 10 Mbps. And When I set the send buffer size to 512 KB, i can get a throughput up to 60 Mbps, which also seems to be impossible as the (buffer size) / RTT is only 40 Mbps. I understand that when the buffer size is set to 128 KB, I actually got a buffer of 256 KB as the kernel doubles the buffer size. I also understand that half the doubled buffer size is used for meta data instead of the actual data to be transferred. So basically the effective buffer sizes for the two examples are just 128 KB and 512 KB respectively. So I was confused because, theoretically, send buffers of 128 KB and 512 KB should achieve no more than 10 Mbps and 40 Mbps respectively but I was able to get way much more than the theoretical limit. So I was wondering is there any chance the send buffer can be "overused"? or there is some other mechanism inside TCP is doing some optimization? * the modification is to disable "TCP_NODELAY" , enable "use_clustering" for SCSI, and set different send buffer sizes for the TCP socket buffer. Any idea will be highly appreciated. Thanks a lot! ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: can TCP socket send buffer be over used? 2010-08-04 0:22 can TCP socket send buffer be over used? Jack Zhang @ 2010-08-04 0:30 ` Rick Jones 2010-08-04 0:48 ` Jack Zhang 2010-08-04 7:20 ` Bill Fink 2010-08-04 7:33 ` Mikael Abrahamsson 2 siblings, 1 reply; 10+ messages in thread From: Rick Jones @ 2010-08-04 0:30 UTC (permalink / raw) To: Jack Zhang; +Cc: netdev Jack Zhang wrote: > I understand that when the buffer size is set to 128 KB, I actually > got a buffer of 256 KB as the kernel doubles the buffer size. I also > understand that half the doubled buffer size is used for meta data > instead of the actual data to be transferred. So basically the > effective buffer sizes for the two examples are just 128 KB and 512 > KB respectively. It may not be strictly 1/2. One way to check would be to take a tcpdump trace on the sending side, and either work-out manually the most the connection has outstanding at a time, or run the binary trace through something like tcptrace. rick jones ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: can TCP socket send buffer be over used? 2010-08-04 0:30 ` Rick Jones @ 2010-08-04 0:48 ` Jack Zhang 2010-08-04 1:17 ` Rick Jones 0 siblings, 1 reply; 10+ messages in thread From: Jack Zhang @ 2010-08-04 0:48 UTC (permalink / raw) To: Rick Jones; +Cc: netdev Hi Rick, Thanks for your reply. Do you maybe know which part of the source code implements the details about how much send buffer can actually be used for the data payload? Thanks a lot! Jack On 3 August 2010 18:30, Rick Jones <rick.jones2@hp.com> wrote: > Jack Zhang wrote: >> >> I understand that when the buffer size is set to 128 KB, I actually >> got a buffer of 256 KB as the kernel doubles the buffer size. I also >> understand that half the doubled buffer size is used for meta data >> instead of the actual data to be transferred. So basically the >> effective buffer sizes for the two examples are just 128 KB and 512 >> KB respectively. > > It may not be strictly 1/2. One way to check would be to take a tcpdump > trace on the sending side, and either work-out manually the most the > connection has outstanding at a time, or run the binary trace through > something like tcptrace. > > rick jones > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: can TCP socket send buffer be over used? 2010-08-04 0:48 ` Jack Zhang @ 2010-08-04 1:17 ` Rick Jones 0 siblings, 0 replies; 10+ messages in thread From: Rick Jones @ 2010-08-04 1:17 UTC (permalink / raw) To: Jack Zhang; +Cc: netdev Jack Zhang wrote: > Hi Rick, > > Thanks for your reply. > > Do you maybe know which part of the source code implements the details > about how much send buffer can actually be used for the data payload? Specifically no. I've not had to go looking for it. happy benchmarking, rick jones ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: can TCP socket send buffer be over used? 2010-08-04 0:22 can TCP socket send buffer be over used? Jack Zhang 2010-08-04 0:30 ` Rick Jones @ 2010-08-04 7:20 ` Bill Fink 2010-08-04 8:00 ` Jack Zhang 2010-08-04 7:33 ` Mikael Abrahamsson 2 siblings, 1 reply; 10+ messages in thread From: Bill Fink @ 2010-08-04 7:20 UTC (permalink / raw) To: Jack Zhang; +Cc: netdev On Tue, 3 Aug 2010, Jack Zhang wrote: > Hi there, > > I'm doing experiments with (modified*) software iSCSI over a link with > an emulated Round-Trip Time (RTT) of 100 ms by netem. > > For example, when I set the send buffer size to 128 KB, i could get a > throughput up to 43 Mbps, which seems to be impossible as the (buffer > size) / RTT is only 10 Mbps. I'm not sure what's going on with this first case. > And When I set the send buffer size to 512 KB, i can get a throughput > up to 60 Mbps, which also seems to be impossible as the (buffer size) > / RTT is only 40 Mbps. But this case seems just about right. Linux doubles the requested buffer size, then uses one quarter of that for overhead (not half), so you effectively get 50% more than requested (2X * 3/4 = 1.5X). Plugging your case into bc: wizin% bc scale=10 512*1024*8/0.100/10^6*3/2 62.9145600000 -Bill > I understand that when the buffer size is set to 128 KB, I actually > got a buffer of 256 KB as the kernel doubles the buffer size. I also > understand that half the doubled buffer size is used for meta data > instead of the actual data to be transferred. So basically the > effective buffer sizes for the two examples are just 128 KB and 512 > KB respectively. > > So I was confused because, theoretically, send buffers of 128 KB and > 512 KB should achieve no more than 10 Mbps and 40 Mbps respectively > but I was able to get way much more than the theoretical limit. So > I was wondering is there any chance the send buffer can be "overused"? > or there is some other mechanism inside TCP is doing some > optimization? > > * the modification is to disable "TCP_NODELAY" , enable > "use_clustering" for SCSI, and set different send buffer sizes for the > TCP socket buffer. > > Any idea will be highly appreciated. > > Thanks a lot! ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: can TCP socket send buffer be over used? 2010-08-04 7:20 ` Bill Fink @ 2010-08-04 8:00 ` Jack Zhang 2010-08-04 9:07 ` Bill Fink 0 siblings, 1 reply; 10+ messages in thread From: Jack Zhang @ 2010-08-04 8:00 UTC (permalink / raw) To: Bill Fink; +Cc: netdev Hi Bill, Thanks a lot for your help. It does make sense! As I'm writing this part into my master thesis, do you happen to know which part in the source code I can maybe use as a proof in the thesis that Linux uses 1/4 of the doubled buffer size for metadata? Thanks, Jack On 4 August 2010 01:20, Bill Fink <billfink@mindspring.com> wrote: > On Tue, 3 Aug 2010, Jack Zhang wrote: > >> Hi there, >> >> I'm doing experiments with (modified*) software iSCSI over a link with >> an emulated Round-Trip Time (RTT) of 100 ms by netem. >> >> For example, when I set the send buffer size to 128 KB, i could get a >> throughput up to 43 Mbps, which seems to be impossible as the (buffer >> size) / RTT is only 10 Mbps. > > I'm not sure what's going on with this first case. > >> And When I set the send buffer size to 512 KB, i can get a throughput >> up to 60 Mbps, which also seems to be impossible as the (buffer size) >> / RTT is only 40 Mbps. > > But this case seems just about right. Linux doubles the requested > buffer size, then uses one quarter of that for overhead (not half), > so you effectively get 50% more than requested (2X * 3/4 = 1.5X). > Plugging your case into bc: > > wizin% bc > scale=10 > 512*1024*8/0.100/10^6*3/2 > 62.9145600000 > > -Bill > > > >> I understand that when the buffer size is set to 128 KB, I actually >> got a buffer of 256 KB as the kernel doubles the buffer size. I also >> understand that half the doubled buffer size is used for meta data >> instead of the actual data to be transferred. So basically the >> effective buffer sizes for the two examples are just 128 KB and 512 >> KB respectively. >> >> So I was confused because, theoretically, send buffers of 128 KB and >> 512 KB should achieve no more than 10 Mbps and 40 Mbps respectively >> but I was able to get way much more than the theoretical limit. So >> I was wondering is there any chance the send buffer can be "overused"? >> or there is some other mechanism inside TCP is doing some >> optimization? >> >> * the modification is to disable "TCP_NODELAY" , enable >> "use_clustering" for SCSI, and set different send buffer sizes for the >> TCP socket buffer. >> >> Any idea will be highly appreciated. >> >> Thanks a lot! > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: can TCP socket send buffer be over used? 2010-08-04 8:00 ` Jack Zhang @ 2010-08-04 9:07 ` Bill Fink 0 siblings, 0 replies; 10+ messages in thread From: Bill Fink @ 2010-08-04 9:07 UTC (permalink / raw) To: Jack Zhang; +Cc: netdev On Wed, 4 Aug 2010, Jack Zhang wrote: > Hi Bill, > > Thanks a lot for your help. > > It does make sense! > > As I'm writing this part into my master thesis, do you happen to know > which part in the source code I can maybe use as a proof in the thesis > that Linux uses 1/4 of the doubled buffer size for metadata? Don't know about the source code, but from Documentation/networking/ip-sysctl.txt: tcp_adv_win_scale - INTEGER Count buffering overhead as bytes/2^tcp_adv_win_scale (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale), if it is <= 0. Default: 2 wizin% cat /proc/sys/net/ipv4/tcp_adv_win_scale 2 For the oddity involving the 128 KB window case, it seems to have something to do with the TCP receiver autotuning. On a real cross-country link (~80 ms RTT), the best to be expected is: wizin% bc scale=10 128*1024*8/0.080/10^6*3/2 19.6608000000 And an actual 60-second nuttcp test (which by default sets both the sender and receiver socket buffer sizes): netem1% nuttcp -T60 -i5 -w128 192.168.1.18 8.8125 MB / 5.00 sec = 14.7849 Mbps 0 retrans 9.2500 MB / 5.00 sec = 15.5189 Mbps 0 retrans 9.1875 MB / 5.00 sec = 15.4141 Mbps 0 retrans 9.5000 MB / 5.00 sec = 15.9384 Mbps 0 retrans 9.1250 MB / 5.00 sec = 15.3092 Mbps 0 retrans 9.1875 MB / 5.00 sec = 15.4141 Mbps 0 retrans 9.4375 MB / 5.00 sec = 15.8335 Mbps 0 retrans 9.3125 MB / 5.00 sec = 15.6238 Mbps 0 retrans 9.3125 MB / 5.00 sec = 15.6238 Mbps 0 retrans 9.1250 MB / 5.00 sec = 15.3092 Mbps 0 retrans 9.1875 MB / 5.00 sec = 15.4141 Mbps 0 retrans 9.4375 MB / 5.00 sec = 15.8335 Mbps 0 retrans 111.0100 MB / 60.13 sec = 15.4867 Mbps 0 %TX 0 %RX 0 retrans 80.59 msRTT But if I allow the receiver to do autotuning by specifying a server window size of 0: netem1% nuttcp -T60 -i5 -w128 -ws0 192.168.1.18 14.3125 MB / 5.00 sec = 24.0123 Mbps 0 retrans 15.5000 MB / 5.00 sec = 26.0047 Mbps 0 retrans 15.5000 MB / 5.00 sec = 26.0047 Mbps 0 retrans 15.5000 MB / 5.00 sec = 26.0047 Mbps 0 retrans 15.3750 MB / 5.00 sec = 25.7950 Mbps 0 retrans 15.3750 MB / 5.00 sec = 25.7950 Mbps 0 retrans 15.5000 MB / 5.00 sec = 26.0047 Mbps 0 retrans 15.5000 MB / 5.00 sec = 26.0047 Mbps 0 retrans 15.5000 MB / 5.00 sec = 26.0047 Mbps 0 retrans 15.3750 MB / 5.00 sec = 25.7950 Mbps 0 retrans 15.5000 MB / 5.00 sec = 26.0047 Mbps 0 retrans 15.3750 MB / 5.00 sec = 25.7950 Mbps 0 retrans 184.3643 MB / 60.04 sec = 25.7609 Mbps 0 %TX 0 %RX 0 retrans 80.58 msRTT This kind of makes sense since with autotuning, the receiver is allowed to increase the socket buffer size beyond 128 KB. One would have to tcpdump the packet flow to see what the receiver's advertised TCP window was. Rate throttling by specifying the socket buffer size only seems to be truly effective when done by the receiver, not when it's only done on the sender side. -Bill P.S. BTW I've also seen cases (on some older kernels), where the window scale used was 1 more than it should have been, resulting in the receiver's advertised TCP window being twice what one would have expected. tcpdump can also be used to verify proper functioning of the window scaling. > Thanks, > Jack > > On 4 August 2010 01:20, Bill Fink <billfink@mindspring.com> wrote: > > On Tue, 3 Aug 2010, Jack Zhang wrote: > > > >> Hi there, > >> > >> I'm doing experiments with (modified*) software iSCSI over a link with > >> an emulated Round-Trip Time (RTT) of 100 ms by netem. > >> > >> For example, when I set the send buffer size to 128 KB, i could get a > >> throughput up to 43 Mbps, which seems to be impossible as the (buffer > >> size) / RTT is only 10 Mbps. > > > > I'm not sure what's going on with this first case. > > > >> And When I set the send buffer size to 512 KB, i can get a throughput > >> up to 60 Mbps, which also seems to be impossible as the (buffer size) > >> / RTT is only 40 Mbps. > > > > But this case seems just about right. Linux doubles the requested > > buffer size, then uses one quarter of that for overhead (not half), > > so you effectively get 50% more than requested (2X * 3/4 = 1.5X). > > Plugging your case into bc: > > > > wizin% bc > > scale=10 > > 512*1024*8/0.100/10^6*3/2 > > 62.9145600000 > > > > -Bill > > > > > > > >> I understand that when the buffer size is set to 128 KB, I actually > >> got a buffer of 256 KB as the kernel doubles the buffer size. I also > >> understand that half the doubled buffer size is used for meta data > >> instead of the actual data to be transferred. So basically the > >> effective buffer sizes for the two examples are just 128 KB and 512 > >> KB respectively. > >> > >> So I was confused because, theoretically, send buffers of 128 KB and > >> 512 KB should achieve no more than 10 Mbps and 40 Mbps respectively > >> but I was able to get way much more than the theoretical limit. So > >> I was wondering is there any chance the send buffer can be "overused"? > >> or there is some other mechanism inside TCP is doing some > >> optimization? > >> > >> * the modification is to disable "TCP_NODELAY" , enable > >> "use_clustering" for SCSI, and set different send buffer sizes for the > >> TCP socket buffer. > >> > >> Any idea will be highly appreciated. > >> > >> Thanks a lot! ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: can TCP socket send buffer be over used? 2010-08-04 0:22 can TCP socket send buffer be over used? Jack Zhang 2010-08-04 0:30 ` Rick Jones 2010-08-04 7:20 ` Bill Fink @ 2010-08-04 7:33 ` Mikael Abrahamsson 2010-08-04 8:03 ` Jack Zhang 2 siblings, 1 reply; 10+ messages in thread From: Mikael Abrahamsson @ 2010-08-04 7:33 UTC (permalink / raw) To: Jack Zhang; +Cc: netdev On Tue, 3 Aug 2010, Jack Zhang wrote: > For example, when I set the send buffer size to 128 KB, i could get a > throughput up to 43 Mbps, which seems to be impossible as the (buffer > size) / RTT is only 10 Mbps. Are you sure the buffer actually corresponds to the congestion window TCP uses? I think you should use wireshark to dump the traffic and look in the TCP headers of the packets to see what is actually going on on the wire. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: can TCP socket send buffer be over used? 2010-08-04 7:33 ` Mikael Abrahamsson @ 2010-08-04 8:03 ` Jack Zhang 2010-08-04 8:21 ` Mikael Abrahamsson 0 siblings, 1 reply; 10+ messages in thread From: Jack Zhang @ 2010-08-04 8:03 UTC (permalink / raw) To: Mikael Abrahamsson; +Cc: netdev Hi Mikael, Thanks for your reply. I'll definitely try that. Quick question though... the link I use in my test does not have any packet loss (it's a straight through cable between two PCs)... in this case, would TCP congestion window size affect the result at all? Thanks, Jack On 4 August 2010 01:33, Mikael Abrahamsson <swmike@swm.pp.se> wrote: > On Tue, 3 Aug 2010, Jack Zhang wrote: > >> For example, when I set the send buffer size to 128 KB, i could get a >> throughput up to 43 Mbps, which seems to be impossible as the (buffer >> size) / RTT is only 10 Mbps. > > Are you sure the buffer actually corresponds to the congestion window TCP > uses? I think you should use wireshark to dump the traffic and look in the > TCP headers of the packets to see what is actually going on on the wire. > > -- > Mikael Abrahamsson email: swmike@swm.pp.se > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: can TCP socket send buffer be over used? 2010-08-04 8:03 ` Jack Zhang @ 2010-08-04 8:21 ` Mikael Abrahamsson 0 siblings, 0 replies; 10+ messages in thread From: Mikael Abrahamsson @ 2010-08-04 8:21 UTC (permalink / raw) To: Jack Zhang; +Cc: netdev On Wed, 4 Aug 2010, Jack Zhang wrote: > Hi Mikael, > > Thanks for your reply. I'll definitely try that. > > Quick question though... the link I use in my test does not have any > packet loss (it's a straight through cable between two PCs)... in this > case, would TCP congestion window size affect the result at all? I always mix up the different TCP windows, but I mean the windows size the sender will use max for the session. I'm not sure this is the same as the buffer you're tuning with your userspace option. An easy test would be for you to set the userspace option to 64k, test what speeds you get, then you turn off window scaling in /proc, and try again. If you get wildly different results here (turning off window scaling limits the window to 64k) then the buffer you're tuning doesn't do what you think it does. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2010-08-04 9:07 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-08-04 0:22 can TCP socket send buffer be over used? Jack Zhang 2010-08-04 0:30 ` Rick Jones 2010-08-04 0:48 ` Jack Zhang 2010-08-04 1:17 ` Rick Jones 2010-08-04 7:20 ` Bill Fink 2010-08-04 8:00 ` Jack Zhang 2010-08-04 9:07 ` Bill Fink 2010-08-04 7:33 ` Mikael Abrahamsson 2010-08-04 8:03 ` Jack Zhang 2010-08-04 8:21 ` Mikael Abrahamsson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).