* can TCP socket send buffer be over used?
@ 2010-08-04 0:22 Jack Zhang
2010-08-04 0:30 ` Rick Jones
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Jack Zhang @ 2010-08-04 0:22 UTC (permalink / raw)
To: netdev
Hi there,
I'm doing experiments with (modified*) software iSCSI over a link with
an emulated Round-Trip Time (RTT) of 100 ms by netem.
For example, when I set the send buffer size to 128 KB, i could get a
throughput up to 43 Mbps, which seems to be impossible as the (buffer
size) / RTT is only 10 Mbps.
And When I set the send buffer size to 512 KB, i can get a throughput
up to 60 Mbps, which also seems to be impossible as the (buffer size)
/ RTT is only 40 Mbps.
I understand that when the buffer size is set to 128 KB, I actually
got a buffer of 256 KB as the kernel doubles the buffer size. I also
understand that half the doubled buffer size is used for meta data
instead of the actual data to be transferred. So basically the
effective buffer sizes for the two examples are just 128 KB and 512
KB respectively.
So I was confused because, theoretically, send buffers of 128 KB and
512 KB should achieve no more than 10 Mbps and 40 Mbps respectively
but I was able to get way much more than the theoretical limit. So
I was wondering is there any chance the send buffer can be "overused"?
or there is some other mechanism inside TCP is doing some
optimization?
* the modification is to disable "TCP_NODELAY" , enable
"use_clustering" for SCSI, and set different send buffer sizes for the
TCP socket buffer.
Any idea will be highly appreciated.
Thanks a lot!
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: can TCP socket send buffer be over used?
2010-08-04 0:22 can TCP socket send buffer be over used? Jack Zhang
@ 2010-08-04 0:30 ` Rick Jones
2010-08-04 0:48 ` Jack Zhang
2010-08-04 7:20 ` Bill Fink
2010-08-04 7:33 ` Mikael Abrahamsson
2 siblings, 1 reply; 10+ messages in thread
From: Rick Jones @ 2010-08-04 0:30 UTC (permalink / raw)
To: Jack Zhang; +Cc: netdev
Jack Zhang wrote:
> I understand that when the buffer size is set to 128 KB, I actually
> got a buffer of 256 KB as the kernel doubles the buffer size. I also
> understand that half the doubled buffer size is used for meta data
> instead of the actual data to be transferred. So basically the
> effective buffer sizes for the two examples are just 128 KB and 512
> KB respectively.
It may not be strictly 1/2. One way to check would be to take a tcpdump trace
on the sending side, and either work-out manually the most the connection has
outstanding at a time, or run the binary trace through something like tcptrace.
rick jones
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: can TCP socket send buffer be over used?
2010-08-04 0:30 ` Rick Jones
@ 2010-08-04 0:48 ` Jack Zhang
2010-08-04 1:17 ` Rick Jones
0 siblings, 1 reply; 10+ messages in thread
From: Jack Zhang @ 2010-08-04 0:48 UTC (permalink / raw)
To: Rick Jones; +Cc: netdev
Hi Rick,
Thanks for your reply.
Do you maybe know which part of the source code implements the details
about how much send buffer can actually be used for the data payload?
Thanks a lot!
Jack
On 3 August 2010 18:30, Rick Jones <rick.jones2@hp.com> wrote:
> Jack Zhang wrote:
>>
>> I understand that when the buffer size is set to 128 KB, I actually
>> got a buffer of 256 KB as the kernel doubles the buffer size. I also
>> understand that half the doubled buffer size is used for meta data
>> instead of the actual data to be transferred. So basically the
>> effective buffer sizes for the two examples are just 128 KB and 512
>> KB respectively.
>
> It may not be strictly 1/2. One way to check would be to take a tcpdump
> trace on the sending side, and either work-out manually the most the
> connection has outstanding at a time, or run the binary trace through
> something like tcptrace.
>
> rick jones
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: can TCP socket send buffer be over used?
2010-08-04 0:48 ` Jack Zhang
@ 2010-08-04 1:17 ` Rick Jones
0 siblings, 0 replies; 10+ messages in thread
From: Rick Jones @ 2010-08-04 1:17 UTC (permalink / raw)
To: Jack Zhang; +Cc: netdev
Jack Zhang wrote:
> Hi Rick,
>
> Thanks for your reply.
>
> Do you maybe know which part of the source code implements the details
> about how much send buffer can actually be used for the data payload?
Specifically no. I've not had to go looking for it.
happy benchmarking,
rick jones
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: can TCP socket send buffer be over used?
2010-08-04 0:22 can TCP socket send buffer be over used? Jack Zhang
2010-08-04 0:30 ` Rick Jones
@ 2010-08-04 7:20 ` Bill Fink
2010-08-04 8:00 ` Jack Zhang
2010-08-04 7:33 ` Mikael Abrahamsson
2 siblings, 1 reply; 10+ messages in thread
From: Bill Fink @ 2010-08-04 7:20 UTC (permalink / raw)
To: Jack Zhang; +Cc: netdev
On Tue, 3 Aug 2010, Jack Zhang wrote:
> Hi there,
>
> I'm doing experiments with (modified*) software iSCSI over a link with
> an emulated Round-Trip Time (RTT) of 100 ms by netem.
>
> For example, when I set the send buffer size to 128 KB, i could get a
> throughput up to 43 Mbps, which seems to be impossible as the (buffer
> size) / RTT is only 10 Mbps.
I'm not sure what's going on with this first case.
> And When I set the send buffer size to 512 KB, i can get a throughput
> up to 60 Mbps, which also seems to be impossible as the (buffer size)
> / RTT is only 40 Mbps.
But this case seems just about right. Linux doubles the requested
buffer size, then uses one quarter of that for overhead (not half),
so you effectively get 50% more than requested (2X * 3/4 = 1.5X).
Plugging your case into bc:
wizin% bc
scale=10
512*1024*8/0.100/10^6*3/2
62.9145600000
-Bill
> I understand that when the buffer size is set to 128 KB, I actually
> got a buffer of 256 KB as the kernel doubles the buffer size. I also
> understand that half the doubled buffer size is used for meta data
> instead of the actual data to be transferred. So basically the
> effective buffer sizes for the two examples are just 128 KB and 512
> KB respectively.
>
> So I was confused because, theoretically, send buffers of 128 KB and
> 512 KB should achieve no more than 10 Mbps and 40 Mbps respectively
> but I was able to get way much more than the theoretical limit. So
> I was wondering is there any chance the send buffer can be "overused"?
> or there is some other mechanism inside TCP is doing some
> optimization?
>
> * the modification is to disable "TCP_NODELAY" , enable
> "use_clustering" for SCSI, and set different send buffer sizes for the
> TCP socket buffer.
>
> Any idea will be highly appreciated.
>
> Thanks a lot!
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: can TCP socket send buffer be over used?
2010-08-04 0:22 can TCP socket send buffer be over used? Jack Zhang
2010-08-04 0:30 ` Rick Jones
2010-08-04 7:20 ` Bill Fink
@ 2010-08-04 7:33 ` Mikael Abrahamsson
2010-08-04 8:03 ` Jack Zhang
2 siblings, 1 reply; 10+ messages in thread
From: Mikael Abrahamsson @ 2010-08-04 7:33 UTC (permalink / raw)
To: Jack Zhang; +Cc: netdev
On Tue, 3 Aug 2010, Jack Zhang wrote:
> For example, when I set the send buffer size to 128 KB, i could get a
> throughput up to 43 Mbps, which seems to be impossible as the (buffer
> size) / RTT is only 10 Mbps.
Are you sure the buffer actually corresponds to the congestion window TCP
uses? I think you should use wireshark to dump the traffic and look in the
TCP headers of the packets to see what is actually going on on the wire.
--
Mikael Abrahamsson email: swmike@swm.pp.se
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: can TCP socket send buffer be over used?
2010-08-04 7:20 ` Bill Fink
@ 2010-08-04 8:00 ` Jack Zhang
2010-08-04 9:07 ` Bill Fink
0 siblings, 1 reply; 10+ messages in thread
From: Jack Zhang @ 2010-08-04 8:00 UTC (permalink / raw)
To: Bill Fink; +Cc: netdev
Hi Bill,
Thanks a lot for your help.
It does make sense!
As I'm writing this part into my master thesis, do you happen to know
which part in the source code I can maybe use as a proof in the thesis
that Linux uses 1/4 of the doubled buffer size for metadata?
Thanks,
Jack
On 4 August 2010 01:20, Bill Fink <billfink@mindspring.com> wrote:
> On Tue, 3 Aug 2010, Jack Zhang wrote:
>
>> Hi there,
>>
>> I'm doing experiments with (modified*) software iSCSI over a link with
>> an emulated Round-Trip Time (RTT) of 100 ms by netem.
>>
>> For example, when I set the send buffer size to 128 KB, i could get a
>> throughput up to 43 Mbps, which seems to be impossible as the (buffer
>> size) / RTT is only 10 Mbps.
>
> I'm not sure what's going on with this first case.
>
>> And When I set the send buffer size to 512 KB, i can get a throughput
>> up to 60 Mbps, which also seems to be impossible as the (buffer size)
>> / RTT is only 40 Mbps.
>
> But this case seems just about right. Linux doubles the requested
> buffer size, then uses one quarter of that for overhead (not half),
> so you effectively get 50% more than requested (2X * 3/4 = 1.5X).
> Plugging your case into bc:
>
> wizin% bc
> scale=10
> 512*1024*8/0.100/10^6*3/2
> 62.9145600000
>
> -Bill
>
>
>
>> I understand that when the buffer size is set to 128 KB, I actually
>> got a buffer of 256 KB as the kernel doubles the buffer size. I also
>> understand that half the doubled buffer size is used for meta data
>> instead of the actual data to be transferred. So basically the
>> effective buffer sizes for the two examples are just 128 KB and 512
>> KB respectively.
>>
>> So I was confused because, theoretically, send buffers of 128 KB and
>> 512 KB should achieve no more than 10 Mbps and 40 Mbps respectively
>> but I was able to get way much more than the theoretical limit. So
>> I was wondering is there any chance the send buffer can be "overused"?
>> or there is some other mechanism inside TCP is doing some
>> optimization?
>>
>> * the modification is to disable "TCP_NODELAY" , enable
>> "use_clustering" for SCSI, and set different send buffer sizes for the
>> TCP socket buffer.
>>
>> Any idea will be highly appreciated.
>>
>> Thanks a lot!
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: can TCP socket send buffer be over used?
2010-08-04 7:33 ` Mikael Abrahamsson
@ 2010-08-04 8:03 ` Jack Zhang
2010-08-04 8:21 ` Mikael Abrahamsson
0 siblings, 1 reply; 10+ messages in thread
From: Jack Zhang @ 2010-08-04 8:03 UTC (permalink / raw)
To: Mikael Abrahamsson; +Cc: netdev
Hi Mikael,
Thanks for your reply. I'll definitely try that.
Quick question though... the link I use in my test does not have any
packet loss (it's a straight through cable between two PCs)... in this
case, would TCP congestion window size affect the result at all?
Thanks,
Jack
On 4 August 2010 01:33, Mikael Abrahamsson <swmike@swm.pp.se> wrote:
> On Tue, 3 Aug 2010, Jack Zhang wrote:
>
>> For example, when I set the send buffer size to 128 KB, i could get a
>> throughput up to 43 Mbps, which seems to be impossible as the (buffer
>> size) / RTT is only 10 Mbps.
>
> Are you sure the buffer actually corresponds to the congestion window TCP
> uses? I think you should use wireshark to dump the traffic and look in the
> TCP headers of the packets to see what is actually going on on the wire.
>
> --
> Mikael Abrahamsson email: swmike@swm.pp.se
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: can TCP socket send buffer be over used?
2010-08-04 8:03 ` Jack Zhang
@ 2010-08-04 8:21 ` Mikael Abrahamsson
0 siblings, 0 replies; 10+ messages in thread
From: Mikael Abrahamsson @ 2010-08-04 8:21 UTC (permalink / raw)
To: Jack Zhang; +Cc: netdev
On Wed, 4 Aug 2010, Jack Zhang wrote:
> Hi Mikael,
>
> Thanks for your reply. I'll definitely try that.
>
> Quick question though... the link I use in my test does not have any
> packet loss (it's a straight through cable between two PCs)... in this
> case, would TCP congestion window size affect the result at all?
I always mix up the different TCP windows, but I mean the windows size the
sender will use max for the session. I'm not sure this is the same as the
buffer you're tuning with your userspace option.
An easy test would be for you to set the userspace option to 64k, test
what speeds you get, then you turn off window scaling in /proc, and try
again. If you get wildly different results here (turning off window
scaling limits the window to 64k) then the buffer you're tuning doesn't do
what you think it does.
--
Mikael Abrahamsson email: swmike@swm.pp.se
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: can TCP socket send buffer be over used?
2010-08-04 8:00 ` Jack Zhang
@ 2010-08-04 9:07 ` Bill Fink
0 siblings, 0 replies; 10+ messages in thread
From: Bill Fink @ 2010-08-04 9:07 UTC (permalink / raw)
To: Jack Zhang; +Cc: netdev
On Wed, 4 Aug 2010, Jack Zhang wrote:
> Hi Bill,
>
> Thanks a lot for your help.
>
> It does make sense!
>
> As I'm writing this part into my master thesis, do you happen to know
> which part in the source code I can maybe use as a proof in the thesis
> that Linux uses 1/4 of the doubled buffer size for metadata?
Don't know about the source code, but from
Documentation/networking/ip-sysctl.txt:
tcp_adv_win_scale - INTEGER
Count buffering overhead as bytes/2^tcp_adv_win_scale
(if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale),
if it is <= 0.
Default: 2
wizin% cat /proc/sys/net/ipv4/tcp_adv_win_scale
2
For the oddity involving the 128 KB window case, it seems to have
something to do with the TCP receiver autotuning. On a real
cross-country link (~80 ms RTT), the best to be expected is:
wizin% bc
scale=10
128*1024*8/0.080/10^6*3/2
19.6608000000
And an actual 60-second nuttcp test (which by default sets both
the sender and receiver socket buffer sizes):
netem1% nuttcp -T60 -i5 -w128 192.168.1.18
8.8125 MB / 5.00 sec = 14.7849 Mbps 0 retrans
9.2500 MB / 5.00 sec = 15.5189 Mbps 0 retrans
9.1875 MB / 5.00 sec = 15.4141 Mbps 0 retrans
9.5000 MB / 5.00 sec = 15.9384 Mbps 0 retrans
9.1250 MB / 5.00 sec = 15.3092 Mbps 0 retrans
9.1875 MB / 5.00 sec = 15.4141 Mbps 0 retrans
9.4375 MB / 5.00 sec = 15.8335 Mbps 0 retrans
9.3125 MB / 5.00 sec = 15.6238 Mbps 0 retrans
9.3125 MB / 5.00 sec = 15.6238 Mbps 0 retrans
9.1250 MB / 5.00 sec = 15.3092 Mbps 0 retrans
9.1875 MB / 5.00 sec = 15.4141 Mbps 0 retrans
9.4375 MB / 5.00 sec = 15.8335 Mbps 0 retrans
111.0100 MB / 60.13 sec = 15.4867 Mbps 0 %TX 0 %RX 0 retrans 80.59 msRTT
But if I allow the receiver to do autotuning by specifying
a server window size of 0:
netem1% nuttcp -T60 -i5 -w128 -ws0 192.168.1.18
14.3125 MB / 5.00 sec = 24.0123 Mbps 0 retrans
15.5000 MB / 5.00 sec = 26.0047 Mbps 0 retrans
15.5000 MB / 5.00 sec = 26.0047 Mbps 0 retrans
15.5000 MB / 5.00 sec = 26.0047 Mbps 0 retrans
15.3750 MB / 5.00 sec = 25.7950 Mbps 0 retrans
15.3750 MB / 5.00 sec = 25.7950 Mbps 0 retrans
15.5000 MB / 5.00 sec = 26.0047 Mbps 0 retrans
15.5000 MB / 5.00 sec = 26.0047 Mbps 0 retrans
15.5000 MB / 5.00 sec = 26.0047 Mbps 0 retrans
15.3750 MB / 5.00 sec = 25.7950 Mbps 0 retrans
15.5000 MB / 5.00 sec = 26.0047 Mbps 0 retrans
15.3750 MB / 5.00 sec = 25.7950 Mbps 0 retrans
184.3643 MB / 60.04 sec = 25.7609 Mbps 0 %TX 0 %RX 0 retrans 80.58 msRTT
This kind of makes sense since with autotuning, the receiver
is allowed to increase the socket buffer size beyond 128 KB.
One would have to tcpdump the packet flow to see what the
receiver's advertised TCP window was. Rate throttling by
specifying the socket buffer size only seems to be truly
effective when done by the receiver, not when it's only
done on the sender side.
-Bill
P.S. BTW I've also seen cases (on some older kernels), where
the window scale used was 1 more than it should have been,
resulting in the receiver's advertised TCP window being
twice what one would have expected. tcpdump can also be
used to verify proper functioning of the window scaling.
> Thanks,
> Jack
>
> On 4 August 2010 01:20, Bill Fink <billfink@mindspring.com> wrote:
> > On Tue, 3 Aug 2010, Jack Zhang wrote:
> >
> >> Hi there,
> >>
> >> I'm doing experiments with (modified*) software iSCSI over a link with
> >> an emulated Round-Trip Time (RTT) of 100 ms by netem.
> >>
> >> For example, when I set the send buffer size to 128 KB, i could get a
> >> throughput up to 43 Mbps, which seems to be impossible as the (buffer
> >> size) / RTT is only 10 Mbps.
> >
> > I'm not sure what's going on with this first case.
> >
> >> And When I set the send buffer size to 512 KB, i can get a throughput
> >> up to 60 Mbps, which also seems to be impossible as the (buffer size)
> >> / RTT is only 40 Mbps.
> >
> > But this case seems just about right. Linux doubles the requested
> > buffer size, then uses one quarter of that for overhead (not half),
> > so you effectively get 50% more than requested (2X * 3/4 = 1.5X).
> > Plugging your case into bc:
> >
> > wizin% bc
> > scale=10
> > 512*1024*8/0.100/10^6*3/2
> > 62.9145600000
> >
> > -Bill
> >
> >
> >
> >> I understand that when the buffer size is set to 128 KB, I actually
> >> got a buffer of 256 KB as the kernel doubles the buffer size. I also
> >> understand that half the doubled buffer size is used for meta data
> >> instead of the actual data to be transferred. So basically the
> >> effective buffer sizes for the two examples are just 128 KB and 512
> >> KB respectively.
> >>
> >> So I was confused because, theoretically, send buffers of 128 KB and
> >> 512 KB should achieve no more than 10 Mbps and 40 Mbps respectively
> >> but I was able to get way much more than the theoretical limit. So
> >> I was wondering is there any chance the send buffer can be "overused"?
> >> or there is some other mechanism inside TCP is doing some
> >> optimization?
> >>
> >> * the modification is to disable "TCP_NODELAY" , enable
> >> "use_clustering" for SCSI, and set different send buffer sizes for the
> >> TCP socket buffer.
> >>
> >> Any idea will be highly appreciated.
> >>
> >> Thanks a lot!
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2010-08-04 9:07 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-04 0:22 can TCP socket send buffer be over used? Jack Zhang
2010-08-04 0:30 ` Rick Jones
2010-08-04 0:48 ` Jack Zhang
2010-08-04 1:17 ` Rick Jones
2010-08-04 7:20 ` Bill Fink
2010-08-04 8:00 ` Jack Zhang
2010-08-04 9:07 ` Bill Fink
2010-08-04 7:33 ` Mikael Abrahamsson
2010-08-04 8:03 ` Jack Zhang
2010-08-04 8:21 ` Mikael Abrahamsson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).