Socket buffer sizes with autotuning

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Socket buffer sizes with autotuning
@ 2008-04-23  0:38 Rick Jones
  2008-04-23  2:17 ` John Heffner
  0 siblings, 1 reply; 56+ messages in thread
From: Rick Jones @ 2008-04-23  0:38 UTC (permalink / raw)
  To: Linux Network Development list

One of the issues with netperf and linux is that netperf only snaps the 
socket buffer size at the beginning of the connection.  This of course 
does not catch what the socket buffer size might become over the 
lifetime of the connection.  So, in the in-development "omni" tests I've 
added code that when running on Linux will snap the socket buffer sizes 
at both the beginning and end of the data connection.  I was a triffle 
surprised at some of what I saw with a 1G connection between systems - 
when autoscaling/ranging/tuning/whatever was active (netperf taking 
defaults and not calling setsockopt()) I was seeing the socket buffer 
size at the end of the connection up at 4MB:

sut34:~/netperf2_trunk# netperf -l 1 -t omni -H oslowest -- -d 4 -o bar 
-s -1 -S -1 -m ,16K
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to oslowest.raj 
(10.208.0.1) port 0 AF_INET
Throughput,Direction,Local Release,Local Recv Socket Size 
Requested,Local Recv Socket Size Initial,Local Recv Socket Size 
Final,Remote Release,Remote Send Socket Size Requested,Remote Send 
Socket Size Initial,Remote Send Socket Size Final
940.52,Receive,2.6.25-raj,-1,87380,4194304,2.6.18-5-mckinley,-1,16384,4194304

Which was the limit of the autotuning:

net.ipv4.tcp_wmem = 16384       16384   4194304
net.ipv4.tcp_rmem = 16384       87380   4194304

The test above is basically the omni version of a TCP_MAERTS test from a 
2.6.18 system to a 2.6.25 system (kernel bits grabbed about 40 minutes 
ago from http://www.kernel.org/hg/linux-2.6.  The receiving system on 
which the 2.6.25 bits were compiled and run started life as a Debian 
Lenny/Testing system.  The sender is iirc Debian Etch.

It seemed odd to me that one would need a 4MB socket buffer to get 
link-rate on gigabit, so I ran a quick set of tests to confirm in my 
mind that indeed, a much smaller socket buffer was sufficient:

sut34:~/netperf2_trunk# HDR="-P 1"; for i in -1 32K 64K 128K 256K 512K; 
do netperf -l 20 -t omni -H oslowest $HDR -- -d 4 -o bar -s $i -S $i -m 
,16K; HDR="-P 0"; done
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to oslowest.raj 
(10.208.0.1) port 0 AF_INET
Throughput,Direction,Local Release,Local Recv Socket Size 
Requested,Local Recv Socket Size Initial,Local Recv Socket Size 
Final,Remote Release,Remote Send Socket Size Requested,Remote Send 
Socket Size Initial,Remote Send Socket Size Final
941.38,Receive,2.6.25-raj,-1,87380,4194304,2.6.18-5-mckinley,-1,16384,4194304
939.29,Receive,2.6.25-raj,32768,65536,65536,2.6.18-5-mckinley,32768,65536,65536
940.28,Receive,2.6.25-raj,65536,131072,131072,2.6.18-5-mckinley,65536,131072,131072
940.96,Receive,2.6.25-raj,131072,262142,262142,2.6.18-5-mckinley,131072,253952,253952
940.99,Receive,2.6.25-raj,262144,262142,262142,2.6.18-5-mckinley,262144,253952,253952
940.98,Receive,2.6.25-raj,524288,262142,262142,2.6.18-5-mckinley,524288,253952,253952

And then I decided to let the receiver autotune while the sender was 
either autotune or fixed (simulating something other than Linux sending 
I suppose):
sut34:~/netperf2_trunk# HDR="-P 1"; for i in -1 32K 64K 128K 256K 512K; 
do netperf -l 20 -t omni -H oslowest $HDR -- -d 4 -o bar -s -1 -S $i -m 
,16K; HDR="-P 0"; done
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to oslowest.raj 
(10.208.0.1) port 0 AF_INET
Throughput,Direction,Local Release,Local Recv Socket Size 
Requested,Local Recv Socket Size Initial,Local Recv Socket Size 
Final,Remote Release,Remote Send Socket Size Requested,Remote Send 
Socket Size Initial,Remote Send Socket Size Final
941.38,Receive,2.6.25-raj,-1,87380,4194304,2.6.18-5-mckinley,-1,16384,4194304
941.34,Receive,2.6.25-raj,-1,87380,1337056,2.6.18-5-mckinley,32768,65536,65536
941.35,Receive,2.6.25-raj,-1,87380,1814576,2.6.18-5-mckinley,65536,131072,131072
941.38,Receive,2.6.25-raj,-1,87380,2645664,2.6.18-5-mckinley,131072,253952,253952
941.39,Receive,2.6.25-raj,-1,87380,2649728,2.6.18-5-mckinley,262144,253952,253952
941.38,Receive,2.6.25-raj,-1,87380,2653792,2.6.18-5-mckinley,524288,253952,253952

Finally to see what was going on the wire (in case it was simply the 
socket buffer getting larger and not also the window) I took a packet 
trace on the sender to look at the window updates coming back, and sure 
enough, by the end of the connection (wscale = 7) the advertised window 
was huge:

17:10:00.522200 IP sut34.raj.53459 > oslowest.raj.37322: S 
3334965237:3334965237(0) win 5840 <mss 1460,sackOK,timestamp 4294921737 
0,nop,wscale 7>
17:10:00.522214 IP oslowest.raj.37322 > sut34.raj.53459: S 
962695631:962695631(0) ack 3334965238 win 5792 <mss 
1460,sackOK,timestamp 3303630187 4294921737,nop,wscale 7>

...

17:10:01.554698 IP sut34.raj.53459 > oslowest.raj.37322: . ack 121392225 
win 24576 <nop,nop,timestamp 4294921995 3303630438>
17:10:01.554706 IP sut34.raj.53459 > oslowest.raj.37322: . ack 121395121 
win 24576 <nop,nop,timestamp 4294921995 3303630438>

I also checked (during a different connection, autotuning at both ends) 
how much was actually queued at the sender, and it was indeed rather large:

oslowest:~# netstat -an | grep ESTAB
...
tcp    0 2760560 10.208.0.1:40500    10.208.0.45:42049   ESTABLISHED
...

Is this expected behaviour?

rick jones

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-23  0:38 Rick Jones
@ 2008-04-23  2:17 ` John Heffner
  2008-04-23  3:59   ` David Miller
  0 siblings, 1 reply; 56+ messages in thread
From: John Heffner @ 2008-04-23  2:17 UTC (permalink / raw)
  To: Rick Jones; +Cc: Linux Network Development list

On Tue, Apr 22, 2008 at 5:38 PM, Rick Jones <rick.jones2@hp.com> wrote:
>  oslowest:~# netstat -an | grep ESTAB
>  ...
>  tcp    0 2760560 10.208.0.1:40500    10.208.0.45:42049   ESTABLISHED
>  ...
>
>  Is this expected behaviour?

What is your interface txqueuelen and mtu?  If you have a very large
interface queue, TCP will happily fill it up unless you are using a
delay-based congestion controller.

  -John

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-23  2:17 ` John Heffner
@ 2008-04-23  3:59   ` David Miller
  2008-04-23 16:32     ` Rick Jones
  2008-04-24 22:21     ` Andi Kleen
  0 siblings, 2 replies; 56+ messages in thread
From: David Miller @ 2008-04-23  3:59 UTC (permalink / raw)
  To: johnwheffner; +Cc: rick.jones2, netdev

From: "John Heffner" <johnwheffner@gmail.com>
Date: Tue, 22 Apr 2008 19:17:39 -0700

> On Tue, Apr 22, 2008 at 5:38 PM, Rick Jones <rick.jones2@hp.com> wrote:
> >  oslowest:~# netstat -an | grep ESTAB
> >  ...
> >  tcp    0 2760560 10.208.0.1:40500    10.208.0.45:42049   ESTABLISHED
> >  ...
> >
> >  Is this expected behaviour?
> 
> What is your interface txqueuelen and mtu?  If you have a very large
> interface queue, TCP will happily fill it up unless you are using a
> delay-based congestion controller.

Yes, that's the fundamental problem with loss based congestion
control.  If there are any queues in the path, TCP will fill them up.

Vegas and other similar techniques are able to avoid this, but come
with the fundamental flaw that it's easy to get them into situations
where they do not respond to increases in pipe space adequately, and
thus underperform compared to loss based algorithms.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-23  3:59   ` David Miller
@ 2008-04-23 16:32     ` Rick Jones
  2008-04-23 16:58       ` John Heffner
  2008-04-24 22:21     ` Andi Kleen
  1 sibling, 1 reply; 56+ messages in thread
From: Rick Jones @ 2008-04-23 16:32 UTC (permalink / raw)
  To: David Miller; +Cc: johnwheffner, netdev

David Miller wrote:
> From: "John Heffner" <johnwheffner@gmail.com>
> Date: Tue, 22 Apr 2008 19:17:39 -0700
> 
> 
>>On Tue, Apr 22, 2008 at 5:38 PM, Rick Jones <rick.jones2@hp.com> wrote:
>>
>>> oslowest:~# netstat -an | grep ESTAB
>>> ...
>>> tcp    0 2760560 10.208.0.1:40500    10.208.0.45:42049   ESTABLISHED
>>> ...
>>>
>>> Is this expected behaviour?
>>
>>What is your interface txqueuelen and mtu? 

1000, MTU 1500 and TSO is enabled.

>> If you have a very large interface queue, TCP will happily fill it
>> up unless you are using a delay-based congestion controller.

At the same time, the sender will not try to send any more than the 
receiver is willing to advertise as window.

> Yes, that's the fundamental problem with loss based congestion
> control.  If there are any queues in the path, TCP will fill them up.
> 
> Vegas and other similar techniques are able to avoid this, but come
> with the fundamental flaw that it's easy to get them into situations
> where they do not respond to increases in pipe space adequately, and
> thus underperform compared to loss based algorithms.

I can see that for the sending side being willing to send into the 
receiver's ever increasing window, but is autotuning supposed to keep 
growing and growing the receive window the way it seems to be?

rick jones

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-23 16:32     ` Rick Jones
@ 2008-04-23 16:58       ` John Heffner
  2008-04-23 17:24         ` Rick Jones
  0 siblings, 1 reply; 56+ messages in thread
From: John Heffner @ 2008-04-23 16:58 UTC (permalink / raw)
  To: Rick Jones; +Cc: David Miller, netdev

On Wed, Apr 23, 2008 at 9:32 AM, Rick Jones <rick.jones2@hp.com> wrote:
>  I can see that for the sending side being willing to send into the
> receiver's ever increasing window, but is autotuning supposed to keep
> growing and growing the receive window the way it seems to be?

Receive-side autotuning by design will attempt to grow the rcvbuf
(adjusting for overhead) to twice the observed cwnd[1].  When the
sender keeps growing its window to fill up your interface queue, the
receiver will continue to grow its window to let the sender do what it
wants.  It's not the receiver's job to do congestion control.

One interesting observation is that when timestamps are turned off the
receiver autotuning actually has the property that if the RTT is
growing (in this case due to a queue filling), it will not grow the
window since it's not able to update its RTT estimate.  This property
was described as a feature of the Dynamic Right-Sizing algorithm
(http://public.lanl.gov/radiant/software/drs.html), and obviously in
some cases it is.  However, in general it has the same types of
problems that delay-based congestion control has.  And, it's not the
receiver's job to do congestion control. :-)

One thing you will note if you run many flows is that the aggregate
buffer space used should be much less than n*2MB, since each flow is
competing for the same queue space.  This has good scaling properties.

  -John

[1] It's not exactly accurate that it tries to set rcvbuf to 2*cwnd.
A subtle but important distinction is that it tries to set rcvbuf to
twice the data read by the application in any RTT.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-23 16:58       ` John Heffner
@ 2008-04-23 17:24         ` Rick Jones
  2008-04-23 17:41           ` John Heffner
  0 siblings, 1 reply; 56+ messages in thread
From: Rick Jones @ 2008-04-23 17:24 UTC (permalink / raw)
  To: John Heffner; +Cc: David Miller, netdev

John Heffner wrote:
> On Wed, Apr 23, 2008 at 9:32 AM, Rick Jones <rick.jones2@hp.com> wrote:
> 
>> I can see that for the sending side being willing to send into the
>>receiver's ever increasing window, but is autotuning supposed to keep
>>growing and growing the receive window the way it seems to be?
> 
> 
> Receive-side autotuning by design will attempt to grow the rcvbuf
> (adjusting for overhead) to twice the observed cwnd[1].  When the
> sender keeps growing its window to fill up your interface queue, the
> receiver will continue to grow its window to let the sender do what it
> wants.  It's not the receiver's job to do congestion control.

Then why is it starting small and growing the advertised window?

> One interesting observation is that when timestamps are turned off the
> receiver autotuning actually has the property that if the RTT is
> growing (in this case due to a queue filling), it will not grow the
> window since it's not able to update its RTT estimate.  This property
> was described as a feature of the Dynamic Right-Sizing algorithm
> (http://public.lanl.gov/radiant/software/drs.html), and obviously in
> some cases it is.  However, in general it has the same types of
> problems that delay-based congestion control has.  And, it's not the
> receiver's job to do congestion control. :-)

Then why is it starting small and growing the advertised window?-)

> One thing you will note if you run many flows is that the aggregate
> buffer space used should be much less than n*2MB, since each flow is
> competing for the same queue space.  This has good scaling properties.

I'll see about getting another system on which to test (the one I got 
2.6.25 onto yesterday is being pulled-back for other things, grrr...) 
and see what happens when I run multiple netperfs.


> [1] It's not exactly accurate that it tries to set rcvbuf to 2*cwnd.
> A subtle but important distinction is that it tries to set rcvbuf to
> twice the data read by the application in any RTT.

I guess that explains why it grew but didn't keep growing when the 
sender's SO_SNDBUF was fixed.

What concerns me is that we tell folks to rely on autotuning because it 
will set their socket buffers to the size they need, but it seems to be 
setting their socket buffers to very much more than they need.  Given 
that netperf couldn't have been receiving any more data in an RTT if it 
was always going at ~940 Mbit/s I'm still perplexed unless there is a 
positive feedback here - allow a greater window, this allows more data 
to be queued, which increases the RTT, which allows more data to be 
recevied in an "RTT" which allows a greater window...

rick jones


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-23 17:24         ` Rick Jones
@ 2008-04-23 17:41           ` John Heffner
  2008-04-23 17:46             ` Rick Jones
  0 siblings, 1 reply; 56+ messages in thread
From: John Heffner @ 2008-04-23 17:41 UTC (permalink / raw)
  To: Rick Jones; +Cc: David Miller, netdev

On Wed, Apr 23, 2008 at 10:24 AM, Rick Jones <rick.jones2@hp.com> wrote:
> John Heffner wrote:
> > Receive-side autotuning by design will attempt to grow the rcvbuf
> > (adjusting for overhead) to twice the observed cwnd[1].  When the
> > sender keeps growing its window to fill up your interface queue, the
> > receiver will continue to grow its window to let the sender do what it
> > wants.  It's not the receiver's job to do congestion control.
> >
>
>  Then why is it starting small and growing the advertised window?

Because the receiver tracks the sender's behavior.  (Also because of
the algorithm for figuring out the buffer overhead ratio, but that's a
different story.)

>  What concerns me is that we tell folks to rely on autotuning because it
> will set their socket buffers to the size they need, but it seems to be
> setting their socket buffers to very much more than they need.  Given that
> netperf couldn't have been receiving any more data in an RTT if it was
> always going at ~940 Mbit/s I'm still perplexed unless there is a positive
> feedback here - allow a greater window, this allows more data to be queued,
> which increases the RTT, which allows more data to be recevied in an "RTT"
> which allows a greater window...

It's true that careful hand-tuning can do better than autotuning in
some cases, due to the sub-optimality of congestion control.  This is
one such case, where you have an overbuffered bottleneck.  Another
case is where you have an underbuffered bottleneck, especially with a
large BDP.  You can calculate exactly the window size you need to fill
the bottleneck and set your send or receive buffer to throttle you to
exactly that window, so that you will get 100% utilization.  If you
let autotuning work, it will keep increasing your buffer as congestion
control asks for more until you overflow the buffer and take a loss.
Then, congestion control will have to back off, likely resulting in
under-utilization.

In these cases, hand-tuning can do better because you have out-of-band
knowledge of the path.  Also, smarter congestion control algorithms
can do better than traditional AIMD, but it's a very hard problem to
do this well in general.  Autotuning's job is mainly just to grow the
buffer sizes to as much as congestion control needs to do its thing.

  -John

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-23 17:41           ` John Heffner
@ 2008-04-23 17:46             ` Rick Jones
  0 siblings, 0 replies; 56+ messages in thread
From: Rick Jones @ 2008-04-23 17:46 UTC (permalink / raw)
  To: John Heffner; +Cc: David Miller, netdev

> Autotuning's job is mainly just to grow the buffer sizes to as much
> as congestion control needs to do its thing.

It seems very, well, trusting in that regard.

rick jones

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
@ 2008-04-23 23:29 Jerry Chu
  2008-04-24 16:32 ` John Heffner
  2008-04-25  7:05 ` David Miller
  0 siblings, 2 replies; 56+ messages in thread
From: Jerry Chu @ 2008-04-23 23:29 UTC (permalink / raw)
  To: netdev

I've been seeing the same problem here and am trying to fix it.
My fix is to not count those pkts still in the host queue as "prior_in_flight"
when feeding the latter to tcp_cong_avoid(). This should cause
tcp_is_cwnd_limited() test to fail when the previous in_flight build-up
is all due to the large host queue, and stop the cwnd to grow beyond
what's really necessary.

"sysctl_tcp_tso_win_divisor causes cwnd" also unnecessarily inflates
cwnd quite a bit when TSO is enabled.

Jerry

From: David Miller <davem@davemloft.net>

>
>
> Date: Tue, Apr 22, 2008 at 8:59 PM
>
> Subject: Re: Socket buffer sizes with autotuning
> To: johnwheffner@gmail.com
> Cc: rick.jones2@hp.com, netdev@vger.kernel.org
>
>
> From: "John Heffner" <johnwheffner@gmail.com>
> Date: Tue, 22 Apr 2008 19:17:39 -0700
>
>
>
> > On Tue, Apr 22, 2008 at 5:38 PM, Rick Jones <rick.jones2@hp.com> wrote:
> > >  oslowest:~# netstat -an | grep ESTAB
> > >  ...
> > >  tcp    0 2760560 10.208.0.1:40500    10.208.0.45:42049   ESTABLISHED
> > >  ...
> > >
> > >  Is this expected behaviour?
> >
>
> > What is your interface txqueuelen and mtu?  If you have a very large
> > interface queue, TCP will happily fill it up unless you are using a
> > delay-based congestion controller.
>
>
> Yes, that's the fundamental problem with loss based congestion
> control.  If there are any queues in the path, TCP will fill them up.
>
> Vegas and other similar techniques are able to avoid this, but come
> with the fundamental flaw that it's easy to get them into situations
> where they do not respond to increases in pipe space adequately, and
> thus underperform compared to loss based algorithms.
>
>
>
> --
>
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-23 23:29 Socket buffer sizes with autotuning Jerry Chu
@ 2008-04-24 16:32 ` John Heffner
  2008-04-25  0:49   ` Jerry Chu
  2008-04-25  7:05 ` David Miller
  1 sibling, 1 reply; 56+ messages in thread
From: John Heffner @ 2008-04-24 16:32 UTC (permalink / raw)
  To: Jerry Chu; +Cc: netdev

On Wed, Apr 23, 2008 at 4:29 PM, Jerry Chu <hkchu@google.com> wrote:
>
> I've been seeing the same problem here and am trying to fix it.
>  My fix is to not count those pkts still in the host queue as "prior_in_flight"
>  when feeding the latter to tcp_cong_avoid(). This should cause
>  tcp_is_cwnd_limited() test to fail when the previous in_flight build-up
>  is all due to the large host queue, and stop the cwnd to grow beyond
>  what's really necessary.

Sounds like a useful optimization.  Do you have a patch?

  -John

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-23  3:59   ` David Miller
  2008-04-23 16:32     ` Rick Jones
@ 2008-04-24 22:21     ` Andi Kleen
  2008-04-24 22:39       ` John Heffner
                         ` (2 more replies)
  1 sibling, 3 replies; 56+ messages in thread
From: Andi Kleen @ 2008-04-24 22:21 UTC (permalink / raw)
  To: David Miller; +Cc: johnwheffner, rick.jones2, netdev

David Miller <davem@davemloft.net> writes:

>> What is your interface txqueuelen and mtu?  If you have a very large
>> interface queue, TCP will happily fill it up unless you are using a
>> delay-based congestion controller.
>
> Yes, that's the fundamental problem with loss based congestion
> control.  If there are any queues in the path, TCP will fill them up.

That just means Linux does too much queueing by default.  Perhaps that
should be fixed. On Ethernet hardware the NIC TX queue should be
usually sufficient anyways I would guess. Do we really need the long
qdisc queue too?

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-24 22:21     ` Andi Kleen
@ 2008-04-24 22:39       ` John Heffner
  2008-04-25  1:28       ` David Miller
       [not found]       ` <65634d660804242234w66455bedve44801a98e3de9d9@mail.gmail.com>
  2 siblings, 0 replies; 56+ messages in thread
From: John Heffner @ 2008-04-24 22:39 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, rick.jones2, netdev

On Thu, Apr 24, 2008 at 3:21 PM, Andi Kleen <andi@firstfloor.org> wrote:
> David Miller <davem@davemloft.net> writes:
>
>  >> What is your interface txqueuelen and mtu?  If you have a very large
>  >> interface queue, TCP will happily fill it up unless you are using a
>  >> delay-based congestion controller.
>  >
>  > Yes, that's the fundamental problem with loss based congestion
>  > control.  If there are any queues in the path, TCP will fill them up.
>
>  That just means Linux does too much queueing by default.  Perhaps that
>  should be fixed. On Ethernet hardware the NIC TX queue should be
>  usually sufficient anyways I would guess. Do we really need the long
>  qdisc queue too?

The default on most ethernet devices used to be 100 packets.  This is
pretty short when your flow's window size is thousands of packets.
It's especially a killer for large BDP flows because it causes
slow-start to cut out early and you have to ramp up slowly with
congestion avoidance.  (This effect it mitigated somewhat by cubic,
etc, or even better by limited slow-start, but it's still very
significant.)

I have in the past used a hack that suppressed congestion notification
when overflowing the local interface queue.  This is a very simple
approach, and works fairly well, but doesn't have the property of
converging toward fairness if multiple flows are competing for
bandwidth at the local interface.  It might be possible to cook
something up that's smarter.

  -John

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-24 16:32 ` John Heffner
@ 2008-04-25  0:49   ` Jerry Chu
  2008-04-25  6:46     ` David Miller
  0 siblings, 1 reply; 56+ messages in thread
From: Jerry Chu @ 2008-04-25  0:49 UTC (permalink / raw)
  To: John Heffner; +Cc: netdev, rick.jones2, davem

On Thu, Apr 24, 2008 at 9:32 AM, John Heffner <johnwheffner@gmail.com> wrote:
>
> On Wed, Apr 23, 2008 at 4:29 PM, Jerry Chu <hkchu@google.com> wrote:
> >
> > I've been seeing the same problem here and am trying to fix it.
> >  My fix is to not count those pkts still in the host queue as "prior_in_flight"
> >  when feeding the latter to tcp_cong_avoid(). This should cause
> >  tcp_is_cwnd_limited() test to fail when the previous in_flight build-up
> >  is all due to the large host queue, and stop the cwnd to grow beyond
> >  what's really necessary.
>
> Sounds like a useful optimization.  Do you have a patch?

Am working on one, but still need to completely rootcause the problem first,
and do a lot more testing. I, like Rick Jones, have for a while thought either
the autotuning, or the Congestion Window Validation (rfc2861) code should
dampen the cwnd growth so the bug must be there, until last week when I
decided to get to the bottom of this problem.

One question: I currently use skb_shinfo(skb)->dataref == 1 for skb's on the
sk_write_queue list as the heuristic to determine if a packet has hit the wire.
This seems a good solution for the normal cases without requiring changes
to the driver to notify TCP in the xmit completion path. But I can imagine there
may be cases where another below-IP consumer of skb, e.g., tcpdump, can
nullify the above heuristic. If the below IP consumer causes the skb ref count
to drop to 1 prematurally, well the inflated cwnd problem comes back but it's
no worse than before. What if the below IP skb reader causes the skb
ref count to remain > 1 while pkts have long hit the wire? This may cause the
fix to prevent cwnd from growing when needed, hence hurting performance.
Is there a better solution than checking against dataref to determine if a pkt
has hit the wire?

Also the code to determine when/how much to defer in the TSO path seems
too aggressive. It's currently based on a percentage
(sysctl_tcp_tso_win_divisor)
of min(snd_wnd, snd_cwnd). Would it be too much if the value is large? E.g.,
when I disable sysctl_tcp_tso_win_divisor, the cwnd of my simple netperf run
drops exactly 1/3 from 1037 (segments) to 695. It seems to me the TSO
defer factor should be based on an absolute count, e.g., 64KB.

Jerry

>
>  -John
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-24 22:21     ` Andi Kleen
  2008-04-24 22:39       ` John Heffner
@ 2008-04-25  1:28       ` David Miller
       [not found]       ` <65634d660804242234w66455bedve44801a98e3de9d9@mail.gmail.com>
  2 siblings, 0 replies; 56+ messages in thread
From: David Miller @ 2008-04-25  1:28 UTC (permalink / raw)
  To: andi; +Cc: johnwheffner, rick.jones2, netdev

From: Andi Kleen <andi@firstfloor.org>
Date: Fri, 25 Apr 2008 00:21:08 +0200

> That just means Linux does too much queueing by default.  Perhaps that
> should be fixed. On Ethernet hardware the NIC TX queue should be
> usually sufficient anyways I would guess. Do we really need the long
> qdisc queue too?

It is definitely an area that deserves attention and a real
investigation, that's for sure.

I spent a lot of time looking at the behavior in this area
while writing the NIU 10G driver.  As usual, I got distracted
before I could make any real experiments.

I'll try to get back to this if someone else doesn't beat
me to it.  We also have the TX multiqueue stuff to move
forward, and it's deeply related to these issues.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
       [not found]       ` <65634d660804242234w66455bedve44801a98e3de9d9@mail.gmail.com>
@ 2008-04-25  6:36         ` David Miller
  2008-04-25  7:42           ` Tom Herbert
  0 siblings, 1 reply; 56+ messages in thread
From: David Miller @ 2008-04-25  6:36 UTC (permalink / raw)
  To: therbert; +Cc: andi, johnwheffner, rick.jones2, netdev

From: "Tom Herbert" <therbert@google.com>
Date: Thu, 24 Apr 2008 22:34:50 -0700

> Queuing in the host as opposed to in the TX NIC might be beneficial to make
> work conserving queuing disciplines more effective, since once a packet is
> queued in the NIC the host can no longer perform any more scheduling on
> it.    In fact, we have been working on a mechanism to dynamically limit
> queuing in the NIC (by number of bytes), so that more packets will be queued
> in host.  This seems to give better results with some of the qdiscs (I can
> post a patch if there's interest in this).

This work is interesting, but doesn't it make more sense to limit by
number of packets instead of bytes?

No intermediate node that I know of drops bytes, they drop packets
instead :-)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-25  0:49   ` Jerry Chu
@ 2008-04-25  6:46     ` David Miller
  2008-04-25 21:29       ` Jerry Chu
  2008-04-28 18:30       ` Jerry Chu
  0 siblings, 2 replies; 56+ messages in thread
From: David Miller @ 2008-04-25  6:46 UTC (permalink / raw)
  To: hkchu; +Cc: johnwheffner, netdev, rick.jones2

From: "Jerry Chu" <hkchu@google.com>
Date: Thu, 24 Apr 2008 17:49:33 -0700

> One question: I currently use skb_shinfo(skb)->dataref == 1 for skb's on the
> sk_write_queue list as the heuristic to determine if a packet has hit the wire.

This doesn't work for the reasons that you mention in detail next :-)

> Is there a better solution than checking against dataref to determine if a pkt
> has hit the wire?

Unfortunately, no there isn't.

Part of the issue is that the driver is only working with a clone, but
if a packet gets resent before the driver gives up it's reference,
we'll make a completely new copy.

But even assuming we could say that the driver gets a clone all the
time, the "sent" state would need to be in the shared data area.

> Also the code to determine when/how much to defer in the TSO path seems
> too aggressive. It's currently based on a percentage
> (sysctl_tcp_tso_win_divisor)
> of min(snd_wnd, snd_cwnd). Would it be too much if the value is large? E.g.,
> when I disable sysctl_tcp_tso_win_divisor, the cwnd of my simple netperf run
> drops exactly 1/3 from 1037 (segments) to 695. It seems to me the TSO
> defer factor should be based on an absolute count, e.g., 64KB.

This is one of the most difficult knobs to get right in the TSO code.

If the percentage is too low, you'll notice that cpu utilization
increases because you aren't accumulating enough data to send down the
largest possible TSO frames.

But yes you are absolutely right that we should have a hard limit
of 64K here, since we can't build a larger TSO frame anyways.

In fact I thought we had something like that here already :-/

Wait, in fact we do, it's just hidden behind a variable now:

	/* If a full-sized TSO skb can be sent, do it. */
	if (limit >= sk->sk_gso_max_size)
		goto send_now;

:-)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-23 23:29 Socket buffer sizes with autotuning Jerry Chu
  2008-04-24 16:32 ` John Heffner
@ 2008-04-25  7:05 ` David Miller
  2008-05-07  3:57   ` Jerry Chu
  1 sibling, 1 reply; 56+ messages in thread
From: David Miller @ 2008-04-25  7:05 UTC (permalink / raw)
  To: hkchu; +Cc: netdev

From: "Jerry Chu" <hkchu@google.com>
Date: Wed, 23 Apr 2008 16:29:58 -0700

> I've been seeing the same problem here and am trying to fix it.
> My fix is to not count those pkts still in the host queue as "prior_in_flight"
> when feeding the latter to tcp_cong_avoid(). This should cause
> tcp_is_cwnd_limited() test to fail when the previous in_flight build-up
> is all due to the large host queue, and stop the cwnd to grow beyond
> what's really necessary.

Does something like the following suit your needs?

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 299ec4b..6cdf4be 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -140,6 +140,7 @@ struct skb_frag_struct {
  */
 struct skb_shared_info {
 	atomic_t	dataref;
+	atomic_t	*in_flight;
 	unsigned short	nr_frags;
 	unsigned short	gso_size;
 	/* Warning: this field is not always filled in (UFO)! */
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index d96d9b1..62bb58d 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -271,6 +271,8 @@ struct tcp_sock {
 	u32	rcv_tstamp;	/* timestamp of last received ACK (for keepalives) */
 	u32	lsndtime;	/* timestamp of last sent data packet (for restart window) */
 
+	atomic_t host_inflight;	/* packets queued in transmit path	*/
+
 	/* Data for direct copy to user */
 	struct {
 		struct sk_buff_head	prequeue;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4fe605f..a6880c2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -212,6 +212,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	/* make sure we initialize shinfo sequentially */
 	shinfo = skb_shinfo(skb);
 	atomic_set(&shinfo->dataref, 1);
+	shinfo->in_flight = NULL;
 	shinfo->nr_frags  = 0;
 	shinfo->gso_size = 0;
 	shinfo->gso_segs = 0;
@@ -403,6 +404,8 @@ static void skb_release_all(struct sk_buff *skb)
 void __kfree_skb(struct sk_buff *skb)
 {
 	skb_release_all(skb);
+	if (skb_shinfo(skb)->in_flight)
+		atomic_dec(skb_shinfo(skb)->in_flight);
 	kfree_skbmem(skb);
 }
 
@@ -486,6 +489,8 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
 	atomic_set(&n->users, 1);
 
 	atomic_inc(&(skb_shinfo(skb)->dataref));
+	if (skb_shinfo(skb)->in_flight)
+		atomic_inc(skb_shinfo(skb)->in_flight);
 	skb->cloned = 1;
 
 	return n;
@@ -743,6 +748,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	skb->hdr_len  = 0;
 	skb->nohdr    = 0;
 	atomic_set(&skb_shinfo(skb)->dataref, 1);
+	skb_shinfo(skb)->in_flight = NULL;
 	return 0;
 
 nodata:
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f886531..28a71fd 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -479,6 +479,7 @@ static inline void skb_entail(struct sock *sk, struct sk_buff *skb)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
 
+	skb_shinfo(skb)->in_flight = &tp->host_inflight;
 	skb->csum    = 0;
 	tcb->seq     = tcb->end_seq = tp->write_seq;
 	tcb->flags   = TCPCB_FLAG_ACK;


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
       [not found] <d1c2719f0804241829s1bc3f41ejf7ebbff73ed96578@mail.gmail.com>
@ 2008-04-25  7:06 ` Andi Kleen
  2008-04-25  7:28   ` David Miller
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2008-04-25  7:06 UTC (permalink / raw)
  To: Jerry Chu, davem, johnwheffner, rick.jones2, netdev

[fixed cc and subject]

Jerry Chu wrote:
> On Thu, Apr 24, 2008 at 3:21 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> David Miller <davem@davemloft.net> writes:
>>
>>>> What is your interface txqueuelen and mtu?  If you have a very large
>>>> interface queue, TCP will happily fill it up unless you are using a
>>>> delay-based congestion controller.
>>> Yes, that's the fundamental problem with loss based congestion
>>> control.  If there are any queues in the path, TCP will fill them up.
>> That just means Linux does too much queueing by default.  Perhaps that
>> should be fixed. On Ethernet hardware the NIC TX queue should be
>> usually sufficient anyways I would guess. Do we really need the long
>> qdisc queue too?
> 
> I think we really need the large xmit queue, especially when the CPU speed,
> or the aggregated CPU bandwidth in the case of multi-cores, is >> NIC speed
> for the following reason:
> 
> If the qdisc and/or NIC queue is not large enough, it may not absorb the high
> burst rate from the much faster CPU xmit threads, hence causing pkts to
> be dropped before they hit the wire. 

sendmsg should just be a little smarter on when to block depending on
the state of the interface. There is already some minor code for tnat
as you'll have noted. Then the bursts would be much less of a problem.

We already had this discussion recently together with better behaviour
on bounding.

The only big problem then would be if there are more submitting threads
than packets in the TX queue, but I would consider that unlikely for
GB+ NICs at least (might be an issue for older designs with smaller queues)

> Here the CPU/NIC relation is much like
> a router

It doesn't need to be. Unlike a true network it is very cheap here
to do direct feedback.

> Removing the unnecessary cwnd growth by counting out those pkts that are
> still stuck in the host queue may be a simpler solution. I'll find out
> how well it
> works soon.

I think that's a great start, but probably not enough.

-Andi




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-25  7:06 ` Andi Kleen
@ 2008-04-25  7:28   ` David Miller
  2008-04-25  7:48     ` Andi Kleen
  0 siblings, 1 reply; 56+ messages in thread
From: David Miller @ 2008-04-25  7:28 UTC (permalink / raw)
  To: andi; +Cc: hkchu, johnwheffner, rick.jones2, netdev

From: Andi Kleen <andi@firstfloor.org>
Date: Fri, 25 Apr 2008 09:06:48 +0200

> The only big problem then would be if there are more submitting threads
> than packets in the TX queue, but I would consider that unlikely for
> GB+ NICs at least (might be an issue for older designs with smaller queues)

It's probably exactly what happens in those chat server benchmarks.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-25  6:36         ` David Miller
@ 2008-04-25  7:42           ` Tom Herbert
  2008-04-25  7:46             ` David Miller
  0 siblings, 1 reply; 56+ messages in thread
From: Tom Herbert @ 2008-04-25  7:42 UTC (permalink / raw)
  To: David Miller; +Cc: andi, johnwheffner, rick.jones2, netdev

On Thu, Apr 24, 2008 at 11:36 PM, David Miller <davem@davemloft.net> wrote:
> From: "Tom Herbert" <therbert@google.com>
> Date: Thu, 24 Apr 2008 22:34:50 -0700
>
>
> > Queuing in the host as opposed to in the TX NIC might be beneficial to make
> > work conserving queuing disciplines more effective, since once a packet is
> > queued in the NIC the host can no longer perform any more scheduling on
> > it.    In fact, we have been working on a mechanism to dynamically limit
> > queuing in the NIC (by number of bytes), so that more packets will be queued
> > in host.  This seems to give better results with some of the qdiscs (I can
> > post a patch if there's interest in this).
>
> This work is interesting, but doesn't it make more sense to limit by
> number of packets instead of bytes?
>

If NIC is interrupt driven, enough data must be queued to span
consecutive interrupts.  So for instance if a 10G NIC is generating an
interrupt every 100us, about 125,000 bytes needs to queued at each
interrupt to prevent starvation-- this doesn't translate to a fixed
number of packets.  Limiting by packets seems somewhat ad hoc; if the
limit is to small and the link will be starved, too big and
over-queuing results.

> No intermediate node that I know of drops bytes, they drop packets
> instead :-)
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-25  7:42           ` Tom Herbert
@ 2008-04-25  7:46             ` David Miller
  2008-04-28 17:51               ` Tom Herbert
  0 siblings, 1 reply; 56+ messages in thread
From: David Miller @ 2008-04-25  7:46 UTC (permalink / raw)
  To: therbert; +Cc: andi, johnwheffner, rick.jones2, netdev

From: "Tom Herbert" <therbert@google.com>
Date: Fri, 25 Apr 2008 00:42:03 -0700

> If NIC is interrupt driven, enough data must be queued to span
> consecutive interrupts.  So for instance if a 10G NIC is generating an
> interrupt every 100us, about 125,000 bytes needs to queued at each
> interrupt to prevent starvation-- this doesn't translate to a fixed
> number of packets.  Limiting by packets seems somewhat ad hoc; if the
> limit is to small and the link will be starved, too big and
> over-queuing results.

Yes, however when packets are small the limiting factor becomes
per-transfer transaction related overhead.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-25  7:28   ` David Miller
@ 2008-04-25  7:48     ` Andi Kleen
  0 siblings, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2008-04-25  7:48 UTC (permalink / raw)
  To: David Miller; +Cc: hkchu, johnwheffner, rick.jones2, netdev

David Miller wrote:
> From: Andi Kleen <andi@firstfloor.org>
> Date: Fri, 25 Apr 2008 09:06:48 +0200
> 
>> The only big problem then would be if there are more submitting threads
>> than packets in the TX queue, but I would consider that unlikely for
>> GB+ NICs at least (might be an issue for older designs with smaller queues)
> 
> It's probably exactly what happens in those chat server benchmarks.

In that case we might be better off just blocking some of these threads
until the TX queue cleared? What good is it to send faster than the
network allows?

-Andi



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-25  6:46     ` David Miller
@ 2008-04-25 21:29       ` Jerry Chu
  2008-04-25 21:35         ` David Miller
  2008-04-28 18:30       ` Jerry Chu
  1 sibling, 1 reply; 56+ messages in thread
From: Jerry Chu @ 2008-04-25 21:29 UTC (permalink / raw)
  To: David Miller; +Cc: johnwheffner, netdev, rick.jones2

On Thu, Apr 24, 2008 at 11:46 PM, David Miller <davem@davemloft.net> wrote:
> From: "Jerry Chu" <hkchu@google.com>
>  Date: Thu, 24 Apr 2008 17:49:33 -0700
>
>
>  > One question: I currently use skb_shinfo(skb)->dataref == 1 for skb's on the
>  > sk_write_queue list as the heuristic to determine if a packet has hit the wire.
>
>  This doesn't work for the reasons that you mention in detail next :-)
>
>
>  > Is there a better solution than checking against dataref to determine if a pkt
>  > has hit the wire?
>
>  Unfortunately, no there isn't.
>
>  Part of the issue is that the driver is only working with a clone, but
>  if a packet gets resent before the driver gives up it's reference,
>  we'll make a completely new copy.

I think we can ignore this case if it happens rarely.

>
>  But even assuming we could say that the driver gets a clone all the
>  time, the "sent" state would need to be in the shared data area.

Ok.

>
>
>  > Also the code to determine when/how much to defer in the TSO path seems
>  > too aggressive. It's currently based on a percentage
>  > (sysctl_tcp_tso_win_divisor)
>  > of min(snd_wnd, snd_cwnd). Would it be too much if the value is large? E.g.,
>  > when I disable sysctl_tcp_tso_win_divisor, the cwnd of my simple netperf run
>  > drops exactly 1/3 from 1037 (segments) to 695. It seems to me the TSO
>  > defer factor should be based on an absolute count, e.g., 64KB.
>
>  This is one of the most difficult knobs to get right in the TSO code.
>
>  If the percentage is too low, you'll notice that cpu utilization
>  increases because you aren't accumulating enough data to send down the
>  largest possible TSO frames.

Well, there is a fine line to walk before CPU efficiency and traffic
burstiness. The TSO defer code causes a few hundred KB of bursts that
quickly blow away our small switch buffers. The matter may get even
worse for 10GE.

>
>  But yes you are absolutely right that we should have a hard limit
>  of 64K here, since we can't build a larger TSO frame anyways.
>
>  In fact I thought we had something like that here already :-/
>
>  Wait, in fact we do, it's just hidden behind a variable now:
>
>         /* If a full-sized TSO skb can be sent, do it. */
>         if (limit >= sk->sk_gso_max_size)
>                 goto send_now;

Oh, just realized I've been working on a very "old" (2.6.18 :-)
version of kernel.
Will get the latest 2.6.25 and take a look. I can't find "skb_release_all()"
function you pointed in a later mail either. Guess the Linux kernel
code is rewritten every few month :-(.

Jerry

>
>  :-)
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-25 21:29       ` Jerry Chu
@ 2008-04-25 21:35         ` David Miller
  0 siblings, 0 replies; 56+ messages in thread
From: David Miller @ 2008-04-25 21:35 UTC (permalink / raw)
  To: hkchu; +Cc: johnwheffner, netdev, rick.jones2

From: "Jerry Chu" <hkchu@google.com>
Date: Fri, 25 Apr 2008 14:29:25 -0700

> Will get the latest 2.6.25 and take a look. I can't find "skb_release_all()"
> function you pointed in a later mail either. Guess the Linux kernel
> code is rewritten every few month :-(.

You want bugs fixed, like the one you reported here, don't you?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-25  7:46             ` David Miller
@ 2008-04-28 17:51               ` Tom Herbert
  0 siblings, 0 replies; 56+ messages in thread
From: Tom Herbert @ 2008-04-28 17:51 UTC (permalink / raw)
  To: David Miller; +Cc: andi, johnwheffner, rick.jones2, netdev

>  > If NIC is interrupt driven, enough data must be queued to span
>  > consecutive interrupts.  So for instance if a 10G NIC is generating an
>  > interrupt every 100us, about 125,000 bytes needs to queued at each
>  > interrupt to prevent starvation-- this doesn't translate to a fixed
>  > number of packets.  Limiting by packets seems somewhat ad hoc; if the
>  > limit is to small and the link will be starved, too big and
>  > over-queuing results.
>
>  Yes, however when packets are small the limiting factor becomes
>  per-transfer transaction related overhead.
>

The gist of our algorithm is to determine when the device has been
starved (ie. when all packets queued to the device have been sent, and
there are packets ready in host queues).  When the device is starved,
the byte limit is increased by some amount; if the device is not
starved, the byte limit is decreased by an amount subject to some
mechanisms to prevent hysteresis.  If throughput is less than line
rate, because of small packets for example, the byte limit will
decrease accordingly since fewer packets are needed to be queued to
prevent starvation.  This algorithm requires no a priori knowledge of
link speed or interrupt rate, it will adjust the byte queue limit to
such variable parameters.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-25  6:46     ` David Miller
  2008-04-25 21:29       ` Jerry Chu
@ 2008-04-28 18:30       ` Jerry Chu
  2008-04-28 19:21         ` John Heffner
  1 sibling, 1 reply; 56+ messages in thread
From: Jerry Chu @ 2008-04-28 18:30 UTC (permalink / raw)
  To: David Miller; +Cc: johnwheffner, netdev, rick.jones2

On Thu, Apr 24, 2008 at 11:46 PM, David Miller <davem@davemloft.net> wrote:
> From: "Jerry Chu" <hkchu@google.com>
>  Date: Thu, 24 Apr 2008 17:49:33 -0700
>
>
>  > One question: I currently use skb_shinfo(skb)->dataref == 1 for skb's on the
>  > sk_write_queue list as the heuristic to determine if a packet has hit the wire.
>
>  This doesn't work for the reasons that you mention in detail next :-)
>
>
>  > Is there a better solution than checking against dataref to determine if a pkt
>  > has hit the wire?
>
>  Unfortunately, no there isn't.
>
>  Part of the issue is that the driver is only working with a clone, but
>  if a packet gets resent before the driver gives up it's reference,
>  we'll make a completely new copy.
>
>  But even assuming we could say that the driver gets a clone all the
>  time, the "sent" state would need to be in the shared data area.
>
>
>  > Also the code to determine when/how much to defer in the TSO path seems
>  > too aggressive. It's currently based on a percentage
>  > (sysctl_tcp_tso_win_divisor)
>  > of min(snd_wnd, snd_cwnd). Would it be too much if the value is large? E.g.,
>  > when I disable sysctl_tcp_tso_win_divisor, the cwnd of my simple netperf run
>  > drops exactly 1/3 from 1037 (segments) to 695. It seems to me the TSO
>  > defer factor should be based on an absolute count, e.g., 64KB.
>
>  This is one of the most difficult knobs to get right in the TSO code.
>
>  If the percentage is too low, you'll notice that cpu utilization
>  increases because you aren't accumulating enough data to send down the
>  largest possible TSO frames.
>
>  But yes you are absolutely right that we should have a hard limit
>  of 64K here, since we can't build a larger TSO frame anyways.
>
>  In fact I thought we had something like that here already :-/
>
>  Wait, in fact we do, it's just hidden behind a variable now:
>
>         /* If a full-sized TSO skb can be sent, do it. */
>         if (limit >= sk->sk_gso_max_size)
>                 goto send_now;
>
>  :-)

Correct, but its counterpart doesn't exist in tcp_is_cwnd_limited(). So
cwnd will continue to grow when left < cwnd/sysctl_tcp_tso_win_divisor,
which can be very large when cwnd is large.

If I change tcp_tso_win_divisor to 0, cwnd max out at 695 rather than 1037,
exactly off by 1/3. I tried to add the same check to tcp_is_cwnd_limited():

diff -c /tmp/tcp.h.old tcp.h
*** /tmp/tcp.h.old      Mon Apr 28 11:00:44 2008
--- tcp.h       Mon Apr 28 10:54:10 2008
***************
*** 828,833 ****
--- 828,835 ----
                return 0;

        left = tp->snd_cwnd - in_flight;
+       if (left >= 65536)
+               return 0;
        if (sysctl_tcp_tso_win_divisor)
                return left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd;
        else

>

But it doesn't seem to help (cwnd still grows to 1037).

Jerry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-28 18:30       ` Jerry Chu
@ 2008-04-28 19:21         ` John Heffner
  2008-04-28 20:44           ` Jerry Chu
       [not found]           ` <d1c2719f0804281338j3984cf2bga31def0c2c1192a1@mail.gmail.com>
  0 siblings, 2 replies; 56+ messages in thread
From: John Heffner @ 2008-04-28 19:21 UTC (permalink / raw)
  To: Jerry Chu; +Cc: David Miller, netdev, rick.jones2

[-- Attachment #1: Type: text/plain, Size: 1036 bytes --]

On Mon, Apr 28, 2008 at 11:30 AM, Jerry Chu <hkchu@google.com> wrote:
>  Correct, but its counterpart doesn't exist in tcp_is_cwnd_limited(). So
>  cwnd will continue to grow when left < cwnd/sysctl_tcp_tso_win_divisor,
>  which can be very large when cwnd is large.
>
>  If I change tcp_tso_win_divisor to 0, cwnd max out at 695 rather than 1037,
>  exactly off by 1/3. I tried to add the same check to tcp_is_cwnd_limited():
>
>  diff -c /tmp/tcp.h.old tcp.h
>  *** /tmp/tcp.h.old      Mon Apr 28 11:00:44 2008
>  --- tcp.h       Mon Apr 28 10:54:10 2008
>  ***************
>  *** 828,833 ****
>  --- 828,835 ----
>                 return 0;
>
>         left = tp->snd_cwnd - in_flight;
>  +       if (left >= 65536)
>  +               return 0;
>         if (sysctl_tcp_tso_win_divisor)
>                 return left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd;
>         else
>
>  >
>
>  But it doesn't seem to help (cwnd still grows to 1037).

Cwnd is in packets, not bytes.

Try this series of patches, against net-next.

  -John

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-NET-Allow-send-limited-cwnd-to-grow-up-to-max_bur.patch --]
[-- Type: text/x-diff; name=0001-NET-Allow-send-limited-cwnd-to-grow-up-to-max_bur.patch, Size: 1231 bytes --]

From 57039cfc486ae4deac3b1fecaa238b5483c034df Mon Sep 17 00:00:00 2001
From: John Heffner <johnwheffner@gmail.com>
Date: Mon, 28 Apr 2008 12:17:49 -0700
Subject: [PATCH 1/2] [NET]: Allow send-limited cwnd to grow up to max_burst when gso disabled

This changes the logic in tcp_is_cwnd_limited() so that cwnd may grow
up to tcp_max_burst() even when sk_can_gso() is false, or when
sysctl_tcp_tso_win_divisor != 0.

Signed-off-by: John Heffner <johnwheffner@gmail.com>
---
 net/ipv4/tcp_cong.c |   11 ++++-------
 1 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 3a6be23..bfb1996 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -285,14 +285,11 @@ int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight)
 	if (in_flight >= tp->snd_cwnd)
 		return 1;
 
-	if (!sk_can_gso(sk))
-		return 0;
-
 	left = tp->snd_cwnd - in_flight;
-	if (sysctl_tcp_tso_win_divisor)
-		return left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd;
-	else
-		return left <= tcp_max_burst(tp);
+	if (sk_can_gso(sk) &&
+	    left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd)
+		return 1;
+	return left <= tcp_max_burst(tp);
 }
 EXPORT_SYMBOL_GPL(tcp_is_cwnd_limited);
 
-- 
1.5.2.5


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-NET-Limit-cwnd-growth-when-deferring-for-GSO.patch --]
[-- Type: text/x-diff; name=0002-NET-Limit-cwnd-growth-when-deferring-for-GSO.patch, Size: 992 bytes --]

From fc1d9fe99748f3811a0a5b9d93463cf0e4788f56 Mon Sep 17 00:00:00 2001
From: John Heffner <johnwheffner@gmail.com>
Date: Mon, 28 Apr 2008 12:19:22 -0700
Subject: [PATCH 2/2] [NET]: Limit cwnd growth when deferring for GSO

This fixes inappropriately large cwnd growth on sender-limited flows
when GSO is enabled, limiting cwnd growth to 64k.

Signed-off-by: John Heffner <johnwheffner@gmail.com>
---
 net/ipv4/tcp_cong.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index bfb1996..6a25082 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -287,7 +287,8 @@ int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight)
 
 	left = tp->snd_cwnd - in_flight;
 	if (sk_can_gso(sk) &&
-	    left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd)
+	    left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd &&
+	    left * tp->mss_cache < sk->sk_gso_max_size)
 		return 1;
 	return left <= tcp_max_burst(tp);
 }
-- 
1.5.2.5


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-28 19:21         ` John Heffner
@ 2008-04-28 20:44           ` Jerry Chu
  2008-04-28 23:22             ` [PATCH 1/2] [NET]: Allow send-limited cwnd to grow up to max_burst when gso disabled John Heffner
       [not found]           ` <d1c2719f0804281338j3984cf2bga31def0c2c1192a1@mail.gmail.com>
  1 sibling, 1 reply; 56+ messages in thread
From: Jerry Chu @ 2008-04-28 20:44 UTC (permalink / raw)
  To: John Heffner; +Cc: David Miller, netdev, rick.jones2

On Mon, Apr 28, 2008 at 12:21 PM, John Heffner <johnwheffner@gmail.com> wrote:
>
> On Mon, Apr 28, 2008 at 11:30 AM, Jerry Chu <hkchu@google.com> wrote:
> >  Correct, but its counterpart doesn't exist in tcp_is_cwnd_limited(). So
> >  cwnd will continue to grow when left < cwnd/sysctl_tcp_tso_win_divisor,
> >  which can be very large when cwnd is large.
> >
> >  If I change tcp_tso_win_divisor to 0, cwnd max out at 695 rather than 1037,
> >  exactly off by 1/3. I tried to add the same check to tcp_is_cwnd_limited():
> >
> >  diff -c /tmp/tcp.h.old tcp.h
> >  *** /tmp/tcp.h.old      Mon Apr 28 11:00:44 2008
> >  --- tcp.h       Mon Apr 28 10:54:10 2008
> >  ***************
> >  *** 828,833 ****
> >  --- 828,835 ----
> >                 return 0;
> >
> >         left = tp->snd_cwnd - in_flight;
> >  +       if (left >= 65536)
> >  +               return 0;
> >         if (sysctl_tcp_tso_win_divisor)
> >                 return left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd;
> >         else
> >
> >  >
> >
> >  But it doesn't seem to help (cwnd still grows to 1037).
>
> Cwnd is in packets, not bytes.

Yes, of course.

>
> Try this series of patches, against net-next.

Ah, you already know about this problem. Yes it does the trick.

Jerry

>
>  -John
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 1/2] [NET]: Allow send-limited cwnd to grow up to max_burst when gso disabled
  2008-04-28 20:44           ` Jerry Chu
@ 2008-04-28 23:22             ` John Heffner
  2008-04-28 23:22               ` [PATCH 2/2] [NET]: Limit cwnd growth when deferring for GSO John Heffner
  0 siblings, 1 reply; 56+ messages in thread
From: John Heffner @ 2008-04-28 23:22 UTC (permalink / raw)
  To: davem; +Cc: netdev, John Heffner

This changes the logic in tcp_is_cwnd_limited() so that cwnd may grow
up to tcp_max_burst() even when sk_can_gso() is false, or when
sysctl_tcp_tso_win_divisor != 0.

Signed-off-by: John Heffner <johnwheffner@gmail.com>
---
 net/ipv4/tcp_cong.c |   11 ++++-------
 1 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 3a6be23..bfb1996 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -285,14 +285,11 @@ int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight)
 	if (in_flight >= tp->snd_cwnd)
 		return 1;
 
-	if (!sk_can_gso(sk))
-		return 0;
-
 	left = tp->snd_cwnd - in_flight;
-	if (sysctl_tcp_tso_win_divisor)
-		return left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd;
-	else
-		return left <= tcp_max_burst(tp);
+	if (sk_can_gso(sk) &&
+	    left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd)
+		return 1;
+	return left <= tcp_max_burst(tp);
 }
 EXPORT_SYMBOL_GPL(tcp_is_cwnd_limited);
 
-- 
1.5.2.5


______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 2/2] [NET]: Limit cwnd growth when deferring for GSO
  2008-04-28 23:22             ` [PATCH 1/2] [NET]: Allow send-limited cwnd to grow up to max_burst when gso disabled John Heffner
@ 2008-04-28 23:22               ` John Heffner
  0 siblings, 0 replies; 56+ messages in thread
From: John Heffner @ 2008-04-28 23:22 UTC (permalink / raw)
  To: davem; +Cc: netdev, John Heffner

This fixes inappropriately large cwnd growth on sender-limited flows
when GSO is enabled, limiting cwnd growth to 64k.

Signed-off-by: John Heffner <johnwheffner@gmail.com>
---
 net/ipv4/tcp_cong.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index bfb1996..6a25082 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -287,7 +287,8 @@ int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight)
 
 	left = tp->snd_cwnd - in_flight;
 	if (sk_can_gso(sk) &&
-	    left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd)
+	    left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd &&
+	    left * tp->mss_cache < sk->sk_gso_max_size)
 		return 1;
 	return left <= tcp_max_burst(tp);
 }
-- 
1.5.2.5


______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
       [not found]           ` <d1c2719f0804281338j3984cf2bga31def0c2c1192a1@mail.gmail.com>
@ 2008-04-28 23:28             ` John Heffner
  2008-04-28 23:35               ` David Miller
  2008-04-29  2:20               ` Jerry Chu
  0 siblings, 2 replies; 56+ messages in thread
From: John Heffner @ 2008-04-28 23:28 UTC (permalink / raw)
  To: Jerry Chu; +Cc: David Miller, netdev, rick.jones2

On Mon, Apr 28, 2008 at 1:38 PM, Jerry Chu <hkchu@google.com> wrote:
> > Try this series of patches, against net-next.
>
> Ah, you already know about this problem. Yes it does the trick.

I had not actually known about this.  (Just cooked up the patches
after I saw your mail.)  This looks to me like a clear bug.  After
looking carefully at the code there, it seems there was another small
problem, too, though not likely to have much effect.

Dave, can you apply the patches I just sent?

Thanks,
  -John

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-28 23:28             ` Socket buffer sizes with autotuning John Heffner
@ 2008-04-28 23:35               ` David Miller
  2008-04-29  2:20               ` Jerry Chu
  1 sibling, 0 replies; 56+ messages in thread
From: David Miller @ 2008-04-28 23:35 UTC (permalink / raw)
  To: johnwheffner; +Cc: hkchu, netdev, rick.jones2

From: "John Heffner" <johnwheffner@gmail.com>
Date: Mon, 28 Apr 2008 16:28:16 -0700

> Dave, can you apply the patches I just sent?

Sure thing.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-28 23:28             ` Socket buffer sizes with autotuning John Heffner
  2008-04-28 23:35               ` David Miller
@ 2008-04-29  2:20               ` Jerry Chu
  1 sibling, 0 replies; 56+ messages in thread
From: Jerry Chu @ 2008-04-29  2:20 UTC (permalink / raw)
  To: John Heffner; +Cc: David Miller, netdev, rick.jones2

Just for the record, in my netperf RR (1MB request/20B reply) over
1GE/tg3 test on top of 2.6.18, changing tcp_tso_win_divisor to 0
from the default of 3 gets the cwnd down to 695 from 1037. Applying
the bound check (left * tp->mss_cache < 65536) will get cwnd down
to 737, not 695. Still big improvement though. (Anyway this may be
moot after I fix the real culprit.)

Jerry

On Mon, Apr 28, 2008 at 4:28 PM, John Heffner <johnwheffner@gmail.com> wrote:
>
> On Mon, Apr 28, 2008 at 1:38 PM, Jerry Chu <hkchu@google.com> wrote:
> > > Try this series of patches, against net-next.
> >
> > Ah, you already know about this problem. Yes it does the trick.
>
> I had not actually known about this.  (Just cooked up the patches
> after I saw your mail.)  This looks to me like a clear bug.  After
> looking carefully at the code there, it seems there was another small
> problem, too, though not likely to have much effect.
>
> Dave, can you apply the patches I just sent?
>
> Thanks,
>  -John
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-04-25  7:05 ` David Miller
@ 2008-05-07  3:57   ` Jerry Chu
  2008-05-07  4:27     ` David Miller
  2008-05-07  4:28     ` David Miller
  0 siblings, 2 replies; 56+ messages in thread
From: Jerry Chu @ 2008-05-07  3:57 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

I fail to see how adding shinfo->in_flight to count how many outstanding clones
are there can help accounting for how many "host_inflight" pkts. Part
of the problems,
as you've mentioned before, is that the driver may not always get a
clone. It may
be getting a copy (e.g., when GSO is on?) hence losing all its connection to the
original tp and any chance to have the pkt properly accounted for as
host_infligh
by TCP. The skb may also be cloned more than once (e.g., due to tcpdump)...

That said, I also fail to come up with a more bullet-proof solution
after studying
much of the TSO/GSO code without requring driver and more skb changes. So
I'm currently leaning toward my original fix of checking
if (1 == (atomic_read(&skb_shinfo(skb1)->dataref) & SKB_DATAREF_MASK))

My current prototype scans either sk_send_head or sk_write_queue backwards
until the above condition is true. I'm thinking about adding and
maintaining a new "tp->host_queue_head" field to avoid most of the
scanning. Also it seems much
less costly to add a new field to tcp_sock than to
skb/skb_shared_info. If you have
a better idea please let me know.

Jerry

On Fri, Apr 25, 2008 at 12:05 AM, David Miller <davem@davemloft.net> wrote:
>
> From: "Jerry Chu" <hkchu@google.com>
> Date: Wed, 23 Apr 2008 16:29:58 -0700
>
>
> > I've been seeing the same problem here and am trying to fix it.
> > My fix is to not count those pkts still in the host queue as "prior_in_flight"
> > when feeding the latter to tcp_cong_avoid(). This should cause
> > tcp_is_cwnd_limited() test to fail when the previous in_flight build-up
> > is all due to the large host queue, and stop the cwnd to grow beyond
> > what's really necessary.
>
> Does something like the following suit your needs?
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 299ec4b..6cdf4be 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -140,6 +140,7 @@ struct skb_frag_struct {
>  */
>  struct skb_shared_info {
>        atomic_t        dataref;
> +       atomic_t        *in_flight;
>        unsigned short  nr_frags;
>        unsigned short  gso_size;
>        /* Warning: this field is not always filled in (UFO)! */
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index d96d9b1..62bb58d 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -271,6 +271,8 @@ struct tcp_sock {
>        u32     rcv_tstamp;     /* timestamp of last received ACK (for keepalives) */
>        u32     lsndtime;       /* timestamp of last sent data packet (for restart window) */
>
> +       atomic_t host_inflight; /* packets queued in transmit path      */
> +
>        /* Data for direct copy to user */
>        struct {
>                struct sk_buff_head     prequeue;
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 4fe605f..a6880c2 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -212,6 +212,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>        /* make sure we initialize shinfo sequentially */
>        shinfo = skb_shinfo(skb);
>        atomic_set(&shinfo->dataref, 1);
> +       shinfo->in_flight = NULL;
>        shinfo->nr_frags  = 0;
>        shinfo->gso_size = 0;
>        shinfo->gso_segs = 0;
> @@ -403,6 +404,8 @@ static void skb_release_all(struct sk_buff *skb)
>  void __kfree_skb(struct sk_buff *skb)
>  {
>        skb_release_all(skb);
> +       if (skb_shinfo(skb)->in_flight)
> +               atomic_dec(skb_shinfo(skb)->in_flight);
>        kfree_skbmem(skb);
>  }
>
> @@ -486,6 +489,8 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
>        atomic_set(&n->users, 1);
>
>        atomic_inc(&(skb_shinfo(skb)->dataref));
> +       if (skb_shinfo(skb)->in_flight)
> +               atomic_inc(skb_shinfo(skb)->in_flight);
>        skb->cloned = 1;
>
>        return n;
> @@ -743,6 +748,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>        skb->hdr_len  = 0;
>        skb->nohdr    = 0;
>        atomic_set(&skb_shinfo(skb)->dataref, 1);
> +       skb_shinfo(skb)->in_flight = NULL;
>        return 0;
>
>  nodata:
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index f886531..28a71fd 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -479,6 +479,7 @@ static inline void skb_entail(struct sock *sk, struct sk_buff *skb)
>        struct tcp_sock *tp = tcp_sk(sk);
>        struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
>
> +       skb_shinfo(skb)->in_flight = &tp->host_inflight;
>        skb->csum    = 0;
>        tcb->seq     = tcb->end_seq = tp->write_seq;
>        tcb->flags   = TCPCB_FLAG_ACK;
>
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-07  3:57   ` Jerry Chu
@ 2008-05-07  4:27     ` David Miller
  2008-05-07 18:36       ` Jerry Chu
  2008-05-07  4:28     ` David Miller
  1 sibling, 1 reply; 56+ messages in thread
From: David Miller @ 2008-05-07  4:27 UTC (permalink / raw)
  To: hkchu; +Cc: netdev

From: "Jerry Chu" <hkchu@google.com>
Date: Tue, 6 May 2008 20:57:46 -0700

> I fail to see how adding shinfo->in_flight to count how many
> outstanding clones are there can help accounting for how many
> "host_inflight" pkts. Part of the problems, as you've mentioned
> before, is that the driver may not always get a clone.

Sure but it will get one %99.9999 of the time.

With TCP, as long as that clone is alive the driver has it.  And the
counter only counts the clones.

Anyways, did you even test my patch and try to use it for your needs
or is this analysis purely from your inspection of it? :-/


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-07  3:57   ` Jerry Chu
  2008-05-07  4:27     ` David Miller
@ 2008-05-07  4:28     ` David Miller
  2008-05-07 18:54       ` Jerry Chu
  1 sibling, 1 reply; 56+ messages in thread
From: David Miller @ 2008-05-07  4:28 UTC (permalink / raw)
  To: hkchu; +Cc: netdev

From: "Jerry Chu" <hkchu@google.com>
Date: Tue, 6 May 2008 20:57:46 -0700

> It may be getting a copy (e.g., when GSO is on?) hence losing all
> its connection to the original tp and any chance to have the pkt
> properly accounted for as host_infligh by TCP. The skb may also be
> cloned more than once (e.g., due to tcpdump)...

It only gets a copy if the SKB is already cloned and that clone is
alive somewhere (f.e. stuck in the device, a rare occurance by
the time we retransmit).

It is a clone %99.999999 of the time.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-07  4:27     ` David Miller
@ 2008-05-07 18:36       ` Jerry Chu
  2008-05-07 21:18         ` David Miller
  0 siblings, 1 reply; 56+ messages in thread
From: Jerry Chu @ 2008-05-07 18:36 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Tue, May 6, 2008 at 9:27 PM, David Miller <davem@davemloft.net> wrote:

>
> From: "Jerry Chu" <hkchu@google.com>
> Date: Tue, 6 May 2008 20:57:46 -0700
>
>
> > I fail to see how adding shinfo->in_flight to count how many
> > outstanding clones are there can help accounting for how many
> > "host_inflight" pkts. Part of the problems, as you've mentioned
> > before, is that the driver may not always get a clone.
>
> Sure but it will get one %99.9999 of the time.

I couldn't care less about some rare boundary case either. What I'm
concerned about is if there is some popular below-IP module/feature that,
when turned on, will render my host_inflight check (using dataref==1 as
an indication that the pkt has left the host) completely bogus. Turning on
pkt tapping (e.g., tcpdump) seems to be one such case, although one
may argue that belongs to those edge usage conditions that we don't
need to worry about. (OTOH it may be undesirable for tcpdump to have
significant performance side effect for those who try to use tcpdump to
debug performance problem.) Turning on GSO but not TSO seems to
be another case where the driver gets a copy all the time (kernel version
2.6.18).

If there is no other popular below-IP module/feature that will break the
dataref==1 then perhaps my initial concern was invalid. In either case, I
don't see your patch any better than my dataref==1 check. Neither one
address the below-IP module/feature concern described before.

>
>
> With TCP, as long as that clone is alive the driver has it.  And the
> counter only counts the clones.
>
> Anyways, did you even test my patch and try to use it for your needs
> or is this analysis purely from your inspection of it? :-/

No I haven't tested your patch. I tried to understand skb better before
applying your patch. After I studied bunch of code, I come to the conclusion
that your patch won't work for me. First it tracks # of clones, which is not
what I need. E.g., tcpdump will cause host_inflight to be grossly wrong.
Ok, maybe we can ignore tcpdump, your patch counts # of cloned skb
whereas I need a count of # of pkts. Perhaps this can be fixed also, but it
then dawned on me that wouldn't it be more desirable to add the space
overhead per tcp_sock than per skb?

Jerry

>
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-07  4:28     ` David Miller
@ 2008-05-07 18:54       ` Jerry Chu
  2008-05-07 21:20         ` David Miller
  0 siblings, 1 reply; 56+ messages in thread
From: Jerry Chu @ 2008-05-07 18:54 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Tue, May 6, 2008 at 9:28 PM, David Miller <davem@davemloft.net> wrote:
> From: "Jerry Chu" <hkchu@google.com>
>
> Date: Tue, 6 May 2008 20:57:46 -0700
>
>
> > It may be getting a copy (e.g., when GSO is on?) hence losing all
>  > its connection to the original tp and any chance to have the pkt
>  > properly accounted for as host_infligh by TCP. The skb may also be
>  > cloned more than once (e.g., due to tcpdump)...
>
>  It only gets a copy if the SKB is already cloned and that clone is
>  alive somewhere (f.e. stuck in the device, a rare occurance by
>  the time we retransmit).

This is one of many things about skb that I still don't completely understand.
Why in tcp_transmit_skb() we'll have to pskb_copy() if skb_cloned()?
Can't we clone a skb mulitple times? Is it due to some special optimization
from skb->fclone stuff... that imposes this restriction?

>
>  It is a clone %99.999999 of the time.

When I turned on GSO but not TSO, I believe it's a copy %99.999999 of
the time. (I must confess I don't understand GSO code yet. I temporarily
ran out of stream after studying bunch of other code.)

Jerry

>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-07 18:36       ` Jerry Chu
@ 2008-05-07 21:18         ` David Miller
  2008-05-08  1:37           ` Jerry Chu
  0 siblings, 1 reply; 56+ messages in thread
From: David Miller @ 2008-05-07 21:18 UTC (permalink / raw)
  To: hkchu; +Cc: netdev

From: "Jerry Chu" <hkchu@google.com>
Date: Wed, 7 May 2008 11:36:59 -0700

> No I haven't tested your patch. I tried to understand skb better before
> applying your patch. After I studied bunch of code, I come to the conclusion
> that your patch won't work for me. First it tracks # of clones, which is not
> what I need. E.g., tcpdump will cause host_inflight to be grossly wrong.

We can make sub-clones not count.

Also, we already can distinguish this case, because all SKB clones
made by TCP are fast-clones.  So we could only bump the counter for
fast clones.  If tcpdump clones it again, it won't be a fast clone and
therefore we can avoid bumping the counter in that case.  Similarly
for other features that want to clone.

Please try to get your idea working with my infrastructure.  We
can modify it to behave however you need it to, but at the core
it's the idea that tracks the state most directly and properly.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-07 18:54       ` Jerry Chu
@ 2008-05-07 21:20         ` David Miller
  2008-05-08  0:16           ` Jerry Chu
  0 siblings, 1 reply; 56+ messages in thread
From: David Miller @ 2008-05-07 21:20 UTC (permalink / raw)
  To: hkchu; +Cc: netdev

From: "Jerry Chu" <hkchu@google.com>
Date: Wed, 7 May 2008 11:54:01 -0700

> This is one of many things about skb that I still don't completely understand.
> Why in tcp_transmit_skb() we'll have to pskb_copy() if skb_cloned()?

The other clone holder owns the packet header area.  All packets on
the retransmit queue of TCP are headerless.  The call sites down
into the device add the headers.

Therefore we cannot have two paths modifying the headers at the same
time.

> Can't we clone a skb mulitple times? Is it due to some special optimization
> from skb->fclone stuff... that imposes this restriction?

No, it has nothing to do with fclone.  It has to do with what instance
owns the packet header area in front of the user's TCP data.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-07 21:20         ` David Miller
@ 2008-05-08  0:16           ` Jerry Chu
  0 siblings, 0 replies; 56+ messages in thread
From: Jerry Chu @ 2008-05-08  0:16 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Wed, May 7, 2008 at 2:20 PM, David Miller <davem@davemloft.net> wrote:
> From: "Jerry Chu" <hkchu@google.com>
>  Date: Wed, 7 May 2008 11:54:01 -0700
>
>
>  > This is one of many things about skb that I still don't completely understand.
>  > Why in tcp_transmit_skb() we'll have to pskb_copy() if skb_cloned()?
>
>  The other clone holder owns the packet header area.  All packets on
>  the retransmit queue of TCP are headerless.  The call sites down
>  into the device add the headers.
>
>  Therefore we cannot have two paths modifying the headers at the same
>  time.

Oh I see. I think I confused pskb_copy() with skb_copy() and kept thinking
why do we need to copy the whole thing...

>
>
>  > Can't we clone a skb mulitple times? Is it due to some special optimization
>  > from skb->fclone stuff... that imposes this restriction?
>
>  No, it has nothing to do with fclone.  It has to do with what instance
>  owns the packet header area in front of the user's TCP data.

Thanks for explaining.

Jerry

>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-07 21:18         ` David Miller
@ 2008-05-08  1:37           ` Jerry Chu
  2008-05-08  1:43             ` David Miller
  0 siblings, 1 reply; 56+ messages in thread
From: Jerry Chu @ 2008-05-08  1:37 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Wed, May 7, 2008 at 2:18 PM, David Miller <davem@davemloft.net> wrote:
> From: "Jerry Chu" <hkchu@google.com>
>  Date: Wed, 7 May 2008 11:36:59 -0700
>
>
>  > No I haven't tested your patch. I tried to understand skb better before
>  > applying your patch. After I studied bunch of code, I come to the conclusion
>  > that your patch won't work for me. First it tracks # of clones, which is not
>  > what I need. E.g., tcpdump will cause host_inflight to be grossly wrong.
>
>  We can make sub-clones not count.
>
>  Also, we already can distinguish this case, because all SKB clones
>  made by TCP are fast-clones.  So we could only bump the counter for
>  fast clones.  If tcpdump clones it again, it won't be a fast clone and
>  therefore we can avoid bumping the counter in that case.  Similarly
>  for other features that want to clone.

Ok, just Google search "skb fast clone" and found some posting from you.
Will take a look.

>
>  Please try to get your idea working with my infrastructure.  We
>  can modify it to behave however you need it to, but at the core
>  it's the idea that tracks the state most directly and properly.
>

Ok, will give it a try. First i'll fix your patch to
atomic_add()/atomic_sub() by
skb_shinfo(skb)->gso_segs rather than always 1, in order for GSO/TSO to work.

One problem came up to my mind - it seems possible for __kfree_skb() to
access skb_shinfo(skb)->in_flight whose tp has been freed up since only the
original skb's on TCP's rexmit list have the owner set and socket
held. One solution
is for TCP to zap skb_shinfo(skb)->in_flight field when it's ready to
free up skb.
I can hack sock_wfree() to do this, but I don't know how to do it right.

Jerry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-08  1:37           ` Jerry Chu
@ 2008-05-08  1:43             ` David Miller
  2008-05-08  3:33               ` Jerry Chu
  0 siblings, 1 reply; 56+ messages in thread
From: David Miller @ 2008-05-08  1:43 UTC (permalink / raw)
  To: hkchu; +Cc: netdev

From: "Jerry Chu" <hkchu@google.com>
Date: Wed, 7 May 2008 18:37:01 -0700

> Ok, will give it a try. First i'll fix your patch to
> atomic_add()/atomic_sub() by
> skb_shinfo(skb)->gso_segs rather than always 1, in order for GSO/TSO to work.

That might not work.  gso_segs can change over time as retransmit
packets get split up due to SACKs etc.  it needs to be audited,
at the very least.

> One problem came up to my mind - it seems possible for __kfree_skb() to
> access skb_shinfo(skb)->in_flight whose tp has been freed up since only the
> original skb's on TCP's rexmit list have the owner set and socket
> held. One solution
> is for TCP to zap skb_shinfo(skb)->in_flight field when it's ready to
> free up skb.
> I can hack sock_wfree() to do this, but I don't know how to do it right.

There will be references to the socket, so this should be ok.

If it isn't we can adjust the count and zap the pointer in
skb_orphan().

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-08  1:43             ` David Miller
@ 2008-05-08  3:33               ` Jerry Chu
  2008-05-12 22:22                 ` Jerry Chu
  0 siblings, 1 reply; 56+ messages in thread
From: Jerry Chu @ 2008-05-08  3:33 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

There seems to be quite a bit of complexity plus one additional pointer
field per skb_shared_info to make skb better track when a pkt leaves
the host. Now I wonder if it's really a better solution than my original,
simply checking dataref==1 approach which, although not bullet proof,
may be "good enough" for all practical purposes?

Jerry

On Wed, May 7, 2008 at 6:43 PM, David Miller <davem@davemloft.net> wrote:
> From: "Jerry Chu" <hkchu@google.com>
>  Date: Wed, 7 May 2008 18:37:01 -0700
>
>
>  > Ok, will give it a try. First i'll fix your patch to
>  > atomic_add()/atomic_sub() by
>  > skb_shinfo(skb)->gso_segs rather than always 1, in order for GSO/TSO to work.
>
>  That might not work.  gso_segs can change over time as retransmit
>  packets get split up due to SACKs etc.  it needs to be audited,
>  at the very least.
>
>
>  > One problem came up to my mind - it seems possible for __kfree_skb() to
>  > access skb_shinfo(skb)->in_flight whose tp has been freed up since only the
>  > original skb's on TCP's rexmit list have the owner set and socket
>  > held. One solution
>  > is for TCP to zap skb_shinfo(skb)->in_flight field when it's ready to
>  > free up skb.
>  > I can hack sock_wfree() to do this, but I don't know how to do it right.
>
>  There will be references to the socket, so this should be ok.
>
>  If it isn't we can adjust the count and zap the pointer in
>  skb_orphan().
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-08  3:33               ` Jerry Chu
@ 2008-05-12 22:22                 ` Jerry Chu
  2008-05-12 22:29                   ` David Miller
  0 siblings, 1 reply; 56+ messages in thread
From: Jerry Chu @ 2008-05-12 22:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

I did a quick prototype based on your idea of adding an "in_flight"
field to skb_shared_info to track how many in-flight clones in the
host. I tested
it quickly and it doesn't work. After some thought it was obvious why it
won't work. It's because what the TCP stack needs is to track how
many in-flight pkts are in the host, but your proposed patch increments
"in_flight" once on the 1st __skb_clone() to be sent to the driver, but
decrements "in_flight" TWICE, one for each of the clones to be freed.
I did a quick hack to make it work for my limited test case but I haven't
figured out an acceptable (non-hack) solution.

Continued testing, I discovered the problem I described below where
"in_flight" may point to a tp that has already been freed can not be
addressed by zapping skb_shinfo(skb)->in_flight in sock_wfree(). The
reason is that pkts may be acked and freed by TCP before driver freeing
up its clone copy (e.g., due to driver lazy reclaim...) When that happens
the "host_inflight" accounting will get messed up.

Jerry

On Wed, May 7, 2008 at 8:33 PM, Jerry Chu <hkchu@google.com> wrote:
> There seems to be quite a bit of complexity plus one additional pointer
>  field per skb_shared_info to make skb better track when a pkt leaves
>  the host. Now I wonder if it's really a better solution than my original,
>  simply checking dataref==1 approach which, although not bullet proof,
>  may be "good enough" for all practical purposes?
>
>  Jerry
>
>
>
>  On Wed, May 7, 2008 at 6:43 PM, David Miller <davem@davemloft.net> wrote:
>  > From: "Jerry Chu" <hkchu@google.com>
>  >  Date: Wed, 7 May 2008 18:37:01 -0700
>  >
>  >
>  >  > Ok, will give it a try. First i'll fix your patch to
>  >  > atomic_add()/atomic_sub() by
>  >  > skb_shinfo(skb)->gso_segs rather than always 1, in order for GSO/TSO to work.
>  >
>  >  That might not work.  gso_segs can change over time as retransmit
>  >  packets get split up due to SACKs etc.  it needs to be audited,
>  >  at the very least.
>  >
>  >
>  >  > One problem came up to my mind - it seems possible for __kfree_skb() to
>  >  > access skb_shinfo(skb)->in_flight whose tp has been freed up since only the
>  >  > original skb's on TCP's rexmit list have the owner set and socket
>  >  > held. One solution
>  >  > is for TCP to zap skb_shinfo(skb)->in_flight field when it's ready to
>  >  > free up skb.
>  >  > I can hack sock_wfree() to do this, but I don't know how to do it right.
>  >
>  >  There will be references to the socket, so this should be ok.
>  >
>  >  If it isn't we can adjust the count and zap the pointer in
>  >  skb_orphan().
>  >
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-12 22:22                 ` Jerry Chu
@ 2008-05-12 22:29                   ` David Miller
  2008-05-12 22:31                     ` David Miller
  2008-05-12 22:58                     ` Jerry Chu
  0 siblings, 2 replies; 56+ messages in thread
From: David Miller @ 2008-05-12 22:29 UTC (permalink / raw)
  To: hkchu; +Cc: netdev

From: "Jerry Chu" <hkchu@google.com>
Date: Mon, 12 May 2008 15:22:55 -0700

> I did a quick prototype based on your idea of adding an "in_flight"
> field to skb_shared_info to track how many in-flight clones in the
> host. I tested
> it quickly and it doesn't work. After some thought it was obvious why it
> won't work. It's because what the TCP stack needs is to track how
> many in-flight pkts are in the host, but your proposed patch increments
> "in_flight" once on the 1st __skb_clone() to be sent to the driver, but
> decrements "in_flight" TWICE, one for each of the clones to be freed.
> I did a quick hack to make it work for my limited test case but I haven't
> figured out an acceptable (non-hack) solution.

That's easy to fix, only set the in_flight pointer in the child clone
skb.  Thanks for figuring that out.

> Continued testing, I discovered the problem I described below where
> "in_flight" may point to a tp that has already been freed can not be
> addressed by zapping skb_shinfo(skb)->in_flight in sock_wfree(). The
> reason is that pkts may be acked and freed by TCP before driver freeing
> up its clone copy (e.g., due to driver lazy reclaim...) When that happens
> the "host_inflight" accounting will get messed up.

Simply notice, when we're about to decrement in_flight, that the data
reference is one.  You can take appropriate actions if so.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-12 22:29                   ` David Miller
@ 2008-05-12 22:31                     ` David Miller
  2008-05-13  3:56                       ` Jerry Chu
  2008-05-12 22:58                     ` Jerry Chu
  1 sibling, 1 reply; 56+ messages in thread
From: David Miller @ 2008-05-12 22:31 UTC (permalink / raw)
  To: hkchu; +Cc: netdev

From: David Miller <davem@davemloft.net>
Date: Mon, 12 May 2008 15:29:00 -0700 (PDT)

> Simply notice, when we're about to decrement in_flight, that the data
> reference is one.  You can take appropriate actions if so.

Actually, there is an even simpler solution.  Since this is a fast
clone, when the parent SKB is freed we can see if the fast-clone child
is active and has the in_flight atomic_t pointer.  If so, we decrement
in_flight and zap it to NULL.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-12 22:29                   ` David Miller
  2008-05-12 22:31                     ` David Miller
@ 2008-05-12 22:58                     ` Jerry Chu
  2008-05-12 23:01                       ` David Miller
  1 sibling, 1 reply; 56+ messages in thread
From: Jerry Chu @ 2008-05-12 22:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Mon, May 12, 2008 at 3:29 PM, David Miller <davem@davemloft.net> wrote:
> From: "Jerry Chu" <hkchu@google.com>
>  Date: Mon, 12 May 2008 15:22:55 -0700
>
>
>  > I did a quick prototype based on your idea of adding an "in_flight"
>  > field to skb_shared_info to track how many in-flight clones in the
>  > host. I tested
>  > it quickly and it doesn't work. After some thought it was obvious why it
>  > won't work. It's because what the TCP stack needs is to track how
>  > many in-flight pkts are in the host, but your proposed patch increments
>  > "in_flight" once on the 1st __skb_clone() to be sent to the driver, but
>  > decrements "in_flight" TWICE, one for each of the clones to be freed.
>  > I did a quick hack to make it work for my limited test case but I haven't
>  > figured out an acceptable (non-hack) solution.
>
>  That's easy to fix, only set the in_flight pointer in the child clone
>  skb.  Thanks for figuring that out.

I must be missing something. All the clones share the same
"skb_shared_info", how does one only set in_flight in one clone but
not the other?

>
>
>  > Continued testing, I discovered the problem I described below where
>  > "in_flight" may point to a tp that has already been freed can not be
>  > addressed by zapping skb_shinfo(skb)->in_flight in sock_wfree(). The
>  > reason is that pkts may be acked and freed by TCP before driver freeing
>  > up its clone copy (e.g., due to driver lazy reclaim...) When that happens
>  > the "host_inflight" accounting will get messed up.
>
>  Simply notice, when we're about to decrement in_flight, that the data
>  reference is one.  You can take appropriate actions if so.
>

Jerry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-12 22:58                     ` Jerry Chu
@ 2008-05-12 23:01                       ` David Miller
  0 siblings, 0 replies; 56+ messages in thread
From: David Miller @ 2008-05-12 23:01 UTC (permalink / raw)
  To: hkchu; +Cc: netdev

From: "Jerry Chu" <hkchu@google.com>
Date: Mon, 12 May 2008 15:58:58 -0700

> I must be missing something. All the clones share the same
> "skb_shared_info", how does one only set in_flight in one clone but
> not the other?

These are fast clones, we "know" which is the parent and which is
the child.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-12 22:31                     ` David Miller
@ 2008-05-13  3:56                       ` Jerry Chu
  2008-05-13  3:58                         ` David Miller
  0 siblings, 1 reply; 56+ messages in thread
From: Jerry Chu @ 2008-05-13  3:56 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Mon, May 12, 2008 at 3:31 PM, David Miller <davem@davemloft.net> wrote:
> From: David Miller <davem@davemloft.net>
>  Date: Mon, 12 May 2008 15:29:00 -0700 (PDT)
>
>
>  > Simply notice, when we're about to decrement in_flight, that the data
>  > reference is one.  You can take appropriate actions if so.
>
>  Actually, there is an even simpler solution.  Since this is a fast
>  clone, when the parent SKB is freed we can see if the fast-clone child
>  is active and has the in_flight atomic_t pointer.  If so, we decrement
>  in_flight and zap it to NULL.

I thought about that too. But what happens if the clone is being freed by
the driver at the same time when TCP is freeing the parent? Isn't there an
intrinsic race condition here? If true, can we use some atomic operation to
zap in_flight, but also return to the caller if in_flight has been nullified
already so that the accounting can be handled correctly? (Although
tp->host_inflight doesn't need to be super accurate, if there is no simple
solution above I'll have to reset tp->host_inflight periodically and convince
myself the pathological case of host_inflight drifting forever won't ever
happen...)

Jerry

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-13  3:56                       ` Jerry Chu
@ 2008-05-13  3:58                         ` David Miller
  2008-05-13  4:00                           ` Jerry Chu
  0 siblings, 1 reply; 56+ messages in thread
From: David Miller @ 2008-05-13  3:58 UTC (permalink / raw)
  To: hkchu; +Cc: netdev

From: "Jerry Chu" <hkchu@google.com>
Date: Mon, 12 May 2008 20:56:16 -0700

> If true, can we use some atomic operation to zap in_flight, but also
> return to the caller if in_flight has been nullified already so that
> the accounting can be handled correctly?

xchg() should work for that purpose.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-13  3:58                         ` David Miller
@ 2008-05-13  4:00                           ` Jerry Chu
  2008-05-13  4:02                             ` David Miller
  0 siblings, 1 reply; 56+ messages in thread
From: Jerry Chu @ 2008-05-13  4:00 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Does xchg() return the old value? (Can't find any comments on it.)

Jerry

On Mon, May 12, 2008 at 8:58 PM, David Miller <davem@davemloft.net> wrote:
> From: "Jerry Chu" <hkchu@google.com>
>  Date: Mon, 12 May 2008 20:56:16 -0700
>
>
>  > If true, can we use some atomic operation to zap in_flight, but also
>  > return to the caller if in_flight has been nullified already so that
>  > the accounting can be handled correctly?
>
>  xchg() should work for that purpose.
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-13  4:00                           ` Jerry Chu
@ 2008-05-13  4:02                             ` David Miller
  2008-05-17  1:13                               ` Jerry Chu
  0 siblings, 1 reply; 56+ messages in thread
From: David Miller @ 2008-05-13  4:02 UTC (permalink / raw)
  To: hkchu; +Cc: netdev

From: "Jerry Chu" <hkchu@google.com>
Date: Mon, 12 May 2008 21:00:08 -0700

> Does xchg() return the old value? (Can't find any comments on it.)

Yes, otherwise it wouldn't be very useful.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-13  4:02                             ` David Miller
@ 2008-05-17  1:13                               ` Jerry Chu
  2008-05-17  1:29                                 ` David Miller
  0 siblings, 1 reply; 56+ messages in thread
From: Jerry Chu @ 2008-05-17  1:13 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Ok, now the code seems to work when TSO is disabled, i.e., cwnd
remains low (40-60pkts) and only grows slowly for extended period on
a back2back 1GbE link. It also works when tcpdump was running
after I fixed the code, per your suggestion, to only bump the in_flight
count for fast clones.

But it doesn't work when TSO is enabled. I've fixed tso_fragment() to
correctly set skb_shinfo(buff)->in_flight after the split. But cwnd still
grows rapidly to a few hundred pkts (although smaller than w/o
the fix). The host inflight accounting gets screwed up. It looks like
pskb_expand_head() called by tcp_tso_acked()->tcp_trim_head()
messes up the accounting but I don't know how to fix it (still trying
to understand this complex piece of code). There could be other
reason as well.

Jerry

On Mon, May 12, 2008 at 9:02 PM, David Miller <davem@davemloft.net> wrote:
> From: "Jerry Chu" <hkchu@google.com>
> Date: Mon, 12 May 2008 21:00:08 -0700
>
>> Does xchg() return the old value? (Can't find any comments on it.)
>
> Yes, otherwise it wouldn't be very useful.
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-17  1:13                               ` Jerry Chu
@ 2008-05-17  1:29                                 ` David Miller
  2008-05-17  1:47                                   ` Jerry Chu
  0 siblings, 1 reply; 56+ messages in thread
From: David Miller @ 2008-05-17  1:29 UTC (permalink / raw)
  To: hkchu; +Cc: netdev

From: "Jerry Chu" <hkchu@google.com>
Date: Fri, 16 May 2008 18:13:20 -0700

> The host inflight accounting gets screwed up. It looks like
> pskb_expand_head() called by tcp_tso_acked()->tcp_trim_head() messes
> up the accounting but I don't know how to fix it (still trying to
> understand this complex piece of code). There could be other reason
> as well.

This is just like freeing up a normal SKB, so decrementing the
in_flight value the appropriate number of packets should do the right
thing.

tcp_tso_acked() calculates this adjustment for you, in packets_acked.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Socket buffer sizes with autotuning
  2008-05-17  1:29                                 ` David Miller
@ 2008-05-17  1:47                                   ` Jerry Chu
  0 siblings, 0 replies; 56+ messages in thread
From: Jerry Chu @ 2008-05-17  1:47 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Fri, May 16, 2008 at 6:29 PM, David Miller <davem@davemloft.net> wrote:
> From: "Jerry Chu" <hkchu@google.com>
> Date: Fri, 16 May 2008 18:13:20 -0700
>
>> The host inflight accounting gets screwed up. It looks like
>> pskb_expand_head() called by tcp_tso_acked()->tcp_trim_head() messes
>> up the accounting but I don't know how to fix it (still trying to
>> understand this complex piece of code). There could be other reason
>> as well.
>
> This is just like freeing up a normal SKB, so decrementing the
> in_flight value the appropriate number of packets should do the right
> thing.
>
> tcp_tso_acked() calculates this adjustment for you, in packets_acked.
>

The current pskb_expand_head() code sets skb_shinfo(skb)->in_flight
to NULL to leave the in_flight accounting solely to the other clone to
handle. To calculate in_flight more accurately, it seems to require
splitting the gso_segs and in_flight accounting bewteen the clone and
the skb (with new header), right? Plus any race condition to take care...
Hard to think clearly in a Friday afternoon :-(

Jerry

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2008-05-17  1:47 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-23 23:29 Socket buffer sizes with autotuning Jerry Chu
2008-04-24 16:32 ` John Heffner
2008-04-25  0:49   ` Jerry Chu
2008-04-25  6:46     ` David Miller
2008-04-25 21:29       ` Jerry Chu
2008-04-25 21:35         ` David Miller
2008-04-28 18:30       ` Jerry Chu
2008-04-28 19:21         ` John Heffner
2008-04-28 20:44           ` Jerry Chu
2008-04-28 23:22             ` [PATCH 1/2] [NET]: Allow send-limited cwnd to grow up to max_burst when gso disabled John Heffner
2008-04-28 23:22               ` [PATCH 2/2] [NET]: Limit cwnd growth when deferring for GSO John Heffner
     [not found]           ` <d1c2719f0804281338j3984cf2bga31def0c2c1192a1@mail.gmail.com>
2008-04-28 23:28             ` Socket buffer sizes with autotuning John Heffner
2008-04-28 23:35               ` David Miller
2008-04-29  2:20               ` Jerry Chu
2008-04-25  7:05 ` David Miller
2008-05-07  3:57   ` Jerry Chu
2008-05-07  4:27     ` David Miller
2008-05-07 18:36       ` Jerry Chu
2008-05-07 21:18         ` David Miller
2008-05-08  1:37           ` Jerry Chu
2008-05-08  1:43             ` David Miller
2008-05-08  3:33               ` Jerry Chu
2008-05-12 22:22                 ` Jerry Chu
2008-05-12 22:29                   ` David Miller
2008-05-12 22:31                     ` David Miller
2008-05-13  3:56                       ` Jerry Chu
2008-05-13  3:58                         ` David Miller
2008-05-13  4:00                           ` Jerry Chu
2008-05-13  4:02                             ` David Miller
2008-05-17  1:13                               ` Jerry Chu
2008-05-17  1:29                                 ` David Miller
2008-05-17  1:47                                   ` Jerry Chu
2008-05-12 22:58                     ` Jerry Chu
2008-05-12 23:01                       ` David Miller
2008-05-07  4:28     ` David Miller
2008-05-07 18:54       ` Jerry Chu
2008-05-07 21:20         ` David Miller
2008-05-08  0:16           ` Jerry Chu
     [not found] <d1c2719f0804241829s1bc3f41ejf7ebbff73ed96578@mail.gmail.com>
2008-04-25  7:06 ` Andi Kleen
2008-04-25  7:28   ` David Miller
2008-04-25  7:48     ` Andi Kleen
  -- strict thread matches above, loose matches on Subject: below --
2008-04-23  0:38 Rick Jones
2008-04-23  2:17 ` John Heffner
2008-04-23  3:59   ` David Miller
2008-04-23 16:32     ` Rick Jones
2008-04-23 16:58       ` John Heffner
2008-04-23 17:24         ` Rick Jones
2008-04-23 17:41           ` John Heffner
2008-04-23 17:46             ` Rick Jones
2008-04-24 22:21     ` Andi Kleen
2008-04-24 22:39       ` John Heffner
2008-04-25  1:28       ` David Miller
     [not found]       ` <65634d660804242234w66455bedve44801a98e3de9d9@mail.gmail.com>
2008-04-25  6:36         ` David Miller
2008-04-25  7:42           ` Tom Herbert
2008-04-25  7:46             ` David Miller
2008-04-28 17:51               ` Tom Herbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).