netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* issue with new TCP TSO stuff
@ 2005-05-12  5:30 David S. Miller
  2005-05-12 10:05 ` Herbert Xu
  2005-05-12 14:13 ` Andi Kleen
  0 siblings, 2 replies; 17+ messages in thread
From: David S. Miller @ 2005-05-12  5:30 UTC (permalink / raw)
  To: netdev


It's more expensive than the old code.  And I know the main reason.

There is a higher cost now because we do a large number of page
refcount grabs and releases now, that never occurred before.  Before,
we just cloned and thus grabbed a ref to the whole TSO data area in
one go.

This shows up in testing where the connection is application limited.
For example, an "scp" goes more slowly over TSO now, there are less
cpu cycles available for the encryption.

It's tricky to come up with a scheme to fix this.  I would love to be
able to not do the page grabs/releases in the actual TSO frame.  I
really haven't come up with a clean way to do that however.

Basically, if we could somehow delay the freeing of the actual SKBs in
the socket write queue until the device frees up the TSO frame, we
could avoid the page gets and puts.  Almost all the time, this delay
would never actually be needed, because the ACKs come back _long_
after the device liberates the TSO frame.  But it can due to packet
taps, packet reordering, and other reasons.  So we do have to handle
it.

I don't know, maybe we can do something clever with the
skb_shinfo(skb)->frag_list pointer.

Any bright ideas? :-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: issue with new TCP TSO stuff
  2005-05-12  5:30 issue with new TCP TSO stuff David S. Miller
@ 2005-05-12 10:05 ` Herbert Xu
  2005-05-12 20:13   ` David S. Miller
  2005-05-12 14:13 ` Andi Kleen
  1 sibling, 1 reply; 17+ messages in thread
From: Herbert Xu @ 2005-05-12 10:05 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

Hi Dave:

I've got a number of comments for your patch which I will send once
I've finished reading it all.

David S. Miller <davem@davemloft.net> wrote:
> 
> I don't know, maybe we can do something clever with the
> skb_shinfo(skb)->frag_list pointer.

This was in fact going to be one of my comments :) 

What we could do is get the TSO drivers to all implement NETIF_F_FRAGLIST.
Once they do that, you can simply chain up the skb's and send it off to
them.  The coalescing will need to be done in the drivers.  However, that's
not too bad because coalescing only has to be done at the skb boundaries.

In fact, this is how we can simplify the unwinding stuff in your
skb_append_pages function.  Because the coalescing only needs to occur
between skb's, you only need to check the first frag to know whether it
can be coalesced or not.  This means that the unwinding stuff can mostly
go away.

We'll have to watch out for retransmissions of the frame with a non-null
frag_list pointer.  They will need to be copied if their clone is still
hanging around.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: issue with new TCP TSO stuff
  2005-05-12  5:30 issue with new TCP TSO stuff David S. Miller
  2005-05-12 10:05 ` Herbert Xu
@ 2005-05-12 14:13 ` Andi Kleen
  2005-05-12 19:26   ` David S. Miller
  1 sibling, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2005-05-12 14:13 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

"David S. Miller" <davem@davemloft.net> writes:
>
> This shows up in testing where the connection is application limited.
> For example, an "scp" goes more slowly over TSO now, there are less
> cpu cycles available for the encryption.
>
> It's tricky to come up with a scheme to fix this.  I would love to be
> able to not do the page grabs/releases in the actual TSO frame.  I
> really haven't come up with a clean way to do that however.

Are you sure a few atomic_inc/dec are really causing noticeable
slowdown? That would surprise me unless you have lots of cache line
bouncing on a MP system.

What CPU did you test it on? Does it happen with only a single CPU?
And did you actually see them in some profile?

Assuming the struct page is in cache the P4 core is the slowest at
that that I know, but even on that one it should be in the noise on
the other overhead of talking to a NIC on a PCI bus.

Perhaps it is something else..

-Andi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: issue with new TCP TSO stuff
  2005-05-12 14:13 ` Andi Kleen
@ 2005-05-12 19:26   ` David S. Miller
       [not found]     ` <20050512200251.GA72662@muc.de>
  0 siblings, 1 reply; 17+ messages in thread
From: David S. Miller @ 2005-05-12 19:26 UTC (permalink / raw)
  To: ak; +Cc: netdev

From: Andi Kleen <ak@muc.de>
Subject: Re: issue with new TCP TSO stuff
Date: Thu, 12 May 2005 16:13:05 +0200

> Are you sure a few atomic_inc/dec are really causing noticeable
> slowdown? That would surprise me unless you have lots of cache line
> bouncing on a MP system.

It's about 7 or 8 times as many atomic gets and puts on pages compared
to the older code (with an 8K page size).

Some of it we can't get rid of, because we keep a normal write queue
segmented set of SKBs around.

But the TSO part we can eliminate.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: issue with new TCP TSO stuff
       [not found]     ` <20050512200251.GA72662@muc.de>
@ 2005-05-12 20:03       ` David S. Miller
  2005-05-12 20:26         ` Andi Kleen
  0 siblings, 1 reply; 17+ messages in thread
From: David S. Miller @ 2005-05-12 20:03 UTC (permalink / raw)
  To: ak; +Cc: netdev

From: Andi Kleen <ak@muc.de>
Subject: Re: issue with new TCP TSO stuff
Date: 12 May 2005 22:02:51 +0200,Thu, 12 May 2005 22:02:51 +0200

> Sure, but did you verify it was the actual problem? (e.g. with a profiler) 
> If the cache line the atomic operation is done on is EXCLUSIVE to the
> CPU then it should not take *that* long to do the atomic operations.

Such issues cannot be measured like that, they tend to make
other operations slower by inducing cache misses elsewhere.

I used my brain to analyze this slowdown, instead of the
computer, I'm sorry if that disturbs you :-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: issue with new TCP TSO stuff
  2005-05-12 10:05 ` Herbert Xu
@ 2005-05-12 20:13   ` David S. Miller
  2005-05-12 21:47     ` Herbert Xu
  0 siblings, 1 reply; 17+ messages in thread
From: David S. Miller @ 2005-05-12 20:13 UTC (permalink / raw)
  To: herbert; +Cc: netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Subject: Re: issue with new TCP TSO stuff
Date: Thu, 12 May 2005 20:05:48 +1000

> What we could do is get the TSO drivers to all implement NETIF_F_FRAGLIST.
> Once they do that, you can simply chain up the skb's and send it off to
> them.  The coalescing will need to be done in the drivers.  However, that's
> not too bad because coalescing only has to be done at the skb boundaries.
> 
> In fact, this is how we can simplify the unwinding stuff in your
> skb_append_pages function.  Because the coalescing only needs to occur
> between skb's, you only need to check the first frag to know whether it
> can be coalesced or not.  This means that the unwinding stuff can mostly
> go away.
> 
> We'll have to watch out for retransmissions of the frame with a non-null
> frag_list pointer.  They will need to be copied if their clone is still
> hanging around.

Yes, we can just add a frag_list pointer check to the skb_cloned()
tests we do when cons'ing up retransmit SKBs for tcp_transmit_skb().

But this still has the early free problem, I think.  If an ACK
comes in which releases an SKB on the chain, while the driver
is still working with that chain, we cannot free the SKB.  We
have to do it some time later.

One way to prevent that would be to do an skb_get() on every
SKB in the chain, but then we're back to the original problem
of all the extra atomic operations.

A secondary point is that I'd like to use a name other than
NETIF_F_FRAGLIST because people are super confused as to what this
device flag even means.  Some people confuse it with NETIF_F_SG,
others thing it takes a huge UDP frame and fragments it into MTU sized
IP frames and checksums the whole thing.  None of which are true.

Loopback is the only driver which supports this properly, by
simply doing nothing with the packet :-)

So back to the main point, we are in quite a conundrum.  The whole
point of TSO is to offload the segmentation overhead, but we're in
fact making the TCP output engine more expensive for the TSO path.

I've also considered a longer term idea where we store the write queue
in some minimal abstract format, instead of a list of SKBs.  Just a
data collection and some sequence numbers.  But that would be a huge
change with questionable gains.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: issue with new TCP TSO stuff
  2005-05-12 20:03       ` David S. Miller
@ 2005-05-12 20:26         ` Andi Kleen
  2005-05-12 22:34           ` David S. Miller
  0 siblings, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2005-05-12 20:26 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

On Thu, May 12, 2005 at 01:03:41PM -0700, David S. Miller wrote:
> From: Andi Kleen <ak@muc.de>
> Subject: Re: issue with new TCP TSO stuff
> Date: 12 May 2005 22:02:51 +0200,Thu, 12 May 2005 22:02:51 +0200
> 
> > Sure, but did you verify it was the actual problem? (e.g. with a profiler) 
> > If the cache line the atomic operation is done on is EXCLUSIVE to the
> > CPU then it should not take *that* long to do the atomic operations.
> 
> Such issues cannot be measured like that, they tend to make
> other operations slower by inducing cache misses elsewhere.

Atomic operations, especially with cache misses, normally show in a fine 
grained profile. They also don't cause additional cache misses over
non atomic writes.

> I used my brain to analyze this slowdown, instead of the
> computer, I'm sorry if that disturbs you :-)

What disturbs me is your conclusion :)

-Andi 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: issue with new TCP TSO stuff
  2005-05-12 20:13   ` David S. Miller
@ 2005-05-12 21:47     ` Herbert Xu
  2005-05-12 22:10       ` Herbert Xu
  2005-05-12 22:46       ` David S. Miller
  0 siblings, 2 replies; 17+ messages in thread
From: Herbert Xu @ 2005-05-12 21:47 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

On Thu, May 12, 2005 at 01:13:49PM -0700, David S. Miller wrote:
> 
> But this still has the early free problem, I think.  If an ACK
> comes in which releases an SKB on the chain, while the driver
> is still working with that chain, we cannot free the SKB.  We
> have to do it some time later.

This is fine because we have to skb_clone the packet.  Otherwise
the next pointer for the frag/xmit list will be messed up.

> One way to prevent that would be to do an skb_get() on every
> SKB in the chain, but then we're back to the original problem
> of all the extra atomic operations.

You're being frugal Dave :) I was happy with one skb_clone per
skb and you're having problems with skb_get? :)

Seriously, we already do one skb_clone for every packet sent
so we won't be incurring any extra overhead with this.

> A secondary point is that I'd like to use a name other than
> NETIF_F_FRAGLIST because people are super confused as to what this
> device flag even means.  Some people confuse it with NETIF_F_SG,
> others thing it takes a huge UDP frame and fragments it into MTU sized
> IP frames and checksums the whole thing.  None of which are true.

Fine by me.  But you know I'm not good with names :)

> Loopback is the only driver which supports this properly, by
> simply doing nothing with the packet :-)

If we implement these in the TSO-capable drivers, IPsec could benefit
as well.  Two inbound IPsec fragments that are reassembled are currently
linearised in dev_queue_xmit.  With this it can be sent directly to
the driver.

> So back to the main point, we are in quite a conundrum.  The whole
> point of TSO is to offload the segmentation overhead, but we're in
> fact making the TCP output engine more expensive for the TSO path.

I'll need to finish reading the entire patch before I can respond to
this.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: issue with new TCP TSO stuff
  2005-05-12 21:47     ` Herbert Xu
@ 2005-05-12 22:10       ` Herbert Xu
  2005-05-12 22:52         ` David S. Miller
  2005-05-12 22:46       ` David S. Miller
  1 sibling, 1 reply; 17+ messages in thread
From: Herbert Xu @ 2005-05-12 22:10 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

On Fri, May 13, 2005 at 07:47:44AM +1000, herbert wrote:
> 
> You're being frugal Dave :) I was happy with one skb_clone per
> skb and you're having problems with skb_get? :)
> 
> Seriously, we already do one skb_clone for every packet sent
> so we won't be incurring any extra overhead with this.

Nevermind, you're comparing to the existing TSO implementation.
So how big exactly is the slowdown?

> > A secondary point is that I'd like to use a name other than
> > NETIF_F_FRAGLIST because people are super confused as to what this
> > device flag even means.  Some people confuse it with NETIF_F_SG,
> > others thing it takes a huge UDP frame and fragments it into MTU sized
> > IP frames and checksums the whole thing.  None of which are true.
> 
> Fine by me.  But you know I'm not good with names :)

This just occured to me, what about NETIF_F_SKBLIST?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: issue with new TCP TSO stuff
  2005-05-12 20:26         ` Andi Kleen
@ 2005-05-12 22:34           ` David S. Miller
  0 siblings, 0 replies; 17+ messages in thread
From: David S. Miller @ 2005-05-12 22:34 UTC (permalink / raw)
  To: ak; +Cc: netdev

From: Andi Kleen <ak@muc.de>
Subject: Re: issue with new TCP TSO stuff
Date: 12 May 2005 22:26:29 +0200,Thu, 12 May 2005 22:26:29 +0200

> On Thu, May 12, 2005 at 01:03:41PM -0700, David S. Miller wrote:
> > I used my brain to analyze this slowdown, instead of the
> > computer, I'm sorry if that disturbs you :-)
> 
> What disturbs me is your conclusion :)

Fair enough, I'll get some numbers. :-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: issue with new TCP TSO stuff
  2005-05-12 21:47     ` Herbert Xu
  2005-05-12 22:10       ` Herbert Xu
@ 2005-05-12 22:46       ` David S. Miller
  1 sibling, 0 replies; 17+ messages in thread
From: David S. Miller @ 2005-05-12 22:46 UTC (permalink / raw)
  To: herbert; +Cc: netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Subject: Re: issue with new TCP TSO stuff
Date: Fri, 13 May 2005 07:47:44 +1000

> On Thu, May 12, 2005 at 01:13:49PM -0700, David S. Miller wrote:
> > One way to prevent that would be to do an skb_get() on every
> > SKB in the chain, but then we're back to the original problem
> > of all the extra atomic operations.
> 
> You're being frugal Dave :) I was happy with one skb_clone per
> skb and you're having problems with skb_get? :)
> 
> Seriously, we already do one skb_clone for every packet sent
> so we won't be incurring any extra overhead with this.

It's "relative to existing TSO" that this cost is.  For the existing
TSO handling we do one atomic ref for the entire TSO frame.  Now we're
going to do it for the N frames that a TSO packet is composed of.

So what this ends up meaning is that the new TSO code's only gain
(relative to non-TSO) will basically be that we:

1) Incur only one trip down the device XMIT path for N frames.
2) Only send one set of headers of the bus to the device.

Whereas the existing TSO support has a third benefit which is:

3) TSO frames are segmented as singular SKBs

This reduction in SKB allocations and subsequent handling is
non-trivial.

It has ramifications beyond the transmit path.  When the ACKs
come back in, we do less kfree_skb() calls, for example.  But
then again, we had all of that ugly TSO partial-ACK trimming
stuff.

I'm going to do some profiling to make sure it really matters.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: issue with new TCP TSO stuff
  2005-05-12 22:10       ` Herbert Xu
@ 2005-05-12 22:52         ` David S. Miller
  2005-05-12 23:10           ` Herbert Xu
  0 siblings, 1 reply; 17+ messages in thread
From: David S. Miller @ 2005-05-12 22:52 UTC (permalink / raw)
  To: herbert; +Cc: netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Subject: Re: issue with new TCP TSO stuff
Date: Fri, 13 May 2005 08:10:46 +1000

> Nevermind, you're comparing to the existing TSO implementation.
> So how big exactly is the slowdown?

The "scp largefile dest:" test case went down from 4.6MB/sec
to 4.3MB/sec when I enable TSO on the tg3 device with the
new TSO code.

So that drop is relative to non-TSO.  That's how I knew there
was some trouble.  TSO should definitely not have more overhead
that non-TSO.

For a clean transfer bulk transfer which is not application
limited like scp is, we get full line rate.  But that doesn't
show how expensive TSO is, since the link is the bottleneck
not the cpu.

> This just occured to me, what about NETIF_F_SKBLIST?

Sounds ok.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: issue with new TCP TSO stuff
  2005-05-12 22:52         ` David S. Miller
@ 2005-05-12 23:10           ` Herbert Xu
  2005-05-12 23:24             ` David S. Miller
  0 siblings, 1 reply; 17+ messages in thread
From: Herbert Xu @ 2005-05-12 23:10 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

On Thu, May 12, 2005 at 03:52:30PM -0700, David S. Miller wrote:
> 
> The "scp largefile dest:" test case went down from 4.6MB/sec
> to 4.3MB/sec when I enable TSO on the tg3 device with the
> new TSO code.

OK, I'm sure we can optimise the code so that it beats the
non-TSO case.

However, I think you're right that this does have some fundamental
overheads compared to the existing TSO code which we can't remove.

More specifically, the existing TSO code really does avoid
segmentation in that no MTU-sized skb's are allocated unless
the congestion window requires that to be done.  The new code
will always allocate MTU-sized skb's no matter what.

The ideal solution should bring the best of both worlds :) That is,
no segmentation on output unless required by the congestion window,
while at the same time avoiding the tcp_skb_pcount logic.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: issue with new TCP TSO stuff
  2005-05-12 23:10           ` Herbert Xu
@ 2005-05-12 23:24             ` David S. Miller
  2005-05-12 23:52               ` Herbert Xu
  0 siblings, 1 reply; 17+ messages in thread
From: David S. Miller @ 2005-05-12 23:24 UTC (permalink / raw)
  To: herbert; +Cc: netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Subject: Re: issue with new TCP TSO stuff
Date: Fri, 13 May 2005 09:10:38 +1000

> However, I think you're right that this does have some fundamental
> overheads compared to the existing TSO code which we can't remove.
> 
> More specifically, the existing TSO code really does avoid
> segmentation in that no MTU-sized skb's are allocated unless
> the congestion window requires that to be done.  The new code
> will always allocate MTU-sized skb's no matter what.
> 
> The ideal solution should bring the best of both worlds :) That is,
> no segmentation on output unless required by the congestion window,
> while at the same time avoiding the tcp_skb_pcount logic.

Right.

Ok, the two true downfalls of the current TSO code are:

1) It does not attempt to predict what the CWND will
   be at packet output time.  So smaller than ideal
   TSO frames are built.
2) Packet loss is not handled gracefully, in fact TSO
   is disabled when this happens :-)

So, I'm mentioning this because it may end up being better to try and
solve those two problems instead of going to my new stuff.

#2 could be handled by down-sizing TSO frames when packet loss occurs.
Ie. tcp_retransmit_skb() or whatever will segmentize a TSO packet
which is within the sequence it is trying to retransmit.  Implementing
this is non- trivial mostly due to the fact that it has to work handle
GFP_ATOMIC memory failures and also get all the pcount crap correct.

#1 is more tricky, and is the main reason I explored the "TSO
Reloaded" idea.  I wonder if we could just build enormous TSO frames
_always_.  We pass down these huge things to the output path, with a
struct sk_buff local offset and size.  That way, if the packet is too
large for the congestion window, we're fine, we just set the offset
and size appropriately.  I think the tcp_snd_test() simplifications
made by my TSO Reloaded patch would help a lot here.  The send test is
logically now split to it's two tests 1) whether to send anything at
all, and 2) once #1 passes, how many such packets.

This would be a sort of super-TSO that would do less segmenting work
than even a "perfect" TSO segmenter would.

I'm still not sure which approach is best, just throwing around some
ideas.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: issue with new TCP TSO stuff
  2005-05-12 23:24             ` David S. Miller
@ 2005-05-12 23:52               ` Herbert Xu
  2005-05-13  4:36                 ` David S. Miller
  0 siblings, 1 reply; 17+ messages in thread
From: Herbert Xu @ 2005-05-12 23:52 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

On Thu, May 12, 2005 at 04:24:26PM -0700, David S. Miller wrote:
> 
> Ok, the two true downfalls of the current TSO code are:

You're spot on.
 
> #2 could be handled by down-sizing TSO frames when packet loss occurs.
> Ie. tcp_retransmit_skb() or whatever will segmentize a TSO packet
> which is within the sequence it is trying to retransmit.  Implementing
> this is non- trivial mostly due to the fact that it has to work handle
> GFP_ATOMIC memory failures and also get all the pcount crap correct.

I think most of the code to do that might already be there.  Pity
we'll have to keep track of the pcount though.

> #1 is more tricky, and is the main reason I explored the "TSO
> Reloaded" idea.  I wonder if we could just build enormous TSO frames
> _always_.  We pass down these huge things to the output path, with a

I agree that this is the way to go.

> and size appropriately.  I think the tcp_snd_test() simplifications
> made by my TSO Reloaded patch would help a lot here.  The send test is
> logically now split to it's two tests 1) whether to send anything at
> all, and 2) once #1 passes, how many such packets.

That was another one of my comments to your patch :) I was going to
suggest that we move the cwnd/in_flight loop from tcp_snd_test to
emit_send_queue.

> This would be a sort of super-TSO that would do less segmenting work
> than even a "perfect" TSO segmenter would.
> 
> I'm still not sure which approach is best, just throwing around some
> ideas.

On paper your new super-TSO approach sounds the best.  However, I
suppose we won't know for sure until we try :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: issue with new TCP TSO stuff
  2005-05-12 23:52               ` Herbert Xu
@ 2005-05-13  4:36                 ` David S. Miller
  2005-05-13 13:25                   ` Herbert Xu
  0 siblings, 1 reply; 17+ messages in thread
From: David S. Miller @ 2005-05-13  4:36 UTC (permalink / raw)
  To: herbert; +Cc: netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Subject: Re: issue with new TCP TSO stuff
Date: Fri, 13 May 2005 09:52:36 +1000

> > #2 could be handled by down-sizing TSO frames when packet loss occurs.
> > Ie. tcp_retransmit_skb() or whatever will segmentize a TSO packet
> > which is within the sequence it is trying to retransmit.  Implementing
> > this is non- trivial mostly due to the fact that it has to work handle
> > GFP_ATOMIC memory failures and also get all the pcount crap correct.
> 
> I think most of the code to do that might already be there.  Pity
> we'll have to keep track of the pcount though.

Yes, when we walk the queue to retransmit, we chop up the
TSO frame via tcp_trim_head() and tcp_fragment().  It is
very loose and would need refinements though.

I think we'd be OK, because even if we get a GFP_ATOMIC allocation
failure, it's just like any other failure.  We're doing retransmits
so a timer will be the worst case deadlock breaker should we get an
allocation failure while chopping up the TSO frame.

This actually makes me worried about sk->sk_send_head advancement.
I think we have a deadlock possible here.  If sk->sk_send_head is
the only packet in the output queue, and the remote system has no
reason to send us zero-window probes or any other packets, we could
wedge if the skb_clone() fails for the tcp_transmit_skb() call.
We won't set any timers if we fail to send that single sk->sk_send_head
write queue packet, and we have no pending frames waiting to be ACK'd,
so nothing will wake us up to retry the transmit.

> > and size appropriately.  I think the tcp_snd_test() simplifications
> > made by my TSO Reloaded patch would help a lot here.  The send test is
> > logically now split to it's two tests 1) whether to send anything at
> > all, and 2) once #1 passes, how many such packets.
> 
> That was another one of my comments to your patch :) I was going to
> suggest that we move the cwnd/in_flight loop from tcp_snd_test to
> emit_send_queue.

One elegant part of the tcp_snd_test() in the TCP Reloaded patch is
that tcp_write_queue() calls it exactly once.

That loop in there is bogus and unnecessary.  In the changelog for
the TCP Reloaded patch I mention that this can be simplified into:

	new_packets = skb_queue_len(&sk->sk_write_queue) - in_flight;
	cwnd -= in_flight;
	return min(new_packets, cwnd);

It is a BUG() for this calculation to be larger than the number of
packets from sk->sk_send_head to the end of the write queue (which is
the invariant we must guarentee for our caller).  We could even check
for this while debugging the implementation early on.

> > This would be a sort of super-TSO that would do less segmenting work
> > than even a "perfect" TSO segmenter would.
> > 
> > I'm still not sure which approach is best, just throwing around some
> > ideas.
> 
> On paper your new super-TSO approach sounds the best.  However, I
> suppose we won't know for sure until we try :)

Yeah, TCP Reloaded sounded really nice on paper too :-)

To handle these super-TSO frames properly, we would need to add
sk->sk_send_offset to go along with sk->sk_send_head.  So then
update_send_head() would advance sk->sk_send_offset, and if that
completely consumed sk->sk_send_head then the pointer would be
advanced.

Tests of sk->sk_send_head would need to be fixed up similarly.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: issue with new TCP TSO stuff
  2005-05-13  4:36                 ` David S. Miller
@ 2005-05-13 13:25                   ` Herbert Xu
  0 siblings, 0 replies; 17+ messages in thread
From: Herbert Xu @ 2005-05-13 13:25 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

On Thu, May 12, 2005 at 09:36:18PM -0700, David S. Miller wrote:
> 
> Yes, when we walk the queue to retransmit, we chop up the
> TSO frame via tcp_trim_head() and tcp_fragment().  It is
> very loose and would need refinements though.

I'm also thinking that we might also want to do segmentation in
response to SACKs.  So we'd start with a super-frame and keep
trimming bits off at the head as normal ACKs stream in.  Once
we get a SACK, we chop it down the middle to create two SKBs.

> I think we'd be OK, because even if we get a GFP_ATOMIC allocation
> failure, it's just like any other failure.  We're doing retransmits
> so a timer will be the worst case deadlock breaker should we get an
> allocation failure while chopping up the TSO frame.

Yep.

> This actually makes me worried about sk->sk_send_head advancement.
> I think we have a deadlock possible here.  If sk->sk_send_head is
> the only packet in the output queue, and the remote system has no
> reason to send us zero-window probes or any other packets, we could
> wedge if the skb_clone() fails for the tcp_transmit_skb() call.
> We won't set any timers if we fail to send that single sk->sk_send_head
> write queue packet, and we have no pending frames waiting to be ACK'd,
> so nothing will wake us up to retry the transmit.

AFAIK all the paths that do the initial send will start a timer if
it fails.  I've checked the callers of tcp_push_one, tcp_write_xmit
and tcp_write_wakeup.  I might've easily missed something though.
 
> That loop in there is bogus and unnecessary.  In the changelog for
> the TCP Reloaded patch I mention that this can be simplified into:
> 
> 	new_packets = skb_queue_len(&sk->sk_write_queue) - in_flight;
> 	cwnd -= in_flight;
> 	return min(new_packets, cwnd);

I think we still need to limit it to snd_wnd which is what the
loop was doing.

> To handle these super-TSO frames properly, we would need to add
> sk->sk_send_offset to go along with sk->sk_send_head.  So then
> update_send_head() would advance sk->sk_send_offset, and if that
> completely consumed sk->sk_send_head then the pointer would be
> advanced.

We probably shouldn't avoid segmentation altogether.  We could just
segment lazily.  So when we come around to actually transmitting
something, we will allocate an skb and cut off the data according
to the cwnd.  We should then store the segmented skb's on our send
queue instead of the super-frame since we've spent that effort
anyway and it's pretty unlikely for us to have to retransmit
something that crosses the point where we cut off the skb.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2005-05-13 13:25 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-12  5:30 issue with new TCP TSO stuff David S. Miller
2005-05-12 10:05 ` Herbert Xu
2005-05-12 20:13   ` David S. Miller
2005-05-12 21:47     ` Herbert Xu
2005-05-12 22:10       ` Herbert Xu
2005-05-12 22:52         ` David S. Miller
2005-05-12 23:10           ` Herbert Xu
2005-05-12 23:24             ` David S. Miller
2005-05-12 23:52               ` Herbert Xu
2005-05-13  4:36                 ` David S. Miller
2005-05-13 13:25                   ` Herbert Xu
2005-05-12 22:46       ` David S. Miller
2005-05-12 14:13 ` Andi Kleen
2005-05-12 19:26   ` David S. Miller
     [not found]     ` <20050512200251.GA72662@muc.de>
2005-05-12 20:03       ` David S. Miller
2005-05-12 20:26         ` Andi Kleen
2005-05-12 22:34           ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).