TSO trimming question

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* TSO trimming question
@ 2007-12-19 21:46 Ilpo Järvinen
  2007-12-19 22:02 ` Ilpo Järvinen
  2007-12-20  7:54 ` David Miller
  0 siblings, 2 replies; 26+ messages in thread
From: Ilpo Järvinen @ 2007-12-19 21:46 UTC (permalink / raw)
  To: Netdev, David Miller

Hi all,

I'm not fully sure what's purpose of this code in tcp_write_xmit:

                       if (skb->len < limit) {
                               unsigned int trim = skb->len % mss_now;

                               if (trim)
                                       limit = skb->len - trim;
			}

Is it used to make sure we send only multiples of mss_now here and leave 
the left-over into another skb? Or does it try to make sure that
tso_fragment result honors multiple of mss_now boundaries when snd_wnd
is the limitting factor? For latter IMHO this would be necessary:

			if (skb->len > limit)
				limit -= limit % mss_now;

-- 
 i.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-19 21:46 TSO trimming question Ilpo Järvinen
@ 2007-12-19 22:02 ` Ilpo Järvinen
  2007-12-20  3:26   ` Herbert Xu
  2007-12-20  7:54 ` David Miller
  1 sibling, 1 reply; 26+ messages in thread
From: Ilpo Järvinen @ 2007-12-19 22:02 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1717 bytes --]

On Wed, 19 Dec 2007, Ilpo Järvinen wrote:

> I'm not fully sure what's purpose of this code in tcp_write_xmit:
> 
>                        if (skb->len < limit) {
>                                unsigned int trim = skb->len % mss_now;
> 
>                                if (trim)
>                                        limit = skb->len - trim;
> 			}
> 
> Is it used to make sure we send only multiples of mss_now here and leave 
> the left-over into another skb? Or does it try to make sure that
> tso_fragment result honors multiple of mss_now boundaries when snd_wnd
> is the limitting factor? For latter IMHO this would be necessary:
> 
> 			if (skb->len > limit)
> 				limit -= limit % mss_now;

...Anyway, here's an untested patch to just do it :-):

-- 
 i.


[PATCH] [TCP]: Force tso skb split to mss_now boundary (if snd_wnd limits)

If snd_wnd was limitting factor, the tso_fragment might not
create full-sized skbs but would include the window left-overs
as well.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_output.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1c7ef17..8dafda9 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1453,6 +1453,8 @@ static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle)
 
 				if (trim)
 					limit = skb->len - trim;
+			} else if (skb->len > limit) {
+				limit -= limit % mss_now;
 			}
 		}
 
@@ -1525,6 +1527,8 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now)
 
 				if (trim)
 					limit = skb->len - trim;
+			} else if (skb->len > limit) {
+				limit -= limit % mss_now;
 			}
 		}
 
-- 
1.5.0.6

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-19 22:02 ` Ilpo Järvinen
@ 2007-12-20  3:26   ` Herbert Xu
  2007-12-20  7:55     ` David Miller
  0 siblings, 1 reply; 26+ messages in thread
From: Herbert Xu @ 2007-12-20  3:26 UTC (permalink / raw)
  To: Ilpo J??rvinen; +Cc: davem, netdev

Ilpo J??rvinen <ilpo.jarvinen@helsinki.fi> wrote:
>
>> is the limitting factor? For latter IMHO this would be necessary:
>> 
>>                       if (skb->len > limit)
>>                               limit -= limit % mss_now;

Good catch! But how about putting this logic into tcp_window_allows
since then you can avoid the divide for the case when we're not
bound by the receiver window.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-19 21:46 TSO trimming question Ilpo Järvinen
  2007-12-19 22:02 ` Ilpo Järvinen
@ 2007-12-20  7:54 ` David Miller
  2007-12-20 11:40   ` Ilpo Järvinen
  1 sibling, 1 reply; 26+ messages in thread
From: David Miller @ 2007-12-20  7:54 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Wed, 19 Dec 2007 23:46:33 +0200 (EET)

> I'm not fully sure what's purpose of this code in tcp_write_xmit:
> 
>                        if (skb->len < limit) {
>                                unsigned int trim = skb->len % mss_now;
> 
>                                if (trim)
>                                        limit = skb->len - trim;
> 			}
> 
> Is it used to make sure we send only multiples of mss_now here and leave 
> the left-over into another skb? Or does it try to make sure that
> tso_fragment result honors multiple of mss_now boundaries when snd_wnd
> is the limitting factor? For latter IMHO this would be necessary:
> 
> 			if (skb->len > limit)
> 				limit -= limit % mss_now;

The purpose of the test is to make sure we process tail sub-mss chunks
correctly wrt. Nagle, which most closely matches the first purpose
you've listed.

So I think the calculation really does belong where it is.

Because of the way that the sendmsg() super-skb formation logic
works, we always will tack on more data and grow the tail
SKB before creating a new one.  So any sub-mss chunk at the
end of a TSO frame really is at the end of the write queue
and really should get nagle processing.

Actually, there is an exception, which is when we run out of
skb_frag_list slots.  In that case we'll potentially have breaks at
odd boundaries in the middle of the queue.  But this can only happen
in exceptional cases (user does tons of 1-byte sendfile()'s over
random non-consequetive locations of a file) or outright bugs
(MAX_SKB_FRAGS is defined incorrectly, for example) and thus this
situation is not worth coding for.

sendmsg()'s goal, when TSO is enabled, is to append to the write
queue tail SKB, and it does so until the MAX_SKB_FRAGS limit
is reached.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-20  3:26   ` Herbert Xu
@ 2007-12-20  7:55     ` David Miller
  0 siblings, 0 replies; 26+ messages in thread
From: David Miller @ 2007-12-20  7:55 UTC (permalink / raw)
  To: herbert; +Cc: ilpo.jarvinen, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 20 Dec 2007 11:26:18 +0800

> Ilpo J??rvinen <ilpo.jarvinen@helsinki.fi> wrote:
> >
> >> is the limitting factor? For latter IMHO this would be necessary:
> >> 
> >>                       if (skb->len > limit)
> >>                               limit -= limit % mss_now;
> 
> Good catch! But how about putting this logic into tcp_window_allows
> since then you can avoid the divide for the case when we're not
> bound by the receiver window.

Because the purpose of this calculation is different, as I
described in my other reply, I don't think this change is
correct, regardless of where it is placed :-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-20  7:54 ` David Miller
@ 2007-12-20 11:40   ` Ilpo Järvinen
  2007-12-20 11:56     ` David Miller
  2007-12-20 12:00     ` David Miller
  0 siblings, 2 replies; 26+ messages in thread
From: Ilpo Järvinen @ 2007-12-20 11:40 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev, Herbert Xu

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4087 bytes --]

On Wed, 19 Dec 2007, David Miller wrote:

> From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Wed, 19 Dec 2007 23:46:33 +0200 (EET)
> 
> > I'm not fully sure what's purpose of this code in tcp_write_xmit:
> > 
> >                        if (skb->len < limit) {
> >                                unsigned int trim = skb->len % mss_now;
> > 
> >                                if (trim)
> >                                        limit = skb->len - trim;
> > 			}
> > 
> > Is it used to make sure we send only multiples of mss_now here and leave 
> > the left-over into another skb?

Yeah, I now understand that this part is correct. I somehow got such 
impression while trying to figure this out that it ends up being dead code 
but that wasn't correct thought from my side. However, it caught my 
attention and after some thinking I'd say there's more to handle here 
(covered by the second question).

Also note that patch I sent earlier is not right either but needs some 
refining to do the right thing.

> > Or does it try to make sure that
> > tso_fragment result honors multiple of mss_now boundaries when snd_wnd
> > is the limitting factor? For latter IMHO this would be necessary:
> > 
> > 			if (skb->len > limit)
> > 				limit -= limit % mss_now;
> 
> The purpose of the test is to make sure we process tail sub-mss chunks
> correctly wrt. Nagle, which most closely matches the first purpose
> you've listed.
> 
> So I think the calculation really does belong where it is.
>
> Because of the way that the sendmsg() super-skb formation logic
> works, we always will tack on more data and grow the tail
> SKB before creating a new one.  So any sub-mss chunk at the
> end of a TSO frame really is at the end of the write queue
> and really should get nagle processing.

Yes, I now agree this is fully correct for this task.

> Actually, there is an exception, which is when we run out of
> skb_frag_list slots.  In that case we'll potentially have breaks at
> odd boundaries in the middle of the queue.  But this can only happen
> in exceptional cases (user does tons of 1-byte sendfile()'s over
> random non-consequetive locations of a file) or outright bugs
> (MAX_SKB_FRAGS is defined incorrectly, for example) and thus this
> situation is not worth coding for.

That's not the only case, IMHO if there's odd boundary due to 
snd_una+snd_wnd - skb->seq limit (done in tcp_window_allows()), we don't 
consider it as odd but break the skb at arbitary point resulting
two small segments to the network, and what's worse, when the later skb 
resulting from the first split is matching skb->len < limit check as well 
causing an unnecessary small skb to be created for nagle purpose too, 
solving it fully requires some thought in case the mss_now != mss_cache 
even if non-odd boundaries are honored in the middle of skb.

Though whether we get there is depending on what tcp_tso_should_defer() 
decided. Hmm, there seems to be an unrelated bug in it as well :-/. A 
patch below. Please consider the fact that enabling TSO deferring may 
have some unpleasant effect to TCP dynamics, considering that I don't find 
stable mandatory for this to avoid breaking, besides things have been 
working quite well without it too... Only compile tested.

-- 
 i.

--
[PATCH] [TCP]: Fix TSO deferring

I'd say that most of what tcp_tso_should_defer had in between
there was dead code because of this.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_output.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8dafda9..693b9f6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1217,7 +1217,8 @@ static int tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
 		goto send_now;

 	/* Defer for less than two clock ticks. */
-	if (!tp->tso_deferred && ((jiffies<<1)>>1) - (tp->tso_deferred>>1) > 1)
+	if (tp->tso_deferred &&
+	    ((jiffies << 1) >> 1) - (tp->tso_deferred >> 1) > 1)
 		goto send_now;

 	in_flight = tcp_packets_in_flight(tp);
-- 
1.5.0.6

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-20 11:40   ` Ilpo Järvinen
@ 2007-12-20 11:56     ` David Miller
  2007-12-20 16:02       ` John Heffner
  2007-12-21  8:06       ` Bill Fink
  2007-12-20 12:00     ` David Miller
  1 sibling, 2 replies; 26+ messages in thread
From: David Miller @ 2007-12-20 11:56 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev, herbert, jheffner

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Thu, 20 Dec 2007 13:40:51 +0200 (EET)

> [PATCH] [TCP]: Fix TSO deferring
> 
> I'd say that most of what tcp_tso_should_defer had in between
> there was dead code because of this.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Yikes!!!!!

John, we've been living a lie for more than a year. :-/

On the bright side this explains a lot of small TSO frames I've been
seeing in traces over the past year but never got a chance to
investigate.

> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 8dafda9..693b9f6 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1217,7 +1217,8 @@ static int tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
>  		goto send_now;
>  
>  	/* Defer for less than two clock ticks. */
> -	if (!tp->tso_deferred && ((jiffies<<1)>>1) - (tp->tso_deferred>>1) > 1)
> +	if (tp->tso_deferred &&
> +	    ((jiffies << 1) >> 1) - (tp->tso_deferred >> 1) > 1)
>  		goto send_now;
>  
>  	in_flight = tcp_packets_in_flight(tp);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-20 11:40   ` Ilpo Järvinen
  2007-12-20 11:56     ` David Miller
@ 2007-12-20 12:00     ` David Miller
  2007-12-20 12:35       ` Ilpo Järvinen
  2007-12-20 14:00       ` Herbert Xu
  1 sibling, 2 replies; 26+ messages in thread
From: David Miller @ 2007-12-20 12:00 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev, herbert

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Thu, 20 Dec 2007 13:40:51 +0200 (EET)

> That's not the only case, IMHO if there's odd boundary due to 
> snd_una+snd_wnd - skb->seq limit (done in tcp_window_allows()), we don't 
> consider it as odd but break the skb at arbitary point resulting
> two small segments to the network, and what's worse, when the later skb 
> resulting from the first split is matching skb->len < limit check as well 
> causing an unnecessary small skb to be created for nagle purpose too, 
> solving it fully requires some thought in case the mss_now != mss_cache 
> even if non-odd boundaries are honored in the middle of skb.

In the most ideal sense, tcp_window_allows() should probably
be changed to only return MSS multiples.

Unfortunately this would add an expensive modulo operation,
however I think it would elimiate this problem case.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-20 12:00     ` David Miller
@ 2007-12-20 12:35       ` Ilpo Järvinen
  2007-12-20 14:00       ` Herbert Xu
  1 sibling, 0 replies; 26+ messages in thread
From: Ilpo Järvinen @ 2007-12-20 12:35 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev, Herbert Xu

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1190 bytes --]

On Thu, 20 Dec 2007, David Miller wrote:

> From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Thu, 20 Dec 2007 13:40:51 +0200 (EET)
> 
> > That's not the only case, IMHO if there's odd boundary due to 
> > snd_una+snd_wnd - skb->seq limit (done in tcp_window_allows()), we don't 
> > consider it as odd but break the skb at arbitary point resulting
> > two small segments to the network, and what's worse, when the later skb 
> > resulting from the first split is matching skb->len < limit check as well 
> > causing an unnecessary small skb to be created for nagle purpose too, 
> > solving it fully requires some thought in case the mss_now != mss_cache 
> > even if non-odd boundaries are honored in the middle of skb.
> 
> In the most ideal sense, tcp_window_allows() should probably
> be changed to only return MSS multiples.

That's what Herbert suggested already, I'll send a patch later
on... :-)

> Unfortunately this would add an expensive modulo operation,
> however I think it would elimiate this problem case.

Yes. Should we still call tcp_minshall_update() if split in the middle of 
wq results in smaller than MSS tail (occurs only if mss_now != mss_cache)?

-- 
 i.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-20 12:00     ` David Miller
  2007-12-20 12:35       ` Ilpo Järvinen
@ 2007-12-20 14:00       ` Herbert Xu
  2007-12-20 23:55         ` David Miller
  1 sibling, 1 reply; 26+ messages in thread
From: Herbert Xu @ 2007-12-20 14:00 UTC (permalink / raw)
  To: David Miller; +Cc: ilpo.jarvinen, netdev

On Thu, Dec 20, 2007 at 04:00:37AM -0800, David Miller wrote:
>
> In the most ideal sense, tcp_window_allows() should probably
> be changed to only return MSS multiples.
> 
> Unfortunately this would add an expensive modulo operation,
> however I think it would elimiate this problem case.

Well you only have to divide in the unlikely case of us being
limited by the receiver window.  In that case speed is probably
not of the essence anyway.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-20 11:56     ` David Miller
@ 2007-12-20 16:02       ` John Heffner
  2007-12-21  4:36         ` David Miller
  2007-12-21  8:06       ` Bill Fink
  1 sibling, 1 reply; 26+ messages in thread
From: John Heffner @ 2007-12-20 16:02 UTC (permalink / raw)
  To: David Miller; +Cc: ilpo.jarvinen, netdev, herbert

David Miller wrote:
> From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Thu, 20 Dec 2007 13:40:51 +0200 (EET)
> 
>> [PATCH] [TCP]: Fix TSO deferring
>>
>> I'd say that most of what tcp_tso_should_defer had in between
>> there was dead code because of this.
>>
>> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> 
> Yikes!!!!!
> 
> John, we've been living a lie for more than a year. :-/
> 
> On the bright side this explains a lot of small TSO frames I've been
> seeing in traces over the past year but never got a chance to
> investigate.

Ouch.  This fix may improve some benchmarks.

Re-checking this function was on my list of things to do because I had 
also noticed some TSO frames that seemed a bit small.  This clearly 
explains it.

   -John

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-20 14:00       ` Herbert Xu
@ 2007-12-20 23:55         ` David Miller
  2007-12-21 18:55           ` [PATCH] [TCP]: Force TSO splits to MSS boundaries Ilpo Järvinen
  0 siblings, 1 reply; 26+ messages in thread
From: David Miller @ 2007-12-20 23:55 UTC (permalink / raw)
  To: herbert; +Cc: ilpo.jarvinen, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 20 Dec 2007 22:00:12 +0800

> On Thu, Dec 20, 2007 at 04:00:37AM -0800, David Miller wrote:
> >
> > In the most ideal sense, tcp_window_allows() should probably
> > be changed to only return MSS multiples.
> > 
> > Unfortunately this would add an expensive modulo operation,
> > however I think it would elimiate this problem case.
> 
> Well you only have to divide in the unlikely case of us being
> limited by the receiver window.  In that case speed is probably
> not of the essence anyway.

Agreed, to some extent.

I say "to some extent" because it might be realistic, with
lots (millions) of sockets to hit this case a lot.

There are so many things that are a "don't care" performance
wise until you have a lot of stinky connections over crappy
links.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-20 16:02       ` John Heffner
@ 2007-12-21  4:36         ` David Miller
  0 siblings, 0 replies; 26+ messages in thread
From: David Miller @ 2007-12-21  4:36 UTC (permalink / raw)
  To: jheffner; +Cc: ilpo.jarvinen, netdev, herbert

From: John Heffner <jheffner@psc.edu>
Date: Thu, 20 Dec 2007 11:02:21 -0500

> David Miller wrote:
> > From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> > Date: Thu, 20 Dec 2007 13:40:51 +0200 (EET)
> > 
> >> [PATCH] [TCP]: Fix TSO deferring
> >>
> >> I'd say that most of what tcp_tso_should_defer had in between
> >> there was dead code because of this.
> >>
> >> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> > 
> > Yikes!!!!!
> > 
> > John, we've been living a lie for more than a year. :-/
> > 
> > On the bright side this explains a lot of small TSO frames I've been
> > seeing in traces over the past year but never got a chance to
> > investigate.
> 
> Ouch.  This fix may improve some benchmarks.
> 
> Re-checking this function was on my list of things to do because I had 
> also noticed some TSO frames that seemed a bit small.  This clearly 
> explains it.

What I'll do for now is I'll put this into net-2.6.25 to let it
"cook" for a while.  It's an obvious fix but it's enabling code
that's effectively been disabled for more than a year so something
might turn up that we need to fix.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-20 11:56     ` David Miller
  2007-12-20 16:02       ` John Heffner
@ 2007-12-21  8:06       ` Bill Fink
  2007-12-21  9:26         ` Ilpo Järvinen
  2007-12-21  9:27         ` David Miller
  1 sibling, 2 replies; 26+ messages in thread
From: Bill Fink @ 2007-12-21  8:06 UTC (permalink / raw)
  To: David Miller; +Cc: ilpo.jarvinen, netdev, herbert, jheffner

On Thu, 20 Dec 2007, David Miller wrote:

> From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Thu, 20 Dec 2007 13:40:51 +0200 (EET)
> 
> > [PATCH] [TCP]: Fix TSO deferring
> > 
> > I'd say that most of what tcp_tso_should_defer had in between
> > there was dead code because of this.
> > 
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> 
> Yikes!!!!!
> 
> John, we've been living a lie for more than a year. :-/
> 
> On the bright side this explains a lot of small TSO frames I've been
> seeing in traces over the past year but never got a chance to
> investigate.
> 
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 8dafda9..693b9f6 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -1217,7 +1217,8 @@ static int tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
> >  		goto send_now;
> >  
> >  	/* Defer for less than two clock ticks. */
> > -	if (!tp->tso_deferred && ((jiffies<<1)>>1) - (tp->tso_deferred>>1) > 1)
> > +	if (tp->tso_deferred &&
> > +	    ((jiffies << 1) >> 1) - (tp->tso_deferred >> 1) > 1)
> >  		goto send_now;
> >  
> >  	in_flight = tcp_packets_in_flight(tp);

I meant to ask about this a while back but then got distracted by
other things.  But now since the subject has come up, I had a couple
of more questions about this code.

What's with all the shifting back and forth?  Here with:

	((jiffies<<1)>>1) - (tp->tso_deferred>>1)

and later with:

	/* Ok, it looks like it is advisable to defer.  */
	tp->tso_deferred = 1 | (jiffies<<1);

Is this just done to try and avoid the special case of jiffies==0 
when the jiffies wrap?  If so it seems like a lot of unnecessary
work just to avoid a 1 in 4 billion event, since it's my understanding
that the whole tcp_tso_should_defer function is just an optimization
and not a criticality to the proper functioning of TCP, especially
considering it hasn't even been executing at all up to now.

My second question is more basic and if I'm not mistaken actually
relates to a remaining bug in the (corrected) test:

	/* Defer for less than two clock ticks. */
	if (tp->tso_deferred &&
	    ((jiffies << 1) >> 1) - (tp->tso_deferred >> 1) > 1)

Since jiffies is an unsigned long, which is 64-bits on a 64-bit system,
whereas tp->tso_deferred is a u32, once jiffies exceeds 31-bits, which
will happen in about 25 days if HZ=1000, won't the second part of the
test always be true after that?  Or am I missing something obvious?

						-Bill

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-21  8:06       ` Bill Fink
@ 2007-12-21  9:26         ` Ilpo Järvinen
  2007-12-21  9:28           ` Ilpo Järvinen
  2007-12-21  9:27         ` David Miller
  1 sibling, 1 reply; 26+ messages in thread
From: Ilpo Järvinen @ 2007-12-21  9:26 UTC (permalink / raw)
  To: Bill Fink; +Cc: David Miller, Netdev, Herbert Xu, John Heffner

On Fri, 21 Dec 2007, Bill Fink wrote:

> I meant to ask about this a while back but then got distracted by
> other things.  But now since the subject has come up, I had a couple
> of more questions about this code.
> 
> What's with all the shifting back and forth?  Here with:
> 
> 	((jiffies<<1)>>1) - (tp->tso_deferred>>1)
> 
> and later with:
> 
> 	/* Ok, it looks like it is advisable to defer.  */
> 	tp->tso_deferred = 1 | (jiffies<<1);
> 
> Is this just done to try and avoid the special case of jiffies==0 
> when the jiffies wrap? 

I thought that it must be the reason. I couldn't figure out any other 
explination while thinking the same thing but since I saw no problem in 
that rather weird approach, I didn't want to touch that in a patch which 
had net-2.6 (or stable) potential.

> If so it seems like a lot of unnecessary
> work just to avoid a 1 in 4 billion event, since it's my understanding
> that the whole tcp_tso_should_defer function is just an optimization
> and not a criticality to the proper functioning of TCP, especially
> considering it hasn't even been executing at all up to now.

It would still be good to not return 1 in that case we didn't flag the 
deferral, how about adding one additional tick (+comment) in the zero 
case making tso_deferred == 0 again unambiguously defined, e.g.:

	tp->tso_deferred = min_t(u32, jiffies, 1);

...I'm relatively sure that nobody would ever notice that tick :-) and we 
kept return value consistent with tso_deferred state invariant.

> My second question is more basic and if I'm not mistaken actually
> relates to a remaining bug in the (corrected) test:
> 
> 	/* Defer for less than two clock ticks. */
> 	if (tp->tso_deferred &&
> 	    ((jiffies << 1) >> 1) - (tp->tso_deferred >> 1) > 1)
> 
> Since jiffies is an unsigned long, which is 64-bits on a 64-bit system,
> whereas tp->tso_deferred is a u32, once jiffies exceeds 31-bits, which
> will happen in about 25 days if HZ=1000, won't the second part of the
> test always be true after that?  Or am I missing something obvious?

I didn't notice that earlier but I think you're right though my knowledge 
about jiffies and such is quite limited.

...Feel free to submit a patch to correct these.


-- 
 i.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-21  8:06       ` Bill Fink
  2007-12-21  9:26         ` Ilpo Järvinen
@ 2007-12-21  9:27         ` David Miller
  2007-12-21  9:29           ` Herbert Xu
  1 sibling, 1 reply; 26+ messages in thread
From: David Miller @ 2007-12-21  9:27 UTC (permalink / raw)
  To: billfink; +Cc: ilpo.jarvinen, netdev, herbert, jheffner

From: Bill Fink <billfink@mindspring.com>
Date: Fri, 21 Dec 2007 03:06:48 -0500

> What's with all the shifting back and forth?  Here with:
> 
> 	((jiffies<<1)>>1) - (tp->tso_deferred>>1)
> 
> and later with:
> 
> 	/* Ok, it looks like it is advisable to defer.  */
> 	tp->tso_deferred = 1 | (jiffies<<1);
> 
> Is this just done to try and avoid the special case of jiffies==0 
> when the jiffies wrap?  If so it seems like a lot of unnecessary
> work just to avoid a 1 in 4 billion event, since it's my understanding
> that the whole tcp_tso_should_defer function is just an optimization
> and not a criticality to the proper functioning of TCP, especially
> considering it hasn't even been executing at all up to now.

How else would you avoid the incorrect result when jiffies is
indeed zero?

It's two shifts, and this gets scheduled along with the other
instructions on many cpus so it's effectively free.

I don't see why this is even worth mentioning and discussing.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-21  9:26         ` Ilpo Järvinen
@ 2007-12-21  9:28           ` Ilpo Järvinen
  0 siblings, 0 replies; 26+ messages in thread
From: Ilpo Järvinen @ 2007-12-21  9:28 UTC (permalink / raw)
  To: Bill Fink; +Cc: David Miller, Netdev, Herbert Xu, John Heffner

[-- Attachment #1: Type: TEXT/PLAIN, Size: 717 bytes --]

On Fri, 21 Dec 2007, Ilpo Järvinen wrote:

> On Fri, 21 Dec 2007, Bill Fink wrote:
> 
> > If so it seems like a lot of unnecessary
> > work just to avoid a 1 in 4 billion event, since it's my understanding
> > that the whole tcp_tso_should_defer function is just an optimization
> > and not a criticality to the proper functioning of TCP, especially
> > considering it hasn't even been executing at all up to now.
> 
> It would still be good to not return 1 in that case we didn't flag the 
> deferral, how about adding one additional tick (+comment) in the zero 
> case making tso_deferred == 0 again unambiguously defined, e.g.:
> 
> 	tp->tso_deferred = min_t(u32, jiffies, 1);

Blah, max_t of course :-).


-- 
 i.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-21  9:27         ` David Miller
@ 2007-12-21  9:29           ` Herbert Xu
  2007-12-21  9:36             ` David Miller
  0 siblings, 1 reply; 26+ messages in thread
From: Herbert Xu @ 2007-12-21  9:29 UTC (permalink / raw)
  To: David Miller; +Cc: billfink, ilpo.jarvinen, netdev, jheffner

On Fri, Dec 21, 2007 at 01:27:20AM -0800, David Miller wrote:
>
> It's two shifts, and this gets scheduled along with the other
> instructions on many cpus so it's effectively free.
> 
> I don't see why this is even worth mentioning and discussing.

I totally agree.  Two shifts are way better than a branch.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-21  9:29           ` Herbert Xu
@ 2007-12-21  9:36             ` David Miller
  2007-12-21 10:58               ` Bill Fink
  0 siblings, 1 reply; 26+ messages in thread
From: David Miller @ 2007-12-21  9:36 UTC (permalink / raw)
  To: herbert; +Cc: billfink, ilpo.jarvinen, netdev, jheffner

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 21 Dec 2007 17:29:27 +0800

> On Fri, Dec 21, 2007 at 01:27:20AM -0800, David Miller wrote:
> >
> > It's two shifts, and this gets scheduled along with the other
> > instructions on many cpus so it's effectively free.
> > 
> > I don't see why this is even worth mentioning and discussing.
> 
> I totally agree.  Two shifts are way better than a branch.

We take probably a thousand+ 100+ cycle cache misses in the TCP stack
on big window openning ACKs.

Instead of discussing ways to solve that huge performance killer we're
wanking about two friggin' integer shifts.

It's hilarious isn't it? :-)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-21  9:36             ` David Miller
@ 2007-12-21 10:58               ` Bill Fink
  2007-12-21 18:54                 ` Bill Fink
  0 siblings, 1 reply; 26+ messages in thread
From: Bill Fink @ 2007-12-21 10:58 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, ilpo.jarvinen, netdev, jheffner

On Fri, 21 Dec 2007, David Miller wrote:

> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Fri, 21 Dec 2007 17:29:27 +0800
> 
> > On Fri, Dec 21, 2007 at 01:27:20AM -0800, David Miller wrote:
> > >
> > > It's two shifts, and this gets scheduled along with the other
> > > instructions on many cpus so it's effectively free.
> > > 
> > > I don't see why this is even worth mentioning and discussing.
> > 
> > I totally agree.  Two shifts are way better than a branch.
> 
> We take probably a thousand+ 100+ cycle cache misses in the TCP stack
> on big window openning ACKs.
> 
> Instead of discussing ways to solve that huge performance killer we're
> wanking about two friggin' integer shifts.
> 
> It's hilarious isn't it? :-)

I don't think obfuscated code is hilarious.  Instead of the convoluted
and dense code:

	/* Defer for less than two clock ticks. */
	if (tp->tso_deferred &&
	    ((jiffies << 1) >> 1) - (tp->tso_deferred >> 1) > 1)

You can have the much simpler and more easily understandable:

	/* Defer for less than two clock ticks. */
	if (tp->tso_deferred && (jiffies - tp->tso_deferred) > 1)

And instead of:

	/* Ok, it looks like it is advisable to defer.  */
        tp->tso_deferred = 1 | (jiffies<<1);

        return 1;

You could do as Ilpo suggested:

	/* Ok, it looks like it is advisable to defer.  */
	tp->tso_deferred = max_t(u32, jiffies, 1);

        return 1;

Or perhaps more efficiently:

	/* Ok, it looks like it is advisable to defer.  */
	tp->tso_deferred = jiffies;
	if (unlikely(jiffies == 0))
		tp->tso_deferred = 1;

        return 1;

Or perhaps even:

	/* Ok, it looks like it is advisable to defer.  */
	tp->tso_deferred = jiffies;

	/* need to return a non-zero value to defer, which means won't
	 * defer if jiffies == 0 but it's only a 1 in 4 billion event
	 * (and avoids a compare/branch by not checking jiffies)
	 /
        return jiffies;

Since it really only needs a non-zero return value to defer.

See, no branches needed and much clearer code.  That seems worthwhile
to me from a code maintenance standpoint, even if it isn't any speed
improvement.

And what about the 64-bit jiffies versus 32-bit tp->tso_deferred issue?
Should tso_deferred be made unsigned long to match jiffies?

						-Bill

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-21 10:58               ` Bill Fink
@ 2007-12-21 18:54                 ` Bill Fink
  2007-12-21 18:58                   ` Ilpo Järvinen
  0 siblings, 1 reply; 26+ messages in thread
From: Bill Fink @ 2007-12-21 18:54 UTC (permalink / raw)
  To: Bill Fink; +Cc: David Miller, herbert, ilpo.jarvinen, netdev, jheffner

On Fri, 21 Dec 2007, Bill Fink wrote:

> Or perhaps even:
> 
> 	/* Ok, it looks like it is advisable to defer.  */
> 	tp->tso_deferred = jiffies;
> 
> 	/* need to return a non-zero value to defer, which means won't
> 	 * defer if jiffies == 0 but it's only a 1 in 4 billion event
> 	 * (and avoids a compare/branch by not checking jiffies)
> 	 /
>         return jiffies;

Ack.  I introduced my own 64-bit to 32-bit issue (too late at night).
How about:

	/* Ok, it looks like it is advisable to defer.  */
	tp->tso_deferred = jiffies;

	/* this won't defer if jiffies == 0 but it's only a 1 in
	 * 4 billion event (and avoids a branch)
	 */
	return (jiffies != 0);

							-Bill

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] [TCP]: Force TSO splits to MSS boundaries
  2007-12-20 23:55         ` David Miller
@ 2007-12-21 18:55           ` Ilpo Järvinen
  2007-12-21 20:01             ` Ilpo Järvinen
  2007-12-25  5:35             ` David Miller
  0 siblings, 2 replies; 26+ messages in thread
From: Ilpo Järvinen @ 2007-12-21 18:55 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4791 bytes --]

On Thu, 20 Dec 2007, David Miller wrote:

> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Thu, 20 Dec 2007 22:00:12 +0800
> 
> > On Thu, Dec 20, 2007 at 04:00:37AM -0800, David Miller wrote:
> > >
> > > In the most ideal sense, tcp_window_allows() should probably
> > > be changed to only return MSS multiples.
> > > 
> > > Unfortunately this would add an expensive modulo operation,
> > > however I think it would elimiate this problem case.
> > 
> > Well you only have to divide in the unlikely case of us being
> > limited by the receiver window.  In that case speed is probably
> > not of the essence anyway.
> 
> Agreed, to some extent.
> 
> I say "to some extent" because it might be realistic, with
> lots (millions) of sockets to hit this case a lot.
> 
> There are so many things that are a "don't care" performance
> wise until you have a lot of stinky connections over crappy
> links.

How about this, I had to use another approach due to reasons
outlined in the commit message:

--
[PATCH] [TCP]: Force TSO splits to MSS boundaries

If snd_wnd - snd_nxt wasn't multiple of MSS, skb was split on
odd boundary by the callers of tcp_window_allows.

We try really hard to avoid unnecessary modulos. Therefore the
old caller side check "if (skb->len < limit)" was too wide as
well because limit is not bound in any way to skb->len and can
cause spurious testing for trimming in the middle of the queue
while we only wanted that to happen at the tail of the queue.
A simple additional caller side check for tcp_write_queue_tail
would likely have resulted 2 x modulos because the limit would
have to be first calculated from window, however, doing that
unnecessary modulo is not mandatory. After a minor change to
the algorithm, simply determine first if the modulo is needed
at all and at that point immediately decide also from which
value it should be calculated from.

This approach also kills some duplicated code.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_output.c |   51 ++++++++++++++++++++++++-------------------------
 1 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1852698..ea92a1b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1008,13 +1008,29 @@ static void tcp_cwnd_validate(struct sock *sk)
 	}
 }
 
-static unsigned int tcp_window_allows(struct tcp_sock *tp, struct sk_buff *skb, unsigned int mss_now, unsigned int cwnd)
+/* Returns the portion of skb which can be sent right away without
+ * introducing MSS oddities to segment boundaries. In rare cases where
+ * mss_now != mss_cache, we will request caller to create a small skb
+ * per input skb which could be mostly avoided here (if desired).
+ */
+static unsigned int tcp_mss_split_point(struct sock *sk, struct sk_buff *skb,
+					unsigned int mss_now,
+					unsigned int cwnd)
 {
-	u32 window, cwnd_len;
+	struct tcp_sock *tp = tcp_sk(sk);
+	u32 needed, window, cwnd_len;
 
 	window = (tp->snd_una + tp->snd_wnd - TCP_SKB_CB(skb)->seq);
 	cwnd_len = mss_now * cwnd;
-	return min(window, cwnd_len);
+
+	if (likely(cwnd_len <= window && skb != tcp_write_queue_tail(sk)))
+		return cwnd_len;
+
+	if (skb == tcp_write_queue_tail(sk) && cwnd_len <= skb->len)
+		return cwnd_len;
+
+	needed = min(skb->len, window);
+	return needed - needed % mss_now;
 }
 
 /* Can at least one segment of SKB be sent right now, according to the
@@ -1445,17 +1461,9 @@ static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle)
 		}
 
 		limit = mss_now;
-		if (tso_segs > 1) {
-			limit = tcp_window_allows(tp, skb,
-						  mss_now, cwnd_quota);
-
-			if (skb->len < limit) {
-				unsigned int trim = skb->len % mss_now;
-
-				if (trim)
-					limit = skb->len - trim;
-			}
-		}
+		if (tso_segs > 1)
+			limit = tcp_mss_split_point(sk, skb, mss_now,
+						    cwnd_quota);
 
 		if (skb->len > limit &&
 		    unlikely(tso_fragment(sk, skb, limit, mss_now)))
@@ -1502,7 +1510,6 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
  */
 void tcp_push_one(struct sock *sk, unsigned int mss_now)
 {
-	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb = tcp_send_head(sk);
 	unsigned int tso_segs, cwnd_quota;
 
@@ -1517,17 +1524,9 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now)
 		BUG_ON(!tso_segs);
 
 		limit = mss_now;
-		if (tso_segs > 1) {
-			limit = tcp_window_allows(tp, skb,
-						  mss_now, cwnd_quota);
-
-			if (skb->len < limit) {
-				unsigned int trim = skb->len % mss_now;
-
-				if (trim)
-					limit = skb->len - trim;
-			}
-		}
+		if (tso_segs > 1)
+			limit = tcp_mss_split_point(sk, skb, mss_now,
+						    cwnd_quota);
 
 		if (skb->len > limit &&
 		    unlikely(tso_fragment(sk, skb, limit, mss_now)))
-- 
1.5.0.6

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-21 18:54                 ` Bill Fink
@ 2007-12-21 18:58                   ` Ilpo Järvinen
  2007-12-21 19:37                     ` Bill Fink
  0 siblings, 1 reply; 26+ messages in thread
From: Ilpo Järvinen @ 2007-12-21 18:58 UTC (permalink / raw)
  To: Bill Fink; +Cc: David Miller, Herbert Xu, Netdev, John Heffner

On Fri, 21 Dec 2007, Bill Fink wrote:

> On Fri, 21 Dec 2007, Bill Fink wrote:
> 
> > Or perhaps even:
> > 
> > 	/* Ok, it looks like it is advisable to defer.  */
> > 	tp->tso_deferred = jiffies;
> > 
> > 	/* need to return a non-zero value to defer, which means won't
> > 	 * defer if jiffies == 0 but it's only a 1 in 4 billion event
> > 	 * (and avoids a compare/branch by not checking jiffies)
> > 	 /
> >         return jiffies;
> 
> Ack.  I introduced my own 64-bit to 32-bit issue (too late at night).
> How about:
> 
> 	/* Ok, it looks like it is advisable to defer.  */
> 	tp->tso_deferred = jiffies;
> 
> 	/* this won't defer if jiffies == 0 but it's only a 1 in
> 	 * 4 billion event (and avoids a branch)
> 	 */
> 	return (jiffies != 0);

I'm not sure how the jiffies work but is this racy as well?

Simple return tp->tso_deferred; should work, shouldn't it? :-)


-- 
 i.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: TSO trimming question
  2007-12-21 18:58                   ` Ilpo Järvinen
@ 2007-12-21 19:37                     ` Bill Fink
  0 siblings, 0 replies; 26+ messages in thread
From: Bill Fink @ 2007-12-21 19:37 UTC (permalink / raw)
  To: Ilpo Järvinen ; +Cc: David Miller, Herbert Xu, Netdev, John Heffner

On Fri, 21 Dec 2007, Ilpo Järvinen wrote:

> On Fri, 21 Dec 2007, Bill Fink wrote:
> 
> > On Fri, 21 Dec 2007, Bill Fink wrote:
> > 
> > > Or perhaps even:
> > > 
> > > 	/* Ok, it looks like it is advisable to defer.  */
> > > 	tp->tso_deferred = jiffies;
> > > 
> > > 	/* need to return a non-zero value to defer, which means won't
> > > 	 * defer if jiffies == 0 but it's only a 1 in 4 billion event
> > > 	 * (and avoids a compare/branch by not checking jiffies)
> > > 	 /
> > >         return jiffies;
> > 
> > Ack.  I introduced my own 64-bit to 32-bit issue (too late at night).
> > How about:
> > 
> > 	/* Ok, it looks like it is advisable to defer.  */
> > 	tp->tso_deferred = jiffies;
> > 
> > 	/* this won't defer if jiffies == 0 but it's only a 1 in
> > 	 * 4 billion event (and avoids a branch)
> > 	 */
> > 	return (jiffies != 0);
> 
> I'm not sure how the jiffies work but is this racy as well?
> 
> Simple return tp->tso_deferred; should work, shouldn't it? :-)

As long as tp->tso_deferred remains u32, pending the other issue.

						-Bill

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] [TCP]: Force TSO splits to MSS boundaries
  2007-12-21 18:55           ` [PATCH] [TCP]: Force TSO splits to MSS boundaries Ilpo Järvinen
@ 2007-12-21 20:01             ` Ilpo Järvinen
  2007-12-25  5:35             ` David Miller
  1 sibling, 0 replies; 26+ messages in thread
From: Ilpo Järvinen @ 2007-12-21 20:01 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3013 bytes --]

On Fri, 21 Dec 2007, Ilpo Järvinen wrote:

> How about this, I had to use another approach due to reasons
> outlined in the commit message:
> 
> --
> [PATCH] [TCP]: Force TSO splits to MSS boundaries
> 
> If snd_wnd - snd_nxt wasn't multiple of MSS, skb was split on
> odd boundary by the callers of tcp_window_allows.
> 
> We try really hard to avoid unnecessary modulos. Therefore the
> old caller side check "if (skb->len < limit)" was too wide as
> well because limit is not bound in any way to skb->len and can
> cause spurious testing for trimming in the middle of the queue
> while we only wanted that to happen at the tail of the queue.
> A simple additional caller side check for tcp_write_queue_tail
> would likely have resulted 2 x modulos because the limit would
> have to be first calculated from window, however, doing that
> unnecessary modulo is not mandatory. After a minor change to
> the algorithm, simply determine first if the modulo is needed
> at all and at that point immediately decide also from which
> value it should be calculated from.
> 
> This approach also kills some duplicated code.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> ---
>  net/ipv4/tcp_output.c |   51 ++++++++++++++++++++++++-------------------------
>  1 files changed, 25 insertions(+), 26 deletions(-)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 1852698..ea92a1b 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1008,13 +1008,29 @@ static void tcp_cwnd_validate(struct sock *sk)
>  	}
>  }
>  
> -static unsigned int tcp_window_allows(struct tcp_sock *tp, struct sk_buff *skb, unsigned int mss_now, unsigned int cwnd)
> +/* Returns the portion of skb which can be sent right away without
> + * introducing MSS oddities to segment boundaries. In rare cases where
> + * mss_now != mss_cache, we will request caller to create a small skb
> + * per input skb which could be mostly avoided here (if desired).
> + */
> +static unsigned int tcp_mss_split_point(struct sock *sk, struct sk_buff *skb,
> +					unsigned int mss_now,
> +					unsigned int cwnd)
>  {
> -	u32 window, cwnd_len;
> +	struct tcp_sock *tp = tcp_sk(sk);
> +	u32 needed, window, cwnd_len;
>  
>  	window = (tp->snd_una + tp->snd_wnd - TCP_SKB_CB(skb)->seq);
>  	cwnd_len = mss_now * cwnd;
> -	return min(window, cwnd_len);
> +
> +	if (likely(cwnd_len <= window && skb != tcp_write_queue_tail(sk)))
> +		return cwnd_len;
> +
> +	if (skb == tcp_write_queue_tail(sk) && cwnd_len <= skb->len)
> +		return cwnd_len;

...if desired that this won't cause small skbs in the middle of the queue 
(except in case where just the sub-MSS portion is left out above window), 
something like this could be added here (again, avoiding more modulos):

	if (skb != tcp_write_queue_tail(sk) && window > skb->len)
		return skb->len;

> +	needed = min(skb->len, window);
> +	return needed - needed % mss_now;
>  }
>  
>  /* Can at least one segment of SKB be sent right now, according to the

-- 
 i.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] [TCP]: Force TSO splits to MSS boundaries
  2007-12-21 18:55           ` [PATCH] [TCP]: Force TSO splits to MSS boundaries Ilpo Järvinen
  2007-12-21 20:01             ` Ilpo Järvinen
@ 2007-12-25  5:35             ` David Miller
  1 sibling, 0 replies; 26+ messages in thread
From: David Miller @ 2007-12-25  5:35 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: herbert, netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Fri, 21 Dec 2007 20:55:28 +0200 (EET)

> [PATCH] [TCP]: Force TSO splits to MSS boundaries
> 
> If snd_wnd - snd_nxt wasn't multiple of MSS, skb was split on
> odd boundary by the callers of tcp_window_allows.
> 
> We try really hard to avoid unnecessary modulos. Therefore the
> old caller side check "if (skb->len < limit)" was too wide as
> well because limit is not bound in any way to skb->len and can
> cause spurious testing for trimming in the middle of the queue
> while we only wanted that to happen at the tail of the queue.
> A simple additional caller side check for tcp_write_queue_tail
> would likely have resulted 2 x modulos because the limit would
> have to be first calculated from window, however, doing that
> unnecessary modulo is not mandatory. After a minor change to
> the algorithm, simply determine first if the modulo is needed
> at all and at that point immediately decide also from which
> value it should be calculated from.
> 
> This approach also kills some duplicated code.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Looks good, patch applied, thanks.

With respect to code duplicateion, tcp_push_one() is essentially
the inner loop of tcp_write_xmit() :-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2007-12-25  5:35 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-19 21:46 TSO trimming question Ilpo Järvinen
2007-12-19 22:02 ` Ilpo Järvinen
2007-12-20  3:26   ` Herbert Xu
2007-12-20  7:55     ` David Miller
2007-12-20  7:54 ` David Miller
2007-12-20 11:40   ` Ilpo Järvinen
2007-12-20 11:56     ` David Miller
2007-12-20 16:02       ` John Heffner
2007-12-21  4:36         ` David Miller
2007-12-21  8:06       ` Bill Fink
2007-12-21  9:26         ` Ilpo Järvinen
2007-12-21  9:28           ` Ilpo Järvinen
2007-12-21  9:27         ` David Miller
2007-12-21  9:29           ` Herbert Xu
2007-12-21  9:36             ` David Miller
2007-12-21 10:58               ` Bill Fink
2007-12-21 18:54                 ` Bill Fink
2007-12-21 18:58                   ` Ilpo Järvinen
2007-12-21 19:37                     ` Bill Fink
2007-12-20 12:00     ` David Miller
2007-12-20 12:35       ` Ilpo Järvinen
2007-12-20 14:00       ` Herbert Xu
2007-12-20 23:55         ` David Miller
2007-12-21 18:55           ` [PATCH] [TCP]: Force TSO splits to MSS boundaries Ilpo Järvinen
2007-12-21 20:01             ` Ilpo Järvinen
2007-12-25  5:35             ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).