design for TSO performance fix

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* design for TSO performance fix
@ 2005-01-28  0:31 David S. Miller
  2005-01-28  0:51 ` Rick Jones
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: David S. Miller @ 2005-01-28  0:31 UTC (permalink / raw)
  To: netdev

Ok, here is the best idea I've been able to come up with
so far.

The basic idea is that we stop trying to build TSO frames
in the actual transmit queue.  Instead, TSO packets are
built impromptu when we actually output packets on the
transmit queue.

Advantages:

1) No knowledge of TSO frames need exist anywhere besides
   tcp_write_xmit(), tcp_transmit_skb(), and
   tcp_xmit_retransmit_queue()

2) As a result of #1, all the pcount crap goes away.
   The need for two MSS state variables (mss_cache,
   and mss_cache_std) and assosciated complexity is
   eliminated as well.

3) Keeping TSO enabled after packet loss "just works".

4) CWND sampled at the correct moment when deciding
   the TSO packet arity.

The one disadvantage is that it might be a tiny bit more
expensive to build TSO frames.  But I am sure we can find
ways to optimize that quite well.

The main element of the TSO output logic is a function
that is schemed as follows:

static inline int tcp_skb_data_all_paged(struct sk_buff *skb)
{
	return (skb->len == skb->data_len);
}

/* If possible, append paged data of SRC_SKB onto the
 * tail of DST_SKB.
 */
static int skb_append_pages(struct sk_buff *dst_skb, struct sk_buff *src_skb)
{
	int i;

	if (!tcp_skb_data_all_paged(src_skb))
		return -EINVAL;

	for (i = 0; i < skb_shinfo(src_skb)->nr_frags; i++) {
		skb_frag_t *src_frag = &skb_shinfo(src_skb)->frags[i];
		skb_frag_t *dst_frag;
		int dst_frag_idx;

		dst_frag_idx = skb_shinfo(dst_skb)->nr_frags;

		if (skb_can_coalesce(dst_skb, dst_frag_idx,
				     src_frag->page, src_frag->page_offset)) {
			dst_frag = &skb_shinfo(dst_skb)->frags[dst_frag_idx-1];
			dst_frag->size += src_frag->size;
		} else {
			if (dst_frag_idx >= MAX_SKB_FRAGS)
				return -EMSGSIZE;

			dst_frag = &skb_shinfo(dst_skb)->frags[dst_frag_idx];
			skb_shinfo(dst_skb)->nr_frags = dst_frag_idx + 1;

			dst_frag->page = src_frag->page;
			get_page(src_frag->page);

			dst_frag->page_offset = src_frag->page_offset;
			dst_frag->size = src_frag->size;
		}
		skb->data_len += src_frag->size;
	}

	return 0;
}

static struct sk_buff *tcp_tso_build(struct sk_buff *head, int mss, int num)
{
	struct sk_buff *skb;
	struct sock *sk;
	int err;

	sk = head->sk;
	skb = alloc_skb(sk->sk_prot->max_header, GFP_ATOMIC);
	err = -ENOMEM;
	if (!skb)
		goto fail;

	err = 0;
	skb_shinfo(skb)->tso_size = mss;
	skb_shinfo(skb)->tso_segs = num;
	while (num--) {
		err = skb_append_pages(skb, head, &dst_frag_idx);
		if (err)
			goto fail;

		head = head->next;
	}
	return skb;

fail:
	if (skb) {
		int i;

		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];

			put_page(frag->page);
		}

		kfree_skb(skb);
	}
	return NULL;
}

If tcp_tso_build() fails, the caller just falls back to the
normal path of sending the frames non-TSO one-by-one.

The logic is simple because if TSO is being done we know
that all of the SKB data is paged (since SG+CSUM is a
requirement for TSO).  The one case where that
invariant might fail is due to a routing change (previous
device cannot do SG+CSUM, new device has full TSO capability)
and that is handled via the tcp_skb_data_all_paged() checks.

My thinking is that whatever added expensive this new scheme
has, is offset by the simplifications the rest of the TCP
stack will have since it will no longer need to know anything
about multiple MSS values and packet counts.

Comments?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: design for TSO performance fix
  2005-01-28  0:31 design for TSO performance fix David S. Miller
@ 2005-01-28  0:51 ` Rick Jones
  2005-01-28  0:58   ` David S. Miller
  2005-01-28  1:31 ` Herbert Xu
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Rick Jones @ 2005-01-28  0:51 UTC (permalink / raw)
  To: netdev

David S. Miller wrote:
> Ok, here is the best idea I've been able to come up with
> so far.
> 
> The basic idea is that we stop trying to build TSO frames
> in the actual transmit queue.  Instead, TSO packets are
> built impromptu when we actually output packets on the
> transmit queue.
> 
> Advantages:
> 
> 1) No knowledge of TSO frames need exist anywhere besides
>    tcp_write_xmit(), tcp_transmit_skb(), and
>    tcp_xmit_retransmit_queue()
> 
> 2) As a result of #1, all the pcount crap goes away.
>    The need for two MSS state variables (mss_cache,
>    and mss_cache_std) and assosciated complexity is
>    eliminated as well.
> 
> 3) Keeping TSO enabled after packet loss "just works".

Doubleplusgood.

> 
> 4) CWND sampled at the correct moment when deciding
>    the TSO packet arity.
> 
> The one disadvantage is that it might be a tiny bit more
> expensive to build TSO frames.  But I am sure we can find
> ways to optimize that quite well.
> 
> The main element of the TSO output logic is a function
> that is schemed as follows:
> 
> ...
> 
> If tcp_tso_build() fails, the caller just falls back to the
> normal path of sending the frames non-TSO one-by-one.
> 
> The logic is simple because if TSO is being done we know
> that all of the SKB data is paged (since SG+CSUM is a
> requirement for TSO).  The one case where that
> invariant might fail is due to a routing change (previous
> device cannot do SG+CSUM, new device has full TSO capability)
> and that is handled via the tcp_skb_data_all_paged() checks.
> 
> My thinking is that whatever added expensive this new scheme
> has, is offset by the simplifications the rest of the TCP
> stack will have since it will no longer need to know anything
> about multiple MSS values and packet counts.
> 
> Comments?

Does anything (need to) change wrt getting the size of the TSO's to increase as 
cwnd increases?

rick jones

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: design for TSO performance fix
  2005-01-28  0:51 ` Rick Jones
@ 2005-01-28  0:58   ` David S. Miller
  0 siblings, 0 replies; 13+ messages in thread
From: David S. Miller @ 2005-01-28  0:58 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

On Thu, 27 Jan 2005 16:51:31 -0800
Rick Jones <rick.jones2@hp.com> wrote:

> Does anything (need to) change wrt getting the size of the TSO's to increase as 
> cwnd increases?

Nope, we use the same algorithm we use currently to determine
the "TSO mss", except that we compute and apply it at the correct
moment.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: design for TSO performance fix
  2005-01-28  0:31 design for TSO performance fix David S. Miller
  2005-01-28  0:51 ` Rick Jones
@ 2005-01-28  1:31 ` Herbert Xu
  2005-01-28  5:19   ` David S. Miller
  2005-01-28  1:57 ` Thomas Graf
  2005-01-28  6:25 ` Andi Kleen
  3 siblings, 1 reply; 13+ messages in thread
From: Herbert Xu @ 2005-01-28  1:31 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

David S. Miller <davem@davemloft.net> wrote:
> 
> Ok, here is the best idea I've been able to come up with
> so far.

It sounds great!

> 2) As a result of #1, all the pcount crap goes away.
>   The need for two MSS state variables (mss_cache,
>   and mss_cache_std) and assosciated complexity is
>   eliminated as well.

Does this mean that we'll start counting bytes instead
of packets?

If not then please let me know on how you plan to do the
packet counting.
 
> static struct sk_buff *tcp_tso_build(struct sk_buff *head, int mss, int num)
> {
>        struct sk_buff *skb;
>        struct sock *sk;
>        int err;
> 
>        sk = head->sk;
>        skb = alloc_skb(sk->sk_prot->max_header, GFP_ATOMIC);

The other good thing about this is that if we do this for all
packets including non-TSO ones, then the TCP stack doesn't have
to own the TCP/IP headers at all.  Then we can stop worrying
about the TSO/COW mangling.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: design for TSO performance fix
  2005-01-28  1:31 ` Herbert Xu
@ 2005-01-28  5:19   ` David S. Miller
  2005-01-28  5:44     ` Herbert Xu
  0 siblings, 1 reply; 13+ messages in thread
From: David S. Miller @ 2005-01-28  5:19 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

On Fri, 28 Jan 2005 12:31:53 +1100
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> > 2) As a result of #1, all the pcount crap goes away.
> >   The need for two MSS state variables (mss_cache,
> >   and mss_cache_std) and assosciated complexity is
> >   eliminated as well.
> 
> Does this mean that we'll start counting bytes instead
> of packets?
> 
> If not then please let me know on how you plan to do the
> packet counting.

Things will be same as what we have now, except multi-packet
SKBs will no longer exist in the retransmit queue.

> The other good thing about this is that if we do this for all
> packets including non-TSO ones, then the TCP stack doesn't have
> to own the TCP/IP headers at all.  Then we can stop worrying
> about the TSO/COW mangling.

Hmmm, have to think about that some more.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: design for TSO performance fix
  2005-01-28  5:19   ` David S. Miller
@ 2005-01-28  5:44     ` Herbert Xu
  2005-01-28 19:28       ` David S. Miller
  0 siblings, 1 reply; 13+ messages in thread
From: Herbert Xu @ 2005-01-28  5:44 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

On Thu, Jan 27, 2005 at 09:19:40PM -0800, David S. Miller wrote:
>
> > Does this mean that we'll start counting bytes instead
> > of packets?
> > 
> > If not then please let me know on how you plan to do the
> > packet counting.
> 
> Things will be same as what we have now, except multi-packet
> SKBs will no longer exist in the retransmit queue.

Colour me confused then.  How are you going to remember the
packet boundaries which we need to do if we're going to keep
counting packets instead of bytes?
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: design for TSO performance fix
  2005-01-28  5:44     ` Herbert Xu
@ 2005-01-28 19:28       ` David S. Miller
  2005-01-29 10:12         ` Herbert Xu
  0 siblings, 1 reply; 13+ messages in thread
From: David S. Miller @ 2005-01-28 19:28 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

On Fri, 28 Jan 2005 16:44:41 +1100
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> Colour me confused then.  How are you going to remember the
> packet boundaries which we need to do if we're going to keep
> counting packets instead of bytes?

It's just like how the code was before I added all of
that tcp_pcount_t code.  The retransmit queue only
ever contains normal MSS sized frames.

When we decide to send something off the queue, we try
to build them up into TSO frames.

Congestion control etc. decisions are still made by packet
counting.

When we get ACKs and SACKs back, we can just trim and mark the
retransmit queue in the simplest way since we don't have
TSO packets in there anymore.

TSO packets only exist in the tcp_transmit_skb() path,
nothing else in the stack sees them.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: design for TSO performance fix
  2005-01-28 19:28       ` David S. Miller
@ 2005-01-29 10:12         ` Herbert Xu
  0 siblings, 0 replies; 13+ messages in thread
From: Herbert Xu @ 2005-01-29 10:12 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

On Fri, Jan 28, 2005 at 11:28:38AM -0800, David S. Miller wrote:
> 
> TSO packets only exist in the tcp_transmit_skb() path,
> nothing else in the stack sees them.

Cool, that should be really good then.
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: design for TSO performance fix
  2005-01-28  0:31 design for TSO performance fix David S. Miller
  2005-01-28  0:51 ` Rick Jones
  2005-01-28  1:31 ` Herbert Xu
@ 2005-01-28  1:57 ` Thomas Graf
  2005-02-01 23:04   ` David S. Miller
  2005-01-28  6:25 ` Andi Kleen
  3 siblings, 1 reply; 13+ messages in thread
From: Thomas Graf @ 2005-01-28  1:57 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

* David S. Miller <20050127163146.33b01e95.davem@davemloft.net> 2005-01-27 16:31
> The basic idea is that we stop trying to build TSO frames
> in the actual transmit queue.  Instead, TSO packets are
> built impromptu when we actually output packets on the
> transmit queue.

Sound great.

> static inline int tcp_skb_data_all_paged(struct sk_buff *skb)
> {
> 	return (skb->len == skb->data_len);
> }

You could also define this as (skb_headlen(skb) == 0)

> The logic is simple because if TSO is being done we know
> that all of the SKB data is paged (since SG+CSUM is a
> requirement for TSO).  The one case where that
> invariant might fail is due to a routing change (previous
> device cannot do SG+CSUM, new device has full TSO capability)
> and that is handled via the tcp_skb_data_all_paged() checks.

I assume the case when reroute changes oif to a device no
longer capable of SG+CSUM stays the same and the skb remains
paged until dev_queue_xmit?

> My thinking is that whatever added expensive this new scheme
> has, is offset by the simplifications the rest of the TCP
> stack will have since it will no longer need to know anything
> about multiple MSS values and packet counts.

I think the overhead is really worth the complexity that can
be removed with these changes.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: design for TSO performance fix
  2005-01-28  1:57 ` Thomas Graf
@ 2005-02-01 23:04   ` David S. Miller
  0 siblings, 0 replies; 13+ messages in thread
From: David S. Miller @ 2005-02-01 23:04 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 1631 bytes --]

On Fri, 28 Jan 2005 02:57:51 +0100
Thomas Graf <tgraf@suug.ch> wrote:

> > static inline int tcp_skb_data_all_paged(struct sk_buff *skb)
> > {
> > 	return (skb->len == skb->data_len);
> > }
> 
> You could also define this as (skb_headlen(skb) == 0)

Good point, I'll do it that way.

> I assume the case when reroute changes oif to a device no
> longer capable of SG+CSUM stays the same and the skb remains
> paged until dev_queue_xmit?

That's correct.  The only difference is that the TSO building
path of send queue transmit will not be executed.

I'm slowly piecing together an implementation.  The most non-
trivial aspect is the frame pushing logic.  While building the
queue from userspace, we wish to defer until either 1) the user
will not supply more data or 2) there is enough in the send
queue for an optimally sized TSO frame to be built.

For the curious, there is attached my current state of implementation.
It's very raw, but it starts to give the basic ideas.  The first
attachment are the design notes I've been jotting down casually
while thinking about this, and the second is the rough beginnings
of a patch.

The patch implements the tp->tso_goal calculations, and the TSO
segmentizer, but nothing else.  The missing pieces are:

1) the push-pending-frames logic, it requires the most thought
1.5) the code in tcp_write_xmit() that tries to call the TSO
     segmenter with groups of SKBs to send
2) killing of tp->mss_cache_std, use tp->mss_cache for everything
3) kill all the code disabling TSO during packet drops
4) kill all the pcount stuff

I'll continue trying to make more progress with this thing.

[-- Attachment #2: tcp_tso.txt --]
[-- Type: text/plain, Size: 1432 bytes --]

Maintain some "TSO potential" state during segmentation at
sendmsg()/sendpage() time.  Use this at push-pending-frames
time to defer tcp_write_xmit() calls and control it's behavior.

Add tcp_flush_queue() which doesn't try to optimize TSO,
it is invoked when getting packets out is more important
than producing larger TSO chunks.  These two cases are:

1) At end of sendmsg()/sendpage() call without MSG_MORE,
   indicating that we have no way to know for sure if
   the user will queue up more TCP data to send.

2) When sleeping within sendmsg()/sendpage() waiting for
   memory.  Pushing out packets and receiving the ACKs
   may very well be the event that will free up send
   queue space for us.

(Must consider interactions with Nagle and Minshall rules)

Consider tcp_opt state which keeps a "TSO goal", it
must be in sync with tcp_opt MSS state.  Initially
define "TSO goal" using tcp_tso_win_divisor and the
current congestion window.  Formally this is:

	max(1U, CWND / TCP_TSO_WIN_DIVISOR)

We could either maintain this lazily, costing us a divide
each time it is recalculated.  Or, we can update it
incrementally each time snd_cwnd is updated.

To save some state testing during output decisions,
define "TSO goal" as one for non-TSO flows.

Possible send test logic:

	if (no new data possibly coming from user)
		send_now();
	if (sending due to ACK queue advancement)
		send_now();
	send_tso_goal_sized_chunks();

[-- Attachment #3: diff --]
[-- Type: application/octet-stream, Size: 3165 bytes --]

===== include/linux/tcp.h 1.34 vs edited =====
--- 1.34/include/linux/tcp.h	2005-01-17 14:09:33 -08:00
+++ edited/include/linux/tcp.h	2005-01-31 16:03:32 -08:00
@@ -262,6 +262,7 @@
 	__u32	pmtu_cookie;	/* Last pmtu seen by socket		*/
 	__u32	mss_cache;	/* Cached effective mss, not including SACKS */
 	__u16	mss_cache_std;	/* Like mss_cache, but without TSO */
+	__u16	tso_goal;	/* TSO packet count goal, 1 w/non-TSO paths */
 	__u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
 	__u16	ext_header_len;	/* Network protocol overhead (IP/IPv6 options) */
 	__u16	ext2_header_len;/* Options depending on route */
===== net/ipv4/tcp_output.c 1.77 vs edited =====
--- 1.77/net/ipv4/tcp_output.c	2005-01-18 12:23:36 -08:00
+++ edited/net/ipv4/tcp_output.c	2005-02-01 14:32:46 -08:00
@@ -707,15 +707,103 @@
 		if (factor > limit)
 			factor = limit;

-		tp->mss_cache = mss_now * factor;
+		/* If this ever triggers, change tp->tso_goal to
+		 * a larger type and update this bug check.
+		 */
+		BUG_ON(factor > 65535);

-		mss_now = tp->mss_cache;
-	}
+		tp->tso_goal = factor;
+	} else
+		tp->tso_goal = 1;

 	if (tp->eff_sacks)
 		mss_now -= (TCPOLEN_SACK_BASE_ALIGNED +
 			    (tp->eff_sacks * TCPOLEN_SACK_PERBLOCK));
 	return mss_now;
+}
+
+static inline int tcp_skb_data_all_paged(struct sk_buff *skb)
+{
+	return skb_headlen(skb) == 0;
+}
+
+/* If possible, append paged data of SRC_SKB onto the
+ * tail of DST_SKB.
+ */
+static int skb_append_pages(struct sk_buff *dst_skb, struct sk_buff *src_skb)
+{
+	int i;
+
+	if (!tcp_skb_data_all_paged(src_skb))
+		return -EINVAL;
+
+	for (i = 0; i < skb_shinfo(src_skb)->nr_frags; i++) {
+		skb_frag_t *src_frag = &skb_shinfo(src_skb)->frags[i];
+		skb_frag_t *dst_frag;
+		int dst_frag_idx;
+
+		dst_frag_idx = skb_shinfo(dst_skb)->nr_frags;
+
+		if (skb_can_coalesce(dst_skb, dst_frag_idx,
+				     src_frag->page, src_frag->page_offset)) {
+			dst_frag = &skb_shinfo(dst_skb)->frags[dst_frag_idx-1];
+			dst_frag->size += src_frag->size;
+		} else {
+			if (dst_frag_idx >= MAX_SKB_FRAGS)
+				return -EMSGSIZE;
+
+			dst_frag = &skb_shinfo(dst_skb)->frags[dst_frag_idx];
+			skb_shinfo(dst_skb)->nr_frags = dst_frag_idx + 1;
+
+			dst_frag->page = src_frag->page;
+			get_page(src_frag->page);
+
+			dst_frag->page_offset = src_frag->page_offset;
+			dst_frag->size = src_frag->size;
+		}
+		dst_skb->data_len += src_frag->size;
+	}
+
+	return 0;
+}
+
+static struct sk_buff *tcp_tso_build(struct sk_buff *head, int mss, int num)
+{
+	struct sk_buff *skb;
+	struct sock *sk;
+	int err;
+
+	sk = head->sk;
+	skb = alloc_skb(sk->sk_prot->max_header, GFP_ATOMIC);
+	err = -ENOMEM;
+	if (!skb)
+		goto fail;
+
+	err = 0;
+	skb_shinfo(skb)->tso_size = mss;
+	skb_shinfo(skb)->tso_segs = num;
+	while (num--) {
+		err = skb_append_pages(skb, head);
+		if (err)
+			goto fail;
+
+		head = head->next;
+	}
+	return skb;
+
+fail:
+	if (skb) {
+		int i;
+
+		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+
+			put_page(frag->page);
+		}
+
+		kfree_skb(skb);
+	}
+	return NULL;
 }

 /* This routine writes packets to the network.  It advances the

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: design for TSO performance fix
  2005-01-28  0:31 design for TSO performance fix David S. Miller
                   ` (2 preceding siblings ...)
  2005-01-28  1:57 ` Thomas Graf
@ 2005-01-28  6:25 ` Andi Kleen
  2005-01-28  6:44   ` Nivedita Singhvi
  2005-01-28 19:30   ` David S. Miller
  3 siblings, 2 replies; 13+ messages in thread
From: Andi Kleen @ 2005-01-28  6:25 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

"David S. Miller" <davem@davemloft.net> writes:

> Ok, here is the best idea I've been able to come up with
> so far.
>
> The basic idea is that we stop trying to build TSO frames
> in the actual transmit queue.  Instead, TSO packets are
> built impromptu when we actually output packets on the
> transmit queue.

I don't quite get how it should work.

Currently tcp_sendmsg will always push the first packet when the send_head
is empty way down to hard_queue_xmit, and then queue up some others
and then finally push them out. You would always miss the first
one with that right? (assuming MTU sized packets)

I looked at this some time ago to pass lists of packets
to qdisc and hard_queue_xmit, because that would allow less locking
overhead and allow some drivers to send stuff more efficiently
to the hardware registers
(It was one of the items in my "how to speed up the stack" list ;-) 

I never ended up implementing it because TSO gave most of the advantages
anyways.

> Advantages:
>
> 1) No knowledge of TSO frames need exist anywhere besides
>    tcp_write_xmit(), tcp_transmit_skb(), and
>    tcp_xmit_retransmit_queue()
>
> 2) As a result of #1, all the pcount crap goes away.
>    The need for two MSS state variables (mss_cache,
>    and mss_cache_std) and assosciated complexity is
>    eliminated as well.
>
> 3) Keeping TSO enabled after packet loss "just works".
>
> 4) CWND sampled at the correct moment when deciding
>    the TSO packet arity.
>
> The one disadvantage is that it might be a tiny bit more
> expensive to build TSO frames.  But I am sure we can find
> ways to optimize that quite well.

Without lists of packets through qdiscs etc. it will likely need
a lot more spin locking than it used to be (and spinlocks
tend to be quite expensive). Luckily the high level queuing
you need for this could be used to implement the list of 
packets too (and then finally pass them to hard_queue_xmit
to allow drivers more optimizations) 

-Andi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: design for TSO performance fix
  2005-01-28  6:25 ` Andi Kleen
@ 2005-01-28  6:44   ` Nivedita Singhvi
  2005-01-28 19:30   ` David S. Miller
  1 sibling, 0 replies; 13+ messages in thread
From: Nivedita Singhvi @ 2005-01-28  6:44 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, netdev

Andi Kleen wrote:

> I looked at this some time ago to pass lists of packets
> to qdisc and hard_queue_xmit, because that would allow less locking
> overhead and allow some drivers to send stuff more efficiently
> to the hardware registers
> (It was one of the items in my "how to speed up the stack" list ;-) 
> 
> I never ended up implementing it because TSO gave most of the advantages
> anyways.

I admit that it's been several months since I last looked
at this - and was just handwaving, had no code. But I had
thought the converse then - that it might be better
to abandon TSO and just have the stack pass down the list
of skbs in one pass. Had been mentioned by Andi as well as
Anton. We'd get much of the gain, avoid a lot of the
complexity, and the code would be simpler. And I'm not
positive about this but it seemed it would handle memory
fragmentation better, too.

Bogus?

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: design for TSO performance fix
  2005-01-28  6:25 ` Andi Kleen
  2005-01-28  6:44   ` Nivedita Singhvi
@ 2005-01-28 19:30   ` David S. Miller
  1 sibling, 0 replies; 13+ messages in thread
From: David S. Miller @ 2005-01-28 19:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: netdev

On Fri, 28 Jan 2005 07:25:54 +0100
Andi Kleen <ak@muc.de> wrote:

> Currently tcp_sendmsg will always push the first packet when the send_head
> is empty way down to hard_queue_xmit, and then queue up some others
> and then finally push them out. You would always miss the first
> one with that right? (assuming MTU sized packets)

We could make push_pending_frames defer if we're doing TSO
and might potentially be building suck frames.

It's just a detail, the main idea is what counts which is
to keep all the TSO packets out of the view of most of the
stack which is where all the complexity came from.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2005-02-01 23:04 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-28  0:31 design for TSO performance fix David S. Miller
2005-01-28  0:51 ` Rick Jones
2005-01-28  0:58   ` David S. Miller
2005-01-28  1:31 ` Herbert Xu
2005-01-28  5:19   ` David S. Miller
2005-01-28  5:44     ` Herbert Xu
2005-01-28 19:28       ` David S. Miller
2005-01-29 10:12         ` Herbert Xu
2005-01-28  1:57 ` Thomas Graf
2005-02-01 23:04   ` David S. Miller
2005-01-28  6:25 ` Andi Kleen
2005-01-28  6:44   ` Nivedita Singhvi
2005-01-28 19:30   ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).