netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "David S. Miller" <davem@davemloft.net>
To: Thomas Graf <tgraf@suug.ch>
Cc: netdev@oss.sgi.com
Subject: Re: design for TSO performance fix
Date: Tue, 1 Feb 2005 15:04:30 -0800	[thread overview]
Message-ID: <20050201150430.309978b6.davem@davemloft.net> (raw)
In-Reply-To: <20050128015751.GT31837@postel.suug.ch>

[-- Attachment #1: Type: text/plain, Size: 1631 bytes --]

On Fri, 28 Jan 2005 02:57:51 +0100
Thomas Graf <tgraf@suug.ch> wrote:

> > static inline int tcp_skb_data_all_paged(struct sk_buff *skb)
> > {
> > 	return (skb->len == skb->data_len);
> > }
> 
> You could also define this as (skb_headlen(skb) == 0)

Good point, I'll do it that way.

> I assume the case when reroute changes oif to a device no
> longer capable of SG+CSUM stays the same and the skb remains
> paged until dev_queue_xmit?

That's correct.  The only difference is that the TSO building
path of send queue transmit will not be executed.

I'm slowly piecing together an implementation.  The most non-
trivial aspect is the frame pushing logic.  While building the
queue from userspace, we wish to defer until either 1) the user
will not supply more data or 2) there is enough in the send
queue for an optimally sized TSO frame to be built.

For the curious, there is attached my current state of implementation.
It's very raw, but it starts to give the basic ideas.  The first
attachment are the design notes I've been jotting down casually
while thinking about this, and the second is the rough beginnings
of a patch.

The patch implements the tp->tso_goal calculations, and the TSO
segmentizer, but nothing else.  The missing pieces are:

1) the push-pending-frames logic, it requires the most thought
1.5) the code in tcp_write_xmit() that tries to call the TSO
     segmenter with groups of SKBs to send
2) killing of tp->mss_cache_std, use tp->mss_cache for everything
3) kill all the code disabling TSO during packet drops
4) kill all the pcount stuff

I'll continue trying to make more progress with this thing.

[-- Attachment #2: tcp_tso.txt --]
[-- Type: text/plain, Size: 1432 bytes --]


Maintain some "TSO potential" state during segmentation at
sendmsg()/sendpage() time.  Use this at push-pending-frames
time to defer tcp_write_xmit() calls and control it's behavior.

Add tcp_flush_queue() which doesn't try to optimize TSO,
it is invoked when getting packets out is more important
than producing larger TSO chunks.  These two cases are:

1) At end of sendmsg()/sendpage() call without MSG_MORE,
   indicating that we have no way to know for sure if
   the user will queue up more TCP data to send.

2) When sleeping within sendmsg()/sendpage() waiting for
   memory.  Pushing out packets and receiving the ACKs
   may very well be the event that will free up send
   queue space for us.

(Must consider interactions with Nagle and Minshall rules)

Consider tcp_opt state which keeps a "TSO goal", it
must be in sync with tcp_opt MSS state.  Initially
define "TSO goal" using tcp_tso_win_divisor and the
current congestion window.  Formally this is:

	max(1U, CWND / TCP_TSO_WIN_DIVISOR)

We could either maintain this lazily, costing us a divide
each time it is recalculated.  Or, we can update it
incrementally each time snd_cwnd is updated.

To save some state testing during output decisions,
define "TSO goal" as one for non-TSO flows.

Possible send test logic:

	if (no new data possibly coming from user)
		send_now();
	if (sending due to ACK queue advancement)
		send_now();
	send_tso_goal_sized_chunks();

[-- Attachment #3: diff --]
[-- Type: application/octet-stream, Size: 3165 bytes --]

===== include/linux/tcp.h 1.34 vs edited =====
--- 1.34/include/linux/tcp.h	2005-01-17 14:09:33 -08:00
+++ edited/include/linux/tcp.h	2005-01-31 16:03:32 -08:00
@@ -262,6 +262,7 @@
 	__u32	pmtu_cookie;	/* Last pmtu seen by socket		*/
 	__u32	mss_cache;	/* Cached effective mss, not including SACKS */
 	__u16	mss_cache_std;	/* Like mss_cache, but without TSO */
+	__u16	tso_goal;	/* TSO packet count goal, 1 w/non-TSO paths */
 	__u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
 	__u16	ext_header_len;	/* Network protocol overhead (IP/IPv6 options) */
 	__u16	ext2_header_len;/* Options depending on route */
===== net/ipv4/tcp_output.c 1.77 vs edited =====
--- 1.77/net/ipv4/tcp_output.c	2005-01-18 12:23:36 -08:00
+++ edited/net/ipv4/tcp_output.c	2005-02-01 14:32:46 -08:00
@@ -707,15 +707,103 @@
 		if (factor > limit)
 			factor = limit;
 
-		tp->mss_cache = mss_now * factor;
+		/* If this ever triggers, change tp->tso_goal to
+		 * a larger type and update this bug check.
+		 */
+		BUG_ON(factor > 65535);
 
-		mss_now = tp->mss_cache;
-	}
+		tp->tso_goal = factor;
+	} else
+		tp->tso_goal = 1;
 
 	if (tp->eff_sacks)
 		mss_now -= (TCPOLEN_SACK_BASE_ALIGNED +
 			    (tp->eff_sacks * TCPOLEN_SACK_PERBLOCK));
 	return mss_now;
+}
+
+static inline int tcp_skb_data_all_paged(struct sk_buff *skb)
+{
+	return skb_headlen(skb) == 0;
+}
+
+/* If possible, append paged data of SRC_SKB onto the
+ * tail of DST_SKB.
+ */
+static int skb_append_pages(struct sk_buff *dst_skb, struct sk_buff *src_skb)
+{
+	int i;
+
+	if (!tcp_skb_data_all_paged(src_skb))
+		return -EINVAL;
+
+	for (i = 0; i < skb_shinfo(src_skb)->nr_frags; i++) {
+		skb_frag_t *src_frag = &skb_shinfo(src_skb)->frags[i];
+		skb_frag_t *dst_frag;
+		int dst_frag_idx;
+
+		dst_frag_idx = skb_shinfo(dst_skb)->nr_frags;
+
+		if (skb_can_coalesce(dst_skb, dst_frag_idx,
+				     src_frag->page, src_frag->page_offset)) {
+			dst_frag = &skb_shinfo(dst_skb)->frags[dst_frag_idx-1];
+			dst_frag->size += src_frag->size;
+		} else {
+			if (dst_frag_idx >= MAX_SKB_FRAGS)
+				return -EMSGSIZE;
+
+			dst_frag = &skb_shinfo(dst_skb)->frags[dst_frag_idx];
+			skb_shinfo(dst_skb)->nr_frags = dst_frag_idx + 1;
+
+			dst_frag->page = src_frag->page;
+			get_page(src_frag->page);
+
+			dst_frag->page_offset = src_frag->page_offset;
+			dst_frag->size = src_frag->size;
+		}
+		dst_skb->data_len += src_frag->size;
+	}
+
+	return 0;
+}
+
+static struct sk_buff *tcp_tso_build(struct sk_buff *head, int mss, int num)
+{
+	struct sk_buff *skb;
+	struct sock *sk;
+	int err;
+
+	sk = head->sk;
+	skb = alloc_skb(sk->sk_prot->max_header, GFP_ATOMIC);
+	err = -ENOMEM;
+	if (!skb)
+		goto fail;
+
+	err = 0;
+	skb_shinfo(skb)->tso_size = mss;
+	skb_shinfo(skb)->tso_segs = num;
+	while (num--) {
+		err = skb_append_pages(skb, head);
+		if (err)
+			goto fail;
+
+		head = head->next;
+	}
+	return skb;
+
+fail:
+	if (skb) {
+		int i;
+
+		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+
+			put_page(frag->page);
+		}
+
+		kfree_skb(skb);
+	}
+	return NULL;
 }
 
 /* This routine writes packets to the network.  It advances the

  reply	other threads:[~2005-02-01 23:04 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-01-28  0:31 design for TSO performance fix David S. Miller
2005-01-28  0:51 ` Rick Jones
2005-01-28  0:58   ` David S. Miller
2005-01-28  1:31 ` Herbert Xu
2005-01-28  5:19   ` David S. Miller
2005-01-28  5:44     ` Herbert Xu
2005-01-28 19:28       ` David S. Miller
2005-01-29 10:12         ` Herbert Xu
2005-01-28  1:57 ` Thomas Graf
2005-02-01 23:04   ` David S. Miller [this message]
2005-01-28  6:25 ` Andi Kleen
2005-01-28  6:44   ` Nivedita Singhvi
2005-01-28 19:30   ` David S. Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20050201150430.309978b6.davem@davemloft.net \
    --to=davem@davemloft.net \
    --cc=netdev@oss.sgi.com \
    --cc=tgraf@suug.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).