design for TSO performance fix

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "David S. Miller" <davem@davemloft.net>
To: netdev@oss.sgi.com
Subject: design for TSO performance fix
Date: Thu, 27 Jan 2005 16:31:46 -0800	[thread overview]
Message-ID: <20050127163146.33b01e95.davem@davemloft.net> (raw)

Ok, here is the best idea I've been able to come up with
so far.

The basic idea is that we stop trying to build TSO frames
in the actual transmit queue.  Instead, TSO packets are
built impromptu when we actually output packets on the
transmit queue.

Advantages:

1) No knowledge of TSO frames need exist anywhere besides
   tcp_write_xmit(), tcp_transmit_skb(), and
   tcp_xmit_retransmit_queue()

2) As a result of #1, all the pcount crap goes away.
   The need for two MSS state variables (mss_cache,
   and mss_cache_std) and assosciated complexity is
   eliminated as well.

3) Keeping TSO enabled after packet loss "just works".

4) CWND sampled at the correct moment when deciding
   the TSO packet arity.

The one disadvantage is that it might be a tiny bit more
expensive to build TSO frames.  But I am sure we can find
ways to optimize that quite well.

The main element of the TSO output logic is a function
that is schemed as follows:

static inline int tcp_skb_data_all_paged(struct sk_buff *skb)
{
	return (skb->len == skb->data_len);
}

/* If possible, append paged data of SRC_SKB onto the
 * tail of DST_SKB.
 */
static int skb_append_pages(struct sk_buff *dst_skb, struct sk_buff *src_skb)
{
	int i;

	if (!tcp_skb_data_all_paged(src_skb))
		return -EINVAL;

	for (i = 0; i < skb_shinfo(src_skb)->nr_frags; i++) {
		skb_frag_t *src_frag = &skb_shinfo(src_skb)->frags[i];
		skb_frag_t *dst_frag;
		int dst_frag_idx;

		dst_frag_idx = skb_shinfo(dst_skb)->nr_frags;

		if (skb_can_coalesce(dst_skb, dst_frag_idx,
				     src_frag->page, src_frag->page_offset)) {
			dst_frag = &skb_shinfo(dst_skb)->frags[dst_frag_idx-1];
			dst_frag->size += src_frag->size;
		} else {
			if (dst_frag_idx >= MAX_SKB_FRAGS)
				return -EMSGSIZE;

			dst_frag = &skb_shinfo(dst_skb)->frags[dst_frag_idx];
			skb_shinfo(dst_skb)->nr_frags = dst_frag_idx + 1;

			dst_frag->page = src_frag->page;
			get_page(src_frag->page);

			dst_frag->page_offset = src_frag->page_offset;
			dst_frag->size = src_frag->size;
		}
		skb->data_len += src_frag->size;
	}

	return 0;
}

static struct sk_buff *tcp_tso_build(struct sk_buff *head, int mss, int num)
{
	struct sk_buff *skb;
	struct sock *sk;
	int err;

	sk = head->sk;
	skb = alloc_skb(sk->sk_prot->max_header, GFP_ATOMIC);
	err = -ENOMEM;
	if (!skb)
		goto fail;

	err = 0;
	skb_shinfo(skb)->tso_size = mss;
	skb_shinfo(skb)->tso_segs = num;
	while (num--) {
		err = skb_append_pages(skb, head, &dst_frag_idx);
		if (err)
			goto fail;

		head = head->next;
	}
	return skb;

fail:
	if (skb) {
		int i;

		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];

			put_page(frag->page);
		}

		kfree_skb(skb);
	}
	return NULL;
}

If tcp_tso_build() fails, the caller just falls back to the
normal path of sending the frames non-TSO one-by-one.

The logic is simple because if TSO is being done we know
that all of the SKB data is paged (since SG+CSUM is a
requirement for TSO).  The one case where that
invariant might fail is due to a routing change (previous
device cannot do SG+CSUM, new device has full TSO capability)
and that is handled via the tcp_skb_data_all_paged() checks.

My thinking is that whatever added expensive this new scheme
has, is offset by the simplifications the rest of the TCP
stack will have since it will no longer need to know anything
about multiple MSS values and packet counts.

Comments?

next             reply	other threads:[~2005-01-28  0:31 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-01-28  0:31 David S. Miller [this message]
2005-01-28  0:51 ` design for TSO performance fix Rick Jones
2005-01-28  0:58   ` David S. Miller
2005-01-28  1:31 ` Herbert Xu
2005-01-28  5:19   ` David S. Miller
2005-01-28  5:44     ` Herbert Xu
2005-01-28 19:28       ` David S. Miller
2005-01-29 10:12         ` Herbert Xu
2005-01-28  1:57 ` Thomas Graf
2005-02-01 23:04   ` David S. Miller
2005-01-28  6:25 ` Andi Kleen
2005-01-28  6:44   ` Nivedita Singhvi
2005-01-28 19:30   ` David S. Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20050127163146.33b01e95.davem@davemloft.net \
    --to=davem@davemloft.net \
    --cc=netdev@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).