[0/4] gro: Optimise GRO receive functions

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [0/4] gro: Optimise GRO receive functions
@ 2009-02-09  3:59 Herbert Xu
  2009-02-09  4:00 ` [PATCH 1/4] gro: Remember number of held packets instead of counting every time Herbert Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Herbert Xu @ 2009-02-09  3:59 UTC (permalink / raw)
  To: David S. Miller, netdev

Hi Dave:

This is a split up version of a patch that Divy tested earlier.
I was going to do this earlier but I wanted to track down the
igb/ixgbe regressions first.

After this I'm going to look at moving the skb construction calls
into cxgb3 itself in order to avoid copying the meta data twice.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/4] gro: Remember number of held packets instead of counting every time
  2009-02-09  3:59 [0/4] gro: Optimise GRO receive functions Herbert Xu
@ 2009-02-09  4:00 ` Herbert Xu
  2009-02-09  4:00 ` [PATCH 2/4] gro: Optimise Ethernet header comparison Herbert Xu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Herbert Xu @ 2009-02-09  4:00 UTC (permalink / raw)
  To: David S. Miller, netdev

gro: Remember number of held packets instead of counting every time

This patch prepares for the move of the same_flow checks out of
dev_gro_receive.  As such we need to remember the number of held
packets since doing a loop just to count them every time is silly.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/netdevice.h |    3 +++
 net/core/dev.c            |   12 +++++++-----
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7a5057f..e1d482c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -314,6 +314,9 @@ struct napi_struct {
 	spinlock_t		poll_lock;
 	int			poll_owner;
 #endif
+
+	unsigned int		gro_count;
+
 	struct net_device	*dev;
 	struct list_head	dev_list;
 	struct sk_buff		*gro_list;
diff --git a/net/core/dev.c b/net/core/dev.c
index 247f161..330534e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2372,6 +2372,7 @@ void napi_gro_flush(struct napi_struct *napi)
 		napi_gro_complete(skb);
 	}
 
+	napi->gro_count = 0;
 	napi->gro_list = NULL;
 }
 EXPORT_SYMBOL(napi_gro_flush);
@@ -2402,7 +2403,6 @@ int dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 	struct packet_type *ptype;
 	__be16 type = skb->protocol;
 	struct list_head *head = &ptype_base[ntohs(type) & PTYPE_HASH_MASK];
-	int count = 0;
 	int same_flow;
 	int mac_len;
 	int ret;
@@ -2430,8 +2430,6 @@ int dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 		NAPI_GRO_CB(skb)->free = 0;
 
 		for (p = napi->gro_list; p; p = p->next) {
-			count++;
-
 			if (!NAPI_GRO_CB(p)->same_flow)
 				continue;
 
@@ -2457,15 +2455,16 @@ int dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 		*pp = nskb->next;
 		nskb->next = NULL;
 		napi_gro_complete(nskb);
-		count--;
+		napi->gro_count--;
 	}
 
 	if (same_flow)
 		goto ok;
 
-	if (NAPI_GRO_CB(skb)->flush || count >= MAX_GRO_SKBS)
+	if (NAPI_GRO_CB(skb)->flush || napi->gro_count >= MAX_GRO_SKBS)
 		goto normal;
 
+	napi->gro_count++;
 	NAPI_GRO_CB(skb)->count = 1;
 	skb_shinfo(skb)->gso_size = skb_gro_len(skb);
 	skb->next = napi->gro_list;
@@ -2713,6 +2712,7 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
 		    int (*poll)(struct napi_struct *, int), int weight)
 {
 	INIT_LIST_HEAD(&napi->poll_list);
+	napi->gro_count = 0;
 	napi->gro_list = NULL;
 	napi->skb = NULL;
 	napi->poll = poll;
@@ -2741,6 +2741,7 @@ void netif_napi_del(struct napi_struct *napi)
 	}
 
 	napi->gro_list = NULL;
+	napi->gro_count = 0;
 }
 EXPORT_SYMBOL(netif_napi_del);
 
@@ -5246,6 +5247,7 @@ static int __init net_dev_init(void)
 		queue->backlog.poll = process_backlog;
 		queue->backlog.weight = weight_p;
 		queue->backlog.gro_list = NULL;
+		queue->backlog.gro_count = 0;
 	}
 
 	dev_boot_phase = 0;

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/4] gro: Optimise Ethernet header comparison
  2009-02-09  3:59 [0/4] gro: Optimise GRO receive functions Herbert Xu
  2009-02-09  4:00 ` [PATCH 1/4] gro: Remember number of held packets instead of counting every time Herbert Xu
@ 2009-02-09  4:00 ` Herbert Xu
  2009-02-09  4:00 ` [PATCH 3/4] gro: Optimise IPv4 packet reception Herbert Xu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Herbert Xu @ 2009-02-09  4:00 UTC (permalink / raw)
  To: David S. Miller, netdev

gro: Optimise Ethernet header comparison

This patch optimises the Ethernet header comparison to use 2-byte
and 4-byte xors instead of memcmp.  In order to facilitate this,
the actual comparison is now carried out by the callers of the
shared dev_gro_receive function.

This has a significant impact when receiving 1500B packets through
10GbE.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/etherdevice.h |   21 +++++++++++++++++++++
 include/linux/netdevice.h   |    7 +++++++
 net/8021q/vlan_core.c       |    4 +++-
 net/core/dev.c              |   23 ++---------------------
 4 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index 1cb0f0b..a1f17ab 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -184,4 +184,25 @@ static inline unsigned compare_ether_addr_64bits(const u8 addr1[6+2],
 }
 #endif	/* __KERNEL__ */
 
+/**
+ * compare_ether_header - Compare two Ethernet headers
+ * @a: Pointer to Ethernet header
+ * @b: Pointer to Ethernet header
+ *
+ * Compare two ethernet headers, returns 0 if equal.
+ * This assumes that the network header (i.e., IP header) is 4-byte
+ * aligned OR the platform can handle unaligned access.  This is the
+ * case for all packets coming into netif_receive_skb or similar
+ * entry points.
+ */
+
+static inline int compare_ether_header(const void *a, const void *b)
+{
+	u32 *a32 = (u32 *)((u8 *)a + 2);
+	u32 *b32 = (u32 *)((u8 *)b + 2);
+
+	return (*(u16 *)a ^ *(u16 *)b) | (a32[0] ^ b32[0]) |
+	       (a32[1] ^ b32[1]) | (a32[2] ^ b32[2]);
+}
+
 #endif	/* _LINUX_ETHERDEVICE_H */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e1d482c..c3af50b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1117,6 +1117,13 @@ static inline void skb_gro_reset_offset(struct sk_buff *skb)
 	NAPI_GRO_CB(skb)->data_offset = 0;
 }
 
+static inline void *skb_gro_mac_header(struct sk_buff *skb)
+{
+	return skb_mac_header(skb) < skb->data ? skb_mac_header(skb) :
+	       page_address(skb_shinfo(skb)->frags[0].page) +
+	       skb_shinfo(skb)->frags[0].page_offset;
+}
+
 static inline int dev_hard_header(struct sk_buff *skb, struct net_device *dev,
 				  unsigned short type,
 				  const void *daddr, const void *saddr,
diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index 378fa69..70435af 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -85,7 +85,9 @@ static int vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
 		goto drop;
 
 	for (p = napi->gro_list; p; p = p->next) {
-		NAPI_GRO_CB(p)->same_flow = p->dev == skb->dev;
+		NAPI_GRO_CB(p)->same_flow =
+			p->dev == skb->dev && !compare_ether_header(
+				skb_mac_header(p), skb_gro_mac_header(skb));
 		NAPI_GRO_CB(p)->flush = 0;
 	}
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 330534e..dca2225 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -215,13 +215,6 @@ static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
 	return &net->dev_index_head[ifindex & ((1 << NETDEV_HASHBITS) - 1)];
 }
 
-static inline void *skb_gro_mac_header(struct sk_buff *skb)
-{
-	return skb_mac_header(skb) < skb->data ? skb_mac_header(skb) :
-	       page_address(skb_shinfo(skb)->frags[0].page) +
-	       skb_shinfo(skb)->frags[0].page_offset;
-}
-
 /* Device list insertion */
 static int list_netdevice(struct net_device *dev)
 {
@@ -2415,29 +2408,16 @@ int dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(ptype, head, list) {
-		struct sk_buff *p;
-		void *mac;
-
 		if (ptype->type != type || ptype->dev || !ptype->gro_receive)
 			continue;
 
 		skb_set_network_header(skb, skb_gro_offset(skb));
-		mac = skb_gro_mac_header(skb);
 		mac_len = skb->network_header - skb->mac_header;
 		skb->mac_len = mac_len;
 		NAPI_GRO_CB(skb)->same_flow = 0;
 		NAPI_GRO_CB(skb)->flush = 0;
 		NAPI_GRO_CB(skb)->free = 0;
 
-		for (p = napi->gro_list; p; p = p->next) {
-			if (!NAPI_GRO_CB(p)->same_flow)
-				continue;
-
-			if (p->mac_len != mac_len ||
-			    memcmp(skb_mac_header(p), mac, mac_len))
-				NAPI_GRO_CB(p)->same_flow = 0;
-		}
-
 		pp = ptype->gro_receive(&napi->gro_list, skb);
 		break;
 	}
@@ -2492,7 +2472,8 @@ static int __napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 	struct sk_buff *p;
 
 	for (p = napi->gro_list; p; p = p->next) {
-		NAPI_GRO_CB(p)->same_flow = 1;
+		NAPI_GRO_CB(p)->same_flow = !compare_ether_header(
+			skb_mac_header(p), skb_gro_mac_header(skb));
 		NAPI_GRO_CB(p)->flush = 0;
 	}
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/4] gro: Optimise IPv4 packet reception
  2009-02-09  3:59 [0/4] gro: Optimise GRO receive functions Herbert Xu
  2009-02-09  4:00 ` [PATCH 1/4] gro: Remember number of held packets instead of counting every time Herbert Xu
  2009-02-09  4:00 ` [PATCH 2/4] gro: Optimise Ethernet header comparison Herbert Xu
@ 2009-02-09  4:00 ` Herbert Xu
  2009-02-09  4:00 ` [PATCH 4/4] gro: Optimise TCP " Herbert Xu
  2009-02-09  6:04 ` [0/4] gro: Optimise GRO receive functions David Miller
  4 siblings, 0 replies; 6+ messages in thread
From: Herbert Xu @ 2009-02-09  4:00 UTC (permalink / raw)
  To: David S. Miller, netdev

gro: Optimise IPv4 packet reception

As this function can be called more than half a million times for
10GbE, it's important to optimise it as much as we can.

This patch does some obvious changes to use 2-byte and 4-byte
operations instead of byte-oriented ones where possible.  Bit
ops are also used to replace logical ops to reduce branching.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/ipv4/af_inet.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index c790877..627be4d 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1263,7 +1263,7 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head,
 	if (!ops || !ops->gro_receive)
 		goto out_unlock;
 
-	if (iph->version != 4 || iph->ihl != 5)
+	if (*(u8 *)iph != 0x45)
 		goto out_unlock;
 
 	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
@@ -1281,17 +1281,18 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head,
 
 		iph2 = ip_hdr(p);
 
-		if (iph->protocol != iph2->protocol ||
-		    iph->tos != iph2->tos ||
-		    memcmp(&iph->saddr, &iph2->saddr, 8)) {
+		if ((iph->protocol ^ iph2->protocol) |
+		    (iph->tos ^ iph2->tos) |
+		    (iph->saddr ^ iph2->saddr) |
+		    (iph->daddr ^ iph2->daddr)) {
 			NAPI_GRO_CB(p)->same_flow = 0;
 			continue;
 		}
 
 		/* All fields must match except length and checksum. */
 		NAPI_GRO_CB(p)->flush |=
-			memcmp(&iph->frag_off, &iph2->frag_off, 4) ||
-			(u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) != id;
+			(iph->ttl ^ iph2->ttl) |
+			((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ id);
 
 		NAPI_GRO_CB(p)->flush |= flush;
 	}

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 4/4] gro: Optimise TCP packet reception
  2009-02-09  3:59 [0/4] gro: Optimise GRO receive functions Herbert Xu
                   ` (2 preceding siblings ...)
  2009-02-09  4:00 ` [PATCH 3/4] gro: Optimise IPv4 packet reception Herbert Xu
@ 2009-02-09  4:00 ` Herbert Xu
  2009-02-09  6:04 ` [0/4] gro: Optimise GRO receive functions David Miller
  4 siblings, 0 replies; 6+ messages in thread
From: Herbert Xu @ 2009-02-09  4:00 UTC (permalink / raw)
  To: David S. Miller, netdev

gro: Optimise TCP packet reception

As this function can be called more than half a million times for
10GbE, it's important to optimise it as much as we can.

This patch uses bit ops to logical ops, as well as open coding
memcmp to exploit alignment properties.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/ipv4/tcp.c |   15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 73266b7..90b2f3c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2478,9 +2478,9 @@ struct sk_buff **tcp_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 	struct tcphdr *th2;
 	unsigned int thlen;
 	unsigned int flags;
-	unsigned int total;
 	unsigned int mss = 1;
 	int flush = 1;
+	int i;
 
 	th = skb_gro_header(skb, sizeof(*th));
 	if (unlikely(!th))
@@ -2504,7 +2504,7 @@ struct sk_buff **tcp_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 
 		th2 = tcp_hdr(p);
 
-		if (th->source != th2->source || th->dest != th2->dest) {
+		if ((th->source ^ th2->source) | (th->dest ^ th2->dest)) {
 			NAPI_GRO_CB(p)->same_flow = 0;
 			continue;
 		}
@@ -2519,14 +2519,15 @@ found:
 	flush |= flags & TCP_FLAG_CWR;
 	flush |= (flags ^ tcp_flag_word(th2)) &
 		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH);
-	flush |= th->ack_seq != th2->ack_seq || th->window != th2->window;
-	flush |= memcmp(th + 1, th2 + 1, thlen - sizeof(*th));
+	flush |= (th->ack_seq ^ th2->ack_seq) | (th->window ^ th2->window);
+	for (i = sizeof(*th); !flush && i < thlen; i += 4)
+		flush |= *(u32 *)((u8 *)th + i) ^
+			 *(u32 *)((u8 *)th2 + i);
 
-	total = skb_gro_len(p);
 	mss = skb_shinfo(p)->gso_size;
 
-	flush |= skb_gro_len(skb) > mss || !skb_gro_len(skb);
-	flush |= ntohl(th2->seq) + total != ntohl(th->seq);
+	flush |= (skb_gro_len(skb) > mss) | !skb_gro_len(skb);
+	flush |= (ntohl(th2->seq) + skb_gro_len(p)) ^ ntohl(th->seq);
 
 	if (flush || skb_gro_receive(head, skb)) {
 		mss = 1;

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [0/4] gro: Optimise GRO receive functions
  2009-02-09  3:59 [0/4] gro: Optimise GRO receive functions Herbert Xu
                   ` (3 preceding siblings ...)
  2009-02-09  4:00 ` [PATCH 4/4] gro: Optimise TCP " Herbert Xu
@ 2009-02-09  6:04 ` David Miller
  4 siblings, 0 replies; 6+ messages in thread
From: David Miller @ 2009-02-09  6:04 UTC (permalink / raw)
  To: herbert; +Cc: netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 9 Feb 2009 14:59:30 +1100

> Hi Dave:
> 
> This is a split up version of a patch that Divy tested earlier.
> I was going to do this earlier but I wanted to track down the
> igb/ixgbe regressions first.
> 
> After this I'm going to look at moving the skb construction calls
> into cxgb3 itself in order to avoid copying the meta data twice.

All applied, thanks Herbert.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-02-09  6:04 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-09  3:59 [0/4] gro: Optimise GRO receive functions Herbert Xu
2009-02-09  4:00 ` [PATCH 1/4] gro: Remember number of held packets instead of counting every time Herbert Xu
2009-02-09  4:00 ` [PATCH 2/4] gro: Optimise Ethernet header comparison Herbert Xu
2009-02-09  4:00 ` [PATCH 3/4] gro: Optimise IPv4 packet reception Herbert Xu
2009-02-09  4:00 ` [PATCH 4/4] gro: Optimise TCP " Herbert Xu
2009-02-09  6:04 ` [0/4] gro: Optimise GRO receive functions David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).