[0/8] net: Generic Receive Offload

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [0/8] net: Generic Receive Offload
@ 2008-12-12  5:31 Herbert Xu
  2008-12-12  5:31 ` [PATCH 1/8] net: Add frag_list support to skb_segment Herbert Xu
                   ` (8 more replies)
  0 siblings, 9 replies; 61+ messages in thread
From: Herbert Xu @ 2008-12-12  5:31 UTC (permalink / raw)
  To: David S. Miller, netdev

Hi:

This patch series introduces the Generic Receive Offload system.  Just
as GSO generalised TSO, GRO does the same thing for LRO.  Most of the
technical details are pulled straight out of LRO, although there is no
code shared between them.

For the moment, GRO is actually more stringent than LRO, meaning that
the set of packets that it merges is strictless less than LRO.  However,
I don't anticipate this to be a huge issue as testing shows that most
packets that are worth merging will still qualify under GRO.

The reason GRO strengthened the conditions for merging is to allow the
packets to be refragmented on output.  This is what stopped us from
allowing LRO to be used on routers and bridges.  This is also essential
for LRO to be deployed on a virtual host using bridging.

In future we may extend GRO to be less strict.  In order to do so it
would be necessary for the network to be able to handle super-packets
that are different from our existing GSO packets.  Such super-packets
would retain their individual headers which would be examined and/or
modified where necessary.

As it stands GRO should be a complete no-op unless the user enables
GRO using ethtool (and someone modifies the driver involved to call
napi_gro_receive).  Once we're comfortable with GRO we could enable
it automatically whenever RX checksum offload is available.

If this series is accepted, the next step would be to convert all
the relevant drivers to call napi_gro_receive (note that ironically
drivers using netif_rx won't need to be modified to use GRO, though
they won't actually merge anything unless they support RX checksum
offload).  I've converted e1000e as an example (also because that's
the only piece of hardware I've got access to right now :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 1/8] net: Add frag_list support to skb_segment
  2008-12-12  5:31 [0/8] net: Generic Receive Offload Herbert Xu
@ 2008-12-12  5:31 ` Herbert Xu
  2008-12-12 19:46   ` Evgeniy Polyakov
  2008-12-12  5:31 ` [PATCH 2/8] net: Add frag_list support to GSO Herbert Xu
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-12  5:31 UTC (permalink / raw)
  To: David S. Miller, netdev

net: Add frag_list support to skb_segment

This patch adds limited support for handling frag_list packets in
skb_segment.  The intention is to support GRO (Generic Receive Offload)
packets which will be constructed by chaining normal packets using
frag_list.

As such we require all frag_list members terminate on exact MSS
boundaries.  This is checked using BUG_ON.

As there should only be one producer in the kernel of such packets,
namely GRO, this requirement should not be difficult to maintain.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/core/skbuff.c |   70 +++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 56 insertions(+), 14 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d49ef83..d24c181 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2285,6 +2285,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 {
 	struct sk_buff *segs = NULL;
 	struct sk_buff *tail = NULL;
+	struct sk_buff *fskb = skb_shinfo(skb)->frag_list;
 	unsigned int mss = skb_shinfo(skb)->gso_size;
 	unsigned int doffset = skb->data - skb_mac_header(skb);
 	unsigned int offset = doffset;
@@ -2304,7 +2305,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		struct sk_buff *nskb;
 		skb_frag_t *frag;
 		int hsize;
-		int k;
 		int size;
 
 		len = skb->len - offset;
@@ -2317,9 +2317,34 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		if (hsize > len || !sg)
 			hsize = len;
 
-		nskb = alloc_skb(hsize + doffset + headroom, GFP_ATOMIC);
-		if (unlikely(!nskb))
-			goto err;
+		if (!hsize && i >= nfrags) {
+			BUG_ON(fskb->len != len);
+
+			pos += len;
+			nskb = skb_clone(fskb, GFP_ATOMIC);
+			fskb = fskb->next;
+
+			if (unlikely(!nskb))
+				goto err;
+
+			hsize = skb_end_pointer(nskb) - nskb->head;
+			if (skb_cow_head(nskb, doffset + headroom))
+				goto err;
+
+			nskb->truesize += skb_end_pointer(nskb) - nskb->head -
+					  hsize;
+			skb_release_head_state(nskb);
+			__skb_push(nskb, doffset);
+		} else {
+			nskb = alloc_skb(hsize + doffset + headroom,
+					 GFP_ATOMIC);
+
+			if (unlikely(!nskb))
+				goto err;
+
+			skb_reserve(nskb, headroom);
+			__skb_put(nskb, doffset);
+		}
 
 		if (segs)
 			tail->next = nskb;
@@ -2330,13 +2355,15 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		__copy_skb_header(nskb, skb);
 		nskb->mac_len = skb->mac_len;
 
-		skb_reserve(nskb, headroom);
 		skb_reset_mac_header(nskb);
 		skb_set_network_header(nskb, skb->mac_len);
 		nskb->transport_header = (nskb->network_header +
 					  skb_network_header_len(skb));
-		skb_copy_from_linear_data(skb, skb_put(nskb, doffset),
-					  doffset);
+		skb_copy_from_linear_data(skb, nskb->data, doffset);
+
+		if (pos >= offset + len)
+			continue;
+
 		if (!sg) {
 			nskb->ip_summed = CHECKSUM_NONE;
 			nskb->csum = skb_copy_and_csum_bits(skb, offset,
@@ -2346,14 +2373,11 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		}
 
 		frag = skb_shinfo(nskb)->frags;
-		k = 0;
 
 		skb_copy_from_linear_data_offset(skb, offset,
 						 skb_put(nskb, hsize), hsize);
 
-		while (pos < offset + len) {
-			BUG_ON(i >= nfrags);
-
+		while (pos < offset + len && i < nfrags) {
 			*frag = skb_shinfo(skb)->frags[i];
 			get_page(frag->page);
 			size = frag->size;
@@ -2363,20 +2387,38 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 				frag->size -= offset - pos;
 			}
 
-			k++;
+			skb_shinfo(nskb)->nr_frags++;
 
 			if (pos + size <= offset + len) {
 				i++;
 				pos += size;
 			} else {
 				frag->size -= pos + size - (offset + len);
-				break;
+				goto skip_fraglist;
 			}
 
 			frag++;
 		}
 
-		skb_shinfo(nskb)->nr_frags = k;
+		if (pos < offset + len) {
+			struct sk_buff *fskb2 = fskb;
+
+			BUG_ON(pos + fskb->len != offset + len);
+
+			pos += fskb->len;
+			fskb = fskb->next;
+
+			if (fskb2->next) {
+				fskb2 = skb_clone(fskb2, GFP_ATOMIC);
+				if (!fskb2)
+					goto err;
+			} else
+				skb_get(fskb2);
+
+			skb_shinfo(nskb)->frag_list = fskb2;
+		}
+
+skip_fraglist:
 		nskb->data_len = len - hsize;
 		nskb->len += nskb->data_len;
 		nskb->truesize += nskb->data_len;

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 2/8] net: Add frag_list support to GSO
  2008-12-12  5:31 [0/8] net: Generic Receive Offload Herbert Xu
  2008-12-12  5:31 ` [PATCH 1/8] net: Add frag_list support to skb_segment Herbert Xu
@ 2008-12-12  5:31 ` Herbert Xu
  2008-12-12  5:31 ` [PATCH 3/8] net: Add Generic Receive Offload infrastructure Herbert Xu
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 61+ messages in thread
From: Herbert Xu @ 2008-12-12  5:31 UTC (permalink / raw)
  To: David S. Miller, netdev

net: Add frag_list support to GSO

This patch allows GSO to handle frag_list in a limited way for the
purposes of allowing packets merged by GRO to be refragmented on
output.

Most hardware won't (and aren't expected to) support handling GRO
frag_list packets directly.  Therefore we will perform GSO in
software for those cases.

However, for drivers that can support it (such as virtual NICs) we
may not have to segment the packets at all.

Whether the added overhead of GRO/GSO is worthwhile for bridges
and routers when weighed against the benefit of potentially
increasing the MTU within the host is still an open question.
However, for the case of host nodes this is undoubtedly a win.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/netdevice.h |    2 ++
 net/core/dev.c            |    2 --
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9d77b1d..8bf9127 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1724,6 +1724,8 @@ static inline int netif_needs_gso(struct net_device *dev, struct sk_buff *skb)
 {
 	return skb_is_gso(skb) &&
 	       (!skb_gso_ok(skb, dev->features) ||
+	        (skb_shinfo(skb)->frag_list &&
+	         !(dev->features & NETIF_F_FRAGLIST)) ||
 		unlikely(skb->ip_summed != CHECKSUM_PARTIAL));
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 9174c77..4388e27 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1527,8 +1527,6 @@ struct sk_buff *skb_gso_segment(struct sk_buff *skb, int features)
 	__be16 type = skb->protocol;
 	int err;
 
-	BUG_ON(skb_shinfo(skb)->frag_list);
-
 	skb_reset_mac_header(skb);
 	skb->mac_len = skb->network_header - skb->mac_header;
 	__skb_pull(skb, skb->mac_len);

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 3/8] net: Add Generic Receive Offload infrastructure
  2008-12-12  5:31 [0/8] net: Generic Receive Offload Herbert Xu
  2008-12-12  5:31 ` [PATCH 1/8] net: Add frag_list support to skb_segment Herbert Xu
  2008-12-12  5:31 ` [PATCH 2/8] net: Add frag_list support to GSO Herbert Xu
@ 2008-12-12  5:31 ` Herbert Xu
  2008-12-12 19:51   ` Evgeniy Polyakov
  2008-12-12 22:25   ` Ben Hutchings
  2008-12-12  5:31 ` [PATCH 4/8] ipv4: Add GRO infrastructure Herbert Xu
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 61+ messages in thread
From: Herbert Xu @ 2008-12-12  5:31 UTC (permalink / raw)
  To: David S. Miller, netdev

net: Add Generic Receive Offload infrastructure

This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.

For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically.  When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.

Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.

In addition to the packet, gro_receive will get a list of currently held
packets.  Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet.  For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.

Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it.  In this case gro_receive should return the pointer to
the existing skb in gro_list.  Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.

Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow.  Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.

If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.

Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.

Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries.  In future, this may be
expanded to use a hash table to allow more flows to be held for merging.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/netdevice.h |   74 ++++++------------
 include/linux/netpoll.h   |    5 -
 net/core/dev.c            |  186 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 212 insertions(+), 53 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8bf9127..5acd176 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -311,8 +311,9 @@ struct napi_struct {
 	spinlock_t		poll_lock;
 	int			poll_owner;
 	struct net_device	*dev;
-	struct list_head	dev_list;
 #endif
+	struct list_head	dev_list;
+	struct sk_buff		*gro_list;
 };
 
 enum
@@ -372,22 +373,8 @@ static inline int napi_reschedule(struct napi_struct *napi)
  *
  * Mark NAPI processing as complete.
  */
-static inline void __napi_complete(struct napi_struct *n)
-{
-	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
-	list_del(&n->poll_list);
-	smp_mb__before_clear_bit();
-	clear_bit(NAPI_STATE_SCHED, &n->state);
-}
-
-static inline void napi_complete(struct napi_struct *n)
-{
-	unsigned long flags;
-
-	local_irq_save(flags);
-	__napi_complete(n);
-	local_irq_restore(flags);
-}
+extern void __napi_complete(struct napi_struct *n);
+extern void napi_complete(struct napi_struct *n);
 
 /**
  *	napi_disable - prevent NAPI from scheduling
@@ -495,9 +482,7 @@ struct net_device
 	unsigned long		state;
 
 	struct list_head	dev_list;
-#ifdef CONFIG_NETPOLL
 	struct list_head	napi_list;
-#endif
 	
 	/* The device initialization function. Called only once. */
 	int			(*init)(struct net_device *dev);
@@ -521,6 +506,7 @@ struct net_device
 #define NETIF_F_LLTX		4096	/* LockLess TX - deprecated. Please */
 					/* do not use LLTX in new drivers */
 #define NETIF_F_NETNS_LOCAL	8192	/* Does not change network namespaces */
+#define NETIF_F_GRO		16384	/* Generic receive offload */
 #define NETIF_F_LRO		32768	/* large receive offload */
 
 	/* Segmentation offload features */
@@ -858,22 +844,8 @@ static inline void *netdev_priv(const struct net_device *dev)
  * netif_napi_add() must be used to initialize a napi context prior to calling
  * *any* of the other napi related functions.
  */
-static inline void netif_napi_add(struct net_device *dev,
-				  struct napi_struct *napi,
-				  int (*poll)(struct napi_struct *, int),
-				  int weight)
-{
-	INIT_LIST_HEAD(&napi->poll_list);
-	napi->poll = poll;
-	napi->weight = weight;
-#ifdef CONFIG_NETPOLL
-	napi->dev = dev;
-	list_add(&napi->dev_list, &dev->napi_list);
-	spin_lock_init(&napi->poll_lock);
-	napi->poll_owner = -1;
-#endif
-	set_bit(NAPI_STATE_SCHED, &napi->state);
-}
+void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
+		    int (*poll)(struct napi_struct *, int), int weight);
 
 /**
  *  netif_napi_del - remove a napi context
@@ -881,12 +853,20 @@ static inline void netif_napi_add(struct net_device *dev,
  *
  *  netif_napi_del() removes a napi context from the network device napi list
  */
-static inline void netif_napi_del(struct napi_struct *napi)
-{
-#ifdef CONFIG_NETPOLL
-	list_del(&napi->dev_list);
-#endif
-}
+void netif_napi_del(struct napi_struct *napi);
+
+struct napi_gro_cb {
+	/* This is non-zero if the packet may be of the same flow. */
+	int same_flow;
+
+	/* This is non-zero if the packet cannot be merged with the new skb. */
+	int flush;
+
+	/* Number of segments aggregated. */
+	int count;
+};
+
+#define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb)
 
 struct packet_type {
 	__be16			type;	/* This is really htons(ether_type). */
@@ -898,6 +878,9 @@ struct packet_type {
 	struct sk_buff		*(*gso_segment)(struct sk_buff *skb,
 						int features);
 	int			(*gso_send_check)(struct sk_buff *skb);
+	struct sk_buff		**(*gro_receive)(struct sk_buff **head,
+					       struct sk_buff *skb);
+	int			(*gro_complete)(struct sk_buff *skb);
 	void			*af_packet_priv;
 	struct list_head	list;
 };
@@ -1251,6 +1234,9 @@ extern int		netif_rx(struct sk_buff *skb);
 extern int		netif_rx_ni(struct sk_buff *skb);
 #define HAVE_NETIF_RECEIVE_SKB 1
 extern int		netif_receive_skb(struct sk_buff *skb);
+extern void		napi_gro_flush(struct napi_struct *napi);
+extern int		napi_gro_receive(struct napi_struct *napi,
+					 struct sk_buff *skb);
 extern void		netif_nit_deliver(struct sk_buff *skb);
 extern int		dev_valid_name(const char *name);
 extern int		dev_ioctl(struct net *net, unsigned int cmd, void __user *);
@@ -1495,11 +1481,7 @@ static inline void __netif_rx_complete(struct net_device *dev,
 static inline void netif_rx_complete(struct net_device *dev,
 				     struct napi_struct *napi)
 {
-	unsigned long flags;
-
-	local_irq_save(flags);
-	__netif_rx_complete(dev, napi);
-	local_irq_restore(flags);
+	napi_complete(napi);
 }
 
 static inline void __netif_tx_lock(struct netdev_queue *txq, int cpu)
diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h
index e3d7959..e38d3c9 100644
--- a/include/linux/netpoll.h
+++ b/include/linux/netpoll.h
@@ -94,11 +94,6 @@ static inline void netpoll_poll_unlock(void *have)
 	rcu_read_unlock();
 }
 
-static inline void netpoll_netdev_init(struct net_device *dev)
-{
-	INIT_LIST_HEAD(&dev->napi_list);
-}
-
 #else
 static inline int netpoll_rx(struct sk_buff *skb)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index 4388e27..5e5132c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -130,6 +130,9 @@
 
 #include "net-sysfs.h"
 
+/* Instead of increasing this, you should create a hash table. */
+#define MAX_GRO_SKBS 8
+
 /*
  *	The list of packet types we will receive (as opposed to discard)
  *	and the routines to invoke.
@@ -2323,6 +2326,122 @@ static void flush_backlog(void *arg)
 		}
 }
 
+static int napi_gro_complete(struct sk_buff *skb)
+{
+	struct packet_type *ptype;
+	__be16 type = skb->protocol;
+	struct list_head *head = &ptype_base[ntohs(type) & PTYPE_HASH_MASK];
+	int err = -ENOENT;
+
+	if (!skb_shinfo(skb)->frag_list)
+		goto out;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(ptype, head, list) {
+		if (ptype->type != type || ptype->dev || !ptype->gro_complete)
+			continue;
+
+		err = ptype->gro_complete(skb);
+		break;
+	}
+	rcu_read_unlock();
+
+	if (err) {
+		WARN_ON(&ptype->list == head);
+		kfree_skb(skb);
+		return NET_RX_SUCCESS;
+	}
+
+out:
+	__skb_push(skb, -skb_network_offset(skb));
+	return netif_receive_skb(skb);
+}
+
+void napi_gro_flush(struct napi_struct *napi)
+{
+	struct sk_buff *skb, *next;
+
+	for (skb = napi->gro_list; skb; skb = next) {
+		next = skb->next;
+		skb->next = NULL;
+		napi_gro_complete(skb);
+	}
+
+	napi->gro_list = NULL;
+}
+EXPORT_SYMBOL(napi_gro_flush);
+
+int napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
+{
+	struct sk_buff **pp;
+	struct packet_type *ptype;
+	__be16 type = skb->protocol;
+	struct list_head *head = &ptype_base[ntohs(type) & PTYPE_HASH_MASK];
+	int count = 0;
+	int mac_len;
+
+	if (!(skb->dev->features & NETIF_F_GRO))
+		goto normal;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(ptype, head, list) {
+		struct sk_buff *p;
+
+		if (ptype->type != type || ptype->dev || !ptype->gro_receive)
+			continue;
+
+		skb_reset_network_header(skb);
+		mac_len = skb->network_header - skb->mac_header;
+		skb->mac_len = mac_len;
+		NAPI_GRO_CB(skb)->same_flow = 0;
+		NAPI_GRO_CB(skb)->flush = 0;
+
+		for (p = napi->gro_list; p; p = p->next) {
+			count++;
+			NAPI_GRO_CB(p)->same_flow =
+				p->mac_len == mac_len &&
+				!memcmp(skb_mac_header(p), skb_mac_header(skb),
+					mac_len);
+			NAPI_GRO_CB(p)->flush = 0;
+		}
+
+		pp = ptype->gro_receive(&napi->gro_list, skb);
+		break;
+	}
+	rcu_read_unlock();
+
+	if (&ptype->list == head)
+		goto normal;
+
+	if (pp) {
+		struct sk_buff *nskb = *pp;
+
+		*pp = nskb->next;
+		nskb->next = NULL;
+		napi_gro_complete(nskb);
+		count--;
+	}
+
+	if (NAPI_GRO_CB(skb)->same_flow)
+		goto ok;
+
+	if (NAPI_GRO_CB(skb)->flush || count >= MAX_GRO_SKBS) {
+		__skb_push(skb, -skb_network_offset(skb));
+		goto normal;
+	}
+
+	NAPI_GRO_CB(skb)->count = 1;
+	skb->next = napi->gro_list;
+	napi->gro_list = skb;
+
+ok:
+	return NET_RX_SUCCESS;
+
+normal:
+	return netif_receive_skb(skb);
+}
+EXPORT_SYMBOL(napi_gro_receive);
+
 static int process_backlog(struct napi_struct *napi, int quota)
 {
 	int work = 0;
@@ -2342,9 +2461,11 @@ static int process_backlog(struct napi_struct *napi, int quota)
 		}
 		local_irq_enable();
 
-		netif_receive_skb(skb);
+		napi_gro_receive(napi, skb);
 	} while (++work < quota && jiffies == start_time);
 
+	napi_gro_flush(napi);
+
 	return work;
 }
 
@@ -2365,6 +2486,61 @@ void __napi_schedule(struct napi_struct *n)
 }
 EXPORT_SYMBOL(__napi_schedule);
 
+void __napi_complete(struct napi_struct *n)
+{
+	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
+	BUG_ON(n->gro_list);
+
+	list_del(&n->poll_list);
+	smp_mb__before_clear_bit();
+	clear_bit(NAPI_STATE_SCHED, &n->state);
+}
+EXPORT_SYMBOL(__napi_complete);
+
+void napi_complete(struct napi_struct *n)
+{
+	unsigned long flags;
+
+	napi_gro_flush(n);
+	local_irq_save(flags);
+	__napi_complete(n);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(napi_complete);
+
+void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
+		    int (*poll)(struct napi_struct *, int), int weight)
+{
+	INIT_LIST_HEAD(&napi->poll_list);
+	napi->gro_list = NULL;
+	napi->poll = poll;
+	napi->weight = weight;
+	list_add(&napi->dev_list, &dev->napi_list);
+#ifdef CONFIG_NETPOLL
+	napi->dev = dev;
+	spin_lock_init(&napi->poll_lock);
+	napi->poll_owner = -1;
+#endif
+	set_bit(NAPI_STATE_SCHED, &napi->state);
+}
+EXPORT_SYMBOL(netif_napi_add);
+
+void netif_napi_del(struct napi_struct *napi)
+{
+	struct sk_buff *skb, *next;
+
+	list_del(&napi->dev_list);
+
+	for (skb = napi->gro_list; skb; skb = next) {
+		next = skb->next;
+		skb->next = NULL;
+		kfree_skb(skb);
+	}
+
+	napi->gro_list = NULL;
+}
+EXPORT_SYMBOL(netif_napi_del);
+
 
 static void net_rx_action(struct softirq_action *h)
 {
@@ -4348,7 +4524,7 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 	netdev_init_queues(dev);
 
 	dev->get_stats = internal_stats;
-	netpoll_netdev_init(dev);
+	INIT_LIST_HEAD(&dev->napi_list);
 	setup(dev);
 	strcpy(dev->name, name);
 	return dev;
@@ -4365,10 +4541,15 @@ EXPORT_SYMBOL(alloc_netdev_mq);
  */
 void free_netdev(struct net_device *dev)
 {
+	struct napi_struct *p, *n;
+
 	release_net(dev_net(dev));
 
 	kfree(dev->_tx);
 
+	list_for_each_entry_safe(p, n, &dev->napi_list, dev_list)
+		netif_napi_del(p);
+
 	/*  Compatibility with error handling in drivers */
 	if (dev->reg_state == NETREG_UNINITIALIZED) {
 		kfree((char *)dev - dev->padded);
@@ -4904,6 +5085,7 @@ static int __init net_dev_init(void)
 
 		queue->backlog.poll = process_backlog;
 		queue->backlog.weight = weight_p;
+		queue->backlog.gro_list = NULL;
 	}
 
 	netdev_dma_register();

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 4/8] ipv4: Add GRO infrastructure
  2008-12-12  5:31 [0/8] net: Generic Receive Offload Herbert Xu
                   ` (2 preceding siblings ...)
  2008-12-12  5:31 ` [PATCH 3/8] net: Add Generic Receive Offload infrastructure Herbert Xu
@ 2008-12-12  5:31 ` Herbert Xu
  2008-12-12 22:55   ` Ben Hutchings
  2008-12-12  5:31 ` [PATCH 5/8] net: Add skb_gro_receive Herbert Xu
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-12  5:31 UTC (permalink / raw)
  To: David S. Miller, netdev

ipv4: Add GRO infrastructure

This patch adds GRO support for IPv4.

The criteria for merging is more stringent than LRO, in particular,
we require all fields in the IP header to be identical except for
the length, ID and checksum.  In addition, the ID must form an
arithmetic sequence with a difference of one.

The ID requirement might seem overly strict, however, most hardware
TSO solutions already obey this rule.  Linux itself also obeys this
whether GSO is in use or not.

In future we could relax this rule by storing the IDs (or rather
making sure that we don't drop them when pulling the aggregate
skb's tail).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/net/protocol.h |    3 +
 net/ipv4/af_inet.c     |   96 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 99 insertions(+)

diff --git a/include/net/protocol.h b/include/net/protocol.h
index 8d024d7..cb2965a 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -39,6 +39,9 @@ struct net_protocol {
 	int			(*gso_send_check)(struct sk_buff *skb);
 	struct sk_buff	       *(*gso_segment)(struct sk_buff *skb,
 					       int features);
+	struct sk_buff	      **(*gro_receive)(struct sk_buff **head,
+					       struct sk_buff *skb);
+	int			(*gro_complete)(struct sk_buff *skb);
 	unsigned int		no_policy:1,
 				netns_ok:1;
 };
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 1aa2dc9..260f081 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -94,6 +94,7 @@
 #include <linux/igmp.h>
 #include <linux/inetdevice.h>
 #include <linux/netdevice.h>
+#include <net/checksum.h>
 #include <net/ip.h>
 #include <net/protocol.h>
 #include <net/arp.h>
@@ -1245,6 +1246,99 @@ out:
 	return segs;
 }
 
+static struct sk_buff **inet_gro_receive(struct sk_buff **head,
+					 struct sk_buff *skb)
+{
+	struct net_protocol *ops;
+	struct sk_buff **pp = NULL;
+	struct sk_buff *p;
+	struct iphdr *iph;
+	int flush = 1;
+	int proto;
+	int id;
+
+	if (unlikely(!pskb_may_pull(skb, sizeof(*iph))))
+		goto out;
+
+	iph = ip_hdr(skb);
+	proto = iph->protocol & (MAX_INET_PROTOS - 1);
+
+	rcu_read_lock();
+	ops = rcu_dereference(inet_protos[proto]);
+	if (!ops || !ops->gro_receive)
+		goto out_unlock;
+
+	if (iph->version != 4 || iph->ihl != 5)
+		goto out_unlock;
+
+	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
+		goto out_unlock;
+
+	flush = ntohs(iph->tot_len) != skb->len ||
+		iph->frag_off != htons(IP_DF);
+	id = ntohs(iph->id);
+
+	for (p = *head; p; p = p->next) {
+		struct iphdr *iph2;
+
+		if (!NAPI_GRO_CB(p)->same_flow)
+			continue;
+
+		iph2 = ip_hdr(p);
+
+		if (iph->protocol != iph2->protocol ||
+		    memcmp(&iph->saddr, &iph2->saddr, 8)) {
+			NAPI_GRO_CB(p)->same_flow = 0;
+			continue;
+		}
+
+		/* All fields must match except length and checksum. */
+		NAPI_GRO_CB(p)->flush |=
+			memcmp(&iph->frag_off, &iph2->frag_off, 4) ||
+			(u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) != id;
+
+		NAPI_GRO_CB(p)->flush |= flush;
+	}
+
+	NAPI_GRO_CB(skb)->flush |= flush;
+	__skb_pull(skb, sizeof(*iph));
+	skb_reset_transport_header(skb);
+
+	pp = ops->gro_receive(head, skb);
+
+out_unlock:
+	rcu_read_unlock();
+
+out:
+	NAPI_GRO_CB(skb)->flush |= flush;
+
+	return pp;
+}
+
+static int inet_gro_complete(struct sk_buff *skb)
+{
+	struct net_protocol *ops;
+	struct iphdr *iph = ip_hdr(skb);
+	int proto = iph->protocol & (MAX_INET_PROTOS - 1);
+	int err = -ENOSYS;
+	__be16 newlen = htons(skb->len - skb_network_offset(skb));
+
+	csum_replace2(&iph->check, iph->tot_len, newlen);
+	iph->tot_len = newlen;
+
+	rcu_read_lock();
+	ops = rcu_dereference(inet_protos[proto]);
+	if (WARN_ON(!ops || !ops->gro_complete))
+		goto out_unlock;
+
+	err = ops->gro_complete(skb);
+
+out_unlock:
+	rcu_read_unlock();
+
+	return err;
+}
+
 int inet_ctl_sock_create(struct sock **sk, unsigned short family,
 			 unsigned short type, unsigned char protocol,
 			 struct net *net)
@@ -1411,6 +1505,8 @@ static struct packet_type ip_packet_type = {
 	.func = ip_rcv,
 	.gso_send_check = inet_gso_send_check,
 	.gso_segment = inet_gso_segment,
+	.gro_receive = inet_gro_receive,
+	.gro_complete = inet_gro_complete,
 };
 
 static int __init inet_init(void)

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 5/8] net: Add skb_gro_receive
  2008-12-12  5:31 [0/8] net: Generic Receive Offload Herbert Xu
                   ` (3 preceding siblings ...)
  2008-12-12  5:31 ` [PATCH 4/8] ipv4: Add GRO infrastructure Herbert Xu
@ 2008-12-12  5:31 ` Herbert Xu
  2008-12-12  5:31 ` [PATCH 6/8] tcp: Add GRO support Herbert Xu
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 61+ messages in thread
From: Herbert Xu @ 2008-12-12  5:31 UTC (permalink / raw)
  To: David S. Miller, netdev

net: Add skb_gro_receive

This patch adds the helper skb_gro_receive to merge packets for
GRO.  The current method is to allocate a new header skb and then
chain the original packets to its frag_list.  This is done to
make it easier to integrate into the existing GSO framework.

In future as GSO is moved into the drivers, we can undo this and
simply chain the original packets together.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/skbuff.h |    2 +
 net/core/skbuff.c      |   57 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2725f4e..2bdb539 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1649,6 +1649,8 @@ extern void	       skb_split(struct sk_buff *skb,
 				 struct sk_buff *skb1, const u32 len);
 
 extern struct sk_buff *skb_segment(struct sk_buff *skb, int features);
+extern int	       skb_gro_receive(struct sk_buff **head,
+				       struct sk_buff *skb);
 
 static inline void *skb_header_pointer(const struct sk_buff *skb, int offset,
 				       int len, void *buffer)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d24c181..c0acb5b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2436,6 +2436,63 @@ err:
 
 EXPORT_SYMBOL_GPL(skb_segment);
 
+int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
+{
+	struct sk_buff *p = *head;
+	struct sk_buff *nskb;
+	unsigned int headroom;
+	unsigned int hlen;
+
+	if (skb_shinfo(p)->frag_list)
+		goto merge;
+
+	headroom = skb_headroom(p);
+	nskb = netdev_alloc_skb(p->dev, headroom);
+	if (unlikely(!nskb))
+		return -ENOMEM;
+
+	__copy_skb_header(nskb, p);
+	nskb->mac_len = p->mac_len;
+
+	skb_reserve(nskb, headroom);
+
+	hlen = p->data - skb_mac_header(p);
+	skb_set_mac_header(nskb, -hlen);
+	skb_set_network_header(nskb, skb_network_offset(p));
+	skb_set_transport_header(nskb, skb_transport_offset(p));
+
+	memcpy(skb_mac_header(nskb), skb_mac_header(p), hlen);
+
+	*NAPI_GRO_CB(nskb) = *NAPI_GRO_CB(p);
+	skb_shinfo(nskb)->frag_list = p;
+	skb_header_release(p);
+	nskb->prev = p;
+
+	nskb->data_len += p->len;
+	nskb->truesize += p->len;
+	nskb->len += p->len;
+
+	*head = nskb;
+	nskb->next = p->next;
+	p->next = NULL;
+
+	p = nskb;
+
+merge:
+	NAPI_GRO_CB(p)->count++;
+	p->prev->next = skb;
+	p->prev = skb;
+	skb_header_release(skb);
+
+	p->data_len += skb->len;
+	p->truesize += skb->len;
+	p->len += skb->len;
+
+	NAPI_GRO_CB(skb)->same_flow = 1;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(skb_gro_receive);
+
 void __init skb_init(void)
 {
 	skbuff_head_cache = kmem_cache_create("skbuff_head_cache",

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 6/8] tcp: Add GRO support
  2008-12-12  5:31 [0/8] net: Generic Receive Offload Herbert Xu
                   ` (4 preceding siblings ...)
  2008-12-12  5:31 ` [PATCH 5/8] net: Add skb_gro_receive Herbert Xu
@ 2008-12-12  5:31 ` Herbert Xu
  2008-12-12 19:56   ` Evgeniy Polyakov
  2008-12-12  5:31 ` [PATCH 7/8] ethtool: Add GGRO and SGRO ops Herbert Xu
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-12  5:31 UTC (permalink / raw)
  To: David S. Miller, netdev

tcp: Add GRO support

This patch adds the TCP-specific portion of GRO.  The criterion for
merging is extremely strict (the TCP header must match exactly apart
from the checksum) so as to allow refragmentation.  Otherwise this
is pretty much identical to LRO, except that we support the merging
of ECN packets.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/net/tcp.h   |    6 +++
 net/ipv4/af_inet.c  |    2 +
 net/ipv4/tcp.c      |  100 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_ipv4.c |   35 ++++++++++++++++++
 4 files changed, 143 insertions(+)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 438014d..cd571a9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1358,6 +1358,12 @@ extern void tcp_v4_destroy_sock(struct sock *sk);
 
 extern int tcp_v4_gso_send_check(struct sk_buff *skb);
 extern struct sk_buff *tcp_tso_segment(struct sk_buff *skb, int features);
+extern struct sk_buff **tcp_gro_receive(struct sk_buff **head,
+					struct sk_buff *skb);
+extern struct sk_buff **tcp4_gro_receive(struct sk_buff **head,
+					 struct sk_buff *skb);
+extern int tcp_gro_complete(struct sk_buff *skb);
+extern int tcp4_gro_complete(struct sk_buff *skb);
 
 #ifdef CONFIG_PROC_FS
 extern int  tcp4_proc_init(void);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 260f081..dafbfbd 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1413,6 +1413,8 @@ static struct net_protocol tcp_protocol = {
 	.err_handler =	tcp_v4_err,
 	.gso_send_check = tcp_v4_gso_send_check,
 	.gso_segment =	tcp_tso_segment,
+	.gro_receive =	tcp4_gro_receive,
+	.gro_complete =	tcp4_gro_complete,
 	.no_policy =	1,
 	.netns_ok =	1,
 };
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c5aca0b..294d838 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2461,6 +2461,106 @@ out:
 }
 EXPORT_SYMBOL(tcp_tso_segment);
 
+struct sk_buff **tcp_gro_receive(struct sk_buff **head, struct sk_buff *skb)
+{
+	struct sk_buff **pp = NULL;
+	struct sk_buff *p;
+	struct tcphdr *th;
+	struct tcphdr *th2;
+	unsigned int thlen;
+	unsigned int flags;
+	unsigned int total;
+	unsigned int mss = 1;
+	int flush = 1;
+
+	if (!pskb_may_pull(skb, sizeof(*th)))
+		goto out;
+
+	th = tcp_hdr(skb);
+	thlen = th->doff * 4;
+	if (thlen < sizeof(*th))
+		goto out;
+
+	if (!pskb_may_pull(skb, thlen))
+		goto out;
+
+	th = tcp_hdr(skb);
+	__skb_pull(skb, thlen);
+
+	flags = tcp_flag_word(th);
+
+	for (; (p = *head); head = &p->next) {
+		if (!NAPI_GRO_CB(p)->same_flow)
+			continue;
+
+		th2 = tcp_hdr(p);
+
+		if (th->source != th2->source || th->dest != th2->dest) {
+			NAPI_GRO_CB(p)->same_flow = 0;
+			continue;
+		}
+
+		goto found;
+	}
+
+	goto out_check_final;
+
+found:
+	flush = NAPI_GRO_CB(p)->flush;
+	flush |= flags & TCP_FLAG_CWR;
+	flush |= (flags ^ tcp_flag_word(th2)) &
+		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH);
+	flush |= th->ack_seq != th2->ack_seq || th->window != th2->window;
+	flush |= memcmp(th + 1, th2 + 1, thlen - sizeof(*th));
+
+	total = p->len;
+	mss = total;
+	if (skb_shinfo(p)->frag_list)
+		mss = skb_shinfo(p)->frag_list->len;
+
+	flush |= skb->len > mss || skb->len <= 0;
+	flush |= ntohl(th2->seq) + total != ntohl(th->seq);
+
+	if (flush || skb_gro_receive(head, skb)) {
+		mss = 1;
+		goto out_check_final;
+	}
+
+	p = *head;
+	th2 = tcp_hdr(p);
+	tcp_flag_word(th2) |= flags & (TCP_FLAG_FIN | TCP_FLAG_PSH);
+
+out_check_final:
+	flush = skb->len < mss;
+	flush |= flags & (TCP_FLAG_URG | TCP_FLAG_PSH | TCP_FLAG_RST |
+			  TCP_FLAG_SYN | TCP_FLAG_FIN);
+
+	if (p && (!NAPI_GRO_CB(skb)->same_flow || flush))
+		pp = head;
+
+out:
+	NAPI_GRO_CB(skb)->flush |= flush;
+
+	return pp;
+}
+
+int tcp_gro_complete(struct sk_buff *skb)
+{
+	struct tcphdr *th = tcp_hdr(skb);
+
+	skb->csum_start = skb_transport_header(skb) - skb->head;
+	skb->csum_offset = offsetof(struct tcphdr, check);
+	skb->ip_summed = CHECKSUM_PARTIAL;
+
+	skb_shinfo(skb)->gso_size = skb_shinfo(skb)->frag_list->len;
+	skb_shinfo(skb)->gso_segs = NAPI_GRO_CB(skb)->count;
+
+	if (th->cwr)
+		skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
+
+	return 0;
+}
+
 #ifdef CONFIG_TCP_MD5SIG
 static unsigned long tcp_md5sig_users;
 static struct tcp_md5sig_pool **tcp_md5sig_pool;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 5c8fa7f..5b7ce84 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2350,6 +2350,41 @@ void tcp4_proc_exit(void)
 }
 #endif /* CONFIG_PROC_FS */
 
+struct sk_buff **tcp4_gro_receive(struct sk_buff **head, struct sk_buff *skb)
+{
+	struct iphdr *iph = ip_hdr(skb);
+
+	switch (skb->ip_summed) {
+	case CHECKSUM_COMPLETE:
+		if (!tcp_v4_check(skb->len, iph->saddr, iph->daddr,
+				  skb->csum)) {
+			skb->ip_summed = CHECKSUM_UNNECESSARY;
+			break;
+		}
+
+		/* fall through */
+	case CHECKSUM_NONE:
+		NAPI_GRO_CB(skb)->flush = 1;
+		return NULL;
+	}
+
+	return tcp_gro_receive(head, skb);
+}
+EXPORT_SYMBOL(tcp4_gro_receive);
+
+int tcp4_gro_complete(struct sk_buff *skb)
+{
+	struct iphdr *iph = ip_hdr(skb);
+	struct tcphdr *th = tcp_hdr(skb);
+
+	th->check = ~tcp_v4_check(skb->len - skb_transport_offset(skb),
+				  iph->saddr, iph->daddr, 0);
+	skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
+
+	return tcp_gro_complete(skb);
+}
+EXPORT_SYMBOL(tcp4_gro_complete);
+
 struct proto tcp_prot = {
 	.name			= "TCP",
 	.owner			= THIS_MODULE,

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 7/8] ethtool: Add GGRO and SGRO ops
  2008-12-12  5:31 [0/8] net: Generic Receive Offload Herbert Xu
                   ` (5 preceding siblings ...)
  2008-12-12  5:31 ` [PATCH 6/8] tcp: Add GRO support Herbert Xu
@ 2008-12-12  5:31 ` Herbert Xu
  2008-12-12 20:11   ` Ben Hutchings
  2008-12-12  5:31 ` [PATCH 8/8] e1000e: Add GRO support Herbert Xu
  2008-12-13  1:34 ` [0/8] net: Generic Receive Offload Herbert Xu
  8 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-12  5:31 UTC (permalink / raw)
  To: David S. Miller, netdev

ethtool: Add GGRO and SGRO ops

This patch adds the ethtool ops to enable and disable GRO.  It also
makes GRO depend on RX checksum offload much the same as how TSO
depends on SG support.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/ethtool.h |    2 +
 net/core/ethtool.c      |   53 ++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index b4b038b..27c67a5 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -467,6 +467,8 @@ struct ethtool_ops {
 
 #define	ETHTOOL_GRXFH		0x00000029 /* Get RX flow hash configuration */
 #define	ETHTOOL_SRXFH		0x0000002a /* Set RX flow hash configuration */
+#define ETHTOOL_GGRO		0x0000002b /* Get GRO enable (ethtool_value) */
+#define ETHTOOL_SGRO		0x0000002c /* Set GRO enable (ethtool_value) */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET		ETHTOOL_GSET
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 14ada53..947710a 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -528,6 +528,22 @@ static int ethtool_set_tx_csum(struct net_device *dev, char __user *useraddr)
 	return dev->ethtool_ops->set_tx_csum(dev, edata.data);
 }
 
+static int ethtool_set_rx_csum(struct net_device *dev, char __user *useraddr)
+{
+	struct ethtool_value edata;
+
+	if (!dev->ethtool_ops->set_rx_csum)
+		return -EOPNOTSUPP;
+
+	if (copy_from_user(&edata, useraddr, sizeof(edata)))
+		return -EFAULT;
+
+	if (!edata.data && dev->ethtool_ops->set_sg)
+		dev->features &= ~NETIF_F_GRO;
+
+	return dev->ethtool_ops->set_rx_csum(dev, edata.data);
+}
+
 static int ethtool_set_sg(struct net_device *dev, char __user *useraddr)
 {
 	struct ethtool_value edata;
@@ -599,6 +615,34 @@ static int ethtool_set_gso(struct net_device *dev, char __user *useraddr)
 	return 0;
 }
 
+static int ethtool_get_gro(struct net_device *dev, char __user *useraddr)
+{
+	struct ethtool_value edata = { ETHTOOL_GGRO };
+
+	edata.data = dev->features & NETIF_F_GRO;
+	if (copy_to_user(useraddr, &edata, sizeof(edata)))
+		 return -EFAULT;
+	return 0;
+}
+
+static int ethtool_set_gro(struct net_device *dev, char __user *useraddr)
+{
+	struct ethtool_value edata;
+
+	if (copy_from_user(&edata, useraddr, sizeof(edata)))
+		return -EFAULT;
+
+	if (edata.data) {
+		if (!dev->ethtool_ops->get_rx_csum ||
+		    !dev->ethtool_ops->get_rx_csum(dev))
+			return -EINVAL;
+		dev->features |= NETIF_F_GRO;
+	} else
+		dev->features &= ~NETIF_F_GRO;
+
+	return 0;
+}
+
 static int ethtool_self_test(struct net_device *dev, char __user *useraddr)
 {
 	struct ethtool_test test;
@@ -932,8 +976,7 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
 				       dev->ethtool_ops->get_rx_csum);
 		break;
 	case ETHTOOL_SRXCSUM:
-		rc = ethtool_set_value(dev, useraddr,
-				       dev->ethtool_ops->set_rx_csum);
+		rc = ethtool_set_rx_csum(dev, useraddr);
 		break;
 	case ETHTOOL_GTXCSUM:
 		rc = ethtool_get_value(dev, useraddr, ethcmd,
@@ -1014,6 +1057,12 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
 	case ETHTOOL_SRXFH:
 		rc = ethtool_set_rxhash(dev, useraddr);
 		break;
+	case ETHTOOL_GGRO:
+		rc = ethtool_get_gro(dev, useraddr);
+		break;
+	case ETHTOOL_SGRO:
+		rc = ethtool_set_gro(dev, useraddr);
+		break;
 	default:
 		rc = -EOPNOTSUPP;
 	}

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 8/8] e1000e: Add GRO support
  2008-12-12  5:31 [0/8] net: Generic Receive Offload Herbert Xu
                   ` (6 preceding siblings ...)
  2008-12-12  5:31 ` [PATCH 7/8] ethtool: Add GGRO and SGRO ops Herbert Xu
@ 2008-12-12  5:31 ` Herbert Xu
  2008-12-13  1:34 ` [0/8] net: Generic Receive Offload Herbert Xu
  8 siblings, 0 replies; 61+ messages in thread
From: Herbert Xu @ 2008-12-12  5:31 UTC (permalink / raw)
  To: David S. Miller, netdev

e1000e: Add GRO support

This patch adds GRO support to e1000e by making it invoke napi_gro_receive
instead of netif_receive_skb.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 drivers/net/e1000e/netdev.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 91795f7..fb7c4bd 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -102,7 +102,7 @@ static void e1000_receive_skb(struct e1000_adapter *adapter,
 		vlan_hwaccel_receive_skb(skb, adapter->vlgrp,
 					 le16_to_cpu(vlan));
 	else
-		netif_receive_skb(skb);
+		napi_gro_receive(&adapter->napi, skb);
 
 	netdev->last_rx = jiffies;
 }

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/8] net: Add frag_list support to skb_segment
  2008-12-12  5:31 ` [PATCH 1/8] net: Add frag_list support to skb_segment Herbert Xu
@ 2008-12-12 19:46   ` Evgeniy Polyakov
  2008-12-12 21:41     ` Herbert Xu
  0 siblings, 1 reply; 61+ messages in thread
From: Evgeniy Polyakov @ 2008-12-12 19:46 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, netdev

Hi.

Couple of comments below.

On Fri, Dec 12, 2008 at 04:31:49PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> @@ -2317,9 +2317,34 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
>  		if (hsize > len || !sg)
>  			hsize = len;
>  
> -		nskb = alloc_skb(hsize + doffset + headroom, GFP_ATOMIC);
> -		if (unlikely(!nskb))
> -			goto err;
> +		if (!hsize && i >= nfrags) {
> +			BUG_ON(fskb->len != len);
> +
> +			pos += len;
> +			nskb = skb_clone(fskb, GFP_ATOMIC);
> +			fskb = fskb->next;
> +
> +			if (unlikely(!nskb))
> +				goto err;
> +
> +			hsize = skb_end_pointer(nskb) - nskb->head;
> +			if (skb_cow_head(nskb, doffset + headroom))
> +				goto err;

This should free nskb first.

> +			nskb->truesize += skb_end_pointer(nskb) - nskb->head -
> +					  hsize;
> +			skb_release_head_state(nskb);
> +			__skb_push(nskb, doffset);
> +		} else {
> +			nskb = alloc_skb(hsize + doffset + headroom,
> +					 GFP_ATOMIC);
> +
> +			if (unlikely(!nskb))
> +				goto err;
> +

The same.

> +			skb_reserve(nskb, headroom);
> +			__skb_put(nskb, doffset);
> +		}
>  
>  		if (segs)
>  			tail->next = nskb;
> @@ -2330,13 +2355,15 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
>  		__copy_skb_header(nskb, skb);
>  		nskb->mac_len = skb->mac_len;
>  
> -		skb_reserve(nskb, headroom);
>  		skb_reset_mac_header(nskb);
>  		skb_set_network_header(nskb, skb->mac_len);
>  		nskb->transport_header = (nskb->network_header +
>  					  skb_network_header_len(skb));
> -		skb_copy_from_linear_data(skb, skb_put(nskb, doffset),
> -					  doffset);
> +		skb_copy_from_linear_data(skb, nskb->data, doffset);
> +
> +		if (pos >= offset + len)
> +			continue;
> +
>  		if (!sg) {
>  			nskb->ip_summed = CHECKSUM_NONE;
>  			nskb->csum = skb_copy_and_csum_bits(skb, offset,
> @@ -2346,14 +2373,11 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
>  		}
>  
>  		frag = skb_shinfo(nskb)->frags;
> -		k = 0;
>  
>  		skb_copy_from_linear_data_offset(skb, offset,
>  						 skb_put(nskb, hsize), hsize);
>  
> -		while (pos < offset + len) {
> -			BUG_ON(i >= nfrags);
> -
> +		while (pos < offset + len && i < nfrags) {
>  			*frag = skb_shinfo(skb)->frags[i];
>  			get_page(frag->page);
>  			size = frag->size;
> @@ -2363,20 +2387,38 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
>  				frag->size -= offset - pos;
>  			}
>  
> -			k++;
> +			skb_shinfo(nskb)->nr_frags++;
>  
>  			if (pos + size <= offset + len) {
>  				i++;
>  				pos += size;
>  			} else {
>  				frag->size -= pos + size - (offset + len);
> -				break;
> +				goto skip_fraglist;
>  			}
>  
>  			frag++;
>  		}
>  
> -		skb_shinfo(nskb)->nr_frags = k;
> +		if (pos < offset + len) {
> +			struct sk_buff *fskb2 = fskb;
> +
> +			BUG_ON(pos + fskb->len != offset + len);
> +
> +			pos += fskb->len;
> +			fskb = fskb->next;
> +
> +			if (fskb2->next) {
> +				fskb2 = skb_clone(fskb2, GFP_ATOMIC);
> +				if (!fskb2)
> +					goto err;
> +			} else
> +				skb_get(fskb2);
> +
> +			skb_shinfo(nskb)->frag_list = fskb2;

Is it guaranteed that nskb does not have frags here?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/8] net: Add Generic Receive Offload infrastructure
  2008-12-12  5:31 ` [PATCH 3/8] net: Add Generic Receive Offload infrastructure Herbert Xu
@ 2008-12-12 19:51   ` Evgeniy Polyakov
  2008-12-12 21:45     ` Herbert Xu
  2008-12-12 22:25   ` Ben Hutchings
  1 sibling, 1 reply; 61+ messages in thread
From: Evgeniy Polyakov @ 2008-12-12 19:51 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, netdev

On Fri, Dec 12, 2008 at 04:31:51PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> +static int napi_gro_complete(struct sk_buff *skb)
> +{
> +	struct packet_type *ptype;
> +	__be16 type = skb->protocol;
> +	struct list_head *head = &ptype_base[ntohs(type) & PTYPE_HASH_MASK];
> +	int err = -ENOENT;
> +
> +	if (!skb_shinfo(skb)->frag_list)
> +		goto out;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(ptype, head, list) {
> +		if (ptype->type != type || ptype->dev || !ptype->gro_complete)
> +			continue;
> +
> +		err = ptype->gro_complete(skb);
> +		break;
> +	}
> +	rcu_read_unlock();
> +
> +	if (err) {
> +		WARN_ON(&ptype->list == head);
> +		kfree_skb(skb);
> +		return NET_RX_SUCCESS;

Success after freeing skb on error? :)
Why not WARN_ON_ONCE()?

> +	}
> +
> +out:
> +	__skb_push(skb, -skb_network_offset(skb));
> +	return netif_receive_skb(skb);
> +}
> +
> +void napi_gro_flush(struct napi_struct *napi)
> +{
> +	struct sk_buff *skb, *next;
> +
> +	for (skb = napi->gro_list; skb; skb = next) {
> +		next = skb->next;
> +		skb->next = NULL;
> +		napi_gro_complete(skb);
> +	}
> +
> +	napi->gro_list = NULL;
> +}
> +EXPORT_SYMBOL(napi_gro_flush);
> +
> +int napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
> +{
> +	struct sk_buff **pp;
> +	struct packet_type *ptype;
> +	__be16 type = skb->protocol;
> +	struct list_head *head = &ptype_base[ntohs(type) & PTYPE_HASH_MASK];
> +	int count = 0;
> +	int mac_len;
> +
> +	if (!(skb->dev->features & NETIF_F_GRO))
> +		goto normal;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(ptype, head, list) {
> +		struct sk_buff *p;
> +
> +		if (ptype->type != type || ptype->dev || !ptype->gro_receive)
> +			continue;
> +
> +		skb_reset_network_header(skb);
> +		mac_len = skb->network_header - skb->mac_header;
> +		skb->mac_len = mac_len;
> +		NAPI_GRO_CB(skb)->same_flow = 0;
> +		NAPI_GRO_CB(skb)->flush = 0;
> +
> +		for (p = napi->gro_list; p; p = p->next) {
> +			count++;
> +			NAPI_GRO_CB(p)->same_flow =
> +				p->mac_len == mac_len &&
> +				!memcmp(skb_mac_header(p), skb_mac_header(skb),
> +					mac_len);
> +			NAPI_GRO_CB(p)->flush = 0;
> +		}
> +
> +		pp = ptype->gro_receive(&napi->gro_list, skb);
> +		break;
> +	}
> +	rcu_read_unlock();
> +
> +	if (&ptype->list == head)
> +		goto normal;
> +
> +	if (pp) {
> +		struct sk_buff *nskb = *pp;
> +
> +		*pp = nskb->next;
> +		nskb->next = NULL;
> +		napi_gro_complete(nskb);
> +		count--;
> +	}
> +
> +	if (NAPI_GRO_CB(skb)->same_flow)
> +		goto ok;

Should this set ->count?

> +
> +	if (NAPI_GRO_CB(skb)->flush || count >= MAX_GRO_SKBS) {
> +		__skb_push(skb, -skb_network_offset(skb));
> +		goto normal;
> +	}
> +
> +	NAPI_GRO_CB(skb)->count = 1;

Maybe this should be set to 'count', not 1?

> +	skb->next = napi->gro_list;
> +	napi->gro_list = skb;
> +
> +ok:
> +	return NET_RX_SUCCESS;
> +
> +normal:
> +	return netif_receive_skb(skb);
> +}
> +EXPORT_SYMBOL(napi_gro_receive);

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 6/8] tcp: Add GRO support
  2008-12-12  5:31 ` [PATCH 6/8] tcp: Add GRO support Herbert Xu
@ 2008-12-12 19:56   ` Evgeniy Polyakov
  2008-12-12 21:46     ` Herbert Xu
  0 siblings, 1 reply; 61+ messages in thread
From: Evgeniy Polyakov @ 2008-12-12 19:56 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, netdev

On Fri, Dec 12, 2008 at 04:31:55PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> +struct sk_buff **tcp_gro_receive(struct sk_buff **head, struct sk_buff *skb)
> +{
> +	struct sk_buff **pp = NULL;
> +	struct sk_buff *p;
> +	struct tcphdr *th;
> +	struct tcphdr *th2;
> +	unsigned int thlen;
> +	unsigned int flags;
> +	unsigned int total;
> +	unsigned int mss = 1;
> +	int flush = 1;
> +
> +	if (!pskb_may_pull(skb, sizeof(*th)))
> +		goto out;
> +
> +	th = tcp_hdr(skb);
> +	thlen = th->doff * 4;
> +	if (thlen < sizeof(*th))
> +		goto out;
> +
> +	if (!pskb_may_pull(skb, thlen))
> +		goto out;
> +
> +	th = tcp_hdr(skb);
> +	__skb_pull(skb, thlen);
> +
> +	flags = tcp_flag_word(th);
> +
> +	for (; (p = *head); head = &p->next) {
> +		if (!NAPI_GRO_CB(p)->same_flow)
> +			continue;
> +
> +		th2 = tcp_hdr(p);
> +
> +		if (th->source != th2->source || th->dest != th2->dest) {
> +			NAPI_GRO_CB(p)->same_flow = 0;
> +			continue;
> +		}
> +
> +		goto found;
> +	}
> +
> +	goto out_check_final;
> +
> +found:
> +	flush = NAPI_GRO_CB(p)->flush;
> +	flush |= flags & TCP_FLAG_CWR;
> +	flush |= (flags ^ tcp_flag_word(th2)) &
> +		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH);
> +	flush |= th->ack_seq != th2->ack_seq || th->window != th2->window;
> +	flush |= memcmp(th + 1, th2 + 1, thlen - sizeof(*th));
> +
> +	total = p->len;
> +	mss = total;
> +	if (skb_shinfo(p)->frag_list)
> +		mss = skb_shinfo(p)->frag_list->len;
> +
> +	flush |= skb->len > mss || skb->len <= 0;
> +	flush |= ntohl(th2->seq) + total != ntohl(th->seq);
> +

No timestamp check?

> +	if (flush || skb_gro_receive(head, skb)) {
> +		mss = 1;
> +		goto out_check_final;
> +	}
> +
> +	p = *head;
> +	th2 = tcp_hdr(p);
> +	tcp_flag_word(th2) |= flags & (TCP_FLAG_FIN | TCP_FLAG_PSH);
> +
> +out_check_final:
> +	flush = skb->len < mss;
> +	flush |= flags & (TCP_FLAG_URG | TCP_FLAG_PSH | TCP_FLAG_RST |
> +			  TCP_FLAG_SYN | TCP_FLAG_FIN);
> +
> +	if (p && (!NAPI_GRO_CB(skb)->same_flow || flush))
> +		pp = head;
> +
> +out:
> +	NAPI_GRO_CB(skb)->flush |= flush;
> +
> +	return pp;
> +}

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 7/8] ethtool: Add GGRO and SGRO ops
  2008-12-12  5:31 ` [PATCH 7/8] ethtool: Add GGRO and SGRO ops Herbert Xu
@ 2008-12-12 20:11   ` Ben Hutchings
  2008-12-12 21:48     ` Herbert Xu
  0 siblings, 1 reply; 61+ messages in thread
From: Ben Hutchings @ 2008-12-12 20:11 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, netdev

On Fri, 2008-12-12 at 16:31 +1100, Herbert Xu wrote:
> ethtool: Add GGRO and SGRO ops
> 
> This patch adds the ethtool ops to enable and disable GRO.  It also
> makes GRO depend on RX checksum offload much the same as how TSO
> depends on SG support.
[...]

Why don't you add NETIF_F_GRO to the flags handled by get_flags() and
set_flags()?

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/8] net: Add frag_list support to skb_segment
  2008-12-12 19:46   ` Evgeniy Polyakov
@ 2008-12-12 21:41     ` Herbert Xu
  2008-12-13  2:38       ` Evgeniy Polyakov
  0 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-12 21:41 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David S. Miller, netdev

On Fri, Dec 12, 2008 at 10:46:18PM +0300, Evgeniy Polyakov wrote:
> 
> On Fri, Dec 12, 2008 at 04:31:49PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > @@ -2317,9 +2317,34 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
> >  		if (hsize > len || !sg)
> >  			hsize = len;
> >  
> > -		nskb = alloc_skb(hsize + doffset + headroom, GFP_ATOMIC);
> > -		if (unlikely(!nskb))
> > -			goto err;
> > +		if (!hsize && i >= nfrags) {
> > +			BUG_ON(fskb->len != len);
> > +
> > +			pos += len;
> > +			nskb = skb_clone(fskb, GFP_ATOMIC);
> > +			fskb = fskb->next;
> > +
> > +			if (unlikely(!nskb))
> > +				goto err;
> > +
> > +			hsize = skb_end_pointer(nskb) - nskb->head;
> > +			if (skb_cow_head(nskb, doffset + headroom))
> > +				goto err;
> 
> This should free nskb first.

Good catch! I'll get this fixed.

> > +			nskb->truesize += skb_end_pointer(nskb) - nskb->head -
> > +					  hsize;
> > +			skb_release_head_state(nskb);
> > +			__skb_push(nskb, doffset);
> > +		} else {
> > +			nskb = alloc_skb(hsize + doffset + headroom,
> > +					 GFP_ATOMIC);
> > +
> > +			if (unlikely(!nskb))
> > +				goto err;
> > +
> 
> The same.

nskb is NULL here.

> > +		if (pos < offset + len) {
> > +			struct sk_buff *fskb2 = fskb;
> > +
> > +			BUG_ON(pos + fskb->len != offset + len);
> > +
> > +			pos += fskb->len;
> > +			fskb = fskb->next;
> > +
> > +			if (fskb2->next) {
> > +				fskb2 = skb_clone(fskb2, GFP_ATOMIC);
> > +				if (!fskb2)
> > +					goto err;
> > +			} else
> > +				skb_get(fskb2);
> > +
> > +			skb_shinfo(nskb)->frag_list = fskb2;
> 
> Is it guaranteed that nskb does not have frags here?

This bit only gets executed the first time we hit the frag_list
in the original skb.  Therefore we've just allocated nskb for
the very first time.  I'll add a BUG_ON for you :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/8] net: Add Generic Receive Offload infrastructure
  2008-12-12 19:51   ` Evgeniy Polyakov
@ 2008-12-12 21:45     ` Herbert Xu
  0 siblings, 0 replies; 61+ messages in thread
From: Herbert Xu @ 2008-12-12 21:45 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David S. Miller, netdev

On Fri, Dec 12, 2008 at 10:51:30PM +0300, Evgeniy Polyakov wrote:
> On Fri, Dec 12, 2008 at 04:31:51PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > +static int napi_gro_complete(struct sk_buff *skb)
> > +{
> > +	struct packet_type *ptype;
> > +	__be16 type = skb->protocol;
> > +	struct list_head *head = &ptype_base[ntohs(type) & PTYPE_HASH_MASK];
> > +	int err = -ENOENT;
> > +
> > +	if (!skb_shinfo(skb)->frag_list)
> > +		goto out;
> > +
> > +	rcu_read_lock();
> > +	list_for_each_entry_rcu(ptype, head, list) {
> > +		if (ptype->type != type || ptype->dev || !ptype->gro_complete)
> > +			continue;
> > +
> > +		err = ptype->gro_complete(skb);
> > +		break;
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	if (err) {
> > +		WARN_ON(&ptype->list == head);
> > +		kfree_skb(skb);
> > +		return NET_RX_SUCCESS;
> 
> Success after freeing skb on error? :)

These return values are for congestion control rather than errors.

> Why not WARN_ON_ONCE()?

If this ever gets run we're in big trouble :) And if you're hacking
protocols and trigger this then you want to see all the backtraces.

> > +	if (NAPI_GRO_CB(skb)->same_flow)
> > +		goto ok;
> 
> Should this set ->count?

Nope, same_flow == 1 means that the skb was merged into an existing
entry.  ->count is only set for the skb that consumed this one.

> > +
> > +	if (NAPI_GRO_CB(skb)->flush || count >= MAX_GRO_SKBS) {
> > +		__skb_push(skb, -skb_network_offset(skb));
> > +		goto normal;
> > +	}
> > +
> > +	NAPI_GRO_CB(skb)->count = 1;
> 
> Maybe this should be set to 'count', not 1?

No, ->count is for how many original skb's we have merged into
skb, whilc the local count is for how many held skb's we have
in total (used to limit the length of that list).

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 6/8] tcp: Add GRO support
  2008-12-12 19:56   ` Evgeniy Polyakov
@ 2008-12-12 21:46     ` Herbert Xu
  2008-12-13  2:40       ` Evgeniy Polyakov
  0 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-12 21:46 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David S. Miller, netdev

On Fri, Dec 12, 2008 at 10:56:15PM +0300, Evgeniy Polyakov wrote:
>
> > +found:
> > +	flush = NAPI_GRO_CB(p)->flush;
> > +	flush |= flags & TCP_FLAG_CWR;
> > +	flush |= (flags ^ tcp_flag_word(th2)) &
> > +		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH);
> > +	flush |= th->ack_seq != th2->ack_seq || th->window != th2->window;
> > +	flush |= memcmp(th + 1, th2 + 1, thlen - sizeof(*th));
> > +
> > +	total = p->len;
> > +	mss = total;
> > +	if (skb_shinfo(p)->frag_list)
> > +		mss = skb_shinfo(p)->frag_list->len;
> > +
> > +	flush |= skb->len > mss || skb->len <= 0;
> > +	flush |= ntohl(th2->seq) + total != ntohl(th->seq);
> > +
> 
> No timestamp check?

The memcmp does that for us.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 7/8] ethtool: Add GGRO and SGRO ops
  2008-12-12 20:11   ` Ben Hutchings
@ 2008-12-12 21:48     ` Herbert Xu
  2008-12-12 22:35       ` Ben Hutchings
  0 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-12 21:48 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David S. Miller, netdev

On Fri, Dec 12, 2008 at 08:11:17PM +0000, Ben Hutchings wrote:
> On Fri, 2008-12-12 at 16:31 +1100, Herbert Xu wrote:
> > ethtool: Add GGRO and SGRO ops
> > 
> > This patch adds the ethtool ops to enable and disable GRO.  It also
> > makes GRO depend on RX checksum offload much the same as how TSO
> > depends on SG support.
> [...]
> 
> Why don't you add NETIF_F_GRO to the flags handled by get_flags() and
> set_flags()?

Surely the patch itself has answered this :) It's because of the
dependency on RX checksum offload.  We want GRO to be off unless
RX checksum offload is on.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/8] net: Add Generic Receive Offload infrastructure
  2008-12-12  5:31 ` [PATCH 3/8] net: Add Generic Receive Offload infrastructure Herbert Xu
  2008-12-12 19:51   ` Evgeniy Polyakov
@ 2008-12-12 22:25   ` Ben Hutchings
  2008-12-12 22:56     ` Herbert Xu
  1 sibling, 1 reply; 61+ messages in thread
From: Ben Hutchings @ 2008-12-12 22:25 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, netdev

On Fri, 2008-12-12 at 16:31 +1100, Herbert Xu wrote:
[...]
> Whenever the skb is merged into an existing entry, the gro_receive
> function should set NAPI_GRO_CB(skb)->same_flow.  Note that if an skb
> merely matches an existing entry but can't be merged with it, then
> this shouldn't be set.

So why not call this field "merged"?

[...]
> Once gro_receive has determined that the new skb matches a held packet,
> the held packet may be processed immediately if the new skb cannot be
> merged with it.  In this case gro_receive should return the pointer to
> the existing skb in gro_list.  Otherwise the new skb should be merged into
> the existing packet and NULL should be returned, unless the new skb makes
> it impossible for any further merges to be made (e.g., FIN packet) where
> the merged skb should be returned.

This belongs in a kernel-doc comment, not in the commit message.

[...]
> Currently held packets are stored in a singly liked list just like LRO.
> The list is limited to a maximum of 8 entries.  In future, this may be
> expanded to use a hash table to allow more flows to be held for merging.

We used a hash table in our own soft-LRO, used in out-of-tree driver
releases.  This certainly improved performance in many-to-one
benchmarks.  How much it matters in real applications, I'm less sure.

[...]
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 4388e27..5e5132c 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
[...]
> +int napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
> +{
> +	struct sk_buff **pp;
> +	struct packet_type *ptype;
> +	__be16 type = skb->protocol;
> +	struct list_head *head = &ptype_base[ntohs(type) & PTYPE_HASH_MASK];

Are you intending for the VLAN driver to call napi_gro_receive()?  If
not, I think this should treat VLAN tags as part of the MAC header.
Not every NIC separates them out!

> +	int count = 0;
> +	int mac_len;
> +
> +	if (!(skb->dev->features & NETIF_F_GRO))
> +		goto normal;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(ptype, head, list) {
> +		struct sk_buff *p;
> +
> +		if (ptype->type != type || ptype->dev || !ptype->gro_receive)
> +			continue;
> +
> +		skb_reset_network_header(skb);
> +		mac_len = skb->network_header - skb->mac_header;
> +		skb->mac_len = mac_len;
> +		NAPI_GRO_CB(skb)->same_flow = 0;
> +		NAPI_GRO_CB(skb)->flush = 0;
> +		for (p = napi->gro_list; p; p = p->next) {
> +			count++;
> +			NAPI_GRO_CB(p)->same_flow =
> +				p->mac_len == mac_len &&
> +				!memcmp(skb_mac_header(p), skb_mac_header(skb),
> +					mac_len);
> +			NAPI_GRO_CB(p)->flush = 0;

Is this assignment to flush really necessary?  Surely any skb on the
gro_list with flush == 1 gets removed before the next call to
napi_gro_receive()?

> +		}
> +
> +		pp = ptype->gro_receive(&napi->gro_list, skb);
> +		break;
> +
> +	}
> +	rcu_read_unlock();
> +
> +	if (&ptype->list == head)
> +		goto normal;

The above loop is unclear because most of the body is supposed to run at
most once; I would suggest writing the loop and the failure case as:

	rcu_read_lock();
	list_for_each_entry_rcu(ptype, head, list)
		if (ptype->type == type && !ptype->dev && ptype->gro_receive)
			break;
	if (&ptype->list == head) {
		rcu_read_unlock();
		goto normal;
	}

and then moving the rest of the loop body after this.

The inet_lro code accepts either skbs or pages and the sfc driver takes
advantage of this: so long as most packets can be coalesced by LRO, it's
cheaper to allocate page buffers in advance and then attach them to skbs
during LRO.  I think you should support the use of page buffers.
Obviously it adds complexity but there's a real performance benefit.
(Alternately you could work out how to make skb allocation cheaper, and
everyone would be happy!)

[...]
> +void netif_napi_del(struct napi_struct *napi)
> +{
> +	struct sk_buff *skb, *next;
> +
> +	list_del(&napi->dev_list);
> +
> +	for (skb = napi->gro_list; skb; skb = next) {
> +		next = skb->next;
> +		skb->next = NULL;
> +		kfree_skb(skb);
> +	}
[...]

Shouldn't the list already be empty at this point?

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 7/8] ethtool: Add GGRO and SGRO ops
  2008-12-12 21:48     ` Herbert Xu
@ 2008-12-12 22:35       ` Ben Hutchings
  2008-12-12 22:49         ` Herbert Xu
  0 siblings, 1 reply; 61+ messages in thread
From: Ben Hutchings @ 2008-12-12 22:35 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, netdev

On Sat, 2008-12-13 at 08:48 +1100, Herbert Xu wrote:
> On Fri, Dec 12, 2008 at 08:11:17PM +0000, Ben Hutchings wrote:
> > On Fri, 2008-12-12 at 16:31 +1100, Herbert Xu wrote:
> > > ethtool: Add GGRO and SGRO ops
> > > 
> > > This patch adds the ethtool ops to enable and disable GRO.  It also
> > > makes GRO depend on RX checksum offload much the same as how TSO
> > > depends on SG support.
> > [...]
> > 
> > Why don't you add NETIF_F_GRO to the flags handled by get_flags() and
> > set_flags()?
> 
> Surely the patch itself has answered this :) It's because of the
> dependency on RX checksum offload.  We want GRO to be off unless
> RX checksum offload is on.

If I'm not mistaken, the whole point of set_flags() is to end the
continued expansion of struct ethtool_ops by another 2 functions for
every new offload feature.  The comments make it fairly clear that Jeff
anticipated that it might be necessary to include such checks for some
flags.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 7/8] ethtool: Add GGRO and SGRO ops
  2008-12-12 22:35       ` Ben Hutchings
@ 2008-12-12 22:49         ` Herbert Xu
  2008-12-14 19:36           ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-12 22:49 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David S. Miller, netdev

On Fri, Dec 12, 2008 at 10:35:27PM +0000, Ben Hutchings wrote:
>
> If I'm not mistaken, the whole point of set_flags() is to end the
> continued expansion of struct ethtool_ops by another 2 functions for
> every new offload feature.  The comments make it fairly clear that Jeff
> anticipated that it might be necessary to include such checks for some
> flags.

GRO is purely a stack feature so it doesn't need anything in
ethtool_ops at all.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/8] ipv4: Add GRO infrastructure
  2008-12-12  5:31 ` [PATCH 4/8] ipv4: Add GRO infrastructure Herbert Xu
@ 2008-12-12 22:55   ` Ben Hutchings
  2008-12-12 23:04     ` Herbert Xu
  0 siblings, 1 reply; 61+ messages in thread
From: Ben Hutchings @ 2008-12-12 22:55 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, netdev

On Fri, 2008-12-12 at 16:31 +1100, Herbert Xu wrote:
[...]
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 1aa2dc9..260f081 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -94,6 +94,7 @@
>  #include <linux/igmp.h>
>  #include <linux/inetdevice.h>
>  #include <linux/netdevice.h>
> +#include <net/checksum.h>
>  #include <net/ip.h>
>  #include <net/protocol.h>
>  #include <net/arp.h>
> @@ -1245,6 +1246,99 @@ out:
>  	return segs;
>  }
>  
> +static struct sk_buff **inet_gro_receive(struct sk_buff **head,
> +					 struct sk_buff *skb)
> +{
> +	struct net_protocol *ops;
> +	struct sk_buff **pp = NULL;
> +	struct sk_buff *p;
> +	struct iphdr *iph;
> +	int flush = 1;
> +	int proto;
> +	int id;
> +
> +	if (unlikely(!pskb_may_pull(skb, sizeof(*iph))))
> +		goto out;
> +
> +	iph = ip_hdr(skb);
> +	proto = iph->protocol & (MAX_INET_PROTOS - 1);
> +
> +	rcu_read_lock();
> +	ops = rcu_dereference(inet_protos[proto]);
> +	if (!ops || !ops->gro_receive)
> +		goto out_unlock;
> +
> +	if (iph->version != 4 || iph->ihl != 5)
> +		goto out_unlock;
> +
> +	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
> +		goto out_unlock;

Don't we also need to check that skb->len >= iph->ihl * 4?

> +	flush = ntohs(iph->tot_len) != skb->len ||
> +		iph->frag_off != htons(IP_DF);
> +	id = ntohs(iph->id);
> +
> +	for (p = *head; p; p = p->next) {
> +		struct iphdr *iph2;
> +
> +		if (!NAPI_GRO_CB(p)->same_flow)
> +			continue;
> +
> +		iph2 = ip_hdr(p);
> +
> +		if (iph->protocol != iph2->protocol ||
> +		    memcmp(&iph->saddr, &iph2->saddr, 8)) {
> +			NAPI_GRO_CB(p)->same_flow = 0;
> +			continue;
> +		}
> +
> +		/* All fields must match except length and checksum. */
> +		NAPI_GRO_CB(p)->flush |=
> +			memcmp(&iph->frag_off, &iph2->frag_off, 4) ||

This covers frag_off, ttl and protocol.  But frag_off and protocol have
already been covered, and tos seems to have been missed.

> +			(u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) != id;
> +		NAPI_GRO_CB(p)->flush |= flush;
> +	}
> +
> +	NAPI_GRO_CB(skb)->flush |= flush;
> +	__skb_pull(skb, sizeof(*iph));
> +	skb_reset_transport_header(skb);
> +
> +	pp = ops->gro_receive(head, skb);
> +
> +out_unlock:
> +	rcu_read_unlock();
> +
> +out:
> +	NAPI_GRO_CB(skb)->flush |= flush;
> +
> +	return pp;
> +}
[...]

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/8] net: Add Generic Receive Offload infrastructure
  2008-12-12 22:25   ` Ben Hutchings
@ 2008-12-12 22:56     ` Herbert Xu
  2008-12-12 23:11       ` Herbert Xu
  0 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-12 22:56 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David S. Miller, netdev

On Fri, Dec 12, 2008 at 10:25:24PM +0000, Ben Hutchings wrote:
> On Fri, 2008-12-12 at 16:31 +1100, Herbert Xu wrote:
> > Whenever the skb is merged into an existing entry, the gro_receive
> > function should set NAPI_GRO_CB(skb)->same_flow.  Note that if an skb
> > merely matches an existing entry but can't be merged with it, then
> > this shouldn't be set.
> 
> So why not call this field "merged"?

Because when it's used on held skb's it indicates whether the new
skb can be of the same flow or not.

> > +int napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
> > +{
> > +	struct sk_buff **pp;
> > +	struct packet_type *ptype;
> > +	__be16 type = skb->protocol;
> > +	struct list_head *head = &ptype_base[ntohs(type) & PTYPE_HASH_MASK];
> 
> Are you intending for the VLAN driver to call napi_gro_receive()?  If
> not, I think this should treat VLAN tags as part of the MAC header.
> Not every NIC separates them out!

Yes the idea is to look at the VLAN+Ethernet header in future.

> > +		for (p = napi->gro_list; p; p = p->next) {
> > +			count++;
> > +			NAPI_GRO_CB(p)->same_flow =
> > +				p->mac_len == mac_len &&
> > +				!memcmp(skb_mac_header(p), skb_mac_header(skb),
> > +					mac_len);
> > +			NAPI_GRO_CB(p)->flush = 0;
> 
> Is this assignment to flush really necessary?  Surely any skb on the
> gro_list with flush == 1 gets removed before the next call to
> napi_gro_receive()?

Yes it is necessary.  Only the one with flushing != 0 and is of the
same flow as the new skb will be removed.

> The inet_lro code accepts either skbs or pages and the sfc driver takes
> advantage of this: so long as most packets can be coalesced by LRO, it's
> cheaper to allocate page buffers in advance and then attach them to skbs
> during LRO.  I think you should support the use of page buffers.
> Obviously it adds complexity but there's a real performance benefit.
> (Alternately you could work out how to make skb allocation cheaper, and
> everyone would be happy!)

This is something which I will look at in future.  The top priority
right now is to get bridging to work with GRO.

> Shouldn't the list already be empty at this point?

It should be.  However, if your driver is buggy and forgets to
call napi_complete (or calls __napi_complete without napi_gro_flush)
then we'll need this.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/8] ipv4: Add GRO infrastructure
  2008-12-12 22:55   ` Ben Hutchings
@ 2008-12-12 23:04     ` Herbert Xu
  0 siblings, 0 replies; 61+ messages in thread
From: Herbert Xu @ 2008-12-12 23:04 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David S. Miller, netdev

On Fri, Dec 12, 2008 at 10:55:40PM +0000, Ben Hutchings wrote:
>
> > +	if (unlikely(!pskb_may_pull(skb, sizeof(*iph))))
> > +		goto out;
> > +
> > +	iph = ip_hdr(skb);
> > +	proto = iph->protocol & (MAX_INET_PROTOS - 1);
> > +
> > +	rcu_read_lock();
> > +	ops = rcu_dereference(inet_protos[proto]);
> > +	if (!ops || !ops->gro_receive)
> > +		goto out_unlock;
> > +
> > +	if (iph->version != 4 || iph->ihl != 5)
> > +		goto out_unlock;
> > +
> > +	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
> > +		goto out_unlock;
> 
> Don't we also need to check that skb->len >= iph->ihl * 4?

We already did through pskb_may_pull and iph->ihl != 5.

> > +		if (iph->protocol != iph2->protocol ||
> > +		    memcmp(&iph->saddr, &iph2->saddr, 8)) {
> > +			NAPI_GRO_CB(p)->same_flow = 0;
> > +			continue;
> > +		}
> > +
> > +		/* All fields must match except length and checksum. */
> > +		NAPI_GRO_CB(p)->flush |=
> > +			memcmp(&iph->frag_off, &iph2->frag_off, 4) ||
> 
> This covers frag_off, ttl and protocol.  But frag_off and protocol have
> already been covered, and tos seems to have been missed.

No we haven't checked frag_off yet, we've only checked the DF bit.
In any case, checking twice doesn't hurt.  Good catch on the missing
tos check though.  I'll get that added.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/8] net: Add Generic Receive Offload infrastructure
  2008-12-12 22:56     ` Herbert Xu
@ 2008-12-12 23:11       ` Herbert Xu
  2008-12-13  3:43         ` Herbert Xu
  0 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-12 23:11 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David S. Miller, netdev

On Sat, Dec 13, 2008 at 09:56:59AM +1100, Herbert Xu wrote:
>
> > The inet_lro code accepts either skbs or pages and the sfc driver takes
> > advantage of this: so long as most packets can be coalesced by LRO, it's
> > cheaper to allocate page buffers in advance and then attach them to skbs
> > during LRO.  I think you should support the use of page buffers.
> > Obviously it adds complexity but there's a real performance benefit.
> > (Alternately you could work out how to make skb allocation cheaper, and
> > everyone would be happy!)
> 
> This is something which I will look at in future.  The top priority
> right now is to get bridging to work with GRO.

BTW this should be pretty easy to implement through a second
entry, e.g., napi_gro_receive_pages() that works just like its
LRO counter-part lro_receive_frags.  This would have its own
protocol hooks so it doesn't need to do anything nasty to get
at the packet headers.

But I'd like to get the main infrastructure nailed down first
before we tack on bells and whistles like this and VLAN support.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [0/8] net: Generic Receive Offload
  2008-12-12  5:31 [0/8] net: Generic Receive Offload Herbert Xu
                   ` (7 preceding siblings ...)
  2008-12-12  5:31 ` [PATCH 8/8] e1000e: Add GRO support Herbert Xu
@ 2008-12-13  1:34 ` Herbert Xu
  2008-12-13  1:35   ` [PATCH 1/8] net: Add frag_list support to skb_segment Herbert Xu
                     ` (7 more replies)
  8 siblings, 8 replies; 61+ messages in thread
From: Herbert Xu @ 2008-12-13  1:34 UTC (permalink / raw)
  To: David S. Miller, netdev

On Fri, Dec 12, 2008 at 04:31:16PM +1100, Herbert Xu wrote:
> 
> This patch series introduces the Generic Receive Offload system.  Just
> as GSO generalised TSO, GRO does the same thing for LRO.  Most of the
> technical details are pulled straight out of LRO, although there is no
> code shared between them.

I've incorporated fixes to the bugs reported against this series and
here is an updated version.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 1/8] net: Add frag_list support to skb_segment
  2008-12-13  1:34 ` [0/8] net: Generic Receive Offload Herbert Xu
@ 2008-12-13  1:35   ` Herbert Xu
  2008-12-16  7:27     ` David Miller
  2008-12-13  1:35   ` [PATCH 2/8] net: Add frag_list support to GSO Herbert Xu
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-13  1:35 UTC (permalink / raw)
  To: David S. Miller, netdev, Evgeniy Polyakov, Ben Hutchings

net: Add frag_list support to skb_segment

This patch adds limited support for handling frag_list packets in
skb_segment.  The intention is to support GRO (Generic Receive Offload)
packets which will be constructed by chaining normal packets using
frag_list.

As such we require all frag_list members terminate on exact MSS
boundaries.  This is checked using BUG_ON.

As there should only be one producer in the kernel of such packets,
namely GRO, this requirement should not be difficult to maintain.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/core/skbuff.c |   73 +++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 59 insertions(+), 14 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d49ef83..cf05a8c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2285,6 +2285,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 {
 	struct sk_buff *segs = NULL;
 	struct sk_buff *tail = NULL;
+	struct sk_buff *fskb = skb_shinfo(skb)->frag_list;
 	unsigned int mss = skb_shinfo(skb)->gso_size;
 	unsigned int doffset = skb->data - skb_mac_header(skb);
 	unsigned int offset = doffset;
@@ -2304,7 +2305,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		struct sk_buff *nskb;
 		skb_frag_t *frag;
 		int hsize;
-		int k;
 		int size;
 
 		len = skb->len - offset;
@@ -2317,9 +2317,36 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		if (hsize > len || !sg)
 			hsize = len;
 
-		nskb = alloc_skb(hsize + doffset + headroom, GFP_ATOMIC);
-		if (unlikely(!nskb))
-			goto err;
+		if (!hsize && i >= nfrags) {
+			BUG_ON(fskb->len != len);
+
+			pos += len;
+			nskb = skb_clone(fskb, GFP_ATOMIC);
+			fskb = fskb->next;
+
+			if (unlikely(!nskb))
+				goto err;
+
+			hsize = skb_end_pointer(nskb) - nskb->head;
+			if (skb_cow_head(nskb, doffset + headroom)) {
+				kfree_skb(nskb);
+				goto err;
+			}
+
+			nskb->truesize += skb_end_pointer(nskb) - nskb->head -
+					  hsize;
+			skb_release_head_state(nskb);
+			__skb_push(nskb, doffset);
+		} else {
+			nskb = alloc_skb(hsize + doffset + headroom,
+					 GFP_ATOMIC);
+
+			if (unlikely(!nskb))
+				goto err;
+
+			skb_reserve(nskb, headroom);
+			__skb_put(nskb, doffset);
+		}
 
 		if (segs)
 			tail->next = nskb;
@@ -2330,13 +2357,15 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		__copy_skb_header(nskb, skb);
 		nskb->mac_len = skb->mac_len;
 
-		skb_reserve(nskb, headroom);
 		skb_reset_mac_header(nskb);
 		skb_set_network_header(nskb, skb->mac_len);
 		nskb->transport_header = (nskb->network_header +
 					  skb_network_header_len(skb));
-		skb_copy_from_linear_data(skb, skb_put(nskb, doffset),
-					  doffset);
+		skb_copy_from_linear_data(skb, nskb->data, doffset);
+
+		if (pos >= offset + len)
+			continue;
+
 		if (!sg) {
 			nskb->ip_summed = CHECKSUM_NONE;
 			nskb->csum = skb_copy_and_csum_bits(skb, offset,
@@ -2346,14 +2375,11 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		}
 
 		frag = skb_shinfo(nskb)->frags;
-		k = 0;
 
 		skb_copy_from_linear_data_offset(skb, offset,
 						 skb_put(nskb, hsize), hsize);
 
-		while (pos < offset + len) {
-			BUG_ON(i >= nfrags);
-
+		while (pos < offset + len && i < nfrags) {
 			*frag = skb_shinfo(skb)->frags[i];
 			get_page(frag->page);
 			size = frag->size;
@@ -2363,20 +2389,39 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 				frag->size -= offset - pos;
 			}
 
-			k++;
+			skb_shinfo(nskb)->nr_frags++;
 
 			if (pos + size <= offset + len) {
 				i++;
 				pos += size;
 			} else {
 				frag->size -= pos + size - (offset + len);
-				break;
+				goto skip_fraglist;
 			}
 
 			frag++;
 		}
 
-		skb_shinfo(nskb)->nr_frags = k;
+		if (pos < offset + len) {
+			struct sk_buff *fskb2 = fskb;
+
+			BUG_ON(pos + fskb->len != offset + len);
+
+			pos += fskb->len;
+			fskb = fskb->next;
+
+			if (fskb2->next) {
+				fskb2 = skb_clone(fskb2, GFP_ATOMIC);
+				if (!fskb2)
+					goto err;
+			} else
+				skb_get(fskb2);
+
+			BUG_ON(skb_shinfo(nskb)->frag_list);
+			skb_shinfo(nskb)->frag_list = fskb2;
+		}
+
+skip_fraglist:
 		nskb->data_len = len - hsize;
 		nskb->len += nskb->data_len;
 		nskb->truesize += nskb->data_len;

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 2/8] net: Add frag_list support to GSO
  2008-12-13  1:34 ` [0/8] net: Generic Receive Offload Herbert Xu
  2008-12-13  1:35   ` [PATCH 1/8] net: Add frag_list support to skb_segment Herbert Xu
@ 2008-12-13  1:35   ` Herbert Xu
  2008-12-16  7:30     ` David Miller
  2008-12-13  1:35   ` [PATCH 3/8] net: Add Generic Receive Offload infrastructure Herbert Xu
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-13  1:35 UTC (permalink / raw)
  To: David S. Miller, netdev, Evgeniy Polyakov, Ben Hutchings

net: Add frag_list support to GSO

This patch allows GSO to handle frag_list in a limited way for the
purposes of allowing packets merged by GRO to be refragmented on
output.

Most hardware won't (and aren't expected to) support handling GRO
frag_list packets directly.  Therefore we will perform GSO in
software for those cases.

However, for drivers that can support it (such as virtual NICs) we
may not have to segment the packets at all.

Whether the added overhead of GRO/GSO is worthwhile for bridges
and routers when weighed against the benefit of potentially
increasing the MTU within the host is still an open question.
However, for the case of host nodes this is undoubtedly a win.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/netdevice.h |    2 ++
 net/core/dev.c            |    2 --
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9d77b1d..8bf9127 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1724,6 +1724,8 @@ static inline int netif_needs_gso(struct net_device *dev, struct sk_buff *skb)
 {
 	return skb_is_gso(skb) &&
 	       (!skb_gso_ok(skb, dev->features) ||
+	        (skb_shinfo(skb)->frag_list &&
+	         !(dev->features & NETIF_F_FRAGLIST)) ||
 		unlikely(skb->ip_summed != CHECKSUM_PARTIAL));
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 9174c77..4388e27 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1527,8 +1527,6 @@ struct sk_buff *skb_gso_segment(struct sk_buff *skb, int features)
 	__be16 type = skb->protocol;
 	int err;
 
-	BUG_ON(skb_shinfo(skb)->frag_list);
-
 	skb_reset_mac_header(skb);
 	skb->mac_len = skb->network_header - skb->mac_header;
 	__skb_pull(skb, skb->mac_len);

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 3/8] net: Add Generic Receive Offload infrastructure
  2008-12-13  1:34 ` [0/8] net: Generic Receive Offload Herbert Xu
  2008-12-13  1:35   ` [PATCH 1/8] net: Add frag_list support to skb_segment Herbert Xu
  2008-12-13  1:35   ` [PATCH 2/8] net: Add frag_list support to GSO Herbert Xu
@ 2008-12-13  1:35   ` Herbert Xu
  2008-12-15 23:29     ` Paul E. McKenney
  2008-12-16  7:40     ` David Miller
  2008-12-13  1:35   ` [PATCH 4/8] ipv4: Add GRO infrastructure Herbert Xu
                     ` (4 subsequent siblings)
  7 siblings, 2 replies; 61+ messages in thread
From: Herbert Xu @ 2008-12-13  1:35 UTC (permalink / raw)
  To: David S. Miller, netdev, Evgeniy Polyakov, Ben Hutchings

net: Add Generic Receive Offload infrastructure

This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.

For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically.  When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.

Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.

In addition to the packet, gro_receive will get a list of currently held
packets.  Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet.  For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.

Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it.  In this case gro_receive should return the pointer to
the existing skb in gro_list.  Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.

Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow.  Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.

If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.

Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.

Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries.  In future, this may be
expanded to use a hash table to allow more flows to be held for merging.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/netdevice.h |   74 ++++++------------
 include/linux/netpoll.h   |    5 -
 net/core/dev.c            |  186 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 212 insertions(+), 53 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8bf9127..5acd176 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -311,8 +311,9 @@ struct napi_struct {
 	spinlock_t		poll_lock;
 	int			poll_owner;
 	struct net_device	*dev;
-	struct list_head	dev_list;
 #endif
+	struct list_head	dev_list;
+	struct sk_buff		*gro_list;
 };
 
 enum
@@ -372,22 +373,8 @@ static inline int napi_reschedule(struct napi_struct *napi)
  *
  * Mark NAPI processing as complete.
  */
-static inline void __napi_complete(struct napi_struct *n)
-{
-	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
-	list_del(&n->poll_list);
-	smp_mb__before_clear_bit();
-	clear_bit(NAPI_STATE_SCHED, &n->state);
-}
-
-static inline void napi_complete(struct napi_struct *n)
-{
-	unsigned long flags;
-
-	local_irq_save(flags);
-	__napi_complete(n);
-	local_irq_restore(flags);
-}
+extern void __napi_complete(struct napi_struct *n);
+extern void napi_complete(struct napi_struct *n);
 
 /**
  *	napi_disable - prevent NAPI from scheduling
@@ -495,9 +482,7 @@ struct net_device
 	unsigned long		state;
 
 	struct list_head	dev_list;
-#ifdef CONFIG_NETPOLL
 	struct list_head	napi_list;
-#endif
 	
 	/* The device initialization function. Called only once. */
 	int			(*init)(struct net_device *dev);
@@ -521,6 +506,7 @@ struct net_device
 #define NETIF_F_LLTX		4096	/* LockLess TX - deprecated. Please */
 					/* do not use LLTX in new drivers */
 #define NETIF_F_NETNS_LOCAL	8192	/* Does not change network namespaces */
+#define NETIF_F_GRO		16384	/* Generic receive offload */
 #define NETIF_F_LRO		32768	/* large receive offload */
 
 	/* Segmentation offload features */
@@ -858,22 +844,8 @@ static inline void *netdev_priv(const struct net_device *dev)
  * netif_napi_add() must be used to initialize a napi context prior to calling
  * *any* of the other napi related functions.
  */
-static inline void netif_napi_add(struct net_device *dev,
-				  struct napi_struct *napi,
-				  int (*poll)(struct napi_struct *, int),
-				  int weight)
-{
-	INIT_LIST_HEAD(&napi->poll_list);
-	napi->poll = poll;
-	napi->weight = weight;
-#ifdef CONFIG_NETPOLL
-	napi->dev = dev;
-	list_add(&napi->dev_list, &dev->napi_list);
-	spin_lock_init(&napi->poll_lock);
-	napi->poll_owner = -1;
-#endif
-	set_bit(NAPI_STATE_SCHED, &napi->state);
-}
+void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
+		    int (*poll)(struct napi_struct *, int), int weight);
 
 /**
  *  netif_napi_del - remove a napi context
@@ -881,12 +853,20 @@ static inline void netif_napi_add(struct net_device *dev,
  *
  *  netif_napi_del() removes a napi context from the network device napi list
  */
-static inline void netif_napi_del(struct napi_struct *napi)
-{
-#ifdef CONFIG_NETPOLL
-	list_del(&napi->dev_list);
-#endif
-}
+void netif_napi_del(struct napi_struct *napi);
+
+struct napi_gro_cb {
+	/* This is non-zero if the packet may be of the same flow. */
+	int same_flow;
+
+	/* This is non-zero if the packet cannot be merged with the new skb. */
+	int flush;
+
+	/* Number of segments aggregated. */
+	int count;
+};
+
+#define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb)
 
 struct packet_type {
 	__be16			type;	/* This is really htons(ether_type). */
@@ -898,6 +878,9 @@ struct packet_type {
 	struct sk_buff		*(*gso_segment)(struct sk_buff *skb,
 						int features);
 	int			(*gso_send_check)(struct sk_buff *skb);
+	struct sk_buff		**(*gro_receive)(struct sk_buff **head,
+					       struct sk_buff *skb);
+	int			(*gro_complete)(struct sk_buff *skb);
 	void			*af_packet_priv;
 	struct list_head	list;
 };
@@ -1251,6 +1234,9 @@ extern int		netif_rx(struct sk_buff *skb);
 extern int		netif_rx_ni(struct sk_buff *skb);
 #define HAVE_NETIF_RECEIVE_SKB 1
 extern int		netif_receive_skb(struct sk_buff *skb);
+extern void		napi_gro_flush(struct napi_struct *napi);
+extern int		napi_gro_receive(struct napi_struct *napi,
+					 struct sk_buff *skb);
 extern void		netif_nit_deliver(struct sk_buff *skb);
 extern int		dev_valid_name(const char *name);
 extern int		dev_ioctl(struct net *net, unsigned int cmd, void __user *);
@@ -1495,11 +1481,7 @@ static inline void __netif_rx_complete(struct net_device *dev,
 static inline void netif_rx_complete(struct net_device *dev,
 				     struct napi_struct *napi)
 {
-	unsigned long flags;
-
-	local_irq_save(flags);
-	__netif_rx_complete(dev, napi);
-	local_irq_restore(flags);
+	napi_complete(napi);
 }
 
 static inline void __netif_tx_lock(struct netdev_queue *txq, int cpu)
diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h
index e3d7959..e38d3c9 100644
--- a/include/linux/netpoll.h
+++ b/include/linux/netpoll.h
@@ -94,11 +94,6 @@ static inline void netpoll_poll_unlock(void *have)
 	rcu_read_unlock();
 }
 
-static inline void netpoll_netdev_init(struct net_device *dev)
-{
-	INIT_LIST_HEAD(&dev->napi_list);
-}
-
 #else
 static inline int netpoll_rx(struct sk_buff *skb)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index 4388e27..5e5132c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -130,6 +130,9 @@
 
 #include "net-sysfs.h"
 
+/* Instead of increasing this, you should create a hash table. */
+#define MAX_GRO_SKBS 8
+
 /*
  *	The list of packet types we will receive (as opposed to discard)
  *	and the routines to invoke.
@@ -2323,6 +2326,122 @@ static void flush_backlog(void *arg)
 		}
 }
 
+static int napi_gro_complete(struct sk_buff *skb)
+{
+	struct packet_type *ptype;
+	__be16 type = skb->protocol;
+	struct list_head *head = &ptype_base[ntohs(type) & PTYPE_HASH_MASK];
+	int err = -ENOENT;
+
+	if (!skb_shinfo(skb)->frag_list)
+		goto out;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(ptype, head, list) {
+		if (ptype->type != type || ptype->dev || !ptype->gro_complete)
+			continue;
+
+		err = ptype->gro_complete(skb);
+		break;
+	}
+	rcu_read_unlock();
+
+	if (err) {
+		WARN_ON(&ptype->list == head);
+		kfree_skb(skb);
+		return NET_RX_SUCCESS;
+	}
+
+out:
+	__skb_push(skb, -skb_network_offset(skb));
+	return netif_receive_skb(skb);
+}
+
+void napi_gro_flush(struct napi_struct *napi)
+{
+	struct sk_buff *skb, *next;
+
+	for (skb = napi->gro_list; skb; skb = next) {
+		next = skb->next;
+		skb->next = NULL;
+		napi_gro_complete(skb);
+	}
+
+	napi->gro_list = NULL;
+}
+EXPORT_SYMBOL(napi_gro_flush);
+
+int napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
+{
+	struct sk_buff **pp;
+	struct packet_type *ptype;
+	__be16 type = skb->protocol;
+	struct list_head *head = &ptype_base[ntohs(type) & PTYPE_HASH_MASK];
+	int count = 0;
+	int mac_len;
+
+	if (!(skb->dev->features & NETIF_F_GRO))
+		goto normal;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(ptype, head, list) {
+		struct sk_buff *p;
+
+		if (ptype->type != type || ptype->dev || !ptype->gro_receive)
+			continue;
+
+		skb_reset_network_header(skb);
+		mac_len = skb->network_header - skb->mac_header;
+		skb->mac_len = mac_len;
+		NAPI_GRO_CB(skb)->same_flow = 0;
+		NAPI_GRO_CB(skb)->flush = 0;
+
+		for (p = napi->gro_list; p; p = p->next) {
+			count++;
+			NAPI_GRO_CB(p)->same_flow =
+				p->mac_len == mac_len &&
+				!memcmp(skb_mac_header(p), skb_mac_header(skb),
+					mac_len);
+			NAPI_GRO_CB(p)->flush = 0;
+		}
+
+		pp = ptype->gro_receive(&napi->gro_list, skb);
+		break;
+	}
+	rcu_read_unlock();
+
+	if (&ptype->list == head)
+		goto normal;
+
+	if (pp) {
+		struct sk_buff *nskb = *pp;
+
+		*pp = nskb->next;
+		nskb->next = NULL;
+		napi_gro_complete(nskb);
+		count--;
+	}
+
+	if (NAPI_GRO_CB(skb)->same_flow)
+		goto ok;
+
+	if (NAPI_GRO_CB(skb)->flush || count >= MAX_GRO_SKBS) {
+		__skb_push(skb, -skb_network_offset(skb));
+		goto normal;
+	}
+
+	NAPI_GRO_CB(skb)->count = 1;
+	skb->next = napi->gro_list;
+	napi->gro_list = skb;
+
+ok:
+	return NET_RX_SUCCESS;
+
+normal:
+	return netif_receive_skb(skb);
+}
+EXPORT_SYMBOL(napi_gro_receive);
+
 static int process_backlog(struct napi_struct *napi, int quota)
 {
 	int work = 0;
@@ -2342,9 +2461,11 @@ static int process_backlog(struct napi_struct *napi, int quota)
 		}
 		local_irq_enable();
 
-		netif_receive_skb(skb);
+		napi_gro_receive(napi, skb);
 	} while (++work < quota && jiffies == start_time);
 
+	napi_gro_flush(napi);
+
 	return work;
 }
 
@@ -2365,6 +2486,61 @@ void __napi_schedule(struct napi_struct *n)
 }
 EXPORT_SYMBOL(__napi_schedule);
 
+void __napi_complete(struct napi_struct *n)
+{
+	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
+	BUG_ON(n->gro_list);
+
+	list_del(&n->poll_list);
+	smp_mb__before_clear_bit();
+	clear_bit(NAPI_STATE_SCHED, &n->state);
+}
+EXPORT_SYMBOL(__napi_complete);
+
+void napi_complete(struct napi_struct *n)
+{
+	unsigned long flags;
+
+	napi_gro_flush(n);
+	local_irq_save(flags);
+	__napi_complete(n);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(napi_complete);
+
+void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
+		    int (*poll)(struct napi_struct *, int), int weight)
+{
+	INIT_LIST_HEAD(&napi->poll_list);
+	napi->gro_list = NULL;
+	napi->poll = poll;
+	napi->weight = weight;
+	list_add(&napi->dev_list, &dev->napi_list);
+#ifdef CONFIG_NETPOLL
+	napi->dev = dev;
+	spin_lock_init(&napi->poll_lock);
+	napi->poll_owner = -1;
+#endif
+	set_bit(NAPI_STATE_SCHED, &napi->state);
+}
+EXPORT_SYMBOL(netif_napi_add);
+
+void netif_napi_del(struct napi_struct *napi)
+{
+	struct sk_buff *skb, *next;
+
+	list_del(&napi->dev_list);
+
+	for (skb = napi->gro_list; skb; skb = next) {
+		next = skb->next;
+		skb->next = NULL;
+		kfree_skb(skb);
+	}
+
+	napi->gro_list = NULL;
+}
+EXPORT_SYMBOL(netif_napi_del);
+
 
 static void net_rx_action(struct softirq_action *h)
 {
@@ -4348,7 +4524,7 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 	netdev_init_queues(dev);
 
 	dev->get_stats = internal_stats;
-	netpoll_netdev_init(dev);
+	INIT_LIST_HEAD(&dev->napi_list);
 	setup(dev);
 	strcpy(dev->name, name);
 	return dev;
@@ -4365,10 +4541,15 @@ EXPORT_SYMBOL(alloc_netdev_mq);
  */
 void free_netdev(struct net_device *dev)
 {
+	struct napi_struct *p, *n;
+
 	release_net(dev_net(dev));
 
 	kfree(dev->_tx);
 
+	list_for_each_entry_safe(p, n, &dev->napi_list, dev_list)
+		netif_napi_del(p);
+
 	/*  Compatibility with error handling in drivers */
 	if (dev->reg_state == NETREG_UNINITIALIZED) {
 		kfree((char *)dev - dev->padded);
@@ -4904,6 +5085,7 @@ static int __init net_dev_init(void)
 
 		queue->backlog.poll = process_backlog;
 		queue->backlog.weight = weight_p;
+		queue->backlog.gro_list = NULL;
 	}
 
 	netdev_dma_register();

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 4/8] ipv4: Add GRO infrastructure
  2008-12-13  1:34 ` [0/8] net: Generic Receive Offload Herbert Xu
                     ` (2 preceding siblings ...)
  2008-12-13  1:35   ` [PATCH 3/8] net: Add Generic Receive Offload infrastructure Herbert Xu
@ 2008-12-13  1:35   ` Herbert Xu
  2008-12-16  7:41     ` David Miller
  2008-12-13  1:35   ` [PATCH 5/8] net: Add skb_gro_receive Herbert Xu
                     ` (3 subsequent siblings)
  7 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-13  1:35 UTC (permalink / raw)
  To: David S. Miller, netdev, Evgeniy Polyakov, Ben Hutchings

ipv4: Add GRO infrastructure

This patch adds GRO support for IPv4.

The criteria for merging is more stringent than LRO, in particular,
we require all fields in the IP header to be identical except for
the length, ID and checksum.  In addition, the ID must form an
arithmetic sequence with a difference of one.

The ID requirement might seem overly strict, however, most hardware
TSO solutions already obey this rule.  Linux itself also obeys this
whether GSO is in use or not.

In future we could relax this rule by storing the IDs (or rather
making sure that we don't drop them when pulling the aggregate
skb's tail).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/net/protocol.h |    3 +
 net/ipv4/af_inet.c     |   97 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 100 insertions(+)

diff --git a/include/net/protocol.h b/include/net/protocol.h
index 8d024d7..cb2965a 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -39,6 +39,9 @@ struct net_protocol {
 	int			(*gso_send_check)(struct sk_buff *skb);
 	struct sk_buff	       *(*gso_segment)(struct sk_buff *skb,
 					       int features);
+	struct sk_buff	      **(*gro_receive)(struct sk_buff **head,
+					       struct sk_buff *skb);
+	int			(*gro_complete)(struct sk_buff *skb);
 	unsigned int		no_policy:1,
 				netns_ok:1;
 };
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 1aa2dc9..8dc1ebd 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -94,6 +94,7 @@
 #include <linux/igmp.h>
 #include <linux/inetdevice.h>
 #include <linux/netdevice.h>
+#include <net/checksum.h>
 #include <net/ip.h>
 #include <net/protocol.h>
 #include <net/arp.h>
@@ -1245,6 +1246,100 @@ out:
 	return segs;
 }
 
+static struct sk_buff **inet_gro_receive(struct sk_buff **head,
+					 struct sk_buff *skb)
+{
+	struct net_protocol *ops;
+	struct sk_buff **pp = NULL;
+	struct sk_buff *p;
+	struct iphdr *iph;
+	int flush = 1;
+	int proto;
+	int id;
+
+	if (unlikely(!pskb_may_pull(skb, sizeof(*iph))))
+		goto out;
+
+	iph = ip_hdr(skb);
+	proto = iph->protocol & (MAX_INET_PROTOS - 1);
+
+	rcu_read_lock();
+	ops = rcu_dereference(inet_protos[proto]);
+	if (!ops || !ops->gro_receive)
+		goto out_unlock;
+
+	if (iph->version != 4 || iph->ihl != 5)
+		goto out_unlock;
+
+	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
+		goto out_unlock;
+
+	flush = ntohs(iph->tot_len) != skb->len ||
+		iph->frag_off != htons(IP_DF);
+	id = ntohs(iph->id);
+
+	for (p = *head; p; p = p->next) {
+		struct iphdr *iph2;
+
+		if (!NAPI_GRO_CB(p)->same_flow)
+			continue;
+
+		iph2 = ip_hdr(p);
+
+		if (iph->protocol != iph2->protocol ||
+		    iph->tos != iph2->tos ||
+		    memcmp(&iph->saddr, &iph2->saddr, 8)) {
+			NAPI_GRO_CB(p)->same_flow = 0;
+			continue;
+		}
+
+		/* All fields must match except length and checksum. */
+		NAPI_GRO_CB(p)->flush |=
+			memcmp(&iph->frag_off, &iph2->frag_off, 4) ||
+			(u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) != id;
+
+		NAPI_GRO_CB(p)->flush |= flush;
+	}
+
+	NAPI_GRO_CB(skb)->flush |= flush;
+	__skb_pull(skb, sizeof(*iph));
+	skb_reset_transport_header(skb);
+
+	pp = ops->gro_receive(head, skb);
+
+out_unlock:
+	rcu_read_unlock();
+
+out:
+	NAPI_GRO_CB(skb)->flush |= flush;
+
+	return pp;
+}
+
+static int inet_gro_complete(struct sk_buff *skb)
+{
+	struct net_protocol *ops;
+	struct iphdr *iph = ip_hdr(skb);
+	int proto = iph->protocol & (MAX_INET_PROTOS - 1);
+	int err = -ENOSYS;
+	__be16 newlen = htons(skb->len - skb_network_offset(skb));
+
+	csum_replace2(&iph->check, iph->tot_len, newlen);
+	iph->tot_len = newlen;
+
+	rcu_read_lock();
+	ops = rcu_dereference(inet_protos[proto]);
+	if (WARN_ON(!ops || !ops->gro_complete))
+		goto out_unlock;
+
+	err = ops->gro_complete(skb);
+
+out_unlock:
+	rcu_read_unlock();
+
+	return err;
+}
+
 int inet_ctl_sock_create(struct sock **sk, unsigned short family,
 			 unsigned short type, unsigned char protocol,
 			 struct net *net)
@@ -1411,6 +1506,8 @@ static struct packet_type ip_packet_type = {
 	.func = ip_rcv,
 	.gso_send_check = inet_gso_send_check,
 	.gso_segment = inet_gso_segment,
+	.gro_receive = inet_gro_receive,
+	.gro_complete = inet_gro_complete,
 };
 
 static int __init inet_init(void)

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 5/8] net: Add skb_gro_receive
  2008-12-13  1:34 ` [0/8] net: Generic Receive Offload Herbert Xu
                     ` (3 preceding siblings ...)
  2008-12-13  1:35   ` [PATCH 4/8] ipv4: Add GRO infrastructure Herbert Xu
@ 2008-12-13  1:35   ` Herbert Xu
  2008-12-13  2:52     ` Herbert Xu
  2008-12-13  1:35   ` [PATCH 6/8] tcp: Add GRO support Herbert Xu
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-13  1:35 UTC (permalink / raw)
  To: David S. Miller, netdev, Evgeniy Polyakov, Ben Hutchings

net: Add skb_gro_receive

This patch adds the helper skb_gro_receive to merge packets for
GRO.  The current method is to allocate a new header skb and then
chain the original packets to its frag_list.  This is done to
make it easier to integrate into the existing GSO framework.

In future as GSO is moved into the drivers, we can undo this and
simply chain the original packets together.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/skbuff.h |    2 +
 net/core/skbuff.c      |   57 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2725f4e..2bdb539 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1649,6 +1649,8 @@ extern void	       skb_split(struct sk_buff *skb,
 				 struct sk_buff *skb1, const u32 len);
 
 extern struct sk_buff *skb_segment(struct sk_buff *skb, int features);
+extern int	       skb_gro_receive(struct sk_buff **head,
+				       struct sk_buff *skb);
 
 static inline void *skb_header_pointer(const struct sk_buff *skb, int offset,
 				       int len, void *buffer)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index cf05a8c..8ea648e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2439,6 +2439,63 @@ err:
 
 EXPORT_SYMBOL_GPL(skb_segment);
 
+int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
+{
+	struct sk_buff *p = *head;
+	struct sk_buff *nskb;
+	unsigned int headroom;
+	unsigned int hlen;
+
+	if (skb_shinfo(p)->frag_list)
+		goto merge;
+
+	headroom = skb_headroom(p);
+	nskb = netdev_alloc_skb(p->dev, headroom);
+	if (unlikely(!nskb))
+		return -ENOMEM;
+
+	__copy_skb_header(nskb, p);
+	nskb->mac_len = p->mac_len;
+
+	skb_reserve(nskb, headroom);
+
+	hlen = p->data - skb_mac_header(p);
+	skb_set_mac_header(nskb, -hlen);
+	skb_set_network_header(nskb, skb_network_offset(p));
+	skb_set_transport_header(nskb, skb_transport_offset(p));
+
+	memcpy(skb_mac_header(nskb), skb_mac_header(p), hlen);
+
+	*NAPI_GRO_CB(nskb) = *NAPI_GRO_CB(p);
+	skb_shinfo(nskb)->frag_list = p;
+	skb_header_release(p);
+	nskb->prev = p;
+
+	nskb->data_len += p->len;
+	nskb->truesize += p->len;
+	nskb->len += p->len;
+
+	*head = nskb;
+	nskb->next = p->next;
+	p->next = NULL;
+
+	p = nskb;
+
+merge:
+	NAPI_GRO_CB(p)->count++;
+	p->prev->next = skb;
+	p->prev = skb;
+	skb_header_release(skb);
+
+	p->data_len += skb->len;
+	p->truesize += skb->len;
+	p->len += skb->len;
+
+	NAPI_GRO_CB(skb)->same_flow = 1;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(skb_gro_receive);
+
 void __init skb_init(void)
 {
 	skbuff_head_cache = kmem_cache_create("skbuff_head_cache",

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 6/8] tcp: Add GRO support
  2008-12-13  1:34 ` [0/8] net: Generic Receive Offload Herbert Xu
                     ` (4 preceding siblings ...)
  2008-12-13  1:35   ` [PATCH 5/8] net: Add skb_gro_receive Herbert Xu
@ 2008-12-13  1:35   ` Herbert Xu
  2008-12-16  7:43     ` David Miller
  2008-12-13  1:35   ` [PATCH 7/8] ethtool: Add GGRO and SGRO ops Herbert Xu
  2008-12-13  1:35   ` [PATCH 8/8] e1000e: Add GRO support Herbert Xu
  7 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-13  1:35 UTC (permalink / raw)
  To: David S. Miller, netdev, Evgeniy Polyakov, Ben Hutchings

tcp: Add GRO support

This patch adds the TCP-specific portion of GRO.  The criterion for
merging is extremely strict (the TCP header must match exactly apart
from the checksum) so as to allow refragmentation.  Otherwise this
is pretty much identical to LRO, except that we support the merging
of ECN packets.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/net/tcp.h   |    6 +++
 net/ipv4/af_inet.c  |    2 +
 net/ipv4/tcp.c      |  100 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_ipv4.c |   35 ++++++++++++++++++
 4 files changed, 143 insertions(+)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 438014d..cd571a9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1358,6 +1358,12 @@ extern void tcp_v4_destroy_sock(struct sock *sk);
 
 extern int tcp_v4_gso_send_check(struct sk_buff *skb);
 extern struct sk_buff *tcp_tso_segment(struct sk_buff *skb, int features);
+extern struct sk_buff **tcp_gro_receive(struct sk_buff **head,
+					struct sk_buff *skb);
+extern struct sk_buff **tcp4_gro_receive(struct sk_buff **head,
+					 struct sk_buff *skb);
+extern int tcp_gro_complete(struct sk_buff *skb);
+extern int tcp4_gro_complete(struct sk_buff *skb);
 
 #ifdef CONFIG_PROC_FS
 extern int  tcp4_proc_init(void);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8dc1ebd..97e973f 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1414,6 +1414,8 @@ static struct net_protocol tcp_protocol = {
 	.err_handler =	tcp_v4_err,
 	.gso_send_check = tcp_v4_gso_send_check,
 	.gso_segment =	tcp_tso_segment,
+	.gro_receive =	tcp4_gro_receive,
+	.gro_complete =	tcp4_gro_complete,
 	.no_policy =	1,
 	.netns_ok =	1,
 };
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c5aca0b..294d838 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2461,6 +2461,106 @@ out:
 }
 EXPORT_SYMBOL(tcp_tso_segment);
 
+struct sk_buff **tcp_gro_receive(struct sk_buff **head, struct sk_buff *skb)
+{
+	struct sk_buff **pp = NULL;
+	struct sk_buff *p;
+	struct tcphdr *th;
+	struct tcphdr *th2;
+	unsigned int thlen;
+	unsigned int flags;
+	unsigned int total;
+	unsigned int mss = 1;
+	int flush = 1;
+
+	if (!pskb_may_pull(skb, sizeof(*th)))
+		goto out;
+
+	th = tcp_hdr(skb);
+	thlen = th->doff * 4;
+	if (thlen < sizeof(*th))
+		goto out;
+
+	if (!pskb_may_pull(skb, thlen))
+		goto out;
+
+	th = tcp_hdr(skb);
+	__skb_pull(skb, thlen);
+
+	flags = tcp_flag_word(th);
+
+	for (; (p = *head); head = &p->next) {
+		if (!NAPI_GRO_CB(p)->same_flow)
+			continue;
+
+		th2 = tcp_hdr(p);
+
+		if (th->source != th2->source || th->dest != th2->dest) {
+			NAPI_GRO_CB(p)->same_flow = 0;
+			continue;
+		}
+
+		goto found;
+	}
+
+	goto out_check_final;
+
+found:
+	flush = NAPI_GRO_CB(p)->flush;
+	flush |= flags & TCP_FLAG_CWR;
+	flush |= (flags ^ tcp_flag_word(th2)) &
+		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH);
+	flush |= th->ack_seq != th2->ack_seq || th->window != th2->window;
+	flush |= memcmp(th + 1, th2 + 1, thlen - sizeof(*th));
+
+	total = p->len;
+	mss = total;
+	if (skb_shinfo(p)->frag_list)
+		mss = skb_shinfo(p)->frag_list->len;
+
+	flush |= skb->len > mss || skb->len <= 0;
+	flush |= ntohl(th2->seq) + total != ntohl(th->seq);
+
+	if (flush || skb_gro_receive(head, skb)) {
+		mss = 1;
+		goto out_check_final;
+	}
+
+	p = *head;
+	th2 = tcp_hdr(p);
+	tcp_flag_word(th2) |= flags & (TCP_FLAG_FIN | TCP_FLAG_PSH);
+
+out_check_final:
+	flush = skb->len < mss;
+	flush |= flags & (TCP_FLAG_URG | TCP_FLAG_PSH | TCP_FLAG_RST |
+			  TCP_FLAG_SYN | TCP_FLAG_FIN);
+
+	if (p && (!NAPI_GRO_CB(skb)->same_flow || flush))
+		pp = head;
+
+out:
+	NAPI_GRO_CB(skb)->flush |= flush;
+
+	return pp;
+}
+
+int tcp_gro_complete(struct sk_buff *skb)
+{
+	struct tcphdr *th = tcp_hdr(skb);
+
+	skb->csum_start = skb_transport_header(skb) - skb->head;
+	skb->csum_offset = offsetof(struct tcphdr, check);
+	skb->ip_summed = CHECKSUM_PARTIAL;
+
+	skb_shinfo(skb)->gso_size = skb_shinfo(skb)->frag_list->len;
+	skb_shinfo(skb)->gso_segs = NAPI_GRO_CB(skb)->count;
+
+	if (th->cwr)
+		skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
+
+	return 0;
+}
+
 #ifdef CONFIG_TCP_MD5SIG
 static unsigned long tcp_md5sig_users;
 static struct tcp_md5sig_pool **tcp_md5sig_pool;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 5c8fa7f..5b7ce84 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2350,6 +2350,41 @@ void tcp4_proc_exit(void)
 }
 #endif /* CONFIG_PROC_FS */
 
+struct sk_buff **tcp4_gro_receive(struct sk_buff **head, struct sk_buff *skb)
+{
+	struct iphdr *iph = ip_hdr(skb);
+
+	switch (skb->ip_summed) {
+	case CHECKSUM_COMPLETE:
+		if (!tcp_v4_check(skb->len, iph->saddr, iph->daddr,
+				  skb->csum)) {
+			skb->ip_summed = CHECKSUM_UNNECESSARY;
+			break;
+		}
+
+		/* fall through */
+	case CHECKSUM_NONE:
+		NAPI_GRO_CB(skb)->flush = 1;
+		return NULL;
+	}
+
+	return tcp_gro_receive(head, skb);
+}
+EXPORT_SYMBOL(tcp4_gro_receive);
+
+int tcp4_gro_complete(struct sk_buff *skb)
+{
+	struct iphdr *iph = ip_hdr(skb);
+	struct tcphdr *th = tcp_hdr(skb);
+
+	th->check = ~tcp_v4_check(skb->len - skb_transport_offset(skb),
+				  iph->saddr, iph->daddr, 0);
+	skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
+
+	return tcp_gro_complete(skb);
+}
+EXPORT_SYMBOL(tcp4_gro_complete);
+
 struct proto tcp_prot = {
 	.name			= "TCP",
 	.owner			= THIS_MODULE,

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 7/8] ethtool: Add GGRO and SGRO ops
  2008-12-13  1:34 ` [0/8] net: Generic Receive Offload Herbert Xu
                     ` (5 preceding siblings ...)
  2008-12-13  1:35   ` [PATCH 6/8] tcp: Add GRO support Herbert Xu
@ 2008-12-13  1:35   ` Herbert Xu
  2008-12-16  7:44     ` David Miller
  2008-12-13  1:35   ` [PATCH 8/8] e1000e: Add GRO support Herbert Xu
  7 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-13  1:35 UTC (permalink / raw)
  To: David S. Miller, netdev, Evgeniy Polyakov, Ben Hutchings

ethtool: Add GGRO and SGRO ops

This patch adds the ethtool ops to enable and disable GRO.  It also
makes GRO depend on RX checksum offload much the same as how TSO
depends on SG support.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/ethtool.h |    2 +
 net/core/ethtool.c      |   53 ++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index b4b038b..27c67a5 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -467,6 +467,8 @@ struct ethtool_ops {
 
 #define	ETHTOOL_GRXFH		0x00000029 /* Get RX flow hash configuration */
 #define	ETHTOOL_SRXFH		0x0000002a /* Set RX flow hash configuration */
+#define ETHTOOL_GGRO		0x0000002b /* Get GRO enable (ethtool_value) */
+#define ETHTOOL_SGRO		0x0000002c /* Set GRO enable (ethtool_value) */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET		ETHTOOL_GSET
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 14ada53..947710a 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -528,6 +528,22 @@ static int ethtool_set_tx_csum(struct net_device *dev, char __user *useraddr)
 	return dev->ethtool_ops->set_tx_csum(dev, edata.data);
 }
 
+static int ethtool_set_rx_csum(struct net_device *dev, char __user *useraddr)
+{
+	struct ethtool_value edata;
+
+	if (!dev->ethtool_ops->set_rx_csum)
+		return -EOPNOTSUPP;
+
+	if (copy_from_user(&edata, useraddr, sizeof(edata)))
+		return -EFAULT;
+
+	if (!edata.data && dev->ethtool_ops->set_sg)
+		dev->features &= ~NETIF_F_GRO;
+
+	return dev->ethtool_ops->set_rx_csum(dev, edata.data);
+}
+
 static int ethtool_set_sg(struct net_device *dev, char __user *useraddr)
 {
 	struct ethtool_value edata;
@@ -599,6 +615,34 @@ static int ethtool_set_gso(struct net_device *dev, char __user *useraddr)
 	return 0;
 }
 
+static int ethtool_get_gro(struct net_device *dev, char __user *useraddr)
+{
+	struct ethtool_value edata = { ETHTOOL_GGRO };
+
+	edata.data = dev->features & NETIF_F_GRO;
+	if (copy_to_user(useraddr, &edata, sizeof(edata)))
+		 return -EFAULT;
+	return 0;
+}
+
+static int ethtool_set_gro(struct net_device *dev, char __user *useraddr)
+{
+	struct ethtool_value edata;
+
+	if (copy_from_user(&edata, useraddr, sizeof(edata)))
+		return -EFAULT;
+
+	if (edata.data) {
+		if (!dev->ethtool_ops->get_rx_csum ||
+		    !dev->ethtool_ops->get_rx_csum(dev))
+			return -EINVAL;
+		dev->features |= NETIF_F_GRO;
+	} else
+		dev->features &= ~NETIF_F_GRO;
+
+	return 0;
+}
+
 static int ethtool_self_test(struct net_device *dev, char __user *useraddr)
 {
 	struct ethtool_test test;
@@ -932,8 +976,7 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
 				       dev->ethtool_ops->get_rx_csum);
 		break;
 	case ETHTOOL_SRXCSUM:
-		rc = ethtool_set_value(dev, useraddr,
-				       dev->ethtool_ops->set_rx_csum);
+		rc = ethtool_set_rx_csum(dev, useraddr);
 		break;
 	case ETHTOOL_GTXCSUM:
 		rc = ethtool_get_value(dev, useraddr, ethcmd,
@@ -1014,6 +1057,12 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
 	case ETHTOOL_SRXFH:
 		rc = ethtool_set_rxhash(dev, useraddr);
 		break;
+	case ETHTOOL_GGRO:
+		rc = ethtool_get_gro(dev, useraddr);
+		break;
+	case ETHTOOL_SGRO:
+		rc = ethtool_set_gro(dev, useraddr);
+		break;
 	default:
 		rc = -EOPNOTSUPP;
 	}

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 8/8] e1000e: Add GRO support
  2008-12-13  1:34 ` [0/8] net: Generic Receive Offload Herbert Xu
                     ` (6 preceding siblings ...)
  2008-12-13  1:35   ` [PATCH 7/8] ethtool: Add GGRO and SGRO ops Herbert Xu
@ 2008-12-13  1:35   ` Herbert Xu
  2008-12-16  7:46     ` David Miller
  7 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-13  1:35 UTC (permalink / raw)
  To: David S. Miller, netdev, Evgeniy Polyakov, Ben Hutchings

e1000e: Add GRO support

This patch adds GRO support to e1000e by making it invoke napi_gro_receive
instead of netif_receive_skb.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 drivers/net/e1000e/netdev.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 91795f7..fb7c4bd 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -102,7 +102,7 @@ static void e1000_receive_skb(struct e1000_adapter *adapter,
 		vlan_hwaccel_receive_skb(skb, adapter->vlgrp,
 					 le16_to_cpu(vlan));
 	else
-		netif_receive_skb(skb);
+		napi_gro_receive(&adapter->napi, skb);
 
 	netdev->last_rx = jiffies;
 }

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/8] net: Add frag_list support to skb_segment
  2008-12-12 21:41     ` Herbert Xu
@ 2008-12-13  2:38       ` Evgeniy Polyakov
  2008-12-13  2:43         ` Herbert Xu
  0 siblings, 1 reply; 61+ messages in thread
From: Evgeniy Polyakov @ 2008-12-13  2:38 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, netdev

On Sat, Dec 13, 2008 at 08:41:25AM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > > +			nskb->truesize += skb_end_pointer(nskb) - nskb->head -
> > > +					  hsize;
> > > +			skb_release_head_state(nskb);
> > > +			__skb_push(nskb, doffset);
> > > +		} else {
> > > +			nskb = alloc_skb(hsize + doffset + headroom,
> > > +					 GFP_ATOMIC);
> > > +
> > > +			if (unlikely(!nskb))
> > > +				goto err;
> > > +
> > 
> > The same.
> 
> nskb is NULL here.

Yup, I meant below 'goto err' when we failed to clone skb.

> > > +		if (pos < offset + len) {
> > > +			struct sk_buff *fskb2 = fskb;
> > > +
> > > +			BUG_ON(pos + fskb->len != offset + len);
> > > +
> > > +			pos += fskb->len;
> > > +			fskb = fskb->next;
> > > +
> > > +			if (fskb2->next) {
> > > +				fskb2 = skb_clone(fskb2, GFP_ATOMIC);
> > > +				if (!fskb2)
> > > +					goto err;
> > > +			} else
> > > +				skb_get(fskb2);
> > > +
> > > +			skb_shinfo(nskb)->frag_list = fskb2;

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 6/8] tcp: Add GRO support
  2008-12-12 21:46     ` Herbert Xu
@ 2008-12-13  2:40       ` Evgeniy Polyakov
  2008-12-13  2:46         ` Herbert Xu
  0 siblings, 1 reply; 61+ messages in thread
From: Evgeniy Polyakov @ 2008-12-13  2:40 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, netdev

On Sat, Dec 13, 2008 at 08:46:27AM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > > +	flush |= memcmp(th + 1, th2 + 1, thlen - sizeof(*th));
> > > +
> > > +	total = p->len;
> > > +	mss = total;
> > > +	if (skb_shinfo(p)->frag_list)
> > > +		mss = skb_shinfo(p)->frag_list->len;
> > > +
> > > +	flush |= skb->len > mss || skb->len <= 0;
> > > +	flush |= ntohl(th2->seq) + total != ntohl(th->seq);
> > > +
> > 
> > No timestamp check?
> 
> The memcmp does that for us.

So it will fail if timestamp changed? Or if some new option added?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/8] net: Add frag_list support to skb_segment
  2008-12-13  2:38       ` Evgeniy Polyakov
@ 2008-12-13  2:43         ` Herbert Xu
  2008-12-13  3:11           ` Evgeniy Polyakov
  0 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-13  2:43 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David S. Miller, netdev

On Sat, Dec 13, 2008 at 05:38:55AM +0300, Evgeniy Polyakov wrote:
>
> Yup, I meant below 'goto err' when we failed to clone skb.
> 
> > > > +		if (pos < offset + len) {
> > > > +			struct sk_buff *fskb2 = fskb;
> > > > +
> > > > +			BUG_ON(pos + fskb->len != offset + len);
> > > > +
> > > > +			pos += fskb->len;
> > > > +			fskb = fskb->next;
> > > > +
> > > > +			if (fskb2->next) {
> > > > +				fskb2 = skb_clone(fskb2, GFP_ATOMIC);
> > > > +				if (!fskb2)
> > > > +					goto err;
> > > > +			} else
> > > > +				skb_get(fskb2);
> > > > +
> > > > +			skb_shinfo(nskb)->frag_list = fskb2;

This isn't a problem.  fskb2 is still on the original frag_list so
we don't need to free it if skb_clone fails.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 6/8] tcp: Add GRO support
  2008-12-13  2:40       ` Evgeniy Polyakov
@ 2008-12-13  2:46         ` Herbert Xu
  2008-12-13  3:10           ` Evgeniy Polyakov
  0 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-13  2:46 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David S. Miller, netdev

On Sat, Dec 13, 2008 at 05:40:46AM +0300, Evgeniy Polyakov wrote:
>
> So it will fail if timestamp changed? Or if some new option added?

Correct, it must remain exactly the same.  Even at 100Mbps any
sane clock frequency will result in an average of 8 packets per
clock update.  At the sort speeds (>1Gbps) at which we're targetting
this'll easily get us to 64K which is our maximum (actually it
looks like I forgot to add a check to stop this from growing beyond
64K :)

As I alluded to in the opening message, in future we can consider
a more generalised form of GRO where we end up with a super-packet
that retains individual headers which can then be refragmented as
necessary.  This would allow us to merge in the presence of timestamp
changes.  However, IMHO this isn't a killer feature for merging TCP
packets.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/8] net: Add skb_gro_receive
  2008-12-13  1:35   ` [PATCH 5/8] net: Add skb_gro_receive Herbert Xu
@ 2008-12-13  2:52     ` Herbert Xu
  2008-12-16  7:42       ` David Miller
  0 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-13  2:52 UTC (permalink / raw)
  To: David S. Miller, netdev, Evgeniy Polyakov, Ben Hutchings

On Sat, Dec 13, 2008 at 12:35:23PM +1100, Herbert Xu wrote:
> net: Add skb_gro_receive
> 
> This patch adds the helper skb_gro_receive to merge packets for
> GRO.  The current method is to allocate a new header skb and then
> chain the original packets to its frag_list.  This is done to
> make it easier to integrate into the existing GSO framework.
> 
> In future as GSO is moved into the drivers, we can undo this and
> simply chain the original packets together.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Looks like I forgot to stop the merging at 64K.  Here is an update
with the check added.

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2725f4e..2bdb539 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1649,6 +1649,8 @@ extern void	       skb_split(struct sk_buff *skb,
 				 struct sk_buff *skb1, const u32 len);
 
 extern struct sk_buff *skb_segment(struct sk_buff *skb, int features);
+extern int	       skb_gro_receive(struct sk_buff **head,
+				       struct sk_buff *skb);
 
 static inline void *skb_header_pointer(const struct sk_buff *skb, int offset,
 				       int len, void *buffer)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index cf05a8c..88c37a5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2439,6 +2439,65 @@ err:
 
 EXPORT_SYMBOL_GPL(skb_segment);
 
+int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
+{
+	struct sk_buff *p = *head;
+	struct sk_buff *nskb;
+	unsigned int headroom;
+	unsigned int hlen = p->data - skb_mac_header(p);
+
+	if (hlen + p->len + skb->len >= 65536)
+		return -E2BIG;
+
+	if (skb_shinfo(p)->frag_list)
+		goto merge;
+
+	headroom = skb_headroom(p);
+	nskb = netdev_alloc_skb(p->dev, headroom);
+	if (unlikely(!nskb))
+		return -ENOMEM;
+
+	__copy_skb_header(nskb, p);
+	nskb->mac_len = p->mac_len;
+
+	skb_reserve(nskb, headroom);
+
+	skb_set_mac_header(nskb, -hlen);
+	skb_set_network_header(nskb, skb_network_offset(p));
+	skb_set_transport_header(nskb, skb_transport_offset(p));
+
+	memcpy(skb_mac_header(nskb), skb_mac_header(p), hlen);
+
+	*NAPI_GRO_CB(nskb) = *NAPI_GRO_CB(p);
+	skb_shinfo(nskb)->frag_list = p;
+	skb_header_release(p);
+	nskb->prev = p;
+
+	nskb->data_len += p->len;
+	nskb->truesize += p->len;
+	nskb->len += p->len;
+
+	*head = nskb;
+	nskb->next = p->next;
+	p->next = NULL;
+
+	p = nskb;
+
+merge:
+	NAPI_GRO_CB(p)->count++;
+	p->prev->next = skb;
+	p->prev = skb;
+	skb_header_release(skb);
+
+	p->data_len += skb->len;
+	p->truesize += skb->len;
+	p->len += skb->len;
+
+	NAPI_GRO_CB(skb)->same_flow = 1;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(skb_gro_receive);
+
 void __init skb_init(void)
 {
 	skbuff_head_cache = kmem_cache_create("skbuff_head_cache",

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 6/8] tcp: Add GRO support
  2008-12-13  2:46         ` Herbert Xu
@ 2008-12-13  3:10           ` Evgeniy Polyakov
  2008-12-13  3:19             ` Herbert Xu
  0 siblings, 1 reply; 61+ messages in thread
From: Evgeniy Polyakov @ 2008-12-13  3:10 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, netdev

On Sat, Dec 13, 2008 at 01:46:45PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> On Sat, Dec 13, 2008 at 05:40:46AM +0300, Evgeniy Polyakov wrote:
> >
> > So it will fail if timestamp changed? Or if some new option added?
> 
> Correct, it must remain exactly the same.  Even at 100Mbps any
> sane clock frequency will result in an average of 8 packets per
> clock update.  At the sort speeds (>1Gbps) at which we're targetting
> this'll easily get us to 64K which is our maximum (actually it
> looks like I forgot to add a check to stop this from growing beyond
> 64K :)

Some stacks use just increased counter since it is allowed by the RFC :)
Probably 'not sane' is appropriate name, but still... Probably not a
serious problem, but what if just check the timestamp option before/after
like it was in LRO?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/8] net: Add frag_list support to skb_segment
  2008-12-13  2:43         ` Herbert Xu
@ 2008-12-13  3:11           ` Evgeniy Polyakov
  2008-12-13  3:20             ` Herbert Xu
  0 siblings, 1 reply; 61+ messages in thread
From: Evgeniy Polyakov @ 2008-12-13  3:11 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, netdev

On Sat, Dec 13, 2008 at 01:43:05PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > > > > +		if (pos < offset + len) {
> > > > > +			struct sk_buff *fskb2 = fskb;
> > > > > +
> > > > > +			BUG_ON(pos + fskb->len != offset + len);
> > > > > +
> > > > > +			pos += fskb->len;
> > > > > +			fskb = fskb->next;
> > > > > +
> > > > > +			if (fskb2->next) {
> > > > > +				fskb2 = skb_clone(fskb2, GFP_ATOMIC);
> > > > > +				if (!fskb2)
> > > > > +					goto err;
> > > > > +			} else
> > > > > +				skb_get(fskb2);
> > > > > +
> > > > > +			skb_shinfo(nskb)->frag_list = fskb2;
> 
> This isn't a problem.  fskb2 is still on the original frag_list so
> we don't need to free it if skb_clone fails.

But nskb was not added into the list or it was?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 6/8] tcp: Add GRO support
  2008-12-13  3:10           ` Evgeniy Polyakov
@ 2008-12-13  3:19             ` Herbert Xu
  0 siblings, 0 replies; 61+ messages in thread
From: Herbert Xu @ 2008-12-13  3:19 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David S. Miller, netdev

On Sat, Dec 13, 2008 at 06:10:19AM +0300, Evgeniy Polyakov wrote:
>
> Some stacks use just increased counter since it is allowed by the RFC :)
> Probably 'not sane' is appropriate name, but still... Probably not a

Well one big reason why TSO worked so well is because the rest
of stack (well most of it) simply treated it as a large packet.
As it stands this means keeping below the 64K threshold (e.g.,
the IP header length field is 16 bits long).

In future of course we'd want to increase this, be it through
using IPv6 or some other means.

> serious problem, but what if just check the timestamp option before/after
> like it was in LRO?

As I said I don't think the restriction on the timestamp is such
a big deal.  At the sort of speeds where merging is actually useful
only an insane clock frequency would require us to merge packets
with different time stamps.  Note also that the limited length of
the TCP timestamp option means that insane clock frequencies aren't
practical anyway as it'll wrap too quickly, thus defeating its
purpose.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/8] net: Add frag_list support to skb_segment
  2008-12-13  3:11           ` Evgeniy Polyakov
@ 2008-12-13  3:20             ` Herbert Xu
  0 siblings, 0 replies; 61+ messages in thread
From: Herbert Xu @ 2008-12-13  3:20 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David S. Miller, netdev

On Sat, Dec 13, 2008 at 06:11:10AM +0300, Evgeniy Polyakov wrote:
>
> > This isn't a problem.  fskb2 is still on the original frag_list so
> > we don't need to free it if skb_clone fails.
> 
> But nskb was not added into the list or it was?

We've added it to the segs list which is freed at err:.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/8] net: Add Generic Receive Offload infrastructure
  2008-12-12 23:11       ` Herbert Xu
@ 2008-12-13  3:43         ` Herbert Xu
  2008-12-13 14:03           ` Evgeniy Polyakov
  0 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-13  3:43 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David S. Miller, netdev

On Sat, Dec 13, 2008 at 10:11:38AM +1100, Herbert Xu wrote:
>
> BTW this should be pretty easy to implement through a second
> entry, e.g., napi_gro_receive_pages() that works just like its
> LRO counter-part lro_receive_frags.  This would have its own
> protocol hooks so it doesn't need to do anything nasty to get
> at the packet headers.

In fact we don't even need extra hooks.  All we need to do is to
keep one pre-allocated in the napi struct, and use that to hold
the pages while we process it.  If it's merged (we'd modify the
low-level skb_gro_receive to merge the pages directly rather than
the skb), then we return the skb to use it again for the next
packet.  If not then we just allocate a new skb.  This way none
of the protocol-specific code needs to handle pages at all.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/8] net: Add Generic Receive Offload infrastructure
  2008-12-13  3:43         ` Herbert Xu
@ 2008-12-13 14:03           ` Evgeniy Polyakov
  0 siblings, 0 replies; 61+ messages in thread
From: Evgeniy Polyakov @ 2008-12-13 14:03 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Ben Hutchings, David S. Miller, netdev

On Sat, Dec 13, 2008 at 02:43:56PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > BTW this should be pretty easy to implement through a second
> > entry, e.g., napi_gro_receive_pages() that works just like its
> > LRO counter-part lro_receive_frags.  This would have its own
> > protocol hooks so it doesn't need to do anything nasty to get
> > at the packet headers.
> 
> In fact we don't even need extra hooks.  All we need to do is to
> keep one pre-allocated in the napi struct, and use that to hold
> the pages while we process it.  If it's merged (we'd modify the
> low-level skb_gro_receive to merge the pages directly rather than
> the skb), then we return the skb to use it again for the next
> packet.  If not then we just allocate a new skb.  This way none
> of the protocol-specific code needs to handle pages at all.

This sounds good!

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 7/8] ethtool: Add GGRO and SGRO ops
  2008-12-12 22:49         ` Herbert Xu
@ 2008-12-14 19:36           ` Waskiewicz Jr, Peter P
  2008-12-14 21:09             ` Herbert Xu
  0 siblings, 1 reply; 61+ messages in thread
From: Waskiewicz Jr, Peter P @ 2008-12-14 19:36 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Ben Hutchings, David S. Miller, netdev@vger.kernel.org

On Fri, 12 Dec 2008, Herbert Xu wrote:

> On Fri, Dec 12, 2008 at 10:35:27PM +0000, Ben Hutchings wrote:
> >
> > If I'm not mistaken, the whole point of set_flags() is to end the
> > continued expansion of struct ethtool_ops by another 2 functions for
> > every new offload feature.  The comments make it fairly clear that Jeff
> > anticipated that it might be necessary to include such checks for some
> > flags.
> 
> GRO is purely a stack feature so it doesn't need anything in
> ethtool_ops at all.

I'm confused then.  You're adding two ethtool entry points with 
ETHTOOL_GGRO and ETHTOOL_SGRO, adding the callpoints in dev_ethtool with 
set_gro and get_gro, but how do you manipulate this without adding to the 
userspace application?  Adding this functionality to the set/get_flags 
will keep the userspace app from needing a patch to support the new 
callbacks.

Cheers,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 7/8] ethtool: Add GGRO and SGRO ops
  2008-12-14 19:36           ` Waskiewicz Jr, Peter P
@ 2008-12-14 21:09             ` Herbert Xu
  2008-12-14 22:00               ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-14 21:09 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P
  Cc: Ben Hutchings, David S. Miller, netdev@vger.kernel.org

On Sun, Dec 14, 2008 at 11:36:14AM -0800, Waskiewicz Jr, Peter P wrote:
>
> I'm confused then.  You're adding two ethtool entry points with 
> ETHTOOL_GGRO and ETHTOOL_SGRO, adding the callpoints in dev_ethtool with 
> set_gro and get_gro, but how do you manipulate this without adding to the 
> userspace application?  Adding this functionality to the set/get_flags 
> will keep the userspace app from needing a patch to support the new 
> callbacks.

Huh? Whether you use get_flags or not you still need to patch the
user-space application so that it knows how to handle

	ethtool -K eth0 gro on/off

There is no way around that.

As I said earlier, I didn't use get_flags/set_flags because of the
need to depend on RX checksum offload.

BTW, here is the patch for ethtool to set the GRO flags.

diff --git a/ethtool-copy.h b/ethtool-copy.h
index eadba25..3ca4e2c 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -336,6 +336,8 @@ struct ethtool_rxnfc {
 
 #define	ETHTOOL_GRXFH		0x00000029 /* Get RX flow hash configuration */
 #define	ETHTOOL_SRXFH		0x0000002a /* Set RX flow hash configuration */
+#define	ETHTOOL_GGRO		0x0000002b /* Get GRO enable (ethtool_value) */
+#define	ETHTOOL_SGRO		0x0000002c /* Set GRO enable (ethtool_value) */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET		ETHTOOL_GSET
diff --git a/ethtool.c b/ethtool.c
index a7c02d0..502fc8f 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -160,6 +160,7 @@ static struct option {
 	        "		[ tso on|off ]\n"
 	        "		[ ufo on|off ]\n"
 		"		[ gso on|off ]\n"
+		"		[ gro on|off ]\n"
 		"               [ lro on|off ]\n"
     },
     { "-i", "--driver", MODE_GDRV, "Show driver information" },
@@ -218,6 +219,7 @@ static int off_sg_wanted = -1;
 static int off_tso_wanted = -1;
 static int off_ufo_wanted = -1;
 static int off_gso_wanted = -1;
+static int off_gro_wanted = -1;
 static int off_lro_wanted = -1;
 
 static struct ethtool_pauseparam epause;
@@ -333,6 +335,7 @@ static struct cmdline_info cmdline_offload[] = {
 	{ "tso", CMDL_BOOL, &off_tso_wanted, NULL },
 	{ "ufo", CMDL_BOOL, &off_ufo_wanted, NULL },
 	{ "gso", CMDL_BOOL, &off_gso_wanted, NULL },
+	{ "gro", CMDL_BOOL, &off_gro_wanted, NULL },
 	{ "lro", CMDL_BOOL, &off_lro_wanted, NULL },
 };
 
@@ -1387,7 +1390,8 @@ static int dump_coalesce(void)
 	return 0;
 }
 
-static int dump_offload(int rx, int tx, int sg, int tso, int ufo, int gso, int lro)
+static int dump_offload(int rx, int tx, int sg, int tso, int ufo, int gso,
+			int gro, int lro)
 {
 	fprintf(stdout,
 		"rx-checksumming: %s\n"
@@ -1396,6 +1400,7 @@ static int dump_offload(int rx, int tx, int sg, int tso, int ufo, int gso, int l
 		"tcp segmentation offload: %s\n"
 		"udp fragmentation offload: %s\n"
 		"generic segmentation offload: %s\n"
+		"generic receive offload: %s\n"
 		"large receive offload: %s\n",
 		rx ? "on" : "off",
 		tx ? "on" : "off",
@@ -1403,6 +1408,7 @@ static int dump_offload(int rx, int tx, int sg, int tso, int ufo, int gso, int l
 		tso ? "on" : "off",
 		ufo ? "on" : "off",
 		gso ? "on" : "off",
+		gro ? "on" : "off",
 		lro ? "on" : "off");
 
 	return 0;
@@ -1714,7 +1720,7 @@ static int do_goffload(int fd, struct ifreq *ifr)
 {
 	struct ethtool_value eval;
 	int err, allfail = 1, rx = 0, tx = 0, sg = 0;
-	int tso = 0, ufo = 0, gso = 0, lro = 0;
+	int tso = 0, ufo = 0, gso = 0, gro = 0, lro = 0;
 
 	fprintf(stdout, "Offload parameters for %s:\n", devname);
 
@@ -1778,6 +1784,16 @@ static int do_goffload(int fd, struct ifreq *ifr)
 		allfail = 0;
 	}
 
+	eval.cmd = ETHTOOL_GGRO;
+	ifr->ifr_data = (caddr_t)&eval;
+	err = ioctl(fd, SIOCETHTOOL, ifr);
+	if (err)
+		perror("Cannot get device generic receive offload settings");
+	else {
+		gro = eval.data;
+		allfail = 0;
+	}
+
 	eval.cmd = ETHTOOL_GFLAGS;
 	ifr->ifr_data = (caddr_t)&eval;
 	err = ioctl(fd, SIOCETHTOOL, ifr);
@@ -1793,7 +1809,7 @@ static int do_goffload(int fd, struct ifreq *ifr)
 		return 83;
 	}
 
-	return dump_offload(rx, tx, sg, tso, ufo, gso, lro);
+	return dump_offload(rx, tx, sg, tso, ufo, gso, gro, lro);
 }
 
 static int do_soffload(int fd, struct ifreq *ifr)
@@ -1870,6 +1886,17 @@ static int do_soffload(int fd, struct ifreq *ifr)
 			return 90;
 		}
 	}
+	if (off_gro_wanted >= 0) {
+		changed = 1;
+		eval.cmd = ETHTOOL_SGRO;
+		eval.data = (off_gro_wanted == 1);
+		ifr->ifr_data = (caddr_t)&eval;
+		err = ioctl(fd, SIOCETHTOOL, ifr);
+		if (err) {
+			perror("Cannot set device generic receive offload settings");
+			return 93;
+		}
+	}
 	if (off_lro_wanted >= 0) {
 		changed = 1;
 		eval.cmd = ETHTOOL_GFLAGS;

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 7/8] ethtool: Add GGRO and SGRO ops
  2008-12-14 21:09             ` Herbert Xu
@ 2008-12-14 22:00               ` Waskiewicz Jr, Peter P
  2008-12-15  3:40                 ` Herbert Xu
  0 siblings, 1 reply; 61+ messages in thread
From: Waskiewicz Jr, Peter P @ 2008-12-14 22:00 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Ben Hutchings, David S. Miller, netdev@vger.kernel.org

On Sun, 14 Dec 2008, Herbert Xu wrote:

> On Sun, Dec 14, 2008 at 11:36:14AM -0800, Waskiewicz Jr, Peter P wrote:
> >
> > I'm confused then.  You're adding two ethtool entry points with
> > ETHTOOL_GGRO and ETHTOOL_SGRO, adding the callpoints in dev_ethtool with
> > set_gro and get_gro, but how do you manipulate this without adding to the
> > userspace application?  Adding this functionality to the set/get_flags
> > will keep the userspace app from needing a patch to support the new
> > callbacks.
> 
> Huh? Whether you use get_flags or not you still need to patch the
> user-space application so that it knows how to handle
> 
>         ethtool -K eth0 gro on/off
> 
> There is no way around that.
> 
> As I said earlier, I didn't use get_flags/set_flags because of the
> need to depend on RX checksum offload.

Ok, that makes sense.  

> 
> BTW, here is the patch for ethtool to set the GRO flags.

Thanks Herbert.  I was slightly confused when you said you wouldn't need 
to affect the ethtool_ops struct, which you don't.  But my initial read 
didn't think to just pop it into the existing offload ethtool_ops entry 
point after seeing the get/set ops you had in the kernel interface.  I've 
been enlightened now (or my brain has finally slipped into gear).  :-)

> -static int dump_offload(int rx, int tx, int sg, int tso, int ufo, int gso, int lro)
> +static int dump_offload(int rx, int tx, int sg, int tso, int ufo, int gso,
> +                       int gro, int lro)

Would it make better sense to add GRO after LRO?  I suppose it doesn't 
make any difference, but I'm thinking of any possible way to provide ABI 
compatibility.  I don't think it's possible though; just thinking out loud 
here.

Cheers,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 7/8] ethtool: Add GGRO and SGRO ops
  2008-12-14 22:00               ` Waskiewicz Jr, Peter P
@ 2008-12-15  3:40                 ` Herbert Xu
  0 siblings, 0 replies; 61+ messages in thread
From: Herbert Xu @ 2008-12-15  3:40 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P
  Cc: Ben Hutchings, David S. Miller, netdev@vger.kernel.org

On Sun, Dec 14, 2008 at 02:00:18PM -0800, Waskiewicz Jr, Peter P wrote:
>
> > -static int dump_offload(int rx, int tx, int sg, int tso, int ufo, int gso, int lro)
> > +static int dump_offload(int rx, int tx, int sg, int tso, int ufo, int gso,
> > +                       int gro, int lro)
> 
> Would it make better sense to add GRO after LRO?  I suppose it doesn't 
> make any difference, but I'm thinking of any possible way to provide ABI 
> compatibility.  I don't think it's possible though; just thinking out loud 
> here.

This is an internal iproute function, not part of any ABI :)

Longer term my plan is to first have LRO restricted to hardware
offload which cannot coexist with forwarding/bridging, and eventually
phased out completely.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/8] net: Add Generic Receive Offload infrastructure
  2008-12-13  1:35   ` [PATCH 3/8] net: Add Generic Receive Offload infrastructure Herbert Xu
@ 2008-12-15 23:29     ` Paul E. McKenney
  2008-12-15 23:39       ` David Miller
  2008-12-16  7:40     ` David Miller
  1 sibling, 1 reply; 61+ messages in thread
From: Paul E. McKenney @ 2008-12-15 23:29 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, netdev, Evgeniy Polyakov, Ben Hutchings

On Sat, Dec 13, 2008 at 12:35:21PM +1100, Herbert Xu wrote:
> net: Add Generic Receive Offload infrastructure
> 
> This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
> This is pretty similar to LRO except that this is protocol-independent.
> Instead of holding packets in an lro_mgr structure, they're now held in
> napi_struct.

Looks good from an RCU standpoint, assuming that the hash table is 
fixed and never resized.  (If not, please see below.)

							Thanx, Paul

> For drivers that intend to use this, they can set the NETIF_F_GRO bit and
> call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
> The latter will call napi_receive_skb automatically.  When napi_gro_receive
> is used, the driver must either call napi_complete/napi_rx_complete, or
> call napi_gro_flush in softirq context if the driver uses the primitives
> __napi_complete/__napi_rx_complete.
> 
> Protocols will set the gro_receive and gro_complete function pointers in
> order to participate in this scheme.
> 
> In addition to the packet, gro_receive will get a list of currently held
> packets.  Each packet in the list has a same_flow field which is non-zero
> if it is a potential match for the new packet.  For each packet that may
> match, they also have a flush field which is non-zero if the held packet
> must not be merged with the new packet.
> 
> Once gro_receive has determined that the new skb matches a held packet,
> the held packet may be processed immediately if the new skb cannot be
> merged with it.  In this case gro_receive should return the pointer to
> the existing skb in gro_list.  Otherwise the new skb should be merged into
> the existing packet and NULL should be returned, unless the new skb makes
> it impossible for any further merges to be made (e.g., FIN packet) where
> the merged skb should be returned.
> 
> Whenever the skb is merged into an existing entry, the gro_receive
> function should set NAPI_GRO_CB(skb)->same_flow.  Note that if an skb
> merely matches an existing entry but can't be merged with it, then
> this shouldn't be set.
> 
> If gro_receive finds it pointless to hold the new skb for future merging,
> it should set NAPI_GRO_CB(skb)->flush.
> 
> Held packets will be flushed by napi_gro_flush which is called by
> napi_complete and napi_rx_complete.
> 
> Currently held packets are stored in a singly liked list just like LRO.
> The list is limited to a maximum of 8 entries.  In future, this may be
> expanded to use a hash table to allow more flows to be held for merging.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> ---
> 
>  include/linux/netdevice.h |   74 ++++++------------
>  include/linux/netpoll.h   |    5 -
>  net/core/dev.c            |  186 +++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 212 insertions(+), 53 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 8bf9127..5acd176 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -311,8 +311,9 @@ struct napi_struct {
>  	spinlock_t		poll_lock;
>  	int			poll_owner;
>  	struct net_device	*dev;
> -	struct list_head	dev_list;
>  #endif
> +	struct list_head	dev_list;
> +	struct sk_buff		*gro_list;
>  };
> 
>  enum
> @@ -372,22 +373,8 @@ static inline int napi_reschedule(struct napi_struct *napi)
>   *
>   * Mark NAPI processing as complete.
>   */
> -static inline void __napi_complete(struct napi_struct *n)
> -{
> -	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
> -	list_del(&n->poll_list);
> -	smp_mb__before_clear_bit();
> -	clear_bit(NAPI_STATE_SCHED, &n->state);
> -}
> -
> -static inline void napi_complete(struct napi_struct *n)
> -{
> -	unsigned long flags;
> -
> -	local_irq_save(flags);
> -	__napi_complete(n);
> -	local_irq_restore(flags);
> -}
> +extern void __napi_complete(struct napi_struct *n);
> +extern void napi_complete(struct napi_struct *n);
> 
>  /**
>   *	napi_disable - prevent NAPI from scheduling
> @@ -495,9 +482,7 @@ struct net_device
>  	unsigned long		state;
> 
>  	struct list_head	dev_list;
> -#ifdef CONFIG_NETPOLL
>  	struct list_head	napi_list;
> -#endif
>  	
>  	/* The device initialization function. Called only once. */
>  	int			(*init)(struct net_device *dev);
> @@ -521,6 +506,7 @@ struct net_device
>  #define NETIF_F_LLTX		4096	/* LockLess TX - deprecated. Please */
>  					/* do not use LLTX in new drivers */
>  #define NETIF_F_NETNS_LOCAL	8192	/* Does not change network namespaces */
> +#define NETIF_F_GRO		16384	/* Generic receive offload */
>  #define NETIF_F_LRO		32768	/* large receive offload */
> 
>  	/* Segmentation offload features */
> @@ -858,22 +844,8 @@ static inline void *netdev_priv(const struct net_device *dev)
>   * netif_napi_add() must be used to initialize a napi context prior to calling
>   * *any* of the other napi related functions.
>   */
> -static inline void netif_napi_add(struct net_device *dev,
> -				  struct napi_struct *napi,
> -				  int (*poll)(struct napi_struct *, int),
> -				  int weight)
> -{
> -	INIT_LIST_HEAD(&napi->poll_list);
> -	napi->poll = poll;
> -	napi->weight = weight;
> -#ifdef CONFIG_NETPOLL
> -	napi->dev = dev;
> -	list_add(&napi->dev_list, &dev->napi_list);
> -	spin_lock_init(&napi->poll_lock);
> -	napi->poll_owner = -1;
> -#endif
> -	set_bit(NAPI_STATE_SCHED, &napi->state);
> -}
> +void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
> +		    int (*poll)(struct napi_struct *, int), int weight);
> 
>  /**
>   *  netif_napi_del - remove a napi context
> @@ -881,12 +853,20 @@ static inline void netif_napi_add(struct net_device *dev,
>   *
>   *  netif_napi_del() removes a napi context from the network device napi list
>   */
> -static inline void netif_napi_del(struct napi_struct *napi)
> -{
> -#ifdef CONFIG_NETPOLL
> -	list_del(&napi->dev_list);
> -#endif
> -}
> +void netif_napi_del(struct napi_struct *napi);
> +
> +struct napi_gro_cb {
> +	/* This is non-zero if the packet may be of the same flow. */
> +	int same_flow;
> +
> +	/* This is non-zero if the packet cannot be merged with the new skb. */
> +	int flush;
> +
> +	/* Number of segments aggregated. */
> +	int count;
> +};
> +
> +#define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb)
> 
>  struct packet_type {
>  	__be16			type;	/* This is really htons(ether_type). */
> @@ -898,6 +878,9 @@ struct packet_type {
>  	struct sk_buff		*(*gso_segment)(struct sk_buff *skb,
>  						int features);
>  	int			(*gso_send_check)(struct sk_buff *skb);
> +	struct sk_buff		**(*gro_receive)(struct sk_buff **head,
> +					       struct sk_buff *skb);
> +	int			(*gro_complete)(struct sk_buff *skb);
>  	void			*af_packet_priv;
>  	struct list_head	list;
>  };
> @@ -1251,6 +1234,9 @@ extern int		netif_rx(struct sk_buff *skb);
>  extern int		netif_rx_ni(struct sk_buff *skb);
>  #define HAVE_NETIF_RECEIVE_SKB 1
>  extern int		netif_receive_skb(struct sk_buff *skb);
> +extern void		napi_gro_flush(struct napi_struct *napi);
> +extern int		napi_gro_receive(struct napi_struct *napi,
> +					 struct sk_buff *skb);
>  extern void		netif_nit_deliver(struct sk_buff *skb);
>  extern int		dev_valid_name(const char *name);
>  extern int		dev_ioctl(struct net *net, unsigned int cmd, void __user *);
> @@ -1495,11 +1481,7 @@ static inline void __netif_rx_complete(struct net_device *dev,
>  static inline void netif_rx_complete(struct net_device *dev,
>  				     struct napi_struct *napi)
>  {
> -	unsigned long flags;
> -
> -	local_irq_save(flags);
> -	__netif_rx_complete(dev, napi);
> -	local_irq_restore(flags);
> +	napi_complete(napi);
>  }
> 
>  static inline void __netif_tx_lock(struct netdev_queue *txq, int cpu)
> diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h
> index e3d7959..e38d3c9 100644
> --- a/include/linux/netpoll.h
> +++ b/include/linux/netpoll.h
> @@ -94,11 +94,6 @@ static inline void netpoll_poll_unlock(void *have)
>  	rcu_read_unlock();
>  }
> 
> -static inline void netpoll_netdev_init(struct net_device *dev)
> -{
> -	INIT_LIST_HEAD(&dev->napi_list);
> -}
> -
>  #else
>  static inline int netpoll_rx(struct sk_buff *skb)
>  {
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 4388e27..5e5132c 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -130,6 +130,9 @@
> 
>  #include "net-sysfs.h"
> 
> +/* Instead of increasing this, you should create a hash table. */
> +#define MAX_GRO_SKBS 8
> +
>  /*
>   *	The list of packet types we will receive (as opposed to discard)
>   *	and the routines to invoke.
> @@ -2323,6 +2326,122 @@ static void flush_backlog(void *arg)
>  		}
>  }
> 
> +static int napi_gro_complete(struct sk_buff *skb)
> +{
> +	struct packet_type *ptype;
> +	__be16 type = skb->protocol;
> +	struct list_head *head = &ptype_base[ntohs(type) & PTYPE_HASH_MASK];
> +	int err = -ENOENT;
> +
> +	if (!skb_shinfo(skb)->frag_list)
> +		goto out;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(ptype, head, list) {
> +		if (ptype->type != type || ptype->dev || !ptype->gro_complete)
> +			continue;
> +
> +		err = ptype->gro_complete(skb);
> +		break;
> +	}
> +	rcu_read_unlock();
> +
> +	if (err) {
> +		WARN_ON(&ptype->list == head);

Presumably ptype_base[] is a static array rather than a dynamically
allocated array that is resized under RCU protection, right?  Otherwise,
you could get in trouble if the above raced with the resize operation due
to the fact that you are outside of the RCU read-side critical section.

> +		kfree_skb(skb);
> +		return NET_RX_SUCCESS;
> +	}
> +
> +out:
> +	__skb_push(skb, -skb_network_offset(skb));
> +	return netif_receive_skb(skb);
> +}
> +
> +void napi_gro_flush(struct napi_struct *napi)
> +{
> +	struct sk_buff *skb, *next;
> +
> +	for (skb = napi->gro_list; skb; skb = next) {
> +		next = skb->next;
> +		skb->next = NULL;
> +		napi_gro_complete(skb);
> +	}
> +
> +	napi->gro_list = NULL;
> +}
> +EXPORT_SYMBOL(napi_gro_flush);
> +
> +int napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
> +{
> +	struct sk_buff **pp;
> +	struct packet_type *ptype;
> +	__be16 type = skb->protocol;
> +	struct list_head *head = &ptype_base[ntohs(type) & PTYPE_HASH_MASK];
> +	int count = 0;
> +	int mac_len;
> +
> +	if (!(skb->dev->features & NETIF_F_GRO))
> +		goto normal;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(ptype, head, list) {
> +		struct sk_buff *p;
> +
> +		if (ptype->type != type || ptype->dev || !ptype->gro_receive)
> +			continue;
> +
> +		skb_reset_network_header(skb);
> +		mac_len = skb->network_header - skb->mac_header;
> +		skb->mac_len = mac_len;
> +		NAPI_GRO_CB(skb)->same_flow = 0;
> +		NAPI_GRO_CB(skb)->flush = 0;
> +
> +		for (p = napi->gro_list; p; p = p->next) {
> +			count++;
> +			NAPI_GRO_CB(p)->same_flow =
> +				p->mac_len == mac_len &&
> +				!memcmp(skb_mac_header(p), skb_mac_header(skb),
> +					mac_len);
> +			NAPI_GRO_CB(p)->flush = 0;
> +		}
> +
> +		pp = ptype->gro_receive(&napi->gro_list, skb);
> +		break;
> +	}
> +	rcu_read_unlock();
> +
> +	if (&ptype->list == head)

Ditto.

> +		goto normal;
> +
> +	if (pp) {
> +		struct sk_buff *nskb = *pp;
> +
> +		*pp = nskb->next;
> +		nskb->next = NULL;
> +		napi_gro_complete(nskb);
> +		count--;
> +	}
> +
> +	if (NAPI_GRO_CB(skb)->same_flow)
> +		goto ok;
> +
> +	if (NAPI_GRO_CB(skb)->flush || count >= MAX_GRO_SKBS) {
> +		__skb_push(skb, -skb_network_offset(skb));
> +		goto normal;
> +	}
> +
> +	NAPI_GRO_CB(skb)->count = 1;
> +	skb->next = napi->gro_list;
> +	napi->gro_list = skb;
> +
> +ok:
> +	return NET_RX_SUCCESS;
> +
> +normal:
> +	return netif_receive_skb(skb);
> +}
> +EXPORT_SYMBOL(napi_gro_receive);
> +
>  static int process_backlog(struct napi_struct *napi, int quota)
>  {
>  	int work = 0;
> @@ -2342,9 +2461,11 @@ static int process_backlog(struct napi_struct *napi, int quota)
>  		}
>  		local_irq_enable();
> 
> -		netif_receive_skb(skb);
> +		napi_gro_receive(napi, skb);
>  	} while (++work < quota && jiffies == start_time);
> 
> +	napi_gro_flush(napi);
> +
>  	return work;
>  }
> 
> @@ -2365,6 +2486,61 @@ void __napi_schedule(struct napi_struct *n)
>  }
>  EXPORT_SYMBOL(__napi_schedule);
> 
> +void __napi_complete(struct napi_struct *n)
> +{
> +	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
> +	BUG_ON(n->gro_list);
> +
> +	list_del(&n->poll_list);
> +	smp_mb__before_clear_bit();
> +	clear_bit(NAPI_STATE_SCHED, &n->state);
> +}
> +EXPORT_SYMBOL(__napi_complete);
> +
> +void napi_complete(struct napi_struct *n)
> +{
> +	unsigned long flags;
> +
> +	napi_gro_flush(n);
> +	local_irq_save(flags);
> +	__napi_complete(n);
> +	local_irq_restore(flags);
> +}
> +EXPORT_SYMBOL(napi_complete);
> +
> +void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
> +		    int (*poll)(struct napi_struct *, int), int weight)
> +{
> +	INIT_LIST_HEAD(&napi->poll_list);
> +	napi->gro_list = NULL;
> +	napi->poll = poll;
> +	napi->weight = weight;
> +	list_add(&napi->dev_list, &dev->napi_list);
> +#ifdef CONFIG_NETPOLL
> +	napi->dev = dev;
> +	spin_lock_init(&napi->poll_lock);
> +	napi->poll_owner = -1;
> +#endif
> +	set_bit(NAPI_STATE_SCHED, &napi->state);
> +}
> +EXPORT_SYMBOL(netif_napi_add);
> +
> +void netif_napi_del(struct napi_struct *napi)
> +{
> +	struct sk_buff *skb, *next;
> +
> +	list_del(&napi->dev_list);
> +
> +	for (skb = napi->gro_list; skb; skb = next) {
> +		next = skb->next;
> +		skb->next = NULL;
> +		kfree_skb(skb);
> +	}
> +
> +	napi->gro_list = NULL;
> +}
> +EXPORT_SYMBOL(netif_napi_del);
> +
> 
>  static void net_rx_action(struct softirq_action *h)
>  {
> @@ -4348,7 +4524,7 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
>  	netdev_init_queues(dev);
> 
>  	dev->get_stats = internal_stats;
> -	netpoll_netdev_init(dev);
> +	INIT_LIST_HEAD(&dev->napi_list);
>  	setup(dev);
>  	strcpy(dev->name, name);
>  	return dev;
> @@ -4365,10 +4541,15 @@ EXPORT_SYMBOL(alloc_netdev_mq);
>   */
>  void free_netdev(struct net_device *dev)
>  {
> +	struct napi_struct *p, *n;
> +
>  	release_net(dev_net(dev));
> 
>  	kfree(dev->_tx);
> 
> +	list_for_each_entry_safe(p, n, &dev->napi_list, dev_list)
> +		netif_napi_del(p);
> +
>  	/*  Compatibility with error handling in drivers */
>  	if (dev->reg_state == NETREG_UNINITIALIZED) {
>  		kfree((char *)dev - dev->padded);
> @@ -4904,6 +5085,7 @@ static int __init net_dev_init(void)
> 
>  		queue->backlog.poll = process_backlog;
>  		queue->backlog.weight = weight_p;
> +		queue->backlog.gro_list = NULL;
>  	}
> 
>  	netdev_dma_register();
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/8] net: Add Generic Receive Offload infrastructure
  2008-12-15 23:29     ` Paul E. McKenney
@ 2008-12-15 23:39       ` David Miller
  2008-12-16  0:02         ` Paul E. McKenney
  2008-12-16  2:04         ` Herbert Xu
  0 siblings, 2 replies; 61+ messages in thread
From: David Miller @ 2008-12-15 23:39 UTC (permalink / raw)
  To: paulmck; +Cc: herbert, netdev, johnpol, bhutchings

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Date: Mon, 15 Dec 2008 15:29:42 -0800

> On Sat, Dec 13, 2008 at 12:35:21PM +1100, Herbert Xu wrote:
> > +static int napi_gro_complete(struct sk_buff *skb)
> > +{
> > +	struct packet_type *ptype;
> > +	__be16 type = skb->protocol;
> > +	struct list_head *head = &ptype_base[ntohs(type) & PTYPE_HASH_MASK];
> > +	int err = -ENOENT;
> > +
> > +	if (!skb_shinfo(skb)->frag_list)
> > +		goto out;
> > +
> > +	rcu_read_lock();
> > +	list_for_each_entry_rcu(ptype, head, list) {
> > +		if (ptype->type != type || ptype->dev || !ptype->gro_complete)
> > +			continue;
> > +
> > +		err = ptype->gro_complete(skb);
> > +		break;
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	if (err) {
> > +		WARN_ON(&ptype->list == head);
> 
> Presumably ptype_base[] is a static array rather than a dynamically
> allocated array that is resized under RCU protection, right?  Otherwise,
> you could get in trouble if the above raced with the resize operation due
> to the fact that you are outside of the RCU read-side critical section.

Yes, and we've been using RCU this way for this table for quite
some time.  From net/core/dev.c:

#define PTYPE_HASH_SIZE	(16)
#define PTYPE_HASH_MASK	(PTYPE_HASH_SIZE - 1)

static DEFINE_SPINLOCK(ptype_lock);
static struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
static struct list_head ptype_all __read_mostly;	/* Taps */

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/8] net: Add Generic Receive Offload infrastructure
  2008-12-15 23:39       ` David Miller
@ 2008-12-16  0:02         ` Paul E. McKenney
  2008-12-16  2:04         ` Herbert Xu
  1 sibling, 0 replies; 61+ messages in thread
From: Paul E. McKenney @ 2008-12-16  0:02 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, johnpol, bhutchings

On Mon, Dec 15, 2008 at 03:39:20PM -0800, David Miller wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Date: Mon, 15 Dec 2008 15:29:42 -0800
> 
> > On Sat, Dec 13, 2008 at 12:35:21PM +1100, Herbert Xu wrote:
> > > +static int napi_gro_complete(struct sk_buff *skb)
> > > +{
> > > +	struct packet_type *ptype;
> > > +	__be16 type = skb->protocol;
> > > +	struct list_head *head = &ptype_base[ntohs(type) & PTYPE_HASH_MASK];
> > > +	int err = -ENOENT;
> > > +
> > > +	if (!skb_shinfo(skb)->frag_list)
> > > +		goto out;
> > > +
> > > +	rcu_read_lock();
> > > +	list_for_each_entry_rcu(ptype, head, list) {
> > > +		if (ptype->type != type || ptype->dev || !ptype->gro_complete)
> > > +			continue;
> > > +
> > > +		err = ptype->gro_complete(skb);
> > > +		break;
> > > +	}
> > > +	rcu_read_unlock();
> > > +
> > > +	if (err) {
> > > +		WARN_ON(&ptype->list == head);
> > 
> > Presumably ptype_base[] is a static array rather than a dynamically
> > allocated array that is resized under RCU protection, right?  Otherwise,
> > you could get in trouble if the above raced with the resize operation due
> > to the fact that you are outside of the RCU read-side critical section.
> 
> Yes, and we've been using RCU this way for this table for quite
> some time.  From net/core/dev.c:
> 
> #define PTYPE_HASH_SIZE	(16)
> #define PTYPE_HASH_MASK	(PTYPE_HASH_SIZE - 1)
> 
> static DEFINE_SPINLOCK(ptype_lock);
> static struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
> static struct list_head ptype_all __read_mostly;	/* Taps */

Works for me, then!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/8] net: Add Generic Receive Offload infrastructure
  2008-12-15 23:39       ` David Miller
  2008-12-16  0:02         ` Paul E. McKenney
@ 2008-12-16  2:04         ` Herbert Xu
  2008-12-16 16:37           ` Paul E. McKenney
  1 sibling, 1 reply; 61+ messages in thread
From: Herbert Xu @ 2008-12-16  2:04 UTC (permalink / raw)
  To: David Miller; +Cc: paulmck, netdev, johnpol, bhutchings

On Mon, Dec 15, 2008 at 03:39:20PM -0800, David Miller wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
>
> > Presumably ptype_base[] is a static array rather than a dynamically
> > allocated array that is resized under RCU protection, right?  Otherwise,
> > you could get in trouble if the above raced with the resize operation due
> > to the fact that you are outside of the RCU read-side critical section.
> 
> Yes, and we've been using RCU this way for this table for quite
> some time.  From net/core/dev.c:
> 
> #define PTYPE_HASH_SIZE	(16)
> #define PTYPE_HASH_MASK	(PTYPE_HASH_SIZE - 1)
> 
> static DEFINE_SPINLOCK(ptype_lock);
> static struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
> static struct list_head ptype_all __read_mostly;	/* Taps */

Thanks for confirming.  Yes unless there is a sudden surge of
new Ethernet protocols we shouldn't need to resize this ever :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/8] net: Add frag_list support to skb_segment
  2008-12-13  1:35   ` [PATCH 1/8] net: Add frag_list support to skb_segment Herbert Xu
@ 2008-12-16  7:27     ` David Miller
  0 siblings, 0 replies; 61+ messages in thread
From: David Miller @ 2008-12-16  7:27 UTC (permalink / raw)
  To: herbert; +Cc: netdev, johnpol, bhutchings

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 13 Dec 2008 12:35:19 +1100

> net: Add frag_list support to skb_segment
> 
> This patch adds limited support for handling frag_list packets in
> skb_segment.  The intention is to support GRO (Generic Receive Offload)
> packets which will be constructed by chaining normal packets using
> frag_list.
> 
> As such we require all frag_list members terminate on exact MSS
> boundaries.  This is checked using BUG_ON.
> 
> As there should only be one producer in the kernel of such packets,
> namely GRO, this requirement should not be difficult to maintain.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/8] net: Add frag_list support to GSO
  2008-12-13  1:35   ` [PATCH 2/8] net: Add frag_list support to GSO Herbert Xu
@ 2008-12-16  7:30     ` David Miller
  0 siblings, 0 replies; 61+ messages in thread
From: David Miller @ 2008-12-16  7:30 UTC (permalink / raw)
  To: herbert; +Cc: netdev, johnpol, bhutchings

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 13 Dec 2008 12:35:20 +1100

> net: Add frag_list support to GSO
> 
> This patch allows GSO to handle frag_list in a limited way for the
> purposes of allowing packets merged by GRO to be refragmented on
> output.
> 
> Most hardware won't (and aren't expected to) support handling GRO
> frag_list packets directly.  Therefore we will perform GSO in
> software for those cases.
> 
> However, for drivers that can support it (such as virtual NICs) we
> may not have to segment the packets at all.
> 
> Whether the added overhead of GRO/GSO is worthwhile for bridges
> and routers when weighed against the benefit of potentially
> increasing the MTU within the host is still an open question.
> However, for the case of host nodes this is undoubtedly a win.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/8] net: Add Generic Receive Offload infrastructure
  2008-12-13  1:35   ` [PATCH 3/8] net: Add Generic Receive Offload infrastructure Herbert Xu
  2008-12-15 23:29     ` Paul E. McKenney
@ 2008-12-16  7:40     ` David Miller
  1 sibling, 0 replies; 61+ messages in thread
From: David Miller @ 2008-12-16  7:40 UTC (permalink / raw)
  To: herbert; +Cc: netdev, johnpol, bhutchings

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 13 Dec 2008 12:35:21 +1100

> net: Add Generic Receive Offload infrastructure
 ...
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied, although there were some merge issues to deal with
wrt. Neil's netpoll race fix.  That wasn't too bad.

Also:

> +int napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
> +{
> +	struct sk_buff **pp;

I made this 'pp' variable get initialized to NULL because
gcc is stupid and thinks this might get used uninitialized.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/8] ipv4: Add GRO infrastructure
  2008-12-13  1:35   ` [PATCH 4/8] ipv4: Add GRO infrastructure Herbert Xu
@ 2008-12-16  7:41     ` David Miller
  0 siblings, 0 replies; 61+ messages in thread
From: David Miller @ 2008-12-16  7:41 UTC (permalink / raw)
  To: herbert; +Cc: netdev, johnpol, bhutchings

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 13 Dec 2008 12:35:22 +1100

> ipv4: Add GRO infrastructure
 ...
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/8] net: Add skb_gro_receive
  2008-12-13  2:52     ` Herbert Xu
@ 2008-12-16  7:42       ` David Miller
  0 siblings, 0 replies; 61+ messages in thread
From: David Miller @ 2008-12-16  7:42 UTC (permalink / raw)
  To: herbert; +Cc: netdev, johnpol, bhutchings

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 13 Dec 2008 13:52:46 +1100

> On Sat, Dec 13, 2008 at 12:35:23PM +1100, Herbert Xu wrote:
> > net: Add skb_gro_receive
> > 
> > This patch adds the helper skb_gro_receive to merge packets for
> > GRO.  The current method is to allocate a new header skb and then
> > chain the original packets to its frag_list.  This is done to
> > make it easier to integrate into the existing GSO framework.
> > 
> > In future as GSO is moved into the drivers, we can undo this and
> > simply chain the original packets together.
> > 
> > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> Looks like I forgot to stop the merging at 64K.  Here is an update
> with the check added.

Applied.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 6/8] tcp: Add GRO support
  2008-12-13  1:35   ` [PATCH 6/8] tcp: Add GRO support Herbert Xu
@ 2008-12-16  7:43     ` David Miller
  0 siblings, 0 replies; 61+ messages in thread
From: David Miller @ 2008-12-16  7:43 UTC (permalink / raw)
  To: herbert; +Cc: netdev, johnpol, bhutchings

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 13 Dec 2008 12:35:24 +1100

> tcp: Add GRO support
> 
> This patch adds the TCP-specific portion of GRO.  The criterion for
> merging is extremely strict (the TCP header must match exactly apart
> from the checksum) so as to allow refragmentation.  Otherwise this
> is pretty much identical to LRO, except that we support the merging
> of ECN packets.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 7/8] ethtool: Add GGRO and SGRO ops
  2008-12-13  1:35   ` [PATCH 7/8] ethtool: Add GGRO and SGRO ops Herbert Xu
@ 2008-12-16  7:44     ` David Miller
  0 siblings, 0 replies; 61+ messages in thread
From: David Miller @ 2008-12-16  7:44 UTC (permalink / raw)
  To: herbert; +Cc: netdev, johnpol, bhutchings

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 13 Dec 2008 12:35:25 +1100

> ethtool: Add GGRO and SGRO ops
> 
> This patch adds the ethtool ops to enable and disable GRO.  It also
> makes GRO depend on RX checksum offload much the same as how TSO
> depends on SG support.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 8/8] e1000e: Add GRO support
  2008-12-13  1:35   ` [PATCH 8/8] e1000e: Add GRO support Herbert Xu
@ 2008-12-16  7:46     ` David Miller
  0 siblings, 0 replies; 61+ messages in thread
From: David Miller @ 2008-12-16  7:46 UTC (permalink / raw)
  To: herbert; +Cc: netdev, johnpol, bhutchings

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 13 Dec 2008 12:35:26 +1100

> e1000e: Add GRO support
> 
> This patch adds GRO support to e1000e by making it invoke napi_gro_receive
> instead of netif_receive_skb.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied, thanks a lot Herbert.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/8] net: Add Generic Receive Offload infrastructure
  2008-12-16  2:04         ` Herbert Xu
@ 2008-12-16 16:37           ` Paul E. McKenney
  0 siblings, 0 replies; 61+ messages in thread
From: Paul E. McKenney @ 2008-12-16 16:37 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, johnpol, bhutchings

On Tue, Dec 16, 2008 at 01:04:32PM +1100, Herbert Xu wrote:
> On Mon, Dec 15, 2008 at 03:39:20PM -0800, David Miller wrote:
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> >
> > > Presumably ptype_base[] is a static array rather than a dynamically
> > > allocated array that is resized under RCU protection, right?  Otherwise,
> > > you could get in trouble if the above raced with the resize operation due
> > > to the fact that you are outside of the RCU read-side critical section.
> > 
> > Yes, and we've been using RCU this way for this table for quite
> > some time.  From net/core/dev.c:
> > 
> > #define PTYPE_HASH_SIZE	(16)
> > #define PTYPE_HASH_MASK	(PTYPE_HASH_SIZE - 1)
> > 
> > static DEFINE_SPINLOCK(ptype_lock);
> > static struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
> > static struct list_head ptype_all __read_mostly;	/* Taps */
> 
> Thanks for confirming.  Yes unless there is a sudden surge of
> new Ethernet protocols we shouldn't need to resize this ever :)

;-)

						Thanx, Paul

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2008-12-16 16:38 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-12  5:31 [0/8] net: Generic Receive Offload Herbert Xu
2008-12-12  5:31 ` [PATCH 1/8] net: Add frag_list support to skb_segment Herbert Xu
2008-12-12 19:46   ` Evgeniy Polyakov
2008-12-12 21:41     ` Herbert Xu
2008-12-13  2:38       ` Evgeniy Polyakov
2008-12-13  2:43         ` Herbert Xu
2008-12-13  3:11           ` Evgeniy Polyakov
2008-12-13  3:20             ` Herbert Xu
2008-12-12  5:31 ` [PATCH 2/8] net: Add frag_list support to GSO Herbert Xu
2008-12-12  5:31 ` [PATCH 3/8] net: Add Generic Receive Offload infrastructure Herbert Xu
2008-12-12 19:51   ` Evgeniy Polyakov
2008-12-12 21:45     ` Herbert Xu
2008-12-12 22:25   ` Ben Hutchings
2008-12-12 22:56     ` Herbert Xu
2008-12-12 23:11       ` Herbert Xu
2008-12-13  3:43         ` Herbert Xu
2008-12-13 14:03           ` Evgeniy Polyakov
2008-12-12  5:31 ` [PATCH 4/8] ipv4: Add GRO infrastructure Herbert Xu
2008-12-12 22:55   ` Ben Hutchings
2008-12-12 23:04     ` Herbert Xu
2008-12-12  5:31 ` [PATCH 5/8] net: Add skb_gro_receive Herbert Xu
2008-12-12  5:31 ` [PATCH 6/8] tcp: Add GRO support Herbert Xu
2008-12-12 19:56   ` Evgeniy Polyakov
2008-12-12 21:46     ` Herbert Xu
2008-12-13  2:40       ` Evgeniy Polyakov
2008-12-13  2:46         ` Herbert Xu
2008-12-13  3:10           ` Evgeniy Polyakov
2008-12-13  3:19             ` Herbert Xu
2008-12-12  5:31 ` [PATCH 7/8] ethtool: Add GGRO and SGRO ops Herbert Xu
2008-12-12 20:11   ` Ben Hutchings
2008-12-12 21:48     ` Herbert Xu
2008-12-12 22:35       ` Ben Hutchings
2008-12-12 22:49         ` Herbert Xu
2008-12-14 19:36           ` Waskiewicz Jr, Peter P
2008-12-14 21:09             ` Herbert Xu
2008-12-14 22:00               ` Waskiewicz Jr, Peter P
2008-12-15  3:40                 ` Herbert Xu
2008-12-12  5:31 ` [PATCH 8/8] e1000e: Add GRO support Herbert Xu
2008-12-13  1:34 ` [0/8] net: Generic Receive Offload Herbert Xu
2008-12-13  1:35   ` [PATCH 1/8] net: Add frag_list support to skb_segment Herbert Xu
2008-12-16  7:27     ` David Miller
2008-12-13  1:35   ` [PATCH 2/8] net: Add frag_list support to GSO Herbert Xu
2008-12-16  7:30     ` David Miller
2008-12-13  1:35   ` [PATCH 3/8] net: Add Generic Receive Offload infrastructure Herbert Xu
2008-12-15 23:29     ` Paul E. McKenney
2008-12-15 23:39       ` David Miller
2008-12-16  0:02         ` Paul E. McKenney
2008-12-16  2:04         ` Herbert Xu
2008-12-16 16:37           ` Paul E. McKenney
2008-12-16  7:40     ` David Miller
2008-12-13  1:35   ` [PATCH 4/8] ipv4: Add GRO infrastructure Herbert Xu
2008-12-16  7:41     ` David Miller
2008-12-13  1:35   ` [PATCH 5/8] net: Add skb_gro_receive Herbert Xu
2008-12-13  2:52     ` Herbert Xu
2008-12-16  7:42       ` David Miller
2008-12-13  1:35   ` [PATCH 6/8] tcp: Add GRO support Herbert Xu
2008-12-16  7:43     ` David Miller
2008-12-13  1:35   ` [PATCH 7/8] ethtool: Add GGRO and SGRO ops Herbert Xu
2008-12-16  7:44     ` David Miller
2008-12-13  1:35   ` [PATCH 8/8] e1000e: Add GRO support Herbert Xu
2008-12-16  7:46     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).