netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell
@ 2014-09-25 14:17 Eric Dumazet
  2014-09-25 14:43 ` Jesper Dangaard Brouer
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Eric Dumazet @ 2014-09-25 14:17 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Amir Vadai

From: Eric Dumazet <edumazet@google.com>

skb->xmit_more tells us if another skb is coming next.

We need to send doorbell when : xmit_more is not set,
or txqueue is stopped (preventing next skb to come immediately)

Tested with a modified pktgen version, I got a 40% increase of
throughput.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c |   23 +++++++++++++------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index c44f4237b9be..adedc47e947d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -667,6 +667,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	int lso_header_size;
 	void *fragptr;
 	bool bounce = false;
+	bool send_doorbell;
 
 	if (!priv->port_up)
 		goto tx_drop;
@@ -878,12 +879,16 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	skb_tx_timestamp(skb);
 
-	if (ring->bf_enabled && desc_size <= MAX_BF && !bounce && !vlan_tx_tag_present(skb)) {
+	send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
+
+	if (ring->bf_enabled && desc_size <= MAX_BF && !bounce &&
+	    !vlan_tx_tag_present(skb) && send_doorbell) {
 		tx_desc->ctrl.bf_qpn |= cpu_to_be32(ring->doorbell_qpn);
 
 		op_own |= htonl((bf_index & 0xffff) << 8);
-		/* Ensure new descirptor hits memory
-		* before setting ownership of this descriptor to HW */
+		/* Ensure new descriptor hits memory
+		 * before setting ownership of this descriptor to HW
+		 */
 		wmb();
 		tx_desc->ctrl.owner_opcode = op_own;
 
@@ -896,12 +901,16 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 		ring->bf.offset ^= ring->bf.buf_size;
 	} else {
-		/* Ensure new descirptor hits memory
-		* before setting ownership of this descriptor to HW */
+		/* Ensure new descriptor hits memory
+		 * before setting ownership of this descriptor to HW
+		 */
 		wmb();
 		tx_desc->ctrl.owner_opcode = op_own;
-		wmb();
-		iowrite32be(ring->doorbell_qpn, ring->bf.uar->map + MLX4_SEND_DOORBELL);
+		if (send_doorbell) {
+			wmb();
+			iowrite32be(ring->doorbell_qpn,
+				    ring->bf.uar->map + MLX4_SEND_DOORBELL);
+		}
 	}
 
 	return NETDEV_TX_OK;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell
  2014-09-25 14:17 [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell Eric Dumazet
@ 2014-09-25 14:43 ` Jesper Dangaard Brouer
  2014-09-25 15:06   ` Eric Dumazet
  2014-09-25 15:53 ` [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell Alexei Starovoitov
  2014-09-28 21:27 ` David Miller
  2 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-25 14:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: brouer, David Miller, netdev, Amir Vadai, John Fastabend

On Thu, 25 Sep 2014 07:17:49 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> From: Eric Dumazet <edumazet@google.com>
> 
> Tested with a modified pktgen version, I got a 40% increase of
> throughput.

Pktgen is artificial benchmarking, you know ;-) :-P

If you really must use pktgen for this, you can use an unmodified
pktgen against the qdisc layer by adding a VLAN interface on-top of the
device you want to test.  I blame John, for telling me this ;-)

/me running away
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell
  2014-09-25 14:43 ` Jesper Dangaard Brouer
@ 2014-09-25 15:06   ` Eric Dumazet
  2014-09-25 15:36     ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2014-09-25 15:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David Miller, netdev, Amir Vadai, John Fastabend

On Thu, 2014-09-25 at 16:43 +0200, Jesper Dangaard Brouer wrote:
> On Thu, 25 Sep 2014 07:17:49 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > Tested with a modified pktgen version, I got a 40% increase of
> > throughput.
> 
> Pktgen is artificial benchmarking, you know ;-) :-P
> 
> If you really must use pktgen for this, you can use an unmodified
> pktgen against the qdisc layer by adding a VLAN interface on-top of the
> device you want to test.  I blame John, for telling me this ;-)
> 
> /me running away


But... Nothing yet in qdisc layer sets xmit_more, unless TSO is
disabled, and a GSO packet is segmented.

So you might notice some improvement with this patch, but not if you use
TSO, and a 40GB NIC is better with TSO on you know ;)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell
  2014-09-25 15:06   ` Eric Dumazet
@ 2014-09-25 15:36     ` Eric Dumazet
  2014-09-25 15:40       ` Eric Dumazet
                         ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Eric Dumazet @ 2014-09-25 15:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David Miller, netdev, Amir Vadai, John Fastabend

On Thu, 2014-09-25 at 08:06 -0700, Eric Dumazet wrote:

> 
> But... Nothing yet in qdisc layer sets xmit_more, unless TSO is
> disabled, and a GSO packet is segmented.
> 
> So you might notice some improvement with this patch, but not if you use
> TSO, and a 40GB NIC is better with TSO on you know ;)
> 

So playing with TSO=off (but GSO=on), we see about 12 % increase of
throughput on a TCP_STREAM flow.

And kernel profile clearly shows the difference of sending or not the
doorbell, as mlx4_en_xmit() shifts. Note also mlx4_en_free_tx_desc()
disappears, this might point to some false sharing or something worth
investigating.

Before patch :

    26.36%  [kernel]  [k] __copy_skb_header             
    16.59%  [kernel]  [k] mlx4_en_xmit                  
     5.62%  [kernel]  [k] __alloc_skb                   
     5.48%  [kernel]  [k] copy_user_enhanced_fast_string
     4.48%  [kernel]  [k] skb_segment                   
     2.46%  [kernel]  [k] mlx4_en_free_tx_desc.isra.27  
     2.31%  [kernel]  [k] _raw_spin_lock                
     2.10%  [kernel]  [k] memcpy                        
     2.01%  [kernel]  [k] tcp_sendmsg                   
     1.62%  [kernel]  [k] __iowrite64_copy              


After patch :

    32.78%  [kernel]  [k] __copy_skb_header             
     8.26%  [kernel]  [k] mlx4_en_xmit                  
     7.25%  [kernel]  [k] __alloc_skb                   
     7.18%  [kernel]  [k] copy_user_enhanced_fast_string
     4.39%  [kernel]  [k] skb_segment                   
     2.87%  [kernel]  [k] memcpy                        
     2.59%  [kernel]  [k] tcp_sendmsg                   
     2.50%  [kernel]  [k] _raw_spin_lock                
     2.38%  [kernel]  [k] dev_hard_start_xmit           
     1.52%  [kernel]  [k] tcp_gso_segment               
     1.50%  [kernel]  [k] kmem_cache_alloc_node_trace   
     1.40%  [kernel]  [k] kmem_cache_alloc_node         
     1.16%  [kernel]  [k] ip_send_check                 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell
  2014-09-25 15:36     ` Eric Dumazet
@ 2014-09-25 15:40       ` Eric Dumazet
  2014-09-25 16:19       ` Eric Dumazet
  2014-09-25 20:47       ` Eric Dumazet
  2 siblings, 0 replies; 19+ messages in thread
From: Eric Dumazet @ 2014-09-25 15:40 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David Miller, netdev, Amir Vadai, John Fastabend

On Thu, 2014-09-25 at 08:36 -0700, Eric Dumazet wrote:

> So playing with TSO=off (but GSO=on), we see about 12 % increase of
> throughput on a TCP_STREAM flow.

Correction, this was a 20 % increase.

( 849284 pps -> 1023469 pps)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell
  2014-09-25 14:17 [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell Eric Dumazet
  2014-09-25 14:43 ` Jesper Dangaard Brouer
@ 2014-09-25 15:53 ` Alexei Starovoitov
  2014-09-25 16:05   ` Eric Dumazet
  2014-09-28 21:27 ` David Miller
  2 siblings, 1 reply; 19+ messages in thread
From: Alexei Starovoitov @ 2014-09-25 15:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Amir Vadai

On Thu, Sep 25, 2014 at 7:17 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> skb->xmit_more tells us if another skb is coming next.
>
> We need to send doorbell when : xmit_more is not set,
> or txqueue is stopped (preventing next skb to come immediately)
>
> Tested with a modified pktgen version, I got a 40% increase of
> throughput.

this is awesome!

I've been hacking pktgen as well based on Jesper's earlier patch,
but in slightly different way. Sounds like you already have working
pktgen with xmit_more. Do you mind sharing it? even rough patch
would be great.

Thanks

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell
  2014-09-25 15:53 ` [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell Alexei Starovoitov
@ 2014-09-25 16:05   ` Eric Dumazet
  2014-09-25 16:08     ` Alexei Starovoitov
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2014-09-25 16:05 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Miller, netdev, Amir Vadai

On Thu, 2014-09-25 at 08:53 -0700, Alexei Starovoitov wrote:
> On Thu, Sep 25, 2014 at 7:17 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > skb->xmit_more tells us if another skb is coming next.
> >
> > We need to send doorbell when : xmit_more is not set,
> > or txqueue is stopped (preventing next skb to come immediately)
> >
> > Tested with a modified pktgen version, I got a 40% increase of
> > throughput.
> 
> this is awesome!
> 
> I've been hacking pktgen as well based on Jesper's earlier patch,
> but in slightly different way. Sounds like you already have working
> pktgen with xmit_more. Do you mind sharing it? even rough patch
> would be great.

The plan was to add a 'pburst x' parameter.

Because not only we want to avoid sending the doorbell, but we could
also not doing the spinlock for every start_xmit().

We also can enforce a global [p]rate (say 100000 packets per second),
but still allow pktgen to send burts of x packets to use (and test) this
xmit_more

Nothing ready yet ;)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell
  2014-09-25 16:05   ` Eric Dumazet
@ 2014-09-25 16:08     ` Alexei Starovoitov
  0 siblings, 0 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2014-09-25 16:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Amir Vadai

On Thu, Sep 25, 2014 at 9:05 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2014-09-25 at 08:53 -0700, Alexei Starovoitov wrote:
>> On Thu, Sep 25, 2014 at 7:17 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > From: Eric Dumazet <edumazet@google.com>
>> >
>> > skb->xmit_more tells us if another skb is coming next.
>> >
>> > We need to send doorbell when : xmit_more is not set,
>> > or txqueue is stopped (preventing next skb to come immediately)
>> >
>> > Tested with a modified pktgen version, I got a 40% increase of
>> > throughput.
>>
>> this is awesome!
>>
>> I've been hacking pktgen as well based on Jesper's earlier patch,
>> but in slightly different way. Sounds like you already have working
>> pktgen with xmit_more. Do you mind sharing it? even rough patch
>> would be great.
>
> The plan was to add a 'pburst x' parameter.
>
> Because not only we want to avoid sending the doorbell, but we could
> also not doing the spinlock for every start_xmit().

exactly!
I also hacked
atomic_inc(&(pkt_dev->skb->users))
into
single atomic_add of N packets for bursting to amortize the cost
of both spin_lock and lock xadd

> We also can enforce a global [p]rate (say 100000 packets per second),
> but still allow pktgen to send burts of x packets to use (and test) this
> xmit_more

that would be nice. I'm not sure how yet.

> Nothing ready yet ;)

can't wait :)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell
  2014-09-25 15:36     ` Eric Dumazet
  2014-09-25 15:40       ` Eric Dumazet
@ 2014-09-25 16:19       ` Eric Dumazet
  2014-09-25 20:47       ` Eric Dumazet
  2 siblings, 0 replies; 19+ messages in thread
From: Eric Dumazet @ 2014-09-25 16:19 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David Miller, netdev, Amir Vadai, John Fastabend

On Thu, 2014-09-25 at 08:36 -0700, Eric Dumazet wrote:


>     26.36%  [kernel]  [k] __copy_skb_header     

Note to myself :

Time to optimize this thing, doing a memset() to copy a bunch of
consecutive fields.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell
  2014-09-25 15:36     ` Eric Dumazet
  2014-09-25 15:40       ` Eric Dumazet
  2014-09-25 16:19       ` Eric Dumazet
@ 2014-09-25 20:47       ` Eric Dumazet
  2014-09-25 20:58         ` Joe Perches
  2014-09-25 21:20         ` [PATCH net-next] net: reorganize sk_buff for faster __copy_skb_header() Eric Dumazet
  2 siblings, 2 replies; 19+ messages in thread
From: Eric Dumazet @ 2014-09-25 20:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David Miller, netdev, Amir Vadai, John Fastabend

On Thu, 2014-09-25 at 08:36 -0700, Eric Dumazet wrote:

> After patch :
> 
>     32.78%  [kernel]  [k] __copy_skb_header             
>      8.26%  [kernel]  [k] mlx4_en_xmit                  
>      7.25%  [kernel]  [k] __alloc_skb                   
>      7.18%  [kernel]  [k] copy_user_enhanced_fast_string
>      4.39%  [kernel]  [k] skb_segment                   
>      2.87%  [kernel]  [k] memcpy                        
>      2.59%  [kernel]  [k] tcp_sendmsg                   
>      2.50%  [kernel]  [k] _raw_spin_lock                
>      2.38%  [kernel]  [k] dev_hard_start_xmit           
>      1.52%  [kernel]  [k] tcp_gso_segment               
>      1.50%  [kernel]  [k] kmem_cache_alloc_node_trace   
>      1.40%  [kernel]  [k] kmem_cache_alloc_node         
>      1.16%  [kernel]  [k] ip_send_check                 
> 

After __copy_skb_header() optimization I get something even nicer ;)

    19.90%  [kernel]          [k] __copy_skb_header             
    15.16%  [kernel]          [k] skb_segment                   
     7.02%  [kernel]          [k] __alloc_skb                   
     6.89%  [kernel]          [k] copy_user_enhanced_fast_string
     6.60%  [kernel]          [k] mlx4_en_xmit                  
     2.70%  [kernel]          [k] _raw_spin_lock                
     2.68%  [kernel]          [k] memcpy                        
     2.41%  [kernel]          [k] tcp_sendmsg                   

I'll post this patch asap.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell
  2014-09-25 20:47       ` Eric Dumazet
@ 2014-09-25 20:58         ` Joe Perches
  2014-09-25 21:20         ` [PATCH net-next] net: reorganize sk_buff for faster __copy_skb_header() Eric Dumazet
  1 sibling, 0 replies; 19+ messages in thread
From: Joe Perches @ 2014-09-25 20:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, David Miller, netdev, Amir Vadai,
	John Fastabend

On Thu, 2014-09-25 at 13:47 -0700, Eric Dumazet wrote:
> After patch :
>     32.78%  [kernel]  [k] __copy_skb_header             
> After __copy_skb_header() optimization I get something even nicer ;)
>     19.90%  [kernel]          [k] __copy_skb_header             
> I'll post this patch asap.

Eric, you are _killing_ it.  network ninja indeed...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH net-next] net: reorganize sk_buff for faster __copy_skb_header()
  2014-09-25 20:47       ` Eric Dumazet
  2014-09-25 20:58         ` Joe Perches
@ 2014-09-25 21:20         ` Eric Dumazet
  2014-09-26  8:35           ` Jesper Dangaard Brouer
                             ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: Eric Dumazet @ 2014-09-25 21:20 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Amir Vadai, John Fastabend, Jesper Dangaard Brouer

From: Eric Dumazet <edumazet@google.com>

With proliferation of bit fields in sk_buff, __copy_skb_header() became
quite expensive, showing as the most expensive function in a GSO
workload.

__copy_skb_header() performance is also critical for non GSO TCP
operations, as it is used from skb_clone()

This patch carefully moves all the fields that were not copied in a
separate zone : cloned, nohdr, fclone, peeked, head_frag, xmit_more

Then I moved all other fields and all other copied fields in a section
delimited by headers_start[0]/headers_end[0] section so that we
can use a single memcpy() call, inlined by compiler using long
word load/stores.

I also tried to make all copies in the natural orders of sk_buff,
to help hardware prefetching.

I made sure sk_buff size did not change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/skbuff.h |  132 ++++++++++++++++++++-------------------
 net/core/skbuff.c      |   80 ++++++++++++-----------
 2 files changed, 112 insertions(+), 100 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f1bfa3781c75..180f55b9ef21 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -527,27 +527,41 @@ struct sk_buff {
 	char			cb[48] __aligned(8);
 
 	unsigned long		_skb_refdst;
+	void			(*destructor)(struct sk_buff *skb);
 #ifdef CONFIG_XFRM
 	struct	sec_path	*sp;
 #endif
+#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
+	struct nf_conntrack	*nfct;
+#endif
+#ifdef CONFIG_BRIDGE_NETFILTER
+	struct nf_bridge_info	*nf_bridge;
+#endif
 	unsigned int		len,
 				data_len;
 	__u16			mac_len,
 				hdr_len;
-	union {
-		__wsum		csum;
-		struct {
-			__u16	csum_start;
-			__u16	csum_offset;
-		};
-	};
-	__u32			priority;
+
+	/* Following fields are _not_ copied in __copy_skb_header()
+	 * Note that queue_mapping is here mostly to fill a hole.
+	 */
 	kmemcheck_bitfield_begin(flags1);
-	__u8			ignore_df:1,
-				cloned:1,
-				ip_summed:2,
+	__u16			queue_mapping;
+	__u8			cloned:1,
 				nohdr:1,
-				nfctinfo:3;
+				fclone:2,
+				peeked:1,
+				head_frag:1,
+				xmit_more:1;
+	/* one bit hole */
+	kmemcheck_bitfield_end(flags1);
+
+
+
+	/* fields enclosed in headers_start/headers_end are copied
+	 * using a single memcpy() in __copy_skb_header()
+	 */
+	__u32			headers_start[0];
 
 /* if you move pkt_type around you also must adapt those constants */
 #ifdef __BIG_ENDIAN_BITFIELD
@@ -558,58 +572,52 @@ struct sk_buff {
 #define PKT_TYPE_OFFSET()	offsetof(struct sk_buff, __pkt_type_offset)
 
 	__u8			__pkt_type_offset[0];
-	__u8			pkt_type:3,
-				fclone:2,
-				ipvs_property:1,
-				peeked:1,
-				nf_trace:1;
-	kmemcheck_bitfield_end(flags1);
-	__be16			protocol;
-
-	void			(*destructor)(struct sk_buff *skb);
-#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
-	struct nf_conntrack	*nfct;
-#endif
-#ifdef CONFIG_BRIDGE_NETFILTER
-	struct nf_bridge_info	*nf_bridge;
-#endif
-
-	int			skb_iif;
-
-	__u32			hash;
-
-	__be16			vlan_proto;
-	__u16			vlan_tci;
-
-#ifdef CONFIG_NET_SCHED
-	__u16			tc_index;	/* traffic control index */
-#ifdef CONFIG_NET_CLS_ACT
-	__u16			tc_verd;	/* traffic control verdict */
-#endif
-#endif
-
-	__u16			queue_mapping;
-	kmemcheck_bitfield_begin(flags2);
-	__u8			xmit_more:1;
-#ifdef CONFIG_IPV6_NDISC_NODETYPE
-	__u8			ndisc_nodetype:2;
-#endif
+	__u8			pkt_type:3;
 	__u8			pfmemalloc:1;
+	__u8			ignore_df:1;
+	__u8			nfctinfo:3;
+
+	__u8			nf_trace:1;
+	__u8			ip_summed:2;
 	__u8			ooo_okay:1;
 	__u8			l4_hash:1;
 	__u8			sw_hash:1;
 	__u8			wifi_acked_valid:1;
 	__u8			wifi_acked:1;
+
 	__u8			no_fcs:1;
-	__u8			head_frag:1;
 	/* Indicates the inner headers are valid in the skbuff. */
 	__u8			encapsulation:1;
 	__u8			encap_hdr_csum:1;
 	__u8			csum_valid:1;
 	__u8			csum_complete_sw:1;
-	/* 1/3 bit hole (depending on ndisc_nodetype presence) */
-	kmemcheck_bitfield_end(flags2);
+	__u8			csum_level:2;
+	__u8			csum_bad:1;
 
+#ifdef CONFIG_IPV6_NDISC_NODETYPE
+	__u8			ndisc_nodetype:2;
+#endif
+	/* 6/8 bit hole */
+
+#ifdef CONFIG_NET_SCHED
+	__u16			tc_index;	/* traffic control index */
+#ifdef CONFIG_NET_CLS_ACT
+	__u16			tc_verd;	/* traffic control verdict */
+#endif
+#endif
+
+	union {
+		__wsum		csum;
+		struct {
+			__u16	csum_start;
+			__u16	csum_offset;
+		};
+	};
+	__u32			priority;
+	int			skb_iif;
+	__u32			hash;
+	__be16			vlan_proto;
+	__u16			vlan_tci;
 #if defined CONFIG_NET_DMA || defined CONFIG_NET_RX_BUSY_POLL
 	union {
 		unsigned int	napi_id;
@@ -625,19 +633,18 @@ struct sk_buff {
 		__u32		reserved_tailroom;
 	};
 
-	kmemcheck_bitfield_begin(flags3);
-	__u8			csum_level:2;
-	__u8			csum_bad:1;
-	/* 13 bit hole */
-	kmemcheck_bitfield_end(flags3);
-
 	__be16			inner_protocol;
 	__u16			inner_transport_header;
 	__u16			inner_network_header;
 	__u16			inner_mac_header;
+
+	__be16			protocol;
 	__u16			transport_header;
 	__u16			network_header;
 	__u16			mac_header;
+
+	__u32			headers_end[0];
+
 	/* These elements must be at the end, see alloc_skb() for details.  */
 	sk_buff_data_t		tail;
 	sk_buff_data_t		end;
@@ -3025,19 +3032,22 @@ static inline void nf_reset_trace(struct sk_buff *skb)
 }
 
 /* Note: This doesn't put any conntrack and bridge info in dst. */
-static inline void __nf_copy(struct sk_buff *dst, const struct sk_buff *src)
+static inline void __nf_copy(struct sk_buff *dst, const struct sk_buff *src,
+			     bool copy)
 {
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 	dst->nfct = src->nfct;
 	nf_conntrack_get(src->nfct);
-	dst->nfctinfo = src->nfctinfo;
+	if (copy)
+		dst->nfctinfo = src->nfctinfo;
 #endif
 #ifdef CONFIG_BRIDGE_NETFILTER
 	dst->nf_bridge  = src->nf_bridge;
 	nf_bridge_get(src->nf_bridge);
 #endif
 #if IS_ENABLED(CONFIG_NETFILTER_XT_TARGET_TRACE) || defined(CONFIG_NF_TABLES)
-	dst->nf_trace = src->nf_trace;
+	if (copy)
+		dst->nf_trace = src->nf_trace;
 #endif
 }
 
@@ -3049,7 +3059,7 @@ static inline void nf_copy(struct sk_buff *dst, const struct sk_buff *src)
 #ifdef CONFIG_BRIDGE_NETFILTER
 	nf_bridge_put(dst->nf_bridge);
 #endif
-	__nf_copy(dst, src);
+	__nf_copy(dst, src, true);
 }
 
 #ifdef CONFIG_NETWORK_SECMARK
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 06a8feb10099..57821927700e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -261,7 +261,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 		atomic_t *fclone_ref = (atomic_t *) (child + 1);
 
 		kmemcheck_annotate_bitfield(child, flags1);
-		kmemcheck_annotate_bitfield(child, flags2);
 		skb->fclone = SKB_FCLONE_ORIG;
 		atomic_set(fclone_ref, 1);
 
@@ -674,57 +673,61 @@ void consume_skb(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(consume_skb);
 
+/* Make sure a field is enclosed inside headers_start/headers_end section */
+#define CHECK_SKB_FIELD(field) \
+	BUILD_BUG_ON(offsetof(struct sk_buff, field) <		\
+		     offsetof(struct sk_buff, headers_start));	\
+	BUILD_BUG_ON(offsetof(struct sk_buff, field) >=		\
+		     offsetof(struct sk_buff, headers_end));	\
+
 static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	new->tstamp		= old->tstamp;
+	/* We do not copy old->sk */
 	new->dev		= old->dev;
-	new->transport_header	= old->transport_header;
-	new->network_header	= old->network_header;
-	new->mac_header		= old->mac_header;
-	new->inner_protocol	= old->inner_protocol;
-	new->inner_transport_header = old->inner_transport_header;
-	new->inner_network_header = old->inner_network_header;
-	new->inner_mac_header = old->inner_mac_header;
+	memcpy(new->cb, old->cb, sizeof(old->cb));
 	skb_dst_copy(new, old);
-	skb_copy_hash(new, old);
-	new->ooo_okay		= old->ooo_okay;
-	new->no_fcs		= old->no_fcs;
-	new->encapsulation	= old->encapsulation;
-	new->encap_hdr_csum	= old->encap_hdr_csum;
-	new->csum_valid		= old->csum_valid;
-	new->csum_complete_sw	= old->csum_complete_sw;
 #ifdef CONFIG_XFRM
 	new->sp			= secpath_get(old->sp);
 #endif
-	memcpy(new->cb, old->cb, sizeof(old->cb));
-	new->csum		= old->csum;
-	new->ignore_df		= old->ignore_df;
-	new->pkt_type		= old->pkt_type;
-	new->ip_summed		= old->ip_summed;
-	skb_copy_queue_mapping(new, old);
-	new->priority		= old->priority;
-#if IS_ENABLED(CONFIG_IP_VS)
-	new->ipvs_property	= old->ipvs_property;
+	__nf_copy(new, old, false);
+
+	/* Note : this field could be in headers_start/headers_end section
+	 * It is not yet because we do not want to have a 16 bit hole
+	 */
+	new->queue_mapping = old->queue_mapping;
+
+	memcpy(&new->headers_start, &old->headers_start,
+	       offsetof(struct sk_buff, headers_end) -
+	       offsetof(struct sk_buff, headers_start));
+	CHECK_SKB_FIELD(protocol);
+	CHECK_SKB_FIELD(csum);
+	CHECK_SKB_FIELD(hash);
+	CHECK_SKB_FIELD(priority);
+	CHECK_SKB_FIELD(skb_iif);
+	CHECK_SKB_FIELD(vlan_proto);
+	CHECK_SKB_FIELD(vlan_tci);
+	CHECK_SKB_FIELD(transport_header);
+	CHECK_SKB_FIELD(network_header);
+	CHECK_SKB_FIELD(mac_header);
+	CHECK_SKB_FIELD(inner_protocol);
+	CHECK_SKB_FIELD(inner_transport_header);
+	CHECK_SKB_FIELD(inner_network_header);
+	CHECK_SKB_FIELD(inner_mac_header);
+	CHECK_SKB_FIELD(mark);
+#ifdef CONFIG_NETWORK_SECMARK
+	CHECK_SKB_FIELD(secmark);
+#endif
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	CHECK_SKB_FIELD(napi_id);
 #endif
-	new->pfmemalloc		= old->pfmemalloc;
-	new->protocol		= old->protocol;
-	new->mark		= old->mark;
-	new->skb_iif		= old->skb_iif;
-	__nf_copy(new, old);
 #ifdef CONFIG_NET_SCHED
-	new->tc_index		= old->tc_index;
+	CHECK_SKB_FIELD(tc_index);
 #ifdef CONFIG_NET_CLS_ACT
-	new->tc_verd		= old->tc_verd;
+	CHECK_SKB_FIELD(tc_verd);
 #endif
 #endif
-	new->vlan_proto		= old->vlan_proto;
-	new->vlan_tci		= old->vlan_tci;
-
-	skb_copy_secmark(new, old);
 
-#ifdef CONFIG_NET_RX_BUSY_POLL
-	new->napi_id	= old->napi_id;
-#endif
 }
 
 /*
@@ -875,7 +878,6 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 			return NULL;
 
 		kmemcheck_annotate_bitfield(n, flags1);
-		kmemcheck_annotate_bitfield(n, flags2);
 		n->fclone = SKB_FCLONE_UNAVAILABLE;
 	}
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] net: reorganize sk_buff for faster __copy_skb_header()
  2014-09-25 21:20         ` [PATCH net-next] net: reorganize sk_buff for faster __copy_skb_header() Eric Dumazet
@ 2014-09-26  8:35           ` Jesper Dangaard Brouer
  2014-09-28 21:37           ` David Miller
  2014-09-29  5:18           ` [PATCH v2 " Eric Dumazet
  2 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-26  8:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Amir Vadai, John Fastabend, brouer

On Thu, 25 Sep 2014 14:20:04 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> From: Eric Dumazet <edumazet@google.com>
> 
> With proliferation of bit fields in sk_buff, __copy_skb_header() became
> quite expensive, showing as the most expensive function in a GSO
> workload.
> 
> __copy_skb_header() performance is also critical for non GSO TCP
> operations, as it is used from skb_clone()
> 
> This patch carefully moves all the fields that were not copied in a
> separate zone : cloned, nohdr, fclone, peeked, head_frag, xmit_more
> 
> Then I moved all other fields and all other copied fields in a section
> delimited by headers_start[0]/headers_end[0] section so that we
> can use a single memcpy() call, inlined by compiler using long
> word load/stores.
> 
> I also tried to make all copies in the natural orders of sk_buff,
> to help hardware prefetching.
> 
> I made sure sk_buff size did not change.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---

I'm impressed, network ninja!

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

It would have been nice if you would have included the performance
improvement you saw, e.g. from:
 http://thread.gmane.org/gmane.linux.network/332035/focus=332086

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell
  2014-09-25 14:17 [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell Eric Dumazet
  2014-09-25 14:43 ` Jesper Dangaard Brouer
  2014-09-25 15:53 ` [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell Alexei Starovoitov
@ 2014-09-28 21:27 ` David Miller
  2 siblings, 0 replies; 19+ messages in thread
From: David Miller @ 2014-09-28 21:27 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, amirv

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 25 Sep 2014 07:17:49 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> skb->xmit_more tells us if another skb is coming next.
> 
> We need to send doorbell when : xmit_more is not set,
> or txqueue is stopped (preventing next skb to come immediately)
> 
> Tested with a modified pktgen version, I got a 40% increase of
> throughput.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] net: reorganize sk_buff for faster __copy_skb_header()
  2014-09-25 21:20         ` [PATCH net-next] net: reorganize sk_buff for faster __copy_skb_header() Eric Dumazet
  2014-09-26  8:35           ` Jesper Dangaard Brouer
@ 2014-09-28 21:37           ` David Miller
  2014-09-29  4:06             ` David Miller
  2014-09-29  5:18           ` [PATCH v2 " Eric Dumazet
  2 siblings, 1 reply; 19+ messages in thread
From: David Miller @ 2014-09-28 21:37 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, amirv, john.r.fastabend, brouer

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 25 Sep 2014 14:20:04 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> With proliferation of bit fields in sk_buff, __copy_skb_header() became
> quite expensive, showing as the most expensive function in a GSO
> workload.
> 
> __copy_skb_header() performance is also critical for non GSO TCP
> operations, as it is used from skb_clone()
> 
> This patch carefully moves all the fields that were not copied in a
> separate zone : cloned, nohdr, fclone, peeked, head_frag, xmit_more
> 
> Then I moved all other fields and all other copied fields in a section
> delimited by headers_start[0]/headers_end[0] section so that we
> can use a single memcpy() call, inlined by compiler using long
> word load/stores.
> 
> I also tried to make all copies in the natural orders of sk_buff,
> to help hardware prefetching.
> 
> I made sure sk_buff size did not change.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] net: reorganize sk_buff for faster __copy_skb_header()
  2014-09-28 21:37           ` David Miller
@ 2014-09-29  4:06             ` David Miller
  2014-09-29  4:57               ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: David Miller @ 2014-09-29  4:06 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, amirv, john.r.fastabend, brouer

From: David Miller <davem@davemloft.net>
Date: Sun, 28 Sep 2014 17:37:59 -0400 (EDT)

> Applied, thanks!

Eric, I had to revert.  Please enable IPVS in your build tests.

net/ipv4/ip_output.c: In function ‘ip_copy_metadata’:
net/ipv4/ip_output.c:470:4: error: ‘struct sk_buff’ has no member named ‘ipvs_property’
  to->ipvs_property = from->ipvs_property;
    ^
net/ipv4/ip_output.c:470:26: error: ‘struct sk_buff’ has no member named ‘ipvs_property’
  to->ipvs_property = from->ipvs_property;
                          ^

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next] net: reorganize sk_buff for faster __copy_skb_header()
  2014-09-29  4:06             ` David Miller
@ 2014-09-29  4:57               ` Eric Dumazet
  0 siblings, 0 replies; 19+ messages in thread
From: Eric Dumazet @ 2014-09-29  4:57 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, amirv, john.r.fastabend, brouer

On Mon, 2014-09-29 at 00:06 -0400, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Sun, 28 Sep 2014 17:37:59 -0400 (EDT)
> 
> > Applied, thanks!
> 
> Eric, I had to revert.  Please enable IPVS in your build tests.
> 
> net/ipv4/ip_output.c: In function ‘ip_copy_metadata’:
> net/ipv4/ip_output.c:470:4: error: ‘struct sk_buff’ has no member named ‘ipvs_property’
>   to->ipvs_property = from->ipvs_property;
>     ^
> net/ipv4/ip_output.c:470:26: error: ‘struct sk_buff’ has no member named ‘ipvs_property’
>   to->ipvs_property = from->ipvs_property;

Ah sorry, will send a v2.

Thanks

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 net-next] net: reorganize sk_buff for faster __copy_skb_header()
  2014-09-25 21:20         ` [PATCH net-next] net: reorganize sk_buff for faster __copy_skb_header() Eric Dumazet
  2014-09-26  8:35           ` Jesper Dangaard Brouer
  2014-09-28 21:37           ` David Miller
@ 2014-09-29  5:18           ` Eric Dumazet
  2014-09-29 16:27             ` David Miller
  2 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2014-09-29  5:18 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Amir Vadai, John Fastabend, Jesper Dangaard Brouer

From: Eric Dumazet <edumazet@google.com>

With proliferation of bit fields in sk_buff, __copy_skb_header() became
quite expensive, showing as the most expensive function in a GSO
workload.

__copy_skb_header() performance is also critical for non GSO TCP
operations, as it is used from skb_clone()

This patch carefully moves all the fields that were not copied in a
separate zone : cloned, nohdr, fclone, peeked, head_frag, xmit_more

Then I moved all other fields and all other copied fields in a section
delimited by headers_start[0]/headers_end[0] section so that we
can use a single memcpy() call, inlined by compiler using long
word load/stores.

I also tried to make all copies in the natural orders of sk_buff,
to help hardware prefetching.

I made sure sk_buff size did not change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
v2 : added back ipvs_property, oops ;)

 include/linux/skbuff.h |  133 +++++++++++++++++++++------------------
 net/core/skbuff.c      |   80 ++++++++++++-----------
 2 files changed, 113 insertions(+), 100 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8eaa62400fca..b6cced304b26 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -527,27 +527,41 @@ struct sk_buff {
 	char			cb[48] __aligned(8);
 
 	unsigned long		_skb_refdst;
+	void			(*destructor)(struct sk_buff *skb);
 #ifdef CONFIG_XFRM
 	struct	sec_path	*sp;
 #endif
+#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
+	struct nf_conntrack	*nfct;
+#endif
+#ifdef CONFIG_BRIDGE_NETFILTER
+	struct nf_bridge_info	*nf_bridge;
+#endif
 	unsigned int		len,
 				data_len;
 	__u16			mac_len,
 				hdr_len;
-	union {
-		__wsum		csum;
-		struct {
-			__u16	csum_start;
-			__u16	csum_offset;
-		};
-	};
-	__u32			priority;
+
+	/* Following fields are _not_ copied in __copy_skb_header()
+	 * Note that queue_mapping is here mostly to fill a hole.
+	 */
 	kmemcheck_bitfield_begin(flags1);
-	__u8			ignore_df:1,
-				cloned:1,
-				ip_summed:2,
+	__u16			queue_mapping;
+	__u8			cloned:1,
 				nohdr:1,
-				nfctinfo:3;
+				fclone:2,
+				peeked:1,
+				head_frag:1,
+				xmit_more:1;
+	/* one bit hole */
+	kmemcheck_bitfield_end(flags1);
+
+
+
+	/* fields enclosed in headers_start/headers_end are copied
+	 * using a single memcpy() in __copy_skb_header()
+	 */
+	__u32			headers_start[0];
 
 /* if you move pkt_type around you also must adapt those constants */
 #ifdef __BIG_ENDIAN_BITFIELD
@@ -558,58 +572,53 @@ struct sk_buff {
 #define PKT_TYPE_OFFSET()	offsetof(struct sk_buff, __pkt_type_offset)
 
 	__u8			__pkt_type_offset[0];
-	__u8			pkt_type:3,
-				fclone:2,
-				ipvs_property:1,
-				peeked:1,
-				nf_trace:1;
-	kmemcheck_bitfield_end(flags1);
-	__be16			protocol;
-
-	void			(*destructor)(struct sk_buff *skb);
-#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
-	struct nf_conntrack	*nfct;
-#endif
-#ifdef CONFIG_BRIDGE_NETFILTER
-	struct nf_bridge_info	*nf_bridge;
-#endif
-
-	int			skb_iif;
-
-	__u32			hash;
-
-	__be16			vlan_proto;
-	__u16			vlan_tci;
-
-#ifdef CONFIG_NET_SCHED
-	__u16			tc_index;	/* traffic control index */
-#ifdef CONFIG_NET_CLS_ACT
-	__u16			tc_verd;	/* traffic control verdict */
-#endif
-#endif
-
-	__u16			queue_mapping;
-	kmemcheck_bitfield_begin(flags2);
-	__u8			xmit_more:1;
-#ifdef CONFIG_IPV6_NDISC_NODETYPE
-	__u8			ndisc_nodetype:2;
-#endif
+	__u8			pkt_type:3;
 	__u8			pfmemalloc:1;
+	__u8			ignore_df:1;
+	__u8			nfctinfo:3;
+
+	__u8			nf_trace:1;
+	__u8			ip_summed:2;
 	__u8			ooo_okay:1;
 	__u8			l4_hash:1;
 	__u8			sw_hash:1;
 	__u8			wifi_acked_valid:1;
 	__u8			wifi_acked:1;
+
 	__u8			no_fcs:1;
-	__u8			head_frag:1;
 	/* Indicates the inner headers are valid in the skbuff. */
 	__u8			encapsulation:1;
 	__u8			encap_hdr_csum:1;
 	__u8			csum_valid:1;
 	__u8			csum_complete_sw:1;
-	/* 1/3 bit hole (depending on ndisc_nodetype presence) */
-	kmemcheck_bitfield_end(flags2);
+	__u8			csum_level:2;
+	__u8			csum_bad:1;
 
+#ifdef CONFIG_IPV6_NDISC_NODETYPE
+	__u8			ndisc_nodetype:2;
+#endif
+	__u8			ipvs_property:1;
+	/* 5 or 7 bit hole */
+
+#ifdef CONFIG_NET_SCHED
+	__u16			tc_index;	/* traffic control index */
+#ifdef CONFIG_NET_CLS_ACT
+	__u16			tc_verd;	/* traffic control verdict */
+#endif
+#endif
+
+	union {
+		__wsum		csum;
+		struct {
+			__u16	csum_start;
+			__u16	csum_offset;
+		};
+	};
+	__u32			priority;
+	int			skb_iif;
+	__u32			hash;
+	__be16			vlan_proto;
+	__u16			vlan_tci;
 #if defined CONFIG_NET_DMA || defined CONFIG_NET_RX_BUSY_POLL
 	union {
 		unsigned int	napi_id;
@@ -625,19 +634,18 @@ struct sk_buff {
 		__u32		reserved_tailroom;
 	};
 
-	kmemcheck_bitfield_begin(flags3);
-	__u8			csum_level:2;
-	__u8			csum_bad:1;
-	/* 13 bit hole */
-	kmemcheck_bitfield_end(flags3);
-
 	__be16			inner_protocol;
 	__u16			inner_transport_header;
 	__u16			inner_network_header;
 	__u16			inner_mac_header;
+
+	__be16			protocol;
 	__u16			transport_header;
 	__u16			network_header;
 	__u16			mac_header;
+
+	__u32			headers_end[0];
+
 	/* These elements must be at the end, see alloc_skb() for details.  */
 	sk_buff_data_t		tail;
 	sk_buff_data_t		end;
@@ -3040,19 +3048,22 @@ static inline void nf_reset_trace(struct sk_buff *skb)
 }
 
 /* Note: This doesn't put any conntrack and bridge info in dst. */
-static inline void __nf_copy(struct sk_buff *dst, const struct sk_buff *src)
+static inline void __nf_copy(struct sk_buff *dst, const struct sk_buff *src,
+			     bool copy)
 {
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 	dst->nfct = src->nfct;
 	nf_conntrack_get(src->nfct);
-	dst->nfctinfo = src->nfctinfo;
+	if (copy)
+		dst->nfctinfo = src->nfctinfo;
 #endif
 #ifdef CONFIG_BRIDGE_NETFILTER
 	dst->nf_bridge  = src->nf_bridge;
 	nf_bridge_get(src->nf_bridge);
 #endif
 #if IS_ENABLED(CONFIG_NETFILTER_XT_TARGET_TRACE) || defined(CONFIG_NF_TABLES)
-	dst->nf_trace = src->nf_trace;
+	if (copy)
+		dst->nf_trace = src->nf_trace;
 #endif
 }
 
@@ -3064,7 +3075,7 @@ static inline void nf_copy(struct sk_buff *dst, const struct sk_buff *src)
 #ifdef CONFIG_BRIDGE_NETFILTER
 	nf_bridge_put(dst->nf_bridge);
 #endif
-	__nf_copy(dst, src);
+	__nf_copy(dst, src, true);
 }
 
 #ifdef CONFIG_NETWORK_SECMARK
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d4fdc649112c..4be570a4ab21 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -261,7 +261,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 		atomic_t *fclone_ref = (atomic_t *) (child + 1);
 
 		kmemcheck_annotate_bitfield(child, flags1);
-		kmemcheck_annotate_bitfield(child, flags2);
 		skb->fclone = SKB_FCLONE_ORIG;
 		atomic_set(fclone_ref, 1);
 
@@ -675,57 +674,61 @@ void consume_skb(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(consume_skb);
 
+/* Make sure a field is enclosed inside headers_start/headers_end section */
+#define CHECK_SKB_FIELD(field) \
+	BUILD_BUG_ON(offsetof(struct sk_buff, field) <		\
+		     offsetof(struct sk_buff, headers_start));	\
+	BUILD_BUG_ON(offsetof(struct sk_buff, field) >		\
+		     offsetof(struct sk_buff, headers_end));	\
+
 static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	new->tstamp		= old->tstamp;
+	/* We do not copy old->sk */
 	new->dev		= old->dev;
-	new->transport_header	= old->transport_header;
-	new->network_header	= old->network_header;
-	new->mac_header		= old->mac_header;
-	new->inner_protocol	= old->inner_protocol;
-	new->inner_transport_header = old->inner_transport_header;
-	new->inner_network_header = old->inner_network_header;
-	new->inner_mac_header = old->inner_mac_header;
+	memcpy(new->cb, old->cb, sizeof(old->cb));
 	skb_dst_copy(new, old);
-	skb_copy_hash(new, old);
-	new->ooo_okay		= old->ooo_okay;
-	new->no_fcs		= old->no_fcs;
-	new->encapsulation	= old->encapsulation;
-	new->encap_hdr_csum	= old->encap_hdr_csum;
-	new->csum_valid		= old->csum_valid;
-	new->csum_complete_sw	= old->csum_complete_sw;
 #ifdef CONFIG_XFRM
 	new->sp			= secpath_get(old->sp);
 #endif
-	memcpy(new->cb, old->cb, sizeof(old->cb));
-	new->csum		= old->csum;
-	new->ignore_df		= old->ignore_df;
-	new->pkt_type		= old->pkt_type;
-	new->ip_summed		= old->ip_summed;
-	skb_copy_queue_mapping(new, old);
-	new->priority		= old->priority;
-#if IS_ENABLED(CONFIG_IP_VS)
-	new->ipvs_property	= old->ipvs_property;
+	__nf_copy(new, old, false);
+
+	/* Note : this field could be in headers_start/headers_end section
+	 * It is not yet because we do not want to have a 16 bit hole
+	 */
+	new->queue_mapping = old->queue_mapping;
+
+	memcpy(&new->headers_start, &old->headers_start,
+	       offsetof(struct sk_buff, headers_end) -
+	       offsetof(struct sk_buff, headers_start));
+	CHECK_SKB_FIELD(protocol);
+	CHECK_SKB_FIELD(csum);
+	CHECK_SKB_FIELD(hash);
+	CHECK_SKB_FIELD(priority);
+	CHECK_SKB_FIELD(skb_iif);
+	CHECK_SKB_FIELD(vlan_proto);
+	CHECK_SKB_FIELD(vlan_tci);
+	CHECK_SKB_FIELD(transport_header);
+	CHECK_SKB_FIELD(network_header);
+	CHECK_SKB_FIELD(mac_header);
+	CHECK_SKB_FIELD(inner_protocol);
+	CHECK_SKB_FIELD(inner_transport_header);
+	CHECK_SKB_FIELD(inner_network_header);
+	CHECK_SKB_FIELD(inner_mac_header);
+	CHECK_SKB_FIELD(mark);
+#ifdef CONFIG_NETWORK_SECMARK
+	CHECK_SKB_FIELD(secmark);
+#endif
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	CHECK_SKB_FIELD(napi_id);
 #endif
-	new->pfmemalloc		= old->pfmemalloc;
-	new->protocol		= old->protocol;
-	new->mark		= old->mark;
-	new->skb_iif		= old->skb_iif;
-	__nf_copy(new, old);
 #ifdef CONFIG_NET_SCHED
-	new->tc_index		= old->tc_index;
+	CHECK_SKB_FIELD(tc_index);
 #ifdef CONFIG_NET_CLS_ACT
-	new->tc_verd		= old->tc_verd;
+	CHECK_SKB_FIELD(tc_verd);
 #endif
 #endif
-	new->vlan_proto		= old->vlan_proto;
-	new->vlan_tci		= old->vlan_tci;
-
-	skb_copy_secmark(new, old);
 
-#ifdef CONFIG_NET_RX_BUSY_POLL
-	new->napi_id	= old->napi_id;
-#endif
 }
 
 /*
@@ -876,7 +879,6 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 			return NULL;
 
 		kmemcheck_annotate_bitfield(n, flags1);
-		kmemcheck_annotate_bitfield(n, flags2);
 		n->fclone = SKB_FCLONE_UNAVAILABLE;
 	}
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 net-next] net: reorganize sk_buff for faster __copy_skb_header()
  2014-09-29  5:18           ` [PATCH v2 " Eric Dumazet
@ 2014-09-29 16:27             ` David Miller
  0 siblings, 0 replies; 19+ messages in thread
From: David Miller @ 2014-09-29 16:27 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, amirv, john.r.fastabend, brouer

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 28 Sep 2014 22:18:47 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> With proliferation of bit fields in sk_buff, __copy_skb_header() became
> quite expensive, showing as the most expensive function in a GSO
> workload.
> 
> __copy_skb_header() performance is also critical for non GSO TCP
> operations, as it is used from skb_clone()
> 
> This patch carefully moves all the fields that were not copied in a
> separate zone : cloned, nohdr, fclone, peeked, head_frag, xmit_more
> 
> Then I moved all other fields and all other copied fields in a section
> delimited by headers_start[0]/headers_end[0] section so that we
> can use a single memcpy() call, inlined by compiler using long
> word load/stores.
> 
> I also tried to make all copies in the natural orders of sk_buff,
> to help hardware prefetching.
> 
> I made sure sk_buff size did not change.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> v2 : added back ipvs_property, oops ;)

Applied :-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-09-29 16:27 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-25 14:17 [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell Eric Dumazet
2014-09-25 14:43 ` Jesper Dangaard Brouer
2014-09-25 15:06   ` Eric Dumazet
2014-09-25 15:36     ` Eric Dumazet
2014-09-25 15:40       ` Eric Dumazet
2014-09-25 16:19       ` Eric Dumazet
2014-09-25 20:47       ` Eric Dumazet
2014-09-25 20:58         ` Joe Perches
2014-09-25 21:20         ` [PATCH net-next] net: reorganize sk_buff for faster __copy_skb_header() Eric Dumazet
2014-09-26  8:35           ` Jesper Dangaard Brouer
2014-09-28 21:37           ` David Miller
2014-09-29  4:06             ` David Miller
2014-09-29  4:57               ` Eric Dumazet
2014-09-29  5:18           ` [PATCH v2 " Eric Dumazet
2014-09-29 16:27             ` David Miller
2014-09-25 15:53 ` [PATCH net-next] mlx4: exploit skb->xmit_more to conditionally send doorbell Alexei Starovoitov
2014-09-25 16:05   ` Eric Dumazet
2014-09-25 16:08     ` Alexei Starovoitov
2014-09-28 21:27 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).