netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC SKBUFF]: Keep track of writable header len of headerless clones
@ 2007-06-24 12:53 Patrick McHardy
  2007-06-25  2:37 ` David Miller
  2007-06-25  2:58 ` David Miller
  0 siblings, 2 replies; 6+ messages in thread
From: Patrick McHardy @ 2007-06-24 12:53 UTC (permalink / raw)
  To: Linux Netdev List; +Cc: Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 1873 bytes --]

Currently NAT (and others) that want to modify cloned skbs copy them,
even if in the vast majority of cases its not necessary because the
skb is a clone made by TCP and the portion NAT wants to modify is
actually writable because TCP release the header reference before
cloning.

The problem is that there is no clean way for NAT to find out how
long the writable header area is, so this patch introduces skb->hdr_len
to hold this length. When a headerless skb is cloned skb->hdr_len
is set to the current headroom, for regular clones it is copied from
the original. A new function skb_clone_writable(skb, len) returns
whether the skb is writable up to len bytes from skb->data. To avoid
enlarging the skb the mac_len field is reduced to 16 bit and the
new hdr_len field is put in the remaining 16 bit.

I've done a few rough benchmarks of NAT (not with this exact patch,
but a very similar one). As expected it saves huge amounts of system
time in case of sendfile, bringing it down to basically the same
amount as without NAT, with sendmsg it only helps on loopback,
probably because of the large MTU.

Transmit a 1GB file using sendfile/sendmsg over eth0/lo with and
without NAT:

- sendfile eth0, no NAT:	sys     0m0.388s
- sendfile eth0, NAT:		sys     0m1.835s
- sendfile eth0: NAT + path:	sys     0m0.370s	(~ -80%)

- sendfile lo, no NAT:		sys     0m0.258s
- sendfile lo, NAT:		sys     0m2.609s
- sendfile lo, NAT + patch:	sys     0m0.260s	(~ -90%)

- sendmsg eth0, no NAT:		sys     0m2.508s
- sendmsg eth0, NAT:		sys     0m2.539s
- sendmsg eth0, NAT + patch:	sys     0m2.445s	(no change)

- sendmsg lo, no NAT:		sys	0m2.151s
- sendmsg lo, NAT:		sys     0m3.557s
- sendmsg lo, NAT + patch:	sys     0m2.159s	(~ -40%)


I expect other users can see a similar performance improvement,
packet mangling iptables targets, ipip and ip_gre come to mind ..

Comments welcome.

[-- Attachment #2: x --]
[-- Type: text/plain, Size: 3069 bytes --]

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6f0b2f7..881fe80 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -147,8 +147,8 @@ struct skb_shared_info {
 
 /* We divide dataref into two halves.  The higher 16 bits hold references
  * to the payload part of skb->data.  The lower 16 bits hold references to
- * the entire skb->data.  It is up to the users of the skb to agree on
- * where the payload starts.
+ * the entire skb->data.  A clone of a headerless skb holds the length of
+ * the header in skb->hdr_len.
  *
  * All users must obey the rule that the skb->data reference count must be
  * greater than or equal to the payload reference count.
@@ -206,6 +206,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@len: Length of actual data
  *	@data_len: Data length
  *	@mac_len: Length of link layer header
+ *	@hdr_len: writable header length of cloned skb
  *	@csum: Checksum (must include start/offset pair)
  *	@csum_start: Offset from skb->head where checksumming should start
  *	@csum_offset: Offset from csum_start where checksum should be stored
@@ -260,8 +261,9 @@ struct sk_buff {
 	char			cb[48];
 
 	unsigned int		len,
-				data_len,
-				mac_len;
+				data_len;
+	__u16			mac_len,
+				hdr_len;
 	union {
 		__wsum		csum;
 		struct {
@@ -1322,6 +1324,20 @@ static inline struct sk_buff *netdev_alloc_skb(struct net_device *dev,
 }
 
 /**
+ *	skb_clone_writable - is the header of a clone writable
+ *	@skb: buffer to check
+ *	@len: length up to which to write
+ *
+ *	Returns true if modifying the header part of the cloned buffer
+ *	does not requires the data to be copied.
+ */
+static inline int skb_clone_writable(struct sk_buff *skb, int len)
+{
+	return !skb_header_cloned(skb) &&
+	       skb_headroom(skb) + len <= skb->hdr_len;
+}
+
+/**
  *	skb_cow - copy header of skb when it is required
  *	@skb: buffer to cow
  *	@headroom: needed headroom
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7c6a34e..8d8e8fc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -415,6 +415,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 	C(csum);
 	C(local_df);
 	n->cloned = 1;
+	n->hdr_len = skb->nohdr ? skb_headroom(skb) : skb->hdr_len;
 	n->nohdr = 0;
 	C(pkt_type);
 	C(ip_summed);
@@ -676,6 +677,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	skb->network_header   += off;
 	skb->mac_header	      += off;
 	skb->cloned   = 0;
+	skb->hdr_len  = 0;
 	skb->nohdr    = 0;
 	atomic_set(&skb_shinfo(skb)->dataref, 1);
 	return 0;
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index a84478e..3aaabec 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -203,7 +203,9 @@ int skb_make_writable(struct sk_buff **pskb, unsigned int writable_len)
 		return 0;
 
 	/* Not exclusive use of packet?  Must copy. */
-	if (skb_shared(*pskb) || skb_cloned(*pskb))
+	if (skb_cloned(*pskb) && !skb_clone_writable(*pskb, writable_len))
+		goto copy_skb;
+	if (skb_shared(*pskb))
 		goto copy_skb;
 
 	return pskb_may_pull(*pskb, writable_len);

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC SKBUFF]: Keep track of writable header len of headerless clones
  2007-06-24 12:53 [RFC SKBUFF]: Keep track of writable header len of headerless clones Patrick McHardy
@ 2007-06-25  2:37 ` David Miller
  2007-06-25  8:53   ` Patrick McHardy
  2007-06-25  2:58 ` David Miller
  1 sibling, 1 reply; 6+ messages in thread
From: David Miller @ 2007-06-25  2:37 UTC (permalink / raw)
  To: kaber; +Cc: netdev, netfilter-devel

From: Patrick McHardy <kaber@trash.net>
Date: Sun, 24 Jun 2007 14:53:36 +0200

> - sendmsg eth0, no NAT:		sys     0m2.508s
> - sendmsg eth0, NAT:		sys     0m2.539s
> - sendmsg eth0, NAT + patch:	sys     0m2.445s	(no change)

This is probably because we're touching all the data anyways
and if the resident set size is small enough (as a gigabit
or 100MB ethernet TCP stream window would be) then it all fits
mostly in the L2 caches of the cpu.

Oh yeah and I like your patch Patrick :-)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC SKBUFF]: Keep track of writable header len of headerless clones
  2007-06-24 12:53 [RFC SKBUFF]: Keep track of writable header len of headerless clones Patrick McHardy
  2007-06-25  2:37 ` David Miller
@ 2007-06-25  2:58 ` David Miller
  2007-06-25  8:51   ` Patrick McHardy
  1 sibling, 1 reply; 6+ messages in thread
From: David Miller @ 2007-06-25  2:58 UTC (permalink / raw)
  To: kaber; +Cc: netdev, netfilter-devel

From: Patrick McHardy <kaber@trash.net>
Date: Sun, 24 Jun 2007 14:53:36 +0200

> I expect other users can see a similar performance improvement,
> packet mangling iptables targets, ipip and ip_gre come to mind ..
> 
> Comments welcome.

Patrick please give me a suitable signed-off-by line, I'd
like to apply this to net-2.6.23

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC SKBUFF]: Keep track of writable header len of headerless clones
  2007-06-25  2:58 ` David Miller
@ 2007-06-25  8:51   ` Patrick McHardy
  2007-06-25 11:42     ` David Miller
  0 siblings, 1 reply; 6+ messages in thread
From: Patrick McHardy @ 2007-06-25  8:51 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, netfilter-devel

David Miller wrote:
> Patrick please give me a suitable signed-off-by line, I'd
> like to apply this to net-2.6.23
>   

Signed-off-by: Patrick McHardy <kaber@trash.net>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC SKBUFF]: Keep track of writable header len of headerless clones
  2007-06-25  2:37 ` David Miller
@ 2007-06-25  8:53   ` Patrick McHardy
  0 siblings, 0 replies; 6+ messages in thread
From: Patrick McHardy @ 2007-06-25  8:53 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, netfilter-devel

David Miller wrote:
> From: Patrick McHardy <kaber@trash.net>
> Date: Sun, 24 Jun 2007 14:53:36 +0200
>
>   
>> - sendmsg eth0, no NAT:		sys     0m2.508s
>> - sendmsg eth0, NAT:		sys     0m2.539s
>> - sendmsg eth0, NAT + patch:	sys     0m2.445s	(no change)
>>     
>
> This is probably because we're touching all the data anyways
> and if the resident set size is small enough (as a gigabit
> or 100MB ethernet TCP stream window would be) then it all fits
> mostly in the L2 caches of the cpu.

Yeah, thats what I suspected too. On loopback it seems to exceed
the cache size, which is a bit strange since its 1MB. Or maybe
there is some loopback optimization interfering.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC SKBUFF]: Keep track of writable header len of headerless clones
  2007-06-25  8:51   ` Patrick McHardy
@ 2007-06-25 11:42     ` David Miller
  0 siblings, 0 replies; 6+ messages in thread
From: David Miller @ 2007-06-25 11:42 UTC (permalink / raw)
  To: kaber; +Cc: netdev, netfilter-devel

From: Patrick McHardy <kaber@trash.net>
Date: Mon, 25 Jun 2007 10:51:59 +0200

> David Miller wrote:
> > Patrick please give me a suitable signed-off-by line, I'd
> > like to apply this to net-2.6.23
> >   
> 
> Signed-off-by: Patrick McHardy <kaber@trash.net>

Applied and pushed out, thanks Patrick!

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-06-25 11:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-24 12:53 [RFC SKBUFF]: Keep track of writable header len of headerless clones Patrick McHardy
2007-06-25  2:37 ` David Miller
2007-06-25  8:53   ` Patrick McHardy
2007-06-25  2:58 ` David Miller
2007-06-25  8:51   ` Patrick McHardy
2007-06-25 11:42     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).