TX pre-headers...

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* TX pre-headers...
@ 2009-02-06  9:41 David Miller
  2009-02-06 10:46 ` Steve.Glendinning
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: David Miller @ 2009-02-06  9:41 UTC (permalink / raw)
  To: netdev

Some NIC hardware wants a pre-header pushed in front of the packet
data on transmit.

When routing or bridging this will cause a reallocation of skb->data
on every packet forwarded because there will only be NET_IP_ALIGN
space reserved at the head by the device receive path.

NIU is one such NIC and I only noticed this because of some things I
saw in some of Robert Olsson's routing stress test oprofile dumps.

Putting a hack into NIU is the wrong way to do this and would only fix
cases where NIU is the receiver and transmitting device. e1000 to
NIU would still be broken, for example.

I think the way to solve this is to have each device indicate how
much TX slack space it neads for it's preheaders.  On device
registration we have some global "netdev_max_tx_hdr_space" that
records the maximum value seen.

We could decrease it on unregister (by walking the device list)
but I don't think that is worth it.

We also round netdev_max_tx_hdr_space up to be a multiple of 4
or something reasonable like that.

Then we get drivers to use a new interface:

	struct sk_buff *netdev_alloc_rx_skb(struct net_device *dev, int size);

which is nearly identical to netdev_alloc_skb() except that it does:

	size += NET_IP_ALIGN + netdev_max_tx_hdr_space;
	skb = netdev_alloc_skb(dev, size);
	if (skb)
		skb_reserve(skb, NET_IP_ALIGN + netdev_max_tx_hdr_space);
	return skb;

Seems reasonable?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: TX pre-headers...
  2009-02-06  9:41 TX pre-headers David Miller
@ 2009-02-06 10:46 ` Steve.Glendinning
  2009-02-06 11:11   ` David Miller
  2009-02-06 18:53   ` Inaky Perez-Gonzalez
  2009-02-06 12:02 ` Frank Blaschka
  2009-02-09 10:07 ` Herbert Xu
  2 siblings, 2 replies; 12+ messages in thread
From: Steve.Glendinning @ 2009-02-06 10:46 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

David Miller <davem@davemloft.net> wrote on 06/02/2009 09:41:07:
> 
> Some NIC hardware wants a pre-header pushed in front of the packet
> data on transmit.
> 
> When routing or bridging this will cause a reallocation of skb->data
> on every packet forwarded because there will only be NET_IP_ALIGN
> space reserved at the head by the device receive path.
> 
> NIU is one such NIC and I only noticed this because of some things I
> saw in some of Robert Olsson's routing stress test oprofile dumps.

Many of the usbnet drivers do this as well - for example smsc95xx needs
either 8 or 12 bytes for its pre-header (depending whether tx checksum
offload is enabled) and does this in its tx_fixup:

        if (skb_headroom(skb) < overhead) {
                struct sk_buff *skb2 = skb_copy_expand(skb,
                        overhead, 0, flags);
                dev_kfree_skb_any(skb);
                skb = skb2;
                if (!skb)
                        return NULL;
        }

It would be really good to get rid of these copies.

Interestingly, this device can change the amount of pre-header space it
needs if tx checksums are enabled/disabled at runtime.  I guess the
device would just indicate its worst case requirement at registration.

Does this just apply to the routing/bridging case, or can packets
originating from the host also allocate this extra pre-header?

Steve

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: TX pre-headers...
  2009-02-06 10:46 ` Steve.Glendinning
@ 2009-02-06 11:11   ` David Miller
  2009-02-06 18:53   ` Inaky Perez-Gonzalez
  1 sibling, 0 replies; 12+ messages in thread
From: David Miller @ 2009-02-06 11:11 UTC (permalink / raw)
  To: Steve.Glendinning; +Cc: netdev

From: Steve.Glendinning@smsc.com
Date: Fri, 6 Feb 2009 10:46:24 +0000

> Many of the usbnet drivers do this as well - for example smsc95xx needs
> either 8 or 12 bytes for its pre-header (depending whether tx checksum
> offload is enabled) and does this in its tx_fixup:
> 
>         if (skb_headroom(skb) < overhead) {
>                 struct sk_buff *skb2 = skb_copy_expand(skb,
>                         overhead, 0, flags);
>                 dev_kfree_skb_any(skb);
>                 skb = skb2;
>                 if (!skb)
>                         return NULL;
>         }
> 
> It would be really good to get rid of these copies.

This is exactly the kind of code that the NIU transmit handler
has.

> Interestingly, this device can change the amount of pre-header space it
> needs if tx checksums are enabled/disabled at runtime.  I guess the
> device would just indicate its worst case requirement at registration.

Right.

> Does this just apply to the routing/bridging case, or can packets
> originating from the host also allocate this extra pre-header?

ARP packets, as one specific case, are still going to trigger that
conditional above and need the copy.

TCP, UDP and others don't trigger this usually only because
they add MAX_TCP_HEADER et al. to all of their allocations
which essentially gives 128 bytes of slack space at the front
of every TX frame from transport protocols.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: TX pre-headers...
  2009-02-06 10:46 ` Steve.Glendinning
  2009-02-06 11:11   ` David Miller
@ 2009-02-06 18:53   ` Inaky Perez-Gonzalez
  1 sibling, 0 replies; 12+ messages in thread
From: Inaky Perez-Gonzalez @ 2009-02-06 18:53 UTC (permalink / raw)
  To: Steve.Glendinning; +Cc: David Miller, netdev

On Friday 06 February 2009, Steve.Glendinning@smsc.com wrote:
> David Miller <davem@davemloft.net> wrote on 06/02/2009 09:41:07:
> > Some NIC hardware wants a pre-header pushed in front of the packet
> > data on transmit.
> >
> > When routing or bridging this will cause a reallocation of skb->data
> > on every packet forwarded because there will only be NET_IP_ALIGN
> > space reserved at the head by the device receive path.
> >
> > NIU is one such NIC and I only noticed this because of some things I
> > saw in some of Robert Olsson's routing stress test oprofile dumps.
>
> Many of the usbnet drivers do this as well - for example smsc95xx needs
> either 8 or 12 bytes for its pre-header (depending whether tx checksum
> offload is enabled) and does this in its tx_fixup:

The wimax i2400m has a similar thing, but on top, it requires packets to
be coalesced and then yet another header with packet descriptors 
has to be added to them.

(header  pld0 pld1 pld2...padding pl0 pl1 pl2....)

pld - payload descriptor
pl - payload

Each IP packet payload has a prefix header too in the payload.

So it would not be able to take advantage of this (using a FIFO right now).
I wanted to explore doing all that coalescing using a sg vector, but 
until the USB and SDIO stacks support it without too much overhead, is 
not worth the effort.


-- 
Inaky

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: TX pre-headers...
  2009-02-06  9:41 TX pre-headers David Miller
  2009-02-06 10:46 ` Steve.Glendinning
@ 2009-02-06 12:02 ` Frank Blaschka
  2009-02-07  8:10   ` David Miller
  2009-02-09 10:07 ` Herbert Xu
  2 siblings, 1 reply; 12+ messages in thread
From: Frank Blaschka @ 2009-02-06 12:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

David Miller schrieb:
> Some NIC hardware wants a pre-header pushed in front of the packet
> data on transmit.
> 
> When routing or bridging this will cause a reallocation of skb->data
> on every packet forwarded because there will only be NET_IP_ALIGN
> space reserved at the head by the device receive path.
> 
> NIU is one such NIC and I only noticed this because of some things I
> saw in some of Robert Olsson's routing stress test oprofile dumps.
> 
> Putting a hack into NIU is the wrong way to do this and would only fix
> cases where NIU is the receiver and transmitting device. e1000 to
> NIU would still be broken, for example.
> 
> I think the way to solve this is to have each device indicate how
> much TX slack space it neads for it's preheaders.  On device
> registration we have some global "netdev_max_tx_hdr_space" that
> records the maximum value seen.
> 
> We could decrease it on unregister (by walking the device list)
> but I don't think that is worth it.
> 
> We also round netdev_max_tx_hdr_space up to be a multiple of 4
> or something reasonable like that.
> 
> Then we get drivers to use a new interface:
> 
> 	struct sk_buff *netdev_alloc_rx_skb(struct net_device *dev, int size);
> 
> which is nearly identical to netdev_alloc_skb() except that it does:
> 
> 	size += NET_IP_ALIGN + netdev_max_tx_hdr_space;
> 	skb = netdev_alloc_skb(dev, size);
> 	if (skb)
> 		skb_reserve(skb, NET_IP_ALIGN + netdev_max_tx_hdr_space);
> 	return skb;
> 
> Seems reasonable?
Absolutely yes, this would help the s/390 qeth drivers too
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: TX pre-headers...
  2009-02-06 12:02 ` Frank Blaschka
@ 2009-02-07  8:10   ` David Miller
  2009-02-09  3:34     ` David Miller
  0 siblings, 1 reply; 12+ messages in thread
From: David Miller @ 2009-02-07  8:10 UTC (permalink / raw)
  To: blaschka; +Cc: netdev

From: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Date: Fri, 06 Feb 2009 13:02:13 +0100

> Absolutely yes, this would help the s/390 qeth drivers too

Well, I did some research and it seems all of the cases we could
actually solve with such a scheme need at a maximum 32 bytes.

We already ensure 16 bytes, via NET_SKB_PAD.

So instead of all of this complex "who has the largest TX header size
and what is it" code, we can simply increase NET_SKB_PAD to 32.

You still need that headroom check there, simply because tunneling and
other device stacking situations can cause the headroom to be depleted
before your device sees the packet.

Any objections? :-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 08670d0..5eba400 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1287,7 +1287,7 @@ static inline int skb_network_offset(const struct sk_buff *skb)
  * The networking layer reserves some headroom in skb data (via
  * dev_alloc_skb). This is used to avoid having to reallocate skb data when
  * the header has to grow. In the default case, if the header has to grow
- * 16 bytes or less we avoid the reallocation.
+ * 32 bytes or less we avoid the reallocation.
  *
  * Unfortunately this headroom changes the DMA alignment of the resulting
  * network packet. As for NET_IP_ALIGN, this unaligned DMA is expensive
@@ -1295,11 +1295,11 @@ static inline int skb_network_offset(const struct sk_buff *skb)
  * perhaps setting it to a cacheline in size (since that will maintain
  * cacheline alignment of the DMA). It must be a power of 2.
  *
- * Various parts of the networking layer expect at least 16 bytes of
+ * Various parts of the networking layer expect at least 32 bytes of
  * headroom, you should not reduce this.
  */
 #ifndef NET_SKB_PAD
-#define NET_SKB_PAD	16
+#define NET_SKB_PAD	32
 #endif
 
 extern int ___pskb_trim(struct sk_buff *skb, unsigned int len);

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: TX pre-headers...
  2009-02-07  8:10   ` David Miller
@ 2009-02-09  3:34     ` David Miller
  0 siblings, 0 replies; 12+ messages in thread
From: David Miller @ 2009-02-09  3:34 UTC (permalink / raw)
  To: blaschka; +Cc: netdev

From: David Miller <davem@davemloft.net>
Date: Sat, 07 Feb 2009 00:10:41 -0800 (PST)

> From: Frank Blaschka <blaschka@linux.vnet.ibm.com>
> Date: Fri, 06 Feb 2009 13:02:13 +0100
> 
> > Absolutely yes, this would help the s/390 qeth drivers too
> 
> Well, I did some research and it seems all of the cases we could
> actually solve with such a scheme need at a maximum 32 bytes.
> 
> We already ensure 16 bytes, via NET_SKB_PAD.
> 
> So instead of all of this complex "who has the largest TX header size
> and what is it" code, we can simply increase NET_SKB_PAD to 32.
> 
> You still need that headroom check there, simply because tunneling and
> other device stacking situations can cause the headroom to be depleted
> before your device sees the packet.
> 
> Any objections? :-)

Nobody objected, at least for now, so I commited this change,
as follows:

net: Increase default NET_SKB_PAD to 32.

Several devices need to insert some "pre headers" in front of the
main packet data when they transmit a packet.

Currently we allocate only 16 bytes of pad room and this ends up not
being enough for some types of hardware (NIU, usb-net, s390 qeth,
etc.)

So increase this to 32.

Note that drivers still need to check in their transmit routine
whether enough headroom exists, and if not use skb_realloc_headroom().
Tunneling, IPSEC, and other encapsulation methods can cause the
padding area to be used up.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/skbuff.h |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 08670d0..5eba400 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1287,7 +1287,7 @@ static inline int skb_network_offset(const struct sk_buff *skb)
  * The networking layer reserves some headroom in skb data (via
  * dev_alloc_skb). This is used to avoid having to reallocate skb data when
  * the header has to grow. In the default case, if the header has to grow
- * 16 bytes or less we avoid the reallocation.
+ * 32 bytes or less we avoid the reallocation.
  *
  * Unfortunately this headroom changes the DMA alignment of the resulting
  * network packet. As for NET_IP_ALIGN, this unaligned DMA is expensive
@@ -1295,11 +1295,11 @@ static inline int skb_network_offset(const struct sk_buff *skb)
  * perhaps setting it to a cacheline in size (since that will maintain
  * cacheline alignment of the DMA). It must be a power of 2.
  *
- * Various parts of the networking layer expect at least 16 bytes of
+ * Various parts of the networking layer expect at least 32 bytes of
  * headroom, you should not reduce this.
  */
 #ifndef NET_SKB_PAD
-#define NET_SKB_PAD	16
+#define NET_SKB_PAD	32
 #endif
 
 extern int ___pskb_trim(struct sk_buff *skb, unsigned int len);
-- 
1.6.1.2.253.ga34a


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: TX pre-headers...
  2009-02-06  9:41 TX pre-headers David Miller
  2009-02-06 10:46 ` Steve.Glendinning
  2009-02-06 12:02 ` Frank Blaschka
@ 2009-02-09 10:07 ` Herbert Xu
  2009-02-09 10:14   ` David Miller
  2 siblings, 1 reply; 12+ messages in thread
From: Herbert Xu @ 2009-02-09 10:07 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

David Miller <davem@davemloft.net> wrote:
>
> I think the way to solve this is to have each device indicate how
> much TX slack space it neads for it's preheaders.  On device
> registration we have some global "netdev_max_tx_hdr_space" that
> records the maximum value seen.

We can already solve it right now with dev->needed_headroom.

Of course it's not optimal because it involves a copy so we
might still need your solution.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: TX pre-headers...
  2009-02-09 10:07 ` Herbert Xu
@ 2009-02-09 10:14   ` David Miller
  2009-02-09 10:19     ` Herbert Xu
  0 siblings, 1 reply; 12+ messages in thread
From: David Miller @ 2009-02-09 10:14 UTC (permalink / raw)
  To: herbert; +Cc: netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 9 Feb 2009 21:07:13 +1100

> David Miller <davem@davemloft.net> wrote:
> >
> > I think the way to solve this is to have each device indicate how
> > much TX slack space it neads for it's preheaders.  On device
> > registration we have some global "netdev_max_tx_hdr_space" that
> > records the maximum value seen.
> 
> We can already solve it right now with dev->needed_headroom.
> 
> Of course it's not optimal because it involves a copy so we
> might still need your solution.

Indeed, I noticed needed_headroom when I set out to start coding
things up. :)

There are all sorts of cases that won't work well, as you implicitly
suggest.  For example, forwarding to a path that encapsulates to a GRE
or other type of tunnel.

For IPSEC we're predominantly COW'ing the data anyways, so...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: TX pre-headers...
  2009-02-09 10:14   ` David Miller
@ 2009-02-09 10:19     ` Herbert Xu
  2009-02-09 15:39       ` Krzysztof Halasa
  0 siblings, 1 reply; 12+ messages in thread
From: Herbert Xu @ 2009-02-09 10:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Mon, Feb 09, 2009 at 02:14:03AM -0800, David Miller wrote:
> 
> Indeed, I noticed needed_headroom when I set out to start coding
> things up. :)
> 
> There are all sorts of cases that won't work well, as you implicitly
> suggest.  For example, forwarding to a path that encapsulates to a GRE
> or other type of tunnel.
> 
> For IPSEC we're predominantly COW'ing the data anyways, so...

Yes tunneling (especially when it's hidden by netfilter) is the
hard part.

Here's a thought, what if we had your global maximum, and every
time we have to reallocate a packet because of a failed head
room check, we increase the maximum by the needed amount (we should
add a ceiling in case some buggy path ignores this maximum when
allocating a new packet).

This way even tunnels could benefit from not having to copy all
the time.

As a failed check should be rare (given that we're continuously
increasing it) the overhead in updating the maximum should be
reasonable.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: TX pre-headers...
  2009-02-09 10:19     ` Herbert Xu
@ 2009-02-09 15:39       ` Krzysztof Halasa
  2009-02-10  8:15         ` David Miller
  0 siblings, 1 reply; 12+ messages in thread
From: Krzysztof Halasa @ 2009-02-09 15:39 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

Herbert Xu <herbert@gondor.apana.org.au> writes:

> Yes tunneling (especially when it's hidden by netfilter) is the
> hard part.
>
> Here's a thought, what if we had your global maximum, and every
> time we have to reallocate a packet because of a failed head
> room check, we increase the maximum by the needed amount (we should
> add a ceiling in case some buggy path ignores this maximum when
> allocating a new packet).
>
> This way even tunnels could benefit from not having to copy all
> the time.
>
> As a failed check should be rare (given that we're continuously
> increasing it) the overhead in updating the maximum should be
> reasonable.

I agree.

OTOH and FWIW my stuff (= WAN) alone without stacking fits in 32 bytes,
the typical max case is 10 bytes for Frame-Relay header + 14 bytes for
Ethernet header + 4 bytes for 802.1Q tag = 28 bytes.

OTOH I wonder if these changes make the dev->hard_header_len no longer
needed.
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: TX pre-headers...
  2009-02-09 15:39       ` Krzysztof Halasa
@ 2009-02-10  8:15         ` David Miller
  0 siblings, 0 replies; 12+ messages in thread
From: David Miller @ 2009-02-10  8:15 UTC (permalink / raw)
  To: khc; +Cc: herbert, netdev

From: Krzysztof Halasa <khc@pm.waw.pl>
Date: Mon, 09 Feb 2009 16:39:48 +0100

> OTOH I wonder if these changes make the dev->hard_header_len no longer
> needed.

Not really, this value tells the kernel how large the cached
link-level headers are that get copied directly under the
network header.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-02-10  8:15 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-06  9:41 TX pre-headers David Miller
2009-02-06 10:46 ` Steve.Glendinning
2009-02-06 11:11   ` David Miller
2009-02-06 18:53   ` Inaky Perez-Gonzalez
2009-02-06 12:02 ` Frank Blaschka
2009-02-07  8:10   ` David Miller
2009-02-09  3:34     ` David Miller
2009-02-09 10:07 ` Herbert Xu
2009-02-09 10:14   ` David Miller
2009-02-09 10:19     ` Herbert Xu
2009-02-09 15:39       ` Krzysztof Halasa
2009-02-10  8:15         ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).