Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] net/fec: carrier off initially to avoid root mount failure
From: Greg Ungerer @ 2010-10-11 12:38 UTC (permalink / raw)
  To: Oskar Schirmer; +Cc: David Miller, netdev, dan, bigeasy, hjk, gerg, bhutchings
In-Reply-To: <20101011075420.GA17320@www.tglx.de>

On 11/10/10 17:54, Oskar Schirmer wrote:
> On Sun, Oct 10, 2010 at 21:19:56 -0700, David Miller wrote:
>> From: Oskar Schirmer<oskar@linutronix.de>
>> Date: Thu,  7 Oct 2010 14:30:30 +0200
>>
>>> with hardware slow in negotiation, the system did freeze
>>> while trying to mount root on nfs at boot time.
>>>
>>> the link state has not been initialised so network stack
>>> tried to start transmission right away. this caused instant
>>> retries, as the driver solely stated business upon link down,
>>> rendering the system unusable.
>>>
>>> notify carrier off initially to prevent transmission until
>>> phylib will report link up.
>>>
>>> Signed-off-by: Oskar Schirmer<oskar@linutronix.de>
>>
>> I did some more investigation into this situation, and for now I'm
>> going to apply your patch.  It is correct, and it also matches what
>> the only other seemingly correct driver I could find using phylib does
>> (gianfar) :-) Actually, although I didn't check, bi-modal drivers
>> (those that only use phylib for some phy types) like tg3 probably do
>> the right thing here too.
>>
>> Longer term I think the right thing to do might be:
>>
>> 1) Create some notion of "network device has managed carrier"
>>
>>     This could simply be a flag bit in the netdev or netdev_ops,
>>     or some other kind of attribute.
>>
>> 2) Managed carrier devices start with netif_carrier_off(), otherwise
>>     the device starts with netif_carrier_on().
>
> This last conditional (managed vs otherwise) would be implicit
> with a null PHY driver as Ben Hutchings proposes it to Greg Ungerers
> "allow FEC driver to not have attached PHY", 2010/10/07,
> with the null PHY simply switching to netif_carrier_on right after
> machine start.
>
> Otherwise my patch would need another #ifdef to live in
> peace with Gregs patch.

You can ignore my patch for now. I am reworking the it to
use fixed phy. It will look quite different when it is done.

Regards
Greg


------------------------------------------------------------------------
Greg Ungerer  --  Principal Engineer        EMAIL:     gerg@snapgear.com
SnapGear Group, McAfee                      PHONE:       +61 7 3435 2888
8 Gardner Close,                            FAX:         +61 7 3891 3630
Milton, QLD, 4064, Australia                WEB: http://www.SnapGear.com

^ permalink raw reply

* Re: sending VLAN packets via packet_mmap
From: Phil Sutter @ 2010-10-11 13:15 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, johann.baudy, eric.dumazet
In-Reply-To: <20101006.235343.98903661.davem@davemloft.net>

David,

On Wed, Oct 06, 2010 at 11:53:43PM -0700, David Miller wrote:
> From: Phil Sutter <phil@nwl.cc>
> Date: Thu, 30 Sep 2010 21:24:14 +0200
> 
> > The actual problem in tpacket_snd() is this:
> > 
> > | reserve = dev->hard_header_len;
> > | [...]
> > | if (size_max > dev->mtu + reserve)
> > | 	size_max = dev->mtu + reserve;
> > 
> > I guess the check is there to avoid skb overflows on malicious data
> > input. Is this correct? Are there other reasons for it's existence?
> 
> We can add a special allowance of 4 extra bytes in this size check
> _iff_ the device is ethernet and the NETIF_F_VLAN_CHALLENGED netdev
> feature bit is not set.

Making sure the device can actually handle the additional data, good
point!

Perhaps a bit philosophical, but what about NETIF_F_HW_VLAN_TX? AFAICT,
NICs providing that feature insert the VLAN header themselfs, correct?
Therefore __vlan_hwaccel_put_tag() just sets skb->vlan_tci. In order to
make use of that feature, one could always insert the VLAN header in the
kernel using vlan_put_tag(), depending on tpacket2_hdr.tp_vlan_tci
(probably in combination with VLAN_TAG_PRESENT to allow for VLAN ID 0).

Resumptively speaking, I'm probably talking about another feature
instead of solution for the bug in first place so I'll reply to this
mail with a patch implementing the suggested solution.

Greetings, Phil

^ permalink raw reply

* [PATCH] af_packet: account for VLAN when checking packet size
From: Phil Sutter @ 2010-10-11 13:25 UTC (permalink / raw)
  To: davem; +Cc: netdev, johann.baudy, eric.dumazet
In-Reply-To: <20101011131500.GA12342@orbit.nwl.cc>

Signed-off-by: Phil Sutter <phil@nwl.cc>
---
 net/packet/af_packet.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 9a17f28..ad37754 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -56,6 +56,7 @@
 #include <linux/in.h>
 #include <linux/inet.h>
 #include <linux/netdevice.h>
+#include <linux/if_arp.h>
 #include <linux/if_packet.h>
 #include <linux/wireless.h>
 #include <linux/kernel.h>
@@ -983,6 +984,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 
 	reserve = dev->hard_header_len;
 
+	/* allow VLAN packets when device supports them */
+	if (likely(dev->type == ARPHRD_ETHER) &&
+	    likely(!(dev->features & NETIF_F_VLAN_CHALLENGED)))
+		reserve += VLAN_HLEN;
+
 	err = -ENETDOWN;
 	if (unlikely(!(dev->flags & IFF_UP)))
 		goto out_put;
-- 
1.7.2.2


^ permalink raw reply related

* RE: [PATCH net-next 2/5] bnx2x: save cycles in setting gso_size
From: Ben Hutchings @ 2010-10-11 14:00 UTC (permalink / raw)
  To: Vladislav Zolotarov
  Cc: David Miller, Dmitry Kravkov, netdev@vger.kernel.org,
	Eilon Greenstein
In-Reply-To: <8628FE4E7912BF47A96AE7DD7BAC0AADDDEE4281D8@SJEXCHCCR02.corp.ad.broadcom.com>

On Mon, 2010-10-11 at 01:53 -0700, Vladislav Zolotarov wrote:
[...]
> Dave, it's a gSo_size, not a gRo_size and we are handling an LRO skb
> here. It's quite confusing, I agree... ;) This is currently the way 
> to mark an LRO skb in order to properly handle the LRO skbs that 
> might still be forwarded to the stack short time after the LRO has 
> been disabled due to enabling the IP forwarding (or if there is a 
> bug in a driver).
> See the skb_warn_if_lro() below:
> 
> static inline bool skb_warn_if_lro(const struct sk_buff *skb)
> {
> 	/* LRO sets gso_size but not gso_type, whereas if GSO is really
> 	 * wanted then gso_type will be set. */
> 	struct skb_shared_info *shinfo = skb_shinfo(skb);
> 	if (skb_is_nonlinear(skb) && shinfo->gso_size != 0 &&
> 	    unlikely(shinfo->gso_type == 0)) {
> 		__skb_warn_lro_forwarding(skb);
> 		return true;
> 	}
> 	return false;
> }
> 
> So, the convention is to set the gSo_size to a none-zero value leaving
> the gSo_type zero for an LRO skbs. It's quite hacky but this is what
> we have for quite a while and it quite worked so far. ;) To make it 
> cleaner we might have done the following:
[...]

The requirement (or as you call it, "convention") is to set gso_size to
the observed MSS of the packets that have been combined.  If you don't
do that then TCP will not know the true number of packets for purposes
of delayed-ACK etc.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH] af_packet: account for VLAN when checking packet size
From: Eric Dumazet @ 2010-10-11 14:03 UTC (permalink / raw)
  To: Phil Sutter; +Cc: davem, netdev, johann.baudy
In-Reply-To: <1286803522-16478-1-git-send-email-phil@nwl.cc>

Le lundi 11 octobre 2010 à 15:25 +0200, Phil Sutter a écrit :
> Signed-off-by: Phil Sutter <phil@nwl.cc>
> ---
>  net/packet/af_packet.c |    6 ++++++
>  1 files changed, 6 insertions(+), 0 deletions(-)
> 
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 9a17f28..ad37754 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -56,6 +56,7 @@
>  #include <linux/in.h>
>  #include <linux/inet.h>
>  #include <linux/netdevice.h>
> +#include <linux/if_arp.h>
>  #include <linux/if_packet.h>
>  #include <linux/wireless.h>
>  #include <linux/kernel.h>
> @@ -983,6 +984,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>  
>  	reserve = dev->hard_header_len;
>  
> +	/* allow VLAN packets when device supports them */
> +	if (likely(dev->type == ARPHRD_ETHER) &&
> +	    likely(!(dev->features & NETIF_F_VLAN_CHALLENGED)))
> +		reserve += VLAN_HLEN;
> +
>  	err = -ENETDOWN;
>  	if (unlikely(!(dev->flags & IFF_UP)))
>  		goto out_put;

If we dont test ETH_P_8021Q protocol here, we allow sending 1504 bytes
frames for MTU=1500

Should we really care ?

If not, just do

reserve = dev->hard_header_len + VLAN_HLEN;




^ permalink raw reply

* [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Andrey Vagin @ 2010-10-11 15:20 UTC (permalink / raw)
  To: stable; +Cc: netdev, Krishna Kumar, David S. Miller, Andrey Vagin

From: Krishna Kumar <krkumar2@in.ibm.com>

commit 068a2de57ddf4f472e32e7af868613c574ad1d88 upstream.

Non-GSO code drops dst entry for performance reasons, but
the same is missing for GSO code. Drop dst while cache-hot
for GSO case too.

Note: Without this patch the kernel may oops if used bridged veth
devices. A bridge set skb->dst = fake_dst_ops, veth transfers this skb
to netif_receive_skb...ip_rcv_finish and it calls dst_input(skb), but
fake_dst_ops->input = NULL -> Oops

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 net/core/dev.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 915d0ae..c325ab6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1747,6 +1747,14 @@ gso:
 
 		skb->next = nskb->next;
 		nskb->next = NULL;
+
+		/*
+		 * If device doesnt need nskb->dst, release it right now while
+		 * its hot in this cpu cache
+		 */
+		if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
+			skb_dst_drop(nskb);
+
 		rc = ops->ndo_start_xmit(nskb, dev);
 		if (unlikely(rc != NETDEV_TX_OK)) {
 			nskb->next = skb->next;
-- 
1.7.2.1


^ permalink raw reply related

* Re: [PATCH v12 06/17] Use callback to deal with skb_release_data() specially.
From: David Miller @ 2010-10-11 15:27 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, herbert, jdike
In-Reply-To: <F2E9EB7348B8264F86B6AB8151CE2D793064B2A9D8@shsmsx502.ccr.corp.intel.com>

From: "Xin, Xiaohui" <xiaohui.xin@intel.com>
Date: Mon, 11 Oct 2010 15:06:05 +0800

> That's to avoid the new cache miss caused by using destructor_arg in data path
> like skb_release_data().
> That's based on the comment from Eric Dumazet on v7 patches.

Thanks for the explanation.

^ permalink raw reply

* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Michael Tokarev @ 2010-10-11 15:40 UTC (permalink / raw)
  To: Andrey Vagin; +Cc: stable, netdev, Krishna Kumar, David S. Miller
In-Reply-To: <1286810413-30238-1-git-send-email-avagin@openvz.org>

Andrey Vagin wrote:
> From: Krishna Kumar <krkumar2@in.ibm.com>
> 
> commit 068a2de57ddf4f472e32e7af868613c574ad1d88 upstream.
> 
> Non-GSO code drops dst entry for performance reasons, but
> the same is missing for GSO code. Drop dst while cache-hot
> for GSO case too.
> 
> Note: Without this patch the kernel may oops if used bridged veth
> devices. A bridge set skb->dst = fake_dst_ops, veth transfers this skb
> to netif_receive_skb...ip_rcv_finish and it calls dst_input(skb), but
> fake_dst_ops->input = NULL -> Oops

Hmm.  Isn't this the reason for my mysterious OOPSes (jump to
NULL) which I concluded are due to stack overflow?  See f.e.
http://www.spinics.net/lists/netdev/msg142104.html

This started happening when I updated virtio drivers in a windows
virtual machne to the ones which supports GSO, and my config
involves bridging veth devices, and this is where the prob
actually occurs - when doing guest => virtio => tap => bridge => veth
route....

Thanks!

/mjt

^ permalink raw reply

* Re: [PATCH v12 06/17] Use callback to deal with skb_release_data() specially.
From: Eric Dumazet @ 2010-10-11 15:42 UTC (permalink / raw)
  To: David Miller
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mst, mingo, herbert,
	jdike
In-Reply-To: <20101011.082728.193709462.davem@davemloft.net>

Le lundi 11 octobre 2010 à 08:27 -0700, David Miller a écrit :
> From: "Xin, Xiaohui" <xiaohui.xin@intel.com>
> Date: Mon, 11 Oct 2010 15:06:05 +0800
> 
> > That's to avoid the new cache miss caused by using destructor_arg in data path
> > like skb_release_data().
> > That's based on the comment from Eric Dumazet on v7 patches.
> 
> Thanks for the explanation.

Anyway, frags[] must be the last field of "struct skb_shared_info"
since commit fed66381 (net: pskb_expand_head() optimization)

It seems Xin worked on a quite old tree.

^ permalink raw reply

* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Eric Dumazet @ 2010-10-11 15:46 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Andrey Vagin, stable, netdev, Krishna Kumar, David S. Miller
In-Reply-To: <4CB32FDB.4090001@msgid.tls.msk.ru>

Le lundi 11 octobre 2010 à 19:40 +0400, Michael Tokarev a écrit :
> Andrey Vagin wrote:
> > From: Krishna Kumar <krkumar2@in.ibm.com>
> > 
> > commit 068a2de57ddf4f472e32e7af868613c574ad1d88 upstream.
> > 
> > Non-GSO code drops dst entry for performance reasons, but
> > the same is missing for GSO code. Drop dst while cache-hot
> > for GSO case too.
> > 
> > Note: Without this patch the kernel may oops if used bridged veth
> > devices. A bridge set skb->dst = fake_dst_ops, veth transfers this skb
> > to netif_receive_skb...ip_rcv_finish and it calls dst_input(skb), but
> > fake_dst_ops->input = NULL -> Oops
> 
> Hmm.  Isn't this the reason for my mysterious OOPSes (jump to
> NULL) which I concluded are due to stack overflow?  See f.e.
> http://www.spinics.net/lists/netdev/msg142104.html
> 
> This started happening when I updated virtio drivers in a windows
> virtual machne to the ones which supports GSO, and my config
> involves bridging veth devices, and this is where the prob
> actually occurs - when doing guest => virtio => tap => bridge => veth
> route....
> 
> Thanks!

This patch was an optimization, not a bug fix.

If something gets better, then a bug is somewhere else ?




^ permalink raw reply

* Re: [PATCH] net: introduce alloc_skb_order0
From: Stanislaw Gruszka @ 2010-10-11 15:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Francois Romieu, netdev
In-Reply-To: <1286639996.2692.148.camel@edumazet-laptop>

On Sat, Oct 09, 2010 at 05:59:56PM +0200, Eric Dumazet wrote:
> Le vendredi 08 octobre 2010 à 18:03 +0200, Stanislaw Gruszka a écrit :
> > On Fri, Oct 08, 2010 at 05:04:07PM +0200, Eric Dumazet wrote:
> 
> > > Switch to SLAB -> no more problem ;)
> > 
> > yeh, I wish to, but fedora use SLUB because of some debugging
> > capabilities. 
> 
> Yes, of course, I was kidding :)
> 
> echo 0 >/sys/kernel/slab/kmalloc-2048/order
> echo 0 >/sys/kernel/slab/kmalloc-1024/order
> echo 0 >/sys/kernel/slab/kmalloc-512/order
> 
> Should do the trick : No more high order allocations for MTU=1500
> frames.

So the SLUB is great, but we need a patch to avoid using it :-)

> For MTU=9000 frames, we probably need something like this patch :
>
> Reception of big frames hit a memory allocation problem, because of high
> order pages allocations (order-3 sometimes for MTU=9000). This patch
> introduces alloc_skb_order0(), to build skbs with order-0 pages only.

I had never seen allocation problems in rtl8169_try_rx_copy or in any
other driver rx path (except iwlwifi, but now this is solved by using
skb_add_rx_frag), so I'm not sure if need this patch.

However I see other benefit of that patch. We save memory. Allocating
for MTU 9000 gives something like skb->data = kmalloc(9000 + 32 + 2
+ 334). So we take data from kmalloc-16384 cache, we waste about 7kB on
every allocation. With patch wastage would be about 2k per allocation
(assuming 4kB and 8kB page size)

However I started this thread thinking about other memory wastage,
in rtl8169_alloc_rx_skb, skb->data = kmalloc(16383 + 32 + 2 + 334), taken
from kmalloc-32768, almost 50% wastage.
 
> +struct sk_buff *alloc_skb_order0(int pkt_size)
> +{
> +	int head = min_t(int, pkt_size, SKB_MAX_HEAD(NET_SKB_PAD + NET_IP_ALIGN));
> +	struct sk_buff *skb;
> +
> +	skb = alloc_skb(head + NET_SKB_PAD + NET_IP_ALIGN,
> +			GFP_ATOMIC | __GFP_NOWARN);
> +	if (!skb)
> +		return NULL;
> +	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
> +	skb_put(skb, head);
> +	pkt_size -= head;
> +
> +	skb->len += pkt_size;
> +	skb->data_len += pkt_size;
> +	skb->truesize += pkt_size;
> +	while (pkt_size) {

if (skb_shinfo(skb)->nr_frags == MAX_SKB_FRAGS - 1)
	goto error;

> +		int i = skb_shinfo(skb)->nr_frags++;
> +		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> +		int fragsize = min_t(int, pkt_size, PAGE_SIZE);
> +		struct page *page = alloc_page(GFP_NOWAIT | __GFP_NOWARN);
> +
> +		if (!page)
> +			goto error;
> +		frag->page = page;
> +		frag->size = fragsize;
> +		frag->page_offset = 0;
> +		pkt_size -= fragsize;
> +	}
> +	return skb;
> +
> +error:
> +	kfree_skb(skb);
> +	return NULL;	
> +}
> +EXPORT_SYMBOL(alloc_skb_order0);
> +
>  /* Checksum skb data. */
>  
>  __wsum skb_checksum(const struct sk_buff *skb, int offset,

^ permalink raw reply

* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: David Miller @ 2010-10-11 15:59 UTC (permalink / raw)
  To: eric.dumazet; +Cc: mjt, avagin, stable, netdev, krkumar2
In-Reply-To: <1286812009.2737.30.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 11 Oct 2010 17:46:49 +0200

> This patch was an optimization, not a bug fix.

Right, this has no business going into 2.6.32-stable at all.

^ permalink raw reply

* Re: [PATCH] af_packet: account for VLAN when checking packet size
From: David Miller @ 2010-10-11 16:01 UTC (permalink / raw)
  To: eric.dumazet; +Cc: phil, netdev, johann.baudy
In-Reply-To: <1286805782.2737.25.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 11 Oct 2010 16:03:02 +0200

> If we dont test ETH_P_8021Q protocol here, we allow sending 1504 bytes
> frames for MTU=1500
> 
> Should we really care ?
> 
> If not, just do
> 
> reserve = dev->hard_header_len + VLAN_HLEN;

Thats a good point, I think we need to validate the SKB protocol
field.

^ permalink raw reply

* Re: [PATCH 1/2] r8169: allocate with GFP_KERNEL flag when able to sleep
From: Christoph Lameter @ 2010-10-11 16:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stanislaw Gruszka, Francois Romieu, netdev
In-Reply-To: <1286550247.2959.444.camel@edumazet-laptop>

On Fri, 8 Oct 2010, Eric Dumazet wrote:

> 8 in the <pagesperslab> column just says that : order-3 pages, even for
> small allocations.

Those allocations will fallback to smaller allocs if the page allocator
has trouble satisfying those requests.


^ permalink raw reply

* Re: [PATCH] net: introduce alloc_skb_order0
From: Eric Dumazet @ 2010-10-11 16:05 UTC (permalink / raw)
  To: Stanislaw Gruszka; +Cc: David Miller, Francois Romieu, netdev
In-Reply-To: <20101011155556.GA2431@redhat.com>

Le lundi 11 octobre 2010 à 17:55 +0200, Stanislaw Gruszka a écrit :
> On Sat, Oct 09, 2010 at 05:59:56PM +0200, Eric Dumazet wrote:
> > Le vendredi 08 octobre 2010 à 18:03 +0200, Stanislaw Gruszka a écrit :
> > > On Fri, Oct 08, 2010 at 05:04:07PM +0200, Eric Dumazet wrote:
> > 
> > > > Switch to SLAB -> no more problem ;)
> > > 
> > > yeh, I wish to, but fedora use SLUB because of some debugging
> > > capabilities. 
> > 
> > Yes, of course, I was kidding :)
> > 
> > echo 0 >/sys/kernel/slab/kmalloc-2048/order
> > echo 0 >/sys/kernel/slab/kmalloc-1024/order
> > echo 0 >/sys/kernel/slab/kmalloc-512/order
> > 
> > Should do the trick : No more high order allocations for MTU=1500
> > frames.
> 
> So the SLUB is great, but we need a patch to avoid using it :-)
> 
> > For MTU=9000 frames, we probably need something like this patch :
> >
> > Reception of big frames hit a memory allocation problem, because of high
> > order pages allocations (order-3 sometimes for MTU=9000). This patch
> > introduces alloc_skb_order0(), to build skbs with order-0 pages only.
> 
> I had never seen allocation problems in rtl8169_try_rx_copy or in any
> other driver rx path (except iwlwifi, but now this is solved by using
> skb_add_rx_frag), so I'm not sure if need this patch.
> 
> However I see other benefit of that patch. We save memory. Allocating
> for MTU 9000 gives something like skb->data = kmalloc(9000 + 32 + 2
> + 334). So we take data from kmalloc-16384 cache, we waste about 7kB on
> every allocation. With patch wastage would be about 2k per allocation
> (assuming 4kB and 8kB page size)
> 
> However I started this thread thinking about other memory wastage,
> in rtl8169_alloc_rx_skb, skb->data = kmalloc(16383 + 32 + 2 + 334), taken
> from kmalloc-32768, almost 50% wastage.
>  

You cannot use my patch to avoid this waste. Really.

You have two different things in this driver :

1) Allocation of a physically continous 16Kbytes bloc for the rx-ring,
at device initialization (GFP_KERNEL OK here)

   Here, the only thing you could do is to not allocate real skbs but
only 16KB data blocs (no need for the sk_buf, only the ->data part), and
force copybreak for all incoming packets (remove the rx_copybreak
tunable)

2) Allocation of order0 skb to perform the copybreak in rx path.
(GFP_ATOMIC) : My patch.


> > +struct sk_buff *alloc_skb_order0(int pkt_size)
> > +{
> > +	int head = min_t(int, pkt_size, SKB_MAX_HEAD(NET_SKB_PAD + NET_IP_ALIGN));
> > +	struct sk_buff *skb;
> > +
> > +	skb = alloc_skb(head + NET_SKB_PAD + NET_IP_ALIGN,
> > +			GFP_ATOMIC | __GFP_NOWARN);
> > +	if (!skb)
> > +		return NULL;
> > +	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
> > +	skb_put(skb, head);
> > +	pkt_size -= head;
> > +
> > +	skb->len += pkt_size;
> > +	skb->data_len += pkt_size;
> > +	skb->truesize += pkt_size;
> > +	while (pkt_size) {
> 
> if (skb_shinfo(skb)->nr_frags == MAX_SKB_FRAGS - 1)
> 	goto error;

Not needed. A frame is < 16383 bytes, so _must_ fit in an skb,
(skb can hold up to 64 Kbytes)

> 
> > +		int i = skb_shinfo(skb)->nr_frags++;
> > +		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> > +		int fragsize = min_t(int, pkt_size, PAGE_SIZE);
> > +		struct page *page = alloc_page(GFP_NOWAIT | __GFP_NOWARN);
> > +
> > +		if (!page)
> > +			goto error;
> > +		frag->page = page;
> > +		frag->size = fragsize;
> > +		frag->page_offset = 0;
> > +		pkt_size -= fragsize;
> > +	}
> > +	return skb;
> > +
> > +error:
> > +	kfree_skb(skb);
> > +	return NULL;	
> > +}
> > +EXPORT_SYMBOL(alloc_skb_order0);
> > +
> >  /* Checksum skb data. */
> >  
> >  __wsum skb_checksum(const struct sk_buff *skb, int offset,



^ permalink raw reply

* Re: [PATCH 1/2] r8169: allocate with GFP_KERNEL flag when able to sleep
From: Eric Dumazet @ 2010-10-11 16:07 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Stanislaw Gruszka, Francois Romieu, netdev
In-Reply-To: <alpine.DEB.2.00.1010111102390.11247@router.home>

Le lundi 11 octobre 2010 à 11:03 -0500, Christoph Lameter a écrit :
> On Fri, 8 Oct 2010, Eric Dumazet wrote:
> 
> > 8 in the <pagesperslab> column just says that : order-3 pages, even for
> > small allocations.
> 
> Those allocations will fallback to smaller allocs if the page allocator
> has trouble satisfying those requests.
> 

Interesting, do you have an idea when this feature was added ?

Thanks



^ permalink raw reply

* Re: [PATCH 1/2] r8169: allocate with GFP_KERNEL flag when able to sleep
From: Christoph Lameter @ 2010-10-11 16:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stanislaw Gruszka, Francois Romieu, netdev
In-Reply-To: <1286813237.2737.37.camel@edumazet-laptop>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 498 bytes --]

On Mon, 11 Oct 2010, Eric Dumazet wrote:

> Le lundi 11 octobre 2010 à 11:03 -0500, Christoph Lameter a écrit :
> > On Fri, 8 Oct 2010, Eric Dumazet wrote:
> >
> > > 8 in the <pagesperslab> column just says that : order-3 pages, even for
> > > small allocations.
> >
> > Those allocations will fallback to smaller allocs if the page allocator
> > has trouble satisfying those requests.
> >
>
> Interesting, do you have an idea when this feature was added ?

A couple of years ago.

^ permalink raw reply

* Re: [PATCH net-next] neigh: speedup neigh_hh_init()
From: David Miller @ 2010-10-11 16:18 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1286456008.2912.171.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 07 Oct 2010 14:53:28 +0200

> When a new dst is used to send a frame, neigh_resolve_output() tries to
> associate an struct hh_cache to this dst, calling neigh_hh_init() with
> the neigh rwlock write locked.
> 
> Most of the time, hh_cache is already known and linked into neighbour,
> so we find it and increment its refcount.
> 
> This patch changes the logic so that we call neigh_hh_init() with
> neighbour lock read locked only, so that fast path can be run in
> parallel by concurrent cpus.
> 
> This brings part of the speedup we got with commit c7d4426a98a5f
> (introduce DST_NOCACHE flag) for non cached dsts, even for cached ones,
> removing one of the contention point that routers hit on multiqueue
> enabled machines.
> 
> Further improvements would need to use a seqlock instead of an rwlock to
> protect neigh->ha[], to not dirty neigh too often and remove two atomic
> ops.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Ok this patch assumes that neighbours are highly shared, which
is a reasonable thing to optimize for.

So, applied, thanks a lot!

BTW, I think you can RCU this thing.  Require that every change
to the 'hh' entry be done in a newly allocated entry.

Then the cmpxchg() on the hh pointer interlocks.

^ permalink raw reply

* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Andrew Vagin @ 2010-10-11 16:19 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, mjt, avagin, stable, netdev, krkumar2
In-Reply-To: <20101011.085952.193718228.davem@davemloft.net>

  On 10/11/2010 07:59 PM, David Miller wrote:
> From: Eric Dumazet<eric.dumazet@gmail.com>
> Date: Mon, 11 Oct 2010 17:46:49 +0200
>
>> This patch was an optimization, not a bug fix.
> Right, this has no business going into 2.6.32-stable at all.
This is bug fix. Now nobody drops dst in case gso and veth, because the 
commit 60df914e295a21a223e43a7ee01e0c73c64dd111 deletes skb_dst_drop 
from the veth.c. We should commit my patch or revert commit 60df914e.

We have two bug reports:

http://www.spinics.net/lists/netdev/msg142104.html

http://bugzilla.openvz.org/show_bug.cgi?id=1634

Taylan verified that the patch really fix his bug.

^ permalink raw reply

* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Eric Dumazet @ 2010-10-11 16:30 UTC (permalink / raw)
  To: avagin; +Cc: David Miller, mjt, avagin, stable, netdev, krkumar2
In-Reply-To: <4CB33907.5060803@gmail.com>

Le lundi 11 octobre 2010 à 20:19 +0400, Andrew Vagin a écrit :
> On 10/11/2010 07:59 PM, David Miller wrote:
> > From: Eric Dumazet<eric.dumazet@gmail.com>
> > Date: Mon, 11 Oct 2010 17:46:49 +0200
> >
> >> This patch was an optimization, not a bug fix.
> > Right, this has no business going into 2.6.32-stable at all.
> This is bug fix. Now nobody drops dst in case gso and veth, because the 
> commit 60df914e295a21a223e43a7ee01e0c73c64dd111 deletes skb_dst_drop 
> from the veth.c. We should commit my patch or revert commit 60df914e.
> 
> We have two bug reports:
> 
> http://www.spinics.net/lists/netdev/msg142104.html
> 
> http://bugzilla.openvz.org/show_bug.cgi?id=1634
> 
> Taylan verified that the patch really fix his bug.

Now that makes sense ;)

This is always good to mention commit id of a bug origin.

Because reading your patch (changelog and content), it was not obvious
it fixed a bug.

Thanks !



^ permalink raw reply

* Re: [PATCH 3/3] NET: wimax, fix use after free
From: Inaky Perez-Gonzalez @ 2010-10-11 16:46 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, jirislaby@gmail.com, linux-wimax
In-Reply-To: <1286789218-13976-3-git-send-email-jslaby@suse.cz>

On Mon, 2010-10-11 at 02:26 -0700, Jiri Slaby wrote: 
> Stanse found that i2400m_rx frees skb, but still uses skb->len even
> though it has skb_len defined. So use skb_len properly in the code.
> 
> And also define it unsinged int rather than size_t to solve
> compilation warnings.
> 
> Signed-off-by: Jiri Slaby <jslaby@suse.cz>

Ops, fail. Thanks for the catch. I assume you have compile tested it.

Acked-by: Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com>

> Cc: linux-wimax@intel.com
> ---
>  drivers/net/wimax/i2400m/rx.c |   26 +++++++++++++-------------
>  1 files changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/net/wimax/i2400m/rx.c b/drivers/net/wimax/i2400m/rx.c
> index c4876d0..844133b 100644
> --- a/drivers/net/wimax/i2400m/rx.c
> +++ b/drivers/net/wimax/i2400m/rx.c
> @@ -1244,16 +1244,16 @@ int i2400m_rx(struct i2400m *i2400m, struct sk_buff *skb)
>  	int i, result;
>  	struct device *dev = i2400m_dev(i2400m);
>  	const struct i2400m_msg_hdr *msg_hdr;
> -	size_t pl_itr, pl_size, skb_len;
> +	size_t pl_itr, pl_size;
>  	unsigned long flags;
> -	unsigned num_pls, single_last;
> +	unsigned num_pls, single_last, skb_len;
>  
>  	skb_len = skb->len;
> -	d_fnstart(4, dev, "(i2400m %p skb %p [size %zu])\n",
> +	d_fnstart(4, dev, "(i2400m %p skb %p [size %u])\n",
>  		  i2400m, skb, skb_len);
>  	result = -EIO;
>  	msg_hdr = (void *) skb->data;
> -	result = i2400m_rx_msg_hdr_check(i2400m, msg_hdr, skb->len);
> +	result = i2400m_rx_msg_hdr_check(i2400m, msg_hdr, skb_len);
>  	if (result < 0)
>  		goto error_msg_hdr_check;
>  	result = -EIO;
> @@ -1261,10 +1261,10 @@ int i2400m_rx(struct i2400m *i2400m, struct sk_buff *skb)
>  	pl_itr = sizeof(*msg_hdr) +	/* Check payload descriptor(s) */
>  		num_pls * sizeof(msg_hdr->pld[0]);
>  	pl_itr = ALIGN(pl_itr, I2400M_PL_ALIGN);
> -	if (pl_itr > skb->len) {	/* got all the payload descriptors? */
> +	if (pl_itr > skb_len) {	/* got all the payload descriptors? */
>  		dev_err(dev, "RX: HW BUG? message too short (%u bytes) for "
>  			"%u payload descriptors (%zu each, total %zu)\n",
> -			skb->len, num_pls, sizeof(msg_hdr->pld[0]), pl_itr);
> +			skb_len, num_pls, sizeof(msg_hdr->pld[0]), pl_itr);
>  		goto error_pl_descr_short;
>  	}
>  	/* Walk each payload payload--check we really got it */
> @@ -1272,7 +1272,7 @@ int i2400m_rx(struct i2400m *i2400m, struct sk_buff *skb)
>  		/* work around old gcc warnings */
>  		pl_size = i2400m_pld_size(&msg_hdr->pld[i]);
>  		result = i2400m_rx_pl_descr_check(i2400m, &msg_hdr->pld[i],
> -						  pl_itr, skb->len);
> +						  pl_itr, skb_len);
>  		if (result < 0)
>  			goto error_pl_descr_check;
>  		single_last = num_pls == 1 || i == num_pls - 1;
> @@ -1290,16 +1290,16 @@ int i2400m_rx(struct i2400m *i2400m, struct sk_buff *skb)
>  	if (i < i2400m->rx_pl_min)
>  		i2400m->rx_pl_min = i;
>  	i2400m->rx_num++;
> -	i2400m->rx_size_acc += skb->len;
> -	if (skb->len < i2400m->rx_size_min)
> -		i2400m->rx_size_min = skb->len;
> -	if (skb->len > i2400m->rx_size_max)
> -		i2400m->rx_size_max = skb->len;
> +	i2400m->rx_size_acc += skb_len;
> +	if (skb_len < i2400m->rx_size_min)
> +		i2400m->rx_size_min = skb_len;
> +	if (skb_len > i2400m->rx_size_max)
> +		i2400m->rx_size_max = skb_len;
>  	spin_unlock_irqrestore(&i2400m->rx_lock, flags);
>  error_pl_descr_check:
>  error_pl_descr_short:
>  error_msg_hdr_check:
> -	d_fnend(4, dev, "(i2400m %p skb %p [size %zu]) = %d\n",
> +	d_fnend(4, dev, "(i2400m %p skb %p [size %u]) = %d\n",
>  		i2400m, skb, skb_len, result);
>  	return result;
>  }




^ permalink raw reply

* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Michael Tokarev @ 2010-10-11 16:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: avagin, David Miller, avagin, stable, netdev, krkumar2
In-Reply-To: <1286814652.2737.41.camel@edumazet-laptop>

11.10.2010 20:30, Eric Dumazet wrote:
> Le lundi 11 octobre 2010 à 20:19 +0400, Andrew Vagin a écrit :
>> On 10/11/2010 07:59 PM, David Miller wrote:
>>> From: Eric Dumazet<eric.dumazet@gmail.com>
>>> Date: Mon, 11 Oct 2010 17:46:49 +0200
>>>
>>>> This patch was an optimization, not a bug fix.
>>> Right, this has no business going into 2.6.32-stable at all.
>> This is bug fix. Now nobody drops dst in case gso and veth, because the 
>> commit 60df914e295a21a223e43a7ee01e0c73c64dd111 deletes skb_dst_drop 
>> from the veth.c. We should commit my patch or revert commit 60df914e.
>>
>> We have two bug reports:
>>
>> http://www.spinics.net/lists/netdev/msg142104.html

This is korg#17251: https://bugzilla.kernel.org/show_bug.cgi?id=17251 ,
from me.

But I really wonder how that thing does not happen anymore
when I disabled netfilter hooks...  I can't experiment till
weekend, but I'll try to get back to it and re-verify again,
with and without this fix, with and without netfilter hooks.

What I know for sure is 2 facts: I can't trigger the problem
now (with the hooks disabled), and I can't trigger it on a
subsequent kernel releases - e.g. 2.6.35 does not have the
issue, but 2.6.32 has.

>> http://bugzilla.openvz.org/show_bug.cgi?id=1634

Thanks!

/mjt

^ permalink raw reply

* RE: [PATCH net-next 2/5] bnx2x: save cycles in setting gso_size
From: Vladislav Zolotarov @ 2010-10-11 17:06 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David Miller, Dmitry Kravkov, netdev@vger.kernel.org,
	Eilon Greenstein
In-Reply-To: <1286805633.2349.64.camel@achroite.uk.solarflarecom.com>



> -----Original Message-----
> From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> Sent: Monday, October 11, 2010 4:01 PM
> To: Vladislav Zolotarov
> Cc: David Miller; Dmitry Kravkov; netdev@vger.kernel.org; Eilon
> Greenstein
> Subject: RE: [PATCH net-next 2/5] bnx2x: save cycles in setting
> gso_size
> 
> On Mon, 2010-10-11 at 01:53 -0700, Vladislav Zolotarov wrote:
> [...]
> > Dave, it's a gSo_size, not a gRo_size and we are handling an LRO skb
> > here. It's quite confusing, I agree... ;) This is currently the way
> > to mark an LRO skb in order to properly handle the LRO skbs that
> > might still be forwarded to the stack short time after the LRO has
> > been disabled due to enabling the IP forwarding (or if there is a
> > bug in a driver).
> > See the skb_warn_if_lro() below:
> >
> > static inline bool skb_warn_if_lro(const struct sk_buff *skb)
> > {
> > 	/* LRO sets gso_size but not gso_type, whereas if GSO is really
> > 	 * wanted then gso_type will be set. */
> > 	struct skb_shared_info *shinfo = skb_shinfo(skb);
> > 	if (skb_is_nonlinear(skb) && shinfo->gso_size != 0 &&
> > 	    unlikely(shinfo->gso_type == 0)) {
> > 		__skb_warn_lro_forwarding(skb);
> > 		return true;
> > 	}
> > 	return false;
> > }
> >
> > So, the convention is to set the gSo_size to a none-zero value
> leaving
> > the gSo_type zero for an LRO skbs. It's quite hacky but this is what
> > we have for quite a while and it quite worked so far. ;) To make it
> > cleaner we might have done the following:
> [...]
> 
> The requirement (or as you call it, "convention") is to set gso_size to
> the observed MSS of the packets that have been combined.  If you don't
> do that then TCP will not know the true number of packets for purposes
> of delayed-ACK etc.

Thanks, Ben. I see that now. 
Dave, we will respin this patch series.

Thanks,
vlad

> 
> Ben.
> 
> --
> Ben Hutchings, Senior Software Engineer, Solarflare Communications
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
> 


^ permalink raw reply

* [patch 1/2] vhost: potential integer overflows
From: Dan Carpenter @ 2010-10-11 17:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Juan Quintela, David S. Miller, Rusty Russell, kvm,
	virtualization, netdev, kernel-janitors

I did an audit for potential integer overflows of values which get passed
to access_ok() and here are the results.

Signed-off-by: Dan Carpenter <error27@gmail.com>

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index dd3d6f7..c2aa12c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -429,6 +429,14 @@ static int vq_access_ok(unsigned int num,
 			struct vring_avail __user *avail,
 			struct vring_used __user *used)
 {
+
+	if (num > UINT_MAX / sizeof *desc)
+		return 0;
+	if (num > UINT_MAX / sizeof *avail->ring - sizeof *avail)
+		return 0;
+	if (num > UINT_MAX / sizeof *used->ring - sizeof *used)
+		return 0;
+
 	return access_ok(VERIFY_READ, desc, num * sizeof *desc) &&
 	       access_ok(VERIFY_READ, avail,
 			 sizeof *avail + num * sizeof *avail->ring) &&
@@ -447,6 +455,9 @@ int vhost_log_access_ok(struct vhost_dev *dev)
 /* Caller should have vq mutex and device mutex */
 static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
 {
+	if (vq->num > UINT_MAX / sizeof *vq->used->ring - sizeof *vq->used)
+		return 0;
+
 	return vq_memory_access_ok(log_base, vq->dev->memory,
 			    vhost_has_feature(vq->dev, VHOST_F_LOG_ALL)) &&
 		(!vq->log_used || log_access_ok(log_base, vq->log_addr,
@@ -606,12 +617,17 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
 			}
 
 			/* Also validate log access for used ring if enabled. */
-			if ((a.flags & (0x1 << VHOST_VRING_F_LOG)) &&
-			    !log_access_ok(vq->log_base, a.log_guest_addr,
+			if (a.flags & (0x1 << VHOST_VRING_F_LOG)) {
+				if (vq->num > UINT_MAX / sizeof *vq->used->ring - sizeof *vq->used) {
+					r = -EINVAL;
+					break;
+				}
+				if (!log_access_ok(vq->log_base, a.log_guest_addr,
 					   sizeof *vq->used +
 					   vq->num * sizeof *vq->used->ring)) {
-				r = -EINVAL;
-				break;
+					r = -EINVAL;
+					break;
+				}
 			}
 		}
 

^ permalink raw reply related

* [patch 2/2] vhost: fix return code for log_access_ok()
From: Dan Carpenter @ 2010-10-11 17:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Juan Quintela, David S. Miller, Rusty Russell, kvm,
	virtualization, netdev, kernel-janitors

access_ok() returns 1 if it's OK otherwise it should return 0.

Signed-off-by: Dan Carpenter <error27@gmail.com>

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c2aa12c..f82fe57 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -371,7 +371,7 @@ static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
 	/* Make sure 64 bit math will not overflow. */
 	if (a > ULONG_MAX - (unsigned long)log_base ||
 	    a + (unsigned long)log_base > ULONG_MAX)
-		return -EFAULT;
+		return 0;
 
 	return access_ok(VERIFY_WRITE, log_base + a,
 			 (sz + VHOST_PAGE_SIZE * 8 - 1) / VHOST_PAGE_SIZE / 8);

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox