Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] net: introduce alloc_skb_order0
From: Stanislaw Gruszka @ 2010-10-11 15:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Francois Romieu, netdev
In-Reply-To: <1286639996.2692.148.camel@edumazet-laptop>

On Sat, Oct 09, 2010 at 05:59:56PM +0200, Eric Dumazet wrote:
> Le vendredi 08 octobre 2010 à 18:03 +0200, Stanislaw Gruszka a écrit :
> > On Fri, Oct 08, 2010 at 05:04:07PM +0200, Eric Dumazet wrote:
> 
> > > Switch to SLAB -> no more problem ;)
> > 
> > yeh, I wish to, but fedora use SLUB because of some debugging
> > capabilities. 
> 
> Yes, of course, I was kidding :)
> 
> echo 0 >/sys/kernel/slab/kmalloc-2048/order
> echo 0 >/sys/kernel/slab/kmalloc-1024/order
> echo 0 >/sys/kernel/slab/kmalloc-512/order
> 
> Should do the trick : No more high order allocations for MTU=1500
> frames.

So the SLUB is great, but we need a patch to avoid using it :-)

> For MTU=9000 frames, we probably need something like this patch :
>
> Reception of big frames hit a memory allocation problem, because of high
> order pages allocations (order-3 sometimes for MTU=9000). This patch
> introduces alloc_skb_order0(), to build skbs with order-0 pages only.

I had never seen allocation problems in rtl8169_try_rx_copy or in any
other driver rx path (except iwlwifi, but now this is solved by using
skb_add_rx_frag), so I'm not sure if need this patch.

However I see other benefit of that patch. We save memory. Allocating
for MTU 9000 gives something like skb->data = kmalloc(9000 + 32 + 2
+ 334). So we take data from kmalloc-16384 cache, we waste about 7kB on
every allocation. With patch wastage would be about 2k per allocation
(assuming 4kB and 8kB page size)

However I started this thread thinking about other memory wastage,
in rtl8169_alloc_rx_skb, skb->data = kmalloc(16383 + 32 + 2 + 334), taken
from kmalloc-32768, almost 50% wastage.
 
> +struct sk_buff *alloc_skb_order0(int pkt_size)
> +{
> +	int head = min_t(int, pkt_size, SKB_MAX_HEAD(NET_SKB_PAD + NET_IP_ALIGN));
> +	struct sk_buff *skb;
> +
> +	skb = alloc_skb(head + NET_SKB_PAD + NET_IP_ALIGN,
> +			GFP_ATOMIC | __GFP_NOWARN);
> +	if (!skb)
> +		return NULL;
> +	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
> +	skb_put(skb, head);
> +	pkt_size -= head;
> +
> +	skb->len += pkt_size;
> +	skb->data_len += pkt_size;
> +	skb->truesize += pkt_size;
> +	while (pkt_size) {

if (skb_shinfo(skb)->nr_frags == MAX_SKB_FRAGS - 1)
	goto error;

> +		int i = skb_shinfo(skb)->nr_frags++;
> +		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> +		int fragsize = min_t(int, pkt_size, PAGE_SIZE);
> +		struct page *page = alloc_page(GFP_NOWAIT | __GFP_NOWARN);
> +
> +		if (!page)
> +			goto error;
> +		frag->page = page;
> +		frag->size = fragsize;
> +		frag->page_offset = 0;
> +		pkt_size -= fragsize;
> +	}
> +	return skb;
> +
> +error:
> +	kfree_skb(skb);
> +	return NULL;	
> +}
> +EXPORT_SYMBOL(alloc_skb_order0);
> +
>  /* Checksum skb data. */
>  
>  __wsum skb_checksum(const struct sk_buff *skb, int offset,

^ permalink raw reply

* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: David Miller @ 2010-10-11 15:59 UTC (permalink / raw)
  To: eric.dumazet; +Cc: mjt, avagin, stable, netdev, krkumar2
In-Reply-To: <1286812009.2737.30.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 11 Oct 2010 17:46:49 +0200

> This patch was an optimization, not a bug fix.

Right, this has no business going into 2.6.32-stable at all.

^ permalink raw reply

* Re: [PATCH] af_packet: account for VLAN when checking packet size
From: David Miller @ 2010-10-11 16:01 UTC (permalink / raw)
  To: eric.dumazet; +Cc: phil, netdev, johann.baudy
In-Reply-To: <1286805782.2737.25.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 11 Oct 2010 16:03:02 +0200

> If we dont test ETH_P_8021Q protocol here, we allow sending 1504 bytes
> frames for MTU=1500
> 
> Should we really care ?
> 
> If not, just do
> 
> reserve = dev->hard_header_len + VLAN_HLEN;

Thats a good point, I think we need to validate the SKB protocol
field.

^ permalink raw reply

* Re: [PATCH 1/2] r8169: allocate with GFP_KERNEL flag when able to sleep
From: Christoph Lameter @ 2010-10-11 16:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stanislaw Gruszka, Francois Romieu, netdev
In-Reply-To: <1286550247.2959.444.camel@edumazet-laptop>

On Fri, 8 Oct 2010, Eric Dumazet wrote:

> 8 in the <pagesperslab> column just says that : order-3 pages, even for
> small allocations.

Those allocations will fallback to smaller allocs if the page allocator
has trouble satisfying those requests.


^ permalink raw reply

* Re: [PATCH] net: introduce alloc_skb_order0
From: Eric Dumazet @ 2010-10-11 16:05 UTC (permalink / raw)
  To: Stanislaw Gruszka; +Cc: David Miller, Francois Romieu, netdev
In-Reply-To: <20101011155556.GA2431@redhat.com>

Le lundi 11 octobre 2010 à 17:55 +0200, Stanislaw Gruszka a écrit :
> On Sat, Oct 09, 2010 at 05:59:56PM +0200, Eric Dumazet wrote:
> > Le vendredi 08 octobre 2010 à 18:03 +0200, Stanislaw Gruszka a écrit :
> > > On Fri, Oct 08, 2010 at 05:04:07PM +0200, Eric Dumazet wrote:
> > 
> > > > Switch to SLAB -> no more problem ;)
> > > 
> > > yeh, I wish to, but fedora use SLUB because of some debugging
> > > capabilities. 
> > 
> > Yes, of course, I was kidding :)
> > 
> > echo 0 >/sys/kernel/slab/kmalloc-2048/order
> > echo 0 >/sys/kernel/slab/kmalloc-1024/order
> > echo 0 >/sys/kernel/slab/kmalloc-512/order
> > 
> > Should do the trick : No more high order allocations for MTU=1500
> > frames.
> 
> So the SLUB is great, but we need a patch to avoid using it :-)
> 
> > For MTU=9000 frames, we probably need something like this patch :
> >
> > Reception of big frames hit a memory allocation problem, because of high
> > order pages allocations (order-3 sometimes for MTU=9000). This patch
> > introduces alloc_skb_order0(), to build skbs with order-0 pages only.
> 
> I had never seen allocation problems in rtl8169_try_rx_copy or in any
> other driver rx path (except iwlwifi, but now this is solved by using
> skb_add_rx_frag), so I'm not sure if need this patch.
> 
> However I see other benefit of that patch. We save memory. Allocating
> for MTU 9000 gives something like skb->data = kmalloc(9000 + 32 + 2
> + 334). So we take data from kmalloc-16384 cache, we waste about 7kB on
> every allocation. With patch wastage would be about 2k per allocation
> (assuming 4kB and 8kB page size)
> 
> However I started this thread thinking about other memory wastage,
> in rtl8169_alloc_rx_skb, skb->data = kmalloc(16383 + 32 + 2 + 334), taken
> from kmalloc-32768, almost 50% wastage.
>  

You cannot use my patch to avoid this waste. Really.

You have two different things in this driver :

1) Allocation of a physically continous 16Kbytes bloc for the rx-ring,
at device initialization (GFP_KERNEL OK here)

   Here, the only thing you could do is to not allocate real skbs but
only 16KB data blocs (no need for the sk_buf, only the ->data part), and
force copybreak for all incoming packets (remove the rx_copybreak
tunable)

2) Allocation of order0 skb to perform the copybreak in rx path.
(GFP_ATOMIC) : My patch.


> > +struct sk_buff *alloc_skb_order0(int pkt_size)
> > +{
> > +	int head = min_t(int, pkt_size, SKB_MAX_HEAD(NET_SKB_PAD + NET_IP_ALIGN));
> > +	struct sk_buff *skb;
> > +
> > +	skb = alloc_skb(head + NET_SKB_PAD + NET_IP_ALIGN,
> > +			GFP_ATOMIC | __GFP_NOWARN);
> > +	if (!skb)
> > +		return NULL;
> > +	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
> > +	skb_put(skb, head);
> > +	pkt_size -= head;
> > +
> > +	skb->len += pkt_size;
> > +	skb->data_len += pkt_size;
> > +	skb->truesize += pkt_size;
> > +	while (pkt_size) {
> 
> if (skb_shinfo(skb)->nr_frags == MAX_SKB_FRAGS - 1)
> 	goto error;

Not needed. A frame is < 16383 bytes, so _must_ fit in an skb,
(skb can hold up to 64 Kbytes)

> 
> > +		int i = skb_shinfo(skb)->nr_frags++;
> > +		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> > +		int fragsize = min_t(int, pkt_size, PAGE_SIZE);
> > +		struct page *page = alloc_page(GFP_NOWAIT | __GFP_NOWARN);
> > +
> > +		if (!page)
> > +			goto error;
> > +		frag->page = page;
> > +		frag->size = fragsize;
> > +		frag->page_offset = 0;
> > +		pkt_size -= fragsize;
> > +	}
> > +	return skb;
> > +
> > +error:
> > +	kfree_skb(skb);
> > +	return NULL;	
> > +}
> > +EXPORT_SYMBOL(alloc_skb_order0);
> > +
> >  /* Checksum skb data. */
> >  
> >  __wsum skb_checksum(const struct sk_buff *skb, int offset,



^ permalink raw reply

* Re: [PATCH 1/2] r8169: allocate with GFP_KERNEL flag when able to sleep
From: Eric Dumazet @ 2010-10-11 16:07 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Stanislaw Gruszka, Francois Romieu, netdev
In-Reply-To: <alpine.DEB.2.00.1010111102390.11247@router.home>

Le lundi 11 octobre 2010 à 11:03 -0500, Christoph Lameter a écrit :
> On Fri, 8 Oct 2010, Eric Dumazet wrote:
> 
> > 8 in the <pagesperslab> column just says that : order-3 pages, even for
> > small allocations.
> 
> Those allocations will fallback to smaller allocs if the page allocator
> has trouble satisfying those requests.
> 

Interesting, do you have an idea when this feature was added ?

Thanks



^ permalink raw reply

* Re: [PATCH 1/2] r8169: allocate with GFP_KERNEL flag when able to sleep
From: Christoph Lameter @ 2010-10-11 16:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stanislaw Gruszka, Francois Romieu, netdev
In-Reply-To: <1286813237.2737.37.camel@edumazet-laptop>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 498 bytes --]

On Mon, 11 Oct 2010, Eric Dumazet wrote:

> Le lundi 11 octobre 2010 à 11:03 -0500, Christoph Lameter a écrit :
> > On Fri, 8 Oct 2010, Eric Dumazet wrote:
> >
> > > 8 in the <pagesperslab> column just says that : order-3 pages, even for
> > > small allocations.
> >
> > Those allocations will fallback to smaller allocs if the page allocator
> > has trouble satisfying those requests.
> >
>
> Interesting, do you have an idea when this feature was added ?

A couple of years ago.

^ permalink raw reply

* Re: [PATCH net-next] neigh: speedup neigh_hh_init()
From: David Miller @ 2010-10-11 16:18 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1286456008.2912.171.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 07 Oct 2010 14:53:28 +0200

> When a new dst is used to send a frame, neigh_resolve_output() tries to
> associate an struct hh_cache to this dst, calling neigh_hh_init() with
> the neigh rwlock write locked.
> 
> Most of the time, hh_cache is already known and linked into neighbour,
> so we find it and increment its refcount.
> 
> This patch changes the logic so that we call neigh_hh_init() with
> neighbour lock read locked only, so that fast path can be run in
> parallel by concurrent cpus.
> 
> This brings part of the speedup we got with commit c7d4426a98a5f
> (introduce DST_NOCACHE flag) for non cached dsts, even for cached ones,
> removing one of the contention point that routers hit on multiqueue
> enabled machines.
> 
> Further improvements would need to use a seqlock instead of an rwlock to
> protect neigh->ha[], to not dirty neigh too often and remove two atomic
> ops.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Ok this patch assumes that neighbours are highly shared, which
is a reasonable thing to optimize for.

So, applied, thanks a lot!

BTW, I think you can RCU this thing.  Require that every change
to the 'hh' entry be done in a newly allocated entry.

Then the cmpxchg() on the hh pointer interlocks.

^ permalink raw reply

* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Andrew Vagin @ 2010-10-11 16:19 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, mjt, avagin, stable, netdev, krkumar2
In-Reply-To: <20101011.085952.193718228.davem@davemloft.net>

  On 10/11/2010 07:59 PM, David Miller wrote:
> From: Eric Dumazet<eric.dumazet@gmail.com>
> Date: Mon, 11 Oct 2010 17:46:49 +0200
>
>> This patch was an optimization, not a bug fix.
> Right, this has no business going into 2.6.32-stable at all.
This is bug fix. Now nobody drops dst in case gso and veth, because the 
commit 60df914e295a21a223e43a7ee01e0c73c64dd111 deletes skb_dst_drop 
from the veth.c. We should commit my patch or revert commit 60df914e.

We have two bug reports:

http://www.spinics.net/lists/netdev/msg142104.html

http://bugzilla.openvz.org/show_bug.cgi?id=1634

Taylan verified that the patch really fix his bug.

^ permalink raw reply

* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Eric Dumazet @ 2010-10-11 16:30 UTC (permalink / raw)
  To: avagin; +Cc: David Miller, mjt, avagin, stable, netdev, krkumar2
In-Reply-To: <4CB33907.5060803@gmail.com>

Le lundi 11 octobre 2010 à 20:19 +0400, Andrew Vagin a écrit :
> On 10/11/2010 07:59 PM, David Miller wrote:
> > From: Eric Dumazet<eric.dumazet@gmail.com>
> > Date: Mon, 11 Oct 2010 17:46:49 +0200
> >
> >> This patch was an optimization, not a bug fix.
> > Right, this has no business going into 2.6.32-stable at all.
> This is bug fix. Now nobody drops dst in case gso and veth, because the 
> commit 60df914e295a21a223e43a7ee01e0c73c64dd111 deletes skb_dst_drop 
> from the veth.c. We should commit my patch or revert commit 60df914e.
> 
> We have two bug reports:
> 
> http://www.spinics.net/lists/netdev/msg142104.html
> 
> http://bugzilla.openvz.org/show_bug.cgi?id=1634
> 
> Taylan verified that the patch really fix his bug.

Now that makes sense ;)

This is always good to mention commit id of a bug origin.

Because reading your patch (changelog and content), it was not obvious
it fixed a bug.

Thanks !



^ permalink raw reply

* Re: [PATCH 3/3] NET: wimax, fix use after free
From: Inaky Perez-Gonzalez @ 2010-10-11 16:46 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, jirislaby@gmail.com, linux-wimax
In-Reply-To: <1286789218-13976-3-git-send-email-jslaby@suse.cz>

On Mon, 2010-10-11 at 02:26 -0700, Jiri Slaby wrote: 
> Stanse found that i2400m_rx frees skb, but still uses skb->len even
> though it has skb_len defined. So use skb_len properly in the code.
> 
> And also define it unsinged int rather than size_t to solve
> compilation warnings.
> 
> Signed-off-by: Jiri Slaby <jslaby@suse.cz>

Ops, fail. Thanks for the catch. I assume you have compile tested it.

Acked-by: Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com>

> Cc: linux-wimax@intel.com
> ---
>  drivers/net/wimax/i2400m/rx.c |   26 +++++++++++++-------------
>  1 files changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/net/wimax/i2400m/rx.c b/drivers/net/wimax/i2400m/rx.c
> index c4876d0..844133b 100644
> --- a/drivers/net/wimax/i2400m/rx.c
> +++ b/drivers/net/wimax/i2400m/rx.c
> @@ -1244,16 +1244,16 @@ int i2400m_rx(struct i2400m *i2400m, struct sk_buff *skb)
>  	int i, result;
>  	struct device *dev = i2400m_dev(i2400m);
>  	const struct i2400m_msg_hdr *msg_hdr;
> -	size_t pl_itr, pl_size, skb_len;
> +	size_t pl_itr, pl_size;
>  	unsigned long flags;
> -	unsigned num_pls, single_last;
> +	unsigned num_pls, single_last, skb_len;
>  
>  	skb_len = skb->len;
> -	d_fnstart(4, dev, "(i2400m %p skb %p [size %zu])\n",
> +	d_fnstart(4, dev, "(i2400m %p skb %p [size %u])\n",
>  		  i2400m, skb, skb_len);
>  	result = -EIO;
>  	msg_hdr = (void *) skb->data;
> -	result = i2400m_rx_msg_hdr_check(i2400m, msg_hdr, skb->len);
> +	result = i2400m_rx_msg_hdr_check(i2400m, msg_hdr, skb_len);
>  	if (result < 0)
>  		goto error_msg_hdr_check;
>  	result = -EIO;
> @@ -1261,10 +1261,10 @@ int i2400m_rx(struct i2400m *i2400m, struct sk_buff *skb)
>  	pl_itr = sizeof(*msg_hdr) +	/* Check payload descriptor(s) */
>  		num_pls * sizeof(msg_hdr->pld[0]);
>  	pl_itr = ALIGN(pl_itr, I2400M_PL_ALIGN);
> -	if (pl_itr > skb->len) {	/* got all the payload descriptors? */
> +	if (pl_itr > skb_len) {	/* got all the payload descriptors? */
>  		dev_err(dev, "RX: HW BUG? message too short (%u bytes) for "
>  			"%u payload descriptors (%zu each, total %zu)\n",
> -			skb->len, num_pls, sizeof(msg_hdr->pld[0]), pl_itr);
> +			skb_len, num_pls, sizeof(msg_hdr->pld[0]), pl_itr);
>  		goto error_pl_descr_short;
>  	}
>  	/* Walk each payload payload--check we really got it */
> @@ -1272,7 +1272,7 @@ int i2400m_rx(struct i2400m *i2400m, struct sk_buff *skb)
>  		/* work around old gcc warnings */
>  		pl_size = i2400m_pld_size(&msg_hdr->pld[i]);
>  		result = i2400m_rx_pl_descr_check(i2400m, &msg_hdr->pld[i],
> -						  pl_itr, skb->len);
> +						  pl_itr, skb_len);
>  		if (result < 0)
>  			goto error_pl_descr_check;
>  		single_last = num_pls == 1 || i == num_pls - 1;
> @@ -1290,16 +1290,16 @@ int i2400m_rx(struct i2400m *i2400m, struct sk_buff *skb)
>  	if (i < i2400m->rx_pl_min)
>  		i2400m->rx_pl_min = i;
>  	i2400m->rx_num++;
> -	i2400m->rx_size_acc += skb->len;
> -	if (skb->len < i2400m->rx_size_min)
> -		i2400m->rx_size_min = skb->len;
> -	if (skb->len > i2400m->rx_size_max)
> -		i2400m->rx_size_max = skb->len;
> +	i2400m->rx_size_acc += skb_len;
> +	if (skb_len < i2400m->rx_size_min)
> +		i2400m->rx_size_min = skb_len;
> +	if (skb_len > i2400m->rx_size_max)
> +		i2400m->rx_size_max = skb_len;
>  	spin_unlock_irqrestore(&i2400m->rx_lock, flags);
>  error_pl_descr_check:
>  error_pl_descr_short:
>  error_msg_hdr_check:
> -	d_fnend(4, dev, "(i2400m %p skb %p [size %zu]) = %d\n",
> +	d_fnend(4, dev, "(i2400m %p skb %p [size %u]) = %d\n",
>  		i2400m, skb, skb_len, result);
>  	return result;
>  }




^ permalink raw reply

* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Michael Tokarev @ 2010-10-11 16:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: avagin, David Miller, avagin, stable, netdev, krkumar2
In-Reply-To: <1286814652.2737.41.camel@edumazet-laptop>

11.10.2010 20:30, Eric Dumazet wrote:
> Le lundi 11 octobre 2010 à 20:19 +0400, Andrew Vagin a écrit :
>> On 10/11/2010 07:59 PM, David Miller wrote:
>>> From: Eric Dumazet<eric.dumazet@gmail.com>
>>> Date: Mon, 11 Oct 2010 17:46:49 +0200
>>>
>>>> This patch was an optimization, not a bug fix.
>>> Right, this has no business going into 2.6.32-stable at all.
>> This is bug fix. Now nobody drops dst in case gso and veth, because the 
>> commit 60df914e295a21a223e43a7ee01e0c73c64dd111 deletes skb_dst_drop 
>> from the veth.c. We should commit my patch or revert commit 60df914e.
>>
>> We have two bug reports:
>>
>> http://www.spinics.net/lists/netdev/msg142104.html

This is korg#17251: https://bugzilla.kernel.org/show_bug.cgi?id=17251 ,
from me.

But I really wonder how that thing does not happen anymore
when I disabled netfilter hooks...  I can't experiment till
weekend, but I'll try to get back to it and re-verify again,
with and without this fix, with and without netfilter hooks.

What I know for sure is 2 facts: I can't trigger the problem
now (with the hooks disabled), and I can't trigger it on a
subsequent kernel releases - e.g. 2.6.35 does not have the
issue, but 2.6.32 has.

>> http://bugzilla.openvz.org/show_bug.cgi?id=1634

Thanks!

/mjt

^ permalink raw reply

* RE: [PATCH net-next 2/5] bnx2x: save cycles in setting gso_size
From: Vladislav Zolotarov @ 2010-10-11 17:06 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David Miller, Dmitry Kravkov, netdev@vger.kernel.org,
	Eilon Greenstein
In-Reply-To: <1286805633.2349.64.camel@achroite.uk.solarflarecom.com>



> -----Original Message-----
> From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> Sent: Monday, October 11, 2010 4:01 PM
> To: Vladislav Zolotarov
> Cc: David Miller; Dmitry Kravkov; netdev@vger.kernel.org; Eilon
> Greenstein
> Subject: RE: [PATCH net-next 2/5] bnx2x: save cycles in setting
> gso_size
> 
> On Mon, 2010-10-11 at 01:53 -0700, Vladislav Zolotarov wrote:
> [...]
> > Dave, it's a gSo_size, not a gRo_size and we are handling an LRO skb
> > here. It's quite confusing, I agree... ;) This is currently the way
> > to mark an LRO skb in order to properly handle the LRO skbs that
> > might still be forwarded to the stack short time after the LRO has
> > been disabled due to enabling the IP forwarding (or if there is a
> > bug in a driver).
> > See the skb_warn_if_lro() below:
> >
> > static inline bool skb_warn_if_lro(const struct sk_buff *skb)
> > {
> > 	/* LRO sets gso_size but not gso_type, whereas if GSO is really
> > 	 * wanted then gso_type will be set. */
> > 	struct skb_shared_info *shinfo = skb_shinfo(skb);
> > 	if (skb_is_nonlinear(skb) && shinfo->gso_size != 0 &&
> > 	    unlikely(shinfo->gso_type == 0)) {
> > 		__skb_warn_lro_forwarding(skb);
> > 		return true;
> > 	}
> > 	return false;
> > }
> >
> > So, the convention is to set the gSo_size to a none-zero value
> leaving
> > the gSo_type zero for an LRO skbs. It's quite hacky but this is what
> > we have for quite a while and it quite worked so far. ;) To make it
> > cleaner we might have done the following:
> [...]
> 
> The requirement (or as you call it, "convention") is to set gso_size to
> the observed MSS of the packets that have been combined.  If you don't
> do that then TCP will not know the true number of packets for purposes
> of delayed-ACK etc.

Thanks, Ben. I see that now. 
Dave, we will respin this patch series.

Thanks,
vlad

> 
> Ben.
> 
> --
> Ben Hutchings, Senior Software Engineer, Solarflare Communications
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
> 


^ permalink raw reply

* [patch 1/2] vhost: potential integer overflows
From: Dan Carpenter @ 2010-10-11 17:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Juan Quintela, David S. Miller, Rusty Russell, kvm,
	virtualization, netdev, kernel-janitors

I did an audit for potential integer overflows of values which get passed
to access_ok() and here are the results.

Signed-off-by: Dan Carpenter <error27@gmail.com>

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index dd3d6f7..c2aa12c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -429,6 +429,14 @@ static int vq_access_ok(unsigned int num,
 			struct vring_avail __user *avail,
 			struct vring_used __user *used)
 {
+
+	if (num > UINT_MAX / sizeof *desc)
+		return 0;
+	if (num > UINT_MAX / sizeof *avail->ring - sizeof *avail)
+		return 0;
+	if (num > UINT_MAX / sizeof *used->ring - sizeof *used)
+		return 0;
+
 	return access_ok(VERIFY_READ, desc, num * sizeof *desc) &&
 	       access_ok(VERIFY_READ, avail,
 			 sizeof *avail + num * sizeof *avail->ring) &&
@@ -447,6 +455,9 @@ int vhost_log_access_ok(struct vhost_dev *dev)
 /* Caller should have vq mutex and device mutex */
 static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
 {
+	if (vq->num > UINT_MAX / sizeof *vq->used->ring - sizeof *vq->used)
+		return 0;
+
 	return vq_memory_access_ok(log_base, vq->dev->memory,
 			    vhost_has_feature(vq->dev, VHOST_F_LOG_ALL)) &&
 		(!vq->log_used || log_access_ok(log_base, vq->log_addr,
@@ -606,12 +617,17 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
 			}
 
 			/* Also validate log access for used ring if enabled. */
-			if ((a.flags & (0x1 << VHOST_VRING_F_LOG)) &&
-			    !log_access_ok(vq->log_base, a.log_guest_addr,
+			if (a.flags & (0x1 << VHOST_VRING_F_LOG)) {
+				if (vq->num > UINT_MAX / sizeof *vq->used->ring - sizeof *vq->used) {
+					r = -EINVAL;
+					break;
+				}
+				if (!log_access_ok(vq->log_base, a.log_guest_addr,
 					   sizeof *vq->used +
 					   vq->num * sizeof *vq->used->ring)) {
-				r = -EINVAL;
-				break;
+					r = -EINVAL;
+					break;
+				}
 			}
 		}
 

^ permalink raw reply related

* [patch 2/2] vhost: fix return code for log_access_ok()
From: Dan Carpenter @ 2010-10-11 17:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Juan Quintela, David S. Miller, Rusty Russell, kvm,
	virtualization, netdev, kernel-janitors

access_ok() returns 1 if it's OK otherwise it should return 0.

Signed-off-by: Dan Carpenter <error27@gmail.com>

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c2aa12c..f82fe57 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -371,7 +371,7 @@ static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
 	/* Make sure 64 bit math will not overflow. */
 	if (a > ULONG_MAX - (unsigned long)log_base ||
 	    a + (unsigned long)log_base > ULONG_MAX)
-		return -EFAULT;
+		return 0;
 
 	return access_ok(VERIFY_WRITE, log_base + a,
 			 (sz + VHOST_PAGE_SIZE * 8 - 1) / VHOST_PAGE_SIZE / 8);

^ permalink raw reply related

* Re: [patch 1/2] vhost: potential integer overflows
From: Al Viro @ 2010-10-11 17:26 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Michael S. Tsirkin, Juan Quintela, David S. Miller, Rusty Russell,
	kvm, virtualization, netdev, kernel-janitors
In-Reply-To: <20101011172256.GF5851@bicker>

On Mon, Oct 11, 2010 at 07:22:57PM +0200, Dan Carpenter wrote:
> I did an audit for potential integer overflows of values which get passed
> to access_ok() and here are the results.

FWIW, UINT_MAX is wrong here.  What you want is maximal size_t value.

> Signed-off-by: Dan Carpenter <error27@gmail.com>
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index dd3d6f7..c2aa12c 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -429,6 +429,14 @@ static int vq_access_ok(unsigned int num,
>  			struct vring_avail __user *avail,
>  			struct vring_used __user *used)
>  {
> +
> +	if (num > UINT_MAX / sizeof *desc)
> +		return 0;
> +	if (num > UINT_MAX / sizeof *avail->ring - sizeof *avail)
> +		return 0;
> +	if (num > UINT_MAX / sizeof *used->ring - sizeof *used)
> +		return 0;
> +
>  	return access_ok(VERIFY_READ, desc, num * sizeof *desc) &&
>  	       access_ok(VERIFY_READ, avail,
>  			 sizeof *avail + num * sizeof *avail->ring) &&
> @@ -447,6 +455,9 @@ int vhost_log_access_ok(struct vhost_dev *dev)
>  /* Caller should have vq mutex and device mutex */
>  static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
>  {
> +	if (vq->num > UINT_MAX / sizeof *vq->used->ring - sizeof *vq->used)
> +		return 0;
> +
>  	return vq_memory_access_ok(log_base, vq->dev->memory,
>  			    vhost_has_feature(vq->dev, VHOST_F_LOG_ALL)) &&
>  		(!vq->log_used || log_access_ok(log_base, vq->log_addr,
> @@ -606,12 +617,17 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
>  			}
>  
>  			/* Also validate log access for used ring if enabled. */
> -			if ((a.flags & (0x1 << VHOST_VRING_F_LOG)) &&
> -			    !log_access_ok(vq->log_base, a.log_guest_addr,
> +			if (a.flags & (0x1 << VHOST_VRING_F_LOG)) {
> +				if (vq->num > UINT_MAX / sizeof *vq->used->ring - sizeof *vq->used) {
> +					r = -EINVAL;
> +					break;
> +				}
> +				if (!log_access_ok(vq->log_base, a.log_guest_addr,
>  					   sizeof *vq->used +
>  					   vq->num * sizeof *vq->used->ring)) {
> -				r = -EINVAL;
> -				break;
> +					r = -EINVAL;
> +					break;
> +				}
>  			}
>  		}
>  
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] af_packet: account for VLAN when checking packet size
From: Phil Sutter @ 2010-10-11 17:29 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev, johann.baudy
In-Reply-To: <20101011.090153.226774563.davem@davemloft.net>

On Mon, Oct 11, 2010 at 09:01:53AM -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 11 Oct 2010 16:03:02 +0200
> 
> > If we dont test ETH_P_8021Q protocol here, we allow sending 1504 bytes
> > frames for MTU=1500
> > 
> > Should we really care ?
> > 
> > If not, just do
> > 
> > reserve = dev->hard_header_len + VLAN_HLEN;
> 
> Thats a good point, I think we need to validate the SKB protocol
> field.

Which is set to the value of the passed struct sockaddr_ll field
sll_protocol. At least in the two userspace code samples I have here,
the later field is set to htons(ETH_P_ALL). So unless one changes the
API, the only way to find out the packet type is to actually parse the
given ethernet header.

Since tpacket_rcv() just interprets the vlan_tci skb field, such
detailed packet inspection is otherwise not done in af_packet.c.

Greetings, Phil

^ permalink raw reply

* Re: [PATCH net-next 2/5] bnx2x: save cycles in setting gso_size
From: David Miller @ 2010-10-11 17:48 UTC (permalink / raw)
  To: vladz; +Cc: bhutchings, dmitry, netdev, eilong
In-Reply-To: <8628FE4E7912BF47A96AE7DD7BAC0AADDDEE428246@SJEXCHCCR02.corp.ad.broadcom.com>

From: "Vladislav Zolotarov" <vladz@broadcom.com>
Date: Mon, 11 Oct 2010 10:06:19 -0700

> Dave, we will respin this patch series.

Ok, thanks guys.

^ permalink raw reply

* Re: [PATCH 1/1] ATM: solos-pci, remove use after free
From: David Miller @ 2010-10-11 18:12 UTC (permalink / raw)
  To: eric.dumazet
  Cc: jslaby, netdev, linux-atm-general, linux-kernel, jirislaby, chas
In-Reply-To: <1286785169.2737.3.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 11 Oct 2010 10:19:29 +0200

> Le lundi 11 octobre 2010 à 09:50 +0200, Jiri Slaby a écrit :
>> Stanse found we do in console_show:
>>   kfree_skb(skb);
>>   return skb->len;
>> which is not good. Fix that by remembering the len and use it in the
>> function instead.
>> 
>> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
>> Cc: Chas Williams <chas@cmf.nrl.navy.mil>
>> ---
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH 3/3] NET: wimax, fix use after free
From: David Miller @ 2010-10-11 18:12 UTC (permalink / raw)
  To: inaky.perez-gonzalez; +Cc: jslaby, netdev, linux-kernel, jirislaby, linux-wimax
In-Reply-To: <1286815606.21592.11.camel@localhost.localdomain>

From: Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com>
Date: Mon, 11 Oct 2010 09:46:46 -0700

> On Mon, 2010-10-11 at 02:26 -0700, Jiri Slaby wrote: 
>> Stanse found that i2400m_rx frees skb, but still uses skb->len even
>> though it has skb_len defined. So use skb_len properly in the code.
>> 
>> And also define it unsinged int rather than size_t to solve
>> compilation warnings.
>> 
>> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
> 
> Ops, fail. Thanks for the catch. I assume you have compile tested it.
> 
> Acked-by: Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com>

Applied.

^ permalink raw reply

* Re: [PATCH 1/1] ATM: mpc, fix use after free
From: David Miller @ 2010-10-11 18:12 UTC (permalink / raw)
  To: eric.dumazet; +Cc: jslaby, netdev, linux-atm-general, linux-kernel, jirislaby
In-Reply-To: <1286787580.2737.4.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 11 Oct 2010 10:59:40 +0200

> Le lundi 11 octobre 2010 à 10:46 +0200, Jiri Slaby a écrit :
>> Stanse found that mpc_push frees skb and then it dereferences it. It
>> is a typo, new_skb should be dereferenced there.
>> 
>> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
>> Cc: Eric Dumazet <eric.dumazet@gmail.com>
 ...
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH 2/3] ATM: iphase, remove sleep-inside-atomic
From: David Miller @ 2010-10-11 18:13 UTC (permalink / raw)
  To: jslaby; +Cc: netdev, linux-kernel, jirislaby, chas
In-Reply-To: <1286789218-13976-2-git-send-email-jslaby@suse.cz>

From: Jiri Slaby <jslaby@suse.cz>
Date: Mon, 11 Oct 2010 11:26:57 +0200

> Stanse found that ia_init_one locks a spinlock and inside of that it
> calls ia_start which calls:
> * request_irq
> * tx_init which does kmalloc(GFP_KERNEL)
> 
> Both of them can thus sleep and result in a deadlock. I don't see a
> reason to have a per-device spinlock there which is used only there
> and inited right before the lock location. So remove it completely.
> 
> Signed-off-by: Jiri Slaby <jslaby@suse.cz>

Applied.

^ permalink raw reply

* Re: [PATCH 1/3] NET: pch, fix use after free
From: David Miller @ 2010-10-11 18:13 UTC (permalink / raw)
  To: eric.dumazet; +Cc: jslaby, netdev, linux-kernel, jirislaby, masa-korg
In-Reply-To: <1286789918.2737.8.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 11 Oct 2010 11:38:38 +0200

> Le lundi 11 octobre 2010 à 11:26 +0200, Jiri Slaby a écrit :
>> Stanse found that pch_gbe_xmit_frame uses skb after it is freed. Fix
>> that.
>> 
>> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
>> Cc: Masayuki Ohtake <masa-korg@dsn.okisemi.com>
>> ---
>>  drivers/net/pch_gbe/pch_gbe_main.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> Applicable to net-next-2.6 only, this driver is not yet in Linus tree
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

I'll apply this to net-next-2.6, thanks.

^ permalink raw reply

* Re: NFS & atl1c : "RPC: multiple fragments per record not supported"
From: J. Bruce Fields @ 2010-10-11 18:16 UTC (permalink / raw)
  To: Phil Endecott; +Cc: Linux Kernel Mailing List, Linux NFS Mailing List, netdev
In-Reply-To: <1286722097803@dmwebmail.dmwebmail.chezphil.org>

On Sun, Oct 10, 2010 at 03:48:17PM +0100, Phil Endecott wrote:
> Dear Experts,
> 
> I am seeing the error "RPC: multiple fragments per record not
> supported" on my NFS server when an NFS client with an atl1c network
> driver talks to it.
> 
> The server is a QNAP TS119 ARM box running Debian's 2.6.33.2 kernel.
> It works reliably with other clients.
> 
> The client is a new x86 system with an "Atheros Communications
> AR8131 Gigabit Ethernet (rev c0)" (1969:1063).  The kernel is
> Debian's 2.6.32-5-686 and the driver seems to be atl1c.

To my knowledge the Linux client has never sent packets that would
trigger the prink above, so off hand it does sound like some sort of
corruption at the network level.

(Independently of that: we should fix the server to support multiple
fragments per record at some point.  But if you hadn't hit that printk,
I'm guessing you would have had a failure soon enough anyway.)

> Typically NFS works for a few seconds and then stops, with that
> message repeated on the server.  Other network activity seems
> reliable (e.g. HTTP, ssh, etc.)
> 
> If I use a USB-ethernet adaptor instead of the built-in gigabit it
> works reliably.  (The USB device is not gigabit, but I do still see
> the problems if I limit the port to 100 Mbit on the switch.)
> 
> I see the problem with NFS v3 and v4.  However, I only see it with
> proto=tcp.  By changing the NFS protocol to UDP, the problem seems
> to go away [well, it has been working for about 20 minutes now
> without any issues].
> 
> Google finds a previous report here:
> http://lkml.org/lkml/2010/1/20/198 ; the suggestion is to turn off
> tcp segmentation offload, but it seems that this is not possible
> with my system:
> 
> # ethtool -K eth0 tso off
> Cannot set device tcp segmentation offload settings: Operation not supported
> 
> I have looked at the changes to atl1c since 2.6.32 (http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=history;f=drivers/net/atl1c;h=53cd10d07d040b7bec957acb1c69bc7b44897e69;hb=HEAD)
> and they seem harmless.
> 
> I wiresharked the network activity while this error was being shown,
> and it did include some packets with the high-contrast colour
> schemes that wireshark uses for "bad" packets.  Unfortunately my
> laptop ran out of battery before I could decipher these packets
> further.
> 
> So, is this a known issue?  Do people agree that the atl1c driver is
> most likely the culprit?  Can I offer any further debugging?

I haven't seen that before.  Adding netdev to the cc:, as you seem to
have reasonable evidence that the problem is the network driver.

--b.

^ permalink raw reply

* Re: [PATCH] atl1c: Add support for Atheros AR8152 and AR8152
From: Luis R. Rodriguez @ 2010-10-11 18:48 UTC (permalink / raw)
  To: David Miller
  Cc: ben@decadent.org.uk, Luis Rodriguez, netdev@vger.kernel.org,
	Jie Yang, linux-team
In-Reply-To: <20101010.210304.71107535.davem@davemloft.net>

On Sun, Oct 10, 2010 at 09:03:04PM -0700, David Miller wrote:
> From: Ben Hutchings <ben@decadent.org.uk>
> Date: Mon, 11 Oct 2010 02:18:50 +0100
> 
> > Your commit 496c185c9495629ef1c65387cb2594578393cfe0 "atl1c: Add support
> > for Atheros AR8152 and AR8152" included the following changes:
>  ...
> >> +		if (hw->nic_type == athr_l1c || hw->nic_type == athr_l2c_b) {
>  ...
> >> +	if ((hw->nic_type == athr_l1c || hw->nic_type == athr_l2c)) {
>  ...
> > Shouldn't the first if-statement use the same condition as the second
> > i.e. matching the previously-defined hardware types athr_l1c and
> > athr_l2c?
> 
> Yeah that definitely looks like a bug to me.

Good catch, unfortunatley I don't have the source code I used to port
this work the day I did this anymore locally, so adding 
Jie Yang who is actually our maintainer for this driver.

Jie, can you please confirm if this patch is correct?

diff --git a/drivers/net/atl1c/atl1c_hw.c b/drivers/net/atl1c/atl1c_hw.c
index d8501f0..0a7b786 100644
--- a/drivers/net/atl1c/atl1c_hw.c
+++ b/drivers/net/atl1c/atl1c_hw.c
@@ -132,7 +132,7 @@ static int atl1c_get_permanent_address(struct atl1c_hw *hw)
 			return -1;
 	}
 	/* Disable OTP_CLK */
-	if ((hw->nic_type == athr_l1c || hw->nic_type == athr_l2c)) {
+	if ((hw->nic_type == athr_l1c || hw->nic_type == athr_l2c_b)) {
 		otp_ctrl_data &= ~OTP_CTRL_CLK_EN;
 		AT_WRITE_REG(hw, REG_OTP_CTRL, otp_ctrl_data);
 		msleep(1);

  Luis

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox