Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 1/1] ATM: mpc, fix use after free
From: David Miller @ 2010-10-11 18:12 UTC (permalink / raw)
  To: eric.dumazet; +Cc: jslaby, netdev, linux-atm-general, linux-kernel, jirislaby
In-Reply-To: <1286787580.2737.4.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 11 Oct 2010 10:59:40 +0200

> Le lundi 11 octobre 2010 à 10:46 +0200, Jiri Slaby a écrit :
>> Stanse found that mpc_push frees skb and then it dereferences it. It
>> is a typo, new_skb should be dereferenced there.
>> 
>> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
>> Cc: Eric Dumazet <eric.dumazet@gmail.com>
 ...
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH 3/3] NET: wimax, fix use after free
From: David Miller @ 2010-10-11 18:12 UTC (permalink / raw)
  To: inaky.perez-gonzalez; +Cc: jslaby, netdev, linux-kernel, jirislaby, linux-wimax
In-Reply-To: <1286815606.21592.11.camel@localhost.localdomain>

From: Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com>
Date: Mon, 11 Oct 2010 09:46:46 -0700

> On Mon, 2010-10-11 at 02:26 -0700, Jiri Slaby wrote: 
>> Stanse found that i2400m_rx frees skb, but still uses skb->len even
>> though it has skb_len defined. So use skb_len properly in the code.
>> 
>> And also define it unsinged int rather than size_t to solve
>> compilation warnings.
>> 
>> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
> 
> Ops, fail. Thanks for the catch. I assume you have compile tested it.
> 
> Acked-by: Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com>

Applied.

^ permalink raw reply

* Re: [PATCH 1/1] ATM: solos-pci, remove use after free
From: David Miller @ 2010-10-11 18:12 UTC (permalink / raw)
  To: eric.dumazet
  Cc: jslaby, netdev, linux-atm-general, linux-kernel, jirislaby, chas
In-Reply-To: <1286785169.2737.3.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 11 Oct 2010 10:19:29 +0200

> Le lundi 11 octobre 2010 à 09:50 +0200, Jiri Slaby a écrit :
>> Stanse found we do in console_show:
>>   kfree_skb(skb);
>>   return skb->len;
>> which is not good. Fix that by remembering the len and use it in the
>> function instead.
>> 
>> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
>> Cc: Chas Williams <chas@cmf.nrl.navy.mil>
>> ---
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 2/5] bnx2x: save cycles in setting gso_size
From: David Miller @ 2010-10-11 17:48 UTC (permalink / raw)
  To: vladz; +Cc: bhutchings, dmitry, netdev, eilong
In-Reply-To: <8628FE4E7912BF47A96AE7DD7BAC0AADDDEE428246@SJEXCHCCR02.corp.ad.broadcom.com>

From: "Vladislav Zolotarov" <vladz@broadcom.com>
Date: Mon, 11 Oct 2010 10:06:19 -0700

> Dave, we will respin this patch series.

Ok, thanks guys.

^ permalink raw reply

* Re: [PATCH] af_packet: account for VLAN when checking packet size
From: Phil Sutter @ 2010-10-11 17:29 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev, johann.baudy
In-Reply-To: <20101011.090153.226774563.davem@davemloft.net>

On Mon, Oct 11, 2010 at 09:01:53AM -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 11 Oct 2010 16:03:02 +0200
> 
> > If we dont test ETH_P_8021Q protocol here, we allow sending 1504 bytes
> > frames for MTU=1500
> > 
> > Should we really care ?
> > 
> > If not, just do
> > 
> > reserve = dev->hard_header_len + VLAN_HLEN;
> 
> Thats a good point, I think we need to validate the SKB protocol
> field.

Which is set to the value of the passed struct sockaddr_ll field
sll_protocol. At least in the two userspace code samples I have here,
the later field is set to htons(ETH_P_ALL). So unless one changes the
API, the only way to find out the packet type is to actually parse the
given ethernet header.

Since tpacket_rcv() just interprets the vlan_tci skb field, such
detailed packet inspection is otherwise not done in af_packet.c.

Greetings, Phil

^ permalink raw reply

* Re: [patch 1/2] vhost: potential integer overflows
From: Al Viro @ 2010-10-11 17:26 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Michael S. Tsirkin, Juan Quintela, David S. Miller, Rusty Russell,
	kvm, virtualization, netdev, kernel-janitors
In-Reply-To: <20101011172256.GF5851@bicker>

On Mon, Oct 11, 2010 at 07:22:57PM +0200, Dan Carpenter wrote:
> I did an audit for potential integer overflows of values which get passed
> to access_ok() and here are the results.

FWIW, UINT_MAX is wrong here.  What you want is maximal size_t value.

> Signed-off-by: Dan Carpenter <error27@gmail.com>
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index dd3d6f7..c2aa12c 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -429,6 +429,14 @@ static int vq_access_ok(unsigned int num,
>  			struct vring_avail __user *avail,
>  			struct vring_used __user *used)
>  {
> +
> +	if (num > UINT_MAX / sizeof *desc)
> +		return 0;
> +	if (num > UINT_MAX / sizeof *avail->ring - sizeof *avail)
> +		return 0;
> +	if (num > UINT_MAX / sizeof *used->ring - sizeof *used)
> +		return 0;
> +
>  	return access_ok(VERIFY_READ, desc, num * sizeof *desc) &&
>  	       access_ok(VERIFY_READ, avail,
>  			 sizeof *avail + num * sizeof *avail->ring) &&
> @@ -447,6 +455,9 @@ int vhost_log_access_ok(struct vhost_dev *dev)
>  /* Caller should have vq mutex and device mutex */
>  static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
>  {
> +	if (vq->num > UINT_MAX / sizeof *vq->used->ring - sizeof *vq->used)
> +		return 0;
> +
>  	return vq_memory_access_ok(log_base, vq->dev->memory,
>  			    vhost_has_feature(vq->dev, VHOST_F_LOG_ALL)) &&
>  		(!vq->log_used || log_access_ok(log_base, vq->log_addr,
> @@ -606,12 +617,17 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
>  			}
>  
>  			/* Also validate log access for used ring if enabled. */
> -			if ((a.flags & (0x1 << VHOST_VRING_F_LOG)) &&
> -			    !log_access_ok(vq->log_base, a.log_guest_addr,
> +			if (a.flags & (0x1 << VHOST_VRING_F_LOG)) {
> +				if (vq->num > UINT_MAX / sizeof *vq->used->ring - sizeof *vq->used) {
> +					r = -EINVAL;
> +					break;
> +				}
> +				if (!log_access_ok(vq->log_base, a.log_guest_addr,
>  					   sizeof *vq->used +
>  					   vq->num * sizeof *vq->used->ring)) {
> -				r = -EINVAL;
> -				break;
> +					r = -EINVAL;
> +					break;
> +				}
>  			}
>  		}
>  
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [patch 2/2] vhost: fix return code for log_access_ok()
From: Dan Carpenter @ 2010-10-11 17:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Juan Quintela, David S. Miller, Rusty Russell, kvm,
	virtualization, netdev, kernel-janitors

access_ok() returns 1 if it's OK otherwise it should return 0.

Signed-off-by: Dan Carpenter <error27@gmail.com>

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c2aa12c..f82fe57 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -371,7 +371,7 @@ static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
 	/* Make sure 64 bit math will not overflow. */
 	if (a > ULONG_MAX - (unsigned long)log_base ||
 	    a + (unsigned long)log_base > ULONG_MAX)
-		return -EFAULT;
+		return 0;
 
 	return access_ok(VERIFY_WRITE, log_base + a,
 			 (sz + VHOST_PAGE_SIZE * 8 - 1) / VHOST_PAGE_SIZE / 8);

^ permalink raw reply related

* [patch 1/2] vhost: potential integer overflows
From: Dan Carpenter @ 2010-10-11 17:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Juan Quintela, David S. Miller, Rusty Russell, kvm,
	virtualization, netdev, kernel-janitors

I did an audit for potential integer overflows of values which get passed
to access_ok() and here are the results.

Signed-off-by: Dan Carpenter <error27@gmail.com>

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index dd3d6f7..c2aa12c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -429,6 +429,14 @@ static int vq_access_ok(unsigned int num,
 			struct vring_avail __user *avail,
 			struct vring_used __user *used)
 {
+
+	if (num > UINT_MAX / sizeof *desc)
+		return 0;
+	if (num > UINT_MAX / sizeof *avail->ring - sizeof *avail)
+		return 0;
+	if (num > UINT_MAX / sizeof *used->ring - sizeof *used)
+		return 0;
+
 	return access_ok(VERIFY_READ, desc, num * sizeof *desc) &&
 	       access_ok(VERIFY_READ, avail,
 			 sizeof *avail + num * sizeof *avail->ring) &&
@@ -447,6 +455,9 @@ int vhost_log_access_ok(struct vhost_dev *dev)
 /* Caller should have vq mutex and device mutex */
 static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
 {
+	if (vq->num > UINT_MAX / sizeof *vq->used->ring - sizeof *vq->used)
+		return 0;
+
 	return vq_memory_access_ok(log_base, vq->dev->memory,
 			    vhost_has_feature(vq->dev, VHOST_F_LOG_ALL)) &&
 		(!vq->log_used || log_access_ok(log_base, vq->log_addr,
@@ -606,12 +617,17 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
 			}
 
 			/* Also validate log access for used ring if enabled. */
-			if ((a.flags & (0x1 << VHOST_VRING_F_LOG)) &&
-			    !log_access_ok(vq->log_base, a.log_guest_addr,
+			if (a.flags & (0x1 << VHOST_VRING_F_LOG)) {
+				if (vq->num > UINT_MAX / sizeof *vq->used->ring - sizeof *vq->used) {
+					r = -EINVAL;
+					break;
+				}
+				if (!log_access_ok(vq->log_base, a.log_guest_addr,
 					   sizeof *vq->used +
 					   vq->num * sizeof *vq->used->ring)) {
-				r = -EINVAL;
-				break;
+					r = -EINVAL;
+					break;
+				}
 			}
 		}
 

^ permalink raw reply related

* RE: [PATCH net-next 2/5] bnx2x: save cycles in setting gso_size
From: Vladislav Zolotarov @ 2010-10-11 17:06 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David Miller, Dmitry Kravkov, netdev@vger.kernel.org,
	Eilon Greenstein
In-Reply-To: <1286805633.2349.64.camel@achroite.uk.solarflarecom.com>



> -----Original Message-----
> From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> Sent: Monday, October 11, 2010 4:01 PM
> To: Vladislav Zolotarov
> Cc: David Miller; Dmitry Kravkov; netdev@vger.kernel.org; Eilon
> Greenstein
> Subject: RE: [PATCH net-next 2/5] bnx2x: save cycles in setting
> gso_size
> 
> On Mon, 2010-10-11 at 01:53 -0700, Vladislav Zolotarov wrote:
> [...]
> > Dave, it's a gSo_size, not a gRo_size and we are handling an LRO skb
> > here. It's quite confusing, I agree... ;) This is currently the way
> > to mark an LRO skb in order to properly handle the LRO skbs that
> > might still be forwarded to the stack short time after the LRO has
> > been disabled due to enabling the IP forwarding (or if there is a
> > bug in a driver).
> > See the skb_warn_if_lro() below:
> >
> > static inline bool skb_warn_if_lro(const struct sk_buff *skb)
> > {
> > 	/* LRO sets gso_size but not gso_type, whereas if GSO is really
> > 	 * wanted then gso_type will be set. */
> > 	struct skb_shared_info *shinfo = skb_shinfo(skb);
> > 	if (skb_is_nonlinear(skb) && shinfo->gso_size != 0 &&
> > 	    unlikely(shinfo->gso_type == 0)) {
> > 		__skb_warn_lro_forwarding(skb);
> > 		return true;
> > 	}
> > 	return false;
> > }
> >
> > So, the convention is to set the gSo_size to a none-zero value
> leaving
> > the gSo_type zero for an LRO skbs. It's quite hacky but this is what
> > we have for quite a while and it quite worked so far. ;) To make it
> > cleaner we might have done the following:
> [...]
> 
> The requirement (or as you call it, "convention") is to set gso_size to
> the observed MSS of the packets that have been combined.  If you don't
> do that then TCP will not know the true number of packets for purposes
> of delayed-ACK etc.

Thanks, Ben. I see that now. 
Dave, we will respin this patch series.

Thanks,
vlad

> 
> Ben.
> 
> --
> Ben Hutchings, Senior Software Engineer, Solarflare Communications
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
> 


^ permalink raw reply

* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Michael Tokarev @ 2010-10-11 16:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: avagin, David Miller, avagin, stable, netdev, krkumar2
In-Reply-To: <1286814652.2737.41.camel@edumazet-laptop>

11.10.2010 20:30, Eric Dumazet wrote:
> Le lundi 11 octobre 2010 à 20:19 +0400, Andrew Vagin a écrit :
>> On 10/11/2010 07:59 PM, David Miller wrote:
>>> From: Eric Dumazet<eric.dumazet@gmail.com>
>>> Date: Mon, 11 Oct 2010 17:46:49 +0200
>>>
>>>> This patch was an optimization, not a bug fix.
>>> Right, this has no business going into 2.6.32-stable at all.
>> This is bug fix. Now nobody drops dst in case gso and veth, because the 
>> commit 60df914e295a21a223e43a7ee01e0c73c64dd111 deletes skb_dst_drop 
>> from the veth.c. We should commit my patch or revert commit 60df914e.
>>
>> We have two bug reports:
>>
>> http://www.spinics.net/lists/netdev/msg142104.html

This is korg#17251: https://bugzilla.kernel.org/show_bug.cgi?id=17251 ,
from me.

But I really wonder how that thing does not happen anymore
when I disabled netfilter hooks...  I can't experiment till
weekend, but I'll try to get back to it and re-verify again,
with and without this fix, with and without netfilter hooks.

What I know for sure is 2 facts: I can't trigger the problem
now (with the hooks disabled), and I can't trigger it on a
subsequent kernel releases - e.g. 2.6.35 does not have the
issue, but 2.6.32 has.

>> http://bugzilla.openvz.org/show_bug.cgi?id=1634

Thanks!

/mjt

^ permalink raw reply

* Re: [PATCH 3/3] NET: wimax, fix use after free
From: Inaky Perez-Gonzalez @ 2010-10-11 16:46 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, jirislaby@gmail.com, linux-wimax
In-Reply-To: <1286789218-13976-3-git-send-email-jslaby@suse.cz>

On Mon, 2010-10-11 at 02:26 -0700, Jiri Slaby wrote: 
> Stanse found that i2400m_rx frees skb, but still uses skb->len even
> though it has skb_len defined. So use skb_len properly in the code.
> 
> And also define it unsinged int rather than size_t to solve
> compilation warnings.
> 
> Signed-off-by: Jiri Slaby <jslaby@suse.cz>

Ops, fail. Thanks for the catch. I assume you have compile tested it.

Acked-by: Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com>

> Cc: linux-wimax@intel.com
> ---
>  drivers/net/wimax/i2400m/rx.c |   26 +++++++++++++-------------
>  1 files changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/net/wimax/i2400m/rx.c b/drivers/net/wimax/i2400m/rx.c
> index c4876d0..844133b 100644
> --- a/drivers/net/wimax/i2400m/rx.c
> +++ b/drivers/net/wimax/i2400m/rx.c
> @@ -1244,16 +1244,16 @@ int i2400m_rx(struct i2400m *i2400m, struct sk_buff *skb)
>  	int i, result;
>  	struct device *dev = i2400m_dev(i2400m);
>  	const struct i2400m_msg_hdr *msg_hdr;
> -	size_t pl_itr, pl_size, skb_len;
> +	size_t pl_itr, pl_size;
>  	unsigned long flags;
> -	unsigned num_pls, single_last;
> +	unsigned num_pls, single_last, skb_len;
>  
>  	skb_len = skb->len;
> -	d_fnstart(4, dev, "(i2400m %p skb %p [size %zu])\n",
> +	d_fnstart(4, dev, "(i2400m %p skb %p [size %u])\n",
>  		  i2400m, skb, skb_len);
>  	result = -EIO;
>  	msg_hdr = (void *) skb->data;
> -	result = i2400m_rx_msg_hdr_check(i2400m, msg_hdr, skb->len);
> +	result = i2400m_rx_msg_hdr_check(i2400m, msg_hdr, skb_len);
>  	if (result < 0)
>  		goto error_msg_hdr_check;
>  	result = -EIO;
> @@ -1261,10 +1261,10 @@ int i2400m_rx(struct i2400m *i2400m, struct sk_buff *skb)
>  	pl_itr = sizeof(*msg_hdr) +	/* Check payload descriptor(s) */
>  		num_pls * sizeof(msg_hdr->pld[0]);
>  	pl_itr = ALIGN(pl_itr, I2400M_PL_ALIGN);
> -	if (pl_itr > skb->len) {	/* got all the payload descriptors? */
> +	if (pl_itr > skb_len) {	/* got all the payload descriptors? */
>  		dev_err(dev, "RX: HW BUG? message too short (%u bytes) for "
>  			"%u payload descriptors (%zu each, total %zu)\n",
> -			skb->len, num_pls, sizeof(msg_hdr->pld[0]), pl_itr);
> +			skb_len, num_pls, sizeof(msg_hdr->pld[0]), pl_itr);
>  		goto error_pl_descr_short;
>  	}
>  	/* Walk each payload payload--check we really got it */
> @@ -1272,7 +1272,7 @@ int i2400m_rx(struct i2400m *i2400m, struct sk_buff *skb)
>  		/* work around old gcc warnings */
>  		pl_size = i2400m_pld_size(&msg_hdr->pld[i]);
>  		result = i2400m_rx_pl_descr_check(i2400m, &msg_hdr->pld[i],
> -						  pl_itr, skb->len);
> +						  pl_itr, skb_len);
>  		if (result < 0)
>  			goto error_pl_descr_check;
>  		single_last = num_pls == 1 || i == num_pls - 1;
> @@ -1290,16 +1290,16 @@ int i2400m_rx(struct i2400m *i2400m, struct sk_buff *skb)
>  	if (i < i2400m->rx_pl_min)
>  		i2400m->rx_pl_min = i;
>  	i2400m->rx_num++;
> -	i2400m->rx_size_acc += skb->len;
> -	if (skb->len < i2400m->rx_size_min)
> -		i2400m->rx_size_min = skb->len;
> -	if (skb->len > i2400m->rx_size_max)
> -		i2400m->rx_size_max = skb->len;
> +	i2400m->rx_size_acc += skb_len;
> +	if (skb_len < i2400m->rx_size_min)
> +		i2400m->rx_size_min = skb_len;
> +	if (skb_len > i2400m->rx_size_max)
> +		i2400m->rx_size_max = skb_len;
>  	spin_unlock_irqrestore(&i2400m->rx_lock, flags);
>  error_pl_descr_check:
>  error_pl_descr_short:
>  error_msg_hdr_check:
> -	d_fnend(4, dev, "(i2400m %p skb %p [size %zu]) = %d\n",
> +	d_fnend(4, dev, "(i2400m %p skb %p [size %u]) = %d\n",
>  		i2400m, skb, skb_len, result);
>  	return result;
>  }




^ permalink raw reply

* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Eric Dumazet @ 2010-10-11 16:30 UTC (permalink / raw)
  To: avagin; +Cc: David Miller, mjt, avagin, stable, netdev, krkumar2
In-Reply-To: <4CB33907.5060803@gmail.com>

Le lundi 11 octobre 2010 à 20:19 +0400, Andrew Vagin a écrit :
> On 10/11/2010 07:59 PM, David Miller wrote:
> > From: Eric Dumazet<eric.dumazet@gmail.com>
> > Date: Mon, 11 Oct 2010 17:46:49 +0200
> >
> >> This patch was an optimization, not a bug fix.
> > Right, this has no business going into 2.6.32-stable at all.
> This is bug fix. Now nobody drops dst in case gso and veth, because the 
> commit 60df914e295a21a223e43a7ee01e0c73c64dd111 deletes skb_dst_drop 
> from the veth.c. We should commit my patch or revert commit 60df914e.
> 
> We have two bug reports:
> 
> http://www.spinics.net/lists/netdev/msg142104.html
> 
> http://bugzilla.openvz.org/show_bug.cgi?id=1634
> 
> Taylan verified that the patch really fix his bug.

Now that makes sense ;)

This is always good to mention commit id of a bug origin.

Because reading your patch (changelog and content), it was not obvious
it fixed a bug.

Thanks !



^ permalink raw reply

* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Andrew Vagin @ 2010-10-11 16:19 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, mjt, avagin, stable, netdev, krkumar2
In-Reply-To: <20101011.085952.193718228.davem@davemloft.net>

  On 10/11/2010 07:59 PM, David Miller wrote:
> From: Eric Dumazet<eric.dumazet@gmail.com>
> Date: Mon, 11 Oct 2010 17:46:49 +0200
>
>> This patch was an optimization, not a bug fix.
> Right, this has no business going into 2.6.32-stable at all.
This is bug fix. Now nobody drops dst in case gso and veth, because the 
commit 60df914e295a21a223e43a7ee01e0c73c64dd111 deletes skb_dst_drop 
from the veth.c. We should commit my patch or revert commit 60df914e.

We have two bug reports:

http://www.spinics.net/lists/netdev/msg142104.html

http://bugzilla.openvz.org/show_bug.cgi?id=1634

Taylan verified that the patch really fix his bug.

^ permalink raw reply

* Re: [PATCH net-next] neigh: speedup neigh_hh_init()
From: David Miller @ 2010-10-11 16:18 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1286456008.2912.171.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 07 Oct 2010 14:53:28 +0200

> When a new dst is used to send a frame, neigh_resolve_output() tries to
> associate an struct hh_cache to this dst, calling neigh_hh_init() with
> the neigh rwlock write locked.
> 
> Most of the time, hh_cache is already known and linked into neighbour,
> so we find it and increment its refcount.
> 
> This patch changes the logic so that we call neigh_hh_init() with
> neighbour lock read locked only, so that fast path can be run in
> parallel by concurrent cpus.
> 
> This brings part of the speedup we got with commit c7d4426a98a5f
> (introduce DST_NOCACHE flag) for non cached dsts, even for cached ones,
> removing one of the contention point that routers hit on multiqueue
> enabled machines.
> 
> Further improvements would need to use a seqlock instead of an rwlock to
> protect neigh->ha[], to not dirty neigh too often and remove two atomic
> ops.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Ok this patch assumes that neighbours are highly shared, which
is a reasonable thing to optimize for.

So, applied, thanks a lot!

BTW, I think you can RCU this thing.  Require that every change
to the 'hh' entry be done in a newly allocated entry.

Then the cmpxchg() on the hh pointer interlocks.

^ permalink raw reply

* Re: [PATCH 1/2] r8169: allocate with GFP_KERNEL flag when able to sleep
From: Christoph Lameter @ 2010-10-11 16:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stanislaw Gruszka, Francois Romieu, netdev
In-Reply-To: <1286813237.2737.37.camel@edumazet-laptop>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 498 bytes --]

On Mon, 11 Oct 2010, Eric Dumazet wrote:

> Le lundi 11 octobre 2010 à 11:03 -0500, Christoph Lameter a écrit :
> > On Fri, 8 Oct 2010, Eric Dumazet wrote:
> >
> > > 8 in the <pagesperslab> column just says that : order-3 pages, even for
> > > small allocations.
> >
> > Those allocations will fallback to smaller allocs if the page allocator
> > has trouble satisfying those requests.
> >
>
> Interesting, do you have an idea when this feature was added ?

A couple of years ago.

^ permalink raw reply

* Re: [PATCH 1/2] r8169: allocate with GFP_KERNEL flag when able to sleep
From: Eric Dumazet @ 2010-10-11 16:07 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Stanislaw Gruszka, Francois Romieu, netdev
In-Reply-To: <alpine.DEB.2.00.1010111102390.11247@router.home>

Le lundi 11 octobre 2010 à 11:03 -0500, Christoph Lameter a écrit :
> On Fri, 8 Oct 2010, Eric Dumazet wrote:
> 
> > 8 in the <pagesperslab> column just says that : order-3 pages, even for
> > small allocations.
> 
> Those allocations will fallback to smaller allocs if the page allocator
> has trouble satisfying those requests.
> 

Interesting, do you have an idea when this feature was added ?

Thanks



^ permalink raw reply

* Re: [PATCH] net: introduce alloc_skb_order0
From: Eric Dumazet @ 2010-10-11 16:05 UTC (permalink / raw)
  To: Stanislaw Gruszka; +Cc: David Miller, Francois Romieu, netdev
In-Reply-To: <20101011155556.GA2431@redhat.com>

Le lundi 11 octobre 2010 à 17:55 +0200, Stanislaw Gruszka a écrit :
> On Sat, Oct 09, 2010 at 05:59:56PM +0200, Eric Dumazet wrote:
> > Le vendredi 08 octobre 2010 à 18:03 +0200, Stanislaw Gruszka a écrit :
> > > On Fri, Oct 08, 2010 at 05:04:07PM +0200, Eric Dumazet wrote:
> > 
> > > > Switch to SLAB -> no more problem ;)
> > > 
> > > yeh, I wish to, but fedora use SLUB because of some debugging
> > > capabilities. 
> > 
> > Yes, of course, I was kidding :)
> > 
> > echo 0 >/sys/kernel/slab/kmalloc-2048/order
> > echo 0 >/sys/kernel/slab/kmalloc-1024/order
> > echo 0 >/sys/kernel/slab/kmalloc-512/order
> > 
> > Should do the trick : No more high order allocations for MTU=1500
> > frames.
> 
> So the SLUB is great, but we need a patch to avoid using it :-)
> 
> > For MTU=9000 frames, we probably need something like this patch :
> >
> > Reception of big frames hit a memory allocation problem, because of high
> > order pages allocations (order-3 sometimes for MTU=9000). This patch
> > introduces alloc_skb_order0(), to build skbs with order-0 pages only.
> 
> I had never seen allocation problems in rtl8169_try_rx_copy or in any
> other driver rx path (except iwlwifi, but now this is solved by using
> skb_add_rx_frag), so I'm not sure if need this patch.
> 
> However I see other benefit of that patch. We save memory. Allocating
> for MTU 9000 gives something like skb->data = kmalloc(9000 + 32 + 2
> + 334). So we take data from kmalloc-16384 cache, we waste about 7kB on
> every allocation. With patch wastage would be about 2k per allocation
> (assuming 4kB and 8kB page size)
> 
> However I started this thread thinking about other memory wastage,
> in rtl8169_alloc_rx_skb, skb->data = kmalloc(16383 + 32 + 2 + 334), taken
> from kmalloc-32768, almost 50% wastage.
>  

You cannot use my patch to avoid this waste. Really.

You have two different things in this driver :

1) Allocation of a physically continous 16Kbytes bloc for the rx-ring,
at device initialization (GFP_KERNEL OK here)

   Here, the only thing you could do is to not allocate real skbs but
only 16KB data blocs (no need for the sk_buf, only the ->data part), and
force copybreak for all incoming packets (remove the rx_copybreak
tunable)

2) Allocation of order0 skb to perform the copybreak in rx path.
(GFP_ATOMIC) : My patch.


> > +struct sk_buff *alloc_skb_order0(int pkt_size)
> > +{
> > +	int head = min_t(int, pkt_size, SKB_MAX_HEAD(NET_SKB_PAD + NET_IP_ALIGN));
> > +	struct sk_buff *skb;
> > +
> > +	skb = alloc_skb(head + NET_SKB_PAD + NET_IP_ALIGN,
> > +			GFP_ATOMIC | __GFP_NOWARN);
> > +	if (!skb)
> > +		return NULL;
> > +	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
> > +	skb_put(skb, head);
> > +	pkt_size -= head;
> > +
> > +	skb->len += pkt_size;
> > +	skb->data_len += pkt_size;
> > +	skb->truesize += pkt_size;
> > +	while (pkt_size) {
> 
> if (skb_shinfo(skb)->nr_frags == MAX_SKB_FRAGS - 1)
> 	goto error;

Not needed. A frame is < 16383 bytes, so _must_ fit in an skb,
(skb can hold up to 64 Kbytes)

> 
> > +		int i = skb_shinfo(skb)->nr_frags++;
> > +		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> > +		int fragsize = min_t(int, pkt_size, PAGE_SIZE);
> > +		struct page *page = alloc_page(GFP_NOWAIT | __GFP_NOWARN);
> > +
> > +		if (!page)
> > +			goto error;
> > +		frag->page = page;
> > +		frag->size = fragsize;
> > +		frag->page_offset = 0;
> > +		pkt_size -= fragsize;
> > +	}
> > +	return skb;
> > +
> > +error:
> > +	kfree_skb(skb);
> > +	return NULL;	
> > +}
> > +EXPORT_SYMBOL(alloc_skb_order0);
> > +
> >  /* Checksum skb data. */
> >  
> >  __wsum skb_checksum(const struct sk_buff *skb, int offset,



^ permalink raw reply

* Re: [PATCH 1/2] r8169: allocate with GFP_KERNEL flag when able to sleep
From: Christoph Lameter @ 2010-10-11 16:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stanislaw Gruszka, Francois Romieu, netdev
In-Reply-To: <1286550247.2959.444.camel@edumazet-laptop>

On Fri, 8 Oct 2010, Eric Dumazet wrote:

> 8 in the <pagesperslab> column just says that : order-3 pages, even for
> small allocations.

Those allocations will fallback to smaller allocs if the page allocator
has trouble satisfying those requests.


^ permalink raw reply

* Re: [PATCH] af_packet: account for VLAN when checking packet size
From: David Miller @ 2010-10-11 16:01 UTC (permalink / raw)
  To: eric.dumazet; +Cc: phil, netdev, johann.baudy
In-Reply-To: <1286805782.2737.25.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 11 Oct 2010 16:03:02 +0200

> If we dont test ETH_P_8021Q protocol here, we allow sending 1504 bytes
> frames for MTU=1500
> 
> Should we really care ?
> 
> If not, just do
> 
> reserve = dev->hard_header_len + VLAN_HLEN;

Thats a good point, I think we need to validate the SKB protocol
field.

^ permalink raw reply

* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: David Miller @ 2010-10-11 15:59 UTC (permalink / raw)
  To: eric.dumazet; +Cc: mjt, avagin, stable, netdev, krkumar2
In-Reply-To: <1286812009.2737.30.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 11 Oct 2010 17:46:49 +0200

> This patch was an optimization, not a bug fix.

Right, this has no business going into 2.6.32-stable at all.

^ permalink raw reply

* Re: [PATCH] net: introduce alloc_skb_order0
From: Stanislaw Gruszka @ 2010-10-11 15:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Francois Romieu, netdev
In-Reply-To: <1286639996.2692.148.camel@edumazet-laptop>

On Sat, Oct 09, 2010 at 05:59:56PM +0200, Eric Dumazet wrote:
> Le vendredi 08 octobre 2010 à 18:03 +0200, Stanislaw Gruszka a écrit :
> > On Fri, Oct 08, 2010 at 05:04:07PM +0200, Eric Dumazet wrote:
> 
> > > Switch to SLAB -> no more problem ;)
> > 
> > yeh, I wish to, but fedora use SLUB because of some debugging
> > capabilities. 
> 
> Yes, of course, I was kidding :)
> 
> echo 0 >/sys/kernel/slab/kmalloc-2048/order
> echo 0 >/sys/kernel/slab/kmalloc-1024/order
> echo 0 >/sys/kernel/slab/kmalloc-512/order
> 
> Should do the trick : No more high order allocations for MTU=1500
> frames.

So the SLUB is great, but we need a patch to avoid using it :-)

> For MTU=9000 frames, we probably need something like this patch :
>
> Reception of big frames hit a memory allocation problem, because of high
> order pages allocations (order-3 sometimes for MTU=9000). This patch
> introduces alloc_skb_order0(), to build skbs with order-0 pages only.

I had never seen allocation problems in rtl8169_try_rx_copy or in any
other driver rx path (except iwlwifi, but now this is solved by using
skb_add_rx_frag), so I'm not sure if need this patch.

However I see other benefit of that patch. We save memory. Allocating
for MTU 9000 gives something like skb->data = kmalloc(9000 + 32 + 2
+ 334). So we take data from kmalloc-16384 cache, we waste about 7kB on
every allocation. With patch wastage would be about 2k per allocation
(assuming 4kB and 8kB page size)

However I started this thread thinking about other memory wastage,
in rtl8169_alloc_rx_skb, skb->data = kmalloc(16383 + 32 + 2 + 334), taken
from kmalloc-32768, almost 50% wastage.
 
> +struct sk_buff *alloc_skb_order0(int pkt_size)
> +{
> +	int head = min_t(int, pkt_size, SKB_MAX_HEAD(NET_SKB_PAD + NET_IP_ALIGN));
> +	struct sk_buff *skb;
> +
> +	skb = alloc_skb(head + NET_SKB_PAD + NET_IP_ALIGN,
> +			GFP_ATOMIC | __GFP_NOWARN);
> +	if (!skb)
> +		return NULL;
> +	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
> +	skb_put(skb, head);
> +	pkt_size -= head;
> +
> +	skb->len += pkt_size;
> +	skb->data_len += pkt_size;
> +	skb->truesize += pkt_size;
> +	while (pkt_size) {

if (skb_shinfo(skb)->nr_frags == MAX_SKB_FRAGS - 1)
	goto error;

> +		int i = skb_shinfo(skb)->nr_frags++;
> +		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> +		int fragsize = min_t(int, pkt_size, PAGE_SIZE);
> +		struct page *page = alloc_page(GFP_NOWAIT | __GFP_NOWARN);
> +
> +		if (!page)
> +			goto error;
> +		frag->page = page;
> +		frag->size = fragsize;
> +		frag->page_offset = 0;
> +		pkt_size -= fragsize;
> +	}
> +	return skb;
> +
> +error:
> +	kfree_skb(skb);
> +	return NULL;	
> +}
> +EXPORT_SYMBOL(alloc_skb_order0);
> +
>  /* Checksum skb data. */
>  
>  __wsum skb_checksum(const struct sk_buff *skb, int offset,

^ permalink raw reply

* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Eric Dumazet @ 2010-10-11 15:46 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Andrey Vagin, stable, netdev, Krishna Kumar, David S. Miller
In-Reply-To: <4CB32FDB.4090001@msgid.tls.msk.ru>

Le lundi 11 octobre 2010 à 19:40 +0400, Michael Tokarev a écrit :
> Andrey Vagin wrote:
> > From: Krishna Kumar <krkumar2@in.ibm.com>
> > 
> > commit 068a2de57ddf4f472e32e7af868613c574ad1d88 upstream.
> > 
> > Non-GSO code drops dst entry for performance reasons, but
> > the same is missing for GSO code. Drop dst while cache-hot
> > for GSO case too.
> > 
> > Note: Without this patch the kernel may oops if used bridged veth
> > devices. A bridge set skb->dst = fake_dst_ops, veth transfers this skb
> > to netif_receive_skb...ip_rcv_finish and it calls dst_input(skb), but
> > fake_dst_ops->input = NULL -> Oops
> 
> Hmm.  Isn't this the reason for my mysterious OOPSes (jump to
> NULL) which I concluded are due to stack overflow?  See f.e.
> http://www.spinics.net/lists/netdev/msg142104.html
> 
> This started happening when I updated virtio drivers in a windows
> virtual machne to the ones which supports GSO, and my config
> involves bridging veth devices, and this is where the prob
> actually occurs - when doing guest => virtio => tap => bridge => veth
> route....
> 
> Thanks!

This patch was an optimization, not a bug fix.

If something gets better, then a bug is somewhere else ?




^ permalink raw reply

* Re: [PATCH v12 06/17] Use callback to deal with skb_release_data() specially.
From: Eric Dumazet @ 2010-10-11 15:42 UTC (permalink / raw)
  To: David Miller
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mst, mingo, herbert,
	jdike
In-Reply-To: <20101011.082728.193709462.davem@davemloft.net>

Le lundi 11 octobre 2010 à 08:27 -0700, David Miller a écrit :
> From: "Xin, Xiaohui" <xiaohui.xin@intel.com>
> Date: Mon, 11 Oct 2010 15:06:05 +0800
> 
> > That's to avoid the new cache miss caused by using destructor_arg in data path
> > like skb_release_data().
> > That's based on the comment from Eric Dumazet on v7 patches.
> 
> Thanks for the explanation.

Anyway, frags[] must be the last field of "struct skb_shared_info"
since commit fed66381 (net: pskb_expand_head() optimization)

It seems Xin worked on a quite old tree.

^ permalink raw reply

* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Michael Tokarev @ 2010-10-11 15:40 UTC (permalink / raw)
  To: Andrey Vagin; +Cc: stable, netdev, Krishna Kumar, David S. Miller
In-Reply-To: <1286810413-30238-1-git-send-email-avagin@openvz.org>

Andrey Vagin wrote:
> From: Krishna Kumar <krkumar2@in.ibm.com>
> 
> commit 068a2de57ddf4f472e32e7af868613c574ad1d88 upstream.
> 
> Non-GSO code drops dst entry for performance reasons, but
> the same is missing for GSO code. Drop dst while cache-hot
> for GSO case too.
> 
> Note: Without this patch the kernel may oops if used bridged veth
> devices. A bridge set skb->dst = fake_dst_ops, veth transfers this skb
> to netif_receive_skb...ip_rcv_finish and it calls dst_input(skb), but
> fake_dst_ops->input = NULL -> Oops

Hmm.  Isn't this the reason for my mysterious OOPSes (jump to
NULL) which I concluded are due to stack overflow?  See f.e.
http://www.spinics.net/lists/netdev/msg142104.html

This started happening when I updated virtio drivers in a windows
virtual machne to the ones which supports GSO, and my config
involves bridging veth devices, and this is where the prob
actually occurs - when doing guest => virtio => tap => bridge => veth
route....

Thanks!

/mjt

^ permalink raw reply

* Re: [PATCH v12 06/17] Use callback to deal with skb_release_data() specially.
From: David Miller @ 2010-10-11 15:27 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, herbert, jdike
In-Reply-To: <F2E9EB7348B8264F86B6AB8151CE2D793064B2A9D8@shsmsx502.ccr.corp.intel.com>

From: "Xin, Xiaohui" <xiaohui.xin@intel.com>
Date: Mon, 11 Oct 2010 15:06:05 +0800

> That's to avoid the new cache miss caused by using destructor_arg in data path
> like skb_release_data().
> That's based on the comment from Eric Dumazet on v7 patches.

Thanks for the explanation.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox