* Re: [PATCH 1/1] ATM: mpc, fix use after free
From: David Miller @ 2010-10-11 18:12 UTC (permalink / raw)
To: eric.dumazet; +Cc: jslaby, netdev, linux-atm-general, linux-kernel, jirislaby
In-Reply-To: <1286787580.2737.4.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 11 Oct 2010 10:59:40 +0200
> Le lundi 11 octobre 2010 à 10:46 +0200, Jiri Slaby a écrit :
>> Stanse found that mpc_push frees skb and then it dereferences it. It
>> is a typo, new_skb should be dereferenced there.
>>
>> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
>> Cc: Eric Dumazet <eric.dumazet@gmail.com>
...
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied.
^ permalink raw reply
* Re: [PATCH 3/3] NET: wimax, fix use after free
From: David Miller @ 2010-10-11 18:12 UTC (permalink / raw)
To: inaky.perez-gonzalez; +Cc: jslaby, netdev, linux-kernel, jirislaby, linux-wimax
In-Reply-To: <1286815606.21592.11.camel@localhost.localdomain>
From: Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com>
Date: Mon, 11 Oct 2010 09:46:46 -0700
> On Mon, 2010-10-11 at 02:26 -0700, Jiri Slaby wrote:
>> Stanse found that i2400m_rx frees skb, but still uses skb->len even
>> though it has skb_len defined. So use skb_len properly in the code.
>>
>> And also define it unsinged int rather than size_t to solve
>> compilation warnings.
>>
>> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
>
> Ops, fail. Thanks for the catch. I assume you have compile tested it.
>
> Acked-by: Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com>
Applied.
^ permalink raw reply
* Re: [PATCH 1/1] ATM: solos-pci, remove use after free
From: David Miller @ 2010-10-11 18:12 UTC (permalink / raw)
To: eric.dumazet
Cc: jslaby, netdev, linux-atm-general, linux-kernel, jirislaby, chas
In-Reply-To: <1286785169.2737.3.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 11 Oct 2010 10:19:29 +0200
> Le lundi 11 octobre 2010 à 09:50 +0200, Jiri Slaby a écrit :
>> Stanse found we do in console_show:
>> kfree_skb(skb);
>> return skb->len;
>> which is not good. Fix that by remembering the len and use it in the
>> function instead.
>>
>> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
>> Cc: Chas Williams <chas@cmf.nrl.navy.mil>
>> ---
>
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next 2/5] bnx2x: save cycles in setting gso_size
From: David Miller @ 2010-10-11 17:48 UTC (permalink / raw)
To: vladz; +Cc: bhutchings, dmitry, netdev, eilong
In-Reply-To: <8628FE4E7912BF47A96AE7DD7BAC0AADDDEE428246@SJEXCHCCR02.corp.ad.broadcom.com>
From: "Vladislav Zolotarov" <vladz@broadcom.com>
Date: Mon, 11 Oct 2010 10:06:19 -0700
> Dave, we will respin this patch series.
Ok, thanks guys.
^ permalink raw reply
* Re: [PATCH] af_packet: account for VLAN when checking packet size
From: Phil Sutter @ 2010-10-11 17:29 UTC (permalink / raw)
To: David Miller; +Cc: eric.dumazet, netdev, johann.baudy
In-Reply-To: <20101011.090153.226774563.davem@davemloft.net>
On Mon, Oct 11, 2010 at 09:01:53AM -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 11 Oct 2010 16:03:02 +0200
>
> > If we dont test ETH_P_8021Q protocol here, we allow sending 1504 bytes
> > frames for MTU=1500
> >
> > Should we really care ?
> >
> > If not, just do
> >
> > reserve = dev->hard_header_len + VLAN_HLEN;
>
> Thats a good point, I think we need to validate the SKB protocol
> field.
Which is set to the value of the passed struct sockaddr_ll field
sll_protocol. At least in the two userspace code samples I have here,
the later field is set to htons(ETH_P_ALL). So unless one changes the
API, the only way to find out the packet type is to actually parse the
given ethernet header.
Since tpacket_rcv() just interprets the vlan_tci skb field, such
detailed packet inspection is otherwise not done in af_packet.c.
Greetings, Phil
^ permalink raw reply
* Re: [patch 1/2] vhost: potential integer overflows
From: Al Viro @ 2010-10-11 17:26 UTC (permalink / raw)
To: Dan Carpenter
Cc: Michael S. Tsirkin, Juan Quintela, David S. Miller, Rusty Russell,
kvm, virtualization, netdev, kernel-janitors
In-Reply-To: <20101011172256.GF5851@bicker>
On Mon, Oct 11, 2010 at 07:22:57PM +0200, Dan Carpenter wrote:
> I did an audit for potential integer overflows of values which get passed
> to access_ok() and here are the results.
FWIW, UINT_MAX is wrong here. What you want is maximal size_t value.
> Signed-off-by: Dan Carpenter <error27@gmail.com>
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index dd3d6f7..c2aa12c 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -429,6 +429,14 @@ static int vq_access_ok(unsigned int num,
> struct vring_avail __user *avail,
> struct vring_used __user *used)
> {
> +
> + if (num > UINT_MAX / sizeof *desc)
> + return 0;
> + if (num > UINT_MAX / sizeof *avail->ring - sizeof *avail)
> + return 0;
> + if (num > UINT_MAX / sizeof *used->ring - sizeof *used)
> + return 0;
> +
> return access_ok(VERIFY_READ, desc, num * sizeof *desc) &&
> access_ok(VERIFY_READ, avail,
> sizeof *avail + num * sizeof *avail->ring) &&
> @@ -447,6 +455,9 @@ int vhost_log_access_ok(struct vhost_dev *dev)
> /* Caller should have vq mutex and device mutex */
> static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
> {
> + if (vq->num > UINT_MAX / sizeof *vq->used->ring - sizeof *vq->used)
> + return 0;
> +
> return vq_memory_access_ok(log_base, vq->dev->memory,
> vhost_has_feature(vq->dev, VHOST_F_LOG_ALL)) &&
> (!vq->log_used || log_access_ok(log_base, vq->log_addr,
> @@ -606,12 +617,17 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
> }
>
> /* Also validate log access for used ring if enabled. */
> - if ((a.flags & (0x1 << VHOST_VRING_F_LOG)) &&
> - !log_access_ok(vq->log_base, a.log_guest_addr,
> + if (a.flags & (0x1 << VHOST_VRING_F_LOG)) {
> + if (vq->num > UINT_MAX / sizeof *vq->used->ring - sizeof *vq->used) {
> + r = -EINVAL;
> + break;
> + }
> + if (!log_access_ok(vq->log_base, a.log_guest_addr,
> sizeof *vq->used +
> vq->num * sizeof *vq->used->ring)) {
> - r = -EINVAL;
> - break;
> + r = -EINVAL;
> + break;
> + }
> }
> }
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [patch 2/2] vhost: fix return code for log_access_ok()
From: Dan Carpenter @ 2010-10-11 17:24 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Juan Quintela, David S. Miller, Rusty Russell, kvm,
virtualization, netdev, kernel-janitors
access_ok() returns 1 if it's OK otherwise it should return 0.
Signed-off-by: Dan Carpenter <error27@gmail.com>
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c2aa12c..f82fe57 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -371,7 +371,7 @@ static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
/* Make sure 64 bit math will not overflow. */
if (a > ULONG_MAX - (unsigned long)log_base ||
a + (unsigned long)log_base > ULONG_MAX)
- return -EFAULT;
+ return 0;
return access_ok(VERIFY_WRITE, log_base + a,
(sz + VHOST_PAGE_SIZE * 8 - 1) / VHOST_PAGE_SIZE / 8);
^ permalink raw reply related
* [patch 1/2] vhost: potential integer overflows
From: Dan Carpenter @ 2010-10-11 17:22 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Juan Quintela, David S. Miller, Rusty Russell, kvm,
virtualization, netdev, kernel-janitors
I did an audit for potential integer overflows of values which get passed
to access_ok() and here are the results.
Signed-off-by: Dan Carpenter <error27@gmail.com>
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index dd3d6f7..c2aa12c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -429,6 +429,14 @@ static int vq_access_ok(unsigned int num,
struct vring_avail __user *avail,
struct vring_used __user *used)
{
+
+ if (num > UINT_MAX / sizeof *desc)
+ return 0;
+ if (num > UINT_MAX / sizeof *avail->ring - sizeof *avail)
+ return 0;
+ if (num > UINT_MAX / sizeof *used->ring - sizeof *used)
+ return 0;
+
return access_ok(VERIFY_READ, desc, num * sizeof *desc) &&
access_ok(VERIFY_READ, avail,
sizeof *avail + num * sizeof *avail->ring) &&
@@ -447,6 +455,9 @@ int vhost_log_access_ok(struct vhost_dev *dev)
/* Caller should have vq mutex and device mutex */
static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
{
+ if (vq->num > UINT_MAX / sizeof *vq->used->ring - sizeof *vq->used)
+ return 0;
+
return vq_memory_access_ok(log_base, vq->dev->memory,
vhost_has_feature(vq->dev, VHOST_F_LOG_ALL)) &&
(!vq->log_used || log_access_ok(log_base, vq->log_addr,
@@ -606,12 +617,17 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
}
/* Also validate log access for used ring if enabled. */
- if ((a.flags & (0x1 << VHOST_VRING_F_LOG)) &&
- !log_access_ok(vq->log_base, a.log_guest_addr,
+ if (a.flags & (0x1 << VHOST_VRING_F_LOG)) {
+ if (vq->num > UINT_MAX / sizeof *vq->used->ring - sizeof *vq->used) {
+ r = -EINVAL;
+ break;
+ }
+ if (!log_access_ok(vq->log_base, a.log_guest_addr,
sizeof *vq->used +
vq->num * sizeof *vq->used->ring)) {
- r = -EINVAL;
- break;
+ r = -EINVAL;
+ break;
+ }
}
}
^ permalink raw reply related
* RE: [PATCH net-next 2/5] bnx2x: save cycles in setting gso_size
From: Vladislav Zolotarov @ 2010-10-11 17:06 UTC (permalink / raw)
To: Ben Hutchings
Cc: David Miller, Dmitry Kravkov, netdev@vger.kernel.org,
Eilon Greenstein
In-Reply-To: <1286805633.2349.64.camel@achroite.uk.solarflarecom.com>
> -----Original Message-----
> From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> Sent: Monday, October 11, 2010 4:01 PM
> To: Vladislav Zolotarov
> Cc: David Miller; Dmitry Kravkov; netdev@vger.kernel.org; Eilon
> Greenstein
> Subject: RE: [PATCH net-next 2/5] bnx2x: save cycles in setting
> gso_size
>
> On Mon, 2010-10-11 at 01:53 -0700, Vladislav Zolotarov wrote:
> [...]
> > Dave, it's a gSo_size, not a gRo_size and we are handling an LRO skb
> > here. It's quite confusing, I agree... ;) This is currently the way
> > to mark an LRO skb in order to properly handle the LRO skbs that
> > might still be forwarded to the stack short time after the LRO has
> > been disabled due to enabling the IP forwarding (or if there is a
> > bug in a driver).
> > See the skb_warn_if_lro() below:
> >
> > static inline bool skb_warn_if_lro(const struct sk_buff *skb)
> > {
> > /* LRO sets gso_size but not gso_type, whereas if GSO is really
> > * wanted then gso_type will be set. */
> > struct skb_shared_info *shinfo = skb_shinfo(skb);
> > if (skb_is_nonlinear(skb) && shinfo->gso_size != 0 &&
> > unlikely(shinfo->gso_type == 0)) {
> > __skb_warn_lro_forwarding(skb);
> > return true;
> > }
> > return false;
> > }
> >
> > So, the convention is to set the gSo_size to a none-zero value
> leaving
> > the gSo_type zero for an LRO skbs. It's quite hacky but this is what
> > we have for quite a while and it quite worked so far. ;) To make it
> > cleaner we might have done the following:
> [...]
>
> The requirement (or as you call it, "convention") is to set gso_size to
> the observed MSS of the packets that have been combined. If you don't
> do that then TCP will not know the true number of packets for purposes
> of delayed-ACK etc.
Thanks, Ben. I see that now.
Dave, we will respin this patch series.
Thanks,
vlad
>
> Ben.
>
> --
> Ben Hutchings, Senior Software Engineer, Solarflare Communications
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
>
^ permalink raw reply
* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Michael Tokarev @ 2010-10-11 16:55 UTC (permalink / raw)
To: Eric Dumazet; +Cc: avagin, David Miller, avagin, stable, netdev, krkumar2
In-Reply-To: <1286814652.2737.41.camel@edumazet-laptop>
11.10.2010 20:30, Eric Dumazet wrote:
> Le lundi 11 octobre 2010 à 20:19 +0400, Andrew Vagin a écrit :
>> On 10/11/2010 07:59 PM, David Miller wrote:
>>> From: Eric Dumazet<eric.dumazet@gmail.com>
>>> Date: Mon, 11 Oct 2010 17:46:49 +0200
>>>
>>>> This patch was an optimization, not a bug fix.
>>> Right, this has no business going into 2.6.32-stable at all.
>> This is bug fix. Now nobody drops dst in case gso and veth, because the
>> commit 60df914e295a21a223e43a7ee01e0c73c64dd111 deletes skb_dst_drop
>> from the veth.c. We should commit my patch or revert commit 60df914e.
>>
>> We have two bug reports:
>>
>> http://www.spinics.net/lists/netdev/msg142104.html
This is korg#17251: https://bugzilla.kernel.org/show_bug.cgi?id=17251 ,
from me.
But I really wonder how that thing does not happen anymore
when I disabled netfilter hooks... I can't experiment till
weekend, but I'll try to get back to it and re-verify again,
with and without this fix, with and without netfilter hooks.
What I know for sure is 2 facts: I can't trigger the problem
now (with the hooks disabled), and I can't trigger it on a
subsequent kernel releases - e.g. 2.6.35 does not have the
issue, but 2.6.32 has.
>> http://bugzilla.openvz.org/show_bug.cgi?id=1634
Thanks!
/mjt
^ permalink raw reply
* Re: [PATCH 3/3] NET: wimax, fix use after free
From: Inaky Perez-Gonzalez @ 2010-10-11 16:46 UTC (permalink / raw)
To: Jiri Slaby
Cc: davem@davemloft.net, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, jirislaby@gmail.com, linux-wimax
In-Reply-To: <1286789218-13976-3-git-send-email-jslaby@suse.cz>
On Mon, 2010-10-11 at 02:26 -0700, Jiri Slaby wrote:
> Stanse found that i2400m_rx frees skb, but still uses skb->len even
> though it has skb_len defined. So use skb_len properly in the code.
>
> And also define it unsinged int rather than size_t to solve
> compilation warnings.
>
> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Ops, fail. Thanks for the catch. I assume you have compile tested it.
Acked-by: Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com>
> Cc: linux-wimax@intel.com
> ---
> drivers/net/wimax/i2400m/rx.c | 26 +++++++++++++-------------
> 1 files changed, 13 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/net/wimax/i2400m/rx.c b/drivers/net/wimax/i2400m/rx.c
> index c4876d0..844133b 100644
> --- a/drivers/net/wimax/i2400m/rx.c
> +++ b/drivers/net/wimax/i2400m/rx.c
> @@ -1244,16 +1244,16 @@ int i2400m_rx(struct i2400m *i2400m, struct sk_buff *skb)
> int i, result;
> struct device *dev = i2400m_dev(i2400m);
> const struct i2400m_msg_hdr *msg_hdr;
> - size_t pl_itr, pl_size, skb_len;
> + size_t pl_itr, pl_size;
> unsigned long flags;
> - unsigned num_pls, single_last;
> + unsigned num_pls, single_last, skb_len;
>
> skb_len = skb->len;
> - d_fnstart(4, dev, "(i2400m %p skb %p [size %zu])\n",
> + d_fnstart(4, dev, "(i2400m %p skb %p [size %u])\n",
> i2400m, skb, skb_len);
> result = -EIO;
> msg_hdr = (void *) skb->data;
> - result = i2400m_rx_msg_hdr_check(i2400m, msg_hdr, skb->len);
> + result = i2400m_rx_msg_hdr_check(i2400m, msg_hdr, skb_len);
> if (result < 0)
> goto error_msg_hdr_check;
> result = -EIO;
> @@ -1261,10 +1261,10 @@ int i2400m_rx(struct i2400m *i2400m, struct sk_buff *skb)
> pl_itr = sizeof(*msg_hdr) + /* Check payload descriptor(s) */
> num_pls * sizeof(msg_hdr->pld[0]);
> pl_itr = ALIGN(pl_itr, I2400M_PL_ALIGN);
> - if (pl_itr > skb->len) { /* got all the payload descriptors? */
> + if (pl_itr > skb_len) { /* got all the payload descriptors? */
> dev_err(dev, "RX: HW BUG? message too short (%u bytes) for "
> "%u payload descriptors (%zu each, total %zu)\n",
> - skb->len, num_pls, sizeof(msg_hdr->pld[0]), pl_itr);
> + skb_len, num_pls, sizeof(msg_hdr->pld[0]), pl_itr);
> goto error_pl_descr_short;
> }
> /* Walk each payload payload--check we really got it */
> @@ -1272,7 +1272,7 @@ int i2400m_rx(struct i2400m *i2400m, struct sk_buff *skb)
> /* work around old gcc warnings */
> pl_size = i2400m_pld_size(&msg_hdr->pld[i]);
> result = i2400m_rx_pl_descr_check(i2400m, &msg_hdr->pld[i],
> - pl_itr, skb->len);
> + pl_itr, skb_len);
> if (result < 0)
> goto error_pl_descr_check;
> single_last = num_pls == 1 || i == num_pls - 1;
> @@ -1290,16 +1290,16 @@ int i2400m_rx(struct i2400m *i2400m, struct sk_buff *skb)
> if (i < i2400m->rx_pl_min)
> i2400m->rx_pl_min = i;
> i2400m->rx_num++;
> - i2400m->rx_size_acc += skb->len;
> - if (skb->len < i2400m->rx_size_min)
> - i2400m->rx_size_min = skb->len;
> - if (skb->len > i2400m->rx_size_max)
> - i2400m->rx_size_max = skb->len;
> + i2400m->rx_size_acc += skb_len;
> + if (skb_len < i2400m->rx_size_min)
> + i2400m->rx_size_min = skb_len;
> + if (skb_len > i2400m->rx_size_max)
> + i2400m->rx_size_max = skb_len;
> spin_unlock_irqrestore(&i2400m->rx_lock, flags);
> error_pl_descr_check:
> error_pl_descr_short:
> error_msg_hdr_check:
> - d_fnend(4, dev, "(i2400m %p skb %p [size %zu]) = %d\n",
> + d_fnend(4, dev, "(i2400m %p skb %p [size %u]) = %d\n",
> i2400m, skb, skb_len, result);
> return result;
> }
^ permalink raw reply
* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Eric Dumazet @ 2010-10-11 16:30 UTC (permalink / raw)
To: avagin; +Cc: David Miller, mjt, avagin, stable, netdev, krkumar2
In-Reply-To: <4CB33907.5060803@gmail.com>
Le lundi 11 octobre 2010 à 20:19 +0400, Andrew Vagin a écrit :
> On 10/11/2010 07:59 PM, David Miller wrote:
> > From: Eric Dumazet<eric.dumazet@gmail.com>
> > Date: Mon, 11 Oct 2010 17:46:49 +0200
> >
> >> This patch was an optimization, not a bug fix.
> > Right, this has no business going into 2.6.32-stable at all.
> This is bug fix. Now nobody drops dst in case gso and veth, because the
> commit 60df914e295a21a223e43a7ee01e0c73c64dd111 deletes skb_dst_drop
> from the veth.c. We should commit my patch or revert commit 60df914e.
>
> We have two bug reports:
>
> http://www.spinics.net/lists/netdev/msg142104.html
>
> http://bugzilla.openvz.org/show_bug.cgi?id=1634
>
> Taylan verified that the patch really fix his bug.
Now that makes sense ;)
This is always good to mention commit id of a bug origin.
Because reading your patch (changelog and content), it was not obvious
it fixed a bug.
Thanks !
^ permalink raw reply
* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Andrew Vagin @ 2010-10-11 16:19 UTC (permalink / raw)
To: David Miller; +Cc: eric.dumazet, mjt, avagin, stable, netdev, krkumar2
In-Reply-To: <20101011.085952.193718228.davem@davemloft.net>
On 10/11/2010 07:59 PM, David Miller wrote:
> From: Eric Dumazet<eric.dumazet@gmail.com>
> Date: Mon, 11 Oct 2010 17:46:49 +0200
>
>> This patch was an optimization, not a bug fix.
> Right, this has no business going into 2.6.32-stable at all.
This is bug fix. Now nobody drops dst in case gso and veth, because the
commit 60df914e295a21a223e43a7ee01e0c73c64dd111 deletes skb_dst_drop
from the veth.c. We should commit my patch or revert commit 60df914e.
We have two bug reports:
http://www.spinics.net/lists/netdev/msg142104.html
http://bugzilla.openvz.org/show_bug.cgi?id=1634
Taylan verified that the patch really fix his bug.
^ permalink raw reply
* Re: [PATCH net-next] neigh: speedup neigh_hh_init()
From: David Miller @ 2010-10-11 16:18 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <1286456008.2912.171.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 07 Oct 2010 14:53:28 +0200
> When a new dst is used to send a frame, neigh_resolve_output() tries to
> associate an struct hh_cache to this dst, calling neigh_hh_init() with
> the neigh rwlock write locked.
>
> Most of the time, hh_cache is already known and linked into neighbour,
> so we find it and increment its refcount.
>
> This patch changes the logic so that we call neigh_hh_init() with
> neighbour lock read locked only, so that fast path can be run in
> parallel by concurrent cpus.
>
> This brings part of the speedup we got with commit c7d4426a98a5f
> (introduce DST_NOCACHE flag) for non cached dsts, even for cached ones,
> removing one of the contention point that routers hit on multiqueue
> enabled machines.
>
> Further improvements would need to use a seqlock instead of an rwlock to
> protect neigh->ha[], to not dirty neigh too often and remove two atomic
> ops.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Ok this patch assumes that neighbours are highly shared, which
is a reasonable thing to optimize for.
So, applied, thanks a lot!
BTW, I think you can RCU this thing. Require that every change
to the 'hh' entry be done in a newly allocated entry.
Then the cmpxchg() on the hh pointer interlocks.
^ permalink raw reply
* Re: [PATCH 1/2] r8169: allocate with GFP_KERNEL flag when able to sleep
From: Christoph Lameter @ 2010-10-11 16:14 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Stanislaw Gruszka, Francois Romieu, netdev
In-Reply-To: <1286813237.2737.37.camel@edumazet-laptop>
[-- Attachment #1: Type: TEXT/PLAIN, Size: 498 bytes --]
On Mon, 11 Oct 2010, Eric Dumazet wrote:
> Le lundi 11 octobre 2010 à 11:03 -0500, Christoph Lameter a écrit :
> > On Fri, 8 Oct 2010, Eric Dumazet wrote:
> >
> > > 8 in the <pagesperslab> column just says that : order-3 pages, even for
> > > small allocations.
> >
> > Those allocations will fallback to smaller allocs if the page allocator
> > has trouble satisfying those requests.
> >
>
> Interesting, do you have an idea when this feature was added ?
A couple of years ago.
^ permalink raw reply
* Re: [PATCH 1/2] r8169: allocate with GFP_KERNEL flag when able to sleep
From: Eric Dumazet @ 2010-10-11 16:07 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Stanislaw Gruszka, Francois Romieu, netdev
In-Reply-To: <alpine.DEB.2.00.1010111102390.11247@router.home>
Le lundi 11 octobre 2010 à 11:03 -0500, Christoph Lameter a écrit :
> On Fri, 8 Oct 2010, Eric Dumazet wrote:
>
> > 8 in the <pagesperslab> column just says that : order-3 pages, even for
> > small allocations.
>
> Those allocations will fallback to smaller allocs if the page allocator
> has trouble satisfying those requests.
>
Interesting, do you have an idea when this feature was added ?
Thanks
^ permalink raw reply
* Re: [PATCH] net: introduce alloc_skb_order0
From: Eric Dumazet @ 2010-10-11 16:05 UTC (permalink / raw)
To: Stanislaw Gruszka; +Cc: David Miller, Francois Romieu, netdev
In-Reply-To: <20101011155556.GA2431@redhat.com>
Le lundi 11 octobre 2010 à 17:55 +0200, Stanislaw Gruszka a écrit :
> On Sat, Oct 09, 2010 at 05:59:56PM +0200, Eric Dumazet wrote:
> > Le vendredi 08 octobre 2010 à 18:03 +0200, Stanislaw Gruszka a écrit :
> > > On Fri, Oct 08, 2010 at 05:04:07PM +0200, Eric Dumazet wrote:
> >
> > > > Switch to SLAB -> no more problem ;)
> > >
> > > yeh, I wish to, but fedora use SLUB because of some debugging
> > > capabilities.
> >
> > Yes, of course, I was kidding :)
> >
> > echo 0 >/sys/kernel/slab/kmalloc-2048/order
> > echo 0 >/sys/kernel/slab/kmalloc-1024/order
> > echo 0 >/sys/kernel/slab/kmalloc-512/order
> >
> > Should do the trick : No more high order allocations for MTU=1500
> > frames.
>
> So the SLUB is great, but we need a patch to avoid using it :-)
>
> > For MTU=9000 frames, we probably need something like this patch :
> >
> > Reception of big frames hit a memory allocation problem, because of high
> > order pages allocations (order-3 sometimes for MTU=9000). This patch
> > introduces alloc_skb_order0(), to build skbs with order-0 pages only.
>
> I had never seen allocation problems in rtl8169_try_rx_copy or in any
> other driver rx path (except iwlwifi, but now this is solved by using
> skb_add_rx_frag), so I'm not sure if need this patch.
>
> However I see other benefit of that patch. We save memory. Allocating
> for MTU 9000 gives something like skb->data = kmalloc(9000 + 32 + 2
> + 334). So we take data from kmalloc-16384 cache, we waste about 7kB on
> every allocation. With patch wastage would be about 2k per allocation
> (assuming 4kB and 8kB page size)
>
> However I started this thread thinking about other memory wastage,
> in rtl8169_alloc_rx_skb, skb->data = kmalloc(16383 + 32 + 2 + 334), taken
> from kmalloc-32768, almost 50% wastage.
>
You cannot use my patch to avoid this waste. Really.
You have two different things in this driver :
1) Allocation of a physically continous 16Kbytes bloc for the rx-ring,
at device initialization (GFP_KERNEL OK here)
Here, the only thing you could do is to not allocate real skbs but
only 16KB data blocs (no need for the sk_buf, only the ->data part), and
force copybreak for all incoming packets (remove the rx_copybreak
tunable)
2) Allocation of order0 skb to perform the copybreak in rx path.
(GFP_ATOMIC) : My patch.
> > +struct sk_buff *alloc_skb_order0(int pkt_size)
> > +{
> > + int head = min_t(int, pkt_size, SKB_MAX_HEAD(NET_SKB_PAD + NET_IP_ALIGN));
> > + struct sk_buff *skb;
> > +
> > + skb = alloc_skb(head + NET_SKB_PAD + NET_IP_ALIGN,
> > + GFP_ATOMIC | __GFP_NOWARN);
> > + if (!skb)
> > + return NULL;
> > + skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
> > + skb_put(skb, head);
> > + pkt_size -= head;
> > +
> > + skb->len += pkt_size;
> > + skb->data_len += pkt_size;
> > + skb->truesize += pkt_size;
> > + while (pkt_size) {
>
> if (skb_shinfo(skb)->nr_frags == MAX_SKB_FRAGS - 1)
> goto error;
Not needed. A frame is < 16383 bytes, so _must_ fit in an skb,
(skb can hold up to 64 Kbytes)
>
> > + int i = skb_shinfo(skb)->nr_frags++;
> > + skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> > + int fragsize = min_t(int, pkt_size, PAGE_SIZE);
> > + struct page *page = alloc_page(GFP_NOWAIT | __GFP_NOWARN);
> > +
> > + if (!page)
> > + goto error;
> > + frag->page = page;
> > + frag->size = fragsize;
> > + frag->page_offset = 0;
> > + pkt_size -= fragsize;
> > + }
> > + return skb;
> > +
> > +error:
> > + kfree_skb(skb);
> > + return NULL;
> > +}
> > +EXPORT_SYMBOL(alloc_skb_order0);
> > +
> > /* Checksum skb data. */
> >
> > __wsum skb_checksum(const struct sk_buff *skb, int offset,
^ permalink raw reply
* Re: [PATCH 1/2] r8169: allocate with GFP_KERNEL flag when able to sleep
From: Christoph Lameter @ 2010-10-11 16:03 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Stanislaw Gruszka, Francois Romieu, netdev
In-Reply-To: <1286550247.2959.444.camel@edumazet-laptop>
On Fri, 8 Oct 2010, Eric Dumazet wrote:
> 8 in the <pagesperslab> column just says that : order-3 pages, even for
> small allocations.
Those allocations will fallback to smaller allocs if the page allocator
has trouble satisfying those requests.
^ permalink raw reply
* Re: [PATCH] af_packet: account for VLAN when checking packet size
From: David Miller @ 2010-10-11 16:01 UTC (permalink / raw)
To: eric.dumazet; +Cc: phil, netdev, johann.baudy
In-Reply-To: <1286805782.2737.25.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 11 Oct 2010 16:03:02 +0200
> If we dont test ETH_P_8021Q protocol here, we allow sending 1504 bytes
> frames for MTU=1500
>
> Should we really care ?
>
> If not, just do
>
> reserve = dev->hard_header_len + VLAN_HLEN;
Thats a good point, I think we need to validate the SKB protocol
field.
^ permalink raw reply
* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: David Miller @ 2010-10-11 15:59 UTC (permalink / raw)
To: eric.dumazet; +Cc: mjt, avagin, stable, netdev, krkumar2
In-Reply-To: <1286812009.2737.30.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 11 Oct 2010 17:46:49 +0200
> This patch was an optimization, not a bug fix.
Right, this has no business going into 2.6.32-stable at all.
^ permalink raw reply
* Re: [PATCH] net: introduce alloc_skb_order0
From: Stanislaw Gruszka @ 2010-10-11 15:55 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, Francois Romieu, netdev
In-Reply-To: <1286639996.2692.148.camel@edumazet-laptop>
On Sat, Oct 09, 2010 at 05:59:56PM +0200, Eric Dumazet wrote:
> Le vendredi 08 octobre 2010 à 18:03 +0200, Stanislaw Gruszka a écrit :
> > On Fri, Oct 08, 2010 at 05:04:07PM +0200, Eric Dumazet wrote:
>
> > > Switch to SLAB -> no more problem ;)
> >
> > yeh, I wish to, but fedora use SLUB because of some debugging
> > capabilities.
>
> Yes, of course, I was kidding :)
>
> echo 0 >/sys/kernel/slab/kmalloc-2048/order
> echo 0 >/sys/kernel/slab/kmalloc-1024/order
> echo 0 >/sys/kernel/slab/kmalloc-512/order
>
> Should do the trick : No more high order allocations for MTU=1500
> frames.
So the SLUB is great, but we need a patch to avoid using it :-)
> For MTU=9000 frames, we probably need something like this patch :
>
> Reception of big frames hit a memory allocation problem, because of high
> order pages allocations (order-3 sometimes for MTU=9000). This patch
> introduces alloc_skb_order0(), to build skbs with order-0 pages only.
I had never seen allocation problems in rtl8169_try_rx_copy or in any
other driver rx path (except iwlwifi, but now this is solved by using
skb_add_rx_frag), so I'm not sure if need this patch.
However I see other benefit of that patch. We save memory. Allocating
for MTU 9000 gives something like skb->data = kmalloc(9000 + 32 + 2
+ 334). So we take data from kmalloc-16384 cache, we waste about 7kB on
every allocation. With patch wastage would be about 2k per allocation
(assuming 4kB and 8kB page size)
However I started this thread thinking about other memory wastage,
in rtl8169_alloc_rx_skb, skb->data = kmalloc(16383 + 32 + 2 + 334), taken
from kmalloc-32768, almost 50% wastage.
> +struct sk_buff *alloc_skb_order0(int pkt_size)
> +{
> + int head = min_t(int, pkt_size, SKB_MAX_HEAD(NET_SKB_PAD + NET_IP_ALIGN));
> + struct sk_buff *skb;
> +
> + skb = alloc_skb(head + NET_SKB_PAD + NET_IP_ALIGN,
> + GFP_ATOMIC | __GFP_NOWARN);
> + if (!skb)
> + return NULL;
> + skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
> + skb_put(skb, head);
> + pkt_size -= head;
> +
> + skb->len += pkt_size;
> + skb->data_len += pkt_size;
> + skb->truesize += pkt_size;
> + while (pkt_size) {
if (skb_shinfo(skb)->nr_frags == MAX_SKB_FRAGS - 1)
goto error;
> + int i = skb_shinfo(skb)->nr_frags++;
> + skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> + int fragsize = min_t(int, pkt_size, PAGE_SIZE);
> + struct page *page = alloc_page(GFP_NOWAIT | __GFP_NOWARN);
> +
> + if (!page)
> + goto error;
> + frag->page = page;
> + frag->size = fragsize;
> + frag->page_offset = 0;
> + pkt_size -= fragsize;
> + }
> + return skb;
> +
> +error:
> + kfree_skb(skb);
> + return NULL;
> +}
> +EXPORT_SYMBOL(alloc_skb_order0);
> +
> /* Checksum skb data. */
>
> __wsum skb_checksum(const struct sk_buff *skb, int offset,
^ permalink raw reply
* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Eric Dumazet @ 2010-10-11 15:46 UTC (permalink / raw)
To: Michael Tokarev
Cc: Andrey Vagin, stable, netdev, Krishna Kumar, David S. Miller
In-Reply-To: <4CB32FDB.4090001@msgid.tls.msk.ru>
Le lundi 11 octobre 2010 à 19:40 +0400, Michael Tokarev a écrit :
> Andrey Vagin wrote:
> > From: Krishna Kumar <krkumar2@in.ibm.com>
> >
> > commit 068a2de57ddf4f472e32e7af868613c574ad1d88 upstream.
> >
> > Non-GSO code drops dst entry for performance reasons, but
> > the same is missing for GSO code. Drop dst while cache-hot
> > for GSO case too.
> >
> > Note: Without this patch the kernel may oops if used bridged veth
> > devices. A bridge set skb->dst = fake_dst_ops, veth transfers this skb
> > to netif_receive_skb...ip_rcv_finish and it calls dst_input(skb), but
> > fake_dst_ops->input = NULL -> Oops
>
> Hmm. Isn't this the reason for my mysterious OOPSes (jump to
> NULL) which I concluded are due to stack overflow? See f.e.
> http://www.spinics.net/lists/netdev/msg142104.html
>
> This started happening when I updated virtio drivers in a windows
> virtual machne to the ones which supports GSO, and my config
> involves bridging veth devices, and this is where the prob
> actually occurs - when doing guest => virtio => tap => bridge => veth
> route....
>
> Thanks!
This patch was an optimization, not a bug fix.
If something gets better, then a bug is somewhere else ?
^ permalink raw reply
* Re: [PATCH v12 06/17] Use callback to deal with skb_release_data() specially.
From: Eric Dumazet @ 2010-10-11 15:42 UTC (permalink / raw)
To: David Miller
Cc: xiaohui.xin, netdev, kvm, linux-kernel, mst, mingo, herbert,
jdike
In-Reply-To: <20101011.082728.193709462.davem@davemloft.net>
Le lundi 11 octobre 2010 à 08:27 -0700, David Miller a écrit :
> From: "Xin, Xiaohui" <xiaohui.xin@intel.com>
> Date: Mon, 11 Oct 2010 15:06:05 +0800
>
> > That's to avoid the new cache miss caused by using destructor_arg in data path
> > like skb_release_data().
> > That's based on the comment from Eric Dumazet on v7 patches.
>
> Thanks for the explanation.
Anyway, frags[] must be the last field of "struct skb_shared_info"
since commit fed66381 (net: pskb_expand_head() optimization)
It seems Xin worked on a quite old tree.
^ permalink raw reply
* Re: [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Michael Tokarev @ 2010-10-11 15:40 UTC (permalink / raw)
To: Andrey Vagin; +Cc: stable, netdev, Krishna Kumar, David S. Miller
In-Reply-To: <1286810413-30238-1-git-send-email-avagin@openvz.org>
Andrey Vagin wrote:
> From: Krishna Kumar <krkumar2@in.ibm.com>
>
> commit 068a2de57ddf4f472e32e7af868613c574ad1d88 upstream.
>
> Non-GSO code drops dst entry for performance reasons, but
> the same is missing for GSO code. Drop dst while cache-hot
> for GSO case too.
>
> Note: Without this patch the kernel may oops if used bridged veth
> devices. A bridge set skb->dst = fake_dst_ops, veth transfers this skb
> to netif_receive_skb...ip_rcv_finish and it calls dst_input(skb), but
> fake_dst_ops->input = NULL -> Oops
Hmm. Isn't this the reason for my mysterious OOPSes (jump to
NULL) which I concluded are due to stack overflow? See f.e.
http://www.spinics.net/lists/netdev/msg142104.html
This started happening when I updated virtio drivers in a windows
virtual machne to the ones which supports GSO, and my config
involves bridging veth devices, and this is where the prob
actually occurs - when doing guest => virtio => tap => bridge => veth
route....
Thanks!
/mjt
^ permalink raw reply
* Re: [PATCH v12 06/17] Use callback to deal with skb_release_data() specially.
From: David Miller @ 2010-10-11 15:27 UTC (permalink / raw)
To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, herbert, jdike
In-Reply-To: <F2E9EB7348B8264F86B6AB8151CE2D793064B2A9D8@shsmsx502.ccr.corp.intel.com>
From: "Xin, Xiaohui" <xiaohui.xin@intel.com>
Date: Mon, 11 Oct 2010 15:06:05 +0800
> That's to avoid the new cache miss caused by using destructor_arg in data path
> like skb_release_data().
> That's based on the comment from Eric Dumazet on v7 patches.
Thanks for the explanation.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox