Netdev List
 help / color / mirror / Atom feed
* [PATCH net] ipv6: move stub initialization after ipv6 setup completion
From: Paolo Abeni @ 2017-04-24 12:18 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Cong Wang, Hannes Frederic Sowa

The ipv6 stub pointer is currently initialized before the ipv6
routing subsystem: a 3rd party can access and use such stub
before the routing data is ready.
Moreover, such pointer is not cleared in case of initialization
error, possibly leading to dangling pointers usage.

This change addresses the above moving the stub initialization
at the end of ipv6 init code.

Fixes: 5f81bd2e5d80 ("ipv6: export a stub for IPv6 symbols used by vxlan")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv6/af_inet6.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index a9a9553..e82e59f 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -933,8 +933,6 @@ static int __init inet6_init(void)
 	if (err)
 		goto igmp_fail;
 
-	ipv6_stub = &ipv6_stub_impl;
-
 	err = ipv6_netfilter_init();
 	if (err)
 		goto netfilter_fail;
@@ -1010,6 +1008,10 @@ static int __init inet6_init(void)
 	if (err)
 		goto sysctl_fail;
 #endif
+
+	/* ensure that ipv6 stubs are visible only after ipv6 is ready */
+	wmb();
+	ipv6_stub = &ipv6_stub_impl;
 out:
 	return err;
 
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next 1/2] l2tp: set name_assign_type for devices created by l2tp_eth.c
From: Guillaume Nault @ 2017-04-24 12:16 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman
In-Reply-To: <cover.1493035407.git.g.nault@alphalink.fr>

Export naming scheme used when creating l2tpeth interfaces
(/sys/class/net/<iface>/name_assign_type). This let userspace know if
the device's name has been generated automatically or defined manually.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_eth.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
index b722d559c544..5e44b3cc1212 100644
--- a/net/l2tp/l2tp_eth.c
+++ b/net/l2tp/l2tp_eth.c
@@ -258,6 +258,7 @@ static void l2tp_eth_adjust_mtu(struct l2tp_tunnel *tunnel,
 
 static int l2tp_eth_create(struct net *net, u32 tunnel_id, u32 session_id, u32 peer_session_id, struct l2tp_session_cfg *cfg)
 {
+	unsigned char name_assign_type;
 	struct net_device *dev;
 	char name[IFNAMSIZ];
 	struct l2tp_tunnel *tunnel;
@@ -281,8 +282,11 @@ static int l2tp_eth_create(struct net *net, u32 tunnel_id, u32 session_id, u32 p
 			goto out;
 		}
 		strlcpy(name, cfg->ifname, IFNAMSIZ);
-	} else
+		name_assign_type = NET_NAME_USER;
+	} else {
 		strcpy(name, L2TP_ETH_DEV_NAME);
+		name_assign_type = NET_NAME_ENUM;
+	}
 
 	session = l2tp_session_create(sizeof(*spriv), tunnel, session_id,
 				      peer_session_id, cfg);
@@ -291,7 +295,7 @@ static int l2tp_eth_create(struct net *net, u32 tunnel_id, u32 session_id, u32 p
 		goto out;
 	}
 
-	dev = alloc_netdev(sizeof(*priv), name, NET_NAME_UNKNOWN,
+	dev = alloc_netdev(sizeof(*priv), name, name_assign_type,
 			   l2tp_eth_dev_setup);
 	if (!dev) {
 		rc = -ENOMEM;
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 2/2] l2tp: define "l2tpeth" device type
From: Guillaume Nault @ 2017-04-24 12:16 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman
In-Reply-To: <cover.1493035407.git.g.nault@alphalink.fr>

Export type of l2tpeth interfaces to userspace
(/sys/class/net/<iface>/uevent).

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_eth.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
index 5e44b3cc1212..59aba8aeac03 100644
--- a/net/l2tp/l2tp_eth.c
+++ b/net/l2tp/l2tp_eth.c
@@ -130,8 +130,13 @@ static const struct net_device_ops l2tp_eth_netdev_ops = {
 	.ndo_set_mac_address	= eth_mac_addr,
 };
 
+static struct device_type l2tpeth_type = {
+	.name = "l2tpeth",
+};
+
 static void l2tp_eth_dev_setup(struct net_device *dev)
 {
+	SET_NETDEV_DEVTYPE(dev, &l2tpeth_type);
 	ether_setup(dev);
 	dev->priv_flags		&= ~IFF_TX_SKB_SHARING;
 	dev->features		|= NETIF_F_LLTX;
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 0/2] l2tp: add informations about l2tpeth interfaces in /sys
From: Guillaume Nault @ 2017-04-24 12:16 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman

Patch #1 lets userspace retrieve the naming scheme of an l2tpeth
interface, using /sys/class/net/<iface>/name_assign_type.

Patch #2 adds the DEVTYPE field in /sys/class/net/<iface>/uevent so
that userspace can reliably know if a device is an l2tpeth interface.


Guillaume Nault (2):
  l2tp: set name_assign_type for devices created by l2tp_eth.c
  l2tp: define "l2tpeth" device type

 net/l2tp/l2tp_eth.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

-- 
2.11.0

^ permalink raw reply

* Re: [PATCH] macsec: avoid heap overflow in skb_to_sgvec
From: Jason A. Donenfeld @ 2017-04-24 12:15 UTC (permalink / raw)
  To: David Laight
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	davem@davemloft.net, stable@vger.kernel.org, security@kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DCFFD9487@AcuExch.aculab.com>

On Mon, Apr 24, 2017 at 1:02 PM, David Laight <David.Laight@aculab.com> wrote:
> ...
>
> Shouldn't skb_to_sgvec() be checking the number of fragments against
> the size of the sg list?
> The callers would then all need auditing to allow for failure.

This has never been done before, since this is one of those operations
that simply _shouldn't fail_ this late in the driver's path. There's
an easy way to use a fixed size array of MAX_SKB_FRAGS+1, and then
just not specify FRAGLIST as a device feature. Then the function
succeeds every time, rather than dropping packets. Alternatively, if
the array is being allocated dynamically (kmalloc), a call to
skb_cow_data returns the number of fragments needed; since usually
people using scattergather are going to be modifying the skb anyway, I
believe this function should be being called anyway...

It would be possible to do as you suggest, though, by using sg_is_last
in skb_to_sgvec. In this case we'd need to change every call site of
skb_to_sgvec to ensure the return value is being checked as well as
making sure that the sglist is initialized with sg_init_table to
ensure the last frag is properly marked. I wouldn't be opposed to
this, though it is potentially error prone work.

In any case, this patch here follows the pattern of the entire rest of
the present-day kernel, so it ought to be merged as-is.

^ permalink raw reply

* Re: net: cleanup_net is slow
From: Florian Westphal @ 2017-04-24 12:08 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Florian Westphal, Eric Dumazet, Cong Wang, netdev, LKML,
	Dmitry Vyukov, Kostya Serebryany, syzkaller
In-Reply-To: <CAAeHK+zwsdzVrgwLccbXs7QUEsFkeTL13_AfNKMD1G2BkOr0eg@mail.gmail.com>

Andrey Konovalov <andreyknvl@google.com> wrote:
> On Fri, Apr 21, 2017 at 9:45 PM, Florian Westphal <fw@strlen.de> wrote:
> > Florian Westphal <fw@strlen.de> wrote:
> >> Indeed.  Setting net.netfilter.nf_conntrack_default_on=0 cuts time
> >> cleanup time by 2/3 ...
> >>
> >> nf unregister is way too happy to issue synchronize_net(), I'll work on
> >> a fix.
> >
> > I'll test this patch as a start.  Maybe we can also leverage exit_batch
> > more on netfilter side.
> 
> Hi Florian,
> 
> Your patch improves fuzzing speed at least twice, which is a great start!

Great.  I'll try to push the patches i have to nf-next ASAP.

^ permalink raw reply

* Re: [PATCH net] bridge: shutdown bridge device before removing it
From: Nikolay Aleksandrov @ 2017-04-24 12:07 UTC (permalink / raw)
  To: Xin Long, network dev
  Cc: David S. Miller, Stephen Hemminger,
	bridge@lists.linux-foundation.org
In-Reply-To: <3d77b47d-c8c3-de57-946e-59096a88964e@cumulusnetworks.com>

On 24/04/17 14:01, Nikolay Aleksandrov wrote:
> On 24/04/17 10:25, Xin Long wrote:
>> During removing a bridge device, if the bridge is still up, a new mdb entry
>> still can be added in br_multicast_add_group() after all mdb entries are
>> removed in br_multicast_dev_del(). Like the path:
>>
>>   mld_ifc_timer_expire ->
>>     mld_sendpack -> ...
>>       br_multicast_rcv ->
>>         br_multicast_add_group
>>
>> The new mp's timer will be set up. If the timer expires after the bridge
>> is freed, it may cause use-after-free panic in br_multicast_group_expired.
>> This can happen when ip link remove a bridge or destroy a netns with a
>> bridge device inside.
>>
>> As we can see in br_del_bridge, brctl is also supposed to remove a bridge
>> device after it's shutdown.
>>
>> This patch is to call dev_close at the beginning of br_dev_delete so that
>> netif_running check in br_multicast_add_group can avoid this issue. But
>> to keep consistent with before, it will not remove the IFF_UP check in
>> br_del_bridge for brctl.
>>
>> Reported-by: Jianwen Ji <jiji@redhat.com>
>> Signed-off-by: Xin Long <lucien.xin@gmail.com>
>> ---
>>  net/bridge/br_if.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
> 
> +CC bridge maintainers
> 
> I can see how this could happen, could you also provide the traceback ?
> 
> The patch looks good to me, actually I think it fixes another issue with
> mcast stats where the percpu pointer can be accessed after it's freed if
> an mcast packet can get sent via br->dev after the br_multicast_dev_del() call.
> This is definitely stable material, if I'm not mistaken the issue is there since
> the introduction of br_dev_delete:
> commit e10177abf842
> Author: Satish Ashok <sashok@cumulusnetworks.com>
> Date:   Wed Jul 15 07:16:51 2015 -0700
> 
>     bridge: multicast: fix handling of temp and perm entries
> 
> 
> 
> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> 

Actually I have a better idea for a fix because dev_close() for a single device is rather heavy.
Why don't you move the mdb flush logic in the bridge's ndo_uninit() callback ?
That should have the same effect and be much faster.

By the way I just noticed that there's also a memory leak - the mdb hash is reallocated
and not freed due to the mdb rehash, here's also kmemleak's object:

unreferenced object 0xffff8800540ba800 (size 2048):
  comm "softirq", pid 0, jiffies 4520588901 (age 5787.284s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff816e2287>] kmemleak_alloc+0x67/0xc0
    [<ffffffff81260bea>] __kmalloc+0x1ba/0x3e0
    [<ffffffffa05c60ee>] br_mdb_rehash+0x5e/0x340 [bridge]
    [<ffffffffa05c74af>] br_multicast_new_group+0x43f/0x6e0 [bridge]
    [<ffffffffa05c7aa3>] br_multicast_add_group+0x203/0x260 [bridge]
    [<ffffffffa05ca4b5>] br_multicast_rcv+0x945/0x11d0 [bridge]
    [<ffffffffa05b6b10>] br_dev_xmit+0x180/0x470 [bridge]
    [<ffffffff815c781b>] dev_hard_start_xmit+0xbb/0x3d0
    [<ffffffff815c8743>] __dev_queue_xmit+0xb13/0xc10
    [<ffffffff815c8850>] dev_queue_xmit+0x10/0x20
    [<ffffffffa02f8d7a>] ip6_finish_output2+0x5ca/0xac0 [ipv6]
    [<ffffffffa02fbfc6>] ip6_finish_output+0x126/0x2c0 [ipv6]
    [<ffffffffa02fc245>] ip6_output+0xe5/0x390 [ipv6]
    [<ffffffffa032b92c>] NF_HOOK.constprop.44+0x6c/0x240 [ipv6]
    [<ffffffffa032bd16>] mld_sendpack+0x216/0x3e0 [ipv6]
    [<ffffffffa032d5eb>] mld_ifc_timer_expire+0x18b/0x2b0 [ipv6]

^ permalink raw reply

* Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume
From: Michael S. Tsirkin @ 2017-04-24 12:00 UTC (permalink / raw)
  To: Jason Wang; +Cc: linux-kernel, netdev
In-Reply-To: <21a19608-40be-38d4-9843-088a273fd71a@redhat.com>

On Mon, Apr 24, 2017 at 07:54:18PM +0800, Jason Wang wrote:
> 
> 
> On 2017年04月24日 07:28, Michael S. Tsirkin wrote:
> > On Tue, Apr 18, 2017 at 11:07:42AM +0800, Jason Wang wrote:
> > > 
> > > On 2017年04月17日 07:19, Michael S. Tsirkin wrote:
> > > > Applications that consume a batch of entries in one go
> > > > can benefit from ability to return some of them back
> > > > into the ring.
> > > > 
> > > > Add an API for that - assuming there's space. If there's no space
> > > > naturally we can't do this and have to drop entries, but this implies
> > > > ring is full so we'd likely drop some anyway.
> > > > 
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > ---
> > > > 
> > > > Jason, in my mind the biggest issue with your batching patchset is the
> > > > backet drops on disconnect.  This API will help avoid that in the common
> > > > case.
> > > Ok, I will rebase the series on top of this. (Though I don't think we care
> > > the packet loss).
> > E.g. I care - I often start sending packets to VM before it's
> > fully booted. Several vhost resets might follow.
> 
> Ok.
> 
> > 
> > > > I would still prefer that we understand what's going on,
> > > I try to reply in another thread, does it make sense?
> > > 
> > > >    and I would
> > > > like to know what's the smallest batch size that's still helpful,
> > > Yes, I've replied in another thread, the result is:
> > > 
> > > 
> > > no batching   1.88Mpps
> > > RX_BATCH=1    1.93Mpps
> > > RX_BATCH=4    2.11Mpps
> > > RX_BATCH=16   2.14Mpps
> > > RX_BATCH=64   2.25Mpps
> > > RX_BATCH=256  2.18Mpps
> > Essentially 4 is enough, other stuf looks more like noise
> > to me. What about 2?
> 
> The numbers are pretty stable, so probably not noise. Retested on top of
> batch zeroing:
> 
> no  1.97Mpps
> 1   2.09Mpps
> 2   2.11Mpps
> 4   2.16Mpps
> 8   2.19Mpps
> 16  2.21Mpps
> 32  2.25Mpps
> 64  2.30Mpps
> 128 2.21Mpps
> 256 2.21Mpps
> 
> 64 performs best.
> 
> Thanks

OK but it might be e.g. a function of the ring size, host cache size or
whatever. As we don't really understand the why, if we just optimize for
your setup we risk regressions in others.  64 entries is a lot, it
increases the queue size noticeably.  Could this be part of the effect?
Could you try changing the queue size to see what happens?

> > 
> > > >    but
> > > > I'm not going to block the patch on these grounds assuming packet drops
> > > > are fixed.
> > > Thanks a lot.
> > > 
> > > > Lightly tested - this is on top of consumer batching patches.
> > > > 
> > > > Thanks!
> > > > 
> > > >    include/linux/ptr_ring.h | 57 ++++++++++++++++++++++++++++++++++++++++++++++++
> > > >    1 file changed, 57 insertions(+)
> > > > 
> > > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > > > index 783e7f5..5fbeab4 100644
> > > > --- a/include/linux/ptr_ring.h
> > > > +++ b/include/linux/ptr_ring.h
> > > > @@ -457,6 +457,63 @@ static inline int ptr_ring_init(struct ptr_ring *r, int size, gfp_t gfp)
> > > >    	return 0;
> > > >    }
> > > > +/*
> > > > + * Return entries into ring. Destroy entries that don't fit.
> > > > + *
> > > > + * Note: this is expected to be a rare slow path operation.
> > > > + *
> > > > + * Note: producer lock is nested within consumer lock, so if you
> > > > + * resize you must make sure all uses nest correctly.
> > > > + * In particular if you consume ring in interrupt or BH context, you must
> > > > + * disable interrupts/BH when doing so.
> > > > + */
> > > > +static inline void ptr_ring_unconsume(struct ptr_ring *r, void **batch, int n,
> > > > +				      void (*destroy)(void *))
> > > > +{
> > > > +	unsigned long flags;
> > > > +	int head;
> > > > +
> > > > +	spin_lock_irqsave(&(r)->consumer_lock, flags);
> > > > +	spin_lock(&(r)->producer_lock);
> > > > +
> > > > +	if (!r->size)
> > > > +		goto done;
> > > > +
> > > > +	/*
> > > > +	 * Clean out buffered entries (for simplicity). This way following code
> > > > +	 * can test entries for NULL and if not assume they are valid.
> > > > +	 */
> > > > +	head = r->consumer_head - 1;
> > > > +	while (likely(head >= r->consumer_tail))
> > > > +		r->queue[head--] = NULL;
> > > > +	r->consumer_tail = r->consumer_head;
> > > > +
> > > > +	/*
> > > > +	 * Go over entries in batch, start moving head back and copy entries.
> > > > +	 * Stop when we run into previously unconsumed entries.
> > > > +	 */
> > > > +	while (n--) {
> > > > +		head = r->consumer_head - 1;
> > > > +		if (head < 0)
> > > > +			head = r->size - 1;
> > > > +		if (r->queue[head]) {
> > > > +			/* This batch entry will have to be destroyed. */
> > > > +			++n;
> > > > +			goto done;
> > > > +		}
> > > > +		r->queue[head] = batch[n];
> > > > +		r->consumer_tail = r->consumer_head = head;
> > > > +	}
> > > > +
> > > > +done:
> > > > +	/* Destroy all entries left in the batch. */
> > > > +	while (n--) {
> > > > +		destroy(batch[n]);
> > > > +	}
> > > > +	spin_unlock(&(r)->producer_lock);
> > > > +	spin_unlock_irqrestore(&(r)->consumer_lock, flags);
> > > > +}
> > > > +
> > > >    static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue,
> > > >    					   int size, gfp_t gfp,
> > > >    					   void (*destroy)(void *))

^ permalink raw reply

* Re: net: cleanup_net is slow
From: Andrey Konovalov @ 2017-04-24 11:58 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Eric Dumazet, Cong Wang, netdev, LKML, Dmitry Vyukov,
	Kostya Serebryany, syzkaller
In-Reply-To: <20170421194515.GB8853@breakpoint.cc>

On Fri, Apr 21, 2017 at 9:45 PM, Florian Westphal <fw@strlen.de> wrote:
> Florian Westphal <fw@strlen.de> wrote:
>> Indeed.  Setting net.netfilter.nf_conntrack_default_on=0 cuts time
>> cleanup time by 2/3 ...
>>
>> nf unregister is way too happy to issue synchronize_net(), I'll work on
>> a fix.
>
> I'll test this patch as a start.  Maybe we can also leverage exit_batch
> more on netfilter side.

Hi Florian,

Your patch improves fuzzing speed at least twice, which is a great start!

Thanks!

>
> diff --git a/net/netfilter/core.c b/net/netfilter/core.c
> index a87a6f8a74d8..08fe1f526265 100644
> --- a/net/netfilter/core.c
> +++ b/net/netfilter/core.c
> @@ -126,14 +126,15 @@ int nf_register_net_hook(struct net *net, const struct nf_hook_ops *reg)
>  }
>  EXPORT_SYMBOL(nf_register_net_hook);
>
> -void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
> +static struct nf_hook_entry *
> +__nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
>  {
>         struct nf_hook_entry __rcu **pp;
>         struct nf_hook_entry *p;
>
>         pp = nf_hook_entry_head(net, reg);
>         if (WARN_ON_ONCE(!pp))
> -               return;
> +               return NULL;
>
>         mutex_lock(&nf_hook_mutex);
>         for (; (p = nf_entry_dereference(*pp)) != NULL; pp = &p->next) {
> @@ -145,7 +146,7 @@ void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
>         mutex_unlock(&nf_hook_mutex);
>         if (!p) {
>                 WARN(1, "nf_unregister_net_hook: hook not found!\n");
> -               return;
> +               return NULL;
>         }
>  #ifdef CONFIG_NETFILTER_INGRESS
>         if (reg->pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_INGRESS)
> @@ -154,6 +155,17 @@ void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
>  #ifdef HAVE_JUMP_LABEL
>         static_key_slow_dec(&nf_hooks_needed[reg->pf][reg->hooknum]);
>  #endif
> +
> +       return p;
> +}
> +
> +void nf_unregister_net_hook(struct net *net, const struct nf_hook_ops *reg)
> +{
> +       struct nf_hook_entry *p = __nf_unregister_net_hook(net, reg);
> +
> +       if (!p)
> +               return;
> +
>         synchronize_net();
>         nf_queue_nf_hook_drop(net, p);
>         /* other cpu might still process nfqueue verdict that used reg */
> @@ -183,10 +195,36 @@ int nf_register_net_hooks(struct net *net, const struct nf_hook_ops *reg,
>  EXPORT_SYMBOL(nf_register_net_hooks);
>
>  void nf_unregister_net_hooks(struct net *net, const struct nf_hook_ops *reg,
> -                            unsigned int n)
> +                            unsigned int hookcount)
>  {
> -       while (n-- > 0)
> -               nf_unregister_net_hook(net, &reg[n]);
> +       struct nf_hook_entry *to_free[16];
> +       unsigned int i, n;
> +
> +       WARN_ON_ONCE(hookcount > ARRAY_SIZE(to_free));
> +
> + next_round:
> +       n = min_t(unsigned int, hookcount, ARRAY_SIZE(to_free));
> +
> +       for (i = 0; i < n; i++)
> +               to_free[i] = __nf_unregister_net_hook(net, &reg[i]);
> +
> +       synchronize_net();
> +
> +       for (i = 0; i < n; i++) {
> +               if (to_free[i])
> +                       nf_queue_nf_hook_drop(net, to_free[i]);
> +       }
> +
> +       synchronize_net();
> +
> +       for (i = 0; i < n; i++)
> +               kfree(to_free[i]);
> +
> +       if (n < hookcount) {
> +               hookcount -= n;
> +               reg += n;
> +               goto next_round;
> +       }
>  }
>  EXPORT_SYMBOL(nf_unregister_net_hooks);
>

^ permalink raw reply

* Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume
From: Jason Wang @ 2017-04-24 11:54 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: linux-kernel, netdev
In-Reply-To: <20170424022711-mutt-send-email-mst@kernel.org>



On 2017年04月24日 07:28, Michael S. Tsirkin wrote:
> On Tue, Apr 18, 2017 at 11:07:42AM +0800, Jason Wang wrote:
>>
>> On 2017年04月17日 07:19, Michael S. Tsirkin wrote:
>>> Applications that consume a batch of entries in one go
>>> can benefit from ability to return some of them back
>>> into the ring.
>>>
>>> Add an API for that - assuming there's space. If there's no space
>>> naturally we can't do this and have to drop entries, but this implies
>>> ring is full so we'd likely drop some anyway.
>>>
>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>> ---
>>>
>>> Jason, in my mind the biggest issue with your batching patchset is the
>>> backet drops on disconnect.  This API will help avoid that in the common
>>> case.
>> Ok, I will rebase the series on top of this. (Though I don't think we care
>> the packet loss).
> E.g. I care - I often start sending packets to VM before it's
> fully booted. Several vhost resets might follow.

Ok.

>
>>> I would still prefer that we understand what's going on,
>> I try to reply in another thread, does it make sense?
>>
>>>    and I would
>>> like to know what's the smallest batch size that's still helpful,
>> Yes, I've replied in another thread, the result is:
>>
>>
>> no batching   1.88Mpps
>> RX_BATCH=1    1.93Mpps
>> RX_BATCH=4    2.11Mpps
>> RX_BATCH=16   2.14Mpps
>> RX_BATCH=64   2.25Mpps
>> RX_BATCH=256  2.18Mpps
> Essentially 4 is enough, other stuf looks more like noise
> to me. What about 2?

The numbers are pretty stable, so probably not noise. Retested on top of 
batch zeroing:

no  1.97Mpps
1   2.09Mpps
2   2.11Mpps
4   2.16Mpps
8   2.19Mpps
16  2.21Mpps
32  2.25Mpps
64  2.30Mpps
128 2.21Mpps
256 2.21Mpps

64 performs best.

Thanks

>
>>>    but
>>> I'm not going to block the patch on these grounds assuming packet drops
>>> are fixed.
>> Thanks a lot.
>>
>>> Lightly tested - this is on top of consumer batching patches.
>>>
>>> Thanks!
>>>
>>>    include/linux/ptr_ring.h | 57 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>    1 file changed, 57 insertions(+)
>>>
>>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
>>> index 783e7f5..5fbeab4 100644
>>> --- a/include/linux/ptr_ring.h
>>> +++ b/include/linux/ptr_ring.h
>>> @@ -457,6 +457,63 @@ static inline int ptr_ring_init(struct ptr_ring *r, int size, gfp_t gfp)
>>>    	return 0;
>>>    }
>>> +/*
>>> + * Return entries into ring. Destroy entries that don't fit.
>>> + *
>>> + * Note: this is expected to be a rare slow path operation.
>>> + *
>>> + * Note: producer lock is nested within consumer lock, so if you
>>> + * resize you must make sure all uses nest correctly.
>>> + * In particular if you consume ring in interrupt or BH context, you must
>>> + * disable interrupts/BH when doing so.
>>> + */
>>> +static inline void ptr_ring_unconsume(struct ptr_ring *r, void **batch, int n,
>>> +				      void (*destroy)(void *))
>>> +{
>>> +	unsigned long flags;
>>> +	int head;
>>> +
>>> +	spin_lock_irqsave(&(r)->consumer_lock, flags);
>>> +	spin_lock(&(r)->producer_lock);
>>> +
>>> +	if (!r->size)
>>> +		goto done;
>>> +
>>> +	/*
>>> +	 * Clean out buffered entries (for simplicity). This way following code
>>> +	 * can test entries for NULL and if not assume they are valid.
>>> +	 */
>>> +	head = r->consumer_head - 1;
>>> +	while (likely(head >= r->consumer_tail))
>>> +		r->queue[head--] = NULL;
>>> +	r->consumer_tail = r->consumer_head;
>>> +
>>> +	/*
>>> +	 * Go over entries in batch, start moving head back and copy entries.
>>> +	 * Stop when we run into previously unconsumed entries.
>>> +	 */
>>> +	while (n--) {
>>> +		head = r->consumer_head - 1;
>>> +		if (head < 0)
>>> +			head = r->size - 1;
>>> +		if (r->queue[head]) {
>>> +			/* This batch entry will have to be destroyed. */
>>> +			++n;
>>> +			goto done;
>>> +		}
>>> +		r->queue[head] = batch[n];
>>> +		r->consumer_tail = r->consumer_head = head;
>>> +	}
>>> +
>>> +done:
>>> +	/* Destroy all entries left in the batch. */
>>> +	while (n--) {
>>> +		destroy(batch[n]);
>>> +	}
>>> +	spin_unlock(&(r)->producer_lock);
>>> +	spin_unlock_irqrestore(&(r)->consumer_lock, flags);
>>> +}
>>> +
>>>    static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue,
>>>    					   int size, gfp_t gfp,
>>>    					   void (*destroy)(void *))

^ permalink raw reply

* [PATCH v2] brcmfmac: Ensure pointer correctly set if skb data location changes
From: James Hughes @ 2017-04-24 11:40 UTC (permalink / raw)
  To: Arend van Spriel, Franky Lin, Hante Meuleman, Kalle Valo,
	linux-wireless, brcm80211-dev-list.pdl, netdev
  Cc: James Hughes

The incoming skb header may be resized if header space is
insufficient, which might change the data adddress in the skb.
Ensure that a cached pointer to that data is correctly set by
moving assignment to after any possible changes.

Signed-off-by: James Hughes <james.hughes@raspberrypi.org>

Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>
---
Changes in v2

Moved the assignment below the len check - saves a few cycles 
if the length check fails. 


 drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
index 5eaac13e2317..9b7c19a508ac 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
@@ -198,7 +198,7 @@ static netdev_tx_t brcmf_netdev_start_xmit(struct sk_buff *skb,
 	int ret;
 	struct brcmf_if *ifp = netdev_priv(ndev);
 	struct brcmf_pub *drvr = ifp->drvr;
-	struct ethhdr *eh = (struct ethhdr *)(skb->data);
+	struct ethhdr *eh;
 
 	brcmf_dbg(DATA, "Enter, bsscfgidx=%d\n", ifp->bsscfgidx);
 
@@ -236,6 +236,8 @@ static netdev_tx_t brcmf_netdev_start_xmit(struct sk_buff *skb,
 		goto done;
 	}
 
+	eh = (struct ethhdr *)(skb->data);
+
 	if (eh->h_proto == htons(ETH_P_PAE))
 		atomic_inc(&ifp->pend_8021x_cnt);
 
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH net] bridge: shutdown bridge device before removing it
From: Nikolay Aleksandrov @ 2017-04-24 11:29 UTC (permalink / raw)
  To: Xin Long, network dev
  Cc: David S. Miller, Stephen Hemminger,
	bridge@lists.linux-foundation.org
In-Reply-To: <3d77b47d-c8c3-de57-946e-59096a88964e@cumulusnetworks.com>

On 24/04/17 14:01, Nikolay Aleksandrov wrote:
> On 24/04/17 10:25, Xin Long wrote:
>> During removing a bridge device, if the bridge is still up, a new mdb entry
>> still can be added in br_multicast_add_group() after all mdb entries are
>> removed in br_multicast_dev_del(). Like the path:
>>
>>   mld_ifc_timer_expire ->
>>     mld_sendpack -> ...
>>       br_multicast_rcv ->
>>         br_multicast_add_group
>>
>> The new mp's timer will be set up. If the timer expires after the bridge
>> is freed, it may cause use-after-free panic in br_multicast_group_expired.
>> This can happen when ip link remove a bridge or destroy a netns with a
>> bridge device inside.
>>
>> As we can see in br_del_bridge, brctl is also supposed to remove a bridge
>> device after it's shutdown.
>>
>> This patch is to call dev_close at the beginning of br_dev_delete so that
>> netif_running check in br_multicast_add_group can avoid this issue. But
>> to keep consistent with before, it will not remove the IFF_UP check in
>> br_del_bridge for brctl.
>>
>> Reported-by: Jianwen Ji <jiji@redhat.com>
>> Signed-off-by: Xin Long <lucien.xin@gmail.com>
>> ---
>>  net/bridge/br_if.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
> 
> +CC bridge maintainers
> 
> I can see how this could happen, could you also provide the traceback ?
> 
> The patch looks good to me, actually I think it fixes another issue with
> mcast stats where the percpu pointer can be accessed after it's freed if
> an mcast packet can get sent via br->dev after the br_multicast_dev_del() call.

Never mind the another issue part, Ido's recent ndo_uninit() patch fixed it.

> This is definitely stable material, if I'm not mistaken the issue is there since
> the introduction of br_dev_delete:
> commit e10177abf842
> Author: Satish Ashok <sashok@cumulusnetworks.com>
> Date:   Wed Jul 15 07:16:51 2015 -0700
> 
>     bridge: multicast: fix handling of temp and perm entries
> 
> 
> 
> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> 
> 
> 
> 
> 
> 
> 

^ permalink raw reply

* Re: [PATCH] brcmfmac: Ensure pointer correctly set if skb data location changes
From: Arend van Spriel @ 2017-04-24 11:10 UTC (permalink / raw)
  To: James Hughes, Franky Lin, Hante Meuleman, Kalle Valo,
	linux-wireless, brcm80211-dev-list.pdl, netdev
In-Reply-To: <20170424105219.18797-1-james.hughes@raspberrypi.org>

On 4/24/2017 12:52 PM, James Hughes wrote:
> The incoming skb header may be resized if header space is
> insufficient, which might change the data adddress in the skb.
> Ensure that a cached pointer to that data is correctly set by
> moving assignment to after any possible changes.

Thanks, James

Minor nit below...

You may add my acknowledgement:

Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>
> Signed-off-by: James Hughes <james.hughes@raspberrypi.org>
> ---
>   drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c

[...]

> @@ -229,6 +229,8 @@ static netdev_tx_t brcmf_netdev_start_xmit(struct sk_buff *skb,
>   		}
>   	}
>   
> +	eh = (struct ethhdr *)(skb->data);
> +

Please move after the length validation below.

Regards,
Arend

>   	/* validate length for ether packet */
>   	if (skb->len < sizeof(*eh)) {
>   		ret = -EINVAL;
> 

^ permalink raw reply

* Re: [PATCH iproute2 net 3/8] tc/pedit: Introduce 'add' operation
From: Or Gerlitz @ 2017-04-24 11:04 UTC (permalink / raw)
  To: Amir Vadai
  Cc: Jamal Hadi Salim, Stephen Hemminger, Linux Netdev List,
	Or Gerlitz
In-Reply-To: <20170424075203.GA3265@office.localdomain>

On Mon, Apr 24, 2017 at 10:52 AM, Amir Vadai <amir@vadai.me> wrote:
> On Sun, Apr 23, 2017 at 01:44:51PM -0400, Jamal Hadi Salim wrote:
>> Thanks for the excellent work.

sure, it's Amir, you know..

>> On 17-04-23 08:53 AM, Amir Vadai wrote:

>> Mostly curious about hardware handling.

> As to hardware handling, Or did the implementation so he will answer
> better. AFAIK, current implementation doesn't allow partial field
> setting/adding, so such a rule can't be offloaded. I think it is only to
> make things simple and because there was no use case for that.
> To my knowledge, hardware should support such thing if needed.

yeah (currently no partial setting) and yeah (in the future yes, will
support partial setting of offset into
field and partial length to set from there) and no for partial addition.

The reason we decided to disallow partial addition in the FW API was
what do you do with the carry, e.g you
add a value to the 2nd nibble of IP address and there are carry bits,
where do you take them? this becomes
even more ugly if you are dealing with mac addresses. I guess once
there's use case for partial add offload,
we will revisit that.

Or.

^ permalink raw reply

* Re: [PATCH net] team: fix memory leaks
From: Jiri Pirko @ 2017-04-24 11:04 UTC (permalink / raw)
  To: Pan Bian; +Cc: netdev, linux-kernel
In-Reply-To: <1493029756-13171-1-git-send-email-bianpan2016@163.com>

Mon, Apr 24, 2017 at 12:29:16PM CEST, bianpan2016@163.com wrote:
>In functions team_nl_send_port_list_get() and
>team_nl_send_options_get(), pointer skb keeps the return value of
>nlmsg_new(). When the call to genlmsg_put() fails, the memory is not
>freed(). This will result in memory leak bugs.
>
>Fixes: 9b00cf2d1024 ("team: implement multipart netlink messages for options transfers")
>Signed-off-by: Pan Bian <bianpan2016@163.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* RE: [PATCH] macsec: avoid heap overflow in skb_to_sgvec
From: David Laight @ 2017-04-24 11:02 UTC (permalink / raw)
  To: 'Jason A. Donenfeld', netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, davem@davemloft.net
  Cc: stable@vger.kernel.org, security@kernel.org
In-Reply-To: <20170421211448.16995-1-Jason@zx2c4.com>

From: Jason A. Donenfeld
> Sent: 21 April 2017 22:15
> While this may appear as a humdrum one line change, it's actually quite
> important. An sk_buff stores data in three places:
> 
> 1. A linear chunk of allocated memory in skb->data. This is the easiest
>    one to work with, but it precludes using scatterdata since the memory
>    must be linear.
> 2. The array skb_shinfo(skb)->frags, which is of maximum length
>    MAX_SKB_FRAGS. This is nice for scattergather, since these fragments
>    can point to different pages.
> 3. skb_shinfo(skb)->frag_list, which is a pointer to another sk_buff,
>    which in turn can have data in either (1) or (2).
> 
> The first two are rather easy to deal with, since they're of a fixed
> maximum length, while the third one is not, since there can be
> potentially limitless chains of fragments. Fortunately dealing with
> frag_list is opt-in for drivers, so drivers don't actually have to deal
> with this mess. For whatever reason, macsec decided it wanted pain, and
> so it explicitly specified NETIF_F_FRAGLIST.
> 
> Because dealing with (1), (2), and (3) is insane, most users of sk_buff
> doing any sort of crypto or paging operation calls a convenient function
> called skb_to_sgvec (which happens to be recursive if (3) is in use!).
> This takes a sk_buff as input, and writes into its output pointer an
> array of scattergather list items. Sometimes people like to declare a
> fixed size scattergather list on the stack; othertimes people like to
> allocate a fixed size scattergather list on the heap. However, if you're
> doing it in a fixed-size fashion, you really shouldn't be using
> NETIF_F_FRAGLIST too (unless you're also ensuring the sk_buff and its
> frag_list children arent't shared and then you check the number of
> fragments in total required.)
> 
> Macsec specifically does this:
> 
>         size += sizeof(struct scatterlist) * (MAX_SKB_FRAGS + 1);
>         tmp = kmalloc(size, GFP_ATOMIC);
>         *sg = (struct scatterlist *)(tmp + sg_offset);
> 	...
>         sg_init_table(sg, MAX_SKB_FRAGS + 1);
>         skb_to_sgvec(skb, sg, 0, skb->len);
> 
> Specifying MAX_SKB_FRAGS + 1 is the right answer usually, but not if you're
> using NETIF_F_FRAGLIST, in which case the call to skb_to_sgvec will
> overflow the heap, and disaster ensues.
...

Shouldn't skb_to_sgvec() be checking the number of fragments against
the size of the sg list?
The callers would then all need auditing to allow for failure.

	David

^ permalink raw reply

* Re: [PATCH net] bridge: shutdown bridge device before removing it
From: Nikolay Aleksandrov @ 2017-04-24 11:01 UTC (permalink / raw)
  To: Xin Long, network dev; +Cc: bridge@lists.linux-foundation.org, David S. Miller
In-Reply-To: <a8b2b2c876f416431dac031a0beb7ce079c7d0a9.1493018730.git.lucien.xin@gmail.com>

On 24/04/17 10:25, Xin Long wrote:
> During removing a bridge device, if the bridge is still up, a new mdb entry
> still can be added in br_multicast_add_group() after all mdb entries are
> removed in br_multicast_dev_del(). Like the path:
> 
>   mld_ifc_timer_expire ->
>     mld_sendpack -> ...
>       br_multicast_rcv ->
>         br_multicast_add_group
> 
> The new mp's timer will be set up. If the timer expires after the bridge
> is freed, it may cause use-after-free panic in br_multicast_group_expired.
> This can happen when ip link remove a bridge or destroy a netns with a
> bridge device inside.
> 
> As we can see in br_del_bridge, brctl is also supposed to remove a bridge
> device after it's shutdown.
> 
> This patch is to call dev_close at the beginning of br_dev_delete so that
> netif_running check in br_multicast_add_group can avoid this issue. But
> to keep consistent with before, it will not remove the IFF_UP check in
> br_del_bridge for brctl.
> 
> Reported-by: Jianwen Ji <jiji@redhat.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/bridge/br_if.c | 2 ++
>  1 file changed, 2 insertions(+)
> 

+CC bridge maintainers

I can see how this could happen, could you also provide the traceback ?

The patch looks good to me, actually I think it fixes another issue with
mcast stats where the percpu pointer can be accessed after it's freed if
an mcast packet can get sent via br->dev after the br_multicast_dev_del() call.
This is definitely stable material, if I'm not mistaken the issue is there since
the introduction of br_dev_delete:
commit e10177abf842
Author: Satish Ashok <sashok@cumulusnetworks.com>
Date:   Wed Jul 15 07:16:51 2015 -0700

    bridge: multicast: fix handling of temp and perm entries



Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

^ permalink raw reply

* Re: PROBLEM: IPVS incorrectly reverse-NATs traffic to LVS host
From: Nick Moriarty @ 2017-04-24 10:55 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Wensong Zhang, Simon Horman, netdev, lvs-devel
In-Reply-To: <alpine.LFD.2.20.1704221417080.1974@ja.home.ssi.bg>

Hi Julian,

Many thanks for your prompt fix!  I've now tested this patch against
the 4.11.0-rc8 kernel on Ubuntu, and I can confirm that my check
script is no longer seeing incorrect addresses in its responses.

Could you please keep me posted as this is merged?

Thanks again

On 22 April 2017 at 18:06, Julian Anastasov <ja@ssi.bg> wrote:
>
>         Hello,
>
> On Wed, 12 Apr 2017, Nick Moriarty wrote:
>
>> Hi,
>>
>> I've experienced a problem in how traffic returning to an LVS host is
>> handled in certain circumstances.  Please find a bug report below - if
>> there's any further information you'd like, please let me know.
>>
>> [1.] One line summary of the problem:
>> IPVS incorrectly reverse-NATs traffic to LVS host
>>
>> [2.] Full description of the problem/report:
>> When using IPVS in direct-routing mode, normal traffic from the LVS
>> host to a back-end server is sometimes incorrectly NATed on the way
>> back into the LVS host.  Using tcpdump shows that the return packets
>> have the correct source IP, but by the time it makes it back to the
>> application, it's been changed.
>>
>> To reproduce this, a configuration such as the following will work:
>> - Set up an LVS system with a VIP serving UDP to a backend DNS server
>> using the direct-routing method in IPVS
>> - Make an outgoing UDP request to the VIP from the LVS system itself
>> (this causes a connection to be added to the IPVS connection table)
>> - The request should succeed as normal
>> - Note the UDP source port used
>> - Within 5 minutes (before the UDP connection entry expires), make an
>> outgoing UDP request directly to the backend DNS server
>> - The request will fail as the reply is incorrectly modified on its
>> way back and appears to return from the VIP
>>
>> Monitoring the above sequence with tcpdump verifies that the returned
>> packet (as it enters the host) is from the DNS IP, even though the
>> application sees the VIP.
>>
>> If an outgoing request direct to the DNS server is made from a port
>> not in the connection table, everything is fine.
>
>         Thanks for the detailed report! I think, I fixed the
> problem. Let me know if you are able to test the appended fix.
>
>> I expect that somewhere, something (e.g. functionality for IPVS MASQ
>> responses) is applying IPVS connection
>> information to incoming traffic, matching a DROUTE rule, and treating
>> it as NAT traffic.
>
>         Yep, that is what happens.
>
> ================================================================
>
> [PATCH net] ipvs: SNAT packet replies only for NATed connections
>
> We do not check if packet from real server is for NAT
> connection before performing SNAT. This causes problems
> for setups that use DR/TUN and allow local clients to
> access the real server directly, for example:
>
> - local client in director creates IPVS-DR/TUN connection
> CIP->VIP and the request packets are routed to RIP.
> Talks are finished but IPVS connection is not expired yet.
>
> - second local client creates non-IPVS connection CIP->RIP
> with same reply tuple RIP->CIP and when replies are received
> on LOCAL_IN we wrongly assign them for the first client
> connection because RIP->CIP matches the reply direction.
>
> The problem is more visible to local UDP clients but in rare
> cases it can happen also for TCP or remote clients when the
> real server sends the reply traffic via the director.
>
> So, better to be more precise for the reply traffic.
> As replies are not expected for DR/TUN connections, better
> to not touch them.
>
> Reported-by: Nick Moriarty <nick.moriarty@york.ac.uk>
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---
>  net/netfilter/ipvs/ip_vs_core.c | 19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index db40050..ee44ed5 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -849,10 +849,8 @@ static int handle_response_icmp(int af, struct sk_buff *skb,
>  {
>         unsigned int verdict = NF_DROP;
>
> -       if (IP_VS_FWD_METHOD(cp) != 0) {
> -               pr_err("shouldn't reach here, because the box is on the "
> -                      "half connection in the tun/dr module.\n");
> -       }
> +       if (IP_VS_FWD_METHOD(cp) != IP_VS_CONN_F_MASQ)
> +               goto ignore_cp;
>
>         /* Ensure the checksum is correct */
>         if (!skb_csum_unnecessary(skb) && ip_vs_checksum_complete(skb, ihl)) {
> @@ -886,6 +884,8 @@ static int handle_response_icmp(int af, struct sk_buff *skb,
>                 ip_vs_notrack(skb);
>         else
>                 ip_vs_update_conntrack(skb, cp, 0);
> +
> +ignore_cp:
>         verdict = NF_ACCEPT;
>
>  out:
> @@ -1385,8 +1385,11 @@ ip_vs_out(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, in
>          */
>         cp = pp->conn_out_get(ipvs, af, skb, &iph);
>
> -       if (likely(cp))
> +       if (likely(cp)) {
> +               if (IP_VS_FWD_METHOD(cp) != IP_VS_CONN_F_MASQ)
> +                       goto ignore_cp;
>                 return handle_response(af, skb, pd, cp, &iph, hooknum);
> +       }
>
>         /* Check for real-server-started requests */
>         if (atomic_read(&ipvs->conn_out_counter)) {
> @@ -1444,9 +1447,15 @@ ip_vs_out(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, in
>                         }
>                 }
>         }
> +
> +out:
>         IP_VS_DBG_PKT(12, af, pp, skb, iph.off,
>                       "ip_vs_out: packet continues traversal as normal");
>         return NF_ACCEPT;
> +
> +ignore_cp:
> +       __ip_vs_conn_put(cp);
> +       goto out;
>  }
>
>  /*
> --
> 2.9.3
>
> Regards
>
> --
> Julian Anastasov <ja@ssi.bg>



-- 
Nick Moriarty
Linux Systems Administrator and Developer

IT Services
Computer Science Building
University of York
York
YO10 5GH
+44 (0)1904 32 3484

e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm

^ permalink raw reply

* [PATCH] brcmfmac: Ensure pointer correctly set if skb data location changes
From: James Hughes @ 2017-04-24 10:52 UTC (permalink / raw)
  To: Arend van Spriel, Franky Lin, Hante Meuleman, Kalle Valo,
	linux-wireless, brcm80211-dev-list.pdl, netdev
  Cc: James Hughes

The incoming skb header may be resized if header space is
insufficient, which might change the data adddress in the skb.
Ensure that a cached pointer to that data is correctly set by
moving assignment to after any possible changes.

Signed-off-by: James Hughes <james.hughes@raspberrypi.org>
---
 drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
index 5eaac13e2317..934fe00e28a0 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
@@ -198,7 +198,7 @@ static netdev_tx_t brcmf_netdev_start_xmit(struct sk_buff *skb,
 	int ret;
 	struct brcmf_if *ifp = netdev_priv(ndev);
 	struct brcmf_pub *drvr = ifp->drvr;
-	struct ethhdr *eh = (struct ethhdr *)(skb->data);
+	struct ethhdr *eh;
 
 	brcmf_dbg(DATA, "Enter, bsscfgidx=%d\n", ifp->bsscfgidx);
 
@@ -229,6 +229,8 @@ static netdev_tx_t brcmf_netdev_start_xmit(struct sk_buff *skb,
 		}
 	}
 
+	eh = (struct ethhdr *)(skb->data);
+
 	/* validate length for ether packet */
 	if (skb->len < sizeof(*eh)) {
 		ret = -EINVAL;
-- 
2.11.0

^ permalink raw reply related

* Re: bluetooth 6lowpan interfaces are not virtual anymore
From: Luiz Augusto von Dentz @ 2017-04-24 10:35 UTC (permalink / raw)
  To: Alexander Aring
  Cc: Network Development, Jukka Rissanen, linux-wpan@vger.kernel.org,
	Linux Bluetooth, Michael Richardson
In-Reply-To: <f3c04860-651e-c9f3-e7ab-db609d105436@pengutronix.de>

Hi Alex,

On Wed, Apr 19, 2017 at 8:43 PM, Alexander Aring <aar@pengutronix.de> wrote:
> Hi,
>
> at first I want to clarify what my definition of virtual and non virtual
> interface is:
>
> - virtual interfaces: has no queue(s)
> - non virtual interfaces: has a queue(s)
>
> I did some "big" ASCII graphic what I think it will do now.
> At first the virtual interface:
>
> --- snip ---
>
> Transmit side only! Case of virtual interface.
>
> IPv6 Stack
>
>                 +-----------------------------+
>                 |    IPv6 Packet (ND, etc)    |
>                 +--------------+--------------+
>                                |
>       -------------------------+---------------------------------
> 6LoWPAN Interface              |
>                                |
>                 +--------------v--------------+
>                 | Adaptation (IPv6 to 6Lo)    |
>                 +--------------+--------------+
>     +--------------------------+
>     |
>     | -----------------------------------------------------------
> Link|Layer (802.15.4/hci_dev/etc)
>     |
>     |                  SKB Queue (With kind of discipline)
>     |           +----------------+---------------+-------------+
>     +----------->      SKB1      |     SKBN      |     ...     +----+
>                 +----------------+---------------+-------------+    |
>                                                                     |
>                                                                     |
>       ------------------------------------------------------------  |
> Real Transeiver Hardware                                            |
>                                        +----------------------------+
>                                        |
>               +------------------------v-----------------------+
>               |        Framebuffers (YOUR hardware resource)   |
>               +------------------------------------------------+
>
> --- snap ---
>
> The non virtual interface:
>
> --- snip ---
>
> Transmit side only! Case of non virtual interface.
>
> IPv6 Stack
>
>                 +-----------------------------+
>                 |    IPv6 Packet (ND, etc)    |
>                 +--------------+--------------+
>                                |
>       -------------------------+---------------------------------
> 6LoWPAN Interface              |
>                                |
>     +--------------------------+
>     |                                                              /> Will be queued if you stop the queue
>     |                  SKB Queue (With kind of discipline)      /--   Because Link Layer is full/software limitation
>     |           +----------------+---------------+-------------+
>     +----------->      SKB1      |     SKBN      |     ...     +----+
>                 +----------------+---------------+-------------+    |
>                                                                     |
>                                +------------------------------------+
>                                |
>                 +--------------v--------------+
>                 | Adaptation (IPv6 to 6Lo)    |
>                 +--------------+--------------+
>     +--------------------------+
>     |
>     | -----------------------------------------------------------
> Link|Layer (802.15.4/hci_dev/etc)
>     |                                                              -> software limitation (tx credits?)
>     |                  SKB Queue (With kind of discipline)     ---/   Right position to make different handling (for virtual case).
>     |           +----------------+---------------+------------/+
>     +----------->      SKB1      |     SKBN      |     ...     +----+
>                 +----------------+---------------+-------------+    |
>                                                                     |
>                                                                     |
>       ------------------------------------------------------------  |
> Real Transeiver Hardware                                            |
>                                        +----------------------------+
>                                        |
>               +------------------------v-----------------------+
>               |        Framebuffers (YOUR hardware resource)   |
>               +------------------------------------------------+
>
>
> --- snap ---
>
> Adding a queue to a interface which does a protocol translation only is
> in my opinion: wrong.
>
> At least if you want to try to make a more efficient qdisc handling,
> means dropping skb's in queue when userspace sends too much. You will
> change it back, control two queues seems to be difficult.

It is wrong only if you look at 6LoWPAN interface as being owned by
6lowpan modules, but it doesn't, in fact it won't anything by itself
so it basically acts as a library to that perform 6LoWPAN operation on
the packet level, thus why it is possible to switch from virtual to
real interface. Also in terms of 15.4, the 6LoWPAN interfaces you
depicted above is also no controlled by 6LoWPAN itself, in fact it
seeems to represent links in 15.4, and I also assume these links may
not use 6LoWPAN.

> On 04/18/2017 12:43 PM, Luiz Augusto von Dentz wrote:
>> Hi Alex,
>>
>> On Mon, Apr 17, 2017 at 8:56 PM, Alexander Aring <aar@pengutronix.de> wrote:
>>> Hi,
>>>
>>> bluetooth-next contains patches which introduces a queue for bluetooth
>>> 6LoWPAN interfaces. [0]
>>>
>>> At first, the current behaviour is now that 802.15.4 6LoWPAN interfaces
>>> are virtual and bluetooth 6LoWPAN interfaces are not virtual anymore.
>>> To have a different handling in both subsystems is _definitely_ wrong.
>>
>> Well 15.4 and Bluetooth have very different handling, iirc in 15.4 you
>> do have a real interface which has, in fact, queueing support:
>>
>> net/mac802154/iface.c:ieee802154_if_setup:
>> dev->tx_queue_len = 300;
>>
>
> Is l2cap_chan_send a call with is synchronized which wait for it so
> after that the "l2cap bluetooth *frame*. etc." is send?
>
> I looked at the code [0]. It will queue it in some skb queue, so far I
> see. Hard to see it in this code, but I see some queueing and workqueue
> scheduling [1] in the function which will be called by l2cap_chan_send.

There is a tx_queue per channel but that doesn't mean there is a per
channel and introducing one when we can use the net qdisc support
seems useless to me.

> This queue should have the same meaning like our queue in IEEE
> 802.15.4 interface which you mentioned above.
> At this point, we have no difference. There exists already a queue for
> your real transeiver hardware.

The queue here is per channel, btw the queue size is not decided by us
it is the remote side that provides the tx credits so to some extend
the tx_queue is actually the remote queue in case of CoC. When testing
this it was quite clear this does not work in practice, with IPv6 at
least, because the remote side may not have enough credits for a
single packet (MPS * tx_credits < MTU) which means without proper
queueing support we would be dropping every packet.

>>> What does the 6LoWPAN interface?
>>>
>>> It will do a protocol change (an adaptation, because 6LoWPAN should
>>> provide the same functionality as IPv6) from IPv6 to 6LoWPAN (tx) and
>>> vice versa for (rx). In my opinion this should be handled as a virtual
>>> interface and not as an interface with a queue.
>>
>> Then we need a real netif for Bluetooth as well, one that does
>> queueing since introducing for l2cap_chan makes no sense when most of
>> time the userspace has a socket which does its own queueing. Btw, I
>> have the opinion that doing a second interface just to make Bluetooth
>> behave like 15.4 is very wasteful since we don't have any other
>> network support except for bnep, which in fact does the same thing
>> regarding queueing.
>>
>
> If bnep just generates some ethernet frames and you put L2CAP around,
> then it should be virtual too. If it queue skbs for sending out for the
> hci_dev. Then it ends in a similar queue handling like "non virtual
> 6LoWPAN interface".

BNEP does not use the same L2CAP channel mode, in fact there is no
queueing whatsoever since it use basic mode, and considering it has
been working for this long I don't think there is a real problem in it
using the net qdisc infra, actually I don't quite get this virtual to
non-virtual complaint, IMO this is only valid if you do have a
tunnel-like design where there are multiple interfaces involved so
there is no point in having qdisc enabled in all of them but we only
have one interface in Bluetooth.

> What you mean with second interface? At least you need a capable
> net_device interface to offer an IP connection from userspace.
>
>>> What makes a queue on 6LoWPAN interfaces?
>>>
>>> It will queue IPv6 packets which waits for it adaptation (the protocol
>>> change) with some additional qdisc handling.
>>> If finally the xmit_do callback will occur it will replace IPv6 header
>>> with 6LoWPAN, etc. After that it should be queued into some queue on
>>> link layer side which should be do the transmit finally.
>>
>> In case of Bluetooth it does the Link Layer using L2CAP, which is not
>> exactly at Link Layer, in fact it is one of the major problems of
>> having this in Kernel space since it is similar of doing IP over TCP
>> tunnel.
>>
>
> aha, but this has nothing to do to decide that you have a 6LoWPAN queue,
> or? I think this is one reason why you want a "6lowtun" interface. To
> handle L2CAP in userspace.

Indeed it is one of main reasons.

>>> Why I think bluetooth introduced a queue handling on 6LoWPAN interfaces?
>>>
>>> Because I think they don't like their own *qdisc* handling on their link
>>> layer interface. I write *qdisc* here because, they have no net_devices
>>> and use some kind of own qdisc handling.
>>
>> No we don't, and except we do change the whole architecture of
>> Bluetooth to work as a net_device I don't think we can have a similar
>> design as 15.4. Anyway Bluetooth is not really designed to carry IP,
>> the L2CAP and other profiles above are mostly a Bluetooth thing so
>> having a net_device would not buy us much more than queueing for bnep
>> and ipsp.
>>
>
> At least, having a queue and don't put anything anymore into the queue
> when "tx credits" is at zero (because queue is full). _Is_ a kind of
> qdisc handling.

The tx_credits operate at L2CAP segment level (MPS), so it is at a
completely different level, there are no guarantees the credits will
attend to a single IP packet.

In case you are curious when testing with qdisc disabled the interface
will basically turn down as soon as there are no credits left which
makes even pings be dropped.

>>> My question: is this correct?
>>>
>>> How to fix that (In my opinion):
>>>
>>> So commit [0] says something "out of credits" that's what I think it's
>>> the *qdisc* handling. If you cannot queue anymore -> you need to drop
>>> it. If you don't like how the current behaviour is, you need to change
>>> your *qdisc* handling on your link layer interface. Introducing queue at
>>> 6LoWPAN interfaces will introduce "buffer bloating".
>>
>> That is not what the code comments says, eg. netif_stop_queue:
>>
>>  * Stop upper layers calling the device hard_start_xmit routine.
>>  * Used for flow control when transmit resources are unavailable.
>>
>
> This comments are correct so long you operate on "real hardware
> resources" which 6LoWPAN interfaces doesn't do. In case so far I see, it
> will call l2cap_chan_send which putting some L2CAP protocol on it (to
> provide data) and somewhere queue it for transmit to fill "real hardware
> resources" e.g. framebuffers.

The interface in question is created by Bluetooth 6lowpan module, the
TX and RX is responsibility of the implementation of the interface
which I assume it to avoid leaking this sort of details to 6LoWPAN
itself or having yet another level of indirection where the 6LoWPAN
would need to define another API/callbacks to interface with the
modules trying to use it.

> Your "resource" which is unavailable is the link layer queue, but this
> is just a software limitation by bluetooth, which you can change by
> software to make a different handling, than just adding more software queues.
>
> What _virtual_ 6LoWPAN interface does is:
>
> Input: IPv6 -> Output: 6LoWPAN, without any queueing in front of that.
>
>> The fact 15.4, and bnep, uses these APIs makes me believe this is the
>> proper way of using qdisc handling, the only difference here is at
>> what level we would be using them.
>>
>
> On 15.4 we use it for link layer interface queue, but we don't put
> another queue in 6LoWPAN. See my aboves ASCII arts which shows the
> different when 6LoWPAN is virtual and not virtual.
>
> ---
>
> We have already qdisc problems in our case. In 15.4 it "just works" for
> now, but we need more handling there to send not garbage if one frame
> gets dropped there, see [2].
>
> In my opinion, if bluetooth people will hit similar issues, you will
> change it back that 6LoWPAN has no queue anymore. Otherwise it will very
> hard to decide which skbs you drop or not - because you need to deal
> with two queues.
>
> To make 6LoWPAN non virtual anymore and just adding a queue will occur
> that you don't drop much anymore... but the queue at link layer
> (hci_dev) will work as before and this occurs buffer bloating only.
>
>>> ---
>>>
>>> I don't care what bluetooth does with the 6LoWPAN interface. If bluetooth
>>> people wants such behaviour, then I am okay with that.
>>>
>>> I will bookmark this mail for the time when somebody reverts it to a
>>> virtual interface again. I think somebody will change it again, or maybe
>>> somebody will argument why we need a queue here.
>>
>> From what I could grasp from the code except if 6LoWPAN start creating
>> its own interface, which it doesn't currently, it should never assume
>> any specific behavior for an interface. 6LoWPAN right now only have
>> helper functions which the interface code call on TX/RX, and by
>> looking at code under 15.4 the only thing we would be doing is to set
>> skb->dev = wdev; (wdev being the real device) as Bluetooth don't use
>> 6lowpan fragmentation.
>>
>
> yep, there is a lot of work do make a better framework, but has nothing
> do to why bluetooth introduced a second queue on 6lowpan interfaces.
>
>> Btw, another big different in terms of design that it seems 15.4 does
>> create interfaces to each and every link, in Bluetooth the interface
>> is per adapter so it is in fact multi-link, I guess in practice 15.4
>> shall only have one 6lowpan link since it is connected to mesh, but in
>> Bluetooth we may have more than one Link and I don't think it would be
>> a good idea to have each of these links as a different interface,
>> especially not with the same mac address as it seems to be what 15.4
>> code is doing:
>>
>> /* Set the lowpan hardware address to the wpan hardware address. */
>> memcpy(ldev->dev_addr, wdev->dev_addr, IEEE802154_ADDR_LEN);
>>
>
> Of course we need the same mac address there (at first), because we
> have an address filter on hardware.
>
> (Be aware that you _cannot_ change the mac address (dev->dev_addr) while IP
>  capable net device is up. They assume the mac address is read only if net
>  device is up.)
>
> There is currently an idea to provide multiple links, but then hardware
> need to go into promiscuous mode (disable address filtering) but then we
> will lose some hardware offloading stuff e.g. ACK handling.
> If we have that, we can provide multiple links with different mac
> addresses, at least this will be handled as multiple wpan interfaces for
> one PHY.

We do need multi-link from the start, and it has been decided it
wouldn't be with multiple interfaces but with a single one.

>> So at least with IPSP/RFC-7668 the topology is completely different
>> from what we can see in 15.4, it may happen that this changes with
>> upcoming changes to Bluetooth, adding yet another topology that we
>> will need to support. Personally I wouldn't mind if 15.4 and Bluetooth
>> had more in common, at least at 6lowpan level, so we wouldn't have the
>> need to create separated modules just to deal with their differences,
>> but in reality, these technologies are competing for the same market
>> so I don't think it will happen, unfortunately.
>>
>
> You already have separated modules, this is .e.g IPHC stuff and 6lowpan
> specific implementation for bluetooth. In net/bluetooth/6lowpan.c you
> can put your changes for make specific bluetooth handling.
>
> This discussion moved currently to a more complex question about how to
> make the whole 6lowpan architecture...
>
> I think I tried to explain so much as possible what a queue on top of
> 6lowpan interface will do, at least if you want to make a better queue
> handling (when dropping skb's inside the queue(s)) then you will fail
> and revert that change (I think). But then I can say "I told you so" :-)
>
> - Alex
>
> [0] http://lxr.free-electrons.com/source/net/bluetooth/l2cap_core.c#L2455
> [1] http://lxr.free-electrons.com/source/net/bluetooth/hci_core.c#L3490
> [2] http://www.spinics.net/lists/linux-wpan/msg03679.html



-- 
Luiz Augusto von Dentz

^ permalink raw reply

* [PATCH net] team: fix memory leaks
From: Pan Bian @ 2017-04-24 10:29 UTC (permalink / raw)
  To: Jiri Pirko, netdev; +Cc: linux-kernel, Pan Bian

In functions team_nl_send_port_list_get() and
team_nl_send_options_get(), pointer skb keeps the return value of
nlmsg_new(). When the call to genlmsg_put() fails, the memory is not
freed(). This will result in memory leak bugs.

Fixes: 9b00cf2d1024 ("team: implement multipart netlink messages for options transfers")
Signed-off-by: Pan Bian <bianpan2016@163.com>
---
 drivers/net/team/team.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index f8c81f1..85c0124 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -2361,8 +2361,10 @@ static int team_nl_send_options_get(struct team *team, u32 portid, u32 seq,
 
 	hdr = genlmsg_put(skb, portid, seq, &team_nl_family, flags | NLM_F_MULTI,
 			  TEAM_CMD_OPTIONS_GET);
-	if (!hdr)
+	if (!hdr) {
+		nlmsg_free(skb);
 		return -EMSGSIZE;
+	}
 
 	if (nla_put_u32(skb, TEAM_ATTR_TEAM_IFINDEX, team->dev->ifindex))
 		goto nla_put_failure;
@@ -2634,8 +2636,10 @@ static int team_nl_send_port_list_get(struct team *team, u32 portid, u32 seq,
 
 	hdr = genlmsg_put(skb, portid, seq, &team_nl_family, flags | NLM_F_MULTI,
 			  TEAM_CMD_PORT_LIST_GET);
-	if (!hdr)
+	if (!hdr) {
+		nlmsg_free(skb);
 		return -EMSGSIZE;
+	}
 
 	if (nla_put_u32(skb, TEAM_ATTR_TEAM_IFINDEX, team->dev->ifindex))
 		goto nla_put_failure;
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH net v2 1/3] net: hns: support deferred probe when can not obtain irq
From: Matthias Brugger @ 2017-04-24 10:28 UTC (permalink / raw)
  To: netdev; +Cc: netdev, charles.chenxin, linuxarm
In-Reply-To: <1492760684-117205-2-git-send-email-yankejian@huawei.com>

On 21/04/17 09:44, Yankejian wrote:
> From: lipeng <lipeng321@huawei.com>
>
> In the hip06 and hip07 SoCs, the interrupt lines from the
> DSAF controllers are connected to mbigen hw module.
> The mbigen module is probed with module_init, and, as such,
> is not guaranteed to probe before the HNS driver. So we need
> to support deferred probe.
>
> We check for probe deferral in the hw layer probe, so we not
> probe into the main layer and memories, etc., to later learn
> that we need to defer the probe.
>

Why? This looks like a hack.
 From what I see, we can handle EPROBE_DEFER easily inside hns_ppe_init 
checking the return value of hns_rcb_get_cfg. Like you do in 2/3 of this 
series.

Regards,
Matthias

> Signed-off-by: lipeng <lipeng321@huawei.com>
> Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com>
> ---
>  drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
> index 403ea9d..2da5b42 100644
> --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
> +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
> @@ -2971,6 +2971,18 @@ static int hns_dsaf_probe(struct platform_device *pdev)
>  	struct dsaf_device *dsaf_dev;
>  	int ret;
>
> +	/*
> +	 * Check if we should defer the probe before we probe the
> +	 * dsaf, as it's hard to defer later on.
> +	 */
> +	ret = platform_get_irq(pdev, 0);
> +	if (ret < 0) {
> +		if (ret != -EPROBE_DEFER)
> +			dev_err(&pdev->dev, "Cannot obtain irq\n");
> +
> +		return ret;
> +	}
> +
>  	dsaf_dev = hns_dsaf_alloc_dev(&pdev->dev, sizeof(struct dsaf_drv_priv));
>  	if (IS_ERR(dsaf_dev)) {
>  		ret = PTR_ERR(dsaf_dev);
>

^ permalink raw reply

* Re: [PATCH] ipvs: explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled
From: Simon Horman @ 2017-04-24  9:54 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Paolo Abeni, netfilter-devel, lvs-devel, Wensong Zhang, netdev
In-Reply-To: <alpine.LFD.2.20.1704241017030.2492@ja.home.ssi.bg>

On Mon, Apr 24, 2017 at 10:21:30AM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Mon, 24 Apr 2017, Paolo Abeni wrote:
> 
> > Hi,
> > 
> > The problem with the patched code is that it tries to resolve ipv6
> > addresses that are not created/validated by the kernel.
> 
> 	OK. Simon, please apply to ipvs tree.
> 
> Acked-by: Julian Anastasov <ja@ssi.bg>

Thanks, done.

^ permalink raw reply

* Re: [PATCH] net: ipv6: check route protocol when deleting routes
From: Lorenzo Colitti @ 2017-04-24  9:48 UTC (permalink / raw)
  To: Mantas Mikulėnas
  Cc: netdev@vger.kernel.org, lkml, David Miller, Joel Scherpelz,
	Hannes Frederic Sowa, YOSHIFUJI Hideaki, Greg KH
In-Reply-To: <20161216083059.251368-1-grawity@gmail.com>

On Fri, Dec 16, 2016 at 5:30 PM, Mantas Mikulėnas <grawity@gmail.com> wrote:
> The protocol field is checked when deleting IPv4 routes, but ignored for
> IPv6, which causes problems with routing daemons accidentally deleting
> externally set routes (observed by multiple bird6 users).
>
> This can be verified using `ip -6 route del <prefix> proto something`.

I think this change might have broken userspace deleting routes that
were created by RAs. This is because the rtm_protocol returned to
userspace is actually synthesized only at route dump time by
rt6_fill_node:

        else if (rt->rt6i_flags & RTF_ADDRCONF) {
                if (rt->rt6i_flags & (RTF_DEFAULT | RTF_ROUTEINFO))
                        rtm->rtm_protocol = RTPROT_RA;
                else
                        rtm->rtm_protocol = RTPROT_KERNEL;
        }

but rt6_add_dflt_router and rt6_add_route_info add the route with a
protocol of 0, and 0 is silently upgraded to RTPROT_BOOT by
ip6_route_info_create.

        if (cfg->fc_protocol == RTPROT_UNSPEC)
                cfg->fc_protocol = RTPROT_BOOT;
        rt->rt6i_protocol = cfg->fc_protocol;

So an app that was previously trying to delete routes looking at
rtm_proto, and issuing a delete with whatever rtm_proto was returned
by netlink, will result in this check failing because its passed-in
protocol (RTPROT_RA or RTPROT_KERNEL) will not match the actual FIB
value, which is RTPROT_BOOT.

I can't easily test on a vanilla kernel, but on a system running a
slightly modified 4.4.63, I see the code fail like this:

# ip -6 route show
2001:db8:64::/64 dev nettest100  proto kernel  metric 256  expires 291sec
fe80::/64 dev nettest100  proto kernel  metric 256
default via fe80::6400 dev nettest100  proto ra  metric 1024  expires 291sec
# ip -6 route flush
Failed to send flush request: No such process
# ip -6 route show
default via fe80::6400 dev nettest100  proto ra  metric 1024  expires 286sec

If so, it seems unfortunate to have brought this into -stable.

For non-stable kernels, it seems that the proper fix would be:

1. Ensure that when an RA creates a route, it properly sets
rtm_protocol at time of route creation.
2. When we dump routes to userspace, we don't overwrite the rtm_protocol.

I can try to write that up, but I'm not sure why the code doesn't do
this already. Perhaps there's a good reason for it. Yoshifuji, Hannes,
any thoughts?

^ permalink raw reply

* Re: [net 1/1] team: fix memory leaks
From: Jiri Pirko @ 2017-04-24  9:43 UTC (permalink / raw)
  To: Pan Bian; +Cc: netdev, linux-kernel
In-Reply-To: <1493026612-12383-1-git-send-email-bianpan2016@163.com>

Plus since you have only one patch, please do not do "1/1" in the
email subject.  Thanks.


Mon, Apr 24, 2017 at 11:36:52AM CEST, bianpan2016@163.com wrote:
>In functions team_nl_send_port_list_get() and
>team_nl_send_options_get(), pointer skb keeps the return value of
>nlmsg_new(). When the call to genlmsg_put() fails, the memory is not
>freed(). This will result in memory leak bugs.
>
>Fixes: 9b00cf2d1024 ("team: implement multipart netlink messages for
>options transfers")
>Signed-off-by: Pan Bian <bianpan2016@163.com>
>---
> drivers/net/team/team.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>index f8c81f1..85c0124 100644
>--- a/drivers/net/team/team.c
>+++ b/drivers/net/team/team.c
>@@ -2361,8 +2361,10 @@ static int team_nl_send_options_get(struct team *team, u32 portid, u32 seq,
> 
> 	hdr = genlmsg_put(skb, portid, seq, &team_nl_family, flags | NLM_F_MULTI,
> 			  TEAM_CMD_OPTIONS_GET);
>-	if (!hdr)
>+	if (!hdr) {
>+		nlmsg_free(skb);
> 		return -EMSGSIZE;
>+	}
> 
> 	if (nla_put_u32(skb, TEAM_ATTR_TEAM_IFINDEX, team->dev->ifindex))
> 		goto nla_put_failure;
>@@ -2634,8 +2636,10 @@ static int team_nl_send_port_list_get(struct team *team, u32 portid, u32 seq,
> 
> 	hdr = genlmsg_put(skb, portid, seq, &team_nl_family, flags | NLM_F_MULTI,
> 			  TEAM_CMD_PORT_LIST_GET);
>-	if (!hdr)
>+	if (!hdr) {
>+		nlmsg_free(skb);
> 		return -EMSGSIZE;
>+	}
> 
> 	if (nla_put_u32(skb, TEAM_ATTR_TEAM_IFINDEX, team->dev->ifindex))
> 		goto nla_put_failure;
>-- 
>1.9.1
>
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox