Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] wl1251: add a missing spin_lock_init()
From: David Miller @ 2017-08-31 17:33 UTC (permalink / raw)
  To: pavel
  Cc: kvalo, torvalds, xiyou.wangcong, akpm, netdev, linux-kernel,
	linux-wireless
In-Reply-To: <20170831144743.GA21261@amd>

From: Pavel Machek <pavel@ucw.cz>
Date: Thu, 31 Aug 2017 16:47:43 +0200

> Dave, Linus -- can you still take the patch?

Pavel, please do not bypass maintainers like this.

It's really rude, and if you do things like that instead of
trying to work properly with us, your relationship with
these maintainers will suffer in the long term.

Thank you.

^ permalink raw reply

* Re: tip -ENOBOOT - bisected to locking/refcounts, x86/asm: Implement fast refcount overflow protection
From: Mike Galbraith @ 2017-08-31 17:19 UTC (permalink / raw)
  To: Kees Cook
  Cc: David S. Miller, Peter Zijlstra, LKML, Ingo Molnar,
	Reshetova, Elena, Network Development
In-Reply-To: <CAGXu5jL8Ueuw-DLY+Hz8Jd7NgqgyaaN8OVDicFxmcjBhScebOA@mail.gmail.com>

On Thu, 2017-08-31 at 10:00 -0700, Kees Cook wrote:
> 
> Oh! So it's gcc-version sensitive? That's alarming. Is this mapping correct:
> 
> 4.8.5: WARN, eventual kernel hang
> 6.3.1, 7.0.1: WARN, but continues working

Yeah, that's correct.  I find that troubling, simply because this gcc
version has been through one hell of a lot of kernels with me.  Yeah, I
know, that doesn't exempt it from having bugs, but color me suspicious.

	-Mike

^ permalink raw reply

* [PATCH][next] net/mlx4_core: fix incorrect size allocation for dev->caps.spec_qps
From: Colin King @ 2017-08-31 17:07 UTC (permalink / raw)
  To: Tariq Toukan, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: kernel-janitors-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

From: Colin Ian King <colin.king-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

The current allocation for dev->caps.spec_qps is for the size of the
pointer and not the size of the actual  mlx4_spec_qps structure.  Fix
this by using the correct size.   Also splint allocation over a few
lines to make it cppcheck clean on overly wide lines.

Detected by CoverityScan, CID#1455222 ("Wrong sizeof argument")

Fixes: c73c8b1e47ca ("net/mlx4_core: Dynamically allocate structs at mlx4_slave_cap")
Signed-off-by: Colin Ian King <colin.king-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
---
 drivers/net/ethernet/mellanox/mlx4/qp.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/qp.c b/drivers/net/ethernet/mellanox/mlx4/qp.c
index b16fc441609e..728a2fb1f5c0 100644
--- a/drivers/net/ethernet/mellanox/mlx4/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx4/qp.c
@@ -845,8 +845,9 @@ int mlx4_init_qp_table(struct mlx4_dev *dev)
 
 		/* In mfunc, calculate proxy and tunnel qp offsets for the PF here,
 		 * since the PF does not call mlx4_slave_caps */
-		dev->caps.spec_qps = kcalloc(dev->caps.num_ports, sizeof(dev->caps.spec_qps), GFP_KERNEL);
-
+		dev->caps.spec_qps = kcalloc(dev->caps.num_ports,
+					     sizeof(*dev->caps.spec_qps),
+					     GFP_KERNEL);
 		if (!dev->caps.spec_qps) {
 			err = -ENOMEM;
 			goto err_mem;
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* virtio_net: ethtool supported link modes
From: Radu Rendec @ 2017-08-31 17:04 UTC (permalink / raw)
  To: virtualization, netdev, linux-kernel; +Cc: Michael S. Tsirkin, Jason Wang

Hello,

Looking at the code in virtnet_set_link_ksettings, it seems the speed
and duplex can be set to any valid value. The driver will "remember"
them and report them back in virtnet_get_link_ksettings.

However, the supported link modes (link_modes.supported in struct
ethtool_link_ksettings) is always 0, indicating that no speed/duplex
setting is supported.

Does it make more sense to set (at least a few of) the supported link
modes, such as 10baseT_Half ... 10000baseT_Full?

I would expect to see consistency between what is reported in
link_modes.supported and what can actually be set. Could you please
share your opinion on this?

Thank you,
Radu Rendec

^ permalink raw reply

* Re: [PATCH net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"
From: Florian Fainelli @ 2017-08-31 17:03 UTC (permalink / raw)
  To: Marc Gonzalez, David Daney
  Cc: netdev, Geert Uytterhoeven, David Miller, Andrew Lunn,
	Mans Rullgard, Mason
In-Reply-To: <f4bb5ac8-dae8-c0af-7aa6-e546fc0783fa@sigmadesigns.com>

On 08/31/2017 05:29 AM, Marc Gonzalez wrote:
> On 31/08/2017 02:49, Florian Fainelli wrote:
> 
>> This reverts commit 7ad813f208533cebfcc32d3d7474dc1677d1b09a ("net: phy:
>> Correctly process PHY_HALTED in phy_stop_machine()") because it is
>> creating the possibility for a NULL pointer dereference.
>>
>> David Daney provide the following call trace and diagram of events:
>>
>> When ndo_stop() is called we call:
>>
>>  phy_disconnect()
>>     +---> phy_stop_interrupts() implies: phydev->irq = PHY_POLL;
> 
> What does this mean?
> 
> On the contrary, phy_stop_interrupts() is only called when *not* polling.
> 
> 	if (phydev->irq > 0)
> 		phy_stop_interrupts(phydev);
> 
>>     +---> phy_stop_machine()
>>     |      +---> phy_state_machine()
>>     |              +----> queue_delayed_work(): Work queued.
> 
> You're referring to the fact that, at the end of phy_state_machine()
> (in polling mode) the code reschedules itself through:
> 
> 	if (phydev->irq == PHY_POLL)
> 		queue_delayed_work(system_power_efficient_wq, &phydev->state_queue, PHY_STATE_TIME * HZ);
> 
>>     +--->phy_detach() implies: phydev->attached_dev = NULL;
>>
>> Now at a later time the queued work does:
>>
>>  phy_state_machine()
>>     +---->netif_carrier_off(phydev->attached_dev): Oh no! It is NULL:
> 
> I tested a sequence of 500 link up / link down in polling mode,
> and saw no such issue. Race condition?
> 
> For what case in phy_state_machine() is netif_carrier_off()
> being called? Surely not PHY_HALTED?
> 
> 
>> The original motivation for this change originated from Marc Gonzales
>> indicating that his network driver did not have its adjust_link callback
>> executing with phydev->link = 0 while he was expecting it.
> 
> I expect the core to call phy_adjust_link() for link changes.
> This used to work back in 3.4 and was broken somewhere along
> the way.

If that was working correctly in 3.4 surely we can look at the diff and
figure out what changed, even maybe find the offending commit, can you
do that?

> 
>> PHYLIB has never made any such guarantees ever because phy_stop() merely
>> just tells the workqueue to move into PHY_HALTED state which will happen
>> asynchronously.
> 
> My original proposal was to fix the issue in the driver.
> I'll try locating it in my archives.

Yes I remember you telling that, by the way I don't think you ever
provided a clear explanation why this is absolutely necessary for your
driver though?
-- 
Florian

^ permalink raw reply

* Re: tip -ENOBOOT - bisected to locking/refcounts, x86/asm: Implement fast refcount overflow protection
From: Kees Cook @ 2017-08-31 17:00 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: David S. Miller, Peter Zijlstra, LKML, Ingo Molnar,
	Reshetova, Elena, Network Development
In-Reply-To: <1504187918.27500.16.camel@gmx.de>

On Thu, Aug 31, 2017 at 6:58 AM, Mike Galbraith <efault@gmx.de> wrote:
> On Wed, 2017-08-30 at 21:10 -0700, Kees Cook wrote:
>> On Wed, Aug 30, 2017 at 9:01 PM, Kees Cook <keescook@chromium.org> wrote:
>> > On Wed, Aug 30, 2017 at 8:12 PM, Mike Galbraith <efault@gmx.de> wrote:
>> >> On Wed, 2017-08-30 at 19:27 -0700, Kees Cook wrote:
>> >>
>> >>> Interesting! Can you try with 633547973ffc3 ("net: convert
>> >>> sk_buff.users from atomic_t to refcount_t") reverted? I'll see if
>> >>> running haveged will help me trigger this on my system...
>> >>
>> >> With that (plus 230cd1279d001 fix to it) reverted, vbox boots.
>> >
>> > Wonderful! Thank you so much for helping track this down.
>> >
>> > So, it seems that sk_buff.users will need some more special attention
>> > before we can convert it to refcount.
>> >
>> > x86-refcount will saturate with refcount_dec_and_test() if the result
>> > is negative. But that would mean at least starting at 0. FULL should
>> > have WARNed in this case, so I remain slightly confused why it was
>> > missed by FULL.
>>
>> Actually, if this is a race condition it's possible that FULL is slow
>> enough to miss it...
>>
>> I bet something briefly takes the refcount negative, and with
>> unchecked atomics, it come back up positive again during the race.
>> FULL may miss the race, and x86-refcount will catch it and saturate...
>
> (gdb) list *in6_dev_get+0x1e
> 0xffffffff8166d3de is in in6_dev_get (./arch/x86/include/asm/refcount.h:52).
> 47                      : "cc", "cx");
> 48      }
> 49
> 50      static __always_inline void refcount_inc(refcount_t *r)
> 51      {
> 52              asm volatile(LOCK_PREFIX "incl %0\n\t"
> 53                      REFCOUNT_CHECK_LT_ZERO
> 54                      : [counter] "+m" (r->refs.counter)
> 55                      : : "cc", "cx");
> 56
>
> gdb) list *in6_dev_get+0x10
> 0xffffffff8166d3d0 is in in6_dev_get (./include/net/addrconf.h:318).
> 313     {
> 314             struct inet6_dev *idev;
> 315
> 316             rcu_read_lock();
> 317             idev = rcu_dereference(dev->ip6_ptr);
> 318             if (idev)
> 319                     refcount_inc(&idev->refcnt);
> 320             rcu_read_unlock();
> 321             return idev;
> 322
>
> That's from kernel with no revert, but your silent saturation patch
> still applied, AND built with gcc-6.3.1.  Kernel traps, but it boots
> and works, as does kernel built with gcc-7.0.1.  Remove your silent
> saturation patch, kernel doesn't notice a thing, just works.
>
> With gcc-4.8.5, trap means you're as good as dead, with the other two,
> trap means the intended.  Compiler, constraints, dark elves.. pick one.

Oh! So it's gcc-version sensitive? That's alarming. Is this mapping correct:

4.8.5: WARN, eventual kernel hang
6.3.1, 7.0.1: WARN, but continues working

> Full first splat from bootable gcc-6.3.1 built kernel.
>
> [    1.293962] NET: Registered protocol family 10
> [    1.294635] refcount_t silent saturation at in6_dev_get+0x25/0x104 in swapper/0[1], uid/euid: 0/0

That's an _increment_ saturation? Which means the result must be
negative, so it started from least -2.

> [    1.295616] ------------[ cut here ]------------
> [    1.296120] WARNING: CPU: 0 PID: 1 at kernel/panic.c:612 refcount_error_report+0x94/0x9e
> [    1.296950] Modules linked in:
> [    1.297276] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.0.g152d54a-tip-default #53
> [    1.299179] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
> [    1.300743] task: ffff88013ab84040 task.stack: ffffc9000062c000
> [    1.301825] RIP: 0010:refcount_error_report+0x94/0x9e
> [    1.302804] RSP: 0018:ffffc9000062fc10 EFLAGS: 00010282
> [    1.303791] RAX: 0000000000000055 RBX: ffffffff81a34274 RCX: ffffffff81c605e8
> [    1.304991] RDX: 0000000000000001 RSI: 0000000000000096 RDI: 0000000000000246
> [    1.306189] RBP: ffffc9000062fd58 R08: 0000000000000000 R09: 0000000000000175
> [    1.307392] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88013ab84040
> [    1.308583] R13: 0000000000000000 R14: 0000000000000004 R15: ffffffff81a256c8
> [    1.309768] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
> [    1.311052] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.312100] CR2: 00007f4631fe8df0 CR3: 0000000137d09003 CR4: 00000000001606f0
> [    1.313301] Call Trace:
> [    1.314012]  ex_handler_refcount+0x63/0x70
> [    1.314893]  fixup_exception+0x32/0x40
> [    1.315737]  do_trap+0x8c/0x170
> [    1.316519]  do_error_trap+0x70/0xd0
> [    1.317340]  ? in6_dev_get+0x23/0x104
> [    1.318172]  ? netlink_broadcast_filtered+0x2bd/0x430
> [    1.319156]  ? kmem_cache_alloc_trace+0xce/0x5d0
> [    1.320098]  ? set_debug_rodata+0x11/0x11
> [    1.320964]  invalid_op+0x1e/0x30
> [    1.322520] RIP: 0010:in6_dev_get+0x25/0x104
> [    1.323631] RSP: 0018:ffffc9000062fe00 EFLAGS: 00010202
> [    1.324614] RAX: ffff880137de2400 RBX: ffff880137df4600 RCX: ffff880137de24f0
> [    1.325793] RDX: ffff88013a5e4000 RSI: 00000000fffffe00 RDI: ffff88013a5e4000
> [    1.326964] RBP: 00000000000000d1 R08: 0000000000000000 R09: ffff880137de7600
> [    1.328150] R10: 0000000000000000 R11: ffff8801398a4df8 R12: 0000000000000000
> [    1.329374] R13: ffffffff82137872 R14: 014200ca00000000 R15: 0000000000000000
> [    1.330547]  ? set_debug_rodata+0x11/0x11
> [    1.331392]  ip6_route_init_special_entries+0x2a/0x89
> [    1.332369]  addrconf_init+0x9e/0x203
> [    1.333173]  inet6_init+0x1af/0x365
> [    1.333956]  ? af_unix_init+0x4e/0x4e
> [    1.334753]  do_one_initcall+0x4e/0x190
> [    1.335555]  ? set_debug_rodata+0x11/0x11
> [    1.336369]  kernel_init_freeable+0x189/0x20e
> [    1.337230]  ? rest_init+0xd0/0xd0
> [    1.337999]  kernel_init+0xa/0xf7
> [    1.338744]  ret_from_fork+0x25/0x30
> [    1.339500] Code: 48 8b 95 80 00 00 00 41 55 49 8d 8c 24 f0 0a 00 00 45 8b 84 24 10 09 00 00 41 89 c1 48 89 de 48 c7 c7 60 7a a3 81 e8 07 de 05 00 <0f> ff 58 5b 5d 41 5c 41 5d c3 0f 1f 44 00 00 55 48 89 e5 41 56
> [    1.342243] ---[ end trace b5d40c0fccce776c ]---

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply

* [PATCH net-next v5 0/2] report TCP MD5 signing keys and addresses
From: Ivan Delalande @ 2017-08-31 16:59 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, netdev, Ivan Delalande

Allow userspace to retrieve MD5 signature keys and addresses configured
on TCP sockets through inet_diag.

Thanks to Eric Dumazet and Stephen Hemminger for their useful
explanations and feedback.

v5: - memset the whole netlink payload after it has been nla_reserve-d
      in tcp_diag_put_md5sig (a third memset had to be added for
      tcpm_key so we might as well have just one for entire region).
    - move the nla_total_size call from inet_sk_attr_size to the
      idiag_get_aux_size defined by protocols as they could add multiple
      netlink attributes,
    - add check for net_admin in tcp_diag_get_aux_size.

v4: - add new struct tcp_diag_md5sig to report the data instead of
      tcp_md5sig to avoid wasting 112 bytes on every tcpm_addr,
    - memset tcpm_addr on IPv4 addresses to avoid leaks,
    - style fix in inet_diag_dump_one_icsk.

v3: - rename inet_diag_*md5sig in tcp_diag.c to tcp_diag_* for
      consistency,
    - don't lock the socket in tcp_diag_put_md5sig,
    - add checks on md5sig_count in tcp_diag_put_md5sig to not create
      the netlink attribute if the list is empty, and to avoid overflows
      or memory leaks if the list has changed in the meantime.

v2: - move changes to tcp_diag.c and extend inet_diag_handler to allow
      protocols to provide additional data on INET_DIAG_INFO,
    - lock socket before calling tcp_diag_put_md5sig.


I also have a patch for iproute2/ss to test this change, making it print
this new attribute. I'm planning to polish and send it if this series
gets applied.


Ivan Delalande (2):
  inet_diag: allow protocols to provide additional data
  tcp_diag: report TCP MD5 signing keys and addresses

 include/linux/inet_diag.h      |   7 +++
 include/uapi/linux/inet_diag.h |   1 +
 include/uapi/linux/tcp.h       |   9 ++++
 net/ipv4/inet_diag.c           |  22 +++++++--
 net/ipv4/tcp_diag.c            | 109 ++++++++++++++++++++++++++++++++++++++---
 5 files changed, 138 insertions(+), 10 deletions(-)

-- 
2.14.1

^ permalink raw reply

* [PATCH net-next v5 2/2] tcp_diag: report TCP MD5 signing keys and addresses
From: Ivan Delalande @ 2017-08-31 16:59 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, netdev, Ivan Delalande
In-Reply-To: <20170831165939.5121-1-colona@arista.com>

Report TCP MD5 (RFC2385) signing keys, addresses and address prefixes to
processes with CAP_NET_ADMIN requesting INET_DIAG_INFO. Currently it is
not possible to retrieve these from the kernel once they have been
configured on sockets.

Signed-off-by: Ivan Delalande <colona@arista.com>
---
 include/uapi/linux/inet_diag.h |   1 +
 include/uapi/linux/tcp.h       |   9 ++++
 net/ipv4/tcp_diag.c            | 109 ++++++++++++++++++++++++++++++++++++++---
 3 files changed, 113 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h
index 678496897a68..f52ff62bfabe 100644
--- a/include/uapi/linux/inet_diag.h
+++ b/include/uapi/linux/inet_diag.h
@@ -143,6 +143,7 @@ enum {
 	INET_DIAG_MARK,
 	INET_DIAG_BBRINFO,
 	INET_DIAG_CLASS_ID,
+	INET_DIAG_MD5SIG,
 	__INET_DIAG_MAX,
 };
 
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 030e594bab45..15c25eccab2b 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -256,4 +256,13 @@ struct tcp_md5sig {
 	__u8	tcpm_key[TCP_MD5SIG_MAXKEYLEN];		/* key (binary) */
 };
 
+/* INET_DIAG_MD5SIG */
+struct tcp_diag_md5sig {
+	__u8	tcpm_family;
+	__u8	tcpm_prefixlen;
+	__u16	tcpm_keylen;
+	__be32	tcpm_addr[4];
+	__u8	tcpm_key[TCP_MD5SIG_MAXKEYLEN];
+};
+
 #endif /* _UAPI_LINUX_TCP_H */
diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c
index a748c74aa8b7..abbf0edcf6c2 100644
--- a/net/ipv4/tcp_diag.c
+++ b/net/ipv4/tcp_diag.c
@@ -16,6 +16,7 @@
 
 #include <linux/tcp.h>
 
+#include <net/netlink.h>
 #include <net/tcp.h>
 
 static void tcp_diag_get_info(struct sock *sk, struct inet_diag_msg *r,
@@ -36,6 +37,100 @@ static void tcp_diag_get_info(struct sock *sk, struct inet_diag_msg *r,
 		tcp_get_info(sk, info);
 }
 
+#ifdef CONFIG_TCP_MD5SIG
+static void tcp_diag_md5sig_fill(struct tcp_diag_md5sig *info,
+				 const struct tcp_md5sig_key *key)
+{
+	info->tcpm_family = key->family;
+	info->tcpm_prefixlen = key->prefixlen;
+	info->tcpm_keylen = key->keylen;
+	memcpy(info->tcpm_key, key->key, key->keylen);
+
+	if (key->family == AF_INET)
+		info->tcpm_addr[0] = key->addr.a4.s_addr;
+	#if IS_ENABLED(CONFIG_IPV6)
+	else if (key->family == AF_INET6)
+		memcpy(&info->tcpm_addr, &key->addr.a6,
+		       sizeof(info->tcpm_addr));
+	#endif
+}
+
+static int tcp_diag_put_md5sig(struct sk_buff *skb,
+			       const struct tcp_md5sig_info *md5sig)
+{
+	const struct tcp_md5sig_key *key;
+	struct tcp_diag_md5sig *info;
+	struct nlattr *attr;
+	int md5sig_count = 0;
+
+	hlist_for_each_entry_rcu(key, &md5sig->head, node)
+		md5sig_count++;
+	if (md5sig_count == 0)
+		return 0;
+
+	attr = nla_reserve(skb, INET_DIAG_MD5SIG,
+			   md5sig_count * sizeof(struct tcp_diag_md5sig));
+	if (!attr)
+		return -EMSGSIZE;
+
+	info = nla_data(attr);
+	memset(info, 0, md5sig_count * sizeof(struct tcp_diag_md5sig));
+	hlist_for_each_entry_rcu(key, &md5sig->head, node) {
+		tcp_diag_md5sig_fill(info++, key);
+		if (--md5sig_count == 0)
+			break;
+	}
+
+	return 0;
+}
+#endif
+
+static int tcp_diag_get_aux(struct sock *sk, bool net_admin,
+			    struct sk_buff *skb)
+{
+#ifdef CONFIG_TCP_MD5SIG
+	if (net_admin) {
+		struct tcp_md5sig_info *md5sig;
+		int err = 0;
+
+		rcu_read_lock();
+		md5sig = rcu_dereference(tcp_sk(sk)->md5sig_info);
+		if (md5sig)
+			err = tcp_diag_put_md5sig(skb, md5sig);
+		rcu_read_unlock();
+		if (err < 0)
+			return err;
+	}
+#endif
+
+	return 0;
+}
+
+static size_t tcp_diag_get_aux_size(struct sock *sk, bool net_admin)
+{
+	size_t size = 0;
+
+#ifdef CONFIG_TCP_MD5SIG
+	if (net_admin && sk_fullsock(sk)) {
+		const struct tcp_md5sig_info *md5sig;
+		const struct tcp_md5sig_key *key;
+		size_t md5sig_count = 0;
+
+		rcu_read_lock();
+		md5sig = rcu_dereference(tcp_sk(sk)->md5sig_info);
+		if (md5sig) {
+			hlist_for_each_entry_rcu(key, &md5sig->head, node)
+				md5sig_count++;
+		}
+		rcu_read_unlock();
+		size += nla_total_size(md5sig_count *
+				       sizeof(struct tcp_diag_md5sig));
+	}
+#endif
+
+	return size;
+}
+
 static void tcp_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
 			  const struct inet_diag_req_v2 *r, struct nlattr *bc)
 {
@@ -68,13 +163,15 @@ static int tcp_diag_destroy(struct sk_buff *in_skb,
 #endif
 
 static const struct inet_diag_handler tcp_diag_handler = {
-	.dump		 = tcp_diag_dump,
-	.dump_one	 = tcp_diag_dump_one,
-	.idiag_get_info	 = tcp_diag_get_info,
-	.idiag_type	 = IPPROTO_TCP,
-	.idiag_info_size = sizeof(struct tcp_info),
+	.dump			= tcp_diag_dump,
+	.dump_one		= tcp_diag_dump_one,
+	.idiag_get_info		= tcp_diag_get_info,
+	.idiag_get_aux		= tcp_diag_get_aux,
+	.idiag_get_aux_size	= tcp_diag_get_aux_size,
+	.idiag_type		= IPPROTO_TCP,
+	.idiag_info_size	= sizeof(struct tcp_info),
 #ifdef CONFIG_INET_DIAG_DESTROY
-	.destroy	 = tcp_diag_destroy,
+	.destroy		= tcp_diag_destroy,
 #endif
 };
 
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v5 1/2] inet_diag: allow protocols to provide additional data
From: Ivan Delalande @ 2017-08-31 16:59 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, netdev, Ivan Delalande
In-Reply-To: <20170831165939.5121-1-colona@arista.com>

Extend inet_diag_handler to allow individual protocols to report
additional data on INET_DIAG_INFO through idiag_get_aux. The size
can be dynamic and is computed by idiag_get_aux_size.

Signed-off-by: Ivan Delalande <colona@arista.com>
---
 include/linux/inet_diag.h |  7 +++++++
 net/ipv4/inet_diag.c      | 22 ++++++++++++++++++----
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/linux/inet_diag.h b/include/linux/inet_diag.h
index 65da430e260f..ee251c585854 100644
--- a/include/linux/inet_diag.h
+++ b/include/linux/inet_diag.h
@@ -25,6 +25,13 @@ struct inet_diag_handler {
 					  struct inet_diag_msg *r,
 					  void *info);
 
+	int		(*idiag_get_aux)(struct sock *sk,
+					 bool net_admin,
+					 struct sk_buff *skb);
+
+	size_t		(*idiag_get_aux_size)(struct sock *sk,
+					      bool net_admin);
+
 	int		(*destroy)(struct sk_buff *in_skb,
 				   const struct inet_diag_req_v2 *req);
 
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 67325d5832d7..c9c35b61a027 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -93,8 +93,17 @@ void inet_diag_msg_common_fill(struct inet_diag_msg *r, struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(inet_diag_msg_common_fill);
 
-static size_t inet_sk_attr_size(void)
+static size_t inet_sk_attr_size(struct sock *sk,
+				const struct inet_diag_req_v2 *req,
+				bool net_admin)
 {
+	const struct inet_diag_handler *handler;
+	size_t aux = 0;
+
+	handler = inet_diag_table[req->sdiag_protocol];
+	if (handler && handler->idiag_get_aux_size)
+		aux = handler->idiag_get_aux_size(sk, net_admin);
+
 	return	  nla_total_size(sizeof(struct tcp_info))
 		+ nla_total_size(1) /* INET_DIAG_SHUTDOWN */
 		+ nla_total_size(1) /* INET_DIAG_TOS */
@@ -105,6 +114,7 @@ static size_t inet_sk_attr_size(void)
 		+ nla_total_size(SK_MEMINFO_VARS * sizeof(u32))
 		+ nla_total_size(TCP_CA_NAME_MAX)
 		+ nla_total_size(sizeof(struct tcpvegas_info))
+		+ aux
 		+ 64;
 }
 
@@ -260,6 +270,10 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk,
 
 	handler->idiag_get_info(sk, r, info);
 
+	if (ext & (1 << (INET_DIAG_INFO - 1)) && handler->idiag_get_aux)
+		if (handler->idiag_get_aux(sk, net_admin, skb) < 0)
+			goto errout;
+
 	if (sk->sk_state < TCP_TIME_WAIT) {
 		union tcp_cc_info info;
 		size_t sz = 0;
@@ -449,6 +463,7 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
 			    const struct nlmsghdr *nlh,
 			    const struct inet_diag_req_v2 *req)
 {
+	bool net_admin = netlink_net_capable(in_skb, CAP_NET_ADMIN);
 	struct net *net = sock_net(in_skb->sk);
 	struct sk_buff *rep;
 	struct sock *sk;
@@ -458,7 +473,7 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
 	if (IS_ERR(sk))
 		return PTR_ERR(sk);
 
-	rep = nlmsg_new(inet_sk_attr_size(), GFP_KERNEL);
+	rep = nlmsg_new(inet_sk_attr_size(sk, req, net_admin), GFP_KERNEL);
 	if (!rep) {
 		err = -ENOMEM;
 		goto out;
@@ -467,8 +482,7 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
 	err = sk_diag_fill(sk, rep, req,
 			   sk_user_ns(NETLINK_CB(in_skb).sk),
 			   NETLINK_CB(in_skb).portid,
-			   nlh->nlmsg_seq, 0, nlh,
-			   netlink_net_capable(in_skb, CAP_NET_ADMIN));
+			   nlh->nlmsg_seq, 0, nlh, net_admin);
 	if (err < 0) {
 		WARN_ON(err == -EMSGSIZE);
 		nlmsg_free(rep);
-- 
2.14.1

^ permalink raw reply related

* Re: [PATCH net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"
From: Florian Fainelli @ 2017-08-31 16:57 UTC (permalink / raw)
  To: David Daney, Marc Gonzalez
  Cc: netdev, Geert Uytterhoeven, David Miller, Andrew Lunn,
	Mans Rullgard, Mason
In-Reply-To: <e24693e8-d8ae-188a-2a38-c9a83fdc94e3@gmail.com>

On 08/31/2017 09:36 AM, David Daney wrote:
> On 08/31/2017 05:29 AM, Marc Gonzalez wrote:
>> On 31/08/2017 02:49, Florian Fainelli wrote:
>>
>>> This reverts commit 7ad813f208533cebfcc32d3d7474dc1677d1b09a ("net: phy:
>>> Correctly process PHY_HALTED in phy_stop_machine()") because it is
>>> creating the possibility for a NULL pointer dereference.
>>>
>>> David Daney provide the following call trace and diagram of events:
>>>
>>> When ndo_stop() is called we call:
>>>
>>>   phy_disconnect()
>>>      +---> phy_stop_interrupts() implies: phydev->irq = PHY_POLL;
>>
>> What does this mean?
> 
> I meant that after the call to phy_stop_interrupts(), phydev->irq =
> PHY_POLL;
> 
> 
>>
>> On the contrary, phy_stop_interrupts() is only called when *not* polling.
> 
> That is the case I have.  We are using interrupts from the phy.
> 
> 
>>
>>     if (phydev->irq > 0)
>>         phy_stop_interrupts(phydev);
>>
>>>      +---> phy_stop_machine()
>>>      |      +---> phy_state_machine()
>>>      |              +----> queue_delayed_work(): Work queued.
>>
>> You're referring to the fact that, at the end of phy_state_machine()
>> (in polling mode) the code reschedules itself through:
>>
>>     if (phydev->irq == PHY_POLL)
>>         queue_delayed_work(system_power_efficient_wq,
>> &phydev->state_queue, PHY_STATE_TIME * HZ);
> 
> Exactly.  The call to phy_disconnect() ensures that there are no more
> interrupts and also that phydev->irq = PHY_POLL
> 
> The call to cancel_delayed_work_sync() at the top of phy_stop_machine()
> was meant to ensure that phy_state_machine() was never run again.  No
> interrupts + no queued work means that it should be save to do...
> 
>>
>>>      +--->phy_detach() implies: phydev->attached_dev = NULL;
> 
> The problem is that by calling phy_state_machine() again (which the
> offending patch added) we now have work scheduled that will try to
> dereference the pointer that was set to NULL as a result of the
> phy_detach()

And the race is between phy_detach() setting phydev->attached_dev = NULL
and phy_state_machine() running in PHY_HALTED state and calling
netif_carrier_off().

> 
> 
>>>
>>> Now at a later time the queued work does:
>>>
>>>   phy_state_machine()
>>>      +---->netif_carrier_off(phydev->attached_dev): Oh no! It is NULL:
>>
>> I tested a sequence of 500 link up / link down in polling mode,
>> and saw no such issue. Race condition?
>>
> 
> You were lucky.

I too tested this a number of times on a 2 core and 4 core system, but
the race is there, both of us just were lucky enough we did not see any
crash. I suspect the race is easier to reproduce on a (at least 12 core)
system with possibly a higher clock speed.

> 
>> For what case in phy_state_machine() is netif_carrier_off()
>> being called? Surely not PHY_HALTED?
>>
> 
> The phy can be in a variety of states.  It is connected to something
> outside of the system that we don't control, so you cannot assume any
> particular state.  We must have code that doesn't crash the system no
> matter what state the phy is in.
> 
> I suspect, but have not checked, that the phy is in PHY_RUNNING.  I
> think that means that because this patch turned the state machine back
> on, it will start transitioning through PHY_UP, PHY_AN, ... and
> eventually get to the crash we see because phydev->attached_dev = NULL

I actually think the PHY remains in PHY_HALTED but just re-schedules
itself and keeps being in PHY_HALTED again until a call to phy_resume or
phy_start() moves it back to another state. This is largely inefficient,
and we should look into using the patch I posted yesterday which would
prevent a re-schedule when moved to PHY_HALTED:

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index d0626bf5c540..78168e19bd5d 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1234,7 +1234,7 @@ void phy_state_machine(struct work_struct *work)
         * PHY, if PHY_IGNORE_INTERRUPT is set, then we will be moving
         * between states from phy_mac_interrupt()
         */
-       if (phydev->irq == PHY_POLL)
+       if (phydev->irq == PHY_POLL && phydev->state != PHY_HALTED)
                queue_delayed_work(system_power_efficient_wq,
&phydev->state_queue,
                                   PHY_STATE_TIME * HZ);
 }



-- 
Florian

^ permalink raw reply related

* Re: [PATCH net] ipv4: Don't override return code from ip_route_input_noref()
From: Wei Wang @ 2017-08-31 16:39 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: David S . Miller, Linux Kernel Network Developers,
	Sabrina Dubroca
In-Reply-To: <c4d8970306e614ac139cf2f33ab5e6bb73b30154.1504194307.git.sbrivio@redhat.com>

> After ip_route_input() calls ip_route_input_noref(), another
> check on skb_dst() is done, but if this fails, we shouldn't
> override the return code from ip_route_input_noref(), as it
> could have been more specific (i.e. -EHOSTUNREACH).
>
> This also saves one call to skb_dst_force_safe() and one to
> skb_dst() in case the ip_route_input_noref() check fails.
>
> Reported-by: Sabrina Dubroca <sdubroca@redhat.com>
> Fixes: ad65a2f05695 ("ipv4: call dst_hold_safe() properly")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Acked-by: Wei Wang <weiwan@google.com>


On Thu, Aug 31, 2017 at 9:11 AM, Stefano Brivio <sbrivio@redhat.com> wrote:
> After ip_route_input() calls ip_route_input_noref(), another
> check on skb_dst() is done, but if this fails, we shouldn't
> override the return code from ip_route_input_noref(), as it
> could have been more specific (i.e. -EHOSTUNREACH).
>
> This also saves one call to skb_dst_force_safe() and one to
> skb_dst() in case the ip_route_input_noref() check fails.
>
> Reported-by: Sabrina Dubroca <sdubroca@redhat.com>
> Fixes: ad65a2f05695 ("ipv4: call dst_hold_safe() properly")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  include/net/route.h | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/route.h b/include/net/route.h
> index cb0a76d9dde1..1b09a9368c68 100644
> --- a/include/net/route.h
> +++ b/include/net/route.h
> @@ -189,10 +189,11 @@ static inline int ip_route_input(struct sk_buff *skb, __be32 dst, __be32 src,
>
>         rcu_read_lock();
>         err = ip_route_input_noref(skb, dst, src, tos, devin);
> -       if (!err)
> +       if (!err) {
>                 skb_dst_force_safe(skb);
> -       if (!skb_dst(skb))
> -               err = -EINVAL;
> +               if (!skb_dst(skb))
> +                       err = -EINVAL;
> +       }
>         rcu_read_unlock();
>
>         return err;
> --
> 2.9.4
>

^ permalink raw reply

* Re: [PATCH net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"
From: David Daney @ 2017-08-31 16:36 UTC (permalink / raw)
  To: Marc Gonzalez, Florian Fainelli
  Cc: netdev, Geert Uytterhoeven, David Miller, Andrew Lunn,
	Mans Rullgard, Mason
In-Reply-To: <f4bb5ac8-dae8-c0af-7aa6-e546fc0783fa@sigmadesigns.com>

On 08/31/2017 05:29 AM, Marc Gonzalez wrote:
> On 31/08/2017 02:49, Florian Fainelli wrote:
> 
>> This reverts commit 7ad813f208533cebfcc32d3d7474dc1677d1b09a ("net: phy:
>> Correctly process PHY_HALTED in phy_stop_machine()") because it is
>> creating the possibility for a NULL pointer dereference.
>>
>> David Daney provide the following call trace and diagram of events:
>>
>> When ndo_stop() is called we call:
>>
>>   phy_disconnect()
>>      +---> phy_stop_interrupts() implies: phydev->irq = PHY_POLL;
> 
> What does this mean?

I meant that after the call to phy_stop_interrupts(), phydev->irq = 
PHY_POLL;


> 
> On the contrary, phy_stop_interrupts() is only called when *not* polling.

That is the case I have.  We are using interrupts from the phy.


> 
> 	if (phydev->irq > 0)
> 		phy_stop_interrupts(phydev);
> 
>>      +---> phy_stop_machine()
>>      |      +---> phy_state_machine()
>>      |              +----> queue_delayed_work(): Work queued.
> 
> You're referring to the fact that, at the end of phy_state_machine()
> (in polling mode) the code reschedules itself through:
> 
> 	if (phydev->irq == PHY_POLL)
> 		queue_delayed_work(system_power_efficient_wq, &phydev->state_queue, PHY_STATE_TIME * HZ);

Exactly.  The call to phy_disconnect() ensures that there are no more 
interrupts and also that phydev->irq = PHY_POLL

The call to cancel_delayed_work_sync() at the top of phy_stop_machine() 
was meant to ensure that phy_state_machine() was never run again.  No 
interrupts + no queued work means that it should be save to do...

> 
>>      +--->phy_detach() implies: phydev->attached_dev = NULL;

The problem is that by calling phy_state_machine() again (which the 
offending patch added) we now have work scheduled that will try to 
dereference the pointer that was set to NULL as a result of the phy_detach()


>>
>> Now at a later time the queued work does:
>>
>>   phy_state_machine()
>>      +---->netif_carrier_off(phydev->attached_dev): Oh no! It is NULL:
> 
> I tested a sequence of 500 link up / link down in polling mode,
> and saw no such issue. Race condition?
> 

You were lucky.

> For what case in phy_state_machine() is netif_carrier_off()
> being called? Surely not PHY_HALTED?
> 

The phy can be in a variety of states.  It is connected to something 
outside of the system that we don't control, so you cannot assume any 
particular state.  We must have code that doesn't crash the system no 
matter what state the phy is in.

I suspect, but have not checked, that the phy is in PHY_RUNNING.  I 
think that means that because this patch turned the state machine back 
on, it will start transitioning through PHY_UP, PHY_AN, ... and 
eventually get to the crash we see because phydev->attached_dev = NULL


> 
>> The original motivation for this change originated from Marc Gonzales
>> indicating that his network driver did not have its adjust_link callback
>> executing with phydev->link = 0 while he was expecting it.
> 
> I expect the core to call phy_adjust_link() for link changes.
> This used to work back in 3.4 and was broken somewhere along
> the way.
> 
>> PHYLIB has never made any such guarantees ever because phy_stop() merely
>> just tells the workqueue to move into PHY_HALTED state which will happen
>> asynchronously.
> 
> My original proposal was to fix the issue in the driver.
> I'll try locating it in my archives.
> 
> Regards.
> 

^ permalink raw reply

* [PATCH][next] net/mlx4_core: fix memory leaks on error exit path
From: Colin King @ 2017-08-31 16:30 UTC (permalink / raw)
  To: Tariq Toukan, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: kernel-janitors-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

From: Colin Ian King <colin.king-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

The structures hca_param and func_cap are not being kfree'd on an error
exit path causing two memory leaks. Fix this by jumping to the existing
free memory error exit path.

Detected by CoverityScan, CID#1455219, CID#1455224 ("Resource Leak")

Fixes: c73c8b1e47ca ("net/mlx4_core: Dynamically allocate structs at mlx4_slave_cap")
Signed-off-by: Colin Ian King <colin.king-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
---
 drivers/net/ethernet/mellanox/mlx4/main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index 1c92101b3ec2..d46f3283ec36 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -977,7 +977,8 @@ static int mlx4_slave_cap(struct mlx4_dev *dev)
 	if (dev->caps.num_ports > MLX4_MAX_PORTS) {
 		mlx4_err(dev, "HCA has %d ports, but we only support %d, aborting\n",
 			 dev->caps.num_ports, MLX4_MAX_PORTS);
-		return -ENODEV;
+		err = -ENODEV;
+		goto free_mem;
 	}
 
 	mlx4_replace_zero_macs(dev);
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [RFC PATCH] net: frag limit checks need to use percpu_counter_compare
From: Michal Kubecek @ 2017-08-31 16:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: liujian56, netdev, Florian Westphal
In-Reply-To: <150417481955.28907.15567119824187929000.stgit@firesoul>

On Thu, Aug 31, 2017 at 12:20:19PM +0200, Jesper Dangaard Brouer wrote:
> To: Liujian can you please test this patch?
>  I want to understand if using __percpu_counter_compare() solves
>  the problem correctness wise (even-though this will be slower
>  than using a simple atomic_t on your big system).
> 
> Fix bug in fragmentation codes use of the percpu_counter API, that
> cause issues on systems with many CPUs.
> 
> The frag_mem_limit() just reads the global counter (fbc->count),
> without considering other CPUs can have upto batch size (130K) that
> haven't been subtracted yet.  Due to the 3MBytes lower thresh limit,
> this become dangerous at >=24 CPUs (3*1024*1024/130000=24).
> 
> The __percpu_counter_compare() does the right thing, and takes into
> account the number of (online) CPUs and batch size, to account for
> this and call __percpu_counter_sum() when needed.
> 
> On systems with many CPUs this will unfortunately always result in the
> heavier fully locked __percpu_counter_sum() which touch the
> per_cpu_ptr of all (online) CPUs.
> 
> On systems with a smaller number of CPUs this solution is also not
> optimal, because __percpu_counter_compare()/__percpu_counter_sum()
> doesn't help synchronize the global counter.
>  Florian Westphal have an idea of adding some counter sync points,
> which should help address this issue.
> ---
>  include/net/inet_frag.h  |   16 ++++++++++++++--
>  net/ipv4/inet_fragment.c |    6 +++---
>  2 files changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
> index 6fdcd2427776..b586e320783d 100644
> --- a/include/net/inet_frag.h
> +++ b/include/net/inet_frag.h
> @@ -147,9 +147,21 @@ static inline bool inet_frag_evicting(struct inet_frag_queue *q)
>   */
>  static unsigned int frag_percpu_counter_batch = 130000;
>  
> -static inline int frag_mem_limit(struct netns_frags *nf)
> +static inline bool frag_mem_over_limit(struct netns_frags *nf, int thresh)
>  {
> -	return percpu_counter_read(&nf->mem);
> +	/* When reading counter here, __percpu_counter_compare() call
> +	 * will invoke __percpu_counter_sum() when needed.  Which
> +	 * depend on num_online_cpus()*batch size, as each CPU can
> +	 * potentential can hold a batch count.
> +	 *
> +	 * With many CPUs this heavier sum operation will
> +	 * unfortunately always occur.
> +	 */
> +	if (__percpu_counter_compare(&nf->mem, thresh,
> +				     frag_percpu_counter_batch) > 0)
> +		return true;
> +	else
> +		return false;
>  }
>  
>  static inline void sub_frag_mem_limit(struct netns_frags *nf, int i)
> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index 96e95e83cc61..ee2cf56900e6 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -120,7 +120,7 @@ static void inet_frag_secret_rebuild(struct inet_frags *f)
>  static bool inet_fragq_should_evict(const struct inet_frag_queue *q)
>  {
>  	return q->net->low_thresh == 0 ||
> -	       frag_mem_limit(q->net) >= q->net->low_thresh;
> +		frag_mem_over_limit(q->net, q->net->low_thresh);
>  }
>  
>  static unsigned int
> @@ -355,7 +355,7 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
>  {
>  	struct inet_frag_queue *q;
>  
> -	if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh) {
> +	if (!nf->high_thresh || frag_mem_over_limit(nf, nf->high_thresh)) {
>  		inet_frag_schedule_worker(f);
>  		return NULL;
>  	}

If we go this way (which would IMHO require some benchmarks to make sure
it doesn't harm performance too much) we can drop the explicit checks
for zero thresholds which were added to work around the unreliability of
fast checks of percpu counters (or at least the second one was by commit
30759219f562 ("net: disable fragment reassembly if high_thresh is zero").
 
Michal Kubecek

> @@ -396,7 +396,7 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
>  	struct inet_frag_queue *q;
>  	int depth = 0;
>  
> -	if (frag_mem_limit(nf) > nf->low_thresh)
> +	if (frag_mem_over_limit(nf, nf->low_thresh))
>  		inet_frag_schedule_worker(f);
>  
>  	hash &= (INETFRAGS_HASHSZ - 1);
> 

^ permalink raw reply

* Re: [PATCH net-next] samples/bpf: Fix compilation issue in redirect dummy program
From: Daniel Borkmann @ 2017-08-31 16:20 UTC (permalink / raw)
  To: Y Song
  Cc: Jesper Dangaard Brouer, Tariq Toukan, David S. Miller, netdev,
	Eran Ben Elisha, Alexei Starovoitov
In-Reply-To: <CAH3MdRVszTVtjruq=-rh26jBtY9evuv7yBz37Ydb8EWSexL+zQ@mail.gmail.com>

On 08/31/2017 05:54 PM, Y Song wrote:
> On Thu, Aug 31, 2017 at 4:43 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 08/31/2017 01:27 PM, Jesper Dangaard Brouer wrote:
>>> On Thu, 31 Aug 2017 14:16:39 +0300
>>> Tariq Toukan <tariqt@mellanox.com> wrote:
>>>
>>>> Fix compilation error below:
>>>>
>>>> $ make samples/bpf/
>>>>
>>>> LLVM ERROR: 'xdp_redirect_dummy' label emitted multiple times to
>>>> assembly file
>>>> make[1]: *** [samples/bpf/xdp_redirect_kern.o] Error 1
>>>> make: *** [samples/bpf/] Error 2
>>>>
>>>> Fixes: 306da4e685b4 ("samples/bpf: xdp_redirect load XDP dummy prog on TX
>>>> device")
>>>> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
>>>> ---
>>>
>>> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
>>
>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>
>>> What LLVM/clang version do you use?
>>>
>>> I don't see this compile error, and I have:
>>>    $ clang --version
>>>    clang version 3.9.1 (tags/RELEA
>>
>> I'm seeing the error as well with a fairly recent LLVM from git
>> tree (6.0.0git-2d810c2).
>>
>> Looks like the llvm error is triggered when section name and
>> the function name for XDP prog is the same. Changing either the
>> function or the section name right above resolves the issue. If
>> such error didn't trigger on older versions, people could be
>> using such naming scheme as done here, so seems to me like a
>> regression on LLVM side we might need to look at ...
>
> Martin fixed a similar bug earlier:
> =====
> commit a2e8bbd2ef5457485f00b6b947bbbfa2778e5b1e
> Author: Martin KaFai Lau <kafai@fb.com>
> Date:   Thu Jun 8 22:30:17 2017 -0700
>
>      bpf: Fix test_obj_id.c for llvm 5.0
>
>      llvm 5.0 does not like the section name and the function name
>      to be the same:
> ...
> =====

Yeah indeed.

> gcc also has this behavior. Section name is treated as global
> and hence cannot collide with a function name...

Okay, so seems at least 3.9.1 treated this slightly different then
where it didn't cause a collision.

^ permalink raw reply

* Re: [PATCH net-next] bridge: add tracepoint in br_fdb_update
From: Jesper Dangaard Brouer @ 2017-08-31 16:20 UTC (permalink / raw)
  To: David Ahern
  Cc: Roopa Prabhu, davem@davemloft.net, netdev@vger.kernel.org,
	Nikolay Aleksandrov, Florian Fainelli, Andrew Lunn, bridge,
	brouer, Arnaldo Carvalho de Melo, Peter Zijlstra
In-Reply-To: <a9349049-bfd7-b6c0-d1c7-2f70b0b0ab11@gmail.com>

On Thu, 31 Aug 2017 09:30:05 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 8/31/17 9:21 AM, Roopa Prabhu wrote:
> > On Thu, Aug 31, 2017 at 5:38 AM, Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:  
> >> On Wed, 30 Aug 2017 22:18:13 -0700
> >> Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
> >>  
> >>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
> >>>
> >>> This extends bridge fdb table tracepoints to also cover
> >>> learned fdb entries in the br_fdb_update path. Note that
> >>> unlike other tracepoints I have moved this to when the fdb
> >>> is modified because this is in the datapath and can generate
> >>> a lot of noise in the trace output. br_fdb_update is also called
> >>> from added_by_user context in the NTF_USE case which is already
> >>> traced ..hence the !added_by_user check.
> >>>
> >>> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
> >>> ---
> >>>  include/trace/events/bridge.h | 31 +++++++++++++++++++++++++++++++
> >>>  net/bridge/br_fdb.c           |  5 ++++-
> >>>  net/core/net-traces.c         |  1 +
> >>>  3 files changed, 36 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/include/trace/events/bridge.h b/include/trace/events/bridge.h
> >>> index 0f1cde0..1bee3e7 100644
> >>> --- a/include/trace/events/bridge.h
> >>> +++ b/include/trace/events/bridge.h
> >>> @@ -92,6 +92,37 @@ TRACE_EVENT(fdb_delete,
> >>>                 __entry->addr[4], __entry->addr[5], __entry->vid)
> >>>  );
> >>>
> >>> +TRACE_EVENT(br_fdb_update,
> >>> +
> >>> +     TP_PROTO(struct net_bridge *br, struct net_bridge_port *source,
> >>> +              const unsigned char *addr, u16 vid, bool added_by_user),
> >>> +
> >>> +     TP_ARGS(br, source, addr, vid, added_by_user),
> >>> +
> >>> +     TP_STRUCT__entry(
> >>> +             __string(br_dev, br->dev->name)
> >>> +             __string(dev, source->dev->name)  
> >>
> >> I have found that using the device string name is
> >>
> >> (1) slow as it involves strcpy+strlen
> >>
> >>  See [1]+[2] where a single dev-name costed me 16 ns, and the base
> >>  overhead of a bpf attached tracepoint is 25 ns (see [3]).
> >>
> >>  [1] https://git.kernel.org/davem/net-next/c/e7d12ce121a
> >>  [2] https://git.kernel.org/davem/net-next/c/315ec3990ef
> >>  [3] https://git.kernel.org/davem/net-next/c/25d4dae1a64
> >>
> >> (2) strings are also harder to work-with/extract when attaching a bpf_prog
> >>
> >> See the trouble I'm in accessing a dev string here napi:napi_poll here:
> >>  https://github.com/netoptimizer/prototype-kernel/blob/103b955a080/kernel/samples/bpf/napi_monitor_kern.c#L52-L58
> >>
> >> Using ifindex'es in userspace is fairly easy see man if_indextoname(3).
> >>  
> > 
> > Jesper thanks for the data!. GTK. Looking at include/trace/events,
> > currently almost all tracepoints use dev->name.

True, but with my recent experience and benchmarking, I consider this
generally a bad choice we have made for all these tracepoints.  In your
case with 2 strings, 2x16=32ns, you basically introduced a overhead
that is larger that to invocation cost.

> > These bridge tracepoints in context are primarily for debugging fdb
> > updates only, not for every packet and hence not in the performance
> > path.
> > In large scale deployments with thousands of bridge ports and fdb
> > entries, dev->name will definately make it easier to trouble-shoot.
> > So, I did like to leave these with dev->name unless there are strong objections.  
> 
> +1 for user friendliness for debugging tracepoints. The device name is
> also more user friendly when adding filters to the data collection.
>
> Being able to add bpf everywhere certainly changes the game a bit, but
> we should not relinquish ease of use and understanding for the potential
> that someone might want to put a bpf program on the tracepoint and want
> to maintain high performance.

(Cc. Acme and Peterz)
I wonder if we can create a special perf-tracepoint type for ifindex'es
and the tool reading (e.g. perf-script) can perform the name lookup in
userspace (calling if_indextoname(3)) ?

I don't know the perf tools well enough to know if this is possible?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH net] ipv4: Don't override return code from ip_route_input_noref()
From: Stefano Brivio @ 2017-08-31 16:11 UTC (permalink / raw)
  To: David S . Miller, netdev; +Cc: Sabrina Dubroca, Wei Wang

After ip_route_input() calls ip_route_input_noref(), another
check on skb_dst() is done, but if this fails, we shouldn't
override the return code from ip_route_input_noref(), as it
could have been more specific (i.e. -EHOSTUNREACH).

This also saves one call to skb_dst_force_safe() and one to
skb_dst() in case the ip_route_input_noref() check fails.

Reported-by: Sabrina Dubroca <sdubroca@redhat.com>
Fixes: ad65a2f05695 ("ipv4: call dst_hold_safe() properly")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 include/net/route.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index cb0a76d9dde1..1b09a9368c68 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -189,10 +189,11 @@ static inline int ip_route_input(struct sk_buff *skb, __be32 dst, __be32 src,
 
 	rcu_read_lock();
 	err = ip_route_input_noref(skb, dst, src, tos, devin);
-	if (!err)
+	if (!err) {
 		skb_dst_force_safe(skb);
-	if (!skb_dst(skb))
-		err = -EINVAL;
+		if (!skb_dst(skb))
+			err = -EINVAL;
+	}
 	rcu_read_unlock();
 
 	return err;
-- 
2.9.4

^ permalink raw reply related

* [patch net-next repost 8/8] mlxsw: spectrum_dpipe: Add support for controlling IPv6 neighbor counters
From: Jiri Pirko @ 2017-08-31 15:59 UTC (permalink / raw)
  To: netdev; +Cc: davem, arkadis, idosch, mlxsw
In-Reply-To: <20170831155919.1366-1-jiri@resnulli.us>

From: Arkadi Sharshevsky <arkadis@mellanox.com>

Add support for controlling IPv6 neighbor counters via dpipe.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c
index 75da2ef..51e6846 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c
@@ -679,8 +679,15 @@ mlxsw_sp_dpipe_table_host_counters_update(struct mlxsw_sp *mlxsw_sp,
 		if (!rif)
 			continue;
 		mlxsw_sp_rif_neigh_for_each(neigh_entry, rif) {
-			if (mlxsw_sp_neigh_entry_type(neigh_entry) != type)
+			int neigh_type = mlxsw_sp_neigh_entry_type(neigh_entry);
+
+			if (neigh_type != type)
+				continue;
+
+			if (neigh_type == AF_INET6 &&
+			    mlxsw_sp_neigh_ipv6_ignore(neigh_entry))
 				continue;
+
 			mlxsw_sp_neigh_entry_counter_update(mlxsw_sp,
 							    neigh_entry,
 							    enable);
@@ -778,6 +785,14 @@ mlxsw_sp_dpipe_table_host6_entries_dump(void *priv, bool counters_enabled,
 						      dump_ctx, AF_INET6);
 }
 
+static int mlxsw_sp_dpipe_table_host6_counters_update(void *priv, bool enable)
+{
+	struct mlxsw_sp *mlxsw_sp = priv;
+
+	mlxsw_sp_dpipe_table_host_counters_update(mlxsw_sp, enable, AF_INET6);
+	return 0;
+}
+
 static u64 mlxsw_sp_dpipe_table_host6_size_get(void *priv)
 {
 	struct mlxsw_sp *mlxsw_sp = priv;
@@ -789,6 +804,7 @@ static struct devlink_dpipe_table_ops mlxsw_sp_host6_ops = {
 	.matches_dump = mlxsw_sp_dpipe_table_host6_matches_dump,
 	.actions_dump = mlxsw_sp_dpipe_table_host_actions_dump,
 	.entries_dump = mlxsw_sp_dpipe_table_host6_entries_dump,
+	.counters_set_update = mlxsw_sp_dpipe_table_host6_counters_update,
 	.size_get = mlxsw_sp_dpipe_table_host6_size_get,
 };
 
-- 
2.9.3

^ permalink raw reply related

* [patch net-next repost 7/8] mlxsw: spectrum_router: Add support for setting counters on IPv6 neighbors
From: Jiri Pirko @ 2017-08-31 15:59 UTC (permalink / raw)
  To: netdev; +Cc: davem, arkadis, idosch, mlxsw
In-Reply-To: <20170831155919.1366-1-jiri@resnulli.us>

From: Arkadi Sharshevsky <arkadis@mellanox.com>

Add support for setting counters on IPv6 neighbors based on dpipe's host6
table counter status.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index db57c0c..0cf6810 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -1008,21 +1008,33 @@ mlxsw_sp_neigh_entry_remove(struct mlxsw_sp *mlxsw_sp,
 }
 
 static bool
-mlxsw_sp_neigh4_counter_should_alloc(struct mlxsw_sp *mlxsw_sp)
+mlxsw_sp_neigh_counter_should_alloc(struct mlxsw_sp *mlxsw_sp,
+				    struct mlxsw_sp_neigh_entry *neigh_entry)
 {
 	struct devlink *devlink;
+	const char *table_name;
+
+	switch (mlxsw_sp_neigh_entry_type(neigh_entry)) {
+	case AF_INET:
+		table_name = MLXSW_SP_DPIPE_TABLE_NAME_HOST4;
+		break;
+	case AF_INET6:
+		table_name = MLXSW_SP_DPIPE_TABLE_NAME_HOST6;
+		break;
+	default:
+		WARN_ON(1);
+		return false;
+	}
 
 	devlink = priv_to_devlink(mlxsw_sp->core);
-	return devlink_dpipe_table_counter_enabled(devlink,
-						   MLXSW_SP_DPIPE_TABLE_NAME_HOST4);
+	return devlink_dpipe_table_counter_enabled(devlink, table_name);
 }
 
 static void
 mlxsw_sp_neigh_counter_alloc(struct mlxsw_sp *mlxsw_sp,
 			     struct mlxsw_sp_neigh_entry *neigh_entry)
 {
-	if (mlxsw_sp_neigh_entry_type(neigh_entry) != AF_INET ||
-	    !mlxsw_sp_neigh4_counter_should_alloc(mlxsw_sp))
+	if (!mlxsw_sp_neigh_counter_should_alloc(mlxsw_sp, neigh_entry))
 		return;
 
 	if (mlxsw_sp_flow_counter_alloc(mlxsw_sp, &neigh_entry->counter_index))
-- 
2.9.3

^ permalink raw reply related

* [patch net-next repost 6/8] mlxsw: spectrum_dpipe: Add support for IPv6 host table dump
From: Jiri Pirko @ 2017-08-31 15:59 UTC (permalink / raw)
  To: netdev; +Cc: davem, arkadis, idosch, mlxsw
In-Reply-To: <20170831155919.1366-1-jiri@resnulli.us>

From: Arkadi Sharshevsky <arkadis@mellanox.com>

Add support for IPv6 host table dump.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 .../net/ethernet/mellanox/mlxsw/spectrum_dpipe.c   | 75 ++++++++++++++++++++--
 1 file changed, 70 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c
index 5924e97..75da2ef 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c
@@ -386,8 +386,19 @@ mlxsw_sp_dpipe_table_host_match_action_prepare(struct devlink_dpipe_match *match
 
 	match = &matches[MLXSW_SP_DPIPE_TABLE_HOST_MATCH_DIP];
 	match->type = DEVLINK_DPIPE_MATCH_TYPE_FIELD_EXACT;
-	match->header = &devlink_dpipe_header_ipv4;
-	match->field_id = DEVLINK_DPIPE_FIELD_IPV4_DST_IP;
+	switch (type) {
+	case AF_INET:
+		match->header = &devlink_dpipe_header_ipv4;
+		match->field_id = DEVLINK_DPIPE_FIELD_IPV4_DST_IP;
+		break;
+	case AF_INET6:
+		match->header = &devlink_dpipe_header_ipv6;
+		match->field_id = DEVLINK_DPIPE_FIELD_IPV6_DST_IP;
+		break;
+	default:
+		WARN_ON(1);
+		return;
+	}
 
 	action->type = DEVLINK_DPIPE_ACTION_TYPE_FIELD_MODIFY;
 	action->header = &devlink_dpipe_header_ethernet;
@@ -424,7 +435,18 @@ mlxsw_sp_dpipe_table_host_entry_prepare(struct devlink_dpipe_entry *entry,
 	match_value = &match_values[MLXSW_SP_DPIPE_TABLE_HOST_MATCH_DIP];
 
 	match_value->match = match;
-	match_value->value_size = sizeof(u32);
+	switch (type) {
+	case AF_INET:
+		match_value->value_size = sizeof(u32);
+		break;
+	case AF_INET6:
+		match_value->value_size = sizeof(struct in6_addr);
+		break;
+	default:
+		WARN_ON(1);
+		return -EINVAL;
+	}
+
 	match_value->value = kmalloc(match_value->value_size, GFP_KERNEL);
 	if (!match_value->value)
 		return -ENOMEM;
@@ -479,6 +501,20 @@ mlxsw_sp_dpipe_table_host4_entry_fill(struct devlink_dpipe_entry *entry,
 }
 
 static void
+mlxsw_sp_dpipe_table_host6_entry_fill(struct devlink_dpipe_entry *entry,
+				      struct mlxsw_sp_neigh_entry *neigh_entry,
+				      struct mlxsw_sp_rif *rif)
+{
+	struct in6_addr *dip;
+	unsigned char *ha;
+
+	ha = mlxsw_sp_neigh_entry_ha(neigh_entry);
+	dip = mlxsw_sp_neigh6_entry_dip(neigh_entry);
+
+	__mlxsw_sp_dpipe_table_host_entry_fill(entry, rif, ha, dip);
+}
+
+static void
 mlxsw_sp_dpipe_table_host_entry_fill(struct mlxsw_sp *mlxsw_sp,
 				     struct devlink_dpipe_entry *entry,
 				     struct mlxsw_sp_neigh_entry *neigh_entry,
@@ -487,7 +523,18 @@ mlxsw_sp_dpipe_table_host_entry_fill(struct mlxsw_sp *mlxsw_sp,
 {
 	int err;
 
-	mlxsw_sp_dpipe_table_host4_entry_fill(entry, neigh_entry, rif);
+	switch (type) {
+	case AF_INET:
+		mlxsw_sp_dpipe_table_host4_entry_fill(entry, neigh_entry, rif);
+		break;
+	case AF_INET6:
+		mlxsw_sp_dpipe_table_host6_entry_fill(entry, neigh_entry, rif);
+		break;
+	default:
+		WARN_ON(1);
+		return;
+	}
+
 	err = mlxsw_sp_neigh_counter_get(mlxsw_sp, neigh_entry,
 					 &entry->counter);
 	if (!err)
@@ -526,7 +573,13 @@ mlxsw_sp_dpipe_table_host_entries_get(struct mlxsw_sp *mlxsw_sp,
 
 		rif_neigh_count = 0;
 		mlxsw_sp_rif_neigh_for_each(neigh_entry, rif) {
-			if (mlxsw_sp_neigh_entry_type(neigh_entry) != type)
+			int neigh_type = mlxsw_sp_neigh_entry_type(neigh_entry);
+
+			if (neigh_type != type)
+				continue;
+
+			if (neigh_type == AF_INET6 &&
+			    mlxsw_sp_neigh_ipv6_ignore(neigh_entry))
 				continue;
 
 			if (rif_neigh_count < rif_neigh_skip)
@@ -714,6 +767,17 @@ mlxsw_sp_dpipe_table_host6_matches_dump(void *priv, struct sk_buff *skb)
 	return mlxsw_sp_dpipe_table_host_matches_dump(skb, AF_INET6);
 }
 
+static int
+mlxsw_sp_dpipe_table_host6_entries_dump(void *priv, bool counters_enabled,
+					struct devlink_dpipe_dump_ctx *dump_ctx)
+{
+	struct mlxsw_sp *mlxsw_sp = priv;
+
+	return mlxsw_sp_dpipe_table_host_entries_dump(mlxsw_sp,
+						      counters_enabled,
+						      dump_ctx, AF_INET6);
+}
+
 static u64 mlxsw_sp_dpipe_table_host6_size_get(void *priv)
 {
 	struct mlxsw_sp *mlxsw_sp = priv;
@@ -724,6 +788,7 @@ static u64 mlxsw_sp_dpipe_table_host6_size_get(void *priv)
 static struct devlink_dpipe_table_ops mlxsw_sp_host6_ops = {
 	.matches_dump = mlxsw_sp_dpipe_table_host6_matches_dump,
 	.actions_dump = mlxsw_sp_dpipe_table_host_actions_dump,
+	.entries_dump = mlxsw_sp_dpipe_table_host6_entries_dump,
 	.size_get = mlxsw_sp_dpipe_table_host6_size_get,
 };
 
-- 
2.9.3

^ permalink raw reply related

* [patch net-next repost 5/8] mlxsw: spectrum_dpipe: Make host entry fill handler more generic
From: Jiri Pirko @ 2017-08-31 15:59 UTC (permalink / raw)
  To: netdev; +Cc: davem, arkadis, idosch, mlxsw
In-Reply-To: <20170831155919.1366-1-jiri@resnulli.us>

From: Arkadi Sharshevsky <arkadis@mellanox.com>

Change the host entry filler helper to be applicable for both IPv4/6
addresses.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c
index 93ba6a6..5924e97 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c
@@ -439,13 +439,12 @@ mlxsw_sp_dpipe_table_host_entry_prepare(struct devlink_dpipe_entry *entry,
 }
 
 static void
-__mlxsw_sp_dpipe_table_host4_entry_fill(struct devlink_dpipe_entry *entry,
-					struct mlxsw_sp_rif *rif,
-					unsigned char *ha, u32 dip)
+__mlxsw_sp_dpipe_table_host_entry_fill(struct devlink_dpipe_entry *entry,
+				       struct mlxsw_sp_rif *rif,
+				       unsigned char *ha, void *dip)
 {
 	struct devlink_dpipe_value *value;
 	u32 *rif_value;
-	u32 *dip_value;
 	u8 *ha_value;
 
 	/* Set Match RIF index */
@@ -458,9 +457,7 @@ __mlxsw_sp_dpipe_table_host4_entry_fill(struct devlink_dpipe_entry *entry,
 
 	/* Set Match DIP */
 	value = &entry->match_values[MLXSW_SP_DPIPE_TABLE_HOST_MATCH_DIP];
-
-	dip_value = value->value;
-	*dip_value = dip;
+	memcpy(value->value, dip, value->value_size);
 
 	/* Set Action DMAC */
 	value = entry->action_values;
@@ -478,7 +475,7 @@ mlxsw_sp_dpipe_table_host4_entry_fill(struct devlink_dpipe_entry *entry,
 
 	ha = mlxsw_sp_neigh_entry_ha(neigh_entry);
 	dip = mlxsw_sp_neigh4_entry_dip(neigh_entry);
-	__mlxsw_sp_dpipe_table_host4_entry_fill(entry, rif, ha, dip);
+	__mlxsw_sp_dpipe_table_host_entry_fill(entry, rif, ha, &dip);
 }
 
 static void
-- 
2.9.3

^ permalink raw reply related

* [patch net-next repost 4/8] mlxsw: spectrum_router: Add IPv6 neighbor access helper
From: Jiri Pirko @ 2017-08-31 15:59 UTC (permalink / raw)
  To: netdev; +Cc: davem, arkadis, idosch, mlxsw
In-Reply-To: <20170831155919.1366-1-jiri@resnulli.us>

From: Arkadi Sharshevsky <arkadis@mellanox.com>

Add helper for accessing destination IP in case of IPv6 neighbor.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 9 +++++++++
 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.h | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 1f41bcd..db57c0c 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -947,6 +947,15 @@ u32 mlxsw_sp_neigh4_entry_dip(struct mlxsw_sp_neigh_entry *neigh_entry)
 	return ntohl(*((__be32 *) n->primary_key));
 }
 
+struct in6_addr *
+mlxsw_sp_neigh6_entry_dip(struct mlxsw_sp_neigh_entry *neigh_entry)
+{
+	struct neighbour *n;
+
+	n = neigh_entry->key.n;
+	return (struct in6_addr *) &n->primary_key;
+}
+
 int mlxsw_sp_neigh_counter_get(struct mlxsw_sp *mlxsw_sp,
 			       struct mlxsw_sp_neigh_entry *neigh_entry,
 			       u64 *p_counter)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.h
index 5b68616..87a04af 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.h
@@ -65,6 +65,8 @@ int mlxsw_sp_neigh_entry_type(struct mlxsw_sp_neigh_entry *neigh_entry);
 unsigned char *
 mlxsw_sp_neigh_entry_ha(struct mlxsw_sp_neigh_entry *neigh_entry);
 u32 mlxsw_sp_neigh4_entry_dip(struct mlxsw_sp_neigh_entry *neigh_entry);
+struct in6_addr *
+mlxsw_sp_neigh6_entry_dip(struct mlxsw_sp_neigh_entry *neigh_entry);
 
 #define mlxsw_sp_rif_neigh_for_each(neigh_entry, rif)				\
 	for (neigh_entry = mlxsw_sp_rif_neigh_next(rif, NULL); neigh_entry;	\
-- 
2.9.3

^ permalink raw reply related

* [patch net-next repost 3/8] mlxsw: spectrum_dpipe: Add IPv6 host table initial support
From: Jiri Pirko @ 2017-08-31 15:59 UTC (permalink / raw)
  To: netdev; +Cc: davem, arkadis, idosch, mlxsw
In-Reply-To: <20170831155919.1366-1-jiri@resnulli.us>

From: Arkadi Sharshevsky <arkadis@mellanox.com>

Add IPv6 host table initial support. The action behavior for both IPv4/6
tables is the same, thus the same action dump op is used. Neighbors with
link local address are ignored.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 .../net/ethernet/mellanox/mlxsw/spectrum_dpipe.c   | 76 ++++++++++++++++++++--
 .../net/ethernet/mellanox/mlxsw/spectrum_dpipe.h   |  1 +
 2 files changed, 71 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c
index 3ea1314..93ba6a6 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.c
@@ -76,6 +76,7 @@ static struct devlink_dpipe_header *mlxsw_dpipe_headers[] = {
 	&mlxsw_sp_dpipe_header_metadata,
 	&devlink_dpipe_header_ethernet,
 	&devlink_dpipe_header_ipv4,
+	&devlink_dpipe_header_ipv6,
 };
 
 static struct devlink_dpipe_headers mlxsw_sp_dpipe_headers = {
@@ -328,9 +329,21 @@ static int mlxsw_sp_dpipe_table_host_matches_dump(struct sk_buff *skb, int type)
 	if (err)
 		return err;
 
-	match.type = DEVLINK_DPIPE_MATCH_TYPE_FIELD_EXACT;
-	match.header = &devlink_dpipe_header_ipv4;
-	match.field_id = DEVLINK_DPIPE_FIELD_IPV4_DST_IP;
+	switch (type) {
+	case AF_INET:
+		match.type = DEVLINK_DPIPE_MATCH_TYPE_FIELD_EXACT;
+		match.header = &devlink_dpipe_header_ipv4;
+		match.field_id = DEVLINK_DPIPE_FIELD_IPV4_DST_IP;
+		break;
+	case AF_INET6:
+		match.type = DEVLINK_DPIPE_MATCH_TYPE_FIELD_EXACT;
+		match.header = &devlink_dpipe_header_ipv6;
+		match.field_id = DEVLINK_DPIPE_FIELD_IPV6_DST_IP;
+		break;
+	default:
+		WARN_ON(1);
+		return -EINVAL;
+	}
 
 	return devlink_dpipe_match_put(skb, &match);
 }
@@ -342,7 +355,7 @@ mlxsw_sp_dpipe_table_host4_matches_dump(void *priv, struct sk_buff *skb)
 }
 
 static int
-mlxsw_sp_dpipe_table_host4_actions_dump(void *priv, struct sk_buff *skb)
+mlxsw_sp_dpipe_table_host_actions_dump(void *priv, struct sk_buff *skb)
 {
 	struct devlink_dpipe_action action = {0};
 
@@ -648,8 +661,15 @@ mlxsw_sp_dpipe_table_host_size_get(struct mlxsw_sp *mlxsw_sp, int type)
 		if (!rif)
 			continue;
 		mlxsw_sp_rif_neigh_for_each(neigh_entry, rif) {
-			if (mlxsw_sp_neigh_entry_type(neigh_entry) != type)
+			int neigh_type = mlxsw_sp_neigh_entry_type(neigh_entry);
+
+			if (neigh_type != type)
 				continue;
+
+			if (neigh_type == AF_INET6 &&
+			    mlxsw_sp_neigh_ipv6_ignore(neigh_entry))
+				continue;
+
 			size++;
 		}
 	}
@@ -667,7 +687,7 @@ static u64 mlxsw_sp_dpipe_table_host4_size_get(void *priv)
 
 static struct devlink_dpipe_table_ops mlxsw_sp_host4_ops = {
 	.matches_dump = mlxsw_sp_dpipe_table_host4_matches_dump,
-	.actions_dump = mlxsw_sp_dpipe_table_host4_actions_dump,
+	.actions_dump = mlxsw_sp_dpipe_table_host_actions_dump,
 	.entries_dump = mlxsw_sp_dpipe_table_host4_entries_dump,
 	.counters_set_update = mlxsw_sp_dpipe_table_host4_counters_update,
 	.size_get = mlxsw_sp_dpipe_table_host4_size_get,
@@ -691,6 +711,43 @@ static void mlxsw_sp_dpipe_host4_table_fini(struct mlxsw_sp *mlxsw_sp)
 				       MLXSW_SP_DPIPE_TABLE_NAME_HOST4);
 }
 
+static int
+mlxsw_sp_dpipe_table_host6_matches_dump(void *priv, struct sk_buff *skb)
+{
+	return mlxsw_sp_dpipe_table_host_matches_dump(skb, AF_INET6);
+}
+
+static u64 mlxsw_sp_dpipe_table_host6_size_get(void *priv)
+{
+	struct mlxsw_sp *mlxsw_sp = priv;
+
+	return mlxsw_sp_dpipe_table_host_size_get(mlxsw_sp, AF_INET6);
+}
+
+static struct devlink_dpipe_table_ops mlxsw_sp_host6_ops = {
+	.matches_dump = mlxsw_sp_dpipe_table_host6_matches_dump,
+	.actions_dump = mlxsw_sp_dpipe_table_host_actions_dump,
+	.size_get = mlxsw_sp_dpipe_table_host6_size_get,
+};
+
+static int mlxsw_sp_dpipe_host6_table_init(struct mlxsw_sp *mlxsw_sp)
+{
+	struct devlink *devlink = priv_to_devlink(mlxsw_sp->core);
+
+	return devlink_dpipe_table_register(devlink,
+					    MLXSW_SP_DPIPE_TABLE_NAME_HOST6,
+					    &mlxsw_sp_host6_ops,
+					    mlxsw_sp, false);
+}
+
+static void mlxsw_sp_dpipe_host6_table_fini(struct mlxsw_sp *mlxsw_sp)
+{
+	struct devlink *devlink = priv_to_devlink(mlxsw_sp->core);
+
+	devlink_dpipe_table_unregister(devlink,
+				       MLXSW_SP_DPIPE_TABLE_NAME_HOST6);
+}
+
 int mlxsw_sp_dpipe_init(struct mlxsw_sp *mlxsw_sp)
 {
 	struct devlink *devlink = priv_to_devlink(mlxsw_sp->core);
@@ -707,8 +764,14 @@ int mlxsw_sp_dpipe_init(struct mlxsw_sp *mlxsw_sp)
 	err = mlxsw_sp_dpipe_host4_table_init(mlxsw_sp);
 	if (err)
 		goto err_host4_table_init;
+
+	err = mlxsw_sp_dpipe_host6_table_init(mlxsw_sp);
+	if (err)
+		goto err_host6_table_init;
 	return 0;
 
+err_host6_table_init:
+	mlxsw_sp_dpipe_host4_table_fini(mlxsw_sp);
 err_host4_table_init:
 	mlxsw_sp_dpipe_erif_table_fini(mlxsw_sp);
 err_erif_table_init:
@@ -720,6 +783,7 @@ void mlxsw_sp_dpipe_fini(struct mlxsw_sp *mlxsw_sp)
 {
 	struct devlink *devlink = priv_to_devlink(mlxsw_sp->core);
 
+	mlxsw_sp_dpipe_host6_table_fini(mlxsw_sp);
 	mlxsw_sp_dpipe_host4_table_fini(mlxsw_sp);
 	mlxsw_sp_dpipe_erif_table_fini(mlxsw_sp);
 	devlink_dpipe_headers_unregister(devlink);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.h
index c56a3d9..283fde4 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.h
@@ -55,5 +55,6 @@ static inline void mlxsw_sp_dpipe_fini(struct mlxsw_sp *mlxsw_sp)
 
 #define MLXSW_SP_DPIPE_TABLE_NAME_ERIF "mlxsw_erif"
 #define MLXSW_SP_DPIPE_TABLE_NAME_HOST4 "mlxsw_host4"
+#define MLXSW_SP_DPIPE_TABLE_NAME_HOST6 "mlxsw_host6"
 
 #endif /* _MLXSW_PIPELINE_H_*/
-- 
2.9.3

^ permalink raw reply related

* [patch net-next repost 2/8] mlxsw: spectrum_router: Export IPv6 link local address check helper
From: Jiri Pirko @ 2017-08-31 15:59 UTC (permalink / raw)
  To: netdev; +Cc: davem, arkadis, idosch, mlxsw
In-Reply-To: <20170831155919.1366-1-jiri@resnulli.us>

From: Arkadi Sharshevsky <arkadis@mellanox.com>

Neighbors with link local addresses are not offloaded to the host table,
yet, the are maintained in the driver for adjacency table usage. When
dumping the IPv6 host neighbors this link local neighbors should be
ignored. This patch exports this helper for dpipe usage.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 6 ++++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.h | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 377d85c..1f41bcd 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -1396,8 +1396,10 @@ mlxsw_sp_router_neigh_entry_op6(struct mlxsw_sp *mlxsw_sp,
 	mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(rauht), rauht_pl);
 }
 
-static bool mlxsw_sp_neigh_ipv6_ignore(struct neighbour *n)
+bool mlxsw_sp_neigh_ipv6_ignore(struct mlxsw_sp_neigh_entry *neigh_entry)
 {
+	struct neighbour *n = neigh_entry->key.n;
+
 	/* Packets with a link-local destination address are trapped
 	 * after LPM lookup and never reach the neighbour table, so
 	 * there is no need to program such neighbours to the device.
@@ -1420,7 +1422,7 @@ mlxsw_sp_neigh_entry_update(struct mlxsw_sp *mlxsw_sp,
 		mlxsw_sp_router_neigh_entry_op4(mlxsw_sp, neigh_entry,
 						mlxsw_sp_rauht_op(adding));
 	} else if (neigh_entry->key.n->tbl->family == AF_INET6) {
-		if (mlxsw_sp_neigh_ipv6_ignore(neigh_entry->key.n))
+		if (mlxsw_sp_neigh_ipv6_ignore(neigh_entry))
 			return;
 		mlxsw_sp_router_neigh_entry_op6(mlxsw_sp, neigh_entry,
 						mlxsw_sp_rauht_op(adding));
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.h
index fb0f971..5b68616 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.h
@@ -76,5 +76,6 @@ void
 mlxsw_sp_neigh_entry_counter_update(struct mlxsw_sp *mlxsw_sp,
 				    struct mlxsw_sp_neigh_entry *neigh_entry,
 				    bool adding);
+bool mlxsw_sp_neigh_ipv6_ignore(struct mlxsw_sp_neigh_entry *neigh_entry);
 
 #endif /* _MLXSW_ROUTER_H_*/
-- 
2.9.3

^ permalink raw reply related

* [patch net-next repost 1/8] devlink: Add IPv6 header for dpipe
From: Jiri Pirko @ 2017-08-31 15:59 UTC (permalink / raw)
  To: netdev; +Cc: davem, arkadis, idosch, mlxsw
In-Reply-To: <20170831155919.1366-1-jiri@resnulli.us>

From: Arkadi Sharshevsky <arkadis@mellanox.com>

This will be used by the IPv6 host table which will be introduced in the
following patches. The fields in the header are added per-use. This header
is global and can be reused by many drivers.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/devlink.h        |  1 +
 include/uapi/linux/devlink.h |  5 +++++
 net/core/devlink.c           | 17 +++++++++++++++++
 3 files changed, 23 insertions(+)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index aaf7178..b9654e1 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -330,6 +330,7 @@ int devlink_dpipe_match_put(struct sk_buff *skb,
 			    struct devlink_dpipe_match *match);
 extern struct devlink_dpipe_header devlink_dpipe_header_ethernet;
 extern struct devlink_dpipe_header devlink_dpipe_header_ipv4;
+extern struct devlink_dpipe_header devlink_dpipe_header_ipv6;
 
 #else
 
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 6c17254..0cbca96 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -234,9 +234,14 @@ enum devlink_dpipe_field_ipv4_id {
 	DEVLINK_DPIPE_FIELD_IPV4_DST_IP,
 };
 
+enum devlink_dpipe_field_ipv6_id {
+	DEVLINK_DPIPE_FIELD_IPV6_DST_IP,
+};
+
 enum devlink_dpipe_header_id {
 	DEVLINK_DPIPE_HEADER_ETHERNET,
 	DEVLINK_DPIPE_HEADER_IPV4,
+	DEVLINK_DPIPE_HEADER_IPV6,
 };
 
 #endif /* _UAPI_LINUX_DEVLINK_H_ */
diff --git a/net/core/devlink.c b/net/core/devlink.c
index cbc4b04..7d430c1 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -63,6 +63,23 @@ struct devlink_dpipe_header devlink_dpipe_header_ipv4 = {
 };
 EXPORT_SYMBOL(devlink_dpipe_header_ipv4);
 
+static struct devlink_dpipe_field devlink_dpipe_fields_ipv6[] = {
+	{
+		.name = "destination ip",
+		.id = DEVLINK_DPIPE_FIELD_IPV6_DST_IP,
+		.bitwidth = 128,
+	},
+};
+
+struct devlink_dpipe_header devlink_dpipe_header_ipv6 = {
+	.name = "ipv6",
+	.id = DEVLINK_DPIPE_HEADER_IPV6,
+	.fields = devlink_dpipe_fields_ipv6,
+	.fields_count = ARRAY_SIZE(devlink_dpipe_fields_ipv6),
+	.global = true,
+};
+EXPORT_SYMBOL(devlink_dpipe_header_ipv6);
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(devlink_hwmsg);
 
 static LIST_HEAD(devlink_list);
-- 
2.9.3

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox