Netdev List
 help / color / mirror / Atom feed
* [net-next] net: crypto set sk to NULL when af_alg_release.
From: Mao Wenan @ 2019-02-15 14:24 UTC (permalink / raw)
  To: netdev, davem, xiyou.wangcong, linux-kernel

KASAN has found use-after-free in sockfs_setattr.
The existed commit 6d8c50dcb029 ("socket: close race condition between sock_close()
and sockfs_setattr()") is to fix this simillar issue, but it seems to ignore
that crypto module forgets to set the sk to NULL after af_alg_release.

KASAN report details as below:
BUG: KASAN: use-after-free in sockfs_setattr+0x120/0x150
Write of size 4 at addr ffff88837b956128 by task syz-executor0/4186

CPU: 2 PID: 4186 Comm: syz-executor0 Not tainted xxx + #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.10.2-1ubuntu1 04/01/2014
Call Trace:
 dump_stack+0xca/0x13e
 print_address_description+0x79/0x330
 ? vprintk_func+0x5e/0xf0
 kasan_report+0x18a/0x2e0
 ? sockfs_setattr+0x120/0x150
 sockfs_setattr+0x120/0x150
 ? sock_register+0x2d0/0x2d0
 notify_change+0x90c/0xd40
 ? chown_common+0x2ef/0x510
 chown_common+0x2ef/0x510
 ? chmod_common+0x3b0/0x3b0
 ? __lock_is_held+0xbc/0x160
 ? __sb_start_write+0x13d/0x2b0
 ? __mnt_want_write+0x19a/0x250
 do_fchownat+0x15c/0x190
 ? __ia32_sys_chmod+0x80/0x80
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 __x64_sys_fchownat+0xbf/0x160
 ? lockdep_hardirqs_on+0x39a/0x5e0
 do_syscall_64+0xc8/0x580
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x462589
Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89
f7 48 89 d6 48 89
ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3
48 c7 c1 bc ff ff
ff f7 d8 64 89 01 48
RSP: 002b:00007fb4b2c83c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000104
RAX: ffffffffffffffda RBX: 000000000072bfa0 RCX: 0000000000462589
RDX: 0000000000000000 RSI: 00000000200000c0 RDI: 0000000000000007
RBP: 0000000000000005 R08: 0000000000001000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb4b2c846bc
R13: 00000000004bc733 R14: 00000000006f5138 R15: 00000000ffffffff

Allocated by task 4185:
 kasan_kmalloc+0xa0/0xd0
 __kmalloc+0x14a/0x350
 sk_prot_alloc+0xf6/0x290
 sk_alloc+0x3d/0xc00
 af_alg_accept+0x9e/0x670
 hash_accept+0x4a3/0x650
 __sys_accept4+0x306/0x5c0
 __x64_sys_accept4+0x98/0x100
 do_syscall_64+0xc8/0x580
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 4184:
 __kasan_slab_free+0x12e/0x180
 kfree+0xeb/0x2f0
 __sk_destruct+0x4e6/0x6a0
 sk_destruct+0x48/0x70
 __sk_free+0xa9/0x270
 sk_free+0x2a/0x30
 af_alg_release+0x5c/0x70
 __sock_release+0xd3/0x280
 sock_close+0x1a/0x20
 __fput+0x27f/0x7f0
 task_work_run+0x136/0x1b0
 exit_to_usermode_loop+0x1a7/0x1d0
 do_syscall_64+0x461/0x580
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Syzkaller reproducer:
r0 = perf_event_open(&(0x7f0000000000)={0x0, 0x70, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, @perf_config_ext}, 0x0, 0x0,
0xffffffffffffffff, 0x0)
r1 = socket$alg(0x26, 0x5, 0x0)
getrusage(0x0, 0x0)
bind(r1, &(0x7f00000001c0)=@alg={0x26, 'hash\x00', 0x0, 0x0,
'sha256-ssse3\x00'}, 0x80)
r2 = accept(r1, 0x0, 0x0)
r3 = accept4$unix(r2, 0x0, 0x0, 0x0)
r4 = dup3(r3, r0, 0x0)
fchownat(r4, &(0x7f00000000c0)='\x00', 0x0, 0x0, 0x1000)

Fixes: 6d8c50dcb029 ("socket: close race condition between sock_close() and sockfs_setattr()")
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 crypto/af_alg.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index 17eb09d222ff..ec78a04eb136 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -122,8 +122,10 @@ static void alg_do_release(const struct af_alg_type *type, void *private)
 
 int af_alg_release(struct socket *sock)
 {
-	if (sock->sk)
+	if (sock->sk) {
 		sock_put(sock->sk);
+		sock->sk = NULL;
+	}
 	return 0;
 }
 EXPORT_SYMBOL_GPL(af_alg_release);
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH] net: dsa: Implement flow_dissect callback for tag_dsa.
From: Andrew Lunn @ 2019-02-15 14:05 UTC (permalink / raw)
  To: Rundong Ge; +Cc: vivien.didelot, f.fainelli, davem, netdev, linux-kernel
In-Reply-To: <20190215082253.2682-1-rdong.ge@gmail.com>

On Fri, Feb 15, 2019 at 08:22:53AM +0000, Rundong Ge wrote:
> RPS not work for DSA devices since the 'skb_get_hash'
> will always get the invalid hash for dsa tagged packets.
> 
> "[PATCH] tag_mtk: add flow_dissect callback to the ops struct"
> introduced the flow_dissect callback to get the right hash for
> MTK tagged packet. And tag_dsa also needs to implement
> the callback.

Hi Rundong

This looks good. Do you have any sort of numbers to show the
performance improvements?

Could you make a similar change to tag_edsa.c? It uses an 8 byte
header.

Thanks
	Andrew

^ permalink raw reply

* Bug in br_handle_frame
From: Tomas Paukrt @ 2019-02-15 14:03 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: roopa@cumulusnetworks.com, nikolay@cumulusnetworks.com

[-- Attachment #1: Type: text/plain, Size: 483 bytes --]

Hi,

I have recently discovered that kernel 3.12.10 is occasionally crashing 
due to NULL pointer dereference in function br_handle_frame when we 
reconfigure the bridge, because function br_port_get_rcu returns NULL.

It is very hard for us to replicate this issue, because it happens about 
once per month in our testing environment, but I have created the 
attached patch. Can you please check it? The latest kernel seems to be 
affected too.

Best regards

Tomas


[-- Attachment #2: lin3-12-10-net-bridge.patch --]
[-- Type: text/plain, Size: 474 bytes --]

diff --exclude CVS --exclude .git -uNr linux-3.12.10/net/bridge/br_input.c linux-3.12.10.modified/net/bridge/br_input.c
--- linux-3.12.10/net/bridge/br_input.c	2014-03-31 03:41:44.000000000 +0200
+++ linux-3.12.10.modified/net/bridge/br_input.c	2019-02-15 10:51:23.376424560 +0100
@@ -174,6 +174,8 @@
 		return RX_HANDLER_CONSUMED;
 
 	p = br_port_get_rcu(skb->dev);
+	if (!p)
+		return RX_HANDLER_PASS;
 
 	if (unlikely(is_link_local_ether_addr(dest))) {
 		/*

^ permalink raw reply

* Re: [RFC PATCH] bonding: use mutex lock in bond_get_stats()
From: Eric Dumazet @ 2019-02-15 13:57 UTC (permalink / raw)
  To: Kefeng Wang
  Cc: netdev, Willem de Bruijn, David S . Miller, Jay Vosburgh,
	Veaceslav Falico, weiyongjun1
In-Reply-To: <20190215135056.18376-1-wangkefeng.wang@huawei.com>

On Fri, Feb 15, 2019 at 5:37 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
> With CONFIG_DEBUG_SPINLOCK=y, we find following stack,
>
>  BUG: spinlock wrong CPU on CPU#0, ip/16047
>   lock: 0xffff803f5febc998, .magic: dead4ead, .owner: ip/16047, .owner_cpu: 0
>  CPU: 1 PID: 16047 Comm: ip Kdump: loaded Tainted: G            E 4.19.12.aarch64 #1
>  Hardware name: Huawei TaiShan 2280 V2/BC82AMDA, BIOS TA BIOS TaiShan 2280 V2 - B900 01/29/2019
>  Call trace:
>   dump_backtrace+0x0/0x1c0
>   show_stack+0x24/0x30
>   dump_stack+0x90/0xbc
>   spin_dump+0x84/0xa8
>   do_raw_spin_unlock+0xf8/0x100
>   _raw_spin_unlock+0x20/0x30
>   bond_get_stats+0x110/0x140 [bonding]
>   rtnl_fill_stats+0x50/0x150
>   rtnl_fill_ifinfo+0x4d4/0xd18
>   rtnl_dump_ifinfo+0x200/0x3a8
>   netlink_dump+0x100/0x2b0
>   netlink_recvmsg+0x310/0x3e8
>   sock_recvmsg+0x58/0x68
>   ___sys_recvmsg+0xd0/0x278
>   __sys_recvmsg+0x74/0xd0
>   __arm64_sys_recvmsg+0x2c/0x38
>   el0_svc_common+0x7c/0x118
>   el0_svc_handler+0x30/0x40
>   el0_svc+0x8/0xc
>
> and then lead to softlockup issue, fix this by using mutex lock instead
> of spin lock.
>
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>
> Not sure if this is right fix, please correct me if I'm wrong.
>

Make sure to also try :

cat /proc/net/dev

^ permalink raw reply

* Re: [PATCH net-next 1/2] net: phy: improve phy_resolve_aneg_linkmode
From: Andrew Lunn @ 2019-02-15 13:57 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: Florian Fainelli, David Miller, netdev@vger.kernel.org
In-Reply-To: <1571d9d1-b3ad-8c96-a476-4ae18d20abfe@gmail.com>

On Thu, Feb 14, 2019 at 10:15:31PM +0100, Heiner Kallweit wrote:
> We have the settings array of modes which is sorted based on aneg
> priority. Instead of checking each mode manually let's simply iterate
> over the sorted settings.
> 
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> ---
>  drivers/net/phy/phy-core.c | 43 +++++++-------------------------------
>  1 file changed, 7 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/net/phy/phy-core.c b/drivers/net/phy/phy-core.c
> index cdea028d1..5d43106fe 100644
> --- a/drivers/net/phy/phy-core.c
> +++ b/drivers/net/phy/phy-core.c
> @@ -349,45 +349,16 @@ size_t phy_speeds(unsigned int *speeds, size_t size,
>  void phy_resolve_aneg_linkmode(struct phy_device *phydev)
>  {
>  	__ETHTOOL_DECLARE_LINK_MODE_MASK(common);
> +	int i;
>  
>  	linkmode_and(common, phydev->lp_advertising, phydev->advertising);
>  
> -	if (linkmode_test_bit(ETHTOOL_LINK_MODE_10000baseT_Full_BIT, common)) {
> -		phydev->speed = SPEED_10000;
> -		phydev->duplex = DUPLEX_FULL;
> -	} else if (linkmode_test_bit(ETHTOOL_LINK_MODE_5000baseT_Full_BIT,
> -				     common)) {
> -		phydev->speed = SPEED_5000;
> -		phydev->duplex = DUPLEX_FULL;
> -	} else if (linkmode_test_bit(ETHTOOL_LINK_MODE_2500baseT_Full_BIT,
> -				     common)) {
> -		phydev->speed = SPEED_2500;
> -		phydev->duplex = DUPLEX_FULL;
> -	} else if (linkmode_test_bit(ETHTOOL_LINK_MODE_1000baseT_Full_BIT,
> -				     common)) {
> -		phydev->speed = SPEED_1000;
> -		phydev->duplex = DUPLEX_FULL;
> -	} else if (linkmode_test_bit(ETHTOOL_LINK_MODE_1000baseT_Half_BIT,
> -				     common)) {
> -		phydev->speed = SPEED_1000;
> -		phydev->duplex = DUPLEX_HALF;
> -	} else if (linkmode_test_bit(ETHTOOL_LINK_MODE_100baseT_Full_BIT,
> -				     common)) {
> -		phydev->speed = SPEED_100;
> -		phydev->duplex = DUPLEX_FULL;
> -	} else if (linkmode_test_bit(ETHTOOL_LINK_MODE_100baseT_Half_BIT,
> -				     common)) {
> -		phydev->speed = SPEED_100;
> -		phydev->duplex = DUPLEX_HALF;
> -	} else if (linkmode_test_bit(ETHTOOL_LINK_MODE_10baseT_Full_BIT,
> -				     common)) {
> -		phydev->speed = SPEED_10;
> -		phydev->duplex = DUPLEX_FULL;
> -	} else if (linkmode_test_bit(ETHTOOL_LINK_MODE_10baseT_Half_BIT,
> -				     common)) {
> -		phydev->speed = SPEED_10;
> -		phydev->duplex = DUPLEX_HALF;
> -	}
> +	for (i = 0; i < ARRAY_SIZE(settings); i++)
> +		if (test_bit(settings[i].bit, common)) {
> +			phydev->speed = settings[i].speed;
> +			phydev->duplex = settings[i].duplex;
> +			break;
> +		}

Hi Heiner

Nice simplification.

I just have one thought about this. The original code was limited to
baseT. The new code could in theory return a non BaseT speed. For that
to happen, it would require that phydev->lp_advertising and
phydev->advertising contain a non BaseT link mode? Is that possible?
I don't think it is.

  Andrew

^ permalink raw reply

* [PATCH] net: marvell: mvneta: fix DMA debug warning
From: Russell King @ 2019-02-15 13:55 UTC (permalink / raw)
  To: Thomas Petazzoni; +Cc: David S. Miller, netdev

Booting 4.20 on SolidRun Clearfog issues this warning with DMA API
debug enabled:

WARNING: CPU: 0 PID: 555 at kernel/dma/debug.c:1230 check_sync+0x514/0x5bc
mvneta f1070000.ethernet: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x000000002dd7dc00] [size=240 bytes]
Modules linked in: ahci mv88e6xxx dsa_core xhci_plat_hcd xhci_hcd devlink armada_thermal marvell_cesa des_generic ehci_orion phy_armada38x_comphy mcp3021 spi_orion evbug sfp mdio_i2c ip_tables x_tables
CPU: 0 PID: 555 Comm: bridge-network- Not tainted 4.20.0+ #291
Hardware name: Marvell Armada 380/385 (Device Tree)
[<c0019638>] (unwind_backtrace) from [<c0014888>] (show_stack+0x10/0x14)
[<c0014888>] (show_stack) from [<c07f54e0>] (dump_stack+0x9c/0xd4)
[<c07f54e0>] (dump_stack) from [<c00312bc>] (__warn+0xf8/0x124)
[<c00312bc>] (__warn) from [<c00313b0>] (warn_slowpath_fmt+0x38/0x48)
[<c00313b0>] (warn_slowpath_fmt) from [<c00b0370>] (check_sync+0x514/0x5bc)
[<c00b0370>] (check_sync) from [<c00b04f8>] (debug_dma_sync_single_range_for_cpu+0x6c/0x74)
[<c00b04f8>] (debug_dma_sync_single_range_for_cpu) from [<c051bd14>] (mvneta_poll+0x298/0xf58)
[<c051bd14>] (mvneta_poll) from [<c0656194>] (net_rx_action+0x128/0x424)
[<c0656194>] (net_rx_action) from [<c000a230>] (__do_softirq+0xf0/0x540)
[<c000a230>] (__do_softirq) from [<c00386e0>] (irq_exit+0x124/0x144)
[<c00386e0>] (irq_exit) from [<c009b5e0>] (__handle_domain_irq+0x58/0xb0)
[<c009b5e0>] (__handle_domain_irq) from [<c03a63c4>] (gic_handle_irq+0x48/0x98)
[<c03a63c4>] (gic_handle_irq) from [<c0009a10>] (__irq_svc+0x70/0x98)
...

This appears to be caused by mvneta_rx_hwbm() calling
dma_sync_single_range_for_cpu() with the wrong struct device pointer,
as the buffer manager device pointer is used to map and unmap the
buffer.  Fix this.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
---
Please check that this is the correct fix.

 drivers/net/ethernet/marvell/mvneta.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index ef0a85bbc586..0e1f87fca952 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2086,7 +2086,7 @@ static int mvneta_rx_hwbm(struct napi_struct *napi,
 			if (unlikely(!skb))
 				goto err_drop_frame_ret_pool;
 
-			dma_sync_single_range_for_cpu(dev->dev.parent,
+			dma_sync_single_range_for_cpu(&pp->bm_priv->pdev->dev,
 			                              rx_desc->buf_phys_addr,
 			                              MVNETA_MH_SIZE + NET_SKB_PAD,
 			                              rx_bytes,
-- 
2.7.4


^ permalink raw reply related

* Re: [PATCH RFC] net: bridge: don't flood known multicast traffic when snooping is enabled
From: Ido Schimmel @ 2019-02-15 13:53 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: netdev, roopa, wkok, anuradhak, bridge, linus.luessing, davem,
	stephen
In-Reply-To: <20190215130427.29824-1-nikolay@cumulusnetworks.com>

On Fri, Feb 15, 2019 at 03:04:27PM +0200, Nikolay Aleksandrov wrote:
> The behaviour since b00589af3b04 ("bridge: disable snooping if there is
> no querier") is wrong, we shouldn't be flooding multicast traffic when
> there is an mdb entry and we know where it should be forwarded to when
> multicast snooping is enabled. This patch changes the behaviour to not
> flood known unicast traffic. I'll give two obviously broken cases:
>  - most obvious: static mdb created by the user with snooping enabled
>  - user-space daemon controlling the mdb table (e.g. MLAG)
> 
> Every user would expect to have traffic forwarded only to the configured
> mdb destination when snooping is enabled, instead now to get that one
> needs to enable both snooping and querier. Enabling querier on all
> switches could be problematic and is not a good solution, for example
> as summarized by our multicast experts:
> "every switch would send an IGMP query for any random multicast traffic it
> received across the entire domain and it would send it forever as long as a
> host exists wanting that stream even if it has no downstream/directly
> connected receivers"
> 
> Sending as an RFC to get the discussion going, but I'm strongly for
> removing this behaviour and would like to send this patch officially.
> 
> We could make this behaviour possible via a knob if necessary, but
> it really should not be the default.
> 
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

Yes. This is great! :)

I had to enable a multicast querier when testing static mdb entries only
because the test was "too long" and this check kicked in. Never made
sense to me. Lets drop it if we can.

Reviewed-by: Ido Schimmel <idosch@mellanox.com>

^ permalink raw reply

* Re: [PATCH] net: sched: matchall: verify that filter is not NULL in mall_walk()
From: Ido Schimmel @ 2019-02-15 13:47 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netdev, jhs, xiyou.wangcong, jiri, davem
In-Reply-To: <20190215121120.4971-1-vladbu@mellanox.com>

On Fri, Feb 15, 2019 at 02:11:20PM +0200, Vlad Buslov wrote:
> Check that filter is not NULL before passing it to tcf_walker->fn()
> callback. This can happen when mall_change() failed to offload filter to
> hardware.
> 
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Works for me. Thanks

Fixes: ed76f5edccc9 ("net: sched: protect filter_chain list with filter_chain_lock mutex")
Reported-by: Ido Schimmel <idosch@mellanox.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>

^ permalink raw reply

* [PATCH net-next 2/3] net: stmmac: dwmac4: Also use TBU interrupt to clean TX path
From: Jose Abreu @ 2019-02-15 13:42 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Jose Abreu, Joao Pinto, David S . Miller, Giuseppe Cavallaro,
	Alexandre Torgue
In-Reply-To: <cover.1550237884.git.joabreu@synopsys.com>

TBU interrupt is a normal interrupt and can be used to trigger the
cleaning of TX path. Lets check if it's active in DMA interrupt handler.

While at it, refactor a little bit the function:
	- Don't check if RI is enabled because at function exit we will
	  only clear the interrupts that are enabled so, no event will be
	  missed.

In my tests with GMAC5 this increased performance.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
---
 drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c | 24 +++++++-----------------
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c
index 49f5687879df..545cb9c47433 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c
@@ -124,9 +124,9 @@ void dwmac4_disable_dma_irq(void __iomem *ioaddr, u32 chan)
 int dwmac4_dma_interrupt(void __iomem *ioaddr,
 			 struct stmmac_extra_stats *x, u32 chan)
 {
-	int ret = 0;
-
 	u32 intr_status = readl(ioaddr + DMA_CHAN_STATUS(chan));
+	u32 intr_en = readl(ioaddr + DMA_CHAN_INTR_ENA(chan));
+	int ret = 0;
 
 	/* ABNORMAL interrupts */
 	if (unlikely(intr_status & DMA_CHAN_STATUS_AIS)) {
@@ -151,16 +151,11 @@ int dwmac4_dma_interrupt(void __iomem *ioaddr,
 	if (likely(intr_status & DMA_CHAN_STATUS_NIS)) {
 		x->normal_irq_n++;
 		if (likely(intr_status & DMA_CHAN_STATUS_RI)) {
-			u32 value;
-
-			value = readl(ioaddr + DMA_CHAN_INTR_ENA(chan));
-			/* to schedule NAPI on real RIE event. */
-			if (likely(value & DMA_CHAN_INTR_ENA_RIE)) {
-				x->rx_normal_irq_n++;
-				ret |= handle_rx;
-			}
+			x->rx_normal_irq_n++;
+			ret |= handle_rx;
 		}
-		if (likely(intr_status & DMA_CHAN_STATUS_TI)) {
+		if (likely(intr_status & (DMA_CHAN_STATUS_TI |
+					  DMA_CHAN_STATUS_TBU))) {
 			x->tx_normal_irq_n++;
 			ret |= handle_tx;
 		}
@@ -168,12 +163,7 @@ int dwmac4_dma_interrupt(void __iomem *ioaddr,
 			x->rx_early_irq++;
 	}
 
-	/* Clear the interrupt by writing a logic 1 to the chanX interrupt
-	 * status [21-0] expect reserved bits [5-3]
-	 */
-	writel((intr_status & 0x3fffc7),
-	       ioaddr + DMA_CHAN_STATUS(chan));
-
+	writel(intr_status & intr_en, ioaddr + DMA_CHAN_STATUS(chan));
 	return ret;
 }
 
-- 
2.7.4


^ permalink raw reply related

* [PATCH net-next 3/3] net: stmmac: dwxgmac2: Also use TBU interrupt to clean TX path
From: Jose Abreu @ 2019-02-15 13:42 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Jose Abreu, Joao Pinto, David S . Miller, Giuseppe Cavallaro,
	Alexandre Torgue
In-Reply-To: <cover.1550237884.git.joabreu@synopsys.com>

TBU interrupt is a normal interrupt and can be used to trigger the
cleaning of TX path. Lets check if it's active in DMA interrupt handler.

While at it, refactor a little bit the function:
	- Don't check if RI is enabled because at function exit we will
	  only clear the interrupts that are enabled so, no event will
	  be missed.

In my tests withe XGMAC2 this increased performance.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
---
 drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h     | 4 +++-
 drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c | 8 +++-----
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h
index d6bb953685fa..37d5e6fe7473 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h
+++ b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h
@@ -193,9 +193,10 @@
 #define XGMAC_AIE			BIT(14)
 #define XGMAC_RBUE			BIT(7)
 #define XGMAC_RIE			BIT(6)
+#define XGMAC_TBUE			BIT(2)
 #define XGMAC_TIE			BIT(0)
 #define XGMAC_DMA_INT_DEFAULT_EN	(XGMAC_NIE | XGMAC_AIE | XGMAC_RBUE | \
-					XGMAC_RIE | XGMAC_TIE)
+					XGMAC_RIE | XGMAC_TBUE | XGMAC_TIE)
 #define XGMAC_DMA_CH_Rx_WATCHDOG(x)	(0x0000313c + (0x80 * (x)))
 #define XGMAC_RWT			GENMASK(7, 0)
 #define XGMAC_DMA_CH_STATUS(x)		(0x00003160 + (0x80 * (x)))
@@ -204,6 +205,7 @@
 #define XGMAC_FBE			BIT(12)
 #define XGMAC_RBU			BIT(7)
 #define XGMAC_RI			BIT(6)
+#define XGMAC_TBU			BIT(2)
 #define XGMAC_TPS			BIT(1)
 #define XGMAC_TI			BIT(0)
 
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c
index c5e25580a43f..2ba712b48a89 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c
@@ -283,12 +283,10 @@ static int dwxgmac2_dma_interrupt(void __iomem *ioaddr,
 		x->normal_irq_n++;
 
 		if (likely(intr_status & XGMAC_RI)) {
-			if (likely(intr_en & XGMAC_RIE)) {
-				x->rx_normal_irq_n++;
-				ret |= handle_rx;
-			}
+			x->rx_normal_irq_n++;
+			ret |= handle_rx;
 		}
-		if (likely(intr_status & XGMAC_TI)) {
+		if (likely(intr_status & (XGMAC_TI | XGMAC_TBU))) {
 			x->tx_normal_irq_n++;
 			ret |= handle_tx;
 		}
-- 
2.7.4


^ permalink raw reply related

* [PATCH net-next 0/3] net: stmmac: Performance improvements in Multi-Queue
From: Jose Abreu @ 2019-02-15 13:42 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Jose Abreu, Florian Fainelli, Joao Pinto, David S . Miller,
	Giuseppe Cavallaro, Alexandre Torgue

Tested in XGMAC2 and GMAC5. I guess 1/3 can be backported but besides being a
kind of "not that small" refactoring it's also for a scenario that no-one
complained yet ... Advice ?

Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>

Jose Abreu (3):
  net: stmmac: Fix NAPI poll in TX path when in multi-queue
  net: stmmac: dwmac4: Also use TBU interrupt to clean TX path
  net: stmmac: dwxgmac2: Also use TBU interrupt to clean TX path

 drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c   |  24 ++---
 drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h     |   4 +-
 drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c |   8 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac.h       |   5 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  | 112 ++++++++++++---------
 5 files changed, 77 insertions(+), 76 deletions(-)

-- 
2.7.4


^ permalink raw reply

* [PATCH net-next 1/3] net: stmmac: Fix NAPI poll in TX path when in multi-queue
From: Jose Abreu @ 2019-02-15 13:42 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Jose Abreu, Florian Fainelli, Joao Pinto, David S . Miller,
	Giuseppe Cavallaro, Alexandre Torgue
In-Reply-To: <cover.1550237884.git.joabreu@synopsys.com>

Commit 8fce33317023 introduced the concept of NAPI per-channel and
independent cleaning of TX path.

This is currently breaking performance in some cases. The scenario
happens when all packets are being received in Queue 0 but the TX is
performed in Queue != 0.

Fix this by using different NAPI instances per each TX and RX queue, as
suggested by Florian.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac.h      |   5 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 112 ++++++++++++----------
 2 files changed, 64 insertions(+), 53 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index 63e1064b27a2..e697ecd9b0a6 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -78,11 +78,10 @@ struct stmmac_rx_queue {
 };
 
 struct stmmac_channel {
-	struct napi_struct napi ____cacheline_aligned_in_smp;
+	struct napi_struct rx_napi ____cacheline_aligned_in_smp;
+	struct napi_struct tx_napi ____cacheline_aligned_in_smp;
 	struct stmmac_priv *priv_data;
 	u32 index;
-	int has_rx;
-	int has_tx;
 };
 
 struct stmmac_tc_entry {
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 685d20472358..89a4dd75af76 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -155,7 +155,10 @@ static void stmmac_disable_all_queues(struct stmmac_priv *priv)
 	for (queue = 0; queue < maxq; queue++) {
 		struct stmmac_channel *ch = &priv->channel[queue];
 
-		napi_disable(&ch->napi);
+		if (queue < rx_queues_cnt)
+			napi_disable(&ch->rx_napi);
+		if (queue < tx_queues_cnt)
+			napi_disable(&ch->tx_napi);
 	}
 }
 
@@ -173,7 +176,10 @@ static void stmmac_enable_all_queues(struct stmmac_priv *priv)
 	for (queue = 0; queue < maxq; queue++) {
 		struct stmmac_channel *ch = &priv->channel[queue];
 
-		napi_enable(&ch->napi);
+		if (queue < rx_queues_cnt)
+			napi_enable(&ch->rx_napi);
+		if (queue < tx_queues_cnt)
+			napi_enable(&ch->tx_napi);
 	}
 }
 
@@ -1939,6 +1945,10 @@ static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue)
 		mod_timer(&priv->eee_ctrl_timer, STMMAC_LPI_T(eee_timer));
 	}
 
+	/* We still have pending packets, let's call for a new scheduling */
+	if (tx_q->dirty_tx != tx_q->cur_tx)
+		mod_timer(&tx_q->txtimer, STMMAC_COAL_TIMER(10));
+
 	__netif_tx_unlock_bh(netdev_get_tx_queue(priv->dev, queue));
 
 	return count;
@@ -2029,23 +2039,15 @@ static int stmmac_napi_check(struct stmmac_priv *priv, u32 chan)
 	int status = stmmac_dma_interrupt_status(priv, priv->ioaddr,
 						 &priv->xstats, chan);
 	struct stmmac_channel *ch = &priv->channel[chan];
-	bool needs_work = false;
-
-	if ((status & handle_rx) && ch->has_rx) {
-		needs_work = true;
-	} else {
-		status &= ~handle_rx;
-	}
 
-	if ((status & handle_tx) && ch->has_tx) {
-		needs_work = true;
-	} else {
-		status &= ~handle_tx;
+	if ((status & handle_rx) && (chan < priv->plat->rx_queues_to_use)) {
+		stmmac_disable_dma_irq(priv, priv->ioaddr, chan);
+		napi_schedule_irqoff(&ch->rx_napi);
 	}
 
-	if (needs_work && napi_schedule_prep(&ch->napi)) {
+	if ((status & handle_tx) && (chan < priv->plat->tx_queues_to_use)) {
 		stmmac_disable_dma_irq(priv, priv->ioaddr, chan);
-		__napi_schedule(&ch->napi);
+		napi_schedule_irqoff(&ch->tx_napi);
 	}
 
 	return status;
@@ -2241,8 +2243,14 @@ static void stmmac_tx_timer(struct timer_list *t)
 
 	ch = &priv->channel[tx_q->queue_index];
 
-	if (likely(napi_schedule_prep(&ch->napi)))
-		__napi_schedule(&ch->napi);
+	/*
+	 * If NAPI is already running we can miss some events. Let's rearm
+	 * the timer and try again.
+	 */
+	if (likely(napi_schedule_prep(&ch->tx_napi)))
+		__napi_schedule(&ch->tx_napi);
+	else
+		mod_timer(&tx_q->txtimer, STMMAC_COAL_TIMER(10));
 }
 
 /**
@@ -3498,7 +3506,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 			else
 				skb->ip_summed = CHECKSUM_UNNECESSARY;
 
-			napi_gro_receive(&ch->napi, skb);
+			napi_gro_receive(&ch->rx_napi, skb);
 
 			priv->dev->stats.rx_packets++;
 			priv->dev->stats.rx_bytes += frame_len;
@@ -3513,41 +3521,41 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
 	return count;
 }
 
-/**
- *  stmmac_poll - stmmac poll method (NAPI)
- *  @napi : pointer to the napi structure.
- *  @budget : maximum number of packets that the current CPU can receive from
- *	      all interfaces.
- *  Description :
- *  To look at the incoming frames and clear the tx resources.
- */
-static int stmmac_napi_poll(struct napi_struct *napi, int budget)
+static int stmmac_napi_poll_rx(struct napi_struct *napi, int budget)
 {
 	struct stmmac_channel *ch =
-		container_of(napi, struct stmmac_channel, napi);
+		container_of(napi, struct stmmac_channel, rx_napi);
 	struct stmmac_priv *priv = ch->priv_data;
-	int work_done, rx_done = 0, tx_done = 0;
 	u32 chan = ch->index;
+	int work_done;
 
 	priv->xstats.napi_poll++;
 
-	if (ch->has_tx)
-		tx_done = stmmac_tx_clean(priv, budget, chan);
-	if (ch->has_rx)
-		rx_done = stmmac_rx(priv, budget, chan);
+	work_done = stmmac_rx(priv, budget, chan);
+	if (work_done < budget && napi_complete_done(napi, work_done))
+		stmmac_enable_dma_irq(priv, priv->ioaddr, chan);
+	return work_done;
+}
 
-	work_done = max(rx_done, tx_done);
-	work_done = min(work_done, budget);
+static int stmmac_napi_poll_tx(struct napi_struct *napi, int budget)
+{
+	struct stmmac_channel *ch =
+		container_of(napi, struct stmmac_channel, tx_napi);
+	struct stmmac_priv *priv = ch->priv_data;
+	struct stmmac_tx_queue *tx_q;
+	u32 chan = ch->index;
+	int work_done;
 
-	if (work_done < budget && napi_complete_done(napi, work_done)) {
-		int stat;
+	priv->xstats.napi_poll++;
 
+	work_done = stmmac_tx_clean(priv, budget, chan);
+	if (work_done < budget && napi_complete_done(napi, work_done))
 		stmmac_enable_dma_irq(priv, priv->ioaddr, chan);
-		stat = stmmac_dma_interrupt_status(priv, priv->ioaddr,
-						   &priv->xstats, chan);
-		if (stat && napi_reschedule(napi))
-			stmmac_disable_dma_irq(priv, priv->ioaddr, chan);
-	}
+
+	/* Force transmission restart */
+	tx_q = &priv->tx_queue[chan];
+	stmmac_enable_dma_transmission(priv, priv->ioaddr);
+	stmmac_set_tx_tail_ptr(priv, priv->ioaddr, tx_q->tx_tail_addr, chan);
 
 	return work_done;
 }
@@ -4323,13 +4331,14 @@ int stmmac_dvr_probe(struct device *device,
 		ch->priv_data = priv;
 		ch->index = queue;
 
-		if (queue < priv->plat->rx_queues_to_use)
-			ch->has_rx = true;
-		if (queue < priv->plat->tx_queues_to_use)
-			ch->has_tx = true;
-
-		netif_napi_add(ndev, &ch->napi, stmmac_napi_poll,
-			       NAPI_POLL_WEIGHT);
+		if (queue < priv->plat->rx_queues_to_use) {
+			netif_napi_add(ndev, &ch->rx_napi, stmmac_napi_poll_rx,
+				       NAPI_POLL_WEIGHT);
+		}
+		if (queue < priv->plat->tx_queues_to_use) {
+			netif_napi_add(ndev, &ch->tx_napi, stmmac_napi_poll_tx,
+				       NAPI_POLL_WEIGHT);
+		}
 	}
 
 	mutex_init(&priv->lock);
@@ -4385,7 +4394,10 @@ int stmmac_dvr_probe(struct device *device,
 	for (queue = 0; queue < maxq; queue++) {
 		struct stmmac_channel *ch = &priv->channel[queue];
 
-		netif_napi_del(&ch->napi);
+		if (queue < priv->plat->rx_queues_to_use)
+			netif_napi_del(&ch->rx_napi);
+		if (queue < priv->plat->tx_queues_to_use)
+			netif_napi_del(&ch->tx_napi);
 	}
 error_hw_init:
 	destroy_workqueue(priv->wq);
-- 
2.7.4


^ permalink raw reply related

* [RFC PATCH] bonding: use mutex lock in bond_get_stats()
From: Kefeng Wang @ 2019-02-15 13:50 UTC (permalink / raw)
  To: netdev, Willem de Bruijn, David S . Miller, Jay Vosburgh,
	Veaceslav Falico, Eric Dumazet
  Cc: weiyongjun1, Kefeng Wang

With CONFIG_DEBUG_SPINLOCK=y, we find following stack,

 BUG: spinlock wrong CPU on CPU#0, ip/16047
  lock: 0xffff803f5febc998, .magic: dead4ead, .owner: ip/16047, .owner_cpu: 0
 CPU: 1 PID: 16047 Comm: ip Kdump: loaded Tainted: G            E 4.19.12.aarch64 #1
 Hardware name: Huawei TaiShan 2280 V2/BC82AMDA, BIOS TA BIOS TaiShan 2280 V2 - B900 01/29/2019
 Call trace:
  dump_backtrace+0x0/0x1c0
  show_stack+0x24/0x30
  dump_stack+0x90/0xbc
  spin_dump+0x84/0xa8
  do_raw_spin_unlock+0xf8/0x100
  _raw_spin_unlock+0x20/0x30
  bond_get_stats+0x110/0x140 [bonding]
  rtnl_fill_stats+0x50/0x150
  rtnl_fill_ifinfo+0x4d4/0xd18
  rtnl_dump_ifinfo+0x200/0x3a8
  netlink_dump+0x100/0x2b0
  netlink_recvmsg+0x310/0x3e8
  sock_recvmsg+0x58/0x68
  ___sys_recvmsg+0xd0/0x278
  __sys_recvmsg+0x74/0xd0
  __arm64_sys_recvmsg+0x2c/0x38
  el0_svc_common+0x7c/0x118
  el0_svc_handler+0x30/0x40
  el0_svc+0x8/0xc

and then lead to softlockup issue, fix this by using mutex lock instead
of spin lock.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---

Not sure if this is right fix, please correct me if I'm wrong.

 drivers/net/bonding/bond_main.c | 6 +++---
 include/net/bonding.h           | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 485462d3087f..3f7849525759 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3452,7 +3452,7 @@ static void bond_get_stats(struct net_device *bond_dev,
 	struct list_head *iter;
 	struct slave *slave;
 
-	spin_lock_nested(&bond->stats_lock, bond_get_nest_level(bond_dev));
+	mutex_lock_nested(&bond->stats_lock, bond_get_nest_level(bond_dev));
 	memcpy(stats, &bond->bond_stats, sizeof(*stats));
 
 	rcu_read_lock();
@@ -3468,7 +3468,7 @@ static void bond_get_stats(struct net_device *bond_dev,
 	rcu_read_unlock();
 
 	memcpy(&bond->bond_stats, stats, sizeof(*stats));
-	spin_unlock(&bond->stats_lock);
+	mutex_unlock(&bond->stats_lock);
 }
 
 static int bond_do_ioctl(struct net_device *bond_dev, struct ifreq *ifr, int cmd)
@@ -4284,7 +4284,7 @@ void bond_setup(struct net_device *bond_dev)
 	struct bonding *bond = netdev_priv(bond_dev);
 
 	spin_lock_init(&bond->mode_lock);
-	spin_lock_init(&bond->stats_lock);
+	mutex_init(&bond->stats_lock);
 	bond->params = bonding_defaults;
 
 	/* Initialize pointers */
diff --git a/include/net/bonding.h b/include/net/bonding.h
index b46d68acf701..3a6dbb2b376c 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -205,7 +205,7 @@ struct bonding {
 	 * ALB mode (6) - to sync the use and modifications of its hash table
 	 */
 	spinlock_t mode_lock;
-	spinlock_t stats_lock;
+	struct mutex	stats_lock;
 	u8	 send_peer_notif;
 	u8       igmp_retrans;
 #ifdef CONFIG_PROC_FS
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH net-next v2 3/3] enetc: Add ENETC PF level external MDIO support
From: Andrew Lunn @ 2019-02-15 13:34 UTC (permalink / raw)
  To: Claudiu Manoil
  Cc: Shawn Guo, Li Yang, David S . Miller, alexandru.marginean,
	linux-arm-kernel, devicetree, netdev, linux-kernel
In-Reply-To: <1550225414-12125-4-git-send-email-claudiu.manoil@nxp.com>

On Fri, Feb 15, 2019 at 12:10:14PM +0200, Claudiu Manoil wrote:
> Each ENETC PF has its own MDIO interface, the corresponding
> MDIO registers are mapped in the ENETC's Port register block.
> The current patch adds a driver for these PF level MDIO buses,
> so that each PF can manage directly its own external link.
> 
> Signed-off-by: Alex Marginean <alexandru.marginean@nxp.com>
> Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
> ---
> v2 - used readx_poll_timeout()
>    - added mdio node child to the port node

Hi Claudiu

Please document this in the device tree binding.

> +	/* return all Fs if nothing was there */
> +	if (enetc_rd_reg(&regs->mdio_cfg) & MDIO_CFG_RD_ER) {
> +		dev_err(&bus->dev,
> +			"Error while reading PHY%d reg at %d.%hhu\n",
> +			phy_id, dev_addr, regnum);
> +		return 0xffff;

I'm not sure you want a dev_err() here. The device tree binding allows
you to have a phy node without a reg value. When that happens, the
core code will scan all 32 addresses to find the PHY. This is going to
spam the log. dev_dbg() might be better.

     Andrew

^ permalink raw reply

* Re: [PATCH net-next] net: phy: marvell10g: Don't explicitly set Pause and Asym_Pause
From: Andrew Lunn @ 2019-02-15 13:24 UTC (permalink / raw)
  To: Maxime Chevallier
  Cc: davem, netdev, linux-kernel, Florian Fainelli, Heiner Kallweit,
	Russell King, linux-arm-kernel, Antoine Tenart, thomas.petazzoni,
	gregory.clement, miquel.raynal
In-Reply-To: <20190215083347.5886-1-maxime.chevallier@bootlin.com>

On Fri, Feb 15, 2019 at 09:33:47AM +0100, Maxime Chevallier wrote:
> The PHY core expects PHY drivers not to set Pause and Asym_Pause bits,
> unless the driver only wants to specify one of them due to HW
> limitation. In the case of the Marvell10g driver, we don't need to set
> them.
> 
> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
> Suggested-by: Andrew Lunn <andrew@lunn.ch>

Thanks Maxime

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH RFC] net: bridge: don't flood known multicast traffic when snooping is enabled
From: Nikolay Aleksandrov @ 2019-02-15 13:09 UTC (permalink / raw)
  To: netdev; +Cc: roopa, wkok, anuradhak, bridge, linus.luessing, davem, stephen
In-Reply-To: <20190215130427.29824-1-nikolay@cumulusnetworks.com>

On 15/02/2019 15:04, Nikolay Aleksandrov wrote:
> The behaviour since b00589af3b04 ("bridge: disable snooping if there is
> no querier") is wrong, we shouldn't be flooding multicast traffic when
> there is an mdb entry and we know where it should be forwarded to when
> multicast snooping is enabled. This patch changes the behaviour to not
> flood known unicast traffic. I'll give two obviously broken cases:
>  - most obvious: static mdb created by the user with snooping enabled
>  - user-space daemon controlling the mdb table (e.g. MLAG)
> 

I had to mention: when snooping is enabled and there is *no querier
configured*, that is the important bit here. In most setups the querier
is explicitly configured when there is no mcast router, but it shouldn't
be required to have the intuitive and normal behaviour.

> Every user would expect to have traffic forwarded only to the configured
> mdb destination when snooping is enabled, instead now to get that one
> needs to enable both snooping and querier. Enabling querier on all
> switches could be problematic and is not a good solution, for example
> as summarized by our multicast experts:
> "every switch would send an IGMP query for any random multicast traffic it
> received across the entire domain and it would send it forever as long as a
> host exists wanting that stream even if it has no downstream/directly
> connected receivers"
> 
> Sending as an RFC to get the discussion going, but I'm strongly for
> removing this behaviour and would like to send this patch officially.
> 
> We could make this behaviour possible via a knob if necessary, but
> it really should not be the default.
> 
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> ---
>  net/bridge/br_device.c | 3 +--
>  net/bridge/br_input.c  | 3 +--
>  2 files changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
> index 013323b6dbe4..2aa8a6509924 100644
> --- a/net/bridge/br_device.c
> +++ b/net/bridge/br_device.c
> @@ -96,8 +96,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
>  		}
>  
>  		mdst = br_mdb_get(br, skb, vid);
> -		if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
> -		    br_multicast_querier_exists(br, eth_hdr(skb)))
> +		if (mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb))
>  			br_multicast_flood(mdst, skb, false, true);
>  		else
>  			br_flood(br, skb, BR_PKT_MULTICAST, false, true);
> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> index 5ea7e56119c1..aae78095cf67 100644
> --- a/net/bridge/br_input.c
> +++ b/net/bridge/br_input.c
> @@ -136,8 +136,7 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
>  	switch (pkt_type) {
>  	case BR_PKT_MULTICAST:
>  		mdst = br_mdb_get(br, skb, vid);
> -		if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
> -		    br_multicast_querier_exists(br, eth_hdr(skb))) {
> +		if (mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) {
>  			if ((mdst && mdst->host_joined) ||
>  			    br_multicast_is_router(br)) {
>  				local_rcv = true;
> 


^ permalink raw reply

* [PATCH RFC] net: bridge: don't flood known multicast traffic when snooping is enabled
From: Nikolay Aleksandrov @ 2019-02-15 13:04 UTC (permalink / raw)
  To: netdev
  Cc: roopa, wkok, anuradhak, bridge, linus.luessing, davem, stephen,
	Nikolay Aleksandrov

The behaviour since b00589af3b04 ("bridge: disable snooping if there is
no querier") is wrong, we shouldn't be flooding multicast traffic when
there is an mdb entry and we know where it should be forwarded to when
multicast snooping is enabled. This patch changes the behaviour to not
flood known unicast traffic. I'll give two obviously broken cases:
 - most obvious: static mdb created by the user with snooping enabled
 - user-space daemon controlling the mdb table (e.g. MLAG)

Every user would expect to have traffic forwarded only to the configured
mdb destination when snooping is enabled, instead now to get that one
needs to enable both snooping and querier. Enabling querier on all
switches could be problematic and is not a good solution, for example
as summarized by our multicast experts:
"every switch would send an IGMP query for any random multicast traffic it
received across the entire domain and it would send it forever as long as a
host exists wanting that stream even if it has no downstream/directly
connected receivers"

Sending as an RFC to get the discussion going, but I'm strongly for
removing this behaviour and would like to send this patch officially.

We could make this behaviour possible via a knob if necessary, but
it really should not be the default.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
 net/bridge/br_device.c | 3 +--
 net/bridge/br_input.c  | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 013323b6dbe4..2aa8a6509924 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -96,8 +96,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
 		}
 
 		mdst = br_mdb_get(br, skb, vid);
-		if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
-		    br_multicast_querier_exists(br, eth_hdr(skb)))
+		if (mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb))
 			br_multicast_flood(mdst, skb, false, true);
 		else
 			br_flood(br, skb, BR_PKT_MULTICAST, false, true);
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 5ea7e56119c1..aae78095cf67 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -136,8 +136,7 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
 	switch (pkt_type) {
 	case BR_PKT_MULTICAST:
 		mdst = br_mdb_get(br, skb, vid);
-		if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
-		    br_multicast_querier_exists(br, eth_hdr(skb))) {
+		if (mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) {
 			if ((mdst && mdst->host_joined) ||
 			    br_multicast_is_router(br)) {
 				local_rcv = true;
-- 
2.17.2


^ permalink raw reply related

* pull-request: mac80211 2019-02-15
From: Johannes Berg @ 2019-02-15 12:51 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-wireless

Hi Dave,

It's clear things are winding down, this is basically just the stuff
from Herbert that we've been discussing. I threw in a simple error
path fix, mostly because it's simple :-)

Please pull and let me know if there's any problem.

Thanks,
johannes



The following changes since commit f9bcc9f3ee4fbbe8f11dfec76745476f5780517e:

  net: ethernet: freescale: set FEC ethtool regs version (2019-02-14 12:45:35 -0500)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git tags/mac80211-for-davem-2019-02-15

for you to fetch changes up to 83e37e0bdd1470bbe6612250b745ad39b1a7b130:

  mac80211: Restore vif beacon interval if start ap fails (2019-02-15 13:30:24 +0100)

----------------------------------------------------------------
Just a few fixes this time:
 * mesh rhashtable fixes from Herbert
 * a small error path fix when starting AP interfaces

----------------------------------------------------------------
Herbert Xu (2):
      mac80211: Use linked list instead of rhashtable walk for mesh tables
      mac80211: Free mpath object when rhashtable insertion fails

Rakesh Pillai (1):
      mac80211: Restore vif beacon interval if start ap fails

 net/mac80211/cfg.c          |   6 +-
 net/mac80211/mesh.h         |   6 ++
 net/mac80211/mesh_pathtbl.c | 155 +++++++++++++-------------------------------
 3 files changed, 57 insertions(+), 110 deletions(-)


^ permalink raw reply

* Re: [PATCH] qmi_wwan: apply SET_DTR quirk to Sierra WP7607
From: Bjørn Mork @ 2019-02-15 12:32 UTC (permalink / raw)
  To: Beniamino Galvani; +Cc: David S . Miller, netdev, linux-usb, linux-kernel
In-Reply-To: <20190215122042.23195-1-bgalvani@redhat.com>

Beniamino Galvani <bgalvani@redhat.com> writes:

> The 1199:68C0 USB ID is reused by Sierra WP7607 which requires the DTR
> quirk to be detected. Apply QMI_QUIRK_SET_DTR unconditionally as
> already done for other IDs shared between different devices.
>
> Signed-off-by: Beniamino Galvani <bgalvani@redhat.com>

Thanks. Looks like it's time to consider applying this unconditionally
to all devices.  Feel free to propose something like that the next time
this issue comes up.

Acked-by: Bjørn Mork <bjorn@mork.no>


^ permalink raw reply

* Re: [PATCH net-next 03/12] net: sched: flower: introduce reference counting for filters
From: Stefano Brivio @ 2019-02-15 12:32 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: netdev@vger.kernel.org, jhs@mojatatu.com,
	xiyou.wangcong@gmail.com, jiri@resnulli.us, davem@davemloft.net
In-Reply-To: <vbfef89jny6.fsf@mellanox.com>

On Fri, 15 Feb 2019 11:22:45 +0000
Vlad Buslov <vladbu@mellanox.com> wrote:

> On Thu 14 Feb 2019 at 20:34, Stefano Brivio <sbrivio@redhat.com> wrote:
> > On Thu, 14 Feb 2019 09:47:03 +0200
> > Vlad Buslov <vladbu@mellanox.com> wrote:
> >  
> >> +static struct cls_fl_filter *fl_get_next_filter(struct tcf_proto *tp,
> >> +						unsigned long *handle)
> >> +{
> >> +	struct cls_fl_head *head = fl_head_dereference(tp);
> >> +	struct cls_fl_filter *f;
> >> +
> >> +	rcu_read_lock();
> >> +	/* don't return filters that are being deleted */
> >> +	while ((f = idr_get_next_ul(&head->handle_idr,
> >> +				    handle)) != NULL &&
> >> +	       !refcount_inc_not_zero(&f->refcnt))
> >> +		++(*handle);  
> >
> > This... hurts :) What about:
> >
> > 	while ((f = idr_get_next_ul(&head->handle_idr, &handle))) {
> > 		if (refcount_inc_not_zero(&f->refcnt))
> > 			break;
> > 		++(*handle);
> > 	}
> >
> > ?  
> 
> I prefer to avoid using value of assignment as boolean and
> non-structured jumps, when possible. In this case it seems OK either
> way, but how about:
> 
> 	for (f = idr_get_next_ul(&head->handle_idr, handle);
> 	     f && !refcount_inc_not_zero(&f->refcnt);
> 	     f = idr_get_next_ul(&head->handle_idr, handle))
> 		++(*handle);

Honestly, I preferred the original, this is repeating idr_get_next_ul()
twice.

Maybe, just:

	[...]
	struct idr *idr;

	[...]
	idr = &head->handle_idr;
	while ((f = idr_get_next_ul(idr, handle)) != NULL &&
	       !refcount_inc_not_zero(&f->refcnt))
		++(*handle);

also rather ugly, but not entirely unreadable. I tried drafting a
helper for this, but it just ends up hiding what this does.

> >> @@ -1349,6 +1404,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
> >>  		err = -ENOBUFS;
> >>  		goto errout_tb;
> >>  	}
> >> +	refcount_set(&fnew->refcnt, 1);
> >>
> >>  	err = tcf_exts_init(&fnew->exts, TCA_FLOWER_ACT, 0);
> >>  	if (err < 0)
> >> @@ -1381,6 +1437,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
> >>  	if (!tc_in_hw(fnew->flags))
> >>  		fnew->flags |= TCA_CLS_FLAGS_NOT_IN_HW;
> >>
> >> +	refcount_inc(&fnew->refcnt);  
> >
> > I guess I'm not getting the semantics but... why is it 2 now?  
> 
> As soon as fnew is inserted into head->handle_idr (one reference), it
> becomes accessible to concurrent users, which means that it can be
> deleted at any time. However, tp->change() returns a reference to newly
> created filter to cls_api by assigning "arg" parameter to it (second
> reference). After tp->change() returns, cls API continues to use fnew
> and releases it with tfilter_put() when finished.

I see, thanks for the explanation!

-- 
Stefano

^ permalink raw reply

* Re: [PATCH AUTOSEL 3.18 15/16] cfg80211: extend range deviation for DMG
From: Johannes Berg @ 2019-02-15 12:31 UTC (permalink / raw)
  To: Sasha Levin, linux-kernel, stable
  Cc: Chaitanya Tata, Chaitanya Tata, linux-wireless, netdev
In-Reply-To: <20190215021546.179605-15-sashal@kernel.org>

On Thu, 2019-02-14 at 21:15 -0500, Sasha Levin wrote:
> From: Chaitanya Tata <chaitanya.tata@bluwirelesstechnology.com>
> 
> [ Upstream commit 93183bdbe73bbdd03e9566c8dc37c9d06b0d0db6 ]
> 
> Recently, DMG frequency bands have been extended till 71GHz, so extend
> the range check till 20GHz (45-71GHZ), else some channels will be marked
> as disabled.

There's not really any danger in picking this up for old kernels, but
also practically no value since those kernels wouldn't have supoprt for
the higher ranges ("recently, ...") part :)

johannes


^ permalink raw reply

* [PATCH] qmi_wwan: apply SET_DTR quirk to Sierra WP7607
From: Beniamino Galvani @ 2019-02-15 12:20 UTC (permalink / raw)
  To: Bjørn Mork, David S . Miller, netdev, linux-usb,
	linux-kernel

The 1199:68C0 USB ID is reused by Sierra WP7607 which requires the DTR
quirk to be detected. Apply QMI_QUIRK_SET_DTR unconditionally as
already done for other IDs shared between different devices.

Signed-off-by: Beniamino Galvani <bgalvani@redhat.com>
---
 drivers/net/usb/qmi_wwan.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 735ad838e2ba..18af2f8eee96 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -1201,8 +1201,8 @@ static const struct usb_device_id products[] = {
 	{QMI_FIXED_INTF(0x114f, 0x68a2, 8)},    /* Sierra Wireless MC7750 */
 	{QMI_FIXED_INTF(0x1199, 0x68a2, 8)},	/* Sierra Wireless MC7710 in QMI mode */
 	{QMI_FIXED_INTF(0x1199, 0x68a2, 19)},	/* Sierra Wireless MC7710 in QMI mode */
-	{QMI_FIXED_INTF(0x1199, 0x68c0, 8)},	/* Sierra Wireless MC7304/MC7354 */
-	{QMI_FIXED_INTF(0x1199, 0x68c0, 10)},	/* Sierra Wireless MC7304/MC7354 */
+	{QMI_QUIRK_SET_DTR(0x1199, 0x68c0, 8)},	/* Sierra Wireless MC7304/MC7354, WP76xx */
+	{QMI_QUIRK_SET_DTR(0x1199, 0x68c0, 10)},/* Sierra Wireless MC7304/MC7354 */
 	{QMI_FIXED_INTF(0x1199, 0x901c, 8)},    /* Sierra Wireless EM7700 */
 	{QMI_FIXED_INTF(0x1199, 0x901f, 8)},    /* Sierra Wireless EM7355 */
 	{QMI_FIXED_INTF(0x1199, 0x9041, 8)},	/* Sierra Wireless MC7305/MC7355 */
-- 
2.20.1


^ permalink raw reply related

* Re: [v3 PATCH 0/4] mac80211: Fix incorrect usage of rhashtable walk API
From: Johannes Berg @ 2019-02-15 12:21 UTC (permalink / raw)
  To: Herbert Xu, David Miller, linux-wireless, netdev, j, tgraf,
	Julia Lawall
In-Reply-To: <20190214140236.omt74prxhkfaasue@gondor.apana.org.au>

> The first two patches in this series are bug fixes and should be
> backported to stable.
> 
> They fixes a number of issues with the use of the rhashtable API
> in mac80211.  First of all it converts the use of rashtable walks over
> to a simple linked list.  This is because an rhashtable walk is
> inherently unstable and not meant for uses that require stability,
> e.g., when you're trying to lookup an object to delete.

Thanks a lot, Herbert.

Applied those now, I'll send a pull request to Dave with them. Once that
trickles back into net-next I'll apply the third patch (it doesn't apply
without the others), and then Dave you can take the rhashtable one.

Let me know if you'd prefer I take the rhashtable one through my tree,
which really would be only so you don't have to track the dependency.

NB: it'd be easier in patchwork if you tagged all the patches with v3 in
their own PATCH tag, or put the "v3" tag into the actual subject (not
the "[PATCH 0/4]" tag because evidently patchwork drops the tags and
doesn't track them for the *series* just each *patch* ... so with what
you did nothing is visible in patchwork, even just appending "(v3)" to
the subject of the cover letter would've fixed that...

johannes


^ permalink raw reply

* Re: [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
From: Vlad Buslov @ 2019-02-15 12:15 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev@vger.kernel.org, jhs@mojatatu.com,
	xiyou.wangcong@gmail.com, jiri@resnulli.us, davem@davemloft.net,
	ast@kernel.org, daniel@iogearbox.net
In-Reply-To: <20190215113041.GA10511@splinter>


On Fri 15 Feb 2019 at 11:30, Ido Schimmel <idosch@idosch.org> wrote:
> On Fri, Feb 15, 2019 at 10:02:11AM +0000, Vlad Buslov wrote:
>>
>> On Thu 14 Feb 2019 at 18:24, Ido Schimmel <idosch@idosch.org> wrote:
>> > On Mon, Feb 11, 2019 at 10:55:38AM +0200, Vlad Buslov wrote:
>> >> Extend tcf_chain with new filter_chain_lock mutex. Always lock the chain
>> >> when accessing filter_chain list, instead of relying on rtnl lock.
>> >> Dereference filter_chain with tcf_chain_dereference() lockdep macro to
>> >> verify that all users of chain_list have the lock taken.
>> >>
>> >> Rearrange tp insert/remove code in tc_new_tfilter/tc_del_tfilter to execute
>> >> all necessary code while holding chain lock in order to prevent
>> >> invalidation of chain_info structure by potential concurrent change. This
>> >> also serializes calls to tcf_chain0_head_change(), which allows head change
>> >> callbacks to rely on filter_chain_lock for synchronization instead of rtnl
>> >> mutex.
>> >>
>> >> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
>> >> Acked-by: Jiri Pirko <jiri@mellanox.com>
>> >
>> > With this sequence [1] I get the following trace [2]. Bisected it to
>> > this patch. Note that second filter is rejected by the device driver
>> > (that's the intention). I guess it is not properly removed from the
>> > filter chain in the error path?
>> >
>> > Thanks
>> >
>> > [1]
>> > # tc qdisc add dev swp3 clsact
>> > # tc filter add dev swp3 ingress pref 1 matchall skip_sw \
>> > 	action mirred egress mirror dev swp7
>> > # tc filter add dev swp3 ingress pref 2 matchall skip_sw \
>> > 	action mirred egress mirror dev swp7
>> > RTNETLINK answers: File exists
>> > We have an error talking to the kernel, -1
>> > # tc qdisc del dev swp3 clsact
>> >
>> > [2]
>> > [   70.545131] kasan: GPF could be caused by NULL-ptr deref or user memory access
>> > [   70.553394] general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
>> > [   70.560618] CPU: 2 PID: 2268 Comm: tc Not tainted 5.0.0-rc5-custom-02103-g415d39427317 #1304
>> > [   70.570065] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
>> > [   70.580204] RIP: 0010:mall_reoffload+0x14a/0x760
>> > [   70.585382] Code: c1 0f 85 ba 05 00 00 31 c0 4d 8d 6c 24 34 b9 06 00 00 00 4c 89 ff f3 48 ab 4c 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6 14 02 4c 89 e8 83
>> > e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 bd
>> > [   70.606382] RSP: 0018:ffff888231faefc0 EFLAGS: 00010207
>> > [   70.612235] RAX: dffffc0000000000 RBX: 1ffff110463f5dfe RCX: 0000000000000000
>> > [   70.620220] RDX: 0000000000000006 RSI: 1ffff110463f5e01 RDI: ffff888231faf040
>> > [   70.628206] RBP: ffff8881ef151a00 R08: 0000000000000000 R09: ffffed10463f5dfa
>> > [   70.636192] R10: ffffed10463f5dfa R11: 0000000000000003 R12: 0000000000000000
>> > [   70.644177] R13: 0000000000000034 R14: 0000000000000000 R15: ffff888231faf010
>> > [   70.652163] FS:  00007f46b5bf0800(0000) GS:ffff888236c00000(0000) knlGS:0000000000000000
>> > [   70.661218] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > [   70.667649] CR2: 0000000001d590a8 CR3: 0000000231c3c000 CR4: 00000000001006e0
>> > [   70.675633] Call Trace:
>> > [   70.693046]  tcf_block_playback_offloads+0x94/0x230
>> > [   70.710617]  __tcf_block_cb_unregister+0xf7/0x2d0
>> > [   70.734143]  mlxsw_sp_setup_tc+0x20f/0x660
>> > [   70.738739]  tcf_block_offload_unbind+0x22b/0x350
>> > [   70.748898]  __tcf_block_put+0x203/0x630
>> > [   70.769700]  tcf_block_put_ext+0x2f/0x40
>> > [   70.774098]  clsact_destroy+0x7a/0xb0
>> > [   70.782604]  qdisc_destroy+0x11a/0x5c0
>> > [   70.786810]  qdisc_put+0x5a/0x70
>> > [   70.790435]  notify_and_destroy+0x8e/0xa0
>> > [   70.794933]  qdisc_graft+0xbb7/0xef0
>> > [   70.809009]  tc_get_qdisc+0x518/0xa30
>> > [   70.821530]  rtnetlink_rcv_msg+0x397/0xa20
>> > [   70.835510]  netlink_rcv_skb+0x132/0x380
>> > [   70.848419]  netlink_unicast+0x4c0/0x690
>> > [   70.866019]  netlink_sendmsg+0x929/0xe10
>> > [   70.882134]  sock_sendmsg+0xc8/0x110
>> > [   70.886144]  ___sys_sendmsg+0x77a/0x8f0
>> > [   70.924064]  __sys_sendmsg+0xf7/0x250
>> > [   70.955727]  do_syscall_64+0x14d/0x610
>>
>> Hi Ido,
>>
>> Thanks for reporting this.
>>
>> I looked at the code and problem seems to be matchall classifier
>> specific. My implementation of unlocked cls API assumes that concurrent
>> insertions are possible and checks for it when deleting "empty" tp.
>> Since classifiers don't expose number of elements, the only way to test
>> this is to do tp->walk() on them and assume that walk callback is called
>> once per filter on every classifier. In your example new tp is created
>> for second filter, filter insertion fails, number of elements on newly
>> created tp is checked with tp->walk() before deleting it. However,
>> matchall classifier always calls the tp->walk() callback once, even when
>> it doesn't have a valid filter (in this case with NULL filter pointer).
>>
>> Possible ways to fix this:
>>
>> 1) Check for NULL filter pointer in tp->walk() callback and ignore it
>> when counting filters on tp - will work but I don't like it because I
>> don't think it is ever correct to call tp->walk() callback with NULL
>> filter pointer.
>>
>> 2) Fix matchall to not call tp->walk() callback with NULL filter
>> pointers - my preferred simple solution.
>>
>> 3) Extend tp API to have direct way to count filters by implementing
>> tp->count - requires change to every classifier. Or maybe add it as
>> optional API that only unlocked classifiers are required to implement
>> and just delete rtnl lock dependent tp's without checking for concurrent
>> insertions.
>>
>> What do you think?
>
> Since the problem is matchall-specific, then it makes sense to fix it in
> matchall like you suggested in option #2.
>
> Can you please use this opportunity and audit other classifiers to
> confirm problem is indeed specific to matchall?
>
> In any case, feel free to send me a patch and I'll test it.
>
> Thanks

I've sent you the patch for matchall and will audit all other
classifiers for this issue.

Thanks,
Vlad

^ permalink raw reply

* [PATCH] net: sched: matchall: verify that filter is not NULL in mall_walk()
From: Vlad Buslov @ 2019-02-15 12:11 UTC (permalink / raw)
  To: idosch; +Cc: netdev, jhs, xiyou.wangcong, jiri, davem, Vlad Buslov
In-Reply-To: <20190215113041.GA10511@splinter>

Check that filter is not NULL before passing it to tcf_walker->fn()
callback. This can happen when mall_change() failed to offload filter to
hardware.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
---
 net/sched/cls_matchall.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index a37137430e61..1f9d481b0fbb 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -247,6 +247,9 @@ static void mall_walk(struct tcf_proto *tp, struct tcf_walker *arg,
 
 	if (arg->count < arg->skip)
 		goto skip;
+
+	if (!head)
+		return;
 	if (arg->fn(tp, head, arg) < 0)
 		arg->stop = 1;
 skip:
-- 
2.13.6


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox