Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2 3/9] rcu/sync: Remove custom check for reader-section
From: Joel Fernandes @ 2019-07-12 21:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oleg Nesterov, Alexey Kuznetsov, Bjorn Helgaas, Borislav Petkov,
	c0d1n61at3, David S. Miller, edumazet, Greg Kroah-Hartman,
	Hideaki YOSHIFUJI, H. Peter Anvin, Ingo Molnar, Jonathan Corbet,
	Josh Triplett, keescook, kernel-hardening, kernel-team,
	Lai Jiangshan, Len Brown, linux-acpi, linux-doc, linux-pci,
	linux-pm, Mathieu Desnoyers, neilb, netdev, Paul E. McKenney,
	Pavel Machek, peterz, Rafael J. Wysocki, Rasmus Villemoes, rcu,
	Steven Rostedt, Tejun Heo, Thomas Gleixner, will,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)
In-Reply-To: <20190712170024.111093-4-joel@joelfernandes.org>

On Fri, Jul 12, 2019 at 01:00:18PM -0400, Joel Fernandes (Google) wrote:
> The rcu/sync code was doing its own check whether we are in a reader
> section. With RCU consolidating flavors and the generic helper added in
> this series, this is no longer need. We can just use the generic helper
> and it results in a nice cleanup.
> 
> Cc: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

Hi Oleg,
Slightly unrelated to the patch,
I tried hard to understand this comment below in percpu_down_read() but no dice.

I do understand how rcu sync and percpu rwsem works, however the comment
below didn't make much sense to me. For one, there's no readers_fast anymore
so I did not follow what readers_fast means. Could the comment be updated to
reflect latest changes?
Also could you help understand how is a writer not able to change
sem->state and count the per-cpu read counters at the same time as the
comment tries to say?

	/*
	 * We are in an RCU-sched read-side critical section, so the writer
	 * cannot both change sem->state from readers_fast and start checking
	 * counters while we are here. So if we see !sem->state, we know that
	 * the writer won't be checking until we're past the preempt_enable()
	 * and that once the synchronize_rcu() is done, the writer will see
	 * anything we did within this RCU-sched read-size critical section.
	 */

Also,
I guess we could get rid of all of the gp_ops struct stuff now that since all
the callbacks are the same now. I will post that as a follow-up patch to this
series.

thanks!

 - Joel


> ---
> Please note: Only build and boot tested this particular patch so far.
> 
>  include/linux/rcu_sync.h |  5 ++---
>  kernel/rcu/sync.c        | 22 ----------------------
>  2 files changed, 2 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/rcu_sync.h b/include/linux/rcu_sync.h
> index 6fc53a1345b3..c954f1efc919 100644
> --- a/include/linux/rcu_sync.h
> +++ b/include/linux/rcu_sync.h
> @@ -39,9 +39,8 @@ extern void rcu_sync_lockdep_assert(struct rcu_sync *);
>   */
>  static inline bool rcu_sync_is_idle(struct rcu_sync *rsp)
>  {
> -#ifdef CONFIG_PROVE_RCU
> -	rcu_sync_lockdep_assert(rsp);
> -#endif
> +	RCU_LOCKDEP_WARN(!rcu_read_lock_any_held(),
> +			 "suspicious rcu_sync_is_idle() usage");
>  	return !rsp->gp_state; /* GP_IDLE */
>  }
>  
> diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
> index a8304d90573f..535e02601f56 100644
> --- a/kernel/rcu/sync.c
> +++ b/kernel/rcu/sync.c
> @@ -10,37 +10,25 @@
>  #include <linux/rcu_sync.h>
>  #include <linux/sched.h>
>  
> -#ifdef CONFIG_PROVE_RCU
> -#define __INIT_HELD(func)	.held = func,
> -#else
> -#define __INIT_HELD(func)
> -#endif
> -
>  static const struct {
>  	void (*sync)(void);
>  	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
>  	void (*wait)(void);
> -#ifdef CONFIG_PROVE_RCU
> -	int  (*held)(void);
> -#endif
>  } gp_ops[] = {
>  	[RCU_SYNC] = {
>  		.sync = synchronize_rcu,
>  		.call = call_rcu,
>  		.wait = rcu_barrier,
> -		__INIT_HELD(rcu_read_lock_held)
>  	},
>  	[RCU_SCHED_SYNC] = {
>  		.sync = synchronize_rcu,
>  		.call = call_rcu,
>  		.wait = rcu_barrier,
> -		__INIT_HELD(rcu_read_lock_sched_held)
>  	},
>  	[RCU_BH_SYNC] = {
>  		.sync = synchronize_rcu,
>  		.call = call_rcu,
>  		.wait = rcu_barrier,
> -		__INIT_HELD(rcu_read_lock_bh_held)
>  	},
>  };
>  
> @@ -49,16 +37,6 @@ enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
>  
>  #define	rss_lock	gp_wait.lock
>  
> -#ifdef CONFIG_PROVE_RCU
> -void rcu_sync_lockdep_assert(struct rcu_sync *rsp)
> -{
> -	RCU_LOCKDEP_WARN(!gp_ops[rsp->gp_type].held(),
> -			 "suspicious rcu_sync_is_idle() usage");
> -}
> -
> -EXPORT_SYMBOL_GPL(rcu_sync_lockdep_assert);
> -#endif
> -
>  /**
>   * rcu_sync_init() - Initialize an rcu_sync structure
>   * @rsp: Pointer to rcu_sync structure to be initialized
> -- 
> 2.22.0.510.g264f2c817a-goog
> 

^ permalink raw reply

* Re: [PATCH net-next] net: openvswitch: do not update max_headroom if new headroom is equal to old headroom
From: David Miller @ 2019-07-12 22:18 UTC (permalink / raw)
  To: ap420073; +Cc: pshelar, netdev, dev
In-Reply-To: <20190705160809.5202-1-ap420073@gmail.com>

From: Taehee Yoo <ap420073@gmail.com>
Date: Sat,  6 Jul 2019 01:08:09 +0900

> When a vport is deleted, the maximum headroom size would be changed.
> If the vport which has the largest headroom is deleted,
> the new max_headroom would be set.
> But, if the new headroom size is equal to the old headroom size,
> updating routine is unnecessary.
> 
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>

I don't think Taehee should be punished because it took several days
to get someone to look at and review and/or test this patch and
meanwhile the net-next tree closed down.

I ask for maintainer review as both a courtesy and a way to lessen
my workload.  But if that means patches rot for days in patchwork
I'm just going to apply them after my own review.

So I'm applying this now.

^ permalink raw reply

* Re: [PATCH] [net-next] davinci_cpdma: don't cast dma_addr_t to pointer
From: David Miller @ 2019-07-12 22:19 UTC (permalink / raw)
  To: arnd
  Cc: ivan.khoronzhuk, grygorii.strashko, andrew, ilias.apalodimas,
	linux-omap, netdev, linux-kernel
In-Reply-To: <20190710080106.24237-1-arnd@arndb.de>

From: Arnd Bergmann <arnd@arndb.de>
Date: Wed, 10 Jul 2019 10:00:33 +0200

> dma_addr_t may be 64-bit wide on 32-bit architectures, so it is not
> valid to cast between it and a pointer:
> 
> drivers/net/ethernet/ti/davinci_cpdma.c: In function 'cpdma_chan_submit_si':
> drivers/net/ethernet/ti/davinci_cpdma.c:1047:12: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
> drivers/net/ethernet/ti/davinci_cpdma.c: In function 'cpdma_chan_idle_submit_mapped':
> drivers/net/ethernet/ti/davinci_cpdma.c:1114:12: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
> drivers/net/ethernet/ti/davinci_cpdma.c: In function 'cpdma_chan_submit_mapped':
> drivers/net/ethernet/ti/davinci_cpdma.c:1164:12: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
> 
> Solve this by using two separate members in 'struct submit_info'.
> Since this avoids the use of the 'flag' member, the structure does
> not even grow in typical configurations.
> 
> Fixes: 6670acacd59e ("net: ethernet: ti: davinci_cpdma: add dma mapped submit")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Applied.

^ permalink raw reply

* [PATCH net-next 1/2] net sched: update skbedit action for batched events operations
From: Roman Mashak @ 2019-07-12 22:21 UTC (permalink / raw)
  To: davem; +Cc: netdev, kernel, jhs, xiyou.wangcong, jiri, Roman Mashak

Add get_fill_size() routine used to calculate the action size
when building a batch of events.

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
---
 net/sched/act_skbedit.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index 215a06705cef..dc3c653ec45e 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -306,6 +306,17 @@ static int tcf_skbedit_search(struct net *net, struct tc_action **a, u32 index)
 	return tcf_idr_search(tn, a, index);
 }
 
+static size_t tcf_skbedit_get_fill_size(const struct tc_action *act)
+{
+	return nla_total_size(sizeof(struct tc_skbedit))
+		+ nla_total_size(sizeof(u32)) /* TCA_SKBEDIT_PRIORITY */
+		+ nla_total_size(sizeof(u16)) /* TCA_SKBEDIT_QUEUE_MAPPING */
+		+ nla_total_size(sizeof(u32)) /* TCA_SKBEDIT_MARK */
+		+ nla_total_size(sizeof(u16)) /* TCA_SKBEDIT_PTYPE */
+		+ nla_total_size(sizeof(u32)) /* TCA_SKBEDIT_MASK */
+		+ nla_total_size_64bit(sizeof(u64)); /* TCA_SKBEDIT_FLAGS */
+}
+
 static struct tc_action_ops act_skbedit_ops = {
 	.kind		=	"skbedit",
 	.id		=	TCA_ID_SKBEDIT,
@@ -315,6 +326,7 @@ static struct tc_action_ops act_skbedit_ops = {
 	.init		=	tcf_skbedit_init,
 	.cleanup	=	tcf_skbedit_cleanup,
 	.walk		=	tcf_skbedit_walker,
+	.get_fill_size	=	tcf_skbedit_get_fill_size,
 	.lookup		=	tcf_skbedit_search,
 	.size		=	sizeof(struct tcf_skbedit),
 };
-- 
2.7.4


^ permalink raw reply related

* [PATCH net-next 2/2] tc-testing: updated skbedit action tests with batch create/delete
From: Roman Mashak @ 2019-07-12 22:22 UTC (permalink / raw)
  To: davem; +Cc: netdev, kernel, jhs, xiyou.wangcong, jiri, Roman Mashak
In-Reply-To: <1562970120-29517-1-git-send-email-mrv@mojatatu.com>

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
---
 .../tc-testing/tc-tests/actions/skbedit.json       | 47 ++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/skbedit.json b/tools/testing/selftests/tc-testing/tc-tests/actions/skbedit.json
index 45e7e89928a5..797477c1208f 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/actions/skbedit.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/skbedit.json
@@ -553,5 +553,52 @@
         "teardown": [
             "$TC actions flush action skbedit"
         ]
+    },
+    {
+        "id": "630c",
+        "name": "Add batch of 32 skbedit actions with all parameters and cookie",
+        "category": [
+            "actions",
+            "skbedit"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action skbedit",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "bash -c \"for i in \\`seq 1 32\\`; do cmd=\\\"action skbedit queue_mapping 2 priority 10 mark 7/0xaabbccdd ptype host inheritdsfield index \\$i cookie aabbccddeeff112233445566778800a1 \\\"; args=\"\\$args\\$cmd\"; done && $TC actions add \\$args\"",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions list action skbedit",
+        "matchPattern": "^[ \t]+index [0-9]+ ref",
+        "matchCount": "32",
+        "teardown": [
+            "$TC actions flush action skbedit"
+        ]
+    },
+    {
+        "id": "706d",
+        "name": "Delete batch of 32 skbedit actions with all parameters",
+        "category": [
+            "actions",
+            "skbedit"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action skbedit",
+                0,
+                1,
+                255
+            ],
+            "bash -c \"for i in \\`seq 1 32\\`; do cmd=\\\"action skbedit queue_mapping 2 priority 10 mark 7/0xaabbccdd ptype host inheritdsfield index \\$i \\\"; args=\\\"\\$args\\$cmd\\\"; done && $TC actions add \\$args\""
+        ],
+        "cmdUnderTest": "bash -c \"for i in \\`seq 1 32\\`; do cmd=\\\"action skbedit index \\$i \\\"; args=\"\\$args\\$cmd\"; done && $TC actions del \\$args\"",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions list action skbedit",
+        "matchPattern": "^[ \t]+index [0-9]+ ref",
+        "matchCount": "0",
+        "teardown": []
     }
 ]
-- 
2.7.4


^ permalink raw reply related

* Re: [PATCH net-next] net: sched: Fix NULL-pointer dereference in tc_indr_block_ing_cmd()
From: David Miller @ 2019-07-12 22:25 UTC (permalink / raw)
  To: vladbu; +Cc: netdev, jhs, xiyou.wangcong, jiri, pablo, saeedm
In-Reply-To: <20190710171229.26900-1-vladbu@mellanox.com>

From: Vlad Buslov <vladbu@mellanox.com>
Date: Wed, 10 Jul 2019 20:12:29 +0300

> After recent refactoring of block offlads infrastructure, indr_dev->block
> pointer is dereferenced before it is verified to be non-NULL. Example stack
> trace where this behavior leads to NULL-pointer dereference error when
> creating vxlan dev on system with mlx5 NIC with offloads enabled:
 ...
> Introduce new function tcf_block_non_null_shared() that verifies block
> pointer before dereferencing it to obtain index. Use the function in
> tc_indr_block_ing_cmd() to prevent NULL pointer dereference.
> 
> Fixes: 955bcb6ea0df ("drivers: net: use flow block API")
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: phy: make exported variables non-static
From: David Miller @ 2019-07-12 22:26 UTC (permalink / raw)
  To: efremov; +Cc: andrew, f.fainelli, hkallweit1, netdev, linux-kernel
In-Reply-To: <20190710180324.8131-1-efremov@linux.com>

From: Denis Efremov <efremov@linux.com>
Date: Wed, 10 Jul 2019 21:03:24 +0300

> The variables phy_basic_ports_array, phy_fibre_port_array and
> phy_all_ports_features_array are declared static and marked
> EXPORT_SYMBOL_GPL(), which is at best an odd combination.
> Because the variables were decided to be a part of API, this commit
> removes the static attributes and adds the declarations to the header.
> 
> Fixes: 3c1bcc8614db ("net: ethernet: Convert phydev advertize and supported from u32 to link mode")
> Signed-off-by: Denis Efremov <efremov@linux.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] net/mlx5e: Provide cb_list pointer when setting up tc block on rep
From: David Miller @ 2019-07-12 22:29 UTC (permalink / raw)
  To: vladbu; +Cc: netdev, jhs, xiyou.wangcong, jiri, pablo, saeedm
In-Reply-To: <20190710182554.2988-1-vladbu@mellanox.com>

From: Vlad Buslov <vladbu@mellanox.com>
Date: Wed, 10 Jul 2019 21:25:54 +0300

> Recent refactoring of tc block offloads infrastructure introduced new
> flow_block_cb_setup_simple() method intended to be used as unified way for
> all drivers to register offload callbacks. However, commit that actually
> extended all users (drivers) with block cb list and provided it to
> flow_block infra missed mlx5 en_rep. This leads to following NULL-pointer
> dereference when creating Qdisc:
 ...
> Extend en_rep with new static mlx5e_rep_block_cb_list list and pass it to
> flow_block_cb_setup_simple() function instead of hardcoded NULL pointer.
> 
> Fixes: 955bcb6ea0df ("drivers: net: use flow block API")
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next 0/2] Fix bugs in NFP flower match offload
From: David Miller @ 2019-07-12 22:33 UTC (permalink / raw)
  To: john.hurley; +Cc: netdev, simon.horman, jakub.kicinski, oss-drivers
In-Reply-To: <1562783430-7031-1-git-send-email-john.hurley@netronome.com>

From: John Hurley <john.hurley@netronome.com>
Date: Wed, 10 Jul 2019 19:30:28 +0100

> This patchset contains bug fixes for corner cases in the match fields of
> flower offloads. The patches ensure that flows that should not be
> supported are not (incorrectly) offloaded. These include rules that match
> on layer 3 and/or 4 data without specified ethernet or ip protocol fields.

Series applied.

^ permalink raw reply

* Re: [PATCH net-next 1/1] tc-tests: updated skbedit tests
From: David Miller @ 2019-07-12 22:33 UTC (permalink / raw)
  To: mrv; +Cc: netdev, kernel, jhs, xiyou.wangcong, jiri
In-Reply-To: <1562862540-16509-1-git-send-email-mrv@mojatatu.com>

From: Roman Mashak <mrv@mojatatu.com>
Date: Thu, 11 Jul 2019 12:29:00 -0400

> - Added mask upper bound test case
> - Added mask validation test case
> - Added mask replacement case
> 
> Signed-off-by: Roman Mashak <mrv@mojatatu.com>

New tests I'll allow still now, applied, thanks.

^ permalink raw reply

* Re: [PATCH v2] tipc: ensure head->lock is initialised
From: David Miller @ 2019-07-12 22:34 UTC (permalink / raw)
  To: chris.packham
  Cc: jon.maloy, eric.dumazet, ying.xue, linux-kernel, netdev,
	tipc-discussion
In-Reply-To: <20190711224115.21499-1-chris.packham@alliedtelesis.co.nz>

From: Chris Packham <chris.packham@alliedtelesis.co.nz>
Date: Fri, 12 Jul 2019 10:41:15 +1200

> tipc_named_node_up() creates a skb list. It passes the list to
> tipc_node_xmit() which has some code paths that can call
> skb_queue_purge() which relies on the list->lock being initialised.
> 
> The spin_lock is only needed if the messages end up on the receive path
> but when the list is created in tipc_named_node_up() we don't
> necessarily know if it is going to end up there.
> 
> Once all the skb list users are updated in tipc it will then be possible
> to update them to use the unlocked variants of the skb list functions
> and initialise the lock when we know the message will follow the receive
> path.
> 
> Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>

Applied.

^ permalink raw reply

* Re: [PATCH] [net-next] cxgb4: reduce kernel stack usage in cudbg_collect_mem_region()
From: David Miller @ 2019-07-12 22:36 UTC (permalink / raw)
  To: arnd
  Cc: vishal, rahul.lakkireddy, ganeshgr, alexios.zavras, arjun,
	surendra, netdev, linux-kernel, clang-built-linux
In-Reply-To: <20190712090700.317887-1-arnd@arndb.de>

From: Arnd Bergmann <arnd@arndb.de>
Date: Fri, 12 Jul 2019 11:06:33 +0200

> The cudbg_collect_mem_region() and cudbg_read_fw_mem() both use several
> hundred kilobytes of kernel stack space. One gets inlined into the other,
> which causes the stack usage to be combined beyond the warning limit
> when building with clang:
> 
> drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c:1057:12: error: stack frame size of 1244 bytes in function 'cudbg_collect_mem_region' [-Werror,-Wframe-larger-than=]
> 
> Restructuring cudbg_collect_mem_region() lets clang do the same
> optimization that gcc does and reuse the stack slots as it can
> see that the large variables are never used together.
> 
> A better fix might be to avoid using cudbg_meminfo on the stack
> altogether, but that requires a larger rewrite.
> 
> Fixes: a1c69520f785 ("cxgb4: collect MC memory dump")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Applied.

^ permalink raw reply

* Re: [PATCH] net: hisilicon: Use devm_platform_ioremap_resource
From: David Miller @ 2019-07-12 22:37 UTC (permalink / raw)
  To: xiaojiangfeng
  Cc: yisen.zhuang, salil.mehta, netdev, linux-kernel, sergei.shtylyov,
	leeyou.li, nixiaoming
In-Reply-To: <1562937384-121710-1-git-send-email-xiaojiangfeng@huawei.com>

From: Jiangfeng Xiao <xiaojiangfeng@huawei.com>
Date: Fri, 12 Jul 2019 21:16:24 +0800

> Use devm_platform_ioremap_resource instead of
> devm_ioremap_resource. Make the code simpler.
> 
> Signed-off-by: Jiangfeng Xiao <xiaojiangfeng@huawei.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: dsa: qca8k: replace legacy gpio include
From: David Miller @ 2019-07-12 22:38 UTC (permalink / raw)
  To: chunkeey; +Cc: netdev, f.fainelli, vivien.didelot, andrew, lkp
In-Reply-To: <20190712153336.5018-1-chunkeey@gmail.com>

From: Christian Lamparter <chunkeey@gmail.com>
Date: Fri, 12 Jul 2019 17:33:36 +0200

> This patch replaces the legacy bulk gpio.h include
> with the proper gpio/consumer.h variant. This was
> caught by the kbuild test robot that was running
> into an error because of this.
> 
> For more information why linux/gpio.h is bad can be found in:
> commit 56a46b6144e7 ("gpio: Clarify that <linux/gpio.h> is legacy")
> 
> Reported-by: kbuild test robot <lkp@intel.com>
> Link: https://www.spinics.net/lists/netdev/msg584447.html
> Fixes: a653f2f538f9 ("net: dsa: qca8k: introduce reset via gpio feature")
> Signed-off-by: Christian Lamparter <chunkeey@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net] net: neigh: fix multiple neigh timer scheduling
From: David Miller @ 2019-07-12 22:40 UTC (permalink / raw)
  To: lorenzo.bianconi; +Cc: netdev, dsahern, marek
In-Reply-To: <7b254317bcb84a33cdbe8eed96e510324d6eb97c.1562951883.git.lorenzo.bianconi@redhat.com>

From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Date: Fri, 12 Jul 2019 19:22:51 +0200

> Neigh timer can be scheduled multiple times from userspace adding
> multiple neigh entries and forcing the neigh timer scheduling passing
> NTF_USE in the netlink requests.
> This will result in a refcount leak and in the following dump stack:
 ...
> Fix the issue unscheduling neigh_timer if selected entry is in 'IN_TIMER'
> receiving a netlink request with NTF_USE flag set
> 
> Reported-by: Marek Majkowski <marek@cloudflare.com>
> Fixes: 0c5c2d308906 ("neigh: Allow for user space users of the neighbour table")
> Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net] net: neigh: fix multiple neigh timer scheduling
From: David Miller @ 2019-07-12 22:42 UTC (permalink / raw)
  To: lorenzo.bianconi; +Cc: netdev, dsahern, marek
In-Reply-To: <20190712.154047.1787144778692165503.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Fri, 12 Jul 2019 15:40:47 -0700 (PDT)

> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> Date: Fri, 12 Jul 2019 19:22:51 +0200
> 
>> Neigh timer can be scheduled multiple times from userspace adding
>> multiple neigh entries and forcing the neigh timer scheduling passing
>> NTF_USE in the netlink requests.
>> This will result in a refcount leak and in the following dump stack:
>  ...
>> Fix the issue unscheduling neigh_timer if selected entry is in 'IN_TIMER'
>> receiving a netlink request with NTF_USE flag set
>> 
>> Reported-by: Marek Majkowski <marek@cloudflare.com>
>> Fixes: 0c5c2d308906 ("neigh: Allow for user space users of the neighbour table")
>> Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> 
> Applied and queued up for -stable, thanks.

Actually, reverted, you didn't test the build thoroughly as Infiniband
fails:

drivers/infiniband/core/addr.c: In function ‘dst_fetch_ha’:
drivers/infiniband/core/addr.c:337:3: error: too few arguments to function ‘neigh_event_send’
   neigh_event_send(n, NULL);
   ^~~~~~~~~~~~~~~~

^ permalink raw reply

* Re: [PATCH] be2net: fix adapter->big_page_size miscaculation
From: David Miller @ 2019-07-12 22:46 UTC (permalink / raw)
  To: cai
  Cc: sathya.perla, ajit.khaparde, sriharsha.basavapatna, somnath.kotur,
	arnd, dhowells, hpa, netdev, linux-arch, linux-kernel
In-Reply-To: <1562959401-19815-1-git-send-email-cai@lca.pw>

From: Qian Cai <cai@lca.pw>
Date: Fri, 12 Jul 2019 15:23:21 -0400

> The commit d66acc39c7ce ("bitops: Optimise get_order()") introduced a
> problem for the be2net driver as "rx_frag_size" could be a module
> parameter that can be changed while loading the module.

Why is this a problem?

> That commit checks __builtin_constant_p() first in get_order() which
> cause "adapter->big_page_size" to be assigned a value based on the
> the default "rx_frag_size" value at the compilation time. It also
> generate a compilation warning,

rx_frag_size is not a constant, therefore the __builtin_constant_p()
test should not pass.

This explanation doesn't seem valid.

^ permalink raw reply

* Re: [PATCH net-next] net: openvswitch: do not update max_headroom if new headroom is equal to old headroom
From: Gregory Rose @ 2019-07-12 23:11 UTC (permalink / raw)
  To: David Miller, ap420073; +Cc: pshelar, netdev, dev
In-Reply-To: <20190712.151846.1093841226730573129.davem@davemloft.net>



On 7/12/2019 3:18 PM, David Miller wrote:
> From: Taehee Yoo <ap420073@gmail.com>
> Date: Sat,  6 Jul 2019 01:08:09 +0900
>
>> When a vport is deleted, the maximum headroom size would be changed.
>> If the vport which has the largest headroom is deleted,
>> the new max_headroom would be set.
>> But, if the new headroom size is equal to the old headroom size,
>> updating routine is unnecessary.
>>
>> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> I don't think Taehee should be punished because it took several days
> to get someone to look at and review and/or test this patch and
> meanwhile the net-next tree closed down.
>
> I ask for maintainer review as both a courtesy and a way to lessen
> my workload.  But if that means patches rot for days in patchwork
> I'm just going to apply them after my own review.
>
> So I'm applying this now.
>
My apologies Dave.  I did test and review the patch, perhaps you didn't 
see it.  In any case, you're right, Taehee was owed a more timely review 
and I missed it.

Thanks for applying the patch.

- Greg

^ permalink raw reply

* Re: [PATCH v1 1/6] rcu: Add support for consolidated-RCU reader checking
From: Paul E. McKenney @ 2019-07-12 23:27 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Peter Zijlstra, linux-kernel, Alexey Kuznetsov, Bjorn Helgaas,
	Borislav Petkov, c0d1n61at3, David S. Miller, edumazet,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Josh Triplett, keescook, kernel-hardening,
	Lai Jiangshan, Len Brown, linux-acpi, linux-pci, linux-pm,
	Mathieu Desnoyers, neilb, netdev, oleg, Pavel Machek,
	Rafael J. Wysocki, Rasmus Villemoes, rcu, Steven Rostedt,
	Tejun Heo, Thomas Gleixner, will,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)
In-Reply-To: <20190712194040.GA150253@google.com>

On Fri, Jul 12, 2019 at 03:40:40PM -0400, Joel Fernandes wrote:
> On Fri, Jul 12, 2019 at 10:46:30AM -0700, Paul E. McKenney wrote:
> > On Fri, Jul 12, 2019 at 01:06:31PM -0400, Joel Fernandes wrote:
> > > On Fri, Jul 12, 2019 at 09:45:31AM -0700, Paul E. McKenney wrote:
> > > > On Fri, Jul 12, 2019 at 11:10:51AM -0400, Joel Fernandes wrote:
> > > > > On Fri, Jul 12, 2019 at 01:11:25PM +0200, Peter Zijlstra wrote:
> > > > > > On Thu, Jul 11, 2019 at 07:43:56PM -0400, Joel Fernandes (Google) wrote:
> > > > > > > +int rcu_read_lock_any_held(void)
> > > > > > > +{
> > > > > > > +	int lockdep_opinion = 0;
> > > > > > > +
> > > > > > > +	if (!debug_lockdep_rcu_enabled())
> > > > > > > +		return 1;
> > > > > > > +	if (!rcu_is_watching())
> > > > > > > +		return 0;
> > > > > > > +	if (!rcu_lockdep_current_cpu_online())
> > > > > > > +		return 0;
> > > > > > > +
> > > > > > > +	/* Preemptible RCU flavor */
> > > > > > > +	if (lock_is_held(&rcu_lock_map))
> > > > > > 
> > > > > > you forgot debug_locks here.
> > > > > 
> > > > > Actually, it turns out debug_locks checking is not even needed. If
> > > > > debug_locks == 0, then debug_lockdep_rcu_enabled() returns 0 and we would not
> > > > > get to this point.
> > > > > 
> > > > > > > +		return 1;
> > > > > > > +
> > > > > > > +	/* BH flavor */
> > > > > > > +	if (in_softirq() || irqs_disabled())
> > > > > > 
> > > > > > I'm not sure I'd put irqs_disabled() under BH, also this entire
> > > > > > condition is superfluous, see below.
> > > > > > 
> > > > > > > +		return 1;
> > > > > > > +
> > > > > > > +	/* Sched flavor */
> > > > > > > +	if (debug_locks)
> > > > > > > +		lockdep_opinion = lock_is_held(&rcu_sched_lock_map);
> > > > > > > +	return lockdep_opinion || !preemptible();
> > > > > > 
> > > > > > that !preemptible() turns into:
> > > > > > 
> > > > > >   !(preempt_count()==0 && !irqs_disabled())
> > > > > > 
> > > > > > which is:
> > > > > > 
> > > > > >   preempt_count() != 0 || irqs_disabled()
> > > > > > 
> > > > > > and already includes irqs_disabled() and in_softirq().
> > > > > > 
> > > > > > > +}
> > > > > > 
> > > > > > So maybe something lke:
> > > > > > 
> > > > > > 	if (debug_locks && (lock_is_held(&rcu_lock_map) ||
> > > > > > 			    lock_is_held(&rcu_sched_lock_map)))
> > > > > > 		return true;
> > > > > 
> > > > > Agreed, I will do it this way (without the debug_locks) like:
> > > > > 
> > > > > ---8<-----------------------
> > > > > 
> > > > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > > > > index ba861d1716d3..339aebc330db 100644
> > > > > --- a/kernel/rcu/update.c
> > > > > +++ b/kernel/rcu/update.c
> > > > > @@ -296,27 +296,15 @@ EXPORT_SYMBOL_GPL(rcu_read_lock_bh_held);
> > > > >  
> > > > >  int rcu_read_lock_any_held(void)
> > > > >  {
> > > > > -	int lockdep_opinion = 0;
> > > > > -
> > > > >  	if (!debug_lockdep_rcu_enabled())
> > > > >  		return 1;
> > > > >  	if (!rcu_is_watching())
> > > > >  		return 0;
> > > > >  	if (!rcu_lockdep_current_cpu_online())
> > > > >  		return 0;
> > > > > -
> > > > > -	/* Preemptible RCU flavor */
> > > > > -	if (lock_is_held(&rcu_lock_map))
> > > > > -		return 1;
> > > > > -
> > > > > -	/* BH flavor */
> > > > > -	if (in_softirq() || irqs_disabled())
> > > > > -		return 1;
> > > > > -
> > > > > -	/* Sched flavor */
> > > > > -	if (debug_locks)
> > > > > -		lockdep_opinion = lock_is_held(&rcu_sched_lock_map);
> > > > > -	return lockdep_opinion || !preemptible();
> > > > > +	if (lock_is_held(&rcu_lock_map) || lock_is_held(&rcu_sched_lock_map))
> > > > 
> > > > OK, I will bite...  Why not also lock_is_held(&rcu_bh_lock_map)?
> > > 
> > > Hmm, I was borrowing the strategy from rcu_read_lock_bh_held() which does not
> > > check for a lock held in this map.
> > > 
> > > Honestly, even  lock_is_held(&rcu_sched_lock_map) seems unnecessary per-se
> > > since !preemptible() will catch that? rcu_read_lock_sched() disables
> > > preemption already, so lockdep's opinion of the matter seems redundant there.
> > 
> > Good point!  At least as long as the lockdep splats list RCU-bh among
> > the locks held, which they did last I checked.
> > 
> > Of course, you could make the same argument for getting rid of
> > rcu_sched_lock_map.  Does it make sense to have the one without
> > the other?
> 
> It probably makes it inconsistent in the least. I will add the check for
> the rcu_bh_lock_map in a separate patch, if that's Ok with you - since I also
> want to update the rcu_read_lock_bh_held() logic in the same patch.
> 
> That rcu_read_lock_bh_held() could also just return !preemptible as Peter
> suggested for the bh case.

Although that seems reasonable, please check the call sites.

> > > Sorry I already sent out patches again before seeing your comment but I can
> > > rework and resend them based on any other suggestions.
> > 
> > Not a problem!
> 
> Thanks. Depending on whether there is any other feedback, I will work on the
> bh_ stuff as a separate patch on top of this series, or work it into the next
> series revision if I'm reposting. Hopefully that sounds Ok to you.

Agreed -- let's separate concerns.  And promote bisectability.

							Thanx, Paul

^ permalink raw reply

* Re: [PATCH v2 3/9] rcu/sync: Remove custom check for reader-section
From: Paul E. McKenney @ 2019-07-12 23:32 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Oleg Nesterov, Alexey Kuznetsov, Bjorn Helgaas,
	Borislav Petkov, c0d1n61at3, David S. Miller, edumazet,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jonathan Corbet, Josh Triplett, keescook,
	kernel-hardening, kernel-team, Lai Jiangshan, Len Brown,
	linux-acpi, linux-doc, linux-pci, linux-pm, Mathieu Desnoyers,
	neilb, netdev, Pavel Machek, peterz, Rafael J. Wysocki,
	Rasmus Villemoes, rcu, Steven Rostedt, Tejun Heo, Thomas Gleixner,
	will, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)
In-Reply-To: <20190712213559.GA175138@google.com>

On Fri, Jul 12, 2019 at 05:35:59PM -0400, Joel Fernandes wrote:
> On Fri, Jul 12, 2019 at 01:00:18PM -0400, Joel Fernandes (Google) wrote:
> > The rcu/sync code was doing its own check whether we are in a reader
> > section. With RCU consolidating flavors and the generic helper added in
> > this series, this is no longer need. We can just use the generic helper
> > and it results in a nice cleanup.
> > 
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> Hi Oleg,
> Slightly unrelated to the patch,
> I tried hard to understand this comment below in percpu_down_read() but no dice.
> 
> I do understand how rcu sync and percpu rwsem works, however the comment
> below didn't make much sense to me. For one, there's no readers_fast anymore
> so I did not follow what readers_fast means. Could the comment be updated to
> reflect latest changes?
> Also could you help understand how is a writer not able to change
> sem->state and count the per-cpu read counters at the same time as the
> comment tries to say?
> 
> 	/*
> 	 * We are in an RCU-sched read-side critical section, so the writer
> 	 * cannot both change sem->state from readers_fast and start checking
> 	 * counters while we are here. So if we see !sem->state, we know that
> 	 * the writer won't be checking until we're past the preempt_enable()
> 	 * and that once the synchronize_rcu() is done, the writer will see
> 	 * anything we did within this RCU-sched read-size critical section.
> 	 */
> 
> Also,
> I guess we could get rid of all of the gp_ops struct stuff now that since all
> the callbacks are the same now. I will post that as a follow-up patch to this
> series.

Hello, Joel,

Oleg has a set of patches updating this code that just hit mainline
this week.  These patches get rid of the code that previously handled
RCU's multiple flavors.  Or are you looking at current mainline and
me just missing your point?

							Thanx, Paul

> thanks!
> 
>  - Joel
> 
> 
> > ---
> > Please note: Only build and boot tested this particular patch so far.
> > 
> >  include/linux/rcu_sync.h |  5 ++---
> >  kernel/rcu/sync.c        | 22 ----------------------
> >  2 files changed, 2 insertions(+), 25 deletions(-)
> > 
> > diff --git a/include/linux/rcu_sync.h b/include/linux/rcu_sync.h
> > index 6fc53a1345b3..c954f1efc919 100644
> > --- a/include/linux/rcu_sync.h
> > +++ b/include/linux/rcu_sync.h
> > @@ -39,9 +39,8 @@ extern void rcu_sync_lockdep_assert(struct rcu_sync *);
> >   */
> >  static inline bool rcu_sync_is_idle(struct rcu_sync *rsp)
> >  {
> > -#ifdef CONFIG_PROVE_RCU
> > -	rcu_sync_lockdep_assert(rsp);
> > -#endif
> > +	RCU_LOCKDEP_WARN(!rcu_read_lock_any_held(),
> > +			 "suspicious rcu_sync_is_idle() usage");
> >  	return !rsp->gp_state; /* GP_IDLE */
> >  }
> >  
> > diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
> > index a8304d90573f..535e02601f56 100644
> > --- a/kernel/rcu/sync.c
> > +++ b/kernel/rcu/sync.c
> > @@ -10,37 +10,25 @@
> >  #include <linux/rcu_sync.h>
> >  #include <linux/sched.h>
> >  
> > -#ifdef CONFIG_PROVE_RCU
> > -#define __INIT_HELD(func)	.held = func,
> > -#else
> > -#define __INIT_HELD(func)
> > -#endif
> > -
> >  static const struct {
> >  	void (*sync)(void);
> >  	void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
> >  	void (*wait)(void);
> > -#ifdef CONFIG_PROVE_RCU
> > -	int  (*held)(void);
> > -#endif
> >  } gp_ops[] = {
> >  	[RCU_SYNC] = {
> >  		.sync = synchronize_rcu,
> >  		.call = call_rcu,
> >  		.wait = rcu_barrier,
> > -		__INIT_HELD(rcu_read_lock_held)
> >  	},
> >  	[RCU_SCHED_SYNC] = {
> >  		.sync = synchronize_rcu,
> >  		.call = call_rcu,
> >  		.wait = rcu_barrier,
> > -		__INIT_HELD(rcu_read_lock_sched_held)
> >  	},
> >  	[RCU_BH_SYNC] = {
> >  		.sync = synchronize_rcu,
> >  		.call = call_rcu,
> >  		.wait = rcu_barrier,
> > -		__INIT_HELD(rcu_read_lock_bh_held)
> >  	},
> >  };
> >  
> > @@ -49,16 +37,6 @@ enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
> >  
> >  #define	rss_lock	gp_wait.lock
> >  
> > -#ifdef CONFIG_PROVE_RCU
> > -void rcu_sync_lockdep_assert(struct rcu_sync *rsp)
> > -{
> > -	RCU_LOCKDEP_WARN(!gp_ops[rsp->gp_type].held(),
> > -			 "suspicious rcu_sync_is_idle() usage");
> > -}
> > -
> > -EXPORT_SYMBOL_GPL(rcu_sync_lockdep_assert);
> > -#endif
> > -
> >  /**
> >   * rcu_sync_init() - Initialize an rcu_sync structure
> >   * @rsp: Pointer to rcu_sync structure to be initialized
> > -- 
> > 2.22.0.510.g264f2c817a-goog
> > 
> 

^ permalink raw reply

* [PATCH net] ppp: mppe: Revert "ppp: mppe: Add softdep to arc4"
From: Eric Biggers @ 2019-07-12 23:39 UTC (permalink / raw)
  To: netdev, linux-ppp, David S . Miller, Paul Mackerras
  Cc: linux-crypto, Takashi Iwai, Ard Biesheuvel

From: Eric Biggers <ebiggers@google.com>

Commit 0e5a610b5ca5 ("ppp: mppe: switch to RC4 library interface"),
which was merged through the crypto tree for v5.3, changed ppp_mppe.c to
use the new arc4_crypt() library function rather than access RC4 through
the dynamic crypto_skcipher API.

Meanwhile commit aad1dcc4f011 ("ppp: mppe: Add softdep to arc4") was
merged through the net tree and added a module soft-dependency on "arc4".

The latter commit no longer makes sense because the code now uses the
"libarc4" module rather than "arc4", and also due to the direct use of
arc4_crypt(), no module soft-dependency is required.

So revert the latter commit.

Cc: Takashi Iwai <tiwai@suse.de>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 drivers/net/ppp/ppp_mppe.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ppp/ppp_mppe.c b/drivers/net/ppp/ppp_mppe.c
index bd3c80b0bc77d..de3b57d09d0cb 100644
--- a/drivers/net/ppp/ppp_mppe.c
+++ b/drivers/net/ppp/ppp_mppe.c
@@ -64,7 +64,6 @@ MODULE_AUTHOR("Frank Cusack <fcusack@fcusack.com>");
 MODULE_DESCRIPTION("Point-to-Point Protocol Microsoft Point-to-Point Encryption support");
 MODULE_LICENSE("Dual BSD/GPL");
 MODULE_ALIAS("ppp-compress-" __stringify(CI_MPPE));
-MODULE_SOFTDEP("pre: arc4");
 MODULE_VERSION("1.0.2");

 #define SHA1_PAD_SIZE 40
-- 
2.22.0

^ permalink raw reply related

* Re: [PATCH] [net-next] cxgb4: reduce kernel stack usage in cudbg_collect_mem_region()
From: Joe Perches @ 2019-07-13  0:14 UTC (permalink / raw)
  To: David Miller, arnd
  Cc: vishal, rahul.lakkireddy, ganeshgr, alexios.zavras, arjun,
	surendra, netdev, linux-kernel, clang-built-linux
In-Reply-To: <20190712.153632.1007215196498198399.davem@davemloft.net>

On Fri, 2019-07-12 at 15:36 -0700, David Miller wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> Date: Fri, 12 Jul 2019 11:06:33 +0200
> 
> > The cudbg_collect_mem_region() and cudbg_read_fw_mem() both use several
> > hundred kilobytes of kernel stack space.

Several hundred 'kilo' bytes?
I hope not.


^ permalink raw reply

* Re: [PATCH] be2net: fix adapter->big_page_size miscaculation
From: Qian Cai @ 2019-07-13  0:27 UTC (permalink / raw)
  To: David Miller
  Cc: sathya.perla, ajit.khaparde, sriharsha.basavapatna, somnath.kotur,
	arnd, dhowells, hpa, netdev, linux-arch, linux-kernel
In-Reply-To: <20190712.154606.493382088615011132.davem@davemloft.net>



> On Jul 12, 2019, at 6:46 PM, David Miller <davem@davemloft.net> wrote:
> 
> From: Qian Cai <cai@lca.pw>
> Date: Fri, 12 Jul 2019 15:23:21 -0400
> 
>> The commit d66acc39c7ce ("bitops: Optimise get_order()") introduced a
>> problem for the be2net driver as "rx_frag_size" could be a module
>> parameter that can be changed while loading the module.
> 
> Why is this a problem?

Well, for example, if rx_frag_size was set to 8096 when loading the module, the kernel has already used the default value 2048 during compilation time.

> 
>> That commit checks __builtin_constant_p() first in get_order() which
>> cause "adapter->big_page_size" to be assigned a value based on the
>> the default "rx_frag_size" value at the compilation time. It also
>> generate a compilation warning,
> 
> rx_frag_size is not a constant, therefore the __builtin_constant_p()
> test should not pass.
> 
> This explanation doesn't seem valid.

Actually, GCC would consider it a const with -O2 optimized level because it found that it was never modified and it does not understand it is a module parameter. Considering the following code.

# cat const.c 
#include <stdio.h>

static int a = 1;

int main(void)
{
	if (__builtin_constant_p(a))
		printf("a is a const.\n");

	return 0;
}

# gcc -O2 const.c -o const

# ./const 
a is a const.

^ permalink raw reply

* Re: [GIT PULL] 9p updates for 5.3
From: pr-tracker-bot @ 2019-07-13  0:40 UTC (permalink / raw)
  To: Dominique Martinet; +Cc: Linus Torvalds, v9fs-developer, linux-kernel, netdev
In-Reply-To: <20190712080446.GA19400@nautica>

The pull request you sent on Fri, 12 Jul 2019 10:04:46 +0200:

> git://github.com/martinetd/linux tags/9p-for-5.3

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/23bbbf5c1fb3ddf104c2ddbda4cc24ebe53a3453

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker

^ permalink raw reply

* Re: [PATCH net-next 00/11] Add drop monitor for offloaded data paths
From: Neil Horman @ 2019-07-13  0:40 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Ido Schimmel, David Miller, netdev, jiri, mlxsw, dsahern, roopa,
	nikolay, andy, pablo, jakub.kicinski, pieter.jansenvanvuuren,
	andrew, f.fainelli, vivien.didelot, idosch
In-Reply-To: <871ryvv3dy.fsf@toke.dk>

On Fri, Jul 12, 2019 at 02:33:29PM +0200, Toke Høiland-Jørgensen wrote:
> Neil Horman <nhorman@tuxdriver.com> writes:
> 
> > On Fri, Jul 12, 2019 at 11:27:55AM +0200, Toke Høiland-Jørgensen wrote:
> >> Neil Horman <nhorman@tuxdriver.com> writes:
> >> 
> >> > On Thu, Jul 11, 2019 at 03:39:09PM +0300, Ido Schimmel wrote:
> >> >> On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:
> >> >> > From: Ido Schimmel <idosch@idosch.org>
> >> >> > Date: Sun,  7 Jul 2019 10:58:17 +0300
> >> >> > 
> >> >> > > Users have several ways to debug the kernel and understand why a packet
> >> >> > > was dropped. For example, using "drop monitor" and "perf". Both
> >> >> > > utilities trace kfree_skb(), which is the function called when a packet
> >> >> > > is freed as part of a failure. The information provided by these tools
> >> >> > > is invaluable when trying to understand the cause of a packet loss.
> >> >> > > 
> >> >> > > In recent years, large portions of the kernel data path were offloaded
> >> >> > > to capable devices. Today, it is possible to perform L2 and L3
> >> >> > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
> >> >> > > Different TC classifiers and actions are also offloaded to capable
> >> >> > > devices, at both ingress and egress.
> >> >> > > 
> >> >> > > However, when the data path is offloaded it is not possible to achieve
> >> >> > > the same level of introspection as tools such "perf" and "drop monitor"
> >> >> > > become irrelevant.
> >> >> > > 
> >> >> > > This patchset aims to solve this by allowing users to monitor packets
> >> >> > > that the underlying device decided to drop along with relevant metadata
> >> >> > > such as the drop reason and ingress port.
> >> >> > 
> >> >> > We are now going to have 5 or so ways to capture packets passing through
> >> >> > the system, this is nonsense.
> >> >> > 
> >> >> > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
> >> >> > devlink thing.
> >> >> > 
> >> >> > This is insanity, too many ways to do the same thing and therefore the
> >> >> > worst possible user experience.
> >> >> > 
> >> >> > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
> >> >> > XDP perf events, and these taps there too.
> >> >> > 
> >> >> > I mean really, think about it from the average user's perspective.  To
> >> >> > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
> >> >> > listen on devlink but configure a special tap thing beforehand and then
> >> >> > if someone is using XDP I gotta setup another perf event buffer capture
> >> >> > thing too.
> >> >> 
> >> >> Dave,
> >> >> 
> >> >> Before I start working on v2, I would like to get your feedback on the
> >> >> high level plan. Also adding Neil who is the maintainer of drop_monitor
> >> >> (and counterpart DropWatch tool [1]).
> >> >> 
> >> >> IIUC, the problem you point out is that users need to use different
> >> >> tools to monitor packet drops based on where these drops occur
> >> >> (SW/HW/XDP).
> >> >> 
> >> >> Therefore, my plan is to extend the existing drop_monitor netlink
> >> >> channel to also cover HW drops. I will add a new message type and a new
> >> >> multicast group for HW drops and encode in the message what is currently
> >> >> encoded in the devlink events.
> >> >> 
> >> > A few things here:
> >> > IIRC we don't announce individual hardware drops, drivers record them in
> >> > internal structures, and they are retrieved on demand via ethtool calls, so you
> >> > will either need to include some polling (probably not a very performant idea),
> >> > or some sort of flagging mechanism to indicate that on the next message sent to
> >> > user space you should go retrieve hw stats from a given interface.  I certainly
> >> > wouldn't mind seeing this happen, but its more work than just adding a new
> >> > netlink message.
> >> >
> >> > Also, regarding XDP drops, we wont see them if the xdp program is offloaded to
> >> > hardware (you'll need your hw drop gathering mechanism for that), but for xdp
> >> > programs run on the cpu, dropwatch should alrady catch those.  I.e. if the xdp
> >> > program returns a DROP result for a packet being processed, the OS will call
> >> > kfree_skb on its behalf, and dropwatch wil call that.
> >> 
> >> There is no skb by the time an XDP program runs, so this is not true. As
> >> I mentioned upthread, there's a tracepoint that will get called if an
> >> error occurs (or the program returns XDP_ABORTED), but in most cases,
> >> XDP_DROP just means that the packet silently disappears...
> >> 
> > As I noted, thats only true for xdp programs that are offloaded to hardware, I
> > was only speaking for XDP programs that run on the cpu.  For the former case, we
> > obviously need some other mechanism to detect drops, but for cpu executed xdp
> > programs, the OS is responsible for freeing skbs associated with programs the
> > return XDP_DROP.
> 
> Ah, I think maybe you're thinking of generic XDP (also referred to as
> skb mode)? That is a separate mode; an XDP program loaded in "native
Yes, was I not clear about that?
Neil

> mode" (or "driver mode") runs on the CPU, but before the skb is created;
> this is the common case for XDP, and there is no skb and thus no drop
> notification in this mode.
> 
> There is *also* an offload mode for XDP programs, but that is only
> supported by netronome cards thus far, so not as commonly used...
> 
> -Toke
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox