Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next v2 2/9] net: switchdev: Add PORT_PRE_BRIDGE_FLAGS
From: Florian Fainelli @ 2019-02-15 22:53 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, David S. Miller, Ido Schimmel, open list,
	open list:STAGING SUBSYSTEM, moderated list:ETHERNET BRIDGE, jiri,
	andrew, vivien.didelot
In-Reply-To: <20190215225313.32303-1-f.fainelli@gmail.com>

In preparation for removing switchdev_port_attr_get(), introduce
PORT_PRE_BRIDGE_FLAGS which will be called through
switchdev_port_attr_set(), in the caller's context (possibly atomic) and
which must be checked by the switchdev driver in order to return whether
the operation is supported or not.

This is entirely analoguous to how the BRIDGE_FLAGS_SUPPORT works,
except it goes through a set() instead of get().

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 include/net/switchdev.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 5e87b54c5dc5..de72b0a3867f 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -46,6 +46,7 @@ enum switchdev_attr_id {
 	SWITCHDEV_ATTR_ID_PORT_STP_STATE,
 	SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS,
 	SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS_SUPPORT,
+	SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS,
 	SWITCHDEV_ATTR_ID_PORT_MROUTER,
 	SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME,
 	SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING,
@@ -61,7 +62,7 @@ struct switchdev_attr {
 	void (*complete)(struct net_device *dev, int err, void *priv);
 	union {
 		u8 stp_state;				/* PORT_STP_STATE */
-		unsigned long brport_flags;		/* PORT_BRIDGE_FLAGS */
+		unsigned long brport_flags;		/* PORT_{PRE}_BRIDGE_FLAGS */
 		unsigned long brport_flags_support;	/* PORT_BRIDGE_FLAGS_SUPPORT */
 		bool mrouter;				/* PORT_MROUTER */
 		clock_t ageing_time;			/* BRIDGE_AGEING_TIME */
-- 
2.17.1


^ permalink raw reply related

* [PATCH net-next v2 0/9] net: Get rid of switchdev_port_attr_get()
From: Florian Fainelli @ 2019-02-15 22:53 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, David S. Miller, Ido Schimmel, open list,
	open list:STAGING SUBSYSTEM, moderated list:ETHERNET BRIDGE, jiri,
	andrew, vivien.didelot

Hi all,

This patch series splits the removal of the switchdev_ops that was
proposed a few times before and first tackles the easy part which is the
removal of the single call to switchdev_port_attr_get() within the
bridge code.

As suggestd by Ido, this patch series adds a
SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS which is used in the same
context as the caller of switchdev_port_attr_set(), so not deferred, and
then the operation is carried out in deferred context with setting a
support bridge port flag.

Follow-up patches will do the switchdev_ops removal after introducing
the proper helpers for the switchdev blocking notifier to work across
stacked devices (unlike the previous submissions).

Changes in v2:

- differentiate callers not supporting switchdev_port_attr_set() from
  the driver not being able to support specific bridge flags

- pass "mask" instead of "flags" for the PRE_BRIDGE_FLAGS check

- skip prepare phase for PRE_BRIDGE_FLAGS

- corrected documentation a bit more

- tested bridge_vlan_aware.sh with veth/VRF

Florian Fainelli (9):
  Documentation: networking: switchdev: Update port parent ID section
  net: switchdev: Add PORT_PRE_BRIDGE_FLAGS
  mlxsw: spectrum: Handle PORT_PRE_BRIDGE_FLAGS
  staging: fsl-dpaa2: ethsw: Handle PORT_PRE_BRIDGE_FLAGS
  net: dsa: Add setter for SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS
  rocker: Check Handle PORT_PRE_BRIDGE_FLAGS
  net: bridge: Stop calling switchdev_port_attr_get()
  net: Remove SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS_SUPPORT
  net: Get rid of switchdev_port_attr_get()

 Documentation/networking/switchdev.txt        | 16 ++--
 .../mellanox/mlxsw/spectrum_switchdev.c       | 38 +++++-----
 drivers/net/ethernet/rocker/rocker_main.c     | 75 +++++++++++--------
 drivers/staging/fsl-dpaa2/ethsw/ethsw.c       | 32 ++++----
 include/net/switchdev.h                       | 13 +---
 net/bridge/br_switchdev.c                     | 11 ++-
 net/dsa/dsa_priv.h                            |  6 ++
 net/dsa/port.c                                | 17 +++++
 net/dsa/slave.c                               | 24 +++---
 9 files changed, 128 insertions(+), 104 deletions(-)

-- 
2.17.1


^ permalink raw reply

* Re: [PATCH bpf-next v2] bpf: make LWTUNNEL_BPF dependent on INET
From: Randy Dunlap @ 2019-02-15 22:55 UTC (permalink / raw)
  To: Peter Oskolkov, Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov
In-Reply-To: <20190215175135.35549-1-posk@google.com>

On 2/15/19 9:51 AM, Peter Oskolkov wrote:
> Lightweight tunnels are L3 constructs that are used with IP/IP6.
> 
> For example, lwtunnel_xmit is called from ip_output.c and
> ip6_output.c only.
> 
> Make the dependency explicit at least for LWT-BPF, as now they
> call into IP routing.
> 
> V2: added "Reported-by" below.
> 
> Reported-by: Randy Dunlap <rdunlap@infradead.org>
> Signed-off-by: Peter Oskolkov <posk@google.com>

Yes, that works.  Thanks.

Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested

> ---
>  net/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/Kconfig b/net/Kconfig
> index 5cb9de1aaf88..62da6148e9f8 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -403,7 +403,7 @@ config LWTUNNEL
>  
>  config LWTUNNEL_BPF
>  	bool "Execute BPF program as route nexthop action"
> -	depends on LWTUNNEL
> +	depends on LWTUNNEL && INET
>  	default y if LWTUNNEL=y
>  	---help---
>  	  Allows to run BPF programs as a nexthop action following a route
> 


-- 
~Randy

^ permalink raw reply

* [PATCH RFC 0/5] net/sched: validate the control action with all the other parameters
From: Davide Caratti @ 2019-02-15 23:06 UTC (permalink / raw)
  To: Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: David S. Miller, Vlad Buslov, Paolo Abeni, netdev

currently, the kernel checks for bad values of the control action in
tcf_action_init_1(), after a successful call to the action's init()
function. This causes three bad behaviors:

1. the "half configuration"
   if the action is overwritten, the new configuration data are
   applied successfully

    # tc action add action gact drop index 100
    # tc action replace action gact goto chain 66 index 100 \
    > cookie aabbccdd
    Error: Failed to init TC action chain
    We have an error talking to the kernel
    # tc action show action gact
    total acts 1
     action order 0: gact action goto chain 66
     random type none pass val 0
     index 100 ref 1 bind 0
    cookie aabbccdd

2. a "refcount leak that makes kmemleak complain"
   when a valid 'goto chain' action is overwritten with another 'goto chain'
   action, the kernel leaks two refcounts.
   
   # tc chain add dev dd0 chain 42 ingress protocol ip flower \
   > ip_proto tcp action drop
   # tc chain add dev dd0 chain 43 ingress protocol ip flower \
   > ip_proto udp action drop
   # tc filter add dev dd0 ingress matchall \
   > action goto chain 42 index 66
   # tc action replace action gact goto chain 43 index 66
   Error: Failed to init TC action chain.     
   We have an error talking to the kernel

   # echo scan >/sys/kernel/debug/kmemleak
   <...>
   unreferenced object 0xffff93c0ee09f000 (size 1024):
   comm "tc", pid 2565, jiffies 4295339808 (age 65.426s)
   hex dump (first 32 bytes):
     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     00 00 00 00 08 00 06 00 00 00 00 00 00 00 00 00  ................
   backtrace:
     [<000000009b63f92d>] tc_ctl_chain+0x3d2/0x4c0
     [<00000000683a8d72>] rtnetlink_rcv_msg+0x263/0x2d0
     [<00000000ddd88f8e>] netlink_rcv_skb+0x4a/0x110
     [<000000006126a348>] netlink_unicast+0x1a0/0x250
     [<00000000b3340877>] netlink_sendmsg+0x2c1/0x3c0
     [<00000000a25a2171>] sock_sendmsg+0x36/0x40
     [<00000000f19ee1ec>] ___sys_sendmsg+0x280/0x2f0
     [<00000000d0422042>] __sys_sendmsg+0x5e/0xa0
     [<000000007a6c61f9>] do_syscall_64+0x5b/0x180
     [<00000000ccd07542>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
     [<0000000013eaa334>] 0xffffffffffffffff

3. a "kernel crash in the traffic plane"
   when an action is overwritten with an invalid 'goto chain' action,
   packets hitting the new rule will trigger a NULL pointer dereference:

   # tc qdisc add dev crash0 clsact
   # tc filter add dev crash0 egress matchall action csum icmp index 6
   # tc action replace action csum icmp goto chain 42 index 6 cookie c1a0c1a0
   # ping 1.2.3.4 -Icrash0

   BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
   #PF error: [normal kernel read fault]
   PGD 0 P4D 0
   Oops: 0000 [#1] SMP PTI
   CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.0.0-rc4.splash #516
   Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
   RIP: 0010:tcf_action_exec+0xb5/0x100
   Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
   RSP: 0018:ffff99933da03be0 EFLAGS: 00010246
   RAX: 000000002000002a RBX: ffff99933c463300 RCX: 0000000000003a00
   RDX: 0000000000000000 RSI: ffff9992ebda2828 RDI: ffff9992ebda2818
   RBP: ffff99933da03c80 R08: 0000000032250000 R09: 0000000000000003
   R10: 000000000000003f R11: 0000000000000028 R12: ffff9992ed22c100
   R13: ffff9992ed22c108 R14: 0000000000000001 R15: ffff9993339919c0
   FS:  0000000000000000(0000) GS:ffff99933da00000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 0000000000000000 CR3: 0000000005a0e004 CR4: 00000000001606f0
   Call Trace:
    <IRQ>
    tcf_classify+0x58/0x110
    __dev_queue_xmit+0x407/0x890
    ? ip6_finish_output2+0x366/0x590
    ip6_finish_output2+0x366/0x590
    ? ip6_output+0x68/0x110
    ip6_output+0x68/0x110
    ? nf_hook.constprop.35+0x79/0xc0
    mld_sendpack+0x16f/0x220
    mld_ifc_timer_expire+0x195/0x2c0
    ? igmp6_timer_handler+0x70/0x70
    call_timer_fn+0x2b/0x130
    run_timer_softirq+0x3e8/0x440
    ? tick_sched_timer+0x37/0x70
    __do_softirq+0xe3/0x2f5
    irq_exit+0xf0/0x100
    smp_apic_timer_interrupt+0x6c/0x130
    apic_timer_interrupt+0xf/0x20
    </IRQ>
   RIP: 0010:native_safe_halt+0x2/0x10
   Code: 74 ff ff ff 7f f3 c3 65 48 8b 04 25 00 5c 01 00 f0 80 48 02 20 48 8b 00 a8 08 74 8b eb c1 90 90 90 90 90 90 90 90 90 90 fb f4 <c3> 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 f4 c3 90 90 90 90 90 90
   RSP: 0018:ffffffff8be03e98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
   RAX: ffffffff8b417730 RBX: 0000000000000000 RCX: 7ffffb503267b27c
   RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff99933da1d240
   RBP: 0000000000000000 R08: 000c515677542b91 R09: 0000000000000000
   R10: ffffb312c0cffd98 R11: 0000000000000001 R12: 0000000000000000

all these three problems can be fixed if we validate the control action
in the init() function, in the same way as we are already doing for all
the other parameters.

- patch 1 is a temporary fix for problem 2), but it's reverted at the
  end of the series
- we need to refcount and unrefcount chains within init(): patch 2
  extends the init() prototype to correctly handle 'goto chain' control
  actions.
- patch 3 to [n] fix all the actions, providing a proper error path in
  case of bad control action, thus providing a fix for problems 1), 2)
  and 3)
- patch [n+1] removes two if statements in tcf_action_init_1() causing
  the above problems:
    if (TC_ACT_EXT_CMP(a->tcfa_action, TC_ACT_GOTO_CHAIN)) {
     ....
    }
    
    and
    if (!tcf_action_valid(a->tcfa_action)) {
     ....
    }
  as they are not not needed anymore, and for the same reason reverts
  patch 1.

This RFC series fixes only csum, bpf and gact - so it stops with n=5.
In case there are not objections, I will prepare a series fixing all
the actions and send it targeting the 'net' tree (or also 'net-next',
this behavior is reproducible since the first introduction of 'goto
action').

Any feedback is appreciated, thank you in advance!

Davide Caratti (5):
  net/sched: fix refcount leak when 'goto_chain' is used
  net/sched: prepare TC actions to properly validate the control action
  net/sched: act_bpf: validate the control action inside init()
  net/sched: act_csum: validate the control action inside init()
  net/sched: act_gact: validate the control action inside init()

 include/net/act_api.h      |  9 +++++-
 net/sched/act_api.c        | 63 ++++++++++++++++++++++++++++++++++++--
 net/sched/act_bpf.c        | 12 ++++++--
 net/sched/act_connmark.c   |  1 +
 net/sched/act_csum.c       | 22 ++++++++++---
 net/sched/act_gact.c       | 17 ++++++++--
 net/sched/act_ife.c        |  2 +-
 net/sched/act_ipt.c        | 11 ++++---
 net/sched/act_mirred.c     |  1 +
 net/sched/act_nat.c        |  3 +-
 net/sched/act_pedit.c      |  2 +-
 net/sched/act_police.c     |  1 +
 net/sched/act_sample.c     |  2 +-
 net/sched/act_simple.c     |  2 +-
 net/sched/act_skbedit.c    |  1 +
 net/sched/act_skbmod.c     |  1 +
 net/sched/act_tunnel_key.c |  1 +
 net/sched/act_vlan.c       |  2 +-
 18 files changed, 131 insertions(+), 22 deletions(-)

-- 
2.20.1


^ permalink raw reply

* [PATCH RFC 1/5] net/sched: fix refcount leak when 'goto_chain' is used
From: Davide Caratti @ 2019-02-15 23:06 UTC (permalink / raw)
  To: Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: David S. Miller, Vlad Buslov, Paolo Abeni, netdev
In-Reply-To: <cover.1550271080.git.dcaratti@redhat.com>

when replacing valid 'goto chain' actions with another valid 'goto chain'
action, the kernel leaks chain->action_refcnt and chain->refcnt. Since we
unconditionally take the refcount again, if the control action is a 'goto
chain', we can just drop them after ->init() has ended successfully.

Fixes: db50514f9a9c ("net: sched: add termination action to allow goto chain")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 net/sched/act_api.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index d4b8355737d8..91d79fac8cb2 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -907,6 +907,11 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 	if (err != ACT_P_CREATED)
 		module_put(a_o->owner);

+	if (a->goto_chain) {
+		tcf_action_goto_chain_fini(a);
+		a->goto_chain = NULL;
+	}
+
 	if (TC_ACT_EXT_CMP(a->tcfa_action, TC_ACT_GOTO_CHAIN)) {
 		err = tcf_action_goto_chain_init(a, tp);
 		if (err) {
-- 
2.20.1

^ permalink raw reply related

* [PATCH RFC 2/5] net/sched: prepare TC actions to properly validate the control action
From: Davide Caratti @ 2019-02-15 23:06 UTC (permalink / raw)
  To: Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: David S. Miller, Vlad Buslov, Paolo Abeni, netdev
In-Reply-To: <cover.1550271080.git.dcaratti@redhat.com>

- add tcf_action_check_ctrlact(), and pass a pointer to struct tcf_proto
  in each actions's init() function, to allow validation of 'goto chain'
  control action.
- add tcf_action_set_ctrlact(), to set the control action, release the
  previous 'goto_chain' handle and replace it with the new one.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 include/net/act_api.h      |  9 +++++-
 net/sched/act_api.c        | 58 ++++++++++++++++++++++++++++++++++++--
 net/sched/act_bpf.c        |  2 +-
 net/sched/act_connmark.c   |  1 +
 net/sched/act_csum.c       |  2 +-
 net/sched/act_gact.c       |  2 +-
 net/sched/act_ife.c        |  2 +-
 net/sched/act_ipt.c        | 11 ++++----
 net/sched/act_mirred.c     |  1 +
 net/sched/act_nat.c        |  3 +-
 net/sched/act_pedit.c      |  2 +-
 net/sched/act_police.c     |  1 +
 net/sched/act_sample.c     |  2 +-
 net/sched/act_simple.c     |  2 +-
 net/sched/act_skbedit.c    |  1 +
 net/sched/act_skbmod.c     |  1 +
 net/sched/act_tunnel_key.c |  1 +
 net/sched/act_vlan.c       |  2 +-
 18 files changed, 86 insertions(+), 17 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index dbc795ec659e..55a8d44ac0c7 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -90,7 +90,7 @@ struct tc_action_ops {
 	int     (*lookup)(struct net *net, struct tc_action **a, u32 index);
 	int     (*init)(struct net *net, struct nlattr *nla,
 			struct nlattr *est, struct tc_action **act, int ovr,
-			int bind, bool rtnl_held,
+			int bind, bool rtnl_held, struct tcf_proto *tp,
 			struct netlink_ext_ack *extack);
 	int     (*walk)(struct net *, struct sk_buff *,
 			struct netlink_callback *, int,
@@ -181,6 +181,13 @@ int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int, int);
 int tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int, int);
 int tcf_action_copy_stats(struct sk_buff *, struct tc_action *, int);
 
+int tcf_action_check_ctrlact(int action, struct tcf_proto *tp,
+			     struct tcf_chain **handle,
+			     struct netlink_ext_ack *extack);
+
+void tcf_action_set_ctrlact(struct tc_action *p, int action,
+			    struct tcf_chain *goto_chain);
+
 #endif /* CONFIG_NET_CLS_ACT */
 
 static inline void tcf_action_stats_update(struct tc_action *a, u64 bytes,
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 91d79fac8cb2..088b0d846bde 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -71,6 +71,60 @@ static void tcf_set_action_cookie(struct tc_cookie __rcu **old_cookie,
 		call_rcu(&old->rcu, tcf_free_cookie_rcu);
 }
 
+int tcf_action_check_ctrlact(int action, struct tcf_proto *tp,
+			     struct tcf_chain **handle,
+			     struct netlink_ext_ack *extack)
+{
+	int opcode = TC_ACT_EXT_OPCODE(action), ret = -EINVAL;
+	u32 chain_index;
+
+	if (!opcode)
+		ret = action > TC_ACT_VALUE_MAX ? -EINVAL : 0;
+	else if (opcode <= TC_ACT_EXT_OPCODE_MAX || action == TC_ACT_UNSPEC)
+		ret = 0;
+	if (ret) {
+		NL_SET_ERR_MSG(extack, "invalid control action");
+		goto end;
+	}
+
+	if (TC_ACT_EXT_CMP(action, TC_ACT_GOTO_CHAIN)) {
+		chain_index = action & TC_ACT_EXT_VAL_MASK;
+		if (!tp) {
+			ret = -EINVAL;
+			NL_SET_ERR_MSG(extack,
+				       "can't use goto_chain with NULL proto");
+			goto end;
+		}
+		if (!handle) {
+			ret = -EINVAL;
+			NL_SET_ERR_MSG(extack,
+				       "can't put goto_chain on NULL handle");
+			goto end;
+		}
+		*handle = tcf_chain_get_by_act(tp->chain->block, chain_index);
+		if (!*handle) {
+			ret = -ENOMEM;
+			NL_SET_ERR_MSG(extack,
+				       "can't allocate goto_chain handle");
+		}
+	}
+end:
+	return ret;
+}
+EXPORT_SYMBOL(tcf_action_check_ctrlact);
+
+void tcf_action_set_ctrlact(struct tc_action *p, int action,
+			    struct tcf_chain *goto_chain)
+{
+	struct tcf_chain *old;
+
+	old = xchg(&p->goto_chain, goto_chain);
+	if (old)
+		tcf_chain_put_by_act(old);
+	p->tcfa_action = action;
+}
+EXPORT_SYMBOL(tcf_action_set_ctrlact);
+
 /* XXX: For standalone actions, we don't need a RCU grace period either, because
  * actions are always connected to filters and filters are already destroyed in
  * RCU callbacks, so after a RCU grace period actions are already disconnected
@@ -890,10 +944,10 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 	/* backward compatibility for policer */
 	if (name == NULL)
 		err = a_o->init(net, tb[TCA_ACT_OPTIONS], est, &a, ovr, bind,
-				rtnl_held, extack);
+				rtnl_held, tp, extack);
 	else
 		err = a_o->init(net, nla, est, &a, ovr, bind, rtnl_held,
-				extack);
+				tp, extack);
 	if (err < 0)
 		goto err_mod;
 
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index c7633843e223..88a729bdab25 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -278,7 +278,7 @@ static void tcf_bpf_prog_fill_cfg(const struct tcf_bpf *prog,
 static int tcf_bpf_init(struct net *net, struct nlattr *nla,
 			struct nlattr *est, struct tc_action **act,
 			int replace, int bind, bool rtnl_held,
-			struct netlink_ext_ack *extack)
+			struct tcf_proto *tp, struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, bpf_net_id);
 	struct nlattr *tb[TCA_ACT_BPF_MAX + 1];
diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c
index 8475913f2070..30c4c109c80c 100644
--- a/net/sched/act_connmark.c
+++ b/net/sched/act_connmark.c
@@ -97,6 +97,7 @@ static const struct nla_policy connmark_policy[TCA_CONNMARK_MAX + 1] = {
 static int tcf_connmark_init(struct net *net, struct nlattr *nla,
 			     struct nlattr *est, struct tc_action **a,
 			     int ovr, int bind, bool rtnl_held,
+			     struct tcf_proto *tp,
 			     struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, connmark_net_id);
diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index 3dc25b7806d7..1ae120c9ab02 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -46,7 +46,7 @@ static struct tc_action_ops act_csum_ops;
 
 static int tcf_csum_init(struct net *net, struct nlattr *nla,
 			 struct nlattr *est, struct tc_action **a, int ovr,
-			 int bind, bool rtnl_held,
+			 int bind, bool rtnl_held, struct tcf_proto *tp,
 			 struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, csum_net_id);
diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c
index b61c20ebb314..727bbca9534b 100644
--- a/net/sched/act_gact.c
+++ b/net/sched/act_gact.c
@@ -57,7 +57,7 @@ static const struct nla_policy gact_policy[TCA_GACT_MAX + 1] = {
 static int tcf_gact_init(struct net *net, struct nlattr *nla,
 			 struct nlattr *est, struct tc_action **a,
 			 int ovr, int bind, bool rtnl_held,
-			 struct netlink_ext_ack *extack)
+			 struct tcf_proto *tp, struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, gact_net_id);
 	struct nlattr *tb[TCA_GACT_MAX + 1];
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 30b63fa23ee2..9b2eb941e093 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -469,7 +469,7 @@ static int populate_metalist(struct tcf_ife_info *ife, struct nlattr **tb,
 static int tcf_ife_init(struct net *net, struct nlattr *nla,
 			struct nlattr *est, struct tc_action **a,
 			int ovr, int bind, bool rtnl_held,
-			struct netlink_ext_ack *extack)
+			struct tcf_proto *tp, struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, ife_net_id);
 	struct nlattr *tb[TCA_IFE_MAX + 1];
diff --git a/net/sched/act_ipt.c b/net/sched/act_ipt.c
index 8af6c11d2482..4b3c844ca988 100644
--- a/net/sched/act_ipt.c
+++ b/net/sched/act_ipt.c
@@ -97,7 +97,8 @@ static const struct nla_policy ipt_policy[TCA_IPT_MAX + 1] = {
 
 static int __tcf_ipt_init(struct net *net, unsigned int id, struct nlattr *nla,
 			  struct nlattr *est, struct tc_action **a,
-			  const struct tc_action_ops *ops, int ovr, int bind)
+			  const struct tc_action_ops *ops, int ovr, int bind,
+			  struct tcf_proto *tp)
 {
 	struct tc_action_net *tn = net_generic(net, id);
 	struct nlattr *tb[TCA_IPT_MAX + 1];
@@ -206,20 +207,20 @@ static int __tcf_ipt_init(struct net *net, unsigned int id, struct nlattr *nla,
 
 static int tcf_ipt_init(struct net *net, struct nlattr *nla,
 			struct nlattr *est, struct tc_action **a, int ovr,
-			int bind, bool rtnl_held,
+			int bind, bool rtnl_held, struct tcf_proto *tp,
 			struct netlink_ext_ack *extack)
 {
 	return __tcf_ipt_init(net, ipt_net_id, nla, est, a, &act_ipt_ops, ovr,
-			      bind);
+			      bind, tp);
 }
 
 static int tcf_xt_init(struct net *net, struct nlattr *nla,
 		       struct nlattr *est, struct tc_action **a, int ovr,
-		       int bind, bool unlocked,
+		       int bind, bool unlocked, struct tcf_proto *tp,
 		       struct netlink_ext_ack *extack)
 {
 	return __tcf_ipt_init(net, xt_net_id, nla, est, a, &act_xt_ops, ovr,
-			      bind);
+			      bind, tp);
 }
 
 static int tcf_ipt_act(struct sk_buff *skb, const struct tc_action *a,
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index c8cf4d10c435..69dda57f1097 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -94,6 +94,7 @@ static struct tc_action_ops act_mirred_ops;
 static int tcf_mirred_init(struct net *net, struct nlattr *nla,
 			   struct nlattr *est, struct tc_action **a,
 			   int ovr, int bind, bool rtnl_held,
+			   struct tcf_proto *tp,
 			   struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, mirred_net_id);
diff --git a/net/sched/act_nat.c b/net/sched/act_nat.c
index c5c1e23add77..526c4c99bcce 100644
--- a/net/sched/act_nat.c
+++ b/net/sched/act_nat.c
@@ -38,7 +38,8 @@ static const struct nla_policy nat_policy[TCA_NAT_MAX + 1] = {
 
 static int tcf_nat_init(struct net *net, struct nlattr *nla, struct nlattr *est,
 			struct tc_action **a, int ovr, int bind,
-			bool rtnl_held, struct netlink_ext_ack *extack)
+			bool rtnl_held,	struct tcf_proto *tp,
+			struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, nat_net_id);
 	struct nlattr *tb[TCA_NAT_MAX + 1];
diff --git a/net/sched/act_pedit.c b/net/sched/act_pedit.c
index 2b372a06b432..1c7a0db7b466 100644
--- a/net/sched/act_pedit.c
+++ b/net/sched/act_pedit.c
@@ -138,7 +138,7 @@ static int tcf_pedit_key_ex_dump(struct sk_buff *skb,
 static int tcf_pedit_init(struct net *net, struct nlattr *nla,
 			  struct nlattr *est, struct tc_action **a,
 			  int ovr, int bind, bool rtnl_held,
-			  struct netlink_ext_ack *extack)
+			  struct tcf_proto *tp, struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, pedit_net_id);
 	struct nlattr *tb[TCA_PEDIT_MAX + 1];
diff --git a/net/sched/act_police.c b/net/sched/act_police.c
index ec8ec55e0fe8..a444dd78a244 100644
--- a/net/sched/act_police.c
+++ b/net/sched/act_police.c
@@ -83,6 +83,7 @@ static const struct nla_policy police_policy[TCA_POLICE_MAX + 1] = {
 static int tcf_police_init(struct net *net, struct nlattr *nla,
 			       struct nlattr *est, struct tc_action **a,
 			       int ovr, int bind, bool rtnl_held,
+			       struct tcf_proto *tp,
 			       struct netlink_ext_ack *extack)
 {
 	int ret = 0, tcfp_result = TC_ACT_OK, err, size;
diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
index 1a0c682fd734..b2154edcb535 100644
--- a/net/sched/act_sample.c
+++ b/net/sched/act_sample.c
@@ -37,7 +37,7 @@ static const struct nla_policy sample_policy[TCA_SAMPLE_MAX + 1] = {
 
 static int tcf_sample_init(struct net *net, struct nlattr *nla,
 			   struct nlattr *est, struct tc_action **a, int ovr,
-			   int bind, bool rtnl_held,
+			   int bind, bool rtnl_held, struct tcf_proto *tp,
 			   struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, sample_net_id);
diff --git a/net/sched/act_simple.c b/net/sched/act_simple.c
index 902957beceb3..640ee5b785dc 100644
--- a/net/sched/act_simple.c
+++ b/net/sched/act_simple.c
@@ -80,7 +80,7 @@ static const struct nla_policy simple_policy[TCA_DEF_MAX + 1] = {
 static int tcf_simp_init(struct net *net, struct nlattr *nla,
 			 struct nlattr *est, struct tc_action **a,
 			 int ovr, int bind, bool rtnl_held,
-			 struct netlink_ext_ack *extack)
+			 struct tcf_proto *tp, struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, simp_net_id);
 	struct nlattr *tb[TCA_DEF_MAX + 1];
diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index 64dba3708fce..9fc8cfdd35b1 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -96,6 +96,7 @@ static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX + 1] = {
 static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
 			    struct nlattr *est, struct tc_action **a,
 			    int ovr, int bind, bool rtnl_held,
+			    struct tcf_proto *tp,
 			    struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, skbedit_net_id);
diff --git a/net/sched/act_skbmod.c b/net/sched/act_skbmod.c
index 59710a183bd3..35572d0e4576 100644
--- a/net/sched/act_skbmod.c
+++ b/net/sched/act_skbmod.c
@@ -82,6 +82,7 @@ static const struct nla_policy skbmod_policy[TCA_SKBMOD_MAX + 1] = {
 static int tcf_skbmod_init(struct net *net, struct nlattr *nla,
 			   struct nlattr *est, struct tc_action **a,
 			   int ovr, int bind, bool rtnl_held,
+			   struct tcf_proto *tp,
 			   struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 8b43fe0130f7..c4adc53e0fb4 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -209,6 +209,7 @@ static void tunnel_key_release_params(struct tcf_tunnel_key_params *p)
 static int tunnel_key_init(struct net *net, struct nlattr *nla,
 			   struct nlattr *est, struct tc_action **a,
 			   int ovr, int bind, bool rtnl_held,
+			   struct tcf_proto *tp,
 			   struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, tunnel_key_net_id);
diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c
index 93fdaf707313..80fd0e238a10 100644
--- a/net/sched/act_vlan.c
+++ b/net/sched/act_vlan.c
@@ -105,7 +105,7 @@ static const struct nla_policy vlan_policy[TCA_VLAN_MAX + 1] = {
 static int tcf_vlan_init(struct net *net, struct nlattr *nla,
 			 struct nlattr *est, struct tc_action **a,
 			 int ovr, int bind, bool rtnl_held,
-			 struct netlink_ext_ack *extack)
+			 struct tcf_proto *tp, struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, vlan_net_id);
 	struct nlattr *tb[TCA_VLAN_MAX + 1];
-- 
2.20.1


^ permalink raw reply related

* [PATCH RFC 3/5] net/sched: act_bpf: validate the control action inside init()
From: Davide Caratti @ 2019-02-15 23:06 UTC (permalink / raw)
  To: Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: David S. Miller, Vlad Buslov, Paolo Abeni, netdev
In-Reply-To: <cover.1550271080.git.dcaratti@redhat.com>

Don't overwrite act_bpf data if the control control action is not valid,
to prevent loosing the previous configuration in case validation failed.
Not doing that caused NULL dereference in the data path if 'goto chain'
is used.

Tested with:
 # ./tdc.py -c bpf

Fixes: db50514f9a9c ("net: sched: add termination action to allow goto chain")
Fixes: 97763dc0f401 ("net_sched: reject unknown tcfa_action values")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 net/sched/act_bpf.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 88a729bdab25..e2c2ba5faeb3 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -17,6 +17,7 @@
 
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
 
 #include <linux/tc_act/tc_bpf.h>
 #include <net/tc_act/tc_bpf.h>
@@ -282,6 +283,7 @@ static int tcf_bpf_init(struct net *net, struct nlattr *nla,
 {
 	struct tc_action_net *tn = net_generic(net, bpf_net_id);
 	struct nlattr *tb[TCA_ACT_BPF_MAX + 1];
+	struct tcf_chain *newchain = NULL;
 	struct tcf_bpf_cfg cfg, old;
 	struct tc_act_bpf *parm;
 	struct tcf_bpf *prog;
@@ -323,6 +325,10 @@ static int tcf_bpf_init(struct net *net, struct nlattr *nla,
 		return ret;
 	}
 
+	ret = tcf_action_check_ctrlact(parm->action, tp, &newchain, extack);
+	if (ret < 0)
+		goto out;
+
 	is_bpf = tb[TCA_ACT_BPF_OPS_LEN] && tb[TCA_ACT_BPF_OPS];
 	is_ebpf = tb[TCA_ACT_BPF_FD];
 
@@ -350,7 +356,7 @@ static int tcf_bpf_init(struct net *net, struct nlattr *nla,
 	if (cfg.bpf_num_ops)
 		prog->bpf_num_ops = cfg.bpf_num_ops;
 
-	prog->tcf_action = parm->action;
+	tcf_action_set_ctrlact(*act, parm->action, newchain);
 	rcu_assign_pointer(prog->filter, cfg.filter);
 	spin_unlock_bh(&prog->tcf_lock);
 
@@ -364,6 +370,8 @@ static int tcf_bpf_init(struct net *net, struct nlattr *nla,
 
 	return res;
 out:
+	if (newchain)
+		tcf_chain_put_by_act(newchain);
 	tcf_idr_release(*act, bind);
 
 	return ret;
-- 
2.20.1


^ permalink raw reply related

* [PATCH RFC 4/5] net/sched: act_csum: validate the control action inside init()
From: Davide Caratti @ 2019-02-15 23:06 UTC (permalink / raw)
  To: Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: David S. Miller, Vlad Buslov, Paolo Abeni, netdev
In-Reply-To: <cover.1550271080.git.dcaratti@redhat.com>

Don't overwrite act_csum data if the control control action is not valid,
to prevent loosing the previous configuration in case validation failed.
Not doing that caused NULL dereference in the data path if 'goto chain'
is used.

Tested with:
 # ./tdc.py -c csum

Fixes: db50514f9a9c ("net: sched: add termination action to allow goto chain")
Fixes: 97763dc0f401 ("net_sched: reject unknown tcfa_action values")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 net/sched/act_csum.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index 1ae120c9ab02..bf0940156886 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -33,6 +33,7 @@
 #include <net/sctp/checksum.h>
 
 #include <net/act_api.h>
+#include <net/pkt_cls.h>
 
 #include <linux/tc_act/tc_csum.h>
 #include <net/tc_act/tc_csum.h>
@@ -52,6 +53,7 @@ static int tcf_csum_init(struct net *net, struct nlattr *nla,
 	struct tc_action_net *tn = net_generic(net, csum_net_id);
 	struct tcf_csum_params *params_new;
 	struct nlattr *tb[TCA_CSUM_MAX + 1];
+	struct tcf_chain *newchain = NULL;
 	struct tc_csum *parm;
 	struct tcf_csum *p;
 	int ret = 0, err;
@@ -87,17 +89,23 @@ static int tcf_csum_init(struct net *net, struct nlattr *nla,
 		return err;
 	}
 
+	err = tcf_action_check_ctrlact(parm->action, tp, &newchain, extack);
+	if (unlikely(err)) {
+		ret = err;
+		goto error;
+	}
+
 	p = to_tcf_csum(*a);
 
 	params_new = kzalloc(sizeof(*params_new), GFP_KERNEL);
 	if (unlikely(!params_new)) {
-		tcf_idr_release(*a, bind);
-		return -ENOMEM;
+		ret = -ENOMEM;
+		goto error;
 	}
 	params_new->update_flags = parm->update_flags;
 
 	spin_lock_bh(&p->tcf_lock);
-	p->tcf_action = parm->action;
+	tcf_action_set_ctrlact(*a, parm->action, newchain);
 	rcu_swap_protected(p->params, params_new,
 			   lockdep_is_held(&p->tcf_lock));
 	spin_unlock_bh(&p->tcf_lock);
@@ -108,7 +116,13 @@ static int tcf_csum_init(struct net *net, struct nlattr *nla,
 	if (ret == ACT_P_CREATED)
 		tcf_idr_insert(tn, *a);
 
+end:
 	return ret;
+error:
+	if (newchain)
+		tcf_chain_put_by_act(newchain);
+	tcf_idr_release(*a, bind);
+	goto end;
 }
 
 /**
-- 
2.20.1


^ permalink raw reply related

* [PATCH RFC 5/5] net/sched: act_gact: validate the control action inside init()
From: Davide Caratti @ 2019-02-15 23:06 UTC (permalink / raw)
  To: Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: David S. Miller, Vlad Buslov, Paolo Abeni, netdev
In-Reply-To: <cover.1550271080.git.dcaratti@redhat.com>

Don't overwrite act_gact data if the control control action is not valid,
to prevent loosing the previous configuration in case validation failed.
Not doing that caused NULL dereference in the data path if 'goto chain'
is used.

Tested with:
 # ./tdc.py -c gact

Fixes: db50514f9a9c ("net: sched: add termination action to allow goto chain")
Fixes: 97763dc0f401 ("net_sched: reject unknown tcfa_action values")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 net/sched/act_gact.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c
index 727bbca9534b..530e8bb8f94d 100644
--- a/net/sched/act_gact.c
+++ b/net/sched/act_gact.c
@@ -20,6 +20,7 @@
 #include <linux/init.h>
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
 #include <linux/tc_act/tc_gact.h>
 #include <net/tc_act/tc_gact.h>
 
@@ -61,6 +62,7 @@ static int tcf_gact_init(struct net *net, struct nlattr *nla,
 {
 	struct tc_action_net *tn = net_generic(net, gact_net_id);
 	struct nlattr *tb[TCA_GACT_MAX + 1];
+	struct tcf_chain *newchain = NULL;
 	struct tc_gact *parm;
 	struct tcf_gact *gact;
 	int ret = 0;
@@ -116,10 +118,15 @@ static int tcf_gact_init(struct net *net, struct nlattr *nla,
 		return err;
 	}
 
+	err = tcf_action_check_ctrlact(parm->action, tp, &newchain, extack);
+	if (unlikely(err)) {
+		ret = err;
+		goto error;
+	}
 	gact = to_gact(*a);
 
 	spin_lock_bh(&gact->tcf_lock);
-	gact->tcf_action = parm->action;
+	tcf_action_set_ctrlact(*a, parm->action, newchain);
 #ifdef CONFIG_GACT_PROB
 	if (p_parm) {
 		gact->tcfg_paction = p_parm->paction;
@@ -135,7 +142,13 @@ static int tcf_gact_init(struct net *net, struct nlattr *nla,
 
 	if (ret == ACT_P_CREATED)
 		tcf_idr_insert(tn, *a);
+end:
 	return ret;
+error:
+	if (newchain)
+		tcf_chain_put_by_act(newchain);
+	tcf_idr_release(*a, bind);
+	goto end;
 }
 
 static int tcf_gact_act(struct sk_buff *skb, const struct tc_action *a,
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH net-next v4 10/17] net: sched: refactor tp insert/delete for concurrent execution
From: Cong Wang @ 2019-02-15 23:17 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <20190211085548.7190-11-vladbu@mellanox.com>

On Mon, Feb 11, 2019 at 12:56 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> +static bool tcf_proto_is_empty(struct tcf_proto *tp)
> +{
> +       struct tcf_walker walker = { .fn = walker_noop, };
> +
> +       if (tp->ops->walk) {
> +               tp->ops->walk(tp, &walker);
> +               return !walker.stop;
> +       }
> +       return true;
> +}
> +
> +static bool tcf_proto_check_delete(struct tcf_proto *tp)
> +{
> +       spin_lock(&tp->lock);
> +       if (tcf_proto_is_empty(tp))
> +               tp->deleting = true;
> +       spin_unlock(&tp->lock);
> +       return tp->deleting;

If you use this spinlock for walking each tp data structure,
why it is not needed for adding to/deleting filters from each
tp?

^ permalink raw reply

* [PATCH bpf-next] selftests: bpf: test_lwt_ip_encap: add negative tests.
From: Peter Oskolkov @ 2019-02-15 23:49 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Peter Oskolkov

As requested by David Ahern:

- add negative tests (no routes, explicitly unreachable destinations)
  to exercize error handling code paths;
- do not exit on test failures, but instead print a summary of
  passed/failed tests at the end.

Future patches will add TSO and VRF tests.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 .../selftests/bpf/test_lwt_ip_encap.sh        | 111 ++++++++++++++----
 1 file changed, 88 insertions(+), 23 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
index 4ca714e23ab0..612632c1425f 100755
--- a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
@@ -38,8 +38,6 @@
 #       ping: SRC->[encap at veth2:ingress]->GRE:decap->DST
 #       ping replies go DST->SRC directly
 
-set -e  # exit on error
-
 if [[ $EUID -ne 0 ]]; then
 	echo "This script must be run as root"
 	echo "FAIL"
@@ -76,8 +74,37 @@ readonly IPv6_GRE="fb10::1"
 readonly IPv6_SRC=$IPv6_1
 readonly IPv6_DST=$IPv6_4
 
-setup() {
-set -e  # exit on error
+TEST_STATUS=0
+TESTS_SUCCEEDED=0
+TESTS_FAILED=0
+
+process_test_results()
+{
+	if [[ "${TEST_STATUS}" -eq 0 ]] ; then
+		echo "PASS"
+		TESTS_SUCCEEDED=$((TESTS_SUCCEEDED+1))
+	else
+		echo "FAIL"
+		TESTS_FAILED=$((TESTS_FAILED+1))
+	fi
+}
+
+print_test_summary_and_exit()
+{
+	echo "passed tests: ${TESTS_SUCCEEDED}"
+	echo "failed tests: ${TESTS_FAILED}"
+	if [ "${TESTS_FAILED}" -eq "0" ] ; then
+		exit 0
+	else
+		exit 1
+	fi
+}
+
+setup()
+{
+	set -e  # exit on error
+	TEST_STATUS=0
+
 	# create devices and namespaces
 	ip netns add "${NS1}"
 	ip netns add "${NS2}"
@@ -178,7 +205,7 @@ set -e  # exit on error
 	# configure IPv4 GRE device in NS3, and a route to it via the "bottom" route
 	ip -netns ${NS3} tunnel add gre_dev mode gre remote ${IPv4_1} local ${IPv4_GRE} ttl 255
 	ip -netns ${NS3} link set gre_dev up
-	ip -netns ${NS3} addr add ${IPv4_GRE} dev gre_dev
+	ip -netns ${NS3} addr add ${IPv4_GRE} nodad dev gre_dev
 	ip -netns ${NS1} route add ${IPv4_GRE}/32 dev veth5 via ${IPv4_6}
 	ip -netns ${NS2} route add ${IPv4_GRE}/32 dev veth7 via ${IPv4_8}
 
@@ -194,9 +221,13 @@ set -e  # exit on error
 	ip netns exec ${NS1} sysctl -wq net.ipv4.conf.all.rp_filter=0
 	ip netns exec ${NS2} sysctl -wq net.ipv4.conf.all.rp_filter=0
 	ip netns exec ${NS3} sysctl -wq net.ipv4.conf.all.rp_filter=0
+
+	sleep 1  # reduce flakiness
+	set +e
 }
 
-cleanup() {
+cleanup()
+{
 	ip netns del ${NS1} 2> /dev/null
 	ip netns del ${NS2} 2> /dev/null
 	ip netns del ${NS3} 2> /dev/null
@@ -204,12 +235,28 @@ cleanup() {
 
 trap cleanup EXIT
 
-test_ping() {
+remove_routes_to_gredev()
+{
+	ip -netns ${NS1} route del ${IPv4_GRE} dev veth5
+	ip -netns ${NS2} route del ${IPv4_GRE} dev veth7
+	ip -netns ${NS1} -6 route del ${IPv6_GRE}/128 dev veth5
+	ip -netns ${NS2} -6 route del ${IPv6_GRE}/128 dev veth7
+}
+
+add_unreachable_routes_to_gredev()
+{
+	ip -netns ${NS1} route add unreachable ${IPv4_GRE}/32
+	ip -netns ${NS2} route add unreachable ${IPv4_GRE}/32
+	ip -netns ${NS1} -6 route add unreachable ${IPv6_GRE}/128
+	ip -netns ${NS2} -6 route add unreachable ${IPv6_GRE}/128
+}
+
+test_ping()
+{
 	local readonly PROTO=$1
 	local readonly EXPECTED=$2
 	local RET=0
 
-	set +e
 	if [ "${PROTO}" == "IPv4" ] ; then
 		ip netns exec ${NS1} ping  -c 1 -W 1 -I ${IPv4_SRC} ${IPv4_DST} 2>&1 > /dev/null
 		RET=$?
@@ -217,29 +264,26 @@ test_ping() {
 		ip netns exec ${NS1} ping6 -c 1 -W 6 -I ${IPv6_SRC} ${IPv6_DST} 2>&1 > /dev/null
 		RET=$?
 	else
-		echo "test_ping: unknown PROTO: ${PROTO}"
-		exit 1
+		echo "    test_ping: unknown PROTO: ${PROTO}"
+		TEST_STATUS=1
 	fi
-	set -e
 
 	if [ "0" != "${RET}" ]; then
 		RET=1
 	fi
 
 	if [ "${EXPECTED}" != "${RET}" ] ; then
-		echo "FAIL: test_ping: ${RET}"
-		exit 1
+		echo "    test_ping failed: expected: ${EXPECTED}; got ${RET}"
+		TEST_STATUS=1
 	fi
 }
 
-test_egress() {
+test_egress()
+{
 	local readonly ENCAP=$1
 	echo "starting egress ${ENCAP} encap test"
 	setup
 
-	# need to wait a bit for IPv6 to autoconf, otherwise
-	# ping6 sometimes fails with "unable to bind to address"
-
 	# by default, pings work
 	test_ping IPv4 0
 	test_ping IPv6 0
@@ -258,16 +302,28 @@ test_egress() {
 		ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre6 dev veth1
 		ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre6 dev veth1
 	else
-		echo "FAIL: unknown encap ${ENCAP}"
+		echo "    unknown encap ${ENCAP}"
+		TEST_STATUS=1
 	fi
 	test_ping IPv4 0
 	test_ping IPv6 0
 
+	# a negative test: remove routes to GRE devices: ping fails
+	remove_routes_to_gredev
+	test_ping IPv4 1
+	test_ping IPv6 1
+
+	# another negative test
+	add_unreachable_routes_to_gredev
+	test_ping IPv4 1
+	test_ping IPv6 1
+
 	cleanup
-	echo "PASS"
+	process_test_results
 }
 
-test_ingress() {
+test_ingress()
+{
 	local readonly ENCAP=$1
 	echo "starting ingress ${ENCAP} encap test"
 	setup
@@ -298,14 +354,23 @@ test_ingress() {
 	test_ping IPv4 0
 	test_ping IPv6 0
 
+	# a negative test: remove routes to GRE devices: ping fails
+	remove_routes_to_gredev
+	test_ping IPv4 1
+	test_ping IPv6 1
+
+	# another negative test
+	add_unreachable_routes_to_gredev
+	test_ping IPv4 1
+	test_ping IPv6 1
+
 	cleanup
-	echo "PASS"
+	process_test_results
 }
 
 test_egress IPv4
 test_egress IPv6
-
 test_ingress IPv4
 test_ingress IPv6
 
-echo "all tests passed"
+print_test_summary_and_exit
-- 
2.21.0.rc0.258.g878e2cd30e-goog


^ permalink raw reply related

* Re: [PATCH net-next v5] ipmr: ip6mr: Create new sockopt to clear mfc cache or vifs
From: Nikolay Aleksandrov @ 2019-02-16  0:21 UTC (permalink / raw)
  To: Callum Sinclair, davem, kuznet, yoshfuji, netdev, linux-kernel
  Cc: nicolas.dichtel
In-Reply-To: <20190214024418.21490-2-callum.sinclair@alliedtelesis.co.nz>

On 14/02/2019 04:44, Callum Sinclair wrote:
> Currently the only way to clear the forwarding cache was to delete the
> entries one by one using the MRT_DEL_MFC socket option or to destroy and
> recreate the socket.
> 
> Create a new socket option which with the use of optional flags can
> clear any combination of multicast entries (static or not static) and
> multicast vifs (static or not static).
> 
> Calling the new socket option MRT_FLUSH with the flags MRT_FLUSH_MFC and
> MRT_FLUSH_VIFS will clear all entries and vifs on the socket except for
> static entries.
> 
> Signed-off-by: Callum Sinclair <callum.sinclair@alliedtelesis.co.nz>
> ---
> v1 -> v2:
>   Implemented additional flags for static entries
> v2 -> v3:
>   Cleaned up flag logic so any combination of routes can be cleared.
>   Fixed style errors
>   Fixed incorrect flag values
> v3 -> v4:
>   Fixed style errors
>   Fixed incorrect flag (MRT_FLUSH was used instead of MRT_FLUSH_VIFS)
> v4 -> v5:
>   Only clear the unresolved queue when MRT_FLUSH_MFC flag is set.
> 
>  include/uapi/linux/mroute.h  |  9 ++++-
>  include/uapi/linux/mroute6.h |  9 ++++-
>  net/ipv4/ipmr.c              | 75 +++++++++++++++++++++-------------
>  net/ipv6/ip6mr.c             | 78 +++++++++++++++++++++++-------------
>  4 files changed, 115 insertions(+), 56 deletions(-)
> 

+1 about Nicolas' comments, other than that looks good:
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

> diff --git a/include/uapi/linux/mroute.h b/include/uapi/linux/mroute.h
> index 5d37a9ccce63..11c8c1fc1124 100644
> --- a/include/uapi/linux/mroute.h
> +++ b/include/uapi/linux/mroute.h
> @@ -28,12 +28,19 @@
>  #define MRT_TABLE	(MRT_BASE+9)	/* Specify mroute table ID		*/
>  #define MRT_ADD_MFC_PROXY	(MRT_BASE+10)	/* Add a (*,*|G) mfc entry	*/
>  #define MRT_DEL_MFC_PROXY	(MRT_BASE+11)	/* Del a (*,*|G) mfc entry	*/
> -#define MRT_MAX		(MRT_BASE+11)
> +#define MRT_FLUSH	(MRT_BASE+12)	/* Flush all mfc entries and/or vifs	*/
> +#define MRT_MAX		(MRT_BASE+12)
>  
>  #define SIOCGETVIFCNT	SIOCPROTOPRIVATE	/* IP protocol privates */
>  #define SIOCGETSGCNT	(SIOCPROTOPRIVATE+1)
>  #define SIOCGETRPF	(SIOCPROTOPRIVATE+2)
>  
> +/* MRT_FLUSH optional flags */
> +#define MRT_FLUSH_MFC	1	/* Flush multicast entries */
> +#define MRT_FLUSH_MFC_STATIC	2	/* Flush static multicast entries */
> +#define MRT_FLUSH_VIFS	4	/* Flush multicast vifs */
> +#define MRT_FLUSH_VIFS_STATIC	8	/* Flush static multicast vifs */
> +
>  #define MAXVIFS		32
>  typedef unsigned long vifbitmap_t;	/* User mode code depends on this lot */
>  typedef unsigned short vifi_t;
> diff --git a/include/uapi/linux/mroute6.h b/include/uapi/linux/mroute6.h
> index 9999cc006390..ac84ef11b29c 100644
> --- a/include/uapi/linux/mroute6.h
> +++ b/include/uapi/linux/mroute6.h
> @@ -31,12 +31,19 @@
>  #define MRT6_TABLE	(MRT6_BASE+9)	/* Specify mroute table ID		*/
>  #define MRT6_ADD_MFC_PROXY	(MRT6_BASE+10)	/* Add a (*,*|G) mfc entry	*/
>  #define MRT6_DEL_MFC_PROXY	(MRT6_BASE+11)	/* Del a (*,*|G) mfc entry	*/
> -#define MRT6_MAX	(MRT6_BASE+11)
> +#define MRT6_FLUSH	(MRT6_BASE+12)	/* Flush all mfc entries and/or vifs	*/
> +#define MRT6_MAX	(MRT6_BASE+12)
>  
>  #define SIOCGETMIFCNT_IN6	SIOCPROTOPRIVATE	/* IP protocol privates */
>  #define SIOCGETSGCNT_IN6	(SIOCPROTOPRIVATE+1)
>  #define SIOCGETRPF	(SIOCPROTOPRIVATE+2)
>  
> +/* MRT6_FLUSH optional flags */
> +#define MRT6_FLUSH_MFC	1	/* Flush multicast entries */
> +#define MRT6_FLUSH_MFC_STATIC	2	/* Flush static multicast entries */
> +#define MRT6_FLUSH_VIFS	4	/* Flushing multicast vifs */
> +#define MRT6_FLUSH_VIFS_STATIC	8	/* Flush static multicast vifs */
> +
>  #define MAXMIFS		32
>  typedef unsigned long mifbitmap_t;	/* User mode code depends on this lot */
>  typedef unsigned short mifi_t;
> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
> index e536970557dd..53869779af74 100644
> --- a/net/ipv4/ipmr.c
> +++ b/net/ipv4/ipmr.c
> @@ -110,7 +110,7 @@ static int ipmr_cache_report(struct mr_table *mrt,
>  static void mroute_netlink_event(struct mr_table *mrt, struct mfc_cache *mfc,
>  				 int cmd);
>  static void igmpmsg_netlink_event(struct mr_table *mrt, struct sk_buff *pkt);
> -static void mroute_clean_tables(struct mr_table *mrt, bool all);
> +static void mroute_clean_tables(struct mr_table *mrt, int flags);
>  static void ipmr_expire_process(struct timer_list *t);
>  
>  #ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES
> @@ -415,7 +415,8 @@ static struct mr_table *ipmr_new_table(struct net *net, u32 id)
>  static void ipmr_free_table(struct mr_table *mrt)
>  {
>  	del_timer_sync(&mrt->ipmr_expire_timer);
> -	mroute_clean_tables(mrt, true);
> +	mroute_clean_tables(mrt, MRT_FLUSH_VIFS | MRT_FLUSH_VIFS_STATIC |
> +					  MRT_FLUSH_MFC | MRT_FLUSH_MFC_STATIC);
>  	rhltable_destroy(&mrt->mfc_hash);
>  	kfree(mrt);
>  }
> @@ -1296,7 +1297,7 @@ static int ipmr_mfc_add(struct net *net, struct mr_table *mrt,
>  }
>  
>  /* Close the multicast socket, and clear the vif tables etc */
> -static void mroute_clean_tables(struct mr_table *mrt, bool all)
> +static void mroute_clean_tables(struct mr_table *mrt, int flags)
>  {
>  	struct net *net = read_pnet(&mrt->net);
>  	struct mr_mfc *c, *tmp;
> @@ -1305,35 +1306,44 @@ static void mroute_clean_tables(struct mr_table *mrt, bool all)
>  	int i;
>  
>  	/* Shut down all active vif entries */
> -	for (i = 0; i < mrt->maxvif; i++) {
> -		if (!all && (mrt->vif_table[i].flags & VIFF_STATIC))
> -			continue;
> -		vif_delete(mrt, i, 0, &list);
> +	if (flags & (MRT_FLUSH_VIFS | MRT_FLUSH_VIFS_STATIC)) {
> +		for (i = 0; i < mrt->maxvif; i++) {
> +			if (((mrt->vif_table[i].flags & VIFF_STATIC) &&
> +			     !(flags & MRT_FLUSH_VIFS_STATIC)) ||
> +			    (!(mrt->vif_table[i].flags & VIFF_STATIC) && !(flags & MRT_FLUSH_VIFS)))
> +				continue;
> +			vif_delete(mrt, i, 0, &list);
> +		}
> +		unregister_netdevice_many(&list);
>  	}
> -	unregister_netdevice_many(&list);
>  
>  	/* Wipe the cache */
> -	list_for_each_entry_safe(c, tmp, &mrt->mfc_cache_list, list) {
> -		if (!all && (c->mfc_flags & MFC_STATIC))
> -			continue;
> -		rhltable_remove(&mrt->mfc_hash, &c->mnode, ipmr_rht_params);
> -		list_del_rcu(&c->list);
> -		cache = (struct mfc_cache *)c;
> -		call_ipmr_mfc_entry_notifiers(net, FIB_EVENT_ENTRY_DEL, cache,
> -					      mrt->id);
> -		mroute_netlink_event(mrt, cache, RTM_DELROUTE);
> -		mr_cache_put(c);
> -	}
> -
> -	if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
> -		spin_lock_bh(&mfc_unres_lock);
> -		list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
> -			list_del(&c->list);
> +	if (flags & (MRT_FLUSH_MFC | MRT_FLUSH_MFC_STATIC)) {
> +		list_for_each_entry_safe(c, tmp, &mrt->mfc_cache_list, list) {
> +			if (((c->mfc_flags & MFC_STATIC) && !(flags & MRT_FLUSH_MFC_STATIC)) ||
> +			    (!(c->mfc_flags & MFC_STATIC) && !(flags & MRT_FLUSH_MFC)))
> +				continue;
> +			rhltable_remove(&mrt->mfc_hash, &c->mnode, ipmr_rht_params);
> +			list_del_rcu(&c->list);
>  			cache = (struct mfc_cache *)c;
> +			call_ipmr_mfc_entry_notifiers(net, FIB_EVENT_ENTRY_DEL, cache,
> +						      mrt->id);
>  			mroute_netlink_event(mrt, cache, RTM_DELROUTE);
> -			ipmr_destroy_unres(mrt, cache);
> +			mr_cache_put(c);
> +		}
> +	}
> +
> +	if (flags & MRT_FLUSH_MFC) {
> +		if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
> +			spin_lock_bh(&mfc_unres_lock);
> +			list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
> +				list_del(&c->list);
> +				cache = (struct mfc_cache *)c;
> +				mroute_netlink_event(mrt, cache, RTM_DELROUTE);
> +				ipmr_destroy_unres(mrt, cache);
> +			}
> +			spin_unlock_bh(&mfc_unres_lock);
>  		}
> -		spin_unlock_bh(&mfc_unres_lock);
>  	}
>  }
>  
> @@ -1354,7 +1364,7 @@ static void mrtsock_destruct(struct sock *sk)
>  						    NETCONFA_IFINDEX_ALL,
>  						    net->ipv4.devconf_all);
>  			RCU_INIT_POINTER(mrt->mroute_sk, NULL);
> -			mroute_clean_tables(mrt, false);
> +			mroute_clean_tables(mrt, MRT_FLUSH_VIFS | MRT_FLUSH_MFC);
>  		}
>  	}
>  	rtnl_unlock();
> @@ -1479,6 +1489,17 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval,
>  					   sk == rtnl_dereference(mrt->mroute_sk),
>  					   parent);
>  		break;
> +	case MRT_FLUSH:
> +		if (optlen != sizeof(val)) {
> +			ret = -EINVAL;
> +			break;
> +		}
> +		if (get_user(val, (int __user *)optval)) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +		mroute_clean_tables(mrt, val);
> +		break;
>  	/* Control PIM assert. */
>  	case MRT_ASSERT:
>  		if (optlen != sizeof(val)) {
> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
> index cc01aa3f2b5e..b67a7c1e3615 100644
> --- a/net/ipv6/ip6mr.c
> +++ b/net/ipv6/ip6mr.c
> @@ -97,7 +97,7 @@ static void mr6_netlink_event(struct mr_table *mrt, struct mfc6_cache *mfc,
>  static void mrt6msg_netlink_event(struct mr_table *mrt, struct sk_buff *pkt);
>  static int ip6mr_rtm_dumproute(struct sk_buff *skb,
>  			       struct netlink_callback *cb);
> -static void mroute_clean_tables(struct mr_table *mrt, bool all);
> +static void mroute_clean_tables(struct mr_table *mrt, int flags);
>  static void ipmr_expire_process(struct timer_list *t);
>  
>  #ifdef CONFIG_IPV6_MROUTE_MULTIPLE_TABLES
> @@ -393,7 +393,8 @@ static struct mr_table *ip6mr_new_table(struct net *net, u32 id)
>  static void ip6mr_free_table(struct mr_table *mrt)
>  {
>  	del_timer_sync(&mrt->ipmr_expire_timer);
> -	mroute_clean_tables(mrt, true);
> +	mroute_clean_tables(mrt, MRT6_FLUSH_VIFS | MRT6_FLUSH_VIFS_STATIC |
> +					  MRT6_FLUSH_MFC | MRT6_FLUSH_MFC_STATIC);
>  	rhltable_destroy(&mrt->mfc_hash);
>  	kfree(mrt);
>  }
> @@ -1496,42 +1497,51 @@ static int ip6mr_mfc_add(struct net *net, struct mr_table *mrt,
>   *	Close the multicast socket, and clear the vif tables etc
>   */
>  
> -static void mroute_clean_tables(struct mr_table *mrt, bool all)
> +static void mroute_clean_tables(struct mr_table *mrt, int flags)
>  {
>  	struct mr_mfc *c, *tmp;
>  	LIST_HEAD(list);
>  	int i;
>  
>  	/* Shut down all active vif entries */
> -	for (i = 0; i < mrt->maxvif; i++) {
> -		if (!all && (mrt->vif_table[i].flags & VIFF_STATIC))
> -			continue;
> -		mif6_delete(mrt, i, 0, &list);
> +	if (flags & (MRT6_FLUSH_VIFS | MRT6_FLUSH_VIFS_STATIC)) {
> +		for (i = 0; i < mrt->maxvif; i++) {
> +			if (((mrt->vif_table[i].flags & VIFF_STATIC) &&
> +			     !(flags & MRT6_FLUSH_VIFS_STATIC)) ||
> +			    (!(mrt->vif_table[i].flags & VIFF_STATIC) && !(flags & MRT6_FLUSH_VIFS)))
> +				continue;
> +			mif6_delete(mrt, i, 0, &list);
> +		}
> +		unregister_netdevice_many(&list);
>  	}
> -	unregister_netdevice_many(&list);
>  
>  	/* Wipe the cache */
> -	list_for_each_entry_safe(c, tmp, &mrt->mfc_cache_list, list) {
> -		if (!all && (c->mfc_flags & MFC_STATIC))
> -			continue;
> -		rhltable_remove(&mrt->mfc_hash, &c->mnode, ip6mr_rht_params);
> -		list_del_rcu(&c->list);
> -		call_ip6mr_mfc_entry_notifiers(read_pnet(&mrt->net),
> -					       FIB_EVENT_ENTRY_DEL,
> -					       (struct mfc6_cache *)c, mrt->id);
> -		mr6_netlink_event(mrt, (struct mfc6_cache *)c, RTM_DELROUTE);
> -		mr_cache_put(c);
> +	if (flags & (MRT6_FLUSH_MFC | MRT6_FLUSH_MFC_STATIC)) {
> +		list_for_each_entry_safe(c, tmp, &mrt->mfc_cache_list, list) {
> +			if (((c->mfc_flags & MFC_STATIC) && !(flags & MRT6_FLUSH_MFC_STATIC)) ||
> +			    (!(c->mfc_flags & MFC_STATIC) && !(flags & MRT6_FLUSH_MFC)))
> +				continue;
> +			rhltable_remove(&mrt->mfc_hash, &c->mnode, ip6mr_rht_params);
> +			list_del_rcu(&c->list);
> +			call_ip6mr_mfc_entry_notifiers(read_pnet(&mrt->net),
> +						       FIB_EVENT_ENTRY_DEL,
> +										   (struct mfc6_cache *)c, mrt->id);
> +			mr6_netlink_event(mrt, (struct mfc6_cache *)c, RTM_DELROUTE);
> +			mr_cache_put(c);
> +		}
>  	}
>  
> -	if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
> -		spin_lock_bh(&mfc_unres_lock);
> -		list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
> -			list_del(&c->list);
> -			mr6_netlink_event(mrt, (struct mfc6_cache *)c,
> -					  RTM_DELROUTE);
> -			ip6mr_destroy_unres(mrt, (struct mfc6_cache *)c);
> +	if (flags & MRT6_FLUSH_MFC) {
> +		if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
> +			spin_lock_bh(&mfc_unres_lock);
> +			list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
> +				list_del(&c->list);
> +				mr6_netlink_event(mrt, (struct mfc6_cache *)c,
> +						  RTM_DELROUTE);
> +				ip6mr_destroy_unres(mrt, (struct mfc6_cache *)c);
> +			}
> +			spin_unlock_bh(&mfc_unres_lock);
>  		}
> -		spin_unlock_bh(&mfc_unres_lock);
>  	}
>  }
>  
> @@ -1587,7 +1597,7 @@ int ip6mr_sk_done(struct sock *sk)
>  						     NETCONFA_IFINDEX_ALL,
>  						     net->ipv6.devconf_all);
>  
> -			mroute_clean_tables(mrt, false);
> +			mroute_clean_tables(mrt, MRT6_FLUSH_VIFS | MRT6_FLUSH_MFC);
>  			err = 0;
>  			break;
>  		}
> @@ -1703,6 +1713,20 @@ int ip6_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, uns
>  		rtnl_unlock();
>  		return ret;
>  
> +	case MRT6_FLUSH:
> +	{
> +		int flags;
> +
> +		if (optlen != sizeof(flags))
> +			return -EINVAL;
> +		if (get_user(flags, (int __user *)optval))
> +			return -EFAULT;
> +		rtnl_lock();
> +		mroute_clean_tables(mrt, flags);
> +		rtnl_unlock();
> +		return 0;
> +	}
> +
>  	/*
>  	 *	Control PIM assert (to activate pim will activate assert)
>  	 */
> 


^ permalink raw reply

* Re: [PATCH] net: sched: matchall: verify that filter is not NULL in mall_walk()
From: Cong Wang @ 2019-02-16  0:24 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Ido Schimmel, Linux Kernel Network Developers, Jamal Hadi Salim,
	Jiri Pirko, David Miller
In-Reply-To: <20190215121120.4971-1-vladbu@mellanox.com>

On Fri, Feb 15, 2019 at 4:11 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
> Check that filter is not NULL before passing it to tcf_walker->fn()
> callback. This can happen when mall_change() failed to offload filter to
> hardware.
>
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
> ---
>  net/sched/cls_matchall.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
> index a37137430e61..1f9d481b0fbb 100644
> --- a/net/sched/cls_matchall.c
> +++ b/net/sched/cls_matchall.c
> @@ -247,6 +247,9 @@ static void mall_walk(struct tcf_proto *tp, struct tcf_walker *arg,
>
>         if (arg->count < arg->skip)
>                 goto skip;
> +
> +       if (!head)
> +               return;

So head==NULL still counts one given that you check NULL after
checking arg->count. Is this expected?

^ permalink raw reply

* Re: [PATCH bpf-next v2] bpf: make LWTUNNEL_BPF dependent on INET
From: Daniel Borkmann @ 2019-02-16  0:16 UTC (permalink / raw)
  To: Randy Dunlap, Peter Oskolkov, Alexei Starovoitov, netdev; +Cc: Peter Oskolkov
In-Reply-To: <7bbc8f63-3033-44ba-5b8e-ae597677bbe6@infradead.org>

On 02/15/2019 11:55 PM, Randy Dunlap wrote:
> On 2/15/19 9:51 AM, Peter Oskolkov wrote:
>> Lightweight tunnels are L3 constructs that are used with IP/IP6.
>>
>> For example, lwtunnel_xmit is called from ip_output.c and
>> ip6_output.c only.
>>
>> Make the dependency explicit at least for LWT-BPF, as now they
>> call into IP routing.
>>
>> V2: added "Reported-by" below.
>>
>> Reported-by: Randy Dunlap <rdunlap@infradead.org>
>> Signed-off-by: Peter Oskolkov <posk@google.com>
> 
> Yes, that works.  Thanks.
> 
> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested

Applied, thanks!

^ permalink raw reply

* Re: [PATCH net-next 1/3] net: stmmac: Fix NAPI poll in TX path when in multi-queue
From: Florian Fainelli @ 2019-02-16  0:35 UTC (permalink / raw)
  To: Jose Abreu, netdev, linux-kernel
  Cc: Joao Pinto, David S . Miller, Giuseppe Cavallaro,
	Alexandre Torgue
In-Reply-To: <76da803a662214b024bfdb95731c38e45aef426d.1550237884.git.joabreu@synopsys.com>

On 2/15/19 5:42 AM, Jose Abreu wrote:
> Commit 8fce33317023 introduced the concept of NAPI per-channel and
> independent cleaning of TX path.
> 
> This is currently breaking performance in some cases. The scenario
> happens when all packets are being received in Queue 0 but the TX is
> performed in Queue != 0.
> 
> Fix this by using different NAPI instances per each TX and RX queue, as
> suggested by Florian.
> 
> Signed-off-by: Jose Abreu <joabreu@synopsys.com>
> Cc: Florian Fainelli <f.fainelli@gmail.com>
> Cc: Joao Pinto <jpinto@synopsys.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
> Cc: Alexandre Torgue <alexandre.torgue@st.com>
> ---

[snip]

> -	if (work_done < budget && napi_complete_done(napi, work_done)) {
> -		int stat;
> +	priv->xstats.napi_poll++;
>  
> +	work_done = stmmac_tx_clean(priv, budget, chan);
> +	if (work_done < budget && napi_complete_done(napi, work_done))

You should not be bounding your TX queue against the NAPI budge, it
should run unbound and clean as much as it can, which could be the
entire ring size if that is how many packets you pushed between
interrupts. That could be the cause of poor performance as well.
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next v2 0/9] net: Get rid of switchdev_port_attr_get()
From: Florian Fainelli @ 2019-02-16  0:37 UTC (permalink / raw)
  To: netdev, David S. Miller
  Cc: Ido Schimmel, open list, open list:STAGING SUBSYSTEM,
	moderated list:ETHERNET BRIDGE, jiri, andrew, vivien.didelot
In-Reply-To: <20190215225313.32303-1-f.fainelli@gmail.com>

On 2/15/19 2:53 PM, Florian Fainelli wrote:
> Hi all,
> 
> This patch series splits the removal of the switchdev_ops that was
> proposed a few times before and first tackles the easy part which is the
> removal of the single call to switchdev_port_attr_get() within the
> bridge code.
> 
> As suggestd by Ido, this patch series adds a
> SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS which is used in the same
> context as the caller of switchdev_port_attr_set(), so not deferred, and
> then the operation is carried out in deferred context with setting a
> support bridge port flag.
> 
> Follow-up patches will do the switchdev_ops removal after introducing
> the proper helpers for the switchdev blocking notifier to work across
> stacked devices (unlike the previous submissions).

David, please ignore this version, I will repost one that actually
builds, need to keep mangling with my kernel configuration and keep
those drivers enabled...

> 
> Changes in v2:
> 
> - differentiate callers not supporting switchdev_port_attr_set() from
>   the driver not being able to support specific bridge flags
> 
> - pass "mask" instead of "flags" for the PRE_BRIDGE_FLAGS check
> 
> - skip prepare phase for PRE_BRIDGE_FLAGS
> 
> - corrected documentation a bit more
> 
> - tested bridge_vlan_aware.sh with veth/VRF
> 
> Florian Fainelli (9):
>   Documentation: networking: switchdev: Update port parent ID section
>   net: switchdev: Add PORT_PRE_BRIDGE_FLAGS
>   mlxsw: spectrum: Handle PORT_PRE_BRIDGE_FLAGS
>   staging: fsl-dpaa2: ethsw: Handle PORT_PRE_BRIDGE_FLAGS
>   net: dsa: Add setter for SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS
>   rocker: Check Handle PORT_PRE_BRIDGE_FLAGS
>   net: bridge: Stop calling switchdev_port_attr_get()
>   net: Remove SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS_SUPPORT
>   net: Get rid of switchdev_port_attr_get()
> 
>  Documentation/networking/switchdev.txt        | 16 ++--
>  .../mellanox/mlxsw/spectrum_switchdev.c       | 38 +++++-----
>  drivers/net/ethernet/rocker/rocker_main.c     | 75 +++++++++++--------
>  drivers/staging/fsl-dpaa2/ethsw/ethsw.c       | 32 ++++----
>  include/net/switchdev.h                       | 13 +---
>  net/bridge/br_switchdev.c                     | 11 ++-
>  net/dsa/dsa_priv.h                            |  6 ++
>  net/dsa/port.c                                | 17 +++++
>  net/dsa/slave.c                               | 24 +++---
>  9 files changed, 128 insertions(+), 104 deletions(-)
> 


-- 
Florian

^ permalink raw reply

* [PATCH net-next] ip_tunnel: Fix DST_METADATA dst_entry handle in tnl_update_pmtu
From: wenxu @ 2019-02-16  0:58 UTC (permalink / raw)
  To: davem, rong.a.chen, netdev; +Cc: sfr, lkp

From: wenxu <wenxu@ucloud.cn>

BUG report in selftests: bpf: test_tunnel.sh

Testing IPIP tunnel...
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
PGD 0 P4D 0
Oops: 0010 [#1] SMP PTI
CPU: 0 PID: 16822 Comm: ping Not tainted 5.0.0-rc3-00352-gc8b34e6 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
RIP: 0010:          (null)
Code: Bad RIP value.
RSP: 0018:ffffc9000104f9c8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffffe8ffffc071a8 RCX: 0000000000000000
RDX: ffff888054e33000 RSI: ffff88807796f500 RDI: ffffe8ffffc07130
RBP: ffff88807796f500 R08: ffff88806da4f0a0 R09: 0000000000000000
R10: 0000000000000004 R11: ffff888054e33000 R12: 0000000000000054
R13: ffff88805e714000 R14: ffff88806da4f0a0 R15: 0000000000000000
FS:  00007f4c00431500(0000) GS:ffff88813fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 000000008276e000 CR4: 00000000000406f0
Call Trace:
 ? tnl_update_pmtu+0x21b/0x250 [ip_tunnel]
 ? ip_md_tunnel_xmit+0x1b7/0xdc0 [ip_tunnel]
 ? ipip_tunnel_xmit+0x90/0xc0 [ipip]
 ? dev_hard_start_xmit+0x98/0x210
 ? __dev_queue_xmit+0x6a9/0x8e0

The bpf program set tunnel_key through bpf_skb_set_tunnel_key which will
drop the old dst_entry and create a DST_METADATA dst_entry. It will lead
the tunnel_update_pmtu operator the dst_entry incorrect. So It should be
check the dst_entry is valid.

Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
Signed-off-by: wenxu <wenxu@ucloud.cn>
---
 net/ipv4/ip_tunnel.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 893f013..a665f11 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -515,7 +515,7 @@ static int tnl_update_pmtu(struct net_device *dev, struct sk_buff *skb,
 		mtu = dst_mtu(&rt->dst) - dev->hard_header_len
 					- sizeof(struct iphdr) - tunnel_hlen;
 	else
-		mtu = skb_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu;
+		mtu = skb_valid_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu;
 
 	skb_dst_update_pmtu(skb, mtu);
 
@@ -530,7 +530,7 @@ static int tnl_update_pmtu(struct net_device *dev, struct sk_buff *skb,
 	}
 #if IS_ENABLED(CONFIG_IPV6)
 	else if (skb->protocol == htons(ETH_P_IPV6)) {
-		struct rt6_info *rt6 = (struct rt6_info *)skb_dst(skb);
+		struct rt6_info *rt6 = (struct rt6_info *)skb_valid_dst(skb);
 		__be32 daddr;
 
 		daddr = md ? dst : tunnel->parms.iph.daddr;
-- 
1.8.3.1


^ permalink raw reply related

* Re: [MERGE HELP] cls_tcindex.c
From: Cong Wang @ 2019-02-16  1:31 UTC (permalink / raw)
  To: David Miller; +Cc: Linux Kernel Network Developers, Vlad Buslov
In-Reply-To: <20190215.124125.1781326145216021027.davem@davemloft.net>

On Fri, Feb 15, 2019 at 12:41 PM David Miller <davem@davemloft.net> wrote:
>
>
> I've merged net into net-next.
>
> The worst conflict was cls_tcindex.c as Cong's fixes collided heavily
> with Vlad's work.
>
> The interim solution I used for this merge was to revert back to RCU.
>
> Please take a look at what I did and send me followups because I am
> absolutely certain that some are necessary :-)))

Well, there is nothing fundamentally changed w.r.t. the race I have fixed.

In tcindex_delete(), after all Vlad's patches, it still calls tcf_queue_work()
therefore still races with call_rcu(). This has nothing to do with whether
it is RTNL or another mutex.

For the first memory leak, it is same, nothing changes, tcindex_delete()
still skips r->res.class==0 therefore we still need to fix it.

For the rest two memory leaks, it looks like you already merged the fix
(patch 3/3) correctly.

I will cherry-pick the first two patches for net-next and send them to you.

Thanks.

^ permalink raw reply

* Re: [PATCH bpf 1/2] bpf/test_run: fix unkillable BPF_PROG_TEST_RUN
From: Daniel Borkmann @ 2019-02-16  1:17 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev; +Cc: davem, ast, syzbot
In-Reply-To: <20190212234239.174386-1-sdf@google.com>

On 02/13/2019 12:42 AM, Stanislav Fomichev wrote:
> Syzbot found out that running BPF_PROG_TEST_RUN with repeat=0xffffffff
> makes process unkillable. The problem is that when CONFIG_PREEMPT is
> enabled, we never see need_resched() return true. This is due to the
> fact that preempt_enable() (which we do in bpf_test_run_one on each
> iteration) now handles resched if it's needed.
> 
> Let's disable preemption for the whole run, not per test. In this case
> we can properly see whether resched is needed.
> Let's also properly return -EINTR to the userspace in case of a signal
> interrupt.
> 
> See recent discussion:
> http://lore.kernel.org/netdev/CAH3MdRWHr4N8jei8jxDppXjmw-Nw=puNDLbu1dQOFQHxfU2onA@mail.gmail.com
> 
> I'll follow up with the same fix bpf_prog_test_run_flow_dissector in
> bpf-next.
> 
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  net/bpf/test_run.c | 45 ++++++++++++++++++++++++---------------------
>  1 file changed, 24 insertions(+), 21 deletions(-)
> 
> diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
> index fa2644d276ef..e31e1b20f7f4 100644
> --- a/net/bpf/test_run.c
> +++ b/net/bpf/test_run.c
> @@ -13,27 +13,13 @@
>  #include <net/sock.h>
>  #include <net/tcp.h>
>  
> -static __always_inline u32 bpf_test_run_one(struct bpf_prog *prog, void *ctx,
> -		struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE])
> -{
> -	u32 ret;
> -
> -	preempt_disable();
> -	rcu_read_lock();
> -	bpf_cgroup_storage_set(storage);
> -	ret = BPF_PROG_RUN(prog, ctx);
> -	rcu_read_unlock();
> -	preempt_enable();
> -
> -	return ret;
> -}
> -
> -static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, u32 *ret,
> -			u32 *time)
> +static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
> +			u32 *retval, u32 *time)
>  {
>  	struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE] = { 0 };
>  	enum bpf_cgroup_storage_type stype;
>  	u64 time_start, time_spent = 0;
> +	int ret = 0;
>  	u32 i;
>  
>  	for_each_cgroup_storage_type(stype) {
> @@ -48,25 +34,42 @@ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, u32 *ret,
>  
>  	if (!repeat)
>  		repeat = 1;
> +
> +	rcu_read_lock();
> +	preempt_disable();
>  	time_start = ktime_get_ns();
>  	for (i = 0; i < repeat; i++) {
> -		*ret = bpf_test_run_one(prog, ctx, storage);
> +		bpf_cgroup_storage_set(storage);
> +		*retval = BPF_PROG_RUN(prog, ctx);
> +
> +		if (signal_pending(current)) {
> +			ret = -EINTR;
> +			break;
> +		}

Wouldn't it be enough to just move the signal_pending() test to
the above as you did to actually fix the unkillable issue? For
CONFIG_PREEMPT the below need_resched() is never triggered as you
mention as preempt_enable() handles rescheduling internally in
this situation, so moving it only out should suffice.

The rationale for disabling preemption for the whole run is imho
a bit different, namely that you would not screw up the ktime
measurements due to rescheduling happening in between otherwise.

But then, once preemption is disabled for the whole run, is there
a need to move out the extra signal_pending() test (presumably as
need_resched() does not handle TIF_SIGPENDING but only TIF_NEED_RESCHED
but we still wouldn't get into a unkillable situation here, no)?

>  		if (need_resched()) {
> -			if (signal_pending(current))
> -				break;
>  			time_spent += ktime_get_ns() - time_start;
> +			preempt_enable();
> +			rcu_read_unlock();
> +
>  			cond_resched();
> +
> +			rcu_read_lock();
> +			preempt_disable();
>  			time_start = ktime_get_ns();
>  		}
>  	}
>  	time_spent += ktime_get_ns() - time_start;
> +	preempt_enable();
> +	rcu_read_unlock();
> +
>  	do_div(time_spent, repeat);
>  	*time = time_spent > U32_MAX ? U32_MAX : (u32)time_spent;
>  
>  	for_each_cgroup_storage_type(stype)
>  		bpf_cgroup_storage_free(storage[stype]);
>  
> -	return 0;
> +	return ret;
>  }
>  
>  static int bpf_test_finish(const union bpf_attr *kattr,
> 


^ permalink raw reply

* [PATCH net-next] net: sgi: use GFP_ATOMIC under spin lock
From: Wei Yongjun @ 2019-02-16  1:48 UTC (permalink / raw)
  To: David S . Miller, Yang Wei, Luis Chamberlain, YueHaibing,
	Christoph Hellwig
  Cc: Wei Yongjun, netdev, kernel-janitors

The function meth_init_tx_ring() is called from meth_tx_timeout(),
in which spin_lock is held, so we should use GFP_ATOMIC instead.

Fixes: 8d4c28fbc284 ("meth: pass struct device to DMA API functions")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
---
 drivers/net/ethernet/sgi/meth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/sgi/meth.c b/drivers/net/ethernet/sgi/meth.c
index f425ab528224..f1271402ca21 100644
--- a/drivers/net/ethernet/sgi/meth.c
+++ b/drivers/net/ethernet/sgi/meth.c
@@ -214,7 +214,7 @@ static int meth_init_tx_ring(struct meth_private *priv)
 {
 	/* Init TX ring */
 	priv->tx_ring = dma_alloc_coherent(&priv->pdev->dev,
-			TX_RING_BUFFER_SIZE, &priv->tx_ring_dma, GFP_KERNEL);
+			TX_RING_BUFFER_SIZE, &priv->tx_ring_dma, GFP_ATOMIC);
 	if (!priv->tx_ring)
 		return -ENOMEM;




^ permalink raw reply related

* [pull request][net-next 00/13] Mellanox, BlueField SmartNIC
From: Saeed Mahameed @ 2019-02-16  1:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Saeed Mahameed

Hi Dave,

This series adds the support for Melanox BlueField SmartNIC.
For more information please see tag log below.

Please note the merge commit of mlx5-next at the base of the pull request:
259fae5a2cff ("Merge branch 'mlx5-next' of git://git.kernel.org/.../mellanox/linux")

Please pull and let me know if there is any problem.

Thanks,
Saeed.

---
The following changes since commit 259fae5a2cff72e19f82094fb73e2149f8d64396:

  Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux (2019-02-15 16:45:31 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2019-02-15

for you to fetch changes up to c96692fb8f3d0a161c6892ab9cc51a5e9992ccf2:

  net/mlx5: E-Switch, Allow transition to offloads mode for ECPF (2019-02-15 17:25:58 -0800)

----------------------------------------------------------------
Support Mellanox BlueField SmartNIC (mlx5-updates-2019-02-15)

Bodong Wang says,

BlueField device is a multi-core ARM processor in a highly integrated
system on chip coupled with the ConnectX interconnect controller.
BlueField device can be presented in one out of two modes:

- SEPARATED_HOST: ARM processors as a separated and orthogonal host
  like any other external host in the multi-host virtualization model.
- EMBEDDED_CPU: ARM processors as Embedded CPU (EC) and part of the
  external hosts virtualization model.

While existing driver already supports the device on separated_host
mode, this patch series focus on the functionalities of embedded_cpu
mode.

On embedded_cpu mode, BlueField device exposes regular network
controller PCI function in the BlueField host(e.g, x86). However, a
separate PCI function called Embedded CPU Physical Function(ECPF) is
also added to the ARM host side, where standard Linux distributions is
able to run on the ARM cores. Depends on the NV configuration from
firmware, ECPF can be the e-switch manager and firmware pages supplier.
If ECPF is configured as e-switch manager and page supplier, it will
take over the responsibilities from the PF on BlueField host includes:
- Owns, controls and manages all e-switch parts, and takes e-switch
  traffic by default. It also should perform ENABLE_HCA for the host
  PF just like a PF does for its VFs.
- Provides and manages the ICM host memory required for the HCA to
  store various contexts for itself, the PF and VFs belong the
  e-switch it manages.

The PF on BlueField host side is still responsible for:
- Control its own permanent MAC.
- PCI and SRIOV configurations and perform ENABLE_HCA for its VFs.

The ECPF can also retrieve information about the external host it
controls, like host identifier, PCI BDF and number of virtual functions.
As these parameters may be changed dynamically, an event will be triggered
to the driver on ECPF side.

----------------------------------------------------------------
Bodong Wang (13):
      net/mlx5: Correctly set LAG mode for ECPF
      net/mlx5: E-Switch, Properly refer to the esw manager vport
      net/mlx5: E-Switch, Properly refer to host PF vport as other vport
      net/mlx5: E-Switch, Refactor offloads flow steering init/cleanup
      net/mlx5: E-Switch, Split VF and special vports for offloads mode
      net/mlx5: E-Switch, Use getter and iterator to access vport/rep
      net/mlx5: E-Switch, Add state to eswitch vport representors
      net/mlx5: E-Switch, Support load/unload reps of specific vport types
      net/mlx5: E-Switch, Centralize repersentor reg/unreg to eswitch driver
      net/mlx5: E-Switch, Assign a different position for uplink rep and vport
      net/mlx5: E-Switch, Consider ECPF vport depends on eswitch ownership
      net/mlx5: E-Switch, Load/unload VF reps according to event from host PF
      net/mlx5: E-Switch, Allow transition to offloads mode for ECPF

 drivers/infiniband/hw/mlx5/ib_rep.c                |  20 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |  25 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  | 175 +++++---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  |  65 ++-
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c | 470 +++++++++++++++++----
 drivers/net/ethernet/mellanox/mlx5/core/lag.c      |   5 +
 drivers/net/ethernet/mellanox/mlx5/core/vport.c    |  10 +-
 include/linux/mlx5/driver.h                        |   5 +
 include/linux/mlx5/eswitch.h                       |  19 +-
 include/linux/mlx5/mlx5_ifc.h                      |   3 +-
 include/linux/mlx5/vport.h                         |  20 +-
 11 files changed, 617 insertions(+), 200 deletions(-)

^ permalink raw reply

* [net-next 01/13] net/mlx5: Correctly set LAG mode for ECPF
From: Saeed Mahameed @ 2019-02-16  1:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Bodong Wang, Saeed Mahameed
In-Reply-To: <20190216013452.21131-1-saeedm@mellanox.com>

From: Bodong Wang <bodong@mellanox.com>

When bonding is added, driver assumes that it's RoCE LAG if no VF is
enabled. This is not enough for ECPF as the VF is enabled in host PF
side. LAG should only choose RoCE mode when both slave devices meet
conditions below:
 1. E-Switch offloads mode is NONE.
 2. No VF is enabled.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/lag.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag.c
index 2d223385dc81..04c5aca7f8c5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lag.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lag.c
@@ -343,6 +343,11 @@ static void mlx5_do_bond(struct mlx5_lag *ldev)
 		roce_lag = !mlx5_sriov_is_enabled(dev0) &&
 			   !mlx5_sriov_is_enabled(dev1);

+#ifdef CONFIG_MLX5_ESWITCH
+		roce_lag &= dev0->priv.eswitch->mode == SRIOV_NONE &&
+			    dev1->priv.eswitch->mode == SRIOV_NONE;
+#endif
+
 		if (roce_lag)
 			mlx5_lag_remove_ib_devices(ldev);

-- 
2.20.1

^ permalink raw reply related

* [net-next 03/13] net/mlx5: E-Switch, Properly refer to host PF vport as other vport
From: Saeed Mahameed @ 2019-02-16  1:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Bodong Wang, Eli Cohen, Saeed Mahameed
In-Reply-To: <20190216013452.21131-1-saeedm@mellanox.com>

From: Bodong Wang <bodong@mellanox.com>

Commands referring to vports use the following scheme:

1. When referring to my own vport, put 0 in vport and 0 in other_vport.
2. When referring to another vport, put the vport number of the
   referred vport and put 1 in other_vport. It was assumed that driver
   is accessing other vport when vport number is greater than 0.

With the above scheme, the case that ECPF eswitch manager is trying
to access host PF vport will fall over with scheme 1 as the vport
number is 0. This is apparently wrong as driver is trying to refer
other vport.

As such usage can only happen in the eswitch context, change relevant
functions to provide other vport input properly.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c  |  6 ++++--
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 13 ++++++-------
 drivers/net/ethernet/mellanox/mlx5/core/vport.c   | 10 ++++------
 include/linux/mlx5/vport.h                        |  4 ++--
 4 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 685f1975be58..f84889bbe2a0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -1083,7 +1083,8 @@ static int mlx5e_vf_rep_open(struct net_device *dev)
 
 	if (!mlx5_modify_vport_admin_state(priv->mdev,
 					   MLX5_VPORT_STATE_OP_MOD_ESW_VPORT,
-					   rep->vport, MLX5_VPORT_ADMIN_STATE_UP))
+					   rep->vport, 1,
+					   MLX5_VPORT_ADMIN_STATE_UP))
 		netif_carrier_on(dev);
 
 unlock:
@@ -1101,7 +1102,8 @@ static int mlx5e_vf_rep_close(struct net_device *dev)
 	mutex_lock(&priv->state_lock);
 	mlx5_modify_vport_admin_state(priv->mdev,
 				      MLX5_VPORT_STATE_OP_MOD_ESW_VPORT,
-				      rep->vport, MLX5_VPORT_ADMIN_STATE_DOWN);
+				      rep->vport, 1,
+				      MLX5_VPORT_ADMIN_STATE_DOWN);
 	ret = mlx5e_close_locked(dev);
 	mutex_unlock(&priv->state_lock);
 	return ret;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 9c622749dbde..648c743cc947 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1462,7 +1462,7 @@ static void esw_apply_vport_conf(struct mlx5_eswitch *esw,
 
 	mlx5_modify_vport_admin_state(esw->dev,
 				      MLX5_VPORT_STATE_OP_MOD_ESW_VPORT,
-				      vport_num,
+				      vport_num, 1,
 				      vport->info.link_state);
 
 	/* Host PF has its own mac/guid. */
@@ -1581,10 +1581,10 @@ static void esw_disable_vport(struct mlx5_eswitch *esw, int vport_num)
 	esw_vport_change_handle_locked(vport);
 	vport->enabled_events = 0;
 	esw_vport_disable_qos(esw, vport_num);
-	if (vport_num && esw->mode == SRIOV_LEGACY) {
+	if (esw->mode == SRIOV_LEGACY) {
 		mlx5_modify_vport_admin_state(esw->dev,
 					      MLX5_VPORT_STATE_OP_MOD_ESW_VPORT,
-					      vport_num,
+					      vport_num, 1,
 					      MLX5_VPORT_ADMIN_STATE_DOWN);
 		esw_vport_disable_egress_acl(esw, vport);
 		esw_vport_disable_ingress_acl(esw, vport);
@@ -1875,7 +1875,7 @@ int mlx5_eswitch_set_vport_state(struct mlx5_eswitch *esw,
 
 	err = mlx5_modify_vport_admin_state(esw->dev,
 					    MLX5_VPORT_STATE_OP_MOD_ESW_VPORT,
-					    vport, link_state);
+					    vport, 1, link_state);
 	if (err) {
 		mlx5_core_warn(esw->dev,
 			       "Failed to set vport %d link state, err = %d",
@@ -2137,7 +2137,7 @@ static int mlx5_eswitch_query_vport_drop_stats(struct mlx5_core_dev *dev,
 	    !MLX5_CAP_GEN(dev, transmit_discard_vport_down))
 		return 0;
 
-	err = mlx5_query_vport_down_stats(dev, vport_idx,
+	err = mlx5_query_vport_down_stats(dev, vport_idx, 1,
 					  &rx_discard_vport_down,
 					  &tx_discard_vport_down);
 	if (err)
@@ -2174,8 +2174,7 @@ int mlx5_eswitch_get_vport_stats(struct mlx5_eswitch *esw,
 		 MLX5_CMD_OP_QUERY_VPORT_COUNTER);
 	MLX5_SET(query_vport_counter_in, in, op_mod, 0);
 	MLX5_SET(query_vport_counter_in, in, vport_number, vport);
-	if (vport)
-		MLX5_SET(query_vport_counter_in, in, other_vport, 1);
+	MLX5_SET(query_vport_counter_in, in, other_vport, 1);
 
 	memset(out, 0, outlen);
 	err = mlx5_cmd_exec(esw->dev, in, sizeof(in), out, outlen);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
index 9a928eb48522..ef95feca9961 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
@@ -64,7 +64,7 @@ u8 mlx5_query_vport_state(struct mlx5_core_dev *mdev, u8 opmod, u16 vport)
 }
 
 int mlx5_modify_vport_admin_state(struct mlx5_core_dev *mdev, u8 opmod,
-				  u16 vport, u8 state)
+				  u16 vport, u8 other_vport, u8 state)
 {
 	u32 in[MLX5_ST_SZ_DW(modify_vport_state_in)]   = {0};
 	u32 out[MLX5_ST_SZ_DW(modify_vport_state_out)] = {0};
@@ -73,8 +73,7 @@ int mlx5_modify_vport_admin_state(struct mlx5_core_dev *mdev, u8 opmod,
 		 MLX5_CMD_OP_MODIFY_VPORT_STATE);
 	MLX5_SET(modify_vport_state_in, in, op_mod, opmod);
 	MLX5_SET(modify_vport_state_in, in, vport_number, vport);
-	if (vport)
-		MLX5_SET(modify_vport_state_in, in, other_vport, 1);
+	MLX5_SET(modify_vport_state_in, in, other_vport, other_vport);
 	MLX5_SET(modify_vport_state_in, in, admin_state, state);
 
 	return mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
@@ -1057,7 +1056,7 @@ int mlx5_core_query_vport_counter(struct mlx5_core_dev *dev, u8 other_vport,
 EXPORT_SYMBOL_GPL(mlx5_core_query_vport_counter);
 
 int mlx5_query_vport_down_stats(struct mlx5_core_dev *mdev, u16 vport,
-				u64 *rx_discard_vport_down,
+				u8 other_vport, u64 *rx_discard_vport_down,
 				u64 *tx_discard_vport_down)
 {
 	u32 out[MLX5_ST_SZ_DW(query_vnic_env_out)] = {0};
@@ -1068,8 +1067,7 @@ int mlx5_query_vport_down_stats(struct mlx5_core_dev *mdev, u16 vport,
 		 MLX5_CMD_OP_QUERY_VNIC_ENV);
 	MLX5_SET(query_vnic_env_in, in, op_mod, 0);
 	MLX5_SET(query_vnic_env_in, in, vport_number, vport);
-	if (vport)
-		MLX5_SET(query_vnic_env_in, in, other_vport, 1);
+	MLX5_SET(query_vnic_env_in, in, other_vport, other_vport);
 
 	err = mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
 	if (err)
diff --git a/include/linux/mlx5/vport.h b/include/linux/mlx5/vport.h
index b67bcc95ab5d..b7edcb1dadd8 100644
--- a/include/linux/mlx5/vport.h
+++ b/include/linux/mlx5/vport.h
@@ -59,7 +59,7 @@ enum {
 
 u8 mlx5_query_vport_state(struct mlx5_core_dev *mdev, u8 opmod, u16 vport);
 int mlx5_modify_vport_admin_state(struct mlx5_core_dev *mdev, u8 opmod,
-				  u16 vport, u8 state);
+				  u16 vport, u8 other_vport, u8 state);
 int mlx5_query_nic_vport_mac_address(struct mlx5_core_dev *mdev,
 				     u16 vport, u8 *addr);
 int mlx5_query_nic_vport_min_inline(struct mlx5_core_dev *mdev,
@@ -121,7 +121,7 @@ int mlx5_modify_nic_vport_vlans(struct mlx5_core_dev *dev,
 int mlx5_nic_vport_enable_roce(struct mlx5_core_dev *mdev);
 int mlx5_nic_vport_disable_roce(struct mlx5_core_dev *mdev);
 int mlx5_query_vport_down_stats(struct mlx5_core_dev *mdev, u16 vport,
-				u64 *rx_discard_vport_down,
+				u8 other_vport, u64 *rx_discard_vport_down,
 				u64 *tx_discard_vport_down);
 int mlx5_core_query_vport_counter(struct mlx5_core_dev *dev, u8 other_vport,
 				  int vf, u8 port_num, void *out,
-- 
2.20.1


^ permalink raw reply related

* [net-next 02/13] net/mlx5: E-Switch, Properly refer to the esw manager vport
From: Saeed Mahameed @ 2019-02-16  1:34 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Bodong Wang, Eli Cohen, Or Gerlitz, Saeed Mahameed
In-Reply-To: <20190216013452.21131-1-saeedm@mellanox.com>

From: Bodong Wang <bodong@mellanox.com>

In SmartNIC mode, the eswitch manager is not necessarily the PF
(vport 0). Use a helper function to get the correct eswitch manager
vport number and cache on the eswitch instance for fast reference.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/eswitch.c | 35 ++++++++++++-------
 .../net/ethernet/mellanox/mlx5/core/eswitch.h | 10 ++++++
 .../mellanox/mlx5/core/eswitch_offloads.c     |  7 ++--
 include/linux/mlx5/vport.h                    |  2 ++
 4 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 05830696abd8..9c622749dbde 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -378,16 +378,16 @@ static int esw_add_uc_addr(struct mlx5_eswitch *esw, struct vport_addr *vaddr)
 	u16 vport = vaddr->vport;
 	int err;
 
-	/* Skip mlx5_mpfs_add_mac for PFs,
-	 * it is already done by the PF netdev in mlx5e_execute_l2_action
+	/* Skip mlx5_mpfs_add_mac for eswitch_managers,
+	 * it is already done by its netdev in mlx5e_execute_l2_action
 	 */
-	if (!vport)
+	if (esw->manager_vport == vport)
 		goto fdb_add;
 
 	err = mlx5_mpfs_add_mac(esw->dev, mac);
 	if (err) {
 		esw_warn(esw->dev,
-			 "Failed to add L2 table mac(%pM) for vport(%d), err(%d)\n",
+			 "Failed to add L2 table mac(%pM) for vport(0x%x), err(%d)\n",
 			 mac, vport, err);
 		return err;
 	}
@@ -410,10 +410,10 @@ static int esw_del_uc_addr(struct mlx5_eswitch *esw, struct vport_addr *vaddr)
 	u16 vport = vaddr->vport;
 	int err = 0;
 
-	/* Skip mlx5_mpfs_del_mac for PFs,
-	 * it is already done by the PF netdev in mlx5e_execute_l2_action
+	/* Skip mlx5_mpfs_del_mac for eswitch managerss,
+	 * it is already done by its netdev in mlx5e_execute_l2_action
 	 */
-	if (!vport || !vaddr->mpfs)
+	if (!vaddr->mpfs || esw->manager_vport == vport)
 		goto fdb_del;
 
 	err = mlx5_mpfs_del_mac(esw->dev, mac);
@@ -1457,15 +1457,22 @@ static void esw_apply_vport_conf(struct mlx5_eswitch *esw,
 {
 	int vport_num = vport->vport;
 
-	if (!vport_num)
+	if (esw->manager_vport == vport_num)
 		return;
 
 	mlx5_modify_vport_admin_state(esw->dev,
 				      MLX5_VPORT_STATE_OP_MOD_ESW_VPORT,
 				      vport_num,
 				      vport->info.link_state);
-	mlx5_modify_nic_vport_mac_address(esw->dev, vport_num, vport->info.mac);
-	mlx5_modify_nic_vport_node_guid(esw->dev, vport_num, vport->info.node_guid);
+
+	/* Host PF has its own mac/guid. */
+	if (vport_num) {
+		mlx5_modify_nic_vport_mac_address(esw->dev, vport_num,
+						  vport->info.mac);
+		mlx5_modify_nic_vport_node_guid(esw->dev, vport_num,
+						vport->info.node_guid);
+	}
+
 	modify_esw_vport_cvlan(esw->dev, vport_num, vport->info.vlan, vport->info.qos,
 			       (vport->info.vlan || vport->info.qos));
 
@@ -1537,8 +1544,11 @@ static void esw_enable_vport(struct mlx5_eswitch *esw, int vport_num,
 	vport->enabled_events = enable_events;
 	vport->enabled = true;
 
-	/* only PF is trusted by default */
-	if (!vport_num)
+	/* Esw manager is trusted by default. Host PF (vport 0) is trusted as well
+	 * in smartNIC as it's a vport group manager.
+	 */
+	if (esw->manager_vport == vport_num ||
+	    (!vport_num && mlx5_core_is_ecpf(esw->dev)))
 		vport->info.trusted = true;
 
 	esw_vport_change_handle_locked(vport);
@@ -1733,6 +1743,7 @@ int mlx5_eswitch_init(struct mlx5_core_dev *dev)
 		return -ENOMEM;
 
 	esw->dev = dev;
+	esw->manager_vport = mlx5_eswitch_manager_vport(dev);
 
 	esw->work_queue = create_singlethread_workqueue("mlx5_esw_wq");
 	if (!esw->work_queue) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 0a3eee8746c1..959a9e28d08f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -38,6 +38,7 @@
 #include <net/devlink.h>
 #include <linux/mlx5/device.h>
 #include <linux/mlx5/eswitch.h>
+#include <linux/mlx5/vport.h>
 #include <linux/mlx5/fs.h>
 #include "lib/mpfs.h"
 
@@ -204,6 +205,7 @@ struct mlx5_eswitch {
 	struct mlx5_esw_offload offloads;
 	int                     mode;
 	int                     nvports;
+	u16                     manager_vport;
 };
 
 void esw_offloads_cleanup(struct mlx5_eswitch *esw, int nvports);
@@ -363,6 +365,14 @@ bool mlx5_esw_lag_prereq(struct mlx5_core_dev *dev0,
 
 #define esw_debug(dev, format, ...)				\
 	mlx5_core_dbg_mask(dev, MLX5_DEBUG_ESWITCH_MASK, format, ##__VA_ARGS__)
+
+/* The returned number is valid only when the dev is eswitch manager. */
+static inline u16 mlx5_eswitch_manager_vport(struct mlx5_core_dev *dev)
+{
+	return mlx5_core_is_ecpf_esw_manager(dev) ?
+		MLX5_VPORT_ECPF : MLX5_VPORT_PF;
+}
+
 #else  /* CONFIG_MLX5_ESWITCH */
 /* eswitch API stubs */
 static inline int  mlx5_eswitch_init(struct mlx5_core_dev *dev) { return 0; }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 9128b45f3f37..af2c44d31357 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -522,7 +522,8 @@ mlx5_eswitch_add_send_to_vport_rule(struct mlx5_eswitch *esw, int vport, u32 sqn
 
 	misc = MLX5_ADDR_OF(fte_match_param, spec->match_value, misc_parameters);
 	MLX5_SET(fte_match_set_misc, misc, source_sqn, sqn);
-	MLX5_SET(fte_match_set_misc, misc, source_port, 0x0); /* source vport is 0 */
+	/* source vport is the esw manager */
+	MLX5_SET(fte_match_set_misc, misc, source_port, esw->manager_vport);
 
 	misc = MLX5_ADDR_OF(fte_match_param, spec->match_criteria, misc_parameters);
 	MLX5_SET_TO_ONES(fte_match_set_misc, misc, source_sqn);
@@ -567,7 +568,7 @@ static void peer_miss_rules_setup(struct mlx5_core_dev *peer_dev,
 			 source_eswitch_owner_vhca_id);
 
 	dest->type = MLX5_FLOW_DESTINATION_TYPE_VPORT;
-	dest->vport.num = 0;
+	dest->vport.num = peer_dev->priv.eswitch->manager_vport;
 	dest->vport.vhca_id = MLX5_CAP_GEN(peer_dev, vhca_id);
 	dest->vport.flags |= MLX5_FLOW_DEST_VPORT_VHCA_ID;
 }
@@ -666,7 +667,7 @@ static int esw_add_fdb_miss_rule(struct mlx5_eswitch *esw)
 	dmac_c[0] = 0x01;
 
 	dest.type = MLX5_FLOW_DESTINATION_TYPE_VPORT;
-	dest.vport.num = 0;
+	dest.vport.num = esw->manager_vport;
 	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
 
 	flow_rule = mlx5_add_flow_rules(esw->fdb_table.offloads.slow_fdb, spec,
diff --git a/include/linux/mlx5/vport.h b/include/linux/mlx5/vport.h
index 3bc05449ac39..b67bcc95ab5d 100644
--- a/include/linux/mlx5/vport.h
+++ b/include/linux/mlx5/vport.h
@@ -52,6 +52,8 @@ enum {
 };
 
 enum {
+	MLX5_VPORT_PF			= 0x0,
+	MLX5_VPORT_ECPF			= 0xfffe,
 	MLX5_VPORT_UPLINK		= 0xffff
 };
 
-- 
2.20.1


^ permalink raw reply related

* [net-next 05/13] net/mlx5: E-Switch, Split VF and special vports for offloads mode
From: Saeed Mahameed @ 2019-02-16  1:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Bodong Wang, Or Gerlitz, Saeed Mahameed
In-Reply-To: <20190216013452.21131-1-saeedm@mellanox.com>

From: Bodong Wang <bodong@mellanox.com>

When driver is entering offloads mode, there are two major tasks to
do: initialize flow steering and create representors. Flow steering
should make sure enough flow table/group spaces are reserved for all
reps. Representors will be created in a group, all or none.

With the introduction of ECPF, flow steering should still reserve the
same spaces. But, the representors are not always loaded/unloaded in a
single piece. Once ECPF is in offloads mode, it will get the number
of VF changing event from host PF. In such scenario, only the VF reps
should be loaded/unloaded, not the reps for special vports (such as
the uplink vport).

Thus, when entering offloads mode, driver should specify the total
number of reps, and the number of VF reps separately. When leaving
offloads mode, the cleanup should use the information self-contained
in eswitch such as number of VFs.

This patch doesn't change any functionality.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  7 +--
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  5 +-
 .../mellanox/mlx5/core/eswitch_offloads.c     | 57 +++++++++++++------
 include/linux/mlx5/vport.h                    |  1 +
 4 files changed, 48 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 648c743cc947..be6c2931d2a0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1641,7 +1641,8 @@ int mlx5_eswitch_enable_sriov(struct mlx5_eswitch *esw, int nvfs, int mode)
 	} else {
 		mlx5_reload_interface(esw->dev, MLX5_INTERFACE_PROTOCOL_ETH);
 		mlx5_reload_interface(esw->dev, MLX5_INTERFACE_PROTOCOL_IB);
-		err = esw_offloads_init(esw, nvfs + MLX5_SPECIAL_VPORTS);
+		err = esw_offloads_init(esw, nvfs,
+					nvfs + MLX5_SPECIAL_VPORTS);
 	}
 
 	if (err)
@@ -1683,7 +1684,6 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw)
 {
 	struct esw_mc_addr *mc_promisc;
 	int old_mode;
-	int nvports;
 	int i;
 
 	if (!ESW_ALLOWED(esw) || esw->mode == SRIOV_NONE)
@@ -1693,7 +1693,6 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw)
 		 esw->enabled_vports, esw->mode);
 
 	mc_promisc = &esw->mc_promisc;
-	nvports = esw->enabled_vports;
 
 	if (esw->mode == SRIOV_LEGACY)
 		mlx5_eq_notifier_unregister(esw->dev, &esw->nb);
@@ -1709,7 +1708,7 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw)
 	if (esw->mode == SRIOV_LEGACY)
 		esw_destroy_legacy_fdb_table(esw);
 	else if (esw->mode == SRIOV_OFFLOADS)
-		esw_offloads_cleanup(esw, nvports);
+		esw_offloads_cleanup(esw);
 
 	old_mode = esw->mode;
 	esw->mode = SRIOV_NONE;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 959a9e28d08f..fd845e6c44d5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -208,8 +208,9 @@ struct mlx5_eswitch {
 	u16                     manager_vport;
 };
 
-void esw_offloads_cleanup(struct mlx5_eswitch *esw, int nvports);
-int esw_offloads_init(struct mlx5_eswitch *esw, int nvports);
+void esw_offloads_cleanup(struct mlx5_eswitch *esw);
+int esw_offloads_init(struct mlx5_eswitch *esw, int vf_nvports,
+		      int total_nvports);
 void esw_offloads_cleanup_reps(struct mlx5_eswitch *esw);
 int esw_offloads_init_reps(struct mlx5_eswitch *esw);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 19969d487a01..14f7ad67cfe4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -54,6 +54,8 @@ enum {
 #define fdb_prio_table(esw, chain, prio, level) \
 	(esw)->fdb_table.offloads.fdb_prio[(chain)][(prio)][(level)]
 
+#define UPLINK_REP_INDEX 0
+
 static struct mlx5_flow_table *
 esw_get_prio_table(struct mlx5_eswitch *esw, u32 chain, u16 prio, int level);
 static void
@@ -1239,19 +1241,28 @@ int esw_offloads_init_reps(struct mlx5_eswitch *esw)
 	return 0;
 }
 
+static void __esw_offloads_unload_rep(struct mlx5_eswitch *esw,
+				      struct mlx5_eswitch_rep *rep, u8 rep_type)
+{
+	if (!rep->rep_if[rep_type].valid)
+		return;
+
+	rep->rep_if[rep_type].unload(rep);
+}
+
 static void esw_offloads_unload_reps_type(struct mlx5_eswitch *esw, int nvports,
 					  u8 rep_type)
 {
 	struct mlx5_eswitch_rep *rep;
 	int vport;
 
-	for (vport = nvports - 1; vport >= 0; vport--) {
+	for (vport = nvports; vport >= MLX5_VPORT_FIRST_VF; vport--) {
 		rep = &esw->offloads.vport_reps[vport];
-		if (!rep->rep_if[rep_type].valid)
-			continue;
-
-		rep->rep_if[rep_type].unload(rep);
+		__esw_offloads_unload_rep(esw, rep, rep_type);
 	}
+
+	rep = &esw->offloads.vport_reps[UPLINK_REP_INDEX];
+	__esw_offloads_unload_rep(esw, rep, rep_type);
 }
 
 static void esw_offloads_unload_reps(struct mlx5_eswitch *esw, int nvports)
@@ -1262,6 +1273,15 @@ static void esw_offloads_unload_reps(struct mlx5_eswitch *esw, int nvports)
 		esw_offloads_unload_reps_type(esw, nvports, rep_type);
 }
 
+static int __esw_offloads_load_rep(struct mlx5_eswitch *esw,
+				   struct mlx5_eswitch_rep *rep, u8 rep_type)
+{
+	if (!rep->rep_if[rep_type].valid)
+		return 0;
+
+	return rep->rep_if[rep_type].load(esw->dev, rep);
+}
+
 static int esw_offloads_load_reps_type(struct mlx5_eswitch *esw, int nvports,
 				       u8 rep_type)
 {
@@ -1269,12 +1289,14 @@ static int esw_offloads_load_reps_type(struct mlx5_eswitch *esw, int nvports,
 	int vport;
 	int err;
 
-	for (vport = 0; vport < nvports; vport++) {
-		rep = &esw->offloads.vport_reps[vport];
-		if (!rep->rep_if[rep_type].valid)
-			continue;
+	rep = &esw->offloads.vport_reps[UPLINK_REP_INDEX];
+	err = __esw_offloads_load_rep(esw, rep, rep_type);
+	if (err)
+		goto out;
 
-		err = rep->rep_if[rep_type].load(esw->dev, rep);
+	for (vport = MLX5_VPORT_FIRST_VF; vport <= nvports; vport++) {
+		rep = &esw->offloads.vport_reps[vport];
+		err = __esw_offloads_load_rep(esw, rep, rep_type);
 		if (err)
 			goto err_reps;
 	}
@@ -1283,6 +1305,7 @@ static int esw_offloads_load_reps_type(struct mlx5_eswitch *esw, int nvports,
 
 err_reps:
 	esw_offloads_unload_reps_type(esw, vport, rep_type);
+out:
 	return err;
 }
 
@@ -1440,17 +1463,18 @@ static void esw_offloads_steering_cleanup(struct mlx5_eswitch *esw)
 	esw_destroy_offloads_fdb_tables(esw);
 }
 
-int esw_offloads_init(struct mlx5_eswitch *esw, int nvports)
+int esw_offloads_init(struct mlx5_eswitch *esw, int vf_nvports,
+		      int total_nvports)
 {
 	int err;
 
 	mutex_init(&esw->fdb_table.offloads.fdb_prio_lock);
 
-	err = esw_offloads_steering_init(esw, nvports);
+	err = esw_offloads_steering_init(esw, total_nvports);
 	if (err)
 		return err;
 
-	err = esw_offloads_load_reps(esw, nvports);
+	err = esw_offloads_load_reps(esw, vf_nvports);
 	if (err)
 		goto err_reps;
 
@@ -1481,10 +1505,12 @@ static int esw_offloads_stop(struct mlx5_eswitch *esw,
 	return err;
 }
 
-void esw_offloads_cleanup(struct mlx5_eswitch *esw, int nvports)
+void esw_offloads_cleanup(struct mlx5_eswitch *esw)
 {
+	u16 num_vfs = esw->dev->priv.sriov.num_vfs;
+
 	esw_offloads_devcom_cleanup(esw);
-	esw_offloads_unload_reps(esw, nvports);
+	esw_offloads_unload_reps(esw, num_vfs);
 	esw_offloads_steering_cleanup(esw);
 }
 
@@ -1822,7 +1848,6 @@ EXPORT_SYMBOL(mlx5_eswitch_unregister_vport_rep);
 
 void *mlx5_eswitch_get_uplink_priv(struct mlx5_eswitch *esw, u8 rep_type)
 {
-#define UPLINK_REP_INDEX 0
 	struct mlx5_esw_offload *offloads = &esw->offloads;
 	struct mlx5_eswitch_rep *rep;
 
diff --git a/include/linux/mlx5/vport.h b/include/linux/mlx5/vport.h
index b7edcb1dadd8..755aeea19e1c 100644
--- a/include/linux/mlx5/vport.h
+++ b/include/linux/mlx5/vport.h
@@ -53,6 +53,7 @@ enum {
 
 enum {
 	MLX5_VPORT_PF			= 0x0,
+	MLX5_VPORT_FIRST_VF		= 0x1,
 	MLX5_VPORT_ECPF			= 0xfffe,
 	MLX5_VPORT_UPLINK		= 0xffff
 };
-- 
2.20.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox