Netdev List
 help / color / mirror / Atom feed
* Re: [Patch net-next] net_sched: add reverse binding for tc class
From: Daniel Borkmann @ 2017-08-30 22:22 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers, Jamal Hadi Salim
In-Reply-To: <CAM_iQpU9B71C5cWs969k464T7HvNhcwps7NYNQx8Q3KCijOfnA@mail.gmail.com>

On 08/31/2017 12:01 AM, Cong Wang wrote:
> On Wed, Aug 30, 2017 at 2:48 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 08/30/2017 11:30 PM, Cong Wang wrote:
>> [...]
>>>
>>> Note, we still can NOT totally get rid of those class lookup in
>>> ->enqueue() because cgroup and flow filters have no way to determine
>>> the classid at setup time, they still have to go through dynamic lookup.
>>
>> [...]
>>>
>>> ---
>>>    include/net/sch_generic.h |  1 +
>>>    net/sched/cls_basic.c     |  9 +++++++
>>>    net/sched/cls_bpf.c       |  9 +++++++
>>
>> Same is for cls_bpf as well, so bind_class wouldn't work there
>> either as we could return dynamic classids. bind_class cannot
>> be added here, too.
>
> I think you are probably right, but the following code is
> misleading there:
>
>          if (tb[TCA_BPF_CLASSID]) {
>                  prog->res.classid = nla_get_u32(tb[TCA_BPF_CLASSID]);
>                  tcf_bind_filter(tp, &prog->res, base);
>          }
>
> If the classid is dynamic, why this tb[TCA_BPF_CLASSID]?

The prog->res.classid is the default one, but can be overridden
later depending on the specified program. cls_bpf_classify() does
after prog return (filter_res holds return code):

	[...]
		if (filter_res == 0)
			continue;
		if (filter_res != -1) {
			res->class   = 0;
			res->classid = filter_res;
		} else {
			*res = prog->res;
		}
	[...]

Meaning in case of a match (-1), we use the default bound one,
but prog may as well return an alternative found classid if it
wants to. So both versions are possible.

^ permalink raw reply

* Re: [PATCH net 0/9] net/sched: init failure fixes
From: David Miller @ 2017-08-30 22:26 UTC (permalink / raw)
  To: jhs; +Cc: nikolay, netdev, edumazet, xiyou.wangcong, jiri, roopa, lucasb
In-Reply-To: <58c39495-8667-d983-94ba-b1a242f56945@mojatatu.com>

From: Jamal Hadi Salim <jhs@mojatatu.com>
Date: Wed, 30 Aug 2017 08:15:37 -0400

> On 17-08-30 05:48 AM, Nikolay Aleksandrov wrote:
>> Hi all,
>> I went over all qdiscs' init, destroy and reset callbacks and found
>> the
>> issues fixed in each patch. Mostly they are null pointer dereferences
>> due
>> to uninitialized timer (qdisc watchdog) or double frees due to
>> ->destroy
>> cleaning up a second time. There's more information in each patch.
>> I've tested these by either sending wrong attributes from user-spaces,
>> no
>> attributes or by simulating memory alloc failure where
>> applicable. Also
>> tried all of the qdiscs as a default qdisc.
>> Most of these bugs were present before commit 87b60cfacf9f, I've tried
>> to
>> include proper fixes tags in each patch.
>> I haven't included individual patch acks in the set, I'd appreciate it
>> if
>> you take another look and resend them.
>> 
> 
> 
> Hi Nik,
> 
> For all patches:
> 
> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

Series applied, thanks Nikolay.

^ permalink raw reply

* Re: [PATCH net-next] net: hns3: Fixes the wrong IS_ERR check on the returned phydev value
From: David Miller @ 2017-08-30 22:30 UTC (permalink / raw)
  To: salil.mehta; +Cc: yisen.zhuang, mehta.salil.lnk, netdev, linux-kernel, linuxarm
In-Reply-To: <20170830110603.12372-1-salil.mehta@huawei.com>

From: Salil Mehta <salil.mehta@huawei.com>
Date: Wed, 30 Aug 2017 12:06:03 +0100

> This patch removes the wrong check being done for the phy device being
> returned by the mdiobus_get_phy() function. This function never returns
> the error pointers.
> 
> Fixes: 256727da7395 ("net: hns3: Add MDIO support to HNS3 Ethernet
> Driver for hip08 SoC")
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Salil Mehta <salil.mehta@huawei.com>

Applied.

^ permalink raw reply

* Re: [PATCH][next] qed: fix spelling mistake: "calescing" -> "coalescing"
From: David Miller @ 2017-08-30 22:32 UTC (permalink / raw)
  To: colin.king
  Cc: Yuval.Mintz, Ariel.Elior, everest-linux-l2, netdev, linux-kernel
In-Reply-To: <20170830114012.28260-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Wed, 30 Aug 2017 12:40:12 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> Trivial fix to spelling mistake in DP_NOTICE message
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied.

^ permalink raw reply

* Re: [PATCH][net-next][V3] bpf: test_maps: fix typos, "conenct" and "listeen"
From: David Miller @ 2017-08-30 22:32 UTC (permalink / raw)
  To: colin.king; +Cc: ast, daniel, shuah, netdev, linux-kselftest, linux-kernel
In-Reply-To: <20170830171525.5688-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Wed, 30 Aug 2017 18:15:25 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> Trivial fix to typos in printf error messages:
> "conenct" -> "connect"
> "listeen" -> "listen"
> 
> thanks to Daniel Borkmann for spotting one of these mistakes
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied.

^ permalink raw reply

* Re: [Patch net-next] net_sched: add reverse binding for tc class
From: Daniel Borkmann @ 2017-08-30 22:45 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers, Jamal Hadi Salim
In-Reply-To: <59A73AAE.509@iogearbox.net>

On 08/31/2017 12:22 AM, Daniel Borkmann wrote:
> On 08/31/2017 12:01 AM, Cong Wang wrote:
>> On Wed, Aug 30, 2017 at 2:48 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>> On 08/30/2017 11:30 PM, Cong Wang wrote:
>>> [...]
>>>>
>>>> Note, we still can NOT totally get rid of those class lookup in
>>>> ->enqueue() because cgroup and flow filters have no way to determine
>>>> the classid at setup time, they still have to go through dynamic lookup.
>>>
>>> [...]
>>>>
>>>> ---
>>>>    include/net/sch_generic.h |  1 +
>>>>    net/sched/cls_basic.c     |  9 +++++++
>>>>    net/sched/cls_bpf.c       |  9 +++++++
>>>
>>> Same is for cls_bpf as well, so bind_class wouldn't work there
>>> either as we could return dynamic classids. bind_class cannot
>>> be added here, too.
>>
>> I think you are probably right, but the following code is
>> misleading there:
>>
>>          if (tb[TCA_BPF_CLASSID]) {
>>                  prog->res.classid = nla_get_u32(tb[TCA_BPF_CLASSID]);
>>                  tcf_bind_filter(tp, &prog->res, base);
>>          }
>>
>> If the classid is dynamic, why this tb[TCA_BPF_CLASSID]?
>
> The prog->res.classid is the default one, but can be overridden
> later depending on the specified program. cls_bpf_classify() does
> after prog return (filter_res holds return code):
>
>      [...]
>          if (filter_res == 0)
>              continue;
>          if (filter_res != -1) {
>              res->class   = 0;
>              res->classid = filter_res;
>          } else {
>              *res = prog->res;
>          }
>      [...]
>
> Meaning in case of a match (-1), we use the default bound one,
> but prog may as well return an alternative found classid if it
> wants to. So both versions are possible.

But even for that case your patch looks fine to me actually, since
for dynamic classid we set class to 0. No objections from my side
then.

^ permalink raw reply

* Re: [PATCH net] kcm: do not attach PF_KCM sockets to avoid deadlock
From: David Miller @ 2017-08-30 22:55 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, tom
In-Reply-To: <1504110571.11498.120.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 30 Aug 2017 09:29:31 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> syzkaller had no problem to trigger a deadlock, attaching a KCM socket
> to another one (or itself). (original syzkaller report was a very
> confusing lockdep splat during a sendmsg())
> 
> It seems KCM claims to only support TCP, but no enforcement is done,
> so we might need to add additional checks.
> 
> Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Dmitry Vyukov <dvyukov@google.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next] xen-netfront: be more drop monitor friendly
From: David Miller @ 2017-08-30 23:01 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1504114378.11498.124.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 30 Aug 2017 10:32:58 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> xennet_start_xmit() might copy skb with inappropriate layout
> into a fresh one.
> 
> Old skb is freed, and at this point it is not a drop, but
> a consume. New skb will then be either consumed or dropped. 
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply

* [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads
From: Saeed Mahameed @ 2017-08-30 23:04 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Saeed Mahameed

Hi Dave,

Tthe following changes provide GRE tunnel offloads for mlx5 ethernet netdevice driver.

For more details please see tag log message below.
Please pull and let me know if there's any problem.

Note: this series doesn't conflict with the ongoing net mlx5 submission.

Thanks,
Saeed.

---

The following changes since commit 90774a93ef075b39e55d31fe56fc286d71a046ac:

  bpf: test_maps: fix typos, "conenct" and "listeen" (2017-08-30 15:32:16 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-GRE-Offload

for you to fetch changes up to 7b3722fa9ef647eb1ae6a60a5d46f7c67ab09a33:

  net/mlx5e: Support RSS for GRE tunneled packets (2017-08-31 01:54:15 +0300)

----------------------------------------------------------------
mlx5-updates-2017-08-31 (GRE Offloads support)

This series provides the support for MPLS RSS and GRE TX offloads and
RSS support.

The first patch from Gal and Ariel provides the mlx5 driver support for
ConnectX capability to perform IP version identification and matching in
order to distinguish between IPv4 and IPv6 without the need to specify the
encapsulation type, thus perform RSS in MPLS automatically without
specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
classification for inner GRE RSS.

2nd patch from Gal, Adds the TX offloads support for GRE tunneled packets,
by reporting the needed netdev features.

3rd patch from Gal, Adds GRE inner RSS support by creating the needed device
resources (Steering Tables/rules and traffic classifiers) to Match GRE traffic
and perform RSS hashing on the inner headers.

Improvement:
Testing 8 TCP streams bandwidth over GRE:
    System: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
    NIC: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
    Before: 21.3 Gbps (Single RQ)
    Now   : 90.5 Gbps (RSS spread on 8 RQs)

Thanks,
Saeed.

----------------------------------------------------------------
Gal Pressman (3):
      net/mlx5e: Use IP version matching to classify IP traffic
      net/mlx5e: Support TSO and TX checksum offloads for GRE tunnels
      net/mlx5e: Support RSS for GRE tunneled packets

 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  18 +-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |  11 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c    | 281 ++++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 108 ++++++--
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  |   4 +-
 include/linux/mlx5/mlx5_ifc.h                      |   2 +-
 6 files changed, 384 insertions(+), 40 deletions(-)

^ permalink raw reply

* [net-next 1/3] net/mlx5e: Use IP version matching to classify IP traffic
From: Saeed Mahameed @ 2017-08-30 23:04 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Gal Pressman, Ariel Levkovich, Saeed Mahameed
In-Reply-To: <20170830230409.15176-1-saeedm@mellanox.com>

From: Gal Pressman <galp@mellanox.com>

This change adds the ability for flow steering to classify IPv4/6
packets with MPLS tag (Ethertype 0x8847 and 0x8848) as standard IP
packets and hit IPv4/6 classification steering rules.

Since IP packets with MPLS tag header have MPLS ethertype, they
missed the IPv4/6 ethertype rule and ended up hitting the default
filter forwarding all the packets to the same single RQ (No RSS).

Since our device is able to look past the MPLS tag and identify the
next protocol we introduce this solution which replaces ethertype
matching by the device's capability to perform IP version
identification and matching in order to distinguish between IPv4 and
IPv6.
Therefore, when driver is performing flow steering configuration on the
device it will use IP version matching in IP classified rules instead
of ethertype matching which will cause relevant MPLS tagged packets to
hit this rule as well.

If the device doesn't support IP version matching the driver will fall back
to use legacy ethertype matching in the steering as before.

Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c | 33 ++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
index eecbc6d4f51f..85e6226dacfb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
@@ -660,6 +660,17 @@ static struct {
 	},
 };
 
+static u8 mlx5e_etype_to_ipv(u16 ethertype)
+{
+	if (ethertype == ETH_P_IP)
+		return 4;
+
+	if (ethertype == ETH_P_IPV6)
+		return 6;
+
+	return 0;
+}
+
 static struct mlx5_flow_handle *
 mlx5e_generate_ttc_rule(struct mlx5e_priv *priv,
 			struct mlx5_flow_table *ft,
@@ -667,10 +678,12 @@ mlx5e_generate_ttc_rule(struct mlx5e_priv *priv,
 			u16 etype,
 			u8 proto)
 {
+	int match_ipv_outer = MLX5_CAP_FLOWTABLE_NIC_RX(priv->mdev, ft_field_support.outer_ip_version);
 	MLX5_DECLARE_FLOW_ACT(flow_act);
 	struct mlx5_flow_handle *rule;
 	struct mlx5_flow_spec *spec;
 	int err = 0;
+	u8 ipv;
 
 	spec = kvzalloc(sizeof(*spec), GFP_KERNEL);
 	if (!spec)
@@ -681,7 +694,13 @@ mlx5e_generate_ttc_rule(struct mlx5e_priv *priv,
 		MLX5_SET_TO_ONES(fte_match_param, spec->match_criteria, outer_headers.ip_protocol);
 		MLX5_SET(fte_match_param, spec->match_value, outer_headers.ip_protocol, proto);
 	}
-	if (etype) {
+
+	ipv = mlx5e_etype_to_ipv(etype);
+	if (match_ipv_outer && ipv) {
+		spec->match_criteria_enable = MLX5_MATCH_OUTER_HEADERS;
+		MLX5_SET_TO_ONES(fte_match_param, spec->match_criteria, outer_headers.ip_version);
+		MLX5_SET(fte_match_param, spec->match_value, outer_headers.ip_version, ipv);
+	} else if (etype) {
 		spec->match_criteria_enable = MLX5_MATCH_OUTER_HEADERS;
 		MLX5_SET_TO_ONES(fte_match_param, spec->match_criteria, outer_headers.ethertype);
 		MLX5_SET(fte_match_param, spec->match_value, outer_headers.ethertype, etype);
@@ -739,7 +758,9 @@ static int mlx5e_generate_ttc_table_rules(struct mlx5e_priv *priv)
 #define MLX5E_TTC_TABLE_SIZE	(MLX5E_TTC_GROUP1_SIZE +\
 				 MLX5E_TTC_GROUP2_SIZE +\
 				 MLX5E_TTC_GROUP3_SIZE)
-static int mlx5e_create_ttc_table_groups(struct mlx5e_ttc_table *ttc)
+
+static int mlx5e_create_ttc_table_groups(struct mlx5e_ttc_table *ttc,
+					 bool use_ipv)
 {
 	int inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
 	struct mlx5e_flow_table *ft = &ttc->ft;
@@ -761,7 +782,10 @@ static int mlx5e_create_ttc_table_groups(struct mlx5e_ttc_table *ttc)
 	/* L4 Group */
 	mc = MLX5_ADDR_OF(create_flow_group_in, in, match_criteria);
 	MLX5_SET_TO_ONES(fte_match_param, mc, outer_headers.ip_protocol);
-	MLX5_SET_TO_ONES(fte_match_param, mc, outer_headers.ethertype);
+	if (use_ipv)
+		MLX5_SET_TO_ONES(fte_match_param, mc, outer_headers.ip_version);
+	else
+		MLX5_SET_TO_ONES(fte_match_param, mc, outer_headers.ethertype);
 	MLX5_SET_CFG(in, match_criteria_enable, MLX5_MATCH_OUTER_HEADERS);
 	MLX5_SET_CFG(in, start_flow_index, ix);
 	ix += MLX5E_TTC_GROUP1_SIZE;
@@ -812,6 +836,7 @@ void mlx5e_destroy_ttc_table(struct mlx5e_priv *priv)
 
 int mlx5e_create_ttc_table(struct mlx5e_priv *priv)
 {
+	bool match_ipv_outer = MLX5_CAP_FLOWTABLE_NIC_RX(priv->mdev, ft_field_support.outer_ip_version);
 	struct mlx5e_ttc_table *ttc = &priv->fs.ttc;
 	struct mlx5_flow_table_attr ft_attr = {};
 	struct mlx5e_flow_table *ft = &ttc->ft;
@@ -828,7 +853,7 @@ int mlx5e_create_ttc_table(struct mlx5e_priv *priv)
 		return err;
 	}
 
-	err = mlx5e_create_ttc_table_groups(ttc);
+	err = mlx5e_create_ttc_table_groups(ttc, match_ipv_outer);
 	if (err)
 		goto err;
 
-- 
2.13.0

^ permalink raw reply related

* [net-next 2/3] net/mlx5e: Support TSO and TX checksum offloads for GRE tunnels
From: Saeed Mahameed @ 2017-08-30 23:04 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Gal Pressman, Saeed Mahameed
In-Reply-To: <20170830230409.15176-1-saeedm@mellanox.com>

From: Gal Pressman <galp@mellanox.com>

Add TX offloads support for GRE tunneled packets by reporting the needed
netdev features.

Signed-off-by: Gal Pressman <galp@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 51 +++++++++++++++--------
 include/linux/mlx5/mlx5_ifc.h                     |  2 +-
 2 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index fdc2b92f020b..9475fb89a744 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3499,13 +3499,13 @@ static void mlx5e_del_vxlan_port(struct net_device *netdev,
 	mlx5e_vxlan_queue_work(priv, ti->sa_family, be16_to_cpu(ti->port), 0);
 }
 
-static netdev_features_t mlx5e_vxlan_features_check(struct mlx5e_priv *priv,
-						    struct sk_buff *skb,
-						    netdev_features_t features)
+static netdev_features_t mlx5e_tunnel_features_check(struct mlx5e_priv *priv,
+						     struct sk_buff *skb,
+						     netdev_features_t features)
 {
 	struct udphdr *udph;
-	u16 proto;
-	u16 port = 0;
+	u8 proto;
+	u16 port;
 
 	switch (vlan_get_protocol(skb)) {
 	case htons(ETH_P_IP):
@@ -3518,14 +3518,17 @@ static netdev_features_t mlx5e_vxlan_features_check(struct mlx5e_priv *priv,
 		goto out;
 	}
 
-	if (proto == IPPROTO_UDP) {
+	switch (proto) {
+	case IPPROTO_GRE:
+		return features;
+	case IPPROTO_UDP:
 		udph = udp_hdr(skb);
 		port = be16_to_cpu(udph->dest);
-	}
 
-	/* Verify if UDP port is being offloaded by HW */
-	if (port && mlx5e_vxlan_lookup_port(priv, port))
-		return features;
+		/* Verify if UDP port is being offloaded by HW */
+		if (mlx5e_vxlan_lookup_port(priv, port))
+			return features;
+	}
 
 out:
 	/* Disable CSUM and GSO if the udp dport is not offloaded by HW */
@@ -3549,7 +3552,7 @@ static netdev_features_t mlx5e_features_check(struct sk_buff *skb,
 	/* Validate if the tunneled packet is being offloaded by HW */
 	if (skb->encapsulation &&
 	    (features & NETIF_F_CSUM_MASK || features & NETIF_F_GSO_MASK))
-		return mlx5e_vxlan_features_check(priv, skb, features);
+		return mlx5e_tunnel_features_check(priv, skb, features);
 
 	return features;
 }
@@ -4014,20 +4017,32 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 	netdev->hw_features      |= NETIF_F_HW_VLAN_CTAG_RX;
 	netdev->hw_features      |= NETIF_F_HW_VLAN_CTAG_FILTER;
 
-	if (mlx5e_vxlan_allowed(mdev)) {
-		netdev->hw_features     |= NETIF_F_GSO_UDP_TUNNEL |
-					   NETIF_F_GSO_UDP_TUNNEL_CSUM |
-					   NETIF_F_GSO_PARTIAL;
+	if (mlx5e_vxlan_allowed(mdev) || MLX5_CAP_ETH(mdev, tunnel_stateless_gre)) {
+		netdev->hw_features     |= NETIF_F_GSO_PARTIAL;
 		netdev->hw_enc_features |= NETIF_F_IP_CSUM;
 		netdev->hw_enc_features |= NETIF_F_IPV6_CSUM;
 		netdev->hw_enc_features |= NETIF_F_TSO;
 		netdev->hw_enc_features |= NETIF_F_TSO6;
-		netdev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL;
-		netdev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM |
-					   NETIF_F_GSO_PARTIAL;
+		netdev->hw_enc_features |= NETIF_F_GSO_PARTIAL;
+	}
+
+	if (mlx5e_vxlan_allowed(mdev)) {
+		netdev->hw_features     |= NETIF_F_GSO_UDP_TUNNEL |
+					   NETIF_F_GSO_UDP_TUNNEL_CSUM;
+		netdev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL |
+					   NETIF_F_GSO_UDP_TUNNEL_CSUM;
 		netdev->gso_partial_features = NETIF_F_GSO_UDP_TUNNEL_CSUM;
 	}
 
+	if (MLX5_CAP_ETH(mdev, tunnel_stateless_gre)) {
+		netdev->hw_features     |= NETIF_F_GSO_GRE |
+					   NETIF_F_GSO_GRE_CSUM;
+		netdev->hw_enc_features |= NETIF_F_GSO_GRE |
+					   NETIF_F_GSO_GRE_CSUM;
+		netdev->gso_partial_features |= NETIF_F_GSO_GRE |
+						NETIF_F_GSO_GRE_CSUM;
+	}
+
 	mlx5_query_port_fcs(mdev, &fcs_supported, &fcs_enabled);
 
 	if (fcs_supported)
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index ae7d09b9c52f..3d5d32e5446c 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -602,7 +602,7 @@ struct mlx5_ifc_per_protocol_networking_offload_caps_bits {
 	u8         reserved_at_1a[0x1];
 	u8         tunnel_lso_const_out_ip_id[0x1];
 	u8         reserved_at_1c[0x2];
-	u8         tunnel_statless_gre[0x1];
+	u8         tunnel_stateless_gre[0x1];
 	u8         tunnel_stateless_vxlan[0x1];
 
 	u8         swp[0x1];
-- 
2.13.0

^ permalink raw reply related

* [net-next 3/3] net/mlx5e: Support RSS for GRE tunneled packets
From: Saeed Mahameed @ 2017-08-30 23:04 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Gal Pressman, Saeed Mahameed
In-Reply-To: <20170830230409.15176-1-saeedm@mellanox.com>

From: Gal Pressman <galp@mellanox.com>

Introduce a new flow table and indirect TIRs which are used to hash the
inner packet headers of GRE tunneled packets.

When a GRE tunneled packet is received, the TTC flow table will match
the new IPv4/6->GRE rules which will forward it to the inner TTC table.
The inner TTC is similar to its counterpart outer TTC table, but
matching the inner packet headers instead of the outer ones (and does
not include the new IPv4/6->GRE rules).
The new rules will not add steering hops since they are added to an
already existing flow group which will be matched regardless of this
patch. Non GRE traffic will not be affected.

The inner flow table will forward the packet to inner indirect TIRs
which hash the inner packet and thus result in RSS for the tunneled
packets.

Testing 8 TCP streams bandwidth over GRE:
System: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
NIC: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
Before: 21.3 Gbps (Single RQ)
Now   : 90.5 Gbps (RSS spread on 8 RQs)

Signed-off-by: Gal Pressman <galp@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  18 +-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |  11 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c    | 248 ++++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  57 ++++-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  |   4 +-
 5 files changed, 321 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 0039b4725405..a31912415264 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -620,6 +620,12 @@ enum mlx5e_traffic_types {
 	MLX5E_NUM_INDIR_TIRS = MLX5E_TT_ANY,
 };
 
+enum mlx5e_tunnel_types {
+	MLX5E_TT_IPV4_GRE,
+	MLX5E_TT_IPV6_GRE,
+	MLX5E_NUM_TUNNEL_TT,
+};
+
 enum {
 	MLX5E_STATE_ASYNC_EVENTS_ENABLED,
 	MLX5E_STATE_OPENED,
@@ -679,6 +685,7 @@ struct mlx5e_l2_table {
 struct mlx5e_ttc_table {
 	struct mlx5e_flow_table  ft;
 	struct mlx5_flow_handle	 *rules[MLX5E_NUM_TT];
+	struct mlx5_flow_handle  *tunnel_rules[MLX5E_NUM_TUNNEL_TT];
 };
 
 #define ARFS_HASH_SHIFT BITS_PER_BYTE
@@ -711,6 +718,7 @@ enum {
 	MLX5E_VLAN_FT_LEVEL = 0,
 	MLX5E_L2_FT_LEVEL,
 	MLX5E_TTC_FT_LEVEL,
+	MLX5E_INNER_TTC_FT_LEVEL,
 	MLX5E_ARFS_FT_LEVEL
 };
 
@@ -736,6 +744,7 @@ struct mlx5e_flow_steering {
 	struct mlx5e_vlan_table         vlan;
 	struct mlx5e_l2_table           l2;
 	struct mlx5e_ttc_table          ttc;
+	struct mlx5e_ttc_table          inner_ttc;
 	struct mlx5e_arfs_tables        arfs;
 };
 
@@ -769,6 +778,7 @@ struct mlx5e_priv {
 	u32                        tisn[MLX5E_MAX_NUM_TC];
 	struct mlx5e_rqt           indir_rqt;
 	struct mlx5e_tir           indir_tir[MLX5E_NUM_INDIR_TIRS];
+	struct mlx5e_tir           inner_indir_tir[MLX5E_NUM_INDIR_TIRS];
 	struct mlx5e_tir           direct_tir[MLX5E_MAX_NUM_CHANNELS];
 	u32                        tx_rates[MLX5E_MAX_NUM_SQS];
 	int                        hard_mtu;
@@ -903,7 +913,7 @@ int mlx5e_redirect_rqt(struct mlx5e_priv *priv, u32 rqtn, int sz,
 		       struct mlx5e_redirect_rqt_param rrp);
 void mlx5e_build_indir_tir_ctx_hash(struct mlx5e_params *params,
 				    enum mlx5e_traffic_types tt,
-				    void *tirc);
+				    void *tirc, bool inner);
 
 int mlx5e_open_locked(struct net_device *netdev);
 int mlx5e_close_locked(struct net_device *netdev);
@@ -932,6 +942,12 @@ void mlx5e_set_rx_cq_mode_params(struct mlx5e_params *params,
 void mlx5e_set_rq_type_params(struct mlx5_core_dev *mdev,
 			      struct mlx5e_params *params, u8 rq_type);
 
+static inline bool mlx5e_tunnel_inner_ft_supported(struct mlx5_core_dev *mdev)
+{
+	return (MLX5_CAP_ETH(mdev, tunnel_stateless_gre) &&
+		MLX5_CAP_FLOWTABLE_NIC_RX(mdev, ft_field_support.inner_ip_version));
+}
+
 static inline
 struct mlx5e_tx_wqe *mlx5e_post_nop(struct mlx5_wq_cyc *wq, u32 sqn, u16 *pc)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 0dd7e9caf150..c6ec90e9c95b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -1212,9 +1212,18 @@ static void mlx5e_modify_tirs_hash(struct mlx5e_priv *priv, void *in, int inlen)
 
 	for (tt = 0; tt < MLX5E_NUM_INDIR_TIRS; tt++) {
 		memset(tirc, 0, ctxlen);
-		mlx5e_build_indir_tir_ctx_hash(&priv->channels.params, tt, tirc);
+		mlx5e_build_indir_tir_ctx_hash(&priv->channels.params, tt, tirc, false);
 		mlx5_core_modify_tir(mdev, priv->indir_tir[tt].tirn, in, inlen);
 	}
+
+	if (!mlx5e_tunnel_inner_ft_supported(priv->mdev))
+		return;
+
+	for (tt = 0; tt < MLX5E_NUM_INDIR_TIRS; tt++) {
+		memset(tirc, 0, ctxlen);
+		mlx5e_build_indir_tir_ctx_hash(&priv->channels.params, tt, tirc, true);
+		mlx5_core_modify_tir(mdev, priv->inner_indir_tir[tt].tirn, in, inlen);
+	}
 }
 
 static int mlx5e_set_rxfh(struct net_device *dev, const u32 *indir,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
index 85e6226dacfb..f11fd07ac4dd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
@@ -608,12 +608,21 @@ static void mlx5e_cleanup_ttc_rules(struct mlx5e_ttc_table *ttc)
 			ttc->rules[i] = NULL;
 		}
 	}
+
+	for (i = 0; i < MLX5E_NUM_TUNNEL_TT; i++) {
+		if (!IS_ERR_OR_NULL(ttc->tunnel_rules[i])) {
+			mlx5_del_flow_rules(ttc->tunnel_rules[i]);
+			ttc->tunnel_rules[i] = NULL;
+		}
+	}
 }
 
-static struct {
+struct mlx5e_etype_proto {
 	u16 etype;
 	u8 proto;
-} ttc_rules[] = {
+};
+
+static struct mlx5e_etype_proto ttc_rules[] = {
 	[MLX5E_TT_IPV4_TCP] = {
 		.etype = ETH_P_IP,
 		.proto = IPPROTO_TCP,
@@ -660,6 +669,17 @@ static struct {
 	},
 };
 
+static struct mlx5e_etype_proto ttc_tunnel_rules[] = {
+	[MLX5E_TT_IPV4_GRE] = {
+		.etype = ETH_P_IP,
+		.proto = IPPROTO_GRE,
+	},
+	[MLX5E_TT_IPV6_GRE] = {
+		.etype = ETH_P_IPV6,
+		.proto = IPPROTO_GRE,
+	},
+};
+
 static u8 mlx5e_etype_to_ipv(u16 ethertype)
 {
 	if (ethertype == ETH_P_IP)
@@ -742,6 +762,20 @@ static int mlx5e_generate_ttc_table_rules(struct mlx5e_priv *priv)
 			goto del_rules;
 	}
 
+	if (!mlx5e_tunnel_inner_ft_supported(priv->mdev))
+		return 0;
+
+	rules     = ttc->tunnel_rules;
+	dest.type = MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE;
+	dest.ft   = priv->fs.inner_ttc.ft.t;
+	for (tt = 0; tt < MLX5E_NUM_TUNNEL_TT; tt++) {
+		rules[tt] = mlx5e_generate_ttc_rule(priv, ft, &dest,
+						    ttc_tunnel_rules[tt].etype,
+						    ttc_tunnel_rules[tt].proto);
+		if (IS_ERR(rules[tt]))
+			goto del_rules;
+	}
+
 	return 0;
 
 del_rules:
@@ -752,13 +786,21 @@ static int mlx5e_generate_ttc_table_rules(struct mlx5e_priv *priv)
 }
 
 #define MLX5E_TTC_NUM_GROUPS	3
-#define MLX5E_TTC_GROUP1_SIZE	BIT(3)
-#define MLX5E_TTC_GROUP2_SIZE	BIT(1)
-#define MLX5E_TTC_GROUP3_SIZE	BIT(0)
+#define MLX5E_TTC_GROUP1_SIZE	(BIT(3) + MLX5E_NUM_TUNNEL_TT)
+#define MLX5E_TTC_GROUP2_SIZE	 BIT(1)
+#define MLX5E_TTC_GROUP3_SIZE	 BIT(0)
 #define MLX5E_TTC_TABLE_SIZE	(MLX5E_TTC_GROUP1_SIZE +\
 				 MLX5E_TTC_GROUP2_SIZE +\
 				 MLX5E_TTC_GROUP3_SIZE)
 
+#define MLX5E_INNER_TTC_NUM_GROUPS	3
+#define MLX5E_INNER_TTC_GROUP1_SIZE	BIT(3)
+#define MLX5E_INNER_TTC_GROUP2_SIZE	BIT(1)
+#define MLX5E_INNER_TTC_GROUP3_SIZE	BIT(0)
+#define MLX5E_INNER_TTC_TABLE_SIZE	(MLX5E_INNER_TTC_GROUP1_SIZE +\
+					 MLX5E_INNER_TTC_GROUP2_SIZE +\
+					 MLX5E_INNER_TTC_GROUP3_SIZE)
+
 static int mlx5e_create_ttc_table_groups(struct mlx5e_ttc_table *ttc,
 					 bool use_ipv)
 {
@@ -826,6 +868,190 @@ static int mlx5e_create_ttc_table_groups(struct mlx5e_ttc_table *ttc,
 	return err;
 }
 
+static struct mlx5_flow_handle *
+mlx5e_generate_inner_ttc_rule(struct mlx5e_priv *priv,
+			      struct mlx5_flow_table *ft,
+			      struct mlx5_flow_destination *dest,
+			      u16 etype, u8 proto)
+{
+	MLX5_DECLARE_FLOW_ACT(flow_act);
+	struct mlx5_flow_handle *rule;
+	struct mlx5_flow_spec *spec;
+	int err = 0;
+	u8 ipv;
+
+	spec = kvzalloc(sizeof(*spec), GFP_KERNEL);
+	if (!spec)
+		return ERR_PTR(-ENOMEM);
+
+	ipv = mlx5e_etype_to_ipv(etype);
+	if (etype && ipv) {
+		spec->match_criteria_enable = MLX5_MATCH_INNER_HEADERS;
+		MLX5_SET_TO_ONES(fte_match_param, spec->match_criteria, inner_headers.ip_version);
+		MLX5_SET(fte_match_param, spec->match_value, inner_headers.ip_version, ipv);
+	}
+
+	if (proto) {
+		spec->match_criteria_enable = MLX5_MATCH_INNER_HEADERS;
+		MLX5_SET_TO_ONES(fte_match_param, spec->match_criteria, inner_headers.ip_protocol);
+		MLX5_SET(fte_match_param, spec->match_value, inner_headers.ip_protocol, proto);
+	}
+
+	rule = mlx5_add_flow_rules(ft, spec, &flow_act, dest, 1);
+	if (IS_ERR(rule)) {
+		err = PTR_ERR(rule);
+		netdev_err(priv->netdev, "%s: add rule failed\n", __func__);
+	}
+
+	kvfree(spec);
+	return err ? ERR_PTR(err) : rule;
+}
+
+static int mlx5e_generate_inner_ttc_table_rules(struct mlx5e_priv *priv)
+{
+	struct mlx5_flow_destination dest;
+	struct mlx5_flow_handle **rules;
+	struct mlx5e_ttc_table *ttc;
+	struct mlx5_flow_table *ft;
+	int err;
+	int tt;
+
+	ttc =  &priv->fs.inner_ttc;
+	ft = ttc->ft.t;
+	rules = ttc->rules;
+
+	dest.type = MLX5_FLOW_DESTINATION_TYPE_TIR;
+	for (tt = 0; tt < MLX5E_NUM_TT; tt++) {
+		if (tt == MLX5E_TT_ANY)
+			dest.tir_num = priv->direct_tir[0].tirn;
+		else
+			dest.tir_num = priv->inner_indir_tir[tt].tirn;
+
+		rules[tt] = mlx5e_generate_inner_ttc_rule(priv, ft, &dest,
+							  ttc_rules[tt].etype,
+							  ttc_rules[tt].proto);
+		if (IS_ERR(rules[tt]))
+			goto del_rules;
+	}
+
+	return 0;
+
+del_rules:
+	err = PTR_ERR(rules[tt]);
+	rules[tt] = NULL;
+	mlx5e_cleanup_ttc_rules(ttc);
+	return err;
+}
+
+static int mlx5e_create_inner_ttc_table_groups(struct mlx5e_ttc_table *ttc)
+{
+	int inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
+	struct mlx5e_flow_table *ft = &ttc->ft;
+	int ix = 0;
+	u32 *in;
+	int err;
+	u8 *mc;
+
+	ft->g = kcalloc(MLX5E_INNER_TTC_NUM_GROUPS, sizeof(*ft->g), GFP_KERNEL);
+	if (!ft->g)
+		return -ENOMEM;
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in) {
+		kfree(ft->g);
+		return -ENOMEM;
+	}
+
+	/* L4 Group */
+	mc = MLX5_ADDR_OF(create_flow_group_in, in, match_criteria);
+	MLX5_SET_TO_ONES(fte_match_param, mc, inner_headers.ip_protocol);
+	MLX5_SET_TO_ONES(fte_match_param, mc, inner_headers.ip_version);
+	MLX5_SET_CFG(in, match_criteria_enable, MLX5_MATCH_INNER_HEADERS);
+	MLX5_SET_CFG(in, start_flow_index, ix);
+	ix += MLX5E_INNER_TTC_GROUP1_SIZE;
+	MLX5_SET_CFG(in, end_flow_index, ix - 1);
+	ft->g[ft->num_groups] = mlx5_create_flow_group(ft->t, in);
+	if (IS_ERR(ft->g[ft->num_groups]))
+		goto err;
+	ft->num_groups++;
+
+	/* L3 Group */
+	MLX5_SET(fte_match_param, mc, inner_headers.ip_protocol, 0);
+	MLX5_SET_CFG(in, start_flow_index, ix);
+	ix += MLX5E_INNER_TTC_GROUP2_SIZE;
+	MLX5_SET_CFG(in, end_flow_index, ix - 1);
+	ft->g[ft->num_groups] = mlx5_create_flow_group(ft->t, in);
+	if (IS_ERR(ft->g[ft->num_groups]))
+		goto err;
+	ft->num_groups++;
+
+	/* Any Group */
+	memset(in, 0, inlen);
+	MLX5_SET_CFG(in, start_flow_index, ix);
+	ix += MLX5E_INNER_TTC_GROUP3_SIZE;
+	MLX5_SET_CFG(in, end_flow_index, ix - 1);
+	ft->g[ft->num_groups] = mlx5_create_flow_group(ft->t, in);
+	if (IS_ERR(ft->g[ft->num_groups]))
+		goto err;
+	ft->num_groups++;
+
+	kvfree(in);
+	return 0;
+
+err:
+	err = PTR_ERR(ft->g[ft->num_groups]);
+	ft->g[ft->num_groups] = NULL;
+	kvfree(in);
+
+	return err;
+}
+
+static int mlx5e_create_inner_ttc_table(struct mlx5e_priv *priv)
+{
+	struct mlx5e_ttc_table *ttc = &priv->fs.inner_ttc;
+	struct mlx5_flow_table_attr ft_attr = {};
+	struct mlx5e_flow_table *ft = &ttc->ft;
+	int err;
+
+	if (!mlx5e_tunnel_inner_ft_supported(priv->mdev))
+		return 0;
+
+	ft_attr.max_fte = MLX5E_INNER_TTC_TABLE_SIZE;
+	ft_attr.level   = MLX5E_INNER_TTC_FT_LEVEL;
+	ft_attr.prio    = MLX5E_NIC_PRIO;
+
+	ft->t = mlx5_create_flow_table(priv->fs.ns, &ft_attr);
+	if (IS_ERR(ft->t)) {
+		err = PTR_ERR(ft->t);
+		ft->t = NULL;
+		return err;
+	}
+
+	err = mlx5e_create_inner_ttc_table_groups(ttc);
+	if (err)
+		goto err;
+
+	err = mlx5e_generate_inner_ttc_table_rules(priv);
+	if (err)
+		goto err;
+
+	return 0;
+
+err:
+	mlx5e_destroy_flow_table(ft);
+	return err;
+}
+
+static void mlx5e_destroy_inner_ttc_table(struct mlx5e_priv *priv)
+{
+	struct mlx5e_ttc_table *ttc = &priv->fs.inner_ttc;
+
+	if (!mlx5e_tunnel_inner_ft_supported(priv->mdev))
+		return;
+
+	mlx5e_cleanup_ttc_rules(ttc);
+	mlx5e_destroy_flow_table(&ttc->ft);
+}
+
 void mlx5e_destroy_ttc_table(struct mlx5e_priv *priv)
 {
 	struct mlx5e_ttc_table *ttc = &priv->fs.ttc;
@@ -1179,11 +1405,18 @@ int mlx5e_create_flow_steering(struct mlx5e_priv *priv)
 		priv->netdev->hw_features &= ~NETIF_F_NTUPLE;
 	}
 
+	err = mlx5e_create_inner_ttc_table(priv);
+	if (err) {
+		netdev_err(priv->netdev, "Failed to create inner ttc table, err=%d\n",
+			   err);
+		goto err_destroy_arfs_tables;
+	}
+
 	err = mlx5e_create_ttc_table(priv);
 	if (err) {
 		netdev_err(priv->netdev, "Failed to create ttc table, err=%d\n",
 			   err);
-		goto err_destroy_arfs_tables;
+		goto err_destroy_inner_ttc_table;
 	}
 
 	err = mlx5e_create_l2_table(priv);
@@ -1208,6 +1441,8 @@ int mlx5e_create_flow_steering(struct mlx5e_priv *priv)
 	mlx5e_destroy_l2_table(priv);
 err_destroy_ttc_table:
 	mlx5e_destroy_ttc_table(priv);
+err_destroy_inner_ttc_table:
+	mlx5e_destroy_inner_ttc_table(priv);
 err_destroy_arfs_tables:
 	mlx5e_arfs_destroy_tables(priv);
 
@@ -1219,6 +1454,7 @@ void mlx5e_destroy_flow_steering(struct mlx5e_priv *priv)
 	mlx5e_destroy_vlan_table(priv);
 	mlx5e_destroy_l2_table(priv);
 	mlx5e_destroy_ttc_table(priv);
+	mlx5e_destroy_inner_ttc_table(priv);
 	mlx5e_arfs_destroy_tables(priv);
 	mlx5e_ethtool_cleanup_steering(priv);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 9475fb89a744..111c7523d448 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2349,9 +2349,10 @@ static void mlx5e_build_tir_ctx_lro(struct mlx5e_params *params, void *tirc)
 
 void mlx5e_build_indir_tir_ctx_hash(struct mlx5e_params *params,
 				    enum mlx5e_traffic_types tt,
-				    void *tirc)
+				    void *tirc, bool inner)
 {
-	void *hfso = MLX5_ADDR_OF(tirc, tirc, rx_hash_field_selector_outer);
+	void *hfso = inner ? MLX5_ADDR_OF(tirc, tirc, rx_hash_field_selector_inner) :
+			     MLX5_ADDR_OF(tirc, tirc, rx_hash_field_selector_outer);
 
 #define MLX5_HASH_IP            (MLX5_HASH_FIELD_SEL_SRC_IP   |\
 				 MLX5_HASH_FIELD_SEL_DST_IP)
@@ -2500,6 +2501,21 @@ static int mlx5e_modify_tirs_lro(struct mlx5e_priv *priv)
 	return err;
 }
 
+static void mlx5e_build_inner_indir_tir_ctx(struct mlx5e_priv *priv,
+					    enum mlx5e_traffic_types tt,
+					    u32 *tirc)
+{
+	MLX5_SET(tirc, tirc, transport_domain, priv->mdev->mlx5e_res.td.tdn);
+
+	mlx5e_build_tir_ctx_lro(&priv->channels.params, tirc);
+
+	MLX5_SET(tirc, tirc, disp_type, MLX5_TIRC_DISP_TYPE_INDIRECT);
+	MLX5_SET(tirc, tirc, indirect_table, priv->indir_rqt.rqtn);
+	MLX5_SET(tirc, tirc, tunneled_offload_en, 0x1);
+
+	mlx5e_build_indir_tir_ctx_hash(&priv->channels.params, tt, tirc, true);
+}
+
 static int mlx5e_set_mtu(struct mlx5e_priv *priv, u16 mtu)
 {
 	struct mlx5_core_dev *mdev = priv->mdev;
@@ -2865,7 +2881,7 @@ static void mlx5e_build_indir_tir_ctx(struct mlx5e_priv *priv,
 
 	MLX5_SET(tirc, tirc, disp_type, MLX5_TIRC_DISP_TYPE_INDIRECT);
 	MLX5_SET(tirc, tirc, indirect_table, priv->indir_rqt.rqtn);
-	mlx5e_build_indir_tir_ctx_hash(&priv->channels.params, tt, tirc);
+	mlx5e_build_indir_tir_ctx_hash(&priv->channels.params, tt, tirc, false);
 }
 
 static void mlx5e_build_direct_tir_ctx(struct mlx5e_priv *priv, u32 rqtn, u32 *tirc)
@@ -2884,6 +2900,7 @@ int mlx5e_create_indirect_tirs(struct mlx5e_priv *priv)
 	struct mlx5e_tir *tir;
 	void *tirc;
 	int inlen;
+	int i = 0;
 	int err;
 	u32 *in;
 	int tt;
@@ -2899,16 +2916,36 @@ int mlx5e_create_indirect_tirs(struct mlx5e_priv *priv)
 		tirc = MLX5_ADDR_OF(create_tir_in, in, ctx);
 		mlx5e_build_indir_tir_ctx(priv, tt, tirc);
 		err = mlx5e_create_tir(priv->mdev, tir, in, inlen);
-		if (err)
-			goto err_destroy_tirs;
+		if (err) {
+			mlx5_core_warn(priv->mdev, "create indirect tirs failed, %d\n", err);
+			goto err_destroy_inner_tirs;
+		}
 	}
 
+	if (!mlx5e_tunnel_inner_ft_supported(priv->mdev))
+		goto out;
+
+	for (i = 0; i < MLX5E_NUM_INDIR_TIRS; i++) {
+		memset(in, 0, inlen);
+		tir = &priv->inner_indir_tir[i];
+		tirc = MLX5_ADDR_OF(create_tir_in, in, ctx);
+		mlx5e_build_inner_indir_tir_ctx(priv, i, tirc);
+		err = mlx5e_create_tir(priv->mdev, tir, in, inlen);
+		if (err) {
+			mlx5_core_warn(priv->mdev, "create inner indirect tirs failed, %d\n", err);
+			goto err_destroy_inner_tirs;
+		}
+	}
+
+out:
 	kvfree(in);
 
 	return 0;
 
-err_destroy_tirs:
-	mlx5_core_warn(priv->mdev, "create indirect tirs failed, %d\n", err);
+err_destroy_inner_tirs:
+	for (i--; i >= 0; i--)
+		mlx5e_destroy_tir(priv->mdev, &priv->inner_indir_tir[i]);
+
 	for (tt--; tt >= 0; tt--)
 		mlx5e_destroy_tir(priv->mdev, &priv->indir_tir[tt]);
 
@@ -2962,6 +2999,12 @@ void mlx5e_destroy_indirect_tirs(struct mlx5e_priv *priv)
 
 	for (i = 0; i < MLX5E_NUM_INDIR_TIRS; i++)
 		mlx5e_destroy_tir(priv->mdev, &priv->indir_tir[i]);
+
+	if (!mlx5e_tunnel_inner_ft_supported(priv->mdev))
+		return;
+
+	for (i = 0; i < MLX5E_NUM_INDIR_TIRS; i++)
+		mlx5e_destroy_tir(priv->mdev, &priv->inner_indir_tir[i]);
 }
 
 void mlx5e_destroy_direct_tirs(struct mlx5e_priv *priv)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index d731d57a996a..5a7bea688ec8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -83,8 +83,8 @@
 #define ETHTOOL_PRIO_NUM_LEVELS 1
 #define ETHTOOL_NUM_PRIOS 11
 #define ETHTOOL_MIN_LEVEL (KERNEL_MIN_LEVEL + ETHTOOL_NUM_PRIOS)
-/* Vlan, mac, ttc, aRFS */
-#define KERNEL_NIC_PRIO_NUM_LEVELS 4
+/* Vlan, mac, ttc, inner ttc, aRFS */
+#define KERNEL_NIC_PRIO_NUM_LEVELS 5
 #define KERNEL_NIC_NUM_PRIOS 1
 /* One more level for tc */
 #define KERNEL_MIN_LEVEL (KERNEL_NIC_PRIO_NUM_LEVELS + 1)
-- 
2.13.0

^ permalink raw reply related

* Re: [PATCH net] net: dsa: bcm_sf2: Fix number of CFP entries for BCM7278
From: David Miller @ 2017-08-30 23:04 UTC (permalink / raw)
  To: f.fainelli; +Cc: netdev, andrew, vivien.didelot
In-Reply-To: <1504121973-8438-1-git-send-email-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Wed, 30 Aug 2017 12:39:33 -0700

> BCM7278 has only 128 entries while BCM7445 has the full 256 entries set,
> fix that.
> 
> Fixes: 7318166cacad ("net: dsa: bcm_sf2: Add support for ethtool::rxnfc")
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Applied and queued up for -stable, thanks.

I hope we remember to increase CFP_NUM_RULES if we ever get a chip
that supports more than 256... :-/

^ permalink raw reply

* Re: [PATCH net-next] hv_netvsc: Fix typos in the document of UDP hashing
From: David Miller @ 2017-08-30 23:05 UTC (permalink / raw)
  To: haiyangz, haiyangz; +Cc: netdev, kys, olaf, vkuznets, linux-kernel
In-Reply-To: <20170830203722.18780-1-haiyangz@exchange.microsoft.com>

From: Haiyang Zhang <haiyangz@exchange.microsoft.com>
Date: Wed, 30 Aug 2017 13:37:22 -0700

> From: Haiyang Zhang <haiyangz@microsoft.com>
> 
> There are two typos in the document, netvsc.txt,
> regarding UDP hashing level. This patch fixes them.
> 
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>

Applied, thanks.

^ permalink raw reply

* Re: [Patch net-next] net_sched: add reverse binding for tc class
From: Cong Wang @ 2017-08-30 23:15 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Linux Kernel Network Developers, Jamal Hadi Salim
In-Reply-To: <59A73FFC.5070803@iogearbox.net>

On Wed, Aug 30, 2017 at 3:45 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 08/31/2017 12:22 AM, Daniel Borkmann wrote:
>>
>> The prog->res.classid is the default one, but can be overridden
>> later depending on the specified program. cls_bpf_classify() does
>> after prog return (filter_res holds return code):
>>
>>      [...]
>>          if (filter_res == 0)
>>              continue;
>>          if (filter_res != -1) {
>>              res->class   = 0;
>>              res->classid = filter_res;
>>          } else {
>>              *res = prog->res;
>>          }
>>      [...]
>>
>> Meaning in case of a match (-1), we use the default bound one,
>> but prog may as well return an alternative found classid if it
>> wants to. So both versions are possible.
>
>
> But even for that case your patch looks fine to me actually, since
> for dynamic classid we set class to 0. No objections from my side
> then.

Sounds good. Then I will leave it as it is.

Thanks for explanation.

^ permalink raw reply

* [PATCH net-next] liquidio: fix crash in presence of zeroed-out base address regs
From: Felix Manlunas @ 2017-08-30 23:19 UTC (permalink / raw)
  To: davem
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	ricardo.farrington

From: Rick Farrington <ricardo.farrington@cavium.com>

Fix crash in linux PF driver when BARs have been cleared/de-programmed;
fail early init (prior to mapping BARs) if the BAR0 or
BAR1 registers are zero.

This situation can arise when the PF is added to a VM (PCI pass-through),
then a PF FLR is issued (in the VM).  After this occurs, the BAR registers
will be zero. If we attempt to load the PF driver in the host
(after VM has been shutdown), the host can reset.

Signed-off-by: Rick Farrington <ricardo.farrington@cavium.com>
Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
---
 .../net/ethernet/cavium/liquidio/cn23xx_pf_device.c  | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
index 4b0ca9f..8705e23 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
@@ -1269,6 +1269,26 @@ static int cn23xx_sriov_config(struct octeon_device *oct)
 
 int setup_cn23xx_octeon_pf_device(struct octeon_device *oct)
 {
+	u32 data32;
+	u64 BAR0, BAR1;
+
+	pci_read_config_dword(oct->pci_dev, PCI_BASE_ADDRESS_0, &data32);
+	BAR0 = (u64)(data32 & ~0xf);
+	pci_read_config_dword(oct->pci_dev, PCI_BASE_ADDRESS_1, &data32);
+	BAR0 |= ((u64)data32 << 32);
+	pci_read_config_dword(oct->pci_dev, PCI_BASE_ADDRESS_2, &data32);
+	BAR1 = (u64)(data32 & ~0xf);
+	pci_read_config_dword(oct->pci_dev, PCI_BASE_ADDRESS_3, &data32);
+	BAR1 |= ((u64)data32 << 32);
+
+	if (!BAR0 || !BAR1) {
+		if (!BAR0)
+			dev_err(&oct->pci_dev->dev, "device BAR0 unassigned\n");
+		if (!BAR1)
+			dev_err(&oct->pci_dev->dev, "device BAR1 unassigned\n");
+		return 1;
+	}
+
 	if (octeon_map_pci_barx(oct, 0, 0))
 		return 1;
 
-- 
1.8.3.1

^ permalink raw reply related

* Re: multi-queue over IFF_NO_QUEUE "virtual" devices
From: Cong Wang @ 2017-08-30 23:37 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Linux Kernel Network Developers, Jiri Pirko, Jamal Hadi Salim,
	andrew, David Miller, Vivien Didelot
In-Reply-To: <85342893-b84f-4922-1e23-8c9fe0e5f1e0@gmail.com>

On Tue, Aug 29, 2017 at 8:49 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> Le 08/07/17 à 15:26, Florian Fainelli a écrit :
>> Hi,
>>
>> Most DSA supported Broadcom switches have multiple queues per ports
>> (usually 8) and each of these queues can be configured with different
>> pause, drop, hysteresis thresholds and so on in order to make use of the
>> switch's internal buffering scheme and have some queues achieve some
>> kind of lossless behavior (e.g: LAN to LAN traffic for Q7 has a higher
>> priority than LAN to WAN for Q0).
>>
>> This is obviously very workload specific, so I'd want maximum
>> programmability as much as possible.
>>
>> This brings me to a few questions:
>>
>> 1) If we have the DSA slave network devices currently flagged with
>> IFF_NO_QUEUE becoming multi-queue (on TX) aware such that an application
>> can control exactly which switch egress queue is used on a per-flow
>> basis, would that be a problem (this is the dynamic selection of the TX
>> queue)?
>
> So I have this part figured out, with a bunch of changes network devices
> created by DSA are now multiqueue aware and the Broadcom tag layer is
> capable of extracting the queue index, passing it in the tag where
> expected and having the switch forward to the appropriate switch port
> and queue within that port. It also sets the queue mapping in the SKB
> for later consumption by the master network device driver: bcmsysport.c
> because of 2).
>
>>
>> 2) The conduit interface (CPU) port network interface has a congestion
>> control scheme which requires each of its TX queues (32 or 16) to be
>> statically mapped to each of the underlying switch port queues because
>> the congestion/ HW needs to inspect the queue depths of the switch to
>> accept/reject a packet at the CPU's TX ring level. Do we have a good way
>> with tc to map a virtual/stacked device's queue(s) on-top of its
>> physical/underlying device's queues (this is the static queue mapping
>> necessary for congestion to work)?
>
> That part I have not figured out yet, with some static mapping I can
> obtain the results that I want and was even considering the possibility
> of doing something like this:
>
> - register a network device notifier with bcmsysport.c (master network
> device) for this setup
> - expose a helper function allowing me to obtain a given DSA network
> device port index
> - whenever DSA creates network devices reconfigure the ring and queue
> mapping of the TX queues managed by bcmsysport.c with the DSA network
> device port index that has just been registered and just do a 1-1
> mapping of the 8 queues
>
> You would end-up with something like:
>
> gphy (port 0) queues 0-7 mapped to systemport queues 0-7
> rgmii_1 (port 1) queues 0-7 mapped to systemport queues 8-15
> rgmii_2 (port 2) queues 0-7 mapped to systemport queues 16 through 23
> moca (port 7) queues 0-7 mapped to systemport queues 24-31
>
> This should be working because bcmsysport's TX queues are not under
> direct control by the user, they are used via DSA created network
> devices which indicate the queue they want to use. When the DSA
> interfaces are brought down, their respective systemport queues now
> become unused. This also works because the number of physical ports of
> the switch times the number of queues is matching the number of TX
> queues from systemport (like if someone designed it with that exact
> purpose in mind ;)).
>
> The only problem with that approach of course is that it embeds a policy
> within the systemport driver.
>
> Ideally I would really like to configure this via tc by setting up a
> mapping between queues of one network devices to queues of another
> network device, is that a possible thing, Jamal, Cong, Jiri, do you know?

I am not sure if I understand the mapping you are talking about here.

TC layer rarely deals with hardware queues directly (except probably mq),
so this question probably don't belong to TC.

OTOH, TC can modify skb->hash, so you can redirect packets to a specific
queue, but this doesn't sound like what you are you looking for.

Maybe Jiri has more thoughts here since he works on TC offloading things.

^ permalink raw reply

* Re: [pull request][net 00/11] Mellanox, mlx5 fixes 2017-08-30
From: David Miller @ 2017-08-31  0:06 UTC (permalink / raw)
  To: saeedm; +Cc: netdev
In-Reply-To: <20170830222110.15737-1-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Thu, 31 Aug 2017 01:20:59 +0300

> This series contains some misc fixes to the mlx5 driver.
> 
> Please pull and let me know if there's any problem.

Series applied, thanks.

^ permalink raw reply

* [PATCH net-next] devlink: Maintain consistency in mac field name
From: David Ahern @ 2017-08-31  0:07 UTC (permalink / raw)
  To: netdev; +Cc: arkadis, David Ahern

IPv4 name uses "destination ip" as does the IPv6 patch set.
Make the mac field consistent.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/core/devlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/devlink.c b/net/core/devlink.c
index 47931a202a0c..7d430c1d9c3e 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -31,7 +31,7 @@
 
 static struct devlink_dpipe_field devlink_dpipe_fields_ethernet[] = {
 	{
-		.name = "destination_mac",
+		.name = "destination mac",
 		.id = DEVLINK_DPIPE_FIELD_ETHERNET_DST_MAC,
 		.bitwidth = 48,
 	},
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH net v2] net: phy: Correctly process PHY_HALTED in phy_stop_machine()
From: David Daney @ 2017-08-31  0:13 UTC (permalink / raw)
  To: David Miller, f.fainelli
  Cc: netdev, andrew, slash.tmp, marc_gonzalez, rmk+kernel
In-Reply-To: <20170731.172818.1505741655348122155.davem@davemloft.net>

On 07/31/2017 05:28 PM, David Miller wrote:
> From: Florian Fainelli <f.fainelli@gmail.com>
> Date: Fri, 28 Jul 2017 11:58:36 -0700
> 
>> Marc reported that he was not getting the PHY library adjust_link()
>> callback function to run when calling phy_stop() + phy_disconnect()
>> which does not indeed happen because we set the state machine to
>> PHY_HALTED but we don't get to run it to process this state past that
>> point.
>>
>> Fix this with a synchronous call to phy_state_machine() in order to have
>> the state machine actually act on PHY_HALTED, set the PHY device's link
>> down, turn the network device's carrier off and finally call the
>> adjust_link() function.
>>
>> Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
>> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work")
>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
>> ---
>> Changes in v2:
>>
>> - reword subject and commit message based on changes
>> - dropped flush_scheduled_work() since it is redundant
> 
> Applied and queued up for -stable, thanks.
> 


This is broken.  Please revert.

Upstream commit 7ad813f20853 and in the stable branches as well.

When ndo_stop() is called we call:


  phy_disconnect()
     +---> phy_stop_interrupts() implies: phydev->irq = PHY_POLL;
     +---> phy_stop_machine()
     |      +---> phy_stop_machine()
     |              +----> queue_delayed_work(): Work queued.
     +--->phy_detach() implies: phydev->attached_dev = NULL;

Now at a later time the queued work does:

  phy_state_machine()
     +---->netif_carrier_off(phydev->attached_dev): Oh no! It is NULL:


  CPU 12 Unable to handle kernel paging request at virtual address
0000000000000048, epc == ffffffff80de37ec, ra == ffffffff80c7c
Oops[#1]:
CPU: 12 PID: 1502 Comm: kworker/12:1 Not tainted 4.9.43-Cavium-Octeon+ #1
Workqueue: events_power_efficient phy_state_machine
task: 80000004021ed100 task.stack: 8000000409d70000
$ 0   : 0000000000000000 ffffffff84720060 0000000000000048 0000000000000004
$ 4   : 0000000000000000 0000000000000001 0000000000000004 0000000000000000
$ 8   : 0000000000000000 0000000000000000 00000000ffff98f3 0000000000000000
$12   : 8000000409d73fe0 0000000000009c00 ffffffff846547c8 000000000000af3b
$16   : 80000004096bab68 80000004096babd0 0000000000000000 80000004096ba800
$20   : 0000000000000000 0000000000000000 ffffffff81090000 0000000000000008
$24   : 0000000000000061 ffffffff808637b0
$28   : 8000000409d70000 8000000409d73cf0 80000000271bd300 ffffffff80c7804c
Hi    : 000000000000002a
Lo    : 000000000000003f
epc   : ffffffff80de37ec netif_carrier_off+0xc/0x58
ra    : ffffffff80c7804c phy_state_machine+0x48c/0x4f8
Status: 14009ce3        KX SX UX KERNEL EXL IE
Cause : 00800008 (ExcCode 02)
BadVA : 0000000000000048
PrId  : 000d9501 (Cavium Octeon III)
Modules linked in:
Process kworker/12:1 (pid: 1502, threadinfo=8000000409d70000,
task=80000004021ed100, tls=0000000000000000)
Stack : 8000000409a54000 80000004096bab68 80000000271bd300 80000000271c1e00
         0000000000000000 ffffffff808a1708 8000000409a54000 80000000271bd300
         80000000271bd320 8000000409a54030 ffffffff80ff0f00 0000000000000001
         ffffffff81090000 ffffffff808a1ac0 8000000402182080 ffffffff84650000
         8000000402182080 ffffffff84650000 ffffffff80ff0000 8000000409a54000
         ffffffff808a1970 0000000000000000 80000004099e8000 8000000402099240
         0000000000000000 ffffffff808a8598 0000000000000000 8000000408eeeb00
         8000000409a54000 00000000810a1d00 0000000000000000 8000000409d73de8
         8000000409d73de8 0000000000000088 000000000c009c00 8000000409d73e08
         8000000409d73e08 8000000402182080 ffffffff808a84d0 8000000402182080
         ...
Call Trace:
[<ffffffff80de37ec>] netif_carrier_off+0xc/0x58
[<ffffffff80c7804c>] phy_state_machine+0x48c/0x4f8
[<ffffffff808a1708>] process_one_work+0x158/0x368
[<ffffffff808a1ac0>] worker_thread+0x150/0x4c0
[<ffffffff808a8598>] kthread+0xc8/0xe0
[<ffffffff808617f0>] ret_from_kernel_thread+0x14/0x1c

^ permalink raw reply

* Re: [PATCH net v2] net: phy: Correctly process PHY_HALTED in phy_stop_machine()
From: Florian Fainelli @ 2017-08-31  0:16 UTC (permalink / raw)
  To: David Daney, David Miller
  Cc: netdev, andrew, slash.tmp, marc_gonzalez, rmk+kernel
In-Reply-To: <57dfe1c5-1816-cf94-7676-293a9dcd343c@gmail.com>

On 08/30/2017 05:13 PM, David Daney wrote:
> On 07/31/2017 05:28 PM, David Miller wrote:
>> From: Florian Fainelli <f.fainelli@gmail.com>
>> Date: Fri, 28 Jul 2017 11:58:36 -0700
>>
>>> Marc reported that he was not getting the PHY library adjust_link()
>>> callback function to run when calling phy_stop() + phy_disconnect()
>>> which does not indeed happen because we set the state machine to
>>> PHY_HALTED but we don't get to run it to process this state past that
>>> point.
>>>
>>> Fix this with a synchronous call to phy_state_machine() in order to have
>>> the state machine actually act on PHY_HALTED, set the PHY device's link
>>> down, turn the network device's carrier off and finally call the
>>> adjust_link() function.
>>>
>>> Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
>>> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work")
>>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
>>> ---
>>> Changes in v2:
>>>
>>> - reword subject and commit message based on changes
>>> - dropped flush_scheduled_work() since it is redundant
>>
>> Applied and queued up for -stable, thanks.
>>
> 
> 
> This is broken.  Please revert.

This has been causing problem for Geert as well, 2 vs 1, Marc, you lose,
I will send a revert for this shortly, sorry about that.

> 
> Upstream commit 7ad813f20853 and in the stable branches as well.
> 
> When ndo_stop() is called we call:
> 
> 
>  phy_disconnect()
>     +---> phy_stop_interrupts() implies: phydev->irq = PHY_POLL;
>     +---> phy_stop_machine()
>     |      +---> phy_stop_machine()
>     |              +----> queue_delayed_work(): Work queued.
>     +--->phy_detach() implies: phydev->attached_dev = NULL;
> 
> Now at a later time the queued work does:
> 
>  phy_state_machine()
>     +---->netif_carrier_off(phydev->attached_dev): Oh no! It is NULL:
> 
> 
>  CPU 12 Unable to handle kernel paging request at virtual address
> 0000000000000048, epc == ffffffff80de37ec, ra == ffffffff80c7c
> Oops[#1]:
> CPU: 12 PID: 1502 Comm: kworker/12:1 Not tainted 4.9.43-Cavium-Octeon+ #1
> Workqueue: events_power_efficient phy_state_machine
> task: 80000004021ed100 task.stack: 8000000409d70000
> $ 0   : 0000000000000000 ffffffff84720060 0000000000000048 0000000000000004
> $ 4   : 0000000000000000 0000000000000001 0000000000000004 0000000000000000
> $ 8   : 0000000000000000 0000000000000000 00000000ffff98f3 0000000000000000
> $12   : 8000000409d73fe0 0000000000009c00 ffffffff846547c8 000000000000af3b
> $16   : 80000004096bab68 80000004096babd0 0000000000000000 80000004096ba800
> $20   : 0000000000000000 0000000000000000 ffffffff81090000 0000000000000008
> $24   : 0000000000000061 ffffffff808637b0
> $28   : 8000000409d70000 8000000409d73cf0 80000000271bd300 ffffffff80c7804c
> Hi    : 000000000000002a
> Lo    : 000000000000003f
> epc   : ffffffff80de37ec netif_carrier_off+0xc/0x58
> ra    : ffffffff80c7804c phy_state_machine+0x48c/0x4f8
> Status: 14009ce3        KX SX UX KERNEL EXL IE
> Cause : 00800008 (ExcCode 02)
> BadVA : 0000000000000048
> PrId  : 000d9501 (Cavium Octeon III)
> Modules linked in:
> Process kworker/12:1 (pid: 1502, threadinfo=8000000409d70000,
> task=80000004021ed100, tls=0000000000000000)
> Stack : 8000000409a54000 80000004096bab68 80000000271bd300 80000000271c1e00
>         0000000000000000 ffffffff808a1708 8000000409a54000 80000000271bd300
>         80000000271bd320 8000000409a54030 ffffffff80ff0f00 0000000000000001
>         ffffffff81090000 ffffffff808a1ac0 8000000402182080 ffffffff84650000
>         8000000402182080 ffffffff84650000 ffffffff80ff0000 8000000409a54000
>         ffffffff808a1970 0000000000000000 80000004099e8000 8000000402099240
>         0000000000000000 ffffffff808a8598 0000000000000000 8000000408eeeb00
>         8000000409a54000 00000000810a1d00 0000000000000000 8000000409d73de8
>         8000000409d73de8 0000000000000088 000000000c009c00 8000000409d73e08
>         8000000409d73e08 8000000402182080 ffffffff808a84d0 8000000402182080
>         ...
> Call Trace:
> [<ffffffff80de37ec>] netif_carrier_off+0xc/0x58
> [<ffffffff80c7804c>] phy_state_machine+0x48c/0x4f8
> [<ffffffff808a1708>] process_one_work+0x158/0x368
> [<ffffffff808a1ac0>] worker_thread+0x150/0x4c0
> [<ffffffff808a8598>] kthread+0xc8/0xe0
> [<ffffffff808617f0>] ret_from_kernel_thread+0x14/0x1c


-- 
Florian

^ permalink raw reply

* Re: [PATCH net v2] net: phy: Correctly process PHY_HALTED in phy_stop_machine()
From: David Daney @ 2017-08-31  0:16 UTC (permalink / raw)
  To: David Miller, f.fainelli
  Cc: netdev, andrew, slash.tmp, marc_gonzalez, rmk+kernel
In-Reply-To: <57dfe1c5-1816-cf94-7676-293a9dcd343c@gmail.com>

And of course I mess up my pretty picture, see below.

On 08/30/2017 05:13 PM, David Daney wrote:
> On 07/31/2017 05:28 PM, David Miller wrote:
>> From: Florian Fainelli <f.fainelli@gmail.com>
>> Date: Fri, 28 Jul 2017 11:58:36 -0700
>>
>>> Marc reported that he was not getting the PHY library adjust_link()
>>> callback function to run when calling phy_stop() + phy_disconnect()
>>> which does not indeed happen because we set the state machine to
>>> PHY_HALTED but we don't get to run it to process this state past that
>>> point.
>>>
>>> Fix this with a synchronous call to phy_state_machine() in order to have
>>> the state machine actually act on PHY_HALTED, set the PHY device's link
>>> down, turn the network device's carrier off and finally call the
>>> adjust_link() function.
>>>
>>> Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
>>> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work")
>>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
>>> ---
>>> Changes in v2:
>>>
>>> - reword subject and commit message based on changes
>>> - dropped flush_scheduled_work() since it is redundant
>>
>> Applied and queued up for -stable, thanks.
>>
> 
> 
> This is broken.  Please revert.
> 
> Upstream commit 7ad813f20853 and in the stable branches as well.
> 
> When ndo_stop() is called we call:
> 
> 
>   phy_disconnect()
>      +---> phy_stop_interrupts() implies: phydev->irq = PHY_POLL;
>      +---> phy_stop_machine()
>      |      +---> phy_stop_machine()

s/phy_stop_machine/phy_state_machine/

The call that the offending patch adds.


>      |              +----> queue_delayed_work(): Work queued.
>      +--->phy_detach() implies: phydev->attached_dev = NULL;
> 
> Now at a later time the queued work does:
> 
>   phy_state_machine()
>      +---->netif_carrier_off(phydev->attached_dev): Oh no! It is NULL:
> 
> 
>   CPU 12 Unable to handle kernel paging request at virtual address
> 0000000000000048, epc == ffffffff80de37ec, ra == ffffffff80c7c
> Oops[#1]:
> CPU: 12 PID: 1502 Comm: kworker/12:1 Not tainted 4.9.43-Cavium-Octeon+ #1
> Workqueue: events_power_efficient phy_state_machine
> task: 80000004021ed100 task.stack: 8000000409d70000
> $ 0   : 0000000000000000 ffffffff84720060 0000000000000048 0000000000000004
> $ 4   : 0000000000000000 0000000000000001 0000000000000004 0000000000000000
> $ 8   : 0000000000000000 0000000000000000 00000000ffff98f3 0000000000000000
> $12   : 8000000409d73fe0 0000000000009c00 ffffffff846547c8 000000000000af3b
> $16   : 80000004096bab68 80000004096babd0 0000000000000000 80000004096ba800
> $20   : 0000000000000000 0000000000000000 ffffffff81090000 0000000000000008
> $24   : 0000000000000061 ffffffff808637b0
> $28   : 8000000409d70000 8000000409d73cf0 80000000271bd300 ffffffff80c7804c
> Hi    : 000000000000002a
> Lo    : 000000000000003f
> epc   : ffffffff80de37ec netif_carrier_off+0xc/0x58
> ra    : ffffffff80c7804c phy_state_machine+0x48c/0x4f8
> Status: 14009ce3        KX SX UX KERNEL EXL IE
> Cause : 00800008 (ExcCode 02)
> BadVA : 0000000000000048
> PrId  : 000d9501 (Cavium Octeon III)
> Modules linked in:
> Process kworker/12:1 (pid: 1502, threadinfo=8000000409d70000,
> task=80000004021ed100, tls=0000000000000000)
> Stack : 8000000409a54000 80000004096bab68 80000000271bd300 80000000271c1e00
>          0000000000000000 ffffffff808a1708 8000000409a54000 
> 80000000271bd300
>          80000000271bd320 8000000409a54030 ffffffff80ff0f00 
> 0000000000000001
>          ffffffff81090000 ffffffff808a1ac0 8000000402182080 
> ffffffff84650000
>          8000000402182080 ffffffff84650000 ffffffff80ff0000 
> 8000000409a54000
>          ffffffff808a1970 0000000000000000 80000004099e8000 
> 8000000402099240
>          0000000000000000 ffffffff808a8598 0000000000000000 
> 8000000408eeeb00
>          8000000409a54000 00000000810a1d00 0000000000000000 
> 8000000409d73de8
>          8000000409d73de8 0000000000000088 000000000c009c00 
> 8000000409d73e08
>          8000000409d73e08 8000000402182080 ffffffff808a84d0 
> 8000000402182080
>          ...
> Call Trace:
> [<ffffffff80de37ec>] netif_carrier_off+0xc/0x58
> [<ffffffff80c7804c>] phy_state_machine+0x48c/0x4f8
> [<ffffffff808a1708>] process_one_work+0x158/0x368
> [<ffffffff808a1ac0>] worker_thread+0x150/0x4c0
> [<ffffffff808a8598>] kthread+0xc8/0xe0
> [<ffffffff808617f0>] ret_from_kernel_thread+0x14/0x1c

^ permalink raw reply

* Re: DSA mv88e6xxx RX frame errors and TCP/IP RX failure
From: Tim Harvey @ 2017-08-31  0:22 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, Vivien Didelot, linux-kernel@vger.kernel.org, Fugang Duan
In-Reply-To: <20170830220631.GM22289@lunn.ch>

On Wed, Aug 30, 2017 at 3:06 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> On Wed, Aug 30, 2017 at 12:53:56PM -0700, Tim Harvey wrote:
>> Greetings,
>>
>> I'm seeing RX frame errors when using the mv88e6xxx DSA driver on
>> 4.13-rc7. The board I'm using is a GW5904 [1] which has an IMX6 FEC
>> MAC (eth0) connected via RGMII to a MV88E6176 with its downstream
>> P0/P1/P2/P3 to front panel RJ45's (lan1-lan4).
>
> Hi Tim
>
> Can you confirm the counter is this one:
>
>                        /* Report late collisions as a frame error. */
>                         if (status & (BD_ENET_RX_NO | BD_ENET_RX_CL))
>                                 ndev->stats.rx_frame_errors++;
>
> I don't see anywhere else frame errors are counted, but it would be
> good to prove we are looking in the right place.
>

Andrew,

(adding IMX FEC driver maintainer to CC)

Yes, that's one of them being hit. It looks like ifconfig reports
'frame' as the accumulation of a few stats so here are some more
specifics from /sys/class/net/eth0/statistics:

root@xenial:/sys/devices/soc0/soc/2100000.aips-bus/2188000.ethernet/net/eth0/statistics#
for i in `ls rx_*`; do echo $i:$(cat $i); done
rx_bytes:103229
rx_compressed:0
rx_crc_errors:22
rx_dropped:0
rx_errors:22
rx_fifo_errors:0
rx_frame_errors:22
rx_length_errors:22
rx_missed_errors:0
rx_nohandler:0
rx_over_errors:0
rx_packets:1174
root@xenial:/sys/devices/soc0/soc/2100000.aips-bus/2188000.ethernet/net/eth0/statistics#
ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:D0:12:41:F3:E7
          inet6 addr: fe80::2d0:12ff:fe41:f3e7/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1207 errors:22 dropped:0 overruns:0 frame:66
          TX packets:42 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:106009 (103.5 KiB)  TX bytes:4604 (4.4 KiB)

Instrumenting fec driver I see the following getting hit:

status & BD_ENET_RX_LG /* rx_length_errors: Frame too long */
status & BD_ENET_RX_CR  /* rx_crc_errors: CRC Error */
status & BD_ENET_RX_CL /* rx_frame_errors: Collision? */

Is this a frame size issue where the MV88E6176 is sending frames down
that exceed the MTU because of headers added?

Tim

^ permalink raw reply

* [RFC net-next 0/8] net: dsa: Multi-queue awareness
From: Florian Fainelli @ 2017-08-31  0:18 UTC (permalink / raw)
  To: netdev
  Cc: jiri, jhs, davem, xiyou.wangcong, andrew, vivien.didelot,
	Florian Fainelli

This patch series is sent as reference, especially because the last patch
is trying not to be creating too many layer violations, but clearly there
are a little bit being created here anyways.

Essentially what I am trying to achieve is that you have a stacked device which
is multi-queue aware, that applications will be using, and for which they can
control the queue selection (using mq) the way they want. Each of each stacked
network devices are created for each port of the switch (this is what DSA
does). When a skb is submitted from say net_device X, we can derive its port
number and look at the queue_mapping value to determine which port of the
switch and queue we should be sending this to. The information is embedded in a
tag (4 bytes) and is used by the switch to steer the transmission.

These stacked devices will actually transmit using a "master" or conduit
network device which has a number of queues as well. In one version of the
hardware that I work with, we have up to 4 ports, each with 8 queues, and the
master device has a total of 32 hardware queues, so a 1:1 mapping is easy. With
another version of the hardware, same number of ports and queues, but only 16
hardware queues, so only a 2:1 mapping is possible.

In order for congestion information to work properly, I need to establish a
mapping, preferably before transmission starts (but reconfiguration while
interfaces are running would be possible too) between these stacked device's
queue and the conduit interface's queue.

Comments, flames, rotten tomatoes, anything!

Florian Fainelli (8):
  net: dsa: Allow switch drivers to indicate number of RX/TX queues
  net: dsa: tag_brcm: Set output queue from skb queue mapping
  net: dsa: bcm_sf2: Advertise number of egress queues
  net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
  net: dsa: bcm_sf2: Fix number of CFP entries for BCM7278
  net: dsa: Expose dsa_slave_dev_check and dsa_slave_dev_port_num
  net: dsa: tag_brcm: Indicate to master netdevice port + queue
  net: systemport: Establish DSA network device queue mapping

 drivers/net/dsa/bcm_sf2.c                  |  16 +++++
 drivers/net/dsa/bcm_sf2.h                  |   1 +
 drivers/net/dsa/bcm_sf2_cfp.c              |   8 +--
 drivers/net/ethernet/broadcom/bcmsysport.c | 100 +++++++++++++++++++++++++++--
 drivers/net/ethernet/broadcom/bcmsysport.h |  11 +++-
 include/net/dsa.h                          |  19 ++++++
 net/dsa/slave.c                            |  22 +++++--
 net/dsa/tag_brcm.c                         |   8 ++-
 8 files changed, 170 insertions(+), 15 deletions(-)

-- 
1.9.1

^ permalink raw reply

* [RFC net-next 1/8] net: dsa: Allow switch drivers to indicate number of RX/TX queues
From: Florian Fainelli @ 2017-08-31  0:18 UTC (permalink / raw)
  To: netdev
  Cc: jiri, jhs, davem, xiyou.wangcong, andrew, vivien.didelot,
	Florian Fainelli
In-Reply-To: <1504138732-65383-1-git-send-email-f.fainelli@gmail.com>

Let switch drivers indicate how many RX and TX queues they support. Some
switches, such as Broadcom Starfighter 2 are resigned with 8 egress
queues. Future changes will allow us to leverage the queue mapping and
direct the transmission towards a particular queue.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 include/net/dsa.h |  4 ++++
 net/dsa/slave.c   | 10 ++++++++--
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 398ca8d70ccd..b10e8da3f8d7 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -243,6 +243,10 @@ struct dsa_switch {
 	/* devlink used to represent this switch device */
 	struct devlink		*devlink;
 
+	/* Number of switch port queues */
+	unsigned int		num_rx_queues;
+	unsigned int		num_tx_queues;
+
 	/* Dynamically allocated ports, keep last */
 	size_t num_ports;
 	struct dsa_port ports[];
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 78e78a6e6833..bfd7173a3c6a 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1259,8 +1259,14 @@ int dsa_slave_create(struct dsa_port *port, const char *name)
 	cpu_dp = ds->dst->cpu_dp;
 	master = cpu_dp->netdev;
 
-	slave_dev = alloc_netdev(sizeof(struct dsa_slave_priv), name,
-				 NET_NAME_UNKNOWN, ether_setup);
+	if (!ds->num_rx_queues)
+		ds->num_rx_queues = 1;
+	if (!ds->num_tx_queues)
+		ds->num_tx_queues = 1;
+
+	slave_dev = alloc_netdev_mqs(sizeof(struct dsa_slave_priv), name,
+				     NET_NAME_UNKNOWN, ether_setup,
+				     ds->num_tx_queues, ds->num_rx_queues);
 	if (slave_dev == NULL)
 		return -ENOMEM;
 
-- 
1.9.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox