Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH ipsec] xfrm interface: fix list corruption for x-netns
From: Nicolas Dichtel @ 2019-07-10 13:11 UTC (permalink / raw)
  To: steffen.klassert, davem; +Cc: netdev, Nicolas Dichtel, Julien Floret

dev_net(dev) is the netns of the device and xi->net is the link netns,
where the device has been linked.
changelink() must operate in the link netns to avoid a corruption of
the xfrm lists.

Note that xi->net and dev_net(xi->physdev) are always the same.

Before the patch, the xfrmi lists may be corrupted and can later trigger a
kernel panic.

Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces")
Reported-by: Julien Floret <julien.floret@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Julien Floret <julien.floret@6wind.com>
---
 net/xfrm/xfrm_interface.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index a60d391f7ebe..9c5bc8dcf608 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -503,7 +503,7 @@ static int xfrmi_change(struct xfrm_if *xi, const struct xfrm_if_parms *p)
 
 static int xfrmi_update(struct xfrm_if *xi, struct xfrm_if_parms *p)
 {
-	struct net *net = dev_net(xi->dev);
+	struct net *net = xi->net;
 	struct xfrmi_net *xfrmn = net_generic(net, xfrmi_net_id);
 	int err;
 
@@ -663,9 +663,9 @@ static int xfrmi_changelink(struct net_device *dev, struct nlattr *tb[],
 			   struct nlattr *data[],
 			   struct netlink_ext_ack *extack)
 {
-	struct net *net = dev_net(dev);
+	struct xfrm_if *xi = netdev_priv(dev);
+	struct net *net = xi->net;
 	struct xfrm_if_parms p;
-	struct xfrm_if *xi;
 
 	xfrmi_netlink_parms(data, &p);
 	xi = xfrmi_locate(net, &p);
@@ -707,7 +707,7 @@ static struct net *xfrmi_get_link_net(const struct net_device *dev)
 {
 	struct xfrm_if *xi = netdev_priv(dev);
 
-	return dev_net(xi->phydev);
+	return xi->net;
 }
 
 static const struct nla_policy xfrmi_policy[IFLA_XFRM_MAX + 1] = {
-- 
2.21.0


^ permalink raw reply related

* RE: [PATCH] tipc: ensure skb->lock is initialised
From: Jon Maloy @ 2019-07-10 13:10 UTC (permalink / raw)
  To: Eric Dumazet, Chris Packham, ying.xue@windriver.com,
	davem@davemloft.net
  Cc: netdev@vger.kernel.org, tipc-discussion@lists.sourceforge.net,
	linux-kernel@vger.kernel.org
In-Reply-To: <e7606e76-8a0a-dab7-4561-f44f98d90164@gmail.com>



> -----Original Message-----
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Sent: 10-Jul-19 04:00
> To: Jon Maloy <jon.maloy@ericsson.com>; Eric Dumazet
> <eric.dumazet@gmail.com>; Chris Packham
> <Chris.Packham@alliedtelesis.co.nz>; ying.xue@windriver.com;
> davem@davemloft.net
> Cc: netdev@vger.kernel.org; tipc-discussion@lists.sourceforge.net; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH] tipc: ensure skb->lock is initialised
> 
> 
> 
> On 7/9/19 10:15 PM, Jon Maloy wrote:
> >
> > It is not only for lockdep purposes, -it is essential.  But please provide details
> about where you see that more fixes are needed.
> >
> 
> Simple fact that you detect a problem only when skb_queue_purge() is called
> should talk by itself.
> 
> As I stated, there are many places where the list is manipulated _without_ its
> spinlock being held.

Yes, and that is the way it should be on the send path.

> 
> You want consistency, then
> 
> - grab the spinlock all the time.
> - Or do not ever use it.

That is exactly what we are doing. 
- The send path doesn't need the spinlock, and never grabs it.
- The receive path does need it, and always grabs it.

However, since we don't know from the beginning which path a created message will follow, we initialize the queue spinlock "just in case" when it is created, even though it may never be used later.
You can see this as a violation of the principle you are stating above, but it is a prize that is worth paying, given savings in code volume, complexity and performance.

> 
> Do not initialize the spinlock just in case a path will use skb_queue_purge()
> (instead of using __skb_queue_purge())

I am ok with that. I think we can agree that Chris goes for that solution, so we can get this bug fixed.

///jon



^ permalink raw reply

* [PATCH] [net-next] net/mlx5e: avoid uninitialized variable use
From: Arnd Bergmann @ 2019-07-10 13:06 UTC (permalink / raw)
  To: Saeed Mahameed, Leon Romanovsky, David S. Miller
  Cc: Arnd Bergmann, Tariq Toukan, Eran Ben Elisha, Boris Pismenny,
	netdev, linux-rdma, linux-kernel, clang-built-linux

clang points to a variable being used in an unexpected
code path:

drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c:251:2: warning: variable 'rec_seq_sz' is used uninitialized whenever switch default is taken [-Wsometimes-uninitialized]
        default:
        ^~~~~~~
drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c:255:46: note: uninitialized use occurs here
        skip_static_post = !memcmp(rec_seq, &rn_be, rec_seq_sz);
                                                    ^~~~~~~~~~

From looking at the function logic, it seems that there is no
sensible way to continue here, so just return early and hope
for the best.

Fixes: d2ead1f360e8 ("net/mlx5e: Add kTLS TX HW offload support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
index 3f5f4317a22b..5c08891806f0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
@@ -250,6 +250,7 @@ tx_post_resync_params(struct mlx5e_txqsq *sq,
 	}
 	default:
 		WARN_ON(1);
+		return;
 	}
 
 	skip_static_post = !memcmp(rec_seq, &rn_be, rec_seq_sz);
-- 
2.20.0


^ permalink raw reply related

* Re: [PATCH net-next v6 06/15] ethtool: netlink bitset handling
From: Jiri Pirko @ 2019-07-10 12:59 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: netdev, David Miller, Jakub Kicinski, Andrew Lunn,
	Florian Fainelli, John Linville, Stephen Hemminger, Johannes Berg,
	linux-kernel
In-Reply-To: <20190710123803.GB5700@unicorn.suse.cz>

Wed, Jul 10, 2019 at 02:38:03PM CEST, mkubecek@suse.cz wrote:
>On Tue, Jul 09, 2019 at 04:18:17PM +0200, Jiri Pirko wrote:
>> 
>> I understand. So how about avoid the bitfield all together and just
>> have array of either bits of strings or combinations?
>> 
>> ETHTOOL_CMD_SETTINGS_SET (U->K)
>>     ETHTOOL_A_HEADER
>>         ETHTOOL_A_DEV_NAME = "eth3"
>>     ETHTOOL_A_SETTINGS_PRIV_FLAGS
>>        ETHTOOL_A_SETTINGS_PRIV_FLAG
>>            ETHTOOL_A_FLAG_NAME = "legacy-rx"
>> 	   ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
>> 
>> or the same with index instead of string
>> 
>> ETHTOOL_CMD_SETTINGS_SET (U->K)
>>     ETHTOOL_A_HEADER
>>         ETHTOOL_A_DEV_NAME = "eth3"
>>     ETHTOOL_A_SETTINGS_PRIV_FLAGS
>>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>>             ETHTOOL_A_FLAG_INDEX = 0
>>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
>> 
>> 
>> For set you can combine both when you want to set multiple bits:
>> 
>> ETHTOOL_CMD_SETTINGS_SET (U->K)
>>     ETHTOOL_A_HEADER
>>         ETHTOOL_A_DEV_NAME = "eth3"
>>     ETHTOOL_A_SETTINGS_PRIV_FLAGS
>>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>>             ETHTOOL_A_FLAG_INDEX = 2
>>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
>>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>>             ETHTOOL_A_FLAG_INDEX = 8
>>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
>>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>>             ETHTOOL_A_FLAG_NAME = "legacy-rx"
>>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
>> 
>> 
>> For get this might be a bit bigger message:
>> 
>> ETHTOOL_CMD_SETTINGS_GET_REPLY (K->U)
>>     ETHTOOL_A_HEADER
>>         ETHTOOL_A_DEV_NAME = "eth3"
>>     ETHTOOL_A_SETTINGS_PRIV_FLAGS
>>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>>             ETHTOOL_A_FLAG_INDEX = 0
>>             ETHTOOL_A_FLAG_NAME = "legacy-rx"
>>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
>>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>>             ETHTOOL_A_FLAG_INDEX = 1
>>             ETHTOOL_A_FLAG_NAME = "vf-ipsec"
>>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
>>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>>             ETHTOOL_A_FLAG_INDEX = 8
>>             ETHTOOL_A_FLAG_NAME = "something-else"
>>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
>
>This is perfect for "one shot" applications but not so much for long
>running ones, either "ethtool --monitor" or management or monitoring
>daemons. Repeating the names in every notification message would be
>a waste, it's much more convenient to load the strings only once and

Yeah, for those aplications, the ETHTOOL_A_FLAG_NAME could be omitted


>cache them. Even if we omit the names in notifications (and possibly the
>GET replies if client opts for it), this format still takes 12-16 bytes
>per bit.
>
>So the problem I'm trying to address is that there are two types of
>clients with very different mode of work and different preferences.
>
>Looking at the bitset.c, I would rather say that most of the complexity
>and ugliness comes from dealing with both unsigned long based bitmaps
>and u32 based ones. Originally, there were functions working with
>unsigned long based bitmaps and the variants with "32" suffix were
>wrappers around them which converted u32 bitmaps to unsigned long ones
>and back. This became a problem when kernel started issuing warnings
>about variable length arrays as getting rid of them meant two kmalloc()
>and two kfree() for each u32 bitmap operation, even if most of the
>bitmaps are in rather short in practice.
>
>Maybe the wrapper could do something like
>
>int ethnl_put_bitset32(const u32 *value, const u32 *mask,
>		       unsigned int size,  ...)
>{
>	unsigned long fixed_value[2], fixed_mask[2];
>	unsigned long *tmp_value = fixed_value;
>	unsigned long *tmp_mask = fixed_mask;
>
>	if (size > sizeof(fixed_value) * BITS_PER_BYTE) {
>		tmp_value = bitmap_alloc(size);
>		if (!tmp_value)
>			return -ENOMEM;
>		tmp_mask = bitmap_alloc(size);
>		if (!tmp_mask) {
>			kfree(tmp_value);
>			return -ENOMEM;
>		}
>	}
>
>	bitmap_from_arr32(tmp_value, value, size);
>	bitmap_from_arr32(tmp_mask, mask, size);
>	ret = ethnl_put_bitset(tmp_value, tmp_mask, size, ...);
>}
>
>This way we would make bitset.c code cleaner while avoiding allocating
>short bitmaps (which is the most common case). 

I'm primarily concerned about the uapi. Plus if the uapi approach is united
for both index and string, we can omit this whole bitset abomination...


>
>Michal

^ permalink raw reply

* RE: Question about linux kernel commit: "net/ipv6: move metrics from dst to rt6_info"
From: Jan Szewczyk @ 2019-07-10 12:59 UTC (permalink / raw)
  To: David Ahern, davem@davemloft.net; +Cc: netdev@vger.kernel.org
In-Reply-To: <cb0674df-8593-f14b-f680-ce278042c88c@gmail.com>

Hi!
I digged up a little further and maybe it's not a problem with MTU itself. I checked every entry I get from RTM_GETROUTE netlink message and after triggering "too big packet" by pinging ipv6address I get exactly the same messages on 4.12 and 4.18, except that the one with that pinged ipv6address is missing on 4.18 at all. What is weird - it's visible when running "ip route get to ipv6address". Do you know why there is a mismatch there?

It's not easy for me to check this behavior on 4.19, because we have a pretty complex system here, but maybe I could try to reproduce it locally on some virtual box and check if it behaves the same.

Thanks for the tip about the testing tools, I'll try to use them.

BR,
Jan Szewczyk

-----Original Message-----
From: David Ahern <dsahern@gmail.com> 
Sent: Wednesday, July 10, 2019 14:28
To: Jan Szewczyk <jan.szewczyk@ericsson.com>; davem@davemloft.net
Cc: netdev@vger.kernel.org
Subject: Re: Question about linux kernel commit: "net/ipv6: move metrics from dst to rt6_info"

[ adding netdev so others can chime in ]

On 7/10/19 2:28 AM, Jan Szewczyk wrote:
> Hi guys!
> 
> We can see different behavior of one of our commands that supposed to 
> show pmtu information.
> 
> It's using netlink message RTM_GETROUTE to get the information and in 
> Linux kernel version 4.12 after sending big packet (and triggering 
> "packet too big") there is an entry with PMTU and expiration time.
> 
> In the version 4.18 unfortunately the entry looks different and there 
> is no PMTU information.

Can you try with 4.19.58 (latest stable release for 4.19)? Perhaps there was a bugfix that is missing from 4.18.

The kernel has 2 commands under tools/testing/selftests/net -- pmtu.sh and icmp_redirect.sh -- that verify exceptions are created and use 'ip ro get' to verify the mtu.

> 
> I can see that in your commit
> https://protect2.fireeye.com/url?k=5be21a17-07361dbb-5be25a8c-8667c4af
> e13e-f99413291ecbed59&q=1&u=https%3A%2F%2Fgithub.com%2Ftorvalds%2Flinu
> x%2Fcommit%2Fd4ead6b34b67fd711639324b6465a050bcb197d4,
> these lines disappeared from route.c:
> 
>  
> 
>      if (rt->rt6i_pmtu)
> 
>            metrics[RTAX_MTU - 1] = rt->rt6i_pmtu;
> 
>  
> 
> I'm very beginner in linux kernel code, can you help me and tell me if 
> that could cause this different behavior?
> 
>  
> 
>  
> 
> BR,
> 
> Jan Szewczyk
> 

^ permalink raw reply

* [PATCH iproute2-next v2 3/3] man: update man pages for TC MPLS actions
From: John Hurley @ 2019-07-10 12:40 UTC (permalink / raw)
  To: netdev
  Cc: davem, jiri, xiyou.wangcong, dsahern, willemdebruijn.kernel,
	stephen, simon.horman, jakub.kicinski, oss-drivers, John Hurley
In-Reply-To: <1562762440-25656-1-git-send-email-john.hurley@netronome.com>

Add a man page describing the newly added TC mpls manipulation actions.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 man/man8/tc-mpls.8 | 156 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 156 insertions(+)
 create mode 100644 man/man8/tc-mpls.8

diff --git a/man/man8/tc-mpls.8 b/man/man8/tc-mpls.8
new file mode 100644
index 0000000..84ef2ef
--- /dev/null
+++ b/man/man8/tc-mpls.8
@@ -0,0 +1,156 @@
+.TH "MPLS manipulation action in tc" 8 "22 May 2019" "iproute2" "Linux"
+
+.SH NAME
+mpls - mpls manipulation module
+.SH SYNOPSIS
+.in +8
+.ti -8
+.BR tc " ... " "action mpls" " { "
+.IR POP " | " PUSH " | " MODIFY " | "
+.BR dec_ttl " } [ "
+.IR CONTROL " ]"
+
+.ti -8
+.IR POP " := "
+.BR pop " " protocol
+.IR MPLS_PROTO
+
+.ti -8
+.IR PUSH " := "
+.BR push " [ " protocol
+.IR MPLS_PROTO " ]"
+.RB " [ " tc
+.IR MPLS_TC " ] "
+.RB " [ " ttl
+.IR MPLS_TTL " ] "
+.RB " [ " bos
+.IR MPLS_BOS " ] "
+.BI label " MPLS_LABEL"
+
+.ti -8
+.IR MODIFY " := "
+.BR modify " [ " label
+.IR MPLS_LABEL " ]"
+.RB " [ " tc
+.IR MPLS_TC " ] "
+.RB " [ " ttl
+.IR MPLS_TTL " ] "
+
+.ti -8
+.IR CONTROL " := { "
+.BR reclassify " | " pipe " | " drop " | " continue " | " pass " | " goto " " chain " " CHAIN_INDEX " }"
+.SH DESCRIPTION
+The
+.B mpls
+action performs mpls encapsulation or decapsulation on a packet, reflected by the
+operation modes
+.IR POP ", " PUSH ", " MODIFY " and " DEC_TTL .
+The
+.I POP
+mode requires the ethertype of the header that follows the MPLS header (e.g.
+IPv4 or another MPLS). It will remove the outer MPLS header and replace the
+ethertype in the MAC header with that passed. The
+.IR PUSH " and " MODIFY
+modes update the current MPLS header information or add a new header.
+.IR PUSH
+requires at least an
+.IR MPLS_LABEL ". "
+.I DEC_TTL
+requires no arguments and simply subtracts 1 from the MPLS header TTL field.
+
+.SH OPTIONS
+.TP
+.B pop
+Decapsulation mode. Requires the protocol of the next header.
+.TP
+.B push
+Encapsulation mode. Requires at least the
+.B label
+option.
+.TP
+.B modify
+Replace mode. Existing MPLS tag is replaced.
+.BR label ", "
+.BR tc ", "
+and
+.B ttl
+are all optional.
+.TP
+.B dec_ttl
+Decrement the TTL field on the outer most MPLS header.
+.TP
+.BI label " MPLS_LABEL"
+Specify the MPLS LABEL for the outer MPLS header.
+.I MPLS_LABEL
+is an unsigned 20bit integer, the format is detected automatically (e.g. prefix
+with
+.RB ' 0x '
+for hexadecimal interpretation, etc.).
+.TP
+.BI protocol " MPLS_PROTO"
+Choose the protocol to use. For push actions this must be
+.BR mpls_uc " or " mpls_mc " (" mpls_uc
+is the default). For pop actions it should be the protocol of the next header.
+This option cannot be used with modify.
+.TP
+.BI tc " MPLS_TC"
+Choose the TC value for the outer MPLS header. Decimal number in range of 0-7.
+Defaults to 0.
+.TP
+.BI ttl " MPLS_TTL"
+Choose the TTL value for the outer MPLS header. Number in range of 0-255. A
+non-zero default value will be selected if this is not explicitly set.
+.TP
+.BI bos " MPLS_BOS"
+Manually configure the bottom of stack bit for an MPLS header push. The default
+is for TC to automatically set (or unset) the bit based on the next header of
+the packet.
+.TP
+.I CONTROL
+How to continue after executing this action.
+.RS
+.TP
+.B reclassify
+Restarts classification by jumping back to the first filter attached to this
+action's parent.
+.TP
+.B pipe
+Continue with the next action, this is the default.
+.TP
+.B drop
+Packet will be dropped without running further actions.
+.TP
+.B continue
+Continue classification with next filter in line.
+.TP
+.B pass
+Return to calling qdisc for packet processing. This ends the classification
+process.
+.RE
+.SH EXAMPLES
+The following example encapsulates incoming IP packets on eth0 into MPLS with
+a label 123 and sends them out eth1:
+
+.RS
+.EX
+#tc qdisc add dev eth0 handle ffff: ingress
+#tc filter add dev eth0 protocol ip parent ffff: flower \\
+	action mpls push protocol mpls_uc label 123  \\
+	action mirred egress redirect dev eth1
+.EE
+.RE
+
+In this example, incoming MPLS unicast packets on eth0 are decapsulated and to
+ip packets and output to eth1:
+
+.RS
+.EX
+#tc qdisc add dev eth0 handle ffff: ingress
+#tc filter add dev eth0 protocol mpls_uc parent ffff: flower \\
+	action mpls pop protocol ipv4  \\
+	action mirred egress redirect dev eth0
+.EE
+.RE
+
+.SH SEE ALSO
+.BR tc (8)
-- 
2.7.4


^ permalink raw reply related

* [PATCH iproute2-next v2 2/3] tc: add mpls actions
From: John Hurley @ 2019-07-10 12:40 UTC (permalink / raw)
  To: netdev
  Cc: davem, jiri, xiyou.wangcong, dsahern, willemdebruijn.kernel,
	stephen, simon.horman, jakub.kicinski, oss-drivers, John Hurley
In-Reply-To: <1562762440-25656-1-git-send-email-john.hurley@netronome.com>

Create a new action type for TC that allows the pushing, popping, and
modifying of MPLS headers.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 tc/Makefile |   1 +
 tc/m_mpls.c | 276 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 277 insertions(+)
 create mode 100644 tc/m_mpls.c

diff --git a/tc/Makefile b/tc/Makefile
index 60abdde..09ff369 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -39,6 +39,7 @@ TCMODULES += q_drr.o
 TCMODULES += q_qfq.o
 TCMODULES += m_gact.o
 TCMODULES += m_mirred.o
+TCMODULES += m_mpls.o
 TCMODULES += m_nat.o
 TCMODULES += m_pedit.o
 TCMODULES += m_ife.o
diff --git a/tc/m_mpls.c b/tc/m_mpls.c
new file mode 100644
index 0000000..0ae6d62
--- /dev/null
+++ b/tc/m_mpls.c
@@ -0,0 +1,276 @@
+// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+/* Copyright (C) 2019 Netronome Systems, Inc. */
+
+#include <linux/if_ether.h>
+#include <linux/tc_act/tc_mpls.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include "utils.h"
+#include "rt_names.h"
+#include "tc_util.h"
+
+static const char * const action_names[] = {
+	[TCA_MPLS_ACT_POP] = "pop",
+	[TCA_MPLS_ACT_PUSH] = "push",
+	[TCA_MPLS_ACT_MODIFY] = "modify",
+	[TCA_MPLS_ACT_DEC_TTL] = "dec_ttl",
+};
+
+static void explain(void)
+{
+	fprintf(stderr,
+		"Usage: mpls pop [ protocol MPLS_PROTO ]\n"
+		"       mpls push [ protocol MPLS_PROTO ] [ label MPLS_LABEL ] [ tc MPLS_TC ]\n"
+		"                 [ ttl MPLS_TTL ] [ bos MPLS_BOS ] [CONTROL]\n"
+		"       mpls modify [ label MPLS_LABEL ] [ tc MPLS_TC ] [ ttl MPLS_TTL ] [CONTROL]\n"
+		"           for pop MPLS_PROTO is next header of packet - e.g. ip or mpls_uc\n"
+		"           for push MPLS_PROTO is one of mpls_uc or mpls_mc\n"
+		"               with default: mpls_uc\n"
+		"       CONTROL := reclassify | pipe | drop | continue | pass |\n"
+		"                  goto chain <CHAIN_INDEX>\n");
+}
+
+static void usage(void)
+{
+	explain();
+	exit(-1);
+}
+
+static bool can_modify_mpls_fields(unsigned int action)
+{
+	return action == TCA_MPLS_ACT_PUSH || action == TCA_MPLS_ACT_MODIFY;
+}
+
+static bool can_modify_ethtype(unsigned int action)
+{
+	return action == TCA_MPLS_ACT_PUSH || action == TCA_MPLS_ACT_POP;
+}
+
+static bool is_valid_label(__u32 label)
+{
+	return label <= 0xfffff;
+}
+
+static bool check_double_action(unsigned int action, const char *arg)
+{
+	if (!action)
+		return false;
+
+	fprintf(stderr,
+		"Error: got \"%s\" but action already set to \"%s\"\n",
+		arg, action_names[action]);
+	explain();
+	return true;
+}
+
+static int parse_mpls(struct action_util *a, int *argc_p, char ***argv_p,
+		      int tca_id, struct nlmsghdr *n)
+{
+	struct tc_mpls parm = {};
+	__u32 label = 0xffffffff;
+	unsigned int action = 0;
+	char **argv = *argv_p;
+	struct rtattr *tail;
+	int argc = *argc_p;
+	__u16 proto = 0;
+	__u8 bos = 0xff;
+	__u8 tc = 0xff;
+	__u8 ttl = 0;
+
+	if (matches(*argv, "mpls") != 0)
+		return -1;
+
+	NEXT_ARG();
+
+	while (argc > 0) {
+		if (matches(*argv, "pop") == 0) {
+			if (check_double_action(action, *argv))
+				return -1;
+			action = TCA_MPLS_ACT_POP;
+		} else if (matches(*argv, "push") == 0) {
+			if (check_double_action(action, *argv))
+				return -1;
+			action = TCA_MPLS_ACT_PUSH;
+		} else if (matches(*argv, "modify") == 0) {
+			if (check_double_action(action, *argv))
+				return -1;
+			action = TCA_MPLS_ACT_MODIFY;
+		} else if (matches(*argv, "dec_ttl") == 0) {
+			if (check_double_action(action, *argv))
+				return -1;
+			action = TCA_MPLS_ACT_DEC_TTL;
+		} else if (matches(*argv, "label") == 0) {
+			if (!can_modify_mpls_fields(action))
+				invarg("only valid for push/modify", *argv);
+			NEXT_ARG();
+			if (get_u32(&label, *argv, 0) || !is_valid_label(label))
+				invarg("label must be <=0xFFFFF", *argv);
+		} else if (matches(*argv, "tc") == 0) {
+			if (!can_modify_mpls_fields(action))
+				invarg("only valid for push/modify", *argv);
+			NEXT_ARG();
+			if (get_u8(&tc, *argv, 0) || (tc & ~0x7))
+				invarg("tc field is 3 bits max", *argv);
+		} else if (matches(*argv, "ttl") == 0) {
+			if (!can_modify_mpls_fields(action))
+				invarg("only valid for push/modify", *argv);
+			NEXT_ARG();
+			if (get_u8(&ttl, *argv, 0) || !ttl)
+				invarg("ttl must be >0 and <=255", *argv);
+		} else if (matches(*argv, "bos") == 0) {
+			if (!can_modify_mpls_fields(action))
+				invarg("only valid for push/modify", *argv);
+			NEXT_ARG();
+			if (get_u8(&bos, *argv, 0) || (bos & ~0x1))
+				invarg("bos must be 0 or 1", *argv);
+		} else if (matches(*argv, "protocol") == 0) {
+			if (!can_modify_ethtype(action))
+				invarg("only valid for push/pop", *argv);
+			NEXT_ARG();
+			if (ll_proto_a2n(&proto, *argv))
+				invarg("protocol is invalid", *argv);
+		} else if (matches(*argv, "help") == 0) {
+			usage();
+		} else {
+			break;
+		}
+
+		NEXT_ARG_FWD();
+	}
+
+	if (!action)
+		incomplete_command();
+
+	parse_action_control_dflt(&argc, &argv, &parm.action,
+				  false, TC_ACT_PIPE);
+
+	if (argc) {
+		if (matches(*argv, "index") == 0) {
+			NEXT_ARG();
+			if (get_u32(&parm.index, *argv, 10))
+				invarg("illegal index", *argv);
+			NEXT_ARG_FWD();
+		}
+	}
+
+	if (action == TCA_MPLS_ACT_PUSH && !label)
+		missarg("label");
+
+	if (action == TCA_MPLS_ACT_PUSH && proto &&
+	    proto != htons(ETH_P_MPLS_UC) && proto != htons(ETH_P_MPLS_MC)) {
+		fprintf(stderr,
+			"invalid push protocol \"0x%04x\" - use mpls_(uc|mc)\n",
+			ntohs(proto));
+		return -1;
+	}
+
+	if (action == TCA_MPLS_ACT_POP && !proto)
+		missarg("protocol");
+
+	parm.m_action = action;
+	tail = addattr_nest(n, MAX_MSG, tca_id | NLA_F_NESTED);
+	addattr_l(n, MAX_MSG, TCA_MPLS_PARMS, &parm, sizeof(parm));
+	if (label != 0xffffffff)
+		addattr_l(n, MAX_MSG, TCA_MPLS_LABEL, &label, sizeof(label));
+	if (proto)
+		addattr_l(n, MAX_MSG, TCA_MPLS_PROTO, &proto, sizeof(proto));
+	if (tc != 0xff)
+		addattr8(n, MAX_MSG, TCA_MPLS_TC, tc);
+	if (ttl)
+		addattr8(n, MAX_MSG, TCA_MPLS_TTL, ttl);
+	if (bos != 0xff)
+		addattr8(n, MAX_MSG, TCA_MPLS_BOS, bos);
+	addattr_nest_end(n, tail);
+
+	*argc_p = argc;
+	*argv_p = argv;
+	return 0;
+}
+
+static int print_mpls(struct action_util *au, FILE *f, struct rtattr *arg)
+{
+	struct rtattr *tb[TCA_MPLS_MAX + 1];
+	struct tc_mpls *parm;
+	SPRINT_BUF(b1);
+	__u32 val;
+
+	if (!arg)
+		return -1;
+
+	parse_rtattr_nested(tb, TCA_MPLS_MAX, arg);
+
+	if (!tb[TCA_MPLS_PARMS]) {
+		fprintf(stderr, "[NULL mpls parameters]\n");
+		return -1;
+	}
+	parm = RTA_DATA(tb[TCA_MPLS_PARMS]);
+
+	print_string(PRINT_ANY, "kind", "%s ", "mpls");
+	print_string(PRINT_ANY, "mpls_action", " %s",
+		     action_names[parm->m_action]);
+
+	switch (parm->m_action) {
+	case TCA_MPLS_ACT_POP:
+		if (tb[TCA_MPLS_PROTO]) {
+			__u16 proto;
+
+			proto = rta_getattr_u16(tb[TCA_MPLS_PROTO]);
+			print_string(PRINT_ANY, "protocol", " protocol %s",
+				     ll_proto_n2a(proto, b1, sizeof(b1)));
+		}
+		break;
+	case TCA_MPLS_ACT_PUSH:
+		if (tb[TCA_MPLS_PROTO]) {
+			__u16 proto;
+
+			proto = rta_getattr_u16(tb[TCA_MPLS_PROTO]);
+			print_string(PRINT_ANY, "protocol", " protocol %s",
+				     ll_proto_n2a(proto, b1, sizeof(b1)));
+		}
+		/* Fallthrough */
+	case TCA_MPLS_ACT_MODIFY:
+		if (tb[TCA_MPLS_LABEL]) {
+			val = rta_getattr_u32(tb[TCA_MPLS_LABEL]);
+			print_uint(PRINT_ANY, "label", " label %u", val);
+		}
+		if (tb[TCA_MPLS_TC]) {
+			val = rta_getattr_u8(tb[TCA_MPLS_TC]);
+			print_uint(PRINT_ANY, "tc", " tc %u", val);
+		}
+		if (tb[TCA_MPLS_BOS]) {
+			val = rta_getattr_u8(tb[TCA_MPLS_BOS]);
+			print_uint(PRINT_ANY, "bos", " bos %u", val);
+		}
+		if (tb[TCA_MPLS_TTL]) {
+			val = rta_getattr_u8(tb[TCA_MPLS_TTL]);
+			print_uint(PRINT_ANY, "ttl", " ttl %u", val);
+		}
+		break;
+	}
+	print_action_control(f, " ", parm->action, "");
+
+	print_uint(PRINT_ANY, "index", "\n\t index %u", parm->index);
+	print_int(PRINT_ANY, "ref", " ref %d", parm->refcnt);
+	print_int(PRINT_ANY, "bind", " bind %d", parm->bindcnt);
+
+	if (show_stats) {
+		if (tb[TCA_MPLS_TM]) {
+			struct tcf_t *tm = RTA_DATA(tb[TCA_MPLS_TM]);
+
+			print_tm(f, tm);
+		}
+	}
+
+	print_string(PRINT_FP, NULL, "%s", _SL_);
+
+	return 0;
+}
+
+struct action_util mpls_action_util = {
+	.id = "mpls",
+	.parse_aopt = parse_mpls,
+	.print_aopt = print_mpls,
+};
-- 
2.7.4


^ permalink raw reply related

* [PATCH iproute2-next v2 1/3] lib: add mpls_uc and mpls_mc as link layer protocol names
From: John Hurley @ 2019-07-10 12:40 UTC (permalink / raw)
  To: netdev
  Cc: davem, jiri, xiyou.wangcong, dsahern, willemdebruijn.kernel,
	stephen, simon.horman, jakub.kicinski, oss-drivers, John Hurley
In-Reply-To: <1562762440-25656-1-git-send-email-john.hurley@netronome.com>

Update the llproto_names array to allow users to reference the mpls
protocol ids with the names 'mpls_uc' for unicast MPLS and 'mpls_mc' for
multicast.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 lib/ll_proto.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/ll_proto.c b/lib/ll_proto.c
index 78c3961..2a0c1cb 100644
--- a/lib/ll_proto.c
+++ b/lib/ll_proto.c
@@ -78,6 +78,8 @@ __PF(TIPC,tipc)
 __PF(AOE,aoe)
 __PF(8021Q,802.1Q)
 __PF(8021AD,802.1ad)
+__PF(MPLS_UC,mpls_uc)
+__PF(MPLS_MC,mpls_mc)
 
 { 0x8100, "802.1Q" },
 { 0x88cc, "LLDP" },
-- 
2.7.4


^ permalink raw reply related

* [PATCH iproute2-next v2 0/3] add interface to TC MPLS actions
From: John Hurley @ 2019-07-10 12:40 UTC (permalink / raw)
  To: netdev
  Cc: davem, jiri, xiyou.wangcong, dsahern, willemdebruijn.kernel,
	stephen, simon.horman, jakub.kicinski, oss-drivers, John Hurley

Recent kernel additions to TC allows the manipulation of MPLS headers as
filter actions.

The following patchset creates an iproute2 interface to the new actions
and includes documentation on how to use it.

v1->v2:
- change error from print_string() to fprintf(strerr,) (Stephen Hemminger)
- split long line in explain() message (David Ahern)
- use _SL_ instead of /n in print message (David Ahern)

John Hurley (3):
  lib: add mpls_uc and mpls_mc as link layer protocol names
  tc: add mpls actions
  man: update man pages for TC MPLS actions

 lib/ll_proto.c     |   2 +
 man/man8/tc-mpls.8 | 156 ++++++++++++++++++++++++++++++
 tc/Makefile        |   1 +
 tc/m_mpls.c        | 276 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 435 insertions(+)
 create mode 100644 man/man8/tc-mpls.8
 create mode 100644 tc/m_mpls.c

-- 
2.7.4


^ permalink raw reply

* [PATCH net-next iproute2 v2 2/2] devlink: Introduce PCI PF and VF port flavour and attribute
From: Parav Pandit @ 2019-07-10 12:39 UTC (permalink / raw)
  To: netdev; +Cc: stephen, dsahern, jiri, Parav Pandit
In-Reply-To: <20190710123952.6877-1-parav@mellanox.com>

Introduce PCI PF and VF port flavour and port attributes such as PF
number and VF number.

$ devlink port show
pci/0000:05:00.0/0: type eth netdev eth0 flavour pcipf pfnum 0
pci/0000:05:00.0/1: type eth netdev eth1 flavour pcivf pfnum 0 vfnum 0
pci/0000:05:00.0/2: type eth netdev eth2 flavour pcivf pfnum 0 vfnum 1

Signed-off-by: Parav Pandit <parav@mellanox.com>
---
Changelog:
v1->v2:
 - Instead of if-else using switch-case.
 - Split patch to two patches to have kernel header update in dedicated
   patch.
---
 devlink/devlink.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/devlink/devlink.c b/devlink/devlink.c
index ac8c0fb1..d8197ea3 100644
--- a/devlink/devlink.c
+++ b/devlink/devlink.c
@@ -2794,11 +2794,29 @@ static const char *port_flavour_name(uint16_t flavour)
 		return "cpu";
 	case DEVLINK_PORT_FLAVOUR_DSA:
 		return "dsa";
+	case DEVLINK_PORT_FLAVOUR_PCI_PF:
+		return "pcipf";
+	case DEVLINK_PORT_FLAVOUR_PCI_VF:
+		return "pcivf";
 	default:
 		return "<unknown flavour>";
 	}
 }
 
+static void pr_out_port_pfvf_num(struct dl *dl, struct nlattr **tb)
+{
+	uint16_t fn_num;
+
+	if (tb[DEVLINK_ATTR_PORT_PCI_PF_NUMBER]) {
+		fn_num = mnl_attr_get_u16(tb[DEVLINK_ATTR_PORT_PCI_PF_NUMBER]);
+		pr_out_uint(dl, "pfnum", fn_num);
+	}
+	if (tb[DEVLINK_ATTR_PORT_PCI_VF_NUMBER]) {
+		fn_num = mnl_attr_get_u16(tb[DEVLINK_ATTR_PORT_PCI_VF_NUMBER]);
+		pr_out_uint(dl, "vfnum", fn_num);
+	}
+}
+
 static void pr_out_port(struct dl *dl, struct nlattr **tb)
 {
 	struct nlattr *pt_attr = tb[DEVLINK_ATTR_PORT_TYPE];
@@ -2828,6 +2846,15 @@ static void pr_out_port(struct dl *dl, struct nlattr **tb)
 				mnl_attr_get_u16(tb[DEVLINK_ATTR_PORT_FLAVOUR]);
 
 		pr_out_str(dl, "flavour", port_flavour_name(port_flavour));
+
+		switch (port_flavour) {
+		case DEVLINK_PORT_FLAVOUR_PCI_PF:
+		case DEVLINK_PORT_FLAVOUR_PCI_VF:
+			pr_out_port_pfvf_num(dl, tb);
+			break;
+		default:
+			break;
+		}
 	}
 	if (tb[DEVLINK_ATTR_PORT_NUMBER]) {
 		uint32_t port_number;
-- 
2.19.2


^ permalink raw reply related

* [PATCH net-next iproute2 v2 1/2] devlink: Update kernel header to commit
From: Parav Pandit @ 2019-07-10 12:39 UTC (permalink / raw)
  To: netdev; +Cc: stephen, dsahern, jiri, Parav Pandit
In-Reply-To: <20190701183017.25407-1-parav@mellanox.com>

Update kernel header to commit:
e41b6bf3cdd4 ("devlink: Introduce PCI VF port flavour and port attribute")

Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 include/uapi/linux/devlink.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 6544824a..fc195cbd 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -169,6 +169,14 @@ enum devlink_port_flavour {
 	DEVLINK_PORT_FLAVOUR_DSA, /* Distributed switch architecture
 				   * interconnect port.
 				   */
+	DEVLINK_PORT_FLAVOUR_PCI_PF, /* Represents eswitch port for
+				      * the PCI PF. It is an internal
+				      * port that faces the PCI PF.
+				      */
+	DEVLINK_PORT_FLAVOUR_PCI_VF, /* Represents eswitch port
+				      * for the PCI VF. It is an internal
+				      * port that faces the PCI VF.
+				      */
 };
 
 enum devlink_param_cmode {
@@ -337,6 +345,9 @@ enum devlink_attr {
 	DEVLINK_ATTR_FLASH_UPDATE_STATUS_DONE,	/* u64 */
 	DEVLINK_ATTR_FLASH_UPDATE_STATUS_TOTAL,	/* u64 */
 
+	DEVLINK_ATTR_PORT_PCI_PF_NUMBER,	/* u16 */
+	DEVLINK_ATTR_PORT_PCI_VF_NUMBER,	/* u16 */
+
 	/* add new attributes above here, update the policy in devlink.c */
 
 	__DEVLINK_ATTR_MAX,
-- 
2.19.2


^ permalink raw reply related

* Re: [PATCH net-next v6 06/15] ethtool: netlink bitset handling
From: Michal Kubecek @ 2019-07-10 12:38 UTC (permalink / raw)
  To: netdev
  Cc: Jiri Pirko, David Miller, Jakub Kicinski, Andrew Lunn,
	Florian Fainelli, John Linville, Stephen Hemminger, Johannes Berg,
	linux-kernel
In-Reply-To: <20190709141817.GE2301@nanopsycho.orion>

On Tue, Jul 09, 2019 at 04:18:17PM +0200, Jiri Pirko wrote:
> 
> I understand. So how about avoid the bitfield all together and just
> have array of either bits of strings or combinations?
> 
> ETHTOOL_CMD_SETTINGS_SET (U->K)
>     ETHTOOL_A_HEADER
>         ETHTOOL_A_DEV_NAME = "eth3"
>     ETHTOOL_A_SETTINGS_PRIV_FLAGS
>        ETHTOOL_A_SETTINGS_PRIV_FLAG
>            ETHTOOL_A_FLAG_NAME = "legacy-rx"
> 	   ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
> 
> or the same with index instead of string
> 
> ETHTOOL_CMD_SETTINGS_SET (U->K)
>     ETHTOOL_A_HEADER
>         ETHTOOL_A_DEV_NAME = "eth3"
>     ETHTOOL_A_SETTINGS_PRIV_FLAGS
>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>             ETHTOOL_A_FLAG_INDEX = 0
>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
> 
> 
> For set you can combine both when you want to set multiple bits:
> 
> ETHTOOL_CMD_SETTINGS_SET (U->K)
>     ETHTOOL_A_HEADER
>         ETHTOOL_A_DEV_NAME = "eth3"
>     ETHTOOL_A_SETTINGS_PRIV_FLAGS
>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>             ETHTOOL_A_FLAG_INDEX = 2
>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>             ETHTOOL_A_FLAG_INDEX = 8
>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>             ETHTOOL_A_FLAG_NAME = "legacy-rx"
>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
> 
> 
> For get this might be a bit bigger message:
> 
> ETHTOOL_CMD_SETTINGS_GET_REPLY (K->U)
>     ETHTOOL_A_HEADER
>         ETHTOOL_A_DEV_NAME = "eth3"
>     ETHTOOL_A_SETTINGS_PRIV_FLAGS
>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>             ETHTOOL_A_FLAG_INDEX = 0
>             ETHTOOL_A_FLAG_NAME = "legacy-rx"
>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>             ETHTOOL_A_FLAG_INDEX = 1
>             ETHTOOL_A_FLAG_NAME = "vf-ipsec"
>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>             ETHTOOL_A_FLAG_INDEX = 8
>             ETHTOOL_A_FLAG_NAME = "something-else"
>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)

This is perfect for "one shot" applications but not so much for long
running ones, either "ethtool --monitor" or management or monitoring
daemons. Repeating the names in every notification message would be
a waste, it's much more convenient to load the strings only once and
cache them. Even if we omit the names in notifications (and possibly the
GET replies if client opts for it), this format still takes 12-16 bytes
per bit.

So the problem I'm trying to address is that there are two types of
clients with very different mode of work and different preferences.

Looking at the bitset.c, I would rather say that most of the complexity
and ugliness comes from dealing with both unsigned long based bitmaps
and u32 based ones. Originally, there were functions working with
unsigned long based bitmaps and the variants with "32" suffix were
wrappers around them which converted u32 bitmaps to unsigned long ones
and back. This became a problem when kernel started issuing warnings
about variable length arrays as getting rid of them meant two kmalloc()
and two kfree() for each u32 bitmap operation, even if most of the
bitmaps are in rather short in practice.

Maybe the wrapper could do something like

int ethnl_put_bitset32(const u32 *value, const u32 *mask,
		       unsigned int size,  ...)
{
	unsigned long fixed_value[2], fixed_mask[2];
	unsigned long *tmp_value = fixed_value;
	unsigned long *tmp_mask = fixed_mask;

	if (size > sizeof(fixed_value) * BITS_PER_BYTE) {
		tmp_value = bitmap_alloc(size);
		if (!tmp_value)
			return -ENOMEM;
		tmp_mask = bitmap_alloc(size);
		if (!tmp_mask) {
			kfree(tmp_value);
			return -ENOMEM;
		}
	}

	bitmap_from_arr32(tmp_value, value, size);
	bitmap_from_arr32(tmp_mask, mask, size);
	ret = ethnl_put_bitset(tmp_value, tmp_mask, size, ...);
}

This way we would make bitset.c code cleaner while avoiding allocating
short bitmaps (which is the most common case). 

Michal

^ permalink raw reply

* Re: Question about linux kernel commit: "net/ipv6: move metrics from dst to rt6_info"
From: David Ahern @ 2019-07-10 12:27 UTC (permalink / raw)
  To: Jan Szewczyk, davem@davemloft.net; +Cc: netdev@vger.kernel.org
In-Reply-To: <AM6PR07MB56397A8BC53D9A525BC9C489F2F00@AM6PR07MB5639.eurprd07.prod.outlook.com>

[ adding netdev so others can chime in ]

On 7/10/19 2:28 AM, Jan Szewczyk wrote:
> Hi guys!
> 
> We can see different behavior of one of our commands that supposed to
> show pmtu information.
> 
> It’s using netlink message RTM_GETROUTE to get the information and in
> Linux kernel version 4.12 after sending big packet (and triggering
> “packet too big”) there is an entry with PMTU and expiration time.
> 
> In the version 4.18 unfortunately the entry looks different and there is
> no PMTU information.

Can you try with 4.19.58 (latest stable release for 4.19)? Perhaps there
was a bugfix that is missing from 4.18.

The kernel has 2 commands under tools/testing/selftests/net -- pmtu.sh
and icmp_redirect.sh -- that verify exceptions are created and use 'ip
ro get' to verify the mtu.


> 
> I can see that in your commit
> https://github.com/torvalds/linux/commit/d4ead6b34b67fd711639324b6465a050bcb197d4,
> these lines disappeared from route.c:
> 
>  
> 
>      if (rt->rt6i_pmtu)
> 
>            metrics[RTAX_MTU - 1] = rt->rt6i_pmtu;
> 
>  
> 
> I’m very beginner in linux kernel code, can you help me and tell me if
> that could cause this different behavior?
> 
>  
> 
>  
> 
> BR,
> 
> Jan Szewczyk
> 


^ permalink raw reply

* Re: [PATCH v6 rdma-next 1/6] RDMA/core: Create mmap database and cookie helper functions
From: Gal Pressman @ 2019-07-10 12:19 UTC (permalink / raw)
  To: Michal Kalderon, ariel.elior, jgg, dledford; +Cc: linux-rdma, davem, netdev
In-Reply-To: <20190709141735.19193-2-michal.kalderon@marvell.com>

On 09/07/2019 17:17, Michal Kalderon wrote:
> Create some common API's for adding entries to a xa_mmap.
> Searching for an entry and freeing one.
> 
> The code was copied from the efa driver almost as is, just renamed
> function to be generic and not efa specific.
> 
> Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>

Reviewed-by: Gal Pressman <galpress@amazon.com>

^ permalink raw reply

* Re: [PATCH net-next v6 04/15] ethtool: introduce ethtool netlink interface
From: Michal Kubecek @ 2019-07-10 12:12 UTC (permalink / raw)
  To: netdev
  Cc: Jiri Pirko, David Miller, Jakub Kicinski, Andrew Lunn,
	Florian Fainelli, John Linville, Stephen Hemminger, Johannes Berg,
	linux-kernel
In-Reply-To: <20190709134212.GD2301@nanopsycho.orion>

On Tue, Jul 09, 2019 at 03:42:12PM +0200, Jiri Pirko wrote:
> Mon, Jul 08, 2019 at 10:22:19PM CEST, mkubecek@suse.cz wrote:
> >On Mon, Jul 08, 2019 at 09:26:29PM +0200, Jiri Pirko wrote:
> >> Mon, Jul 08, 2019 at 07:27:29PM CEST, mkubecek@suse.cz wrote:
> >> >
> >> >There are two reasons for this design. First is to reduce the number of
> >> >requests needed to get the information. This is not so much a problem of
> >> >ethtool itself; the only existing commands that would result in multiple
> >> >request messages would be "ethtool <dev>" and "ethtool -s <dev>". Maybe
> >> >also "ethtool -x/-X <dev>" but even if the indirection table and hash
> >> >key have different bits assigned now, they don't have to be split even
> >> >if we split other commands. It may be bigger problem for daemons wanting
> >> >to keep track of system configuration which would have to issue many
> >> >requests whenever a new device appears.
> >> >
> >> >Second reason is that with 8-bit genetlink command/message id, the space
> >> >is not as infinite as it might seem. I counted quickly, right now the
> >> >full series uses 14 ids for kernel messages, with split you propose it
> >> >would most likely grow to 44. For full implementation of all ethtool
> >> >functionality, we could get to ~60 ids. It's still only 1/4 of the
> >> >available space but it's not clear what the future development will look
> >> >like. We would certainly need to be careful not to start allocating new
> >> >commands for single parameters and try to be foreseeing about what can
> >> >be grouped together. But we will need to do that in any case.
> >> >
> >> >On kernel side, splitting existing messages would make some things a bit
> >> >easier. It would also reduce the number of scenarios where only part of
> >> >requested information is available or only part of a SET request fails.
> >> 
> >> Okay, I got your point. So why don't we look at if from the other angle.
> >> Why don't we have only single get/set command that would be in general
> >> used to get/set ALL info from/to the kernel. Where we can have these
> >> bits (perhaps rather varlen bitfield) to for user to indicate which data
> >> is he interested in? This scales. The other commands would be
> >> just for action.
> >> 
> >> Something like RTM_GETLINK/RTM_SETLINK. Makes sense?
> >
> >It's certainly an option but at the first glance it seems as just moving
> >what I tried to avoid one level lower. It would work around the u8 issue
> >(but as Johannes pointed out, we can handle it with genetlink when/if
> >the time comes). We would almost certainly have to split the replies
> >into multiple messages to keep the packet size reasonable. I'll have to
> >think more about the consequences for both kernel and userspace.
> >
> >My gut feeling is that out of the two extreme options (one universal
> >message type and message types corresponding to current infomask bits),
> >the latter is more appealing. After all, ethtool has been gathering
> >features that would need those ~60 message types for 20 years.
> 
> Yeah, but I think that we have to do one or another. Anything in between
> makes the code complex and uapi confusing. Let's start clean :)

I'll split the messages for v7.

Michal

^ permalink raw reply

* Re: [PATCH v6 rdma-next 2/6] RDMA/efa: Use the common mmap_xa helpers
From: Gal Pressman @ 2019-07-10 12:09 UTC (permalink / raw)
  To: Michal Kalderon, ariel.elior, jgg, dledford; +Cc: linux-rdma, davem, netdev
In-Reply-To: <20190709141735.19193-3-michal.kalderon@marvell.com>

On 09/07/2019 17:17, Michal Kalderon wrote:
> Remove the functions related to managing the mmap_xa database.
> This code was copied to the ib_core. Use the common API's instead.
> 
> Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>

Thanks Michal,
Acked-by: Gal Pressman <galpress@amazon.com>

^ permalink raw reply

* Re: [PATCH v12 1/5] can: m_can: Create a m_can platform framework
From: Dan Murphy @ 2019-07-10 12:08 UTC (permalink / raw)
  To: wg, mkl, davem; +Cc: linux-can, netdev, linux-kernel
In-Reply-To: <6fa79302-ad32-7f43-f9d5-af70aa789284@ti.com>

Hello

On 6/17/19 10:09 AM, Dan Murphy wrote:
> Marc
>
> On 6/10/19 11:35 AM, Dan Murphy wrote:
>> Bump
>>
>> On 6/6/19 8:16 AM, Dan Murphy wrote:
>>> Marc
>>>
>>> Bump
>>>
>>> On 5/31/19 6:51 AM, Dan Murphy wrote:
>>>> Marc
>>>>
>>>> On 5/15/19 3:54 PM, Dan Murphy wrote:
>>>>> Marc
>>>>>
>>>>> On 5/9/19 11:11 AM, Dan Murphy wrote:
>>>>>> Create a m_can platform framework that peripheral
>>>>>> devices can register to and use common code and register sets.
>>>>>> The peripheral devices may provide read/write and configuration
>>>>>> support of the IP.
>>>>>>
>>>>>> Acked-by: Wolfgang Grandegger <wg@grandegger.com>
>>>>>> Signed-off-by: Dan Murphy <dmurphy@ti.com>
>>>>>> ---
>>>>>>
>>>>>> v12 - Update the m_can_read/write functions to create a backtrace 
>>>>>> if the callback
>>>>>> pointer is NULL. - https://lore.kernel.org/patchwork/patch/1052302/
>>>>>>
>>>>> Is this able to be merged now?
>>>>
>>>> ping
>
> Wondering if there is anything else we need to do?
>
> The part has officially shipped and we had hoped to have driver 
> support in Linux as part of the announcement.
>
Is this being sent in a PR for 5.3?

Dan


> Dan
>
>
>>>>
>>>>
>>>>> Dan
>>>>>
>>>>> <snip>

^ permalink raw reply

* [PATCH v2 bpf] selftests/bpf: fix bpf_target_sparc check
From: Ilya Leoshkevich @ 2019-07-10 11:56 UTC (permalink / raw)
  To: bpf, netdev; +Cc: andrii.nakryiko, Ilya Leoshkevich

bpf_helpers.h fails to compile on sparc: the code should be checking
for defined(bpf_target_sparc), but checks simply for bpf_target_sparc.

Also change #ifdef bpf_target_powerpc to #if defined() for consistency.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
---

v1->v2: bpf_target_powerpc change

 tools/testing/selftests/bpf/bpf_helpers.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 5f6f9e7aba2a..0214797518ce 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -440,10 +440,10 @@ static int (*bpf_skb_adjust_room)(void *ctx, __s32 len_diff, __u32 mode,
 
 #endif
 
-#ifdef bpf_target_powerpc
+#if defined(bpf_target_powerpc)
 #define BPF_KPROBE_READ_RET_IP(ip, ctx)		({ (ip) = (ctx)->link; })
 #define BPF_KRETPROBE_READ_RET_IP		BPF_KPROBE_READ_RET_IP
-#elif bpf_target_sparc
+#elif defined(bpf_target_sparc)
 #define BPF_KPROBE_READ_RET_IP(ip, ctx)		({ (ip) = PT_REGS_RET(ctx); })
 #define BPF_KRETPROBE_READ_RET_IP		BPF_KPROBE_READ_RET_IP
 #else
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH iproute2 master 1/3] devlink: Change devlink health dump show command to dumpit
From: Jiri Pirko @ 2019-07-10 11:47 UTC (permalink / raw)
  To: Tariq Toukan; +Cc: Stephen Hemminger, netdev, moshe, ayal
In-Reply-To: <1562756601-19171-2-git-send-email-tariqt@mellanox.com>

Wed, Jul 10, 2019 at 01:03:19PM CEST, tariqt@mellanox.com wrote:
>From: Aya Levin <ayal@mellanox.com>
>
>Although devlink health dump show command is given per reporter, it
>returns large amounts of data. Trying to use the doit cb results in
>OUT-OF-BUFFER error. This complementary patch raises the DUMP flag in
>order to invoke the dumpit cb. We're safe as no existing drivers
>implement the dump health reporter option yet.
>
>Fixes: 041e6e651a8e ("devlink: Add devlink health dump show command")
>Signed-off-by: Aya Levin <ayal@mellanox.com>
>Signed-off-by: Tariq Toukan <tariqt@mellanox.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH v3 bpf-next 3/4] selftests/bpf: make PT_REGS_* work in userspace
From: Ilya Leoshkevich @ 2019-07-10 11:47 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, Stanislav Fomichev, Y Song, David S. Miller,
	Alexei Starovoitov, Daniel Borkmann, Adrian Ratiu, david.daney
In-Reply-To: <CAEf4BzY0kO2si_ajouYNfsauaWdHkj042++bLaHe1W_G885i=g@mail.gmail.com>

> Am 09.07.2019 um 19:48 schrieb Andrii Nakryiko <andrii.nakryiko@gmail.com>:
> 
> On Tue, Jul 9, 2019 at 8:19 AM Ilya Leoshkevich <iii@linux.ibm.com> wrote:
>> 
>> Right now, on certain architectures, these macros are usable only with
>> kernel headers. This patch makes it possible to use them with userspace
>> headers and, as a consequence, not only in BPF samples, but also in BPF
>> selftests.
>> 
>> On s390, provide the forward declaration of struct pt_regs and cast it
>> to user_pt_regs in PT_REGS_* macros. This is necessary, because instead
>> of the full struct pt_regs, s390 exposes only its first member
>> user_pt_regs to userspace, and bpf_helpers.h is used with both userspace
>> (in selftests) and kernel (in samples) headers. It was added in commit
>> 466698e654e8 ("s390/bpf: correct broken uapi for
>> BPF_PROG_TYPE_PERF_EVENT program type").
>> 
>> Ditto on arm64.
>> 
>> On x86, provide userspace versions of PT_REGS_* macros. Unlike s390 and
>> arm64, x86 provides struct pt_regs to both userspace and kernel, however,
>> with different member names.
>> 
>> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
>> ---
> 
> Just curious, what did you use as a reference for which register
> corresponds to which PARM, RET, etc for different archs? I've tried to
> look it up the other day, and it wasn't as straightforward to find as
> I hoped for, so maybe I'm missing something obvious.

For this particular change I did not have to look it up, because it all
was already in the code, I just needed to adapt it to userspace headers.
Normally I would google for „abi supplement“ to find this information.
A lazy way would be to simply ask the (cross-)compiler:

cat <<HERE | aarch64-linux-gnu-gcc -x c -O3 -S - -o -
int f(int a, int b, int c, int d, int e, int f, int g, int h, int i, int j);
int g() { return -f(1, 2, 3, 4, 5, 6, 7, 8, 9, 10); }
HERE

I’ve just double checked the supported arches, and noticed that:

#define PT_REGS_PARM5(x) ((x)->uregs[4])
for bpf_target_arm (arm-linux-gnueabihf) looks wrong:
the 5th parameter should be passed on stack. This observation matches
[1].

#define PT_REGS_RC(x) ((x)->regs[1])
for bpf_target_mips (mips64el-linux-gnuabi64) also looks wrong:
the return value should be in register 2. This observation matches [2].

Since I’m not an expert on those architectures, my conclusions could be
incorrect (e.g. becase a different ABI is normally used in practice).
Adrian and David, could you please correct me if I’m wrong?

[1] http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042f/IHI0042F_aapcs.pdf
[2] ftp://www.linux-mips.org/pub/linux/mips/doc/ABI/psABI_mips3.0.pdf

>> tools/testing/selftests/bpf/bpf_helpers.h | 61 +++++++++++++++--------
>> 1 file changed, 41 insertions(+), 20 deletions(-)
>> 
>> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
>> index 73071a94769a..212ec564e5c3 100644
>> --- a/tools/testing/selftests/bpf/bpf_helpers.h
>> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
>> @@ -358,6 +358,7 @@ static int (*bpf_skb_adjust_room)(void *ctx, __s32 len_diff, __u32 mode,
>> 
>> #if defined(bpf_target_x86)
>> 
>> +#ifdef __KERNEL__
>> #define PT_REGS_PARM1(x) ((x)->di)
>> #define PT_REGS_PARM2(x) ((x)->si)
>> #define PT_REGS_PARM3(x) ((x)->dx)
>> @@ -368,19 +369,35 @@ static int (*bpf_skb_adjust_room)(void *ctx, __s32 len_diff, __u32 mode,
>> #define PT_REGS_RC(x) ((x)->ax)
>> #define PT_REGS_SP(x) ((x)->sp)
>> #define PT_REGS_IP(x) ((x)->ip)
>> +#else
>> +#define PT_REGS_PARM1(x) ((x)->rdi)
>> +#define PT_REGS_PARM2(x) ((x)->rsi)
>> +#define PT_REGS_PARM3(x) ((x)->rdx)
>> +#define PT_REGS_PARM4(x) ((x)->rcx)
>> +#define PT_REGS_PARM5(x) ((x)->r8)
>> +#define PT_REGS_RET(x) ((x)->rsp)
>> +#define PT_REGS_FP(x) ((x)->rbp)
>> +#define PT_REGS_RC(x) ((x)->rax)
>> +#define PT_REGS_SP(x) ((x)->rsp)
>> +#define PT_REGS_IP(x) ((x)->rip)
> 
> Will this also work for 32-bit x86?

Thanks, this is a good catch: this builds, but makes 64-bit accesses, as if it used the 64-bit
variant of pt_regs. I will fix this.

^ permalink raw reply

* Re: [PATCH net-next 00/11] Add drop monitor for offloaded data paths
From: Toke Høiland-Jørgensen @ 2019-07-10 11:39 UTC (permalink / raw)
  To: Ido Schimmel, Jakub Kicinski
  Cc: David Miller, netdev, jiri, mlxsw, dsahern, roopa, nikolay, andy,
	pablo, pieter.jansenvanvuuren, andrew, f.fainelli, vivien.didelot,
	idosch, Alexei Starovoitov
In-Reply-To: <20190710112011.GA552@splinter>

Ido Schimmel <idosch@idosch.org> writes:

> On Tue, Jul 09, 2019 at 03:34:30PM -0700, Jakub Kicinski wrote:
>> On Tue, 9 Jul 2019 15:38:44 +0300, Ido Schimmel wrote:
>> > On Mon, Jul 08, 2019 at 03:51:58PM -0700, Jakub Kicinski wrote:
>> > > On Mon, 8 Jul 2019 16:19:08 +0300, Ido Schimmel wrote:  
>> > > > On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:  
>> > > > > From: Ido Schimmel <idosch@idosch.org>
>> > > > > Date: Sun,  7 Jul 2019 10:58:17 +0300
>> > > > >     
>> > > > > > Users have several ways to debug the kernel and understand why a packet
>> > > > > > was dropped. For example, using "drop monitor" and "perf". Both
>> > > > > > utilities trace kfree_skb(), which is the function called when a packet
>> > > > > > is freed as part of a failure. The information provided by these tools
>> > > > > > is invaluable when trying to understand the cause of a packet loss.
>> > > > > > 
>> > > > > > In recent years, large portions of the kernel data path were offloaded
>> > > > > > to capable devices. Today, it is possible to perform L2 and L3
>> > > > > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
>> > > > > > Different TC classifiers and actions are also offloaded to capable
>> > > > > > devices, at both ingress and egress.
>> > > > > > 
>> > > > > > However, when the data path is offloaded it is not possible to achieve
>> > > > > > the same level of introspection as tools such "perf" and "drop monitor"
>> > > > > > become irrelevant.
>> > > > > > 
>> > > > > > This patchset aims to solve this by allowing users to monitor packets
>> > > > > > that the underlying device decided to drop along with relevant metadata
>> > > > > > such as the drop reason and ingress port.    
>> > > > > 
>> > > > > We are now going to have 5 or so ways to capture packets passing through
>> > > > > the system, this is nonsense.
>> > > > > 
>> > > > > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
>> > > > > devlink thing.
>> > > > > 
>> > > > > This is insanity, too many ways to do the same thing and therefore the
>> > > > > worst possible user experience.
>> > > > > 
>> > > > > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
>> > > > > XDP perf events, and these taps there too.
>> > > > > 
>> > > > > I mean really, think about it from the average user's perspective.  To
>> > > > > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
>> > > > > listen on devlink but configure a special tap thing beforehand and then
>> > > > > if someone is using XDP I gotta setup another perf event buffer capture
>> > > > > thing too.    
>> > > > 
>> > > > Let me try to explain again because I probably wasn't clear enough. The
>> > > > devlink-trap mechanism is not doing the same thing as other solutions.
>> > > > 
>> > > > The packets we are capturing in this patchset are packets that the
>> > > > kernel (the CPU) never saw up until now - they were silently dropped by
>> > > > the underlying device performing the packet forwarding instead of the
>> > > > CPU.  
>> > 
>> > Jakub,
>> > 
>> > It seems to me that most of the criticism is about consolidation of
>> > interfaces because you believe I'm doing something you can already do
>> > today, but this is not the case.
>> 
>> To be clear I'm not opposed to the patches, I'm just trying to
>> facilitate a discussion.
>
> Sure, sorry if it came out the wrong way. I appreciate your feedback and
> the time you have spent on this subject.
>
>> 
>> > Switch ASICs have dedicated traps for specific packets. Usually, these
>> > packets are control packets (e.g., ARP, BGP) which are required for the
>> > correct functioning of the control plane. You can see this in the SAI
>> > interface, which is an abstraction layer over vendors' SDKs:
>> > 
>> > https://github.com/opencomputeproject/SAI/blob/master/inc/saihostif.h#L157
>> > 
>> > We need to be able to configure the hardware policers of these traps and
>> > read their statistics to understand how many packets they dropped. We
>> > currently do not have a way to do any of that and we rely on hardcoded
>> > defaults in the driver which do not fit every use case (from
>> > experience):
>> > 
>> > https://elixir.bootlin.com/linux/v5.2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum.c#L4103
>> > 
>> > We plan to extend devlink-trap mechanism to cover all these use cases. I
>> > hope you agree that this functionality belongs in devlink given it is a
>> > device-specific configuration and not a netdev-specific one.
>> 
>> No disagreement on providing knobs for traps.
>> 
>> > That being said, in its current form, this mechanism is focused on traps
>> > that correlate to packets the device decided to drop as this is very
>> > useful for debugging.
>> 
>> That'd be mixing two things - trap configuration and tracing exceptions
>> in one API. That's a little suboptimal but not too terrible, especially
>> if there is a higher level APIs users can default to.
>
> TBH, initially I was only focused on the drops, but then it occurred to
> me that this is a too narrow scope. These traps are only a subset of the
> complete list of traps we have and we have similar requirements for both
> (statistics, setting policers etc.). Therefore, I decided to design this
> interface in a more generic way, so that it could support the different
> use cases.
>
>> 
>> > Given that the entire configuration is done via devlink and that devlink
>> > stores all the information about these traps, it seems logical to also
>> > report these packets and their metadata to user space as devlink events.
>> > 
>> > If this is not desirable, we can try to call into drop_monitor from
>> > devlink and add a new command (e.g., NET_DM_CMD_HW_ALERT), which will
>> > encode all the information we currently have in DEVLINK_CMD_TRAP_REPORT.
>> > 
>> > IMO, this is less desirable, as instead of having one tool (devlink) to
>> > interact with this mechanism we will need two (devlink & dropwatch).
>> > 
>> > Below I tried to answer all your questions and refer to all the points
>> > you brought up.
>> > 
>> > > When you say silently dropped do you mean that mlxsw as of today
>> > > doesn't have any counters exposed for those events?  
>> > 
>> > Some of these packets are counted, but not all of them.
>> > 
>> > > If we wanted to consolidate this into something existing we can either
>> > >  (a) add similar traps in the kernel data path;
>> > >  (b) make these traps extension of statistics.
>> > > 
>> > > My knee jerk reaction to seeing the patches was that it adds a new
>> > > place where device statistics are reported.  
>> > 
>> > Not at all. This would be a step back. We can already count discards due
>> > to VLAN membership on ingress on a per-port basis. A software maintained
>> > global counter does not buy us anything.
>> > 
>> > By also getting the dropped packet - coupled with the drop reason and
>> > ingress port - you can understand exactly why and on which VLAN the
>> > packet was dropped. I wrote a Wireshark dissector for these netlink
>> > packets to make our life easier. You can see the details in my comment
>> > to the cover letter:
>> > 
>> > https://marc.info/?l=linux-netdev&m=156248736710238&w=2
>> > 
>> > In case you do not care about individual packets, but still want more
>> > fine-grained statistics for your monitoring application, you can use
>> > eBPF. For example, one thing we did is attaching a kprobe to
>> > devlink_trap_report() with an eBPF program that dissects the incoming
>> > skbs and maintains a counter per-{5 tuple, drop reason}. With
>> > ebpf_exporter you can export these statistics to Prometheus on which you
>> > can run queries and visualize the results with Grafana. This is
>> > especially useful for tail and early drops since it allows you to
>> > understand which flows contribute to most of the drops.
>> 
>> No question that the mechanism is useful.
>> 
>> > > Users who want to know why things are dropped will not get detailed
>> > > breakdown from ethtool -S which for better or worse is the one stop
>> > > shop for device stats today.  
>> > 
>> > I hope I managed to explain why counters are not enough, but I also want
>> > to point out that ethtool statistics are not properly documented and
>> > this hinders their effectiveness. I did my best to document the exposed
>> > traps in order to avoid the same fate:
>> > 
>> > https://patchwork.ozlabs.org/patch/1128585/
>> > 
>> > In addition, there are selftests to show how each trap can be triggered
>> > to reduce the ambiguity even further:
>> > 
>> > https://patchwork.ozlabs.org/patch/1128610/
>> > 
>> > And a note in the documentation to make sure future functionality is
>> > tested as well:
>> > 
>> > https://patchwork.ozlabs.org/patch/1128608/
>> > 
>> > > Having thought about it some more, however, I think that having a
>> > > forwarding "exception" object and hanging statistics off of it is a
>> > > better design, even if we need to deal with some duplication to get
>> > > there.
>> > > 
>> > > IOW having an way to "trap all packets which would increment a
>> > > statistic" (option (b) above) is probably a bad design.
>> > > 
>> > > As for (a) I wonder how many of those events have a corresponding event
>> > > in the kernel stack?  
>> > 
>> > Generic packet drops all have a corresponding kfree_skb() calls in the
>> > kernel, but that does not mean that every packet dropped by the hardware
>> > would also be dropped by the kernel if it were to be injected to its Rx
>> > path.
>> 
>> The notion that all SW events get captured by kfree_skb() would not be
>> correct.
>
> I meant that the generic drop reasons I'm exposing with this patchset
> all correspond to reasons for which the kernel would drop packets.
>
>> We have the kfree_skb(), and xdp_exception(), and drivers can
>> drop packets if various allocations fail.. the situation is already not
>> great.
>> 
>> I think that having a single useful place where users can look to see
>> all traffic exception events would go a long way. 
>
> I believe this was Dave's point as well. We have one tool to monitor
> kfree_skb() drops and with this patchset we will have another to monitor
> HW drops. As I mentioned in my previous reply, I will look into sending
> the events via drop_monitor by calling into it from devlink.
>
> I'm not involved with XDP (as you might have noticed), but I assume
> drop_monitor could be extended for this use case as well by doing
> register_trace_xdp_exception(). Then you could monitor SW, HW and XDP
> events using a single netlink channel, potentially split into different
> multicast groups to allow user space programs to receive only the events
> they care about.

Yes, having XDP hook into this would be good! This ties in nicely with
the need for better statistics for XDP (see
https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#consistency-for-statistics-with-xdp).

Unfortunately, it's not enough to just hook into the xdp_exception trace
point, as that is generally only triggered on XDP_ABORTED, not if the
XDP program just decides to drop the packet with XDP_DROP. So we may
need another mechanism to hook this if it's going to comprehensive; and
we'd probably want a way to do this that doesn't impose a performance
penalty when the drop monitor is not enabled...

-Toke

^ permalink raw reply

* Re: [PATCH] MAINTAINERS: update BPF JIT S390 maintainers
From: Vasily Gorbik @ 2019-07-10 11:34 UTC (permalink / raw)
  To: David S. Miller, Alexei Starovoitov, Daniel Borkmann
  Cc: Heiko Carstens, Christian Borntraeger, Ilya Leoshkevich, netdev,
	bpf, linux-s390
In-Reply-To: <patch.git-d365382dfc69.your-ad-here.call-01562755343-ext-3127@work.hours>

On Wed, Jul 10, 2019 at 12:43:45PM +0200, Vasily Gorbik wrote:
> Ilya Leoshkevich is joining as s390 bpf maintainer. With his background
> as gcc developer he would be valuable for the team and community as a
> whole. Ilya, have fun!
> 
> Since there is now enough eyes on s390 bpf, relieve Christian Borntraeger,
> so that he could focus on his maintainer tasks for other components.
> 
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
> ---
>  MAINTAINERS | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 558acf24ea1e..98e7411dfe56 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3066,9 +3066,9 @@ S:	Maintained
>  F:	arch/riscv/net/
>  
>  BPF JIT for S390
> +M:	Ilya Leoshkevich <iii@linux.ibm.com>
>  M:	Heiko Carstens <heiko.carstens@de.ibm.com>
>  M:	Vasily Gorbik <gor@linux.ibm.com>
> -M:	Christian Borntraeger <borntraeger@de.ibm.com>
>  L:	netdev@vger.kernel.org
>  L:	bpf@vger.kernel.org
>  S:	Maintained
> -- 
> 2.21.0

Dave, Alexei, Daniel,
would you take it via one of your trees? Or should I take it via s390?

Thanks,
Vasily


^ permalink raw reply

* Re: [PATCH net-next 00/11] Add drop monitor for offloaded data paths
From: Ido Schimmel @ 2019-07-10 11:20 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Miller, netdev, jiri, mlxsw, dsahern, roopa, nikolay, andy,
	pablo, pieter.jansenvanvuuren, andrew, f.fainelli, vivien.didelot,
	idosch, Alexei Starovoitov
In-Reply-To: <20190709153430.5f0f5295@cakuba.netronome.com>

On Tue, Jul 09, 2019 at 03:34:30PM -0700, Jakub Kicinski wrote:
> On Tue, 9 Jul 2019 15:38:44 +0300, Ido Schimmel wrote:
> > On Mon, Jul 08, 2019 at 03:51:58PM -0700, Jakub Kicinski wrote:
> > > On Mon, 8 Jul 2019 16:19:08 +0300, Ido Schimmel wrote:  
> > > > On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:  
> > > > > From: Ido Schimmel <idosch@idosch.org>
> > > > > Date: Sun,  7 Jul 2019 10:58:17 +0300
> > > > >     
> > > > > > Users have several ways to debug the kernel and understand why a packet
> > > > > > was dropped. For example, using "drop monitor" and "perf". Both
> > > > > > utilities trace kfree_skb(), which is the function called when a packet
> > > > > > is freed as part of a failure. The information provided by these tools
> > > > > > is invaluable when trying to understand the cause of a packet loss.
> > > > > > 
> > > > > > In recent years, large portions of the kernel data path were offloaded
> > > > > > to capable devices. Today, it is possible to perform L2 and L3
> > > > > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
> > > > > > Different TC classifiers and actions are also offloaded to capable
> > > > > > devices, at both ingress and egress.
> > > > > > 
> > > > > > However, when the data path is offloaded it is not possible to achieve
> > > > > > the same level of introspection as tools such "perf" and "drop monitor"
> > > > > > become irrelevant.
> > > > > > 
> > > > > > This patchset aims to solve this by allowing users to monitor packets
> > > > > > that the underlying device decided to drop along with relevant metadata
> > > > > > such as the drop reason and ingress port.    
> > > > > 
> > > > > We are now going to have 5 or so ways to capture packets passing through
> > > > > the system, this is nonsense.
> > > > > 
> > > > > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
> > > > > devlink thing.
> > > > > 
> > > > > This is insanity, too many ways to do the same thing and therefore the
> > > > > worst possible user experience.
> > > > > 
> > > > > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
> > > > > XDP perf events, and these taps there too.
> > > > > 
> > > > > I mean really, think about it from the average user's perspective.  To
> > > > > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
> > > > > listen on devlink but configure a special tap thing beforehand and then
> > > > > if someone is using XDP I gotta setup another perf event buffer capture
> > > > > thing too.    
> > > > 
> > > > Let me try to explain again because I probably wasn't clear enough. The
> > > > devlink-trap mechanism is not doing the same thing as other solutions.
> > > > 
> > > > The packets we are capturing in this patchset are packets that the
> > > > kernel (the CPU) never saw up until now - they were silently dropped by
> > > > the underlying device performing the packet forwarding instead of the
> > > > CPU.  
> > 
> > Jakub,
> > 
> > It seems to me that most of the criticism is about consolidation of
> > interfaces because you believe I'm doing something you can already do
> > today, but this is not the case.
> 
> To be clear I'm not opposed to the patches, I'm just trying to
> facilitate a discussion.

Sure, sorry if it came out the wrong way. I appreciate your feedback and
the time you have spent on this subject.

> 
> > Switch ASICs have dedicated traps for specific packets. Usually, these
> > packets are control packets (e.g., ARP, BGP) which are required for the
> > correct functioning of the control plane. You can see this in the SAI
> > interface, which is an abstraction layer over vendors' SDKs:
> > 
> > https://github.com/opencomputeproject/SAI/blob/master/inc/saihostif.h#L157
> > 
> > We need to be able to configure the hardware policers of these traps and
> > read their statistics to understand how many packets they dropped. We
> > currently do not have a way to do any of that and we rely on hardcoded
> > defaults in the driver which do not fit every use case (from
> > experience):
> > 
> > https://elixir.bootlin.com/linux/v5.2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum.c#L4103
> > 
> > We plan to extend devlink-trap mechanism to cover all these use cases. I
> > hope you agree that this functionality belongs in devlink given it is a
> > device-specific configuration and not a netdev-specific one.
> 
> No disagreement on providing knobs for traps.
> 
> > That being said, in its current form, this mechanism is focused on traps
> > that correlate to packets the device decided to drop as this is very
> > useful for debugging.
> 
> That'd be mixing two things - trap configuration and tracing exceptions
> in one API. That's a little suboptimal but not too terrible, especially
> if there is a higher level APIs users can default to.

TBH, initially I was only focused on the drops, but then it occurred to
me that this is a too narrow scope. These traps are only a subset of the
complete list of traps we have and we have similar requirements for both
(statistics, setting policers etc.). Therefore, I decided to design this
interface in a more generic way, so that it could support the different
use cases.

> 
> > Given that the entire configuration is done via devlink and that devlink
> > stores all the information about these traps, it seems logical to also
> > report these packets and their metadata to user space as devlink events.
> > 
> > If this is not desirable, we can try to call into drop_monitor from
> > devlink and add a new command (e.g., NET_DM_CMD_HW_ALERT), which will
> > encode all the information we currently have in DEVLINK_CMD_TRAP_REPORT.
> > 
> > IMO, this is less desirable, as instead of having one tool (devlink) to
> > interact with this mechanism we will need two (devlink & dropwatch).
> > 
> > Below I tried to answer all your questions and refer to all the points
> > you brought up.
> > 
> > > When you say silently dropped do you mean that mlxsw as of today
> > > doesn't have any counters exposed for those events?  
> > 
> > Some of these packets are counted, but not all of them.
> > 
> > > If we wanted to consolidate this into something existing we can either
> > >  (a) add similar traps in the kernel data path;
> > >  (b) make these traps extension of statistics.
> > > 
> > > My knee jerk reaction to seeing the patches was that it adds a new
> > > place where device statistics are reported.  
> > 
> > Not at all. This would be a step back. We can already count discards due
> > to VLAN membership on ingress on a per-port basis. A software maintained
> > global counter does not buy us anything.
> > 
> > By also getting the dropped packet - coupled with the drop reason and
> > ingress port - you can understand exactly why and on which VLAN the
> > packet was dropped. I wrote a Wireshark dissector for these netlink
> > packets to make our life easier. You can see the details in my comment
> > to the cover letter:
> > 
> > https://marc.info/?l=linux-netdev&m=156248736710238&w=2
> > 
> > In case you do not care about individual packets, but still want more
> > fine-grained statistics for your monitoring application, you can use
> > eBPF. For example, one thing we did is attaching a kprobe to
> > devlink_trap_report() with an eBPF program that dissects the incoming
> > skbs and maintains a counter per-{5 tuple, drop reason}. With
> > ebpf_exporter you can export these statistics to Prometheus on which you
> > can run queries and visualize the results with Grafana. This is
> > especially useful for tail and early drops since it allows you to
> > understand which flows contribute to most of the drops.
> 
> No question that the mechanism is useful.
> 
> > > Users who want to know why things are dropped will not get detailed
> > > breakdown from ethtool -S which for better or worse is the one stop
> > > shop for device stats today.  
> > 
> > I hope I managed to explain why counters are not enough, but I also want
> > to point out that ethtool statistics are not properly documented and
> > this hinders their effectiveness. I did my best to document the exposed
> > traps in order to avoid the same fate:
> > 
> > https://patchwork.ozlabs.org/patch/1128585/
> > 
> > In addition, there are selftests to show how each trap can be triggered
> > to reduce the ambiguity even further:
> > 
> > https://patchwork.ozlabs.org/patch/1128610/
> > 
> > And a note in the documentation to make sure future functionality is
> > tested as well:
> > 
> > https://patchwork.ozlabs.org/patch/1128608/
> > 
> > > Having thought about it some more, however, I think that having a
> > > forwarding "exception" object and hanging statistics off of it is a
> > > better design, even if we need to deal with some duplication to get
> > > there.
> > > 
> > > IOW having an way to "trap all packets which would increment a
> > > statistic" (option (b) above) is probably a bad design.
> > > 
> > > As for (a) I wonder how many of those events have a corresponding event
> > > in the kernel stack?  
> > 
> > Generic packet drops all have a corresponding kfree_skb() calls in the
> > kernel, but that does not mean that every packet dropped by the hardware
> > would also be dropped by the kernel if it were to be injected to its Rx
> > path.
> 
> The notion that all SW events get captured by kfree_skb() would not be
> correct.

I meant that the generic drop reasons I'm exposing with this patchset
all correspond to reasons for which the kernel would drop packets.

> We have the kfree_skb(), and xdp_exception(), and drivers can
> drop packets if various allocations fail.. the situation is already not
> great.
> 
> I think that having a single useful place where users can look to see
> all traffic exception events would go a long way. 

I believe this was Dave's point as well. We have one tool to monitor
kfree_skb() drops and with this patchset we will have another to monitor
HW drops. As I mentioned in my previous reply, I will look into sending
the events via drop_monitor by calling into it from devlink.

I'm not involved with XDP (as you might have noticed), but I assume
drop_monitor could be extended for this use case as well by doing
register_trace_xdp_exception(). Then you could monitor SW, HW and XDP
events using a single netlink channel, potentially split into different
multicast groups to allow user space programs to receive only the events
they care about.

> Software side as I mentioned is pretty brutal, IDK how many users are
> actually willing to decode stack traces to figure out why their system
> is dropping packets :/
> 
> > In my reply to Dave I gave buffer drops as an example.
> 
> The example of buffer drops is also probably the case where having the
> packet is least useful, but yes, I definitely agree devices need a way
> of reporting events that can't happen in SW.
> 
> > There are also situations in which packets can be dropped due to
> > device-specific exceptions and these do not have a corresponding drop
> > reason in the kernel. See example here:
> > 
> > https://patchwork.ozlabs.org/patch/1128587/
> > 
> > > If we could add corresponding trace points and just feed those from
> > > the device driver, that'd obviously be a holy grail.  
> > 
> > Unlike tracepoints, netlink gives you a structured and extensible
> > interface. For example, in Spectrum-1 we cannot provide the Tx port for
> > early/tail drops, whereas for Spectrum-2 and later we can. With netlink,
> > we can just omit the DEVLINK_ATTR_TRAP_OUT_PORT attribute for
> > Spectrum-1. You also get a programmatic interface that you can query for
> > this information:
> > 
> > # devlink -v trap show netdevsim/netdevsim10 trap ingress_vlan_filter
> > netdevsim/netdevsim10:
> >   name ingress_vlan_filter type drop generic true report false action drop group l2_drops
> >     metadata:
> >        input_port
> 
> Right, you can set or not set skb fields to some extent but its
> definitely not as flexible as netlink.

^ permalink raw reply

* Re: [PATCH bpf-next] net: Don't uninstall an XDP program when none is installed
From: Maxim Mikityanskiy @ 2019-07-10 11:16 UTC (permalink / raw)
  To: Alexei Starovoitov, David S. Miller
  Cc: Daniel Borkmann, Björn Töpel, bpf@vger.kernel.org,
	netdev@vger.kernel.org, Saeed Mahameed, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	Song Liu, Yonghong Song
In-Reply-To: <20190612161405.24064-1-maximmi@mellanox.com>

On 2019-06-12 19:14, Maxim Mikityanskiy wrote:
> dev_change_xdp_fd doesn't perform any checks in case it uninstalls an
> XDP program. It means that the driver's ndo_bpf can be called with
> XDP_SETUP_PROG asking to set it to NULL even if it's already NULL. This
> case happens if the user runs `ip link set eth0 xdp off` when there is
> no XDP program attached.
> 
> The drivers typically perform some heavy operations on XDP_SETUP_PROG,
> so they all have to handle this case internally to return early if it
> happens. This patch puts this check into the kernel code, so that all
> drivers will benefit from it.
> 
> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
> ---
> Björn, please take a look at this, Saeed told me you were doing
> something related, but I couldn't find it. If this fix is already
> covered by your work, please tell about that.
> 
>   net/core/dev.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 66f7508825bd..68b3e3320ceb 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -8089,6 +8089,9 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
>   			bpf_prog_put(prog);
>   			return -EINVAL;
>   		}
> +	} else {
> +		if (!__dev_xdp_query(dev, bpf_op, query))
> +			return 0;
>   	}
>   
>   	err = dev_xdp_install(dev, bpf_op, extack, flags, prog);
> 

Alexei, so what about this patch? It's marked as "Changed Requested" in 
patchwork, but Jakub's point looks resolved - I don't see any changes 
required from my side.

^ permalink raw reply

* [PATCH iproute2 master 1/3] devlink: Change devlink health dump show command to dumpit
From: Tariq Toukan @ 2019-07-10 11:03 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, moshe, ayal, Tariq Toukan
In-Reply-To: <1562756601-19171-1-git-send-email-tariqt@mellanox.com>

From: Aya Levin <ayal@mellanox.com>

Although devlink health dump show command is given per reporter, it
returns large amounts of data. Trying to use the doit cb results in
OUT-OF-BUFFER error. This complementary patch raises the DUMP flag in
order to invoke the dumpit cb. We're safe as no existing drivers
implement the dump health reporter option yet.

Fixes: 041e6e651a8e ("devlink: Add devlink health dump show command")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 devlink/devlink.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/devlink/devlink.c b/devlink/devlink.c
index ac8c0fb149b6..e3e1e27ab312 100644
--- a/devlink/devlink.c
+++ b/devlink/devlink.c
@@ -6078,13 +6078,13 @@ static int cmd_fmsg_object_cb(const struct nlmsghdr *nlh, void *data)
 	return MNL_CB_OK;
 }
 
-static int cmd_health_object_common(struct dl *dl, uint8_t cmd)
+static int cmd_health_object_common(struct dl *dl, uint8_t cmd, uint16_t flags)
 {
 	struct fmsg_cb_data data;
 	struct nlmsghdr *nlh;
 	int err;
 
-	nlh = mnlg_msg_prepare(dl->nlg, cmd,  NLM_F_REQUEST | NLM_F_ACK);
+	nlh = mnlg_msg_prepare(dl->nlg, cmd, flags | NLM_F_REQUEST | NLM_F_ACK);
 
 	err = dl_argv_parse_put(nlh, dl,
 				DL_OPT_HANDLE | DL_OPT_HEALTH_REPORTER_NAME, 0);
@@ -6099,12 +6099,16 @@ static int cmd_health_object_common(struct dl *dl, uint8_t cmd)
 
 static int cmd_health_dump_show(struct dl *dl)
 {
-	return cmd_health_object_common(dl, DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET);
+	return cmd_health_object_common(dl,
+					DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET,
+					NLM_F_DUMP);
 }
 
 static int cmd_health_diagnose(struct dl *dl)
 {
-	return cmd_health_object_common(dl, DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE);
+	return cmd_health_object_common(dl,
+					DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE,
+					0);
 }
 
 static int cmd_health_recover(struct dl *dl)
-- 
1.8.3.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox