Netdev List

Netdev List
 help / color / mirror / Atom feed

* RE: Help needed - Kernel lockup while running ipsec
From: Vakul Garg @ 2019-08-20  9:52 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev@vger.kernel.org
In-Reply-To: <20190820093800.GN2588@breakpoint.cc>



> -----Original Message-----
> From: Florian Westphal <fw@strlen.de>
> Sent: Tuesday, August 20, 2019 3:08 PM
> To: Vakul Garg <vakul.garg@nxp.com>
> Cc: Florian Westphal <fw@strlen.de>; netdev@vger.kernel.org
> Subject: Re: Help needed - Kernel lockup while running ipsec
> 
> Vakul Garg <vakul.garg@nxp.com> wrote:
> >
> >
> > > -----Original Message-----
> > > From: Florian Westphal <fw@strlen.de>
> > > Sent: Tuesday, August 20, 2019 2:53 PM
> > > To: Vakul Garg <vakul.garg@nxp.com>
> > > Cc: Florian Westphal <fw@strlen.de>; netdev@vger.kernel.org
> > > Subject: Re: Help needed - Kernel lockup while running ipsec
> > >
> > > Vakul Garg <vakul.garg@nxp.com> wrote:
> > > > > > With kernel 4.14.122, I am getting a kernel softlockup while
> > > > > > running single
> > > > > static ipsec tunnel.
> > > > > > The problem reproduces mostly after running 8-10 hours of
> > > > > > ipsec encap
> > > > > test (on my dual core arm board).
> > > > > >
> > > > > > I found that in function xfrm_policy_lookup_bytype(), the
> > > > > > policy in variable
> > > > > 'ret' shows refcnt=0 under problem situation.
> > > > > > This creates an infinite loop in  xfrm_policy_lookup_bytype()
> > > > > > and hence the
> > > > > lockup.
> > > > > >
> > > > > > Can some body please provide me pointers about 'refcnt'?
> > > > > > Is it legitimate for 'refcnt' to become '0'? Under what
> > > > > > condition can it
> > > > > become '0'?
> > > > >
> > > > > Yes, when policy is destroyed and the last user calls
> > > > > xfrm_pol_put() which will invoke call_rcu to free the structure.
> > > >
> > > > It seems that policy reference count never gets decremented during
> > > > packet
> > > ipsec encap.
> > > > It is getting incremented for every frame that hits the policy.
> > > > In setkey -DP output, I see refcnt to be wrapping around after '0'.
> > >
> > > Thats a bug.  Does this affect 4.14 only or does this happen on
> > > current tree as well?
> >
> > I am yet to try it on 4.19.
> > Can you help me with the right fix? Which part of code should it get
> decremented?
> > I am not conversant with xfrm code.
> 
> Normally policy reference counts get decremented when the skb is free'd, via
> dst destruction (xfrm_dst_destroy()).
> 
> Do you see a dst leak as well?

Can you please guide me how to detect it?

(I am checking refcount on recent kernel and will let you know.)

^ permalink raw reply

* [PATCH bpf-next] bpf: add BTF ids in procfs for file descriptors to BTF objects
From: Quentin Monnet @ 2019-08-20  9:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet

Implement the show_fdinfo hook for BTF FDs file operations, and make it
print the id and the size of the BTF object. This allows for a quick
retrieval of the BTF id from its FD; or it can help understanding what
type of object (BTF) the file descriptor points to.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 kernel/bpf/btf.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 5fcc7a17eb5a..39e184f1b27c 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3376,6 +3376,19 @@ void btf_type_seq_show(const struct btf *btf, u32 type_id, void *obj,
 	btf_type_ops(t)->seq_show(btf, t, type_id, obj, 0, m);
 }
 
+#ifdef CONFIG_PROC_FS
+static void bpf_btf_show_fdinfo(struct seq_file *m, struct file *filp)
+{
+	const struct btf *btf = filp->private_data;
+
+	seq_printf(m,
+		   "btf_id:\t%u\n"
+		   "data_size:\t%u\n",
+		   btf->id,
+		   btf->data_size);
+}
+#endif
+
 static int btf_release(struct inode *inode, struct file *filp)
 {
 	btf_put(filp->private_data);
@@ -3383,6 +3396,9 @@ static int btf_release(struct inode *inode, struct file *filp)
 }
 
 const struct file_operations btf_fops = {
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= bpf_btf_show_fdinfo,
+#endif
 	.release	= btf_release,
 };
 
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH] can: flexcan: free error skb if enqueueing failed
From: Sean Nyekjaer @ 2019-08-20  9:49 UTC (permalink / raw)
  To: Martin Hundebøll, Wolfgang Grandegger, Marc Kleine-Budde,
	linux-can
  Cc: David S . Miller, netdev, Joakim Zhang
In-Reply-To: <20190715185308.104333-1-martin@geanix.com>

CC'ing Joakim Zhang

On 15/07/2019 20.53, Martin Hundebøll wrote:
> If the call to can_rx_offload_queue_sorted() fails, the passed skb isn't
> consumed, so the caller must do so.
> 
> Fixes: 30164759db1b ("can: flexcan: make use of rx-offload's irq_offload_fifo")
> Signed-off-by: Martin Hundebøll <martin@geanix.com>
> ---
>   drivers/net/can/flexcan.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/can/flexcan.c b/drivers/net/can/flexcan.c
> index 1c66fb2ad76b..21f39e805d42 100644
> --- a/drivers/net/can/flexcan.c
> +++ b/drivers/net/can/flexcan.c
> @@ -688,7 +688,8 @@ static void flexcan_irq_bus_err(struct net_device *dev, u32 reg_esr)
>   	if (tx_errors)
>   		dev->stats.tx_errors++;
>   
> -	can_rx_offload_queue_sorted(&priv->offload, skb, timestamp);
> +	if (can_rx_offload_queue_sorted(&priv->offload, skb, timestamp))
> +		kfree_skb(skb);
>   }
>   
>   static void flexcan_irq_state(struct net_device *dev, u32 reg_esr)
> @@ -732,7 +733,8 @@ static void flexcan_irq_state(struct net_device *dev, u32 reg_esr)
>   	if (unlikely(new_state == CAN_STATE_BUS_OFF))
>   		can_bus_off(dev);
>   
> -	can_rx_offload_queue_sorted(&priv->offload, skb, timestamp);
> +	if (can_rx_offload_queue_sorted(&priv->offload, skb, timestamp))
> +		kfree_skb(skb);
>   }
>   
>   static inline struct flexcan_priv *rx_offload_to_priv(struct can_rx_offload *offload)
> 

^ permalink raw reply

* Re: [PATCH net-next v3 2/4] net: mdio: add PTP offset compensation to mdiobus_write_sts
From: Miroslav Lichvar @ 2019-08-20  9:49 UTC (permalink / raw)
  To: Hubert Feurstein
  Cc: netdev, linux-kernel, Andrew Lunn, Richard Cochran,
	Florian Fainelli, Heiner Kallweit, Vladimir Oltean,
	David S. Miller
In-Reply-To: <20190820084833.6019-3-hubert.feurstein@vahle.at>

On Tue, Aug 20, 2019 at 10:48:31AM +0200, Hubert Feurstein wrote:

> +	/* PTP offset compensation:
> +	 * After the MDIO access is completed (from the chip perspective), the
> +	 * switch chip will snapshot the PHC timestamp. To make sure our system
> +	 * timestamp corresponds to the PHC timestamp, we have to add the
> +	 * duration of this MDIO access to sts->post_ts. Linuxptp's phc2sys
> +	 * takes the average of pre_ts and post_ts to calculate the final
> +	 * system timestamp. With this in mind, we have to add ptp_sts_offset
> +	 * twice to post_ts, in order to not introduce an constant time offset.
> +	 */
> +	if (sts)
> +		timespec64_add_ns(&sts->post_ts, 2 * bus->ptp_sts_offset);

This correction looks good to me.

Is the MDIO write delay constant in reality, or does it at least have
an upper bound? That is, is it always true that the post_ts timestamp
does not point to a time before the PHC timestamp was actually taken?

This is important to not break the estimation of maximum error in the
measured offset. Applications using the ioctl may assume that the
maximum error is (post_ts-pre_ts)/2 (i.e. half of the delay printed by
phc2sys). That would not work if the delay could be occasionally 50
microseconds for instance, i.e. the post_ts timestamp would be earlier
than the PHC timestamp.

-- 
Miroslav Lichvar

^ permalink raw reply

* Re: [PATCH -next] bpf: Use PTR_ERR_OR_ZERO in xsk_map_inc()
From: Dan Carpenter @ 2019-08-20  9:44 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Björn Töpel, YueHaibing, Karlsson, Magnus,
	Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend, Netdev,
	bpf, kernel-janitors
In-Reply-To: <CAJ+HfNhRf+=yN6eOOZ1zp8=VicT-k6nHLO6r+f__O5X3M+N=ug@mail.gmail.com>

On Tue, Aug 20, 2019 at 11:25:29AM +0200, Björn Töpel wrote:
> On Tue, 20 Aug 2019 at 10:59, Dan Carpenter <dan.carpenter@oracle.com> wrote:
> >
> > On Tue, Aug 20, 2019 at 09:28:26AM +0200, Björn Töpel wrote:
> > > For future patches: Prefix AF_XDP socket work with "xsk:" and use "PATCH
> > > bpf-next" to let the developers know what tree you're aiming for.
> >
> > There are over 300 trees in linux-next.  It impossible to try remember
> > everyone's trees.  No one else has this requirement.
> >
> 
> Net/bpf are different, and I wanted to point that out to lessen the
> burden for the maintainers. It's documented in:
> 
> Documentation/bpf/bpf_devel_QA.rst.
> Documentation/networking/netdev-FAQ.rst

Ah...  I hadn't realized that BPF patches were confusing to Dave.

I actually do keep track of net and net-next.  I do quite a bit of extra
stuff for netdev patches.  So what about if we used [PATCH] for bpf and
[PATCH net] and [PATCH net-next] for networking?

I will do that.

regards,
dan carpenter

^ permalink raw reply

* Re: Help needed - Kernel lockup while running ipsec
From: Florian Westphal @ 2019-08-20  9:38 UTC (permalink / raw)
  To: Vakul Garg; +Cc: Florian Westphal, netdev@vger.kernel.org
In-Reply-To: <DB7PR04MB4620487074796FBC015AFD098BAB0@DB7PR04MB4620.eurprd04.prod.outlook.com>

Vakul Garg <vakul.garg@nxp.com> wrote:
> 
> 
> > -----Original Message-----
> > From: Florian Westphal <fw@strlen.de>
> > Sent: Tuesday, August 20, 2019 2:53 PM
> > To: Vakul Garg <vakul.garg@nxp.com>
> > Cc: Florian Westphal <fw@strlen.de>; netdev@vger.kernel.org
> > Subject: Re: Help needed - Kernel lockup while running ipsec
> > 
> > Vakul Garg <vakul.garg@nxp.com> wrote:
> > > > > With kernel 4.14.122, I am getting a kernel softlockup while
> > > > > running single
> > > > static ipsec tunnel.
> > > > > The problem reproduces mostly after running 8-10 hours of ipsec
> > > > > encap
> > > > test (on my dual core arm board).
> > > > >
> > > > > I found that in function xfrm_policy_lookup_bytype(), the policy
> > > > > in variable
> > > > 'ret' shows refcnt=0 under problem situation.
> > > > > This creates an infinite loop in  xfrm_policy_lookup_bytype() and
> > > > > hence the
> > > > lockup.
> > > > >
> > > > > Can some body please provide me pointers about 'refcnt'?
> > > > > Is it legitimate for 'refcnt' to become '0'? Under what condition
> > > > > can it
> > > > become '0'?
> > > >
> > > > Yes, when policy is destroyed and the last user calls
> > > > xfrm_pol_put() which will invoke call_rcu to free the structure.
> > >
> > > It seems that policy reference count never gets decremented during packet
> > ipsec encap.
> > > It is getting incremented for every frame that hits the policy.
> > > In setkey -DP output, I see refcnt to be wrapping around after '0'.
> > 
> > Thats a bug.  Does this affect 4.14 only or does this happen on current tree
> > as well?
>  
> I am yet to try it on 4.19.
> Can you help me with the right fix? Which part of code should it get decremented?
> I am not conversant with xfrm code.

Normally policy reference counts get decremented when the skb is free'd, via dst
destruction (xfrm_dst_destroy()).

Do you see a dst leak as well?


^ permalink raw reply

* [PATCH bpf-next v2 5/5] tools: bpftool: implement "bpftool btf show|list"
From: Quentin Monnet @ 2019-08-20  9:31 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet
In-Reply-To: <20190820093154.14042-1-quentin.monnet@netronome.com>

Add a "btf list" (alias: "btf show") subcommand to bpftool in order to
dump all BTF objects loaded on a system.

When running the command, hash tables are built in bpftool to retrieve
all the associations between BTF objects and BPF maps and programs. This
allows for printing all such associations when listing the BTF objects.

The command is added at the top of the subcommands for "bpftool btf", so
that typing only "bpftool btf" also comes down to listing the programs.
We could not have this with the previous command ("dump"), which
required a BTF object id, so it should not break any previous behaviour.
This also makes the "btf" command behaviour consistent with "prog" or
"map".

Bash completion is updated to use "bpftool btf" instead of "bpftool
prog" to list the BTF ids, as it looks more consistent.

Example output (plain):

    # bpftool btf show
    9: size 2989B  prog_ids 21  map_ids 15
    17: size 2847B  prog_ids 36  map_ids 30,29,28
    26: size 2847B

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 .../bpf/bpftool/Documentation/bpftool-btf.rst |   7 +
 tools/bpf/bpftool/bash-completion/bpftool     |  20 +-
 tools/bpf/bpftool/btf.c                       | 342 +++++++++++++++++-
 3 files changed, 363 insertions(+), 6 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-btf.rst b/tools/bpf/bpftool/Documentation/bpftool-btf.rst
index 6694a0fc8f99..39615f8e145b 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-btf.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-btf.rst
@@ -19,6 +19,7 @@ SYNOPSIS
 BTF COMMANDS
 =============
 
+|	**bpftool** **btf** { **show** | **list** } [**id** *BTF_ID*]
 |	**bpftool** **btf dump** *BTF_SRC* [**format** *FORMAT*]
 |	**bpftool** **btf help**
 |
@@ -29,6 +30,12 @@ BTF COMMANDS
 
 DESCRIPTION
 ===========
+	**bpftool btf { show | list }** [**id** *BTF_ID*]
+		  Show information about loaded BTF objects. If a BTF ID is
+		  specified, show information only about given BTF object,
+		  otherwise list all BTF objects currently loaded on the
+		  system.
+
 	**bpftool btf dump** *BTF_SRC*
 		  Dump BTF entries from a given *BTF_SRC*.
 
diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
index 4549fd424069..2ffd351f9dbf 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -73,8 +73,8 @@ _bpftool_get_prog_tags()
 
 _bpftool_get_btf_ids()
 {
-    COMPREPLY+=( $( compgen -W "$( bpftool -jp prog 2>&1 | \
-        command sed -n 's/.*"btf_id": \(.*\),\?$/\1/p' )" -- "$cur" ) )
+    COMPREPLY+=( $( compgen -W "$( bpftool -jp btf 2>&1 | \
+        command sed -n 's/.*"id": \(.*\),$/\1/p' )" -- "$cur" ) )
 }
 
 _bpftool_get_obj_map_names()
@@ -670,7 +670,7 @@ _bpftool()
                                 map)
                                     _bpftool_get_map_ids
                                     ;;
-                                dump)
+                                $command)
                                     _bpftool_get_btf_ids
                                     ;;
                             esac
@@ -698,9 +698,21 @@ _bpftool()
                             ;;
                     esac
                     ;;
+                show|list)
+                    case $prev in
+                        $command)
+                            COMPREPLY+=( $( compgen -W "id" -- "$cur" ) )
+                            ;;
+                        id)
+                            _bpftool_get_btf_ids
+                            ;;
+                    esac
+                    return 0
+                    ;;
                 *)
                     [[ $prev == $object ]] && \
-                        COMPREPLY=( $( compgen -W 'dump help' -- "$cur" ) )
+                        COMPREPLY=( $( compgen -W 'dump help show list' \
+                            -- "$cur" ) )
                     ;;
             esac
             ;;
diff --git a/tools/bpf/bpftool/btf.c b/tools/bpf/bpftool/btf.c
index 8805637f1a7e..9a9376d1d3df 100644
--- a/tools/bpf/bpftool/btf.c
+++ b/tools/bpf/bpftool/btf.c
@@ -11,6 +11,7 @@
 #include <bpf.h>
 #include <libbpf.h>
 #include <linux/btf.h>
+#include <linux/hashtable.h>
 
 #include "btf.h"
 #include "json_writer.h"
@@ -35,6 +36,16 @@ static const char * const btf_kind_str[NR_BTF_KINDS] = {
 	[BTF_KIND_DATASEC]	= "DATASEC",
 };
 
+struct btf_attach_table {
+	DECLARE_HASHTABLE(table, 16);
+};
+
+struct btf_attach_point {
+	__u32 obj_id;
+	__u32 btf_id;
+	struct hlist_node hash;
+};
+
 static const char *btf_int_enc_str(__u8 encoding)
 {
 	switch (encoding) {
@@ -522,6 +533,330 @@ static int do_dump(int argc, char **argv)
 	return err;
 }
 
+static int btf_parse_fd(int *argc, char ***argv)
+{
+	unsigned int id;
+	char *endptr;
+	int fd;
+
+	if (!is_prefix(*argv[0], "id")) {
+		p_err("expected 'id', got: '%s'?", **argv);
+		return -1;
+	}
+	NEXT_ARGP();
+
+	id = strtoul(**argv, &endptr, 0);
+	if (*endptr) {
+		p_err("can't parse %s as ID", **argv);
+		return -1;
+	}
+	NEXT_ARGP();
+
+	fd = bpf_btf_get_fd_by_id(id);
+	if (fd < 0)
+		p_err("can't get BTF object by id (%u): %s",
+		      id, strerror(errno));
+
+	return fd;
+}
+
+static void delete_btf_table(struct btf_attach_table *tab)
+{
+	struct btf_attach_point *obj;
+	struct hlist_node *tmp;
+
+	unsigned int bkt;
+
+	hash_for_each_safe(tab->table, bkt, tmp, obj, hash) {
+		hash_del(&obj->hash);
+		free(obj);
+	}
+}
+
+static int
+build_btf_type_table(struct btf_attach_table *tab, enum bpf_obj_type type,
+		     void *info, __u32 *len)
+{
+	static const char * const names[] = {
+		[BPF_OBJ_UNKNOWN]	= "unknown",
+		[BPF_OBJ_PROG]		= "prog",
+		[BPF_OBJ_MAP]		= "map",
+	};
+	struct btf_attach_point *obj_node;
+	__u32 btf_id, id = 0;
+	int err;
+	int fd;
+
+	while (true) {
+		switch (type) {
+		case BPF_OBJ_PROG:
+			err = bpf_prog_get_next_id(id, &id);
+			break;
+		case BPF_OBJ_MAP:
+			err = bpf_map_get_next_id(id, &id);
+			break;
+		default:
+			err = -1;
+			p_err("unexpected object type: %d", type);
+			goto err_free;
+		}
+		if (err) {
+			if (errno == ENOENT) {
+				err = 0;
+				break;
+			}
+			p_err("can't get next %s: %s%s", names[type],
+			      strerror(errno),
+			      errno == EINVAL ? " -- kernel too old?" : "");
+			goto err_free;
+		}
+
+		switch (type) {
+		case BPF_OBJ_PROG:
+			fd = bpf_prog_get_fd_by_id(id);
+			break;
+		case BPF_OBJ_MAP:
+			fd = bpf_map_get_fd_by_id(id);
+			break;
+		default:
+			err = -1;
+			p_err("unexpected object type: %d", type);
+			goto err_free;
+		}
+		if (fd < 0) {
+			if (errno == ENOENT)
+				continue;
+			p_err("can't get %s by id (%u): %s", names[type], id,
+			      strerror(errno));
+			err = -1;
+			goto err_free;
+		}
+
+		memset(info, 0, *len);
+		err = bpf_obj_get_info_by_fd(fd, info, len);
+		close(fd);
+		if (err) {
+			p_err("can't get %s info: %s", names[type],
+			      strerror(errno));
+			goto err_free;
+		}
+
+		switch (type) {
+		case BPF_OBJ_PROG:
+			btf_id = ((struct bpf_prog_info *)info)->btf_id;
+			break;
+		case BPF_OBJ_MAP:
+			btf_id = ((struct bpf_map_info *)info)->btf_id;
+			break;
+		default:
+			err = -1;
+			p_err("unexpected object type: %d", type);
+			goto err_free;
+		}
+		if (!btf_id)
+			continue;
+
+		obj_node = calloc(1, sizeof(*obj_node));
+		if (!obj_node) {
+			p_err("failed to allocate memory: %s", strerror(errno));
+			goto err_free;
+		}
+
+		obj_node->obj_id = id;
+		obj_node->btf_id = btf_id;
+		hash_add(tab->table, &obj_node->hash, obj_node->btf_id);
+	}
+
+	return 0;
+
+err_free:
+	delete_btf_table(tab);
+	return err;
+}
+
+static int
+build_btf_tables(struct btf_attach_table *btf_prog_table,
+		 struct btf_attach_table *btf_map_table)
+{
+	struct bpf_prog_info prog_info;
+	__u32 prog_len = sizeof(prog_info);
+	struct bpf_map_info map_info;
+	__u32 map_len = sizeof(map_info);
+	int err = 0;
+
+	err = build_btf_type_table(btf_prog_table, BPF_OBJ_PROG, &prog_info,
+				   &prog_len);
+	if (err)
+		return err;
+
+	err = build_btf_type_table(btf_map_table, BPF_OBJ_MAP, &map_info,
+				   &map_len);
+	if (err) {
+		delete_btf_table(btf_prog_table);
+		return err;
+	}
+
+	return 0;
+}
+
+static void
+show_btf_plain(struct bpf_btf_info *info, int fd,
+	       struct btf_attach_table *btf_prog_table,
+	       struct btf_attach_table *btf_map_table)
+{
+	struct btf_attach_point *obj;
+	int n;
+
+	printf("%u: ", info->id);
+	printf("size %uB", info->btf_size);
+
+	n = 0;
+	hash_for_each_possible(btf_prog_table->table, obj, hash, info->id) {
+		if (obj->btf_id == info->id)
+			printf("%s%u", n++ == 0 ? "  prog_ids " : ",",
+			       obj->obj_id);
+	}
+
+	n = 0;
+	hash_for_each_possible(btf_map_table->table, obj, hash, info->id) {
+		if (obj->btf_id == info->id)
+			printf("%s%u", n++ == 0 ? "  map_ids " : ",",
+			       obj->obj_id);
+	}
+
+	printf("\n");
+}
+
+static void
+show_btf_json(struct bpf_btf_info *info, int fd,
+	      struct btf_attach_table *btf_prog_table,
+	      struct btf_attach_table *btf_map_table)
+{
+	struct btf_attach_point *obj;
+
+	jsonw_start_object(json_wtr);	/* btf object */
+	jsonw_uint_field(json_wtr, "id", info->id);
+	jsonw_uint_field(json_wtr, "size", info->btf_size);
+
+	jsonw_name(json_wtr, "prog_ids");
+	jsonw_start_array(json_wtr);	/* prog_ids */
+	hash_for_each_possible(btf_prog_table->table, obj, hash,
+			       info->id) {
+		if (obj->btf_id == info->id)
+			jsonw_uint(json_wtr, obj->obj_id);
+	}
+	jsonw_end_array(json_wtr);	/* prog_ids */
+
+	jsonw_name(json_wtr, "map_ids");
+	jsonw_start_array(json_wtr);	/* map_ids */
+	hash_for_each_possible(btf_map_table->table, obj, hash,
+			       info->id) {
+		if (obj->btf_id == info->id)
+			jsonw_uint(json_wtr, obj->obj_id);
+	}
+	jsonw_end_array(json_wtr);	/* map_ids */
+	jsonw_end_object(json_wtr);	/* btf object */
+}
+
+static int
+show_btf(int fd, struct btf_attach_table *btf_prog_table,
+	 struct btf_attach_table *btf_map_table)
+{
+	struct bpf_btf_info info = {};
+	__u32 len = sizeof(info);
+	int err;
+
+	err = bpf_obj_get_info_by_fd(fd, &info, &len);
+	if (err) {
+		p_err("can't get BTF object info: %s", strerror(errno));
+		return -1;
+	}
+
+	if (json_output)
+		show_btf_json(&info, fd, btf_prog_table, btf_map_table);
+	else
+		show_btf_plain(&info, fd, btf_prog_table, btf_map_table);
+
+	return 0;
+}
+
+static int do_show(int argc, char **argv)
+{
+	struct btf_attach_table btf_prog_table;
+	struct btf_attach_table btf_map_table;
+	int err, fd = -1;
+	__u32 id = 0;
+
+	if (argc == 2) {
+		fd = btf_parse_fd(&argc, &argv);
+		if (fd < 0)
+			return -1;
+	}
+
+	if (argc) {
+		if (fd >= 0)
+			close(fd);
+		return BAD_ARG();
+	}
+
+	hash_init(btf_prog_table.table);
+	hash_init(btf_map_table.table);
+	err = build_btf_tables(&btf_prog_table, &btf_map_table);
+	if (err) {
+		if (fd >= 0)
+			close(fd);
+		return err;
+	}
+
+	if (fd >= 0) {
+		err = show_btf(fd, &btf_prog_table, &btf_map_table);
+		close(fd);
+		goto exit_free;
+	}
+
+	if (json_output)
+		jsonw_start_array(json_wtr);	/* root array */
+
+	while (true) {
+		err = bpf_btf_get_next_id(id, &id);
+		if (err) {
+			if (errno == ENOENT) {
+				err = 0;
+				break;
+			}
+			p_err("can't get next BTF object: %s%s",
+			      strerror(errno),
+			      errno == EINVAL ? " -- kernel too old?" : "");
+			err = -1;
+			break;
+		}
+
+		fd = bpf_btf_get_fd_by_id(id);
+		if (fd < 0) {
+			if (errno == ENOENT)
+				continue;
+			p_err("can't get BTF object by id (%u): %s",
+			      id, strerror(errno));
+			err = -1;
+			break;
+		}
+
+		err = show_btf(fd, &btf_prog_table, &btf_map_table);
+		close(fd);
+		if (err)
+			break;
+	}
+
+	if (json_output)
+		jsonw_end_array(json_wtr);	/* root array */
+
+exit_free:
+	delete_btf_table(&btf_prog_table);
+	delete_btf_table(&btf_map_table);
+
+	return err;
+}
+
 static int do_help(int argc, char **argv)
 {
 	if (json_output) {
@@ -530,7 +865,8 @@ static int do_help(int argc, char **argv)
 	}
 
 	fprintf(stderr,
-		"Usage: %s btf dump BTF_SRC [format FORMAT]\n"
+		"Usage: %s btf { show | list } [id BTF_ID]\n"
+		"       %s btf dump BTF_SRC [format FORMAT]\n"
 		"       %s btf help\n"
 		"\n"
 		"       BTF_SRC := { id BTF_ID | prog PROG | map MAP [{key | value | kv | all}] | file FILE }\n"
@@ -539,12 +875,14 @@ static int do_help(int argc, char **argv)
 		"       " HELP_SPEC_PROGRAM "\n"
 		"       " HELP_SPEC_OPTIONS "\n"
 		"",
-		bin_name, bin_name);
+		bin_name, bin_name, bin_name);
 
 	return 0;
 }
 
 static const struct cmd cmds[] = {
+	{ "show",	do_show },
+	{ "list",	do_show },
 	{ "help",	do_help },
 	{ "dump",	do_dump },
 	{ 0 }
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next v2 4/5] libbpf: add bpf_btf_get_next_id() to cycle through BTF objects
From: Quentin Monnet @ 2019-08-20  9:31 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet
In-Reply-To: <20190820093154.14042-1-quentin.monnet@netronome.com>

Add an API function taking a BTF object id and providing the id of the
next BTF object in the kernel. This can be used to list all BTF objects
loaded on the system.

v2:
- Rebase on top of Andrii's changes regarding libbpf versioning.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 tools/lib/bpf/bpf.c      | 5 +++++
 tools/lib/bpf/bpf.h      | 1 +
 tools/lib/bpf/libbpf.map | 2 ++
 3 files changed, 8 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 1439e99c9be5..cbb933532981 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -593,6 +593,11 @@ int bpf_map_get_next_id(__u32 start_id, __u32 *next_id)
 	return bpf_obj_get_next_id(start_id, next_id, BPF_MAP_GET_NEXT_ID);
 }
 
+int bpf_btf_get_next_id(__u32 start_id, __u32 *next_id)
+{
+	return bpf_obj_get_next_id(start_id, next_id, BPF_BTF_GET_NEXT_ID);
+}
+
 int bpf_prog_get_fd_by_id(__u32 id)
 {
 	union bpf_attr attr;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index ff42ca043dc8..0db01334740f 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -156,6 +156,7 @@ LIBBPF_API int bpf_prog_test_run(int prog_fd, int repeat, void *data,
 				 __u32 *retval, __u32 *duration);
 LIBBPF_API int bpf_prog_get_next_id(__u32 start_id, __u32 *next_id);
 LIBBPF_API int bpf_map_get_next_id(__u32 start_id, __u32 *next_id);
+LIBBPF_API int bpf_btf_get_next_id(__u32 start_id, __u32 *next_id);
 LIBBPF_API int bpf_prog_get_fd_by_id(__u32 id);
 LIBBPF_API int bpf_map_get_fd_by_id(__u32 id);
 LIBBPF_API int bpf_btf_get_fd_by_id(__u32 id);
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 4e72df8e98ba..664ce8e7a60e 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -186,4 +186,6 @@ LIBBPF_0.0.4 {
 } LIBBPF_0.0.3;
 
 LIBBPF_0.0.5 {
+	global:
+		bpf_btf_get_next_id;
 } LIBBPF_0.0.4;
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next v2 3/5] libbpf: refactor bpf_*_get_next_id() functions
From: Quentin Monnet @ 2019-08-20  9:31 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet
In-Reply-To: <20190820093154.14042-1-quentin.monnet@netronome.com>

In preparation for the introduction of a similar function for retrieving
the id of the next BTF object, consolidate the code from
bpf_prog_get_next_id() and bpf_map_get_next_id() in libbpf.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 tools/lib/bpf/bpf.c | 21 ++++++++-------------
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index c7d7993c44bb..1439e99c9be5 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -568,7 +568,7 @@ int bpf_prog_test_run_xattr(struct bpf_prog_test_run_attr *test_attr)
 	return ret;
 }
 
-int bpf_prog_get_next_id(__u32 start_id, __u32 *next_id)
+static int bpf_obj_get_next_id(__u32 start_id, __u32 *next_id, int cmd)
 {
 	union bpf_attr attr;
 	int err;
@@ -576,26 +576,21 @@ int bpf_prog_get_next_id(__u32 start_id, __u32 *next_id)
 	memset(&attr, 0, sizeof(attr));
 	attr.start_id = start_id;
 
-	err = sys_bpf(BPF_PROG_GET_NEXT_ID, &attr, sizeof(attr));
+	err = sys_bpf(cmd, &attr, sizeof(attr));
 	if (!err)
 		*next_id = attr.next_id;
 
 	return err;
 }
 
-int bpf_map_get_next_id(__u32 start_id, __u32 *next_id)
+int bpf_prog_get_next_id(__u32 start_id, __u32 *next_id)
 {
-	union bpf_attr attr;
-	int err;
-
-	memset(&attr, 0, sizeof(attr));
-	attr.start_id = start_id;
-
-	err = sys_bpf(BPF_MAP_GET_NEXT_ID, &attr, sizeof(attr));
-	if (!err)
-		*next_id = attr.next_id;
+	return bpf_obj_get_next_id(start_id, next_id, BPF_PROG_GET_NEXT_ID);
+}
 
-	return err;
+int bpf_map_get_next_id(__u32 start_id, __u32 *next_id)
+{
+	return bpf_obj_get_next_id(start_id, next_id, BPF_MAP_GET_NEXT_ID);
 }
 
 int bpf_prog_get_fd_by_id(__u32 id)
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next v2 2/5] tools: bpf: synchronise BPF UAPI header with tools
From: Quentin Monnet @ 2019-08-20  9:31 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet
In-Reply-To: <20190820093154.14042-1-quentin.monnet@netronome.com>

Synchronise the bpf.h header under tools, to report the addition of the
new BPF_BTF_GET_NEXT_ID syscall command for bpf().

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
---
 tools/include/uapi/linux/bpf.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0ef594ac3899..8aa6126f0b6e 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -106,6 +106,7 @@ enum bpf_cmd {
 	BPF_TASK_FD_QUERY,
 	BPF_MAP_LOOKUP_AND_DELETE_ELEM,
 	BPF_MAP_FREEZE,
+	BPF_BTF_GET_NEXT_ID,
 };
 
 enum bpf_map_type {
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next v2 1/5] bpf: add new BPF_BTF_GET_NEXT_ID syscall command
From: Quentin Monnet @ 2019-08-20  9:31 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet
In-Reply-To: <20190820093154.14042-1-quentin.monnet@netronome.com>

Add a new command for the bpf() system call: BPF_BTF_GET_NEXT_ID is used
to cycle through all BTF objects loaded on the system.

The motivation is to be able to inspect (list) all BTF objects presents
on the system.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 include/linux/bpf.h      | 3 +++
 include/uapi/linux/bpf.h | 1 +
 kernel/bpf/btf.c         | 4 ++--
 kernel/bpf/syscall.c     | 4 ++++
 4 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 15ae49862b82..5b9d22338606 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -24,6 +24,9 @@ struct seq_file;
 struct btf;
 struct btf_type;
 
+extern struct idr btf_idr;
+extern spinlock_t btf_idr_lock;
+
 /* map is generic key/value storage optionally accesible by eBPF programs */
 struct bpf_map_ops {
 	/* funcs callable from userspace (via syscall) */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0ef594ac3899..8aa6126f0b6e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -106,6 +106,7 @@ enum bpf_cmd {
 	BPF_TASK_FD_QUERY,
 	BPF_MAP_LOOKUP_AND_DELETE_ELEM,
 	BPF_MAP_FREEZE,
+	BPF_BTF_GET_NEXT_ID,
 };
 
 enum bpf_map_type {
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 5fcc7a17eb5a..e716a64b2f7f 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -195,8 +195,8 @@
 	     i < btf_type_vlen(struct_type);					\
 	     i++, member++)
 
-static DEFINE_IDR(btf_idr);
-static DEFINE_SPINLOCK(btf_idr_lock);
+DEFINE_IDR(btf_idr);
+DEFINE_SPINLOCK(btf_idr_lock);
 
 struct btf {
 	void *data;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cf8052b016e7..c0f62fd67c6b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2884,6 +2884,10 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 		err = bpf_obj_get_next_id(&attr, uattr,
 					  &map_idr, &map_idr_lock);
 		break;
+	case BPF_BTF_GET_NEXT_ID:
+		err = bpf_obj_get_next_id(&attr, uattr,
+					  &btf_idr, &btf_idr_lock);
+		break;
 	case BPF_PROG_GET_FD_BY_ID:
 		err = bpf_prog_get_fd_by_id(&attr);
 		break;
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next v2 0/5] bpf: list BTF objects loaded on system
From: Quentin Monnet @ 2019-08-20  9:31 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet

Hi,
This set adds a new command BPF_BTF_GET_NEXT_ID to the bpf() system call,
adds the relevant API function in libbpf, and uses it in bpftool to list
all BTF objects loaded on the system (and to dump the ids of maps and
programs associated with them, if any).

The main motivation of listing BTF objects is introspection and debugging
purposes. By getting BPF program and map information, it should already be
possible to list all BTF objects associated to at least one map or one
program. But there may be unattached BTF objects, held by a file descriptor
from a user space process only, and we may want to list them too.

As a side note, it also turned useful for examining the BTF objects
attached to offloaded programs, which would not show in program information
because the BTF id is not copied when retrieving such info. A fix is in
progress on that side.

v2:
- Rebase patch with new libbpf function on top of Andrii's changes
  regarding libbpf versioning.

Quentin Monnet (5):
  bpf: add new BPF_BTF_GET_NEXT_ID syscall command
  tools: bpf: synchronise BPF UAPI header with tools
  libbpf: refactor bpf_*_get_next_id() functions
  libbpf: add bpf_btf_get_next_id() to cycle through BTF objects
  tools: bpftool: implement "bpftool btf show|list"

 include/linux/bpf.h                           |   3 +
 include/uapi/linux/bpf.h                      |   1 +
 kernel/bpf/btf.c                              |   4 +-
 kernel/bpf/syscall.c                          |   4 +
 .../bpf/bpftool/Documentation/bpftool-btf.rst |   7 +
 tools/bpf/bpftool/bash-completion/bpftool     |  20 +-
 tools/bpf/bpftool/btf.c                       | 342 +++++++++++++++++-
 tools/include/uapi/linux/bpf.h                |   1 +
 tools/lib/bpf/bpf.c                           |  24 +-
 tools/lib/bpf/bpf.h                           |   1 +
 tools/lib/bpf/libbpf.map                      |   2 +
 11 files changed, 389 insertions(+), 20 deletions(-)

-- 
2.17.1

^ permalink raw reply

* RE: Help needed - Kernel lockup while running ipsec
From: Vakul Garg @ 2019-08-20  9:30 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev@vger.kernel.org
In-Reply-To: <DB7PR04MB4620487074796FBC015AFD098BAB0@DB7PR04MB4620.eurprd04.prod.outlook.com>



> > -----Original Message-----
> > From: Florian Westphal <fw@strlen.de>
> > Sent: Tuesday, August 20, 2019 2:53 PM
> > To: Vakul Garg <vakul.garg@nxp.com>
> > Cc: Florian Westphal <fw@strlen.de>; netdev@vger.kernel.org
> > Subject: Re: Help needed - Kernel lockup while running ipsec
> >
> > Vakul Garg <vakul.garg@nxp.com> wrote:
> > > > > With kernel 4.14.122, I am getting a kernel softlockup while
> > > > > running single
> > > > static ipsec tunnel.
> > > > > The problem reproduces mostly after running 8-10 hours of ipsec
> > > > > encap
> > > > test (on my dual core arm board).
> > > > >
> > > > > I found that in function xfrm_policy_lookup_bytype(), the policy
> > > > > in variable
> > > > 'ret' shows refcnt=0 under problem situation.
> > > > > This creates an infinite loop in  xfrm_policy_lookup_bytype() and
> > > > > hence the
> > > > lockup.
> > > > >
> > > > > Can some body please provide me pointers about 'refcnt'?
> > > > > Is it legitimate for 'refcnt' to become '0'? Under what condition
> > > > > can it
> > > > become '0'?
> > > >
> > > > Yes, when policy is destroyed and the last user calls
> > > > xfrm_pol_put() which will invoke call_rcu to free the structure.
> > >
> > > It seems that policy reference count never gets decremented during
> packet
> > ipsec encap.
> > > It is getting incremented for every frame that hits the policy.
> > > In setkey -DP output, I see refcnt to be wrapping around after '0'.
> >
> > Thats a bug.  Does this affect 4.14 only or does this happen on current tree
> > as well?
> 
> I am yet to try it on 4.19.

Correction: I am yet to try it on current tree.

> Can you help me with the right fix? Which part of code should it get
> decremented?
> I am not conversant with xfrm code.


^ permalink raw reply

* RE: Help needed - Kernel lockup while running ipsec
From: Vakul Garg @ 2019-08-20  9:26 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev@vger.kernel.org
In-Reply-To: <20190820092303.GM2588@breakpoint.cc>



> -----Original Message-----
> From: Florian Westphal <fw@strlen.de>
> Sent: Tuesday, August 20, 2019 2:53 PM
> To: Vakul Garg <vakul.garg@nxp.com>
> Cc: Florian Westphal <fw@strlen.de>; netdev@vger.kernel.org
> Subject: Re: Help needed - Kernel lockup while running ipsec
> 
> Vakul Garg <vakul.garg@nxp.com> wrote:
> > > > With kernel 4.14.122, I am getting a kernel softlockup while
> > > > running single
> > > static ipsec tunnel.
> > > > The problem reproduces mostly after running 8-10 hours of ipsec
> > > > encap
> > > test (on my dual core arm board).
> > > >
> > > > I found that in function xfrm_policy_lookup_bytype(), the policy
> > > > in variable
> > > 'ret' shows refcnt=0 under problem situation.
> > > > This creates an infinite loop in  xfrm_policy_lookup_bytype() and
> > > > hence the
> > > lockup.
> > > >
> > > > Can some body please provide me pointers about 'refcnt'?
> > > > Is it legitimate for 'refcnt' to become '0'? Under what condition
> > > > can it
> > > become '0'?
> > >
> > > Yes, when policy is destroyed and the last user calls
> > > xfrm_pol_put() which will invoke call_rcu to free the structure.
> >
> > It seems that policy reference count never gets decremented during packet
> ipsec encap.
> > It is getting incremented for every frame that hits the policy.
> > In setkey -DP output, I see refcnt to be wrapping around after '0'.
> 
> Thats a bug.  Does this affect 4.14 only or does this happen on current tree
> as well?
 
I am yet to try it on 4.19.
Can you help me with the right fix? Which part of code should it get decremented?
I am not conversant with xfrm code.


^ permalink raw reply

* Re: [PATCH -next] bpf: Use PTR_ERR_OR_ZERO in xsk_map_inc()
From: Björn Töpel @ 2019-08-20  9:25 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Björn Töpel, YueHaibing, Karlsson, Magnus,
	Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend, Netdev,
	bpf, kernel-janitors
In-Reply-To: <20190820085547.GE4451@kadam>

On Tue, 20 Aug 2019 at 10:59, Dan Carpenter <dan.carpenter@oracle.com> wrote:
>
> On Tue, Aug 20, 2019 at 09:28:26AM +0200, Björn Töpel wrote:
> > For future patches: Prefix AF_XDP socket work with "xsk:" and use "PATCH
> > bpf-next" to let the developers know what tree you're aiming for.
>
> There are over 300 trees in linux-next.  It impossible to try remember
> everyone's trees.  No one else has this requirement.
>

Net/bpf are different, and I wanted to point that out to lessen the
burden for the maintainers. It's documented in:

Documentation/bpf/bpf_devel_QA.rst.
Documentation/networking/netdev-FAQ.rst

> Maybe add it as an option to get_maintainer.pl --tree <hash> then that
> would be very easy.
>

Yes, improved tooling would help.


Cheers,
Björn

> regards,
> dan carpenter
>

^ permalink raw reply

* Re: Help needed - Kernel lockup while running ipsec
From: Florian Westphal @ 2019-08-20  9:23 UTC (permalink / raw)
  To: Vakul Garg; +Cc: Florian Westphal, netdev@vger.kernel.org
In-Reply-To: <DB7PR04MB4620C6E770C97AB14A04A1D98BAB0@DB7PR04MB4620.eurprd04.prod.outlook.com>

Vakul Garg <vakul.garg@nxp.com> wrote:
> > > With kernel 4.14.122, I am getting a kernel softlockup while running single
> > static ipsec tunnel.
> > > The problem reproduces mostly after running 8-10 hours of ipsec encap
> > test (on my dual core arm board).
> > >
> > > I found that in function xfrm_policy_lookup_bytype(), the policy in variable
> > 'ret' shows refcnt=0 under problem situation.
> > > This creates an infinite loop in  xfrm_policy_lookup_bytype() and hence the
> > lockup.
> > >
> > > Can some body please provide me pointers about 'refcnt'?
> > > Is it legitimate for 'refcnt' to become '0'? Under what condition can it
> > become '0'?
> > 
> > Yes, when policy is destroyed and the last user calls
> > xfrm_pol_put() which will invoke call_rcu to free the structure.
> 
> It seems that policy reference count never gets decremented during packet ipsec encap.
> It is getting incremented for every frame that hits the policy.
> In setkey -DP output, I see refcnt to be wrapping around after '0'.

Thats a bug.  Does this affect 4.14 only or does this happen on current
tree as well?

^ permalink raw reply

* Re: general protection fault in xsk_poll
From: Björn Töpel @ 2019-08-20  9:17 UTC (permalink / raw)
  To: Hillf Danton, syzbot
  Cc: ast, bpf, daniel, davem, hawk, jakub.kicinski, john.fastabend,
	jonathan.lemon, kafai, linux-kernel, magnus.karlsson, netdev,
	songliubraving, syzkaller-bugs, xdp-newbies, yhs
In-Reply-To: <20190820033154.9112-1-hdanton@sina.com>

On 2019-08-20 05:31, Hillf Danton wrote:
> 
> On Mon, 19 Aug 2019 18:18:06 -0700
>> Hello,
>>
>> syzbot found the following crash on:
>>
>> HEAD commit:    da657043 Add linux-next specific files for 20190819
>> git tree:       linux-next
>> console output: https://syzkaller.appspot.com/x/log.txt?x=16af124c600000
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=739a9b3ab3d8c770
>> dashboard link: https://syzkaller.appspot.com/bug?extid=c82697e3043781e08802
>> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=109e1922600000
>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1445bf02600000
>>
>> The bug was bisected to:
>>
>> commit 77cd0d7b3f257fd0e3096b4fdcff1a7d38e99e10
>> Author: Magnus Karlsson <magnus.karlsson@intel.com>
>> Date:   Wed Aug 14 07:27:17 2019 +0000
>>
>>       xsk: add support for need_wakeup flag in AF_XDP rings
>>
>> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=15e1ea4c600000
>> final crash:    https://syzkaller.appspot.com/x/report.txt?x=17e1ea4c600000
>> console output: https://syzkaller.appspot.com/x/log.txt?x=13e1ea4c600000
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+c82697e3043781e08802@syzkaller.appspotmail.com
>> Fixes: 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in AF_XDP rings")
>>
>> kasan: CONFIG_KASAN_INLINE enabled
>> kasan: GPF could be caused by NULL-ptr deref or user memory access
>> general protection fault: 0000 [#1] PREEMPT SMP KASAN
>> CPU: 1 PID: 7959 Comm: syz-executor611 Not tainted 5.3.0-rc5-next-20190819 #68
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> RIP: 0010:xsk_poll+0x95/0x540 net/xdp/xsk.c:386
>> Code: 80 3c 02 00 0f 85 70 04 00 00 4c 8b a3 88 04 00 00 48 b8 00 00 00 00
>> 00 fc ff df 49 8d bc 24 96 00 00 00 48 89 fa 48 c1 ea 03 <0f> b6 04 02 48
>> 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 bf 03 00 00
>> RSP: 0018:ffff8880926f7850 EFLAGS: 00010207
>> RAX: dffffc0000000000 RBX: ffff88809a141700 RCX: ffffffff859b07aa
>> RDX: 0000000000000012 RSI: ffffffff859b07c4 RDI: 0000000000000096
>> RBP: ffff8880926f7880 R08: ffff88809698a580 R09: ffffed1013428329
>> R10: ffffed1013428328 R11: ffff88809a141947 R12: 0000000000000000
>> R13: 0000000000000304 R14: ffff888095d4d840 R15: ffff888092bdd020
>> FS:  0000555557529880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 0000000020000280 CR3: 0000000098281000 CR4: 00000000001406e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Call Trace:
>>    sock_poll+0x15e/0x480 net/socket.c:1256
>>    vfs_poll include/linux/poll.h:90 [inline]
>>    do_pollfd fs/select.c:859 [inline]
>>    do_poll fs/select.c:907 [inline]
>>    do_sys_poll+0x7c2/0xde0 fs/select.c:1001
>>    __do_sys_ppoll fs/select.c:1101 [inline]
>>    __se_sys_ppoll fs/select.c:1081 [inline]
>>    __x64_sys_ppoll+0x259/0x310 fs/select.c:1081
>>    do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
>>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> RIP: 0033:0x440159
>> Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7
>> 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
>> ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
>> RSP: 002b:00007ffd9fbd16e8 EFLAGS: 00000246 ORIG_RAX: 000000000000010f
>> RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440159
>> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000020000280
>> RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
>> R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004019e0
>> R13: 0000000000401a70 R14: 0000000000000000 R15: 0000000000000000
>> Modules linked in:
>> ---[ end trace da907175426b4065 ]---
> 
> Add umem check.
> 
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -381,9 +381,9 @@ static unsigned int xsk_poll(struct file
>   	struct sock *sk = sock->sk;
>   	struct xdp_sock *xs = xdp_sk(sk);
>   	struct net_device *dev = xs->dev;
> -	struct xdp_umem *umem = xs->umem;
> +	struct xdp_umem *umem = READ_ONCE(xs->umem);
>   
> -	if (umem->need_wakeup)
> +	if (umem && umem->need_wakeup)
>   		dev->netdev_ops->ndo_xsk_wakeup(dev, xs->queue_id,
>   						umem->need_wakeup);
>   
> --
> 

Thanks!

What do you think about making it a bit more generic, like:

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ee4428a892fa..08bed5e92af4 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -356,13 +356,20 @@ static int xsk_generic_xmit(struct sock *sk, 
struct msghdr *m,
  	return err;
  }

+static bool xsk_is_bound(struct xdp_sock *xs)
+{
+	struct net_device *dev = READ_ONCE(xs->dev);
+
+	return dev && xs->state == XSK_BOUND;
+}
+
  static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t 
total_len)
  {
  	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
  	struct sock *sk = sock->sk;
  	struct xdp_sock *xs = xdp_sk(sk);

-	if (unlikely(!xs->dev))
+	if (unlikely(!xsk_is_bound(xs)))
  		return -ENXIO;
  	if (unlikely(!(xs->dev->flags & IFF_UP)))
  		return -ENETDOWN;
@@ -383,6 +390,9 @@ static unsigned int xsk_poll(struct file *file, 
struct socket *sock,
  	struct net_device *dev = xs->dev;
  	struct xdp_umem *umem = xs->umem;

+	if (unlikely(!xsk_is_bound(xs)))
+		return mask;
+
  	if (umem->need_wakeup)
  		dev->netdev_ops->ndo_xsk_wakeup(dev, xs->queue_id,
  						umem->need_wakeup);
@@ -417,7 +427,7 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
  {
  	struct net_device *dev = xs->dev;

-	if (!dev || xs->state != XSK_BOUND)
+	if (!xsk_is_bound(xs))
  		return;

  	xs->state = XSK_UNBOUND;

^ permalink raw reply related

* [PATCH v2 net] gve: Copy and paste bug in gve_get_stats()
From: Dan Carpenter @ 2019-08-20  9:11 UTC (permalink / raw)
  To: Catherine Sullivan
  Cc: Sagi Shahar, Jon Olson, David S. Miller, Willem de Bruijn,
	Luigi Rizzo, Chuhong Yuan, netdev, kernel-janitors
In-Reply-To: <20190820090053.GA24410@mwanda>

There is a copy and paste error so we have "rx" where "tx" was intended
in the priv->tx[] array.

Fixes: f5cedc84a30d ("gve: Add transmit and receive support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
v2: fix a typo in the subject: buy -> bug (Thanks Walter Harms)

 drivers/net/ethernet/google/gve/gve_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 497298752381..aca95f64bde8 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -50,7 +50,7 @@ static void gve_get_stats(struct net_device *dev, struct rtnl_link_stats64 *s)
 				  u64_stats_fetch_begin(&priv->tx[ring].statss);
 				s->tx_packets += priv->tx[ring].pkt_done;
 				s->tx_bytes += priv->tx[ring].bytes_done;
-			} while (u64_stats_fetch_retry(&priv->rx[ring].statss,
+			} while (u64_stats_fetch_retry(&priv->tx[ring].statss,
 						       start));
 		}
 	}
-- 
2.20.1

^ permalink raw reply related

* Re: [PATCH 2/2] dt-bindings: can: flexcan: add can wakeup property
From: Sean Nyekjaer @ 2019-08-20  9:10 UTC (permalink / raw)
  To: linux-can, mkl; +Cc: Rob Herring, netdev, robh+dt, devicetree
In-Reply-To: <20190429173930.GA11283@bogus>



On 29/04/2019 19.39, Rob Herring wrote:
> On Tue,  9 Apr 2019 10:39:49 +0200, Sean Nyekjaer wrote:
>> add wakeup-source boolean property.
>>
>> Signed-off-by: Sean Nyekjaer <sean@geanix.com>
>> ---
>>   Documentation/devicetree/bindings/net/can/fsl-flexcan.txt | 2 ++
>>   1 file changed, 2 insertions(+)
>>
> 
> Reviewed-by: Rob Herring <robh@kernel.org>
>
This doesn't seem to be applied. PATCH 1/2 in this series is applied.

In any case which repo does this patch belong to?

/Sean

^ permalink raw reply

* RE: Help needed - Kernel lockup while running ipsec
From: Vakul Garg @ 2019-08-20  9:10 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev@vger.kernel.org
In-Reply-To: <20190819173810.GK2588@breakpoint.cc>

Thanks for your response.


> -----Original Message-----
> From: Florian Westphal <fw@strlen.de>
> Sent: Monday, August 19, 2019 11:08 PM
> To: Vakul Garg <vakul.garg@nxp.com>
> Cc: netdev@vger.kernel.org
> Subject: Re: Help needed - Kernel lockup while running ipsec
> 
> Vakul Garg <vakul.garg@nxp.com> wrote:
> > Hi
> >
> > With kernel 4.14.122, I am getting a kernel softlockup while running single
> static ipsec tunnel.
> > The problem reproduces mostly after running 8-10 hours of ipsec encap
> test (on my dual core arm board).
> >
> > I found that in function xfrm_policy_lookup_bytype(), the policy in variable
> 'ret' shows refcnt=0 under problem situation.
> > This creates an infinite loop in  xfrm_policy_lookup_bytype() and hence the
> lockup.
> >
> > Can some body please provide me pointers about 'refcnt'?
> > Is it legitimate for 'refcnt' to become '0'? Under what condition can it
> become '0'?
> 
> Yes, when policy is destroyed and the last user calls
> xfrm_pol_put() which will invoke call_rcu to free the structure.

It seems that policy reference count never gets decremented during packet ipsec encap.
It is getting incremented for every frame that hits the policy.
In setkey -DP output, I see refcnt to be wrapping around after '0'.

Is this designed to be like this or is it weird?


^ permalink raw reply

* Re: [PATCH 2/3] macb: Update compatibility string for SiFive FU540-C000
From: Andreas Schwab @ 2019-08-20  9:10 UTC (permalink / raw)
  To: Paul Walmsley
  Cc: Nicolas Ferre, David Miller, Yash Shah, Rob Herring, netdev,
	devicetree, linux-kernel@vger.kernel.org List, linux-riscv,
	Mark Rutland, Palmer Dabbelt, Albert Ou, Petr Štetiar,
	Sachin Ghadi
In-Reply-To: <alpine.DEB.2.21.9999.1908131142150.5033@viisi.sifive.com>

On Aug 13 2019, Paul Walmsley <paul.walmsley@sifive.com> wrote:

> Dave, Nicolas,
>
> On Mon, 22 Jul 2019, Yash Shah wrote:
>
>> On Fri, Jul 19, 2019 at 5:36 PM <Nicolas.Ferre@microchip.com> wrote:
>> >
>> > On 19/07/2019 at 13:10, Yash Shah wrote:
>> > > Update the compatibility string for SiFive FU540-C000 as per the new
>> > > string updated in the binding doc.
>> > > Reference: https://lkml.org/lkml/2019/7/17/200
>> >
>> > Maybe referring to lore.kernel.org is better:
>> > https://lore.kernel.org/netdev/CAJ2_jOFEVZQat0Yprg4hem4jRrqkB72FKSeQj4p8P5KA-+rgww@mail.gmail.com/
>> 
>> Sure. Will keep that in mind for future reference.
>> 
>> >
>> > > Signed-off-by: Yash Shah <yash.shah@sifive.com>
>> >
>> > Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
>> 
>> Thanks.
>
> Am assuming you'll pick this up for the -net tree for v5.4-rc1 or earlier.
> If not, please let us know.

This is still missing in v5.4-rc5, which means that networking is broken.

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

^ permalink raw reply

* [PATCH net] gve: Copy and paste buy in gve_get_stats()
From: Dan Carpenter @ 2019-08-20  9:00 UTC (permalink / raw)
  To: Catherine Sullivan
  Cc: Sagi Shahar, Jon Olson, David S. Miller, Willem de Bruijn,
	Luigi Rizzo, Chuhong Yuan, netdev, kernel-janitors

There is a copy and paste error so we have "rx" where "tx" was intended
in the priv->tx[] array.

Fixes: f5cedc84a30d ("gve: Add transmit and receive support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
 drivers/net/ethernet/google/gve/gve_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 497298752381..aca95f64bde8 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -50,7 +50,7 @@ static void gve_get_stats(struct net_device *dev, struct rtnl_link_stats64 *s)
 				  u64_stats_fetch_begin(&priv->tx[ring].statss);
 				s->tx_packets += priv->tx[ring].pkt_done;
 				s->tx_bytes += priv->tx[ring].bytes_done;
-			} while (u64_stats_fetch_retry(&priv->rx[ring].statss,
+			} while (u64_stats_fetch_retry(&priv->tx[ring].statss,
 						       start));
 		}
 	}
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH -next] bpf: Use PTR_ERR_OR_ZERO in xsk_map_inc()
From: Dan Carpenter @ 2019-08-20  8:55 UTC (permalink / raw)
  To: Björn Töpel
  Cc: YueHaibing, magnus.karlsson, jonathan.lemon, ast, daniel, kafai,
	songliubraving, yhs, john.fastabend, netdev, bpf, kernel-janitors
In-Reply-To: <93fafdab-8fb3-0f2b-8f36-0cf297db3cd9@intel.com>

On Tue, Aug 20, 2019 at 09:28:26AM +0200, Björn Töpel wrote:
> For future patches: Prefix AF_XDP socket work with "xsk:" and use "PATCH
> bpf-next" to let the developers know what tree you're aiming for.

There are over 300 trees in linux-next.  It impossible to try remember
everyone's trees.  No one else has this requirement.

Maybe add it as an option to get_maintainer.pl --tree <hash> then that
would be very easy.

regards,
dan carpenter


^ permalink raw reply

* RE: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Parav Pandit @ 2019-08-20  8:58 UTC (permalink / raw)
  To: Parav Pandit, Alex Williamson, Jiri Pirko, David S . Miller,
	Kirti Wankhede, Cornelia Huck
  Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	cjia@nvidia.com, netdev@vger.kernel.org
In-Reply-To: <AM0PR05MB4866148ABA3C4E48E73E95FCD1AD0@AM0PR05MB4866.eurprd05.prod.outlook.com>

+ Dave.

Hi Jiri, Dave, Alex, Kirti, Cornelia,

Please provide your feedback on it, how shall we proceed?

Short summary of requirements.
For a given mdev (mediated device [1]), there is one representor netdevice and devlink port in switchdev mode (similar to SR-IOV VF),
And there is one netdevice for the actual mdev when mdev is probed.

(a) representor netdev and devlink port should be able derive phys_port_name().
So that representor netdev name can be built deterministically across reboots.

(b) for mdev's netdevice, mdev's device should have an attribute.
This attribute can be used by udev rules/systemd or something else to rename netdev name deterministically.

(c) IFNAMSIZ of 16 bytes is too small to fit whole UUID.
A simple grep IFNAMSIZ in stack hints hundreds of users of IFNAMSIZ in drivers, uapi, netlink, boot config area and more.
Changing IFNAMSIZ for a mdev bus doesn't really look reasonable option to me.

Hence, I would like to discuss below options.

Option-1: mdev index
Introduce an optional mdev index/handle as u32 during mdev create time.
User passes mdev index/handle as input.

phys_port_name=mIndex=m%u
mdev_index will be available in sysfs as mdev attribute for udev to name the mdev's netdev.

example mdev create command:
UUID=$(uuidgen)
echo $UUID index=10 > /sys/class/net/ens2f0/mdev_supported_types/mlx5_core_mdev/create
example netdevs:
repnetdev=ens2f0_m10	/*ens2f0 is parent PF's netdevice */
mdev_netdev=enm10

Pros:
1. mdevctl and any other existing tools are unaffected.
2. netdev stack, ovs and other switching platforms are unaffected.
3. achieves unique phys_port_name for representor netdev
4. achieves unique mdev eth netdev name for the mdev using udev/systemd extension.
5. Aligns well with mdev and netdev subsystem and similar to existing sriov bdf's.

Option-2: shorter mdev name
Extend mdev to have shorter mdev device name in addition to UUID.
such as 'foo', 'bar'.
Mdev will continue to have UUID.
phys_port_name=mdev_name

Pros:
1. All same as option-1, except mdevctl needs upgrade for newer usage.
It is common practice to upgrade iproute2 package along with the kernel.
Similar practice to be done with mdevctl.
2. Newer users of mdevctl who wants to work with non_UUID names, will use newer mdevctl/tools.
Cons:
1. Dual naming scheme of mdev might affect some of the existing tools.
It's unclear how/if it actually affects.
mdevctl [2] is very recently developed and can be enhanced for dual naming scheme.

Option-3: mdev uuid alias
Instead of shorter mdev name or mdev index, have alpha-numeric name alias.
Alias is an optional mdev sysfs attribute such as 'foo', 'bar'.
example mdev create command:
UUID=$(uuidgen)
echo $UUID alias=foo > /sys/class/net/ens2f0/mdev_supported_types/mlx5_core_mdev/create
example netdevs:
examle netdevs:
repnetdev = ens2f0_mfoo
mdev_netdev=enmfoo

Pros:
1. All same as option-1.
2. Doesn't affect existing mdev naming scheme.
Cons:
1. Index scheme of option-1 is better which can number large number of mdevs with fewer characters, simplifying the management tool.

Option-4: extend IFNAMESZ to be 64 bytes Extended IFNAMESZ from 16 to 64 bytes phys_port_name=mdev_UUID_string mdev_netdev_name=enmUUID

Pros:
1. Doesn't require mdev extension
Cons:
1. netdev stack, driver, uapi, user space, boot config wide changes
2. Possible user space extensions who assumed name size being 16 characters
3. Single device type demands namesize change for all netdev types

[1] https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
[2] https://github.com/mdevctl/mdevctl

Regards,
Parav Pandit

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org <linux-kernel-
> owner@vger.kernel.org> On Behalf Of Parav Pandit
> Sent: Wednesday, August 14, 2019 9:51 PM
> To: Alex Williamson <alex.williamson@redhat.com>
> Cc: Cornelia Huck <cohuck@redhat.com>; Kirti Wankhede
> <kwankhede@nvidia.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; cjia@nvidia.com; Jiri Pirko <jiri@mellanox.com>;
> netdev@vger.kernel.org
> Subject: RE: [PATCH v2 0/2] Simplify mtty driver and mdev core
> 
> 
> 
> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, August 14, 2019 8:28 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: Cornelia Huck <cohuck@redhat.com>; Kirti Wankhede
> > <kwankhede@nvidia.com>; kvm@vger.kernel.org; linux-
> > kernel@vger.kernel.org; cjia@nvidia.com; Jiri Pirko
> > <jiri@mellanox.com>; netdev@vger.kernel.org
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> >
> > On Wed, 14 Aug 2019 13:45:49 +0000
> > Parav Pandit <parav@mellanox.com> wrote:
> >
> > > > -----Original Message-----
> > > > From: Cornelia Huck <cohuck@redhat.com>
> > > > Sent: Wednesday, August 14, 2019 6:39 PM
> > > > To: Parav Pandit <parav@mellanox.com>
> > > > Cc: Alex Williamson <alex.williamson@redhat.com>; Kirti Wankhede
> > > > <kwankhede@nvidia.com>; kvm@vger.kernel.org; linux-
> > > > kernel@vger.kernel.org; cjia@nvidia.com; Jiri Pirko
> > > > <jiri@mellanox.com>; netdev@vger.kernel.org
> > > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > >
> > > > On Wed, 14 Aug 2019 12:27:01 +0000 Parav Pandit
> > > > <parav@mellanox.com> wrote:
> > > >
> > > > > + Jiri, + netdev
> > > > > To get perspective on the ndo->phys_port_name for the
> > > > > representor netdev
> > > > of mdev.
> > > > >
> > > > > Hi Cornelia,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Cornelia Huck <cohuck@redhat.com>
> > > > > > Sent: Wednesday, August 14, 2019 1:32 PM
> > > > > > To: Parav Pandit <parav@mellanox.com>
> > > > > > Cc: Alex Williamson <alex.williamson@redhat.com>; Kirti
> > > > > > Wankhede <kwankhede@nvidia.com>; kvm@vger.kernel.org; linux-
> > > > > > kernel@vger.kernel.org; cjia@nvidia.com
> > > > > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > > > >
> > > > > > On Wed, 14 Aug 2019 05:54:36 +0000 Parav Pandit
> > > > > > <parav@mellanox.com> wrote:
> > > > > >
> > > > > > > > > I get that part. I prefer to remove the UUID itself from
> > > > > > > > > the structure and therefore removing this API makes lot
> > > > > > > > > more
> > sense?
> > > > > > > >
> > > > > > > > Mdev and support tools around mdev are based on UUIDs
> > > > > > > > because it's
> > > > > > defined
> > > > > > > > in the documentation.
> > > > > > > When we introduce newer device naming scheme, it will update
> > > > > > > the
> > > > > > documentation also.
> > > > > > > May be that is the time to move to .rst format too.
> > > > > >
> > > > > > You are aware that there are existing tools that expect a uuid
> > > > > > naming scheme, right?
> > > > > >
> > > > > Yes, Alex mentioned too.
> > > > > The good tool that I am aware of is [1], which is 4 months old.
> > > > > Not sure if it is
> > > > part of any distros yet.
> > > > >
> > > > > README also says, that it is in 'early in development. So we
> > > > > have scope to
> > > > improve it for non UUID names, but lets discuss that more below.
> > > >
> > > > The up-to-date reference for mdevctl is
> > > > https://github.com/mdevctl/mdevctl. There is currently an effort
> > > > to get this packaged in Fedora.
> > > >
> > > Awesome.
> > >
> > > > >
> > > > > > >
> > > > > > > > I don't think it's as simple as saying "voila, UUID
> > > > > > > > dependencies are removed, users are free to use arbitrary
> > > > > > > > strings".  We'd need to create some kind of naming policy,
> > > > > > > > what characters are allows so that we can potentially
> > > > > > > > expand the creation parameters as has been proposed a
> > > > > > > > couple times, how do we deal with collisions and races,
> > > > > > > > and why should we make such a change when a UUID is a
> > > > > > > > perfectly reasonable devices name.  Thanks,
> > > > > > > >
> > > > > > > Sure, we should define a policy on device naming to be more
> relaxed.
> > > > > > > We have enough examples in-kernel.
> > > > > > > Few that I am aware of are netdev (vxlan, macvlan, ipvlan,
> > > > > > > lot more), rdma
> > > > > > etc which has arbitrary device names and ID based device names.
> > > > > > >
> > > > > > > Collisions and race is already taken care today in the mdev core.
> > > > > > > Same
> > > > > > unique device names continue.
> > > > > >
> > > > > > I'm still completely missing a rationale _why_ uuids are
> > > > > > supposedly bad/restricting/etc.
> > > > > There is nothing bad about uuid based naming.
> > > > > Its just too long name to derive phys_port_name of a netdev.
> > > > > In details below.
> > > > >
> > > > > For a given mdev of networking type, we would like to have
> > > > > (a) representor netdevice [2]
> > > > > (b) associated devlink port [3]
> > > > >
> > > > > Currently these representor netdevice exist only for the PCIe SR-IOV
> VFs.
> > > > > It is further getting extended for mdev without SR-IOV.
> > > > >
> > > > > Each of the devlink port is attached to representor netdevice [4].
> > > > >
> > > > > This netdevice phys_port_name should be a unique derived from
> > > > > some
> > > > property of mdev.
> > > > > Udev/systemd uses phys_port_name to derive unique representor
> > > > > netdev
> > > > name.
> > > > > This netdev name is further use by orchestration and switching
> > > > > software in
> > > > user space.
> > > > > One such distro supported switching software is ovs [4], which
> > > > > relies on the
> > > > persistent device name of the representor netdevice.
> > > >
> > > > Ok, let me rephrase this to check that I understand this correctly.
> > > > I'm not sure about some of the terms you use here (even after
> > > > looking at the linked doc/code), but that's probably still ok.
> > > >
> > > > We want to derive an unique (and probably persistent?) netdev name
> > > > so that userspace can refer to a representor netdevice. Makes sense.
> > > > For generating that name, udev uses the phys_port_name (which
> > > > represents the devlink port, IIUC). Also makes sense.
> > > >
> > > You understood it correctly.
> > >
> > > > >
> > > > > phys_port_name has limitation to be only 15 characters long.
> > > > > UUID doesn't fit in phys_port_name.
> > > >
> > > > Understood. But why do we need to derive the phys_port_name from
> > > > the mdev device name? This netdevice use case seems to be just one
> > > > use case for using mdev devices? If this is a specialized mdev
> > > > type for this setup, why not just expose a shorter identifier via an extra
> attribute?
> > > >
> > > Representor netdev, represents mdev's switch port (like PCI SRIOV
> > > VF's switch
> > port).
> > > So user must be able to relate this two objects in similar manner as
> > > SRIOV
> > VFs.
> > > Phys_port_name is derived from the PCI PF and VF numbering scheme.
> > > Similarly mdev's such port should be derived from mdev's
> id/name/attribute.
> > >
> > > > > Longer UUID names are creating snow ball effect, not just in
> > > > > networking stack
> > > > but many user space tools too.
> > > >
> > > > This snowball effect mainly comes from the device name ->
> > > > phys_port_name setup, IIUC.
> > > >
> > > Right.
> > >
> > > > > (as opposed to recently introduced mdevctl, are they more mdev
> > > > > tools which has dependency on UUID name?)
> > > >
> > > > I am aware that people have written scripts etc. to manage their mdevs.
> > > > Given that the mdev infrastructure has been around for quite some
> > > > time, I'd say the chance of some of those scripts relying on uuid
> > > > names is
> > non-zero.
> > > >
> > > Ok. but those scripts have never managed networking devices.
> > > So those scripts won't break because they will always create mdev
> > > devices
> > using UUID.
> > > When they use these new networking devices, they need more things
> > > than
> > their scripts.
> > > So user space upgrade for such mixed mode case is reasonable.
> >
> > Tools like mdevctl are agnostic of the type of mdev device they're
> > managing, it shouldn't matter than they've never managed a networking
> > mdev previously, it follows the standards of mdev management.
> >
> > > > >
> > > > > Instead of mdev subsystem creating such effect, one option we
> > > > > are
> > > > considering is to have shorter mdev names.
> > > > > (Similar to netdev, rdma, nvme devices).
> > > > > Such as mdev1, mdev2000 etc.
> >
> > Note that these are kernel generated names, as are the other examples.
> No. I probably gave the wrong examples.
> Mdev user provided names can be 'foo', 'bar', 'foo1'.
> 
> > In the case of mdev, the user is providing the UUID, which becomes the
> > device name.  When a user writes to the create attribute, there needs
> > to be determinism that the user can identify the device they created
> > vs another that may have been created concurrently.  I don't see that
> > we can put users in the path of managing device instance numbers.
> No. Its just user provided names.
> 
> >
> > > > > Second option I was considering is to have an optional alias for
> > > > > UUID based
> > > > mdev.
> > > > > This name alias is given at time of mdev creation.
> > > > > Devlink port's phys_port_name is derived out of this shorter
> > > > > mdev name
> > > > alias.
> > > > > This way, mdev remains to be UUID based with optional extension.
> > > > > However, I prefer first option to relax mdev naming scheme.
> > > >
> > > > Actually, I think that second option makes much more sense, as you
> > > > avoid potentially breaking existing tooling.
> > > Let's first understand of what exactly will break with existing tool
> > > if they see non_uuid based device.
> >
> > Do we really want a mixed namespace of device names, some UUID, some...
> > something else?  That seems like a mess.
> >
> So you prefer alias as an attribute? If so, it should be an optional additional
> parameter during create time, because it is desired to not invent new callbacks
> for such attributes setting and (and rewrite them).
> 
> > > Existing tooling continue to work with UUID devices.
> > > Do you have example of what can break if they see non_uuid based
> > > device name? I think you are clear, but to be sure, UUID based
> > > creation will continue to be there. Optionally mdev will be created
> > > with alpha-numeric string, if we don't it as additional attribute.
> >
> > I'm not onboard with a UUID being just one of the possible naming
> > strings via which we can create mdev devices.  I think that becomes
> > untenable for userspace.  I don't think a sufficient argument has been
> > made against the alias approach, which seems to keep the UUID as a
> > canonical name, providing a consistent namespace, augmented with user or
> kernel provided short alias.
> > Thanks,
> >
> If I understand you correctly, you prefer alias name approach to keep UUID
> naming scheme intact in mdev?
> 
> > Alex
> >
> > > > >
> > > > > > We want to uniquely identify a device, across different types
> > > > > > of vendor drivers. An uuid is a unique identifier and even a
> > > > > > well-defined one. Tools (e.g. mdevctl) are relying on it for
> > > > > > mdev devices
> > > > today.
> > > > > >
> > > > > > What is the problem you're trying to solve?
> > > > > Unique device naming is still achieved without UUID scheme by
> > > > > various
> > > > subsystems in kernel using alpha-numeric string.
> > > > > Having such string based continue to provide unique names.
> > > > >
> > > > > I hope I described the problem and two solutions above.
> > > > >
> > > > > [1] https://github.com/awilliam/mdevctl
> > > > > [2]
> > > > > https://elixir.bootlin.com/linux/v5.3-rc4/source/drivers/net/eth
> > > > > er
> > > > > net/
> > > > > mellanox/mlx5/core/en_rep.c [3]
> > > > > http://man7.org/linux/man-pages/man8/devlink-port.8.html
> > > > > [4]
> > > > > https://elixir.bootlin.com/linux/v5.3-rc4/source/net/core/devlink.
> > > > > c#L6
> > > > > 921
> > > > > [5] https://www.openvswitch.org/
> > > > >
> > >


^ permalink raw reply

* [PATCH net-next v3 0/4] Improve phc2sys precision for mv88e6xxx switch in combination with imx6-fec
From: Hubert Feurstein @ 2019-08-20  8:48 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Hubert Feurstein, Andrew Lunn, Richard Cochran, Miroslav Lichvar,
	Vivien Didelot, Florian Fainelli, Heiner Kallweit,
	Vladimir Oltean, Fugang Duan, David S. Miller

From: Hubert Feurstein <h.feurstein@gmail.com>

Changelog:
 v3: mv88e6xxx_smi_indirect_write: forward ptp_sts only on the last write
     Copied Miroslav Lichvar because of PTP offset compensation patch
 v2: Added patch for PTP offset compensation
     Removed mdiobus_write_sts as there was no user
     Removed ptp_sts_supported-boolean and introduced flags instead

With this patchset the phc2sys synchronisation precision improved to +/-555ns on
an IMX6DL with an MV88E6220 switch attached.

This patchset takes into account the comments from the following discussions:
- https://lkml.org/lkml/2019/8/2/1364
- https://lkml.org/lkml/2019/8/5/169

Patch 01 adds the required infrastructure in the MDIO layer.
Patch 02 adds additional PTP offset compensation.
Patch 03 adds support for the PTP_SYS_OFFSET_EXTENDED ioctl in the mv88e6xxx driver.
Patch 04 adds support for the PTP system timestamping in the imx-fec driver.

The following tests show the improvement caused by each patch. The system clock 
precision was set to 15ns instead of 333ns (as described in https://lkml.org/lkml/2019/8/2/1364).

Without this patchset applied, the phc2sys synchronisation performance was very 
poor:

  offset: min -27120 max 28840 mean 2.44 stddev 8040.78 count 1236
  delay:  min 282103 max 386385 mean 352568.03 stddev 27814.27 count 1236
  (test runtime 20 minutes)

Results after appling patch 01-03:

  offset: min -12316 max 13314 mean -9.38 stddev 4274.82 count 1022
  delay:  min 69977 max 96266 mean 87939.04 stddev 6466.17 count 1022
  (test runtime 16 minutes)

Results after appling patch 04:

  offset: min -788 max 528 mean -0.06 stddev 185.02 count 7171
  delay:  min 1773 max 2031 mean 1909.43 stddev 33.74 count 7171
  (test runtime 119 minutes)

Hubert Feurstein (4):
  net: mdio: add support for passing a PTP system timestamp to the
    mii_bus driver
  net: mdio: add PTP offset compensation to mdiobus_write_sts
  net: dsa: mv88e6xxx: extend PTP gettime function to read system clock
  net: fec: add support for PTP system timestamping for MDIO devices

 drivers/net/dsa/mv88e6xxx/chip.h          |  2 +
 drivers/net/dsa/mv88e6xxx/ptp.c           | 11 +--
 drivers/net/dsa/mv88e6xxx/smi.c           |  7 +-
 drivers/net/ethernet/freescale/fec_main.c |  7 +-
 drivers/net/phy/mdio_bus.c                | 88 +++++++++++++++++++++++
 include/linux/mdio.h                      |  5 ++
 include/linux/phy.h                       | 42 +++++++++++
 7 files changed, 156 insertions(+), 6 deletions(-)

-- 
2.22.1


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox