Netdev List
 help / color / mirror / Atom feed
* [v3,0/4] tools: bpftool: add net attach/detach command to attach XDP prog
From: Daniel T. Lee @ 2019-08-07  2:25 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov; +Cc: netdev

Currently, bpftool net only supports dumping progs attached on the
interface. To attach XDP prog on interface, user must use other tool
(eg. iproute2). By this patch, with `bpftool net attach/detach`, user
can attach/detach XDP prog on interface.

    # bpftool prog
	16: xdp  name xdp_prog1  tag 539ec6ce11b52f98  gpl
        loaded_at 2019-08-07T08:30:17+0900  uid 0
    ...
	20: xdp  name xdp_fwd_prog  tag b9cb69f121e4a274  gpl
        loaded_at 2019-08-07T08:30:17+0900  uid 0
    
	# bpftool net attach xdpdrv id 16 dev enp6s0np0
    # bpftool net
    xdp:
	enp6s0np0(4) driver id 16
    
	# bpftool net attach xdpdrv id 20 dev enp6s0np0 overwrite
    # bpftool net
    xdp:
	enp6s0np0(4) driver id 20

	# bpftool net detach xdpdrv dev enp6s0np0
    # bpftool net
    xdp:


While this patch only contains support for XDP, through `net
attach/detach`, bpftool can further support other prog attach types.

XDP attach/detach tested on Mellanox ConnectX-4 and Netronome Agilio.

---
Changes in v3:
  - added 'overwrite' option for replacing previously attached XDP prog
  - command argument order has been changed ('ATTACH_TYPE' comes first)
  - add 'dev' keyword in front of <devname>
  - added bash-completion and documentation

Changes in v2:
  - command 'load/unload' changed to 'attach/detach' for the consistency

Daniel T. Lee (4):
  tools: bpftool: add net attach command to attach XDP on interface
  tools: bpftool: add net detach command to detach XDP on interface
  tools: bpftool: add bash-completion for net attach/detach
  tools: bpftool: add documentation for net attach/detach

 .../bpf/bpftool/Documentation/bpftool-net.rst |  51 ++++-
 tools/bpf/bpftool/bash-completion/bpftool     |  64 ++++++-
 tools/bpf/bpftool/net.c                       | 181 ++++++++++++++++--
 3 files changed, 273 insertions(+), 23 deletions(-)

-- 
2.20.1


^ permalink raw reply

* [v3,1/4] tools: bpftool: add net attach command to attach XDP on interface
From: Daniel T. Lee @ 2019-08-07  2:25 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov; +Cc: netdev
In-Reply-To: <20190807022509.4214-1-danieltimlee@gmail.com>

By this commit, using `bpftool net attach`, user can attach XDP prog on
interface. New type of enum 'net_attach_type' has been made, as stated at
cover-letter, the meaning of 'attach' is, prog will be attached on interface.

With 'overwrite' option at argument, attached XDP program could be replaced.
Added new helper 'net_parse_dev' to parse the network device at argument.

BPF prog will be attached through libbpf 'bpf_set_link_xdp_fd'.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
---
 tools/bpf/bpftool/net.c | 141 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 130 insertions(+), 11 deletions(-)

diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
index 67e99c56bc88..c05a3fac5cac 100644
--- a/tools/bpf/bpftool/net.c
+++ b/tools/bpf/bpftool/net.c
@@ -55,6 +55,35 @@ struct bpf_attach_info {
 	__u32 flow_dissector_id;
 };
 
+enum net_attach_type {
+	NET_ATTACH_TYPE_XDP,
+	NET_ATTACH_TYPE_XDP_GENERIC,
+	NET_ATTACH_TYPE_XDP_DRIVER,
+	NET_ATTACH_TYPE_XDP_OFFLOAD,
+};
+
+static const char * const attach_type_strings[] = {
+	[NET_ATTACH_TYPE_XDP]		= "xdp",
+	[NET_ATTACH_TYPE_XDP_GENERIC]	= "xdpgeneric",
+	[NET_ATTACH_TYPE_XDP_DRIVER]	= "xdpdrv",
+	[NET_ATTACH_TYPE_XDP_OFFLOAD]	= "xdpoffload",
+};
+
+const size_t max_net_attach_type = ARRAY_SIZE(attach_type_strings);
+
+static enum net_attach_type parse_attach_type(const char *str)
+{
+	enum net_attach_type type;
+
+	for (type = 0; type < max_net_attach_type; type++) {
+		if (attach_type_strings[type] &&
+		   is_prefix(str, attach_type_strings[type]))
+			return type;
+	}
+
+	return max_net_attach_type;
+}
+
 static int dump_link_nlmsg(void *cookie, void *msg, struct nlattr **tb)
 {
 	struct bpf_netdev_t *netinfo = cookie;
@@ -223,6 +252,97 @@ static int query_flow_dissector(struct bpf_attach_info *attach_info)
 	return 0;
 }
 
+static int net_parse_dev(int *argc, char ***argv)
+{
+	int ifindex;
+
+	if (is_prefix(**argv, "dev")) {
+		NEXT_ARGP();
+
+		ifindex = if_nametoindex(**argv);
+		if (!ifindex)
+			p_err("invalid devname %s", **argv);
+
+		NEXT_ARGP();
+	} else {
+		p_err("expected 'dev', got: '%s'?", **argv);
+		return -1;
+	}
+
+	return ifindex;
+}
+
+static int do_attach_detach_xdp(int progfd, enum net_attach_type attach_type,
+				int ifindex, bool overwrite)
+{
+	__u32 flags = 0;
+
+	if (!overwrite)
+		flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
+	if (attach_type == NET_ATTACH_TYPE_XDP_GENERIC)
+		flags |= XDP_FLAGS_SKB_MODE;
+	if (attach_type == NET_ATTACH_TYPE_XDP_DRIVER)
+		flags |= XDP_FLAGS_DRV_MODE;
+	if (attach_type == NET_ATTACH_TYPE_XDP_OFFLOAD)
+		flags |= XDP_FLAGS_HW_MODE;
+
+	return bpf_set_link_xdp_fd(ifindex, progfd, flags);
+}
+
+static int do_attach(int argc, char **argv)
+{
+	enum net_attach_type attach_type;
+	int progfd, ifindex, err = 0;
+	bool overwrite = false;
+
+	/* parse attach args */
+	if (!REQ_ARGS(5))
+		return -EINVAL;
+
+	attach_type = parse_attach_type(*argv);
+	if (attach_type == max_net_attach_type) {
+		p_err("invalid net attach/detach type");
+		return -EINVAL;
+	}
+
+	NEXT_ARG();
+	progfd = prog_parse_fd(&argc, &argv);
+	if (progfd < 0)
+		return -EINVAL;
+
+	ifindex = net_parse_dev(&argc, &argv);
+	if (ifindex < 1) {
+		close(progfd);
+		return -EINVAL;
+	}
+
+	if (argc) {
+		if (is_prefix(*argv, "overwrite")) {
+			overwrite = true;
+		} else {
+			p_err("expected 'overwrite', got: '%s'?", *argv);
+			close(progfd);
+			return -EINVAL;
+		}
+	}
+
+	/* attach xdp prog */
+	if (is_prefix("xdp", attach_type_strings[attach_type]))
+		err = do_attach_detach_xdp(progfd, attach_type, ifindex,
+					   overwrite);
+
+	if (err < 0) {
+		p_err("interface %s attach failed",
+		      attach_type_strings[attach_type]);
+		return err;
+	}
+
+	if (json_output)
+		jsonw_null(json_wtr);
+
+	return 0;
+}
+
 static int do_show(int argc, char **argv)
 {
 	struct bpf_attach_info attach_info = {};
@@ -231,17 +351,10 @@ static int do_show(int argc, char **argv)
 	unsigned int nl_pid;
 	char err_buf[256];
 
-	if (argc == 2) {
-		if (strcmp(argv[0], "dev") != 0)
-			usage();
-		filter_idx = if_nametoindex(argv[1]);
-		if (filter_idx == 0) {
-			fprintf(stderr, "invalid dev name %s\n", argv[1]);
-			return -1;
-		}
-	} else if (argc != 0) {
+	if (argc == 2)
+		filter_idx = net_parse_dev(&argc, &argv);
+	else if (argc != 0)
 		usage();
-	}
 
 	ret = query_flow_dissector(&attach_info);
 	if (ret)
@@ -305,13 +418,18 @@ static int do_help(int argc, char **argv)
 
 	fprintf(stderr,
 		"Usage: %s %s { show | list } [dev <devname>]\n"
+		"       %s %s attach ATTACH_TYPE PROG dev <devname> [ overwrite ]\n"
 		"       %s %s help\n"
+		"\n"
+		"       " HELP_SPEC_PROGRAM "\n"
+		"       ATTACH_TYPE := { xdp | xdpgeneric | xdpdrv | xdpoffload }\n"
+		"\n"
 		"Note: Only xdp and tc attachments are supported now.\n"
 		"      For progs attached to cgroups, use \"bpftool cgroup\"\n"
 		"      to dump program attachments. For program types\n"
 		"      sk_{filter,skb,msg,reuseport} and lwt/seg6, please\n"
 		"      consult iproute2.\n",
-		bin_name, argv[-2], bin_name, argv[-2]);
+		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2]);
 
 	return 0;
 }
@@ -319,6 +437,7 @@ static int do_help(int argc, char **argv)
 static const struct cmd cmds[] = {
 	{ "show",	do_show },
 	{ "list",	do_show },
+	{ "attach",	do_attach },
 	{ "help",	do_help },
 	{ 0 }
 };
-- 
2.20.1


^ permalink raw reply related

* [v3,2/4] tools: bpftool: add net detach command to detach XDP on interface
From: Daniel T. Lee @ 2019-08-07  2:25 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov; +Cc: netdev
In-Reply-To: <20190807022509.4214-1-danieltimlee@gmail.com>

By this commit, using `bpftool net detach`, the attached XDP prog can
be detached. Detaching the BPF prog will be done through libbpf
'bpf_set_link_xdp_fd' with the progfd set to -1.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
---
 tools/bpf/bpftool/net.c | 42 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
index c05a3fac5cac..7be96acb08e0 100644
--- a/tools/bpf/bpftool/net.c
+++ b/tools/bpf/bpftool/net.c
@@ -343,6 +343,43 @@ static int do_attach(int argc, char **argv)
 	return 0;
 }
 
+static int do_detach(int argc, char **argv)
+{
+	enum net_attach_type attach_type;
+	int progfd, ifindex, err = 0;
+
+	/* parse detach args */
+	if (!REQ_ARGS(3))
+		return -EINVAL;
+
+	attach_type = parse_attach_type(*argv);
+	if (attach_type == max_net_attach_type) {
+		p_err("invalid net attach/detach type");
+		return -EINVAL;
+	}
+
+	NEXT_ARG();
+	ifindex = net_parse_dev(&argc, &argv);
+	if (ifindex < 1)
+		return -EINVAL;
+
+	/* detach xdp prog */
+	progfd = -1;
+	if (is_prefix("xdp", attach_type_strings[attach_type]))
+		err = do_attach_detach_xdp(progfd, attach_type, ifindex, NULL);
+
+	if (err < 0) {
+		p_err("interface %s detach failed",
+		      attach_type_strings[attach_type]);
+		return err;
+	}
+
+	if (json_output)
+		jsonw_null(json_wtr);
+
+	return 0;
+}
+
 static int do_show(int argc, char **argv)
 {
 	struct bpf_attach_info attach_info = {};
@@ -419,6 +456,7 @@ static int do_help(int argc, char **argv)
 	fprintf(stderr,
 		"Usage: %s %s { show | list } [dev <devname>]\n"
 		"       %s %s attach ATTACH_TYPE PROG dev <devname> [ overwrite ]\n"
+		"       %s %s detach ATTACH_TYPE dev <devname>\n"
 		"       %s %s help\n"
 		"\n"
 		"       " HELP_SPEC_PROGRAM "\n"
@@ -429,7 +467,8 @@ static int do_help(int argc, char **argv)
 		"      to dump program attachments. For program types\n"
 		"      sk_{filter,skb,msg,reuseport} and lwt/seg6, please\n"
 		"      consult iproute2.\n",
-		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2]);
+		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
+		bin_name, argv[-2]);
 
 	return 0;
 }
@@ -438,6 +477,7 @@ static const struct cmd cmds[] = {
 	{ "show",	do_show },
 	{ "list",	do_show },
 	{ "attach",	do_attach },
+	{ "detach",	do_detach },
 	{ "help",	do_help },
 	{ 0 }
 };
-- 
2.20.1


^ permalink raw reply related

* [v3,3/4] tools: bpftool: add bash-completion for net attach/detach
From: Daniel T. Lee @ 2019-08-07  2:25 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov; +Cc: netdev
In-Reply-To: <20190807022509.4214-1-danieltimlee@gmail.com>

This commit adds bash-completion for new "net attach/detach"
subcommand for attaching XDP program on interface.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
---
 tools/bpf/bpftool/bash-completion/bpftool | 64 +++++++++++++++++++----
 1 file changed, 55 insertions(+), 9 deletions(-)

diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
index c8f42e1fcbc9..1d81cb09d478 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -201,6 +201,10 @@ _bpftool()
             _bpftool_get_prog_tags
             return 0
             ;;
+        dev)
+            _sysfs_get_netdevs
+            return 0
+            ;;
         file|pinned)
             _filedir
             return 0
@@ -399,10 +403,6 @@ _bpftool()
                             _filedir
                             return 0
                             ;;
-                        dev)
-                            _sysfs_get_netdevs
-                            return 0
-                            ;;
                         *)
                             COMPREPLY=( $( compgen -W "map" -- "$cur" ) )
                             _bpftool_once_attr 'type'
@@ -498,10 +498,6 @@ _bpftool()
                         key|value|flags|name|entries)
                             return 0
                             ;;
-                        dev)
-                            _sysfs_get_netdevs
-                            return 0
-                            ;;
                         *)
                             _bpftool_once_attr 'type'
                             _bpftool_once_attr 'key'
@@ -775,11 +771,61 @@ _bpftool()
             esac
             ;;
         net)
+            local PROG_TYPE='id pinned tag'
+            local ATTACH_TYPES='xdp xdpgeneric xdpdrv xdpoffload'
             case $command in
+                show|list)
+                    [[ $prev != "$command" ]] && return 0
+                    COMPREPLY=( $( compgen -W 'dev' -- "$cur" ) )
+                    return 0
+                    ;;
+                attach)
+                    case $cword in
+                        3)
+                            COMPREPLY=( $( compgen -W "$ATTACH_TYPES" -- "$cur" ) )
+                            return 0
+                            ;;
+                        4)
+                            COMPREPLY=( $( compgen -W "$PROG_TYPE" -- "$cur" ) )
+                            return 0
+                            ;;
+                        5)
+                            case $prev in
+                                id)
+                                    _bpftool_get_prog_ids
+                                    ;;
+                                pinned)
+                                    _filedir
+                                    ;;
+                            esac
+                            return 0
+                            ;;
+                        6)
+                            COMPREPLY=( $( compgen -W 'dev' -- "$cur" ) )
+                            return 0
+                            ;;
+                        8)
+                            _bpftool_once_attr 'overwrite'
+                            return 0
+                            ;;
+                    esac
+                    ;;
+                detach)
+                    case $cword in
+                        3)
+                            COMPREPLY=( $( compgen -W "$ATTACH_TYPES" -- "$cur" ) )
+                            return 0
+                            ;;
+                        4)
+                            COMPREPLY=( $( compgen -W 'dev' -- "$cur" ) )
+                            return 0
+                            ;;
+                    esac
+                    ;;
                 *)
                     [[ $prev == $object ]] && \
                         COMPREPLY=( $( compgen -W 'help \
-                            show list' -- "$cur" ) )
+                            show list attach detach' -- "$cur" ) )
                     ;;
             esac
             ;;
-- 
2.20.1


^ permalink raw reply related

* [v3,4/4] tools: bpftool: add documentation for net attach/detach
From: Daniel T. Lee @ 2019-08-07  2:25 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov; +Cc: netdev
In-Reply-To: <20190807022509.4214-1-danieltimlee@gmail.com>

Since, new sub-command 'net attach/detach' has been added for
attaching XDP program on interface,
this commit documents usage and sample output of `net attach/detach`.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
---
 .../bpf/bpftool/Documentation/bpftool-net.rst | 51 +++++++++++++++++--
 1 file changed, 48 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-net.rst b/tools/bpf/bpftool/Documentation/bpftool-net.rst
index d8e5237a2085..4ad1a380e186 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-net.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-net.rst
@@ -15,17 +15,22 @@ SYNOPSIS
 	*OPTIONS* := { [{ **-j** | **--json** }] [{ **-p** | **--pretty** }] }
 
 	*COMMANDS* :=
-	{ **show** | **list** } [ **dev** name ] | **help**
+	{ **show** | **list** | **attach** | **detach** | **help** }
 
 NET COMMANDS
 ============
 
-|	**bpftool** **net { show | list } [ dev name ]**
+|	**bpftool** **net { show | list }** [ **dev** *name* ]
+|	**bpftool** **net attach** *ATTACH_TYPE* *PROG* **dev** *name* [ **overwrite** ]
+|	**bpftool** **net detach** *ATTACH_TYPE* **dev** *name*
 |	**bpftool** **net help**
+|
+|	*PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* }
+|	*ATTACH_TYPE* := { **xdp** | **xdpgeneric** | **xdpdrv** | **xdpoffload** }
 
 DESCRIPTION
 ===========
-	**bpftool net { show | list } [ dev name ]**
+	**bpftool net { show | list }** [ **dev** *name* ]
                   List bpf program attachments in the kernel networking subsystem.
 
                   Currently, only device driver xdp attachments and tc filter
@@ -47,6 +52,18 @@ DESCRIPTION
                   all bpf programs attached to non clsact qdiscs, and finally all
                   bpf programs attached to root and clsact qdisc.
 
+	**bpftool** **net attach** *ATTACH_TYPE* *PROG* **dev** *name* [ **overwrite** ]
+                  Attach bpf program *PROG* to network interface *name* with
+                  type specified by *ATTACH_TYPE*. Previously attached bpf program
+                  can be replaced by the command used with **overwrite** option.
+                  Currently, *ATTACH_TYPE* only contains XDP programs.
+
+	**bpftool** **net detach** *ATTACH_TYPE* **dev** *name*
+                  Detach bpf program attached to network interface *name* with
+                  type specified by *ATTACH_TYPE*. To detach bpf program, same
+                  *ATTACH_TYPE* previously used for attach must be specified.
+                  Currently, *ATTACH_TYPE* only contains XDP programs.
+
 	**bpftool net help**
 		  Print short help message.
 
@@ -137,6 +154,34 @@ EXAMPLES
         }
     ]
 
+|
+| **# bpftool net attach xdpdrv id 16 dev enp6s0np0**
+| **# bpftool net**
+
+::
+
+      xdp:
+      enp6s0np0(4) driver id 16
+
+|
+| **# bpftool net attach xdpdrv id 16 dev enp6s0np0**
+| **# bpftool net attach xdpdrv id 20 dev enp6s0np0 overwrite**
+| **# bpftool net**
+
+::
+
+      xdp:
+      enp6s0np0(4) driver id 20
+
+|
+| **# bpftool net attach xdpdrv id 16 dev enp6s0np0**
+| **# bpftool net detach xdpdrv dev enp6s0np0**
+| **# bpftool net**
+
+::
+
+      xdp:
+
 
 SEE ALSO
 ========
-- 
2.20.1


^ permalink raw reply related

* RE: Realtek r8822be wireless card fails to work with new rtw88 kernel module
From: Tony Chuang @ 2019-08-07  2:33 UTC (permalink / raw)
  To: Brian Norris, 고준
  Cc: linux-wireless, <netdev@vger.kernel.org>, Linux Kernel
In-Reply-To: <CA+ASDXM6Jz7YY9XUj6QKv5VJCED-BnQ5K1UZHNApB9p6qTWtgg@mail.gmail.com>

> + yhchuang
> 
> On Tue, Aug 6, 2019 at 7:32 AM 고준 <gojun077@gmail.com> wrote:
> >
> > Hello,
> >
> > I recently reported a bug to Ubuntu regarding a regression in wireless
> > driver support for the Realtek r8822be wireless chipset. The issue
> > link on launchpad is:
> >
> > https://bugs.launchpad.net/bugs/1838133
> >
> > After Canonical developers triaged the bug they determined that the
> > problem lies upstream, and instructed me to send mails to the relevant
> > kernel module maintainers at Realtek and to the general kernel.org
> > mailing list.
> >
> > I built kernel 5.3.0-rc1+ with the latest realtek drivers from
> > wireless-drivers-next but my Realtek r8822be doesn't work with
> > rtw88/rtwpci kernel modules.
> >
> > Please let me know if there is any additional information I can
> > provide that would help in debugging this issue.
> 
> Any chance this would help you?
> 
> https://patchwork.kernel.org/patch/11065631/
> 
> Somebody else was complaining about 8822be regressions that were fixed
> with that.
> 

I hope it could fix it.

And as "r8822be" was dropped, it is preferred to use "rtw88" instead.
I have received two kinds of failures that cause driver stop working.
One is the MSI interrupt should be enabled on certain platforms.
Another is the RFE type of the card, could you send more dmesg to me?

Yan-Hsuan



^ permalink raw reply

* Re: [RFC] implicit per-namespace devlink instance to set kernel resource limitations
From: David Ahern @ 2019-08-07  2:33 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jiri Pirko, netdev, davem, mlxsw, jakub.kicinski, f.fainelli,
	vivien.didelot, mkubecek, stephen, daniel, brouer, eric.dumazet,
	Jakub Kicinski
In-Reply-To: <20190806180346.GD17072@lunn.ch>

Some time back supported was added for devlink 'resources'. The idea is
that hardware (mlxsw) has limited resources (e.g., memory) that can be
allocated in certain ways (e.g., kvd for mlxsw) thus implementing
restrictions on the number of programmable entries (e.g., routes,
neighbors) by userspace.

I contend:

1. The kernel is an analogy to the hardware: it is programmed by
userspace, has limited resources (e.g., memory), and that users want to
control (e.g., limit) the number of networking entities that can be
programmed - routes, rules, nexthop objects etc and by address family
(ipv4, ipv6).

2. A consistent operational model across use cases - s/w forwarding, XDP
forwarding and hardware forwarding - is good for users deploying systems
based on the Linux networking stack. This aligns with my basic point at
LPC last November about better integration of XDP and kernel tables.

The existing devlink API is the right one for all use cases. Most
notably that the kernel can mimic the hardware from a resource
management. Trying to say 'use cgroups for s/w forwarding and devlink
for h/w forwarding' is complicating the lives of users. It is just a
model and models can apply to more than some rigid definition.

As for the namespace piece of this, the kernel's tables for networking
are *per namespace*, and so the resource controller must be per
namespace. This aligns with another consistent theme I have promoted
over the years - the ability to divide up a single ASIC into multiple,
virtual switches which are managed per namespace. This is a very popular
feature from a certain legacy vendor and one that would be good for open
networking to achieve. This is the basis of my response last week about
the devlink instance per namespace, and I thought Jiri was moving in
that direction until our chat today. Jiri's intention is something
different; we can discuss that on the next version of his patches.

###

As for the current controller put into netdevsim...

When I started down this road 18-20 months ago, I was copying a lot of
netdevsim code to create a fake device from which I could have a devlink
instance to implement the devlink resources. At some point it was silly
to keep duplicating the code - just make it part of netdevsim. After all
it really mirrors mlxsw and the resource limits for fib notifier
handling, it allows testing of the userspace APIs and in kernel notifier
APIs which allow an entity to veto a change. This is all consistent with
the intent of netdevsim - s/w based implementation for testing of APIs
that otherwise require hardware.

^ permalink raw reply

* [PATCH v2 1/1] ixgbe: sync the first fragment unconditionally
From: Firo Yang @ 2019-08-07  2:49 UTC (permalink / raw)
  To: davem@davemloft.net
  Cc: alexander.h.duyck@linux.intel.com, jeffrey.t.kirsher@intel.com,
	netdev@vger.kernel.org, intel-wired-lan@lists.osuosl.org,
	linux-kernel@vger.kernel.org, Firo Yang

In Xen environment, if Xen-swiotlb is enabled, ixgbe driver
could possibly allocate a page, DMA memory buffer, for the first
fragment which is not suitable for Xen-swiotlb to do DMA operations.
Xen-swiotlb have to internally allocate another page for doing DMA
operations. It requires syncing between those two pages. However,
since commit f3213d932173 ("ixgbe: Update driver to make use of DMA
attributes in Rx path"), the unmap operation is performed with
DMA_ATTR_SKIP_CPU_SYNC. As a result, the sync is not performed.

To fix this problem, always sync before possibly performing a page
unmap operation.

Fixes: f3213d932173 ("ixgbe: Update driver to make use of DMA
attributes in Rx path")
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Firo Yang <firo.yang@suse.com>
---

Changes from v1:
 * Imporved the patch description.
 * Added Reviewed-by: and Fixes: as suggested by Alexander Duyck

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index cbaf712d6529..200de9838096 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1825,13 +1825,7 @@ static void ixgbe_pull_tail(struct ixgbe_ring *rx_ring,
 static void ixgbe_dma_sync_frag(struct ixgbe_ring *rx_ring,
 				struct sk_buff *skb)
 {
-	/* if the page was released unmap it, else just sync our portion */
-	if (unlikely(IXGBE_CB(skb)->page_released)) {
-		dma_unmap_page_attrs(rx_ring->dev, IXGBE_CB(skb)->dma,
-				     ixgbe_rx_pg_size(rx_ring),
-				     DMA_FROM_DEVICE,
-				     IXGBE_RX_DMA_ATTR);
-	} else if (ring_uses_build_skb(rx_ring)) {
+	if (ring_uses_build_skb(rx_ring)) {
 		unsigned long offset = (unsigned long)(skb->data) & ~PAGE_MASK;
 
 		dma_sync_single_range_for_cpu(rx_ring->dev,
@@ -1848,6 +1842,14 @@ static void ixgbe_dma_sync_frag(struct ixgbe_ring *rx_ring,
 					      skb_frag_size(frag),
 					      DMA_FROM_DEVICE);
 	}
+
+	/* If the page was released, just unmap it. */
+	if (unlikely(IXGBE_CB(skb)->page_released)) {
+		dma_unmap_page_attrs(rx_ring->dev, IXGBE_CB(skb)->dma,
+				     ixgbe_rx_pg_size(rx_ring),
+				     DMA_FROM_DEVICE,
+				     IXGBE_RX_DMA_ATTR);
+	}
 }
 
 /**
-- 
2.16.4


^ permalink raw reply related

* RE: Slowness forming TIPC cluster with explicit node addresses
From: Jon Maloy @ 2019-08-07  2:55 UTC (permalink / raw)
  To: Chris Packham, tipc-discussion@lists.sourceforge.net
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <1564959879.27215.18.camel@alliedtelesis.co.nz>



> -----Original Message-----
> From: Chris Packham <Chris.Packham@alliedtelesis.co.nz>
> Sent: 4-Aug-19 19:05
> To: Jon Maloy <jon.maloy@ericsson.com>; tipc-
> discussion@lists.sourceforge.net
> Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: Slowness forming TIPC cluster with explicit node addresses
> 
> On Sun, 2019-08-04 at 21:53 +0000, Jon Maloy wrote:
> >
> > >
> > > -----Original Message-----
> > > From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org>
> On
> > > Behalf Of Chris Packham
> > > Sent: 2-Aug-19 01:11
> > > To: Jon Maloy <jon.maloy@ericsson.com>; tipc-
> > > discussion@lists.sourceforge.net
> > > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> > > Subject: Re: Slowness forming TIPC cluster with explicit node
> > > addresses
> > >
> > > On Mon, 2019-07-29 at 09:04 +1200, Chris Packham wrote:
> > > >
> > > > On Fri, 2019-07-26 at 13:31 +0000, Jon Maloy wrote:
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: netdev-owner@vger.kernel.org <netdev-
> > > owner@vger.kernel.org>
> > > >
> > > > >
> > > > > >
> > > > > > On Behalf Of Chris Packham
> > > > > > Sent: 25-Jul-19 19:37
> > > > > > To: tipc-discussion@lists.sourceforge.net
> > > > > > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> > > > > > Subject: Slowness forming TIPC cluster with explicit node
> > > > > > addresses
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I'm having problems forming a TIPC cluster between 2 nodes.
> > > > > >
> > > > > > This is the basic steps I'm going through on each node.
> > > > > >
> > > > > > modprobe tipc
> > > > > > ip link set eth2 up
> > > > > > tipc node set addr 1.1.5 # or 1.1.6 tipc bearer enable media
> > > > > > eth dev eth0
> > > > > eth2, I assume...
> > > > >
> > > > Yes sorry I keep switching between between Ethernet ports for
> > > > testing
> > > > so I hand edited the email.
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Then to confirm if the cluster is formed I use tipc link list
> > > > > >
> > > > > > [root@node-5 ~]# tipc link list
> > > > > > broadcast-link: up
> > > > > > ...
> > > > > >
> > > > > > Looking at tcpdump the two nodes are sending packets
> > > > > >
> > > > > > 22:30:05.782320 TIPC v2.0 1.1.5 > 0.0.0, headerlength 60
> > > > > > bytes,
> > > > > > MessageSize
> > > > > > 76 bytes, Neighbor Detection Protocol internal, messageType
> > > > > > Link
> > > > > > request
> > > > > > 22:30:05.863555 TIPC v2.0 1.1.6 > 0.0.0, headerlength 60
> > > > > > bytes,
> > > > > > MessageSize
> > > > > > 76 bytes, Neighbor Detection Protocol internal, messageType
> > > > > > Link
> > > > > > request
> > > > > >
> > > > > > Eventually (after a few minutes) the link does come up
> > > > > >
> > > > > > [root@node-6 ~]# tipc link list
> > > > > > broadcast-link: up
> > > > > > 1001006:eth2-1001005:eth2: up
> > > > > >
> > > > > > [root@node-5 ~]# tipc link list
> > > > > > broadcast-link: up
> > > > > > 1001005:eth2-1001006:eth2: up
> > > > > >
> > > > > > When I remove the "tipc node set addr" things seem to kick
> > > > > > into
> > > > > > life straight away
> > > > > >
> > > > > > [root@node-5 ~]# tipc link list
> > > > > > broadcast-link: up
> > > > > > 0050b61bd2aa:eth2-0050b61e6dfa:eth2: up
> > > > > >
> > > > > > So there appears to be some difference in behaviour between
> > > > > > having
> > > > > > an explicit node address and using the default. Unfortunately
> > > > > > our
> > > > > > application relies on setting the node addresses.
> > > > > I do this many times a day, without any problems. If there
> > > > > would be
> > > > > any time difference, I would expect the 'auto configurable'
> > > > > version
> > > > > to be slower, because it involves a DAD step.
> > > > > Are you sure you don't have any other nodes running in your
> > > > > system?
> > > > >
> > > > > ///jon
> > > > >
> > > > Nope the two nodes are connected back to back. Does the number of
> > > > Ethernet interfaces make a difference? As you can see I've got 3
> > > > on
> > > > each node. One is completely disconnected, one is for booting
> > > > over
> > > > TFTP
> > > >  (only used by U-boot) and the other is the USB Ethernet I'm
> > > > using for
> > > > testing.
> > > >
> > > So I can still reproduce this on nodes that only have one network
> > > interface and
> > > are the only things connected.
> > >
> > > I did find one thing that helps
> > >
> > > diff --git a/net/tipc/discover.c b/net/tipc/discover.c index
> > > c138d68e8a69..49921dad404a 100644
> > > --- a/net/tipc/discover.c
> > > +++ b/net/tipc/discover.c
> > > @@ -358,10 +358,10 @@ int tipc_disc_create(struct net *net, struct
> > > tipc_bearer *b,
> > >         tipc_disc_init_msg(net, d->skb, DSC_REQ_MSG, b);
> > >
> > >         /* Do we need an address trial period first ? */
> > > -       if (!tipc_own_addr(net)) {
> > > +//     if (!tipc_own_addr(net)) {
> > >                 tn->addr_trial_end = jiffies +
> > > msecs_to_jiffies(1000);
> > >                 msg_set_type(buf_msg(d->skb), DSC_TRIAL_MSG);
> > > -       }
> > > +//     }
> > >         memcpy(&d->dest, dest, sizeof(*dest));
> > >         d->net = net;
> > >         d->bearer_id = b->identity;
> > >
> > > I think because with pre-configured addresses the duplicate address
> > > detection
> > > is skipped the shorter init phase is skipped. Would is make sense
> > > to
> > > unconditionally do the trial step? Or is there some better way to
> > > get things to
> > > transition with pre-assigned addresses.
> >
> > I am on vacation until the end of next-week, so I can't give you any
> > good analysis right now.
> 
> Thanks for taking the time to respond.
> 
> > To do the trial step doesn’t make much sense to me, -it would only
> > delay the setup unnecessarily (but with only 1 second).
> > Can you check the initial value of addr_trial_end when there a pre-
> > configured address?
> 
> I had the same thought. For both my devices 'addr_trial_end = 0' so I
> think tipc_disc_addr_trial_msg should end up with trial == false

I suggest you try initializing it to jiffies and see what happens.

///jon

> 
> >
> > ///jon
> >

^ permalink raw reply

* Re: [RFC] implicit per-namespace devlink instance to set kernel resource limitations
From: Andrew Lunn @ 2019-08-07  2:59 UTC (permalink / raw)
  To: David Ahern
  Cc: Jiri Pirko, netdev, davem, mlxsw, jakub.kicinski, f.fainelli,
	vivien.didelot, mkubecek, stephen, daniel, brouer, eric.dumazet
In-Reply-To: <e0047c07-11a0-423c-9560-3806328a0d76@gmail.com>

On Tue, Aug 06, 2019 at 08:33:47PM -0600, David Ahern wrote:
> Some time back supported was added for devlink 'resources'. The idea is
> that hardware (mlxsw) has limited resources (e.g., memory) that can be
> allocated in certain ways (e.g., kvd for mlxsw) thus implementing
> restrictions on the number of programmable entries (e.g., routes,
> neighbors) by userspace.
> 
> I contend:
> 
> 1. The kernel is an analogy to the hardware: it is programmed by
> userspace, has limited resources (e.g., memory), and that users want to
> control (e.g., limit) the number of networking entities that can be
> programmed - routes, rules, nexthop objects etc and by address family
> (ipv4, ipv6).
> 
> 2. A consistent operational model across use cases - s/w forwarding, XDP
> forwarding and hardware forwarding - is good for users deploying systems
> based on the Linux networking stack. This aligns with my basic point at
> LPC last November about better integration of XDP and kernel tables.

Hi David

Nice arguments.

However, zoom out a bit, from networking to the whole kernel. In
general, across the kernel as a whole, resource management is done
with cgroups. cgroups is the consistent operational model across the
kernel as a whole.

So i think you need a second leg to your argument. You have said why
devlink is the right way to do this. But you should also be able to
say to Tejun Heo why cgroups is the wrong way to do this, going
against the kernel as a whole model. Why is networking special?

      Andrew

^ permalink raw reply

* [net-next v3] tipc: add loopback device tracking
From: john.rutherford @ 2019-08-07  2:52 UTC (permalink / raw)
  To: davem, netdev, tipc-discussion; +Cc: John Rutherford

From: John Rutherford <john.rutherford@dektech.com.au>

Since node internal messages are passed directly to the socket, it is not
possible to observe those messages via tcpdump or wireshark.

We now remedy this by making it possible to clone such messages and send
the clones to the loopback interface.  The clones are dropped at reception
and have no functional role except making the traffic visible.

The feature is enabled if network taps are active for the loopback device.
pcap filtering restrictions require the messages to be presented to the
receiving side of the loopback device.

v3 - Function dev_nit_active used to check for network taps.
   - Procedure netif_rx_ni used to send cloned messages to loopback device.

Signed-off-by: John Rutherford <john.rutherford@dektech.com.au>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
---
 net/tipc/bcast.c  |  4 +++-
 net/tipc/bearer.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/tipc/bearer.h | 10 +++++++++
 net/tipc/core.c   |  5 +++++
 net/tipc/core.h   |  3 +++
 net/tipc/node.c   |  1 +
 net/tipc/topsrv.c |  2 ++
 7 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
index 6c997d4..235331d 100644
--- a/net/tipc/bcast.c
+++ b/net/tipc/bcast.c
@@ -406,8 +406,10 @@ int tipc_mcast_xmit(struct net *net, struct sk_buff_head *pkts,
 			rc = tipc_bcast_xmit(net, pkts, cong_link_cnt);
 	}
 
-	if (dests->local)
+	if (dests->local) {
+		tipc_loopback_trace(net, &localq);
 		tipc_sk_mcast_rcv(net, &localq, &inputq);
+	}
 exit:
 	/* This queue should normally be empty by now */
 	__skb_queue_purge(pkts);
diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c
index 2bed658..93c9616 100644
--- a/net/tipc/bearer.c
+++ b/net/tipc/bearer.c
@@ -389,6 +389,11 @@ int tipc_enable_l2_media(struct net *net, struct tipc_bearer *b,
 		dev_put(dev);
 		return -EINVAL;
 	}
+	if (dev == net->loopback_dev) {
+		dev_put(dev);
+		pr_info("Enabling <%s> not permitted\n", b->name);
+		return -EINVAL;
+	}
 
 	/* Autoconfigure own node identity if needed */
 	if (!tipc_own_id(net) && hwaddr_len <= NODE_ID_LEN) {
@@ -674,6 +679,65 @@ void tipc_bearer_stop(struct net *net)
 	}
 }
 
+void tipc_clone_to_loopback(struct net *net, struct sk_buff_head *pkts)
+{
+	struct net_device *dev = net->loopback_dev;
+	struct sk_buff *skb, *_skb;
+	int exp;
+
+	skb_queue_walk(pkts, _skb) {
+		skb = pskb_copy(_skb, GFP_ATOMIC);
+		if (!skb)
+			continue;
+
+		exp = SKB_DATA_ALIGN(dev->hard_header_len - skb_headroom(skb));
+		if (exp > 0 && pskb_expand_head(skb, exp, 0, GFP_ATOMIC)) {
+			kfree_skb(skb);
+			continue;
+		}
+
+		skb_reset_network_header(skb);
+		dev_hard_header(skb, dev, ETH_P_TIPC, dev->dev_addr,
+				dev->dev_addr, skb->len);
+		skb->dev = dev;
+		skb->pkt_type = PACKET_HOST;
+		skb->ip_summed = CHECKSUM_UNNECESSARY;
+		skb->protocol = eth_type_trans(skb, dev);
+		netif_rx_ni(skb);
+	}
+}
+
+static int tipc_loopback_rcv_pkt(struct sk_buff *skb, struct net_device *dev,
+				 struct packet_type *pt, struct net_device *od)
+{
+	consume_skb(skb);
+	return NET_RX_SUCCESS;
+}
+
+int tipc_attach_loopback(struct net *net)
+{
+	struct net_device *dev = net->loopback_dev;
+	struct tipc_net *tn = tipc_net(net);
+
+	if (!dev)
+		return -ENODEV;
+
+	dev_hold(dev);
+	tn->loopback_pt.dev = dev;
+	tn->loopback_pt.type = htons(ETH_P_TIPC);
+	tn->loopback_pt.func = tipc_loopback_rcv_pkt;
+	dev_add_pack(&tn->loopback_pt);
+	return 0;
+}
+
+void tipc_detach_loopback(struct net *net)
+{
+	struct tipc_net *tn = tipc_net(net);
+
+	dev_remove_pack(&tn->loopback_pt);
+	dev_put(net->loopback_dev);
+}
+
 /* Caller should hold rtnl_lock to protect the bearer */
 static int __tipc_nl_add_bearer(struct tipc_nl_msg *msg,
 				struct tipc_bearer *bearer, int nlflags)
diff --git a/net/tipc/bearer.h b/net/tipc/bearer.h
index 7f4c569..ea0f3c4 100644
--- a/net/tipc/bearer.h
+++ b/net/tipc/bearer.h
@@ -232,6 +232,16 @@ void tipc_bearer_xmit(struct net *net, u32 bearer_id,
 		      struct tipc_media_addr *dst);
 void tipc_bearer_bc_xmit(struct net *net, u32 bearer_id,
 			 struct sk_buff_head *xmitq);
+void tipc_clone_to_loopback(struct net *net, struct sk_buff_head *pkts);
+int tipc_attach_loopback(struct net *net);
+void tipc_detach_loopback(struct net *net);
+
+static inline void tipc_loopback_trace(struct net *net,
+				       struct sk_buff_head *pkts)
+{
+	if (unlikely(dev_nit_active(net->loopback_dev)))
+		tipc_clone_to_loopback(net, pkts);
+}
 
 /* check if device MTU is too low for tipc headers */
 static inline bool tipc_mtu_bad(struct net_device *dev, unsigned int reserve)
diff --git a/net/tipc/core.c b/net/tipc/core.c
index c837072..23cb379 100644
--- a/net/tipc/core.c
+++ b/net/tipc/core.c
@@ -82,6 +82,10 @@ static int __net_init tipc_init_net(struct net *net)
 	if (err)
 		goto out_bclink;
 
+	err = tipc_attach_loopback(net);
+	if (err)
+		goto out_bclink;
+
 	return 0;
 
 out_bclink:
@@ -94,6 +98,7 @@ static int __net_init tipc_init_net(struct net *net)
 
 static void __net_exit tipc_exit_net(struct net *net)
 {
+	tipc_detach_loopback(net);
 	tipc_net_stop(net);
 	tipc_bcast_stop(net);
 	tipc_nametbl_stop(net);
diff --git a/net/tipc/core.h b/net/tipc/core.h
index 7a68e1b..60d8295 100644
--- a/net/tipc/core.h
+++ b/net/tipc/core.h
@@ -125,6 +125,9 @@ struct tipc_net {
 
 	/* Cluster capabilities */
 	u16 capabilities;
+
+	/* Tracing of node internal messages */
+	struct packet_type loopback_pt;
 };
 
 static inline struct tipc_net *tipc_net(struct net *net)
diff --git a/net/tipc/node.c b/net/tipc/node.c
index 550581d..16d251b 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1443,6 +1443,7 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head *list,
 	int rc;
 
 	if (in_own_node(net, dnode)) {
+		tipc_loopback_trace(net, list);
 		tipc_sk_rcv(net, list);
 		return 0;
 	}
diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c
index f345662..e3a6ba1 100644
--- a/net/tipc/topsrv.c
+++ b/net/tipc/topsrv.c
@@ -40,6 +40,7 @@
 #include "socket.h"
 #include "addr.h"
 #include "msg.h"
+#include "bearer.h"
 #include <net/sock.h>
 #include <linux/module.h>
 
@@ -608,6 +609,7 @@ static void tipc_topsrv_kern_evt(struct net *net, struct tipc_event *evt)
 	memcpy(msg_data(buf_msg(skb)), evt, sizeof(*evt));
 	skb_queue_head_init(&evtq);
 	__skb_queue_tail(&evtq, skb);
+	tipc_loopback_trace(net, &evtq);
 	tipc_sk_rcv(net, &evtq);
 }
 
-- 
2.11.0


^ permalink raw reply related

* [PATCH] team: Add vlan tx offload to hw_enc_features
From: YueHaibing @ 2019-08-07  2:38 UTC (permalink / raw)
  To: j.vosburgh, vfalico, andy, davem, jiri, jay.vosburgh
  Cc: linux-kernel, netdev, YueHaibing

We should also enable bonding's vlan tx offload in hw_enc_features,
pass the vlan packets to the slave devices with vlan tci, let them
to handle vlan tunneling offload implementation.

Fixes: 3268e5cb494d ("team: Advertise tunneling offload features")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
---
 drivers/net/team/team.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index abfa0da..e8089de 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -1004,6 +1004,8 @@ static void __team_compute_features(struct team *team)
 
 	team->dev->vlan_features = vlan_features;
 	team->dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL |
+				     NETIF_F_HW_VLAN_CTAG_TX |
+				     NETIF_F_HW_VLAN_STAG_TX |
 				     NETIF_F_GSO_UDP_L4;
 	team->dev->hard_header_len = max_hard_header_len;
 
-- 
2.7.4



^ permalink raw reply related

* Re: [RFC] implicit per-namespace devlink instance to set kernel resource limitations
From: David Ahern @ 2019-08-07  3:10 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jiri Pirko, netdev, davem, mlxsw, jakub.kicinski, f.fainelli,
	vivien.didelot, mkubecek, stephen, daniel, brouer, eric.dumazet
In-Reply-To: <20190807025933.GF20422@lunn.ch>

On 8/6/19 8:59 PM, Andrew Lunn wrote:
> However, zoom out a bit, from networking to the whole kernel. In
> general, across the kernel as a whole, resource management is done
> with cgroups. cgroups is the consistent operational model across the
> kernel as a whole.
> 
> So i think you need a second leg to your argument. You have said why
> devlink is the right way to do this. But you should also be able to
> say to Tejun Heo why cgroups is the wrong way to do this, going
> against the kernel as a whole model. Why is networking special?
> 

So you are saying mlxsw should be using a cgroups based API for its
resources? netdevsim is for testing kernel APIs sans hardware. Is that
not what the fib controller netdevsim is doing? It is from my perspective.

I am not the one arguing to change code and functionality that has
existed for 16 months. I am arguing that the existing resource
controller satisfies all existing goals (testing in kernel APIs) and
even satisfies additional ones - like a consistent user experience
managing networking resources. ie.., I see no reason to change what exists.

^ permalink raw reply

* Re: [PATCH v3] mlx5: Use refcount_t for refcount
From: Leon Romanovsky @ 2019-08-07  3:17 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: hslester96@gmail.com, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	davem@davemloft.net, jgg@ziepe.ca, dledford@redhat.com
In-Reply-To: <cbea99e74a1f70b1a67357aaf2afdb55655cd2bd.camel@mellanox.com>

On Tue, Aug 06, 2019 at 08:40:11PM +0000, Saeed Mahameed wrote:
> On Tue, 2019-08-06 at 09:59 +0800, Chuhong Yuan wrote:
> > Reference counters are preferred to use refcount_t instead of
> > atomic_t.
> > This is because the implementation of refcount_t can prevent
> > overflows and detect possible use-after-free.
> > So convert atomic_t ref counters to refcount_t.
> >
> > Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> > ---
> > Changes in v3:
> >   - Merge v2 patches together.
> >
> >  drivers/infiniband/hw/mlx5/srq_cmd.c         | 6 +++---
> >  drivers/net/ethernet/mellanox/mlx5/core/qp.c | 6 +++---
> >  include/linux/mlx5/driver.h                  | 3 ++-
> >  3 files changed, 8 insertions(+), 7 deletions(-)
> >
>
> LGTM, Leon, let me know if you are happy with this version,
> this should go to mlx5-next.

Thanks,
Acked-by: Leon Romanovsky <leonro@mellanox.com>

^ permalink raw reply

* WARNING in cgroup_rstat_updated
From: syzbot @ 2019-08-07  3:18 UTC (permalink / raw)
  To: linux-kernel, linux-mm, netdev, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    31cc088a Merge tag 'drm-next-2019-07-19' of git://anongit...
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=102db48c600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=4dba67bf8b8c9ad7
dashboard link: https://syzkaller.appspot.com/bug?extid=370e4739fa489334a4ef
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=16dd57dc600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+370e4739fa489334a4ef@syzkaller.appspotmail.com

8021q: adding VLAN 0 to HW filter on device batadv0
WARNING: CPU: 1 PID: 9095 at mm/page_counter.c:62 page_counter_cancel  
mm/page_counter.c:62 [inline]
WARNING: CPU: 1 PID: 9095 at mm/page_counter.c:62  
page_counter_cancel+0x5a/0x70 mm/page_counter.c:55
Kernel panic - not syncing: panic_on_warn set ...
Shutting down cpus with NMI
Kernel Offset: disabled

======================================================
WARNING: possible circular locking dependency detected
5.2.0+ #67 Not tainted
------------------------------------------------------
syz-executor.2/9306 is trying to acquire lock:
00000000e4252251 ((console_sem).lock){-.-.}, at: down_trylock+0x13/0x70  
kernel/locking/semaphore.c:135

but task is already holding lock:
000000000fdb8781 (per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu)){-...}, at:  
cgroup_rstat_updated+0x115/0x2f0 kernel/cgroup/rstat.c:49

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3 (per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu)){-...}:
        __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
        _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:159
        cgroup_rstat_updated+0x115/0x2f0 kernel/cgroup/rstat.c:49
        cgroup_base_stat_cputime_account_end.isra.0+0x1d/0x60  
kernel/cgroup/rstat.c:361
        __cgroup_account_cputime+0x9e/0xd0 kernel/cgroup/rstat.c:371
        cgroup_account_cputime include/linux/cgroup.h:782 [inline]
        update_curr+0x3c8/0x8d0 kernel/sched/fair.c:862
        dequeue_entity+0x1e/0x1100 kernel/sched/fair.c:4014
        dequeue_task_fair+0x65/0x870 kernel/sched/fair.c:5306
        dequeue_task+0x77/0x2e0 kernel/sched/core.c:1195
        sched_move_task+0x1fb/0x350 kernel/sched/core.c:6847
        cpu_cgroup_attach+0x6d/0xb0 kernel/sched/core.c:6970
        cgroup_migrate_execute+0xc56/0x1350 kernel/cgroup/cgroup.c:2524
        cgroup_migrate+0x14f/0x1f0 kernel/cgroup/cgroup.c:2780
        cgroup_attach_task+0x57f/0x860 kernel/cgroup/cgroup.c:2817
        cgroup_procs_write+0x340/0x400 kernel/cgroup/cgroup.c:4777
        cgroup_file_write+0x241/0x790 kernel/cgroup/cgroup.c:3754
        kernfs_fop_write+0x2b8/0x480 fs/kernfs/file.c:315
        __vfs_write+0x8a/0x110 fs/read_write.c:494
        vfs_write+0x268/0x5d0 fs/read_write.c:558
        ksys_write+0x14f/0x290 fs/read_write.c:611
        __do_sys_write fs/read_write.c:623 [inline]
        __se_sys_write fs/read_write.c:620 [inline]
        __x64_sys_write+0x73/0xb0 fs/read_write.c:620
        do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
        entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #2 (&rq->lock){-.-.}:
        __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
        _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:151
        rq_lock kernel/sched/sched.h:1207 [inline]
        task_fork_fair+0x6a/0x520 kernel/sched/fair.c:9940
        sched_fork+0x3af/0x900 kernel/sched/core.c:2783
        copy_process+0x1b04/0x6b00 kernel/fork.c:1987
        _do_fork+0x146/0xfa0 kernel/fork.c:2369
        kernel_thread+0xbb/0xf0 kernel/fork.c:2456
        rest_init+0x28/0x37b init/main.c:417
        arch_call_rest_init+0xe/0x1b
        start_kernel+0x912/0x951 init/main.c:785
        x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:472
        x86_64_start_kernel+0x77/0x7b arch/x86/kernel/head64.c:453
        secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243

-> #1 (&p->pi_lock){-.-.}:
        __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
        _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:159
        try_to_wake_up+0xb0/0x1aa0 kernel/sched/core.c:2432
        wake_up_process+0x10/0x20 kernel/sched/core.c:2548
        __up.isra.0+0x136/0x1a0 kernel/locking/semaphore.c:261
        up+0x9c/0xe0 kernel/locking/semaphore.c:186
        __up_console_sem+0xb7/0x1c0 kernel/printk/printk.c:244
        console_unlock+0x695/0xf10 kernel/printk/printk.c:2481
        vprintk_emit+0x2a0/0x700 kernel/printk/printk.c:1986
        vprintk_default+0x28/0x30 kernel/printk/printk.c:2013
        vprintk_func+0x7e/0x189 kernel/printk/printk_safe.c:386
        printk+0xba/0xed kernel/printk/printk.c:2046
        check_stack_usage kernel/exit.c:765 [inline]
        do_exit.cold+0x18b/0x314 kernel/exit.c:927
        do_group_exit+0x135/0x360 kernel/exit.c:981
        __do_sys_exit_group kernel/exit.c:992 [inline]
        __se_sys_exit_group kernel/exit.c:990 [inline]
        __x64_sys_exit_group+0x44/0x50 kernel/exit.c:990
        do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
        entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #0 ((console_sem).lock){-.-.}:
        check_prev_add kernel/locking/lockdep.c:2405 [inline]
        check_prevs_add kernel/locking/lockdep.c:2507 [inline]
        validate_chain kernel/locking/lockdep.c:2897 [inline]
        __lock_acquire+0x25a9/0x4c30 kernel/locking/lockdep.c:3880
        lock_acquire+0x190/0x410 kernel/locking/lockdep.c:4413
        __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
        _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:159
        down_trylock+0x13/0x70 kernel/locking/semaphore.c:135
        __down_trylock_console_sem+0xa8/0x210 kernel/printk/printk.c:227
        console_trylock+0x15/0xa0 kernel/printk/printk.c:2297
        console_trylock_spinning kernel/printk/printk.c:1706 [inline]
        vprintk_emit+0x283/0x700 kernel/printk/printk.c:1985
        vprintk_default+0x28/0x30 kernel/printk/printk.c:2013
        vprintk_func+0x7e/0x189 kernel/printk/printk_safe.c:386
        printk+0xba/0xed kernel/printk/printk.c:2046
        kasan_die_handler arch/x86/mm/kasan_init_64.c:254 [inline]
        kasan_die_handler.cold+0x11/0x23 arch/x86/mm/kasan_init_64.c:249
        notifier_call_chain+0xc2/0x230 kernel/notifier.c:95
        __atomic_notifier_call_chain+0xa6/0x1a0 kernel/notifier.c:185
        atomic_notifier_call_chain kernel/notifier.c:195 [inline]
        notify_die+0xfb/0x180 kernel/notifier.c:551
        do_general_protection+0x13d/0x300 arch/x86/kernel/traps.c:558
        general_protection+0x1e/0x30 arch/x86/entry/entry_64.S:1181
        cgroup_rstat_updated+0x174/0x2f0 kernel/cgroup/rstat.c:64
        cgroup_base_stat_cputime_account_end.isra.0+0x1d/0x60  
kernel/cgroup/rstat.c:361
        __cgroup_account_cputime_field+0xd3/0x130 kernel/cgroup/rstat.c:395
        cgroup_account_cputime_field include/linux/cgroup.h:797 [inline]
        task_group_account_field kernel/sched/cputime.c:109 [inline]
        account_system_index_time+0x1f7/0x390 kernel/sched/cputime.c:172
        irqtime_account_process_tick.isra.0+0x386/0x490  
kernel/sched/cputime.c:389
        account_process_tick+0x27f/0x350 kernel/sched/cputime.c:484
        update_process_times+0x25/0x80 kernel/time/timer.c:1637
        tick_sched_handle+0xa2/0x190 kernel/time/tick-sched.c:167
        tick_sched_timer+0x53/0x140 kernel/time/tick-sched.c:1296
        __run_hrtimer kernel/time/hrtimer.c:1389 [inline]
        __hrtimer_run_queues+0x364/0xe40 kernel/time/hrtimer.c:1451
        hrtimer_interrupt+0x314/0x770 kernel/time/hrtimer.c:1509
        local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1068 [inline]
        smp_apic_timer_interrupt+0x160/0x610 arch/x86/kernel/apic/apic.c:1093
        apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:828

other info that might help us debug this:

Chain exists of:
   (console_sem).lock --> &rq->lock --> per_cpu_ptr(&cgroup_rstat_cpu_lock,  
cpu)

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
                                lock(&rq->lock);
                                lock(per_cpu_ptr(&cgroup_rstat_cpu_lock,  
cpu));
   lock((console_sem).lock);

  *** DEADLOCK ***

6 locks held by syz-executor.2/9306:
  #0: 0000000032d2cedf (&sb->s_type->i_mutex_key#12){+.+.}, at: inode_lock  
include/linux/fs.h:778 [inline]
  #0: 0000000032d2cedf (&sb->s_type->i_mutex_key#12){+.+.}, at:  
__sock_release+0x89/0x280 net/socket.c:589
  #1: 000000002033d24d (sk_lock-AF_INET6){+.+.}, at: lock_sock  
include/net/sock.h:1522 [inline]
  #1: 000000002033d24d (sk_lock-AF_INET6){+.+.}, at: tcp_close+0x27/0x10e0  
net/ipv4/tcp.c:2329
  #2: 0000000067f2fc6a (rcu_read_lock){....}, at: tcp_bpf_unhash+0x0/0x390  
net/ipv4/tcp_bpf.c:480
  #3: 0000000067f2fc6a (rcu_read_lock){....}, at: arch_atomic64_add  
arch/x86/include/asm/atomic64_64.h:46 [inline]
  #3: 0000000067f2fc6a (rcu_read_lock){....}, at: atomic64_add  
include/asm-generic/atomic-instrumented.h:873 [inline]
  #3: 0000000067f2fc6a (rcu_read_lock){....}, at: account_group_system_time  
include/linux/sched/cputime.h:154 [inline]
  #3: 0000000067f2fc6a (rcu_read_lock){....}, at:  
account_system_index_time+0xf7/0x390 kernel/sched/cputime.c:169
  #4: 000000000fdb8781 (per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu)){-...}, at:  
cgroup_rstat_updated+0x115/0x2f0 kernel/cgroup/rstat.c:49
  #5: 0000000067f2fc6a (rcu_read_lock){....}, at:  
__atomic_notifier_call_chain+0x0/0x1a0 kernel/notifier.c:404

stack backtrace:
CPU: 0 PID: 9306 Comm: syz-executor.2 Not tainted 5.2.0+ #67
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  <IRQ>
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  print_circular_bug.cold+0x163/0x172 kernel/locking/lockdep.c:1617
  check_noncircular+0x345/0x3e0 kernel/locking/lockdep.c:1741
  check_prev_add kernel/locking/lockdep.c:2405 [inline]
  check_prevs_add kernel/locking/lockdep.c:2507 [inline]
  validate_chain kernel/locking/lockdep.c:2897 [inline]
  __lock_acquire+0x25a9/0x4c30 kernel/locking/lockdep.c:3880
  lock_acquire+0x190/0x410 kernel/locking/lockdep.c:4413
  __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
  _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:159
  down_trylock+0x13/0x70 kernel/locking/semaphore.c:135
  __down_trylock_console_sem+0xa8/0x210 kernel/printk/printk.c:227
  console_trylock+0x15/0xa0 kernel/printk/printk.c:2297
  console_trylock_spinning kernel/printk/printk.c:1706 [inline]
  vprintk_emit+0x283/0x700 kernel/printk/printk.c:1985
  vprintk_default+0x28/0x30 kernel/printk/printk.c:2013
  vprintk_func+0x7e/0x189 kernel/printk/printk_safe.c:386
  printk+0xba/0xed kernel/printk/printk.c:2046
  kasan_die_handler arch/x86/mm/kasan_init_64.c:254 [inline]
  kasan_die_handler.cold+0x11/0x23 arch/x86/mm/kasan_init_64.c:249
  notifier_call_chain+0xc2/0x230 kernel/notifier.c:95
  __atomic_notifier_call_chain+0xa6/0x1a0 kernel/notifier.c:185
  atomic_notifier_call_chain kernel/notifier.c:195 [inline]
  notify_die+0xfb/0x180 kernel/notifier.c:551
  do_general_protection+0x13d/0x300 arch/x86/kernel/traps.c:558
  general_protection+0x1e/0x30 arch/x86/entry/entry_64.S:1181
RIP: 0010:cgroup_rstat_updated+0x174/0x2f0 kernel/cgroup/rstat.c:64
Code: 00 fc ff df 48 8b 45 c0 48 c1 e8 03 4c 01 f8 48 89 45 c8 eb 60 e8 6c  
e1 05 00 49 8d 7c 24 30 48 8b 55 d0 49 89 f9 49 c1 e9 03 <43> 80 3c 39 00  
0f 85 00 01 00 00 49 8b 7c 24 30 48 89 7a 38 49 8d
RSP: 0018:ffff8880ae809c08 EFLAGS: 00010006
RAX: ffff88809378a480 RBX: 0000000000000000 RCX: ffffffff8159c5ca
RDX: ffff8880ae800000 RSI: ffffffff816ca374 RDI: 47ff8883313e8861
RBP: ffff8880ae809c58 R08: 0000000000000004 R09: 08fff1106627d10c
R10: ffffed1015d0136d R11: 0000000000000003 R12: 47ff8883313e8831
R13: ffff88807b60a280 R14: ffffffff8626cbf5 R15: dffffc0000000000
  cgroup_base_stat_cputime_account_end.isra.0+0x1d/0x60  
kernel/cgroup/rstat.c:361
  __cgroup_account_cputime_field+0xd3/0x130 kernel/cgroup/rstat.c:395
  cgroup_account_cputime_field include/linux/cgroup.h:797 [inline]
  task_group_account_field kernel/sched/cputime.c:109 [inline]
  account_system_index_time+0x1f7/0x390 kernel/sched/cputime.c:172
  irqtime_account_process_tick.isra.0+0x386/0x490 kernel/sched/cputime.c:389
  account_process_tick+0x27f/0x350 kernel/sched/cputime.c:484
  update_process_times+0x25/0x80 kernel/time/timer.c:1637
  tick_sched_handle+0xa2/0x190 kernel/time/tick-sched.c:167
  tick_sched_timer+0x53/0x140 kernel/time/tick-sched.c:1296
  __run_hrtimer kernel/time/hrtimer.c:1389 [inline]
  __hrtimer_run_queues+0x364/0xe40 kernel/time/hrtimer.c:1451
  hrtimer_interrupt+0x314/0x770 kernel/time/hrtimer.c:1509
  local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1068 [inline]
  smp_apic_timer_interrupt+0x160/0x610 arch/x86/kernel/apic/apic.c:1093
  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:828
  </IRQ>
Rebooting in 86400 seconds..


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* Re: [PATCH net v2] net: dsa: Check existence of .port_mdb_add callback before calling it
From: Chen-Yu Tsai @ 2019-08-07  3:18 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: Chen-Yu Tsai, Andrew Lunn, Florian Fainelli, David S. Miller,
	netdev, linux-kernel
In-Reply-To: <20190806163402.GB16656@t480s.localdomain>

On Wed, Aug 7, 2019 at 4:34 AM Vivien Didelot <vivien.didelot@gmail.com> wrote:
>
> Hi Chen-Yu,
>
> On Wed, 7 Aug 2019 01:49:37 +0800, Chen-Yu Tsai <wens@kernel.org> wrote:
> > On Wed, Aug 7, 2019 at 1:15 AM Vivien Didelot <vivien.didelot@gmail.com> wrote:
> > >
> > > Hi Chen-Yu,
> > >
> > > On Tue,  6 Aug 2019 15:53:25 +0800, Chen-Yu Tsai <wens@kernel.org> wrote:
> > > > From: Chen-Yu Tsai <wens@csie.org>
> > > >
> > > > With the recent addition of commit 75dad2520fc3 ("net: dsa: b53: Disable
> > > > all ports on setup"), users of b53 (BCM53125 on Lamobo R1 in my case)
> > > > are forced to use the dsa subsystem to enable the switch, instead of
> > > > having it in the default transparent "forward-to-all" mode.
> > > >
> > > > The b53 driver does not support mdb bitmap functions. However the dsa
> > > > layer does not check for the existence of the .port_mdb_add callback
> > > > before actually using it. This results in a NULL pointer dereference,
> > > > as shown in the kernel oops below.
> > > >
> > > > The other functions seem to be properly guarded. Do the same for
> > > > .port_mdb_add in dsa_switch_mdb_add_bitmap() as well.
> > > >
> > > > b53 is not the only driver that doesn't support mdb bitmap functions.
> > > > Others include bcm_sf2, dsa_loop, lantiq_gswip, mt7530, mv88e6060,
> > > > qca8k, realtek-smi, and vitesse-vsc73xx.
> > >
> > > I don't know what you mean by that, there's no "mdb bitmap function"
> > > support for drivers, only the port_mdb_{prepare,add,del} callbacks...
> >
> > The term was coined from commit e6db98db8a95 ("net: dsa: add switch mdb
> > bitmap functions"). But yeah, .port_mdb_* ops/callbacks would be more
> > appropriate.
> >
> > > >     8<--- cut here ---
> > > >     Unable to handle kernel NULL pointer dereference at virtual address 00000000
> > > >     pgd = (ptrval)
> > > >     [00000000] *pgd=00000000
> > > >     Internal error: Oops: 80000005 [#1] SMP ARM
> > > >     Modules linked in: rtl8xxxu rtl8192cu rtl_usb rtl8192c_common rtlwifi mac80211 cfg80211
> > > >     CPU: 1 PID: 134 Comm: kworker/1:2 Not tainted 5.3.0-rc1-00247-gd3519030752a #1
> > > >     Hardware name: Allwinner sun7i (A20) Family
> > > >     Workqueue: events switchdev_deferred_process_work
> > > >     PC is at 0x0
> > > >     LR is at dsa_switch_event+0x570/0x620
> > > >     pc : [<00000000>]    lr : [<c08533ec>]    psr: 80070013
> > > >     sp : ee871db8  ip : 00000000  fp : ee98d0a4
> > > >     r10: 0000000c  r9 : 00000008  r8 : ee89f710
> > > >     r7 : ee98d040  r6 : ee98d088  r5 : c0f04c48  r4 : ee98d04c
> > > >     r3 : 00000000  r2 : ee89f710  r1 : 00000008  r0 : ee98d040
> > > >     Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> > > >     Control: 10c5387d  Table: 6deb406a  DAC: 00000051
> > > >     Process kworker/1:2 (pid: 134, stack limit = 0x(ptrval))
> > > >     Stack: (0xee871db8 to 0xee872000)
> > > >     1da0:                                                       ee871e14 103ace2d
> > > >     1dc0: 00000000 ffffffff 00000000 ee871e14 00000005 00000000 c08524a0 00000000
> > > >     1de0: ffffe000 c014bdfc c0f04c48 ee871e98 c0f04c48 ee9e5000 c0851120 c014bef0
> > > >     1e00: 00000000 b643aea2 ee9b4068 c08509a8 ee2bf940 ee89f710 ee871ecb 00000000
> > > >     1e20: 00000008 103ace2d 00000000 c087e248 ee29c868 103ace2d 00000001 ffffffff
> > > >     1e40: 00000000 ee871e98 00000006 00000000 c0fb2a50 c087e2d0 ffffffff c08523c4
> > > >     1e60: ffffffff c014bdfc 00000006 c0fad2d0 ee871e98 ee89f710 00000000 c014c500
> > > >     1e80: 00000000 ee89f3c0 c0f04c48 00000000 ee9e5000 c087dfb4 ee9e5000 00000000
> > > >     1ea0: ee89f710 ee871ecb 00000001 103ace2d 00000000 c0f04c48 00000000 c087e0a8
> > > >     1ec0: 00000000 efd9a3e0 0089f3c0 103ace2d ee89f700 ee89f710 ee9e5000 00000122
> > > >     1ee0: 00000100 c087e130 ee89f700 c0fad2c8 c1003ef0 c087de4c 2e928000 c0fad2ec
> > > >     1f00: c0fad2ec ee839580 ef7a62c0 ef7a9400 00000000 c087def8 c0fad2ec c01447dc
> > > >     1f20: ef315640 ef7a62c0 00000008 ee839580 ee839594 ef7a62c0 00000008 c0f03d00
> > > >     1f40: ef7a62d8 ef7a62c0 ffffe000 c0145b84 ffffe000 c0fb2420 c0bfaa8c 00000000
> > > >     1f60: ffffe000 ee84b600 ee84b5c0 00000000 ee870000 ee839580 c0145b40 ef0e5ea4
> > > >     1f80: ee84b61c c014a6f8 00000001 ee84b5c0 c014a5b0 00000000 00000000 00000000
> > > >     1fa0: 00000000 00000000 00000000 c01010e8 00000000 00000000 00000000 00000000
> > > >     1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> > > >     1fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
> > > >     [<c08533ec>] (dsa_switch_event) from [<c014bdfc>] (notifier_call_chain+0x48/0x84)
> > > >     [<c014bdfc>] (notifier_call_chain) from [<c014bef0>] (raw_notifier_call_chain+0x18/0x20)
> > > >     [<c014bef0>] (raw_notifier_call_chain) from [<c08509a8>] (dsa_port_mdb_add+0x48/0x74)
> > > >     [<c08509a8>] (dsa_port_mdb_add) from [<c087e248>] (__switchdev_handle_port_obj_add+0x54/0xd4)
> > > >     [<c087e248>] (__switchdev_handle_port_obj_add) from [<c087e2d0>] (switchdev_handle_port_obj_add+0x8/0x14)
> > > >     [<c087e2d0>] (switchdev_handle_port_obj_add) from [<c08523c4>] (dsa_slave_switchdev_blocking_event+0x94/0xa4)
> > > >     [<c08523c4>] (dsa_slave_switchdev_blocking_event) from [<c014bdfc>] (notifier_call_chain+0x48/0x84)
> > > >     [<c014bdfc>] (notifier_call_chain) from [<c014c500>] (blocking_notifier_call_chain+0x50/0x68)
> > > >     [<c014c500>] (blocking_notifier_call_chain) from [<c087dfb4>] (switchdev_port_obj_notify+0x44/0xa8)
> > > >     [<c087dfb4>] (switchdev_port_obj_notify) from [<c087e0a8>] (switchdev_port_obj_add_now+0x90/0x104)
> > > >     [<c087e0a8>] (switchdev_port_obj_add_now) from [<c087e130>] (switchdev_port_obj_add_deferred+0x14/0x5c)
> > > >     [<c087e130>] (switchdev_port_obj_add_deferred) from [<c087de4c>] (switchdev_deferred_process+0x64/0x104)
> > > >     [<c087de4c>] (switchdev_deferred_process) from [<c087def8>] (switchdev_deferred_process_work+0xc/0x14)
> > > >     [<c087def8>] (switchdev_deferred_process_work) from [<c01447dc>] (process_one_work+0x218/0x50c)
> > > >     [<c01447dc>] (process_one_work) from [<c0145b84>] (worker_thread+0x44/0x5bc)
> > > >     [<c0145b84>] (worker_thread) from [<c014a6f8>] (kthread+0x148/0x150)
> > > >     [<c014a6f8>] (kthread) from [<c01010e8>] (ret_from_fork+0x14/0x2c)
> > > >     Exception stack(0xee871fb0 to 0xee871ff8)
> > > >     1fa0:                                     00000000 00000000 00000000 00000000
> > > >     1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> > > >     1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> > > >     Code: bad PC value
> > > >     ---[ end trace 1292c61abd17b130 ]---
> > > >
> > > >     [<c08533ec>] (dsa_switch_event) from [<c014bdfc>] (notifier_call_chain+0x48/0x84)
> > > >     corresponds to
> > > >
> > > >       $ arm-linux-gnueabihf-addr2line -C -i -e vmlinux c08533ec
> > > >
> > > >       linux/net/dsa/switch.c:156
> > > >       linux/net/dsa/switch.c:178
> > > >       linux/net/dsa/switch.c:328
> > > >
> > > > Fixes: e6db98db8a95 ("net: dsa: add switch mdb bitmap functions")
> > > > Signed-off-by: Chen-Yu Tsai <wens@csie.org>
> > > > ---
> > > > Changes since v1:
> > > >
> > > >   - Moved the check to the beginning of dsa_switch_mdb_add()
> > > >
> > > > Looks like we could also move the ops check out of
> > > > dsa_switch_mdb_prepare_bitmap(), though I suppose keeping the code the
> > > > way it is now is clearer.
> > > >
> > > > ---
> > > >  net/dsa/switch.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/net/dsa/switch.c b/net/dsa/switch.c
> > > > index 4ec5b7f85d51..231af5268656 100644
> > > > --- a/net/dsa/switch.c
> > > > +++ b/net/dsa/switch.c
> > > > @@ -164,6 +164,9 @@ static int dsa_switch_mdb_add(struct dsa_switch *ds,
> > > >       struct switchdev_trans *trans = info->trans;
> > > >       int port;
> > > >
> > > > +     if (!ds->ops->port_mdb_add)
> > > > +             return -EOPNOTSUPP;
> > > > +
> > > >       /* Build a mask of Multicast group members */
> > > >       bitmap_zero(ds->bitmap, ds->num_ports);
> > > >       if (ds->index == info->sw_index)
> > > > --
> > > > 2.20.1
> > > >
> > >
> > > I don't understand the crash here, nor the fix. dsa_switch_mdb_add()
> > > is supposed to be called through switchdev with a prepare phase,
> > > which checks for ds->ops->port_mdb_add. Do you mean that a switchdev
> > > MDB object is added somewhere without a prepare phase? If that's
> > > the case, this is what the commit message must say. Then the
> >
> > I had pretty much zero understanding of how switchdev and dsa work.
> > The symptom is a NULL pointer reference, resulting from an unsupported
> > callback that was not checked before being called, as described above.
> > And that is what I mean. A NULL pointer reference happened when it
> > should not have.
> >
> > Based on what you just mentioned, yes it does look like an object was
> > added without a prepare phase. Randomly looking through the net/dsa
> > code, it seems only dsa_port_vid_add() does a prepare phase, judging
> > by .ph_prepare being set. dsa_port_{vlan,mbd,fdb}_add directly call
> > the add phase, without the prepare phase. So I'm guessing "supposed
> > to be called with a prepare phase" is not quite accurate. This also
> > exceeds the scope of the simple fix I had in mind.
> >
> > > ds->ops->port_mdb_add check must go where it is used, that is to say
> > > at the beginning of dsa_switch_mdb_add_bitmap() (similarly to what
> > > dsa_switch_mdb_prepare_bitmap() does), not in dsa_switch_mdb_add.
> >
> > Andrew asked me to move it to where it is now. Please take a look at
> > v1 [2] if it's what you would like.
> >
> > I'm ok either way.
>
> I still cannot find in the code where a SWITCHDEV_OBJ_ID_PORT_MDB object
> gets added without a prepare phase or a trans object, but it wouldn't hurt to
> double check the presence of ds->ops->port_mdb_add before calling it anyway,
> since a patch may actually bypass this prepare phase.

I dug a bit more and I couldn't find it either. AFAICS the only place it
gets called is through the notification chain invoked in
switchdev_port_obj_add_now(),
which specifically has prepare and commit phases.

> Your v1 patch was a bit confusing and changed the signature of the function
> at the same time. Please check the callback where it is used, like this:
>
>     diff --git a/net/dsa/switch.c b/net/dsa/switch.c
>     index 4ec5b7f85d51..09d9286b27cc 100644
>     --- a/net/dsa/switch.c
>     +++ b/net/dsa/switch.c
>     @@ -153,6 +153,9 @@ static void dsa_switch_mdb_add_bitmap(struct dsa_switch *ds,
>      {
>             int port;
>
>     +       if (!ds->ops->port_mdb_add)
>     +               return;
>     +
>             for_each_set_bit(port, bitmap, ds->num_ports)
>                     ds->ops->port_mdb_add(ds, port, mdb);
>      }
>
>
> This will be easier to maintain. Please provide a simpler commit message,

OK. How about:

The dsa framework has optional .port_mdb_{prepare,add,delete} callback fields
for drivers to handle multicast database entries. When adding an entry, the
framework goes through a prepare phase, then a commit phase. Drivers not
providing these callbacks should be detected in the prepare phase.

For some unknown reason, the guard in the prepare phase is insufficient,
and the framework ends up calling an undefined .port_mdb_add callback.
This results in a NULL pointer dereference, as shown in the log below.

Add a check for .port_mdb_add before calling it in dsa_switch_mdb_add_bitmap().

<attach log>

> this one is not relevant. Are you actually able to reproduce this stack
> trace? If not, that is not necessary to add it to the commit message...

Yes I can reproduce it. Just did on a vanilla v5.3-rc3 kernel.

[   71.827837] br0: port 1(eth0.1) entered blocking state
[   71.844072] br0: port 1(eth0.1) entered disabled state
[   71.849669] device eth0.1 entered promiscuous mode
[   71.865848] device eth0 entered promiscuous mode
[   71.876535] br0: port 2(wan) entered blocking state
[   71.881475] br0: port 2(wan) entered disabled state
[   71.888631] device wan entered promiscuous mode
[   71.914341] bcm53xx stmmac-0:1e wan: configuring for phy/gmii link mode
[   71.921505] 8021q: adding VLAN 0 to HW filter on device wan
[   71.930146] bcm53xx stmmac-0:1e wan: Link is Up - 1Gbps/Full - flow
control rx/tx
[   71.932360] br0: port 3(lan1) entered blocking state
[   71.942751] br0: port 3(lan1) entered disabled state
[   71.950128] device lan1 entered promiscuous mode
[   71.970024] bcm53xx stmmac-0:1e lan1: configuring for phy/gmii link mode
[   71.977157] 8021q: adding VLAN 0 to HW filter on device lan1
[   71.988159] br0: port 4(lan2) entered blocking state
[   71.993158] br0: port 4(lan2) entered disabled state
[   72.001235] device lan2 entered promiscuous mode
[   72.020994] bcm53xx stmmac-0:1e lan2: configuring for phy/gmii link mode
[   72.028081] 8021q: adding VLAN 0 to HW filter on device lan2
[   72.035797] bcm53xx stmmac-0:1e lan2: Link is Up - 1Gbps/Full -
flow control rx/tx
[   72.038767] br0: port 5(lan3) entered blocking state
[   72.048450] br0: port 5(lan3) entered disabled state
[   72.057129] device lan3 entered promiscuous mode
[   72.076956] bcm53xx stmmac-0:1e lan3: configuring for phy/gmii link mode
[   72.084004] 8021q: adding VLAN 0 to HW filter on device lan3
[   72.091352] bcm53xx stmmac-0:1e lan3: Link is Up - 1Gbps/Full -
flow control rx/tx
[   72.095168] br0: port 6(lan4) entered blocking state
[   72.103926] br0: port 6(lan4) entered disabled state
[   72.113313] device lan4 entered promiscuous mode
[   72.133293] bcm53xx stmmac-0:1e lan4: configuring for phy/gmii link mode
[   72.140410] 8021q: adding VLAN 0 to HW filter on device lan4
[   72.147682] bcm53xx stmmac-0:1e lan4: Link is Up - 100Mbps/Full -
flow control rx/tx
[   72.155915] br0: port 6(lan4) entered blocking state
[   72.160905] br0: port 6(lan4) entered forwarding state
[   72.166135] br0: port 5(lan3) entered blocking state
[   72.171111] br0: port 5(lan3) entered forwarding state
[   72.176300] br0: port 4(lan2) entered blocking state
[   72.181277] br0: port 4(lan2) entered forwarding state
[   72.186472] br0: port 2(wan) entered blocking state
[   72.191361] br0: port 2(wan) entered forwarding state
[   72.196453] br0: port 1(eth0.1) entered blocking state
[   72.201601] br0: port 1(eth0.1) entered forwarding state

Waiting for br0 to get ready (MAXWAIT is 32 seconds).
[   72.244956] 8<--- cut here ---
[   72.248072] Unable to handle kernel NULL pointer dereference at
virtual address 00000000
[   72.256331] pgd = (ptrval)
[   72.259048] [00000000] *pgd=00000000
[   72.262642] Internal error: Oops: 80000005 [#1] SMP ARM
[   72.267869] Modules linked in:
[   72.270934] CPU: 1 PID: 134 Comm: kworker/1:2 Not tainted 5.3.0-rc3 #1
[   72.277458] Hardware name: Allwinner sun7i (A20) Family
[   72.282699] Workqueue: events switchdev_deferred_process_work
[   72.288447] PC is at 0x0
[   72.290988] LR is at dsa_switch_event+0x570/0x620
[   72.295691] pc : [<00000000>]    lr : [<c0853890>]    psr: 80070113
[   72.301955] sp : ef3cfdb8  ip : 00000000  fp : ef0ad0a4
[   72.307180] r10: 0000000c  r9 : 00000008  r8 : eeb3b450
[   72.312405] r7 : ef0ad040  r6 : ef0ad088  r5 : c0f04c48  r4 : ef0ad04c
[   72.318931] r3 : 00000000  r2 : eeb3b450  r1 : 00000008  r0 : ef0ad040
[   72.325459] Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[   72.332593] Control: 10c5387d  Table: 6de6406a  DAC: 00000051
[   72.338340] Process kworker/1:2 (pid: 134, stack limit = 0x(ptrval))
[   72.344692] Stack: (0xef3cfdb8 to 0xef3d0000)
[   72.349085] fda0:
    ef3cfe14 46893f30
[   72.357329] fdc0: 00000000 ffffffff 00000000 ef3cfe14 00000005
00000000 c0852944 00000000
[   72.365550] fde0: ffffe000 c014be24 c0f04c48 ef3cfe98 c0f04c48
ee9ea800 c08515d8 c014bf18
[   72.373739] fe00: 00000000 63a7c876 ee9b0068 c0850e60 ef26e800
eeb3b450 ef3cfecb 00000000
[   72.381921] fe20: 00000008 46893f30 00000000 c087e718 eebe2068
46893f30 c0e72400 ffffffff
[   72.390102] fe40: 00000000 ef3cfe98 00000006 00000000 c0fb2a50
c087e7a0 ffffffff c0852868
[   72.398283] fe60: ffffffff c014be24 00000006 c0fad2d0 ef3cfe98
eeb3b450 00000000 c014c528
[   72.406463] fe80: 00000000 c015dcf8 c0f04c48 00000000 ee9ea800
c087e484 ee9ea800 00000000
[   72.414644] fea0: eeb3b450 ef3cfecb 00000001 46893f30 00000000
c0f04c48 00000000 c087e578
[   72.422825] fec0: 00000000 ef26e780 00024400 46893f30 eeb3b440
eeb3b450 ee9ea800 00000122
[   72.430998] fee0: 00000100 c087e600 eeb3b440 c0fad2c8 c1003ef0
c087e31c 2e928000 c0fad2ec
[   72.439179] ff00: c0fad2ec ee858080 ef7a62c0 ef7a9400 00000000
c087e3c8 c0fad2ec c0144804
[   72.447360] ff20: ef312280 ef7a62c0 00000008 ee858080 ee858094
ef7a62c0 00000008 c0f03d00
[   72.455541] ff40: ef7a62d8 ef7a62c0 ffffe000 c0145bac ffffe000
c0fb2420 c0bfa1ec 00000000
[   72.463721] ff60: ffffe000 ee855c40 ee855c00 00000000 ef3ce000
ee858080 c0145b68 ef0e5ea4
[   72.471902] ff80: ee855c5c c014a720 0000000b ee855c00 c014a5d8
00000000 00000000 00000000
[   72.480082] ffa0: 00000000 00000000 00000000 c01010e8 00000000
00000000 00000000 00000000
[   72.488262] ffc0: 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[   72.496442] ffe0: 00000000 00000000 00000000 00000000 00000013
00000000 00000000 00000000
[   72.504652] [<c0853890>] (dsa_switch_event) from [<c014be24>]
(notifier_call_chain+0x48/0x84)
[   72.513184] [<c014be24>] (notifier_call_chain) from [<c014bf18>]
(raw_notifier_call_chain+0x18/0x20)
[   72.522321] [<c014bf18>] (raw_notifier_call_chain) from
[<c0850e60>] (dsa_port_mdb_add+0x48/0x74)
[   72.531200] [<c0850e60>] (dsa_port_mdb_add) from [<c087e718>]
(__switchdev_handle_port_obj_add+0x54/0xd4)
[   72.540773] [<c087e718>] (__switchdev_handle_port_obj_add) from
[<c087e7a0>] (switchdev_handle_port_obj_add+0x8/0x14)
[   72.551385] [<c087e7a0>] (switchdev_handle_port_obj_add) from
[<c0852868>] (dsa_slave_switchdev_blocking_event+0x94/0xa4)
[   72.562343] [<c0852868>] (dsa_slave_switchdev_blocking_event) from
[<c014be24>] (notifier_call_chain+0x48/0x84)
[   72.572434] [<c014be24>] (notifier_call_chain) from [<c014c528>]
(blocking_notifier_call_chain+0x50/0x68)
[   72.582004] [<c014c528>] (blocking_notifier_call_chain) from
[<c087e484>] (switchdev_port_obj_notify+0x44/0xa8)
[   72.592094] [<c087e484>] (switchdev_port_obj_notify) from
[<c087e578>] (switchdev_port_obj_add_now+0x90/0x104)
[   72.602098] [<c087e578>] (switchdev_port_obj_add_now) from
[<c087e600>] (switchdev_port_obj_add_deferred+0x14/0x5c)
[   72.612537] [<c087e600>] (switchdev_port_obj_add_deferred) from
[<c087e31c>] (switchdev_deferred_process+0x64/0x104)
[   72.623060] [<c087e31c>] (switchdev_deferred_process) from
[<c087e3c8>] (switchdev_deferred_process_work+0xc/0x14)
[   72.633412] [<c087e3c8>] (switchdev_deferred_process_work) from
[<c0144804>] (process_one_work+0x218/0x50c)
[   72.643156] [<c0144804>] (process_one_work) from [<c0145bac>]
(worker_thread+0x44/0x5bc)
[   72.651253] [<c0145bac>] (worker_thread) from [<c014a720>]
(kthread+0x148/0x150)
[   72.658657] [<c014a720>] (kthread) from [<c01010e8>]
(ret_from_fork+0x14/0x2c)
[   72.665879] Exception stack(0xef3cffb0 to 0xef3cfff8)
[   72.670932] ffa0:                                     00000000
00000000 00000000 00000000
[   72.679112] ffc0: 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[   72.687291] ffe0: 00000000 00000000 00000000 00000000 00000013 00000000
[   72.693911] Code: bad PC value
[   72.697022] ---[ end trace c7626868564873c8 ]---


ChenYu

^ permalink raw reply

* Re: Slowness forming TIPC cluster with explicit node addresses
From: Chris Packham @ 2019-08-07  3:45 UTC (permalink / raw)
  To: jon.maloy@ericsson.com, tipc-discussion@lists.sourceforge.net
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <CH2PR15MB35759E27F2A01FAE59AB66809AD40@CH2PR15MB3575.namprd15.prod.outlook.com>

Hi Jon,

On Wed, 2019-08-07 at 02:55 +0000, Jon Maloy wrote:
> 
> > 
> > -----Original Message-----
> > From: Chris Packham <Chris.Packham@alliedtelesis.co.nz>
> > Sent: 4-Aug-19 19:05
> > To: Jon Maloy <jon.maloy@ericsson.com>; tipc-
> > discussion@lists.sourceforge.net
> > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> > Subject: Re: Slowness forming TIPC cluster with explicit node
> > addresses
> > 
> > On Sun, 2019-08-04 at 21:53 +0000, Jon Maloy wrote:
> > > 
> > > 
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.or
> > > > g>
> > On
> > > 
> > > > 
> > > > Behalf Of Chris Packham
> > > > Sent: 2-Aug-19 01:11
> > > > To: Jon Maloy <jon.maloy@ericsson.com>; tipc-
> > > > discussion@lists.sourceforge.net
> > > > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> > > > Subject: Re: Slowness forming TIPC cluster with explicit node
> > > > addresses
> > > > 
> > > > On Mon, 2019-07-29 at 09:04 +1200, Chris Packham wrote:
> > > > > 
> > > > > 
> > > > > On Fri, 2019-07-26 at 13:31 +0000, Jon Maloy wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: netdev-owner@vger.kernel.org <netdev-
> > > > owner@vger.kernel.org>
> > > > > 
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > On Behalf Of Chris Packham
> > > > > > > Sent: 25-Jul-19 19:37
> > > > > > > To: tipc-discussion@lists.sourceforge.net
> > > > > > > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> > > > > > > Subject: Slowness forming TIPC cluster with explicit node
> > > > > > > addresses
> > > > > > > 
> > > > > > > Hi,
> > > > > > > 
> > > > > > > I'm having problems forming a TIPC cluster between 2
> > > > > > > nodes.
> > > > > > > 
> > > > > > > This is the basic steps I'm going through on each node.
> > > > > > > 
> > > > > > > modprobe tipc
> > > > > > > ip link set eth2 up
> > > > > > > tipc node set addr 1.1.5 # or 1.1.6 tipc bearer enable
> > > > > > > media
> > > > > > > eth dev eth0
> > > > > > eth2, I assume...
> > > > > > 
> > > > > Yes sorry I keep switching between between Ethernet ports for
> > > > > testing
> > > > > so I hand edited the email.
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > Then to confirm if the cluster is formed I use tipc link
> > > > > > > list
> > > > > > > 
> > > > > > > [root@node-5 ~]# tipc link list
> > > > > > > broadcast-link: up
> > > > > > > ...
> > > > > > > 
> > > > > > > Looking at tcpdump the two nodes are sending packets
> > > > > > > 
> > > > > > > 22:30:05.782320 TIPC v2.0 1.1.5 > 0.0.0, headerlength 60
> > > > > > > bytes,
> > > > > > > MessageSize
> > > > > > > 76 bytes, Neighbor Detection Protocol internal,
> > > > > > > messageType
> > > > > > > Link
> > > > > > > request
> > > > > > > 22:30:05.863555 TIPC v2.0 1.1.6 > 0.0.0, headerlength 60
> > > > > > > bytes,
> > > > > > > MessageSize
> > > > > > > 76 bytes, Neighbor Detection Protocol internal,
> > > > > > > messageType
> > > > > > > Link
> > > > > > > request
> > > > > > > 
> > > > > > > Eventually (after a few minutes) the link does come up
> > > > > > > 
> > > > > > > [root@node-6 ~]# tipc link list
> > > > > > > broadcast-link: up
> > > > > > > 1001006:eth2-1001005:eth2: up
> > > > > > > 
> > > > > > > [root@node-5 ~]# tipc link list
> > > > > > > broadcast-link: up
> > > > > > > 1001005:eth2-1001006:eth2: up
> > > > > > > 
> > > > > > > When I remove the "tipc node set addr" things seem to
> > > > > > > kick
> > > > > > > into
> > > > > > > life straight away
> > > > > > > 
> > > > > > > [root@node-5 ~]# tipc link list
> > > > > > > broadcast-link: up
> > > > > > > 0050b61bd2aa:eth2-0050b61e6dfa:eth2: up
> > > > > > > 
> > > > > > > So there appears to be some difference in behaviour
> > > > > > > between
> > > > > > > having
> > > > > > > an explicit node address and using the default.
> > > > > > > Unfortunately
> > > > > > > our
> > > > > > > application relies on setting the node addresses.
> > > > > > I do this many times a day, without any problems. If there
> > > > > > would be
> > > > > > any time difference, I would expect the 'auto configurable'
> > > > > > version
> > > > > > to be slower, because it involves a DAD step.
> > > > > > Are you sure you don't have any other nodes running in your
> > > > > > system?
> > > > > > 
> > > > > > ///jon
> > > > > > 
> > > > > Nope the two nodes are connected back to back. Does the
> > > > > number of
> > > > > Ethernet interfaces make a difference? As you can see I've
> > > > > got 3
> > > > > on
> > > > > each node. One is completely disconnected, one is for booting
> > > > > over
> > > > > TFTP
> > > > >  (only used by U-boot) and the other is the USB Ethernet I'm
> > > > > using for
> > > > > testing.
> > > > > 
> > > > So I can still reproduce this on nodes that only have one
> > > > network
> > > > interface and
> > > > are the only things connected.
> > > > 
> > > > I did find one thing that helps
> > > > 
> > > > diff --git a/net/tipc/discover.c b/net/tipc/discover.c index
> > > > c138d68e8a69..49921dad404a 100644
> > > > --- a/net/tipc/discover.c
> > > > +++ b/net/tipc/discover.c
> > > > @@ -358,10 +358,10 @@ int tipc_disc_create(struct net *net,
> > > > struct
> > > > tipc_bearer *b,
> > > >         tipc_disc_init_msg(net, d->skb, DSC_REQ_MSG, b);
> > > > 
> > > >         /* Do we need an address trial period first ? */
> > > > -       if (!tipc_own_addr(net)) {
> > > > +//     if (!tipc_own_addr(net)) {
> > > >                 tn->addr_trial_end = jiffies +
> > > > msecs_to_jiffies(1000);
> > > >                 msg_set_type(buf_msg(d->skb), DSC_TRIAL_MSG);
> > > > -       }
> > > > +//     }
> > > >         memcpy(&d->dest, dest, sizeof(*dest));
> > > >         d->net = net;
> > > >         d->bearer_id = b->identity;
> > > > 
> > > > I think because with pre-configured addresses the duplicate
> > > > address
> > > > detection
> > > > is skipped the shorter init phase is skipped. Would is make
> > > > sense
> > > > to
> > > > unconditionally do the trial step? Or is there some better way
> > > > to
> > > > get things to
> > > > transition with pre-assigned addresses.
> > > I am on vacation until the end of next-week, so I can't give you
> > > any
> > > good analysis right now.
> > Thanks for taking the time to respond.
> > 
> > > 
> > > To do the trial step doesn’t make much sense to me, -it would
> > > only
> > > delay the setup unnecessarily (but with only 1 second).
> > > Can you check the initial value of addr_trial_end when there a
> > > pre-
> > > configured address?
> > I had the same thought. For both my devices 'addr_trial_end = 0' so
> > I
> > think tipc_disc_addr_trial_msg should end up with trial == false
> I suggest you try initializing it to jiffies and see what happens.
> 

Setting addr_trial_end to jiffies seems to do the trick. I'll prepare a
patch and send it through.

^ permalink raw reply

* Re: [PATCH net v2] net: dsa: Check existence of .port_mdb_add callback before calling it
From: Vivien Didelot @ 2019-08-07  4:33 UTC (permalink / raw)
  To: Chen-Yu Tsai
  Cc: Chen-Yu Tsai, Andrew Lunn, Florian Fainelli, David S. Miller,
	netdev, linux-kernel
In-Reply-To: <CAGb2v666sNV41rejf=wTJB2gYeDGAHNHn1NyxyRh3E6y9C=11A@mail.gmail.com>

Hi Chen-Yu,

On Wed, 7 Aug 2019 11:18:28 +0800, Chen-Yu Tsai <wens@kernel.org> wrote:
> On Wed, Aug 7, 2019 at 4:34 AM Vivien Didelot <vivien.didelot@gmail.com> wrote:
> >
> > Hi Chen-Yu,
> >
> > On Wed, 7 Aug 2019 01:49:37 +0800, Chen-Yu Tsai <wens@kernel.org> wrote:
> > > On Wed, Aug 7, 2019 at 1:15 AM Vivien Didelot <vivien.didelot@gmail.com> wrote:
> > > >
> > > > Hi Chen-Yu,
> > > >
> > > > On Tue,  6 Aug 2019 15:53:25 +0800, Chen-Yu Tsai <wens@kernel.org> wrote:
> > > > > From: Chen-Yu Tsai <wens@csie.org>
> > > > >
> > > > > With the recent addition of commit 75dad2520fc3 ("net: dsa: b53: Disable
> > > > > all ports on setup"), users of b53 (BCM53125 on Lamobo R1 in my case)
> > > > > are forced to use the dsa subsystem to enable the switch, instead of
> > > > > having it in the default transparent "forward-to-all" mode.
> > > > >
> > > > > The b53 driver does not support mdb bitmap functions. However the dsa
> > > > > layer does not check for the existence of the .port_mdb_add callback
> > > > > before actually using it. This results in a NULL pointer dereference,
> > > > > as shown in the kernel oops below.
> > > > >
> > > > > The other functions seem to be properly guarded. Do the same for
> > > > > .port_mdb_add in dsa_switch_mdb_add_bitmap() as well.
> > > > >
> > > > > b53 is not the only driver that doesn't support mdb bitmap functions.
> > > > > Others include bcm_sf2, dsa_loop, lantiq_gswip, mt7530, mv88e6060,
> > > > > qca8k, realtek-smi, and vitesse-vsc73xx.
> > > >
> > > > I don't know what you mean by that, there's no "mdb bitmap function"
> > > > support for drivers, only the port_mdb_{prepare,add,del} callbacks...
> > >
> > > The term was coined from commit e6db98db8a95 ("net: dsa: add switch mdb
> > > bitmap functions"). But yeah, .port_mdb_* ops/callbacks would be more
> > > appropriate.
> > >
> > > > >     8<--- cut here ---
> > > > >     Unable to handle kernel NULL pointer dereference at virtual address 00000000
> > > > >     pgd = (ptrval)
> > > > >     [00000000] *pgd=00000000
> > > > >     Internal error: Oops: 80000005 [#1] SMP ARM
> > > > >     Modules linked in: rtl8xxxu rtl8192cu rtl_usb rtl8192c_common rtlwifi mac80211 cfg80211
> > > > >     CPU: 1 PID: 134 Comm: kworker/1:2 Not tainted 5.3.0-rc1-00247-gd3519030752a #1
> > > > >     Hardware name: Allwinner sun7i (A20) Family
> > > > >     Workqueue: events switchdev_deferred_process_work
> > > > >     PC is at 0x0
> > > > >     LR is at dsa_switch_event+0x570/0x620
> > > > >     pc : [<00000000>]    lr : [<c08533ec>]    psr: 80070013
> > > > >     sp : ee871db8  ip : 00000000  fp : ee98d0a4
> > > > >     r10: 0000000c  r9 : 00000008  r8 : ee89f710
> > > > >     r7 : ee98d040  r6 : ee98d088  r5 : c0f04c48  r4 : ee98d04c
> > > > >     r3 : 00000000  r2 : ee89f710  r1 : 00000008  r0 : ee98d040
> > > > >     Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> > > > >     Control: 10c5387d  Table: 6deb406a  DAC: 00000051
> > > > >     Process kworker/1:2 (pid: 134, stack limit = 0x(ptrval))
> > > > >     Stack: (0xee871db8 to 0xee872000)
> > > > >     1da0:                                                       ee871e14 103ace2d
> > > > >     1dc0: 00000000 ffffffff 00000000 ee871e14 00000005 00000000 c08524a0 00000000
> > > > >     1de0: ffffe000 c014bdfc c0f04c48 ee871e98 c0f04c48 ee9e5000 c0851120 c014bef0
> > > > >     1e00: 00000000 b643aea2 ee9b4068 c08509a8 ee2bf940 ee89f710 ee871ecb 00000000
> > > > >     1e20: 00000008 103ace2d 00000000 c087e248 ee29c868 103ace2d 00000001 ffffffff
> > > > >     1e40: 00000000 ee871e98 00000006 00000000 c0fb2a50 c087e2d0 ffffffff c08523c4
> > > > >     1e60: ffffffff c014bdfc 00000006 c0fad2d0 ee871e98 ee89f710 00000000 c014c500
> > > > >     1e80: 00000000 ee89f3c0 c0f04c48 00000000 ee9e5000 c087dfb4 ee9e5000 00000000
> > > > >     1ea0: ee89f710 ee871ecb 00000001 103ace2d 00000000 c0f04c48 00000000 c087e0a8
> > > > >     1ec0: 00000000 efd9a3e0 0089f3c0 103ace2d ee89f700 ee89f710 ee9e5000 00000122
> > > > >     1ee0: 00000100 c087e130 ee89f700 c0fad2c8 c1003ef0 c087de4c 2e928000 c0fad2ec
> > > > >     1f00: c0fad2ec ee839580 ef7a62c0 ef7a9400 00000000 c087def8 c0fad2ec c01447dc
> > > > >     1f20: ef315640 ef7a62c0 00000008 ee839580 ee839594 ef7a62c0 00000008 c0f03d00
> > > > >     1f40: ef7a62d8 ef7a62c0 ffffe000 c0145b84 ffffe000 c0fb2420 c0bfaa8c 00000000
> > > > >     1f60: ffffe000 ee84b600 ee84b5c0 00000000 ee870000 ee839580 c0145b40 ef0e5ea4
> > > > >     1f80: ee84b61c c014a6f8 00000001 ee84b5c0 c014a5b0 00000000 00000000 00000000
> > > > >     1fa0: 00000000 00000000 00000000 c01010e8 00000000 00000000 00000000 00000000
> > > > >     1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> > > > >     1fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
> > > > >     [<c08533ec>] (dsa_switch_event) from [<c014bdfc>] (notifier_call_chain+0x48/0x84)
> > > > >     [<c014bdfc>] (notifier_call_chain) from [<c014bef0>] (raw_notifier_call_chain+0x18/0x20)
> > > > >     [<c014bef0>] (raw_notifier_call_chain) from [<c08509a8>] (dsa_port_mdb_add+0x48/0x74)
> > > > >     [<c08509a8>] (dsa_port_mdb_add) from [<c087e248>] (__switchdev_handle_port_obj_add+0x54/0xd4)
> > > > >     [<c087e248>] (__switchdev_handle_port_obj_add) from [<c087e2d0>] (switchdev_handle_port_obj_add+0x8/0x14)
> > > > >     [<c087e2d0>] (switchdev_handle_port_obj_add) from [<c08523c4>] (dsa_slave_switchdev_blocking_event+0x94/0xa4)
> > > > >     [<c08523c4>] (dsa_slave_switchdev_blocking_event) from [<c014bdfc>] (notifier_call_chain+0x48/0x84)
> > > > >     [<c014bdfc>] (notifier_call_chain) from [<c014c500>] (blocking_notifier_call_chain+0x50/0x68)
> > > > >     [<c014c500>] (blocking_notifier_call_chain) from [<c087dfb4>] (switchdev_port_obj_notify+0x44/0xa8)
> > > > >     [<c087dfb4>] (switchdev_port_obj_notify) from [<c087e0a8>] (switchdev_port_obj_add_now+0x90/0x104)
> > > > >     [<c087e0a8>] (switchdev_port_obj_add_now) from [<c087e130>] (switchdev_port_obj_add_deferred+0x14/0x5c)
> > > > >     [<c087e130>] (switchdev_port_obj_add_deferred) from [<c087de4c>] (switchdev_deferred_process+0x64/0x104)
> > > > >     [<c087de4c>] (switchdev_deferred_process) from [<c087def8>] (switchdev_deferred_process_work+0xc/0x14)
> > > > >     [<c087def8>] (switchdev_deferred_process_work) from [<c01447dc>] (process_one_work+0x218/0x50c)
> > > > >     [<c01447dc>] (process_one_work) from [<c0145b84>] (worker_thread+0x44/0x5bc)
> > > > >     [<c0145b84>] (worker_thread) from [<c014a6f8>] (kthread+0x148/0x150)
> > > > >     [<c014a6f8>] (kthread) from [<c01010e8>] (ret_from_fork+0x14/0x2c)
> > > > >     Exception stack(0xee871fb0 to 0xee871ff8)
> > > > >     1fa0:                                     00000000 00000000 00000000 00000000
> > > > >     1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> > > > >     1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> > > > >     Code: bad PC value
> > > > >     ---[ end trace 1292c61abd17b130 ]---
> > > > >
> > > > >     [<c08533ec>] (dsa_switch_event) from [<c014bdfc>] (notifier_call_chain+0x48/0x84)
> > > > >     corresponds to
> > > > >
> > > > >       $ arm-linux-gnueabihf-addr2line -C -i -e vmlinux c08533ec
> > > > >
> > > > >       linux/net/dsa/switch.c:156
> > > > >       linux/net/dsa/switch.c:178
> > > > >       linux/net/dsa/switch.c:328
> > > > >
> > > > > Fixes: e6db98db8a95 ("net: dsa: add switch mdb bitmap functions")
> > > > > Signed-off-by: Chen-Yu Tsai <wens@csie.org>
> > > > > ---
> > > > > Changes since v1:
> > > > >
> > > > >   - Moved the check to the beginning of dsa_switch_mdb_add()
> > > > >
> > > > > Looks like we could also move the ops check out of
> > > > > dsa_switch_mdb_prepare_bitmap(), though I suppose keeping the code the
> > > > > way it is now is clearer.
> > > > >
> > > > > ---
> > > > >  net/dsa/switch.c | 3 +++
> > > > >  1 file changed, 3 insertions(+)
> > > > >
> > > > > diff --git a/net/dsa/switch.c b/net/dsa/switch.c
> > > > > index 4ec5b7f85d51..231af5268656 100644
> > > > > --- a/net/dsa/switch.c
> > > > > +++ b/net/dsa/switch.c
> > > > > @@ -164,6 +164,9 @@ static int dsa_switch_mdb_add(struct dsa_switch *ds,
> > > > >       struct switchdev_trans *trans = info->trans;
> > > > >       int port;
> > > > >
> > > > > +     if (!ds->ops->port_mdb_add)
> > > > > +             return -EOPNOTSUPP;
> > > > > +
> > > > >       /* Build a mask of Multicast group members */
> > > > >       bitmap_zero(ds->bitmap, ds->num_ports);
> > > > >       if (ds->index == info->sw_index)
> > > > > --
> > > > > 2.20.1
> > > > >
> > > >
> > > > I don't understand the crash here, nor the fix. dsa_switch_mdb_add()
> > > > is supposed to be called through switchdev with a prepare phase,
> > > > which checks for ds->ops->port_mdb_add. Do you mean that a switchdev
> > > > MDB object is added somewhere without a prepare phase? If that's
> > > > the case, this is what the commit message must say. Then the
> > >
> > > I had pretty much zero understanding of how switchdev and dsa work.
> > > The symptom is a NULL pointer reference, resulting from an unsupported
> > > callback that was not checked before being called, as described above.
> > > And that is what I mean. A NULL pointer reference happened when it
> > > should not have.
> > >
> > > Based on what you just mentioned, yes it does look like an object was
> > > added without a prepare phase. Randomly looking through the net/dsa
> > > code, it seems only dsa_port_vid_add() does a prepare phase, judging
> > > by .ph_prepare being set. dsa_port_{vlan,mbd,fdb}_add directly call
> > > the add phase, without the prepare phase. So I'm guessing "supposed
> > > to be called with a prepare phase" is not quite accurate. This also
> > > exceeds the scope of the simple fix I had in mind.
> > >
> > > > ds->ops->port_mdb_add check must go where it is used, that is to say
> > > > at the beginning of dsa_switch_mdb_add_bitmap() (similarly to what
> > > > dsa_switch_mdb_prepare_bitmap() does), not in dsa_switch_mdb_add.
> > >
> > > Andrew asked me to move it to where it is now. Please take a look at
> > > v1 [2] if it's what you would like.
> > >
> > > I'm ok either way.
> >
> > I still cannot find in the code where a SWITCHDEV_OBJ_ID_PORT_MDB object
> > gets added without a prepare phase or a trans object, but it wouldn't hurt to
> > double check the presence of ds->ops->port_mdb_add before calling it anyway,
> > since a patch may actually bypass this prepare phase.
> 
> I dug a bit more and I couldn't find it either. AFAICS the only place it
> gets called is through the notification chain invoked in
> switchdev_port_obj_add_now(),
> which specifically has prepare and commit phases.
> 
> > Your v1 patch was a bit confusing and changed the signature of the function
> > at the same time. Please check the callback where it is used, like this:
> >
> >     diff --git a/net/dsa/switch.c b/net/dsa/switch.c
> >     index 4ec5b7f85d51..09d9286b27cc 100644
> >     --- a/net/dsa/switch.c
> >     +++ b/net/dsa/switch.c
> >     @@ -153,6 +153,9 @@ static void dsa_switch_mdb_add_bitmap(struct dsa_switch *ds,
> >      {
> >             int port;
> >
> >     +       if (!ds->ops->port_mdb_add)
> >     +               return;
> >     +
> >             for_each_set_bit(port, bitmap, ds->num_ports)
> >                     ds->ops->port_mdb_add(ds, port, mdb);
> >      }
> >
> >
> > This will be easier to maintain. Please provide a simpler commit message,
> 
> OK. How about:
> 
> The dsa framework has optional .port_mdb_{prepare,add,delete} callback fields

s/delete/del/

> for drivers to handle multicast database entries. When adding an entry, the
> framework goes through a prepare phase, then a commit phase. Drivers not
> providing these callbacks should be detected in the prepare phase.
> 
> For some unknown reason, the guard in the prepare phase is insufficient,

I'd change the line above with something like:

    DSA core may still bypass the bridge layer and call the dsa_port_mdb_add
    function directly with no prepare phase or no switchdev trans object,

> and the framework ends up calling an undefined .port_mdb_add callback.
> This results in a NULL pointer dereference, as shown in the log below.

But yes you get the idea right ;-)

> Add a check for .port_mdb_add before calling it in dsa_switch_mdb_add_bitmap().
> 
> <attach log>
> 
> > this one is not relevant. Are you actually able to reproduce this stack
> > trace? If not, that is not necessary to add it to the commit message...
> 
> Yes I can reproduce it. Just did on a vanilla v5.3-rc3 kernel.
> 
> [   71.827837] br0: port 1(eth0.1) entered blocking state
> [   71.844072] br0: port 1(eth0.1) entered disabled state
> [   71.849669] device eth0.1 entered promiscuous mode
> [   71.865848] device eth0 entered promiscuous mode
> [   71.876535] br0: port 2(wan) entered blocking state
> [   71.881475] br0: port 2(wan) entered disabled state
> [   71.888631] device wan entered promiscuous mode
> [   71.914341] bcm53xx stmmac-0:1e wan: configuring for phy/gmii link mode
> [   71.921505] 8021q: adding VLAN 0 to HW filter on device wan
> [   71.930146] bcm53xx stmmac-0:1e wan: Link is Up - 1Gbps/Full - flow
> control rx/tx
> [   71.932360] br0: port 3(lan1) entered blocking state
> [   71.942751] br0: port 3(lan1) entered disabled state
> [   71.950128] device lan1 entered promiscuous mode
> [   71.970024] bcm53xx stmmac-0:1e lan1: configuring for phy/gmii link mode
> [   71.977157] 8021q: adding VLAN 0 to HW filter on device lan1
> [   71.988159] br0: port 4(lan2) entered blocking state
> [   71.993158] br0: port 4(lan2) entered disabled state
> [   72.001235] device lan2 entered promiscuous mode
> [   72.020994] bcm53xx stmmac-0:1e lan2: configuring for phy/gmii link mode
> [   72.028081] 8021q: adding VLAN 0 to HW filter on device lan2
> [   72.035797] bcm53xx stmmac-0:1e lan2: Link is Up - 1Gbps/Full -
> flow control rx/tx
> [   72.038767] br0: port 5(lan3) entered blocking state
> [   72.048450] br0: port 5(lan3) entered disabled state
> [   72.057129] device lan3 entered promiscuous mode
> [   72.076956] bcm53xx stmmac-0:1e lan3: configuring for phy/gmii link mode
> [   72.084004] 8021q: adding VLAN 0 to HW filter on device lan3
> [   72.091352] bcm53xx stmmac-0:1e lan3: Link is Up - 1Gbps/Full -
> flow control rx/tx
> [   72.095168] br0: port 6(lan4) entered blocking state
> [   72.103926] br0: port 6(lan4) entered disabled state
> [   72.113313] device lan4 entered promiscuous mode
> [   72.133293] bcm53xx stmmac-0:1e lan4: configuring for phy/gmii link mode
> [   72.140410] 8021q: adding VLAN 0 to HW filter on device lan4
> [   72.147682] bcm53xx stmmac-0:1e lan4: Link is Up - 100Mbps/Full -
> flow control rx/tx
> [   72.155915] br0: port 6(lan4) entered blocking state
> [   72.160905] br0: port 6(lan4) entered forwarding state
> [   72.166135] br0: port 5(lan3) entered blocking state
> [   72.171111] br0: port 5(lan3) entered forwarding state
> [   72.176300] br0: port 4(lan2) entered blocking state
> [   72.181277] br0: port 4(lan2) entered forwarding state
> [   72.186472] br0: port 2(wan) entered blocking state
> [   72.191361] br0: port 2(wan) entered forwarding state
> [   72.196453] br0: port 1(eth0.1) entered blocking state
> [   72.201601] br0: port 1(eth0.1) entered forwarding state
> 
> Waiting for br0 to get ready (MAXWAIT is 32 seconds).
> [   72.244956] 8<--- cut here ---
> [   72.248072] Unable to handle kernel NULL pointer dereference at
> virtual address 00000000
> [   72.256331] pgd = (ptrval)
> [   72.259048] [00000000] *pgd=00000000
> [   72.262642] Internal error: Oops: 80000005 [#1] SMP ARM
> [   72.267869] Modules linked in:
> [   72.270934] CPU: 1 PID: 134 Comm: kworker/1:2 Not tainted 5.3.0-rc3 #1
> [   72.277458] Hardware name: Allwinner sun7i (A20) Family
> [   72.282699] Workqueue: events switchdev_deferred_process_work
> [   72.288447] PC is at 0x0
> [   72.290988] LR is at dsa_switch_event+0x570/0x620
> [   72.295691] pc : [<00000000>]    lr : [<c0853890>]    psr: 80070113
> [   72.301955] sp : ef3cfdb8  ip : 00000000  fp : ef0ad0a4
> [   72.307180] r10: 0000000c  r9 : 00000008  r8 : eeb3b450
> [   72.312405] r7 : ef0ad040  r6 : ef0ad088  r5 : c0f04c48  r4 : ef0ad04c
> [   72.318931] r3 : 00000000  r2 : eeb3b450  r1 : 00000008  r0 : ef0ad040
> [   72.325459] Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> [   72.332593] Control: 10c5387d  Table: 6de6406a  DAC: 00000051
> [   72.338340] Process kworker/1:2 (pid: 134, stack limit = 0x(ptrval))
> [   72.344692] Stack: (0xef3cfdb8 to 0xef3d0000)
> [   72.349085] fda0:
>     ef3cfe14 46893f30
> [   72.357329] fdc0: 00000000 ffffffff 00000000 ef3cfe14 00000005
> 00000000 c0852944 00000000
> [   72.365550] fde0: ffffe000 c014be24 c0f04c48 ef3cfe98 c0f04c48
> ee9ea800 c08515d8 c014bf18
> [   72.373739] fe00: 00000000 63a7c876 ee9b0068 c0850e60 ef26e800
> eeb3b450 ef3cfecb 00000000
> [   72.381921] fe20: 00000008 46893f30 00000000 c087e718 eebe2068
> 46893f30 c0e72400 ffffffff
> [   72.390102] fe40: 00000000 ef3cfe98 00000006 00000000 c0fb2a50
> c087e7a0 ffffffff c0852868
> [   72.398283] fe60: ffffffff c014be24 00000006 c0fad2d0 ef3cfe98
> eeb3b450 00000000 c014c528
> [   72.406463] fe80: 00000000 c015dcf8 c0f04c48 00000000 ee9ea800
> c087e484 ee9ea800 00000000
> [   72.414644] fea0: eeb3b450 ef3cfecb 00000001 46893f30 00000000
> c0f04c48 00000000 c087e578
> [   72.422825] fec0: 00000000 ef26e780 00024400 46893f30 eeb3b440
> eeb3b450 ee9ea800 00000122
> [   72.430998] fee0: 00000100 c087e600 eeb3b440 c0fad2c8 c1003ef0
> c087e31c 2e928000 c0fad2ec
> [   72.439179] ff00: c0fad2ec ee858080 ef7a62c0 ef7a9400 00000000
> c087e3c8 c0fad2ec c0144804
> [   72.447360] ff20: ef312280 ef7a62c0 00000008 ee858080 ee858094
> ef7a62c0 00000008 c0f03d00
> [   72.455541] ff40: ef7a62d8 ef7a62c0 ffffe000 c0145bac ffffe000
> c0fb2420 c0bfa1ec 00000000
> [   72.463721] ff60: ffffe000 ee855c40 ee855c00 00000000 ef3ce000
> ee858080 c0145b68 ef0e5ea4
> [   72.471902] ff80: ee855c5c c014a720 0000000b ee855c00 c014a5d8
> 00000000 00000000 00000000
> [   72.480082] ffa0: 00000000 00000000 00000000 c01010e8 00000000
> 00000000 00000000 00000000
> [   72.488262] ffc0: 00000000 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000
> [   72.496442] ffe0: 00000000 00000000 00000000 00000000 00000013
> 00000000 00000000 00000000
> [   72.504652] [<c0853890>] (dsa_switch_event) from [<c014be24>]
> (notifier_call_chain+0x48/0x84)
> [   72.513184] [<c014be24>] (notifier_call_chain) from [<c014bf18>]
> (raw_notifier_call_chain+0x18/0x20)
> [   72.522321] [<c014bf18>] (raw_notifier_call_chain) from
> [<c0850e60>] (dsa_port_mdb_add+0x48/0x74)
> [   72.531200] [<c0850e60>] (dsa_port_mdb_add) from [<c087e718>]
> (__switchdev_handle_port_obj_add+0x54/0xd4)
> [   72.540773] [<c087e718>] (__switchdev_handle_port_obj_add) from
> [<c087e7a0>] (switchdev_handle_port_obj_add+0x8/0x14)
> [   72.551385] [<c087e7a0>] (switchdev_handle_port_obj_add) from
> [<c0852868>] (dsa_slave_switchdev_blocking_event+0x94/0xa4)
> [   72.562343] [<c0852868>] (dsa_slave_switchdev_blocking_event) from
> [<c014be24>] (notifier_call_chain+0x48/0x84)
> [   72.572434] [<c014be24>] (notifier_call_chain) from [<c014c528>]
> (blocking_notifier_call_chain+0x50/0x68)
> [   72.582004] [<c014c528>] (blocking_notifier_call_chain) from
> [<c087e484>] (switchdev_port_obj_notify+0x44/0xa8)
> [   72.592094] [<c087e484>] (switchdev_port_obj_notify) from
> [<c087e578>] (switchdev_port_obj_add_now+0x90/0x104)
> [   72.602098] [<c087e578>] (switchdev_port_obj_add_now) from
> [<c087e600>] (switchdev_port_obj_add_deferred+0x14/0x5c)
> [   72.612537] [<c087e600>] (switchdev_port_obj_add_deferred) from
> [<c087e31c>] (switchdev_deferred_process+0x64/0x104)
> [   72.623060] [<c087e31c>] (switchdev_deferred_process) from
> [<c087e3c8>] (switchdev_deferred_process_work+0xc/0x14)
> [   72.633412] [<c087e3c8>] (switchdev_deferred_process_work) from
> [<c0144804>] (process_one_work+0x218/0x50c)
> [   72.643156] [<c0144804>] (process_one_work) from [<c0145bac>]
> (worker_thread+0x44/0x5bc)
> [   72.651253] [<c0145bac>] (worker_thread) from [<c014a720>]
> (kthread+0x148/0x150)
> [   72.658657] [<c014a720>] (kthread) from [<c01010e8>]
> (ret_from_fork+0x14/0x2c)
> [   72.665879] Exception stack(0xef3cffb0 to 0xef3cfff8)
> [   72.670932] ffa0:                                     00000000
> 00000000 00000000 00000000
> [   72.679112] ffc0: 00000000 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000
> [   72.687291] ffe0: 00000000 00000000 00000000 00000000 00000013 00000000
> [   72.693911] Code: bad PC value
> [   72.697022] ---[ end trace c7626868564873c8 ]---

Good to know!


Thank you,

	Vivien

^ permalink raw reply

* RE: [PATCH net-next 2/5] r8152: replace array with linking list for rx information
From: Hayes Wang @ 2019-08-07  4:34 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev@vger.kernel.org, nic_swsd, linux-kernel@vger.kernel.org,
	linux-usb@vger.kernel.org
In-Reply-To: <20190806125342.4620a94f@cakuba.netronome.com>

Jakub Kicinski [mailto:jakub.kicinski@netronome.com]
[...]
> >  static int rtl_stop_rx(struct r8152 *tp)
> >  {
> > -	int i;
> > +	struct list_head *cursor, *next, tmp_list;
> > +	unsigned long flags;
> > +
> > +	INIT_LIST_HEAD(&tmp_list);
> >
> > -	for (i = 0; i < RTL8152_MAX_RX; i++)
> > -		usb_kill_urb(tp->rx_info[i].urb);
> > +	/* The usb_kill_urb() couldn't be used in atomic.
> > +	 * Therefore, move the list of rx_info to a tmp one.
> > +	 * Then, list_for_each_safe could be used without
> > +	 * spin lock.
> > +	 */
> 
> Would you mind explaining in a little more detail why taking the
> entries from the list for a brief period of time is safe?

Usually, it needs the spin lock before accessing the entry
of the list "tp->rx_info". However, for some reasons,
if we want to access the entry without spin lock, we
cloud move all entries to a local list temporally. Then,
we could make sure no other one could access the entries
included in the temporal local list.

For this case, when I move all entries to a temporal 
local list, no other one could access them. Therefore,
I could access the entries included in the temporal local
list without spin lock.


Best Regards,
Hayes



^ permalink raw reply

* RE: [PATCH net-next 4/5] r8152: support skb_add_rx_frag
From: Hayes Wang @ 2019-08-07  4:34 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev@vger.kernel.org, nic_swsd, linux-kernel@vger.kernel.org,
	linux-usb@vger.kernel.org
In-Reply-To: <20190806150802.72e0ef02@cakuba.netronome.com>

Jakub Kicinski [mailto:jakub.kicinski@netronome.com]
> Sent: Wednesday, August 07, 2019 6:08 AM
[...]
> >  #define RTL8152_REQT_READ	0xc0
> > @@ -720,7 +723,7 @@ struct r8152 {
> >  	struct net_device *netdev;
> >  	struct urb *intr_urb;
> >  	struct tx_agg tx_info[RTL8152_MAX_TX];
> > -	struct list_head rx_info;
> > +	struct list_head rx_info, rx_used;
> 
> I don't see where entries on the rx_used list get freed when driver is
> unloaded, could you explain how that's taken care of?

When the driver is unloaded, all rx_agg would be freed from
info_list list.

The info_list includes all rx_agg buffers which may be idle
or be busy. The rx_done and rx_use are used to determine
the status of rx_agg buffer included in info_list.

info_list: the rx_agg buffer would be inserted in this list
	   when it is allocated.
rx_done: the rx_agg buffer is ready (contains rx data). Or
	 it needs to be resubmitted.
rx_use: the rx_agg buffer is busy and couldn't be submitted
	yet.


Best Regards,
Hayes



^ permalink raw reply

* [PATCH] tipc: set addr_trail_end when using explicit node addresses
From: Chris Packham @ 2019-08-07  4:55 UTC (permalink / raw)
  To: jon.maloy, ying.xue, davem
  Cc: netdev, tipc-discussion, linux-kernel, Chris Packham

When tipc uses auto-generated node addresses it goes through a duplicate
address detection phase to ensure the address is unique.

When using explicitly configured node names the DAD phase is skipped.
However addr_trail_end was being left set to 0 which causes parts of the
tipc state machine to assume that the address is not yet valid and
unnecessarily delays the discovery phase. By setting addr_trail_end to
jiffies when using explicit addresses we ensure that we move straight to
discovery.

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
---
 net/tipc/discover.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/tipc/discover.c b/net/tipc/discover.c
index c138d68e8a69..f83bfe8c9443 100644
--- a/net/tipc/discover.c
+++ b/net/tipc/discover.c
@@ -361,6 +361,8 @@ int tipc_disc_create(struct net *net, struct tipc_bearer *b,
 	if (!tipc_own_addr(net)) {
 		tn->addr_trial_end = jiffies + msecs_to_jiffies(1000);
 		msg_set_type(buf_msg(d->skb), DSC_TRIAL_MSG);
+	} else {
+		tn->addr_trial_end = jiffies;
 	}
 	memcpy(&d->dest, dest, sizeof(*dest));
 	d->net = net;
-- 
2.22.0


^ permalink raw reply related

* Re: [PATCH v2] net: mdio-octeon: Fix Kconfig warnings and build errors
From: Nathan Chancellor @ 2019-08-07  5:10 UTC (permalink / raw)
  To: David Miller
  Cc: andrew, broonie, devel, f.fainelli, gregkh, hkallweit1,
	kernel-build-reports, linux-arm-kernel, linux-next, lkp, netdev,
	rdunlap, willy
In-Reply-To: <20190806.141133.1365654857955536268.davem@davemloft.net>

On Tue, Aug 06, 2019 at 02:11:33PM -0700, David Miller wrote:
> From: Nathan Chancellor <natechancellor@gmail.com>
> Date: Fri,  2 Aug 2019 23:01:56 -0700
> 
> > After commit 171a9bae68c7 ("staging/octeon: Allow test build on
> > !MIPS"), the following combination of configs cause a few Kconfig
> > warnings and build errors (distilled from arm allyesconfig and Randy's
> > randconfig builds):
> > 
> >     CONFIG_NETDEVICES=y
> >     CONFIG_STAGING=y
> >     CONFIG_COMPILE_TEST=y
> > 
> > and CONFIG_OCTEON_ETHERNET as either a module or built-in.
> > 
> > WARNING: unmet direct dependencies detected for MDIO_OCTEON
> >   Depends on [n]: NETDEVICES [=y] && MDIO_DEVICE [=y] && MDIO_BUS [=y]
> > && 64BIT [=n] && HAS_IOMEM [=y] && OF_MDIO [=n]
> >   Selected by [y]:
> >   - OCTEON_ETHERNET [=y] && STAGING [=y] && (CAVIUM_OCTEON_SOC ||
> > COMPILE_TEST [=y]) && NETDEVICES [=y]
> > 
> > In file included from ../drivers/net/phy/mdio-octeon.c:14:
> > ../drivers/net/phy/mdio-cavium.h:111:36: error: implicit declaration of
> > function ‘writeq’; did you mean ‘writel’?
> > [-Werror=implicit-function-declaration]
> >   111 | #define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)
> >       |                                    ^~~~~~
> > 
> > CONFIG_64BIT is not strictly necessary if the proper readq/writeq
> > definitions are included from io-64-nonatomic-lo-hi.h.
> > 
> > CONFIG_OF_MDIO is not needed when CONFIG_COMPILE_TEST is enabled because
> > of commit f9dc9ac51610 ("of/mdio: Add dummy functions in of_mdio.h.").
> > 
> > Fixes: 171a9bae68c7 ("staging/octeon: Allow test build on !MIPS")
> > Reported-by: kbuild test robot <lkp@intel.com>
> > Reported-by: Mark Brown <broonie@kernel.org>
> > Reported-by: Randy Dunlap <rdunlap@infradead.org>
> > Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
> 
> Applied to net-next.
> 
> Please make it clear what tree your changes are targetting in the future,
> thank you.

Sorry for the confusion, I'll do my best to add a patch suffix in the
future.

Thank you for picking this up!
Nathan

^ permalink raw reply

* Re: [PATCH -next] iwlwifi: dbg: work around clang bug by marking debug strings static
From: Nathan Chancellor @ 2019-08-07  5:15 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Johannes Berg, Michael Ellerman, Kalle Valo, Luca Coelho,
	Arnd Bergmann, Emmanuel Grumbach, Intel Linux Wireless,
	David S. Miller, Shahar S Matityahu, Sara Sharon, linux-wireless,
	netdev, LKML, clang-built-linux
In-Reply-To: <CAKwvOdmBeB1BezsGh=cK=U9m8goKzZnngDRzNM7B1voZfh8yWg@mail.gmail.com>

On Tue, Aug 06, 2019 at 03:37:42PM -0700, Nick Desaulniers wrote:
> On Thu, Aug 1, 2019 at 12:11 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> >
> >
> > > Luca, you said this was already fixed in your internal tree, and the fix
> > > would appear soon in next, but I don't see anything in linux-next?
> >
> > Luca is still on vacation, but I just sent out a version of the patch we
> > had applied internally.
> >
> > Also turns out it wasn't actually _fixed_, just _moved_, so those
> > internal patches wouldn't have helped anyway.
> 
> Thanks for the report. Do you have a link?
> I'll rebase my patch then.
> -- 
> Thanks,
> ~Nick Desaulniers

Just for everyone else (since I commented on our issue tracker), this is
now fixed in Linus's tree as of commit  1f6607250331 ("iwlwifi: dbg_ini:
fix compile time assert build errors").

Cheers,
Nathan

^ permalink raw reply

* Re: [PATCH -next] iwlwifi: dbg: work around clang bug by marking debug strings static
From: Luciano Coelho @ 2019-08-07  5:24 UTC (permalink / raw)
  To: Nathan Chancellor, Nick Desaulniers
  Cc: Johannes Berg, Michael Ellerman, Kalle Valo, Arnd Bergmann,
	Emmanuel Grumbach, Intel Linux Wireless, David S. Miller,
	Shahar S Matityahu, Sara Sharon, linux-wireless, netdev, LKML,
	clang-built-linux
In-Reply-To: <20190807051516.GA117639@archlinux-threadripper>

On Tue, 2019-08-06 at 22:15 -0700, Nathan Chancellor wrote:
> On Tue, Aug 06, 2019 at 03:37:42PM -0700, Nick Desaulniers wrote:
> > On Thu, Aug 1, 2019 at 12:11 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> > > 
> > > > Luca, you said this was already fixed in your internal tree, and the fix
> > > > would appear soon in next, but I don't see anything in linux-next?
> > > 
> > > Luca is still on vacation, but I just sent out a version of the patch we
> > > had applied internally.
> > > 
> > > Also turns out it wasn't actually _fixed_, just _moved_, so those
> > > internal patches wouldn't have helped anyway.
> > 
> > Thanks for the report. Do you have a link?
> > I'll rebase my patch then.
> > -- 
> > Thanks,
> > ~Nick Desaulniers
> 
> Just for everyone else (since I commented on our issue tracker), this is
> now fixed in Linus's tree as of commit  1f6607250331 ("iwlwifi: dbg_ini:
> fix compile time assert build errors").

Yes, thanks Nathan! I was just digging for this patch to reply to you,
I'm still catching up with what happened during my vacations.

--
Cheers,
Luca.


^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-07  5:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190806011134.p5baub5l3t5fkmou@ast-mbp>

On Mon, Aug 5, 2019 at 6:11 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Aug 05, 2019 at 02:25:35PM -0700, Andy Lutomirski wrote:
> > It tries to make the kernel respect the access modes for fds.  Without
> > this patch, there seem to be some holes: nothing looked at program fds
> > and, unless I missed something, you could take a readonly fd for a
> > program, pin the program, and reopen it RW.
>
> I think it's by design. iirc Daniel had a use case for something like this.

That seems odd.  Daniel, can you elaborate?

> Hence unprivileged bpf is actually something that can be deprecated.

I hope not.  There are a couple setsockopt uses right now, and and
seccomp will surely want it someday.  And the bpf-inside-container use
case really is unprivileged bpf -- containers are, in many (most?)
cases, explicitly not trusted by the host.  But regardless, see below.

> > The ability to obtain an fd to a cgroup does *not* imply any right to
> > modify that cgroup.  The ability to write to a cgroup directory
> > already means something else -- it's the ability to create cgroups
> > under the group in question.  I'm suggesting that a new API be added
> > that allows attaching a bpf program to a cgroup without capabilities
> > and that instead requires write access to a new file in the cgroup
> > directory.  (It could be a single file for all bpf types or one file
> > per type.  I prefer the latter -- it gives the admin finer-grained
> > control.)
>
> This is something to discuss. I don't mind something like this,
> but in general bpf is not for untrusted users.
> Hence I don't want to overdesign.

I think the code would end up being extremely simple.  It would be an
ABI change, but I think that it could easily be arranged so that new
and old libbpf would be fully functional on new and old kernels except
that only new libbpf with a new kernel would be able to attach cgroup
programs without privilege.  I see no reason to remove the current
cgroup attach API.

> See https://github.com/systemd/systemd/blob/01234e1fe777602265ffd25ab6e73823c62b0efe/src/core/bpf-firewall.c#L671-L674
> bpf based IP sandboxing doesn't work in 'systemd --user'.
> That is just one of the problems that people complained about.

This isn't privilege *dropping* -- this is not having privilege in the
first place.  Giving systemd --user unrestricted use of bpf() is
tantamount to making the user root.  But again, see below.

>
> Inside containers and inside nested containers we need to start processes
> that will use bpf. All of the processes are trusted.

Trusted by whom?  In a non-nested container, the container manager
*might* be trusted by the outside world.  In a *nested* container,
unless the inner container management is controlled from outside the
outer container, it's not trusted.  I don't know much about how
Facebook's containers work, but the LXC/LXD/Podman world is moving
very strongly toward user namespaces and maximally-untrusted
containers, and I think bpf() should work in that context.

> To solve your concern of bypassing all capable checks...
> How about we do /dev/bpf/full_verifier first?
> It will replace capable() checks in the verifier only.

I'm not convinced that "in the verifier" is the right distinction.
Telling administrators that some setting lets certain users bypass
bpf() verifier checks doesn't have a clear enough meaning.  I propose,
instead, that the current capable() checks be divided into three
categories:

a) Those that, by design, control privileged operations.  This
includes most attach calls, but it also includes allow_ptr_leaks,
bpf_probe_read(), and quite a few other things.  It also includes all
of the by_id calls, I think, unless some clever modification to the
way they worked would isolate different users' objects.  I think that
persistent objects can do pretty much everything that by_id users
would need, so this isn't a big deal.

b) Those that protect code that is considered to be at higher risk of
exposing security bugs.  In other words, these are things that would
have no need for privilege if we could prove that the implementation
was perfect.  This includes bpf2bpf, LPM, etc.

c) Those that serve to prevent excessive resource usage.  This
includes the bounded loop code (I think -- this might also be category
(b)) and several explicit checks for program size.

I suppose that the speculative execution mitigation stuff could be
split out from (b) into its own category.

Based on this, how about a different division: /dev/bpf_dangerous for
(b) and /dev/bpf_resources for (c)?  And (a) is dealt with by
gradually reducing the privilege needed for the important operations.
/dev/bpf_dangerous isn't ideal in the long run though, since a lot of
the operations in that category will probably become less dangerous as
the code is better vetted, and people granting "dangerous" permission
might want to restrict it to only the specific parts that they need.
/dev/bpf_lpm, /dev/bpf2bpf, etc could work better.

This type of thing actually fits quite nicely into an idea I've been
thinking about for a while called "implicit rights". In very brief
summary, there would be objects called /dev/rights/xyz, where xyz is
the same of a "right".  If there is a readable object of the right
type at the literal path "/dev/rights/xyz", then you have right xyz.
There's a bit more flexibility on top of this.  BPF could use
/dev/rights/bpf/maptypes/lpm and
/dev/rights/bpf/verifier/bounded_loops, for example.  Other non-BPF
use cases include a biggie:
/dev/rights/namespace/create_unprivileged_userns.
/dev/rights/bind_port/80 would be nice, too.

I may have a bit of time over the next few days to actually prototype
this.  Would you be interested in using this for BPF?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox