Netdev List
 help / color / mirror / Atom feed
* Re: [RFC v2, net-next, PATCH 0/4] Add switchdev on TI-CPSW
From: Ilias Apalodimas @ 2018-06-18 16:46 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, grygorii.strashko, ivan.khoronzhuk, nsekhar, jiri,
	ivecera, f.fainelli, francois.ozog, yogeshs, spatton, Jose.Abreu
In-Reply-To: <20180618162836.GD5865@lunn.ch>

On Mon, Jun 18, 2018 at 06:28:36PM +0200, Andrew Lunn wrote:
> > Yes, if the CPU port is added on the VLAN then unregistered multicast traffic
> > (and thus IGMP joins) will reach the CPU port and everything will work as
> > expected. I think we should not consider this as a "problem" as long as it's
> > descibed properly in Documentation. This switch is excected to support this.
> 
> Back to the two e1000e. What would you expect to happen with them?
> Either IGMP snooping needs to work, or your don't do snooping at
> all.
That's a different use case, you don't have a CPU port on e1000 and it 
"just works". You can't do anything to the card and drop the packet. If you
want to have the same example imagine something like "i filter and drop IGMP
messages on one of the 2 e1000e's on the bridge but i expect IGMP to work".
It's a totally different hardware here were this is an option and for TI 
usecases a valid option.

> 
> > What you describe is essentially what happens on "example 2."
> > Enabling the unregistered multicat traffic to be directed to the CPU will cover
> > all use cases that require no user intervention for everything to work. MDBs
> > will automcatically be added/removed to the hardware and traffic will be
> > offloaded.
> > With the current code the user has that possibility, so it's up to him to 
> > decide what mode of configuration he prefers.
> 
> So by default, it just needs to work. You can give the user the option
> to shoot themselves in the foot, but they need to actively pull the
> trigger to blow their own foot off.
Yes it does by default. I don't consider it "foot shooting" though. 
If we stop thinking about switches connected to user environments and think of
industrial ones, my understanding is that this is a common scenarioa that needs
to be supported.

> 
> > > > Setting on/off and IFF_MULTICAST (on eth0/eth1/br0) will affect registered 
> > > > multicast masks programmed in the switch(for port1, port2, cpu port
> > > > respectively).
> > > 
> > > > This muct occur before adding VLANs on the interfaces. If you change the
> > > > flag after the VLAN configuration you need to re-issue the VLAN config 
> > > > commands. 
> > > 
> > > This you should fix. You should be able to get the stack to tell you
> > > about all the configured VLANs, so you can re-program the switch.
> > I was planning fixing these via bridge link commands which would get propagated
> > to the driver for port1/port2 and CPU port. I just didn't find anything
> > relevant to multicast on bridge commands apart from flooding (which is used 
> > properly). I think that the proper way to do this is follow the logic that was
> > introduced by VLANs i.e: 
> > bridge vlan add dev br0 vid 100 pvid untagged self <---- destined for CPU port
> > and apply it for multicast/flooding etc.
> > This requires iproute2 changes first though.
> 
> No, i think you can do this in the driver. The driver just needs to
> ask the stack to tell it about all the VLANs again. Or you walk the
> VLAN tables in the hardware and do the programming based on that. I
> don't see why user space should be involved at all. What would you
> expect with two e1000e's?
We are pretty much describing the same thing i just thought having a bridge
command for it was more appropriate.
After the user removes the multicast flag for an interface i'll just walk VLANs
and adjust accordingly. It's doable and i'll change it for the patch.


Thanks
Ilias

^ permalink raw reply

* [PATCH v3] bpf: attach type BPF_LIRC_MODE2 should not depend on CONFIG_CGROUP_BPF
From: Sean Young @ 2018-06-18 17:12 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Y Song, Matthias Reichl, linux-media, LKML, Alexei Starovoitov,
	Mauro Carvalho Chehab, netdev, Devin Heitmueller, Quentin Monnet
In-Reply-To: <d2613314-406e-dd7d-1cf0-b5a78a155e3b@iogearbox.net>

If the kernel is compiled with CONFIG_CGROUP_BPF not enabled, it is not
possible to attach, detach or query IR BPF programs to /dev/lircN devices,
making them impossible to use. For embedded devices, it should be possible
to use IR decoding without cgroups or CONFIG_CGROUP_BPF enabled.

This change requires some refactoring, since bpf_prog_{attach,detach,query}
functions are now always compiled, but their code paths for cgroups need
moving out. Rather than a #ifdef CONFIG_CGROUP_BPF in kernel/bpf/syscall.c,
moving them to kernel/bpf/cgroup.c and kernel/bpf/sockmap.c does not
require #ifdefs since that is already conditionally compiled.

Signed-off-by: Sean Young <sean@mess.org>
---
 drivers/media/rc/bpf-lirc.c | 14 +-----
 include/linux/bpf-cgroup.h  | 26 ++++++++++
 include/linux/bpf.h         |  8 +++
 include/linux/bpf_lirc.h    |  5 +-
 kernel/bpf/cgroup.c         | 52 ++++++++++++++++++++
 kernel/bpf/sockmap.c        | 18 +++++++
 kernel/bpf/syscall.c        | 98 ++++++++-----------------------------
 7 files changed, 130 insertions(+), 91 deletions(-)

diff --git a/drivers/media/rc/bpf-lirc.c b/drivers/media/rc/bpf-lirc.c
index 40826bba06b6..fcfab6635f9c 100644
--- a/drivers/media/rc/bpf-lirc.c
+++ b/drivers/media/rc/bpf-lirc.c
@@ -207,29 +207,19 @@ void lirc_bpf_free(struct rc_dev *rcdev)
 	bpf_prog_array_free(rcdev->raw->progs);
 }
 
-int lirc_prog_attach(const union bpf_attr *attr)
+int lirc_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog)
 {
-	struct bpf_prog *prog;
 	struct rc_dev *rcdev;
 	int ret;
 
 	if (attr->attach_flags)
 		return -EINVAL;
 
-	prog = bpf_prog_get_type(attr->attach_bpf_fd,
-				 BPF_PROG_TYPE_LIRC_MODE2);
-	if (IS_ERR(prog))
-		return PTR_ERR(prog);
-
 	rcdev = rc_dev_get_from_fd(attr->target_fd);
-	if (IS_ERR(rcdev)) {
-		bpf_prog_put(prog);
+	if (IS_ERR(rcdev))
 		return PTR_ERR(rcdev);
-	}
 
 	ret = lirc_bpf_attach(rcdev, prog);
-	if (ret)
-		bpf_prog_put(prog);
 
 	put_device(&rcdev->dev);
 
diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 975fb4cf1bb7..79795c5fa7c3 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -188,12 +188,38 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
 									      \
 	__ret;								      \
 })
+int cgroup_bpf_prog_attach(const union bpf_attr *attr,
+			   enum bpf_prog_type ptype, struct bpf_prog *prog);
+int cgroup_bpf_prog_detach(const union bpf_attr *attr,
+			   enum bpf_prog_type ptype);
+int cgroup_bpf_prog_query(const union bpf_attr *attr,
+			  union bpf_attr __user *uattr);
 #else
 
+struct bpf_prog;
 struct cgroup_bpf {};
 static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
 static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
 
+static inline int cgroup_bpf_prog_attach(const union bpf_attr *attr,
+					 enum bpf_prog_type ptype,
+					 struct bpf_prog *prog)
+{
+	return -EINVAL;
+}
+
+static inline int cgroup_bpf_prog_detach(const union bpf_attr *attr,
+					 enum bpf_prog_type ptype)
+{
+	return -EINVAL;
+}
+
+static inline int cgroup_bpf_prog_query(const union bpf_attr *attr,
+					union bpf_attr __user *uattr)
+{
+	return -EINVAL;
+}
+
 #define cgroup_bpf_enabled (0)
 #define BPF_CGROUP_PRE_CONNECT_ENABLED(sk) (0)
 #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 995c3b1e59bf..ac4c73d29f96 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -684,6 +684,8 @@ static inline void bpf_map_offload_map_free(struct bpf_map *map)
 struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key);
 struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key);
 int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type);
+int sockmap_get_from_fd(const union bpf_attr *attr, int type,
+			struct bpf_prog *prog);
 #else
 static inline struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key)
 {
@@ -702,6 +704,12 @@ static inline int sock_map_prog(struct bpf_map *map,
 {
 	return -EOPNOTSUPP;
 }
+
+int sockmap_get_from_fd(const union bpf_attr *attr, int type,
+			struct bpf_prog *prog);
+{
+	return -EINVAL;
+}
 #endif
 
 #if defined(CONFIG_XDP_SOCKETS)
diff --git a/include/linux/bpf_lirc.h b/include/linux/bpf_lirc.h
index 5f8a4283092d..9d9ff755ec29 100644
--- a/include/linux/bpf_lirc.h
+++ b/include/linux/bpf_lirc.h
@@ -5,11 +5,12 @@
 #include <uapi/linux/bpf.h>
 
 #ifdef CONFIG_BPF_LIRC_MODE2
-int lirc_prog_attach(const union bpf_attr *attr);
+int lirc_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
 int lirc_prog_detach(const union bpf_attr *attr);
 int lirc_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr);
 #else
-static inline int lirc_prog_attach(const union bpf_attr *attr)
+static inline int lirc_prog_attach(const union bpf_attr *attr,
+				   struct bpf_prog *prog)
 {
 	return -EINVAL;
 }
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index f7c00bd6f8e4..f0ae8a3d01f9 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -428,6 +428,58 @@ int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
 	return ret;
 }
 
+int cgroup_bpf_prog_attach(const union bpf_attr *attr,
+			   enum bpf_prog_type ptype, struct bpf_prog *prog)
+{
+	struct cgroup *cgrp;
+	int ret;
+
+	cgrp = cgroup_get_from_fd(attr->target_fd);
+	if (IS_ERR(cgrp))
+		return PTR_ERR(cgrp);
+
+	ret = cgroup_bpf_attach(cgrp, prog, attr->attach_type,
+				attr->attach_flags);
+	cgroup_put(cgrp);
+
+	return ret;
+}
+
+int cgroup_bpf_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
+{
+	struct bpf_prog *prog;
+	struct cgroup *cgrp;
+	int ret;
+
+	cgrp = cgroup_get_from_fd(attr->target_fd);
+	if (IS_ERR(cgrp))
+		return PTR_ERR(cgrp);
+
+	prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
+	if (IS_ERR(prog))
+		prog = NULL;
+
+	ret = cgroup_bpf_detach(cgrp, prog, attr->attach_type, 0);
+	if (prog)
+		bpf_prog_put(prog);
+	cgroup_put(cgrp);
+	return ret;
+}
+
+int cgroup_bpf_prog_query(const union bpf_attr *attr,
+			  union bpf_attr __user *uattr)
+{
+	struct cgroup *cgrp;
+	int ret;
+
+	cgrp = cgroup_get_from_fd(attr->query.target_fd);
+	if (IS_ERR(cgrp))
+		return PTR_ERR(cgrp);
+	ret = cgroup_bpf_query(cgrp, attr, uattr);
+	cgroup_put(cgrp);
+	return ret;
+}
+
 /**
  * __cgroup_bpf_run_filter_skb() - Run a program for packet filtering
  * @sk: The socket sending or receiving traffic
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 52a91d816c0e..81d0c55a77aa 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1915,6 +1915,24 @@ int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type)
 	return 0;
 }
 
+int sockmap_get_from_fd(const union bpf_attr *attr, int type,
+			struct bpf_prog *prog)
+{
+	int ufd = attr->target_fd;
+	struct bpf_map *map;
+	struct fd f;
+	int err;
+
+	f = fdget(ufd);
+	map = __bpf_map_get(f);
+	if (IS_ERR(map))
+		return PTR_ERR(map);
+
+	err = sock_map_prog(map, prog, attr->attach_type);
+	fdput(f);
+	return err;
+}
+
 static void *sock_map_lookup(struct bpf_map *map, void *key)
 {
 	return NULL;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0fa20624707f..93993c03c9ac 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1489,8 +1489,6 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 	return err;
 }
 
-#ifdef CONFIG_CGROUP_BPF
-
 static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 					     enum bpf_attach_type attach_type)
 {
@@ -1505,40 +1503,6 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 
 #define BPF_PROG_ATTACH_LAST_FIELD attach_flags
 
-static int sockmap_get_from_fd(const union bpf_attr *attr,
-			       int type, bool attach)
-{
-	struct bpf_prog *prog = NULL;
-	int ufd = attr->target_fd;
-	struct bpf_map *map;
-	struct fd f;
-	int err;
-
-	f = fdget(ufd);
-	map = __bpf_map_get(f);
-	if (IS_ERR(map))
-		return PTR_ERR(map);
-
-	if (attach) {
-		prog = bpf_prog_get_type(attr->attach_bpf_fd, type);
-		if (IS_ERR(prog)) {
-			fdput(f);
-			return PTR_ERR(prog);
-		}
-	}
-
-	err = sock_map_prog(map, prog, attr->attach_type);
-	if (err) {
-		fdput(f);
-		if (prog)
-			bpf_prog_put(prog);
-		return err;
-	}
-
-	fdput(f);
-	return 0;
-}
-
 #define BPF_F_ATTACH_MASK \
 	(BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI)
 
@@ -1546,7 +1510,6 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 {
 	enum bpf_prog_type ptype;
 	struct bpf_prog *prog;
-	struct cgroup *cgrp;
 	int ret;
 
 	if (!capable(CAP_NET_ADMIN))
@@ -1583,12 +1546,15 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 		ptype = BPF_PROG_TYPE_CGROUP_DEVICE;
 		break;
 	case BPF_SK_MSG_VERDICT:
-		return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_MSG, true);
+		ptype = BPF_PROG_TYPE_SK_MSG;
+		break;
 	case BPF_SK_SKB_STREAM_PARSER:
 	case BPF_SK_SKB_STREAM_VERDICT:
-		return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, true);
+		ptype = BPF_PROG_TYPE_SK_SKB;
+		break;
 	case BPF_LIRC_MODE2:
-		return lirc_prog_attach(attr);
+		ptype = BPF_PROG_TYPE_LIRC_MODE2;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -1602,17 +1568,20 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 		return -EINVAL;
 	}
 
-	cgrp = cgroup_get_from_fd(attr->target_fd);
-	if (IS_ERR(cgrp)) {
-		bpf_prog_put(prog);
-		return PTR_ERR(cgrp);
+	switch (ptype) {
+	case BPF_PROG_TYPE_SK_SKB:
+	case BPF_PROG_TYPE_SK_MSG:
+		ret = sockmap_get_from_fd(attr, ptype, prog);
+		break;
+	case BPF_PROG_TYPE_LIRC_MODE2:
+		ret = lirc_prog_attach(attr, prog);
+		break;
+	default:
+		ret = cgroup_bpf_prog_attach(attr, ptype, prog);
 	}
 
-	ret = cgroup_bpf_attach(cgrp, prog, attr->attach_type,
-				attr->attach_flags);
 	if (ret)
 		bpf_prog_put(prog);
-	cgroup_put(cgrp);
 
 	return ret;
 }
@@ -1622,9 +1591,6 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 static int bpf_prog_detach(const union bpf_attr *attr)
 {
 	enum bpf_prog_type ptype;
-	struct bpf_prog *prog;
-	struct cgroup *cgrp;
-	int ret;
 
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
@@ -1657,29 +1623,17 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 		ptype = BPF_PROG_TYPE_CGROUP_DEVICE;
 		break;
 	case BPF_SK_MSG_VERDICT:
-		return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_MSG, false);
+		return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_MSG, NULL);
 	case BPF_SK_SKB_STREAM_PARSER:
 	case BPF_SK_SKB_STREAM_VERDICT:
-		return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, false);
+		return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, NULL);
 	case BPF_LIRC_MODE2:
 		return lirc_prog_detach(attr);
 	default:
 		return -EINVAL;
 	}
 
-	cgrp = cgroup_get_from_fd(attr->target_fd);
-	if (IS_ERR(cgrp))
-		return PTR_ERR(cgrp);
-
-	prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
-	if (IS_ERR(prog))
-		prog = NULL;
-
-	ret = cgroup_bpf_detach(cgrp, prog, attr->attach_type, 0);
-	if (prog)
-		bpf_prog_put(prog);
-	cgroup_put(cgrp);
-	return ret;
+	return cgroup_bpf_prog_detach(attr, ptype);
 }
 
 #define BPF_PROG_QUERY_LAST_FIELD query.prog_cnt
@@ -1687,9 +1641,6 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 static int bpf_prog_query(const union bpf_attr *attr,
 			  union bpf_attr __user *uattr)
 {
-	struct cgroup *cgrp;
-	int ret;
-
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
 	if (CHECK_ATTR(BPF_PROG_QUERY))
@@ -1717,14 +1668,9 @@ static int bpf_prog_query(const union bpf_attr *attr,
 	default:
 		return -EINVAL;
 	}
-	cgrp = cgroup_get_from_fd(attr->query.target_fd);
-	if (IS_ERR(cgrp))
-		return PTR_ERR(cgrp);
-	ret = cgroup_bpf_query(cgrp, attr, uattr);
-	cgroup_put(cgrp);
-	return ret;
+
+	return cgroup_bpf_prog_query(attr, uattr);
 }
-#endif /* CONFIG_CGROUP_BPF */
 
 #define BPF_PROG_TEST_RUN_LAST_FIELD test.duration
 
@@ -2371,7 +2317,6 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_OBJ_GET:
 		err = bpf_obj_get(&attr);
 		break;
-#ifdef CONFIG_CGROUP_BPF
 	case BPF_PROG_ATTACH:
 		err = bpf_prog_attach(&attr);
 		break;
@@ -2381,7 +2326,6 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_PROG_QUERY:
 		err = bpf_prog_query(&attr, uattr);
 		break;
-#endif
 	case BPF_PROG_TEST_RUN:
 		err = bpf_prog_test_run(&attr, uattr);
 		break;
-- 
2.17.1

^ permalink raw reply related

* Re: [RFC v2, net-next, PATCH 0/4] Add switchdev on TI-CPSW
From: Andrew Lunn @ 2018-06-18 17:30 UTC (permalink / raw)
  To: Ilias Apalodimas
  Cc: netdev, grygorii.strashko, ivan.khoronzhuk, nsekhar, jiri,
	ivecera, f.fainelli, francois.ozog, yogeshs, spatton, Jose.Abreu
In-Reply-To: <20180618164602.GA26411@apalos>

On Mon, Jun 18, 2018 at 07:46:02PM +0300, Ilias Apalodimas wrote:
> On Mon, Jun 18, 2018 at 06:28:36PM +0200, Andrew Lunn wrote:
> > > Yes, if the CPU port is added on the VLAN then unregistered multicast traffic
> > > (and thus IGMP joins) will reach the CPU port and everything will work as
> > > expected. I think we should not consider this as a "problem" as long as it's
> > > descibed properly in Documentation. This switch is excected to support this.
> > 
> > Back to the two e1000e. What would you expect to happen with them?
> > Either IGMP snooping needs to work, or your don't do snooping at
> > all.
> That's a different use case

I disagree. That is the exact same use case. I add ports to a bridge
and i expect the bridge to either do IGMP snooping, or just forward
all multicast. That is the users expectations. That is how the Linux
network stack works. If the hardware has limitations you want to try
to hide them from the user.

> > So by default, it just needs to work. You can give the user the option
> > to shoot themselves in the foot, but they need to actively pull the
> > trigger to blow their own foot off.

> Yes it does by default. I don't consider it "foot shooting" though. 
> If we stop thinking about switches connected to user environments 

I never think about switches. I think about a block of acceleration
hardware, which i try to offload Linux networking to. And if the
hardware cannot accelerate Linux network functions properly, i don't
try to offload it. That way it always operates in the same way, and
the user expectations are clear.

    Andrew

^ permalink raw reply

* Re: [RFC v2, net-next, PATCH 0/4] Add switchdev on TI-CPSW
From: Ilias Apalodimas @ 2018-06-18 17:49 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, grygorii.strashko, ivan.khoronzhuk, nsekhar, jiri,
	ivecera, f.fainelli, francois.ozog, yogeshs, spatton, Jose.Abreu
In-Reply-To: <20180618173025.GF5865@lunn.ch>

On Mon, Jun 18, 2018 at 07:30:25PM +0200, Andrew Lunn wrote:
> On Mon, Jun 18, 2018 at 07:46:02PM +0300, Ilias Apalodimas wrote:
> > On Mon, Jun 18, 2018 at 06:28:36PM +0200, Andrew Lunn wrote:
> > > > Yes, if the CPU port is added on the VLAN then unregistered multicast traffic
> > > > (and thus IGMP joins) will reach the CPU port and everything will work as
> > > > expected. I think we should not consider this as a "problem" as long as it's
> > > > descibed properly in Documentation. This switch is excected to support this.
> > > 
> > > Back to the two e1000e. What would you expect to happen with them?
> > > Either IGMP snooping needs to work, or your don't do snooping at
> > > all.
> > That's a different use case
> 
> I disagree. That is the exact same use case. I add ports to a bridge
> and i expect the bridge to either do IGMP snooping, or just forward
> all multicast. That is the users expectations. That is how the Linux
> network stack works. If the hardware has limitations you want to try
> to hide them from the user.
Why is this a limitation? Isn't it proper functionality?
If you configure the CPU port as "passthrough" (which again is
the default) then everything works just like e1000e. The user doesn't have to do
anything at all, MDBs are added/deleted based on proper timers/joins etc.
If the user chooses to use the cpu port as a kind of internal L-2 filter, 
that's up to him, but having hardware do the filtering for you isn't something 
i'd call a limitation.

I am not sure what's the case here though. The hardware operates as you want
by defaulti. As added functionality the user can, if he chooses to, add the 
MDBs manually instead of having some piece of code do that for him. 
This was clearly described in the first RFC since it was the only reason to add
a CPU port. Is there a problem with the user controlling these capabilities of
the hardware?
> > > So by default, it just needs to work. You can give the user the option
> > > to shoot themselves in the foot, but they need to actively pull the
> > > trigger to blow their own foot off.
> 
> > Yes it does by default. I don't consider it "foot shooting" though. 
> > If we stop thinking about switches connected to user environments 
> 
> I never think about switches. I think about a block of acceleration
> hardware, which i try to offload Linux networking to.  And if the
> hardware cannot accelerate Linux network functions properly, i don't
> try to offload it. That way it always operates in the same way, and
> the user expectations are clear.
> 
>     Andrew
The acceleration block is working properly here. The user has the ability to
instruct the acceleration block not to accelerate all the traffic but specific
cases he chooses to. Isn't that a valid use case since the hardware supports
that ?

Regards
Ilias

^ permalink raw reply

* Re: [PATCH] Revert "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"
From: Andreas Schwab @ 2018-06-18 17:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mathieu Malaterre, David S. Miller, Eric Dumazet, LKML,
	Christophe LEROY, Meelis Roos, netdev, linuxppc-dev
In-Reply-To: <e1a9b126-3226-72d0-82b2-b69ca5f59ccb@gmail.com>

On Jun 17 2018, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Oh this is silly, please try :
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index c642304f178ce0a4e1358d59e45032a39f76fb3f..54dd9c18ecad817812898d6f335e1794a07dabbe 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1845,10 +1845,9 @@ EXPORT_SYMBOL(___pskb_trim);
>  int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len)
>  {
>         if (skb->ip_summed == CHECKSUM_COMPLETE) {
> -               int delta = skb->len - len;
> +               __wsum csumdiff = skb_checksum(skb, len, skb->len - len, 0);
>  
> -               skb->csum = csum_sub(skb->csum,
> -                                    skb_checksum(skb, len, delta, 0));
> +               skb->csum = csum_block_sub(skb->csum, csumdiff, len);
>         }
>         return __pskb_trim(skb, len);
>  }

That doesn't help either.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply

* array bounds warning in xfrm_output_resume
From: David Ahern @ 2018-06-18 18:10 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev@vger.kernel.org

Florian:

I am seeing this warning:

$ make O=kbuild/perf -j 24 -s
In file included from /home/dsa/kernel-3.git/include/linux/kernel.h:10:0,
                 from /home/dsa/kernel-3.git/include/linux/list.h:9,
                 from /home/dsa/kernel-3.git/include/linux/module.h:9,
                 from /home/dsa/kernel-3.git/net/xfrm/xfrm_output.c:13:
/home/dsa/kernel-3.git/net/xfrm/xfrm_output.c: In function
‘xfrm_output_resume’:
/home/dsa/kernel-3.git/include/linux/compiler.h:252:20: warning: array
subscript is above array bounds [-Warray-bounds]
   __read_once_size(&(x), __u.__c, sizeof(x));  \
                    ^~~~
/home/dsa/kernel-3.git/include/linux/compiler.h:258:22: note: in
expansion of macro ‘__READ_ONCE’
 #define READ_ONCE(x) __READ_ONCE(x, 1)
                      ^~~~~~~~~~~
/home/dsa/kernel-3.git/include/linux/rcupdate.h:350:48: note: in
expansion of macro ‘READ_ONCE’
  typeof(*p) *________p1 = (typeof(*p) *__force)READ_ONCE(p); \
                                                ^~~~~~~~~
/home/dsa/kernel-3.git/include/linux/rcupdate.h:487:2: note: in
expansion of macro ‘__rcu_dereference_check’
  __rcu_dereference_check((p), (c) || rcu_read_lock_held(), __rcu)
  ^~~~~~~~~~~~~~~~~~~~~~~
/home/dsa/kernel-3.git/include/linux/rcupdate.h:545:28: note: in
expansion of macro ‘rcu_dereference_check’
 #define rcu_dereference(p) rcu_dereference_check(p, 0)
                            ^~~~~~~~~~~~~~~~~~~~~
/home/dsa/kernel-3.git/include/linux/netfilter.h:218:15: note: in
expansion of macro ‘rcu_dereference’
   hook_head = rcu_dereference(net->nf.hooks_arp[hook]);
               ^~~~~~~~~~~~~~~

Line in question is the nf_hook in xfrm_output_resume.
NF_INET_POST_ROUTING = 4 which is greater than NF_ARP_NUMHOOKS = 3

I believe ef57170bbfdd6 is the commit that introduced the warning

^ permalink raw reply

* Re: [PATCH bpf-net] bpf: Change bpf_fib_lookup to return lookup status
From: Martin KaFai Lau @ 2018-06-18 18:11 UTC (permalink / raw)
  To: dsahern; +Cc: netdev, borkmann, ast, davem, David Ahern
In-Reply-To: <20180617151819.6824-1-dsahern@kernel.org>

On Sun, Jun 17, 2018 at 08:18:19AM -0700, dsahern@kernel.org wrote:
> From: David Ahern <dsahern@gmail.com>
> 
> For ACLs implemented using either FIB rules or FIB entries, the BPF
> program needs the FIB lookup status to be able to drop the packet.
Except BPF_FIB_LKUP_RET_SUCCESS and BPF_FIB_LKUP_RET_NO_NEIGH,  can you
give an example on how the xdp_prog may decide XDP_PASS vs XDP_DROP based
on other BPF_FIB_LKUP_RET_*?

> Since the bpf_fib_lookup API has not reached a released kernel yet,
> change the return code to contain an encoding of the FIB lookup
> result and return the nexthop device index in the params struct.
> 
> In addition, inform the BPF program of any post FIB lookup reason as
> to why the packet needs to go up the stack.
> 
> Update the sample program per the change in API.
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>
> ---
>  include/uapi/linux/bpf.h   | 28 ++++++++++++++----
>  net/core/filter.c          | 74 ++++++++++++++++++++++++++++++++--------------
>  samples/bpf/xdp_fwd_kern.c |  8 ++---
>  3 files changed, 78 insertions(+), 32 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 59b19b6a40d7..ceb80071c341 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1857,7 +1857,8 @@ union bpf_attr {
>   *		is resolved), the nexthop address is returned in ipv4_dst
>   *		or ipv6_dst based on family, smac is set to mac address of
>   *		egress device, dmac is set to nexthop mac address, rt_metric
> - *		is set to metric from route (IPv4/IPv6 only).
> + *		is set to metric from route (IPv4/IPv6 only), and ifindex
> + *		is set to the device index of the nexthop from the FIB lookup.
>   *
>   *             *plen* argument is the size of the passed in struct.
>   *             *flags* argument can be a combination of one or more of the
> @@ -1873,9 +1874,9 @@ union bpf_attr {
>   *             *ctx* is either **struct xdp_md** for XDP programs or
>   *             **struct sk_buff** tc cls_act programs.
>   *     Return
> - *             Egress device index on success, 0 if packet needs to continue
> - *             up the stack for further processing or a negative error in case
> - *             of failure.
> + *		< 0 if any input argument is invalid
> + *		  0 on success (packet is forwarded and nexthop neighbor exists)
> + *		> 0 one of BPF_FIB_LKUP_RET_ codes on FIB lookup response
>   *
>   * int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map *map, void *key, u64 flags)
>   *	Description
> @@ -2612,6 +2613,19 @@ struct bpf_raw_tracepoint_args {
>  #define BPF_FIB_LOOKUP_DIRECT  BIT(0)
>  #define BPF_FIB_LOOKUP_OUTPUT  BIT(1)
>  
> +enum {
> +	BPF_FIB_LKUP_RET_SUCCESS,      /* lookup successful */
> +	BPF_FIB_LKUP_RET_BLACKHOLE,    /* dest is blackholed */
> +	BPF_FIB_LKUP_RET_UNREACHABLE,  /* dest is unreachable */
> +	BPF_FIB_LKUP_RET_PROHIBIT,     /* dest not allowed */
> +	BPF_FIB_LKUP_RET_NOT_FWDED,    /* pkt is not forwardded */
BPF_FIB_LKUP_RET_NOT_FWDED is a catch all?

> +	BPF_FIB_LKUP_RET_FWD_DISABLED, /* fwding is not enabled on ingress */
> +	BPF_FIB_LKUP_RET_UNSUPP_LWT,   /* fwd requires unsupported encap */
> +	BPF_FIB_LKUP_RET_NO_NHDEV,     /* nh device does not exist */
> +	BPF_FIB_LKUP_RET_NO_NEIGH,     /* no neigh entry for nh */
> +	BPF_FIB_LKUP_RET_FRAG_NEEDED,  /* pkt too big to fwd */
> +};
> +
>  struct bpf_fib_lookup {
>  	/* input:  network family for lookup (AF_INET, AF_INET6)
>  	 * output: network family of egress nexthop
> @@ -2625,7 +2639,11 @@ struct bpf_fib_lookup {
>  
>  	/* total length of packet from network header - used for MTU check */
>  	__u16	tot_len;
> -	__u32	ifindex;  /* L3 device index for lookup */
> +
> +	/* input: L3 device index for lookup
> +	 * output: nexthop device index from FIB lookup
> +	 */
> +	__u32	ifindex;
>  
>  	union {
>  		/* inputs to lookup */
> diff --git a/net/core/filter.c b/net/core/filter.c
> index e7f12e9f598c..e758ca487878 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4073,8 +4073,9 @@ static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
>  	memcpy(params->smac, dev->dev_addr, ETH_ALEN);
>  	params->h_vlan_TCI = 0;
>  	params->h_vlan_proto = 0;
> +	params->ifindex = dev->ifindex;
>  
> -	return dev->ifindex;
> +	return 0;
>  }
>  #endif
>  
> @@ -4098,7 +4099,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>  	/* verify forwarding is enabled on this interface */
>  	in_dev = __in_dev_get_rcu(dev);
>  	if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
> -		return 0;
> +		return BPF_FIB_LKUP_RET_FWD_DISABLED;
>  
>  	if (flags & BPF_FIB_LOOKUP_OUTPUT) {
>  		fl4.flowi4_iif = 1;
> @@ -4123,7 +4124,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>  
>  		tb = fib_get_table(net, tbid);
>  		if (unlikely(!tb))
> -			return 0;
> +			return BPF_FIB_LKUP_RET_NOT_FWDED;
>  
>  		err = fib_table_lookup(tb, &fl4, &res, FIB_LOOKUP_NOREF);
>  	} else {
> @@ -4135,8 +4136,20 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>  		err = fib_lookup(net, &fl4, &res, FIB_LOOKUP_NOREF);
>  	}
>  
> -	if (err || res.type != RTN_UNICAST)
> -		return 0;
> +	if (err) {
> +		/* map fib lookup errors to RTN_ type */
> +		if (err == -EINVAL)
> +			return BPF_FIB_LKUP_RET_BLACKHOLE;
> +		if (err == -EHOSTUNREACH)
> +			return BPF_FIB_LKUP_RET_UNREACHABLE;
> +		if (err == -EACCES)
> +			return BPF_FIB_LKUP_RET_PROHIBIT;
> +
> +		return BPF_FIB_LKUP_RET_NOT_FWDED;
> +	}
> +
> +	if (res.type != RTN_UNICAST)
> +		return BPF_FIB_LKUP_RET_NOT_FWDED;
>  
>  	if (res.fi->fib_nhs > 1)
>  		fib_select_path(net, &res, &fl4, NULL);
> @@ -4144,18 +4157,18 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>  	if (check_mtu) {
>  		mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst);
>  		if (params->tot_len > mtu)
> -			return 0;
> +			return BPF_FIB_LKUP_RET_FRAG_NEEDED;
>  	}
>  
>  	nh = &res.fi->fib_nh[res.nh_sel];
>  
>  	/* do not handle lwt encaps right now */
>  	if (nh->nh_lwtstate)
> -		return 0;
> +		return BPF_FIB_LKUP_RET_UNSUPP_LWT;
>  
>  	dev = nh->nh_dev;
>  	if (unlikely(!dev))
> -		return 0;
> +		return BPF_FIB_LKUP_RET_NO_NHDEV;
>  
>  	if (nh->nh_gw)
>  		params->ipv4_dst = nh->nh_gw;
> @@ -4166,10 +4179,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>  	 * rcu_read_lock_bh is not needed here
>  	 */
>  	neigh = __ipv4_neigh_lookup_noref(dev, (__force u32)params->ipv4_dst);
> -	if (neigh)
> -		return bpf_fib_set_fwd_params(params, neigh, dev);
> +	if (!neigh)
> +		return BPF_FIB_LKUP_RET_NO_NEIGH;
>  
> -	return 0;
> +	return bpf_fib_set_fwd_params(params, neigh, dev);
>  }
>  #endif
>  
> @@ -4190,7 +4203,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>  
>  	/* link local addresses are never forwarded */
>  	if (rt6_need_strict(dst) || rt6_need_strict(src))
> -		return 0;
> +		return BPF_FIB_LKUP_RET_NOT_FWDED;
>  
>  	dev = dev_get_by_index_rcu(net, params->ifindex);
>  	if (unlikely(!dev))
> @@ -4198,7 +4211,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>  
>  	idev = __in6_dev_get_safely(dev);
>  	if (unlikely(!idev || !net->ipv6.devconf_all->forwarding))
> -		return 0;
> +		return BPF_FIB_LKUP_RET_FWD_DISABLED;
>  
>  	if (flags & BPF_FIB_LOOKUP_OUTPUT) {
>  		fl6.flowi6_iif = 1;
> @@ -4225,7 +4238,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>  
>  		tb = ipv6_stub->fib6_get_table(net, tbid);
>  		if (unlikely(!tb))
> -			return 0;
> +			return BPF_FIB_LKUP_RET_NOT_FWDED;
>  
>  		f6i = ipv6_stub->fib6_table_lookup(net, tb, oif, &fl6, strict);
>  	} else {
> @@ -4238,11 +4251,23 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>  	}
>  
>  	if (unlikely(IS_ERR_OR_NULL(f6i) || f6i == net->ipv6.fib6_null_entry))
> -		return 0;
> +		return BPF_FIB_LKUP_RET_NOT_FWDED;
> +
> +	if (unlikely(f6i->fib6_flags & RTF_REJECT)) {
> +		switch (f6i->fib6_type) {
> +		case RTN_BLACKHOLE:
> +			return BPF_FIB_LKUP_RET_BLACKHOLE;
> +		case RTN_UNREACHABLE:
> +			return BPF_FIB_LKUP_RET_UNREACHABLE;
> +		case RTN_PROHIBIT:
> +			return BPF_FIB_LKUP_RET_PROHIBIT;
> +		default:
> +			return BPF_FIB_LKUP_RET_NOT_FWDED;
> +		}
> +	}
>  
> -	if (unlikely(f6i->fib6_flags & RTF_REJECT ||
> -	    f6i->fib6_type != RTN_UNICAST))
> -		return 0;
> +	if (f6i->fib6_type != RTN_UNICAST)
> +		return BPF_FIB_LKUP_RET_NOT_FWDED;
>  
>  	if (f6i->fib6_nsiblings && fl6.flowi6_oif == 0)
>  		f6i = ipv6_stub->fib6_multipath_select(net, f6i, &fl6,
> @@ -4252,16 +4277,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>  	if (check_mtu) {
>  		mtu = ipv6_stub->ip6_mtu_from_fib6(f6i, dst, src);
>  		if (params->tot_len > mtu)
> -			return 0;
> +			return BPF_FIB_LKUP_RET_FRAG_NEEDED;
>  	}
>  
>  	if (f6i->fib6_nh.nh_lwtstate)
> -		return 0;
> +		return BPF_FIB_LKUP_RET_UNSUPP_LWT;
>  
>  	if (f6i->fib6_flags & RTF_GATEWAY)
>  		*dst = f6i->fib6_nh.nh_gw;
>  
>  	dev = f6i->fib6_nh.nh_dev;
> +	if (unlikely(!dev))
> +		return BPF_FIB_LKUP_RET_NO_NHDEV;
Is this a bug fix?

> +
>  	params->rt_metric = f6i->fib6_metric;
>  
>  	/* xdp and cls_bpf programs are run in RCU-bh so rcu_read_lock_bh is
> @@ -4270,10 +4298,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>  	 */
>  	neigh = ___neigh_lookup_noref(ipv6_stub->nd_tbl, neigh_key_eq128,
>  				      ndisc_hashfn, dst, dev);
> -	if (neigh)
> -		return bpf_fib_set_fwd_params(params, neigh, dev);
> +	if (!neigh)
> +		return BPF_FIB_LKUP_RET_NO_NEIGH;
>  
> -	return 0;
> +	return bpf_fib_set_fwd_params(params, neigh, dev);
>  }
>  #endif
>  
> diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
> index 6673cdb9f55c..a7e94e7ff87d 100644
> --- a/samples/bpf/xdp_fwd_kern.c
> +++ b/samples/bpf/xdp_fwd_kern.c
> @@ -48,9 +48,9 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
>  	struct ethhdr *eth = data;
>  	struct ipv6hdr *ip6h;
>  	struct iphdr *iph;
> -	int out_index;
>  	u16 h_proto;
>  	u64 nh_off;
> +	int rc;
>  
>  	nh_off = sizeof(*eth);
>  	if (data + nh_off > data_end)
> @@ -101,7 +101,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
>  
>  	fib_params.ifindex = ctx->ingress_ifindex;
>  
> -	out_index = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
> +	rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
>  
>  	/* verify egress index has xdp support
>  	 * TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with
> @@ -109,7 +109,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
>  	 * NOTE: without verification that egress index supports XDP
>  	 *       forwarding packets are dropped.
>  	 */
> -	if (out_index > 0) {
> +	if (rc == 0) {
>  		if (h_proto == htons(ETH_P_IP))
>  			ip_decrease_ttl(iph);
>  		else if (h_proto == htons(ETH_P_IPV6))
> @@ -117,7 +117,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
>  
>  		memcpy(eth->h_dest, fib_params.dmac, ETH_ALEN);
>  		memcpy(eth->h_source, fib_params.smac, ETH_ALEN);
> -		return bpf_redirect_map(&tx_port, out_index, 0);
> +		return bpf_redirect_map(&tx_port, fib_params.ifindex, 0);
>  	}
>  
>  	return XDP_PASS;
> -- 
> 2.11.0
> 

^ permalink raw reply

* Re: [PATCH] Revert "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"
From: Eric Dumazet @ 2018-06-18 18:18 UTC (permalink / raw)
  To: Andreas Schwab, Eric Dumazet
  Cc: Mathieu Malaterre, David S. Miller, Eric Dumazet, LKML,
	Christophe LEROY, Meelis Roos, netdev, linuxppc-dev
In-Reply-To: <87o9g8geu0.fsf@igel.home>



On 06/18/2018 10:54 AM, Andreas Schwab wrote:
> On Jun 17 2018, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> Oh this is silly, please try :
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index c642304f178ce0a4e1358d59e45032a39f76fb3f..54dd9c18ecad817812898d6f335e1794a07dabbe 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -1845,10 +1845,9 @@ EXPORT_SYMBOL(___pskb_trim);
>>  int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len)
>>  {
>>         if (skb->ip_summed == CHECKSUM_COMPLETE) {
>> -               int delta = skb->len - len;
>> +               __wsum csumdiff = skb_checksum(skb, len, skb->len - len, 0);
>>  
>> -               skb->csum = csum_sub(skb->csum,
>> -                                    skb_checksum(skb, len, delta, 0));
>> +               skb->csum = csum_block_sub(skb->csum, csumdiff, len);
>>         }
>>         return __pskb_trim(skb, len);
>>  }
> 
> That doesn't help either.
> 
> Andreas.
> 

Then maybe NIC provided csum is not correct.

It does not compute a checksum on all the frame, but part of it.

You could use this patch to double check.

diff --git a/drivers/net/ethernet/sun/sungem.c b/drivers/net/ethernet/sun/sungem.c
index 7a16d40a72d13cf1d522e8a3a396c826fe76f9b9..277859ea73e35271a10b02011120fca248ec8e71 100644
--- a/drivers/net/ethernet/sun/sungem.c
+++ b/drivers/net/ethernet/sun/sungem.c
@@ -857,6 +857,11 @@ static int gem_rx(struct gem *gp, int work_to_do)
 
                csum = (__force __sum16)htons((status & RXDCTRL_TCPCSUM) ^ 0xffff);
                skb->csum = csum_unfold(csum);
+               {
+               __wsum rsum = csum_partial(skb->data + ETH_HLEN, len - ETH_HLEN, 0);
+               if (csum != csum_fold(rsum))
+                       pr_err_ratelimited("sungem wrong csum : %x/%x, len %u bytes\n", csum, csum_fold(rsum), len);
+               }
                skb->ip_summed = CHECKSUM_COMPLETE;
                skb->protocol = eth_type_trans(skb, gp->dev);
 

^ permalink raw reply related

* Re: [PATCH] ptp: replace getnstimeofday64() with ktime_get_real_ts64()
From: Richard Cochran @ 2018-06-18 18:26 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Yangbo Lu, y2038, David S. Miller, Fabio Estevam, netdev,
	linux-kernel
In-Reply-To: <20180618142109.3445025-1-arnd@arndb.de>

On Mon, Jun 18, 2018 at 04:20:39PM +0200, Arnd Bergmann wrote:
> getnstimeofday64() is deprecated and getting replaced throughout
> the kernel with ktime_get_*() based helpers for a more consistent
> interface.
> 
> The two functions do the exact same thing, so this is just
> a cosmetic change.

Acked-by: Richard Cochran <richardcochran@gmail.com>

^ permalink raw reply

* Re: [PATCH bpf-net] bpf: Change bpf_fib_lookup to return lookup status
From: David Ahern @ 2018-06-18 18:27 UTC (permalink / raw)
  To: Martin KaFai Lau, dsahern; +Cc: netdev, borkmann, ast, davem
In-Reply-To: <20180618181123.eczjeb3axd6sao57@kafai-mbp.dhcp.thefacebook.com>

On 6/18/18 12:11 PM, Martin KaFai Lau wrote:
> On Sun, Jun 17, 2018 at 08:18:19AM -0700, dsahern@kernel.org wrote:
>> From: David Ahern <dsahern@gmail.com>
>>
>> For ACLs implemented using either FIB rules or FIB entries, the BPF
>> program needs the FIB lookup status to be able to drop the packet.
> Except BPF_FIB_LKUP_RET_SUCCESS and BPF_FIB_LKUP_RET_NO_NEIGH,  can you
> give an example on how the xdp_prog may decide XDP_PASS vs XDP_DROP based
> on other BPF_FIB_LKUP_RET_*?
> 

	rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
	if (rc == 0)
		packet is forwarded, do the redirect

	/* the program is misconfigured -- wrong parameters in struct or flags */
	if (rc < 0)
		....

	/* rc > 0 case */
	switch(rc) {
	case BPF_FIB_LKUP_RET_BLACKHOLE:
	case BPF_FIB_LKUP_RET_UNREACHABLE:
	case BPF_FIB_LKUP_RET_PROHIBIT:
		return XDP_DROP;
	}

For the others it becomes a question of do we share why the stack needs
to be involved? Maybe the program wants to collect stats to show traffic
patterns that can be improved (BPF_FIB_LKUP_RET_FRAG_NEEDED) or support
in the kernel needs to be improved (BPF_FIB_LKUP_RET_UNSUPP_LWT) or an
interface is misconfigured (BPF_FIB_LKUP_RET_FWD_DISABLED).

Arguably BPF_FIB_LKUP_RET_NO_NHDEV is not needed. See below.

>> @@ -2612,6 +2613,19 @@ struct bpf_raw_tracepoint_args {
>>  #define BPF_FIB_LOOKUP_DIRECT  BIT(0)
>>  #define BPF_FIB_LOOKUP_OUTPUT  BIT(1)
>>  
>> +enum {
>> +	BPF_FIB_LKUP_RET_SUCCESS,      /* lookup successful */
>> +	BPF_FIB_LKUP_RET_BLACKHOLE,    /* dest is blackholed */
>> +	BPF_FIB_LKUP_RET_UNREACHABLE,  /* dest is unreachable */
>> +	BPF_FIB_LKUP_RET_PROHIBIT,     /* dest not allowed */
>> +	BPF_FIB_LKUP_RET_NOT_FWDED,    /* pkt is not forwardded */
> BPF_FIB_LKUP_RET_NOT_FWDED is a catch all?
> 

Destination is local. More precisely, the FIB lookup is not unicast so
not forwarded. It could be RTN_LOCAL, RTN_BROADCAST, RTN_ANYCAST, or
RTN_MULTICAST. The next ones -- blackhole, reachable, prohibit -- are
called out.

>> @@ -4252,16 +4277,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>>  	if (check_mtu) {
>>  		mtu = ipv6_stub->ip6_mtu_from_fib6(f6i, dst, src);
>>  		if (params->tot_len > mtu)
>> -			return 0;
>> +			return BPF_FIB_LKUP_RET_FRAG_NEEDED;
>>  	}
>>  
>>  	if (f6i->fib6_nh.nh_lwtstate)
>> -		return 0;
>> +		return BPF_FIB_LKUP_RET_UNSUPP_LWT;
>>  
>>  	if (f6i->fib6_flags & RTF_GATEWAY)
>>  		*dst = f6i->fib6_nh.nh_gw;
>>  
>>  	dev = f6i->fib6_nh.nh_dev;
>> +	if (unlikely(!dev))
>> +		return BPF_FIB_LKUP_RET_NO_NHDEV;
> Is this a bug fix?
> 

Difference between IPv4 and IPv6. Making them consistent.

It is a major BUG in the kernel to reach this point in either protocol
to have a unicast route not tied to a device. IPv4 has checks; v6 does
not. I figured this being new code, why not make bpf_ipv{4,6}_fib_lookup
as close to the same as possible.

^ permalink raw reply

* Re: [PATCH] Revert "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"
From: Mathieu Malaterre @ 2018-06-18 18:29 UTC (permalink / raw)
  To: schwab
  Cc: eric.dumazet, David S. Miller, Eric Dumazet, LKML,
	Christophe LEROY, Meelis Roos, netdev, linuxppc-dev
In-Reply-To: <87o9g8geu0.fsf@igel.home>

On Mon, Jun 18, 2018 at 7:54 PM Andreas Schwab <schwab@linux-m68k.org> wrote:
>
> On Jun 17 2018, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> > Oh this is silly, please try :
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index c642304f178ce0a4e1358d59e45032a39f76fb3f..54dd9c18ecad817812898d6f335e1794a07dabbe 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -1845,10 +1845,9 @@ EXPORT_SYMBOL(___pskb_trim);
> >  int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len)
> >  {
> >         if (skb->ip_summed == CHECKSUM_COMPLETE) {
> > -               int delta = skb->len - len;
> > +               __wsum csumdiff = skb_checksum(skb, len, skb->len - len, 0);
> >
> > -               skb->csum = csum_sub(skb->csum,
> > -                                    skb_checksum(skb, len, delta, 0));
> > +               skb->csum = csum_block_sub(skb->csum, csumdiff, len);
> >         }
> >         return __pskb_trim(skb, len);
> >  }
>
> That doesn't help either.

seconded (setup g4+sungem):

[  100.272676] eth0: hw csum failure
[  100.272710] CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0+ #6
[  100.272716] Call Trace:
[  100.272733] [dffedbd0] [c069ddb8]
__skb_checksum_complete+0xf0/0x108 (unreliable)
[  100.272752] [dffedbf0] [c078ea28] __udp4_lib_rcv+0x238/0xf98
[  100.272767] [dffedc70] [c0731630] ip_local_deliver_finish+0xa8/0x3c4
[  100.272777] [dffedcb0] [c073243c] ip_local_deliver+0xf0/0x154
[  100.272786] [dffedcf0] [c07328e8] ip_rcv+0x448/0x774
[  100.272800] [dffedd50] [c06aeaec] __netif_receive_skb_core+0x5e8/0x1184
[  100.272811] [dffedde0] [c06bba2c] napi_gro_receive+0x160/0x22c
[  100.272828] [dffede10] [e1571590] gem_poll+0x7fc/0x1ac0 [sungem]
[  100.272837] [dffedee0] [c06bacfc] net_rx_action+0x34c/0x618
[  100.272849] [dffedf60] [c07fd28c] __do_softirq+0x16c/0x5f0
[  100.272863] [dffedfd0] [c0064c7c] irq_exit+0x110/0x1a8
[  100.272877] [dffedff0] [c0016170] call_do_irq+0x24/0x3c
[  100.272890] [c0cf7e80] [c0009a84] do_IRQ+0x98/0x1a0
[  100.272900] [c0cf7eb0] [c001b474] ret_from_except+0x0/0x14
[  100.272911] --- interrupt: 501 at arch_cpu_idle+0x30/0x78
                   LR = arch_cpu_idle+0x30/0x78
[  100.272920] [c0cf7f70] [c0cf6000] 0xc0cf6000 (unreliable)
[  100.272935] [c0cf7f80] [c00a3868] do_idle+0xc4/0x158
[  100.272944] [c0cf7fb0] [c00a3ab4] cpu_startup_entry+0x24/0x28
[  100.272955] [c0cf7fc0] [c0998820] start_kernel+0x47c/0x490
[  100.272963] [c0cf7ff0] [00003444] 0x3444


> Andreas.
>
> --
> Andreas Schwab, schwab@linux-m68k.org
> GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> "And now for something completely different."

^ permalink raw reply

* Re: [PATCH] Revert "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"
From: Mathieu Malaterre @ 2018-06-18 18:45 UTC (permalink / raw)
  To: eric.dumazet
  Cc: schwab, David S. Miller, Eric Dumazet, LKML, Christophe LEROY,
	Meelis Roos, netdev, linuxppc-dev
In-Reply-To: <816ef746-5278-1d51-1d9d-55593e377681@gmail.com>

On Mon, Jun 18, 2018 at 8:18 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 06/18/2018 10:54 AM, Andreas Schwab wrote:
> > On Jun 17 2018, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> >> Oh this is silly, please try :
> >>
> >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >> index c642304f178ce0a4e1358d59e45032a39f76fb3f..54dd9c18ecad817812898d6f335e1794a07dabbe 100644
> >> --- a/net/core/skbuff.c
> >> +++ b/net/core/skbuff.c
> >> @@ -1845,10 +1845,9 @@ EXPORT_SYMBOL(___pskb_trim);
> >>  int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len)
> >>  {
> >>         if (skb->ip_summed == CHECKSUM_COMPLETE) {
> >> -               int delta = skb->len - len;
> >> +               __wsum csumdiff = skb_checksum(skb, len, skb->len - len, 0);
> >>
> >> -               skb->csum = csum_sub(skb->csum,
> >> -                                    skb_checksum(skb, len, delta, 0));
> >> +               skb->csum = csum_block_sub(skb->csum, csumdiff, len);
> >>         }
> >>         return __pskb_trim(skb, len);
> >>  }
> >
> > That doesn't help either.
> >
> > Andreas.
> >
>
> Then maybe NIC provided csum is not correct.
>
> It does not compute a checksum on all the frame, but part of it.
>
> You could use this patch to double check.
>
> diff --git a/drivers/net/ethernet/sun/sungem.c b/drivers/net/ethernet/sun/sungem.c
> index 7a16d40a72d13cf1d522e8a3a396c826fe76f9b9..277859ea73e35271a10b02011120fca248ec8e71 100644
> --- a/drivers/net/ethernet/sun/sungem.c
> +++ b/drivers/net/ethernet/sun/sungem.c
> @@ -857,6 +857,11 @@ static int gem_rx(struct gem *gp, int work_to_do)
>
>                 csum = (__force __sum16)htons((status & RXDCTRL_TCPCSUM) ^ 0xffff);
>                 skb->csum = csum_unfold(csum);
> +               {
> +               __wsum rsum = csum_partial(skb->data + ETH_HLEN, len - ETH_HLEN, 0);
> +               if (csum != csum_fold(rsum))
> +                       pr_err_ratelimited("sungem wrong csum : %x/%x, len %u bytes\n", csum, csum_fold(rsum), len);
> +               }
>                 skb->ip_summed = CHECKSUM_COMPLETE;
>                 skb->protocol = eth_type_trans(skb, gp->dev);
>
>

Here is what I get on my side

[   53.628847] sungem: sungem wrong csum : 4e04/f97, len 64 bytes
[   53.667063] sungem: sungem wrong csum : eea8/6eec, len 149 bytes
[   58.648952] sungem: sungem wrong csum : 2095/3d06, len 64 bytes
[   58.669414] sungem: sungem wrong csum : 5245/b50, len 149 bytes
[   63.674451] sungem: sungem wrong csum : 2d8/5abd, len 149 bytes
[   68.678233] sungem: sungem wrong csum : b8fc/a498, len 149 bytes
[   73.685771] sungem: sungem wrong csum : 374/5a21, len 149 bytes
[   78.689089] sungem: sungem wrong csum : d81/5014, len 149 bytes
[   83.683261] sungem: sungem wrong csum : 4e04/f97, len 64 bytes
[   83.690193] sungem: sungem wrong csum : c2f7/9a9d, len 149 bytes
[   88.692511] sungem: sungem wrong csum : f4d8/68bc, len 149 bytes
[   93.699915] sungem: sungem wrong csum : 1370/4a25, len 149 bytes
[   98.703136] sungem: sungem wrong csum : e1b5/7bdf, len 149 bytes
[  103.704230] sungem: sungem wrong csum : 5321/a74, len 149 bytes
[  108.688912] sungem: sungem wrong csum : 2095/3d06, len 64 bytes
[  108.706559] sungem: sungem wrong csum : ddbc/7fd8, len 149 bytes
[  113.713189] sungem: sungem wrong csum : 5a65/330, len 149 bytes
[  113.891697] sungem: sungem wrong csum : 4e04/f97, len 64 bytes
[  118.717151] sungem: sungem wrong csum : f7c8/65cc, len 149 bytes
[  123.722680] sungem: sungem wrong csum : 3d7a/201b, len 149 bytes
[  128.726524] sungem: sungem wrong csum : c8fd/9497, len 149 bytes
[  133.732045] sungem: sungem wrong csum : de0d/7f87, len 149 bytes
[  135.529163] sungem: sungem wrong csum : 3089/b6dd, len 96 bytes
[  135.529208] eth0: hw csum failure
[  135.529220] CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0+ #7
[  135.529226] Call Trace:
[  135.529243] [dffedbe0] [c069ddac]
__skb_checksum_complete+0xf0/0x108 (unreliable)


>
>
>

^ permalink raw reply

* Re: [PATCH v3] bpf: attach type BPF_LIRC_MODE2 should not depend on CONFIG_CGROUP_BPF
From: kbuild test robot @ 2018-06-18 18:46 UTC (permalink / raw)
  To: Sean Young
  Cc: kbuild-all, Daniel Borkmann, Y Song, Matthias Reichl, linux-media,
	LKML, Alexei Starovoitov, Mauro Carvalho Chehab, netdev,
	Devin Heitmueller, Quentin Monnet
In-Reply-To: <20180618171216.gearpr755pm3wot7@gofer.mess.org>

[-- Attachment #1: Type: text/plain, Size: 1124 bytes --]

Hi Sean,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18-rc1 next-20180618]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Sean-Young/bpf-attach-type-BPF_LIRC_MODE2-should-not-depend-on-CONFIG_CGROUP_BPF/20180619-023056
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   In file included from kernel///events/core.c:45:0:
>> include/linux/bpf.h:710:1: error: expected identifier or '(' before '{' token
    {
    ^

vim +710 include/linux/bpf.h

   707	
   708	int sockmap_get_from_fd(const union bpf_attr *attr, int type,
   709				struct bpf_prog *prog);
 > 710	{
   711		return -EINVAL;
   712	}
   713	#endif
   714	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6370 bytes --]

^ permalink raw reply

* [PATCH iproute2-next 1/1] tc: jsonify nat action
From: Keara Leibovitz @ 2018-06-18 18:57 UTC (permalink / raw)
  To: dsahern; +Cc: stephen, netdev, kernel, Keara Leibovitz

Add json output support for nat action

Example output:

~$ $TC actions add action nat egress 10.10.10.1 20.20.20.2 index 2
~$ $TC actions add action nat ingress 100.100.100.1/32 200.200.200.2 \
	continue index 99
~$ $TC -j actions ls action nat

[{
	"total acts": 2
}, {
	"actions": [{
		"order": 0,
		"type": "nat",
		"direction": "egress",
		"old_addr": "10.10.10.1/32",
		"new_addr": "20.20.20.2",
		"control_action": {
			"type": "pass"
		},
		"index": 2,
		"ref": 1,
		"bind": 0
	}, {
		"order": 1,
		"type": "nat",
		"direction": "ingress",
		"old_addr": "100.100.100.1/32",
		"new_addr": "200.200.200.2",
		"control_action": {
			"type": "continue"
		},
		"index": 99,
		"ref": 1,
		"bind": 0
	}]
}]

Signed-off-by: Keara Leibovitz <kleib@mojatatu.com>
---
 tc/m_nat.c | 32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/tc/m_nat.c b/tc/m_nat.c
index 653792da91c0..ee0b7520a605 100644
--- a/tc/m_nat.c
+++ b/tc/m_nat.c
@@ -142,9 +142,8 @@ print_nat(struct action_util *au, FILE * f, struct rtattr *arg)
 {
 	struct tc_nat *sel;
 	struct rtattr *tb[TCA_NAT_MAX + 1];
-	char buf1[256];
-	char buf2[256];
-
+	SPRINT_BUF(buf1);
+	SPRINT_BUF(buf2);
 	int len;
 
 	if (arg == NULL)
@@ -153,7 +152,7 @@ print_nat(struct action_util *au, FILE * f, struct rtattr *arg)
 	parse_rtattr_nested(tb, TCA_NAT_MAX, arg);
 
 	if (tb[TCA_NAT_PARMS] == NULL) {
-		fprintf(f, "[NULL nat parameters]");
+		print_string(PRINT_FP, NULL, "%s", "[NULL nat parameters]");
 		return -1;
 	}
 	sel = RTA_DATA(tb[TCA_NAT_PARMS]);
@@ -161,15 +160,22 @@ print_nat(struct action_util *au, FILE * f, struct rtattr *arg)
 	len = ffs(sel->mask);
 	len = len ? 33 - len : 0;
 
-	fprintf(f, " nat %s %s/%d %s", sel->flags & TCA_NAT_FLAG_EGRESS ?
-				       "egress" : "ingress",
-		format_host_r(AF_INET, 4, &sel->old_addr, buf1, sizeof(buf1)),
-		len,
-		format_host_r(AF_INET, 4, &sel->new_addr, buf2, sizeof(buf2)));
-	print_action_control(f, " ", sel->action, "");
+	print_string(PRINT_ANY, "type", " %s ", "nat");
+	print_string(PRINT_ANY, "direction", "%s",
+		     sel->flags & TCA_NAT_FLAG_EGRESS ? "egress" : "ingress");
 
-	fprintf(f, "\n\t index %u ref %d bind %d",
-		sel->index, sel->refcnt, sel->bindcnt);
+	snprintf(buf2, sizeof(buf2), "%s/%d",
+		 format_host_r(AF_INET, 4, &sel->old_addr, buf1, sizeof(buf1)),
+		 len);
+	print_string(PRINT_ANY, "old_addr", " %s", buf2);
+	print_string(PRINT_ANY, "new_addr", " %s",
+		     format_host_r(AF_INET, 4, &sel->new_addr, buf1, sizeof(buf1)));
+
+	print_action_control(f, " ", sel->action, "");
+	print_string(PRINT_FP, NULL, "%s", _SL_);
+	print_uint(PRINT_ANY, "index", "\t index %u", sel->index);
+	print_int(PRINT_ANY, "ref", " ref %d", sel->refcnt);
+	print_int(PRINT_ANY, "bind", " bind %d", sel->bindcnt);
 
 	if (show_stats) {
 		if (tb[TCA_NAT_TM]) {
@@ -179,7 +185,7 @@ print_nat(struct action_util *au, FILE * f, struct rtattr *arg)
 		}
 	}
 
-	fprintf(f, "\n");
+	print_string(PRINT_FP, NULL, "%s", _SL_);
 
 	return 0;
 }
-- 
2.7.4

^ permalink raw reply related

* Re: 4.14.(44->48) IPv6 RA issue?
From: Ido Schimmel @ 2018-06-18 19:09 UTC (permalink / raw)
  To: valdis.kletnieks; +Cc: netdev, linux-kernel
In-Reply-To: <35958.1529334166@turing-police.cc.vt.edu>

On Mon, Jun 18, 2018 at 11:02:46AM -0400, valdis.kletnieks@vt.edu wrote:
> So I'm trying to troubleshoot an issue on an OpenWRT/Lede based
> router, where IPv6 connectivity totally fails. I've bisected it down to:
> 
> git log --oneline 187da94808a634477b5e5a69109ea0c566dfa64b..73d8a6ab7668173d70adbed45b61be5256c505e
> 73d8a6ab7668 (refs/bisect/bad) base-files: fix UCI config parsing and callback handling
> e52f3e9b1376 kernel: bump 4.14 to 4.14.48
> 7590c3c58f5e (HEAD) scripts: Replace obsolete POSIX tmpnam in slugimage.pl with File::Temp function
> 987900f2de76 hostapd: properly build hostapd-only SSL variants
> 
> and am pretty sure that it's the kernel bump (works with a 4.14.44 kernel,
> breaks with 4.14.48) as the other 3 commits don't go anywhere near IPv6 handling.

...

> Note that eth1 is the uplink towards my ISP.  I've pointed a 'tcpdump -n -i eth1 ip6'
> at it, and see plenty of RA packets come in, but neighbor discovery never completes.
> Looking at the Changelogs for .45->.50 don't show any smoking-gun patches.
> 
> This ring any bells, before I delve deeper into it?

If this is a router and forwarding is enabled, then you need to set
'accept_ra' to '2' [1]. Is this what you have configured? Seems that
OpenWRT recently started to explicitly set 'accept_ra' to '0' [2].

1. https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
2. https://github.com/openwrt/openwrt/commit/bb46520159c0119e829900e29681feea6f297fe0

^ permalink raw reply

* Re: [PATCH] optoe: driver to read/write SFP/QSFP EEPROMs
From: Don Bollinger @ 2018-06-18 19:13 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Andrew Lunn, Tom Lendacky, Arnd Bergmann, Greg Kroah-Hartman,
	linux-kernel, brandon_chuang, wally_wang, roy_lee, rick_burchett,
	quentin.chang, steven.noble, jeffrey.townsend, scotte, roopa,
	David Ahern, luke.williams, Guohan Lu, Russell King,
	netdev@vger.kernel.org, don
In-Reply-To: <ef0baf6d-1d1f-ba5b-708a-d8179bd1ea3b@gmail.com>

On Thu, Jun 14, 2018 at 08:24:34PM -0700, Florian Fainelli wrote:
> 
> 
> On 06/14/2018 07:26 PM, Don Bollinger wrote:
> > On Tue, Jun 12, 2018 at 08:11:09PM +0200, Andrew Lunn wrote:
> >>> There's an SFP driver under drivers/net/phy.  Can that driver be extended
> >>> to provide this support?  Adding Russel King who developed sfp.c, as well
> >>> at the netdev mailing list.
> >>
> >> I agree, the current SFP code should be used.
> >>
> >> My observations seem to be there are two different ways {Q}SFP are used:
> >>
> >> 1) The Linux kernel has full control, as assumed by the devlink/SFP
> >> frame work. We parse the SFP data to find the capabilities of the SFP
> >> and use it to program the MAC to use the correct mode. The MAC can be
> >> a NIC, but it can also be a switch. DSA is gaining support for
> >> PHYLINK, so SFP modules should just work with most switches which DSA
> >> support.  And there is no reason a plain switchdev switch can not use
> >> PHYLINK.
> >>
> >> 2) Firmware is in control of the PHY layer, but there is a wish to
> >> expose some of the data which is available via i2c from the {Q}SFP to
> >> linux.
> >>
> >> It appears this optoe supports this second case. It does not appear to
> >> support any in kernel API to actually make use of the SFP data in the
> >> kernel.
> >>
> >> We should not be duplicating code. We should share the SFP code for
> >> both use cases above. There is also a Linux standard API for getting
> >> access to this information. ethtool -m/--module-info. Anything which
> >> is exporting {Q}SFP data needs to use this API.
> >>
> >>    Andrew
> > 
> > Actually this is better described by a third use case.  The target
> > switches are PHY-less (see various designs at
> > www.compute.org/wiki/Networking/SpecsAndDesigns). The AS5712 for example
> > says "The AS5712-54X is a PHY-Less design with the SFP+ and QSFP+
> > connections directly attaching to the Serdes interfaces of the Broadcom
> > BCM56854 720G Trident 2 switching silicon..."
> > 
> > The electrical controls of the {Q}SFP devices (TxDisable for example)
> > are organized in a platform dependent way, through CPLD devices, and
> > managed by a platform specific CPLD driver.
> > 
> > The i2c bus is muxed from the CPU to all of the {Q}SFP devices, which
> > are set up as standard linux i2c devices
> > (/sys/bus/i2c/devices/i2c-xxxx).
> > 
> > There is no MDIO bus between the CPU and the {Q}SFP devices.
> > 
> >> 2) Firmware is in control of the PHY layer, but there is a wish to
> >> expose some of the data which is available via i2c from the {Q}SFP to
> >> linux.
> > 
> > So the switch silicon is in control of the PHY layer.  The platform
> > driver is in control of the electrical interfaces.  And the EEPROM data
> > is available via I2C.
> > 
> > And, there isn't actually 'a wish to expose' the EEPROM data to linux
> > (the kernel).  It turns out that none of the NOS partners I'm working
> > with use that data *in the kernel*.  It is all managed from user space.
> > 
> > More generally, I think sfp.c and optoe are not actually trying to
> > accomplish the same thing at all.  sfp.c combines all three functions
> > (PHY, electrical control, EEPROM access).  optoe is only providing EEPROM
> > access, and only to user space.  This is a real need in the white box
> > switch environment, and is not met by sfp.c.  optoe isn't better, sfp.c
> > isn't better, they're just different.
> 
> sfp exposes standard ethtool hooks such as get_module_info() which pass
> the whole data blob to user-space, e.g: ethtool where all of this is
> better interpreted.

Yes.  But ethtool depends on the underlying driver to access the data.
The current sfp.c implementation limits the amount of data that can be
accessed.  optoe makes the entire EEPROM accessible.  I think the right
solution is to call optoe for access to the EEPROM.  A couple of lines
of code in sfp_i2c_read could call optoe to get the data instead of
formatting the i2c_transfer directly.  That change would immediately
give the whole SFP framework access to all the address space of both SFP
and QSFP.  (Same change to sfp_i2c_write.)

> 
> > 
> > Finally, sfp.c does not recognize that SFP devices have data beyond 512
> > bytes, accessible via a page register.  It also does not recognize QSFP
> > devices at all.  QSFP devices have only 256 bytes accessible (one i2c
> > address) before switching to paged access for the remaining data.  The
> > first design requirement for optoe was to access all the available
> > pages, because there is information and controls that we (optics
> > vendors) want to make available to network management applications.
> 
> Patches welcome if you wish to extend sfp.c to support QSFP devices for
> instances.

I would love to collaborate on this.  I don't have an environment
(hardware or software) where I could build or test changes to the sfp
code.

> 
> > 
> > If sfp.c creates a standard linux i2c client for each SFP device, it
> > should be possible to create an optoe managed device 'under' sfp.c to
> > provide access to the full EEPROM address space:
> 
> It's the other way around, SFP relies on a standard Linux i2c bus master
> to exist such that it can read the EEPROM from the standard slave
> address location, same goes with a possibly present PHY.

Great.  Then plugging optoe in should be easy.

> 
> >   # echo optoe2 0x50 >/sys/bus/i2c/devices/i2c-xx/new_device
> > This might prove useful to user space consumers of that data.  We could
> > also easily add a kernel API (eg the nvmem framework) to optoe to provide
> > kernel access.  In other words, sfp.c could assign EEPROM management to
> > optoe, while managing the electrical interfaces.  (This is actually
> > pretty close to how the platfom drivers work in the switch environment.)
> > sfp.c would get SFP page support and QSFP EEPROM access 'for free'.
> 
> That sounds like a possibly good approach.

Thanks

> 
> > 
> >>                       There is also a Linux standard API for getting
> >> access to this information. ethtool -m/--module-info. Anything which
> >> is exporting {Q}SFP data needs to use this API.
> > 
> > optoe simply provides direct access from user space to the full EEPROM
> > data.  There is more data there than ethtool knows about, and in some
> > devices there are proprietary registers that ethtool will never know
> > about.  optoe does not interpret any of the EEPROM content (except the
> > bare minimum to access pages correctly).  optoe also does not get in the
> > way of ethtool.  It could prove to be a handy way for ethtool to access
> > new EEPROM fields in the future.  QSFP-DD/OSFP are coming soon, they
> > will have a different (incompatible) set of new fields to be decoded.
> 
> sfp is the same it only passes the EEPROM information to user-space and
> interprets just what it needs to get the work done.

I offer include/linux/sfp.h as a counter example.  Every byte, every bit
in the spec is spec'ed there.  That's great, but exceeds the mandate of
optoe.  Optoe is just there to get the data in and out of the EEPROM.

> 
> > 
> > Bottom Line:  sfp.c is not a useful starting point for the switch
> > environment I'm working in.  The underlying hardware architecture is
> > quite different.  optoe is not a competing alternative.  Its only
> > function is to provide user-space access to the EEPROM data in {Q}SFP
> > devices.
> 
> I just don't understand why would we want optoe when the standard way to
> expose both EEPROM and diagnostics modules has been through ethtool's
> get_module_info since the basic paradigm is that a network port is a
> net_device instance in the kernel. If that basic device driver model
> does not exist, then it is unclear to me what are the benefits.

We want optoe to access regions of the EEPROM which are paged, and to
access QSFP which only has one I2C address and is also paged.  It is
just the sfp_read/sfp_write function, but reading and writing the whole
EEPROM.  Plugging it into sfp gives that broader access to the ethtool
interface and the rest of the net-device model.

> 
> Would I be completely wrong if I wrote that you are likely working with
> a switch SDK which primarily runs in user-space and so with lack of a
> proper kernel-based device driver for your piece of hardware you are
> attempting to create a driver which is some sort of a "tap" for some
> specific portion of that larger hardware block?

Not completely wrong, but biased.  The switch ASIC and the switch SDK do
not connect to the i2c bus (at least in the architecture I'm working
with).  Whether or not it is in user-space isn't the point.  That it
doesn't access i2c, hence doesn't access the EEPROM, is the reason we
have a 'proper kernel-based device driver'.

> -- 
> Florian
> 

Don

^ permalink raw reply

* Re: WARNING: CPU: 3 PID: 0 at net/sched/sch_hfsc.c:1388 hfsc_dequeue+0x319/0x350 [sch_hfsc]
From: Cong Wang @ 2018-06-18 19:28 UTC (permalink / raw)
  To: Marco Berizzi; +Cc: Linux Kernel Network Developers
In-Reply-To: <1938577710.309243.1528180788286@mail.libero.it>

[-- Attachment #1: Type: text/plain, Size: 867 bytes --]

On Mon, Jun 4, 2018 at 11:39 PM, Marco Berizzi <pupilla@libero.it> wrote:
>> Il 8 marzo 2018 alle 17.02 Marco Berizzi <pupilla@libero.it> ha scritto:
>>
>> > Marco Berizzi wrote:
>> >
>> > Hello everyone,
>> >
>> > Yesterday I got this error on a slackware linux 4.16-rc4 system
>> > running as a traffic shaping gateway and netfilter nat.
>> > The error has been arisen after a partial ISP network outage,
>> > so unfortunately it will not trivial for me to reproduce it again.
>>
>> Hello everyone,
>>
>> I'm getting this error twice/day, so fortunately I'm able to
>> reproduce it.
>
> I'm getting the same error also with 4.17 (4.15.18 was
> the latest working version):

Can you test the attached patch?

It almost certainly fixes the warning, but I am not sure if
it papers out any other real problem. Please make sure
hfsc still works as expected. :)

Thanks.

[-- Attachment #2: hfsc.diff --]
[-- Type: application/octet-stream, Size: 715 bytes --]

commit 3df066c835cdb6dd5ee093292e1c41ae584fafa3
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date:   Mon Jun 18 11:59:47 2018 -0700

    net_sched: remove a bogus warning in hfsc
    
    Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index 3ae9877ea205..3278a76f6861 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -1385,8 +1385,8 @@ hfsc_schedule_watchdog(struct Qdisc *sch)
 		if (next_time == 0 || next_time > q->root.cl_cfmin)
 			next_time = q->root.cl_cfmin;
 	}
-	WARN_ON(next_time == 0);
-	qdisc_watchdog_schedule(&q->watchdog, next_time);
+	if (next_time)
+		qdisc_watchdog_schedule(&q->watchdog, next_time);
 }
 
 static int

^ permalink raw reply related

* [PATCH net] net/tcp: Fix socket lookups with SO_BINDTODEVICE
From: dsahern @ 2018-06-18 19:30 UTC (permalink / raw)
  To: netdev; +Cc: davem, lberger, renato, David Ahern

From: David Ahern <dsahern@gmail.com>

Similar to 69678bcd4d2d ("udp: fix SO_BINDTODEVICE"), TCP socket lookups
need to fail if dev_match is not true. Currently, a packet to a given port
can match a socket bound to device when it should not. In the VRF case,
this causes the lookup to hit a VRF socket and not a global socket
resulting in a response trying to go through the VRF when it should it.

Fixes: 3fa6f616a7a4d ("net: ipv4: add second dif to inet socket lookups")
Fixes: 4297a0ef08572 ("net: ipv6: add second dif to inet6 socket lookups")
Reported-by: Lou Berger <lberger@labn.net>
Diagnosed-by: Renato Westphal <renato@opensourcerouting.org>
Tested-by: Renato Westphal <renato@opensourcerouting.org>
Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv4/inet_hashtables.c  | 4 ++--
 net/ipv6/inet6_hashtables.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 31ff46daae97..3647167c8fa3 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -243,9 +243,9 @@ static inline int compute_score(struct sock *sk, struct net *net,
 			bool dev_match = (sk->sk_bound_dev_if == dif ||
 					  sk->sk_bound_dev_if == sdif);
 
-			if (exact_dif && !dev_match)
+			if (!dev_match)
 				return -1;
-			if (sk->sk_bound_dev_if && dev_match)
+			if (sk->sk_bound_dev_if)
 				score += 4;
 		}
 		if (sk->sk_incoming_cpu == raw_smp_processor_id())
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 2febe26de6a1..595ad408dba0 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -113,9 +113,9 @@ static inline int compute_score(struct sock *sk, struct net *net,
 			bool dev_match = (sk->sk_bound_dev_if == dif ||
 					  sk->sk_bound_dev_if == sdif);
 
-			if (exact_dif && !dev_match)
+			if (!dev_match)
 				return -1;
-			if (sk->sk_bound_dev_if && dev_match)
+			if (sk->sk_bound_dev_if)
 				score++;
 		}
 		if (sk->sk_incoming_cpu == raw_smp_processor_id())
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH v3 06/14] mtd: rawnand: marvell: remove the dmaengine compat need
From: Robert Jarzmik @ 2018-06-18 19:31 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Ulf Hansson, alsa-devel, linux-kernel, linux-ide, linux-mtd,
	Miquel Raynal, Mauro Carvalho Chehab, Vinod Koul,
	Richard Weinberger, Takashi Iwai, Marek Vasut, linux-media,
	Bartlomiej Zolnierkiewicz, Haojian Zhuang, Boris Brezillon,
	Mark Brown, linux-arm-kernel, Nicolas Pitre, netdev, linux-mmc,
	Liam Girdwood, dmaengine, Tejun Heo, Brian Norris,
	David Woodhouse
In-Reply-To: <3a2a8951-f380-af99-bf97-6ff722404410@zonque.org>

Daniel Mack <daniel@zonque.org> writes:

> On Sunday, June 17, 2018 07:02 PM, Robert Jarzmik wrote:
>> As the pxa architecture switched towards the dmaengine slave map, the
>> old compatibility mechanism to acquire the dma requestor line number and
>> priority are not needed anymore.
>>
>> This patch simplifies the dma resource acquisition, using the more
>> generic function dma_request_slave_channel().
>>
>> Signed-off-by: Signed-off-by: Daniel Mack <daniel@zonque.org>
>
> Something went wrong here, but you can simply fix that when applying the series
> :)
Indeed, fixed before applying to the pxa/for-next tree.

--
Robert

^ permalink raw reply

* Re: [PATCH net-next v3 2/2] r8169: Reinstate ASPM Support
From: Heiner Kallweit @ 2018-06-18 19:33 UTC (permalink / raw)
  To: Kai-Heng Feng, davem
  Cc: ryankao, hayeswang, hau, romieu, bhelgaas, acelan.kao, netdev,
	linux-pci, linux-kernel
In-Reply-To: <20180615053219.14053-2-kai.heng.feng@canonical.com>

On 15.06.2018 07:32, Kai-Heng Feng wrote:
> On Intel Coffe Lake platforms, ASPM support in r8169 is the last missing
> puzzle to let CPU's Package C-State reaches PC8.
> Without ASPM support, the CPU cannot reach beyond PC3. PC8 can save
> additional ~3W in comparison with PC3 on my testing platform, Dell G3
> 3779.
> 
The patch itself looks good to me.
Still I think the commit message should be improved. Currently it sounds
like the patch affects Coffee Lake systems only. But disabled ASPM
prevents the system from reaching PC states beyond PC3 on more than
one platform. You just tested on Coffee Lake only.

And it would be good if you mention the RTL8168 chip version of
the system you tested on.

Finally one smaller remark below.

> This is based on the work from Chunhao Lin <hau@realtek.com>.
> 
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
> v3:
> - Change commit message wording.
> - Rename the function to rtl_hw_aspm_clkreq_enable().
> 
> v2:
> - Remove module parameter.
> - Remove pci_disable_link_state().
> 
>  drivers/net/ethernet/realtek/r8169.c | 40 +++++++++++++++++++---------
>  1 file changed, 27 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
> index 9b55ce513a36..269ac7561368 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -5289,6 +5289,17 @@ static void rtl_pcie_state_l2l3_enable(struct rtl8169_private *tp, bool enable)
>  	RTL_W8(tp, Config3, data);
>  }
>  
> +static void rtl_hw_aspm_clkreq_enable(struct rtl8169_private *tp, bool enable)
> +{
> +	if (enable) {
> +		RTL_W8(tp, Config2, RTL_R8(tp, Config2) | ClkReqEn);
> +		RTL_W8(tp, Config5, RTL_R8(tp, Config5) | ASPM_en);
> +	} else {
> +		RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
> +		RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
> +	}
> +}
> +
>  static void rtl_hw_start_8168bb(struct rtl8169_private *tp)
>  {
>  	RTL_W8(tp, Config3, RTL_R8(tp, Config3) & ~Beacon_en);
> @@ -5645,9 +5656,9 @@ static void rtl_hw_start_8168g_1(struct rtl8169_private *tp)
>  	rtl_hw_start_8168g(tp);
>  
>  	/* disable aspm and clock request before access ephy */
> -	RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
> -	RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
> +	rtl_hw_aspm_clkreq_enable(tp, false);
>  	rtl_ephy_init(tp, e_info_8168g_1, ARRAY_SIZE(e_info_8168g_1));
> +	rtl_hw_aspm_clkreq_enable(tp, true);
>  }
>  
>  static void rtl_hw_start_8168g_2(struct rtl8169_private *tp)
> @@ -5680,9 +5691,9 @@ static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>  	rtl_hw_start_8168g(tp);
>  
>  	/* disable aspm and clock request before access ephy */
> -	RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
> -	RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
> +	rtl_hw_aspm_clkreq_enable(tp, false);
>  	rtl_ephy_init(tp, e_info_8411_2, ARRAY_SIZE(e_info_8411_2));
> +	rtl_hw_aspm_clkreq_enable(tp, true);
>  }
>  
>  static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)
> @@ -5699,8 +5710,7 @@ static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)
>  	};
>  
>  	/* disable aspm and clock request before access ephy */
> -	RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
> -	RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
> +	rtl_hw_aspm_clkreq_enable(tp, false);
>  	rtl_ephy_init(tp, e_info_8168h_1, ARRAY_SIZE(e_info_8168h_1));
>  
>  	RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
> @@ -5779,6 +5789,8 @@ static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)
>  	r8168_mac_ocp_write(tp, 0xe63e, 0x0000);
>  	r8168_mac_ocp_write(tp, 0xc094, 0x0000);
>  	r8168_mac_ocp_write(tp, 0xc09e, 0x0000);
> +
> +	rtl_hw_aspm_clkreq_enable(tp, true);
>  }
>  
>  static void rtl_hw_start_8168ep(struct rtl8169_private *tp)
> @@ -5830,11 +5842,12 @@ static void rtl_hw_start_8168ep_1(struct rtl8169_private *tp)
>  	};
>  
>  	/* disable aspm and clock request before access ephy */
> -	RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
> -	RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
> +	rtl_hw_aspm_clkreq_enable(tp, false);
>  	rtl_ephy_init(tp, e_info_8168ep_1, ARRAY_SIZE(e_info_8168ep_1));
>  
>  	rtl_hw_start_8168ep(tp);
> +
> +	rtl_hw_aspm_clkreq_enable(tp, true);
>  }
>  
>  static void rtl_hw_start_8168ep_2(struct rtl8169_private *tp)
> @@ -5846,14 +5859,15 @@ static void rtl_hw_start_8168ep_2(struct rtl8169_private *tp)
>  	};
>  
>  	/* disable aspm and clock request before access ephy */
> -	RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
> -	RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
> +	rtl_hw_aspm_clkreq_enable(tp, false);
>  	rtl_ephy_init(tp, e_info_8168ep_2, ARRAY_SIZE(e_info_8168ep_2));
>  
>  	rtl_hw_start_8168ep(tp);
>  
>  	RTL_W8(tp, DLLPR, RTL_R8(tp, DLLPR) & ~PFM_EN);
>  	RTL_W8(tp, MISC_1, RTL_R8(tp, MISC_1) & ~PFM_D3COLD_EN);
> +
> +	rtl_hw_aspm_clkreq_enable(tp, true);
>  }
>  
>  static void rtl_hw_start_8168ep_3(struct rtl8169_private *tp)
> @@ -5867,8 +5881,7 @@ static void rtl_hw_start_8168ep_3(struct rtl8169_private *tp)
>  	};
>  
>  	/* disable aspm and clock request before access ephy */
> -	RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
> -	RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
> +	rtl_hw_aspm_clkreq_enable(tp, false);
>  	rtl_ephy_init(tp, e_info_8168ep_3, ARRAY_SIZE(e_info_8168ep_3));
>  
>  	rtl_hw_start_8168ep(tp);
> @@ -5888,6 +5901,8 @@ static void rtl_hw_start_8168ep_3(struct rtl8169_private *tp)
>  	data = r8168_mac_ocp_read(tp, 0xe860);
>  	data |= 0x0080;
>  	r8168_mac_ocp_write(tp, 0xe860, data);
> +
> +	rtl_hw_aspm_clkreq_enable(tp, true);
>  }
>  
>  static void rtl_hw_start_8168(struct rtl8169_private *tp)
> @@ -7646,7 +7661,6 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>  	mii->reg_num_mask = 0x1f;
>  	mii->supports_gmii = cfg->has_gmii;
>  
> -

This should go to patch 1.

>  	/* enable device (incl. PCI PM wakeup and hotplug setup) */
>  	rc = pcim_enable_device(pdev);
>  	if (rc < 0) {
> 

^ permalink raw reply

* Re: [PATCH] optoe: driver to read/write SFP/QSFP EEPROMs
From: Don Bollinger @ 2018-06-18 19:41 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Tom Lendacky, Arnd Bergmann, Greg Kroah-Hartman, linux-kernel,
	brandon_chuang, wally_wang, roy_lee, rick_burchett, quentin.chang,
	steven.noble, jeffrey.townsend, scotte, roopa, David Ahern,
	luke.williams, Guohan Lu, Russell King, netdev@vger.kernel.org,
	don
In-Reply-To: <20180615075417.GA28730@lunn.ch>

On Fri, Jun 15, 2018 at 09:54:17AM +0200, Andrew Lunn wrote:
> > Actually this is better described by a third use case.  The target
> > switches are PHY-less (see various designs at
> > www.compute.org/wiki/Networking/SpecsAndDesigns). The AS5712 for example
> > says "The AS5712-54X is a PHY-Less design with the SFP+ and QSFP+
> > connections directly attaching to the Serdes interfaces of the Broadcom
> > BCM56854 720G Trident 2 switching silicon..."
> 
> We consider the SFP+ and QSFP+ as being the PHY. You need something to
> control that PHY. Either it is firmware running in the switch, or it
> is the Linux kernel, via PHYLINK.

Actually in the environment I'm working in (at least 3 different NOS
vendors, and 3 different switch vendors), the {Q}SFP devices are
controlled by a platform specific driver that pokes CPLD registers.  I'm
not sure where the actual control logic is, but it doesn't appear to use
any of the frameworks you are describing.

> 
> > The i2c bus is muxed from the CPU to all of the {Q}SFP devices, which
> > are set up as standard linux i2c devices
> > (/sys/bus/i2c/devices/i2c-xxxx).
> 
> Having a standard i2c bus driver is correct. This is what PHYLINK
> assumes. It knows about the different addresses the SFP uses on the
> i2c bus.

Great.  Then plugging optoe into it should be easy.

> 
> > There is no MDIO bus between the CPU and the {Q}SFP devices.
> 
> There is no physical MDIO bus for SFP devices. If the SFP module
> implements copper 1G, there is often MDIO tunnelled over i2c. PHYLINK
> knows how to do this, and will instantiate a normal Linux MDIO bus
> driver, and then you can use the Linux kernel copper PHY state
> machines as normal.
> 
> > And, there isn't actually 'a wish to expose' the EEPROM data to linux
> > (the kernel).  It turns out that none of the NOS partners I'm working
> > with use that data *in the kernel*.  It is all managed from user space.
> 
> Ah. O.K. We can stop here then.
> 
> If you are using Linux as a boot loader, i doubt you will find any
> network kernel developers who are willing to consider this driver. The

It isn't a boot loader.  It is the kernel that is running on the switch
when it is doing its switch thing.  The kernel hosts the drivers and the
switch SDK and all the apps that configure and manage the networking.

> kernel community as decided switchdev is how the Linux kernel supports

I'm sure switchdev works very well.  It is not being used in the
environment I am trying to support.  I've checked all 3 NOSs that have
adopted optoe, none of them have switchdev configured in their .config
file:

# CONFIG_NET_SWITCHDEV is not set

> switches. We are unlikely to add drivers for supporting user space
> drivers of switches.

That's not the request.  I'm offering an improved driver to access {Q}SFP
EEPROMs.  It can be easily called by your framework, so the sfp.c users
can also get improved access to the EEPROMs.

As designed it fits the need in the linux based community I'm
working with.  It is in production in two NOSs on three switches.  Less
complete variants of this driver are in production on all three NOSs
I've worked with, on dozens of platforms.  This is real code, that fits
a real need, and would like the benefits of being maintained as part of
the mainline kernel.

> 
> NACK.
> 
> 	Andrew
> 

Don

^ permalink raw reply

* RE: [PATCH] net: fix e1000.rst Documentation build errors
From: Brown, Aaron F @ 2018-06-18 19:42 UTC (permalink / raw)
  To: Randy Dunlap, netdev@vger.kernel.org, David Miller,
	linux-doc@vger.kernel.org
  Cc: LKML, Kirsher, Jeffrey T
In-Reply-To: <4e7a8aa5-cd19-3f57-cc60-6e6224d02a69@infradead.org>

> From: Randy Dunlap [mailto:rdunlap@infradead.org]
> Sent: Saturday, June 16, 2018 5:36 PM
> To: netdev@vger.kernel.org; David Miller <davem@davemloft.net>; linux-
> doc@vger.kernel.org
> Cc: LKML <linux-kernel@vger.kernel.org>; Kirsher, Jeffrey T
> <jeffrey.t.kirsher@intel.com>; Brown, Aaron F <aaron.f.brown@intel.com>
> Subject: [PATCH] net: fix e1000.rst Documentation build errors
> 
> From: Randy Dunlap <rdunlap@infradead.org>
> 
> Fix Documentation build errors in e1000.rst.  Several section titles and
> their underlines should not be indented.
> 
> Documentation/networking/e1000.rst:358: (SEVERE/4) Unexpected section
> title.
> 
> Fixes: 228046e76189 ("Documentation: e1000: Update kernel
> documentation")
> 
> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Cc: Aaron Brown <aaron.f.brown@intel.com>
> ---
> Is there a Sphinx version problem here?  Tested-by: should indicate
> that there was no error like I am seeing. 

The "Tested-by:" for this (and the e100 variant) was entirely focused on correctness of the documentation text, parameters match the driver, URLs were correct, etc...  In retrospect "Tested-by:" is not exactly accurate and I probably should have left it at a simpler ack.

> 
>  Documentation/networking/e1000.rst |   12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> --- lnx-418-rc1.orig/Documentation/networking/e1000.rst
> +++ lnx-418-rc1/Documentation/networking/e1000.rst
> @@ -354,8 +354,8 @@ previously mentioned to force the adapte
>  Additional Configurations
>  =========================
> 
> -  Jumbo Frames
> -  ------------
> +Jumbo Frames
> +------------
>    Jumbo Frames support is enabled by changing the MTU to a value larger
> than
>    the default of 1500.  Use the ifconfig command to increase the MTU size.
>    For example::
> @@ -389,8 +389,8 @@ Additional Configurations
>       Intel(R) PRO/1000 Gigabit Server Adapter
>       Intel(R) PRO/1000 PM Network Connection
> 
> -  ethtool
> -  -------
> +ethtool
> +-------
>    The driver utilizes the ethtool interface for driver configuration and
>    diagnostics, as well as displaying statistical information.  The ethtool
>    version 1.6 or later is required for this functionality.
> @@ -398,8 +398,8 @@ Additional Configurations
>    The latest release of ethtool can be found from
>    https://www.kernel.org/pub/software/network/ethtool/
> 
> -  Enabling Wake on LAN* (WoL)
> -  ---------------------------
> +Enabling Wake on LAN* (WoL)
> +---------------------------
>    WoL is configured through the ethtool* utility.
> 
>    WoL will be enabled on the system during the next shut down or reboot.
> 


^ permalink raw reply

* [PATCH v3] bitfield: fix *_encode_bits()
From: Johannes Berg @ 2018-06-18 19:56 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: Al Viro, Andy Shevchenko

There's a bug in *_encode_bits() in using ~field_multiplier() for
the check whether or not the constant value fits into the field,
this is wrong and clearly ~field_mask() was intended. This was
triggering for me for both constant and non-constant values.

Additionally, make this case actually into an compile error.
Declaring the extern function that will never exist with just a
warning is pointless as then later we'll just get a link error.

While at it, also fix the indentation in those lines I'm touching.

Finally, as suggested by Andy Shevchenko, add some tests and for
that introduce also u8 helpers. The tests don't compile without
the fix, showing that it's necessary.

Fixes: 00b0c9b82663 ("Add primitives for manipulating bitfields both in host- and fixed-endian.")
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
---
v2: replace stray tab by space
v3: u8 helpers, tests

If you don't mind, I'd like to take this through the networking
tree(s) as a fix since I have some pending code that depends on
it.
---
 include/linux/bitfield.h |   7 +-
 lib/Kconfig.debug        |   7 ++
 lib/Makefile             |   1 +
 lib/test_bitfield.c      | 163 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 175 insertions(+), 3 deletions(-)
 create mode 100644 lib/test_bitfield.c

diff --git a/include/linux/bitfield.h b/include/linux/bitfield.h
index cf2588d81148..65a6981eef7b 100644
--- a/include/linux/bitfield.h
+++ b/include/linux/bitfield.h
@@ -104,7 +104,7 @@
 		(typeof(_mask))(((_reg) & (_mask)) >> __bf_shf(_mask));	\
 	})
 
-extern void __compiletime_warning("value doesn't fit into mask")
+extern void __compiletime_error("value doesn't fit into mask")
 __field_overflow(void);
 extern void __compiletime_error("bad bitfield mask")
 __bad_mask(void);
@@ -121,8 +121,8 @@ static __always_inline u64 field_mask(u64 field)
 #define ____MAKE_OP(type,base,to,from)					\
 static __always_inline __##type type##_encode_bits(base v, base field)	\
 {									\
-        if (__builtin_constant_p(v) &&	(v & ~field_multiplier(field)))	\
-			    __field_overflow();				\
+	if (__builtin_constant_p(v) && (v & ~field_mask(field)))	\
+		__field_overflow();					\
 	return to((v & field_mask(field)) * field_multiplier(field));	\
 }									\
 static __always_inline __##type type##_replace_bits(__##type old,	\
@@ -143,6 +143,7 @@ static __always_inline base type##_get_bits(__##type v, base field)	\
 	____MAKE_OP(le##size,u##size,cpu_to_le##size,le##size##_to_cpu)	\
 	____MAKE_OP(be##size,u##size,cpu_to_be##size,be##size##_to_cpu)	\
 	____MAKE_OP(u##size,u##size,,)
+____MAKE_OP(u8,u8,,)
 __MAKE_OP(16)
 __MAKE_OP(32)
 __MAKE_OP(64)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index eb885942eb0f..b0870377b4d1 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1799,6 +1799,13 @@ config TEST_BITMAP
 
 	  If unsure, say N.
 
+config TEST_BITFIELD
+	tristate "Test bitfield functions at runtime"
+	help
+	  Enable this option to test the bitfield functions at boot.
+
+	  If unsure, say N.
+
 config TEST_UUID
 	tristate "Test functions located in the uuid module at runtime"
 
diff --git a/lib/Makefile b/lib/Makefile
index 84c6dcb31fbb..02ab4c1a64d1 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -68,6 +68,7 @@ obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_keys.o
 obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_key_base.o
 obj-$(CONFIG_TEST_PRINTF) += test_printf.o
 obj-$(CONFIG_TEST_BITMAP) += test_bitmap.o
+obj-$(CONFIG_TEST_BITFIELD) += test_bitfield.o
 obj-$(CONFIG_TEST_UUID) += test_uuid.o
 obj-$(CONFIG_TEST_PARMAN) += test_parman.o
 obj-$(CONFIG_TEST_KMOD) += test_kmod.o
diff --git a/lib/test_bitfield.c b/lib/test_bitfield.c
new file mode 100644
index 000000000000..3b50067611d9
--- /dev/null
+++ b/lib/test_bitfield.c
@@ -0,0 +1,163 @@
+/*
+ * Test cases for bitfield helpers.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bitfield.h>
+
+#define CHECK_ENC_GET_U(tp, v, field, res) do {				\
+		{							\
+			u##tp _res;					\
+									\
+			_res = u##tp##_encode_bits(v, field);		\
+			if (_res != res) {				\
+				pr_warn("u" #tp "_encode_bits(" #v ", " #field ") is 0x%llx != " #res "\n",\
+					(u64)_res);			\
+				return -EINVAL;				\
+			}						\
+			if (u##tp##_get_bits(_res, field) != v)		\
+				return -EINVAL;				\
+		}							\
+	} while (0)
+
+#define CHECK_ENC_GET_LE(tp, v, field, res) do {			\
+		{							\
+			__le##tp _res;					\
+									\
+			_res = le##tp##_encode_bits(v, field);		\
+			if (_res != cpu_to_le##tp(res)) {		\
+				pr_warn("le" #tp "_encode_bits(" #v ", " #field ") is 0x%llx != 0x%llx\n",\
+					(u64)le##tp##_to_cpu(_res),	\
+					(u64)(res));			\
+				return -EINVAL;				\
+			}						\
+			if (le##tp##_get_bits(_res, field) != v)	\
+				return -EINVAL;				\
+		}							\
+	} while (0)
+
+#define CHECK_ENC_GET_BE(tp, v, field, res) do {			\
+		{							\
+			__be##tp _res;					\
+									\
+			_res = be##tp##_encode_bits(v, field);		\
+			if (_res != cpu_to_be##tp(res)) {		\
+				pr_warn("be" #tp "_encode_bits(" #v ", " #field ") is 0x%llx != 0x%llx\n",\
+					(u64)be##tp##_to_cpu(_res),	\
+					(u64)(res));			\
+				return -EINVAL;				\
+			}						\
+			if (be##tp##_get_bits(_res, field) != v)	\
+				return -EINVAL;				\
+		}							\
+	} while (0)
+
+#define CHECK_ENC_GET(tp, v, field, res) do {				\
+		CHECK_ENC_GET_U(tp, v, field, res);			\
+		CHECK_ENC_GET_LE(tp, v, field, res);			\
+		CHECK_ENC_GET_BE(tp, v, field, res);			\
+	} while (0)
+
+static int test_constants(void)
+{
+	/*
+	 * NOTE
+	 * This whole function compiles (or at least should, if everything
+	 * is going according to plan) to nothing after optimisation.
+	 */
+
+	CHECK_ENC_GET(16,  1, 0x000f, 0x0001);
+	CHECK_ENC_GET(16,  3, 0x00f0, 0x0030);
+	CHECK_ENC_GET(16,  5, 0x0f00, 0x0500);
+	CHECK_ENC_GET(16,  7, 0xf000, 0x7000);
+	CHECK_ENC_GET(16, 14, 0x000f, 0x000e);
+	CHECK_ENC_GET(16, 15, 0x00f0, 0x00f0);
+/*
+ * This should fail compilation:
+ *	CHECK_ENC_GET(16, 16, 0x0f00, 0x1000);
+ */
+
+	CHECK_ENC_GET_U(8,  1, 0x0f, 0x01);
+	CHECK_ENC_GET_U(8,  3, 0xf0, 0x30);
+	CHECK_ENC_GET_U(8, 14, 0x0f, 0x0e);
+	CHECK_ENC_GET_U(8, 15, 0xf0, 0xf0);
+
+	CHECK_ENC_GET(32,  1, 0x00000f00, 0x00000100);
+	CHECK_ENC_GET(32,  3, 0x0000f000, 0x00003000);
+	CHECK_ENC_GET(32,  5, 0x000f0000, 0x00050000);
+	CHECK_ENC_GET(32,  7, 0x00f00000, 0x00700000);
+	CHECK_ENC_GET(32, 14, 0x0f000000, 0x0e000000);
+	CHECK_ENC_GET(32, 15, 0xf0000000, 0xf0000000);
+
+	CHECK_ENC_GET(64,  1, 0x00000f0000000000ull, 0x0000010000000000ull);
+	CHECK_ENC_GET(64,  3, 0x0000f00000000000ull, 0x0000300000000000ull);
+	CHECK_ENC_GET(64,  5, 0x000f000000000000ull, 0x0005000000000000ull);
+	CHECK_ENC_GET(64,  7, 0x00f0000000000000ull, 0x0070000000000000ull);
+	CHECK_ENC_GET(64, 14, 0x0f00000000000000ull, 0x0e00000000000000ull);
+	CHECK_ENC_GET(64, 15, 0xf000000000000000ull, 0xf000000000000000ull);
+
+	return 0;
+}
+
+#define CHECK(tp, mask) do {						\
+		u64 v;							\
+									\
+		for (v = 0; v < 1 << hweight32(mask); v++)		\
+			if (tp##_encode_bits(v, mask) != v << __ffs64(mask)) \
+				return -EINVAL;				\
+	} while (0)
+
+static int test_variables(void)
+{
+	CHECK(u8, 0x0f);
+	CHECK(u8, 0xf0);
+	CHECK(u8, 0x38);
+
+	CHECK(u16, 0x0038);
+	CHECK(u16, 0x0380);
+	CHECK(u16, 0x3800);
+	CHECK(u16, 0x8000);
+
+	CHECK(u32, 0x80000000);
+	CHECK(u32, 0x7f000000);
+	CHECK(u32, 0x07e00000);
+	CHECK(u32, 0x00018000);
+
+	CHECK(u64, 0x8000000000000000ull);
+	CHECK(u64, 0x7f00000000000000ull);
+	CHECK(u64, 0x0001800000000000ull);
+	CHECK(u64, 0x0000000080000000ull);
+	CHECK(u64, 0x000000007f000000ull);
+	CHECK(u64, 0x0000000018000000ull);
+	CHECK(u64, 0x0000001f8000000ull);
+
+	return 0;
+}
+
+static int __init test_bitfields(void)
+{
+	int ret = test_constants();
+
+	if (ret) {
+		pr_warn("constant tests failed!\n");
+		return ret;
+	}
+
+	ret = test_variables();
+	if (ret) {
+		pr_warn("variable tests failed!\n");
+		return ret;
+	}
+
+	pr_info("tests passed\n");
+
+	return 0;
+}
+module_init(test_bitfields)
+
+MODULE_AUTHOR("Johannes Berg <johannes@sipsolutions.net>");
+MODULE_LICENSE("GPL");
-- 
2.14.4

^ permalink raw reply related

* Re: [PATCH] tc, bpf: add option to dump bpf verifier as C program fragment
From: Jakub Kicinski @ 2018-06-18 20:18 UTC (permalink / raw)
  To: Ophir Munk
  Cc: netdev, Stephen Hemminger, David Ahern, Thomas Monjalon,
	Olga Shern
In-Reply-To: <1529225321-15429-1-git-send-email-ophirmu@mellanox.com>

On Sun, 17 Jun 2018 08:48:41 +0000, Ophir Munk wrote:
> Similar to cbpf used within tcpdump utility with a "-d" option to dump
> the compiled packet-matching code in a human readable form - tc has the
> "verbose" option to dump ebpf verifier output.
> Another useful option of cbpf using tcpdump "-dd" option is to dump
> packet-matching code a C program fragment. Similar to this - this commit
> adds a new tc ebpf option named "code" to dump ebpf verifier as C program
> fragment.
> 
> Existing "verbose" option sample output:
> 
> Verifier analysis:
> 0: (61) r2 = *(u32 *)(r1 +52)
> 1: (18) r3 = 0xdeadbeef
> 3: (63) *(u32 *)(r10 -4) = r3
> .
> .
> 11: (63) *(u32 *)(r1 +52) = r2
> 12: (18) r0 = 0xffffffff
> 14: (95) exit
> 
> New "code" option sample output:
> 
> /* struct bpf_insn cls_q_code[] = { */
> {0x61,    2,    1,       52, 0x00000000},
> {0x18,    3,    0,        0, 0xdeadbeef},
> {0x00,    0,    0,        0, 0x00000000},
> .
> .
> {0x63,    1,    2,       52, 0x00000000},
> {0x18,    0,    0,        0, 0xffffffff},
> {0x00,    0,    0,        0, 0x00000000},
> {0x95,    0,    0,        0, 0x00000000},
> 
> Signed-off-by: Ophir Munk <ophirmu@mellanox.com>

Hmm... printing C arrays looks like hacky integration with some C
code...  Would you not be better served by simply using libbpf in
whatever is consuming this output?

^ permalink raw reply

* Re: [RFC v2, net-next, PATCH 4/4] net/cpsw_switchdev: add switchdev mode of operation on cpsw driver
From: Ilias Apalodimas @ 2018-06-18 20:19 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, grygorii.strashko, ivan.khoronzhuk, nsekhar, jiri,
	ivecera, f.fainelli, francois.ozog, yogeshs, spatton, Jose.Abreu
In-Reply-To: <20180618161627.GC5865@lunn.ch>

On Mon, Jun 18, 2018 at 06:16:27PM +0200, Andrew Lunn wrote:
> > @@ -2711,6 +2789,10 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
> >  	if (of_property_read_bool(node, "dual_emac"))
> >  		data->switch_mode = CPSW_DUAL_EMAC;
> >  
> > +	/* switchdev overrides DTS */
> > +	if (IS_ENABLED(CONFIG_TI_CPSW_SWITCHDEV))
> > +		data->switch_mode = CPSW_SWITCHDEV;
> > +
> 
> I know you discussed this a bit with Jiri, but i still think if
> 'dual_mac" is found, you should do dual mac. The DT clearly requests
> dual mac, and doing anything else is going to cause confusion.
> 
> The device tree binding is ambiguous what should happen when dual-mac
> is missing. So i would only enable swithdev mode in that case.
At the moment if no 'dual_emac;' is found on DTS the driver operates in "switch
mode". It only registers 1 ethernet interface with no ability (unless you patch
the kernel) to configure the switch. If we use DTS instead of a .config option
we should add parsing for something like 'switchdev;' in the DTS.
Jiri proposed using devlink, which makes sense, but i am not sure it's
applicable on this patchset. This will change the driver completely and will
totally break backwards compatibility.

Ideally i'd prefer something like:
1. Add a DTS option and continue the current behavior. I agree with you that
this will cause less confusion (in fact i prefer it for the current state of the
driver compared to the .config).
or
2. Keep the .config option which is better suited over DTS but might cause some
confusion.
> 
> But ideally, it should be a new driver with a new binding.
TI is better suited to comment on this. The work proposed here is mostly to
accomodate future TSN related configuration for switches. This patchset has
been tested against Ivan's patchset for CBS and is working as expected 
configuration wise.
(http://lkml.iu.edu/hypermail/linux/kernel/1806.1/05302.html)

Thanks
Ilias

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox