Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next 4/5] net: phy: add Lynx PCS MDIO module
From: Russell King - ARM Linux admin @ 2020-06-18 22:21 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Ioana Ciornei, netdev@vger.kernel.org, davem@davemloft.net,
	Vladimir Oltean, Claudiu Manoil, Alexandru Marginean,
	michael@walle.cc, f.fainelli@gmail.com
In-Reply-To: <20200618220331.GA279339@lunn.ch>

On Fri, Jun 19, 2020 at 12:03:31AM +0200, Andrew Lunn wrote:
> > Are there really instances where the ethernet driver has to manage multiple
> > different types of PCSs? I am not sure this type of snippet of code is really
> > going to occur.
> 
> Hi Ioana
> 
> The Marvell mv88e6390 family has three PCS's, one for SGMII/1000BaseX,
> a 10Gbase-X4/X2 and a 10GBAse-R. So this sort of code could appear.

It in fact already does, but only for the 1G and 10GBase-R cases.
We don't have a PHY interface mode for 10Gbase-X4/X2, so I never
added that, but if it happens, we end up with a third case.

It may be tempting to suggest that moving the mv88e6390 PCS support
to common code would be a good idea, but it's really not - the PCS
don't follow IEEE 802.3 register layout.  The 1G registers are in
the PHYXS MMD, following Clause 22 layout at offset 0x2000.  The
10GBASE-R registers are in the PHYXS MMD at offset 0x1000.

So, I don't think it's sensible to compare the mv88e6390 switch
with this; the mv88e6390 serdes PCS is unlikely to be useful
outside of the DSA switch environment.

However, the Lynx PCS appears to be used in a range of different
situations, which include DSA switches and conventional ethernet
drivers.  That means we need some kind of solution where the code
driving the PCS does not rely on the code structures for the
device it is embedded with.

The solution I've suggested for DSA may help us get towards having
generic PCS drivers, but it doesn't fully solve the problem.  I
would ideally like DSA drivers to have access to "struct phylink"
so that they can attach the PCS themselves without having any of
the DSA layering veneers get in the way between phylink and the
PCS code - thereby allowing the PCS code to be re-used cleanly
with a conventional network driver.

Basically, I'm thinking is:

int phylink_attach_pcs(struct phylink *pl, const struct phylink_pcs_ops *ops,
		       void *pcs_priv);

which the PCS driver itself would call when requested to by the
DSA driver or ethernet driver.

Thoughts?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [PATCH v2] tc: qdisc: filter qdisc's by parent/handle specification
From: Stephen Hemminger @ 2020-06-18 22:17 UTC (permalink / raw)
  To: Anton Danilov; +Cc: netdev
In-Reply-To: <20200616063902.15605-1-littlesmilingcloud@gmail.com>

On Tue, 16 Jun 2020 09:39:02 +0300
Anton Danilov <littlesmilingcloud@gmail.com> wrote:

> +				fprintf(stderr,
> +					"only one of parent/handle/root/ingress can be specified\n");
> +				arg_error = true;
> +				break;

The concept is good, but it could be simplified.

You are repeating same error message several places and it is not
necessary to continue parsing after one bad argument.

See what invarg() does.


^ permalink raw reply

* Re: [PATCH v4 00/11] Add seccomp notifier ioctl that enables adding fds
From: Sargun Dhillon @ 2020-06-18 22:16 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Christian Brauner, David S. Miller,
	Christoph Hellwig, Tycho Andersen, Jakub Kicinski, Alexander Viro,
	Aleksa Sarai, Matt Denton, Jann Horn, Chris Palmer, Robert Sesek,
	Giuseppe Scrivano, Greg Kroah-Hartman, Andy Lutomirski,
	Will Drewry, Shuah Khan, netdev, containers, linux-api,
	linux-fsdevel, linux-kselftest
In-Reply-To: <20200616032524.460144-1-keescook@chromium.org>

On Mon, Jun 15, 2020 at 08:25:13PM -0700, Kees Cook wrote:
> Hello!
> 
> This is a bit of thread-merge between [1] and [2]. tl;dr: add a way for
> a seccomp user_notif process manager to inject files into the managed
> process in order to handle emulation of various fd-returning syscalls
> across security boundaries. Containers folks and Chrome are in need
> of the feature, and investigating this solution uncovered (and fixed)
> implementation issues with existing file sending routines.
> 
> I intend to carry this in the seccomp tree, unless someone has objections.
> :) Please review and test!
> 
> -Kees
> 
> [1] https://lore.kernel.org/lkml/20200603011044.7972-1-sargun@sargun.me/
> [2] https://lore.kernel.org/lkml/20200610045214.1175600-1-keescook@chromium.org/
> 
> Kees Cook (9):
>   net/scm: Regularize compat handling of scm_detach_fds()
>   fs: Move __scm_install_fd() to __fd_install_received()
>   fs: Add fd_install_received() wrapper for __fd_install_received()
>   pidfd: Replace open-coded partial fd_install_received()
>   fs: Expand __fd_install_received() to accept fd
>   selftests/seccomp: Make kcmp() less required
>   selftests/seccomp: Rename user_trap_syscall() to user_notif_syscall()
>   seccomp: Switch addfd to Extensible Argument ioctl
>   seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID
> 
This looks much cleaner than the original patchset. Thanks.

Reviewed-by: Sargun Dhillon <sargun@sargun.me>

on the pidfd, change fs* changes.

> Sargun Dhillon (2):
>   seccomp: Introduce addfd ioctl to seccomp user notifier
>   selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD
> 
>  fs/file.c                                     |  65 ++++
>  include/linux/file.h                          |  16 +
>  include/uapi/linux/seccomp.h                  |  25 +-
>  kernel/pid.c                                  |  11 +-
>  kernel/seccomp.c                              | 181 ++++++++-
>  net/compat.c                                  |  55 ++-
>  net/core/scm.c                                |  50 +--
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 350 +++++++++++++++---
>  8 files changed, 618 insertions(+), 135 deletions(-)
> 
> -- 
> 2.25.1
> 

^ permalink raw reply

* [PATCH net-next 3/3] net/sched: cls_flower: Add hash info to flow classification
From: Ariel Levkovich @ 2020-06-18 22:15 UTC (permalink / raw)
  To: netdev
  Cc: jiri, kuba, jhs, xiyou.wangcong, ast, daniel, Ariel Levkovich,
	Jiri Pirko
In-Reply-To: <20200618221548.3805-1-lariel@mellanox.com>

Adding new cls flower keys for hash value and hash
mask and dissect the hash info from the skb into
the flow key towards flow classication.

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
---
 include/uapi/linux/pkt_cls.h |  3 +++
 net/sched/cls_flower.c       | 16 ++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 2fd93389d091..ef145320ee99 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -579,6 +579,9 @@ enum {
 
 	TCA_FLOWER_KEY_MPLS_OPTS,
 
+	TCA_FLOWER_KEY_HASH,		/* u32 */
+	TCA_FLOWER_KEY_HASH_MASK,	/* u32 */
+
 	__TCA_FLOWER_MAX,
 };
 
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index b2da37286082..ff739e0d86fc 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -64,6 +64,7 @@ struct fl_flow_key {
 		};
 	} tp_range;
 	struct flow_dissector_key_ct ct;
+	struct flow_dissector_key_hash hash;
 } __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as longs. */
 
 struct fl_flow_mask_range {
@@ -318,6 +319,7 @@ static int fl_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 		skb_flow_dissect_ct(skb, &mask->dissector, &skb_key,
 				    fl_ct_info_to_flower_map,
 				    ARRAY_SIZE(fl_ct_info_to_flower_map));
+		skb_flow_dissect_hash(skb, &mask->dissector, &skb_key);
 		skb_flow_dissect(skb, &mask->dissector, &skb_key, 0);
 
 		f = fl_mask_lookup(mask, &skb_key);
@@ -694,6 +696,9 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = {
 	[TCA_FLOWER_KEY_CT_LABELS_MASK]	= { .type = NLA_BINARY,
 					    .len = 128 / BITS_PER_BYTE },
 	[TCA_FLOWER_FLAGS]		= { .type = NLA_U32 },
+	[TCA_FLOWER_KEY_HASH]		= { .type = NLA_U32 },
+	[TCA_FLOWER_KEY_HASH_MASK]	= { .type = NLA_U32 },
+
 };
 
 static const struct nla_policy
@@ -1625,6 +1630,10 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
 
 	fl_set_key_ip(tb, true, &key->enc_ip, &mask->enc_ip);
 
+	fl_set_key_val(tb, &key->hash.hash, TCA_FLOWER_KEY_HASH,
+		       &mask->hash.hash, TCA_FLOWER_KEY_HASH_MASK,
+		       sizeof(key->hash.hash));
+
 	if (tb[TCA_FLOWER_KEY_ENC_OPTS]) {
 		ret = fl_set_enc_opt(tb, key, mask, extack);
 		if (ret)
@@ -1739,6 +1748,8 @@ static void fl_init_dissector(struct flow_dissector *dissector,
 			     FLOW_DISSECTOR_KEY_ENC_OPTS, enc_opts);
 	FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 			     FLOW_DISSECTOR_KEY_CT, ct);
+	FL_KEY_SET_IF_MASKED(mask, keys, cnt,
+			     FLOW_DISSECTOR_KEY_HASH, hash);
 
 	skb_flow_dissector_init(dissector, keys, cnt);
 }
@@ -2959,6 +2970,11 @@ static int fl_dump_key(struct sk_buff *skb, struct net *net,
 	if (fl_dump_key_flags(skb, key->control.flags, mask->control.flags))
 		goto nla_put_failure;
 
+	if (fl_dump_key_val(skb, &key->hash.hash, TCA_FLOWER_KEY_HASH,
+			     &mask->hash.hash, TCA_FLOWER_KEY_HASH_MASK,
+			     sizeof(key->hash.hash)))
+		goto nla_put_failure;
+
 	return 0;
 
 nla_put_failure:
-- 
2.25.2


^ permalink raw reply related

* [PATCH net-next 1/3] net/sched: Introduce action hash
From: Ariel Levkovich @ 2020-06-18 22:15 UTC (permalink / raw)
  To: netdev
  Cc: jiri, kuba, jhs, xiyou.wangcong, ast, daniel, Ariel Levkovich,
	Jiri Pirko
In-Reply-To: <20200618221548.3805-1-lariel@mellanox.com>

Allow setting a hash value to a packet for a future match.

The action will determine the packet's hash result according to
the selected hash type.

The first option is to select a basic asymmetric l4 hash calculation
on the packet headers which will either use the skb->hash value as
if such was already calculated and set by the device driver, or it
will perform the kernel jenkins hash function on the packet which will
generate the result otherwise.

The other option is for user to provide an BPF program which is
dedicated to calculate the hash. In such case the program is loaded
and used by tc to perform the hash calculation and provide it to
the hash action to be stored in skb->hash field.

The BPF option can be useful for future HW offload support of the hash
calculation by emulating the HW hash function when it's different than
the kernel's but yet we want to maintain consistency between the SW and
the HW.

Usage is as follows:

$ tc filter add dev ens1f0_0 ingress \
prio 1 chain 0 proto ip \
flower ip_proto tcp \
action hash bpf object-file <bpf file> \
action goto chain 2

$ tc filter add dev ens1f0_0 ingress \
prio 1 chain 0 proto ip \
flower ip_proto udp \
action hash asym_l4 basis <basis> \
action goto chain 2

Matching on the result:
$ tc filter add dev ens1f0_0 ingress \
prio 1 chain 2 proto ip \
flower hash 0x0/0xf  \
action mirred egress redirect dev ens1f0_1

$ tc filter add dev ens1f0_0 ingress \
prio 1 chain 2 proto ip \
flower hash 0x1/0xf  \
action mirred egress redirect dev ens1f0_2

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/act_api.h               |   2 +
 include/net/tc_act/tc_hash.h        |  22 ++
 include/uapi/linux/pkt_cls.h        |   1 +
 include/uapi/linux/tc_act/tc_hash.h |  32 +++
 net/sched/Kconfig                   |  11 +
 net/sched/Makefile                  |   1 +
 net/sched/act_hash.c                | 376 ++++++++++++++++++++++++++++
 net/sched/cls_api.c                 |   1 +
 8 files changed, 446 insertions(+)
 create mode 100644 include/net/tc_act/tc_hash.h
 create mode 100644 include/uapi/linux/tc_act/tc_hash.h
 create mode 100644 net/sched/act_hash.c

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 8c3934880670..b7e5d060bd2f 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -12,6 +12,8 @@
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 
+#define ACT_BPF_NAME_LEN	256
+
 struct tcf_idrinfo {
 	struct mutex	lock;
 	struct idr	action_idr;
diff --git a/include/net/tc_act/tc_hash.h b/include/net/tc_act/tc_hash.h
new file mode 100644
index 000000000000..f03bb19709e5
--- /dev/null
+++ b/include/net/tc_act/tc_hash.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef __NET_TC_HASH_H
+#define __NET_TC_HASH_H
+
+#include <net/act_api.h>
+#include <uapi/linux/tc_act/tc_hash.h>
+
+struct tcf_hash_params {
+	enum tc_hash_alg alg;
+	u32 basis;
+	struct bpf_prog	*prog;
+	const char *bpf_name;
+	struct rcu_head	rcu;
+};
+
+struct tcf_hash {
+	struct tc_action common;
+	struct tcf_hash_params __rcu *hash_p;
+};
+#define to_hash(a) ((struct tcf_hash *)a)
+
+#endif /* __NET_TC_HASH_H */
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 7576209d96f9..2fd93389d091 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -135,6 +135,7 @@ enum tca_id {
 	TCA_ID_MPLS,
 	TCA_ID_CT,
 	TCA_ID_GATE,
+	TCA_ID_HASH,
 	/* other actions go here */
 	__TCA_ID_MAX = 255
 };
diff --git a/include/uapi/linux/tc_act/tc_hash.h b/include/uapi/linux/tc_act/tc_hash.h
new file mode 100644
index 000000000000..ff3063f70ee0
--- /dev/null
+++ b/include/uapi/linux/tc_act/tc_hash.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+
+#ifndef __LINUX_TC_HASH_H
+#define __LINUX_TC_HASH_H
+
+#include <linux/pkt_cls.h>
+
+enum tc_hash_alg {
+	TCA_HASH_ALG_L4,
+	TCA_HASH_ALG_BPF,
+};
+
+enum {
+	TCA_HASH_UNSPEC,
+	TCA_HASH_TM,
+	TCA_HASH_PARMS,
+	TCA_HASH_ALG,
+	TCA_HASH_BASIS,
+	TCA_HASH_BPF_FD,
+	TCA_HASH_BPF_NAME,
+	TCA_HASH_BPF_ID,
+	TCA_HASH_BPF_TAG,
+	TCA_HASH_PAD,
+	__TCA_HASH_MAX,
+};
+#define TCA_HASH_MAX (__TCA_HASH_MAX - 1)
+
+struct tc_hash {
+	tc_gen;
+};
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 84badf00647e..e9725bb40f4f 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -993,6 +993,17 @@ config NET_ACT_GATE
 	  To compile this code as a module, choose M here: the
 	  module will be called act_gate.
 
+config NET_ACT_HASH
+	tristate "Hash calculation action"
+	depends on NET_CLS_ACT
+	help
+	  Say Y here to perform hash calculation on packet headers.
+
+	  If unsure, say N.
+
+	  To compile this code as a module, choose M here: the
+	  module will be called act_hash.
+
 config NET_IFE_SKBMARK
 	tristate "Support to encoding decoding skb mark on IFE action"
 	depends on NET_ACT_IFE
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 66bbf9a98f9e..2d1415fb57db 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_NET_ACT_CONNMARK)	+= act_connmark.o
 obj-$(CONFIG_NET_ACT_CTINFO)	+= act_ctinfo.o
 obj-$(CONFIG_NET_ACT_SKBMOD)	+= act_skbmod.o
 obj-$(CONFIG_NET_ACT_IFE)	+= act_ife.o
+obj-$(CONFIG_NET_ACT_HASH)      += act_hash.o
 obj-$(CONFIG_NET_IFE_SKBMARK)	+= act_meta_mark.o
 obj-$(CONFIG_NET_IFE_SKBPRIO)	+= act_meta_skbprio.o
 obj-$(CONFIG_NET_IFE_SKBTCINDEX)	+= act_meta_skbtcindex.o
diff --git a/net/sched/act_hash.c b/net/sched/act_hash.c
new file mode 100644
index 000000000000..40a5c34f8745
--- /dev/null
+++ b/net/sched/act_hash.c
@@ -0,0 +1,376 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* -
+ * net/sched/act_hash.c  Hash calculation action
+ *
+ * Author:   Ariel Levkovich <lariel@mellanox.com>
+ */
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/rtnetlink.h>
+#include <linux/skbuff.h>
+#include <linux/filter.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
+#include <linux/tc_act/tc_hash.h>
+#include <net/tc_act/tc_hash.h>
+
+#define ACT_HASH_BPF_NAME_LEN	256
+
+static unsigned int hash_net_id;
+static struct tc_action_ops act_hash_ops;
+
+static int tcf_hash_act(struct sk_buff *skb, const struct tc_action *a,
+			struct tcf_result *res)
+{
+	struct tcf_hash *h = to_hash(a);
+	struct tcf_hash_params *p;
+	int action;
+	u32 hash;
+
+	tcf_lastuse_update(&h->tcf_tm);
+	tcf_action_update_bstats(&h->common, skb);
+
+	p = rcu_dereference_bh(h->hash_p);
+
+	action = READ_ONCE(h->tcf_action);
+
+	switch (p->alg) {
+	case TCA_HASH_ALG_L4:
+		hash = skb_get_hash(skb);
+		/* If a hash basis was provided, add it into
+		 * hash calculation here and re-set skb->hash
+		 * to the new result with sw_hash indication
+		 * and keeping the l4 hash indication.
+		 */
+		hash = jhash_1word(hash, p->basis);
+		__skb_set_sw_hash(skb, hash, skb->l4_hash);
+		break;
+	case TCA_HASH_ALG_BPF:
+		__skb_push(skb, skb->mac_len);
+		bpf_compute_data_pointers(skb);
+		hash = BPF_PROG_RUN(p->prog, skb);
+		__skb_pull(skb, skb->mac_len);
+		/* The BPF program hash function type is
+		 * unknown so only the sw hash bit is set.
+		 */
+		__skb_set_sw_hash(skb, hash, false);
+		break;
+	}
+
+	return action;
+}
+
+static const struct nla_policy hash_policy[TCA_HASH_MAX + 1] = {
+	[TCA_HASH_PARMS]	= { .type = NLA_EXACT_LEN, .len = sizeof(struct tc_hash) },
+	[TCA_HASH_ALG]		= { .type = NLA_U32 },
+	[TCA_HASH_BASIS]	= { .type = NLA_U32 },
+	[TCA_HASH_BPF_FD]	= { .type = NLA_U32 },
+	[TCA_HASH_BPF_NAME]	= { .type = NLA_NUL_STRING,
+				    .len = ACT_HASH_BPF_NAME_LEN },
+};
+
+static int tcf_hash_bpf_init(struct nlattr **tb, struct tcf_hash_params *params)
+{
+	struct bpf_prog *fp;
+	char *name = NULL;
+	u32 bpf_fd;
+
+	bpf_fd = nla_get_u32(tb[TCA_HASH_BPF_FD]);
+
+	fp = bpf_prog_get_type(bpf_fd, BPF_PROG_TYPE_SCHED_ACT);
+	if (IS_ERR(fp))
+		return PTR_ERR(fp);
+
+	if (tb[TCA_HASH_BPF_NAME]) {
+		name = nla_memdup(tb[TCA_HASH_BPF_NAME], GFP_KERNEL);
+		if (!name) {
+			bpf_prog_put(fp);
+			return -ENOMEM;
+		}
+	}
+
+	params->bpf_name = name;
+	params->prog = fp;
+
+	return 0;
+}
+
+static void tcf_hash_bpf_cleanup(struct tcf_hash_params *params)
+{
+	if (params->prog)
+		bpf_prog_put(params->prog);
+
+	kfree(params->bpf_name);
+}
+
+static int tcf_hash_init(struct net *net, struct nlattr *nla,
+			 struct nlattr *est, struct tc_action **a,
+			 int replace, int bind, bool rtnl_held,
+			 struct tcf_proto *tp, u32 flags,
+			 struct netlink_ext_ack *extack)
+{
+	struct tc_action_net *tn = net_generic(net, hash_net_id);
+	struct tcf_hash_params *params, old;
+	struct nlattr *tb[TCA_HASH_MAX + 1];
+	struct tcf_chain *goto_ch = NULL;
+	struct tcf_hash_params *p = NULL;
+	struct tc_hash *parm;
+	struct tcf_hash *h;
+	int err, res = 0;
+	u32 index;
+
+	if (!nla) {
+		NL_SET_ERR_MSG_MOD(extack, "Hash requires attributes to be passed");
+		return -EINVAL;
+	}
+
+	err = nla_parse_nested(tb, TCA_HASH_MAX, nla, hash_policy, extack);
+	if (err < 0)
+		return err;
+
+	if (!tb[TCA_HASH_PARMS]) {
+		NL_SET_ERR_MSG_MOD(extack, "Missing required hash parameters");
+		return -EINVAL;
+	}
+	parm = nla_data(tb[TCA_HASH_PARMS]);
+	index = parm->index;
+
+	err = tcf_idr_check_alloc(tn, &index, a, bind);
+	if (err < 0)
+		return err;
+
+	if (!err) {
+		err = tcf_idr_create_from_flags(tn, index, est, a,
+						&act_hash_ops, bind, flags);
+		if (err) {
+			tcf_idr_cleanup(tn, index);
+			return err;
+		}
+		res = ACT_P_CREATED;
+	} else {
+		if (bind)
+			return 0;
+
+		if (!replace) {
+			tcf_idr_release(*a, bind);
+			return -EEXIST;
+		}
+	}
+	err = tcf_action_check_ctrlact(parm->action, tp, &goto_ch, extack);
+	if (err < 0)
+		goto release_idr;
+
+	h = to_hash(*a);
+
+	p = kzalloc(sizeof(*p), GFP_KERNEL);
+	if (unlikely(!p)) {
+		err = -ENOMEM;
+		goto cleanup;
+	}
+
+	if (!tb[TCA_HASH_ALG]) {
+		NL_SET_ERR_MSG_MOD(extack, "Missing hash algorithm selection");
+		err = -EINVAL;
+		goto cleanup;
+	}
+
+	p->alg = nla_get_u32(tb[TCA_HASH_ALG]);
+
+	spin_lock_bh(&h->tcf_lock);
+
+	switch (p->alg) {
+	case TCA_HASH_ALG_L4:
+		break;
+	case TCA_HASH_ALG_BPF:
+		if (res != ACT_P_CREATED) {
+			params = rcu_dereference_protected(h->hash_p, 1);
+			old.prog = params->prog;
+			old.bpf_name = params->bpf_name;
+		}
+
+		err = tcf_hash_bpf_init(tb, p);
+		if (err)
+			goto cleanup;
+
+		break;
+	default:
+		NL_SET_ERR_MSG_MOD(extack, "Hash type not supported");
+		err = -EINVAL;
+		goto cleanup;
+	}
+
+	if (tb[TCA_HASH_BASIS])
+		p->basis = nla_get_u32(tb[TCA_HASH_BASIS]);
+
+	goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
+	p = rcu_replace_pointer(h->hash_p, p,
+				lockdep_is_held(&h->tcf_lock));
+	spin_unlock_bh(&h->tcf_lock);
+
+	if (goto_ch)
+		tcf_chain_put_by_act(goto_ch);
+	if (p)
+		kfree_rcu(p, rcu);
+
+	if (res == ACT_P_CREATED) {
+		tcf_idr_insert(tn, *a);
+	} else {
+		synchronize_rcu();
+		tcf_hash_bpf_cleanup(&old);
+	}
+
+	return res;
+
+cleanup:
+	if (goto_ch)
+		tcf_chain_put_by_act(goto_ch);
+	kfree(p);
+
+release_idr:
+	tcf_idr_release(*a, bind);
+	return err;
+}
+
+static void tcf_hash_cleanup(struct tc_action *a)
+{
+	struct tcf_hash *h = to_hash(a);
+	struct tcf_hash_params *p;
+
+	p = rcu_dereference_protected(h->hash_p, 1);
+	if (p) {
+		tcf_hash_bpf_cleanup(p);
+		kfree_rcu(p, rcu);
+	}
+}
+
+static int tcf_hash_dump(struct sk_buff *skb, struct tc_action *a,
+			 int bind, int ref)
+{
+	unsigned char *tp = skb_tail_pointer(skb);
+	struct tcf_hash *h = to_hash(a);
+	struct tcf_hash_params *p;
+	struct tc_hash opt = {
+		.index    = h->tcf_index,
+		.refcnt   = refcount_read(&h->tcf_refcnt) - ref,
+		.bindcnt  = atomic_read(&h->tcf_bindcnt) - bind,
+	};
+	struct nlattr *nla;
+	struct tcf_t tm;
+
+	spin_lock_bh(&h->tcf_lock);
+	opt.action = h->tcf_action;
+	p = rcu_dereference_protected(h->hash_p, lockdep_is_held(&h->tcf_lock));
+
+	if (nla_put(skb, TCA_HASH_PARMS, sizeof(opt), &opt))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_HASH_ALG, p->alg))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_HASH_BASIS, p->basis))
+		goto nla_put_failure;
+
+	if (p->alg == TCA_HASH_ALG_BPF && p->bpf_name) {
+		if (nla_put_string(skb, TCA_HASH_BPF_NAME, p->bpf_name))
+			goto nla_put_failure;
+		if (nla_put_u32(skb, TCA_HASH_BPF_ID, p->prog->aux->id))
+			goto nla_put_failure;
+		nla = nla_reserve(skb, TCA_HASH_BPF_TAG, sizeof(p->prog->tag));
+		if (!nla)
+			goto nla_put_failure;
+
+		memcpy(nla_data(nla), p->prog->tag, nla_len(nla));
+	}
+
+	tcf_tm_dump(&tm, &h->tcf_tm);
+	if (nla_put_64bit(skb, TCA_HASH_TM, sizeof(tm), &tm,
+			  TCA_HASH_PAD))
+		goto nla_put_failure;
+
+	spin_unlock_bh(&h->tcf_lock);
+	return skb->len;
+
+nla_put_failure:
+	spin_unlock_bh(&h->tcf_lock);
+	nlmsg_trim(skb, tp);
+	return -1;
+}
+
+static int tcf_hash_walker(struct net *net, struct sk_buff *skb,
+			   struct netlink_callback *cb, int type,
+			   const struct tc_action_ops *ops,
+			   struct netlink_ext_ack *extack)
+{
+	struct tc_action_net *tn = net_generic(net, hash_net_id);
+
+	return tcf_generic_walker(tn, skb, cb, type, ops, extack);
+}
+
+static int tcf_hash_search(struct net *net, struct tc_action **a, u32 index)
+{
+	struct tc_action_net *tn = net_generic(net, hash_net_id);
+
+	return tcf_idr_search(tn, a, index);
+}
+
+static void tcf_hash_stats_update(struct tc_action *a, u64 bytes, u32 packets,
+				  u64 lastuse, bool hw)
+{
+	struct tcf_hash *h = to_hash(a);
+
+	tcf_action_update_stats(a, bytes, packets, false, hw);
+	h->tcf_tm.lastuse = max_t(u64, h->tcf_tm.lastuse, lastuse);
+}
+
+static struct tc_action_ops act_hash_ops = {
+	.kind		=       "hash",
+	.id		=       TCA_ID_HASH,
+	.owner		=       THIS_MODULE,
+	.act		=       tcf_hash_act,
+	.dump		=       tcf_hash_dump,
+	.init		=       tcf_hash_init,
+	.cleanup	=       tcf_hash_cleanup,
+	.walk		=       tcf_hash_walker,
+	.lookup		=       tcf_hash_search,
+	.stats_update	=       tcf_hash_stats_update,
+	.size		=       sizeof(struct tcf_hash),
+};
+
+static __net_init int hash_init_net(struct net *net)
+{
+	struct tc_action_net *tn = net_generic(net, hash_net_id);
+
+	return tc_action_net_init(net, tn, &act_hash_ops);
+}
+
+static void __net_exit hash_exit_net(struct list_head *net_list)
+{
+	tc_action_net_exit(net_list, hash_net_id);
+}
+
+static struct pernet_operations hash_net_ops = {
+	.init = hash_init_net,
+	.exit_batch = hash_exit_net,
+	.id = &hash_net_id,
+	.size = sizeof(struct tc_action_net),
+};
+
+static int __init hash_init_module(void)
+{
+	return tcf_register_action(&act_hash_ops, &hash_net_ops);
+}
+
+static void __exit hash_cleanup_module(void)
+{
+	tcf_unregister_action(&act_hash_ops, &hash_net_ops);
+}
+
+module_init(hash_init_module);
+module_exit(hash_cleanup_module);
+
+MODULE_AUTHOR("Ariel Levkovich <lariel@mellanox.com>");
+MODULE_DESCRIPTION("Packet hash action");
+MODULE_LICENSE("GPL v2");
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index a00a203b2ef5..6d7eb249e557 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -40,6 +40,7 @@
 #include <net/tc_act/tc_ct.h>
 #include <net/tc_act/tc_mpls.h>
 #include <net/tc_act/tc_gate.h>
+#include <net/tc_act/tc_hash.h>
 #include <net/flow_offload.h>
 
 extern const struct nla_policy rtm_tca_policy[TCA_MAX + 1];
-- 
2.25.2


^ permalink raw reply related

* [PATCH net-next 0/3] TC datapath hash api
From: Ariel Levkovich @ 2020-06-18 22:15 UTC (permalink / raw)
  To: netdev; +Cc: jiri, kuba, jhs, xiyou.wangcong, ast, daniel, Ariel Levkovich

Supporting datapath hash allows user to set up rules that provide
load balancing of traffic across multiple vports and for ECMP path
selection while keeping the number of rule at minimum.

Instead of matching on exact flow spec, which requires a rule per
flow, user can define rules based on hashing on the packet headers
and distribute the flows to different buckets. The number of rules
in this case will be constant and equal to the number of buckets.

The datapath hash functionality is achieved in two steps -
performing the hash action and then matching on the result, as
part of the packet's classification.

The api allows user to define a filter with a tc hash action
where the hash function can be standard asymetric hashing that Linux
offers or alternatively user can provide a bpf program that
performs hash calculation on a packet.

Usage is as follows:

$ tc filter add dev ens1f0_0 ingress \
prio 1 chain 0 proto ip \
flower ip_proto tcp \
action hash bpf object-file <file> \
action goto chain 2

$ tc filter add dev ens1f0_0 ingress \
prio 1 chain 0 proto ip \
flower ip_proto udp \
action hash bpf asym_l4 basis <basis> \
action goto chain 2

$ tc filter add dev ens1f0_0 ingress \
prio 1 chain 2 proto ip \
flower hash 0x0/0xf  \
action mirred egress redirect dev ens1f0_1

$ tc filter add dev ens1f0_0 ingress \
prio 1 chain 2 proto ip \
flower hash 0x1/0xf  \
action mirred egress redirect dev ens1f0_2

Ariel Levkovich (3):
  net/sched: Introduce action hash
  net/flow_dissector: add packet hash dissection
  net/sched: cls_flower: Add hash info to flow classification

 include/linux/skbuff.h              |   4 +
 include/net/act_api.h               |   2 +
 include/net/flow_dissector.h        |   9 +
 include/net/tc_act/tc_hash.h        |  22 ++
 include/uapi/linux/pkt_cls.h        |   4 +
 include/uapi/linux/tc_act/tc_hash.h |  32 +++
 net/core/flow_dissector.c           |  17 ++
 net/sched/Kconfig                   |  11 +
 net/sched/Makefile                  |   1 +
 net/sched/act_hash.c                | 376 ++++++++++++++++++++++++++++
 net/sched/cls_api.c                 |   1 +
 net/sched/cls_flower.c              |  16 ++
 12 files changed, 495 insertions(+)
 create mode 100644 include/net/tc_act/tc_hash.h
 create mode 100644 include/uapi/linux/tc_act/tc_hash.h
 create mode 100644 net/sched/act_hash.c

-- 
2.25.2


^ permalink raw reply

* [PATCH net-next 2/3] net/flow_dissector: add packet hash dissection
From: Ariel Levkovich @ 2020-06-18 22:15 UTC (permalink / raw)
  To: netdev
  Cc: jiri, kuba, jhs, xiyou.wangcong, ast, daniel, Ariel Levkovich,
	Jiri Pirko
In-Reply-To: <20200618221548.3805-1-lariel@mellanox.com>

Retreive a hash value from the SKB and store it
in the dissector key for future matching.

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
---
 include/linux/skbuff.h       |  4 ++++
 include/net/flow_dissector.h |  9 +++++++++
 net/core/flow_dissector.c    | 17 +++++++++++++++++
 3 files changed, 30 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0c0377fc00c2..beb7fe2c7809 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1342,6 +1342,10 @@ skb_flow_dissect_tunnel_info(const struct sk_buff *skb,
 			     struct flow_dissector *flow_dissector,
 			     void *target_container);
 
+void skb_flow_dissect_hash(const struct sk_buff *skb,
+			   struct flow_dissector *flow_dissector,
+			   void *target_container);
+
 static inline __u32 skb_get_hash(struct sk_buff *skb)
 {
 	if (!skb->l4_hash && !skb->sw_hash)
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index a7eba43fe4e4..5cc0540ce3f7 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -243,6 +243,14 @@ struct flow_dissector_key_ct {
 	u32	ct_labels[4];
 };
 
+/**
+ * struct flow_dissector_key_hash:
+ * @hash: hash value
+ */
+struct flow_dissector_key_hash {
+	u32 hash;
+};
+
 enum flow_dissector_key_id {
 	FLOW_DISSECTOR_KEY_CONTROL, /* struct flow_dissector_key_control */
 	FLOW_DISSECTOR_KEY_BASIC, /* struct flow_dissector_key_basic */
@@ -271,6 +279,7 @@ enum flow_dissector_key_id {
 	FLOW_DISSECTOR_KEY_ENC_OPTS, /* struct flow_dissector_key_enc_opts */
 	FLOW_DISSECTOR_KEY_META, /* struct flow_dissector_key_meta */
 	FLOW_DISSECTOR_KEY_CT, /* struct flow_dissector_key_ct */
+	FLOW_DISSECTOR_KEY_HASH, /* struct flow_dissector_key_hash */
 
 	FLOW_DISSECTOR_KEY_MAX,
 };
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index d02df0b6d0d9..c114f0e3ef4f 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -392,6 +392,23 @@ skb_flow_dissect_tunnel_info(const struct sk_buff *skb,
 }
 EXPORT_SYMBOL(skb_flow_dissect_tunnel_info);
 
+void skb_flow_dissect_hash(const struct sk_buff *skb,
+			   struct flow_dissector *flow_dissector,
+			   void *target_container)
+{
+	struct flow_dissector_key_hash *key;
+
+	if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_HASH))
+		return;
+
+	key = skb_flow_dissector_target(flow_dissector,
+					FLOW_DISSECTOR_KEY_HASH,
+					target_container);
+
+	key->hash = skb_get_hash_raw(skb);
+}
+EXPORT_SYMBOL(skb_flow_dissect_hash);
+
 static enum flow_dissect_ret
 __skb_flow_dissect_mpls(const struct sk_buff *skb,
 			struct flow_dissector *flow_dissector,
-- 
2.25.2


^ permalink raw reply related

* Re: [PATCH net-next 4/5] net: phy: add Lynx PCS MDIO module
From: Andrew Lunn @ 2020-06-18 22:13 UTC (permalink / raw)
  To: Ioana Ciornei
  Cc: netdev, davem, vladimir.oltean, claudiu.manoil,
	alexandru.marginean, michael, linux, f.fainelli
In-Reply-To: <20200618120837.27089-5-ioana.ciornei@nxp.com>

>  MAINTAINERS                     |   7 +
>  drivers/net/phy/Kconfig         |   6 +
>  drivers/net/phy/Makefile        |   1 +
>  drivers/net/phy/mdio-lynx-pcs.c | 358 ++++++++++++++++++++++++++++++++
>  include/linux/mdio-lynx-pcs.h   |  43 ++++
>  5 files changed, 415 insertions(+)
>  create mode 100644 drivers/net/phy/mdio-lynx-pcs.c
>  create mode 100644 include/linux/mdio-lynx-pcs.h

Hi Ioana

We should think about naming convention here.

All MDIO bus driver, MDIO multiplexors etc use mdio- as a prefix.

This is not a bus driver, so i don't think it should use the mdio-
prefix. How about pcs-lynx.c?

In terms of Kconfig, MDIO_ prefix is used for MDIO bus drivers etc.  I
don't think it is appropriate here. How about PCS_LYNX? I don't think
any other subsystem is using PCS_ as a prefix.

> --- a/drivers/net/phy/Kconfig
> +++ b/drivers/net/phy/Kconfig
> @@ -235,6 +235,12 @@ config MDIO_XPCS
>  	  This module provides helper functions for Synopsys DesignWare XPCS
>  	  controllers.
>  
> +config MDIO_LYNX_PCS
> +	bool
> +	help
> +	  This module provides helper functions for Lynx PCS enablement
> +	  representing the PCS as an MDIO device.
> +
>  endif
>  endif

Maybe add this at the end, and add a

comment "PCS device drivers"

before it? I'm assuming with time we will have more of these drivers.

       Andrew

^ permalink raw reply

* Re: [PATCH net-next] of: mdio: preserve phy dev_flags in of_phy_connect()
From: Florian Fainelli @ 2020-06-18 22:12 UTC (permalink / raw)
  To: rentao.bupt, Andrew Lunn, Heiner Kallweit, Russell King,
	Rob Herring, Frank Rowand, netdev, devicetree, linux-kernel,
	openbmc, taoren
In-Reply-To: <20200618220444.5064-1-rentao.bupt@gmail.com>



On 6/18/2020 3:04 PM, rentao.bupt@gmail.com wrote:
> From: Tao Ren <rentao.bupt@gmail.com>
> 
> Replace assignment "=" with OR "|=" for "phy->dev_flags" so "dev_flags"
> configured in phy probe() function can be preserved.
> 
> The idea is similar to commit e7312efbd5de ("net: phy: modify assignment
> to OR for dev_flags in phy_attach_direct").
> 
> Signed-off-by: Tao Ren <rentao.bupt@gmail.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

* Re: [PATCH] bpf: Allow small structs to be type of function argument
From: John Fastabend @ 2020-06-18 22:06 UTC (permalink / raw)
  To: Jiri Olsa, John Fastabend, Andrii Nakryiko
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, netdev, bpf,
	Yonghong Song, Martin KaFai Lau, Jakub Kicinski, David Miller,
	Jesper Dangaard Brouer, KP Singh, Masanori Misono
In-Reply-To: <20200618114806.GA2369163@krava>

Jiri Olsa wrote:
> On Wed, Jun 17, 2020 at 04:20:54PM -0700, John Fastabend wrote:
> > Jiri Olsa wrote:
> > > This way we can have trampoline on function
> > > that has arguments with types like:
> > > 
> > >   kuid_t uid
> > >   kgid_t gid
> > > 
> > > which unwind into small structs like:
> > > 
> > >   typedef struct {
> > >         uid_t val;
> > >   } kuid_t;
> > > 
> > >   typedef struct {
> > >         gid_t val;
> > >   } kgid_t;
> > > 
> > > And we can use them in bpftrace like:
> > > (assuming d_path changes are in)
> > > 
> > >   # bpftrace -e 'lsm:path_chown { printf("uid %d, gid %d\n", args->uid, args->gid) }'
> > >   Attaching 1 probe...
> > >   uid 0, gid 0
> > >   uid 1000, gid 1000
> > >   ...
> > > 
> > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > ---
> > >  kernel/bpf/btf.c | 12 +++++++++++-
> > >  1 file changed, 11 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > index 58c9af1d4808..f8fee5833684 100644
> > > --- a/kernel/bpf/btf.c
> > > +++ b/kernel/bpf/btf.c
> > > @@ -362,6 +362,14 @@ static bool btf_type_is_struct(const struct btf_type *t)
> > >  	return kind == BTF_KIND_STRUCT || kind == BTF_KIND_UNION;
> > >  }
> > >  
> > > +/* type is struct and its size is within 8 bytes
> > > + * and it can be value of function argument
> > > + */
> > > +static bool btf_type_is_struct_arg(const struct btf_type *t)
> > > +{
> > > +	return btf_type_is_struct(t) && (t->size <= sizeof(u64));
> > 
> > Can you comment on why sizeof(u64) here? The int types can be larger
> > than 64 for example and don't have a similar check, maybe the should
> > as well?
> > 
> > Here is an example from some made up program I ran through clang and
> > bpftool.
> > 
> > [2] INT '__int128' size=16 bits_offset=0 nr_bits=128 encoding=SIGNED
> > 
> > We also have btf_type_int_is_regular to decide if the int is of some
> > "regular" size but I don't see it used in these paths.
> 
> so this small structs are passed as scalars via function arguments,
> so the size limit is to fit teir value into register size which holds
> the argument
> 
> I'm not sure how 128bit numbers are passed to function as argument,
> but I think we can treat them separately if there's a need
> 

Moving Andrii up to the TO field ;)

Andrii, do we also need a guard on the int type with sizeof(u64)?
Otherwise the arg calculation might be incorrect? wdyt did I follow
along correctly.

^ permalink raw reply

* [PATCH net-next] of: mdio: preserve phy dev_flags in of_phy_connect()
From: rentao.bupt @ 2020-06-18 22:04 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, Heiner Kallweit, Russell King,
	Rob Herring, Frank Rowand, netdev, devicetree, linux-kernel,
	openbmc, taoren
  Cc: Tao Ren

From: Tao Ren <rentao.bupt@gmail.com>

Replace assignment "=" with OR "|=" for "phy->dev_flags" so "dev_flags"
configured in phy probe() function can be preserved.

The idea is similar to commit e7312efbd5de ("net: phy: modify assignment
to OR for dev_flags in phy_attach_direct").

Signed-off-by: Tao Ren <rentao.bupt@gmail.com>
---
 drivers/of/of_mdio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
index a04afe79529c..f5c46c72f4d3 100644
--- a/drivers/of/of_mdio.c
+++ b/drivers/of/of_mdio.c
@@ -380,7 +380,7 @@ struct phy_device *of_phy_connect(struct net_device *dev,
 	if (!phy)
 		return NULL;
 
-	phy->dev_flags = flags;
+	phy->dev_flags |= flags;
 
 	ret = phy_connect_direct(dev, phy, hndlr, iface);
 
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH] bpf: Allow small structs to be type of function argument
From: Alexei Starovoitov @ 2020-06-18 22:05 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: John Fastabend, Jiri Olsa, Alexei Starovoitov, Daniel Borkmann,
	netdev, bpf, Yonghong Song, Martin KaFai Lau, Jakub Kicinski,
	David Miller, Jesper Dangaard Brouer, Andrii Nakryiko, KP Singh,
	Masanori Misono
In-Reply-To: <20200618114806.GA2369163@krava>

On Thu, Jun 18, 2020 at 01:48:06PM +0200, Jiri Olsa wrote:
> On Wed, Jun 17, 2020 at 04:20:54PM -0700, John Fastabend wrote:
> > Jiri Olsa wrote:
> > > This way we can have trampoline on function
> > > that has arguments with types like:
> > > 
> > >   kuid_t uid
> > >   kgid_t gid
> > > 
> > > which unwind into small structs like:
> > > 
> > >   typedef struct {
> > >         uid_t val;
> > >   } kuid_t;
> > > 
> > >   typedef struct {
> > >         gid_t val;
> > >   } kgid_t;
> > > 
> > > And we can use them in bpftrace like:
> > > (assuming d_path changes are in)

the patch doesn't seem to be related to d_path. Unless I'm missing something.

Please add a selftest. bpftrace example is nice, but selftest is still mandatory.

> > > 
> > >   # bpftrace -e 'lsm:path_chown { printf("uid %d, gid %d\n", args->uid, args->gid) }'
> > >   Attaching 1 probe...
> > >   uid 0, gid 0
> > >   uid 1000, gid 1000
> > >   ...
> > > 
> > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > ---
> > >  kernel/bpf/btf.c | 12 +++++++++++-
> > >  1 file changed, 11 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > index 58c9af1d4808..f8fee5833684 100644
> > > --- a/kernel/bpf/btf.c
> > > +++ b/kernel/bpf/btf.c
> > > @@ -362,6 +362,14 @@ static bool btf_type_is_struct(const struct btf_type *t)
> > >  	return kind == BTF_KIND_STRUCT || kind == BTF_KIND_UNION;
> > >  }
> > >  
> > > +/* type is struct and its size is within 8 bytes
> > > + * and it can be value of function argument
> > > + */
> > > +static bool btf_type_is_struct_arg(const struct btf_type *t)
> > > +{
> > > +	return btf_type_is_struct(t) && (t->size <= sizeof(u64));

extra () are unnecessary.

the function needs different name. May btf_type_is_struct_by_value() ?

> > 
> > Can you comment on why sizeof(u64) here? The int types can be larger
> > than 64 for example and don't have a similar check, maybe the should
> > as well?
> > 
> > Here is an example from some made up program I ran through clang and
> > bpftool.
> > 
> > [2] INT '__int128' size=16 bits_offset=0 nr_bits=128 encoding=SIGNED
> > 
> > We also have btf_type_int_is_regular to decide if the int is of some
> > "regular" size but I don't see it used in these paths.
> 
> so this small structs are passed as scalars via function arguments,
> so the size limit is to fit teir value into register size which holds
> the argument
> 
> I'm not sure how 128bit numbers are passed to function as argument,
> but I think we can treat them separately if there's a need
> 
> jirka
> 
> > 
> > > +}
> > > +
> > >  static bool __btf_type_is_struct(const struct btf_type *t)
> > >  {
> > >  	return BTF_INFO_KIND(t->info) == BTF_KIND_STRUCT;
> > > @@ -3768,7 +3776,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
> > >  	/* skip modifiers */
> > >  	while (btf_type_is_modifier(t))
> > >  		t = btf_type_by_id(btf, t->type);
> > > -	if (btf_type_is_int(t) || btf_type_is_enum(t))
> > > +	if (btf_type_is_int(t) || btf_type_is_enum(t) || btf_type_is_struct_arg(t))
> > >  		/* accessing a scalar */
> > >  		return true;

It probably needs to be x86 gated?
I don't think all archs do that for small structs.

What kind of code clang generates for bpf prog?
I don't remember what we told clang to do for struct by value.
That has to be carefully defined and tested.

^ permalink raw reply

* Re: [PATCH net-next 4/5] net: phy: add Lynx PCS MDIO module
From: Andrew Lunn @ 2020-06-18 22:03 UTC (permalink / raw)
  To: Ioana Ciornei
  Cc: Russell King - ARM Linux admin, netdev@vger.kernel.org,
	davem@davemloft.net, Vladimir Oltean, Claudiu Manoil,
	Alexandru Marginean, michael@walle.cc, f.fainelli@gmail.com
In-Reply-To: <VI1PR0402MB38712F94BAAC32DB1C8AB7F8E09B0@VI1PR0402MB3871.eurprd04.prod.outlook.com>

> Are there really instances where the ethernet driver has to manage multiple
> different types of PCSs? I am not sure this type of snippet of code is really
> going to occur.

Hi Ioana

The Marvell mv88e6390 family has three PCS's, one for SGMII/1000BaseX,
a 10Gbase-X4/X2 and a 10GBAse-R. So this sort of code could appear.

  Andrew

^ permalink raw reply

* Re: [PATCH net-next 4/5] net: phy: add Lynx PCS MDIO module
From: Russell King - ARM Linux admin @ 2020-06-18 22:01 UTC (permalink / raw)
  To: Ioana Ciornei
  Cc: netdev@vger.kernel.org, davem@davemloft.net, Vladimir Oltean,
	Claudiu Manoil, Alexandru Marginean, michael@walle.cc,
	andrew@lunn.ch, f.fainelli@gmail.com
In-Reply-To: <VI1PR0402MB38712F94BAAC32DB1C8AB7F8E09B0@VI1PR0402MB3871.eurprd04.prod.outlook.com>

On Thu, Jun 18, 2020 at 05:34:49PM +0000, Ioana Ciornei wrote:
> I am not sure how this would work with Felix and DSA drivers in general
> since the DSA core is hiding the phylink_pcs_ops from the actual switch
> driver.

Here's an idea to work around DSA (untested):

diff --git a/include/linux/phylink.h b/include/linux/phylink.h
index 1f8e0023f4f4..f578a5bb8ad0 100644
--- a/include/linux/phylink.h
+++ b/include/linux/phylink.h
@@ -67,6 +67,7 @@ enum phylink_op_type {
 struct phylink_config {
 	struct device *dev;
 	enum phylink_op_type type;
+	void *pcs_private;
 	bool pcs_poll;
 	bool poll_fixed_state;
 	void (*get_fixed_state)(struct phylink_config *config,
diff --git a/include/net/dsa.h b/include/net/dsa.h
index a50d5007fd39..0fbb3a542b6e 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -742,6 +742,9 @@ int dsa_port_get_phy_strings(struct dsa_port *dp, uint8_t *data);
 int dsa_port_get_ethtool_phy_stats(struct dsa_port *dp, uint64_t *data);
 int dsa_port_get_phy_sset_count(struct dsa_port *dp);
 void dsa_port_phylink_mac_change(struct dsa_switch *ds, int port, bool up);
+void dsa_slave_attach_phylink_pcs(struct dsa_switch *ds, int port,
+				  const struct phylink_pcs_ops *ops,
+				  void *priv);
 
 struct dsa_tag_driver {
 	const struct dsa_device_ops *ops;
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 2ae17f95cb63..c80f62d88b63 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1619,6 +1619,17 @@ static int dsa_slave_phy_connect(struct net_device *slave_dev, int addr)
 	return phylink_connect_phy(dp->pl, slave_dev->phydev);
 }
 
+void dsa_slave_attach_phylink_pcs(struct dsa_switch *ds, int port,
+				  const struct phylink_pcs_ops *ops,
+				  void *priv)
+{
+	const struct dsa_port *dp = dsa_to_port(ds, port);
+
+	dp->pl_config.pcs_private = priv;
+	phylink_add_pcs(dp->pl, ops);
+}
+EXPORT_SYMBOL_GPL(dsa_slave_attach_phylink_pcs);
+
 static int dsa_slave_phy_setup(struct net_device *slave_dev)
 {
 	struct dsa_port *dp = dsa_slave_to_port(slave_dev);


dsa_slave_attach_phylink_pcs() can be passed anything as it's priv
argument - that could be "dp", or it could be a lynx PCS private
data structure, but that's up to the driver code to decide.

At least this gives a way for us to have some standardised PCS
code that can be bolted into either DSA or a MAC driver by the
higher levels without the PCS code having to know which it's
connected to, and without having to have veneers to bridge into
the PCS code.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply related

* Re: [PATCH bpf 2/2] selftests/bpf: add variable-length data concatenation pattern test
From: Andrii Nakryiko @ 2020-06-18 21:59 UTC (permalink / raw)
  To: John Fastabend
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Christoph Hellwig
In-Reply-To: <5eebbbef8f904_6d292ad5e7a285b883@john-XPS-13-9370.notmuch>

On Thu, Jun 18, 2020 at 12:09 PM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Andrii Nakryiko wrote:
> > Add selftest that validates variable-length data reading and concatentation
> > with one big shared data array. This is a common pattern in production use for
> > monitoring and tracing applications, that potentially can read a lot of data,
> > but usually reads much less. Such pattern allows to determine precisely what
> > amount of data needs to be sent over perfbuf/ringbuf and maximize efficiency.
> >
> > This is the first BPF selftest that at all looks at and tests
> > bpf_probe_read_str()-like helper's return value, closing a major gap in BPF
> > testing. It surfaced the problem with bpf_probe_read_kernel_str() returning
> > 0 on success, instead of amount of bytes successfully read.
> >
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
>
> [...]
>
> > +/* .data */
> > +int payload2_len1 = -1;
> > +int payload2_len2 = -1;
> > +int total2 = -1;
> > +char payload2[MAX_LEN + MAX_LEN] = { 1 };
> > +
> > +SEC("raw_tp/sys_enter")
> > +int handler64(void *regs)
> > +{
> > +     int pid = bpf_get_current_pid_tgid() >> 32;
> > +     void *payload = payload1;
> > +     u64 len;
> > +
> > +     /* ignore irrelevant invocations */
> > +     if (test_pid != pid || !capture)
> > +             return 0;
> > +
> > +     len = bpf_probe_read_kernel_str(payload, MAX_LEN, &buf_in1[0]);
> > +     if (len <= MAX_LEN) {
>
> Took me a bit grok this. You are relying on the fact that in errors,
> such as a page fault, will encode to a large u64 value and so you
> verifier is happy. But most of my programs actually want to distinguish
> between legitimate errors on the probe vs buffer overrun cases.

What buffer overrun? bpf_probe_read_str() family cannot return higher
value than MAX_LEN. If you want to detect truncated strings, then you
can attempt reading MAX_LEN + 1 and then check that the return result
is MAX_LEN exactly. But still, that would be something like:

u64 len;

len = bpf_probe_read_str(payload, MAX_LEN + 1, &buf);
if (len > MAX_LEN)
  return -1;
if (len == MAX_LEN) {
  /* truncated */
} else {
  /* full string */
}

>
> Can we make these tests do explicit check for errors. For example,
>
>   if (len < 0) goto abort;
>
> But this also breaks your types here. This is what I was trying to
> point out in the 1/2 patch thread. Wanted to make the point here as
> well in case it wasn't clear. Not sure I did the best job explaining.
>

I can write *a correct* C code in a lot of ways such that it will not
pass verifier verification, not sure what that will prove, though.

Have you tried using the pattern with two ifs with no-ALU32? Does it work?

Also you are cheating in your example (in patch #1 thread). You are
exiting on the first error and do not attempt to read any more data
after that. In practice, you want to get as much info as possible,
even if some of string reads fail (e.g., because argv might not be
paged in, but env is, or vice versa). So you'll end up doing this:

len = bpf_probe_read_str(...);
if (len >= 0 && len <= MAX_LEN) {
    payload += len;
}
...

... and of course it spectacularly fails in no-ALU32.

To be completely fair, this is a result of Clang optimization and
Yonghong is trying to deal with it as we speak. Switching int to long
for helpers doesn't help it either. But there are better code patterns
(unsigned len + single if check) that do work with both ALU32 and
no-ALU32.

And I just double-checked, this pattern keeps working for ALU32 with
both int and long types, so maybe there are unnecessary bit shifts,
but at least code is still verifiable.

So my point stands. int -> long helps in some cases and doesn't hurt
in others, so I argue that it's a good thing to do :)




> > +             payload += len;
> > +             payload1_len1 = len;
> > +     }
> > +
> > +     len = bpf_probe_read_kernel_str(payload, MAX_LEN, &buf_in2[0]);
> > +     if (len <= MAX_LEN) {
> > +             payload += len;
> > +             payload1_len2 = len;
> > +     }
> > +
> > +     total1 = payload - (void *)payload1;
> > +
> > +     return 0;
> > +}

^ permalink raw reply

* [PATCH net v2] ibmveth: Fix max MTU limit
From: Thomas Falcon @ 2020-06-18 21:55 UTC (permalink / raw)
  To: kuba; +Cc: netdev, linuxppc-dev, Thomas Falcon

The max MTU limit defined for ibmveth is not accounting for
virtual ethernet buffer overhead, which is twenty-two additional
bytes set aside for the ethernet header and eight additional bytes
of an opaque handle reserved for use by the hypervisor. Update the
max MTU to reflect this overhead.

Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Fixes: d894be57ca92 ("ethernet: use net core MTU range checking in more drivers")
Fixes: 110447f8269a ("ethernet: fix min/max MTU typos")
---
v2: Include Fixes tags suggested by Jakub Kicisnki
---
 drivers/net/ethernet/ibm/ibmveth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
index 96d36ae5049e..c5c732601e35 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1715,7 +1715,7 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
 	}
 
 	netdev->min_mtu = IBMVETH_MIN_MTU;
-	netdev->max_mtu = ETH_MAX_MTU;
+	netdev->max_mtu = ETH_MAX_MTU - IBMVETH_BUFF_OH;
 
 	memcpy(netdev->dev_addr, mac_addr_p, ETH_ALEN);
 
-- 
2.26.2


^ permalink raw reply related

* Re: [RFC PATCH 06/21] mlx5: add header_split flag
From: Jonathan Lemon @ 2020-06-18 21:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, kernel-team, axboe, Govindarajulu Varadarajan,
	Michal Kubecek
In-Reply-To: <4b0e0916-2910-373c-82cf-d912a82502a4@gmail.com>

On Thu, Jun 18, 2020 at 11:12:57AM -0700, Eric Dumazet wrote:
> 
> 
> On 6/18/20 9:09 AM, Jonathan Lemon wrote:
> > Adds a "rx_hd_split" private flag parameter to ethtool.
> > 
> > This enables header splitting, and sets up the fragment mappings.
> > The feature is currently only enabled for netgpu channels.
> 
> We are using a similar idea (pseudo header split) to implement 4096+(headers) MTU at Google,
> to enable TCP RX zerocopy on x86.
> 
> Patch for mlx4 has not been sent upstream yet.
> 
> For mlx4, we are using a single buffer of 128*(number_of_slots_per_RX_RING),
> and 86 bytes for the first frag, so that the payload exactly fits a 4096 bytes page.
> 
> (In our case, most of our data TCP packets only have 12 bytes of TCP options)
> 
> 
> I suggest that instead of a flag, you use a tunable, that can be set by ethtool,
> so that the exact number of bytes can be tuned, instead of hard coded in the driver.

Sounds reasonable - in the long run, it would be ideal to have the
hardware actually perform header splitting, but for now using a tunable
fixed offset will work.  In the same vein, there should be a similar
setting for the TCP option padding on the sender side.
-- 
Jonathan

^ permalink raw reply

* Re: [PATCH bpf-next] bpf: sk_storage: Prefer to get a free cache_idx
From: Alexei Starovoitov @ 2020-06-18 21:48 UTC (permalink / raw)
  To: John Fastabend
  Cc: Martin KaFai Lau, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Network Development
In-Reply-To: <5eeaf50fec904_38b82b28075185c44c@john-XPS-13-9370.notmuch>

On Wed, Jun 17, 2020 at 10:01 PM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Martin KaFai Lau wrote:
> > The cache_idx is currently picked by RR.  There is chance that
> > the same cache_idx will be picked by multiple sk_storage_maps while
> > other cache_idx is still unused.  e.g. It could happen when the
> > sk_storage_map is recreated during the restart of the user
> > space process.
> >
> > This patch tracks the usage count for each cache_idx.  There is
> > 16 of them now (defined in BPF_SK_STORAGE_CACHE_SIZE).
> > It will try to pick the free cache_idx.  If none was found,
> > it would pick one with the minimal usage count.
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >  net/core/bpf_sk_storage.c | 41 +++++++++++++++++++++++++++++++++++----
> >  1 file changed, 37 insertions(+), 4 deletions(-)
> >
>
> Acked-by: John Fastabend <john.fastabend@gmail.com>

Applied. Thanks

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
From: Andrii Nakryiko @ 2020-06-18 21:41 UTC (permalink / raw)
  To: John Fastabend
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team
In-Reply-To: <5eebb95299a20_6d292ad5e7a285b835@john-XPS-13-9370.notmuch>

On Thu, Jun 18, 2020 at 11:58 AM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Andrii Nakryiko wrote:
> > On Wed, Jun 17, 2020 at 11:49 PM John Fastabend
> > <john.fastabend@gmail.com> wrote:
> > >
> > > Andrii Nakryiko wrote:
> > > > Switch most of BPF helper definitions from returning int to long. These
> > > > definitions are coming from comments in BPF UAPI header and are used to
> > > > generate bpf_helper_defs.h (under libbpf) to be later included and used from
> > > > BPF programs.
> > > >
> > > > In actual in-kernel implementation, all the helpers are defined as returning
> > > > u64, but due to some historical reasons, most of them are actually defined as
> > > > returning int in UAPI (usually, to return 0 on success, and negative value on
> > > > error).
> > >
> > > Could we change the helpers side to return correct types now? Meaning if the
> > > UAPI claims its an int lets actually return the int.
> >
> > I'm not sure how exactly you see this being done. BPF ABI dictates
> > that the helper's result is passed in a full 64-bit r0 register. Are
> > you suggesting that in addition to RET_ANYTHING we should add
> > RET_ANYTHING32 and teach verifier that higher 32 bits of r0 are
> > guaranteed to be zero? And then make helpers actually return 32-bit
> > values without up-casting them to u64?
>
> Yes this is what I was thinking, having a RET_ANYTHING32 but I would
> assume the upper 32-bits could be anything not zeroed. For +alu32
> and programmer using correct types I would expect clang to generate
> good code here and mostly not need to zero upper bits.
>
> I think this discussion can be independent of your changes though and
> its not at the top of my todo list so probably wont get to investigating
> more for awhile.

I'm confused. If the verifier doesn't make any assumptions about upper
32-bits for RET_ANYTHING32, how is it different from RET_ANYTHING and
today's logic? What you described is exactly what is happening when
bpf_helpers_def.h has BPF helpers defined as returning int.

>
> >
> > >
> > > >
> > > > This actually causes Clang to quite often generate sub-optimal code, because
> > > > compiler believes that return value is 32-bit, and in a lot of cases has to be
> > > > up-converted (usually with a pair of 32-bit bit shifts) to 64-bit values,
> > > > before they can be used further in BPF code.
> > > >
> > > > Besides just "polluting" the code, these 32-bit shifts quite often cause
> > > > problems for cases in which return value matters. This is especially the case
> > > > for the family of bpf_probe_read_str() functions. There are few other similar
> > > > helpers (e.g., bpf_read_branch_records()), in which return value is used by
> > > > BPF program logic to record variable-length data and process it. For such
> > > > cases, BPF program logic carefully manages offsets within some array or map to
> > > > read variable-length data. For such uses, it's crucial for BPF verifier to
> > > > track possible range of register values to prove that all the accesses happen
> > > > within given memory bounds. Those extraneous zero-extending bit shifts,
> > > > inserted by Clang (and quite often interleaved with other code, which makes
> > > > the issues even more challenging and sometimes requires employing extra
> > > > per-variable compiler barriers), throws off verifier logic and makes it mark
> > > > registers as having unknown variable offset. We'll study this pattern a bit
> > > > later below.
> > >
> > > With latest verifier zext with alu32 support should be implemented as a
> > > MOV insn.
> >
> > Code generation is independent of verifier version or am I not getting
> > what you are saying? Also all this code was compiled with up-to-date
> > Clang.
>
> Agh sorry I read the example too fast and too late. The above is a typo
> what I meant is "With the latest _clang_ zext with alu32 support...". But,
> I also read your code example wrong and the left/right shift pattern here,
>
> > > >   19:   w1 = w0                              19:   r1 = 0 ll
> > > >   20:   r1 <<= 32
> > > >   21:   r1 s>>= 32
>
> above is a _signed_ zext to deal with the types. OK thanks for bearing with me
> on this one. Some more thoughts below.

No worries, just trying to ensure we are on the same page in discussion :)

>
> >
> > >
> > > >
> > > > Another common pattern is to check return of BPF helper for non-zero state to
> > > > detect error conditions and attempt alternative actions in such case. Even in
> > > > this simple and straightforward case, this 32-bit vs BPF's native 64-bit mode
> > > > quite often leads to sub-optimal and unnecessary extra code. We'll look at
> > > > this pattern as well.
> > > >
> > > > Clang's BPF target supports two modes of code generation: ALU32, in which it
> > > > is capable of using lower 32-bit parts of registers, and no-ALU32, in which
> > > > only full 64-bit registers are being used. ALU32 mode somewhat mitigates the
> > > > above described problems, but not in all cases.
> > >
> > > A bit curious, do you see users running with no-ALU32 support? I have enabled
> > > it by default now. It seems to generate better code and with latest 32-bit
> > > bounds tracking I haven't hit any issues with verifier.
> >
> > Yes, all Facebook apps are built with no-ALU32. And those apps have to
> > run on quite old kernels as well, so relying on latest bug fixes in
> > kernel is not an option right now.
>
> OK got it.
>
> >
> > >
> > > >
> > > > This patch switches all the cases in which BPF helpers return 0 or negative
> > > > error from returning int to returning long. It is shown below that such change
> > > > in definition leads to equivalent or better code. No-ALU32 mode benefits more,
> > > > but ALU32 mode doesn't degrade or still gets improved code generation.
> > > >
> > > > Another class of cases switched from int to long are bpf_probe_read_str()-like
> > > > helpers, which encode successful case as non-negative values, while still
> > > > returning negative value for errors.
> > > >
> > > > In all of such cases, correctness is preserved due to two's complement
> > > > encoding of negative values and the fact that all helpers return values with
> > > > 32-bit absolute value. Two's complement ensures that for negative values
> > > > higher 32 bits are all ones and when truncated, leave valid negative 32-bit
> > > > value with the same value. Non-negative values have upper 32 bits set to zero
> > > > and similarly preserve value when high 32 bits are truncated. This means that
> > > > just casting to int/u32 is correct and efficient (and in ALU32 mode doesn't
> > > > require any extra shifts).
> > > >
> > > > To minimize the chances of regressions, two code patterns were investigated,
> > > > as mentioned above. For both patterns, BPF assembly was analyzed in
> > > > ALU32/NO-ALU32 compiler modes, both with current 32-bit int return type and
> > > > new 64-bit long return type.
> > > >
> > > > Case 1. Variable-length data reading and concatenation. This is quite
> > > > ubiquitous pattern in tracing/monitoring applications, reading data like
> > > > process's environment variables, file path, etc. In such case, many pieces of
> > > > string-like variable-length data are read into a single big buffer, and at the
> > > > end of the process, only a part of array containing actual data is sent to
> > > > user-space for further processing. This case is tested in test_varlen.c
> > > > selftest (in the next patch). Code flow is roughly as follows:
> > > >
> > > >   void *payload = &sample->payload;
> > > >   u64 len;
> > > >
> > > >   len = bpf_probe_read_kernel_str(payload, MAX_SZ1, &source_data1);
> > > >   if (len <= MAX_SZ1) {
> > > >       payload += len;
> > > >       sample->len1 = len;
> > > >   }
> > > >   len = bpf_probe_read_kernel_str(payload, MAX_SZ2, &source_data2);
> > > >   if (len <= MAX_SZ2) {
> > > >       payload += len;
> > > >       sample->len2 = len;
> > > >   }
> > > >   /* and so on */
> > > >   sample->total_len = payload - &sample->payload;
> > > >   /* send over, e.g., perf buffer */
> > > >
> > > > There could be two variations with slightly different code generated: when len
> > > > is 64-bit integer and when it is 32-bit integer. Both variations were analysed.
> > > > BPF assembly instructions between two successive invocations of
> > > > bpf_probe_read_kernel_str() were used to check code regressions. Results are
> > > > below, followed by short analysis. Left side is using helpers with int return
> > > > type, the right one is after the switch to long.
> > > >
> > > > ALU32 + INT                                ALU32 + LONG
> > > > ===========                                ============
> > > >
> > > > 64-BIT (13 insns):                         64-BIT (10 insns):
> > > > ------------------------------------       ------------------------------------
> > > >   17:   call 115                             17:   call 115
> > > >   18:   if w0 > 256 goto +9 <LBB0_4>         18:   if r0 > 256 goto +6 <LBB0_4>
> > > >   19:   w1 = w0                              19:   r1 = 0 ll
> > > >   20:   r1 <<= 32                            21:   *(u64 *)(r1 + 0) = r0
> > > >   21:   r1 s>>= 32                           22:   r6 = 0 ll
> > >
> > > What version of clang is this? That is probably a zext in llvm-ir that in
> > > latest should be sufficient with the 'w1=w0'. I'm guessing (hoping?) you
> > > might not have latest?
> >
> > Just double-checked, very latest Clang, built today. Still generates
> > the same code.
> >
> > But I think this makes sense, because r1 is u64, and it gets assigned
> > from int, so int first has to be converted to s64, then casted to u64.
> > So sign extension is necessary. I've confirmed with this simple
> > program:
> >
> > $ cat bla.c
> > #include <stdio.h>
> >
> > int main() {
> >         int a = -1;
> >         unsigned long b = a;
> >         printf("%lx\n", b);
> >         return 0;
> > }
> > $ clang bla.c -o test && ./test
> > ffffffffffffffff
> >
> > ^^^^^^^^--- not zeroes
> >
> > So I don't think it's a bug or inefficiency, C language requires that.
>
> Agreed. Sorry for the confusion on my side. Poked at this a bit more this
> morning trying to see why I don't hit the same pattern when we have many
> cases very similar to above.
>
> In your C code you never check for negative return codes? Oops, this
> means you can walk backwards off the front of payload? This is probably
> not valid either from program logic side and/or verifier will probably
> yell. Commented where I think you want a <0 check here,

You are missing that I'm using unsigned u64. So (s64)-1 ==
(u64)0xFFFFFFFFFFFFFFFF. So negative errors are effectively turned
into too large length and I filter them out with the same (len >
MAX_SZ) check. This allows to do just one comparison instead of two,
and also helps avoid some Clang optimizations that Yonghong is trying
to undo right now (if (a > X && a < Y) turned into if (x < Y - X),
with assembly that verifier can't verify). So no bug there, very
deliberate choice of types.

>
>  void *payload = &sample->payload;
>  u64 len; // but how do you get a len < 0 check now with u64 type?
>
>  len = bpf_probe_read_kernel_str(payload, MAX_SZ1, &source_data1);
>  // should insert a negative case check here
>  // if (len < 0) goto abort;
>  if (len <= MAX_SZ1) {
>         payload += len;
>         sample->len1 = len;
>  }
>  // without the <0 case you may have walked backwards in payload?
>  len = bpf_probe_read_kernel_str(payload, MAX_SZ2, &source_data2);
>
> So in all of my C code I have something like this,
>
>  int myfunc(void *a, void *b, void *c) {
>         void *payload = a;
>         int len;
>
>         len = probe_read_str(payload, 1000, b);
>         if (len < 0) return len;
>         if (len <= 1000) { // yeah we have room
>                 ayload += len;
>         }
>         len = probe_read_str(payload, 1000, c);
>         [...]
>  }
>
> And then when I look at generated code I get this,
>
> ; int myfunc(void *a, void *b, void *c) {
>        4:       call 45
> ;       if (len < 0) return len;
>        5:       if w0 s< 0 goto +9 <LBB0_4>
> ;       if (len <= 1000) {
>        6:       w2 = w0
>        7:       r1 = r7
>        8:       r1 += r2
>        9:       if w0 s< 1001 goto +1 <LBB0_3>
>       10:       r1 = r7
>
> 0000000000000058 <LBB0_3>:
> ;       len = probe_read_str(payload, 1000, b);
>       11:       w2 = 1000
>       12:       r3 = r6
>       13:       call 45
>
>
> Here the <0 check means we can skip the sext and do a
> simple zext which is just a w2=w0 by bpf zero extension
> rules.

See above. In practice (it might be no-ALU32-only thing, don't know),
doing two ifs is both less efficient and quite often leads to
unverifiable code. Users have to do hacks to complicate control flow
enough to prevent Clang from doing Hi/Lo combining. I learned a new
inlined assembly trick recently to prevent this, but either way it's
unpleasant and unnecessary.

>
> >
> > >
> > > >   22:   r2 = 0 ll                            24:   r6 += r0
> > > >   24:   *(u64 *)(r2 + 0) = r1              00000000000000c8 <LBB0_4>:
> > > >   25:   r6 = 0 ll                            25:   r1 = r6
> > > >   27:   r6 += r1                             26:   w2 = 256
> > > > 00000000000000e0 <LBB0_4>:                   27:   r3 = 0 ll
> > > >   28:   r1 = r6                              29:   call 115
> > > >   29:   w2 = 256
> > > >   30:   r3 = 0 ll
> > > >   32:   call 115
> > > >
> > > > 32-BIT (11 insns):                         32-BIT (12 insns):
> > > > ------------------------------------       ------------------------------------
> > > >   17:   call 115                             17:   call 115
> > > >   18:   if w0 > 256 goto +7 <LBB1_4>         18:   if w0 > 256 goto +8 <LBB1_4>
> > > >   19:   r1 = 0 ll                            19:   r1 = 0 ll
> > > >   21:   *(u32 *)(r1 + 0) = r0                21:   *(u32 *)(r1 + 0) = r0
> > > >   22:   w1 = w0                              22:   r0 <<= 32
> > > >   23:   r6 = 0 ll                            23:   r0 >>= 32
> > > >   25:   r6 += r1                             24:   r6 = 0 ll
> > > > 00000000000000d0 <LBB1_4>:                   26:   r6 += r0
> > > >   26:   r1 = r6                            00000000000000d8 <LBB1_4>:
> > > >   27:   w2 = 256                             27:   r1 = r6
> > > >   28:   r3 = 0 ll                            28:   w2 = 256
> > > >   30:   call 115                             29:   r3 = 0 ll
> > > >                                              31:   call 115
> > > >
> > > > In ALU32 mode, the variant using 64-bit length variable clearly wins and
> > > > avoids unnecessary zero-extension bit shifts. In practice, this is even more
> > > > important and good, because BPF code won't need to do extra checks to "prove"
> > > > that payload/len are within good bounds.
> > >
> > > I bet with latest clang the shifts are removed. But if not we probably
> > > should fix clang regardless of if helpers return longs or ints.
> >
> > are we still talking about bit shifts for INT HELPER + U64 len case?
> > Or now about bit shifts in LONG HELPER + U32 len case?
>
> Forget I was confused. I put the <0 checks in there mentally and it messed up
> how I read your code.
>
> >
> > >
> > > >
> > > > 32-bit len is one instruction longer. Clang decided to do 64-to-32 casting
> > > > with two bit shifts, instead of equivalent `w1 = w0` assignment. The former
> > > > uses extra register. The latter might potentially lose some range information,
> > > > but not for 32-bit value. So in this case, verifier infers that r0 is [0, 256]
> > > > after check at 18:, and shifting 32 bits left/right keeps that range intact.
> > > > We should probably look into Clang's logic and see why it chooses bitshifts
> > > > over sub-register assignments for this.
> > > >
> > > > NO-ALU32 + INT                             NO-ALU32 + LONG
> > > > ==============                             ===============
> > > >
> > > > 64-BIT (14 insns):                         64-BIT (10 insns):
> > > > ------------------------------------       ------------------------------------
> > > >   17:   call 115                             17:   call 115
> > > >   18:   r0 <<= 32                            18:   if r0 > 256 goto +6 <LBB0_4>
> > > >   19:   r1 = r0                              19:   r1 = 0 ll
> > > >   20:   r1 >>= 32                            21:   *(u64 *)(r1 + 0) = r0
> > > >   21:   if r1 > 256 goto +7 <LBB0_4>         22:   r6 = 0 ll
> > > >   22:   r0 s>>= 32                           24:   r6 += r0
> > > >   23:   r1 = 0 ll                          00000000000000c8 <LBB0_4>:
> > > >   25:   *(u64 *)(r1 + 0) = r0                25:   r1 = r6
> > > >   26:   r6 = 0 ll                            26:   r2 = 256
> > > >   28:   r6 += r0                             27:   r3 = 0 ll
> > > > 00000000000000e8 <LBB0_4>:                   29:   call 115
> > > >   29:   r1 = r6
> > > >   30:   r2 = 256
> > > >   31:   r3 = 0 ll
> > > >   33:   call 115
> > > >
> > > > 32-BIT (13 insns):                         32-BIT (13 insns):
> > > > ------------------------------------       ------------------------------------
> > > >   17:   call 115                             17:   call 115
> > > >   18:   r1 = r0                              18:   r1 = r0
> > > >   19:   r1 <<= 32                            19:   r1 <<= 32
> > > >   20:   r1 >>= 32                            20:   r1 >>= 32
> > > >   21:   if r1 > 256 goto +6 <LBB1_4>         21:   if r1 > 256 goto +6 <LBB1_4>
> > > >   22:   r2 = 0 ll                            22:   r2 = 0 ll
> > > >   24:   *(u32 *)(r2 + 0) = r0                24:   *(u32 *)(r2 + 0) = r0
> > > >   25:   r6 = 0 ll                            25:   r6 = 0 ll
> > > >   27:   r6 += r1                             27:   r6 += r1
> > > > 00000000000000e0 <LBB1_4>:                 00000000000000e0 <LBB1_4>:
> > > >   28:   r1 = r6                              28:   r1 = r6
> > > >   29:   r2 = 256                             29:   r2 = 256
> > > >   30:   r3 = 0 ll                            30:   r3 = 0 ll
> > > >   32:   call 115                             32:   call 115
> > > >
> > > > In NO-ALU32 mode, for the case of 64-bit len variable, Clang generates much
> > > > superior code, as expected, eliminating unnecessary bit shifts. For 32-bit
> > > > len, code is identical.
> > >
> > > Right I can't think of any way clang can avoid it here. OTOH I fix this
> > > by enabling alu32 ;)
> > >
> > > >
> > > > So overall, only ALU-32 32-bit len case is more-or-less equivalent and the
> > > > difference stems from internal Clang decision, rather than compiler lacking
> > > > enough information about types.
> > > >
> > > > Case 2. Let's look at the simpler case of checking return result of BPF helper
> > > > for errors. The code is very simple:
> > > >
> > > >   long bla;
> > > >   if (bpf_probe_read_kenerl(&bla, sizeof(bla), 0))
> > > >       return 1;
> > > >   else
> > > >       return 0;
> > > >
> > > > ALU32 + CHECK (9 insns)                    ALU32 + CHECK (9 insns)
> > > > ====================================       ====================================
> > > >   0:    r1 = r10                             0:    r1 = r10
> > > >   1:    r1 += -8                             1:    r1 += -8
> > > >   2:    w2 = 8                               2:    w2 = 8
> > > >   3:    r3 = 0                               3:    r3 = 0
> > > >   4:    call 113                             4:    call 113
> > > >   5:    w1 = w0                              5:    r1 = r0
> > > >   6:    w0 = 1                               6:    w0 = 1
> > > >   7:    if w1 != 0 goto +1 <LBB2_2>          7:    if r1 != 0 goto +1 <LBB2_2>
> > > >   8:    w0 = 0                               8:    w0 = 0
> > > > 0000000000000048 <LBB2_2>:                 0000000000000048 <LBB2_2>:
> > > >   9:    exit                                 9:    exit
> > > >
> > > > Almost identical code, the only difference is the use of full register
> > > > assignment (r1 = r0) vs half-registers (w1 = w0) in instruction #5. On 32-bit
> > > > architectures, new BPF assembly might be slightly less optimal, in theory. But
> > > > one can argue that's not a big issue, given that use of full registers is
> > > > still prevalent (e.g., for parameter passing).
> > > >
> > > > NO-ALU32 + CHECK (11 insns)                NO-ALU32 + CHECK (9 insns)
> > > > ====================================       ====================================
> > > >   0:    r1 = r10                             0:    r1 = r10
> > > >   1:    r1 += -8                             1:    r1 += -8
> > > >   2:    r2 = 8                               2:    r2 = 8
> > > >   3:    r3 = 0                               3:    r3 = 0
> > > >   4:    call 113                             4:    call 113
> > > >   5:    r1 = r0                              5:    r1 = r0
> > > >   6:    r1 <<= 32                            6:    r0 = 1
> > > >   7:    r1 >>= 32                            7:    if r1 != 0 goto +1 <LBB2_2>
> > > >   8:    r0 = 1                               8:    r0 = 0
> > > >   9:    if r1 != 0 goto +1 <LBB2_2>        0000000000000048 <LBB2_2>:
> > > >  10:    r0 = 0                               9:    exit
> > > > 0000000000000058 <LBB2_2>:
> > > >  11:    exit
> > > >
> > > > NO-ALU32 is a clear improvement, getting rid of unnecessary zero-extension bit
> > > > shifts.
> > >
> > > It seems a win for the NO-ALU32 case but for the +ALU32 case I think its
> > > the same with latest clang although I haven't tried yet. I was actually
> > > considering going the other way and avoiding always returning u64 on
> > > the other side. From a purely aesethetics point of view I prefer the
> > > int type because it seems more clear/standard C. I'm also not so interested
> > > in optimizing the no-alu32 case but curious if there is a use case for
> > > that?
> >
> > My point was that this int -> long switch doesn't degrade ALU32 and
> > helps no-ALU32, and thus is good :)
>
> With the long vs int I do see worse code when using the <0 check.
> Using C function below which I took from some real code and renamed
> variables.
>
> int myfunc(void *a, void *b, void *c) {
>         void *payload = a;
>         int len;
>
>         len = probe_read_str(payload, 1000, a);
>         if (len < 0) return len;
>         if (len <= 1000) {
>                 payload += len;
>         }
>         len = probe_read_str(payload, 1000, b);
>         if (len <= 1000) {
>                 payload += len;
>         }
>         return 1;
> }
>
> Then here is the side-by-side of generated code, with +ALU32.
>
>   int BPF_FUNC(probe_read, ...                  long BPF_FUNC(probe_read, ...
> -------------------------------                ---------------------------------
>        0:       r6 = r2                         0:      r6 = r2
>        1:       r7 = r1                         1:      r7 = r1
>        2:       w2 = 1000                       2:      w2 = 1000
>        3:       r3 = r7                         3:      r3 = r7
>        4:       call 45                         4:      call 45
>        5:       if w0 s< 0 goto +9 <LBB0_4>     5:      r2 = r0
>        6:       w2 = w0                         6:      if w0 s< 0 goto +10 <LBB0_4>
>        7:       r1 = r7                         7:      r2 <<= 32
>        8:       r1 += r2                        8:      r2 s>>= 32
>        9:       if w0 s< 1001 goto +1 <LBB0_3>  9:      r1 = r7
>       10:       r1 = r7                        10:      r1 += r2
>       11:       w2 = 1000                      11:      if w0 s< 1001 goto +1 <LBB0_3>
>       12:       r3 = r6                        12:      r1 = r7
>       13:       call 45                        13:      w2 = 1000
>       14:       w0 = 1                         14:      r3 = r6
>       15:       exit                           15:      call 45
>                                                16:      w0 = 1
>                                                17:      exit
>
> So a couple extra instruction, but more concerning we created a
> <<=,s>> pattern. I'll need to do some more tests but my concern
> is that could break verifier for real programs we have. I guess
> it didn't in the selftests? Surely, this thread has at least
> pointed out some gaps in our test cases. I guess I _could_ make
> len a u64 type to remove the sext but then <0 check on a u64?!

I addressed <0 check above. As for <<=,s>>=, I wish Clang was a bit
smarter and just did w2 = w2 or something like that, given we just
checked that w0 is non-negative. But then again, I wouldn't do two ifs
and wouldn't use signed int for len.

>
> >
> > Overall, long as a return type matches reality and BPF ABI
> > specification. BTW, one of the varlen programs from patch 2 doesn't
> > even validate successfully on latest kernel with latest Clang right
> > now, if helpers return int, even though it's completely correct code.
> > That's a real problem we have to deal with in few major BPF
> > applications right now, and we have to use inline assembly to enforce
> > Clang to do the right thing. A bunch of those problems are simply
> > avoided with correct return types for helpers.
>
> Do the real programs check <0? Did I miss something? I'll try
> applying your patch to our real code base and see what happens.

That would be great. Self-tests do work, but having more testing with
real-world application would certainly help as well.

^ permalink raw reply

* Re: [PATCH bpf-next v4 3/3] bpf: add SO_KEEPALIVE and related options to bpf_setsockopt
From: Alexei Starovoitov @ 2020-06-18 21:33 UTC (permalink / raw)
  To: Dmitry Yakunin
  Cc: Daniel Borkmann, David S. Miller, Lawrence Brakmo, Eric Dumazet,
	Martin KaFai Lau, bpf, Network Development
In-Reply-To: <20200617110217.35669-3-zeil@yandex-team.ru>

On Wed, Jun 17, 2020 at 4:02 AM Dmitry Yakunin <zeil@yandex-team.ru> wrote:
>
> This patch adds support of SO_KEEPALIVE flag and TCP related options
> to bpf_setsockopt() routine. This is helpful if we want to enable or tune
> TCP keepalive for applications which don't do it in the userspace code.
>
> v2:
>   - update kernel-doc (Nikita Vetoshkin <nekto0n@yandex-team.ru>)
>
> Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru>
> Acked-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  include/uapi/linux/bpf.h |  7 +++++--
>  net/core/filter.c        | 36 +++++++++++++++++++++++++++++++++++-
>  2 files changed, 40 insertions(+), 3 deletions(-)

Please update tools/include/uapi/linux/bpf.h as well.

Also please add a selftest for at least some new opts.

^ permalink raw reply

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
From: Roman Gushchin @ 2020-06-18 21:26 UTC (permalink / raw)
  To: Cong Wang
  Cc: Zefan Li, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo
In-Reply-To: <CAM_iQpWuNnHqNHKz5FMgAXoqQ5qGDEtNbBKDXpmpeNSadCZ-1w@mail.gmail.com>

On Thu, Jun 18, 2020 at 02:09:43PM -0700, Cong Wang wrote:
> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> > > On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> > > >
> > > > Cc: Roman Gushchin <guro@fb.com>
> > > >
> > > > Thanks for fixing this.
> > > >
> > > > On 2020/6/17 2:03, Cong Wang wrote:
> > > > > When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> > > > > copied, so the cgroup refcnt must be taken too. And, unlike the
> > > > > sk_alloc() path, sock_update_netprioidx() is not called here.
> > > > > Therefore, it is safe and necessary to grab the cgroup refcnt
> > > > > even when cgroup_sk_alloc is disabled.
> > > > >
> > > > > sk_clone_lock() is in BH context anyway, the in_interrupt()
> > > > > would terminate this function if called there. And for sk_alloc()
> > > > > skcd->val is always zero. So it's safe to factor out the code
> > > > > to make it more readable.
> > > > >
> > > > > Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> > > >
> > > > but I don't think the bug was introduced by this commit, because there
> > > > are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> > > > write_classid(), which can be triggered by writing to ifpriomap or
> > > > classid in cgroupfs. This commit just made it much easier to happen
> > > > with systemd invovled.
> > > >
> > > > I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> > > > which added cgroup_bpf_get() in cgroup_sk_alloc().
> > >
> > > Good point.
> > >
> > > I take a deeper look, it looks like commit d979a39d7242e06
> > > is the one to blame, because it is the first commit that began to
> > > hold cgroup refcnt in cgroup_sk_alloc().
> >
> > I agree, ut seems that the issue is not related to bpf and probably
> > can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> > seems closer to the origin.
> 
> Yeah, I will update the Fixes tag and send V2.
> 
> >
> > Btw, based on the number of reported-by tags it seems that there was
> > a real issue which the patch is fixing. Maybe you'll a couple of words
> > about how it reveals itself in the real life?
> 
> I still have no idea how exactly this is triggered. According to the
> people who reported this bug, they just need to wait for some hours
> to trigger. So I am not sure what to add here, just the stack trace?

Yeah, stack trace is definitely useful. So at least if someone will encounter the same
error in the future, they can search for the stacktrace and find the fix.

Thanks!

^ permalink raw reply

* [PATCH net] r8169: fix firmware not resetting tp->ocp_base
From: Heiner Kallweit @ 2020-06-18 21:25 UTC (permalink / raw)
  To: Realtek linux nic maintainers, David Miller, Jakub Kicinski,
	Aaron Ma
  Cc: netdev@vger.kernel.org

Typically the firmware takes care that tp->ocp_base is reset to its
default value. That's not the case (at least) for RTL8117.
As a result subsequent PHY access reads/writes the wrong page and
the link is broken. Fix this be resetting tp->ocp_base explicitly.

Fixes: 229c1e0dfd3d ("r8169: load firmware for RTL8168fp/RTL8117")
Reported-by: Aaron Ma <mapengyu@gmail.com>
Tested-by: Aaron Ma <mapengyu@gmail.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/ethernet/realtek/r8169_main.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index a3c4187d9..98391797b 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -2116,8 +2116,11 @@ static void rtl_release_firmware(struct rtl8169_private *tp)
 void r8169_apply_firmware(struct rtl8169_private *tp)
 {
 	/* TODO: release firmware if rtl_fw_write_firmware signals failure. */
-	if (tp->rtl_fw)
+	if (tp->rtl_fw) {
 		rtl_fw_write_firmware(tp, tp->rtl_fw);
+		/* At least one firmware doesn't reset tp->ocp_base. */
+		tp->ocp_base = OCP_STD_PHY_BASE;
+	}
 }
 
 static void rtl8168_config_eee_mac(struct rtl8169_private *tp)
-- 
2.27.0


^ permalink raw reply related

* [net-next PATCH] net: Avoid overwriting valid skb->napi_id
From: Amritha Nambiar @ 2020-06-18 21:22 UTC (permalink / raw)
  To: netdev, davem
  Cc: edumazet, alexander.h.duyck, eliezer.tamir, sridhar.samudrala,
	amritha.nambiar

This will be useful to allow busy poll for tunneled traffic. In case of
busy poll for sessions over tunnels, the underlying physical device's
queues need to be polled.

Tunnels schedule NAPI either via netif_rx() for backlog queue or
schedule the gro_cell_poll(). netif_rx() propagates the valid skb->napi_id
to the socket. OTOH, gro_cell_poll() stamps the skb->napi_id again by
calling skb_mark_napi_id() with the tunnel NAPI which is not a busy poll
candidate. This was preventing tunneled traffic to use busy poll. A valid
NAPI ID in the skb indicates it was already marked for busy poll by a
NAPI driver and hence needs to be copied into the socket.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 include/net/busy_poll.h |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 86e028388bad..b001fa91c14e 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -114,7 +114,11 @@ static inline void skb_mark_napi_id(struct sk_buff *skb,
 				    struct napi_struct *napi)
 {
 #ifdef CONFIG_NET_RX_BUSY_POLL
-	skb->napi_id = napi->napi_id;
+	/* If the skb was already marked with a valid NAPI ID, avoid overwriting
+	 * it.
+	 */
+	if (skb->napi_id < MIN_NAPI_ID)
+		skb->napi_id = napi->napi_id;
 #endif
 }
 


^ permalink raw reply related

* RE: [PATCH 02/11] bpf: Compile btfid tool at kernel compilation start
From: John Fastabend @ 2020-06-18 21:17 UTC (permalink / raw)
  To: John Fastabend, Jiri Olsa, Alexei Starovoitov, Daniel Borkmann
  Cc: netdev, bpf, Song Liu, Yonghong Song, Martin KaFai Lau,
	David Miller, John Fastabend, Wenbo Zhang, KP Singh,
	Andrii Nakryiko, Brendan Gregg, Florent Revest, Al Viro
In-Reply-To: <5eebd1486e46b_6d292ad5e7a285b817@john-XPS-13-9370.notmuch>

John Fastabend wrote:
> Jiri Olsa wrote:
> > The btfid tool will be used during the vmlinux linking,
> > so it's necessary it's ready for it.
> > 
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  Makefile           | 22 ++++++++++++++++++----
> >  tools/Makefile     |  3 +++
> >  tools/bpf/Makefile |  5 ++++-
> >  3 files changed, 25 insertions(+), 5 deletions(-)
> 
> This breaks the build for me. I fixed it with this but then I get warnings,

Also maybe fix below is not good because now I get a segfault in btfid.

> 
> diff --git a/tools/bpf/btfid/btfid.c b/tools/bpf/btfid/btfid.c
> index 7cdf39bfb150..3697e8ae9efa 100644
> --- a/tools/bpf/btfid/btfid.c
> +++ b/tools/bpf/btfid/btfid.c
> @@ -48,7 +48,7 @@
>  #include <errno.h>
>  #include <linux/rbtree.h>
>  #include <linux/zalloc.h>
> -#include <btf.h>
> +#include <linux/btf.h>
>  #include <libbpf.h>
>  #include <parse-options.h>
> 
> Here is the error. Is it something about my setup? bpftool/btf.c uses
> <btf.h>. Because this in top-level Makefile we probably don't want to
> push extra setup onto folks.
> 
> In file included from btfid.c:51:
> /home/john/git/bpf-next/tools/lib/bpf/btf.h: In function ‘btf_is_var’:
> /home/john/git/bpf-next/tools/lib/bpf/btf.h:254:24: error: ‘BTF_KIND_VAR’ undeclared (first use in this function); did you mean ‘BTF_KIND_PTR’?
>   return btf_kind(t) == BTF_KIND_VAR;
>                         ^~~~~~~~~~~~
>                         BTF_KIND_PTR
> /home/john/git/bpf-next/tools/lib/bpf/btf.h:254:24: note: each undeclared identifier is reported only once for each function it appears in
> /home/john/git/bpf-next/tools/lib/bpf/btf.h: In function ‘btf_is_datasec’:
> /home/john/git/bpf-next/tools/lib/bpf/btf.h:259:24: error: ‘BTF_KIND_DATASEC’ undeclared (first use in this function); did you mean ‘BTF_KIND_PTR’?
>   return btf_kind(t) == BTF_KIND_DATASEC;
>                         ^~~~~~~~~~~~~~~~
>                         BTF_KIND_PTR
> mv: cannot stat '/home/john/git/bpf-next/tools/bpf/btfid/.btfid.o.tmp': No such file or directory
> make[3]: *** [/home/john/git/bpf-next/tools/build/Makefile.build:97: /home/john/git/bpf-next/tools/bpf/btfid/btfid.o] Error 1
> make[2]: *** [Makefile:59: /home/john/git/bpf-next/tools/bpf/btfid/btfid-in.o] Error 2
> make[1]: *** [Makefile:71: bpf/btfid] Error 2
> make: *** [Makefile:1894: tools/bpf/btfid] Error 2



^ permalink raw reply

* Re: [net-next 13/15] iecm: Add ethtool
From: Michal Kubecek @ 2020-06-18 21:17 UTC (permalink / raw)
  To: netdev
  Cc: Jeff Kirsher, davem, Alice Michael, nhorman, sassmann, Alan Brady,
	Phani Burra, Joshua Hay, Madhu Chittim, Pavan Kumar Linga,
	Donald Skidmore, Jesse Brandeburg, Sridhar Samudrala
In-Reply-To: <20200618051344.516587-14-jeffrey.t.kirsher@intel.com>

On Wed, Jun 17, 2020 at 10:13:42PM -0700, Jeff Kirsher wrote:
> From: Alice Michael <alice.michael@intel.com>
> 
> Implement ethtool interface for the common module.
> 
> Signed-off-by: Alice Michael <alice.michael@intel.com>
> Signed-off-by: Alan Brady <alan.brady@intel.com>
> Signed-off-by: Phani Burra <phani.r.burra@intel.com>
> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
> Signed-off-by: Madhu Chittim <madhu.chittim@intel.com>
> Signed-off-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com>
> Reviewed-by: Donald Skidmore <donald.c.skidmore@intel.com>
> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
[...]
> +static int iecm_set_channels(struct net_device *netdev,
> +			     struct ethtool_channels *ch)
> +{
> +	struct iecm_vport *vport = iecm_netdev_to_vport(netdev);
> +	int num_req_q = ch->combined_count;
> +
> +	if (num_req_q == max(vport->num_txq, vport->num_rxq))
> +		return 0;
> +
> +	/* All of these should have already been checked by ethtool before this
> +	 * even gets to us, but just to be sure.
> +	 */
> +	if (num_req_q <= 0 || num_req_q > IECM_MAX_Q)
> +		return -EINVAL;
> +
> +	if (ch->rx_count || ch->tx_count || ch->other_count != IECM_MAX_NONQ)
> +		return -EINVAL;

I don't see much sense in duplicating the checks. Having the checks in
common code allows us to simplify driver callbacks.

> +	vport->adapter->config_data.num_req_qs = num_req_q;
> +
> +	return iecm_initiate_soft_reset(vport, __IECM_SR_Q_CHANGE);
> +}
[...]
> +static int iecm_set_ringparam(struct net_device *netdev,
> +			      struct ethtool_ringparam *ring)
> +{
> +	struct iecm_vport *vport = iecm_netdev_to_vport(netdev);
> +	u32 new_rx_count, new_tx_count;
> +
> +	if (ring->rx_mini_pending || ring->rx_jumbo_pending)
> +		return -EINVAL;
> +
> +	new_tx_count = clamp_t(u32, ring->tx_pending,
> +			       IECM_MIN_TXQ_DESC,
> +			       IECM_MAX_TXQ_DESC);
> +	new_tx_count = ALIGN(new_tx_count, IECM_REQ_DESC_MULTIPLE);
> +
> +	new_rx_count = clamp_t(u32, ring->rx_pending,
> +			       IECM_MIN_RXQ_DESC,
> +			       IECM_MAX_RXQ_DESC);
> +	new_rx_count = ALIGN(new_rx_count, IECM_REQ_DESC_MULTIPLE);

Same here. This is actually a bit misleading as it seems that count
exceeding maximum would be silently clamped to it but such value would
be rejected by common code.

> +	/* if nothing to do return success */
> +	if (new_tx_count == vport->txq_desc_count &&
> +	    new_rx_count == vport->rxq_desc_count)
> +		return 0;
> +
> +	vport->adapter->config_data.num_req_txq_desc = new_tx_count;
> +	vport->adapter->config_data.num_req_rxq_desc = new_rx_count;
> +
> +	return iecm_initiate_soft_reset(vport, __IECM_SR_Q_DESC_CHANGE);
> +}
[...]
> +/* For now we have one and only one private flag and it is only defined
> + * when we have support for the SKIP_CPU_SYNC DMA attribute.  Instead
> + * of leaving all this code sitting around empty we will strip it unless
> + * our one private flag is actually available.
> + */

The code below will always return 1 for ETH_SS_PRIV_FLAGS in
iecm_get_sset_count() and an array of one string in iecm_get_strings().
This would e.g. result in "ethtool -i" saying "supports-priv-flags: yes"
but then "ethtool --show-priv-flags" failing with -EOPNOTSUPP. IMHO you
should not return bogus string set if private flags are not implemented.

Michal

> +struct iecm_priv_flags {
> +	char flag_string[ETH_GSTRING_LEN];
> +	bool read_only;
> +	u32 flag;
> +};
> +
> +#define IECM_PRIV_FLAG(_name, _flag, _read_only) { \
> +	.read_only = _read_only, \
> +	.flag_string = _name, \
> +	.flag = _flag, \
> +}
> +
> +static const struct iecm_priv_flags iecm_gstrings_priv_flags[] = {
> +	IECM_PRIV_FLAG("", 0, 0),
> +};
> +
> +#define IECM_PRIV_FLAGS_STR_LEN ARRAY_SIZE(iecm_gstrings_priv_flags)
[...]
> +static void iecm_get_priv_flag_strings(struct net_device *netdev, u8 *data)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < IECM_PRIV_FLAGS_STR_LEN; i++) {
> +		snprintf((char *)data, ETH_GSTRING_LEN, "%s",
> +			 iecm_gstrings_priv_flags[i].flag_string);
> +		data += ETH_GSTRING_LEN;
> +	}
> +}

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox