Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v2 net-next 06/23] netlink: Add new socket option to enable strict checking on dumps
From: David Ahern @ 2018-10-08  3:16 UTC (permalink / raw)
  To: netdev, davem; +Cc: christian, jbenc, stephen, David Ahern
In-Reply-To: <20181008031644.15989-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

Add a new socket option, NETLINK_DUMP_STRICT_CHK, that userspace
can use via setsockopt to request strict checking of headers and
attributes on dump requests.

To get dump features such as kernel side filtering based on data in
the header or attributes appended to the dump request, userspace
must call setsockopt() for NETLINK_DUMP_STRICT_CHK and a non-zero
value. Since the netlink sock and its flags are private to the
af_netlink code, the strict checking flag is passed to dump handlers
via a flag in the netlink_callback struct.

For old userspace on new kernel there is no impact as all of the data
checks in later patches are wrapped in a check on the new strict flag.

For new userspace on old kernel, the setsockopt will fail and even if
new userspace sets data in the headers and appended attributes the
kernel will silently ignore it. Moving forward when the setsockopt
succeeds, the new userspace on old kernel means the dump request can
pass an attribute the kernel does not understand. The dump will then
fail as the older kernel does not understand it.

New userspace on new kernel setting the socket option gets the benefit
of the improved data dump.

Kernel side the NETLINK_DUMP_STRICT_CHK uapi is converted to a generic
NETLINK_F_STRICT_CHK flag which can potentially be leveraged for tighter
checking on the NEW, DEL, and SET commands.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/linux/netlink.h      |  1 +
 include/uapi/linux/netlink.h |  1 +
 net/netlink/af_netlink.c     | 21 ++++++++++++++++++++-
 net/netlink/af_netlink.h     |  1 +
 4 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 88c8a2d83eb3..72580f1a72a2 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -179,6 +179,7 @@ struct netlink_callback {
 	struct netlink_ext_ack	*extack;
 	u16			family;
 	u16			min_dump_alloc;
+	bool			strict_check;
 	unsigned int		prev_seq, seq;
 	long			args[6];
 };
diff --git a/include/uapi/linux/netlink.h b/include/uapi/linux/netlink.h
index 776bc92e9118..486ed1f0c0bc 100644
--- a/include/uapi/linux/netlink.h
+++ b/include/uapi/linux/netlink.h
@@ -155,6 +155,7 @@ enum nlmsgerr_attrs {
 #define NETLINK_LIST_MEMBERSHIPS	9
 #define NETLINK_CAP_ACK			10
 #define NETLINK_EXT_ACK			11
+#define NETLINK_DUMP_STRICT_CHK		12
 
 struct nl_pktinfo {
 	__u32	group;
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 7ac585f33a9e..e613a9f89600 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1706,6 +1706,13 @@ static int netlink_setsockopt(struct socket *sock, int level, int optname,
 			nlk->flags &= ~NETLINK_F_EXT_ACK;
 		err = 0;
 		break;
+	case NETLINK_DUMP_STRICT_CHK:
+		if (val)
+			nlk->flags |= NETLINK_F_STRICT_CHK;
+		else
+			nlk->flags &= ~NETLINK_F_STRICT_CHK;
+		err = 0;
+		break;
 	default:
 		err = -ENOPROTOOPT;
 	}
@@ -1799,6 +1806,15 @@ static int netlink_getsockopt(struct socket *sock, int level, int optname,
 			return -EFAULT;
 		err = 0;
 		break;
+	case NETLINK_DUMP_STRICT_CHK:
+		if (len < sizeof(int))
+			return -EINVAL;
+		len = sizeof(int);
+		val = nlk->flags & NETLINK_F_STRICT_CHK ? 1 : 0;
+		if (put_user(len, optlen) || put_user(val, optval))
+			return -EFAULT;
+		err = 0;
+		break;
 	default:
 		err = -ENOPROTOOPT;
 	}
@@ -2282,9 +2298,9 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff *skb,
 			 const struct nlmsghdr *nlh,
 			 struct netlink_dump_control *control)
 {
+	struct netlink_sock *nlk, *nlk2;
 	struct netlink_callback *cb;
 	struct sock *sk;
-	struct netlink_sock *nlk;
 	int ret;
 
 	refcount_inc(&skb->users);
@@ -2318,6 +2334,9 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff *skb,
 	cb->min_dump_alloc = control->min_dump_alloc;
 	cb->skb = skb;
 
+	nlk2 = nlk_sk(NETLINK_CB(skb).sk);
+	cb->strict_check = !!(nlk2->flags & NETLINK_F_STRICT_CHK);
+
 	if (control->start) {
 		ret = control->start(cb);
 		if (ret)
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index 962de7b3c023..5f454c8de6a4 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -15,6 +15,7 @@
 #define NETLINK_F_LISTEN_ALL_NSID	0x10
 #define NETLINK_F_CAP_ACK		0x20
 #define NETLINK_F_EXT_ACK		0x40
+#define NETLINK_F_STRICT_CHK		0x80
 
 #define NLGRPSZ(x)	(ALIGN(x, sizeof(unsigned long) * 8) / 8)
 #define NLGRPLONGS(x)	(NLGRPSZ(x)/sizeof(unsigned long))
-- 
2.11.0

^ permalink raw reply related

* [PATCH v2 net-next 05/23] net/ipv6: Refactor address dump to push inet6_fill_args to in6_dump_addrs
From: David Ahern @ 2018-10-08  3:16 UTC (permalink / raw)
  To: netdev, davem; +Cc: christian, jbenc, stephen, David Ahern
In-Reply-To: <20181008031644.15989-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

Pull the inet6_fill_args arg up to in6_dump_addrs and move netnsid
into it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Christian Brauner <christian@brauner.io>
---
 net/ipv6/addrconf.c | 57 ++++++++++++++++++++++++++++-------------------------
 1 file changed, 30 insertions(+), 27 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 2f8aa4fd5e55..afa279170ba5 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4793,12 +4793,19 @@ static inline int inet6_ifaddr_msgsize(void)
 	       + nla_total_size(4)  /* IFA_RT_PRIORITY */;
 }
 
+enum addr_type_t {
+	UNICAST_ADDR,
+	MULTICAST_ADDR,
+	ANYCAST_ADDR,
+};
+
 struct inet6_fill_args {
 	u32 portid;
 	u32 seq;
 	int event;
 	unsigned int flags;
 	int netnsid;
+	enum addr_type_t type;
 };
 
 static int inet6_fill_ifaddr(struct sk_buff *skb, struct inet6_ifaddr *ifa,
@@ -4930,39 +4937,28 @@ static int inet6_fill_ifacaddr(struct sk_buff *skb, struct ifacaddr6 *ifaca,
 	return 0;
 }
 
-enum addr_type_t {
-	UNICAST_ADDR,
-	MULTICAST_ADDR,
-	ANYCAST_ADDR,
-};
-
 /* called with rcu_read_lock() */
 static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb,
-			  struct netlink_callback *cb, enum addr_type_t type,
-			  int s_ip_idx, int *p_ip_idx, int netnsid)
+			  struct netlink_callback *cb,
+			  int s_ip_idx, int *p_ip_idx,
+			  struct inet6_fill_args *fillargs)
 {
-	struct inet6_fill_args fillargs = {
-		.portid = NETLINK_CB(cb->skb).portid,
-		.seq = cb->nlh->nlmsg_seq,
-		.flags = NLM_F_MULTI,
-		.netnsid = netnsid,
-	};
 	struct ifmcaddr6 *ifmca;
 	struct ifacaddr6 *ifaca;
 	int err = 1;
 	int ip_idx = *p_ip_idx;
 
 	read_lock_bh(&idev->lock);
-	switch (type) {
+	switch (fillargs->type) {
 	case UNICAST_ADDR: {
 		struct inet6_ifaddr *ifa;
-		fillargs.event = RTM_NEWADDR;
+		fillargs->event = RTM_NEWADDR;
 
 		/* unicast address incl. temp addr */
 		list_for_each_entry(ifa, &idev->addr_list, if_list) {
 			if (++ip_idx < s_ip_idx)
 				continue;
-			err = inet6_fill_ifaddr(skb, ifa, &fillargs);
+			err = inet6_fill_ifaddr(skb, ifa, fillargs);
 			if (err < 0)
 				break;
 			nl_dump_check_consistent(cb, nlmsg_hdr(skb));
@@ -4970,26 +4966,26 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb,
 		break;
 	}
 	case MULTICAST_ADDR:
-		fillargs.event = RTM_GETMULTICAST;
+		fillargs->event = RTM_GETMULTICAST;
 
 		/* multicast address */
 		for (ifmca = idev->mc_list; ifmca;
 		     ifmca = ifmca->next, ip_idx++) {
 			if (ip_idx < s_ip_idx)
 				continue;
-			err = inet6_fill_ifmcaddr(skb, ifmca, &fillargs);
+			err = inet6_fill_ifmcaddr(skb, ifmca, fillargs);
 			if (err < 0)
 				break;
 		}
 		break;
 	case ANYCAST_ADDR:
-		fillargs.event = RTM_GETANYCAST;
+		fillargs->event = RTM_GETANYCAST;
 		/* anycast address */
 		for (ifaca = idev->ac_list; ifaca;
 		     ifaca = ifaca->aca_next, ip_idx++) {
 			if (ip_idx < s_ip_idx)
 				continue;
-			err = inet6_fill_ifacaddr(skb, ifaca, &fillargs);
+			err = inet6_fill_ifacaddr(skb, ifaca, fillargs);
 			if (err < 0)
 				break;
 		}
@@ -5005,10 +5001,16 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb,
 static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
 			   enum addr_type_t type)
 {
+	struct inet6_fill_args fillargs = {
+		.portid = NETLINK_CB(cb->skb).portid,
+		.seq = cb->nlh->nlmsg_seq,
+		.flags = NLM_F_MULTI,
+		.netnsid = -1,
+		.type = type,
+	};
 	struct net *net = sock_net(skb->sk);
 	struct nlattr *tb[IFA_MAX+1];
 	struct net *tgt_net = net;
-	int netnsid = -1;
 	int h, s_h;
 	int idx, ip_idx;
 	int s_idx, s_ip_idx;
@@ -5023,9 +5025,10 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
 	if (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,
 			ifa_ipv6_policy, cb->extack) >= 0) {
 		if (tb[IFA_TARGET_NETNSID]) {
-			netnsid = nla_get_s32(tb[IFA_TARGET_NETNSID]);
+			fillargs.netnsid = nla_get_s32(tb[IFA_TARGET_NETNSID]);
 
-			tgt_net = rtnl_get_net_ns_capable(skb->sk, netnsid);
+			tgt_net = rtnl_get_net_ns_capable(skb->sk,
+							  fillargs.netnsid);
 			if (IS_ERR(tgt_net))
 				return PTR_ERR(tgt_net);
 		}
@@ -5046,8 +5049,8 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
 			if (!idev)
 				goto cont;
 
-			if (in6_dump_addrs(idev, skb, cb, type,
-					   s_ip_idx, &ip_idx, netnsid) < 0)
+			if (in6_dump_addrs(idev, skb, cb, s_ip_idx, &ip_idx,
+					   &fillargs) < 0)
 				goto done;
 cont:
 			idx++;
@@ -5058,7 +5061,7 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
 	cb->args[0] = h;
 	cb->args[1] = idx;
 	cb->args[2] = ip_idx;
-	if (netnsid >= 0)
+	if (fillargs.netnsid >= 0)
 		put_net(tgt_net);
 
 	return skb->len;
-- 
2.11.0

^ permalink raw reply related

* [PATCH v2 net-next 04/23] netlink: Add strict version of nlmsg_parse and nla_parse
From: David Ahern @ 2018-10-08  3:16 UTC (permalink / raw)
  To: netdev, davem; +Cc: christian, jbenc, stephen, David Ahern
In-Reply-To: <20181008031644.15989-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

nla_parse is currently lenient on message parsing, allowing type to be 0
or greater than max expected and only logging a message

    "netlink: %d bytes leftover after parsing attributes in process `%s'."

if the netlink message has unknown data at the end after parsing. What this
could mean is that the header at the front of the attributes is actually
wrong and the parsing is shifted from what is expected.

Add a new strict version that actually fails with EINVAL if there are any
bytes remaining after the parsing loop completes, if the atttrbitue type
is 0 or greater than max expected.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/netlink.h | 17 +++++++++++++++++
 lib/nlattr.c          | 48 ++++++++++++++++++++++++++++++++++++------------
 2 files changed, 53 insertions(+), 12 deletions(-)

diff --git a/include/net/netlink.h b/include/net/netlink.h
index 9522a0bf1f3a..f1db8e594847 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -373,6 +373,9 @@ int nla_validate(const struct nlattr *head, int len, int maxtype,
 int nla_parse(struct nlattr **tb, int maxtype, const struct nlattr *head,
 	      int len, const struct nla_policy *policy,
 	      struct netlink_ext_ack *extack);
+int nla_parse_strict(struct nlattr **tb, int maxtype, const struct nlattr *head,
+		     int len, const struct nla_policy *policy,
+		     struct netlink_ext_ack *extack);
 int nla_policy_len(const struct nla_policy *, int);
 struct nlattr *nla_find(const struct nlattr *head, int len, int attrtype);
 size_t nla_strlcpy(char *dst, const struct nlattr *nla, size_t dstsize);
@@ -525,6 +528,20 @@ static inline int nlmsg_parse(const struct nlmsghdr *nlh, int hdrlen,
 			 nlmsg_attrlen(nlh, hdrlen), policy, extack);
 }
 
+static inline int nlmsg_parse_strict(const struct nlmsghdr *nlh, int hdrlen,
+				     struct nlattr *tb[], int maxtype,
+				     const struct nla_policy *policy,
+				     struct netlink_ext_ack *extack)
+{
+	if (nlh->nlmsg_len < nlmsg_msg_size(hdrlen)) {
+		NL_SET_ERR_MSG(extack, "Invalid header length");
+		return -EINVAL;
+	}
+
+	return nla_parse_strict(tb, maxtype, nlmsg_attrdata(nlh, hdrlen),
+				nlmsg_attrlen(nlh, hdrlen), policy, extack);
+}
+
 /**
  * nlmsg_find_attr - find a specific attribute in a netlink message
  * @nlh: netlink message header
diff --git a/lib/nlattr.c b/lib/nlattr.c
index 1e900bb414ef..d26de6156b97 100644
--- a/lib/nlattr.c
+++ b/lib/nlattr.c
@@ -391,9 +391,10 @@ EXPORT_SYMBOL(nla_policy_len);
  *
  * Returns 0 on success or a negative error code.
  */
-int nla_parse(struct nlattr **tb, int maxtype, const struct nlattr *head,
-	      int len, const struct nla_policy *policy,
-	      struct netlink_ext_ack *extack)
+static int __nla_parse(struct nlattr **tb, int maxtype,
+		       const struct nlattr *head, int len,
+		       bool strict, const struct nla_policy *policy,
+		       struct netlink_ext_ack *extack)
 {
 	const struct nlattr *nla;
 	int rem;
@@ -403,27 +404,50 @@ int nla_parse(struct nlattr **tb, int maxtype, const struct nlattr *head,
 	nla_for_each_attr(nla, head, len, rem) {
 		u16 type = nla_type(nla);
 
-		if (type > 0 && type <= maxtype) {
-			if (policy) {
-				int err = validate_nla(nla, maxtype, policy,
-						       extack);
-
-				if (err < 0)
-					return err;
+		if (type == 0 || type > maxtype) {
+			if (strict) {
+				NL_SET_ERR_MSG(extack, "Unknown attribute type");
+				return -EINVAL;
 			}
+			continue;
+		}
+		if (policy) {
+			int err = validate_nla(nla, maxtype, policy, extack);
 
-			tb[type] = (struct nlattr *)nla;
+			if (err < 0)
+				return err;
 		}
+
+		tb[type] = (struct nlattr *)nla;
 	}
 
-	if (unlikely(rem > 0))
+	if (unlikely(rem > 0)) {
 		pr_warn_ratelimited("netlink: %d bytes leftover after parsing attributes in process `%s'.\n",
 				    rem, current->comm);
+		NL_SET_ERR_MSG(extack, "bytes leftover after parsing attributes");
+		if (strict)
+			return -EINVAL;
+	}
 
 	return 0;
 }
+
+int nla_parse(struct nlattr **tb, int maxtype, const struct nlattr *head,
+	      int len, const struct nla_policy *policy,
+	      struct netlink_ext_ack *extack)
+{
+	return __nla_parse(tb, maxtype, head, len, false, policy, extack);
+}
 EXPORT_SYMBOL(nla_parse);
 
+int nla_parse_strict(struct nlattr **tb, int maxtype, const struct nlattr *head,
+		     int len, const struct nla_policy *policy,
+		     struct netlink_ext_ack *extack)
+{
+	return __nla_parse(tb, maxtype, head, len, true, policy, extack);
+}
+EXPORT_SYMBOL(nla_parse_strict);
+
 /**
  * nla_find - Find a specific attribute in a stream of attributes
  * @head: head of attribute stream
-- 
2.11.0

^ permalink raw reply related

* [PATCH v2 net-next 02/23] netlink: Add extack message to nlmsg_parse for invalid header length
From: David Ahern @ 2018-10-08  3:16 UTC (permalink / raw)
  To: netdev, davem; +Cc: christian, jbenc, stephen, David Ahern
In-Reply-To: <20181008031644.15989-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

Give a user a reason why EINVAL is returned in nlmsg_parse.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Christian Brauner <christian@brauner.io>
---
 include/net/netlink.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/netlink.h b/include/net/netlink.h
index 589683091f16..9522a0bf1f3a 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -516,8 +516,10 @@ static inline int nlmsg_parse(const struct nlmsghdr *nlh, int hdrlen,
 			      const struct nla_policy *policy,
 			      struct netlink_ext_ack *extack)
 {
-	if (nlh->nlmsg_len < nlmsg_msg_size(hdrlen))
+	if (nlh->nlmsg_len < nlmsg_msg_size(hdrlen)) {
+		NL_SET_ERR_MSG(extack, "Invalid header length");
 		return -EINVAL;
+	}
 
 	return nla_parse(tb, maxtype, nlmsg_attrdata(nlh, hdrlen),
 			 nlmsg_attrlen(nlh, hdrlen), policy, extack);
-- 
2.11.0

^ permalink raw reply related

* [PATCH v2 net-next 01/23] netlink: Pass extack to dump handlers
From: David Ahern @ 2018-10-08  3:16 UTC (permalink / raw)
  To: netdev, davem; +Cc: christian, jbenc, stephen, David Ahern
In-Reply-To: <20181008031644.15989-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

Declare extack in netlink_dump and pass to dump handlers via
netlink_callback. Add any extack message after the dump_done_errno
allowing error messages to be returned. This will be useful when
strict checking is done on dump requests, returning why the dump
fails EINVAL.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Christian Brauner <christian@brauner.io>
---
 include/linux/netlink.h  |  1 +
 net/netlink/af_netlink.c | 12 +++++++++++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 71f121b66ca8..88c8a2d83eb3 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -176,6 +176,7 @@ struct netlink_callback {
 	void			*data;
 	/* the module that dump function belong to */
 	struct module		*module;
+	struct netlink_ext_ack	*extack;
 	u16			family;
 	u16			min_dump_alloc;
 	unsigned int		prev_seq, seq;
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index e3a0538ec0be..7ac585f33a9e 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2171,6 +2171,7 @@ EXPORT_SYMBOL(__nlmsg_put);
 static int netlink_dump(struct sock *sk)
 {
 	struct netlink_sock *nlk = nlk_sk(sk);
+	struct netlink_ext_ack extack = {};
 	struct netlink_callback *cb;
 	struct sk_buff *skb = NULL;
 	struct nlmsghdr *nlh;
@@ -2222,8 +2223,11 @@ static int netlink_dump(struct sock *sk)
 	skb_reserve(skb, skb_tailroom(skb) - alloc_size);
 	netlink_skb_set_owner_r(skb, sk);
 
-	if (nlk->dump_done_errno > 0)
+	if (nlk->dump_done_errno > 0) {
+		cb->extack = &extack;
 		nlk->dump_done_errno = cb->dump(skb, cb);
+		cb->extack = NULL;
+	}
 
 	if (nlk->dump_done_errno > 0 ||
 	    skb_tailroom(skb) < nlmsg_total_size(sizeof(nlk->dump_done_errno))) {
@@ -2246,6 +2250,12 @@ static int netlink_dump(struct sock *sk)
 	memcpy(nlmsg_data(nlh), &nlk->dump_done_errno,
 	       sizeof(nlk->dump_done_errno));
 
+	if (extack._msg && nlk->flags & NETLINK_F_EXT_ACK) {
+		nlh->nlmsg_flags |= NLM_F_ACK_TLVS;
+		if (!nla_put_string(skb, NLMSGERR_ATTR_MSG, extack._msg))
+			nlmsg_end(skb, nlh);
+	}
+
 	if (sk_filter(sk, skb))
 		kfree_skb(skb);
 	else
-- 
2.11.0

^ permalink raw reply related

* [PATCH v2 net-next 03/23] net: Add extack to nlmsg_parse
From: David Ahern @ 2018-10-08  3:16 UTC (permalink / raw)
  To: netdev, davem; +Cc: christian, jbenc, stephen, David Ahern
In-Reply-To: <20181008031644.15989-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

Make sure extack is passed to nlmsg_parse where easy to do so.
Most of these are dump handlers and leveraging the extack in
the netlink_callback.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Christian Brauner <christian@brauner.io>
---
 net/core/devlink.c             | 2 +-
 net/core/neighbour.c           | 3 ++-
 net/core/rtnetlink.c           | 4 ++--
 net/ipv4/devinet.c             | 9 +++++----
 net/ipv6/addrconf.c            | 2 +-
 net/ipv6/route.c               | 2 +-
 net/mpls/af_mpls.c             | 2 +-
 net/netfilter/ipvs/ip_vs_ctl.c | 2 +-
 net/sched/act_api.c            | 2 +-
 net/sched/cls_api.c            | 6 ++++--
 net/sched/sch_api.c            | 2 +-
 net/xfrm/xfrm_user.c           | 2 +-
 12 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/net/core/devlink.c b/net/core/devlink.c
index 938f68ee92f0..6dae81d65d5c 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -3504,7 +3504,7 @@ static int devlink_nl_cmd_region_read_dumpit(struct sk_buff *skb,
 	start_offset = *((u64 *)&cb->args[0]);
 
 	err = nlmsg_parse(cb->nlh, GENL_HDRLEN + devlink_nl_family.hdrsize,
-			  attrs, DEVLINK_ATTR_MAX, ops->policy, NULL);
+			  attrs, DEVLINK_ATTR_MAX, ops->policy, cb->extack);
 	if (err)
 		goto out;
 
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index fb023df48b83..b06f794bf91e 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2445,7 +2445,8 @@ static int neigh_dump_info(struct sk_buff *skb, struct netlink_callback *cb)
 	    ((struct ndmsg *)nlmsg_data(nlh))->ndm_flags == NTF_PROXY)
 		proxy = 1;
 
-	err = nlmsg_parse(nlh, sizeof(struct ndmsg), tb, NDA_MAX, NULL, NULL);
+	err = nlmsg_parse(nlh, sizeof(struct ndmsg), tb, NDA_MAX, NULL,
+			  cb->extack);
 	if (!err) {
 		if (tb[NDA_IFINDEX]) {
 			if (nla_len(tb[NDA_IFINDEX]) != sizeof(u32))
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 5564eee1e980..4486e8b7d9d0 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1909,7 +1909,7 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
 		 sizeof(struct rtgenmsg) : sizeof(struct ifinfomsg);
 
 	if (nlmsg_parse(cb->nlh, hdrlen, tb, IFLA_MAX,
-			ifla_policy, NULL) >= 0) {
+			ifla_policy, cb->extack) >= 0) {
 		if (tb[IFLA_TARGET_NETNSID]) {
 			netnsid = nla_get_s32(tb[IFLA_TARGET_NETNSID]);
 			tgt_net = rtnl_get_net_ns_capable(skb->sk, netnsid);
@@ -3774,7 +3774,7 @@ static int rtnl_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
 	    (nlmsg_len(cb->nlh) != sizeof(struct ndmsg) +
 	     nla_attr_size(sizeof(u32)))) {
 		err = nlmsg_parse(cb->nlh, sizeof(struct ifinfomsg), tb,
-				  IFLA_MAX, ifla_policy, NULL);
+				  IFLA_MAX, ifla_policy, cb->extack);
 		if (err < 0) {
 			return -EINVAL;
 		} else if (err == 0) {
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 44d931a3cd50..ab2b11df5ea4 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -782,7 +782,8 @@ static void set_ifa_lifetime(struct in_ifaddr *ifa, __u32 valid_lft,
 }
 
 static struct in_ifaddr *rtm_to_ifaddr(struct net *net, struct nlmsghdr *nlh,
-				       __u32 *pvalid_lft, __u32 *pprefered_lft)
+				       __u32 *pvalid_lft, __u32 *pprefered_lft,
+				       struct netlink_ext_ack *extack)
 {
 	struct nlattr *tb[IFA_MAX+1];
 	struct in_ifaddr *ifa;
@@ -792,7 +793,7 @@ static struct in_ifaddr *rtm_to_ifaddr(struct net *net, struct nlmsghdr *nlh,
 	int err;
 
 	err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFA_MAX, ifa_ipv4_policy,
-			  NULL);
+			  extack);
 	if (err < 0)
 		goto errout;
 
@@ -897,7 +898,7 @@ static int inet_rtm_newaddr(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	ASSERT_RTNL();
 
-	ifa = rtm_to_ifaddr(net, nlh, &valid_lft, &prefered_lft);
+	ifa = rtm_to_ifaddr(net, nlh, &valid_lft, &prefered_lft, extack);
 	if (IS_ERR(ifa))
 		return PTR_ERR(ifa);
 
@@ -1684,7 +1685,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 	s_ip_idx = ip_idx = cb->args[2];
 
 	if (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,
-			ifa_ipv4_policy, NULL) >= 0) {
+			ifa_ipv4_policy, cb->extack) >= 0) {
 		if (tb[IFA_TARGET_NETNSID]) {
 			fillargs.netnsid = nla_get_s32(tb[IFA_TARGET_NETNSID]);
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index a9a317322388..2f8aa4fd5e55 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5021,7 +5021,7 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
 	s_ip_idx = ip_idx = cb->args[2];
 
 	if (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,
-			ifa_ipv6_policy, NULL) >= 0) {
+			ifa_ipv6_policy, cb->extack) >= 0) {
 		if (tb[IFA_TARGET_NETNSID]) {
 			netnsid = nla_get_s32(tb[IFA_TARGET_NETNSID]);
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 74d97addf1af..7c38e0e058ae 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -4117,7 +4117,7 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
 	int err;
 
 	err = nlmsg_parse(nlh, sizeof(*rtm), tb, RTA_MAX, rtm_ipv6_policy,
-			  NULL);
+			  extack);
 	if (err < 0)
 		goto errout;
 
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 8fbe6cdbe255..55a30ee3d820 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -1223,7 +1223,7 @@ static int mpls_netconf_get_devconf(struct sk_buff *in_skb,
 	int err;
 
 	err = nlmsg_parse(nlh, sizeof(*ncm), tb, NETCONFA_MAX,
-			  devconf_mpls_policy, NULL);
+			  devconf_mpls_policy, extack);
 	if (err < 0)
 		goto errout;
 
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 62eefea48973..83395bf6dc35 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -3234,7 +3234,7 @@ static int ip_vs_genl_dump_dests(struct sk_buff *skb,
 
 	/* Try to find the service for which to dump destinations */
 	if (nlmsg_parse(cb->nlh, GENL_HDRLEN, attrs, IPVS_CMD_ATTR_MAX,
-			ip_vs_cmd_policy, NULL))
+			ip_vs_cmd_policy, cb->extack))
 		goto out_err;
 
 
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 55153da00278..9c1b0729aebf 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -1452,7 +1452,7 @@ static int tc_dump_action(struct sk_buff *skb, struct netlink_callback *cb)
 	u32 act_count = 0;
 
 	ret = nlmsg_parse(cb->nlh, sizeof(struct tcamsg), tb, TCA_ROOT_MAX,
-			  tcaa_policy, NULL);
+			  tcaa_policy, cb->extack);
 	if (ret < 0)
 		return ret;
 
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index d670d3066ebd..43c8559aca56 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1727,7 +1727,8 @@ static int tc_dump_tfilter(struct sk_buff *skb, struct netlink_callback *cb)
 	if (nlmsg_len(cb->nlh) < sizeof(*tcm))
 		return skb->len;
 
-	err = nlmsg_parse(cb->nlh, sizeof(*tcm), tca, TCA_MAX, NULL, NULL);
+	err = nlmsg_parse(cb->nlh, sizeof(*tcm), tca, TCA_MAX, NULL,
+			  cb->extack);
 	if (err)
 		return err;
 
@@ -2054,7 +2055,8 @@ static int tc_dump_chain(struct sk_buff *skb, struct netlink_callback *cb)
 	if (nlmsg_len(cb->nlh) < sizeof(*tcm))
 		return skb->len;
 
-	err = nlmsg_parse(cb->nlh, sizeof(*tcm), tca, TCA_MAX, NULL, NULL);
+	err = nlmsg_parse(cb->nlh, sizeof(*tcm), tca, TCA_MAX, NULL,
+			  cb->extack);
 	if (err)
 		return err;
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index da1963b19dec..cf5c714ae786 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1671,7 +1671,7 @@ static int tc_dump_qdisc(struct sk_buff *skb, struct netlink_callback *cb)
 	ASSERT_RTNL();
 
 	err = nlmsg_parse(nlh, sizeof(struct tcmsg), tca, TCA_MAX,
-			  rtm_tca_policy, NULL);
+			  rtm_tca_policy, cb->extack);
 	if (err < 0)
 		return err;
 
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index df7ca2dabc48..ca7a207b81a9 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1007,7 +1007,7 @@ static int xfrm_dump_sa(struct sk_buff *skb, struct netlink_callback *cb)
 		int err;
 
 		err = nlmsg_parse(cb->nlh, 0, attrs, XFRMA_MAX, xfrma_policy,
-				  NULL);
+				  cb->extack);
 		if (err < 0)
 			return err;
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH v2 net-next 00/23] rtnetlink: Add support for rigid checking of data in dump request
From: David Ahern @ 2018-10-08  3:16 UTC (permalink / raw)
  To: netdev, davem; +Cc: christian, jbenc, stephen, David Ahern

From: David Ahern <dsahern@gmail.com>

There are many use cases where a user wants to influence what is
returned in a dump for some rtnetlink command: one is wanting data
for a different namespace than the one the request is received and
another is limiting the amount of data returned in the dump to a
specific set of interest to userspace, reducing the cpu overhead of
both kernel and userspace. Unfortunately, the kernel has historically
not been strict with checking for the proper header or checking the
values passed in the header. This lenient implementation has allowed
iproute2 and other packages to pass any struct or data in the dump
request as long as the family is the first byte. For example, ifinfomsg
struct is used by iproute2 for all generic dump requests - links,
addresses, routes and rules when it is really only valid for link
requests.

There is 1 is example where the kernel deals with the wrong struct: link
dumps after VF support was added. Older iproute2 was sending rtgenmsg as
the header instead of ifinfomsg so a patch was added to try and detect
old userspace vs new:
e5eca6d41f53 ("rtnetlink: fix userspace API breakage for iproute2 < v3.9.0")

The latest example is Christian's patch set wanting to return addresses for
a target namespace. It guesses the header struct is an ifaddrmsg and if it
guesses wrong a netlink warning is generated in the kernel log on every
address dump which is unacceptable.

Another example where the kernel is a bit lenient is route dumps: iproute2
can send either a request with either ifinfomsg or a rtmsg as the header
struct, yet the kernel always treats the header as an rtmsg (see
inet_dump_fib and rtm_flags check). The header inconsistency impacts the
ability to add kernel side filters for route dumps - a necessary feature
for scale setups with 100k+ routes.

How to resolve the problem of not breaking old userspace yet be able to
move forward with new features such as kernel side filtering which are
crucial for efficient operation at high scale?

This patch set addresses the problem by adding a new socket flag,
NETLINK_DUMP_STRICT_CHK, that userspace can use with setsockopt to
request strict checking of headers and attributes on dump requests and
hence unlock the ability to use kernel side filters as they are added.

Kernel side, the dump handlers are updated to verify the message contains
at least the expected header struct:
    RTM_GETLINK:       ifinfomsg
    RTM_GETADDR:       ifaddrmsg
    RTM_GETMULTICAST:  ifaddrmsg
    RTM_GETANYCAST:    ifaddrmsg
    RTM_GETADDRLABEL:  ifaddrlblmsg
    RTM_GETROUTE:      rtmsg
    RTM_GETSTATS:      if_stats_msg
    RTM_GETNEIGH:      ndmsg
    RTM_GETNEIGHTBL:   ndtmsg
    RTM_GETNSID:       rtgenmsg
    RTM_GETRULE:       fib_rule_hdr
    RTM_GETNETCONF:    netconfmsg
    RTM_GETMDB:        br_port_msg

And then every field in the header struct should be 0 with the exception
of the family. There are a few exceptions to this rule where the kernel
already influences the data returned by values in the struct. Next the
message should not contain attributes unless the kernel implements
filtering for it. Any unexpected data causes the dump to fail with EINVAL.
If the new flag is honored by the kernel and the dump contents adjusted
by any data passed in the request, the dump handler can set the
NLM_F_DUMP_FILTERED flag in the netlink message header.

For old userspace on new kernel there is no impact as all checks are
wrapped in a check on the new strict flag. For new userspace on old
kernel, the data in the headers and any appended attributes are
silently ignored though the setsockopt failing is the clue to userspace
the feature is not supported. New userspace on new kernel gets the
requested data dump.

iproute2 patches can be found here:
    https://github.com/dsahern/iproute2 dump-enhancements

Major changes since v1
- inner header is supposed to be 4-bytes aligned. So for dumps that
  should not have attributes appended changed the check to use:
        if (nlmsg_attrlen(nlh, sizeof(hdr)))
  Only impacts patches with headers that are not multiples of 4-bytes
  (rtgenmsg, netconfmsg), but applied the change to all patches not
  calling nlmsg_parse for consistency.

- Added nlmsg_parse_strict and nla_parse_strict for tighter control on
  attribute parsing. There should be no unknown attribute types or extra
  bytes.

- Moved validation to a helper in most cases

Changes since rfc-v2
- dropped the NLM_F_DUMP_FILTERED flag from target nsid dumps per
  Jiri's objections
- changed the opt-in uapi from a netlink message flag to a socket
  flag. setsockopt provides an api for userspace to definitively
  know if the kernel supports strict checking on dumps.
- re-ordered patches to peel off the extack on dumps if needed to
  keep this set size within limits
- misc cleanups in patches based on testing

David Ahern (23):
  netlink: Pass extack to dump handlers
  netlink: Add extack message to nlmsg_parse for invalid header length
  net: Add extack to nlmsg_parse
  netlink: Add strict version of nlmsg_parse and nla_parse
  net/ipv6: Refactor address dump to push inet6_fill_args to
    in6_dump_addrs
  netlink: Add new socket option to enable strict checking on dumps
  net/ipv4: Update inet_dump_ifaddr for strict data checking
  net/ipv6: Update inet6_dump_addr for strict data checking
  rtnetlink: Update rtnl_dump_ifinfo for strict data checking
  rtnetlink: Update rtnl_bridge_getlink for strict data checking
  rtnetlink: Update rtnl_stats_dump for strict data checking
  rtnetlink: Update inet6_dump_ifinfo for strict data checking
  rtnetlink: Update ipmr_rtm_dumplink for strict data checking
  rtnetlink: Update fib dumps for strict data checking
  net/neighbor: Update neigh_dump_info for strict data checking
  net/neighbor: Update neightbl_dump_info for strict data checking
  net/namespace: Update rtnl_net_dumpid for strict data checking
  net/fib_rules: Update fib_nl_dumprule for strict data checking
  net/ipv6: Update ip6addrlbl_dump for strict data checking
  net: Update netconf dump handlers for strict data checking
  net/bridge: Update br_mdb_dump for strict data checking
  rtnetlink: Move input checking for rtnl_fdb_dump to helper
  rtnetlink: Update rtnl_fdb_dump for strict data checking

 include/linux/netlink.h        |   2 +
 include/net/ip_fib.h           |   2 +
 include/net/netlink.h          |  21 ++-
 include/uapi/linux/netlink.h   |   1 +
 lib/nlattr.c                   |  48 +++++--
 net/bridge/br_mdb.c            |  30 ++++
 net/core/devlink.c             |   2 +-
 net/core/fib_rules.c           |  36 ++++-
 net/core/neighbour.c           | 119 ++++++++++++---
 net/core/net_namespace.c       |   6 +
 net/core/rtnetlink.c           | 318 ++++++++++++++++++++++++++++++++---------
 net/ipv4/devinet.c             | 101 ++++++++++---
 net/ipv4/fib_frontend.c        |  42 +++++-
 net/ipv4/ipmr.c                |  39 +++++
 net/ipv6/addrconf.c            | 177 ++++++++++++++++++-----
 net/ipv6/addrlabel.c           |  34 ++++-
 net/ipv6/ip6_fib.c             |   8 ++
 net/ipv6/ip6mr.c               |   9 ++
 net/ipv6/route.c               |   2 +-
 net/mpls/af_mpls.c             |  28 +++-
 net/netfilter/ipvs/ip_vs_ctl.c |   2 +-
 net/netlink/af_netlink.c       |  33 ++++-
 net/netlink/af_netlink.h       |   1 +
 net/sched/act_api.c            |   2 +-
 net/sched/cls_api.c            |   6 +-
 net/sched/sch_api.c            |   2 +-
 net/xfrm/xfrm_user.c           |   2 +-
 27 files changed, 908 insertions(+), 165 deletions(-)

-- 
2.11.0

^ permalink raw reply

* Re: [RFC PATCH bpf-next v4 4/7] bpf: add bpf queue and stack maps
From: Mauricio Vasquez @ 2018-10-08  3:15 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Alexei Starovoitov, Daniel Borkmann, netdev
In-Reply-To: <86a2498d-4e2e-7fb9-5139-9738e3796b01@polito.it>



On 10/04/2018 10:40 PM, Mauricio Vasquez wrote:
>
>
> On 10/04/2018 06:57 PM, Alexei Starovoitov wrote:
>> On Thu, Oct 04, 2018 at 07:12:44PM +0200, Mauricio Vasquez B wrote:
>>> Implement two new kind of maps that support the peek, push and pop
>>> operations.
>>>
>>> A use case for this is to keep track of a pool of elements, like
>>> network ports in a SNAT.
>>>
>>> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
>>> ---
>>>   include/linux/bpf.h           |    7 +
>>>   include/linux/bpf_types.h     |    2
>>>   include/uapi/linux/bpf.h      |   35 ++++-
>>>   kernel/bpf/Makefile           |    2
>>>   kernel/bpf/core.c             |    3
>>>   kernel/bpf/helpers.c          |   43 ++++++
>>>   kernel/bpf/queue_stack_maps.c |  300 
>>> +++++++++++++++++++++++++++++++++++++++++
>>>   kernel/bpf/syscall.c          |   31 +++-
>>>   kernel/bpf/verifier.c         |   14 +-
>>>   net/core/filter.c             |    6 +
>>>   10 files changed, 424 insertions(+), 19 deletions(-)
>>>   create mode 100644 kernel/bpf/queue_stack_maps.c
>>>
>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>> index 98c7eeb6d138..cad3bc5cffd1 100644
>>> --- a/include/linux/bpf.h
>>> +++ b/include/linux/bpf.h
>>> @@ -40,6 +40,9 @@ struct bpf_map_ops {
>>>       int (*map_update_elem)(struct bpf_map *map, void *key, void 
>>> *value, u64 flags);
>>>       int (*map_delete_elem)(struct bpf_map *map, void *key);
>>>       void *(*map_lookup_and_delete_elem)(struct bpf_map *map, void 
>>> *key);
>>> +    int (*map_push_elem)(struct bpf_map *map, void *value, u64 flags);
>>> +    int (*map_pop_elem)(struct bpf_map *map, void *value);
>>> +    int (*map_peek_elem)(struct bpf_map *map, void *value);
>>>         /* funcs called by prog_array and perf_event_array map */
>>>       void *(*map_fd_get_ptr)(struct bpf_map *map, struct file 
>>> *map_file,
>>> @@ -139,6 +142,7 @@ enum bpf_arg_type {
>>>       ARG_CONST_MAP_PTR,    /* const argument used as pointer to 
>>> bpf_map */
>>>       ARG_PTR_TO_MAP_KEY,    /* pointer to stack used as map key */
>>>       ARG_PTR_TO_MAP_VALUE,    /* pointer to stack used as map value */
>>> +    ARG_PTR_TO_UNINIT_MAP_VALUE,    /* pointer to valid memory used 
>>> to store a map value */
>>>         /* the following constraints used to prototype bpf_memcmp() 
>>> and other
>>>        * functions that access data on eBPF program stack
>>> @@ -825,6 +829,9 @@ static inline int 
>>> bpf_fd_reuseport_array_update_elem(struct bpf_map *map,
>>>   extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
>>>   extern const struct bpf_func_proto bpf_map_update_elem_proto;
>>>   extern const struct bpf_func_proto bpf_map_delete_elem_proto;
>>> +extern const struct bpf_func_proto bpf_map_push_elem_proto;
>>> +extern const struct bpf_func_proto bpf_map_pop_elem_proto;
>>> +extern const struct bpf_func_proto bpf_map_peek_elem_proto;
>>>     extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
>>>   extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
>>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>>> index 658509daacd4..a2ec73aa1ec7 100644
>>> --- a/include/linux/bpf_types.h
>>> +++ b/include/linux/bpf_types.h
>>> @@ -69,3 +69,5 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
>>>   BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
>>>   #endif
>>>   #endif
>>> +BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
>>> +BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 3bb94aa2d408..bfa042273fad 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -129,6 +129,8 @@ enum bpf_map_type {
>>>       BPF_MAP_TYPE_CGROUP_STORAGE,
>>>       BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
>>>       BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
>>> +    BPF_MAP_TYPE_QUEUE,
>>> +    BPF_MAP_TYPE_STACK,
>>>   };
>>>     enum bpf_prog_type {
>>> @@ -463,6 +465,28 @@ union bpf_attr {
>>>    *     Return
>>>    *         0 on success, or a negative error in case of failure.
>>>    *
>>> + * int bpf_map_push_elem(struct bpf_map *map, const void *value, 
>>> u64 flags)
>>> + *     Description
>>> + *         Push an element *value* in *map*. *flags* is one of:
>>> + *
>>> + *         **BPF_EXIST**
>>> + *         If the queue/stack is full, the oldest element is 
>>> removed to
>>> + *         make room for this.
>>> + *     Return
>>> + *         0 on success, or a negative error in case of failure.
>>> + *
>>> + * int bpf_map_pop_elem(struct bpf_map *map, void *value)
>>> + *     Description
>>> + *         Pop an element from *map*.
>>> + * Return
>>> + *         0 on success, or a negative error in case of failure.
>>> + *
>>> + * int bpf_map_peek_elem(struct bpf_map *map, void *value)
>>> + *     Description
>>> + *         Get an element from *map* without removing it.
>>> + * Return
>>> + *         0 on success, or a negative error in case of failure.
>>> + *
>>>    * int bpf_probe_read(void *dst, u32 size, const void *src)
>>>    *     Description
>>>    *         For tracing programs, safely attempt to read *size* 
>>> bytes from
>>> @@ -790,14 +814,14 @@ union bpf_attr {
>>>    *
>>>    *             int ret;
>>>    *             struct bpf_tunnel_key key = {};
>>> - *
>>> + *
>>>    *             ret = bpf_skb_get_tunnel_key(skb, &key, 
>>> sizeof(key), 0);
>>>    *             if (ret < 0)
>>>    *                 return TC_ACT_SHOT;    // drop packet
>>> - *
>>> + *
>>>    *             if (key.remote_ipv4 != 0x0a000001)
>>>    *                 return TC_ACT_SHOT;    // drop packet
>>> - *
>>> + *
>>>    *             return TC_ACT_OK;        // accept packet
>>>    *
>>>    *         This interface can also be used with all encapsulation 
>>> devices
>>> @@ -2304,7 +2328,10 @@ union bpf_attr {
>>>       FN(skb_ancestor_cgroup_id),    \
>>>       FN(sk_lookup_tcp),        \
>>>       FN(sk_lookup_udp),        \
>>> -    FN(sk_release),
>>> +    FN(sk_release),            \
>>> +    FN(map_push_elem),        \
>>> +    FN(map_pop_elem),        \
>>> +    FN(map_peek_elem),
>>>     /* integer value in 'imm' field of BPF_CALL instruction selects 
>>> which helper
>>>    * function eBPF program intends to call
>>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
>>> index 0488b8258321..17afae9e65f3 100644
>>> --- a/kernel/bpf/Makefile
>>> +++ b/kernel/bpf/Makefile
>>> @@ -3,7 +3,7 @@ obj-y := core.o
>>>     obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o 
>>> helpers.o tnum.o
>>>   obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o 
>>> percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
>>> -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o
>>> +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
>>>   obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>>>   obj-$(CONFIG_BPF_SYSCALL) += btf.o
>>>   ifeq ($(CONFIG_NET),y)
>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>> index 3f5bf1af0826..8d2db076d123 100644
>>> --- a/kernel/bpf/core.c
>>> +++ b/kernel/bpf/core.c
>>> @@ -1783,6 +1783,9 @@ BPF_CALL_0(bpf_user_rnd_u32)
>>>   const struct bpf_func_proto bpf_map_lookup_elem_proto __weak;
>>>   const struct bpf_func_proto bpf_map_update_elem_proto __weak;
>>>   const struct bpf_func_proto bpf_map_delete_elem_proto __weak;
>>> +const struct bpf_func_proto bpf_map_push_elem_proto __weak;
>>> +const struct bpf_func_proto bpf_map_pop_elem_proto __weak;
>>> +const struct bpf_func_proto bpf_map_peek_elem_proto __weak;
>>>     const struct bpf_func_proto bpf_get_prandom_u32_proto __weak;
>>>   const struct bpf_func_proto bpf_get_smp_processor_id_proto __weak;
>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>>> index 6502115e8f55..ab0d5e3f9892 100644
>>> --- a/kernel/bpf/helpers.c
>>> +++ b/kernel/bpf/helpers.c
>>> @@ -76,6 +76,49 @@ const struct bpf_func_proto 
>>> bpf_map_delete_elem_proto = {
>>>       .arg2_type    = ARG_PTR_TO_MAP_KEY,
>>>   };
>>>   +BPF_CALL_3(bpf_map_push_elem, struct bpf_map *, map, void *, 
>>> value, u64, flags)
>>> +{
>>> +    return map->ops->map_push_elem(map, value, flags);
>>> +}
>>> +
>>> +const struct bpf_func_proto bpf_map_push_elem_proto = {
>>> +    .func        = bpf_map_push_elem,
>>> +    .gpl_only    = false,
>>> +    .pkt_access    = true,
>>> +    .ret_type    = RET_INTEGER,
>>> +    .arg1_type    = ARG_CONST_MAP_PTR,
>>> +    .arg2_type    = ARG_PTR_TO_MAP_VALUE,
>>> +    .arg3_type    = ARG_ANYTHING,
>>> +};
>>> +
>>> +BPF_CALL_2(bpf_map_pop_elem, struct bpf_map *, map, void *, value)
>>> +{
>>> +    return map->ops->map_pop_elem(map, value);
>>> +}
>>> +
>>> +const struct bpf_func_proto bpf_map_pop_elem_proto = {
>>> +    .func        = bpf_map_pop_elem,
>>> +    .gpl_only    = false,
>>> +    .pkt_access    = true,
>>> +    .ret_type    = RET_INTEGER,
>>> +    .arg1_type    = ARG_CONST_MAP_PTR,
>>> +    .arg2_type    = ARG_PTR_TO_UNINIT_MAP_VALUE,
>>> +};
>>> +
>>> +BPF_CALL_2(bpf_map_peek_elem, struct bpf_map *, map, void *, value)
>>> +{
>>> +    return map->ops->map_peek_elem(map, value);
>>> +}
>>> +
>>> +const struct bpf_func_proto bpf_map_peek_elem_proto = {
>>> +    .func        = bpf_map_pop_elem,
>>> +    .gpl_only    = false,
>>> +    .pkt_access    = true,
>>> +    .ret_type    = RET_INTEGER,
>>> +    .arg1_type    = ARG_CONST_MAP_PTR,
>>> +    .arg2_type    = ARG_PTR_TO_UNINIT_MAP_VALUE,
>>> +};
>>> +
>>>   const struct bpf_func_proto bpf_get_prandom_u32_proto = {
>>>       .func        = bpf_user_rnd_u32,
>>>       .gpl_only    = false,
>>> diff --git a/kernel/bpf/queue_stack_maps.c 
>>> b/kernel/bpf/queue_stack_maps.c
>>> new file mode 100644
>>> index 000000000000..a597c5ba68f6
>>> --- /dev/null
>>> +++ b/kernel/bpf/queue_stack_maps.c
>>> @@ -0,0 +1,300 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * queue_stack_maps.c: BPF queue and stack maps
>>> + *
>>> + * Copyright (c) 2018 Politecnico di Torino
>>> + */
>>> +#include <linux/bpf.h>
>>> +#include <linux/list.h>
>>> +#include <linux/slab.h>
>>> +#include "percpu_freelist.h"
>>> +
>>> +#define QUEUE_STACK_CREATE_FLAG_MASK \
>>> +    (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
>>> +
>>> +
>>> +struct bpf_queue_stack {
>>> +    struct bpf_map map;
>>> +    raw_spinlock_t lock;
>>> +    u32 head, tail;
>>> +    u32 index_mask;
>>> +    u32 count;
>>> +
>>> +    char elements[0] __aligned(8);
>>> +};
>>> +
>>> +static struct bpf_queue_stack *bpf_queue_stack(struct bpf_map *map)
>>> +{
>>> +    return container_of(map, struct bpf_queue_stack, map);
>>> +}
>>> +
>>> +static bool queue_stack_map_is_empty(struct bpf_queue_stack *qs)
>>> +{
>>> +    return qs->count == 0;
>>> +}
>>> +
>>> +static bool queue_stack_map_is_full(struct bpf_queue_stack *qs)
>>> +{
>>> +    return qs->count == qs->map.max_entries;
>>> +}
>>> +
>>> +/* Called from syscall */
>>> +static int queue_stack_map_alloc_check(union bpf_attr *attr)
>>> +{
>>> +    /* check sanity of attributes */
>>> +    if (attr->max_entries == 0 || attr->key_size != 0 ||
>>> +        attr->map_flags & ~QUEUE_STACK_CREATE_FLAG_MASK)
>>> +        return -EINVAL;
>>> +
>>> +    if (attr->value_size > KMALLOC_MAX_SIZE)
>>> +        /* if value_size is bigger, the user space won't be able to
>>> +         * access the elements.
>>> +         */
>>> +        return -E2BIG;
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr)
>>> +{
>>> +    int ret, numa_node = bpf_map_attr_numa_node(attr);
>>> +    u32 max_entries, value_size, index_mask;
>>> +    u64 queue_size, cost, mask64;
>>> +    struct bpf_queue_stack *qs;
>>> +
>>> +    max_entries = attr->max_entries;
>>> +    value_size = attr->value_size;
>>> +
>>> +    /* From arraymap.c:
>>> +     * On 32 bit archs roundup_pow_of_two() with max_entries that has
>>> +     * upper most bit set in u32 space is undefined behavior due to
>>> +     * resulting 1U << 32, so do it manually here in u64 space.
>>> +     */
>>> +    mask64 = fls_long(max_entries - 1);
>>> +    mask64 = 1ULL << mask64;
>>> +    mask64 -= 1;
>>> +
>>> +    index_mask = mask64;
>>> +
>>> +    /* Round up queue size to nearest power of 2 */
>>> +    max_entries = index_mask + 1;
>> what's the point of roundup ?
>
> If the size of the buffer is power of two we can wrap the indexes with 
> an AND operation instead of MOD.
>
>> The memory waste becomes quite large when max_entries are high.
> Yes, you are right, we have the different choices described below.
>
>>
>> If queue/stack is sized to exact max_entries,
>> then 'count' can be removed too, right?
>
> If we don't use 'count' and we want to use the AND operation for 
> wrapping indexes, the max entries should be 2^ - 1  because a slot is 
> lost to distinguish between full/empty queue/stack.
>
> Just to summarize, we have these options:
> 1. Allow any size, round up, use the AND operation and 'count' (current).
> 2. Allow only power of 2 sizes, use the AND operation and 'count'.
> 3. Allow any size, no roundup, use the MOD operation and leaving an 
> empty slot.
>
> I prefer 1 or 2, but I don't have a strong opinion, maybe allowing 
> only power of two max entries could be too limiting.
> Another consideration: is this really too bad to waste memory when 
> user requires a size far away of the next power of 2?
>
>>> +    /* Check for overflows. */
>>> +    if (max_entries < attr->max_entries)
>>> +        return ERR_PTR(-E2BIG);
>>> +
>>> +    queue_size = sizeof(*qs) + (u64) value_size * max_entries;
>>> +
>>> +    cost = queue_size;
>>> +    if (cost >= U32_MAX - PAGE_SIZE)
>>> +        return ERR_PTR(-E2BIG);
>>> +
>>> +    cost = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
>>> +
>>> +    ret = bpf_map_precharge_memlock(cost);
>>> +    if (ret < 0)
>>> +        return ERR_PTR(ret);
>>> +
>>> +    qs = bpf_map_area_alloc(queue_size, numa_node);
>>> +    if (!qs)
>>> +        return ERR_PTR(-ENOMEM);
>>> +
>>> +    memset(qs, 0, sizeof(*qs));
>>> +
>>> +    bpf_map_init_from_attr(&qs->map, attr);
>>> +
>>> +    qs->map.pages = cost;
>>> +    qs->index_mask = index_mask;
>>> +
>>> +    raw_spin_lock_init(&qs->lock);
>>> +
>>> +    return &qs->map;
>>> +}
>>> +
>>> +/* Called when map->refcnt goes to zero, either from workqueue or 
>>> from syscall */
>>> +static void queue_stack_map_free(struct bpf_map *map)
>>> +{
>>> +    struct bpf_queue_stack *qs = bpf_queue_stack(map);
>>> +
>>> +    /* at this point bpf_prog->aux->refcnt == 0 and this 
>>> map->refcnt == 0,
>>> +     * so the programs (can be more than one that used this map) were
>>> +     * disconnected from events. Wait for outstanding critical 
>>> sections in
>>> +     * these programs to complete
>>> +     */
>>> +    synchronize_rcu();
>>> +
>>> +    bpf_map_area_free(qs);
>>> +}
>>> +
>>> +static int __queue_map_get(struct bpf_map *map, void *value, bool 
>>> delete)
>>> +{
>>> +    struct bpf_queue_stack *qs = bpf_queue_stack(map);
>>> +    unsigned long flags;
>>> +    int err = 0;
>>> +    void *ptr;
>>> +
>>> +    raw_spin_lock_irqsave(&qs->lock, flags);
>>> +
>>> +    if (queue_stack_map_is_empty(qs)) {
>>> +        err = -ENOENT;
>>> +        goto out;
>>> +    }
>>> +
>>> +    ptr = &qs->elements[qs->tail * qs->map.value_size];
>>> +    memcpy(value, ptr, qs->map.value_size);
>>> +
>>> +    if (delete) {
>>> +        qs->tail = (qs->tail + 1) & qs->index_mask;
>>> +        qs->count--;
>>> +    }
>>> +
>>> +out:
>>> +    raw_spin_unlock_irqrestore(&qs->lock, flags);
>>> +    return err;
>>> +}
>>> +
>>> +
>>> +static int __stack_map_get(struct bpf_map *map, void *value, bool 
>>> delete)
>>> +{
>>> +    struct bpf_queue_stack *qs = bpf_queue_stack(map);
>>> +    unsigned long flags;
>>> +    int err = 0;
>>> +    void *ptr;
>>> +    u32 index;
>>> +
>>> +    raw_spin_lock_irqsave(&qs->lock, flags);
>>> +
>>> +    if (queue_stack_map_is_empty(qs)) {
>>> +        err = -ENOENT;
>>> +        goto out;
>>> +    }
>>> +
>>> +    index = (qs->head - 1) & qs->index_mask;
>>> +    ptr = &qs->elements[index * qs->map.value_size];
>>> +    memcpy(value, ptr, qs->map.value_size);
>>> +
>>> +    if (delete) {
>>> +        qs->head = (qs->head - 1) & qs->index_mask;
>>> +        qs->count--;
>>> +    }
>>> +
>>> +out:
>>> +    raw_spin_unlock_irqrestore(&qs->lock, flags);
>>> +    return err;
>>> +}
>>> +
>>> +/* Called from syscall or from eBPF program */
>>> +static int queue_map_peek_elem(struct bpf_map *map, void *value)
>>> +{
>>> +    return __queue_map_get(map, value, false);
>>> +}
>>> +
>>> +/* Called from syscall or from eBPF program */
>>> +static int stack_map_peek_elem(struct bpf_map *map, void *value)
>>> +{
>>> +    return __stack_map_get(map, value, false);
>>> +}
>>> +
>>> +/* Called from syscall or from eBPF program */
>>> +static int queue_map_pop_elem(struct bpf_map *map, void *value)
>>> +{
>>> +    return __queue_map_get(map, value, true);
>>> +}
>>> +
>>> +/* Called from syscall or from eBPF program */
>>> +static int stack_map_pop_elem(struct bpf_map *map, void *value)
>>> +{
>>> +    return __stack_map_get(map, value, true);
>>> +}
>>> +
>>> +/* Called from syscall or from eBPF program */
>>> +static int queue_stack_map_push_elem(struct bpf_map *map, void *value,
>>> +                     u64 flags)
>>> +{
>>> +    struct bpf_queue_stack *qs = bpf_queue_stack(map);
>>> +    unsigned long irq_flags;
>>> +    int err = 0;
>>> +    void *dst;
>>> +
>>> +    /* BPF_EXIST is used to force making room for a new element in 
>>> case the
>>> +     * map is full
>>> +     */
>>> +    bool replace = (flags & BPF_EXIST);
>>> +
>>> +    /* Check supported flags for queue and stack maps */
>>> +    if (flags & BPF_NOEXIST || flags > BPF_EXIST)
>>> +        return -EINVAL;
>>> +
>>> +    raw_spin_lock_irqsave(&qs->lock, irq_flags);
>>> +
>>> +    if (queue_stack_map_is_full(qs)) {
>>> +        if (!replace) {
>>> +            err = -E2BIG;
>> ENOSPC is probably more accurate ?
> Agree.

Well, actually I don't, I just realized that other maps are returning 
E2BIG when the max_entries is reached so probably we want to keep equal 
across all maps.

>>> +            goto out;
>>> +        }
>>> +        /* advance tail pointer to overwrite oldest element */
>> 'oldest' is ambiguous here.
>> For queue it's true, but for stack it's the last element.
>> Since stack is poping from the head, push w/exist flag will keep
>> overwriting the last element.
>
> No actually, pushing with exist flag advances the tail index, hence 
> overwrites the oldest element in a stack map.
> It is like shifting the whole stack a position.
>
>
>> Pls explain it more clearly in helper description in bpf.h
>
> Will do.
>>
>>> +        qs->tail = (qs->tail + 1) & qs->index_mask;
>>> +        qs->count--;
>>> +    }
>>> +
>>> +    dst = &qs->elements[qs->head * qs->map.value_size];
>>> +    memcpy(dst, value, qs->map.value_size);
>>> +
>>> +    qs->head = (qs->head + 1) & qs->index_mask;
>>> +    qs->count++;
>>> +
>>> +out:
>>> +    raw_spin_unlock_irqrestore(&qs->lock, irq_flags);
>>> +    return err;
>>> +}
>>> +
>>> +/* Called from syscall or from eBPF program */
>>> +static void *queue_stack_map_lookup_elem(struct bpf_map *map, void 
>>> *key)
>>> +{
>>> +    return NULL;
>>> +}
>>> +
>>> +/* Called from syscall or from eBPF program */
>>> +static int queue_stack_map_update_elem(struct bpf_map *map, void *key,
>>> +                       void *value, u64 flags)
>>> +{
>>> +    return -EINVAL;
>>> +}
>>> +
>>> +/* Called from syscall or from eBPF program */
>>> +static int queue_stack_map_delete_elem(struct bpf_map *map, void *key)
>>> +{
>>> +    return -EINVAL;
>>> +}
>>> +
>>> +/* Called from syscall */
>>> +static int queue_stack_map_get_next_key(struct bpf_map *map, void 
>>> *key,
>>> +                    void *next_key)
>>> +{
>>> +    return -EINVAL;
>>> +}
>>> +
>>> +const struct bpf_map_ops queue_map_ops = {
>>> +    .map_alloc_check = queue_stack_map_alloc_check,
>>> +    .map_alloc = queue_stack_map_alloc,
>>> +    .map_free = queue_stack_map_free,
>>> +    .map_lookup_elem = queue_stack_map_lookup_elem,
>>> +    .map_update_elem = queue_stack_map_update_elem,
>>> +    .map_delete_elem = queue_stack_map_delete_elem,
>>> +    .map_push_elem = queue_stack_map_push_elem,
>>> +    .map_pop_elem = queue_map_pop_elem,
>>> +    .map_peek_elem = queue_map_peek_elem,
>>> +    .map_get_next_key = queue_stack_map_get_next_key,
>>> +};
>>> +
>>> +const struct bpf_map_ops stack_map_ops = {
>>> +    .map_alloc_check = queue_stack_map_alloc_check,
>>> +    .map_alloc = queue_stack_map_alloc,
>>> +    .map_free = queue_stack_map_free,
>>> +    .map_lookup_elem = queue_stack_map_lookup_elem,
>>> +    .map_update_elem = queue_stack_map_update_elem,
>>> +    .map_delete_elem = queue_stack_map_delete_elem,
>>> +    .map_push_elem = queue_stack_map_push_elem,
>>> +    .map_pop_elem = stack_map_pop_elem,
>>> +    .map_peek_elem = stack_map_peek_elem,
>>> +    .map_get_next_key = queue_stack_map_get_next_key,
>>> +};
>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>> index 50957e243bfb..c46bf2d38be3 100644
>>> --- a/kernel/bpf/syscall.c
>>> +++ b/kernel/bpf/syscall.c
>>> @@ -727,6 +727,9 @@ static int map_lookup_elem(union bpf_attr *attr)
>>>           err = bpf_fd_htab_map_lookup_elem(map, key, value);
>>>       } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
>>>           err = bpf_fd_reuseport_array_lookup_elem(map, key, value);
>>> +    } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
>>> +           map->map_type == BPF_MAP_TYPE_STACK) {
>>> +        err = map->ops->map_peek_elem(map, value);
>>>       } else {
>>>           rcu_read_lock();
>>>           ptr = map->ops->map_lookup_elem(map, key);
>>> @@ -841,6 +844,9 @@ static int map_update_elem(union bpf_attr *attr)
>>>           /* rcu_read_lock() is not needed */
>>>           err = bpf_fd_reuseport_array_update_elem(map, key, value,
>>>                                attr->flags);
>>> +    } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
>>> +           map->map_type == BPF_MAP_TYPE_STACK) {
>>> +        err = map->ops->map_push_elem(map, value, attr->flags);
>>>       } else {
>>>           rcu_read_lock();
>>>           err = map->ops->map_update_elem(map, key, value, 
>>> attr->flags);
>>> @@ -1001,11 +1007,6 @@ static int map_lookup_and_delete_elem(union 
>>> bpf_attr *attr)
>>>           goto err_put;
>>>       }
>>>   -    if (!map->ops->map_lookup_and_delete_elem) {
>>> -        err = -ENOTSUPP;
>>> -        goto err_put;
>>> -    }
>>> -
>>>       key = __bpf_copy_key(ukey, map->key_size);
>>>       if (IS_ERR(key)) {
>>>           err = PTR_ERR(key);
>>> @@ -1028,12 +1029,22 @@ static int map_lookup_and_delete_elem(union 
>>> bpf_attr *attr)
>>>        */
>>>       preempt_disable();
>>>       __this_cpu_inc(bpf_prog_active);
>>> -    rcu_read_lock();
>>> -    ptr = map->ops->map_lookup_and_delete_elem(map, key);
>>> -    if (ptr)
>>> -        memcpy(value, ptr, value_size);
>>> -    rcu_read_unlock();
>>> +    if (map->map_type == BPF_MAP_TYPE_QUEUE ||
>>> +        map->map_type == BPF_MAP_TYPE_STACK) {
>>> +        err = map->ops->map_pop_elem(map, value);
>> please clean up the paches, so that patch 4 doesn't immediately
>> deletes the lines that were introduced in patch 3.
>> Otherwise what was the point of them in patch 3?
>
> Actually there are not deleted but moved, anyway, I will clean it up a 
> little bit to move to a closer place.
>>
>>> +    } else {
>>> +        if (!map->ops->map_lookup_and_delete_elem) {
>>> +            err = -ENOTSUPP;
>>> +            goto free_value;
>>> +        }
>>> +        rcu_read_lock();
>>> +        ptr = map->ops->map_lookup_and_delete_elem(map, key);
>>> +        if (ptr)
>>> +            memcpy(value, ptr, value_size);
>>> +        rcu_read_unlock();
>>>           err = ptr ? 0 : -ENOENT;
>>> +    }
>>> +
>>>       __this_cpu_dec(bpf_prog_active);
>>>       preempt_enable();
>>>   diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>> index 73c81bef6ae8..489667f93061 100644
>>> --- a/kernel/bpf/verifier.c
>>> +++ b/kernel/bpf/verifier.c
>>> @@ -2121,7 +2121,8 @@ static int check_func_arg(struct 
>>> bpf_verifier_env *env, u32 regno,
>>>       }
>>>         if (arg_type == ARG_PTR_TO_MAP_KEY ||
>>> -        arg_type == ARG_PTR_TO_MAP_VALUE) {
>>> +        arg_type == ARG_PTR_TO_MAP_VALUE ||
>>> +        arg_type == ARG_PTR_TO_UNINIT_MAP_VALUE) {
>>>           expected_type = PTR_TO_STACK;
>>>           if (!type_is_pkt_pointer(type) && type != PTR_TO_MAP_VALUE &&
>>>               type != expected_type)
>>> @@ -2191,7 +2192,8 @@ static int check_func_arg(struct 
>>> bpf_verifier_env *env, u32 regno,
>>>           err = check_helper_mem_access(env, regno,
>>>                             meta->map_ptr->key_size, false,
>>>                             NULL);
>>> -    } else if (arg_type == ARG_PTR_TO_MAP_VALUE) {
>>> +    } else if (arg_type == ARG_PTR_TO_MAP_VALUE ||
>>> +           arg_type == ARG_PTR_TO_UNINIT_MAP_VALUE) {
>>>           /* bpf_map_xxx(..., map_ptr, ..., value) call:
>>>            * check [value, value + map->value_size) validity
>>>            */
>>> @@ -2200,9 +2202,10 @@ static int check_func_arg(struct 
>>> bpf_verifier_env *env, u32 regno,
>>>               verbose(env, "invalid map_ptr to access map->value\n");
>>>               return -EACCES;
>>>           }
>>> +        meta->raw_mode = (arg_type == ARG_PTR_TO_UNINIT_MAP_VALUE);
>>>           err = check_helper_mem_access(env, regno,
>>>                             meta->map_ptr->value_size, false,
>>> -                          NULL);
>>> +                          meta);
>>>       } else if (arg_type_is_mem_size(arg_type)) {
>>>           bool zero_size_allowed = (arg_type == 
>>> ARG_CONST_SIZE_OR_ZERO);
>>>   @@ -2676,7 +2679,10 @@ record_func_map(struct bpf_verifier_env 
>>> *env, struct bpf_call_arg_meta *meta,
>>>       if (func_id != BPF_FUNC_tail_call &&
>>>           func_id != BPF_FUNC_map_lookup_elem &&
>>>           func_id != BPF_FUNC_map_update_elem &&
>>> -        func_id != BPF_FUNC_map_delete_elem)
>>> +        func_id != BPF_FUNC_map_delete_elem &&
>>> +        func_id != BPF_FUNC_map_push_elem &&
>>> +        func_id != BPF_FUNC_map_pop_elem &&
>>> +        func_id != BPF_FUNC_map_peek_elem)
>>>           return 0;
>>>         if (meta->map_ptr == NULL) {
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index 591c698bc517..40736e0d9cff 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -4993,6 +4993,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>>>           return &bpf_map_update_elem_proto;
>>>       case BPF_FUNC_map_delete_elem:
>>>           return &bpf_map_delete_elem_proto;
>>> +    case BPF_FUNC_map_push_elem:
>>> +        return &bpf_map_push_elem_proto;
>>> +    case BPF_FUNC_map_pop_elem:
>>> +        return &bpf_map_pop_elem_proto;
>>> +    case BPF_FUNC_map_peek_elem:
>>> +        return &bpf_map_peek_elem_proto;
>>>       case BPF_FUNC_get_prandom_u32:
>>>           return &bpf_get_prandom_u32_proto;
>>>       case BPF_FUNC_get_smp_processor_id:
>>>
>

^ permalink raw reply

* Re: [PATCH] net/packet: fix packet drop as of virtio gso
From: Jason Wang @ 2018-10-08  3:14 UTC (permalink / raw)
  To: Jianfeng Tan, netdev; +Cc: davem, mst
In-Reply-To: <20180929154127.20867-1-jianfeng.tan@linux.alibaba.com>



On 2018年09月29日 23:41, Jianfeng Tan wrote:
> When we use raw socket as the vhost backend, a packet from virito with
> gso offloading information, cannot be sent out in later validaton at
> xmit path, as we did not set correct skb->protocol which is further used
> for looking up the gso function.

Hi:

May I ask the reason for using raw socket for vhost? It was not a common 
setup with little care in the past few years. And it was slow since it 
lacks some recent improvements. Can it be replaced with e.g macvtap?

Thanks

>
> To fix this, we set this field according to virito hdr information.
>
> Fixes: e858fae2b0b8f4 ("virtio_net: use common code for virtio_net_hdr and skb GSO conversion")
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Jianfeng Tan <jianfeng.tan@linux.alibaba.com>
> ---
>   include/linux/virtio_net.h | 18 ++++++++++++++++++
>   net/packet/af_packet.c     | 11 +++++++----
>   2 files changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index 9397628a1967..cb462f9ab7dd 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -5,6 +5,24 @@
>   #include <linux/if_vlan.h>
>   #include <uapi/linux/virtio_net.h>
>   
> +static inline int virtio_net_hdr_set_proto(struct sk_buff *skb,
> +					   const struct virtio_net_hdr *hdr)
> +{
> +	switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
> +	case VIRTIO_NET_HDR_GSO_TCPV4:
> +	case VIRTIO_NET_HDR_GSO_UDP:
> +		skb->protocol = cpu_to_be16(ETH_P_IP);
> +		break;
> +	case VIRTIO_NET_HDR_GSO_TCPV6:
> +		skb->protocol = cpu_to_be16(ETH_P_IPV6);
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
>   static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
>   					const struct virtio_net_hdr *hdr,
>   					bool little_endian)
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 75c92a87e7b2..d6e94dc7e290 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2715,10 +2715,12 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>   			}
>   		}
>   
> -		if (po->has_vnet_hdr && virtio_net_hdr_to_skb(skb, vnet_hdr,
> -							      vio_le())) {
> -			tp_len = -EINVAL;
> -			goto tpacket_error;
> +		if (po->has_vnet_hdr) {
> +			if (virtio_net_hdr_to_skb(skb, vnet_hdr, vio_le())) {
> +				tp_len = -EINVAL;
> +				goto tpacket_error;
> +			}
> +			virtio_net_hdr_set_proto(skb, vnet_hdr);
>   		}
>   
>   		skb->destructor = tpacket_destruct_skb;
> @@ -2915,6 +2917,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>   		if (err)
>   			goto out_free;
>   		len += sizeof(vnet_hdr);
> +		virtio_net_hdr_set_proto(skb, &vnet_hdr);
>   	}
>   
>   	skb_probe_transport_header(skb, reserve);

^ permalink raw reply

* [RESEND PATCH v2 5/5] MIPS: mscc: add PCB120 to the ocelot fitImage
From: Quentin Schulz @ 2018-10-08 10:14 UTC (permalink / raw)
  To: alexandre.belloni, ralf, paul.burton, jhogan, robh+dt,
	mark.rutland, davem, andrew, f.fainelli
  Cc: allan.nielsen, linux-mips, devicetree, linux-kernel, netdev,
	thomas.petazzoni, antoine.tenart, Quentin Schulz
In-Reply-To: <20181008101445.25946-1-quentin.schulz@bootlin.com>

PCB120 and PCB123 are both development boards based on Microsemi Ocelot
so let's use the same fitImage for both.

Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
---
 arch/mips/generic/Kconfig                       |  6 +++---
 arch/mips/generic/Platform                      |  2 +-
 ...d-ocelot_pcb123.its.S => board-ocelot.its.S} | 17 +++++++++++++++++
 3 files changed, 21 insertions(+), 4 deletions(-)
 rename arch/mips/generic/{board-ocelot_pcb123.its.S => board-ocelot.its.S} (55%)

diff --git a/arch/mips/generic/Kconfig b/arch/mips/generic/Kconfig
index 08e33c6b2539..fd6019802657 100644
--- a/arch/mips/generic/Kconfig
+++ b/arch/mips/generic/Kconfig
@@ -65,11 +65,11 @@ config FIT_IMAGE_FDT_XILFPGA
 	  Enable this to include the FDT for the MIPSfpga platform
 	  from Imagination Technologies in the FIT kernel image.
 
-config FIT_IMAGE_FDT_OCELOT_PCB123
-	bool "Include FDT for Microsemi Ocelot PCB123"
+config FIT_IMAGE_FDT_OCELOT
+	bool "Include FDT for Microsemi Ocelot development platforms"
 	select MSCC_OCELOT
 	help
-	  Enable this to include the FDT for the Ocelot PCB123 platform
+	  Enable this to include the FDT for the Ocelot development platforms
 	  from Microsemi in the FIT kernel image.
 	  This requires u-boot on the platform.
 
diff --git a/arch/mips/generic/Platform b/arch/mips/generic/Platform
index 879cb80396c8..eaa19d189324 100644
--- a/arch/mips/generic/Platform
+++ b/arch/mips/generic/Platform
@@ -16,5 +16,5 @@ all-$(CONFIG_MIPS_GENERIC)	:= vmlinux.gz.itb
 its-y					:= vmlinux.its.S
 its-$(CONFIG_FIT_IMAGE_FDT_BOSTON)	+= board-boston.its.S
 its-$(CONFIG_FIT_IMAGE_FDT_NI169445)	+= board-ni169445.its.S
-its-$(CONFIG_FIT_IMAGE_FDT_OCELOT_PCB123) += board-ocelot_pcb123.its.S
+its-$(CONFIG_FIT_IMAGE_FDT_OCELOT)	+= board-ocelot.its.S
 its-$(CONFIG_FIT_IMAGE_FDT_XILFPGA)	+= board-xilfpga.its.S
diff --git a/arch/mips/generic/board-ocelot_pcb123.its.S b/arch/mips/generic/board-ocelot.its.S
similarity index 55%
rename from arch/mips/generic/board-ocelot_pcb123.its.S
rename to arch/mips/generic/board-ocelot.its.S
index 5a7d5e1c878a..3da23988149a 100644
--- a/arch/mips/generic/board-ocelot_pcb123.its.S
+++ b/arch/mips/generic/board-ocelot.its.S
@@ -11,6 +11,17 @@
 				algo = "sha1";
 			};
 		};
+
+		fdt@ocelot_pcb120 {
+			description = "MSCC Ocelot PCB120 Device Tree";
+			data = /incbin/("boot/dts/mscc/ocelot_pcb120.dtb");
+			type = "flat_dt";
+			arch = "mips";
+			compression = "none";
+			hash@0 {
+				algo = "sha1";
+			};
+		};
 	};
 
 	configurations {
@@ -19,5 +30,11 @@
 			kernel = "kernel@0";
 			fdt = "fdt@ocelot_pcb123";
 		};
+
+		conf@ocelot_pcb120 {
+			description = "Ocelot Linux kernel";
+			kernel = "kernel@0";
+			fdt = "fdt@ocelot_pcb120";
+		};
 	};
 };
-- 
2.17.1

^ permalink raw reply related

* [RESEND PATCH net-next v2 3/5] net: phy: mscc: add support for VSC8574 PHY
From: Quentin Schulz @ 2018-10-08 10:14 UTC (permalink / raw)
  To: alexandre.belloni, ralf, paul.burton, jhogan, robh+dt,
	mark.rutland, davem, andrew, f.fainelli
  Cc: allan.nielsen, linux-mips, devicetree, linux-kernel, netdev,
	thomas.petazzoni, antoine.tenart, Quentin Schulz
In-Reply-To: <20181008101445.25946-1-quentin.schulz@bootlin.com>

The VSC8574 PHY is a 4-ports PHY that is 10/100/1000BASE-T, 100BASE-FX,
1000BASE-X and triple-speed copper SFP capable, can communicate with
the MAC via SGMII, QSGMII or 1000BASE-X, supports WOL, downshifting and
can set the blinking pattern of each of its 4 LEDs, supports SyncE as
well as HP Auto-MDIX detection.

This adds support for 10/100/1000BASE-T, SGMII/QSGMII link with the MAC,
WOL, downshifting, HP Auto-MDIX detection and blinking pattern for its 4
LEDs.

The VSC8574 has also an internal Intel 8051 microcontroller whose
firmware needs to be patched when the PHY is reset. If the 8051's
firmware has the expected CRC, its patching can be skipped. The
microcontroller can be accessed from any port of the PHY, though the CRC
function can only be done through the PHY that is the base PHY of the
package (internal address 0) due to a limitation of the firmware.

The GPIO register bank is a set of registers that are common to all PHYs
in the package. So any modification in any register of this bank affects
all PHYs of the package.

If the PHYs haven't been reset before booting the Linux kernel and were
configured to use interrupts for e.g. link status updates, it is
required to clear the interrupts mask register of all PHYs before being
able to use interrupts with any PHY. The first PHY of the package that
will be init will take care of clearing all PHYs interrupts mask
registers. Thus, we need to keep track of the init sequence in the
package, if it's already been done or if it's to be done.

Most of the init sequence of a PHY of the package is common to all PHYs
in the package, thus we use the SMI broadcast feature which enables us
to propagate a write in one register of one PHY to all PHYs in the same
package.

Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
---
 drivers/net/phy/mscc.c | 320 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 319 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c
index 35292cfd4979..bffe077dc75f 100644
--- a/drivers/net/phy/mscc.c
+++ b/drivers/net/phy/mscc.c
@@ -65,6 +65,8 @@ enum rgmii_rx_clock_delay {
 #define MEDIA_OP_MODE_AMS_COPPER_100BASEFX	7
 #define MEDIA_OP_MODE_POS		  8
 
+#define MSCC_PHY_EXT_PHY_CNTL_2		  24
+
 #define MII_VSC85XX_INT_MASK		  25
 #define MII_VSC85XX_INT_MASK_MASK	  0xa000
 #define MII_VSC85XX_INT_MASK_WOL	  0x0040
@@ -151,6 +153,7 @@ enum rgmii_rx_clock_delay {
 #define DW8051_CLK_EN			  0x0010
 #define MICRO_CLK_EN			  0x0008
 #define MICRO_CLK_DIVIDE(x)		  ((x) >> 1)
+#define MSCC_DW8051_VLD_MASK		  0xf1ff
 
 /* x Address in range 1-4 */
 #define MSCC_TRAP_ROM_ADDR(x)		  ((x) * 2 + 1)
@@ -184,7 +187,9 @@ enum rgmii_rx_clock_delay {
 #define PROC_CMD_SGMII_MAC		  0x0030
 #define PROC_CMD_QSGMII_MAC		  0x0020
 #define PROC_CMD_NO_MAC_CONF		  0x0000
+#define PROC_CMD_1588_DEFAULT_INIT	  0x0010
 #define PROC_CMD_NOP			  0x000f
+#define PROC_CMD_PHY_INIT		  0x000a
 #define PROC_CMD_CRC16			  0x0008
 #define PROC_CMD_FIBER_MEDIA_CONF	  0x0001
 #define PROC_CMD_MCB_ACCESS_MAC_CONF	  0x0000
@@ -198,6 +203,9 @@ enum rgmii_rx_clock_delay {
 /* Test page Registers */
 #define MSCC_PHY_TEST_PAGE_5		  5
 #define MSCC_PHY_TEST_PAGE_8		  8
+#define MSCC_PHY_TEST_PAGE_9		  9
+#define MSCC_PHY_TEST_PAGE_20		  20
+#define MSCC_PHY_TEST_PAGE_24		  24
 
 /* Token ring page Registers */
 #define MSCC_PHY_TR_CNTL		  16
@@ -211,6 +219,7 @@ enum rgmii_rx_clock_delay {
 #define PHY_ID_VSC8531			  0x00070570
 #define PHY_ID_VSC8540			  0x00070760
 #define PHY_ID_VSC8541			  0x00070770
+#define PHY_ID_VSC8574			  0x000704a0
 #define PHY_ID_VSC8584			  0x000707c0
 
 #define MSCC_VDDMAC_1500		  1500
@@ -258,6 +267,10 @@ enum rgmii_rx_clock_delay {
 #define MSCC_VSC8584_REVB_INT8051_FW_START_ADDR	0xe800
 #define MSCC_VSC8584_REVB_INT8051_FW_CRC	0xfb48
 
+#define MSCC_VSC8574_REVB_INT8051_FW		"mscc_vsc8574_revb_int8051_29e8.bin"
+#define MSCC_VSC8574_REVB_INT8051_FW_START_ADDR	0x4000
+#define MSCC_VSC8574_REVB_INT8051_FW_CRC	0x29e8
+
 #define VSC8584_REVB				0x0001
 #define MSCC_DEV_REV_MASK			GENMASK(3, 0)
 
@@ -1087,6 +1100,250 @@ static int vsc8584_patch_fw(struct phy_device *phydev,
 	return 0;
 }
 
+/* bus->mdio_lock should be locked when using this function */
+static bool vsc8574_is_serdes_init(struct phy_device *phydev)
+{
+	u16 reg;
+	bool ret;
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS,
+		       MSCC_PHY_PAGE_EXTENDED_GPIO);
+
+	reg = phy_base_read(phydev, MSCC_TRAP_ROM_ADDR(1));
+	if (reg != 0x3eb7) {
+		ret = false;
+		goto out;
+	}
+
+	reg = phy_base_read(phydev, MSCC_PATCH_RAM_ADDR(1));
+	if (reg != 0x4012) {
+		ret = false;
+		goto out;
+	}
+
+	reg = phy_base_read(phydev, MSCC_INT_MEM_CNTL);
+	if (reg != EN_PATCH_RAM_TRAP_ADDR(1)) {
+		ret = false;
+		goto out;
+	}
+
+	reg = phy_base_read(phydev, MSCC_DW8051_CNTL_STATUS);
+	if ((MICRO_NSOFT_RESET | RUN_FROM_INT_ROM |  DW8051_CLK_EN |
+	     MICRO_CLK_EN) != (reg & MSCC_DW8051_VLD_MASK)) {
+		ret = false;
+		goto out;
+	}
+
+	ret = true;
+out:
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_STANDARD);
+
+	return ret;
+}
+
+/* bus->mdio_lock should be locked when using this function */
+static int vsc8574_config_pre_init(struct phy_device *phydev)
+{
+	const struct reg_val pre_init1[] = {
+		{0x0fae, 0x000401bd},
+		{0x0fac, 0x000f000f},
+		{0x17a0, 0x00a0f147},
+		{0x0fe4, 0x00052f54},
+		{0x1792, 0x0027303d},
+		{0x07fe, 0x00000704},
+		{0x0fe0, 0x00060150},
+		{0x0f82, 0x0012b00a},
+		{0x0f80, 0x00000d74},
+		{0x02e0, 0x00000012},
+		{0x03a2, 0x00050208},
+		{0x03b2, 0x00009186},
+		{0x0fb0, 0x000e3700},
+		{0x1688, 0x00049f81},
+		{0x0fd2, 0x0000ffff},
+		{0x168a, 0x00039fa2},
+		{0x1690, 0x0020640b},
+		{0x0258, 0x00002220},
+		{0x025a, 0x00002a20},
+		{0x025c, 0x00003060},
+		{0x025e, 0x00003fa0},
+		{0x03a6, 0x0000e0f0},
+		{0x0f92, 0x00001489},
+		{0x16a2, 0x00007000},
+		{0x16a6, 0x00071448},
+		{0x16a0, 0x00eeffdd},
+		{0x0fe8, 0x0091b06c},
+		{0x0fea, 0x00041600},
+		{0x16b0, 0x00eeff00},
+		{0x16b2, 0x00007000},
+		{0x16b4, 0x00000814},
+		{0x0f90, 0x00688980},
+		{0x03a4, 0x0000d8f0},
+		{0x0fc0, 0x00000400},
+		{0x07fa, 0x0050100f},
+		{0x0796, 0x00000003},
+		{0x07f8, 0x00c3ff98},
+		{0x0fa4, 0x0018292a},
+		{0x168c, 0x00d2c46f},
+		{0x17a2, 0x00000620},
+		{0x16a4, 0x0013132f},
+		{0x16a8, 0x00000000},
+		{0x0ffc, 0x00c0a028},
+		{0x0fec, 0x00901c09},
+		{0x0fee, 0x0004a6a1},
+		{0x0ffe, 0x00b01807},
+	};
+	const struct reg_val pre_init2[] = {
+		{0x0486, 0x0008a518},
+		{0x0488, 0x006dc696},
+		{0x048a, 0x00000912},
+		{0x048e, 0x00000db6},
+		{0x049c, 0x00596596},
+		{0x049e, 0x00000514},
+		{0x04a2, 0x00410280},
+		{0x04a4, 0x00000000},
+		{0x04a6, 0x00000000},
+		{0x04a8, 0x00000000},
+		{0x04aa, 0x00000000},
+		{0x04ae, 0x007df7dd},
+		{0x04b0, 0x006d95d4},
+		{0x04b2, 0x00492410},
+	};
+	struct device *dev = &phydev->mdio.dev;
+	const struct firmware *fw;
+	unsigned int i;
+	u16 crc, reg;
+	bool serdes_init;
+	int ret;
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_STANDARD);
+
+	/* all writes below are broadcasted to all PHYs in the same package */
+	reg = phy_base_read(phydev, MSCC_PHY_EXT_CNTL_STATUS);
+	reg |= SMI_BROADCAST_WR_EN;
+	phy_base_write(phydev, MSCC_PHY_EXT_CNTL_STATUS, reg);
+
+	phy_base_write(phydev, MII_VSC85XX_INT_MASK, 0);
+
+	/* The below register writes are tweaking analog and electrical
+	 * configuration that were determined through characterization by PHY
+	 * engineers. These don't mean anything more than "these are the best
+	 * values".
+	 */
+	phy_base_write(phydev, MSCC_PHY_EXT_PHY_CNTL_2, 0x0040);
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_TEST);
+
+	phy_base_write(phydev, MSCC_PHY_TEST_PAGE_20, 0x4320);
+	phy_base_write(phydev, MSCC_PHY_TEST_PAGE_24, 0x0c00);
+	phy_base_write(phydev, MSCC_PHY_TEST_PAGE_9, 0x18ca);
+	phy_base_write(phydev, MSCC_PHY_TEST_PAGE_5, 0x1b20);
+
+	reg = phy_base_read(phydev, MSCC_PHY_TEST_PAGE_8);
+	reg |= 0x8000;
+	phy_base_write(phydev, MSCC_PHY_TEST_PAGE_8, reg);
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_TR);
+
+	for (i = 0; i < ARRAY_SIZE(pre_init1); i++)
+		vsc8584_csr_write(phydev, pre_init1[i].reg, pre_init1[i].val);
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_EXTENDED_2);
+
+	phy_base_write(phydev, MSCC_PHY_CU_PMD_TX_CNTL, 0x028e);
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_TR);
+
+	for (i = 0; i < ARRAY_SIZE(pre_init2); i++)
+		vsc8584_csr_write(phydev, pre_init2[i].reg, pre_init2[i].val);
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_TEST);
+
+	reg = phy_base_read(phydev, MSCC_PHY_TEST_PAGE_8);
+	reg &= ~0x8000;
+	phy_base_write(phydev, MSCC_PHY_TEST_PAGE_8, reg);
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_STANDARD);
+
+	/* end of write broadcasting */
+	reg = phy_base_read(phydev, MSCC_PHY_EXT_CNTL_STATUS);
+	reg &= ~SMI_BROADCAST_WR_EN;
+	phy_base_write(phydev, MSCC_PHY_EXT_CNTL_STATUS, reg);
+
+	ret = request_firmware(&fw, MSCC_VSC8574_REVB_INT8051_FW, dev);
+	if (ret) {
+		dev_err(dev, "failed to load firmware %s, ret: %d\n",
+			MSCC_VSC8574_REVB_INT8051_FW, ret);
+		return ret;
+	}
+
+	/* Add one byte to size for the one added by the patch_fw function */
+	ret = vsc8584_get_fw_crc(phydev,
+				 MSCC_VSC8574_REVB_INT8051_FW_START_ADDR,
+				 fw->size + 1, &crc);
+	if (ret)
+		goto out;
+
+	if (crc == MSCC_VSC8574_REVB_INT8051_FW_CRC) {
+		serdes_init = vsc8574_is_serdes_init(phydev);
+
+		if (!serdes_init) {
+			ret = vsc8584_micro_assert_reset(phydev);
+			if (ret) {
+				dev_err(dev,
+					"%s: failed to assert reset of micro\n",
+					__func__);
+				return ret;
+			}
+		}
+	} else {
+		dev_dbg(dev, "FW CRC is not the expected one, patching FW\n");
+
+		serdes_init = false;
+
+		if (vsc8584_patch_fw(phydev, fw))
+			dev_warn(dev,
+				 "failed to patch FW, expect non-optimal device\n");
+	}
+
+	if (!serdes_init) {
+		phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS,
+			       MSCC_PHY_PAGE_EXTENDED_GPIO);
+
+		phy_base_write(phydev, MSCC_TRAP_ROM_ADDR(1), 0x3eb7);
+		phy_base_write(phydev, MSCC_PATCH_RAM_ADDR(1), 0x4012);
+		phy_base_write(phydev, MSCC_INT_MEM_CNTL,
+			       EN_PATCH_RAM_TRAP_ADDR(1));
+
+		vsc8584_micro_deassert_reset(phydev, false);
+
+		/* Add one byte to size for the one added by the patch_fw
+		 * function
+		 */
+		ret = vsc8584_get_fw_crc(phydev,
+					 MSCC_VSC8574_REVB_INT8051_FW_START_ADDR,
+					 fw->size + 1, &crc);
+		if (ret)
+			goto out;
+
+		if (crc != MSCC_VSC8574_REVB_INT8051_FW_CRC)
+			dev_warn(dev,
+				 "FW CRC after patching is not the expected one, expect non-optimal device\n");
+	}
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS,
+		       MSCC_PHY_PAGE_EXTENDED_GPIO);
+
+	ret = vsc8584_cmd(phydev, PROC_CMD_1588_DEFAULT_INIT |
+			  PROC_CMD_PHY_INIT);
+
+out:
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_STANDARD);
+
+	release_firmware(fw);
+
+	return ret;
+}
+
 /* bus->mdio_lock should be locked when using this function */
 static int vsc8584_config_pre_init(struct phy_device *phydev)
 {
@@ -1310,7 +1567,15 @@ static int vsc8584_config_init(struct phy_device *phydev)
 	 * in this pre-init function.
 	 */
 	if (!vsc8584_is_pkg_init(phydev, val & PHY_ADDR_REVERSED ? 1 : 0)) {
-		ret = vsc8584_config_pre_init(phydev);
+		if ((phydev->phy_id & phydev->drv->phy_id_mask) ==
+		    (PHY_ID_VSC8574 & phydev->drv->phy_id_mask))
+			ret = vsc8574_config_pre_init(phydev);
+		else if ((phydev->phy_id & phydev->drv->phy_id_mask) ==
+			 (PHY_ID_VSC8584 & phydev->drv->phy_id_mask))
+			ret = vsc8584_config_pre_init(phydev);
+		else
+			ret = -EINVAL;
+
 		if (ret)
 			goto err;
 	}
@@ -1476,6 +1741,31 @@ static int vsc85xx_read_status(struct phy_device *phydev)
 	return genphy_read_status(phydev);
 }
 
+static int vsc8574_probe(struct phy_device *phydev)
+{
+	struct vsc8531_private *vsc8531;
+	u32 default_mode[4] = {VSC8531_LINK_1000_ACTIVITY,
+	   VSC8531_LINK_100_ACTIVITY, VSC8531_LINK_ACTIVITY,
+	   VSC8531_DUPLEX_COLLISION};
+
+	vsc8531 = devm_kzalloc(&phydev->mdio.dev, sizeof(*vsc8531), GFP_KERNEL);
+	if (!vsc8531)
+		return -ENOMEM;
+
+	phydev->priv = vsc8531;
+
+	vsc8531->nleds = 4;
+	vsc8531->supp_led_modes = VSC8584_SUPP_LED_MODES;
+	vsc8531->hw_stats = vsc8584_hw_stats;
+	vsc8531->nstats = ARRAY_SIZE(vsc8584_hw_stats);
+	vsc8531->stats = devm_kmalloc_array(&phydev->mdio.dev, vsc8531->nstats,
+					    sizeof(u64), GFP_KERNEL);
+	if (!vsc8531->stats)
+		return -ENOMEM;
+
+	return vsc85xx_dt_led_modes_get(phydev, default_mode);
+}
+
 static int vsc8584_probe(struct phy_device *phydev)
 {
 	struct vsc8531_private *vsc8531;
@@ -1642,6 +1932,33 @@ static struct phy_driver vsc85xx_driver[] = {
 	.get_strings    = &vsc85xx_get_strings,
 	.get_stats      = &vsc85xx_get_stats,
 },
+{
+	.phy_id		= PHY_ID_VSC8574,
+	.name		= "Microsemi GE VSC8574 SyncE",
+	.phy_id_mask	= 0xfffffff0,
+	.features	= PHY_GBIT_FEATURES,
+	.flags		= PHY_HAS_INTERRUPT,
+	.soft_reset	= &genphy_soft_reset,
+	.config_init    = &vsc8584_config_init,
+	.config_aneg    = &vsc85xx_config_aneg,
+	.aneg_done	= &genphy_aneg_done,
+	.read_status	= &vsc85xx_read_status,
+	.ack_interrupt  = &vsc85xx_ack_interrupt,
+	.config_intr    = &vsc85xx_config_intr,
+	.did_interrupt  = &vsc8584_did_interrupt,
+	.suspend	= &genphy_suspend,
+	.resume		= &genphy_resume,
+	.probe		= &vsc8574_probe,
+	.set_wol	= &vsc85xx_wol_set,
+	.get_wol	= &vsc85xx_wol_get,
+	.get_tunable	= &vsc85xx_get_tunable,
+	.set_tunable	= &vsc85xx_set_tunable,
+	.read_page	= &vsc85xx_phy_read_page,
+	.write_page	= &vsc85xx_phy_write_page,
+	.get_sset_count = &vsc85xx_get_sset_count,
+	.get_strings    = &vsc85xx_get_strings,
+	.get_stats      = &vsc85xx_get_stats,
+},
 {
 	.phy_id		= PHY_ID_VSC8584,
 	.name		= "Microsemi GE VSC8584 SyncE",
@@ -1677,6 +1994,7 @@ static struct mdio_device_id __maybe_unused vsc85xx_tbl[] = {
 	{ PHY_ID_VSC8531, 0xfffffff0, },
 	{ PHY_ID_VSC8540, 0xfffffff0, },
 	{ PHY_ID_VSC8541, 0xfffffff0, },
+	{ PHY_ID_VSC8574, 0xfffffff0, },
 	{ PHY_ID_VSC8584, 0xfffffff0, },
 	{ }
 };
-- 
2.17.1

^ permalink raw reply related

* [RESEND PATCH net-next v2 2/5] net: phy: mscc: add support for VSC8584 PHY
From: Quentin Schulz @ 2018-10-08 10:14 UTC (permalink / raw)
  To: alexandre.belloni, ralf, paul.burton, jhogan, robh+dt,
	mark.rutland, davem, andrew, f.fainelli
  Cc: allan.nielsen, linux-mips, devicetree, linux-kernel, netdev,
	thomas.petazzoni, antoine.tenart, Quentin Schulz
In-Reply-To: <20181008101445.25946-1-quentin.schulz@bootlin.com>

The VSC8584 PHY is a 4-ports PHY that is 10/100/1000BASE-T, 100BASE-FX,
1000BASE-X and triple-speed copper SFP capable, can communicate with the
MAC via SGMII, QSGMII or 1000BASE-X, supports downshifting and can set
the blinking pattern of each of its 4 LEDs, supports hardware offloading
of MACsec and supports SyncE as well as HP Auto-MDIX detection.

This adds support for 10/100/1000BASE-T, SGMII/QSGMII link with the MAC,
downshifting, HP Auto-MDIX detection and blinking pattern for its 4
LEDs.

The VSC8584 has also an internal Intel 8051 microcontroller whose
firmware needs to be patched when the PHY is reset. If the 8051's
firmware has the expected CRC, its patching can be skipped. The
microcontroller can be accessed from any port of the PHY, though the CRC
function can only be done through the PHY that is the base PHY of the
package (internal address 0) due to a limitation of the firmware.

The GPIO register bank is a set of registers that are common to all PHYs
in the package. So any modification in any register of this bank affects
all PHYs of the package.

If the PHYs haven't been reset before booting the Linux kernel and were
configured to use interrupts for e.g. link status updates, it is
required to clear the interrupts mask register of all PHYs before being
able to use interrupts with any PHY. The first PHY of the package that
will be init will take care of clearing all PHYs interrupts mask
registers. Thus, we need to keep track of the init sequence in the
package, if it's already been done or if it's to be done.

Most of the init sequence of a PHY of the package is common to all PHYs
in the package, thus we use the SMI broadcast feature which enables us
to propagate a write in one register of one PHY to all PHYs in the same
package.

The revA of the VSC8584 PHY (which is not and will not be publicly
released) should NOT patch the firmware of the microcontroller or it'll
make things worse, the easiest way is just to not support it.

Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
---
 drivers/net/phy/mscc.c | 747 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 747 insertions(+)

diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c
index 7ae3e644a18f..35292cfd4979 100644
--- a/drivers/net/phy/mscc.c
+++ b/drivers/net/phy/mscc.c
@@ -6,6 +6,8 @@
  * Copyright (c) 2016 Microsemi Corporation
  */
 
+#include <linux/firmware.h>
+#include <linux/jiffies.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/mdio.h>
@@ -32,6 +34,10 @@ enum rgmii_rx_clock_delay {
 #define DISABLE_HP_AUTO_MDIX_MASK	  0x0080
 #define DISABLE_PAIR_SWAP_CORR_MASK	  0x0020
 #define DISABLE_POLARITY_CORR_MASK	  0x0010
+#define PARALLEL_DET_IGNORE_ADVERTISED    0x0008
+
+#define MSCC_PHY_EXT_CNTL_STATUS          22
+#define SMI_BROADCAST_WR_EN		  0x0001
 
 #define MSCC_PHY_ERR_RX_CNT		  19
 #define MSCC_PHY_ERR_FALSE_CARRIER_CNT	  20
@@ -44,7 +50,20 @@ enum rgmii_rx_clock_delay {
 #define MAC_IF_SELECTION_RMII             1
 #define MAC_IF_SELECTION_RGMII            2
 #define MAC_IF_SELECTION_POS              11
+#define VSC8584_MAC_IF_SELECTION_MASK     0x1000
+#define VSC8584_MAC_IF_SELECTION_SGMII    0
+#define VSC8584_MAC_IF_SELECTION_1000BASEX 1
+#define VSC8584_MAC_IF_SELECTION_POS      12
 #define FAR_END_LOOPBACK_MODE_MASK        0x0008
+#define MEDIA_OP_MODE_MASK		  0x0700
+#define MEDIA_OP_MODE_COPPER		  0
+#define MEDIA_OP_MODE_SERDES		  1
+#define MEDIA_OP_MODE_1000BASEX		  2
+#define MEDIA_OP_MODE_100BASEFX		  3
+#define MEDIA_OP_MODE_AMS_COPPER_SERDES	  5
+#define MEDIA_OP_MODE_AMS_COPPER_1000BASEX	6
+#define MEDIA_OP_MODE_AMS_COPPER_100BASEFX	7
+#define MEDIA_OP_MODE_POS		  8
 
 #define MII_VSC85XX_INT_MASK		  25
 #define MII_VSC85XX_INT_MASK_MASK	  0xa000
@@ -67,6 +86,13 @@ enum rgmii_rx_clock_delay {
 #define MSCC_PHY_PAGE_STANDARD		  0x0000 /* Standard registers */
 #define MSCC_PHY_PAGE_EXTENDED		  0x0001 /* Extended registers */
 #define MSCC_PHY_PAGE_EXTENDED_2	  0x0002 /* Extended reg - page 2 */
+#define MSCC_PHY_PAGE_EXTENDED_3	  0x0003 /* Extended reg - page 3 */
+#define MSCC_PHY_PAGE_EXTENDED_4	  0x0004 /* Extended reg - page 4 */
+/* Extended reg - GPIO; this is a bank of registers that are shared for all PHYs
+ * in the same package.
+ */
+#define MSCC_PHY_PAGE_EXTENDED_GPIO	  0x0010 /* Extended reg - GPIO */
+#define MSCC_PHY_PAGE_TEST		  0x2a30 /* Test reg */
 #define MSCC_PHY_PAGE_TR		  0x52b5 /* Token ring registers */
 
 /* Extended Page 1 Registers */
@@ -79,13 +105,21 @@ enum rgmii_rx_clock_delay {
 #define FORCE_MDI_CROSSOVER_MDI		  0x0008
 
 #define MSCC_PHY_ACTIPHY_CNTL		  20
+#define PHY_ADDR_REVERSED		  0x0200
 #define DOWNSHIFT_CNTL_MASK		  0x001C
 #define DOWNSHIFT_EN			  0x0010
 #define DOWNSHIFT_CNTL_POS		  2
 
 #define MSCC_PHY_EXT_PHY_CNTL_4		  23
+#define PHY_CNTL_4_ADDR_POS		  11
+
+#define MSCC_PHY_VERIPHY_CNTL_2		  25
+
+#define MSCC_PHY_VERIPHY_CNTL_3		  26
 
 /* Extended Page 2 Registers */
+#define MSCC_PHY_CU_PMD_TX_CNTL		  16
+
 #define MSCC_PHY_RGMII_CNTL		  20
 #define RGMII_RX_CLK_DELAY_MASK		  0x0070
 #define RGMII_RX_CLK_DELAY_POS		  4
@@ -101,6 +135,70 @@ enum rgmii_rx_clock_delay {
 #define SECURE_ON_ENABLE		  0x8000
 #define SECURE_ON_PASSWD_LEN_4		  0x4000
 
+/* Extended Page 3 Registers */
+#define MSCC_PHY_SERDES_TX_VALID_CNT	  21
+#define MSCC_PHY_SERDES_TX_CRC_ERR_CNT	  22
+#define MSCC_PHY_SERDES_RX_VALID_CNT	  28
+#define MSCC_PHY_SERDES_RX_CRC_ERR_CNT	  29
+
+/* Extended page GPIO Registers */
+#define MSCC_DW8051_CNTL_STATUS		  0
+#define MICRO_NSOFT_RESET		  0x8000
+#define RUN_FROM_INT_ROM		  0x4000
+#define AUTOINC_ADDR			  0x2000
+#define PATCH_RAM_CLK			  0x1000
+#define MICRO_PATCH_EN			  0x0080
+#define DW8051_CLK_EN			  0x0010
+#define MICRO_CLK_EN			  0x0008
+#define MICRO_CLK_DIVIDE(x)		  ((x) >> 1)
+
+/* x Address in range 1-4 */
+#define MSCC_TRAP_ROM_ADDR(x)		  ((x) * 2 + 1)
+#define MSCC_PATCH_RAM_ADDR(x)		  (((x) + 1) * 2)
+#define MSCC_INT_MEM_ADDR		  11
+
+#define MSCC_INT_MEM_CNTL		  12
+#define READ_SFR			  0x6000
+#define READ_PRAM			  0x4000
+#define READ_ROM			  0x2000
+#define READ_RAM			  0x0000
+#define INT_MEM_WRITE_EN		  0x1000
+#define EN_PATCH_RAM_TRAP_ADDR(x)	  (0x0100 << ((x) - 1))
+#define INT_MEM_DATA_M			  0x00ff
+#define INT_MEM_DATA(x)			  (INT_MEM_DATA_M & (x))
+
+#define MSCC_PHY_PROC_CMD		  18
+#define PROC_CMD_NCOMPLETED		  0x8000
+#define PROC_CMD_FAILED			  0x4000
+#define PROC_CMD_SGMII_PORT(x)		  ((x) << 8)
+#define PROC_CMD_FIBER_PORT(x)		  (0x0100 << (x) % 4)
+#define PROC_CMD_QSGMII_PORT		  0x0c00
+#define PROC_CMD_RST_CONF_PORT		  0x0080
+#define PROC_CMD_RECONF_PORT		  0x0000
+#define PROC_CMD_READ_MOD_WRITE_PORT	  0x0040
+#define PROC_CMD_WRITE			  0x0040
+#define PROC_CMD_READ			  0x0000
+#define PROC_CMD_FIBER_DISABLE		  0x0020
+#define PROC_CMD_FIBER_100BASE_FX	  0x0010
+#define PROC_CMD_FIBER_1000BASE_X	  0x0000
+#define PROC_CMD_SGMII_MAC		  0x0030
+#define PROC_CMD_QSGMII_MAC		  0x0020
+#define PROC_CMD_NO_MAC_CONF		  0x0000
+#define PROC_CMD_NOP			  0x000f
+#define PROC_CMD_CRC16			  0x0008
+#define PROC_CMD_FIBER_MEDIA_CONF	  0x0001
+#define PROC_CMD_MCB_ACCESS_MAC_CONF	  0x0000
+#define PROC_CMD_NCOMPLETED_TIMEOUT_MS    500
+
+#define MSCC_PHY_MAC_CFG_FASTLINK	  19
+#define MAC_CFG_MASK			  0xc000
+#define MAC_CFG_SGMII			  0x0000
+#define MAC_CFG_QSGMII			  0x4000
+
+/* Test page Registers */
+#define MSCC_PHY_TEST_PAGE_5		  5
+#define MSCC_PHY_TEST_PAGE_8		  8
+
 /* Token ring page Registers */
 #define MSCC_PHY_TR_CNTL		  16
 #define TR_WRITE			  0x8000
@@ -113,6 +211,7 @@ enum rgmii_rx_clock_delay {
 #define PHY_ID_VSC8531			  0x00070570
 #define PHY_ID_VSC8540			  0x00070760
 #define PHY_ID_VSC8541			  0x00070770
+#define PHY_ID_VSC8584			  0x000707c0
 
 #define MSCC_VDDMAC_1500		  1500
 #define MSCC_VDDMAC_1800		  1800
@@ -122,6 +221,24 @@ enum rgmii_rx_clock_delay {
 #define DOWNSHIFT_COUNT_MAX		  5
 
 #define MAX_LEDS			  4
+
+#define VSC8584_SUPP_LED_MODES (BIT(VSC8531_LINK_ACTIVITY) | \
+				BIT(VSC8531_LINK_1000_ACTIVITY) | \
+				BIT(VSC8531_LINK_100_ACTIVITY) | \
+				BIT(VSC8531_LINK_10_ACTIVITY) | \
+				BIT(VSC8531_LINK_100_1000_ACTIVITY) | \
+				BIT(VSC8531_LINK_10_1000_ACTIVITY) | \
+				BIT(VSC8531_LINK_10_100_ACTIVITY) | \
+				BIT(VSC8584_LINK_100FX_1000X_ACTIVITY) | \
+				BIT(VSC8531_DUPLEX_COLLISION) | \
+				BIT(VSC8531_COLLISION) | \
+				BIT(VSC8531_ACTIVITY) | \
+				BIT(VSC8584_100FX_1000X_ACTIVITY) | \
+				BIT(VSC8531_AUTONEG_FAULT) | \
+				BIT(VSC8531_SERIAL_MODE) | \
+				BIT(VSC8531_FORCE_LED_OFF) | \
+				BIT(VSC8531_FORCE_LED_ON))
+
 #define VSC85XX_SUPP_LED_MODES (BIT(VSC8531_LINK_ACTIVITY) | \
 				BIT(VSC8531_LINK_1000_ACTIVITY) | \
 				BIT(VSC8531_LINK_100_ACTIVITY) | \
@@ -137,6 +254,13 @@ enum rgmii_rx_clock_delay {
 				BIT(VSC8531_FORCE_LED_OFF) | \
 				BIT(VSC8531_FORCE_LED_ON))
 
+#define MSCC_VSC8584_REVB_INT8051_FW		"mscc_vsc8584_revb_int8051_fb48.bin"
+#define MSCC_VSC8584_REVB_INT8051_FW_START_ADDR	0xe800
+#define MSCC_VSC8584_REVB_INT8051_FW_CRC	0xfb48
+
+#define VSC8584_REVB				0x0001
+#define MSCC_DEV_REV_MASK			GENMASK(3, 0)
+
 struct reg_val {
 	u16	reg;
 	u32	val;
@@ -178,6 +302,55 @@ static const struct vsc85xx_hw_stat vsc85xx_hw_stats[] = {
 	},
 };
 
+static const struct vsc85xx_hw_stat vsc8584_hw_stats[] = {
+	{
+		.string	= "phy_receive_errors",
+		.reg	= MSCC_PHY_ERR_RX_CNT,
+		.page	= MSCC_PHY_PAGE_STANDARD,
+		.mask	= ERR_CNT_MASK,
+	}, {
+		.string	= "phy_false_carrier",
+		.reg	= MSCC_PHY_ERR_FALSE_CARRIER_CNT,
+		.page	= MSCC_PHY_PAGE_STANDARD,
+		.mask	= ERR_CNT_MASK,
+	}, {
+		.string	= "phy_cu_media_link_disconnect",
+		.reg	= MSCC_PHY_ERR_LINK_DISCONNECT_CNT,
+		.page	= MSCC_PHY_PAGE_STANDARD,
+		.mask	= ERR_CNT_MASK,
+	}, {
+		.string	= "phy_cu_media_crc_good_count",
+		.reg	= MSCC_PHY_CU_MEDIA_CRC_VALID_CNT,
+		.page	= MSCC_PHY_PAGE_EXTENDED,
+		.mask	= VALID_CRC_CNT_CRC_MASK,
+	}, {
+		.string	= "phy_cu_media_crc_error_count",
+		.reg	= MSCC_PHY_EXT_PHY_CNTL_4,
+		.page	= MSCC_PHY_PAGE_EXTENDED,
+		.mask	= ERR_CNT_MASK,
+	}, {
+		.string	= "phy_serdes_tx_good_pkt_count",
+		.reg	= MSCC_PHY_SERDES_TX_VALID_CNT,
+		.page	= MSCC_PHY_PAGE_EXTENDED_3,
+		.mask	= VALID_CRC_CNT_CRC_MASK,
+	}, {
+		.string	= "phy_serdes_tx_bad_crc_count",
+		.reg	= MSCC_PHY_SERDES_TX_CRC_ERR_CNT,
+		.page	= MSCC_PHY_PAGE_EXTENDED_3,
+		.mask	= ERR_CNT_MASK,
+	}, {
+		.string	= "phy_serdes_rx_good_pkt_count",
+		.reg	= MSCC_PHY_SERDES_RX_VALID_CNT,
+		.page	= MSCC_PHY_PAGE_EXTENDED_3,
+		.mask	= VALID_CRC_CNT_CRC_MASK,
+	}, {
+		.string	= "phy_serdes_rx_bad_crc_count",
+		.reg	= MSCC_PHY_SERDES_RX_CRC_ERR_CNT,
+		.page	= MSCC_PHY_PAGE_EXTENDED_3,
+		.mask	= ERR_CNT_MASK,
+	},
+};
+
 struct vsc8531_private {
 	int rate_magic;
 	u16 supp_led_modes;
@@ -186,6 +359,11 @@ struct vsc8531_private {
 	const struct vsc85xx_hw_stat *hw_stats;
 	u64 *stats;
 	int nstats;
+	bool pkg_init;
+	/* For multiple port PHYs; the MDIO address of the base PHY in the
+	 * package.
+	 */
+	unsigned int base_addr;
 };
 
 #ifdef CONFIG_OF_MDIO
@@ -706,6 +884,509 @@ static int vsc85xx_eee_init_seq_set(struct phy_device *phydev)
 	return oldpage;
 }
 
+/* phydev->bus->mdio_lock should be locked when using this function */
+static int phy_base_write(struct phy_device *phydev, u32 regnum, u16 val)
+{
+	struct vsc8531_private *priv = phydev->priv;
+
+	if (unlikely(!mutex_is_locked(&phydev->mdio.bus->mdio_lock))) {
+		dev_err(&phydev->mdio.dev, "MDIO bus lock not held!\n");
+		dump_stack();
+	}
+
+	return __mdiobus_write(phydev->mdio.bus, priv->base_addr, regnum, val);
+}
+
+/* phydev->bus->mdio_lock should be locked when using this function */
+static int phy_base_read(struct phy_device *phydev, u32 regnum)
+{
+	struct vsc8531_private *priv = phydev->priv;
+
+	if (unlikely(!mutex_is_locked(&phydev->mdio.bus->mdio_lock))) {
+		dev_err(&phydev->mdio.dev, "MDIO bus lock not held!\n");
+		dump_stack();
+	}
+
+	return __mdiobus_read(phydev->mdio.bus, priv->base_addr, regnum);
+}
+
+/* bus->mdio_lock should be locked when using this function */
+static void vsc8584_csr_write(struct phy_device *phydev, u16 addr, u32 val)
+{
+	phy_base_write(phydev, MSCC_PHY_TR_MSB, val >> 16);
+	phy_base_write(phydev, MSCC_PHY_TR_LSB, val & GENMASK(15, 0));
+	phy_base_write(phydev, MSCC_PHY_TR_CNTL, TR_WRITE | TR_ADDR(addr));
+}
+
+/* bus->mdio_lock should be locked when using this function */
+static int vsc8584_cmd(struct phy_device *phydev, u16 val)
+{
+	unsigned long deadline;
+	u16 reg_val;
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS,
+		       MSCC_PHY_PAGE_EXTENDED_GPIO);
+
+	phy_base_write(phydev, MSCC_PHY_PROC_CMD, PROC_CMD_NCOMPLETED | val);
+
+	deadline = jiffies + msecs_to_jiffies(PROC_CMD_NCOMPLETED_TIMEOUT_MS);
+	do {
+		reg_val = phy_base_read(phydev, MSCC_PHY_PROC_CMD);
+	} while (time_before(jiffies, deadline) &&
+		 (reg_val & PROC_CMD_NCOMPLETED) &&
+		 !(reg_val & PROC_CMD_FAILED));
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_STANDARD);
+
+	if (reg_val & PROC_CMD_FAILED)
+		return -EIO;
+
+	if (reg_val & PROC_CMD_NCOMPLETED)
+		return -ETIMEDOUT;
+
+	return 0;
+}
+
+/* bus->mdio_lock should be locked when using this function */
+static int vsc8584_micro_deassert_reset(struct phy_device *phydev,
+					bool patch_en)
+{
+	u32 enable, release;
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS,
+		       MSCC_PHY_PAGE_EXTENDED_GPIO);
+
+	enable = RUN_FROM_INT_ROM | MICRO_CLK_EN | DW8051_CLK_EN;
+	release = MICRO_NSOFT_RESET | RUN_FROM_INT_ROM | DW8051_CLK_EN |
+		MICRO_CLK_EN;
+
+	if (patch_en) {
+		enable |= MICRO_PATCH_EN;
+		release |= MICRO_PATCH_EN;
+
+		/* Clear all patches */
+		phy_base_write(phydev, MSCC_INT_MEM_CNTL, READ_RAM);
+	}
+
+	/* Enable 8051 Micro clock; CLEAR/SET patch present; disable PRAM clock
+	 * override and addr. auto-incr; operate at 125 MHz
+	 */
+	phy_base_write(phydev, MSCC_DW8051_CNTL_STATUS, enable);
+	/* Release 8051 Micro SW reset */
+	phy_base_write(phydev, MSCC_DW8051_CNTL_STATUS, release);
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_STANDARD);
+
+	return 0;
+}
+
+/* bus->mdio_lock should be locked when using this function */
+static int vsc8584_micro_assert_reset(struct phy_device *phydev)
+{
+	int ret;
+	u16 reg;
+
+	ret = vsc8584_cmd(phydev, PROC_CMD_NOP);
+	if (ret)
+		return ret;
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS,
+		       MSCC_PHY_PAGE_EXTENDED_GPIO);
+
+	reg = phy_base_read(phydev, MSCC_INT_MEM_CNTL);
+	reg &= ~EN_PATCH_RAM_TRAP_ADDR(4);
+	phy_base_write(phydev, MSCC_INT_MEM_CNTL, reg);
+
+	phy_base_write(phydev, MSCC_TRAP_ROM_ADDR(4), 0x005b);
+	phy_base_write(phydev, MSCC_PATCH_RAM_ADDR(4), 0x005b);
+
+	reg = phy_base_read(phydev, MSCC_INT_MEM_CNTL);
+	reg |= EN_PATCH_RAM_TRAP_ADDR(4);
+	phy_base_write(phydev, MSCC_INT_MEM_CNTL, reg);
+
+	phy_base_write(phydev, MSCC_PHY_PROC_CMD, PROC_CMD_NOP);
+
+	reg = phy_base_read(phydev, MSCC_DW8051_CNTL_STATUS);
+	reg &= ~MICRO_NSOFT_RESET;
+	phy_base_write(phydev, MSCC_DW8051_CNTL_STATUS, reg);
+
+	phy_base_write(phydev, MSCC_PHY_PROC_CMD, PROC_CMD_MCB_ACCESS_MAC_CONF |
+		       PROC_CMD_SGMII_PORT(0) | PROC_CMD_NO_MAC_CONF |
+		       PROC_CMD_READ);
+
+	reg = phy_base_read(phydev, MSCC_INT_MEM_CNTL);
+	reg &= ~EN_PATCH_RAM_TRAP_ADDR(4);
+	phy_base_write(phydev, MSCC_INT_MEM_CNTL, reg);
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_STANDARD);
+
+	return 0;
+}
+
+/* bus->mdio_lock should be locked when using this function */
+static int vsc8584_get_fw_crc(struct phy_device *phydev, u16 start, u16 size,
+			      u16 *crc)
+{
+	int ret;
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_EXTENDED);
+
+	phy_base_write(phydev, MSCC_PHY_VERIPHY_CNTL_2, start);
+	phy_base_write(phydev, MSCC_PHY_VERIPHY_CNTL_3, size);
+
+	/* Start Micro command */
+	ret = vsc8584_cmd(phydev, PROC_CMD_CRC16);
+	if (ret)
+		goto out;
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_EXTENDED);
+
+	*crc = phy_base_read(phydev, MSCC_PHY_VERIPHY_CNTL_2);
+
+out:
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_STANDARD);
+
+	return ret;
+}
+
+/* bus->mdio_lock should be locked when using this function */
+static int vsc8584_patch_fw(struct phy_device *phydev,
+			    const struct firmware *fw)
+{
+	int i, ret;
+
+	ret = vsc8584_micro_assert_reset(phydev);
+	if (ret) {
+		dev_err(&phydev->mdio.dev,
+			"%s: failed to assert reset of micro\n", __func__);
+		return ret;
+	}
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS,
+		       MSCC_PHY_PAGE_EXTENDED_GPIO);
+
+	/* Hold 8051 Micro in SW Reset, Enable auto incr address and patch clock
+	 * Disable the 8051 Micro clock
+	 */
+	phy_base_write(phydev, MSCC_DW8051_CNTL_STATUS, RUN_FROM_INT_ROM |
+		       AUTOINC_ADDR | PATCH_RAM_CLK | MICRO_CLK_EN |
+		       MICRO_CLK_DIVIDE(2));
+	phy_base_write(phydev, MSCC_INT_MEM_CNTL, READ_PRAM | INT_MEM_WRITE_EN |
+		       INT_MEM_DATA(2));
+	phy_base_write(phydev, MSCC_INT_MEM_ADDR, 0x0000);
+
+	for (i = 0; i < fw->size; i++)
+		phy_base_write(phydev, MSCC_INT_MEM_CNTL, READ_PRAM |
+			       INT_MEM_WRITE_EN | fw->data[i]);
+
+	/* Clear internal memory access */
+	phy_base_write(phydev, MSCC_INT_MEM_CNTL, READ_RAM);
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_STANDARD);
+
+	return 0;
+}
+
+/* bus->mdio_lock should be locked when using this function */
+static int vsc8584_config_pre_init(struct phy_device *phydev)
+{
+	const struct reg_val pre_init1[] = {
+		{0x07fa, 0x0050100f},
+		{0x1688, 0x00049f81},
+		{0x0f90, 0x00688980},
+		{0x03a4, 0x0000d8f0},
+		{0x0fc0, 0x00000400},
+		{0x0f82, 0x0012b002},
+		{0x1686, 0x00000004},
+		{0x168c, 0x00d2c46f},
+		{0x17a2, 0x00000620},
+		{0x16a0, 0x00eeffdd},
+		{0x16a6, 0x00071448},
+		{0x16a4, 0x0013132f},
+		{0x16a8, 0x00000000},
+		{0x0ffc, 0x00c0a028},
+		{0x0fe8, 0x0091b06c},
+		{0x0fea, 0x00041600},
+		{0x0f80, 0x00fffaff},
+		{0x0fec, 0x00901809},
+		{0x0ffe, 0x00b01007},
+		{0x16b0, 0x00eeff00},
+		{0x16b2, 0x00007000},
+		{0x16b4, 0x00000814},
+	};
+	const struct reg_val pre_init2[] = {
+		{0x0486, 0x0008a518},
+		{0x0488, 0x006dc696},
+		{0x048a, 0x00000912},
+	};
+	const struct firmware *fw;
+	struct device *dev = &phydev->mdio.dev;
+	unsigned int i;
+	u16 crc, reg;
+	int ret;
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_STANDARD);
+
+	/* all writes below are broadcasted to all PHYs in the same package */
+	reg = phy_base_read(phydev, MSCC_PHY_EXT_CNTL_STATUS);
+	reg |= SMI_BROADCAST_WR_EN;
+	phy_base_write(phydev, MSCC_PHY_EXT_CNTL_STATUS, reg);
+
+	phy_base_write(phydev, MII_VSC85XX_INT_MASK, 0);
+
+	reg = phy_base_read(phydev,  MSCC_PHY_BYPASS_CONTROL);
+	reg |= PARALLEL_DET_IGNORE_ADVERTISED;
+	phy_base_write(phydev, MSCC_PHY_BYPASS_CONTROL, reg);
+
+	/* The below register writes are tweaking analog and electrical
+	 * configuration that were determined through characterization by PHY
+	 * engineers. These don't mean anything more than "these are the best
+	 * values".
+	 */
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_EXTENDED_3);
+
+	phy_base_write(phydev, MSCC_PHY_SERDES_TX_CRC_ERR_CNT, 0x2000);
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_TEST);
+
+	phy_base_write(phydev, MSCC_PHY_TEST_PAGE_5, 0x1f20);
+
+	reg = phy_base_read(phydev, MSCC_PHY_TEST_PAGE_8);
+	reg |= 0x8000;
+	phy_base_write(phydev, MSCC_PHY_TEST_PAGE_8, reg);
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_TR);
+
+	phy_base_write(phydev, MSCC_PHY_TR_CNTL, TR_WRITE | TR_ADDR(0x2fa4));
+
+	reg = phy_base_read(phydev, MSCC_PHY_TR_MSB);
+	reg &= ~0x007f;
+	reg |= 0x0019;
+	phy_base_write(phydev, MSCC_PHY_TR_MSB, reg);
+
+	phy_base_write(phydev, MSCC_PHY_TR_CNTL, TR_WRITE | TR_ADDR(0x0fa4));
+
+	for (i = 0; i < ARRAY_SIZE(pre_init1); i++)
+		vsc8584_csr_write(phydev, pre_init1[i].reg, pre_init1[i].val);
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_EXTENDED_2);
+
+	phy_base_write(phydev, MSCC_PHY_CU_PMD_TX_CNTL, 0x028e);
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_TR);
+
+	for (i = 0; i < ARRAY_SIZE(pre_init2); i++)
+		vsc8584_csr_write(phydev, pre_init2[i].reg, pre_init2[i].val);
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_TEST);
+
+	reg = phy_base_read(phydev, MSCC_PHY_TEST_PAGE_8);
+	reg &= ~0x8000;
+	phy_base_write(phydev, MSCC_PHY_TEST_PAGE_8, reg);
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_STANDARD);
+
+	/* end of write broadcasting */
+	reg = phy_base_read(phydev, MSCC_PHY_EXT_CNTL_STATUS);
+	reg &= ~SMI_BROADCAST_WR_EN;
+	phy_base_write(phydev, MSCC_PHY_EXT_CNTL_STATUS, reg);
+
+	ret = request_firmware(&fw, MSCC_VSC8584_REVB_INT8051_FW, dev);
+	if (ret) {
+		dev_err(dev, "failed to load firmware %s, ret: %d\n",
+			MSCC_VSC8584_REVB_INT8051_FW, ret);
+		return ret;
+	}
+
+	/* Add one byte to size for the one added by the patch_fw function */
+	ret = vsc8584_get_fw_crc(phydev,
+				 MSCC_VSC8584_REVB_INT8051_FW_START_ADDR,
+				 fw->size + 1, &crc);
+	if (ret)
+		goto out;
+
+	if (crc != MSCC_VSC8584_REVB_INT8051_FW_CRC) {
+		dev_dbg(dev, "FW CRC is not the expected one, patching FW\n");
+		if (vsc8584_patch_fw(phydev, fw))
+			dev_warn(dev,
+				 "failed to patch FW, expect non-optimal device\n");
+	}
+
+	vsc8584_micro_deassert_reset(phydev, false);
+
+	/* Add one byte to size for the one added by the patch_fw function */
+	ret = vsc8584_get_fw_crc(phydev,
+				 MSCC_VSC8584_REVB_INT8051_FW_START_ADDR,
+				 fw->size + 1, &crc);
+	if (ret)
+		goto out;
+
+	if (crc != MSCC_VSC8584_REVB_INT8051_FW_CRC)
+		dev_warn(dev,
+			 "FW CRC after patching is not the expected one, expect non-optimal device\n");
+
+	ret = vsc8584_micro_assert_reset(phydev);
+	if (ret)
+		goto out;
+
+	vsc8584_micro_deassert_reset(phydev, true);
+
+out:
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_STANDARD);
+
+	release_firmware(fw);
+
+	return ret;
+}
+
+/* Check if one PHY has already done the init of the parts common to all PHYs
+ * in the Quad PHY package.
+ */
+static bool vsc8584_is_pkg_init(struct phy_device *phydev, bool reversed)
+{
+	struct mdio_device **map = phydev->mdio.bus->mdio_map;
+	struct vsc8531_private *vsc8531;
+	struct phy_device *phy;
+	int i, addr;
+
+	/* VSC8584 is a Quad PHY */
+	for (i = 0; i < 4; i++) {
+		vsc8531 = phydev->priv;
+
+		if (reversed)
+			addr = vsc8531->base_addr - i;
+		else
+			addr = vsc8531->base_addr + i;
+
+		phy = container_of(map[addr], struct phy_device, mdio);
+
+		if ((phy->phy_id & phydev->drv->phy_id_mask) !=
+		    (phydev->drv->phy_id & phydev->drv->phy_id_mask))
+			continue;
+
+		vsc8531 = phy->priv;
+
+		if (vsc8531 && vsc8531->pkg_init)
+			return true;
+	}
+
+	return false;
+}
+
+static int vsc8584_config_init(struct phy_device *phydev)
+{
+	struct vsc8531_private *vsc8531 = phydev->priv;
+	u16 addr, val;
+	int ret, i;
+
+	phydev->mdix_ctrl = ETH_TP_MDI_AUTO;
+
+	mutex_lock(&phydev->mdio.bus->mdio_lock);
+
+	__mdiobus_write(phydev->mdio.bus, phydev->mdio.addr,
+			MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_EXTENDED);
+	addr = __mdiobus_read(phydev->mdio.bus, phydev->mdio.addr,
+			      MSCC_PHY_EXT_PHY_CNTL_4);
+	addr >>= PHY_CNTL_4_ADDR_POS;
+
+	val = __mdiobus_read(phydev->mdio.bus, phydev->mdio.addr,
+			     MSCC_PHY_ACTIPHY_CNTL);
+	if (val & PHY_ADDR_REVERSED)
+		vsc8531->base_addr = phydev->mdio.addr + addr;
+	else
+		vsc8531->base_addr = phydev->mdio.addr - addr;
+
+	/* Some parts of the init sequence are identical for every PHY in the
+	 * package. Some parts are modifying the GPIO register bank which is a
+	 * set of registers that are affecting all PHYs, a few resetting the
+	 * microprocessor common to all PHYs. The CRC check responsible of the
+	 * checking the firmware within the 8051 microprocessor can only be
+	 * accessed via the PHY whose internal address in the package is 0.
+	 * All PHYs' interrupts mask register has to be zeroed before enabling
+	 * any PHY's interrupt in this register.
+	 * For all these reasons, we need to do the init sequence once and only
+	 * once whatever is the first PHY in the package that is initialized and
+	 * do the correct init sequence for all PHYs that are package-critical
+	 * in this pre-init function.
+	 */
+	if (!vsc8584_is_pkg_init(phydev, val & PHY_ADDR_REVERSED ? 1 : 0)) {
+		ret = vsc8584_config_pre_init(phydev);
+		if (ret)
+			goto err;
+	}
+
+	vsc8531->pkg_init = true;
+
+	phy_base_write(phydev, MSCC_EXT_PAGE_ACCESS,
+		       MSCC_PHY_PAGE_EXTENDED_GPIO);
+
+	val = phy_base_read(phydev, MSCC_PHY_MAC_CFG_FASTLINK);
+	val &= ~MAC_CFG_MASK;
+	if (phydev->interface == PHY_INTERFACE_MODE_QSGMII)
+		val |= MAC_CFG_QSGMII;
+	else
+		val |= MAC_CFG_SGMII;
+
+	ret = phy_base_write(phydev, MSCC_PHY_MAC_CFG_FASTLINK, val);
+	if (ret)
+		goto err;
+
+	val = PROC_CMD_MCB_ACCESS_MAC_CONF | PROC_CMD_RST_CONF_PORT |
+		PROC_CMD_READ_MOD_WRITE_PORT;
+	if (phydev->interface == PHY_INTERFACE_MODE_QSGMII)
+		val |= PROC_CMD_QSGMII_MAC;
+	else
+		val |= PROC_CMD_SGMII_MAC;
+
+	ret = vsc8584_cmd(phydev, val);
+	if (ret)
+		goto err;
+
+	usleep_range(10000, 20000);
+
+	/* Disable SerDes for 100Base-FX */
+	ret = vsc8584_cmd(phydev, PROC_CMD_FIBER_MEDIA_CONF |
+			  PROC_CMD_FIBER_PORT(addr) | PROC_CMD_FIBER_DISABLE |
+			  PROC_CMD_READ_MOD_WRITE_PORT |
+			  PROC_CMD_RST_CONF_PORT | PROC_CMD_FIBER_100BASE_FX);
+	if (ret)
+		goto err;
+
+	/* Disable SerDes for 1000Base-X */
+	ret = vsc8584_cmd(phydev, PROC_CMD_FIBER_MEDIA_CONF |
+			  PROC_CMD_FIBER_PORT(addr) | PROC_CMD_FIBER_DISABLE |
+			  PROC_CMD_READ_MOD_WRITE_PORT |
+			  PROC_CMD_RST_CONF_PORT | PROC_CMD_FIBER_1000BASE_X);
+	if (ret)
+		goto err;
+
+	mutex_unlock(&phydev->mdio.bus->mdio_lock);
+
+	phy_write(phydev, MSCC_EXT_PAGE_ACCESS, MSCC_PHY_PAGE_STANDARD);
+
+	val = phy_read(phydev, MSCC_PHY_EXT_PHY_CNTL_1);
+	val &= ~(MEDIA_OP_MODE_MASK | VSC8584_MAC_IF_SELECTION_MASK);
+	val |= MEDIA_OP_MODE_COPPER | (VSC8584_MAC_IF_SELECTION_SGMII <<
+				       VSC8584_MAC_IF_SELECTION_POS);
+	ret = phy_write(phydev, MSCC_PHY_EXT_PHY_CNTL_1, val);
+
+	ret = genphy_soft_reset(phydev);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < vsc8531->nleds; i++) {
+		ret = vsc85xx_led_cntl_set(phydev, i, vsc8531->leds_mode[i]);
+		if (ret)
+			return ret;
+	}
+
+	return genphy_config_init(phydev);
+
+err:
+	mutex_unlock(&phydev->mdio.bus->mdio_lock);
+	return ret;
+}
+
 static int vsc85xx_config_init(struct phy_device *phydev)
 {
 	int rc, i;
@@ -736,6 +1417,16 @@ static int vsc85xx_config_init(struct phy_device *phydev)
 	return genphy_config_init(phydev);
 }
 
+static int vsc8584_did_interrupt(struct phy_device *phydev)
+{
+	int rc = 0;
+
+	if (phydev->interrupts == PHY_INTERRUPT_ENABLED)
+		rc = phy_read(phydev, MII_VSC85XX_INT_STATUS);
+
+	return (rc < 0) ? 0 : rc & MII_VSC85XX_INT_MASK_MASK;
+}
+
 static int vsc85xx_ack_interrupt(struct phy_device *phydev)
 {
 	int rc = 0;
@@ -785,6 +1476,36 @@ static int vsc85xx_read_status(struct phy_device *phydev)
 	return genphy_read_status(phydev);
 }
 
+static int vsc8584_probe(struct phy_device *phydev)
+{
+	struct vsc8531_private *vsc8531;
+	u32 default_mode[4] = {VSC8531_LINK_1000_ACTIVITY,
+	   VSC8531_LINK_100_ACTIVITY, VSC8531_LINK_ACTIVITY,
+	   VSC8531_DUPLEX_COLLISION};
+
+	if ((phydev->phy_id & MSCC_DEV_REV_MASK) != VSC8584_REVB) {
+		dev_err(&phydev->mdio.dev, "Only VSC8584 revB is supported.\n");
+		return -ENOTSUPP;
+	}
+
+	vsc8531 = devm_kzalloc(&phydev->mdio.dev, sizeof(*vsc8531), GFP_KERNEL);
+	if (!vsc8531)
+		return -ENOMEM;
+
+	phydev->priv = vsc8531;
+
+	vsc8531->nleds = 4;
+	vsc8531->supp_led_modes = VSC8584_SUPP_LED_MODES;
+	vsc8531->hw_stats = vsc8584_hw_stats;
+	vsc8531->nstats = ARRAY_SIZE(vsc8584_hw_stats);
+	vsc8531->stats = devm_kmalloc_array(&phydev->mdio.dev, vsc8531->nstats,
+					    sizeof(u64), GFP_KERNEL);
+	if (!vsc8531->stats)
+		return -ENOMEM;
+
+	return vsc85xx_dt_led_modes_get(phydev, default_mode);
+}
+
 static int vsc85xx_probe(struct phy_device *phydev)
 {
 	struct vsc8531_private *vsc8531;
@@ -920,6 +1641,31 @@ static struct phy_driver vsc85xx_driver[] = {
 	.get_sset_count = &vsc85xx_get_sset_count,
 	.get_strings    = &vsc85xx_get_strings,
 	.get_stats      = &vsc85xx_get_stats,
+},
+{
+	.phy_id		= PHY_ID_VSC8584,
+	.name		= "Microsemi GE VSC8584 SyncE",
+	.phy_id_mask	= 0xfffffff0,
+	.features	= PHY_GBIT_FEATURES,
+	.flags		= PHY_HAS_INTERRUPT,
+	.soft_reset	= &genphy_soft_reset,
+	.config_init    = &vsc8584_config_init,
+	.config_aneg    = &vsc85xx_config_aneg,
+	.aneg_done	= &genphy_aneg_done,
+	.read_status	= &vsc85xx_read_status,
+	.ack_interrupt  = &vsc85xx_ack_interrupt,
+	.config_intr    = &vsc85xx_config_intr,
+	.did_interrupt  = &vsc8584_did_interrupt,
+	.suspend	= &genphy_suspend,
+	.resume		= &genphy_resume,
+	.probe		= &vsc8584_probe,
+	.get_tunable	= &vsc85xx_get_tunable,
+	.set_tunable	= &vsc85xx_set_tunable,
+	.read_page	= &vsc85xx_phy_read_page,
+	.write_page	= &vsc85xx_phy_write_page,
+	.get_sset_count = &vsc85xx_get_sset_count,
+	.get_strings    = &vsc85xx_get_strings,
+	.get_stats      = &vsc85xx_get_stats,
 }
 
 };
@@ -931,6 +1677,7 @@ static struct mdio_device_id __maybe_unused vsc85xx_tbl[] = {
 	{ PHY_ID_VSC8531, 0xfffffff0, },
 	{ PHY_ID_VSC8540, 0xfffffff0, },
 	{ PHY_ID_VSC8541, 0xfffffff0, },
+	{ PHY_ID_VSC8584, 0xfffffff0, },
 	{ }
 };
 
-- 
2.17.1

^ permalink raw reply related

* [RESEND PATCH net-next v2 1/5] dt-bindings: net: vsc8531: add two additional LED modes for VSC8584
From: Quentin Schulz @ 2018-10-08 10:14 UTC (permalink / raw)
  To: alexandre.belloni, ralf, paul.burton, jhogan, robh+dt,
	mark.rutland, davem, andrew, f.fainelli
  Cc: allan.nielsen, linux-mips, devicetree, linux-kernel, netdev,
	thomas.petazzoni, antoine.tenart, Quentin Schulz
In-Reply-To: <20181008101445.25946-1-quentin.schulz@bootlin.com>

The VSC8584 (and most likely other PHYs in the same generation) has two
additional LED modes that can be picked, so let's add them.

Reviewed-by: Rob Herring <robh@kernel.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
---
 include/dt-bindings/net/mscc-phy-vsc8531.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/dt-bindings/net/mscc-phy-vsc8531.h b/include/dt-bindings/net/mscc-phy-vsc8531.h
index 697161f80eb5..9eb2ec2b2ea9 100644
--- a/include/dt-bindings/net/mscc-phy-vsc8531.h
+++ b/include/dt-bindings/net/mscc-phy-vsc8531.h
@@ -18,9 +18,11 @@
 #define VSC8531_LINK_100_1000_ACTIVITY  4
 #define VSC8531_LINK_10_1000_ACTIVITY   5
 #define VSC8531_LINK_10_100_ACTIVITY    6
+#define VSC8584_LINK_100FX_1000X_ACTIVITY	7
 #define VSC8531_DUPLEX_COLLISION        8
 #define VSC8531_COLLISION               9
 #define VSC8531_ACTIVITY                10
+#define VSC8584_100FX_1000X_ACTIVITY	11
 #define VSC8531_AUTONEG_FAULT           12
 #define VSC8531_SERIAL_MODE             13
 #define VSC8531_FORCE_LED_OFF           14
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v3 5/6] net: phy: mscc: shorten `x != 0` condition to `x`
From: Quentin Schulz @ 2018-10-08 10:07 UTC (permalink / raw)
  To: davem, andrew, f.fainelli
  Cc: allan.nielsen, linux-kernel, netdev, thomas.petazzoni,
	alexandre.belloni, Quentin Schulz
In-Reply-To: <20181008100728.24959-1-quentin.schulz@bootlin.com>

`if (x != 0)` is basically a more verbose version of `if (x)` so let's
use the latter so it's consistent throughout the whole driver.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
---
 drivers/net/phy/mscc.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c
index 0ff7803ed16b..6bfdc168c62b 100644
--- a/drivers/net/phy/mscc.c
+++ b/drivers/net/phy/mscc.c
@@ -310,7 +310,7 @@ static int vsc85xx_mdix_set(struct phy_device *phydev, u8 mdix)
 			     DISABLE_HP_AUTO_MDIX_MASK);
 	}
 	rc = phy_write(phydev, MSCC_PHY_BYPASS_CONTROL, reg_val);
-	if (rc != 0)
+	if (rc)
 		return rc;
 
 	reg_val = 0;
@@ -425,14 +425,14 @@ static int vsc85xx_wol_set(struct phy_device *phydev,
 		reg_val = phy_read(phydev, MII_VSC85XX_INT_MASK);
 		reg_val |= MII_VSC85XX_INT_MASK_WOL;
 		rc = phy_write(phydev, MII_VSC85XX_INT_MASK, reg_val);
-		if (rc != 0)
+		if (rc)
 			goto out_unlock;
 	} else {
 		/* Disable the WOL interrupt */
 		reg_val = phy_read(phydev, MII_VSC85XX_INT_MASK);
 		reg_val &= (~MII_VSC85XX_INT_MASK_WOL);
 		rc = phy_write(phydev, MII_VSC85XX_INT_MASK, reg_val);
-		if (rc != 0)
+		if (rc)
 			goto out_unlock;
 	}
 	/* Clear WOL iterrupt status */
@@ -603,7 +603,7 @@ static int vsc85xx_mac_if_set(struct phy_device *phydev,
 		goto out_unlock;
 	}
 	rc = phy_write(phydev, MSCC_PHY_EXT_PHY_CNTL_1, reg_val);
-	if (rc != 0)
+	if (rc)
 		goto out_unlock;
 
 	rc = genphy_soft_reset(phydev);
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v3 4/6] net: phy: mscc: remove unneeded parenthesis
From: Quentin Schulz @ 2018-10-08 10:07 UTC (permalink / raw)
  To: davem, andrew, f.fainelli
  Cc: allan.nielsen, linux-kernel, netdev, thomas.petazzoni,
	alexandre.belloni, Quentin Schulz
In-Reply-To: <20181008100728.24959-1-quentin.schulz@bootlin.com>

The == operator precedes the || operator, so we can remove the
parenthesis around (a == b) || (c == d).

The condition is rather explicit and short so removing the parenthesis
definitely does not make it harder to read.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
---
 drivers/net/phy/mscc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c
index d304fb4df23c..0ff7803ed16b 100644
--- a/drivers/net/phy/mscc.c
+++ b/drivers/net/phy/mscc.c
@@ -300,7 +300,7 @@ static int vsc85xx_mdix_set(struct phy_device *phydev, u8 mdix)
 	u16 reg_val;
 
 	reg_val = phy_read(phydev, MSCC_PHY_BYPASS_CONTROL);
-	if ((mdix == ETH_TP_MDI) || (mdix == ETH_TP_MDI_X)) {
+	if (mdix == ETH_TP_MDI || mdix == ETH_TP_MDI_X) {
 		reg_val |= (DISABLE_PAIR_SWAP_CORR_MASK |
 			    DISABLE_POLARITY_CORR_MASK  |
 			    DISABLE_HP_AUTO_MDIX_MASK);
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v3 3/6] net: phy: mscc: Add EEE init sequence
From: Quentin Schulz @ 2018-10-08 10:07 UTC (permalink / raw)
  To: davem, andrew, f.fainelli
  Cc: allan.nielsen, linux-kernel, netdev, thomas.petazzoni,
	alexandre.belloni, Raju Lakkaraju, Quentin Schulz
In-Reply-To: <20181008100728.24959-1-quentin.schulz@bootlin.com>

From: Raju Lakkaraju <Raju.Lakkaraju@microchip.com>

Microsemi PHYs (VSC 8530/31/40/41) need to update the Energy Efficient
Ethernet initialization sequence.
In order to avoid certain link state errors that could result in link
drops and packet loss, the physical coding sublayer (PCS) must be
updated with settings related to EEE in order to improve performance.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microchip.com>
Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
---
 drivers/net/phy/mscc.c | 65 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c
index 47fbab489287..d304fb4df23c 100644
--- a/drivers/net/phy/mscc.c
+++ b/drivers/net/phy/mscc.c
@@ -67,6 +67,7 @@ enum rgmii_rx_clock_delay {
 #define MSCC_PHY_PAGE_STANDARD		  0x0000 /* Standard registers */
 #define MSCC_PHY_PAGE_EXTENDED		  0x0001 /* Extended registers */
 #define MSCC_PHY_PAGE_EXTENDED_2	  0x0002 /* Extended reg - page 2 */
+#define MSCC_PHY_PAGE_TR		  0x52b5 /* Token ring registers */
 
 /* Extended Page 1 Registers */
 #define MSCC_PHY_CU_MEDIA_CRC_VALID_CNT	  18
@@ -100,6 +101,13 @@ enum rgmii_rx_clock_delay {
 #define SECURE_ON_ENABLE		  0x8000
 #define SECURE_ON_PASSWD_LEN_4		  0x4000
 
+/* Token ring page Registers */
+#define MSCC_PHY_TR_CNTL		  16
+#define TR_WRITE			  0x8000
+#define TR_ADDR(x)			  (0x7fff & (x))
+#define MSCC_PHY_TR_LSB			  17
+#define MSCC_PHY_TR_MSB			  18
+
 /* Microsemi PHY ID's */
 #define PHY_ID_VSC8530			  0x00070560
 #define PHY_ID_VSC8531			  0x00070570
@@ -129,6 +137,11 @@ enum rgmii_rx_clock_delay {
 				BIT(VSC8531_FORCE_LED_OFF) | \
 				BIT(VSC8531_FORCE_LED_ON))
 
+struct reg_val {
+	u16	reg;
+	u32	val;
+};
+
 struct vsc85xx_hw_stat {
 	const char *string;
 	u8 reg;
@@ -647,6 +660,54 @@ static int vsc85xx_set_tunable(struct phy_device *phydev,
 	}
 }
 
+/* mdiobus lock should be locked when using this function */
+static void vsc85xx_tr_write(struct phy_device *phydev, u16 addr, u32 val)
+{
+	__phy_write(phydev, MSCC_PHY_TR_MSB, val >> 16);
+	__phy_write(phydev, MSCC_PHY_TR_LSB, val & GENMASK(15, 0));
+	__phy_write(phydev, MSCC_PHY_TR_CNTL, TR_WRITE | TR_ADDR(addr));
+}
+
+static int vsc85xx_eee_init_seq_set(struct phy_device *phydev)
+{
+	const struct reg_val init_eee[] = {
+		{0x0f82, 0x0012b00a},
+		{0x1686, 0x00000004},
+		{0x168c, 0x00d2c46f},
+		{0x17a2, 0x00000620},
+		{0x16a0, 0x00eeffdd},
+		{0x16a6, 0x00071448},
+		{0x16a4, 0x0013132f},
+		{0x16a8, 0x00000000},
+		{0x0ffc, 0x00c0a028},
+		{0x0fe8, 0x0091b06c},
+		{0x0fea, 0x00041600},
+		{0x0f80, 0x00000af4},
+		{0x0fec, 0x00901809},
+		{0x0fee, 0x0000a6a1},
+		{0x0ffe, 0x00b01007},
+		{0x16b0, 0x00eeff00},
+		{0x16b2, 0x00007000},
+		{0x16b4, 0x00000814},
+	};
+	unsigned int i;
+	int oldpage;
+
+	mutex_lock(&phydev->lock);
+	oldpage = phy_select_page(phydev, MSCC_PHY_PAGE_TR);
+	if (oldpage < 0)
+		goto out_unlock;
+
+	for (i = 0; i < ARRAY_SIZE(init_eee); i++)
+		vsc85xx_tr_write(phydev, init_eee[i].reg, init_eee[i].val);
+
+out_unlock:
+	oldpage = phy_restore_page(phydev, oldpage, oldpage);
+	mutex_unlock(&phydev->lock);
+
+	return oldpage;
+}
+
 static int vsc85xx_config_init(struct phy_device *phydev)
 {
 	int rc, i;
@@ -664,6 +725,10 @@ static int vsc85xx_config_init(struct phy_device *phydev)
 	if (rc)
 		return rc;
 
+	rc = vsc85xx_eee_init_seq_set(phydev);
+	if (rc)
+		return rc;
+
 	for (i = 0; i < vsc8531->nleds; i++) {
 		rc = vsc85xx_led_cntl_set(phydev, i, vsc8531->leds_mode[i]);
 		if (rc)
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v3 2/6] net: phy: mscc: add ethtool statistics counters
From: Quentin Schulz @ 2018-10-08 10:07 UTC (permalink / raw)
  To: davem, andrew, f.fainelli
  Cc: allan.nielsen, linux-kernel, netdev, thomas.petazzoni,
	alexandre.belloni, Raju Lakkaraju, Quentin Schulz
In-Reply-To: <20181008100728.24959-1-quentin.schulz@bootlin.com>

From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>

There are a few counters available in the PHY: receive errors, false
carriers, link disconnects, media CRC errors and valids counters.

So let's expose those in the PHY driver.

Use the priv structure as the next PHY to be supported has a few
additional counters.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
---
 drivers/net/phy/mscc.c | 119 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 119 insertions(+)

diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c
index 52198be46c68..47fbab489287 100644
--- a/drivers/net/phy/mscc.c
+++ b/drivers/net/phy/mscc.c
@@ -33,6 +33,11 @@ enum rgmii_rx_clock_delay {
 #define DISABLE_PAIR_SWAP_CORR_MASK	  0x0020
 #define DISABLE_POLARITY_CORR_MASK	  0x0010
 
+#define MSCC_PHY_ERR_RX_CNT		  19
+#define MSCC_PHY_ERR_FALSE_CARRIER_CNT	  20
+#define MSCC_PHY_ERR_LINK_DISCONNECT_CNT  21
+#define ERR_CNT_MASK			  GENMASK(7, 0)
+
 #define MSCC_PHY_EXT_PHY_CNTL_1           23
 #define MAC_IF_SELECTION_MASK             0x1800
 #define MAC_IF_SELECTION_GMII             0
@@ -64,6 +69,9 @@ enum rgmii_rx_clock_delay {
 #define MSCC_PHY_PAGE_EXTENDED_2	  0x0002 /* Extended reg - page 2 */
 
 /* Extended Page 1 Registers */
+#define MSCC_PHY_CU_MEDIA_CRC_VALID_CNT	  18
+#define VALID_CRC_CNT_CRC_MASK		  GENMASK(13, 0)
+
 #define MSCC_PHY_EXT_MODE_CNTL		  19
 #define FORCE_MDI_CROSSOVER_MASK	  0x000C
 #define FORCE_MDI_CROSSOVER_MDIX	  0x000C
@@ -74,6 +82,8 @@ enum rgmii_rx_clock_delay {
 #define DOWNSHIFT_EN			  0x0010
 #define DOWNSHIFT_CNTL_POS		  2
 
+#define MSCC_PHY_EXT_PHY_CNTL_4		  23
+
 /* Extended Page 2 Registers */
 #define MSCC_PHY_RGMII_CNTL		  20
 #define RGMII_RX_CLK_DELAY_MASK		  0x0070
@@ -119,11 +129,50 @@ enum rgmii_rx_clock_delay {
 				BIT(VSC8531_FORCE_LED_OFF) | \
 				BIT(VSC8531_FORCE_LED_ON))
 
+struct vsc85xx_hw_stat {
+	const char *string;
+	u8 reg;
+	u16 page;
+	u16 mask;
+};
+
+static const struct vsc85xx_hw_stat vsc85xx_hw_stats[] = {
+	{
+		.string	= "phy_receive_errors",
+		.reg	= MSCC_PHY_ERR_RX_CNT,
+		.page	= MSCC_PHY_PAGE_STANDARD,
+		.mask	= ERR_CNT_MASK,
+	}, {
+		.string	= "phy_false_carrier",
+		.reg	= MSCC_PHY_ERR_FALSE_CARRIER_CNT,
+		.page	= MSCC_PHY_PAGE_STANDARD,
+		.mask	= ERR_CNT_MASK,
+	}, {
+		.string	= "phy_cu_media_link_disconnect",
+		.reg	= MSCC_PHY_ERR_LINK_DISCONNECT_CNT,
+		.page	= MSCC_PHY_PAGE_STANDARD,
+		.mask	= ERR_CNT_MASK,
+	}, {
+		.string	= "phy_cu_media_crc_good_count",
+		.reg	= MSCC_PHY_CU_MEDIA_CRC_VALID_CNT,
+		.page	= MSCC_PHY_PAGE_EXTENDED,
+		.mask	= VALID_CRC_CNT_CRC_MASK,
+	}, {
+		.string	= "phy_cu_media_crc_error_count",
+		.reg	= MSCC_PHY_EXT_PHY_CNTL_4,
+		.page	= MSCC_PHY_PAGE_EXTENDED,
+		.mask	= ERR_CNT_MASK,
+	},
+};
+
 struct vsc8531_private {
 	int rate_magic;
 	u16 supp_led_modes;
 	u32 leds_mode[MAX_LEDS];
 	u8 nleds;
+	const struct vsc85xx_hw_stat *hw_stats;
+	u64 *stats;
+	int nstats;
 };
 
 #ifdef CONFIG_OF_MDIO
@@ -150,6 +199,58 @@ static int vsc85xx_phy_write_page(struct phy_device *phydev, int page)
 	return __phy_write(phydev, MSCC_EXT_PAGE_ACCESS, page);
 }
 
+static int vsc85xx_get_sset_count(struct phy_device *phydev)
+{
+	struct vsc8531_private *priv = phydev->priv;
+
+	if (!priv)
+		return 0;
+
+	return priv->nstats;
+}
+
+static void vsc85xx_get_strings(struct phy_device *phydev, u8 *data)
+{
+	struct vsc8531_private *priv = phydev->priv;
+	int i;
+
+	if (!priv)
+		return;
+
+	for (i = 0; i < priv->nstats; i++)
+		strlcpy(data + i * ETH_GSTRING_LEN, priv->hw_stats[i].string,
+			ETH_GSTRING_LEN);
+}
+
+static u64 vsc85xx_get_stat(struct phy_device *phydev, int i)
+{
+	struct vsc8531_private *priv = phydev->priv;
+	int val;
+
+	val = phy_read_paged(phydev, priv->hw_stats[i].page,
+			     priv->hw_stats[i].reg);
+	if (val < 0)
+		return U64_MAX;
+
+	val = val & priv->hw_stats[i].mask;
+	priv->stats[i] += val;
+
+	return priv->stats[i];
+}
+
+static void vsc85xx_get_stats(struct phy_device *phydev,
+			      struct ethtool_stats *stats, u64 *data)
+{
+	struct vsc8531_private *priv = phydev->priv;
+	int i;
+
+	if (!priv)
+		return;
+
+	for (i = 0; i < priv->nstats; i++)
+		data[i] = vsc85xx_get_stat(phydev, i);
+}
+
 static int vsc85xx_led_cntl_set(struct phy_device *phydev,
 				u8 led_num,
 				u8 mode)
@@ -643,6 +744,12 @@ static int vsc85xx_probe(struct phy_device *phydev)
 	vsc8531->rate_magic = rate_magic;
 	vsc8531->nleds = 2;
 	vsc8531->supp_led_modes = VSC85XX_SUPP_LED_MODES;
+	vsc8531->hw_stats = vsc85xx_hw_stats;
+	vsc8531->nstats = ARRAY_SIZE(vsc85xx_hw_stats);
+	vsc8531->stats = devm_kmalloc_array(&phydev->mdio.dev, vsc8531->nstats,
+					    sizeof(u64), GFP_KERNEL);
+	if (!vsc8531->stats)
+		return -ENOMEM;
 
 	return vsc85xx_dt_led_modes_get(phydev, default_mode);
 }
@@ -671,6 +778,9 @@ static struct phy_driver vsc85xx_driver[] = {
 	.set_tunable	= &vsc85xx_set_tunable,
 	.read_page	= &vsc85xx_phy_read_page,
 	.write_page	= &vsc85xx_phy_write_page,
+	.get_sset_count = &vsc85xx_get_sset_count,
+	.get_strings    = &vsc85xx_get_strings,
+	.get_stats      = &vsc85xx_get_stats,
 },
 {
 	.phy_id		= PHY_ID_VSC8531,
@@ -694,6 +804,9 @@ static struct phy_driver vsc85xx_driver[] = {
 	.set_tunable	= &vsc85xx_set_tunable,
 	.read_page	= &vsc85xx_phy_read_page,
 	.write_page	= &vsc85xx_phy_write_page,
+	.get_sset_count = &vsc85xx_get_sset_count,
+	.get_strings    = &vsc85xx_get_strings,
+	.get_stats      = &vsc85xx_get_stats,
 },
 {
 	.phy_id		= PHY_ID_VSC8540,
@@ -717,6 +830,9 @@ static struct phy_driver vsc85xx_driver[] = {
 	.set_tunable	= &vsc85xx_set_tunable,
 	.read_page	= &vsc85xx_phy_read_page,
 	.write_page	= &vsc85xx_phy_write_page,
+	.get_sset_count = &vsc85xx_get_sset_count,
+	.get_strings    = &vsc85xx_get_strings,
+	.get_stats      = &vsc85xx_get_stats,
 },
 {
 	.phy_id		= PHY_ID_VSC8541,
@@ -740,6 +856,9 @@ static struct phy_driver vsc85xx_driver[] = {
 	.set_tunable	= &vsc85xx_set_tunable,
 	.read_page	= &vsc85xx_phy_read_page,
 	.write_page	= &vsc85xx_phy_write_page,
+	.get_sset_count = &vsc85xx_get_sset_count,
+	.get_strings    = &vsc85xx_get_strings,
+	.get_stats      = &vsc85xx_get_stats,
 }
 
 };
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v3 0/6] net: phy: mscc: various improvements to Microsemi PHY driver
From: Quentin Schulz @ 2018-10-08 10:07 UTC (permalink / raw)
  To: davem, andrew, f.fainelli
  Cc: allan.nielsen, linux-kernel, netdev, thomas.petazzoni,
	alexandre.belloni, Quentin Schulz

The Microsemi PHYs have multiple banks of registers (called pages).
Registers can only be accessed from one page, if we need a register from
another page, we need to switch the page and the registers of all other
pages are not accessible anymore.

Basically, to read register 5 from page 0, 1, 2, etc., you do the same
phy_read(phydev, 5); but you need to set the desired page beforehand.

In order to guarantee that two concurrent functions do not change the
page, we need to do some locking per page. This can be achieved with the
use of phy_select_page and phy_restore_page functions but phy_write/read
calls in-between those two functions shall be replaced by their
lock-free alternative __phy_write/read.

The Microsemi PHYs have several counters so let's make them available as PHY
statistics.

The VSC 8530/31/40/41 also need to update their EEE init sequence in order to
avoid packet losses and improve performance.

This patch series also makes some minor cosmetic changes to the driver.

Thanks,
Quentin

v3:
  - add reviewed-by,
  - use phy_read/write/modify_paged whenever possible instead of the
  combo phy_select_page, __phy_read/write/modify, phy_restore_page when
  only one __phy_read/write/modify was executed,

v2:
  - add patch to migrate MSCC driver to use phy_restore/select_page,
  - migrate all patches from v1 to use those two functions,
  - put the multiple lines of constants writes in an array and iterate over
  it to write the values,
  - add reviewed-bys,

Quentin Schulz (4):
  net: phy: mscc: migrate to phy_select/restore_page functions
  net: phy: mscc: remove unneeded parenthesis
  net: phy: mscc: shorten `x != 0` condition to `x`
  net: phy: mscc: remove unneeded temporary variable

Raju Lakkaraju (2):
  net: phy: mscc: add ethtool statistics counters
  net: phy: mscc: Add EEE init sequence

 drivers/net/phy/mscc.c | 352 +++++++++++++++++++++++++++++------------
 1 file changed, 255 insertions(+), 97 deletions(-)

-- 
2.17.1

^ permalink raw reply

* can not sync master interface when bond1 box connected with another bond1 box
From: Yi Li @ 2018-10-08  2:45 UTC (permalink / raw)
  To: netdev

Hi guys,
          I encountered this problem when using bonding with mode1. I
have two linux box,
they both have two nics, and i setup these two nics with bond1 mode on
each linux box.
And then I connected these two linux box with each other.

          And then , I found , sometimes, Box A selects eth0 as
active,  eth1 as backup; and
at this moment, Box B auto selects eth1 as active, eth0 as backup.
        But my Box A's eth0 is connected with Box B's eth0, so they
are disconnected and can
not recovery from this situation, until i reboot or re-plug cables.

        So , guys, how can I make two linux boxes both with bonding
mode 1, connected with each
other steadily.
        Thanks.

^ permalink raw reply

* Re: [PATCH bpf-next 1/6] bpf: introduce BPF_PROG_TYPE_FILE_FILTER
From: Alexei Starovoitov @ 2018-10-08  2:22 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexei Starovoitov, David S. Miller, Daniel Borkmann,
	Andy Lutomirski, Al Viro, Network Development, kernel list,
	kernel-team, Kernel Hardening, Mickaël Salaün
In-Reply-To: <CAG48ez2zuJP0qOSU5QAER1xP56d5SvC0HTrM0Yh3wkKEQZ9R3A@mail.gmail.com>

On Mon, Oct 08, 2018 at 02:56:15AM +0200, Jann Horn wrote:
> +cc kernel-hardening because this is related to sandboxing
> +cc Mickaël Salaün because this looks related to his Landlock proposal

It may seem that this work overlaps with landlock, but the goals
are different. Landlock is LSM based to act as _security_ framework
with end goal being available to unpriv users.
While this cgroup-bpf hook is expected to stay root only,
since we're trying to restrict what containers can do in a trusted environment.
'sandboxing' is overloaded word.
Sandboxing for security and sandboxing of trusted root are different.

> On Mon, Oct 8, 2018 at 2:30 AM Alexei Starovoitov <ast@kernel.org> wrote:
> > Similar to networking sandboxing programs and cgroup-v2 based hooks
> > (BPF_CGROUP_INET_[INGRESS|EGRESS,] BPF_CGROUP_INET[4|6]_[BIND|CONNECT], etc)
> > introduce basic per-container sandboxing for file access via
> > new BPF_PROG_TYPE_FILE_FILTER program type that attaches after
> > security_file_open() LSM hook and works as additional file_open filter.
> > The new cgroup bpf hook is called BPF_CGROUP_FILE_OPEN.
> 
> Why do_dentry_open() specifically, and nothing else? If you want to
> filter open, wouldn't you also want to filter a bunch of other
> filesystem operations with LSM hooks, like rename, unlink, truncate
> and so on? Landlock benefits there from re-using the existing security
> hooks.

It may make sense to extend in the future, but we don't have clear
user cases for rename/unlink/truncate at this point.

As you can see the amount of pushback even for basic file access is high.
Hence I don't think the landlock can be upstreamed in the current form,
since it touches VFS layer a lot more than this patch.
It's intrusive to LSM, and adds new concepts to BPF as well.
This work fits into existing BPF machinery and minimally intrusive to VFS.
I hope we can find a common ground with Al regarding what file
access primitives are exposed to BPF side.
Once we agree on that the landlock can piggy back on this work
and extend it to all file-based LSM hooks.
The first step for everyone interested in bpf-based 'sandboxing'
is to figure out VFS<->BPF interface.
If you or Mickael have suggestions on what bpf progs should and should not
see at these hooks, it's a good time to discuss.
I believe the fields proposed are the obvious minimum.

> > Just like other cgroup-bpf programs new BPF_PROG_TYPE_FILE_FILTER type
> > is only available to root.
> >
> > This program type has access to single argument 'struct bpf_file_info'
> > that contains standard sys_stat fields:
> > struct bpf_file_info {
> >         __u64 inode;
> >         __u32 dev_major;
> >         __u32 dev_minor;
> >         __u32 fs_magic;
> >         __u32 mnt_id;
> >         __u32 nlink;
> >         __u32 mode;     /* file mode S_ISDIR, S_ISLNK, 0755, etc */
> >         __u32 flags;    /* open flags O_RDWR, O_CREAT, etc */
> > };
> > Other file attributes can be added in the future to the end of this struct
> > without breaking bpf programs.
> >
> > For debugging introduce bpf_get_file_path() helper that returns
> > NUL-terminated full path of the file. It should never be used for sandboxing.
> >
> > Use cases:
> > - disallow certain FS types within containers (fs_magic == CGROUP2_SUPER_MAGIC)
> > - restrict permissions in particular mount (mnt_id == X && (flags & O_RDWR))
> > - disallow access to hard linked sensitive files (nlink > 1 && mode == 0700)
> > - disallow access to world writeable files (mode == 0..7)
> > - disallow access to given set of files (dev_major == X && dev_minor == Y && inode == Z)
> 
> That last one sounds weird. It doesn't work if you want to ban access
> to a whole directory at once. And in general, highly specific
> blocklists make me nervous, because if you add anything new and forget
> to put it on the list, you have a problem.

In the upcoming V2 of the patches the direct exposure to dev and inode will be removed.
And instead the opaque 'struct file_handle' will be available to bpf progs.
The use case is indeed to restrict access to specific blacklist of files.
The user space will collect the set of files via sys_name_to_handle_at(),
then store the fhandles in bpf map, and bpf prog will consult the map
to deny the access.
It's not a replacement for ACLs, directory permissions, etc. The use case
is to prevent trusted containers messing up the environment.
The continuous integration system needs to run some containers (and tests inside them)
with root privs. When these jobs mess up the system the subsequent jobs may incorrectly
fail. We believe that this cgroup based container enforcement will solve this use case
and similar other use cases when containers are trusted, but could be buggy when
it comes to file access.

To recap what I'm implementing in V2:
1.
struct bpf_file_info {
        __u32 fs_magic; // file->f_inode->i_sb->s_magic
        __u32 mnt_id;   // real_mount(file->f_path.mnt)->mnt_id
        __u32 nlink;    // file->f_inode->i_nlink
        __u32 mode;     // file->f_inode->i_mode
        __u32 flags;    // file->f_flags
};
I double checked what VFS layer does with above fields and I think
there is no additional user space exposure will be made when such
fields are seen by bpf progs.
But since I'm not a VFS expert, I'd like Al to confirm.

2.
bpf_get_file_handle(struct bpf_file_info *ctx, struct file_handle *fh, int fh_size);
helper that bpf prog will use to obtain fh of the file about to be open.

3.
bpf_get_file_statx(struct bpf_file_info *ctx, struct statx *sx, int size, int flags);
Though struct statx is 256 bytes, and the helper would have to touch all bytes
I couldn't figure out the faster way to get to inode/dev/uid of the given file
that will work on all underlying FSes.

Thoughts?

^ permalink raw reply

* RE: [v3, 3/6] net: dpaa2: fix dependency of config FSL_DPAA2_ETH
From: Ioana Ciocoi Radulescu @ 2018-10-08  9:14 UTC (permalink / raw)
  To: Y.b. Lu, linux-kernel@vger.kernel.org, devel@driverdev.osuosl.org,
	netdev@vger.kernel.org, Richard Cochran, David S . Miller,
	Greg Kroah-Hartman, Andrew Lunn
  Cc: Y.b. Lu
In-Reply-To: <20181008074430.34379-3-yangbo.lu@nxp.com>

> -----Original Message-----
> From: Yangbo Lu <yangbo.lu@nxp.com>
> Sent: Monday, October 8, 2018 10:44 AM
> To: linux-kernel@vger.kernel.org; devel@driverdev.osuosl.org;
> netdev@vger.kernel.org; Richard Cochran <richardcochran@gmail.com>;
> David S . Miller <davem@davemloft.net>; Ioana Ciocoi Radulescu
> <ruxandra.radulescu@nxp.com>; Greg Kroah-Hartman
> <gregkh@linuxfoundation.org>; Andrew Lunn <andrew@lunn.ch>
> Cc: Y.b. Lu <yangbo.lu@nxp.com>
> Subject: [v3, 3/6] net: dpaa2: fix dependency of config FSL_DPAA2_ETH
> 
> The NETDEVICES dependency and ETHERNET dependency hadn't
> been required since dpaa2-eth was moved out of staging.
> Also allowed COMPILE_TEST for dpaa2-eth.
> 
> Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
> ---
> Changes for v3:
> 	- Added this patch.
> ---
>  drivers/net/ethernet/freescale/dpaa2/Kconfig |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/dpaa2/Kconfig
> b/drivers/net/ethernet/freescale/dpaa2/Kconfig
> index 67e6461..a7f365d 100644
> --- a/drivers/net/ethernet/freescale/dpaa2/Kconfig
> +++ b/drivers/net/ethernet/freescale/dpaa2/Kconfig
> @@ -1,7 +1,7 @@
>  config FSL_DPAA2_ETH
>  	tristate "Freescale DPAA2 Ethernet"
>  	depends on FSL_MC_BUS && FSL_MC_DPIO
> -	depends on NETDEVICES && ETHERNET
> +	depends on ARCH_LAYERSCAPE || COMPILE_TEST

Dependency on ARCH_LAYERSCAPE and COMPILE_TEST (for some architectures)
is already implied through FSL_MC_BUS.

Thanks,
Ioana

^ permalink raw reply

* Re: [PATCH V2 net-next 2/5] net: Introduce a new MII time stamping interface.
From: Richard Cochran @ 2018-10-08  2:04 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, devicetree, David Miller, Florian Fainelli, Jacob Keller,
	Mark Rutland, Miroslav Lichvar, Rob Herring, Willem de Bruijn
In-Reply-To: <20181007195400.GA25883@lunn.ch>

On Sun, Oct 07, 2018 at 09:54:00PM +0200, Andrew Lunn wrote:
> Sure, but things have moved on since then.

I was curious about this.  Based on your uses cases, I guess that you
mean phylib?  But not much has changed AFAICT. (There is one new
global function and two were removed, but that doesn't change the
picture WRT time stamping.)

Phylink now has two or three new users, one of which is dsa.  Is that
the big move?

The situation with MACs that handle their own PHYs without phylib is
unchanged, AFAICT.

So what exactly do you mean?

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH bpf-next] libbpf: relicense libbpf as LGPL-2.1 OR BSD-2-Clause
From: Jesper Dangaard Brouer @ 2018-10-08  9:11 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, davem, acme, bjorn.topel, jakub.kicinski,
	tgraf, wangnan0, colin.king, kraig, daniel.diaz, eric, hekuang,
	jeremy, jolsa, joe, yamada.masahiro, mic, namhyung, ppenkov,
	netdev, linux-kernel, kernel-team, brouer
In-Reply-To: <7a8a3ccd-d78f-8e3d-4759-34297e8e5357@iogearbox.net>

On Mon, 8 Oct 2018 10:15:10 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 10/05/2018 11:24 PM, Alexei Starovoitov wrote:
> > libbpf is maturing as a library and gaining features that no other bpf libraries support
> > (BPF Type Format, bpf to bpf calls, etc)
> > Many Apache2 licensed projects (like bcc, bpftrace, gobpf, cilium, etc)
> > would like to use libbpf, but cannot do this yet, since Apache Foundation explicitly
> > states that LGPL is incompatible with Apache2.
> > Hence let's relicense libbpf as dual license LGPL-2.1 or BSD-2-Clause,
> > since BSD-2 is compatible with Apache2.
> > Dual LGPL or Apache2 is invalid combination.
> > Fix license mistake in Makefile as well.
> > 
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > Acked-by: Andrey Ignatov <rdna@fb.com>
> > Acked-by: Arnaldo Carvalho de Melo <acme@kernel.org>
> > Acked-by: Björn Töpel <bjorn.topel@intel.com>
> > Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> > Acked-by: David Beckett <david.beckett@netronome.com>
> > Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> > Acked-by: Joe Stringer <joe@ovn.org>
> > Acked-by: John Fastabend <john.fastabend@gmail.com>
> > Acked-by: Martin KaFai Lau <kafai@fb.com>
> > Acked-by: Quentin Monnet <quentin.monnet@netronome.com>
> > Acked-by: Thomas Graf <tgraf@suug.ch>
> > Acked-by: Roman Gushchin <guro@fb.com>
> > Acked-by: Wang Nan <wangnan0@huawei.com>
> > Acked-by: Yonghong Song <yhs@fb.com>  
> 
> Applied to bpf-next, thanks!

For the public record:
 Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

I do know, this ack will not reach the git-log, but given I have code
in the lib, I want to show that I support and ack this re-license.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH bpf-next 1/6] bpf: introduce BPF_PROG_TYPE_FILE_FILTER
From: Mickaël Salaün @ 2018-10-08  9:06 UTC (permalink / raw)
  To: Alexei Starovoitov, Jann Horn
  Cc: Alexei Starovoitov, David S. Miller, Daniel Borkmann,
	Andy Lutomirski, Al Viro, Network Development, kernel list,
	kernel-team, Kernel Hardening, linux-security-module, Linux API
In-Reply-To: <20181008022210.33ljgryhodzunf5l@ast-mbp>

[-- Attachment #1.1: Type: text/plain, Size: 9383 bytes --]

On 10/8/18 04:22, Alexei Starovoitov wrote:
> On Mon, Oct 08, 2018 at 02:56:15AM +0200, Jann Horn wrote:
>> +cc kernel-hardening because this is related to sandboxing
>> +cc Mickaël Salaün because this looks related to his Landlock proposal

Thanks for the CC. Because this patch series is an access-control
mechanism, it will undoubtedly interest LSM folks as well.

Here is the full thread:
https://lore.kernel.org/lkml/20181004025750.498303-1-ast@kernel.org/

> 
> It may seem that this work overlaps with landlock, but the goals
> are different. Landlock is LSM based to act as _security_ framework
> with end goal being available to unpriv users.

Right, one of the end goal of Landlock is to make most of its
access-control features available to unprivileged processes. However,
some of them may require higher privileges, but we should minimize them
as much as possible to not end up with the SUID-binary (or
CAP_SYS_ADMIN) syndrome.

> While this cgroup-bpf hook is expected to stay root only,
> since we're trying to restrict what containers can do in a trusted environment.
> 'sandboxing' is overloaded word.
> Sandboxing for security and sandboxing of trusted root are different.

Sure we must treat them in a different manner but we must take care of
which capability this sandboxing mechanism really need, and hopefully
not root-like ones, especially for a container.

As a reminder, using cgroups to sandbox processes is definitely a use
case for Landlock, as well as the process hierarchy one (seccomp-like).
I implemented sandboxing with cgroups in a previous patch series, but
put it aside for now to minimize the number of patches.

> 
>> On Mon, Oct 8, 2018 at 2:30 AM Alexei Starovoitov <ast@kernel.org> wrote:
>>> Similar to networking sandboxing programs and cgroup-v2 based hooks
>>> (BPF_CGROUP_INET_[INGRESS|EGRESS,] BPF_CGROUP_INET[4|6]_[BIND|CONNECT], etc)
>>> introduce basic per-container sandboxing for file access via
>>> new BPF_PROG_TYPE_FILE_FILTER program type that attaches after
>>> security_file_open() LSM hook and works as additional file_open filter.
>>> The new cgroup bpf hook is called BPF_CGROUP_FILE_OPEN.
>>
>> Why do_dentry_open() specifically, and nothing else? If you want to
>> filter open, wouldn't you also want to filter a bunch of other
>> filesystem operations with LSM hooks, like rename, unlink, truncate
>> and so on? Landlock benefits there from re-using the existing security
>> hooks.
> 
> It may make sense to extend in the future, but we don't have clear
> user cases for rename/unlink/truncate at this point.

I guess the users who ask for the current features will come with new
ones, which will move towards the same goal as Landlock.

> 
> As you can see the amount of pushback even for basic file access is high.
> Hence I don't think the landlock can be upstreamed in the current form,
> since it touches VFS layer a lot more than this patch.

That is not true for the main part. The only patch from Landlock
touching the FS subsystem is to add a new LSM hook. This was NAKed by Al
but I guess mostly because of misunderstanding. I'll send a new specific
patch to the FS subsystem soon to properly explain the impact on path
lookup and why it doesn't expose more that userspace have already access.

Regarding the upstreaming concern, if I remember correctly, the net/BPF
and the LSM maintainers are (mostly ?) OK. The feedback I got (you
included) was good.

I can definitely create a lighter version of Landlock with your use
cases in mind. There is already all the necessary building blocks: the
BPF (un-)privileged handling, the file checks (see the FS_PICK subtype)
and the cgroups handling. I may even be easier to upstream (and answer
Casey's request) to make the patch series smaller.

> It's intrusive to LSM, and adds new concepts to BPF as well.
> This work fits into existing BPF machinery and minimally intrusive to VFS.
> I hope we can find a common ground with Al regarding what file
> access primitives are exposed to BPF side.
> Once we agree on that the landlock can piggy back on this work
> and extend it to all file-based LSM hooks.

Can you explain your thought about piggy backing? If your patch is
included, Landlock will still need to add new BPF prog types, and will
still be an LSM (which is designed to handle access-control).

> The first step for everyone interested in bpf-based 'sandboxing'
> is to figure out VFS<->BPF interface.
> If you or Mickael have suggestions on what bpf progs should and should not
> see at these hooks, it's a good time to discuss.
> I believe the fields proposed are the obvious minimum.

In a nutshell, Landlock rely on an inode map, which contains references
to inodes. These references/handles can then be used to "compare" with a
requested inode. This way, we can limit the checks to what is already
available to the process which loaded the map. I suggest to not use
these hardcoded fields but use a BFP helper with a file handle as
argument and dedicated flags, as I did in a previous version of Landlock
(which I planned to upstream as a new future feature).

> 
>>> Just like other cgroup-bpf programs new BPF_PROG_TYPE_FILE_FILTER type
>>> is only available to root.
>>>
>>> This program type has access to single argument 'struct bpf_file_info'
>>> that contains standard sys_stat fields:
>>> struct bpf_file_info {
>>>         __u64 inode;
>>>         __u32 dev_major;
>>>         __u32 dev_minor;
>>>         __u32 fs_magic;
>>>         __u32 mnt_id;
>>>         __u32 nlink;
>>>         __u32 mode;     /* file mode S_ISDIR, S_ISLNK, 0755, etc */
>>>         __u32 flags;    /* open flags O_RDWR, O_CREAT, etc */
>>> };
>>> Other file attributes can be added in the future to the end of this struct
>>> without breaking bpf programs.
>>>
>>> For debugging introduce bpf_get_file_path() helper that returns
>>> NUL-terminated full path of the file. It should never be used for sandboxing.

Well, I'm convinced it *will* be used for sandboxing. Once a feature is
available, it is too late to forbid users to use it the way they want,
even if you warn them about security issues…

>>>
>>> Use cases:
>>> - disallow certain FS types within containers (fs_magic == CGROUP2_SUPER_MAGIC)
>>> - restrict permissions in particular mount (mnt_id == X && (flags & O_RDWR))
>>> - disallow access to hard linked sensitive files (nlink > 1 && mode == 0700)
>>> - disallow access to world writeable files (mode == 0..7)
>>> - disallow access to given set of files (dev_major == X && dev_minor == Y && inode == Z)

These use cases are interesting and should help to harden containers.
They are definitely in the scope of what Landlock can do.

However, I would like to see which capabilities are needed to access
these properties.

>>
>> That last one sounds weird. It doesn't work if you want to ban access
>> to a whole directory at once. And in general, highly specific
>> blocklists make me nervous, because if you add anything new and forget
>> to put it on the list, you have a problem.
> 
> In the upcoming V2 of the patches the direct exposure to dev and inode will be removed.
> And instead the opaque 'struct file_handle' will be available to bpf progs.
> The use case is indeed to restrict access to specific blacklist of files.
> The user space will collect the set of files via sys_name_to_handle_at(),
> then store the fhandles in bpf map, and bpf prog will consult the map
> to deny the access.
> It's not a replacement for ACLs, directory permissions, etc. The use case
> is to prevent trusted containers messing up the environment.
> The continuous integration system needs to run some containers (and tests inside them)
> with root privs. When these jobs mess up the system the subsequent jobs may incorrectly
> fail. We believe that this cgroup based container enforcement will solve this use case
> and similar other use cases when containers are trusted, but could be buggy when
> it comes to file access.
> 
> To recap what I'm implementing in V2:
> 1.
> struct bpf_file_info {
>         __u32 fs_magic; // file->f_inode->i_sb->s_magic
>         __u32 mnt_id;   // real_mount(file->f_path.mnt)->mnt_id
>         __u32 nlink;    // file->f_inode->i_nlink
>         __u32 mode;     // file->f_inode->i_mode
>         __u32 flags;    // file->f_flags
> };
> I double checked what VFS layer does with above fields and I think
> there is no additional user space exposure will be made when such
> fields are seen by bpf progs.
> But since I'm not a VFS expert, I'd like Al to confirm.
> 
> 2.
> bpf_get_file_handle(struct bpf_file_info *ctx, struct file_handle *fh, int fh_size);
> helper that bpf prog will use to obtain fh of the file about to be open.

This looks like Landlock. :)

> 
> 3.
> bpf_get_file_statx(struct bpf_file_info *ctx, struct statx *sx, int size, int flags);
> Though struct statx is 256 bytes, and the helper would have to touch all bytes
> I couldn't figure out the faster way to get to inode/dev/uid of the given file
> that will work on all underlying FSes.

This looks like a previous helper I implemented, but tried to avoid
because of security issues (mostly for unpriv processes).

> 
> Thoughts?
> 
> 

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH] rtlwifi: rtl8821ae: add in a missing break in switch statement
From: Colin Ian King @ 2018-10-08  8:55 UTC (permalink / raw)
  To: Larry Finger, Joe Perches, Kalle Valo
  Cc: Ping-Ke Shih, David S . Miller, Tsang-Shian Lin, linux-wireless,
	netdev, kernel-janitors, linux-kernel
In-Reply-To: <617cf0d8-a99a-1f8c-b1a2-7e3f07aa6d24@lwfinger.net>

On 07/10/18 01:48, Larry Finger wrote:
> On 10/6/18 5:03 PM, Joe Perches wrote:
>> On Sat, 2018-10-06 at 17:00 -0500, Larry Finger wrote:
>>> On 10/6/18 3:17 PM, Joe Perches wrote:
>>>> On Sat, 2018-10-06 at 15:05 -0500, Larry Finger wrote:
>>>>> On 10/6/18 2:30 PM, Kalle Valo wrote:
>>>>>> Colin King <colin.king@canonical.com> writes:
>>>>>>
>>>>>>> From: Colin Ian King <colin.king@canonical.com>
>>>>>>>
>>>>>>> The switch case RATR_INX_WIRELESS_MC has a missing break, this seems
>>>>>>> to be unintentional as the setting of variable ret gets overwritten
>>>>>>> when the case falls through to the following RATR_INX_WIRELESS_AC_5N
>>>>>>> case.  Fix this by adding in the missing break.
>>>>>>>
>>>>>>> Detected by CoverityScan, CID#1167237 ("Missing break in switch")
>>>>>>>
>>>>>>> Fixes: 3c05bedb5fef ("Staging: rtl8812ae: Add Realtek 8821 PCI
>>>>>>> WIFI driver")
>>>>>>> Signed-off-by: Colin Ian King <colin.king@canonical.com>
>>>>>>> ---
>>>>>>>     drivers/net/wireless/realtek/rtlwifi/rtl8821ae/hw.c | 1 +
>>>>>>
>>>>>> Is the fixes line correct? This patch is not for staging.
>>>>>
>>>>> No, the correct fixes commit is 21e4b0726dc67 (" rtlwifi:
>>>>> rtl8821ae: Move driver
>>>>> from staging to regular tree").
>>>>>
>>>>> This driver was initially placed in staging as it was needed for a
>>>>> special
>>>>> project, which is the commit that Colin used. As the patch subject
>>>>> states, the
>>>>> driver was later moved to the regular wireless tree.
>>>>>
>>>>> That break is required, thus ACKed-by: Larry Finger
>>>>> <Larry.Finger@lwfinger.net>
>>>>
>>>> Why not remove this entirely and use the generic routine in
>>>> drivers/net/wireless/realtek/rtlwifi/base.c?
>>>>
>>>> Is there a real difference?
>>>
>>> I did not see any difference other than the removal of a bunch of
>>> magic numbers
>>> and better formatting.
>>
>> Me neither.
> 
> Colin,
> 
> Do you want to push the new patch removing the duplicate routine from
> rtl8821ae?

Indeed. Sent.
> 
> Larry

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox