Netdev List
 help / color / mirror / Atom feed
* [PATCH bpf 0/2] tools: bpftool: fix unlikely race and JSON output on error path
From: Jakub Kicinski @ 2017-12-22 19:36 UTC (permalink / raw)
  To: netdev, daniel, alexei.starovoitov; +Cc: oss-drivers, Jakub Kicinski

Hi!

Two small fixes here to listing maps and programs.  The loop for showing
maps is written slightly differently to programs which was missed in JSON
output support, and output would be broken if any of the system calls
failed.  Second fix is in very unlikely case that program or map disappears
after we get its ID we should just skip over that object instead of failing.

Jakub Kicinski (2):
  tools: bpftool: maps: close json array on error paths of show
  tools: bpftool: protect against races with disappearing objects

 tools/bpf/bpftool/map.c  | 8 +++++---
 tools/bpf/bpftool/prog.c | 2 ++
 2 files changed, 7 insertions(+), 3 deletions(-)

-- 
2.15.1

^ permalink raw reply

* [PATCH bpf 1/2] tools: bpftool: maps: close json array on error paths of show
From: Jakub Kicinski @ 2017-12-22 19:36 UTC (permalink / raw)
  To: netdev, daniel, alexei.starovoitov; +Cc: oss-drivers, Jakub Kicinski
In-Reply-To: <20171222193606.19786-1-jakub.kicinski@netronome.com>

We can't return from the middle of do_show(), because
json_array will not be closed.  Break out of the loop.
Note that the error handling after the loop depends on
errno, so no need to set err.

Fixes: 831a0aafe5c3 ("tools: bpftool: add JSON output for `bpftool map *` commands")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Quentin Monnet <quentin.monnet@netronome.com>
---
 tools/bpf/bpftool/map.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index e2450c8e88e6..8368b7ea31b5 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -523,21 +523,21 @@ static int do_show(int argc, char **argv)
 				break;
 			p_err("can't get next map: %s%s", strerror(errno),
 			      errno == EINVAL ? " -- kernel too old?" : "");
-			return -1;
+			break;
 		}
 
 		fd = bpf_map_get_fd_by_id(id);
 		if (fd < 0) {
 			p_err("can't get map by id (%u): %s",
 			      id, strerror(errno));
-			return -1;
+			break;
 		}
 
 		err = bpf_obj_get_info_by_fd(fd, &info, &len);
 		if (err) {
 			p_err("can't get map info: %s", strerror(errno));
 			close(fd);
-			return -1;
+			break;
 		}
 
 		if (json_output)
-- 
2.15.1

^ permalink raw reply related

* [PATCH bpf 2/2] tools: bpftool: protect against races with disappearing objects
From: Jakub Kicinski @ 2017-12-22 19:36 UTC (permalink / raw)
  To: netdev, daniel, alexei.starovoitov; +Cc: oss-drivers, Jakub Kicinski
In-Reply-To: <20171222193606.19786-1-jakub.kicinski@netronome.com>

On program/map show we may get an ID of an object from GETNEXT,
but the object may disappear before we call GET_FD_BY_ID.  If
that happens, ignore the object and continue.

Fixes: 71bb428fe2c1 ("tools: bpf: add bpftool")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Quentin Monnet <quentin.monnet@netronome.com>
---
 tools/bpf/bpftool/map.c  | 2 ++
 tools/bpf/bpftool/prog.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 8368b7ea31b5..a8c3a33dd185 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -528,6 +528,8 @@ static int do_show(int argc, char **argv)
 
 		fd = bpf_map_get_fd_by_id(id);
 		if (fd < 0) {
+			if (errno == ENOENT)
+				continue;
 			p_err("can't get map by id (%u): %s",
 			      id, strerror(errno));
 			break;
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index ad619b96c276..dded77345bfb 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -382,6 +382,8 @@ static int do_show(int argc, char **argv)
 
 		fd = bpf_prog_get_fd_by_id(id);
 		if (fd < 0) {
+			if (errno == ENOENT)
+				continue;
 			p_err("can't get prog by id (%u): %s",
 			      id, strerror(errno));
 			err = -1;
-- 
2.15.1

^ permalink raw reply related

* [PATCH net-next 0/4] kcm: Fix two locking issues
From: Tom Herbert @ 2017-12-22 19:47 UTC (permalink / raw)
  To: davem; +Cc: netdev, dvyukov, john.fastabend, rohit, Tom Herbert

One issue is lockdep warnings when sock_owned_by_user returns true
in strparser. Fix is to add and call sock_owned_by_user_nocheck since
the check for owned by user is not an error condition in this case.

The other issue is a potential deadlock between TX and RX paths

KCM socket lock and the psock socket lock are acquired in both
the RX and TX path, however they take the locks in opposite order
which can lead to deadlock. The fix is to add try_sock_lock to see
if psock socket lock can get acquired in the TX path with KCM lock
held. If not, then KCM socket is released and the psock socket lock
and KCM socket lock are acquired in the same order as the RX path.

Tested:

Ran KCM traffic without incident.

Tom Herbert (4):
  sock: Add sock_owned_by_user_nocheck
  strparser: Call sock_owned_by_user_nocheck
  sock_lock: Add try_sock_lock
  kcm: Address deadlock between TX and RX paths

 include/net/kcm.h         |  1 +
 include/net/sock.h        | 12 +++++++++
 net/core/sock.c           | 20 +++++++++++++++
 net/kcm/kcmsock.c         | 64 ++++++++++++++++++++++++++++++++++-------------
 net/strparser/strparser.c |  2 +-
 5 files changed, 81 insertions(+), 18 deletions(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH net-next 1/4] sock: Add sock_owned_by_user_nocheck
From: Tom Herbert @ 2017-12-22 19:47 UTC (permalink / raw)
  To: davem; +Cc: netdev, dvyukov, john.fastabend, rohit, Tom Herbert
In-Reply-To: <20171222194755.8544-1-tom@quantonium.net>

This allows checking socket lock ownership with producing lockdep
warnings.

Signed-off-by: Tom Herbert <tom@quantonium.net>
---
 include/net/sock.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 0a32f3ce381c..3b4ca2046f8c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1515,6 +1515,11 @@ static inline bool sock_owned_by_user(const struct sock *sk)
 	return sk->sk_lock.owned;
 }
 
+static inline bool sock_owned_by_user_nocheck(const struct sock *sk)
+{
+	return sk->sk_lock.owned;
+}
+
 /* no reclassification while locks are held */
 static inline bool sock_allow_reclassification(const struct sock *csk)
 {
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 2/4] strparser: Call sock_owned_by_user_nocheck
From: Tom Herbert @ 2017-12-22 19:47 UTC (permalink / raw)
  To: davem; +Cc: netdev, dvyukov, john.fastabend, rohit, Tom Herbert
In-Reply-To: <20171222194755.8544-1-tom@quantonium.net>

strparser wants to check socket ownership without producing any
warnings. As indicated by the comment in the code, it is permissible
for owned_by_user to return true.

Fixes: 43a0c6751a322847 ("strparser: Stream parser for messages")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Tom Herbert <tom@quantonium.net>
---
 net/strparser/strparser.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
index c5fda15ba319..1fdab5c4eda8 100644
--- a/net/strparser/strparser.c
+++ b/net/strparser/strparser.c
@@ -401,7 +401,7 @@ void strp_data_ready(struct strparser *strp)
 	 * allows a thread in BH context to safely check if the process
 	 * lock is held. In this case, if the lock is held, queue work.
 	 */
-	if (sock_owned_by_user(strp->sk)) {
+	if (sock_owned_by_user_nocheck(strp->sk)) {
 		queue_work(strp_wq, &strp->work);
 		return;
 	}
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 3/4] sock_lock: Add try_sock_lock
From: Tom Herbert @ 2017-12-22 19:47 UTC (permalink / raw)
  To: davem; +Cc: netdev, dvyukov, john.fastabend, rohit, Tom Herbert
In-Reply-To: <20171222194755.8544-1-tom@quantonium.net>

try_sock lock is an opportunistic attempt to acquire a socket lock
without blocking or sleeping. If the socket lock is acquired then
true is returned, else false is returned.

Signed-off-by: Tom Herbert <tom@quantonium.net>
---
 include/net/sock.h |  7 +++++++
 net/core/sock.c    | 20 ++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 3b4ca2046f8c..69fdd1a89591 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1462,6 +1462,13 @@ static inline void lock_sock(struct sock *sk)
 	lock_sock_nested(sk, 0);
 }
 
+bool try_lock_sock_nested(struct sock *sk, int subclass);
+
+static inline bool try_lock_sock(struct sock *sk)
+{
+	return try_lock_sock_nested(sk, 0);
+}
+
 void release_sock(struct sock *sk);
 
 /* BH context may only use the following locking interface. */
diff --git a/net/core/sock.c b/net/core/sock.c
index 72d14b221784..40fb772e2d52 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2782,6 +2782,26 @@ void lock_sock_nested(struct sock *sk, int subclass)
 }
 EXPORT_SYMBOL(lock_sock_nested);
 
+bool try_lock_sock_nested(struct sock *sk, int subclass)
+{
+	spin_lock_bh(&sk->sk_lock.slock);
+	if (sk->sk_lock.owned) {
+		spin_unlock_bh(&sk->sk_lock.slock);
+		return false;
+	}
+
+	sk->sk_lock.owned = 1;
+	spin_unlock(&sk->sk_lock.slock);
+
+	/* The sk_lock has mutex_lock() semantics here: */
+
+	mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
+	local_bh_enable();
+
+	return true;
+}
+EXPORT_SYMBOL(try_lock_sock_nested);
+
 void release_sock(struct sock *sk)
 {
 	spin_lock_bh(&sk->sk_lock.slock);
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 4/4] kcm: Address deadlock between TX and RX paths
From: Tom Herbert @ 2017-12-22 19:47 UTC (permalink / raw)
  To: davem; +Cc: netdev, dvyukov, john.fastabend, rohit, Tom Herbert
In-Reply-To: <20171222194755.8544-1-tom@quantonium.net>

Both the transmit and receive paths of KCM need to take both the
KCM socket lock and the psock socket lock, however they take the
locks in opposite order which can lead to deadlock.

This patch changes the transmit path (kcm_write_msgs to be specific)
so the locks are taken in the proper order. try_sock_lock is first used
to get the lower socket lock, if that is successful then sending data
can proceed with dropping KCM lock. If try_sock_lock fails then the KCM
lock is released and lock_sock is done on the lower socket followed by
the lock_sock on the KCM sock.

A doing_write_msgs flag has been added to kcm structure to prevent
multiple threads doing write_msgs when the KCM lock is dropped.
kernel_sendpage_locked is now called to do the send data with lock
already held.

Signed-off-by: Tom Herbert <tom@quantonium.net>
---
 include/net/kcm.h |  1 +
 net/kcm/kcmsock.c | 64 ++++++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 48 insertions(+), 17 deletions(-)

diff --git a/include/net/kcm.h b/include/net/kcm.h
index 2a8965819db0..22bd7dd3eedb 100644
--- a/include/net/kcm.h
+++ b/include/net/kcm.h
@@ -78,6 +78,7 @@ struct kcm_sock {
 	/* Don't use bit fields here, these are set under different locks */
 	bool tx_wait;
 	bool tx_wait_more;
+	bool doing_write_msgs;
 
 	/* Receive */
 	struct kcm_psock *rx_psock;
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index d4e98f20fc2a..3eb3179b96b3 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -574,13 +574,19 @@ static void kcm_report_tx_retry(struct kcm_sock *kcm)
 static int kcm_write_msgs(struct kcm_sock *kcm)
 {
 	struct sock *sk = &kcm->sk;
-	struct kcm_psock *psock;
-	struct sk_buff *skb, *head;
-	struct kcm_tx_msg *txm;
+	struct sk_buff *head = skb_peek(&sk->sk_write_queue);
 	unsigned short fragidx, frag_offset;
 	unsigned int sent, total_sent = 0;
+	struct kcm_psock *psock;
+	struct kcm_tx_msg *txm;
+	struct sk_buff *skb;
 	int ret = 0;
 
+	if (unlikely(kcm->doing_write_msgs))
+		return 0;
+
+	kcm->doing_write_msgs = true;
+
 	kcm->tx_wait_more = false;
 	psock = kcm->tx_psock;
 	if (unlikely(psock && psock->tx_stopped)) {
@@ -598,15 +604,36 @@ static int kcm_write_msgs(struct kcm_sock *kcm)
 		return 0;
 	}
 
+try_again:
+	psock = reserve_psock(kcm);
+	if (!psock)
+		goto out_no_release_psock;
+
+	/* Get lock for lower sock */
+	if (!try_lock_sock(psock->sk)) {
+		/* Someone  else is holding the lower sock lock. We need to
+		 * release the KCM lock and get the psock lock first. This is
+		 * needed since the receive path obtains the locks in reverse
+		 * order and we want to avoid deadlock. Note that
+		 * write_msgs can't be reentered when we drop the KCM lock
+		 * since doing_write_msgs is set.
+		 */
+		release_sock(&kcm->sk);
+
+		/* Take locks in order that receive path does */
+		lock_sock(psock->sk);
+		lock_sock(&kcm->sk);
+	}
+
+	/* At this point we have a reserved psock and its lower socket is
+	 * locked.
+	 */
+
 	head = skb_peek(&sk->sk_write_queue);
 	txm = kcm_tx_msg(head);
 
 	if (txm->sent) {
 		/* Send of first skbuff in queue already in progress */
-		if (WARN_ON(!psock)) {
-			ret = -EINVAL;
-			goto out;
-		}
 		sent = txm->sent;
 		frag_offset = txm->frag_offset;
 		fragidx = txm->fragidx;
@@ -615,11 +642,6 @@ static int kcm_write_msgs(struct kcm_sock *kcm)
 		goto do_frag;
 	}
 
-try_again:
-	psock = reserve_psock(kcm);
-	if (!psock)
-		goto out;
-
 	do {
 		skb = head;
 		txm = kcm_tx_msg(head);
@@ -643,11 +665,12 @@ static int kcm_write_msgs(struct kcm_sock *kcm)
 				goto out;
 			}
 
-			ret = kernel_sendpage(psock->sk->sk_socket,
-					      frag->page.p,
-					      frag->page_offset + frag_offset,
-					      frag->size - frag_offset,
-					      MSG_DONTWAIT);
+			ret = kernel_sendpage_locked(psock->sk, frag->page.p,
+						     frag->page_offset +
+							frag_offset,
+						     frag->size -
+							frag_offset,
+						     MSG_DONTWAIT);
 			if (ret <= 0) {
 				if (ret == -EAGAIN) {
 					/* Save state to try again when there's
@@ -669,6 +692,7 @@ static int kcm_write_msgs(struct kcm_sock *kcm)
 				 */
 				kcm_abort_tx_psock(psock, ret ? -ret : EPIPE,
 						   true);
+				release_sock(psock->sk);
 				unreserve_psock(kcm);
 
 				txm->sent = 0;
@@ -704,7 +728,13 @@ static int kcm_write_msgs(struct kcm_sock *kcm)
 		total_sent += sent;
 		KCM_STATS_INCR(psock->stats.tx_msgs);
 	} while ((head = skb_peek(&sk->sk_write_queue)));
+
 out:
+	release_sock(psock->sk);
+
+out_no_release_psock:
+	kcm->doing_write_msgs = false;
+
 	if (!head) {
 		/* Done with all queued messages. */
 		WARN_ON(!skb_queue_empty(&sk->sk_write_queue));
-- 
2.11.0

^ permalink raw reply related

* Re: [Patch net-next] net_sched: remove the unsafe __skb_array_empty()
From: Cong Wang @ 2017-12-22 20:31 UTC (permalink / raw)
  To: John Fastabend; +Cc: Linux Kernel Network Developers, Jakub Kicinski
In-Reply-To: <7d126ce3-15c6-92ad-c9b1-da9a3149dd3e@gmail.com>

On Thu, Dec 21, 2017 at 7:06 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 12/21/2017 04:03 PM, Cong Wang wrote:
>> __skb_array_empty() is only safe if array is never resized.
>> pfifo_fast_dequeue() is called in TX BH context and without
>> qdisc lock, so even after we disable BH on ->reset() path
>> we can still race with other CPU's.
>>
>> Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
>> Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>> Cc: John Fastabend <john.fastabend@gmail.com>
>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>> ---
>>  net/sched/sch_generic.c | 3 ---
>>  1 file changed, 3 deletions(-)
>>
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 00ddb5f8f430..9279258ce060 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -622,9 +622,6 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>       for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>>               struct skb_array *q = band2list(priv, band);
>>
>> -             if (__skb_array_empty(q))
>> -                     continue;
>> -
>>               skb = skb_array_consume_bh(q);
>>       }
>>       if (likely(skb)) {
>>
>
>
> So this is a performance thing we don't want to grab the consumer lock on
> empty bands. Which can be fairly common depending on traffic patterns.


I understand why you had it, but it is just not safe. You don't want
to achieve performance gain by crashing system, right?

>
> Although its not logical IMO to have both reset and dequeue running at
> the same time. Some skbs would get through others would get sent, sort
> of a mess. I don't see how it can be an issue. The never resized bit
> in the documentation is referring to resizing the ring size _not_ popping
> off elements of the ring. array_empty just reads the consumer head.
> The only ring resizing in pfifo fast should be at init and destroy where
> enqueue/dequeue should be disconnected by then. Although based on the
> trace I missed a case.


Both pfifo_fast_reset() and pfifo_fast_dequeue() call
skb_array_consume_bh(), so there is no difference w.r.t. resizing.

And ->reset() is called in qdisc_graft() too. Let's say we have htb+pfifo_fast,
htb_graft() calls qdisc_replace() which calls qdisc_reset() on pfifo_fast,
so clearly pfifo_fast_reset() can run with pfifo_fast_dequeue()
concurrently.


>
> I think the right fix is to only call reset/destroy patterns after
> waiting a grace period and for all tx_action calls in-flight to
> complete. This is also better going forward for more complex qdiscs.

But we don't even have rcu read lock in TX BH, do we?

Also, people certainly don't like yet another synchronize_net()...

^ permalink raw reply

* [PATCH net v2] netns, rtnetlink: fix struct net reference leak
From: Craig Gallek @ 2017-12-22 20:36 UTC (permalink / raw)
  To: David Miller, Jiri Benc; +Cc: netdev, Nicolas Dichtel, Jason A . Donenfeld

From: Craig Gallek <kraig@google.com>

netns ids were added in commit 0c7aecd4bde4 and defined as signed
integers in both the kernel datastructures and the netlink interface.
However, the semantics of the implementation assume that the ids
are always greater than or equal to zero, except for an internal
sentinal value NETNSA_NSID_NOT_ASSIGNED.

Several subsequent patches carried this pattern forward.  This patch
updates all of the netlink input paths of this value to enforce the
'greater than or equal to zero' constraint.

This issue was discovered by syskaller.  It would set a negative
value for a netns id and then repeatedly call the RTM_GETLINK.
The put_net call in that path would not trigger for negative netns ids,
caused a reference count leak, and eventually overflowed.  There are
probably additional error paths that do not handle this situation
correctly, but this was the only one I was able to trigger a real
issue through.

Fixes: 0c7aecd4bde4 ("netns: add rtnl cmd to add and get peer netns ids")
Fixes: 317f4810e45e ("rtnl: allow to create device with IFLA_LINK_NETNSID set")
Fixes: 79e1ad148c84 ("rtnetlink: use netnsid to query interface")
CC: Jiri Benc <jbenc@redhat.com>
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
CC: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Craig Gallek <kraig@google.com>
---
 net/core/net_namespace.c |  2 ++
 net/core/rtnetlink.c     | 17 +++++++++++++----
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 60a71be75aea..4b7ea33f5705 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -627,6 +627,8 @@ static int rtnl_net_newid(struct sk_buff *skb, struct nlmsghdr *nlh,
 		return -EINVAL;
 	}
 	nsid = nla_get_s32(tb[NETNSA_NSID]);
+	if (nsid < 0)
+		return -EINVAL;
 
 	if (tb[NETNSA_PID]) {
 		peer = get_net_ns_by_pid(nla_get_u32(tb[NETNSA_PID]));
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index dabba2a91fc8..a928b8f081b8 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1733,10 +1733,12 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
 			ifla_policy, NULL) >= 0) {
 		if (tb[IFLA_IF_NETNSID]) {
 			netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
-			tgt_net = get_target_net(skb, netnsid);
-			if (IS_ERR(tgt_net)) {
-				tgt_net = net;
-				netnsid = -1;
+			if (netnsid >= 0) {
+				tgt_net = get_target_net(skb, netnsid);
+				if (IS_ERR(tgt_net)) {
+					tgt_net = net;
+					netnsid = -1;
+				}
 			}
 		}
 
@@ -2792,6 +2794,11 @@ static int rtnl_newlink(struct sk_buff *skb, struct nlmsghdr *nlh,
 		if (tb[IFLA_LINK_NETNSID]) {
 			int id = nla_get_s32(tb[IFLA_LINK_NETNSID]);
 
+			if (id < 0) {
+				err =  -EINVAL;
+				goto out;
+			}
+
 			link_net = get_net_ns_by_id(dest_net, id);
 			if (!link_net) {
 				err =  -EINVAL;
@@ -2883,6 +2890,8 @@ static int rtnl_getlink(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	if (tb[IFLA_IF_NETNSID]) {
 		netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
+		if (netnsid < 0)
+			return -EINVAL;
 		tgt_net = get_target_net(skb, netnsid);
 		if (IS_ERR(tgt_net))
 			return PTR_ERR(tgt_net);
-- 
2.15.1.620.gb9897f4670-goog

^ permalink raw reply related

* Re: [PATCH v2 bpf-next 1/2] tools/bpftool: use version from the kernel source tree
From: Jakub Kicinski @ 2017-12-22 21:03 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: netdev, linux-kernel, kernel-team, Alexei Starovoitov,
	Daniel Borkmann
In-Reply-To: <20171222161152.24715-1-guro@fb.com>

On Fri, 22 Dec 2017 16:11:51 +0000, Roman Gushchin wrote:
> Bpftool determines it's own version based on the kernel
> version, which is picked from the linux/version.h header.
> 
> It's strange to use the version of the installed kernel
> headers, and makes much more sense to use the version
> of the actual source tree, where bpftool sources are.
> 
> Fix this by building kernelversion target and use
> the resulting string as bpftool version.
> 
> Example:
> before:
> 
> $ bpftool version
> bpftool v4.14.6
> 
> after:
> $ bpftool version
> bpftool v4.15.0-rc3
> 
> $bpftool version --json
> {"version":"4.15.0-rc3"}
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Daniel Borkmann <daniel@iogearbox.net>

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Thanks Roman!

^ permalink raw reply

* [PATCH bpf-next 1/2] bpf: fix maximum stack depth tracking logic
From: Alexei Starovoitov @ 2017-12-22 21:33 UTC (permalink / raw)
  To: David S . Miller; +Cc: Daniel Borkmann, Jann Horn, netdev, kernel-team
In-Reply-To: <20171222213328.993019-1-ast@kernel.org>

instead of computing max stack depth for current call chain only
track the maximum possible stack depth of any function at given
frame position. Such algorithm is simple and fast, but conservative,
since it overestimates amount of stack used. Consider:
main() // stack 32
{
  A();
  B();
}

A(){} // stack 256

B()  // stack 64
{
  A();
}

since A() is called at frame[1] and frame[2], the algorithm
will estimate the max stack depth as 32 + 256 + 256 and will reject
such program, though real max is 32 + 64 + 256.

Fortunately the algorithm is good enough in practice. The alternative
would be to track max stack of every function in the fast pass through
the verifier and then do additional CFG walk just to compute max total.

Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf_verifier.h | 17 +++++++++++++++++
 kernel/bpf/verifier.c        | 15 +++++++++------
 2 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index aaac589e490c..adc65ef093d1 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -194,7 +194,24 @@ struct bpf_verifier_env {
 	struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
 	struct bpf_verifer_log log;
 	u32 subprog_starts[BPF_MAX_SUBPROGS];
+	/* computes the stack depth of each bpf function */
 	u16 subprog_stack_depth[BPF_MAX_SUBPROGS + 1];
+	/* gradually computes the maximum stack depth out of all functions
+	 * called at this frame position. Example:
+	 * main()
+	 * {
+	 *   a();
+	 *   b();
+	 * }
+	 * b()
+	 * {
+	 *   a();
+	 * }
+	 * frame_stack_depth[0] = stack depth of main()
+	 * frame_stack_depth[1] = max { stack depth of a(), stack depth of b() }
+	 * frame_stack_depth[2] = stack depth of a()
+	 */
+	u16 frame_stack_depth[MAX_CALL_FRAMES];
 	u32 subprog_cnt;
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 4ae46b2cba88..096681bcb312 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1434,15 +1434,18 @@ static int update_stack_depth(struct bpf_verifier_env *env,
 	struct bpf_verifier_state *cur = env->cur_state;
 	int i;
 
+	if (stack < -off)
+		/* update known max for given subprogram */
+		env->subprog_stack_depth[func->subprogno] = -off;
+
+	stack = env->frame_stack_depth[func->frameno];
 	if (stack >= -off)
 		return 0;
+	env->frame_stack_depth[func->frameno] = -off;
 
-	/* update known max for given subprogram */
-	env->subprog_stack_depth[func->subprogno] = -off;
-
-	/* compute the total for current call chain */
-	for (i = 0; i <= cur->curframe; i++) {
-		u32 depth = env->subprog_stack_depth[cur->frame[i]->subprogno];
+	/* some frame changed its max, recompute the total */
+	for (i = 0; i < MAX_CALL_FRAMES; i++) {
+		u32 depth = env->frame_stack_depth[i];
 
 		/* round up to 32-bytes, since this is granularity
 		 * of interpreter stack sizes
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 2/2] selftests/bpf: additional stack depth tests
From: Alexei Starovoitov @ 2017-12-22 21:33 UTC (permalink / raw)
  To: David S . Miller; +Cc: Daniel Borkmann, Jann Horn, netdev, kernel-team
In-Reply-To: <20171222213328.993019-1-ast@kernel.org>

to test inner logic of stack depth tracking

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/test_verifier.c | 50 +++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 71fb0be81b78..e0b21c95c3bc 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -8764,6 +8764,56 @@ static struct bpf_test tests[] = {
 		.result = REJECT,
 	},
 	{
+		"calls: stack depth check using three frames. test1",
+		.insns = {
+			/* main */
+			BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0, 1, 0, 4), /* call A */
+			BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0, 1, 0, 5), /* call B */
+			BPF_ST_MEM(BPF_B, BPF_REG_10, -32, 0),
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+			/* A */
+			BPF_ST_MEM(BPF_B, BPF_REG_10, -256, 0),
+			BPF_EXIT_INSN(),
+			/* B */
+			BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0, 1, 0, -3), /* call A */
+			BPF_ST_MEM(BPF_B, BPF_REG_10, -64, 0),
+			BPF_EXIT_INSN(),
+		},
+		.prog_type = BPF_PROG_TYPE_XDP,
+		/* though stack_main=32, stack_A=256, stack_B=64
+		 * and max(main+A, main+A+B) < 512 the program is rejected
+		 * since stack tracking is conservative
+		 * frame[0]=32, frame[1]=256, frame[2]=256
+		 */
+		.errstr = "combined stack size",
+		.result = REJECT,
+	},
+	{
+		"calls: stack depth check using three frames. test2",
+		.insns = {
+			/* main */
+			BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0, 1, 0, 4), /* call A */
+			BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0, 1, 0, 5), /* call B */
+			BPF_ST_MEM(BPF_B, BPF_REG_10, -32, 0),
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+			/* A */
+			BPF_ST_MEM(BPF_B, BPF_REG_10, -64, 0),
+			BPF_EXIT_INSN(),
+			/* B */
+			BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0, 1, 0, -3), /* call A */
+			BPF_ST_MEM(BPF_B, BPF_REG_10, -256, 0),
+			BPF_EXIT_INSN(),
+		},
+		.prog_type = BPF_PROG_TYPE_XDP,
+		/* though stack_main=32, stack_A=64, stack_B=256
+		 * and max(main+A, main+A+B) < 512 the program is accepted
+		 * since frame[0]=32, frame[1]=256, frame[2]=64
+		 */
+		.result = ACCEPT,
+	},
+	{
 		"calls: spill into caller stack frame",
 		.insns = {
 			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 0/2] bpf: stack depth tracking fix
From: Alexei Starovoitov @ 2017-12-22 21:33 UTC (permalink / raw)
  To: David S . Miller; +Cc: Daniel Borkmann, Jann Horn, netdev, kernel-team

Jann reported an issue with stack depth tracking.
Fix it and add tests.

Alexei Starovoitov (2):
  bpf: fix maximum stack depth tracking logic
  selftests/bpf: additional stack depth tests

 include/linux/bpf_verifier.h                | 17 ++++++++++
 kernel/bpf/verifier.c                       | 15 +++++----
 tools/testing/selftests/bpf/test_verifier.c | 50 +++++++++++++++++++++++++++++
 3 files changed, 76 insertions(+), 6 deletions(-)

-- 
2.9.5

^ permalink raw reply

* Re: [Patch net-next] net_sched: call qdisc_reset() with qdisc lock
From: Cong Wang @ 2017-12-22 21:41 UTC (permalink / raw)
  To: John Fastabend; +Cc: Linux Kernel Network Developers, Jakub Kicinski
In-Reply-To: <bbbffbae-2c05-b07c-21f3-0371f02aa7f7@gmail.com>

On Thu, Dec 21, 2017 at 7:36 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 12/21/2017 04:03 PM, Cong Wang wrote:
>> qdisc_reset() should always be called with qdisc spinlock
>> and with BH disabled, otherwise qdisc ->reset() could race
>> with TX BH.
>>
> hmm I don't see how this fixes the issue. pfifo_fast is no longer
> using the qdisc lock so that doesn't help.  And it is only a
> local_bh_disable.


First of all, this is to fix non-pfifo_fast which you seem totally forget.


>
>
>> Fixes: 7bbde83b1860 ("net: sched: drop qdisc_reset from dev_graft_qdisc")
>> Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>> Cc: John Fastabend <john.fastabend@gmail.com>
>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>> ---
>>  net/sched/sch_generic.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 10aaa3b615ce..00ddb5f8f430 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -1097,8 +1097,11 @@ static void dev_qdisc_reset(struct net_device *dev,
>>  {
>>       struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
>>
>> -     if (qdisc)
>> +     if (qdisc) {
>> +             spin_lock_bh(qdisc_lock(qdisc));
>>               qdisc_reset(qdisc);
>> +             spin_unlock_bh(qdisc_lock(qdisc));
>> +     }
>>  }
>>
>>  /**
>>
>
> OK first the cases to get to qdisc_reset that I've tracked
> down are,
>
>     dev_shutdown()
>       qdisc_destroy()
>
>     dev_deactivate_many()
>       dev_qdisc_reset() <- for each txq
>          qdisc_reset()
>
>     chained calls from qdisc_reset ops
>
> At the moment all the lockless qdiscs don't care about chained
> calls so we can ignore that, but would be nice to keep in mind.
>
> Next qdisc_reset() is doing a couple things calling the qdisc
> ops reset call but also walking gso_skb and skb_bad_txq. The
> 'unlink' operations there are not safe to be called while an
> enqueue/dequeue op is in-flight. Also pfifo_fast's reset op
> is not safe to be called with enqueue/dequeue ops in-flight.

This is why I sent two patches instead just this one.


>
> So I've made the assumption that qdisc_reset is _only_ ever
> called after a qdisc is no longer attached on the enqueue
> dev_xmit side and also any in-progress tx_action calls are
> completed. For what its worth this has always been the assumption
> AFAIK.
>
> So those are the assumptions what did I miss?


Speaking of this, qdisc_reset() is also called in dev_deactivate_queue()
even before synchronize_net()...

>
> The biggest gap I see is dev_deactivate_many() is supposed
> to wait for all tx_action calls to complete, this bit:

In some_qdisc_is_busy() you hold qdisc spinlock for non-lockless
ones, I don't understand why you don't want it in dev_qdisc_reset().

^ permalink raw reply

* [PATCH 2/3] qemu: use 64-bit values for feature flags in virtio-net
From: Jason Baron @ 2017-12-22 21:53 UTC (permalink / raw)
  To: netdev, virtualization, qemu-devel; +Cc: mst, jasowang
In-Reply-To: <cover.1513974243.git.jbaron@akamai.com>

In prepartion for using some of the high order feature bits, make sure that
virtio-net uses 64-bit values everywhere.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
---
 hw/net/virtio-net.c            | 54 +++++++++++++++++++++---------------------
 include/hw/virtio/virtio-net.h |  2 +-
 2 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 38674b0..adc20df 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -48,18 +48,18 @@
     (offsetof(container, field) + sizeof(((container *)0)->field))
 
 typedef struct VirtIOFeature {
-    uint32_t flags;
+    uint64_t flags;
     size_t end;
 } VirtIOFeature;
 
 static VirtIOFeature feature_sizes[] = {
-    {.flags = 1 << VIRTIO_NET_F_MAC,
+    {.flags = 1ULL << VIRTIO_NET_F_MAC,
      .end = endof(struct virtio_net_config, mac)},
-    {.flags = 1 << VIRTIO_NET_F_STATUS,
+    {.flags = 1ULL << VIRTIO_NET_F_STATUS,
      .end = endof(struct virtio_net_config, status)},
-    {.flags = 1 << VIRTIO_NET_F_MQ,
+    {.flags = 1ULL << VIRTIO_NET_F_MQ,
      .end = endof(struct virtio_net_config, max_virtqueue_pairs)},
-    {.flags = 1 << VIRTIO_NET_F_MTU,
+    {.flags = 1ULL << VIRTIO_NET_F_MTU,
      .end = endof(struct virtio_net_config, mtu)},
     {}
 };
@@ -1938,7 +1938,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
     int i;
 
     if (n->net_conf.mtu) {
-        n->host_features |= (0x1 << VIRTIO_NET_F_MTU);
+        n->host_features |= (1ULL << VIRTIO_NET_F_MTU);
     }
 
     virtio_net_set_config_size(n, n->host_features);
@@ -2109,45 +2109,45 @@ static const VMStateDescription vmstate_virtio_net = {
 };
 
 static Property virtio_net_properties[] = {
-    DEFINE_PROP_BIT("csum", VirtIONet, host_features, VIRTIO_NET_F_CSUM, true),
-    DEFINE_PROP_BIT("guest_csum", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("csum", VirtIONet, host_features, VIRTIO_NET_F_CSUM, true),
+    DEFINE_PROP_BIT64("guest_csum", VirtIONet, host_features,
                     VIRTIO_NET_F_GUEST_CSUM, true),
-    DEFINE_PROP_BIT("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true),
-    DEFINE_PROP_BIT("guest_tso4", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true),
+    DEFINE_PROP_BIT64("guest_tso4", VirtIONet, host_features,
                     VIRTIO_NET_F_GUEST_TSO4, true),
-    DEFINE_PROP_BIT("guest_tso6", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("guest_tso6", VirtIONet, host_features,
                     VIRTIO_NET_F_GUEST_TSO6, true),
-    DEFINE_PROP_BIT("guest_ecn", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("guest_ecn", VirtIONet, host_features,
                     VIRTIO_NET_F_GUEST_ECN, true),
-    DEFINE_PROP_BIT("guest_ufo", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("guest_ufo", VirtIONet, host_features,
                     VIRTIO_NET_F_GUEST_UFO, true),
-    DEFINE_PROP_BIT("guest_announce", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("guest_announce", VirtIONet, host_features,
                     VIRTIO_NET_F_GUEST_ANNOUNCE, true),
-    DEFINE_PROP_BIT("host_tso4", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("host_tso4", VirtIONet, host_features,
                     VIRTIO_NET_F_HOST_TSO4, true),
-    DEFINE_PROP_BIT("host_tso6", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("host_tso6", VirtIONet, host_features,
                     VIRTIO_NET_F_HOST_TSO6, true),
-    DEFINE_PROP_BIT("host_ecn", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("host_ecn", VirtIONet, host_features,
                     VIRTIO_NET_F_HOST_ECN, true),
-    DEFINE_PROP_BIT("host_ufo", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("host_ufo", VirtIONet, host_features,
                     VIRTIO_NET_F_HOST_UFO, true),
-    DEFINE_PROP_BIT("mrg_rxbuf", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("mrg_rxbuf", VirtIONet, host_features,
                     VIRTIO_NET_F_MRG_RXBUF, true),
-    DEFINE_PROP_BIT("status", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("status", VirtIONet, host_features,
                     VIRTIO_NET_F_STATUS, true),
-    DEFINE_PROP_BIT("ctrl_vq", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("ctrl_vq", VirtIONet, host_features,
                     VIRTIO_NET_F_CTRL_VQ, true),
-    DEFINE_PROP_BIT("ctrl_rx", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("ctrl_rx", VirtIONet, host_features,
                     VIRTIO_NET_F_CTRL_RX, true),
-    DEFINE_PROP_BIT("ctrl_vlan", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("ctrl_vlan", VirtIONet, host_features,
                     VIRTIO_NET_F_CTRL_VLAN, true),
-    DEFINE_PROP_BIT("ctrl_rx_extra", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("ctrl_rx_extra", VirtIONet, host_features,
                     VIRTIO_NET_F_CTRL_RX_EXTRA, true),
-    DEFINE_PROP_BIT("ctrl_mac_addr", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("ctrl_mac_addr", VirtIONet, host_features,
                     VIRTIO_NET_F_CTRL_MAC_ADDR, true),
-    DEFINE_PROP_BIT("ctrl_guest_offloads", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("ctrl_guest_offloads", VirtIONet, host_features,
                     VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, true),
-    DEFINE_PROP_BIT("mq", VirtIONet, host_features, VIRTIO_NET_F_MQ, false),
+    DEFINE_PROP_BIT64("mq", VirtIONet, host_features, VIRTIO_NET_F_MQ, false),
     DEFINE_NIC_PROPERTIES(VirtIONet, nic_conf),
     DEFINE_PROP_UINT32("x-txtimer", VirtIONet, net_conf.txtimer,
                        TX_TIMER_INTERVAL),
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index b81b6a4..e7634c9 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -67,7 +67,7 @@ typedef struct VirtIONet {
     uint32_t has_vnet_hdr;
     size_t host_hdr_len;
     size_t guest_hdr_len;
-    uint32_t host_features;
+    uint64_t host_features;
     uint8_t has_ufo;
     uint32_t mergeable_rx_bufs;
     uint8_t promisc;
-- 
2.6.1

^ permalink raw reply related

* [PATCH 3/3] qemu: add linkspeed and duplex settings to virtio-net
From: Jason Baron @ 2017-12-22 21:53 UTC (permalink / raw)
  To: netdev, virtualization, qemu-devel; +Cc: mst, jasowang
In-Reply-To: <cover.1513974243.git.jbaron@akamai.com>

Although linkspeed and duplex can be set in a linux guest via 'ethtool -s',
this requires custom ethtool commands for virtio-net by default.

Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
the hypervisor to export a linkspeed and duplex setting. The user can
subsequently overwrite it later if desired via: 'ethtool -s'.

Linkspeed and duplex settings can be set as:
'-device virtio-net,speed=10000,duplex=full'

where speed is [-1...INT_MAX], and duplex is ["half"|"full"].

Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
---
 hw/net/virtio-net.c                         | 29 +++++++++++++++++++++++++++++
 include/hw/virtio/virtio-net.h              |  3 +++
 include/standard-headers/linux/virtio_net.h |  4 ++++
 3 files changed, 36 insertions(+)
 create mode 160000 pixman

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index adc20df..7fafe6e 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -40,6 +40,12 @@
 #define VIRTIO_NET_RX_QUEUE_MIN_SIZE VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE
 #define VIRTIO_NET_TX_QUEUE_MIN_SIZE VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE
 
+/* duplex and speed defines */
+#define DUPLEX_UNKNOWN          0xff
+#define DUPLEX_HALF             0x00
+#define DUPLEX_FULL             0x01
+#define SPEED_UNKNOWN           -1
+
 /*
  * Calculate the number of bytes up to and including the given 'field' of
  * 'container'.
@@ -61,6 +67,8 @@ static VirtIOFeature feature_sizes[] = {
      .end = endof(struct virtio_net_config, max_virtqueue_pairs)},
     {.flags = 1ULL << VIRTIO_NET_F_MTU,
      .end = endof(struct virtio_net_config, mtu)},
+    {.flags = 1ULL << VIRTIO_NET_F_SPEED_DUPLEX,
+     .end = endof(struct virtio_net_config, duplex)},
     {}
 };
 
@@ -88,6 +96,8 @@ static void virtio_net_get_config(VirtIODevice *vdev, uint8_t *config)
     virtio_stw_p(vdev, &netcfg.status, n->status);
     virtio_stw_p(vdev, &netcfg.max_virtqueue_pairs, n->max_queues);
     virtio_stw_p(vdev, &netcfg.mtu, n->net_conf.mtu);
+    virtio_stl_p(vdev, &netcfg.speed, n->net_conf.speed);
+    netcfg.duplex = n->net_conf.duplex;
     memcpy(netcfg.mac, n->mac, ETH_ALEN);
     memcpy(config, &netcfg, n->config_size);
 }
@@ -1941,6 +1951,23 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
         n->host_features |= (1ULL << VIRTIO_NET_F_MTU);
     }
 
+    n->host_features |= (1ULL << VIRTIO_NET_F_SPEED_DUPLEX);
+    if (n->net_conf.duplex_str) {
+        if (strncmp(n->net_conf.duplex_str, "half", 5) == 0) {
+            n->net_conf.duplex = DUPLEX_HALF;
+        } else if (strncmp(n->net_conf.duplex_str, "full", 5) == 0) {
+            n->net_conf.duplex = DUPLEX_FULL;
+        } else {
+            error_setg(errp, "'duplex' must be 'half' or 'full'");
+        }
+    } else {
+        n->net_conf.duplex = DUPLEX_UNKNOWN;
+    }
+    if (n->net_conf.speed < SPEED_UNKNOWN) {
+            error_setg(errp, "'speed' must be between -1 (SPEED_UNKOWN) and "
+                       "INT_MAX");
+    }
+
     virtio_net_set_config_size(n, n->host_features);
     virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size);
 
@@ -2160,6 +2187,8 @@ static Property virtio_net_properties[] = {
     DEFINE_PROP_UINT16("host_mtu", VirtIONet, net_conf.mtu, 0),
     DEFINE_PROP_BOOL("x-mtu-bypass-backend", VirtIONet, mtu_bypass_backend,
                      true),
+    DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
+    DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index e7634c9..02484dc 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -38,6 +38,9 @@ typedef struct virtio_net_conf
     uint16_t rx_queue_size;
     uint16_t tx_queue_size;
     uint16_t mtu;
+    int32_t speed;
+    char *duplex_str;
+    uint8_t duplex;
 } virtio_net_conf;
 
 /* Maximum packet size we can receive from tap device: header + 64k */
diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
index 30ff249..0ff1447 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -36,6 +36,7 @@
 #define VIRTIO_NET_F_GUEST_CSUM	1	/* Guest handles pkts w/ partial csum */
 #define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload configuration. */
 #define VIRTIO_NET_F_MTU	3	/* Initial MTU advice */
+#define VIRTIO_NET_F_SPEED_DUPLEX 4	/* Host set linkspeed and duplex */
 #define VIRTIO_NET_F_MAC	5	/* Host has given MAC address. */
 #define VIRTIO_NET_F_GUEST_TSO4	7	/* Guest can handle TSOv4 in. */
 #define VIRTIO_NET_F_GUEST_TSO6	8	/* Guest can handle TSOv6 in. */
@@ -76,6 +77,9 @@ struct virtio_net_config {
 	uint16_t max_virtqueue_pairs;
 	/* Default maximum transmit unit advice */
 	uint16_t mtu;
+	/* Host exported linkspeed and duplex */
+	uint32_t speed;
+	uint8_t duplex;
 } QEMU_PACKED;
 
 /*

^ permalink raw reply related

* [PATCH v2 0/3] virtio_net: allow hypervisor to indicate linkspeed and duplex setting
From: Jason Baron @ 2017-12-22 21:54 UTC (permalink / raw)
  To: netdev, virtualization, qemu-devel; +Cc: mst, jasowang
In-Reply-To: <1513979641-7999-1-git-send-email-jbaron@akamai.com>

We have found it useful to be able to set the linkspeed and duplex
settings from the host-side for virtio_net. This obviates the need
for guest changes and settings for these fields, and does not require
custom ethtool commands for virtio_net.
                                                                                                                          
The ability to set linkspeed and duplex is useful in various cases
as described here:                                                                                                        
                                                                                                                          
16032be virtio_net: add ethtool support for set and get of settings                                                       
                                                                                                                          
Using 'ethtool -s' continues to over-write the linkspeed/duplex                                                           
settings with this patch.                                                                                          
                                                                                                                          
The 1/3 patch is against net-next, while the 2-3/3 patch are the associated
qemu changes that would go in after as update-linux-headers.sh should
be run first. So the qemu patches are a demonstration of how I intend this
to work.
                                                                                                                          
Thanks,                                                                                                                   
                                                                                                                          
-Jason                                                                                                                    
                                                                                                                          
 linux changes:                                                                                                           

Jason Baron (1):
  virtio_net: propagate linkspeed/duplex settings from the hypervisor

 drivers/net/virtio_net.c        | 17 ++++++++++++++++-
 include/uapi/linux/virtio_net.h |  5 +++++
 2 files changed, 21 insertions(+), 1 deletion(-)
                                                                                                                          
                                                                                                                          
 qemu changes:                                                                                                            

Jason Baron (2):
  qemu: virtio-net: use 64-bit values for feature flags
  qemu: add linkspeed and duplex settings to virtio-net

 hw/net/virtio-net.c                         | 83 +++++++++++++++++++----------
 include/hw/virtio/virtio-net.h              |  5 +-
 include/standard-headers/linux/virtio_net.h |  4 ++
 3 files changed, 64 insertions(+), 28 deletions(-)

^ permalink raw reply

* [PATCH net-next v2 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor
From: Jason Baron @ 2017-12-22 21:54 UTC (permalink / raw)
  To: netdev, virtualization, qemu-devel; +Cc: mst, jasowang
In-Reply-To: <cover.1513974243.git.jbaron@akamai.com>

The ability to set speed and duplex for virtio_net in useful in various
scenarios as described here:

16032be virtio_net: add ethtool support for set and get of settings

However, it would be nice to be able to set this from the hypervisor,
such that virtio_net doesn't require custom guest ethtool commands.

Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
the hypervisor to export a linkspeed and duplex setting. The user can
subsequently overwrite it later if desired via: 'ethtool -s'.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c        | 17 ++++++++++++++++-
 include/uapi/linux/virtio_net.h |  5 +++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 511f833..4168d82 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2621,6 +2621,20 @@ static int virtnet_probe(struct virtio_device *vdev)
 	netif_set_real_num_rx_queues(dev, vi->curr_queue_pairs);
 
 	virtnet_init_settings(dev);
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_SPEED_DUPLEX)) {
+		u32 speed;
+		u8 duplex;
+
+		speed = virtio_cread32(vdev, offsetof(struct virtio_net_config,
+				       speed));
+		if (ethtool_validate_speed(speed))
+			vi->speed = speed;
+		duplex = virtio_cread8(vdev,
+				       offsetof(struct virtio_net_config,
+				       duplex));
+		if (ethtool_validate_duplex(duplex))
+			vi->duplex = duplex;
+	}
 
 	err = register_netdev(dev);
 	if (err) {
@@ -2746,7 +2760,8 @@ static struct virtio_device_id id_table[] = {
 	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
-	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
+	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
+	VIRTIO_NET_F_SPEED_DUPLEX
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index fc353b5..0f1548e 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,8 @@
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+#define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Host set linkspeed and duplex */
+
 #ifndef VIRTIO_NET_NO_LEGACY
 #define VIRTIO_NET_F_GSO	6	/* Host handles pkts w/ any GSO type */
 #endif /* VIRTIO_NET_NO_LEGACY */
@@ -76,6 +78,9 @@ struct virtio_net_config {
 	__u16 max_virtqueue_pairs;
 	/* Default maximum transmit unit advice */
 	__u16 mtu;
+	/* Host exported linkspeed and duplex */
+	__u32 speed;
+	__u8 duplex;
 } __attribute__((packed));
 
 /*
-- 
2.6.1

^ permalink raw reply related

* RE: [Intel-wired-lan] [PATCH] e1000e: Fix e1000_check_for_copper_link_ich8lan return value.
From: Brown, Aaron F @ 2017-12-22 22:00 UTC (permalink / raw)
  To: Neftin, Sasha, Benjamin Poirier, Kirsher, Jeffrey T
  Cc: Ben Hutchings, Gabriel C, netdev@vger.kernel.org, Christian Hesse,
	stable@vger.kernel.org, linux-kernel@vger.kernel.org,
	intel-wired-lan@lists.osuosl.org
In-Reply-To: <441de43f-4234-7616-2879-7e2e31a8d522@intel.com>

> From: Intel-wired-lan [mailto:intel-wired-lan-bounces@osuosl.org] On
> Behalf Of Neftin, Sasha
> Sent: Wednesday, December 20, 2017 10:57 PM
> To: Benjamin Poirier <bpoirier@suse.com>; Kirsher, Jeffrey T
> <jeffrey.t.kirsher@intel.com>
> Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>; Gabriel C
> <nix.or.die@gmail.com>; netdev@vger.kernel.org; Christian Hesse
> <list@eworm.de>; stable@vger.kernel.org; linux-kernel@vger.kernel.org;
> intel-wired-lan@lists.osuosl.org
> Subject: Re: [Intel-wired-lan] [PATCH] e1000e: Fix
> e1000_check_for_copper_link_ich8lan return value.
> 
> On 11/12/2017 9:26, Benjamin Poirier wrote:
> > e1000e_check_for_copper_link() and
> e1000_check_for_copper_link_ich8lan()
> > are the two functions that may be assigned to mac.ops.check_for_link
> when
> > phy.media_type == e1000_media_type_copper. Commit 19110cfbb34d
> ("e1000e:
> > Separate signaling for link check/link up") changed the meaning of the
> > return value of check_for_link for copper media but only adjusted the first
> > function. This patch adjusts the second function likewise.
> >
> > Reported-by: Christian Hesse <list@eworm.de>
> > Reported-by: Gabriel C <nix.or.die@gmail.com>
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=198047
> > Fixes: 19110cfbb34d ("e1000e: Separate signaling for link check/link up")
> > Tested-by: Christian Hesse <list@eworm.de>
> > Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
> > ---
> >   drivers/net/ethernet/intel/e1000e/ich8lan.c | 11 ++++++++---
> >   1 file changed, 8 insertions(+), 3 deletions(-)
> >

Tested-by: Aaron Brown <aaron.f.brown@intel.com>

> Acked by sasha.neftin@intel.com
> 
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply

* Re: thunderx sgmii interface hang
From: Tim Harvey @ 2017-12-22 22:19 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Sunil Goutham, netdev
In-Reply-To: <20171219205256.GB24156@lunn.ch>

On Tue, Dec 19, 2017 at 12:52 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> On Mon, Dec 18, 2017 at 01:53:47PM -0800, Tim Harvey wrote:
>> On Wed, Dec 13, 2017 at 11:43 AM, Andrew Lunn <andrew@lunn.ch> wrote:
>> >> The nic appears to work fine (pings, TCP etc) up until a performance
>> >> test is attempted.
>> >> When an iperf bandwidth test is attempted the nic ends up in a state
>> >> where truncated-ip packets are being sent out (per a tcpdump from
>> >> another board):
>> >
>> > Hi Tim
>> >
>> > Are pause frames supported? Have you tried turning them off?
>> >
>> > Can you reproduce the issue with UDP? Or is it TCP only?
>> >
>>
>> Andrew,
>>
>> Pause frames don't appear to be supported yet and the issue occurs
>> when using UDP as well as TCP. I'm not clear what the best way to
>> troubleshoot this is.
>
> Hi Tim
>
> Is pause being negotiated? In theory, it should not be. The PHY should
> not offer it, if the MAC has not enabled it. But some PHY drivers are
> probably broken and offer pause when they should not.
>
> Also, can you trigger the issue using UDP at say 75% the maximum
> bandwidth. That should be low enough that the peer never even tries to
> use pause.
>
> All this pause stuff is just a stab in the dark. Something else to try
> is to turn off various forms off acceleration, ethtook -K, and see if
> that makes a difference.
>

Andrew,

Currently I'm not using the DP83867_PHY driver (after verifying the
issue occurs with or without that driver).

It does not occur if I limit UDP (ie 950mbps). I disabled all offloads
and the issue still occurs.

I have found that once the issue occurs I can recover to a working
state by clearing/setting BGX_CMRX_CFG[BGX_EN] and once I encounter
the issue and recover with that, I can never trigger the issue again.
If toggle that register bit upon power-up before the issue occurs it
will still occur.

The CN80XX reference manual describes BGX_CMRX_CFG[BGX_EN] as:
- when cleared all dedicated BGX context state for LMAC (state
machine, FIFOs, counters etc) are reset and LMAC access to shared BGX
resources (data path, serdes lanes) is disabled
- when set LMAC operation is enabled (link bring-up, sync, and tx/rx
of idles and fault sequences)

I'm told that the particular Cavium reference board with an SGMII phy
doesn't show this issue (I don't have that specific board to do my own
testing or comparisons against our board) so I'm inclined to think it
has something to do with an interaction with the DP83867 PHY. I would
like to start poking at PHY registers to see if I can find anything
unusual. The best way to do that from userspace is via
SIOCGMIIREG/SIOCSMIIREG right? The thunderx nic doesn't currently
support ioctl's so I guess I'll have to add that support unless
there's a way to get at phy registers from userspace through a phy
driver?

Regards,

Tim

^ permalink raw reply

* Re: [PATCH RFC 09/18] r8168: use genphy_soft_reset instead of open coding the soft reset
From: Heiner Kallweit @ 2017-12-22 22:19 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Realtek linux nic maintainers, Chun-Hao Lin, David Miller,
	netdev@vger.kernel.org
In-Reply-To: <20171222095706.GF2431@lunn.ch>

Am 22.12.2017 um 10:57 schrieb Andrew Lunn:
> On Thu, Dec 21, 2017 at 09:50:28PM +0100, Heiner Kallweit wrote:
>> Use genphy_soft_reset instead of open coding the soft reset.
> 
> Hi Heiner
> 
> At this point, you have swapped over the phylib. Does one of the
> drivers in drivers/net/phy now take control of the PHY? Does the PHY
> ID match one of those in realtek.c?
> 
> The PHY driver and phylib should be responsible for resetting the
> PHY. The MAC driver should not need to do this.
> 
In my case the PHY ID matches the existing RTL8211E driver in
realtek.c and the MAC driver wouldn't have to do any soft reset.

However there may be chips with a PHY ID not yet being supported
by realtek.c. AFAICS in such a case phylib uses genphy_driver
which has a no-op soft reset.
Therefore I'd tend to say that we can remove the phy soft reset
from MAC driver only after being sure that all PHY ID's of
supported NIC's have a driver in realtek.c.

>      Andrew
> 
Heiner

By the way: Thanks for the review comments.

^ permalink raw reply

* Re: [PATCH v3 next-queue 08/10] ixgbe: process the Tx ipsec offload
From: Shannon Nelson @ 2017-12-22 22:33 UTC (permalink / raw)
  To: Yanjun Zhu, intel-wired-lan, jeffrey.t.kirsher
  Cc: steffen.klassert, sowmini.varadhan, netdev
In-Reply-To: <6dc35153-0615-2132-a8d7-ab6ab05f0ab5@oracle.com>

On 12/22/2017 12:24 AM, Yanjun Zhu wrote:
> On 2017/12/20 8:00, Shannon Nelson wrote:
>> If the skb has a security association referenced in the skb, then
>> set up the Tx descriptor with the ipsec offload bits.  While we're
>> here, we fix an oddly named field in the context descriptor struct.
>>

[...]

>> +int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring,
>> +           struct ixgbe_tx_buffer *first,
>> +           struct ixgbe_ipsec_tx_data *itd)
>> +{
>> +    struct ixgbe_adapter *adapter = netdev_priv(tx_ring->netdev);
>> +    struct ixgbe_ipsec *ipsec = adapter->ipsec;
>> +    struct xfrm_state *xs;
>> +    struct tx_sa *tsa;
>> +
>> +    if (!first->skb->sp->len) {
> Hi, Nelson
> 
> The function ixgbe_ipsec_tx is called in tx fastpath. Can we add 
> unlikely as below:
> if (unlikely(!first->skb->sp->len)) ?
> 
> If I am wrong, please correct me.
> 
> Thanks a lot.
> Zhu Yanjun

Yes, we can probably throw those in.  I'll be working on this code in 
the new year to get the checksum and TSO bits working and can add these 
at that time.

sln

^ permalink raw reply

* Re: thunderx sgmii interface hang
From: Andrew Lunn @ 2017-12-22 22:45 UTC (permalink / raw)
  To: Tim Harvey; +Cc: Sunil Goutham, netdev
In-Reply-To: <CAJ+vNU3Gn2YXwMegzh+D2mTcAfwauzKoXhTWBzZyROWG+UxKQA@mail.gmail.com>

> Currently I'm not using the DP83867_PHY driver (after verifying the
> issue occurs with or without that driver).
> 
> It does not occur if I limit UDP (ie 950mbps). I disabled all offloads
> and the issue still occurs.

> I'm told that the particular Cavium reference board with an SGMII phy
> doesn't show this issue (I don't have that specific board to do my own
> testing or comparisons against our board) so I'm inclined to think it
> has something to do with an interaction with the DP83867 PHY. I would
> like to start poking at PHY registers to see if I can find anything
> unusual. The best way to do that from userspace is via
> SIOCGMIIREG/SIOCSMIIREG right? The thunderx nic doesn't currently
> support ioctl's so I guess I'll have to add that support unless
> there's a way to get at phy registers from userspace through a phy
> driver?

phy_mii_ioctl() does what you need, and is simple to use.

mii-tool will then give you access to the standard PHY registers.

	 Andre

^ permalink raw reply

* Re: thunderx sgmii interface hang
From: David Daney @ 2017-12-22 23:00 UTC (permalink / raw)
  To: Tim Harvey, Andrew Lunn; +Cc: Sunil Goutham, netdev
In-Reply-To: <CAJ+vNU3Gn2YXwMegzh+D2mTcAfwauzKoXhTWBzZyROWG+UxKQA@mail.gmail.com>

On 12/22/2017 02:19 PM, Tim Harvey wrote:
> On Tue, Dec 19, 2017 at 12:52 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>> On Mon, Dec 18, 2017 at 01:53:47PM -0800, Tim Harvey wrote:
>>> On Wed, Dec 13, 2017 at 11:43 AM, Andrew Lunn <andrew@lunn.ch> wrote:
>>>>> The nic appears to work fine (pings, TCP etc) up until a performance
>>>>> test is attempted.
>>>>> When an iperf bandwidth test is attempted the nic ends up in a state
>>>>> where truncated-ip packets are being sent out (per a tcpdump from
>>>>> another board):
>>>>
>>>> Hi Tim
>>>>
>>>> Are pause frames supported? Have you tried turning them off?
>>>>
>>>> Can you reproduce the issue with UDP? Or is it TCP only?
>>>>
>>>
>>> Andrew,
>>>
>>> Pause frames don't appear to be supported yet and the issue occurs
>>> when using UDP as well as TCP. I'm not clear what the best way to
>>> troubleshoot this is.
>>
>> Hi Tim
>>
>> Is pause being negotiated? In theory, it should not be. The PHY should
>> not offer it, if the MAC has not enabled it. But some PHY drivers are
>> probably broken and offer pause when they should not.
>>
>> Also, can you trigger the issue using UDP at say 75% the maximum
>> bandwidth. That should be low enough that the peer never even tries to
>> use pause.
>>
>> All this pause stuff is just a stab in the dark. Something else to try
>> is to turn off various forms off acceleration, ethtook -K, and see if
>> that makes a difference.
>>
> 
> Andrew,
> 
> Currently I'm not using the DP83867_PHY driver (after verifying the
> issue occurs with or without that driver).
> 
> It does not occur if I limit UDP (ie 950mbps). I disabled all offloads
> and the issue still occurs.
> 
> I have found that once the issue occurs I can recover to a working
> state by clearing/setting BGX_CMRX_CFG[BGX_EN] and once I encounter
> the issue and recover with that, I can never trigger the issue again.
> If toggle that register bit upon power-up before the issue occurs it
> will still occur.
> 
> The CN80XX reference manual describes BGX_CMRX_CFG[BGX_EN] as:
> - when cleared all dedicated BGX context state for LMAC (state
> machine, FIFOs, counters etc) are reset and LMAC access to shared BGX
> resources (data path, serdes lanes) is disabled
> - when set LMAC operation is enabled (link bring-up, sync, and tx/rx
> of idles and fault sequences)

You could try looking at
BGXX_GMP_PCS_INTX
BGXX_GMP_GMI_RXX_INT
BGXX_GMP_GMI_TXX_INT

Those are all W1C registers that should contain all zeros.  If they 
don't, just write back to them to clear before running a test.

If there are bits asserting in these when the thing gets wedged up, it 
might point to a possible cause.

You could also look at these RO registers:
BGXX_GMP_PCS_TXX_STATES
BGXX_GMP_PCS_RXX_STATES




> 
> I'm told that the particular Cavium reference board with an SGMII phy
> doesn't show this issue (I don't have that specific board to do my own
> testing or comparisons against our board) so I'm inclined to think it
> has something to do with an interaction with the DP83867 PHY. I would
> like to start poking at PHY registers to see if I can find anything
> unusual. The best way to do that from userspace is via
> SIOCGMIIREG/SIOCSMIIREG right? The thunderx nic doesn't currently
> support ioctl's so I guess I'll have to add that support unless
> there's a way to get at phy registers from userspace through a phy
> driver?
> 
> Regards,
> 
> Tim
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox