Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next RFC 3/5] vhost: introduce vhost_add_used_idx()
From: Stefan Hajnoczi @ 2017-09-22  9:07 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, virtualization, netdev, linux-kernel, kvm
In-Reply-To: <1506067355-5771-4-git-send-email-jasowang@redhat.com>

On Fri, Sep 22, 2017 at 04:02:33PM +0800, Jason Wang wrote:
> This patch introduces a helper which just increase the used idx. This
> will be used in pair with vhost_prefetch_desc_indices() by batching
> code.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vhost/vhost.c | 33 +++++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.h |  1 +
>  2 files changed, 34 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

^ permalink raw reply

* Re: [PATCH net-next RFC 2/5] vhost: introduce helper to prefetch desc index
From: Stefan Hajnoczi @ 2017-09-22  9:02 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, virtualization, linux-kernel, kvm, mst
In-Reply-To: <1506067355-5771-3-git-send-email-jasowang@redhat.com>

On Fri, Sep 22, 2017 at 04:02:32PM +0800, Jason Wang wrote:
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index f87ec75..8424166d 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2437,6 +2437,61 @@ struct vhost_msg_node *vhost_dequeue_msg(struct vhost_dev *dev,
>  }
>  EXPORT_SYMBOL_GPL(vhost_dequeue_msg);
>  
> +int vhost_prefetch_desc_indices(struct vhost_virtqueue *vq,
> +				struct vring_used_elem *heads,
> +				u16 num, bool used_update)

Missing doc comment.

> +{
> +	int ret, ret2;
> +	u16 last_avail_idx, last_used_idx, total, copied;
> +	__virtio16 avail_idx;
> +	struct vring_used_elem __user *used;
> +	int i;

The following variable names are a little confusing:

last_avail_idx vs vq->last_avail_idx.  last_avail_idx is a wrapped
avail->ring[] index, vq->last_avail_idx is a free-running counter.  The
same for last_used_idx vs vq->last_used_idx.

num argument vs vq->num.  The argument could be called nheads instead to
make it clear that this is heads[] and not the virtqueue size.

Not a bug but it took me a while to figure out what was going on.

^ permalink raw reply

* Re: [net-next 1/2] dummy: add device MTU validation check
From: Sabrina Dubroca @ 2017-09-22  8:56 UTC (permalink / raw)
  To: Eric Dumazet, Jarod Wilson; +Cc: Zhang Shengju, davem, willemb, stephen, netdev
In-Reply-To: <1506006138.29839.132.camel@edumazet-glaptop3.roam.corp.google.com>

2017-09-21, 08:02:18 -0700, Eric Dumazet wrote:
> On Thu, 2017-09-21 at 21:32 +0800, Zhang Shengju wrote:
> > Currently, any mtu value can be assigned when adding a new dummy device:
> > [~]# ip link add name dummy1 mtu 100000 type dummy
> > [~]# ip link show dummy1
> > 15: dummy1: <BROADCAST,NOARP> mtu 100000 qdisc noop state DOWN mode DEFAULT group default qlen 1000
> >     link/ether 0a:61:6b:16:14:ce brd ff:ff:ff:ff:ff:ff
> > 
> > This patch adds device MTU validation check.
> 
> What is wrong with big MTU on dummy ?

It looks like the "centralize MTU checking" series broke that, but
only for changing the MTU on an existing dummy device. Commit
a52ad514fdf3 defined min_mtu/max_mtu in ether_setup, which dummy uses,
but there was no MTU check in dummy prior to that commit.


> If this is a generic rule, this check should belong in core network
> stack.
> 
> > 
> > Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
> > ---
> >  drivers/net/dummy.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
> > index e31ab3b..0276b2b 100644
> > --- a/drivers/net/dummy.c
> > +++ b/drivers/net/dummy.c
> > @@ -365,6 +365,14 @@ static int dummy_validate(struct nlattr *tb[], struct nlattr *data[],
> >  		if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS])))
> >  			return -EADDRNOTAVAIL;
> >  	}
> > +
> > +	if (tb[IFLA_MTU]) {
> > +		u32 mtu = nla_get_u32(tb[IFLA_MTU]);
> 
> You do not verify/validate nla_len(tb[IFLA_MTU]).

I think ifla_policy already performs that check:


static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
[...]
	[IFLA_MTU]		= { .type = NLA_U32 },


static int rtnl_newlink(struct sk_buff *skb, struct nlmsghdr *nlh,
			struct netlink_ext_ack *extack)
{
[...]
	err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy, extack);


-- 
Sabrina

^ permalink raw reply

* Re: [PATCH 2/2] ip_tunnel: add mpls over gre encapsulation
From: Amine Kherbouche @ 2017-09-22  8:39 UTC (permalink / raw)
  To: Roopa Prabhu; +Cc: netdev@vger.kernel.org, xeb, David Lamparter
In-Reply-To: <CAJieiUicOTaxeT4t-y6QzoPibi1i7WJ31X978OP6sqjMjVuVNw@mail.gmail.com>



On 09/22/2017 12:35 AM, Roopa Prabhu wrote:
>> > diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
>> > index 36ea2ad..060ed07 100644
>> > --- a/net/mpls/af_mpls.c
>> > +++ b/net/mpls/af_mpls.c
>> > @@ -16,6 +16,7 @@
>> >  #include <net/arp.h>
>> >  #include <net/ip_fib.h>
>> >  #include <net/netevent.h>
>> > +#include <net/ip_tunnels.h>
>> >  #include <net/netns/generic.h>
>> >  #if IS_ENABLED(CONFIG_IPV6)
>> >  #include <net/ipv6.h>
>> > @@ -39,6 +40,40 @@ static int one = 1;
>> >  static int label_limit = (1 << 20) - 1;
>> >  static int ttl_max = 255;
>> >
>> > +size_t ipgre_mpls_encap_hlen(struct ip_tunnel_encap *e)
>> > +{
>> > +       return sizeof(struct mpls_shim_hdr);
>> > +}
>> > +
>> > +int ipgre_mpls_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
>> > +                           u8 *protocol, struct flowi4 *fl4)
>> > +{
>> > +       return 0;
>> > +}
>
> any reason you are supporting only rx ?

Tx path doesn't need changes, all gre hdr fields remain the same except 
for the protocol type which is loaded from skb->protocol this last is 
set by the mpls stack before entering gre device.

^ permalink raw reply

* Re: [patch net-next 07/12] mlxsw: spectrum: Add the multicast routing offloading logic
From: Yotam Gigi @ 2017-09-22  8:36 UTC (permalink / raw)
  To: Andrew Lunn, Jiri Pirko; +Cc: netdev, davem, idosch, mlxsw
In-Reply-To: <20170921152654.GF27589@lunn.ch>

On 09/21/2017 06:26 PM, Andrew Lunn wrote:
>> +static void mlxsw_sp_mr_route_stats_update(struct mlxsw_sp *mlxsw_sp,
>> +					   struct mlxsw_sp_mr_route *mr_route)
>> +{
>> +	struct mlxsw_sp_mr *mr = mlxsw_sp->mr;
>> +	u64 packets, bytes;
>> +
>> +	if (mr_route->route_action == MLXSW_SP_MR_ROUTE_ACTION_TRAP)
>> +		return;
>> +
>> +	mr->mr_ops->route_stats(mlxsw_sp, mr_route->route_priv, &packets,
>> +				&bytes);
>> +
>> +	switch (mr_route->mr_table->proto) {
>> +	case MLXSW_SP_L3_PROTO_IPV4:
>> +		mr_route->mfc4->mfc_un.res.pkt = packets;
>> +		mr_route->mfc4->mfc_un.res.bytes = bytes;
> What about wrong_if and lastuse? 

wronf_if is updated by ipmr as it is trapped to the CPU. We did not
address lastuse currently, though it can be easily addressed here.

>
> Is an mfc with iif on the host, not the switch, not offloaded?


I am not sure I followed. What do you mean MFC with iif on the host? you mean
MFC with iif that is an external NIC which is not part of the spectrum ASIC? in
this case, the route will not be offloaded and all traffic will pass in slowpath.


>
>    Andrew

^ permalink raw reply

* Re: [PATCH net-next RFC 1/5] vhost: split out ring head fetching logic
From: Stefan Hajnoczi @ 2017-09-22  8:31 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, virtualization, netdev, linux-kernel, kvm
In-Reply-To: <1506067355-5771-2-git-send-email-jasowang@redhat.com>

On Fri, Sep 22, 2017 at 04:02:31PM +0800, Jason Wang wrote:
> +/* This looks in the virtqueue and for the first available buffer, and converts
> + * it to an iovec for convenient access.  Since descriptors consist of some
> + * number of output then some number of input descriptors, it's actually two
> + * iovecs, but we pack them into one and note how many of each there were.
> + *
> + * This function returns the descriptor number found, or vq->num (which is
> + * never a valid descriptor number) if none was found.  A negative code is
> + * returned on error. */
> +int __vhost_get_vq_desc(struct vhost_virtqueue *vq,
> +			struct iovec iov[], unsigned int iov_size,
> +			unsigned int *out_num, unsigned int *in_num,
> +			struct vhost_log *log, unsigned int *log_num,
> +			__virtio16 head)
[...]
> +int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> +		      struct iovec iov[], unsigned int iov_size,
> +		      unsigned int *out_num, unsigned int *in_num,
> +		      struct vhost_log *log, unsigned int *log_num)

Please document vhost_get_vq_desc().

Please also explain the difference between __vhost_get_vq_desc() and
vhost_get_vq_desc() in the documentation.

^ permalink raw reply

* [PATCH net-next RFC 5/5] vhost_net: basic tx virtqueue batched processing
From: Jason Wang @ 2017-09-22  8:02 UTC (permalink / raw)
  To: mst, jasowang, virtualization, netdev, linux-kernel; +Cc: kvm
In-Reply-To: <1506067355-5771-1-git-send-email-jasowang@redhat.com>

This patch implements basic batched processing of tx virtqueue by
prefetching desc indices and updating used ring in a batch. For
non-zerocopy case, vq->heads were used for storing the prefetched
indices and updating used ring. It is also a requirement for doing
more batching on top. For zerocopy case and for simplicity, batched
processing were simply disabled by only fetching and processing one
descriptor at a time, this could be optimized in the future.

XDP_DROP (without touching skb) on tun (with Moongen in guest) with
zercopy disabled:

Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz:
Before: 3.20Mpps
After:  3.90Mpps (+22%)

No differences were seen with zerocopy enabled.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c   | 215 ++++++++++++++++++++++++++++----------------------
 drivers/vhost/vhost.c |   2 +-
 2 files changed, 121 insertions(+), 96 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index c89640e..c439892 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -408,27 +408,25 @@ static int vhost_net_enable_vq(struct vhost_net *n,
 	return vhost_poll_start(poll, sock->file);
 }
 
-static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
-				    struct vhost_virtqueue *vq,
-				    struct iovec iov[], unsigned int iov_size,
-				    unsigned int *out_num, unsigned int *in_num)
+static bool vhost_net_tx_avail(struct vhost_net *net,
+			       struct vhost_virtqueue *vq)
 {
 	unsigned long uninitialized_var(endtime);
-	int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
-				  out_num, in_num, NULL, NULL);
 
-	if (r == vq->num && vq->busyloop_timeout) {
-		preempt_disable();
-		endtime = busy_clock() + vq->busyloop_timeout;
-		while (vhost_can_busy_poll(vq->dev, endtime) &&
-		       vhost_vq_avail_empty(vq->dev, vq))
-			cpu_relax();
-		preempt_enable();
-		r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
-				      out_num, in_num, NULL, NULL);
-	}
+	if (!vq->busyloop_timeout)
+		return false;
 
-	return r;
+	if (!vhost_vq_avail_empty(vq->dev, vq))
+		return true;
+
+	preempt_disable();
+	endtime = busy_clock() + vq->busyloop_timeout;
+	while (vhost_can_busy_poll(vq->dev, endtime) &&
+		vhost_vq_avail_empty(vq->dev, vq))
+		cpu_relax();
+	preempt_enable();
+
+	return !vhost_vq_avail_empty(vq->dev, vq);
 }
 
 static bool vhost_exceeds_maxpend(struct vhost_net *net)
@@ -446,8 +444,9 @@ static void handle_tx(struct vhost_net *net)
 {
 	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
 	struct vhost_virtqueue *vq = &nvq->vq;
+	struct vring_used_elem used, *heads = vq->heads;
 	unsigned out, in;
-	int head;
+	int avails, head;
 	struct msghdr msg = {
 		.msg_name = NULL,
 		.msg_namelen = 0,
@@ -461,6 +460,7 @@ static void handle_tx(struct vhost_net *net)
 	struct socket *sock;
 	struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
 	bool zcopy, zcopy_used;
+	int i, batched = VHOST_NET_BATCH;
 
 	mutex_lock(&vq->mutex);
 	sock = vq->private_data;
@@ -475,6 +475,12 @@ static void handle_tx(struct vhost_net *net)
 	hdr_size = nvq->vhost_hlen;
 	zcopy = nvq->ubufs;
 
+	/* Disable zerocopy batched fetching for simplicity */
+	if (zcopy) {
+		heads = &used;
+		batched = 1;
+	}
+
 	for (;;) {
 		/* Release DMAs done buffers first */
 		if (zcopy)
@@ -486,95 +492,114 @@ static void handle_tx(struct vhost_net *net)
 		if (unlikely(vhost_exceeds_maxpend(net)))
 			break;
 
-		head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
-						ARRAY_SIZE(vq->iov),
-						&out, &in);
+		avails = vhost_prefetch_desc_indices(vq, heads, batched, !zcopy);
 		/* On error, stop handling until the next kick. */
-		if (unlikely(head < 0))
+		if (unlikely(avails < 0))
 			break;
-		/* Nothing new?  Wait for eventfd to tell us they refilled. */
-		if (head == vq->num) {
+		/* Nothing new?  Busy poll for a while or wait for
+		 * eventfd to tell us they refilled. */
+		if (!avails) {
+			if (vhost_net_tx_avail(net, vq))
+				continue;
 			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
 				vhost_disable_notify(&net->dev, vq);
 				continue;
 			}
 			break;
 		}
-		if (in) {
-			vq_err(vq, "Unexpected descriptor format for TX: "
-			       "out %d, int %d\n", out, in);
-			break;
-		}
-		/* Skip header. TODO: support TSO. */
-		len = iov_length(vq->iov, out);
-		iov_iter_init(&msg.msg_iter, WRITE, vq->iov, out, len);
-		iov_iter_advance(&msg.msg_iter, hdr_size);
-		/* Sanity check */
-		if (!msg_data_left(&msg)) {
-			vq_err(vq, "Unexpected header len for TX: "
-			       "%zd expected %zd\n",
-			       len, hdr_size);
-			break;
-		}
-		len = msg_data_left(&msg);
-
-		zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN
-				   && (nvq->upend_idx + 1) % UIO_MAXIOV !=
-				      nvq->done_idx
-				   && vhost_net_tx_select_zcopy(net);
-
-		/* use msg_control to pass vhost zerocopy ubuf info to skb */
-		if (zcopy_used) {
-			struct ubuf_info *ubuf;
-			ubuf = nvq->ubuf_info + nvq->upend_idx;
-
-			vq->heads[nvq->upend_idx].id = cpu_to_vhost32(vq, head);
-			vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
-			ubuf->callback = vhost_zerocopy_callback;
-			ubuf->ctx = nvq->ubufs;
-			ubuf->desc = nvq->upend_idx;
-			refcount_set(&ubuf->refcnt, 1);
-			msg.msg_control = ubuf;
-			msg.msg_controllen = sizeof(ubuf);
-			ubufs = nvq->ubufs;
-			atomic_inc(&ubufs->refcount);
-			nvq->upend_idx = (nvq->upend_idx + 1) % UIO_MAXIOV;
-		} else {
-			msg.msg_control = NULL;
-			ubufs = NULL;
-		}
+		for (i = 0; i < avails; i++) {
+			head = __vhost_get_vq_desc(vq, vq->iov,
+						   ARRAY_SIZE(vq->iov),
+						   &out, &in, NULL, NULL,
+					       vhost16_to_cpu(vq, heads[i].id));
+			if (in) {
+				vq_err(vq, "Unexpected descriptor format for "
+					   "TX: out %d, int %d\n", out, in);
+				goto out;
+			}
 
-		total_len += len;
-		if (total_len < VHOST_NET_WEIGHT &&
-		    !vhost_vq_avail_empty(&net->dev, vq) &&
-		    likely(!vhost_exceeds_maxpend(net))) {
-			msg.msg_flags |= MSG_MORE;
-		} else {
-			msg.msg_flags &= ~MSG_MORE;
-		}
+			/* Skip header. TODO: support TSO. */
+			len = iov_length(vq->iov, out);
+			iov_iter_init(&msg.msg_iter, WRITE, vq->iov, out, len);
+			iov_iter_advance(&msg.msg_iter, hdr_size);
+			/* Sanity check */
+			if (!msg_data_left(&msg)) {
+				vq_err(vq, "Unexpected header len for TX: "
+					"%zd expected %zd\n",
+					len, hdr_size);
+				goto out;
+			}
+			len = msg_data_left(&msg);
 
-		/* TODO: Check specific error and bomb out unless ENOBUFS? */
-		err = sock->ops->sendmsg(sock, &msg, len);
-		if (unlikely(err < 0)) {
+			zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN
+				     && (nvq->upend_idx + 1) % UIO_MAXIOV !=
+					nvq->done_idx
+				     && vhost_net_tx_select_zcopy(net);
+
+			/* use msg_control to pass vhost zerocopy ubuf
+			 * info to skb
+			 */
 			if (zcopy_used) {
-				vhost_net_ubuf_put(ubufs);
-				nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
-					% UIO_MAXIOV;
+				struct ubuf_info *ubuf;
+				ubuf = nvq->ubuf_info + nvq->upend_idx;
+
+				vq->heads[nvq->upend_idx].id =
+					cpu_to_vhost32(vq, head);
+				vq->heads[nvq->upend_idx].len =
+					VHOST_DMA_IN_PROGRESS;
+				ubuf->callback = vhost_zerocopy_callback;
+				ubuf->ctx = nvq->ubufs;
+				ubuf->desc = nvq->upend_idx;
+				refcount_set(&ubuf->refcnt, 1);
+				msg.msg_control = ubuf;
+				msg.msg_controllen = sizeof(ubuf);
+				ubufs = nvq->ubufs;
+				atomic_inc(&ubufs->refcount);
+				nvq->upend_idx =
+					(nvq->upend_idx + 1) % UIO_MAXIOV;
+			} else {
+				msg.msg_control = NULL;
+				ubufs = NULL;
+			}
+
+			total_len += len;
+			if (total_len < VHOST_NET_WEIGHT &&
+				!vhost_vq_avail_empty(&net->dev, vq) &&
+				likely(!vhost_exceeds_maxpend(net))) {
+				msg.msg_flags |= MSG_MORE;
+			} else {
+				msg.msg_flags &= ~MSG_MORE;
+			}
+
+			/* TODO: Check specific error and bomb out
+			 * unless ENOBUFS?
+			 */
+			err = sock->ops->sendmsg(sock, &msg, len);
+			if (unlikely(err < 0)) {
+				if (zcopy_used) {
+					vhost_net_ubuf_put(ubufs);
+					nvq->upend_idx =
+				   ((unsigned)nvq->upend_idx - 1) % UIO_MAXIOV;
+				}
+				vhost_discard_vq_desc(vq, 1);
+				goto out;
+			}
+			if (err != len)
+				pr_debug("Truncated TX packet: "
+					" len %d != %zd\n", err, len);
+			if (!zcopy) {
+				vhost_add_used_idx(vq, 1);
+				vhost_signal(&net->dev, vq);
+			} else if (!zcopy_used) {
+				vhost_add_used_and_signal(&net->dev,
+							  vq, head, 0);
+			} else
+				vhost_zerocopy_signal_used(net, vq);
+			vhost_net_tx_packet(net);
+			if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
+				vhost_poll_queue(&vq->poll);
+				goto out;
 			}
-			vhost_discard_vq_desc(vq, 1);
-			break;
-		}
-		if (err != len)
-			pr_debug("Truncated TX packet: "
-				 " len %d != %zd\n", err, len);
-		if (!zcopy_used)
-			vhost_add_used_and_signal(&net->dev, vq, head, 0);
-		else
-			vhost_zerocopy_signal_used(net, vq);
-		vhost_net_tx_packet(net);
-		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
-			vhost_poll_queue(&vq->poll);
-			break;
 		}
 	}
 out:
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 6532cda..8764df5 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -392,7 +392,7 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
 		vq->indirect = kmalloc(sizeof *vq->indirect * UIO_MAXIOV,
 				       GFP_KERNEL);
 		vq->log = kmalloc(sizeof *vq->log * UIO_MAXIOV, GFP_KERNEL);
-		vq->heads = kmalloc(sizeof *vq->heads * UIO_MAXIOV, GFP_KERNEL);
+		vq->heads = kzalloc(sizeof *vq->heads * UIO_MAXIOV, GFP_KERNEL);
 		if (!vq->indirect || !vq->log || !vq->heads)
 			goto err_nomem;
 	}
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next RFC 4/5] vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH
From: Jason Wang @ 2017-09-22  8:02 UTC (permalink / raw)
  To: mst, jasowang, virtualization, netdev, linux-kernel; +Cc: kvm
In-Reply-To: <1506067355-5771-1-git-send-email-jasowang@redhat.com>

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 58585ec..c89640e 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -87,7 +87,7 @@ struct vhost_net_ubuf_ref {
 	struct vhost_virtqueue *vq;
 };
 
-#define VHOST_RX_BATCH 64
+#define VHOST_NET_BATCH 64
 struct vhost_net_buf {
 	struct sk_buff **queue;
 	int tail;
@@ -159,7 +159,7 @@ static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
 
 	rxq->head = 0;
 	rxq->tail = skb_array_consume_batched(nvq->rx_array, rxq->queue,
-					      VHOST_RX_BATCH);
+					      VHOST_NET_BATCH);
 	return rxq->tail;
 }
 
@@ -912,7 +912,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 		return -ENOMEM;
 	}
 
-	queue = kmalloc_array(VHOST_RX_BATCH, sizeof(struct sk_buff *),
+	queue = kmalloc_array(VHOST_NET_BATCH, sizeof(struct sk_buff *),
 			      GFP_KERNEL);
 	if (!queue) {
 		kfree(vqs);
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next RFC 3/5] vhost: introduce vhost_add_used_idx()
From: Jason Wang @ 2017-09-22  8:02 UTC (permalink / raw)
  To: mst, jasowang, virtualization, netdev, linux-kernel; +Cc: kvm
In-Reply-To: <1506067355-5771-1-git-send-email-jasowang@redhat.com>

This patch introduces a helper which just increase the used idx. This
will be used in pair with vhost_prefetch_desc_indices() by batching
code.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 33 +++++++++++++++++++++++++++++++++
 drivers/vhost/vhost.h |  1 +
 2 files changed, 34 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 8424166d..6532cda 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2178,6 +2178,39 @@ int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
 }
 EXPORT_SYMBOL_GPL(vhost_add_used);
 
+int vhost_add_used_idx(struct vhost_virtqueue *vq, int n)
+{
+	u16 old, new;
+
+	old = vq->last_used_idx;
+	new = (vq->last_used_idx += n);
+	/* If the driver never bothers to signal in a very long while,
+	 * used index might wrap around. If that happens, invalidate
+	 * signalled_used index we stored. TODO: make sure driver
+	 * signals at least once in 2^16 and remove this.
+	 */
+	if (unlikely((u16)(new - vq->signalled_used) < (u16)(new - old)))
+		vq->signalled_used_valid = false;
+
+	/* Make sure buffer is written before we update index. */
+	smp_wmb();
+	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
+			   &vq->used->idx)) {
+		vq_err(vq, "Failed to increment used idx");
+		return -EFAULT;
+	}
+	if (unlikely(vq->log_used)) {
+		/* Log used index update. */
+		log_write(vq->log_base,
+			  vq->log_addr + offsetof(struct vring_used, idx),
+			  sizeof(vq->used->idx));
+		if (vq->log_ctx)
+			eventfd_signal(vq->log_ctx, 1);
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vhost_add_used_idx);
+
 static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 			    struct vring_used_elem *heads,
 			    unsigned count)
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 16c2cb6..5dd6c05 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -199,6 +199,7 @@ int __vhost_get_vq_desc(struct vhost_virtqueue *vq,
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 
 int vhost_vq_init_access(struct vhost_virtqueue *);
+int vhost_add_used_idx(struct vhost_virtqueue *vq, int n);
 int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
 int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
 		     unsigned count);
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next RFC 2/5] vhost: introduce helper to prefetch desc index
From: Jason Wang @ 2017-09-22  8:02 UTC (permalink / raw)
  To: mst, jasowang, virtualization, netdev, linux-kernel; +Cc: kvm
In-Reply-To: <1506067355-5771-1-git-send-email-jasowang@redhat.com>

This patch introduces vhost_prefetch_desc_indices() which could batch
descriptor indices fetching and used ring updating. This intends to
reduce the cache misses of indices fetching and updating and reduce
cache line bounce when virtqueue is almost full. copy_to_user() was
used in order to benefit from modern cpus that support fast string
copy. Batched virtqueue processing will be the first user.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/vhost.h |  3 +++
 2 files changed, 58 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f87ec75..8424166d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2437,6 +2437,61 @@ struct vhost_msg_node *vhost_dequeue_msg(struct vhost_dev *dev,
 }
 EXPORT_SYMBOL_GPL(vhost_dequeue_msg);
 
+int vhost_prefetch_desc_indices(struct vhost_virtqueue *vq,
+				struct vring_used_elem *heads,
+				u16 num, bool used_update)
+{
+	int ret, ret2;
+	u16 last_avail_idx, last_used_idx, total, copied;
+	__virtio16 avail_idx;
+	struct vring_used_elem __user *used;
+	int i;
+
+	if (unlikely(vhost_get_avail(vq, avail_idx, &vq->avail->idx))) {
+		vq_err(vq, "Failed to access avail idx at %p\n",
+		       &vq->avail->idx);
+		return -EFAULT;
+	}
+	last_avail_idx = vq->last_avail_idx & (vq->num - 1);
+	vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
+	total = vq->avail_idx - vq->last_avail_idx;
+	ret = total = min(total, num);
+
+	for (i = 0; i < ret; i++) {
+		ret2 = vhost_get_avail(vq, heads[i].id,
+				      &vq->avail->ring[last_avail_idx]);
+		if (unlikely(ret2)) {
+			vq_err(vq, "Failed to get descriptors\n");
+			return -EFAULT;
+		}
+		last_avail_idx = (last_avail_idx + 1) & (vq->num - 1);
+	}
+
+	if (!used_update)
+		return ret;
+
+	last_used_idx = vq->last_used_idx & (vq->num - 1);
+	while (total) {
+		copied = min((u16)(vq->num - last_used_idx), total);
+		ret2 = vhost_copy_to_user(vq,
+					  &vq->used->ring[last_used_idx],
+					  &heads[ret - total],
+					  copied * sizeof(*used));
+
+		if (unlikely(ret2)) {
+			vq_err(vq, "Failed to update used ring!\n");
+			return -EFAULT;
+		}
+
+		last_used_idx = 0;
+		total -= copied;
+	}
+
+	/* Only get avail ring entries after they have been exposed by guest. */
+	smp_rmb();
+	return ret;
+}
+EXPORT_SYMBOL(vhost_prefetch_desc_indices);
 
 static int __init vhost_init(void)
 {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 39ff897..16c2cb6 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -228,6 +228,9 @@ ssize_t vhost_chr_read_iter(struct vhost_dev *dev, struct iov_iter *to,
 ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
 			     struct iov_iter *from);
 int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled);
+int vhost_prefetch_desc_indices(struct vhost_virtqueue *vq,
+				struct vring_used_elem *heads,
+				u16 num, bool used_update);
 
 #define vq_err(vq, fmt, ...) do {                                  \
 		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next RFC 1/5] vhost: split out ring head fetching logic
From: Jason Wang @ 2017-09-22  8:02 UTC (permalink / raw)
  To: mst, jasowang, virtualization, netdev, linux-kernel; +Cc: kvm
In-Reply-To: <1506067355-5771-1-git-send-email-jasowang@redhat.com>

This patch splits out ring head fetching logic and leave the
descriptor fetching and translation logic. This makes it is possible
to batch fetching the descriptor indices.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 75 +++++++++++++++++++++++++++++++++------------------
 drivers/vhost/vhost.h |  5 ++++
 2 files changed, 54 insertions(+), 26 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index d6dbb28..f87ec75 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1984,41 +1984,23 @@ static int get_indirect(struct vhost_virtqueue *vq,
 	return 0;
 }
 
-/* This looks in the virtqueue and for the first available buffer, and converts
- * it to an iovec for convenient access.  Since descriptors consist of some
- * number of output then some number of input descriptors, it's actually two
- * iovecs, but we pack them into one and note how many of each there were.
- *
- * This function returns the descriptor number found, or vq->num (which is
- * never a valid descriptor number) if none was found.  A negative code is
- * returned on error. */
-int vhost_get_vq_desc(struct vhost_virtqueue *vq,
-		      struct iovec iov[], unsigned int iov_size,
-		      unsigned int *out_num, unsigned int *in_num,
-		      struct vhost_log *log, unsigned int *log_num)
+static unsigned int vhost_get_vq_head(struct vhost_virtqueue *vq, int *err)
 {
-	struct vring_desc desc;
-	unsigned int i, head, found = 0;
-	u16 last_avail_idx;
-	__virtio16 avail_idx;
-	__virtio16 ring_head;
-	int ret, access;
-
-	/* Check it isn't doing very strange things with descriptor numbers. */
-	last_avail_idx = vq->last_avail_idx;
+	u16 last_avail_idx = vq->last_avail_idx;
+	__virtio16 avail_idx, ring_head;
 
 	if (vq->avail_idx == vq->last_avail_idx) {
 		if (unlikely(vhost_get_avail(vq, avail_idx, &vq->avail->idx))) {
 			vq_err(vq, "Failed to access avail idx at %p\n",
 				&vq->avail->idx);
-			return -EFAULT;
+			goto err;
 		}
 		vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
 
 		if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
 			vq_err(vq, "Guest moved used index from %u to %u",
 				last_avail_idx, vq->avail_idx);
-			return -EFAULT;
+			goto err;
 		}
 
 		/* If there's nothing new since last we looked, return
@@ -2040,13 +2022,35 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 		vq_err(vq, "Failed to read head: idx %d address %p\n",
 		       last_avail_idx,
 		       &vq->avail->ring[last_avail_idx % vq->num]);
-		return -EFAULT;
+		goto err;
 	}
 
-	head = vhost16_to_cpu(vq, ring_head);
+	return vhost16_to_cpu(vq, ring_head);
+err:
+	*err = -EFAULT;
+	return 0;
+}
+
+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access.  Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which is
+ * never a valid descriptor number) if none was found.  A negative code is
+ * returned on error. */
+int __vhost_get_vq_desc(struct vhost_virtqueue *vq,
+			struct iovec iov[], unsigned int iov_size,
+			unsigned int *out_num, unsigned int *in_num,
+			struct vhost_log *log, unsigned int *log_num,
+			__virtio16 head)
+{
+	struct vring_desc desc;
+	unsigned int i, found = 0;
+	int ret = 0, access;
 
 	/* If their number is silly, that's an error. */
-	if (unlikely(head >= vq->num)) {
+	if (unlikely(head > vq->num)) {
 		vq_err(vq, "Guest says index %u > %u is available",
 		       head, vq->num);
 		return -EINVAL;
@@ -2133,6 +2137,25 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 	BUG_ON(!(vq->used_flags & VRING_USED_F_NO_NOTIFY));
 	return head;
 }
+EXPORT_SYMBOL_GPL(__vhost_get_vq_desc);
+
+int vhost_get_vq_desc(struct vhost_virtqueue *vq,
+		      struct iovec iov[], unsigned int iov_size,
+		      unsigned int *out_num, unsigned int *in_num,
+		      struct vhost_log *log, unsigned int *log_num)
+{
+	int ret = 0;
+	unsigned int head = vhost_get_vq_head(vq, &ret);
+
+	if (ret)
+		return ret;
+
+	if (head == vq->num)
+		return head;
+
+	return __vhost_get_vq_desc(vq, iov, iov_size, out_num, in_num,
+				   log, log_num, head);
+}
 EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
 
 /* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d59a9cc..39ff897 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -191,6 +191,11 @@ int vhost_get_vq_desc(struct vhost_virtqueue *,
 		      struct iovec iov[], unsigned int iov_count,
 		      unsigned int *out_num, unsigned int *in_num,
 		      struct vhost_log *log, unsigned int *log_num);
+int __vhost_get_vq_desc(struct vhost_virtqueue *vq,
+			struct iovec iov[], unsigned int iov_count,
+			unsigned int *out_num, unsigned int *in_num,
+			struct vhost_log *log, unsigned int *log_num,
+			__virtio16 ring_head);
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 
 int vhost_vq_init_access(struct vhost_virtqueue *);
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next RFC 0/5] batched tx processing in vhost_net
From: Jason Wang @ 2017-09-22  8:02 UTC (permalink / raw)
  To: mst, jasowang, virtualization, netdev, linux-kernel; +Cc: kvm

Hi:

This series tries to implement basic tx batched processing. This is
done by prefetching descriptor indices and update used ring in a
batch. This intends to speed up used ring updating and improve the
cache utilization. Test shows about ~22% improvement in tx pss.

Please review.

Jason Wang (5):
  vhost: split out ring head fetching logic
  vhost: introduce helper to prefetch desc index
  vhost: introduce vhost_add_used_idx()
  vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH
  vhost_net: basic tx virtqueue batched processing

 drivers/vhost/net.c   | 221 ++++++++++++++++++++++++++++----------------------
 drivers/vhost/vhost.c | 165 +++++++++++++++++++++++++++++++------
 drivers/vhost/vhost.h |   9 ++
 3 files changed, 270 insertions(+), 125 deletions(-)

-- 
2.7.4

^ permalink raw reply

* Re: [PATCH net-next] net: mvpp2: phylink support
From: Marcin Wojtas @ 2017-09-22  7:56 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: David S. Miller, Russell King - ARM Linux, Andrew Lunn,
	Gregory Clément, Thomas Petazzoni, Miquèl Raynal,
	nadavh, linux-kernel, Stefan Chulski, netdev
In-Reply-To: <20170921134522.10993-1-antoine.tenart@free-electrons.com>

Hi Antoine,

You can add
Tested-by: Marcin Wojtas <mw@semihalf.com>

Best regards,
Marcin

2017-09-21 15:45 GMT+02:00 Antoine Tenart <antoine.tenart@free-electrons.com>:
> Convert the PPv2 driver to use phylink, which models the MAC to PHY
> link. The phylink support is made such a way the GoP link IRQ can still
> be used: the two modes are incompatible and the GoP link IRQ will be
> used if no PHY is described in the device tree. This is the same
> behaviour as before.
>
> Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
> ---
>  drivers/net/ethernet/marvell/Kconfig |   1 +
>  drivers/net/ethernet/marvell/mvpp2.c | 502 +++++++++++++++++++++--------------
>  2 files changed, 303 insertions(+), 200 deletions(-)
>
> diff --git a/drivers/net/ethernet/marvell/Kconfig b/drivers/net/ethernet/marvell/Kconfig
> index da6fb825afea..991138b8dfba 100644
> --- a/drivers/net/ethernet/marvell/Kconfig
> +++ b/drivers/net/ethernet/marvell/Kconfig
> @@ -86,6 +86,7 @@ config MVPP2
>         depends on ARCH_MVEBU || COMPILE_TEST
>         depends on HAS_DMA
>         select MVMDIO
> +       select PHYLINK
>         ---help---
>           This driver supports the network interface units in the
>           Marvell ARMADA 375, 7K and 8K SoCs.
> diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
> index 8041d692db3c..5fb7e76ee128 100644
> --- a/drivers/net/ethernet/marvell/mvpp2.c
> +++ b/drivers/net/ethernet/marvell/mvpp2.c
> @@ -28,6 +28,7 @@
>  #include <linux/of_address.h>
>  #include <linux/of_device.h>
>  #include <linux/phy.h>
> +#include <linux/phylink.h>
>  #include <linux/phy/phy.h>
>  #include <linux/clk.h>
>  #include <linux/hrtimer.h>
> @@ -341,6 +342,7 @@
>  #define     MVPP2_GMAC_FORCE_LINK_PASS         BIT(1)
>  #define     MVPP2_GMAC_IN_BAND_AUTONEG         BIT(2)
>  #define     MVPP2_GMAC_IN_BAND_AUTONEG_BYPASS  BIT(3)
> +#define     MVPP2_GMAC_IN_BAND_RESTART_AN      BIT(4)
>  #define     MVPP2_GMAC_CONFIG_MII_SPEED        BIT(5)
>  #define     MVPP2_GMAC_CONFIG_GMII_SPEED       BIT(6)
>  #define     MVPP2_GMAC_AN_SPEED_EN             BIT(7)
> @@ -350,6 +352,12 @@
>  #define     MVPP2_GMAC_AN_DUPLEX_EN            BIT(13)
>  #define MVPP2_GMAC_STATUS0                     0x10
>  #define     MVPP2_GMAC_STATUS0_LINK_UP         BIT(0)
> +#define            MVPP2_GMAC_STATUS0_GMII_SPEED       BIT(1)
> +#define            MVPP2_GMAC_STATUS0_MII_SPEED        BIT(2)
> +#define            MVPP2_GMAC_STATUS0_FULL_DUPLEX      BIT(3)
> +#define     MVPP2_GMAC_STATUS0_RX_PAUSE                BIT(6)
> +#define     MVPP2_GMAC_STATUS0_TX_PAUSE                BIT(7)
> +#define     MVPP2_GMAC_STATUS0_AN_COMPLETE     BIT(11)
>  #define MVPP2_GMAC_PORT_FIFO_CFG_1_REG         0x1c
>  #define     MVPP2_GMAC_TX_FIFO_MIN_TH_OFFS     6
>  #define     MVPP2_GMAC_TX_FIFO_MIN_TH_ALL_MASK 0x1fc0
> @@ -878,12 +886,11 @@ struct mvpp2_port {
>         u16 rx_ring_size;
>         struct mvpp2_pcpu_stats __percpu *stats;
>
> +       struct device_node *of_node;
> +
>         phy_interface_t phy_interface;
> -       struct device_node *phy_node;
> +       struct phylink *phylink;
>         struct phy *comphy;
> -       unsigned int link;
> -       unsigned int duplex;
> -       unsigned int speed;
>
>         struct mvpp2_bm_pool *pool_long;
>         struct mvpp2_bm_pool *pool_short;
> @@ -4716,13 +4723,14 @@ static void mvpp2_port_periodic_xon_disable(struct mvpp2_port *port)
>  }
>
>  /* Configure loopback port */
> -static void mvpp2_port_loopback_set(struct mvpp2_port *port)
> +static void mvpp2_port_loopback_set(struct mvpp2_port *port,
> +                                   const struct phylink_link_state *state)
>  {
>         u32 val;
>
>         val = readl(port->base + MVPP2_GMAC_CTRL_1_REG);
>
> -       if (port->speed == 1000)
> +       if (state->speed == 1000)
>                 val |= MVPP2_GMAC_GMII_LB_EN_MASK;
>         else
>                 val &= ~MVPP2_GMAC_GMII_LB_EN_MASK;
> @@ -4778,10 +4786,6 @@ static void mvpp2_defaults_set(struct mvpp2_port *port)
>         int tx_port_num, val, queue, ptxq, lrxq;
>
>         if (port->priv->hw_version == MVPP21) {
> -               /* Configure port to loopback if needed */
> -               if (port->flags & MVPP2_F_LOOPBACK)
> -                       mvpp2_port_loopback_set(port);
> -
>                 /* Update TX FIFO MIN Threshold */
>                 val = readl(port->base + MVPP2_GMAC_PORT_FIFO_CFG_1_REG);
>                 val &= ~MVPP2_GMAC_TX_FIFO_MIN_TH_ALL_MASK;
> @@ -5860,111 +5864,6 @@ static irqreturn_t mvpp2_link_status_isr(int irq, void *dev_id)
>         return IRQ_HANDLED;
>  }
>
> -static void mvpp2_gmac_set_autoneg(struct mvpp2_port *port,
> -                                  struct phy_device *phydev)
> -{
> -       u32 val;
> -
> -       if (port->phy_interface != PHY_INTERFACE_MODE_RGMII &&
> -           port->phy_interface != PHY_INTERFACE_MODE_RGMII_ID &&
> -           port->phy_interface != PHY_INTERFACE_MODE_RGMII_RXID &&
> -           port->phy_interface != PHY_INTERFACE_MODE_RGMII_TXID &&
> -           port->phy_interface != PHY_INTERFACE_MODE_SGMII)
> -               return;
> -
> -       val = readl(port->base + MVPP2_GMAC_AUTONEG_CONFIG);
> -       val &= ~(MVPP2_GMAC_CONFIG_MII_SPEED |
> -                MVPP2_GMAC_CONFIG_GMII_SPEED |
> -                MVPP2_GMAC_CONFIG_FULL_DUPLEX |
> -                MVPP2_GMAC_AN_SPEED_EN |
> -                MVPP2_GMAC_AN_DUPLEX_EN);
> -
> -       if (phydev->duplex)
> -               val |= MVPP2_GMAC_CONFIG_FULL_DUPLEX;
> -
> -       if (phydev->speed == SPEED_1000)
> -               val |= MVPP2_GMAC_CONFIG_GMII_SPEED;
> -       else if (phydev->speed == SPEED_100)
> -               val |= MVPP2_GMAC_CONFIG_MII_SPEED;
> -
> -       writel(val, port->base + MVPP2_GMAC_AUTONEG_CONFIG);
> -}
> -
> -/* Adjust link */
> -static void mvpp2_link_event(struct net_device *dev)
> -{
> -       struct mvpp2_port *port = netdev_priv(dev);
> -       struct phy_device *phydev = dev->phydev;
> -       bool link_reconfigured = false;
> -       u32 val;
> -
> -       if (phydev->link) {
> -               if (port->phy_interface != phydev->interface && port->comphy) {
> -                       /* disable current port for reconfiguration */
> -                       mvpp2_interrupts_disable(port);
> -                       netif_carrier_off(port->dev);
> -                       mvpp2_port_disable(port);
> -                       phy_power_off(port->comphy);
> -
> -                       /* comphy reconfiguration */
> -                       port->phy_interface = phydev->interface;
> -                       mvpp22_comphy_init(port);
> -
> -                       /* gop/mac reconfiguration */
> -                       mvpp22_gop_init(port);
> -                       mvpp2_port_mii_set(port);
> -
> -                       link_reconfigured = true;
> -               }
> -
> -               if ((port->speed != phydev->speed) ||
> -                   (port->duplex != phydev->duplex)) {
> -                       mvpp2_gmac_set_autoneg(port, phydev);
> -
> -                       port->duplex = phydev->duplex;
> -                       port->speed  = phydev->speed;
> -               }
> -       }
> -
> -       if (phydev->link != port->link || link_reconfigured) {
> -               port->link = phydev->link;
> -
> -               if (phydev->link) {
> -                       if (port->phy_interface == PHY_INTERFACE_MODE_RGMII ||
> -                           port->phy_interface == PHY_INTERFACE_MODE_RGMII_ID ||
> -                           port->phy_interface == PHY_INTERFACE_MODE_RGMII_RXID ||
> -                           port->phy_interface == PHY_INTERFACE_MODE_RGMII_TXID ||
> -                           port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
> -                               val = readl(port->base + MVPP2_GMAC_AUTONEG_CONFIG);
> -                               val |= (MVPP2_GMAC_FORCE_LINK_PASS |
> -                                       MVPP2_GMAC_FORCE_LINK_DOWN);
> -                               writel(val, port->base + MVPP2_GMAC_AUTONEG_CONFIG);
> -                       }
> -
> -                       mvpp2_interrupts_enable(port);
> -                       mvpp2_port_enable(port);
> -
> -                       mvpp2_egress_enable(port);
> -                       mvpp2_ingress_enable(port);
> -                       netif_carrier_on(dev);
> -                       netif_tx_wake_all_queues(dev);
> -               } else {
> -                       port->duplex = -1;
> -                       port->speed = 0;
> -
> -                       netif_tx_stop_all_queues(dev);
> -                       netif_carrier_off(dev);
> -                       mvpp2_ingress_disable(port);
> -                       mvpp2_egress_disable(port);
> -
> -                       mvpp2_port_disable(port);
> -                       mvpp2_interrupts_disable(port);
> -               }
> -
> -               phy_print_status(phydev);
> -       }
> -}
> -
>  static void mvpp2_timer_set(struct mvpp2_port_pcpu *port_pcpu)
>  {
>         ktime_t interval;
> @@ -6582,7 +6481,6 @@ static int mvpp2_poll(struct napi_struct *napi, int budget)
>  /* Set hw internals when starting port */
>  static void mvpp2_start_dev(struct mvpp2_port *port)
>  {
> -       struct net_device *ndev = port->dev;
>         int i;
>
>         if (port->gop_id == 0 &&
> @@ -6607,15 +6505,14 @@ static void mvpp2_start_dev(struct mvpp2_port *port)
>
>         mvpp2_port_mii_set(port);
>         mvpp2_port_enable(port);
> -       if (ndev->phydev)
> -               phy_start(ndev->phydev);
> +       if (port->phylink)
> +               phylink_start(port->phylink);
>         netif_tx_start_all_queues(port->dev);
>  }
>
>  /* Set hw internals when stopping port */
>  static void mvpp2_stop_dev(struct mvpp2_port *port)
>  {
> -       struct net_device *ndev = port->dev;
>         int i;
>
>         /* Stop new packets from arriving to RXQs */
> @@ -6634,8 +6531,8 @@ static void mvpp2_stop_dev(struct mvpp2_port *port)
>
>         mvpp2_egress_disable(port);
>         mvpp2_port_disable(port);
> -       if (ndev->phydev)
> -               phy_stop(ndev->phydev);
> +       if (port->phylink)
> +               phylink_stop(port->phylink);
>         phy_power_off(port->comphy);
>  }
>
> @@ -6688,40 +6585,6 @@ static void mvpp21_get_mac_address(struct mvpp2_port *port, unsigned char *addr)
>         addr[5] = (mac_addr_l >> MVPP2_GMAC_SA_LOW_OFFS) & 0xFF;
>  }
>
> -static int mvpp2_phy_connect(struct mvpp2_port *port)
> -{
> -       struct phy_device *phy_dev;
> -
> -       /* No PHY is attached */
> -       if (!port->phy_node)
> -               return 0;
> -
> -       phy_dev = of_phy_connect(port->dev, port->phy_node, mvpp2_link_event, 0,
> -                                port->phy_interface);
> -       if (!phy_dev) {
> -               netdev_err(port->dev, "cannot connect to phy\n");
> -               return -ENODEV;
> -       }
> -       phy_dev->supported &= PHY_GBIT_FEATURES;
> -       phy_dev->advertising = phy_dev->supported;
> -
> -       port->link    = 0;
> -       port->duplex  = 0;
> -       port->speed   = 0;
> -
> -       return 0;
> -}
> -
> -static void mvpp2_phy_disconnect(struct mvpp2_port *port)
> -{
> -       struct net_device *ndev = port->dev;
> -
> -       if (!ndev->phydev)
> -               return;
> -
> -       phy_disconnect(ndev->phydev);
> -}
> -
>  static int mvpp2_irqs_init(struct mvpp2_port *port)
>  {
>         int err, i;
> @@ -6765,7 +6628,6 @@ static void mvpp2_irqs_deinit(struct mvpp2_port *port)
>  static int mvpp2_open(struct net_device *dev)
>  {
>         struct mvpp2_port *port = netdev_priv(dev);
> -       struct mvpp2 *priv = port->priv;
>         unsigned char mac_bcast[ETH_ALEN] = {
>                         0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
>         int err;
> @@ -6811,7 +6673,16 @@ static int mvpp2_open(struct net_device *dev)
>                 goto err_cleanup_txqs;
>         }
>
> -       if (priv->hw_version == MVPP22 && !port->phy_node && port->link_irq) {
> +       /* In default link is down */
> +       netif_carrier_off(port->dev);
> +
> +       if (port->phylink) {
> +               err = phylink_of_phy_connect(port->phylink, port->of_node);
> +               if (err) {
> +                       netdev_err(port->dev, "could not attach PHY (%d)\n", err);
> +                       goto err_free_irq;
> +               }
> +       } else if (port->link_irq) {
>                 err = request_irq(port->link_irq, mvpp2_link_status_isr, 0,
>                                   dev->name, port);
>                 if (err) {
> @@ -6821,15 +6692,11 @@ static int mvpp2_open(struct net_device *dev)
>                 }
>
>                 mvpp22_gop_setup_irq(port);
> +       } else {
> +               netdev_err(dev, "cannot use phylink or GoP link IRQ\n");
> +               goto err_free_irq;
>         }
>
> -       /* In default link is down */
> -       netif_carrier_off(port->dev);
> -
> -       err = mvpp2_phy_connect(port);
> -       if (err < 0)
> -               goto err_free_link_irq;
> -
>         /* Unmask interrupts on all CPUs */
>         on_each_cpu(mvpp2_interrupts_unmask, port, 1);
>         mvpp2_shared_interrupt_mask_unmask(port, false);
> @@ -6838,9 +6705,6 @@ static int mvpp2_open(struct net_device *dev)
>
>         return 0;
>
> -err_free_link_irq:
> -       if (priv->hw_version == MVPP22 && !port->phy_node && port->link_irq)
> -               free_irq(port->link_irq, port);
>  err_free_irq:
>         mvpp2_irqs_deinit(port);
>  err_cleanup_txqs:
> @@ -6854,17 +6718,17 @@ static int mvpp2_stop(struct net_device *dev)
>  {
>         struct mvpp2_port *port = netdev_priv(dev);
>         struct mvpp2_port_pcpu *port_pcpu;
> -       struct mvpp2 *priv = port->priv;
>         int cpu;
>
>         mvpp2_stop_dev(port);
> -       mvpp2_phy_disconnect(port);
> +       if (port->phylink)
> +               phylink_disconnect_phy(port->phylink);
>
>         /* Mask interrupts on all CPUs */
>         on_each_cpu(mvpp2_interrupts_mask, port, 1);
>         mvpp2_shared_interrupt_mask_unmask(port, true);
>
> -       if (priv->hw_version == MVPP22 && !port->phy_node && port->link_irq)
> +       if (port->link_irq)
>                 free_irq(port->link_irq, port);
>
>         mvpp2_irqs_deinit(port);
> @@ -7029,20 +6893,26 @@ mvpp2_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
>
>  static int mvpp2_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
>  {
> -       int ret;
> +       struct mvpp2_port *port = netdev_priv(dev);
>
> -       if (!dev->phydev)
> +       if (!port->phylink)
>                 return -ENOTSUPP;
>
> -       ret = phy_mii_ioctl(dev->phydev, ifr, cmd);
> -       if (!ret)
> -               mvpp2_link_event(dev);
> -
> -       return ret;
> +       return phylink_mii_ioctl(port->phylink, ifr, cmd);
>  }
>
>  /* Ethtool methods */
>
> +static int mvpp2_ethtool_nway_reset(struct net_device *dev)
> +{
> +       struct mvpp2_port *port = netdev_priv(dev);
> +
> +       if (!port->phylink)
> +               return -ENOTSUPP;
> +
> +       return phylink_ethtool_nway_reset(port->phylink);
> +}
> +
>  /* Set interrupt coalescing for ethtools */
>  static int mvpp2_ethtool_set_coalesce(struct net_device *dev,
>                                       struct ethtool_coalesce *c)
> @@ -7170,6 +7040,50 @@ static int mvpp2_ethtool_set_ringparam(struct net_device *dev,
>         return err;
>  }
>
> +static void mvpp2_ethtool_get_pause_param(struct net_device *dev,
> +                                         struct ethtool_pauseparam *pause)
> +{
> +       struct mvpp2_port *port = netdev_priv(dev);
> +
> +       if (!port->phylink)
> +               return;
> +
> +       phylink_ethtool_get_pauseparam(port->phylink, pause);
> +}
> +
> +static int mvpp2_ethtool_set_pause_param(struct net_device *dev,
> +                                        struct ethtool_pauseparam *pause)
> +{
> +       struct mvpp2_port *port = netdev_priv(dev);
> +
> +       if (!port->phylink)
> +               return -ENOTSUPP;
> +
> +       return phylink_ethtool_set_pauseparam(port->phylink, pause);
> +}
> +
> +static int mvpp2_ethtool_get_link_ksettings(struct net_device *dev,
> +                                           struct ethtool_link_ksettings *cmd)
> +{
> +       struct mvpp2_port *port = netdev_priv(dev);
> +
> +       if (!port->phylink)
> +               return -ENOTSUPP;
> +
> +       return phylink_ethtool_ksettings_get(port->phylink, cmd);
> +}
> +
> +static int mvpp2_ethtool_set_link_ksettings(struct net_device *dev,
> +                                           const struct ethtool_link_ksettings *cmd)
> +{
> +       struct mvpp2_port *port = netdev_priv(dev);
> +
> +       if (!port->phylink)
> +               return -ENOTSUPP;
> +
> +       return phylink_ethtool_ksettings_set(port->phylink, cmd);
> +}
> +
>  /* Device ops */
>
>  static const struct net_device_ops mvpp2_netdev_ops = {
> @@ -7184,15 +7098,17 @@ static const struct net_device_ops mvpp2_netdev_ops = {
>  };
>
>  static const struct ethtool_ops mvpp2_eth_tool_ops = {
> -       .nway_reset     = phy_ethtool_nway_reset,
> +       .nway_reset     = mvpp2_ethtool_nway_reset,
>         .get_link       = ethtool_op_get_link,
>         .set_coalesce   = mvpp2_ethtool_set_coalesce,
>         .get_coalesce   = mvpp2_ethtool_get_coalesce,
>         .get_drvinfo    = mvpp2_ethtool_get_drvinfo,
>         .get_ringparam  = mvpp2_ethtool_get_ringparam,
>         .set_ringparam  = mvpp2_ethtool_set_ringparam,
> -       .get_link_ksettings = phy_ethtool_get_link_ksettings,
> -       .set_link_ksettings = phy_ethtool_set_link_ksettings,
> +       .get_pauseparam = mvpp2_ethtool_get_pause_param,
> +       .set_pauseparam = mvpp2_ethtool_set_pause_param,
> +       .get_link_ksettings = mvpp2_ethtool_get_link_ksettings,
> +       .set_link_ksettings = mvpp2_ethtool_set_link_ksettings,
>  };
>
>  /* Used for PPv2.1, or PPv2.2 with the old Device Tree binding that
> @@ -7492,17 +7408,185 @@ static void mvpp2_port_copy_mac_addr(struct net_device *dev, struct mvpp2 *priv,
>         eth_hw_addr_random(dev);
>  }
>
> +static void mvpp2_phylink_validate(struct net_device *dev,
> +                                  unsigned long *supported,
> +                                  struct phylink_link_state *state)
> +{
> +       __ETHTOOL_DECLARE_LINK_MODE_MASK(mask) = { 0, };
> +
> +       phylink_set_port_modes(mask);
> +
> +       phylink_set(mask, Autoneg);
> +       phylink_set(mask, Pause);
> +       phylink_set(mask, Asym_Pause);
> +
> +       phylink_set(mask, 10baseT_Half);
> +       phylink_set(mask, 10baseT_Full);
> +       phylink_set(mask, 100baseT_Half);
> +       phylink_set(mask, 100baseT_Full);
> +       phylink_set(mask, 1000baseT_Half);
> +       phylink_set(mask, 1000baseT_Full);
> +       phylink_set(mask, 1000baseX_Full);
> +       phylink_set(mask, 10000baseKR_Full);
> +
> +       bitmap_and(supported, supported, mask, __ETHTOOL_LINK_MODE_MASK_NBITS);
> +       bitmap_and(state->advertising, state->advertising, mask,
> +                  __ETHTOOL_LINK_MODE_MASK_NBITS);
> +}
> +
> +static int mvpp2_phylink_mac_link_state(struct net_device *dev,
> +                                       struct phylink_link_state *state)
> +{
> +       struct mvpp2_port *port = netdev_priv(dev);
> +       u32 val;
> +
> +       if (!phy_interface_mode_is_rgmii(port->phy_interface) &&
> +           port->phy_interface != PHY_INTERFACE_MODE_SGMII)
> +               return 0;
> +
> +       val = readl(port->base + MVPP2_GMAC_STATUS0);
> +
> +       state->an_complete = !!(val & MVPP2_GMAC_STATUS0_AN_COMPLETE);
> +       state->link = !!(val & MVPP2_GMAC_STATUS0_LINK_UP);
> +       state->duplex = !!(val & MVPP2_GMAC_STATUS0_FULL_DUPLEX);
> +
> +       if (val & MVPP2_GMAC_STATUS0_GMII_SPEED)
> +               state->speed = SPEED_1000;
> +       else
> +               state->speed = (val & MVPP2_GMAC_STATUS0_MII_SPEED) ?
> +                              SPEED_100 : SPEED_10;
> +
> +       state->pause = 0;
> +       if (val & MVPP2_GMAC_STATUS0_RX_PAUSE)
> +               state->pause |= MLO_PAUSE_RX;
> +       if (val & MVPP2_GMAC_STATUS0_TX_PAUSE)
> +               state->pause |= MLO_PAUSE_TX;
> +
> +       return 1;
> +}
> +
> +static void mvpp2_mac_an_restart(struct net_device *dev)
> +{
> +       struct mvpp2_port *port = netdev_priv(dev);
> +       u32 val;
> +
> +       if (!phy_interface_mode_is_rgmii(port->phy_interface) &&
> +           port->phy_interface != PHY_INTERFACE_MODE_SGMII)
> +               return;
> +
> +       val = readl(port->base + MVPP2_GMAC_AUTONEG_CONFIG);
> +       val |= MVPP2_GMAC_IN_BAND_RESTART_AN;
> +       writel(val, port->base + MVPP2_GMAC_AUTONEG_CONFIG);
> +}
> +
> +static void mvpp2_mac_config(struct net_device *dev, unsigned int mode,
> +                            const struct phylink_link_state *state)
> +{
> +       struct mvpp2_port *port = netdev_priv(dev);
> +       u32 val;
> +
> +       /* disable current port for reconfiguration */
> +       mvpp2_interrupts_disable(port);
> +       netif_carrier_off(port->dev);
> +       mvpp2_port_disable(port);
> +       phy_power_off(port->comphy);
> +
> +       /* comphy reconfiguration */
> +       port->phy_interface = state->interface;
> +       mvpp22_comphy_init(port);
> +
> +       /* gop/mac reconfiguration */
> +       mvpp22_gop_init(port);
> +       mvpp2_port_mii_set(port);
> +
> +       if (!phy_interface_mode_is_rgmii(port->phy_interface) &&
> +           port->phy_interface != PHY_INTERFACE_MODE_SGMII)
> +               return;
> +
> +       val = readl(port->base + MVPP2_GMAC_AUTONEG_CONFIG);
> +       val &= ~(MVPP2_GMAC_CONFIG_MII_SPEED |
> +                MVPP2_GMAC_CONFIG_GMII_SPEED |
> +                MVPP2_GMAC_CONFIG_FULL_DUPLEX |
> +                MVPP2_GMAC_AN_SPEED_EN |
> +                MVPP2_GMAC_AN_DUPLEX_EN);
> +
> +       if (state->duplex)
> +               val |= MVPP2_GMAC_CONFIG_FULL_DUPLEX;
> +
> +       if (state->speed == SPEED_1000)
> +               val |= MVPP2_GMAC_CONFIG_GMII_SPEED;
> +       else if (state->speed == SPEED_100)
> +               val |= MVPP2_GMAC_CONFIG_MII_SPEED;
> +
> +       writel(val, port->base + MVPP2_GMAC_AUTONEG_CONFIG);
> +
> +       if (port->priv->hw_version == MVPP21 && port->flags & MVPP2_F_LOOPBACK)
> +               mvpp2_port_loopback_set(port, state);
> +}
> +
> +static void mvpp2_mac_link_down(struct net_device *dev, unsigned int mode)
> +{
> +       struct mvpp2_port *port = netdev_priv(dev);
> +       u32 val;
> +
> +       netif_tx_stop_all_queues(dev);
> +       netif_carrier_off(dev);
> +       mvpp2_ingress_disable(port);
> +       mvpp2_egress_disable(port);
> +
> +       mvpp2_port_disable(port);
> +       mvpp2_interrupts_disable(port);
> +
> +       if (!phylink_autoneg_inband(mode)) {
> +               val = readl(port->base + MVPP2_GMAC_AUTONEG_CONFIG);
> +               val &= ~MVPP2_GMAC_FORCE_LINK_PASS;
> +               val |= MVPP2_GMAC_FORCE_LINK_DOWN;
> +               writel(val, port->base + MVPP2_GMAC_AUTONEG_CONFIG);
> +       }
> +}
> +
> +static void mvpp2_mac_link_up(struct net_device *dev, unsigned int mode,
> +                             struct phy_device *phy)
> +{
> +       struct mvpp2_port *port = netdev_priv(dev);
> +       u32 val;
> +
> +       if (!phylink_autoneg_inband(mode)) {
> +               val = readl(port->base + MVPP2_GMAC_AUTONEG_CONFIG);
> +               val &= ~MVPP2_GMAC_FORCE_LINK_DOWN;
> +               val |= MVPP2_GMAC_FORCE_LINK_PASS;
> +               writel(val, port->base + MVPP2_GMAC_AUTONEG_CONFIG);
> +       }
> +
> +       mvpp2_interrupts_enable(port);
> +       mvpp2_port_enable(port);
> +
> +       mvpp2_egress_enable(port);
> +       mvpp2_ingress_enable(port);
> +       netif_carrier_on(dev);
> +       netif_tx_wake_all_queues(dev);
> +}
> +
> +static const struct phylink_mac_ops mvpp2_phylink_ops = {
> +       .validate = mvpp2_phylink_validate,
> +       .mac_link_state = mvpp2_phylink_mac_link_state,
> +       .mac_an_restart = mvpp2_mac_an_restart,
> +       .mac_config = mvpp2_mac_config,
> +       .mac_link_down = mvpp2_mac_link_down,
> +       .mac_link_up = mvpp2_mac_link_up,
> +};
> +
>  /* Ports initialization */
>  static int mvpp2_port_probe(struct platform_device *pdev,
>                             struct device_node *port_node,
>                             struct mvpp2 *priv)
>  {
> -       struct device_node *phy_node;
>         struct phy *comphy;
>         struct mvpp2_port *port;
>         struct mvpp2_port_pcpu *port_pcpu;
>         struct net_device *dev;
>         struct resource *res;
> +       struct phylink *phylink;
>         char *mac_from = "";
>         unsigned int ntxqs, nrxqs;
>         bool has_tx_irqs;
> @@ -7526,7 +7610,6 @@ static int mvpp2_port_probe(struct platform_device *pdev,
>         if (!dev)
>                 return -ENOMEM;
>
> -       phy_node = of_parse_phandle(port_node, "phy", 0);
>         phy_mode = of_get_phy_mode(port_node);
>         if (phy_mode < 0) {
>                 dev_err(&pdev->dev, "incorrect phy mode\n");
> @@ -7565,15 +7648,6 @@ static int mvpp2_port_probe(struct platform_device *pdev,
>         if (err)
>                 goto err_free_netdev;
>
> -       port->link_irq = of_irq_get_byname(port_node, "link");
> -       if (port->link_irq == -EPROBE_DEFER) {
> -               err = -EPROBE_DEFER;
> -               goto err_deinit_qvecs;
> -       }
> -       if (port->link_irq <= 0)
> -               /* the link irq is optional */
> -               port->link_irq = 0;
> -
>         if (of_property_read_bool(port_node, "marvell,loopback"))
>                 port->flags |= MVPP2_F_LOOPBACK;
>
> @@ -7583,7 +7657,7 @@ static int mvpp2_port_probe(struct platform_device *pdev,
>         else
>                 port->first_rxq = port->id * priv->max_port_rxqs;
>
> -       port->phy_node = phy_node;
> +       port->of_node = port_node;
>         port->phy_interface = phy_mode;
>         port->comphy = comphy;
>
> @@ -7592,7 +7666,7 @@ static int mvpp2_port_probe(struct platform_device *pdev,
>                 port->base = devm_ioremap_resource(&pdev->dev, res);
>                 if (IS_ERR(port->base)) {
>                         err = PTR_ERR(port->base);
> -                       goto err_free_irq;
> +                       goto err_deinit_qvecs;
>                 }
>         } else {
>                 if (of_property_read_u32(port_node, "gop-port-id",
> @@ -7609,7 +7683,7 @@ static int mvpp2_port_probe(struct platform_device *pdev,
>         port->stats = netdev_alloc_pcpu_stats(struct mvpp2_pcpu_stats);
>         if (!port->stats) {
>                 err = -ENOMEM;
> -               goto err_free_irq;
> +               goto err_deinit_qvecs;
>         }
>
>         mvpp2_port_copy_mac_addr(dev, priv, port_node, &mac_from);
> @@ -7662,16 +7736,47 @@ static int mvpp2_port_probe(struct platform_device *pdev,
>         /* 9676 == 9700 - 20 and rounding to 8 */
>         dev->max_mtu = 9676;
>
> +       /* The PHY node is optional. If not present the GoP link IRQ will be
> +        * used to handle link updates. Otherwise use phylink.
> +        */
> +       if (of_find_property(port_node, "phy", NULL)) {
> +               phylink = phylink_create(dev, port_node, phy_mode,
> +                                        &mvpp2_phylink_ops);
> +               if (IS_ERR(phylink)) {
> +                       err = PTR_ERR(phylink);
> +                       goto err_free_port_pcpu;
> +               }
> +               port->phylink = phylink;
> +               port->link_irq = 0;
> +       } else {
> +               port->phylink = NULL;
> +               if (priv->hw_version == MVPP22) {
> +                       port->link_irq = of_irq_get_byname(port_node, "link");
> +                       if (port->link_irq == -EPROBE_DEFER) {
> +                               err = -EPROBE_DEFER;
> +                               goto err_free_port_pcpu;
> +                       }
> +                       if (port->link_irq <= 0)
> +                               /* the link irq is optional */
> +                               port->link_irq = 0;
> +               }
> +       }
> +
>         err = register_netdev(dev);
>         if (err < 0) {
>                 dev_err(&pdev->dev, "failed to register netdev\n");
> -               goto err_free_port_pcpu;
> +               goto err_phylink_irq;
>         }
>         netdev_info(dev, "Using %s mac address %pM\n", mac_from, dev->dev_addr);
>
>         priv->port_list[id] = port;
>         return 0;
>
> +err_phylink_irq:
> +       if (port->phylink)
> +               phylink_destroy(port->phylink);
> +       else if (port->link_irq)
> +               irq_dispose_mapping(port->link_irq);
>  err_free_port_pcpu:
>         free_percpu(port->pcpu);
>  err_free_txq_pcpu:
> @@ -7679,13 +7784,9 @@ static int mvpp2_port_probe(struct platform_device *pdev,
>                 free_percpu(port->txqs[i]->pcpu);
>  err_free_stats:
>         free_percpu(port->stats);
> -err_free_irq:
> -       if (port->link_irq)
> -               irq_dispose_mapping(port->link_irq);
>  err_deinit_qvecs:
>         mvpp2_queue_vectors_deinit(port);
>  err_free_netdev:
> -       of_node_put(phy_node);
>         free_netdev(dev);
>         return err;
>  }
> @@ -7696,7 +7797,8 @@ static void mvpp2_port_remove(struct mvpp2_port *port)
>         int i;
>
>         unregister_netdev(port->dev);
> -       of_node_put(port->phy_node);
> +       if (port->phylink)
> +               phylink_destroy(port->phylink);
>         free_percpu(port->pcpu);
>         free_percpu(port->stats);
>         for (i = 0; i < port->ntxqs; i++)
> --
> 2.13.5
>

^ permalink raw reply

* Re: tg3 pxe weirdness
From: Berend De Schouwer @ 2017-09-22  7:28 UTC (permalink / raw)
  To: Siva Reddy Kallam; +Cc: Linux Netdev List
In-Reply-To: <CAMet4B4Jdgjs42KAWQ4HJr1t_Grn1_+Hv8sipeLb_tvZ_SFrMw@mail.gmail.com>

On Fri, 2017-09-22 at 11:51 +0530, Siva Reddy Kallam wrote:
> On Thu, Sep 21, 2017 at 7:53 PM, Berend De Schouwer
> <berend.de.schouwer@gmail.com> wrote:
> > Hi,
> > 
> > I've got a machine with a Broadcom bcm5762c, using the tg3 driver,
> > that
> > fails to receive network packets under some very specific
> > conditions.
> > 
> > 
> > Berend
> 
> Can you please share below details?
> 1) Model and Manufacturer of the system

HP EliteDesk 705 G3 SFF
lspci -n on the network card: 01:00.0 0200: 14e4:1687 (rev 10)

dmesg from tg3: 
tg3 0000:01:00.0: PCI INT A -> GSI 24 (level, low) -> IRQ 24
tg3 0000:01:00.0: setting latency timer to 64
tg3 0000:01:00.0: eth0: Tigon3 [partno(none) rev 5762100] (PCI Express)
MAC address 70:5a:0f:3d:9f:4f
tg3 0000:01:00.0: eth0: attached PHY is 5762C (10/100/1000Base-T
Ethernet) (WireSpeed[1], EEE[1])
tg3 0000:01:00.0: eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0]
TSOcap[1]
tg3 0000:01:00.0: eth0: dma_rwctrl[00000001] dma_mask[64-bit]

The same motherboard with add-on network cards does boot using PXE. 
It's an EFI board but Secureboot is currently disabled.


> 2) Linux distro/kernel used?

CentOS 6 + updates.  I've tried all the CentOS 6 kernels up to 2.6.32-
696.10.2.el6.x86_64

I've tried updating the tg3 driver from 3.137 to both 3.137h and
3.137o, with the same result.

I'm currently in the process of making 4.13.2 boot.

^ permalink raw reply

* Re: [PATCH net 1/4] net:ethernet:aquantia: Setup max_mtu in ndev to enable jumbo frames
From: Igor Russkikh @ 2017-09-22  7:24 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: David S . Miller, netdev, David Arcari, Pavel Belous,
	Nadezhda Krupnina, Simon Edelhaus
In-Reply-To: <20170921153619.GG27589@lunn.ch>


>>  	self->ndev->mtu = aq_nic_cfg->mtu - ETH_HLEN;
>> +	self->ndev->min_mtu = ETH_MIN_MTU;
> This is not required. It will default to ETH_MIN_MTU.

Thanks Andrew, true.

>> +	self->ndev->max_mtu = self->aq_hw_caps.mtu - ETH_FCS_LEN - ETH_HLEN;
>>  
>>  	return 0;
>>  }
>> @@ -695,7 +696,7 @@ int aq_nic_set_mtu(struct aq_nic_s *self, int new_mtu)
>>  {
>>  	int err = 0;
>>  
>> -	if (new_mtu > self->aq_hw_caps.mtu) {
>> +	if (new_mtu + ETH_FCS_LEN > self->aq_hw_caps.mtu) {
> If you have set max_mtu correctly, this cannot happen.  But it seems
> odd you don't have ETH_HLEN here, where as when setting max_mtu you
> do.

I agree, thats an extra check. Will remove that.
Regarding ETH_HLEN - the baseline code invokes this function with (new_mtu + ETH_HLEN)  - thats why this check does not add it. In general, that extra level of storage (aq_nic_cfg.mtu field) is almost useless, I'll cleanup this under a separate patchset.

-- 
BR,
  Igor

^ permalink raw reply

* Re: [PATCH net-next 2/2] net: dsa: lan9303: Add basic offloading of unicast traffic
From: Egil Hjelmeland @ 2017-09-22  7:23 UTC (permalink / raw)
  To: Vivien Didelot, andrew, f.fainelli, netdev, linux-kernel
In-Reply-To: <87k20sqg9d.fsf@weeman.i-did-not-set--mail-host-address--so-tickle-me>

Den 21. sep. 2017 16:26, skrev Vivien Didelot:
> Hi Egil,
> 
> Egil Hjelmeland <privat@egil-hjelmeland.no> writes:
> 
>> When both user ports are joined to the same bridge, the normal
>> HW MAC learning is enabled. This means that unicast traffic is forwarded
>> in HW.
>>
>> If one of the user ports leave the bridge,
>> the ports goes back to the initial separated operation.
>>
>> Port separation relies on disabled HW MAC learning. Hence the condition
>> that both ports must join same bridge.
>>
>> Add brigde methods port_bridge_join, port_bridge_leave and
>> port_stp_state_set.
>>
>> Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
> 
> Styling nitpicks below, other than that, the patch LGTM:
> 
> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
> 
>>   #include <linux/mutex.h>
>>   #include <linux/mii.h>
>>   #include <linux/phy.h>
>> +#include <linux/if_bridge.h>
> 
> It's nice to order header inclusions alphabetically.
> 
>>   
>>   #include "lan9303.h"
>>   
>> @@ -146,6 +147,7 @@
>>   # define LAN9303_SWE_PORT_STATE_FORWARDING_PORT0 (0)
>>   # define LAN9303_SWE_PORT_STATE_LEARNING_PORT0 BIT(1)
>>   # define LAN9303_SWE_PORT_STATE_BLOCKING_PORT0 BIT(0)
>> +# define LAN9303_SWE_PORT_STATE_DISABLED_PORT0 (3)
>>   #define LAN9303_SWE_PORT_MIRROR 0x1846
>>   # define LAN9303_SWE_PORT_MIRROR_SNIFF_ALL BIT(8)
>>   # define LAN9303_SWE_PORT_MIRROR_SNIFFER_PORT2 BIT(7)
>> @@ -431,6 +433,20 @@ static int lan9303_read_switch_reg(struct lan9303 *chip, u16 regnum, u32 *val)
>>   	return ret;
>>   }
>>   
>> +static int lan9303_write_switch_reg_mask(
>> +	struct lan9303 *chip, u16 regnum, u32 val, u32 mask)
> 
> That is unlikely to respect the Kernel Coding Style. Please fill the
> line as much as possible and align with the opening parenthesis:
> 
>      static int lan9303_write_switch_reg_mask(struct lan9303 *chip, u16 regnum,
>                                               u32 val, u32 mask)
> 
OK. Probably this function will go away in v2 due to Andrews comment.

>> +{
>> +	int ret;
>> +	u32 reg;
>> +
>> +	ret = lan9303_read_switch_reg(chip, regnum, &reg);
>> +	if (ret)
>> +		return ret;
>> +	reg = (reg & ~mask) | val;
>> +
>> +	return lan9303_write_switch_reg(chip, regnum, reg);
>> +}
> 
> ...
> 
>> +
>> +	portmask = 0x3 << (port * 2);
>> +	portstate     <<= (port * 2);
> 
> Unnecessary alignment, please remove the extra space characters.
> 

I think the alignment makes the logic more clear.


>> +	lan9303_write_switch_reg_mask(chip, LAN9303_SWE_PORT_STATE,
>> +				      portstate, portmask);
>> +}
>> +
>>   static const struct dsa_switch_ops lan9303_switch_ops = {
>>   	.get_tag_protocol = lan9303_get_tag_protocol,
>>   	.setup = lan9303_setup,
>> @@ -855,6 +940,9 @@ static const struct dsa_switch_ops lan9303_switch_ops = {
>>   	.get_sset_count = lan9303_get_sset_count,
>>   	.port_enable = lan9303_port_enable,
>>   	.port_disable = lan9303_port_disable,
>> +	.port_bridge_join       = lan9303_port_bridge_join,
>> +	.port_bridge_leave      = lan9303_port_bridge_leave,
>> +	.port_stp_state_set     = lan9303_port_stp_state_set,
> 
> Same here, alignment unnecessary, especially since the rest of the
> structure does not do that.
> 
>>   };
>>   
>>   static int lan9303_register_switch(struct lan9303 *chip)
>> diff --git a/drivers/net/dsa/lan9303.h b/drivers/net/dsa/lan9303.h
>> index 4d8be555ff4d..5be246f05965 100644
>> --- a/drivers/net/dsa/lan9303.h
>> +++ b/drivers/net/dsa/lan9303.h
>> @@ -21,6 +21,7 @@ struct lan9303 {
>>   	struct dsa_switch *ds;
>>   	struct mutex indirect_mutex; /* protect indexed register access */
>>   	const struct lan9303_phy_ops *ops;
>> +	bool is_bridged; /* true if port 1 and 2 is bridged */
>>   };
>>   
>>   extern const struct regmap_access_table lan9303_register_set;
> 
> Please use the checkpatch.pl script to ensure your patch respects the
> kernel conventions before submitting, it can spot nice stuffs!

I have checked _every_ patch with checkpatch.pl and weeded all warnings
before I submitted them.

> 
> I use a Git alias(*) to check a commit which does basically this:
> 
>      git format-patch --stdout -1 | ./scripts/checkpatch.pl -
> 
> 
> (*) in details, especially convenient during interactive rebases:
> 
>      $ git config alias.checkcommit '!f () { git format-patch --stdout -1 ${1:-HEAD} | ./scripts/checkpatch.pl - ; }; f'
>      $ git checkcommit # i.e. current one
>      $ git checkcommit HEAD^^
>      $ git checkcommit d329ac88eb21
>      ...
> 
> 
> Thanks,
> 
>          Vivien
> 

^ permalink raw reply

* Re: [PATCH] wireless: iwlegacy: make const array static to shink object code size Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit
From: Julia Lawall @ 2017-09-22  7:23 UTC (permalink / raw)
  To: Colin King
  Cc: Stanislaw Gruszka, Kalle Valo, linux-wireless, netdev,
	kernel-janitors, linux-kernel
In-Reply-To: <20170921225630.21916-1-colin.king@canonical.com>



On Thu, 21 Sep 2017, Colin King wrote:

> From: Colin Ian King <colin.king@canonical.com>
>
> Don't populate const array ac_to_fifo on the stack in an inlined
> function, instead make it static.  Makes the object code smaller
> by over 800 bytes:
>
>    text	   data	    bss	    dec	    hex	filename
>  159029	  33154	   1216	 193399	  2f377	4965-mac.o
>
>    text	   data	    bss	    dec	    hex	filename
>  158122	  33250	   1216	 192588	  2f04c	4965-mac.o
>
> (gcc version 7.2.0 x86_64)

Gcc 7 must be much more aggressive about inlining than gcc 4.  Here is the
change that I got on this file:

     text          data     bss     dec     hex filename
-   76346          1494     152   77992   130a8 drivers/net/wireless/intel/iwlegacy/4965-mac.o
+   76298          1494     152   77944   13078 drivers/net/wireless/intel/iwlegacy/4965-mac.o
decrease of 48

julia

>
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
>  drivers/net/wireless/intel/iwlegacy/4965-mac.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/wireless/intel/iwlegacy/4965-mac.c b/drivers/net/wireless/intel/iwlegacy/4965-mac.c
> index de9b6522c43f..65eba2c24292 100644
> --- a/drivers/net/wireless/intel/iwlegacy/4965-mac.c
> +++ b/drivers/net/wireless/intel/iwlegacy/4965-mac.c
> @@ -1480,7 +1480,7 @@ il4965_get_ac_from_tid(u16 tid)
>  static inline int
>  il4965_get_fifo_from_tid(u16 tid)
>  {
> -	const u8 ac_to_fifo[] = {
> +	static const u8 ac_to_fifo[] = {
>  		IL_TX_FIFO_VO,
>  		IL_TX_FIFO_VI,
>  		IL_TX_FIFO_BE,
> --
> 2.14.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe kernel-janitors" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: linux-next: build failure after merge of the net-next tree
From: Paolo Abeni @ 2017-09-22  7:10 UTC (permalink / raw)
  To: David Miller, sfr; +Cc: netdev, linux-next, linux-kernel
In-Reply-To: <20170921.183759.289331373985144347.davem@davemloft.net>

On Thu, 2017-09-21 at 18:37 -0700, David Miller wrote:
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Fri, 22 Sep 2017 11:03:55 +1000
> 
> > After merging the net-next tree, today's linux-next build (arm
> > multi_v7_defconfig) failed like this:
> > 
> > net/ipv4/fib_frontend.c: In function 'fib_validate_source':
> > net/ipv4/fib_frontend.c:411:16: error: 'struct netns_ipv4' has no member named 'fib_has_custom_local_routes'
> >    if (net->ipv4.fib_has_custom_local_routes)
> >                 ^
> > net/ipv4/fib_frontend.c: In function 'inet_rtm_newroute':
> > net/ipv4/fib_frontend.c:773:12: error: 'struct netns_ipv4' has no member named 'fib_has_custom_local_routes'
> >    net->ipv4.fib_has_custom_local_routes = true;
> >             ^
> > 
> > Caused by commit
> > 
> >   6e617de84e87 ("net: avoid a full fib lookup when rp_filter is disabled.")
> 
> Paolo, it seems this struct member should be placed outside of
> IP_MULTIPLE_TABLES protection, since users can insert custom
> local routes even without that set.
> 
> So I'm installing the following fix for this:
> 
> ====================
> [PATCH] ipv4: Move fib_has_custom_local_routes outside of IP_MULTIPLE_TABLES.
> 
> > net/ipv4/fib_frontend.c: In function 'fib_validate_source':
> > net/ipv4/fib_frontend.c:411:16: error: 'struct netns_ipv4' has no member named 'fib_has_custom_local_routes'
> >    if (net->ipv4.fib_has_custom_local_routes)
> >                 ^
> > net/ipv4/fib_frontend.c: In function 'inet_rtm_newroute':
> > net/ipv4/fib_frontend.c:773:12: error: 'struct netns_ipv4' has no member named 'fib_has_custom_local_routes'
> >    net->ipv4.fib_has_custom_local_routes = true;
> >             ^
> 
> Fixes: 6e617de84e87 ("net: avoid a full fib lookup when rp_filter is disabled.")
> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
>  include/net/netns/ipv4.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 20720721da4b..8387f099115e 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -49,10 +49,10 @@ struct netns_ipv4 {
>  #ifdef CONFIG_IP_MULTIPLE_TABLES
>         struct fib_rules_ops    *rules_ops;
>         bool                    fib_has_custom_rules;
> -       bool                    fib_has_custom_local_routes;
>         struct fib_table __rcu  *fib_main;
>         struct fib_table __rcu  *fib_default;
>  #endif
> +       bool                    fib_has_custom_local_routes;
>  #ifdef CONFIG_IP_ROUTE_CLASSID
>         int                     fib_num_tclassid_users;
>  #endif
> -- 
> 2.13.5

My fault, I should have fuzzed the configuration before posting.

Fix looks good, thanks David!

Paolo

^ permalink raw reply

* Re: [PATCH net-next 2/2] net: dsa: lan9303: Add basic offloading of unicast traffic
From: Egil Hjelmeland @ 2017-09-22  7:06 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: vivien.didelot, f.fainelli, netdev, linux-kernel
In-Reply-To: <20170921142127.GB27589@lunn.ch>

Den 21. sep. 2017 16:21, skrev Andrew Lunn:
> Hi Egil
> 
>> +static void lan9303_bridge_ports(struct lan9303 *chip)
>> +{
>> +	/* ports bridged: remove mirroring */
>> +	lan9303_write_switch_reg(chip, LAN9303_SWE_PORT_MIRROR, 0);
>> +}
> 
> Could you replace the 0 with something symbolic which makes this
> easier to understand.
> 
> #define LAN9303_SWE_PORT_MIRROR_DISABLED 0
> 

OK

>> +
>>   static int lan9303_handle_reset(struct lan9303 *chip)
>>   {
>>   	if (!chip->reset_gpio)
>> @@ -844,6 +866,69 @@ static void lan9303_port_disable(struct dsa_switch *ds, int port,
>>   	}
>>   }
>>   
>> +static int lan9303_port_bridge_join(struct dsa_switch *ds, int port,
>> +				    struct net_device *br)
>> +{
>> +	struct lan9303 *chip = ds->priv;
>> +
>> +	dev_dbg(chip->dev, "%s(port %d)\n", __func__, port);
>> +	if (ds->ports[1].bridge_dev ==  ds->ports[2].bridge_dev) {
>> +		lan9303_bridge_ports(chip);
>> +		chip->is_bridged = true;  /* unleash stp_state_set() */
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static void lan9303_port_bridge_leave(struct dsa_switch *ds, int port,
>> +				      struct net_device *br)
>> +{
>> +	struct lan9303 *chip = ds->priv;
>> +
>> +	dev_dbg(chip->dev, "%s(port %d)\n", __func__, port);
>> +	if (chip->is_bridged) {
>> +		lan9303_separate_ports(chip);
>> +		chip->is_bridged = false;
>> +	}
>> +}
>> +
>> +static void lan9303_port_stp_state_set(struct dsa_switch *ds, int port,
>> +				       u8 state)
>> +{
>> +	int portmask, portstate;
>> +	struct lan9303 *chip = ds->priv;
>> +
>> +	dev_dbg(chip->dev, "%s(port %d, state %d)\n",
>> +		__func__, port, state);
>> +	if (!chip->is_bridged)
>> +		return; /* touching SWE_PORT_STATE will break port separation */
> 
> I'm wondering how this is supposed to work. Please add a good comment
> here, since the hardware is forcing you to do something odd.
> 
> Maybe it would be a good idea to save the STP state in chip.  And then
> when chip->is_bridged is set true, change the state in the hardware to
> the saved value?
> 
> What happens when port 0 is added to the bridge, there is then a
> minute pause and then port 1 is added? I would expect that as soon as
> port 0 is added, the STP state machine for port 0 will start and move
> into listening and then forwarding. Due to hardware limitations it
> looks like you cannot do this. So what state is the hardware
> effectively in? Blocking? Forwarding?
> 
> Then port 1 is added. You can then can respect the states. port 1 will
> do blocking->listening->forwarding, but what about port 0? The calls
> won't get repeated? How does it transition to forwarding?
> 
>    Andrew
> 

I see your point with the "minute pause" argument. Although a bit
contrived use case, it is easy to fix by caching the STP state, as
you suggest. So I can do that.

The port separation method is from the original version of the driver,
not by me.

I have read through the datasheet to see if there are other ways
to make port separation, but I could not find anything.
If anybody care to read through the 300+ page lan9303.pdf and prove
me wrong, I am happy to do it differently.

How does other DSA HW chips handle port separation? Knowing that
could perhaps help me know what to look for.

Egil
'

>> +
>> +	switch (state) {
>> +	case BR_STATE_DISABLED:
>> +		portstate = LAN9303_SWE_PORT_STATE_DISABLED_PORT0;
>> +		break;
>> +	case BR_STATE_BLOCKING:
>> +	case BR_STATE_LISTENING:
>> +		portstate = LAN9303_SWE_PORT_STATE_BLOCKING_PORT0;
>> +		break;
>> +	case BR_STATE_LEARNING:
>> +		portstate = LAN9303_SWE_PORT_STATE_LEARNING_PORT0;
>> +		break;
>> +	case BR_STATE_FORWARDING:
>> +		portstate = LAN9303_SWE_PORT_STATE_FORWARDING_PORT0;
>> +		break;
>> +	default:
>> +		portstate = LAN9303_SWE_PORT_STATE_DISABLED_PORT0;
>> +		dev_err(chip->dev, "unknown stp state: port %d, state %d\n",
>> +			port, state);
>> +	}
>> +
>> +	portmask = 0x3 << (port * 2);
>> +	portstate     <<= (port * 2);
>> +	lan9303_write_switch_reg_mask(chip, LAN9303_SWE_PORT_STATE,
>> +				      portstate, portmask);
>> +}
> 

^ permalink raw reply

* [PATCH net-next] virtio-net: correctly set xdp_xmit for mergeable buffer
From: Jason Wang @ 2017-09-22  6:38 UTC (permalink / raw)
  To: mst, jasowang, virtualization, netdev, linux-kernel; +Cc: John Fastabend

We should set xdp_xmit only when xdp_do_redirect() succeed.

Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f6c1f13..dd14a45 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -721,7 +721,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			goto xdp_xmit;
 		case XDP_REDIRECT:
 			err = xdp_do_redirect(dev, &xdp, xdp_prog);
-			if (err)
+			if (!err)
 				*xdp_xmit = true;
 			rcu_read_unlock();
 			goto xdp_xmit;
-- 
2.7.4

^ permalink raw reply related

* Re: [patch net-next 03/12] ipmr: Add FIB notification access functions
From: Yotam Gigi @ 2017-09-22  6:35 UTC (permalink / raw)
  To: Nikolay Aleksandrov, Jiri Pirko, netdev; +Cc: davem, idosch, mlxsw
In-Reply-To: <24470459-704b-89a2-648a-bbd4d3a1aebb@cumulusnetworks.com>

On 09/21/2017 02:19 PM, Nikolay Aleksandrov wrote:
> On 21/09/17 09:43, Jiri Pirko wrote:
>> From: Yotam Gigi <yotamg@mellanox.com>
>>
>> Make the ipmr module register as a FIB notifier. To do that, implement both
>> the ipmr_seq_read and ipmr_dump ops.
>>
>> The ipmr_seq_read op returns a sequence counter that is incremented on
>> every notification related operation done by the ipmr. To implement that,
>> add a sequence counter in the netns_ipv4 struct and increment it whenever a
>> new MFC route or VIF are added or deleted. The sequence operations are
>> protected by the RTNL lock.
>>
>> The ipmr_dump iterates the list of MFC routes and the list of VIF entries
>> and sends notifications about them. The entries dump is done under RCU.
>>
>> Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
>> Reviewed-by: Ido Schimmel <idosch@mellanox.com>
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>> ---
>>  include/linux/mroute.h   |  15 ++++++
>>  include/net/netns/ipv4.h |   3 ++
>>  net/ipv4/ipmr.c          | 135 ++++++++++++++++++++++++++++++++++++++++++++++-
>>  3 files changed, 151 insertions(+), 2 deletions(-)
>>
> [snip]
>> +
>> +static int ipmr_dump(struct net *net, struct notifier_block *nb)
>> +{
>> +	struct mr_table *mrt;
>> +	int err;
>> +
>> +	err = ipmr_rules_dump(net, nb);
>> +	if (err)
>> +		return err;
>> +
>> +	ipmr_for_each_table(mrt, net) {
>> +		struct vif_device *v = &mrt->vif_table[0];
>> +		struct mfc_cache *mfc;
>> +		int vifi;
>> +
>> +		/* Notifiy on table VIF entries */
>> +		for (vifi = 0; vifi < mrt->maxvif; vifi++, v++) {
>> +			if (!v->dev)
>> +				continue;
>> +
>> +			call_ipmr_vif_entry_notifier(nb, net, FIB_EVENT_VIF_ADD,
>> +						     v, vifi, mrt->id);
>> +		}
> The VIF table is protected by mrt_lock (rwlock), here with RCU only
> you're not guaranteed to keep v->dev. It can become NULL after the check above.
> For details you can see vif_delete() in net/ipv4/ipmr.c. You need at least
> mrt_lock for reading.

Hmm, that's interesting. That situation would lead the dump to be inconsistent,
thus eventually discarded, right? So I can also just make sure that the driver
ignores dev=NULL notifications and that would be enough to solve it.

I have to think about it a bit more, but anyway, maybe taking the mrt_lock for
the VIF dump is the right solution here.

Anyway, thanks for the review!


>> +
>> +		/* Notify on table MFC entries */
>> +		list_for_each_entry_rcu(mfc, &mrt->mfc_cache_list, list)
>> +			call_ipmr_mfc_entry_notifier(nb, net,
>> +						     FIB_EVENT_ENTRY_ADD, mfc,
>> +						     mrt->id);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct fib_notifier_ops ipmr_notifier_ops_template = {
>> +	.family		= RTNL_FAMILY_IPMR,
>> +	.fib_seq_read	= ipmr_seq_read,
>> +	.fib_dump	= ipmr_dump,
>> +	.owner		= THIS_MODULE,
>> +};
>> +
>> +int __net_init ipmr_notifier_init(struct net *net)
>> +{
>> +	struct fib_notifier_ops *ops;
>> +
>> +	net->ipv4.ipmr_seq = 0;
>> +
>> +	ops = fib_notifier_ops_register(&ipmr_notifier_ops_template, net);
>> +	if (IS_ERR(ops))
>> +		return PTR_ERR(ops);
>> +	net->ipv4.ipmr_notifier_ops = ops;
>> +
>> +	return 0;
>> +}
>> +
>> +static void __net_exit ipmr_notifier_exit(struct net *net)
>> +{
>> +	fib_notifier_ops_unregister(net->ipv4.ipmr_notifier_ops);
>> +	net->ipv4.ipmr_notifier_ops = NULL;
>> +}
>> +
>>  /* Setup for IP multicast routing */
>>  static int __net_init ipmr_net_init(struct net *net)
>>  {
>>  	int err;
>>  
>> +	err = ipmr_notifier_init(net);
>> +	if (err)
>> +		goto ipmr_notifier_fail;
>> +
>>  	err = ipmr_rules_init(net);
>>  	if (err < 0)
>> -		goto fail;
>> +		goto ipmr_rules_fail;
>>  
>>  #ifdef CONFIG_PROC_FS
>>  	err = -ENOMEM;
>> @@ -3074,7 +3202,9 @@ static int __net_init ipmr_net_init(struct net *net)
>>  proc_vif_fail:
>>  	ipmr_rules_exit(net);
>>  #endif
>> -fail:
>> +ipmr_rules_fail:
>> +	ipmr_notifier_exit(net);
>> +ipmr_notifier_fail:
>>  	return err;
>>  }
>>  
>> @@ -3084,6 +3214,7 @@ static void __net_exit ipmr_net_exit(struct net *net)
>>  	remove_proc_entry("ip_mr_cache", net->proc_net);
>>  	remove_proc_entry("ip_mr_vif", net->proc_net);
>>  #endif
>> +	ipmr_notifier_exit(net);
>>  	ipmr_rules_exit(net);
>>  }
>>  
>>

^ permalink raw reply

* Re: [PATCH 00/64] use setup_timer() helper function.
From: Allen @ 2017-09-22  6:22 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, linux-kernel, m.grzeschik, dmitry.tarnyagin, wg, mkl,
	Mark Einon, linux, pcnet32, Florian Fainelli,
	bcm-kernel-feedback-list, venza, ajk, jes, romieu, khc,
	Simon Kelley, linux-can, adi-buildroot-devel
In-Reply-To: <20170921.113927.1684139706973891956.davem@davemloft.net>

>
>>  This series uses setup_timer() helper function. The series
>> addresses the files under drivers/net/*.
>
> I've reviewed this series and will apply it to net-next.
>
> But please send out smaller chunks next time, maybe 10-15
> in a bunch?  64 patches at once makes it really hard for
> reviewiers.

 My apologies. I'll make sure I break it into smaller chunks
in the future.


-- 
       - Allen

^ permalink raw reply

* Re: tg3 pxe weirdness
From: Siva Reddy Kallam @ 2017-09-22  6:21 UTC (permalink / raw)
  To: Berend De Schouwer; +Cc: Linux Netdev List

On Thu, Sep 21, 2017 at 7:53 PM, Berend De Schouwer
<berend.de.schouwer@gmail.com> wrote:
> Hi,
>
> I've got a machine with a Broadcom bcm5762c, using the tg3 driver, that
> fails to receive network packets under some very specific conditions.
>
> It works perfectly using a 1Gbps switch.  If, however, it first uses
> PXE and then loads the Linux tg3 driver, and the switch is 100Mbps, it
> no longer receives packets larger than ICMP.
>
> It does do ARP and ping.
>
> If it boots using PXE on a 1Gbps switch, boots into Linux, and then
> it's plugged into 100 Mbps it also stops receiving packets.
>
> mii-diag and dmesg confirm auto-negotiated speed and flow control, and
> confirm temporary disconnect as the cables are moved.
>
> PXE boots using UNDI, which then transfers a kernel using TFTP, which
> transfers correctly.  The kernel boots, loads the tg3 driver, connects
> the network.  Up to this point everything works.  Ping will work too.
> Any other network traffic fails.
>
> Booting from a harddrive works fine.  I assume the UNDI driver
> somewhere breaks auto-negotiation.  I've tried using mii-tool and
> ethtool, but I haven't managed to make it work yet.
>
> Is it possible to get negotiation working after PXE boot?  Are there
> any tg3 driver flags that might make a difference?
>
>
> Berend

Can you please share below details?
1) Model and Manufacturer of the system
2) Linux distro/kernel used?

^ permalink raw reply

* [PATCH net-next v2 6/6] rtnetlink: rtnl_have_link_slave_info doesn't need rtnl
From: Florian Westphal @ 2017-09-22  6:10 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal
In-Reply-To: <20170922061008.14723-1-fw@strlen.de>

Switch it to rcu.

rtnl_link_slave_info_fill on to other hand does need rtnl mutex for now so
add an annotation to its caller as a reminder.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 Change since v1:
  - don't add ASSERT_RTNL to rtnl_link_slave_info_fill and
  rtnl_link_info_fill they are only called via rtnl_link_fill.

 net/core/rtnetlink.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 590823f70cc3..e6f9e36d9d5a 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -522,11 +522,15 @@ static size_t rtnl_link_get_af_size(const struct net_device *dev,
 static bool rtnl_have_link_slave_info(const struct net_device *dev)
 {
 	struct net_device *master_dev;
+	bool ret = false;
 
-	master_dev = netdev_master_upper_dev_get((struct net_device *) dev);
+	rcu_read_lock();
+
+	master_dev = netdev_master_upper_dev_get_rcu((struct net_device *)dev);
 	if (master_dev && master_dev->rtnl_link_ops)
-		return true;
-	return false;
+		ret = true;
+	rcu_read_unlock();
+	return ret;
 }
 
 static int rtnl_link_slave_info_fill(struct sk_buff *skb,
@@ -598,6 +602,8 @@ static int rtnl_link_fill(struct sk_buff *skb, const struct net_device *dev)
 	struct nlattr *linkinfo;
 	int err = -EMSGSIZE;
 
+	ASSERT_RTNL();
+
 	linkinfo = nla_nest_start(skb, IFLA_LINKINFO);
 	if (linkinfo == NULL)
 		goto out;
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next v2 5/6] rtnetlink: add helpers to dump vf and netnsid information
From: Florian Westphal @ 2017-09-22  6:10 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal
In-Reply-To: <20170922061008.14723-1-fw@strlen.de>

similar to earlier patches, split out more parts of this function to
better see what is happening and where we assume rtnl is locked.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/core/rtnetlink.c | 80 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 50 insertions(+), 30 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 42ff582a010e..590823f70cc3 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1211,6 +1211,36 @@ static noinline_for_stack int rtnl_fill_vfinfo(struct sk_buff *skb,
 	return -EMSGSIZE;
 }
 
+static noinline_for_stack int rtnl_fill_vf(struct sk_buff *skb,
+					   struct net_device *dev,
+					   u32 ext_filter_mask)
+{
+	struct nlattr *vfinfo;
+	int i, num_vfs;
+
+	if (!dev->dev.parent || ((ext_filter_mask & RTEXT_FILTER_VF) == 0))
+		return 0;
+
+	num_vfs = dev_num_vf(dev->dev.parent);
+	if (nla_put_u32(skb, IFLA_NUM_VF, num_vfs))
+		return -EMSGSIZE;
+
+	if (!dev->netdev_ops->ndo_get_vf_config)
+		return 0;
+
+	vfinfo = nla_nest_start(skb, IFLA_VFINFO_LIST);
+	if (!vfinfo)
+		return -EMSGSIZE;
+
+	for (i = 0; i < num_vfs; i++) {
+		if (rtnl_fill_vfinfo(skb, dev, i, vfinfo))
+			return -EMSGSIZE;
+	}
+
+	nla_nest_end(skb, vfinfo);
+	return 0;
+}
+
 static int rtnl_fill_link_ifmap(struct sk_buff *skb, struct net_device *dev)
 {
 	struct rtnl_link_ifmap map;
@@ -1355,6 +1385,23 @@ static noinline int nla_put_ifalias(struct sk_buff *skb, struct net_device *dev)
 	return 0;
 }
 
+static noinline int rtnl_fill_link_netnsid(struct sk_buff *skb,
+					   const struct net_device *dev)
+{
+	if (dev->rtnl_link_ops && dev->rtnl_link_ops->get_link_net) {
+		struct net *link_net = dev->rtnl_link_ops->get_link_net(dev);
+
+		if (!net_eq(dev_net(dev), link_net)) {
+			int id = peernet2id_alloc(dev_net(dev), link_net);
+
+			if (nla_put_s32(skb, IFLA_LINK_NETNSID, id))
+				return -EMSGSIZE;
+		}
+	}
+
+	return 0;
+}
+
 static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 			    int type, u32 pid, u32 seq, u32 change,
 			    unsigned int flags, u32 ext_filter_mask,
@@ -1428,27 +1475,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	if (rtnl_fill_stats(skb, dev))
 		goto nla_put_failure;
 
-	if (dev->dev.parent && (ext_filter_mask & RTEXT_FILTER_VF) &&
-	    nla_put_u32(skb, IFLA_NUM_VF, dev_num_vf(dev->dev.parent)))
+	if (rtnl_fill_vf(skb, dev, ext_filter_mask))
 		goto nla_put_failure;
 
-	if (dev->netdev_ops->ndo_get_vf_config && dev->dev.parent &&
-	    ext_filter_mask & RTEXT_FILTER_VF) {
-		int i;
-		struct nlattr *vfinfo;
-		int num_vfs = dev_num_vf(dev->dev.parent);
-
-		vfinfo = nla_nest_start(skb, IFLA_VFINFO_LIST);
-		if (!vfinfo)
-			goto nla_put_failure;
-		for (i = 0; i < num_vfs; i++) {
-			if (rtnl_fill_vfinfo(skb, dev, i, vfinfo))
-				goto nla_put_failure;
-		}
-
-		nla_nest_end(skb, vfinfo);
-	}
-
 	if (rtnl_port_fill(skb, dev, ext_filter_mask))
 		goto nla_put_failure;
 
@@ -1460,17 +1489,8 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 			goto nla_put_failure;
 	}
 
-	if (dev->rtnl_link_ops &&
-	    dev->rtnl_link_ops->get_link_net) {
-		struct net *link_net = dev->rtnl_link_ops->get_link_net(dev);
-
-		if (!net_eq(dev_net(dev), link_net)) {
-			int id = peernet2id_alloc(dev_net(dev), link_net);
-
-			if (nla_put_s32(skb, IFLA_LINK_NETNSID, id))
-				goto nla_put_failure;
-		}
-	}
+	if (rtnl_fill_link_netnsid(skb, dev))
+		goto nla_put_failure;
 
 	if (!(af_spec = nla_nest_start(skb, IFLA_AF_SPEC)))
 		goto nla_put_failure;
-- 
2.13.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox