Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next RFC 3/5] vhost: introduce vhost_add_used_idx()
From: Michael S. Tsirkin @ 2017-09-26 19:13 UTC (permalink / raw)
  To: Jason Wang; +Cc: virtualization, netdev, linux-kernel, kvm
In-Reply-To: <1506067355-5771-4-git-send-email-jasowang@redhat.com>

On Fri, Sep 22, 2017 at 04:02:33PM +0800, Jason Wang wrote:
> This patch introduces a helper which just increase the used idx. This
> will be used in pair with vhost_prefetch_desc_indices() by batching
> code.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vhost/vhost.c | 33 +++++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.h |  1 +
>  2 files changed, 34 insertions(+)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 8424166d..6532cda 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2178,6 +2178,39 @@ int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
>  }
>  EXPORT_SYMBOL_GPL(vhost_add_used);
>  
> +int vhost_add_used_idx(struct vhost_virtqueue *vq, int n)
> +{
> +	u16 old, new;
> +
> +	old = vq->last_used_idx;
> +	new = (vq->last_used_idx += n);
> +	/* If the driver never bothers to signal in a very long while,
> +	 * used index might wrap around. If that happens, invalidate
> +	 * signalled_used index we stored. TODO: make sure driver
> +	 * signals at least once in 2^16 and remove this.
> +	 */
> +	if (unlikely((u16)(new - vq->signalled_used) < (u16)(new - old)))
> +		vq->signalled_used_valid = false;
> +
> +	/* Make sure buffer is written before we update index. */
> +	smp_wmb();
> +	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
> +			   &vq->used->idx)) {
> +		vq_err(vq, "Failed to increment used idx");
> +		return -EFAULT;
> +	}
> +	if (unlikely(vq->log_used)) {
> +		/* Log used index update. */
> +		log_write(vq->log_base,
> +			  vq->log_addr + offsetof(struct vring_used, idx),
> +			  sizeof(vq->used->idx));
> +		if (vq->log_ctx)
> +			eventfd_signal(vq->log_ctx, 1);
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(vhost_add_used_idx);
> +
>  static int __vhost_add_used_n(struct vhost_virtqueue *vq,
>  			    struct vring_used_elem *heads,
>  			    unsigned count)
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 16c2cb6..5dd6c05 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -199,6 +199,7 @@ int __vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
>  
>  int vhost_vq_init_access(struct vhost_virtqueue *);
> +int vhost_add_used_idx(struct vhost_virtqueue *vq, int n);
>  int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
>  int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
>  		     unsigned count);

Please change the API to hide the fact that there's an index that needs
to be updated.

> -- 
> 2.7.4

^ permalink raw reply

* Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
From: Jesper Dangaard Brouer @ 2017-09-26 19:13 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: brouer, davem, alexei.starovoitov, john.fastabend,
	peter.waskiewicz.jr, jakub.kicinski, netdev, Andy Gospodarek
In-Reply-To: <458f9c13ab58abb1a15627906d03c33c42b02a7c.1506297988.git.daniel@iogearbox.net>

On Mon, 25 Sep 2017 02:25:51 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> This work enables generic transfer of metadata from XDP into skb. The
> basic idea is that we can make use of the fact that the resulting skb
> must be linear and already comes with a larger headroom for supporting
> bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
> on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
> for adjusting a new pointer called xdp->data_meta. Thus, the packet has
> a flexible and programmable room for meta data, followed by the actual
> packet data. struct xdp_buff is therefore laid out that we first point
> to data_hard_start, then data_meta directly prepended to data followed
> by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
> account whether we have meta data already prepended and if so, memmove()s
> this along with the given offset provided there's enough room.
> 
> [...] The scratch space at the head
> of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
> yet supporting xdp->data_meta can simply be set up with xdp->data_meta
> as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
> such that the subsequent match against xdp->data for later access is
> guaranteed to fail.

So, xdp->meta_data is placed just before the packet xdp->data starts.

I'm currently implementing a cpumap type, that transfers raw XDP frames
to another CPU, and the SKB is allocated on the remote CPU.  (It
actually works extremely well).  

For transferring info I need, I'm currently using xdp->data_hard_start
(the top/start of the xdp page).  Which should be compatible with your
approach, right?

The info I need:

 struct xdp_pkt {
	void *data;
	u16 len;
	u16 headroom;
	struct net_device *dev_rx;
 };

When I enqueue the xdp packet I do the following:

 int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp,
	struct net_device *dev_rx)
 {
	struct xdp_pkt *xdp_pkt;
	int headroom;

	/* Convert xdp_buff to xdp_pkt */
	headroom = xdp->data - xdp->data_hard_start;
	if (headroom < sizeof(*xdp_pkt))
		return -EOVERFLOW;
	xdp_pkt = xdp->data_hard_start;
	xdp_pkt->data = xdp->data;
	xdp_pkt->len  = xdp->data_end - xdp->data;
	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);

	/* Info needed when constructing SKB on remote CPU */
	xdp_pkt->dev_rx = dev_rx;

	bq_enqueue(rcpu, xdp_pkt);
	return 0;
 }

On the remote CPU dequeueing the packet, I'm doing the following.  As
you can see I'm still lacking some meta-data, that would be nice to
also transfer.  Could I use your infrastructure for that?

 static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
					  struct xdp_pkt *xdp_pkt)
 {
	unsigned int truesize;
	void *pkt_data_start;
	struct sk_buff *skb;

	/* TODO: rcpu could provide truesize, it's static per RX-ring */
	truesize = 2048;

	// pkt_data_start = xdp_pkt + sizeof(*xdp_pkt);
	pkt_data_start = xdp_pkt->data - xdp_pkt->headroom;

	/* Need to adjust "truesize" for skb_shared_info to get proper
	 * placed, to take into account that xdp_pkt is using part of
	 * headroom
	 */
	skb = build_skb(pkt_data_start, truesize - sizeof(*xdp_pkt));
	if (!skb)
		return NULL;

	skb_reserve(skb, xdp_pkt->headroom);
	__skb_put(skb, xdp_pkt->len);

	// skb_record_rx_queue(skb, rx_ring->queue_index);
	skb->protocol = eth_type_trans(skb, xdp_pkt->dev_rx);

	// How much does csum matter? 
 //	skb->ip_summed = CHECKSUM_UNNECESSARY; // Try to fake it...

	// Does setting skb_set_hash()) matter?
 //	__skb_set_hash(skb, 42, true, false); // Say it is software
 //	__skb_set_hash(skb, 42, false, true); // Say it is hardware

	// Do we lack setting rx_queue... it doesn't seem to matter
 //	skb_record_rx_queue(skb, 0);

	return skb;
 }

(I'll send out some patches soonish, hopefully tomorrow... to show in
more details what I'm doing)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH iproute2 1/1] ip: initialize FILE pointer in ip-monitor
From: Roman Mashak @ 2017-09-26 19:13 UTC (permalink / raw)
  To: stephen; +Cc: netdev, Roman Mashak

Since FILE *_fp was not explicitly initialized, all the consequent print_*()
calls were failing.

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
---
 ip/ipmonitor.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c
index 3171d47..f4d502a 100644
--- a/ip/ipmonitor.c
+++ b/ip/ipmonitor.c
@@ -284,6 +284,9 @@ int do_ipmonitor(int argc, char **argv)
 	if (lnsid) {
 		groups |= nl_mgrp(RTNLGRP_NSID);
 	}
+
+	new_json_obj(json, stdout);
+
 	if (file) {
 		FILE *fp;
 		int err;
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH net-next RFC 2/5] vhost: introduce helper to prefetch desc index
From: Michael S. Tsirkin @ 2017-09-26 19:19 UTC (permalink / raw)
  To: Jason Wang; +Cc: virtualization, netdev, linux-kernel, kvm
In-Reply-To: <1506067355-5771-3-git-send-email-jasowang@redhat.com>

On Fri, Sep 22, 2017 at 04:02:32PM +0800, Jason Wang wrote:
> This patch introduces vhost_prefetch_desc_indices() which could batch
> descriptor indices fetching and used ring updating. This intends to
> reduce the cache misses of indices fetching and updating and reduce
> cache line bounce when virtqueue is almost full. copy_to_user() was
> used in order to benefit from modern cpus that support fast string
> copy. Batched virtqueue processing will be the first user.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vhost/vhost.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.h |  3 +++
>  2 files changed, 58 insertions(+)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index f87ec75..8424166d 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2437,6 +2437,61 @@ struct vhost_msg_node *vhost_dequeue_msg(struct vhost_dev *dev,
>  }
>  EXPORT_SYMBOL_GPL(vhost_dequeue_msg);
>  
> +int vhost_prefetch_desc_indices(struct vhost_virtqueue *vq,
> +				struct vring_used_elem *heads,
> +				u16 num, bool used_update)

why do you need to combine used update with prefetch?

> +{
> +	int ret, ret2;
> +	u16 last_avail_idx, last_used_idx, total, copied;
> +	__virtio16 avail_idx;
> +	struct vring_used_elem __user *used;
> +	int i;
> +
> +	if (unlikely(vhost_get_avail(vq, avail_idx, &vq->avail->idx))) {
> +		vq_err(vq, "Failed to access avail idx at %p\n",
> +		       &vq->avail->idx);
> +		return -EFAULT;
> +	}
> +	last_avail_idx = vq->last_avail_idx & (vq->num - 1);
> +	vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> +	total = vq->avail_idx - vq->last_avail_idx;
> +	ret = total = min(total, num);
> +
> +	for (i = 0; i < ret; i++) {
> +		ret2 = vhost_get_avail(vq, heads[i].id,
> +				      &vq->avail->ring[last_avail_idx]);
> +		if (unlikely(ret2)) {
> +			vq_err(vq, "Failed to get descriptors\n");
> +			return -EFAULT;
> +		}
> +		last_avail_idx = (last_avail_idx + 1) & (vq->num - 1);
> +	}
> +
> +	if (!used_update)
> +		return ret;
> +
> +	last_used_idx = vq->last_used_idx & (vq->num - 1);
> +	while (total) {
> +		copied = min((u16)(vq->num - last_used_idx), total);
> +		ret2 = vhost_copy_to_user(vq,
> +					  &vq->used->ring[last_used_idx],
> +					  &heads[ret - total],
> +					  copied * sizeof(*used));
> +
> +		if (unlikely(ret2)) {
> +			vq_err(vq, "Failed to update used ring!\n");
> +			return -EFAULT;
> +		}
> +
> +		last_used_idx = 0;
> +		total -= copied;
> +	}
> +
> +	/* Only get avail ring entries after they have been exposed by guest. */
> +	smp_rmb();

Barrier before return is a very confusing API. I guess it's designed to
be used in a specific way to make it necessary - but what is it?


> +	return ret;
> +}
> +EXPORT_SYMBOL(vhost_prefetch_desc_indices);
>  
>  static int __init vhost_init(void)
>  {
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 39ff897..16c2cb6 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -228,6 +228,9 @@ ssize_t vhost_chr_read_iter(struct vhost_dev *dev, struct iov_iter *to,
>  ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
>  			     struct iov_iter *from);
>  int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled);
> +int vhost_prefetch_desc_indices(struct vhost_virtqueue *vq,
> +				struct vring_used_elem *heads,
> +				u16 num, bool used_update);
>  
>  #define vq_err(vq, fmt, ...) do {                                  \
>  		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> -- 
> 2.7.4

^ permalink raw reply

* Re: [PATCH net-next RFC 5/5] vhost_net: basic tx virtqueue batched processing
From: Michael S. Tsirkin @ 2017-09-26 19:25 UTC (permalink / raw)
  To: Jason Wang; +Cc: virtualization, netdev, linux-kernel, kvm
In-Reply-To: <1506067355-5771-6-git-send-email-jasowang@redhat.com>

On Fri, Sep 22, 2017 at 04:02:35PM +0800, Jason Wang wrote:
> This patch implements basic batched processing of tx virtqueue by
> prefetching desc indices and updating used ring in a batch. For
> non-zerocopy case, vq->heads were used for storing the prefetched
> indices and updating used ring. It is also a requirement for doing
> more batching on top. For zerocopy case and for simplicity, batched
> processing were simply disabled by only fetching and processing one
> descriptor at a time, this could be optimized in the future.
> 
> XDP_DROP (without touching skb) on tun (with Moongen in guest) with
> zercopy disabled:
> 
> Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz:
> Before: 3.20Mpps
> After:  3.90Mpps (+22%)
> 
> No differences were seen with zerocopy enabled.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

So where is the speedup coming from? I'd guess the ring is
hot in cache, it's faster to access it in one go, then
pass many packets to net stack. Is that right?

Another possibility is better code cache locality.

So how about this patchset is refactored:

1. use existing APIs just first get packets then
   transmit them all then use them all
2. add new APIs and move the loop into vhost core
   for more speedups



> ---
>  drivers/vhost/net.c   | 215 ++++++++++++++++++++++++++++----------------------
>  drivers/vhost/vhost.c |   2 +-
>  2 files changed, 121 insertions(+), 96 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index c89640e..c439892 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -408,27 +408,25 @@ static int vhost_net_enable_vq(struct vhost_net *n,
>  	return vhost_poll_start(poll, sock->file);
>  }
>  
> -static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
> -				    struct vhost_virtqueue *vq,
> -				    struct iovec iov[], unsigned int iov_size,
> -				    unsigned int *out_num, unsigned int *in_num)
> +static bool vhost_net_tx_avail(struct vhost_net *net,
> +			       struct vhost_virtqueue *vq)
>  {
>  	unsigned long uninitialized_var(endtime);
> -	int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
> -				  out_num, in_num, NULL, NULL);
>  
> -	if (r == vq->num && vq->busyloop_timeout) {
> -		preempt_disable();
> -		endtime = busy_clock() + vq->busyloop_timeout;
> -		while (vhost_can_busy_poll(vq->dev, endtime) &&
> -		       vhost_vq_avail_empty(vq->dev, vq))
> -			cpu_relax();
> -		preempt_enable();
> -		r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
> -				      out_num, in_num, NULL, NULL);
> -	}
> +	if (!vq->busyloop_timeout)
> +		return false;
>  
> -	return r;
> +	if (!vhost_vq_avail_empty(vq->dev, vq))
> +		return true;
> +
> +	preempt_disable();
> +	endtime = busy_clock() + vq->busyloop_timeout;
> +	while (vhost_can_busy_poll(vq->dev, endtime) &&
> +		vhost_vq_avail_empty(vq->dev, vq))
> +		cpu_relax();
> +	preempt_enable();
> +
> +	return !vhost_vq_avail_empty(vq->dev, vq);
>  }
>  
>  static bool vhost_exceeds_maxpend(struct vhost_net *net)
> @@ -446,8 +444,9 @@ static void handle_tx(struct vhost_net *net)
>  {
>  	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
>  	struct vhost_virtqueue *vq = &nvq->vq;
> +	struct vring_used_elem used, *heads = vq->heads;
>  	unsigned out, in;
> -	int head;
> +	int avails, head;
>  	struct msghdr msg = {
>  		.msg_name = NULL,
>  		.msg_namelen = 0,
> @@ -461,6 +460,7 @@ static void handle_tx(struct vhost_net *net)
>  	struct socket *sock;
>  	struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
>  	bool zcopy, zcopy_used;
> +	int i, batched = VHOST_NET_BATCH;
>  
>  	mutex_lock(&vq->mutex);
>  	sock = vq->private_data;
> @@ -475,6 +475,12 @@ static void handle_tx(struct vhost_net *net)
>  	hdr_size = nvq->vhost_hlen;
>  	zcopy = nvq->ubufs;
>  
> +	/* Disable zerocopy batched fetching for simplicity */
> +	if (zcopy) {
> +		heads = &used;
> +		batched = 1;
> +	}
> +
>  	for (;;) {
>  		/* Release DMAs done buffers first */
>  		if (zcopy)
> @@ -486,95 +492,114 @@ static void handle_tx(struct vhost_net *net)
>  		if (unlikely(vhost_exceeds_maxpend(net)))
>  			break;
>  
> -		head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
> -						ARRAY_SIZE(vq->iov),
> -						&out, &in);
> +		avails = vhost_prefetch_desc_indices(vq, heads, batched, !zcopy);
>  		/* On error, stop handling until the next kick. */
> -		if (unlikely(head < 0))
> +		if (unlikely(avails < 0))
>  			break;
> -		/* Nothing new?  Wait for eventfd to tell us they refilled. */
> -		if (head == vq->num) {
> +		/* Nothing new?  Busy poll for a while or wait for
> +		 * eventfd to tell us they refilled. */
> +		if (!avails) {
> +			if (vhost_net_tx_avail(net, vq))
> +				continue;
>  			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
>  				vhost_disable_notify(&net->dev, vq);
>  				continue;
>  			}
>  			break;
>  		}
> -		if (in) {
> -			vq_err(vq, "Unexpected descriptor format for TX: "
> -			       "out %d, int %d\n", out, in);
> -			break;
> -		}
> -		/* Skip header. TODO: support TSO. */
> -		len = iov_length(vq->iov, out);
> -		iov_iter_init(&msg.msg_iter, WRITE, vq->iov, out, len);
> -		iov_iter_advance(&msg.msg_iter, hdr_size);
> -		/* Sanity check */
> -		if (!msg_data_left(&msg)) {
> -			vq_err(vq, "Unexpected header len for TX: "
> -			       "%zd expected %zd\n",
> -			       len, hdr_size);
> -			break;
> -		}
> -		len = msg_data_left(&msg);
> -
> -		zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN
> -				   && (nvq->upend_idx + 1) % UIO_MAXIOV !=
> -				      nvq->done_idx
> -				   && vhost_net_tx_select_zcopy(net);
> -
> -		/* use msg_control to pass vhost zerocopy ubuf info to skb */
> -		if (zcopy_used) {
> -			struct ubuf_info *ubuf;
> -			ubuf = nvq->ubuf_info + nvq->upend_idx;
> -
> -			vq->heads[nvq->upend_idx].id = cpu_to_vhost32(vq, head);
> -			vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
> -			ubuf->callback = vhost_zerocopy_callback;
> -			ubuf->ctx = nvq->ubufs;
> -			ubuf->desc = nvq->upend_idx;
> -			refcount_set(&ubuf->refcnt, 1);
> -			msg.msg_control = ubuf;
> -			msg.msg_controllen = sizeof(ubuf);
> -			ubufs = nvq->ubufs;
> -			atomic_inc(&ubufs->refcount);
> -			nvq->upend_idx = (nvq->upend_idx + 1) % UIO_MAXIOV;
> -		} else {
> -			msg.msg_control = NULL;
> -			ubufs = NULL;
> -		}
> +		for (i = 0; i < avails; i++) {
> +			head = __vhost_get_vq_desc(vq, vq->iov,
> +						   ARRAY_SIZE(vq->iov),
> +						   &out, &in, NULL, NULL,
> +					       vhost16_to_cpu(vq, heads[i].id));
> +			if (in) {
> +				vq_err(vq, "Unexpected descriptor format for "
> +					   "TX: out %d, int %d\n", out, in);
> +				goto out;
> +			}
>  
> -		total_len += len;
> -		if (total_len < VHOST_NET_WEIGHT &&
> -		    !vhost_vq_avail_empty(&net->dev, vq) &&
> -		    likely(!vhost_exceeds_maxpend(net))) {
> -			msg.msg_flags |= MSG_MORE;
> -		} else {
> -			msg.msg_flags &= ~MSG_MORE;
> -		}
> +			/* Skip header. TODO: support TSO. */
> +			len = iov_length(vq->iov, out);
> +			iov_iter_init(&msg.msg_iter, WRITE, vq->iov, out, len);
> +			iov_iter_advance(&msg.msg_iter, hdr_size);
> +			/* Sanity check */
> +			if (!msg_data_left(&msg)) {
> +				vq_err(vq, "Unexpected header len for TX: "
> +					"%zd expected %zd\n",
> +					len, hdr_size);
> +				goto out;
> +			}
> +			len = msg_data_left(&msg);
>  
> -		/* TODO: Check specific error and bomb out unless ENOBUFS? */
> -		err = sock->ops->sendmsg(sock, &msg, len);
> -		if (unlikely(err < 0)) {
> +			zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN
> +				     && (nvq->upend_idx + 1) % UIO_MAXIOV !=
> +					nvq->done_idx
> +				     && vhost_net_tx_select_zcopy(net);
> +
> +			/* use msg_control to pass vhost zerocopy ubuf
> +			 * info to skb
> +			 */
>  			if (zcopy_used) {
> -				vhost_net_ubuf_put(ubufs);
> -				nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
> -					% UIO_MAXIOV;
> +				struct ubuf_info *ubuf;
> +				ubuf = nvq->ubuf_info + nvq->upend_idx;
> +
> +				vq->heads[nvq->upend_idx].id =
> +					cpu_to_vhost32(vq, head);
> +				vq->heads[nvq->upend_idx].len =
> +					VHOST_DMA_IN_PROGRESS;
> +				ubuf->callback = vhost_zerocopy_callback;
> +				ubuf->ctx = nvq->ubufs;
> +				ubuf->desc = nvq->upend_idx;
> +				refcount_set(&ubuf->refcnt, 1);
> +				msg.msg_control = ubuf;
> +				msg.msg_controllen = sizeof(ubuf);
> +				ubufs = nvq->ubufs;
> +				atomic_inc(&ubufs->refcount);
> +				nvq->upend_idx =
> +					(nvq->upend_idx + 1) % UIO_MAXIOV;
> +			} else {
> +				msg.msg_control = NULL;
> +				ubufs = NULL;
> +			}
> +
> +			total_len += len;
> +			if (total_len < VHOST_NET_WEIGHT &&
> +				!vhost_vq_avail_empty(&net->dev, vq) &&
> +				likely(!vhost_exceeds_maxpend(net))) {
> +				msg.msg_flags |= MSG_MORE;
> +			} else {
> +				msg.msg_flags &= ~MSG_MORE;
> +			}
> +
> +			/* TODO: Check specific error and bomb out
> +			 * unless ENOBUFS?
> +			 */
> +			err = sock->ops->sendmsg(sock, &msg, len);
> +			if (unlikely(err < 0)) {
> +				if (zcopy_used) {
> +					vhost_net_ubuf_put(ubufs);
> +					nvq->upend_idx =
> +				   ((unsigned)nvq->upend_idx - 1) % UIO_MAXIOV;
> +				}
> +				vhost_discard_vq_desc(vq, 1);
> +				goto out;
> +			}
> +			if (err != len)
> +				pr_debug("Truncated TX packet: "
> +					" len %d != %zd\n", err, len);
> +			if (!zcopy) {
> +				vhost_add_used_idx(vq, 1);
> +				vhost_signal(&net->dev, vq);
> +			} else if (!zcopy_used) {
> +				vhost_add_used_and_signal(&net->dev,
> +							  vq, head, 0);
> +			} else
> +				vhost_zerocopy_signal_used(net, vq);
> +			vhost_net_tx_packet(net);
> +			if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> +				vhost_poll_queue(&vq->poll);
> +				goto out;
>  			}
> -			vhost_discard_vq_desc(vq, 1);
> -			break;
> -		}
> -		if (err != len)
> -			pr_debug("Truncated TX packet: "
> -				 " len %d != %zd\n", err, len);
> -		if (!zcopy_used)
> -			vhost_add_used_and_signal(&net->dev, vq, head, 0);
> -		else
> -			vhost_zerocopy_signal_used(net, vq);
> -		vhost_net_tx_packet(net);
> -		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> -			vhost_poll_queue(&vq->poll);
> -			break;
>  		}
>  	}
>  out:
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 6532cda..8764df5 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -392,7 +392,7 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
>  		vq->indirect = kmalloc(sizeof *vq->indirect * UIO_MAXIOV,
>  				       GFP_KERNEL);
>  		vq->log = kmalloc(sizeof *vq->log * UIO_MAXIOV, GFP_KERNEL);
> -		vq->heads = kmalloc(sizeof *vq->heads * UIO_MAXIOV, GFP_KERNEL);
> +		vq->heads = kzalloc(sizeof *vq->heads * UIO_MAXIOV, GFP_KERNEL);
>  		if (!vq->indirect || !vq->log || !vq->heads)
>  			goto err_nomem;
>  	}
> -- 
> 2.7.4

^ permalink raw reply

* Re: [PATCH net-next RFC 0/5] batched tx processing in vhost_net
From: Michael S. Tsirkin @ 2017-09-26 19:26 UTC (permalink / raw)
  To: Jason Wang; +Cc: virtualization, netdev, linux-kernel, kvm
In-Reply-To: <1506067355-5771-1-git-send-email-jasowang@redhat.com>

On Fri, Sep 22, 2017 at 04:02:30PM +0800, Jason Wang wrote:
> Hi:
> 
> This series tries to implement basic tx batched processing. This is
> done by prefetching descriptor indices and update used ring in a
> batch. This intends to speed up used ring updating and improve the
> cache utilization. Test shows about ~22% improvement in tx pss.

> Please review.
> 
> Jason Wang (5):
>   vhost: split out ring head fetching logic
>   vhost: introduce helper to prefetch desc index
>   vhost: introduce vhost_add_used_idx()

Please squash these new APIs into where they are used.
This split-up just makes review harder for me as
I can't figure out how the new APIs are used.

>   vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH

This is ok as a separate patch.

>   vhost_net: basic tx virtqueue batched processing
> 
>  drivers/vhost/net.c   | 221 ++++++++++++++++++++++++++++----------------------
>  drivers/vhost/vhost.c | 165 +++++++++++++++++++++++++++++++------
>  drivers/vhost/vhost.h |   9 ++
>  3 files changed, 270 insertions(+), 125 deletions(-)
> 
> -- 
> 2.7.4

^ permalink raw reply

* Re: [PATCH v2 2/2] ip_tunnel: add mpls over gre encapsulation
From: Roopa Prabhu @ 2017-09-26 19:31 UTC (permalink / raw)
  To: Amine Kherbouche; +Cc: netdev@vger.kernel.org, xeb, David Lamparter
In-Reply-To: <9f0675b0-cbda-9053-250f-19f3f837b124@6wind.com>

On Tue, Sep 26, 2017 at 10:58 AM, Amine Kherbouche
<amine.kherbouche@6wind.com> wrote:
> Hi Roopa,
>
> Thanks for the feedback, I have just one question:
>
>
> On 09/26/2017 05:15 PM, Roopa Prabhu wrote:
>>>
>>> +static int ipgre_tunnel_encap_add_mpls_ops(void)
>>> > +{
>>> > +       int ret = -1;
>>> > +
>>> > +#if IS_ENABLED(CONFIG_NET_IP_TUNNEL)
>>> > +       ret = ip_tunnel_encap_add_ops(&mpls_iptun_ops,
>>> > TUNNEL_ENCAP_MPLS);
>>> > +#endif
>>> > +
>>> > +       return ret;
>>> > +}
>>> > +
>>> > +static void ipgre_tunnel_encap_del_mpls_ops(void)
>>> > +{
>>> > +#if IS_ENABLED(CONFIG_NET_IP_TUNNEL)
>>> > +       ip_tunnel_encap_del_ops(&mpls_iptun_ops, TUNNEL_ENCAP_MPLS);
>>> > +#endif
>>> > +}
>>> > +
>>> >  static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
>>> >                        struct nlmsghdr *nlh, struct net *net, u32
>>> > portid,
>>> >                        unsigned int nlm_flags);
>>> > @@ -2486,6 +2521,10 @@ static int __init mpls_init(void)
>>> >                       0);
>>> >         rtnl_register(PF_MPLS, RTM_GETNETCONF,
>>> > mpls_netconf_get_devconf,
>>> >                       mpls_netconf_dump_devconf, 0);
>>> > +       err = ipgre_tunnel_encap_add_mpls_ops();
>>> > +       if (err)
>>> > +               pr_err("Can't add mpls over gre tunnel ops\n");
>>> > +
>>
>> This will throw an error  if CONFIG_NET_IP_TUNNEL is not enabled.
>> Can you pls put the CONFIG_NET_IP_TUNNEL around
>> ipgre_tunnel_encap_add_mpls_ops ?
>
>
> You want the CONFIG_NET_IP_TUNNEL definition around the declaration of the
> function or when it's called ? or both ? (I can adjust the code for any
> case).

around the declaration like below....,

#if IS_ENABLED(CONFIG_NET_IP_TUNNEL)
static int ipgre_tunnel_encap_add_mpls_ops(void)
{
      return ip_tunnel_encap_add_ops(&mpls_iptun_ops,
                                                            TUNNEL_ENCAP_MPLS);
}

static void ipgre_tunnel_encap_del_mpls_ops(void)
{
      ip_tunnel_encap_del_ops(&mpls_iptun_ops, TUNNEL_ENCAP_MPLS);
}
#else
static int ipgre_tunnel_encap_add_mpls_ops(void)
{
      return 0
}

static void ipgre_tunnel_encap_del_mpls_ops(void)
{
}
#endif

thanks.

^ permalink raw reply

* Re: [PATCH v4 9/9] kasan: rework Kconfig settings
From: Andrey Ryabinin @ 2017-09-26 19:36 UTC (permalink / raw)
  To: Arnd Bergmann, Masahiro Yamada, Michal Marek, Andrew Morton
  Cc: Mauro Carvalho Chehab, Jiri Pirko, Arend van Spriel, Kalle Valo,
	David S. Miller, Alexander Potapenko, Dmitry Vyukov, Kees Cook,
	Geert Uytterhoeven, Greg Kroah-Hartman, linux-media, linux-kernel,
	netdev, linux-wireless, brcm80211-dev-list.pdl,
	brcm80211-dev-list, kasan-dev, linux-kbuild, Jakub Jelinek,
	Martin Liška
In-Reply-To: <20170922212930.620249-10-arnd@arndb.de>

On 09/23/2017 12:29 AM, Arnd Bergmann wrote:
> We get a lot of very large stack frames using gcc-7.0.1 with the default
> -fsanitize-address-use-after-scope --param asan-stack=1 options, which
> can easily cause an overflow of the kernel stack, e.g.
> 
> drivers/gpu/drm/i915/gvt/handlers.c:2407:1: error: the frame size of 31216 bytes is larger than 2048 bytes
> drivers/net/wireless/ralink/rt2x00/rt2800lib.c:5650:1: error: the frame size of 23632 bytes is larger than 2048 bytes
> drivers/scsi/fnic/fnic_trace.c:451:1: error: the frame size of 5152 bytes is larger than 2048 bytes
> fs/btrfs/relocation.c:1202:1: error: the frame size of 4256 bytes is larger than 2048 bytes
> fs/fscache/stats.c:287:1: error: the frame size of 6552 bytes is larger than 2048 bytes
> lib/atomic64_test.c:250:1: error: the frame size of 12616 bytes is larger than 2048 bytes
> mm/vmscan.c:1367:1: error: the frame size of 5080 bytes is larger than 2048 bytes
> net/wireless/nl80211.c:1905:1: error: the frame size of 4232 bytes is larger than 2048 bytes
> 
> To reduce this risk, -fsanitize-address-use-after-scope is now split
> out into a separate CONFIG_KASAN_EXTRA Kconfig option, leading to stack
> frames that are smaller than 2 kilobytes most of the time on x86_64. An
> earlier version of this patch also prevented combining KASAN_EXTRA with
> KASAN_INLINE, but that is no longer necessary with gcc-7.0.1.
> 
> A lot of warnings with KASAN_EXTRA go away if we disable KMEMCHECK,
> as -fsanitize-address-use-after-scope seems to understand the builtin
> memcpy, but adds checking code around an extern memcpy call. I had to work
> around a circular dependency, as DEBUG_SLAB/SLUB depended on !KMEMCHECK,
> while KASAN did it the other way round. Now we handle both the same way
> and make KASAN and KMEMCHECK mutually exclusive.
> 
> All patches to get the frame size below 2048 bytes with CONFIG_KASAN=y
> and CONFIG_KASAN_EXTRA=n have been submitted along with this patch, so
> we can bring back that default now. KASAN_EXTRA=y still causes lots of
> warnings but now defaults to !COMPILE_TEST to disable it in allmodconfig,
> and it remains disabled in all other defconfigs since it is a new option.
> I arbitrarily raise the warning limit for KASAN_EXTRA to 3072 to reduce
> the noise, but an allmodconfig kernel still has around 50 warnings
> on gcc-7.
> 
> I experimented a bit more with smaller stack frames and have another
> follow-up series that reduces the warning limit for 64-bit architectures
> to 1280 bytes (without CONFIG_KASAN).
> 
> With earlier versions of this patch series, I also had patches to
> address the warnings we get with KASAN and/or KASAN_EXTRA, using a
> "noinline_if_stackbloat" annotation. That annotation now got replaced with
> a gcc-8 bugfix (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81715)
> and a workaround for older compilers, which means that KASAN_EXTRA is
> now just as bad as before and will lead to an instant stack overflow in
> a few extreme cases.
> 
> This reverts parts of commit commit 3f181b4 ("lib/Kconfig.debug: disable
> -Wframe-larger-than warnings with KASAN=y").
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>

^ permalink raw reply

* Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
From: Daniel Borkmann @ 2017-09-26 19:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: davem, alexei.starovoitov, john.fastabend, peter.waskiewicz.jr,
	jakub.kicinski, netdev, Andy Gospodarek
In-Reply-To: <20170926211342.0c8e72b0@redhat.com>

On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote:
[...]
> I'm currently implementing a cpumap type, that transfers raw XDP frames
> to another CPU, and the SKB is allocated on the remote CPU.  (It
> actually works extremely well).

Meaning you let all the XDP_PASS packets get processed on a
different CPU, so you can reserve the whole CPU just for
prefiltering, right? Do you have some numbers to share at
this point, just curious when you mention it works extremely
well.

> For transferring info I need, I'm currently using xdp->data_hard_start
> (the top/start of the xdp page).  Which should be compatible with your
> approach, right?

Should be possible, yes. More below.

> The info I need:
>
>   struct xdp_pkt {
> 	void *data;
> 	u16 len;
> 	u16 headroom;
> 	struct net_device *dev_rx;
>   };
>
> When I enqueue the xdp packet I do the following:
>
>   int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp,
> 	struct net_device *dev_rx)
>   {
> 	struct xdp_pkt *xdp_pkt;
> 	int headroom;
>
> 	/* Convert xdp_buff to xdp_pkt */
> 	headroom = xdp->data - xdp->data_hard_start;
> 	if (headroom < sizeof(*xdp_pkt))
> 		return -EOVERFLOW;
> 	xdp_pkt = xdp->data_hard_start;
> 	xdp_pkt->data = xdp->data;
> 	xdp_pkt->len  = xdp->data_end - xdp->data;
> 	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
>
> 	/* Info needed when constructing SKB on remote CPU */
> 	xdp_pkt->dev_rx = dev_rx;
>
> 	bq_enqueue(rcpu, xdp_pkt);
> 	return 0;
>   }
>
> On the remote CPU dequeueing the packet, I'm doing the following.  As
> you can see I'm still lacking some meta-data, that would be nice to
> also transfer.  Could I use your infrastructure for that?

There could be multiple options to use it, in case you have a
helper where you look up the CPU in the map and would also store
the meta data, you could use a per-CPU scratch buffer similarly
as we do with struct redirect_info, and move that later e.g.
after program return into xdp->data_hard_start pointer. You
could also reserve that upfront potentially, so it's hidden from
the beginning from the program unless you want the program itself
to fill it out (modulo the pointers). Not all drivers currently
leave room though, I've also seen where xdp->data_hard_start
points directly to xdp->data, so there's 0 headroom available to
use; in such case it could either be treated as a hint and for
those drivers where they just pass the skb up the current CPU or
you would need some other means to move the meta data to the
remote CPU, or potentially just use tail room.

Thanks,
Daniel

^ permalink raw reply

* [PATCH] arp: make arp_hdr_len() return unsigned int
From: Alexey Dobriyan @ 2017-09-26 20:12 UTC (permalink / raw)
  To: davem; +Cc: netdev

Negative ARP header length are not a thing.

Constify arguments while I'm at it.

Space savings:

	add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-3 (-3)
	function                        old     new   delta
	arpt_do_table                  1163    1160      -3

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 drivers/net/bonding/bond_main.c |    3 ++-
 include/linux/if_arp.h          |    2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2491,7 +2491,8 @@ int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond,
 	struct slave *curr_active_slave, *curr_arp_slave;
 	unsigned char *arp_ptr;
 	__be32 sip, tip;
-	int alen, is_arp = skb->protocol == __cpu_to_be16(ETH_P_ARP);
+	int is_arp = skb->protocol == __cpu_to_be16(ETH_P_ARP);
+	unsigned int alen;
 
 	if (!slave_do_arp_validate(bond, slave)) {
 		if ((slave_do_arp_validate_only(bond) && is_arp) ||
--- a/include/linux/if_arp.h
+++ b/include/linux/if_arp.h
@@ -31,7 +31,7 @@ static inline struct arphdr *arp_hdr(const struct sk_buff *skb)
 	return (struct arphdr *)skb_network_header(skb);
 }
 
-static inline int arp_hdr_len(struct net_device *dev)
+static inline unsigned int arp_hdr_len(const struct net_device *dev)
 {
 	switch (dev->type) {
 #if IS_ENABLED(CONFIG_FIREWIRE_NET)

^ permalink raw reply

* [RFC PATCH 10/11] IP: early demux can return an error code
From: Paolo Abeni @ 2017-09-26 20:18 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <cover.1506114055.git.pabeni@redhat.com>

it will used by later patch to cope with unconnected sockets.
Since early demux can do a route lookup and an ipv4 route
lookup can return an error code this is consistent with the
current ipv4 route infrastructure.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
This patch and the next one did not land on the ML previously
due to PEBKAC, appending now to give the complete picture of
this RFC series.
Side note: currently the early demux lookup for mcast sockets
does not perform source address validation and we need (also)
something like this commit to fix the issue without causing
large performance regressions.
---
 include/net/protocol.h |  4 ++--
 include/net/tcp.h      |  2 +-
 include/net/udp.h      |  2 +-
 net/ipv4/ip_input.c    | 25 +++++++++++++++----------
 net/ipv4/tcp_ipv4.c    |  9 +++++----
 net/ipv4/udp.c         | 11 ++++++-----
 6 files changed, 30 insertions(+), 23 deletions(-)

diff --git a/include/net/protocol.h b/include/net/protocol.h
index 65ba335b0e7e..4fc75f7ae23b 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -39,8 +39,8 @@
 
 /* This is used to register protocols. */
 struct net_protocol {
-	void			(*early_demux)(struct sk_buff *skb);
-	void                    (*early_demux_handler)(struct sk_buff *skb);
+	int			(*early_demux)(struct sk_buff *skb);
+	int			(*early_demux_handler)(struct sk_buff *skb);
 	int			(*handler)(struct sk_buff *skb);
 	void			(*err_handler)(struct sk_buff *skb, u32 info);
 	unsigned int		no_policy:1,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 49a8a46466f3..cf0bb918c52d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -345,7 +345,7 @@ void tcp_v4_err(struct sk_buff *skb, u32);
 
 void tcp_shutdown(struct sock *sk, int how);
 
-void tcp_v4_early_demux(struct sk_buff *skb);
+int tcp_v4_early_demux(struct sk_buff *skb);
 int tcp_v4_rcv(struct sk_buff *skb);
 
 int tcp_v4_tw_remember_stamp(struct inet_timewait_sock *tw);
diff --git a/include/net/udp.h b/include/net/udp.h
index 12dfbfe2e2d7..6c759c8594e2 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -259,7 +259,7 @@ static inline struct sk_buff *skb_recv_udp(struct sock *sk, unsigned int flags,
 	return __skb_recv_udp(sk, flags, noblock, &peeked, &off, err);
 }
 
-void udp_v4_early_demux(struct sk_buff *skb);
+int udp_v4_early_demux(struct sk_buff *skb);
 bool udp_sk_rx_dst_set(struct sock *sk, struct dst_entry *dst);
 int udp_get_port(struct sock *sk, unsigned short snum,
 		 int (*saddr_cmp)(const struct sock *,
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 5690ef09da28..f172be87674f 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -311,9 +311,10 @@ static inline bool ip_rcv_options(struct sk_buff *skb)
 static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	const struct iphdr *iph = ip_hdr(skb);
-	struct rtable *rt;
+	int (*edemux)(struct sk_buff *skb);
 	struct net_device *dev = skb->dev;
-	void (*edemux)(struct sk_buff *skb);
+	struct rtable *rt;
+	int err;
 
 	/* if ingress device is enslaved to an L3 master device pass the
 	 * skb to its handler for processing
@@ -331,7 +332,9 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 
 		ipprot = rcu_dereference(inet_protos[protocol]);
 		if (ipprot && (edemux = READ_ONCE(ipprot->early_demux))) {
-			edemux(skb);
+			err = edemux(skb);
+			if (unlikely(err))
+				goto drop_error;
 			/* must reload iph, skb->head might have changed */
 			iph = ip_hdr(skb);
 		}
@@ -342,13 +345,10 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 	 *	how the packet travels inside Linux networking.
 	 */
 	if (!skb_valid_dst(skb)) {
-		int err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
-					       iph->tos, dev);
-		if (unlikely(err)) {
-			if (err == -EXDEV)
-				__NET_INC_STATS(net, LINUX_MIB_IPRPFILTER);
-			goto drop;
-		}
+		err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
+					   iph->tos, dev);
+		if (unlikely(err))
+			goto drop_error;
 	}
 
 	/* Since the sk has no reference to the socket, we must
@@ -407,6 +407,11 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 drop:
 	kfree_skb(skb);
 	return NET_RX_DROP;
+
+drop_error:
+	if (err == -EXDEV)
+		__NET_INC_STATS(net, LINUX_MIB_IPRPFILTER);
+	goto drop;
 }
 
 /*
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d9416b5162bc..85164d4d3e53 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1503,23 +1503,23 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(tcp_v4_do_rcv);
 
-void tcp_v4_early_demux(struct sk_buff *skb)
+int tcp_v4_early_demux(struct sk_buff *skb)
 {
 	const struct iphdr *iph;
 	const struct tcphdr *th;
 	struct sock *sk;
 
 	if (skb->pkt_type != PACKET_HOST)
-		return;
+		return 0;
 
 	if (!pskb_may_pull(skb, skb_transport_offset(skb) + sizeof(struct tcphdr)))
-		return;
+		return 0;
 
 	iph = ip_hdr(skb);
 	th = tcp_hdr(skb);
 
 	if (th->doff < sizeof(struct tcphdr) / 4)
-		return;
+		return 0;
 
 	sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
 				       iph->saddr, th->source,
@@ -1538,6 +1538,7 @@ void tcp_v4_early_demux(struct sk_buff *skb)
 				skb_dst_set_noref(skb, dst);
 		}
 	}
+	return 0;
 }
 
 bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5cbbd78024dc..b7202a15f360 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2215,7 +2215,7 @@ void udp_set_skb_rx_dst(struct sock *sk, struct sk_buff *skb, u32 cookie)
 }
 EXPORT_SYMBOL_GPL(udp_set_skb_rx_dst);
 
-void udp_v4_early_demux(struct sk_buff *skb)
+int udp_v4_early_demux(struct sk_buff *skb)
 {
 	struct net *net = dev_net(skb->dev);
 	int dif = skb->dev->ifindex;
@@ -2227,7 +2227,7 @@ void udp_v4_early_demux(struct sk_buff *skb)
 
 	/* validate the packet */
 	if (!pskb_may_pull(skb, skb_transport_offset(skb) + sizeof(struct udphdr)))
-		return;
+		return 0;
 
 	iph = ip_hdr(skb);
 	uh = udp_hdr(skb);
@@ -2237,14 +2237,14 @@ void udp_v4_early_demux(struct sk_buff *skb)
 		struct in_device *in_dev = __in_dev_get_rcu(skb->dev);
 
 		if (!in_dev)
-			return;
+			return 0;
 
 		/* we are supposed to accept bcast packets */
 		if (skb->pkt_type == PACKET_MULTICAST) {
 			ours = ip_check_mc_rcu(in_dev, iph->daddr, iph->saddr,
 					       iph->protocol);
 			if (!ours)
-				return;
+				return 0;
 		}
 
 		sk = __udp4_lib_mcast_demux_lookup(net, uh->dest, iph->daddr,
@@ -2256,11 +2256,12 @@ void udp_v4_early_demux(struct sk_buff *skb)
 	}
 
 	if (!sk)
-		return;
+		return 0;
 
 	skb_set_noref_sk(skb, sk);
 	if (udp_use_rx_dst_cache(sk, skb))
 		udp_set_skb_rx_dst(sk, skb, 0);
+	return 0;
 }
 
 int udp_rcv(struct sk_buff *skb)
-- 
2.13.5

^ permalink raw reply related

* [RFC PATCH 11/11] udp: dst lookup in early demux for unconnected sockets
From: Paolo Abeni @ 2017-09-26 20:18 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <cover.1506114055.git.pabeni@redhat.com>

Now that all the helper are in place we can leverage them to
fetch the appropriate destination for unconnected sockets,
looking up the ifaddr cache if not local bind are not allowed.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/udp.c | 10 ++++++++--
 net/ipv6/udp.c |  9 ++++++++-
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index b7202a15f360..87d48101efe2 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2259,9 +2259,15 @@ int udp_v4_early_demux(struct sk_buff *skb)
 		return 0;
 
 	skb_set_noref_sk(skb, sk);
-	if (udp_use_rx_dst_cache(sk, skb))
+	if (udp_use_rx_dst_cache(sk, skb)) {
 		udp_set_skb_rx_dst(sk, skb, 0);
-	return 0;
+		return 0;
+	}
+
+	if (net->ipv4.sysctl_ip_nonlocal_bind)
+		return 0;
+
+	return ip_route_try_local_rcu(net, skb, iph);
 }
 
 int udp_rcv(struct sk_buff *skb)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 67d340679c3a..360d54bc2359 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -935,8 +935,15 @@ static void udp_v6_early_demux(struct sk_buff *skb)
 		return;
 
 	skb_set_noref_sk(skb, sk);
-	if (udp6_use_rx_dst_cache(sk))
+	if (udp6_use_rx_dst_cache(sk)) {
 		udp_set_skb_rx_dst(sk, skb, inet6_sk(sk)->rx_dst_cookie);
+		return;
+	}
+
+	if (net->ipv6.sysctl.ip_nonlocal_bind)
+		return;
+
+	ip6_route_try_local_rcu_bh(net, skb);
 }
 
 static __inline__ int udpv6_rcv(struct sk_buff *skb)
-- 
2.13.5

^ permalink raw reply related

* Re: [PATCH] net: stmmac: Meet alignment requirements for DMA
From: David Miller @ 2017-09-26 20:33 UTC (permalink / raw)
  To: paul.burton
  Cc: matt.redfearn, netdev, alexandre.torgue, peppe.cavallaro,
	linux-kernel, linux-mips, james.hogan
In-Reply-To: <71740323.RRBhnhefTi@np-p-burton>

From: Paul Burton <paul.burton@imgtec.com>
Date: Tue, 26 Sep 2017 11:48:19 -0700

>> The device writes into only the bytes it was given access to, which
>> starts at the DMA address.
> 
> OK - still fine but *only* if we don't write to anything that happens to be 
> part of the cache lines that are only partially covered by the DMA mapping & 
> make them dirty. If we do then any writeback of those lines will clobber data 
> the device wrote, and any invalidation of them will discard data the CPU 
> wrote.
> 
> How would you have us ensure that such writes don't occur?
> 
> This really does not feel to me like something to leave up to drivers without 
> any means of checking or enforcing correctness. The requirement to align DMA 
> mappings at cache-line boundaries, as outlined in Documentation/DMA-API.txt, 
> allows this to be enforced. If you want to ignore this requirement then there 
> is no clear way to verify that we aren't writing to cache lines involved in a 
> DMA transfer whilst the transfer is occurring, and no sane way to handle those 
> cache lines if we do.

The memory immediately before skb->data and immediately after
skb->data+skb->len will not be accessed.  This is guaranteed.

I will request something exactly one more time to give you the benfit
of the doubt that you want to show me why this is _really_ a problem
and not a problem only in theory.

Please show me an exact sequence, with current code, in a current driver
that uses the DMA API properly, where corruption really can occur.

The new restriction is absolutely not reasonable for this use case.
It it furthermore impractical to require the 200+ drivers the use this
technique to allocate and map networking buffers to abide by this new
rule.  Many platforms with even worse cache problems that MIPS handle
this just fine.

Thank you.

^ permalink raw reply

* Re: [PATCH net-next 0/6] BPF metadata for direct access
From: David Miller @ 2017-09-26 20:37 UTC (permalink / raw)
  To: daniel
  Cc: alexei.starovoitov, john.fastabend, peter.waskiewicz.jr,
	jakub.kicinski, netdev
In-Reply-To: <cover.1506297988.git.daniel@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Mon, 25 Sep 2017 02:25:49 +0200

> This work enables generic transfer of metadata from XDP into skb,
> meaning the packet has a flexible and programmable room for meta
> data, which can later be used by BPF to set various skb members
> when passing up the stack. For details, please see second patch.
> Support has been implemented and tested with two drivers, and
> should be straight forward to add to other drivers as well which
> properly support head adjustment already.

Series applied, thanks.

We're accumulating features again and leaving some drivers behind,
let's not allow this to get out of hand for XDP_REDIRECT and this
stuff.

Thanks.

^ permalink raw reply

* Re: [PATCH net v2 0/4] net:ethernet:aquantia: Atlantic driver bugfixes und improvements
From: David Miller @ 2017-09-26 20:39 UTC (permalink / raw)
  To: igor.russkikh
  Cc: netdev, darcari, pavel.belous, Nadezhda.Krupnina, simon.edelhaus
In-Reply-To: <cover.1506324091.git.igor.russkikh@aquantia.com>

From: Igor Russkikh <igor.russkikh@aquantia.com>
Date: Mon, 25 Sep 2017 10:48:46 +0300

> This series contains bugfixes for aQuantia Atlantic driver.
> 
> Changes in v2:
> Review comments applied:
> - min_mtu set removed
> - extra mtu range check is removed
> - err codes handling improved

I'll apply this series but simply "aquantia: " is a sufficient subsystem
prefix all by itself so I have edited your Subject lines to use this
before applying.

Even if multiple subsystem tags were appropriate here, they should be
separated by spaces like "net: ethernet: ...".

Thanks.

^ permalink raw reply

* Re: [PATCH net-next 0/6] BPF metadata for direct access
From: Daniel Borkmann @ 2017-09-26 20:44 UTC (permalink / raw)
  To: David Miller
  Cc: alexei.starovoitov, john.fastabend, peter.waskiewicz.jr,
	jakub.kicinski, netdev
In-Reply-To: <20170926.133732.1980264928194037130.davem@davemloft.net>

On 09/26/2017 10:37 PM, David Miller wrote:
> From: Daniel Borkmann <daniel@iogearbox.net>
> Date: Mon, 25 Sep 2017 02:25:49 +0200
>
>> This work enables generic transfer of metadata from XDP into skb,
>> meaning the packet has a flexible and programmable room for meta
>> data, which can later be used by BPF to set various skb members
>> when passing up the stack. For details, please see second patch.
>> Support has been implemented and tested with two drivers, and
>> should be straight forward to add to other drivers as well which
>> properly support head adjustment already.
>
> Series applied, thanks.
>
> We're accumulating features again and leaving some drivers behind,
> let's not allow this to get out of hand for XDP_REDIRECT and this
> stuff.

Thanks, I'll see to get it implemented for further drivers as well.

^ permalink raw reply

* Re: [ovs-dev] [PATCH net-next v9] openvswitch: enable NSH support
From: Eric Garver @ 2017-09-26 20:59 UTC (permalink / raw)
  To: Yang, Yi
  Cc: dev@openvswitch.org, netdev@vger.kernel.org, jbenc@redhat.com,
	davem@davemloft.net
In-Reply-To: <20170926050215.GB5896@localhost.localdomain>

On Tue, Sep 26, 2017 at 01:02:15PM +0800, Yang, Yi wrote:
> On Tue, Sep 26, 2017 at 03:28:42AM +0800, Eric Garver wrote:
> > On Mon, Sep 25, 2017 at 10:16:09PM +0800, Yi Yang wrote:
> > > +
> > > +	length = nsh_hdr_len(nsh_hdr);
> > > +	skb_pull(skb, length);
> > 
> > Do you need to verify you can actually pull length bytes? I don't see
> > any guarantee.
> 
> I have added skb length check in pop_nsh, so that can verify this.

That doesn't help other code that may call skb_pop_nsh(). skb_vlan_pop()
calls skb_ensure_writable() which seems like the right thing to do.

^ permalink raw reply

* Re: [PATCH net-next 1/2] tools: rename tools/net directory to tools/bpf
From: Daniel Borkmann @ 2017-09-26 21:04 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: alexei.starovoitov, davem, hannes, dsahern, oss-drivers
In-Reply-To: <20170926153522.31500-2-jakub.kicinski@netronome.com>

On 09/26/2017 05:35 PM, Jakub Kicinski wrote:
> We currently only have BPF tools in the tools/net directory.
> We are about to add more BPF tools there, not necessarily
> networking related, rename the directory and related Makefile
> targets to bpf.
>
> Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Simon Horman <simon.horman@netronome.com>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* Re: [PATCH net-next 2/2] tools: bpf: add bpftool
From: Daniel Borkmann @ 2017-09-26 21:06 UTC (permalink / raw)
  To: Jakub Kicinski, netdev
  Cc: alexei.starovoitov, davem, hannes, dsahern, oss-drivers
In-Reply-To: <20170926153522.31500-3-jakub.kicinski@netronome.com>

On 09/26/2017 05:35 PM, Jakub Kicinski wrote:
> Add a simple tool for querying and updating BPF objects on the system.
>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Simon Horman <simon.horman@netronome.com>

LGTM, thanks!

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* [PATCH net-next 0/5] net: dsa: use generic slave phydev
From: Vivien Didelot @ 2017-09-26 21:15 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Vivien Didelot

DSA currently stores a phy_device pointer in each slave private
structure. This requires to implement our own ethtool ksettings
accessors and such.

This patchset removes the private phy_device in favor of the one
provided in the net_device structure, and thus allows us to use the
generic phy_ethtool_* functions.

Vivien Didelot (5):
  net: dsa: return -ENODEV is there is no slave PHY
  net: dsa: use slave device phydev
  net: dsa: use phy_ethtool_get_link_ksettings
  net: dsa: use phy_ethtool_set_link_ksettings
  net: dsa: use phy_ethtool_nway_reset

 net/dsa/dsa_priv.h |   1 -
 net/dsa/slave.c    | 143 +++++++++++++++++++----------------------------------
 2 files changed, 52 insertions(+), 92 deletions(-)

-- 
2.14.1

^ permalink raw reply

* [PATCH net-next 1/5] net: dsa: return -ENODEV is there is no slave PHY
From: Vivien Didelot @ 2017-09-26 21:15 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Vivien Didelot
In-Reply-To: <20170926211535.21273-1-vivien.didelot@savoirfairelinux.com>

Instead of returning -EOPNOTSUPP when a slave device has no PHY,
directly return -ENODEV as ethtool and phylib do.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
 net/dsa/slave.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index bd51ef56ec5b..79c5a0cd9923 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -266,10 +266,10 @@ static int dsa_slave_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 {
 	struct dsa_slave_priv *p = netdev_priv(dev);
 
-	if (p->phy != NULL)
-		return phy_mii_ioctl(p->phy, ifr, cmd);
+	if (!p->phy)
+		return -ENODEV;
 
-	return -EOPNOTSUPP;
+	return phy_mii_ioctl(p->phy, ifr, cmd);
 }
 
 static int dsa_slave_port_attr_set(struct net_device *dev,
@@ -429,7 +429,7 @@ dsa_slave_get_link_ksettings(struct net_device *dev,
 	struct dsa_slave_priv *p = netdev_priv(dev);
 
 	if (!p->phy)
-		return -EOPNOTSUPP;
+		return -ENODEV;
 
 	phy_ethtool_ksettings_get(p->phy, cmd);
 
@@ -442,10 +442,10 @@ dsa_slave_set_link_ksettings(struct net_device *dev,
 {
 	struct dsa_slave_priv *p = netdev_priv(dev);
 
-	if (p->phy != NULL)
-		return phy_ethtool_ksettings_set(p->phy, cmd);
+	if (!p->phy)
+		return -ENODEV;
 
-	return -EOPNOTSUPP;
+	return phy_ethtool_ksettings_set(p->phy, cmd);
 }
 
 static void dsa_slave_get_drvinfo(struct net_device *dev,
@@ -481,22 +481,22 @@ static int dsa_slave_nway_reset(struct net_device *dev)
 {
 	struct dsa_slave_priv *p = netdev_priv(dev);
 
-	if (p->phy != NULL)
-		return genphy_restart_aneg(p->phy);
+	if (!p->phy)
+		return -ENODEV;
 
-	return -EOPNOTSUPP;
+	return genphy_restart_aneg(p->phy);
 }
 
 static u32 dsa_slave_get_link(struct net_device *dev)
 {
 	struct dsa_slave_priv *p = netdev_priv(dev);
 
-	if (p->phy != NULL) {
-		genphy_update_link(p->phy);
-		return p->phy->link;
-	}
+	if (!p->phy)
+		return -ENODEV;
 
-	return -EOPNOTSUPP;
+	genphy_update_link(p->phy);
+
+	return p->phy->link;
 }
 
 static int dsa_slave_get_eeprom_len(struct net_device *dev)
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next 4/5] net: dsa: use phy_ethtool_set_link_ksettings
From: Vivien Didelot @ 2017-09-26 21:15 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Vivien Didelot
In-Reply-To: <20170926211535.21273-1-vivien.didelot@savoirfairelinux.com>

Use phy_ethtool_set_link_ksettings now that dsa_slave_set_link_ksettings
does exactly the same.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
 net/dsa/slave.c | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index bb0f64f47ae7..d2b632cae468 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -421,16 +421,6 @@ static netdev_tx_t dsa_slave_xmit(struct sk_buff *skb, struct net_device *dev)
 
 /* ethtool operations *******************************************************/
 
-static int
-dsa_slave_set_link_ksettings(struct net_device *dev,
-			     const struct ethtool_link_ksettings *cmd)
-{
-	if (!dev->phydev)
-		return -ENODEV;
-
-	return phy_ethtool_ksettings_set(dev->phydev, cmd);
-}
-
 static void dsa_slave_get_drvinfo(struct net_device *dev,
 				  struct ethtool_drvinfo *drvinfo)
 {
@@ -910,8 +900,8 @@ static const struct ethtool_ops dsa_slave_ethtool_ops = {
 	.get_wol		= dsa_slave_get_wol,
 	.set_eee		= dsa_slave_set_eee,
 	.get_eee		= dsa_slave_get_eee,
-	.set_link_ksettings	= dsa_slave_set_link_ksettings,
 	.get_link_ksettings	= phy_ethtool_get_link_ksettings,
+	.set_link_ksettings	= phy_ethtool_set_link_ksettings,
 	.get_rxnfc		= dsa_slave_get_rxnfc,
 	.set_rxnfc		= dsa_slave_set_rxnfc,
 };
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next 5/5] net: dsa: use phy_ethtool_nway_reset
From: Vivien Didelot @ 2017-09-26 21:15 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Vivien Didelot
In-Reply-To: <20170926211535.21273-1-vivien.didelot@savoirfairelinux.com>

Use phy_ethtool_nway_reset now that dsa_slave_nway_reset does exactly
the same.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
 net/dsa/slave.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index d2b632cae468..bf8800de13c1 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -450,14 +450,6 @@ dsa_slave_get_regs(struct net_device *dev, struct ethtool_regs *regs, void *_p)
 		ds->ops->get_regs(ds, p->dp->index, regs, _p);
 }
 
-static int dsa_slave_nway_reset(struct net_device *dev)
-{
-	if (!dev->phydev)
-		return -ENODEV;
-
-	return genphy_restart_aneg(dev->phydev);
-}
-
 static u32 dsa_slave_get_link(struct net_device *dev)
 {
 	if (!dev->phydev)
@@ -888,7 +880,7 @@ static const struct ethtool_ops dsa_slave_ethtool_ops = {
 	.get_drvinfo		= dsa_slave_get_drvinfo,
 	.get_regs_len		= dsa_slave_get_regs_len,
 	.get_regs		= dsa_slave_get_regs,
-	.nway_reset		= dsa_slave_nway_reset,
+	.nway_reset		= phy_ethtool_nway_reset,
 	.get_link		= dsa_slave_get_link,
 	.get_eeprom_len		= dsa_slave_get_eeprom_len,
 	.get_eeprom		= dsa_slave_get_eeprom,
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next 2/5] net: dsa: use slave device phydev
From: Vivien Didelot @ 2017-09-26 21:15 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Vivien Didelot
In-Reply-To: <20170926211535.21273-1-vivien.didelot@savoirfairelinux.com>

There is no need to store a phy_device in dsa_slave_priv since
net_device already provides one. Simply s/p->phy/dev->phydev/.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
 net/dsa/dsa_priv.h |   1 -
 net/dsa/slave.c    | 116 ++++++++++++++++++++++++-----------------------------
 2 files changed, 53 insertions(+), 64 deletions(-)

diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 0298a0f6a349..eccc62776283 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -79,7 +79,6 @@ struct dsa_slave_priv {
 	 * The phylib phy_device pointer for the PHY connected
 	 * to this port.
 	 */
-	struct phy_device	*phy;
 	phy_interface_t		phy_interface;
 	int			old_link;
 	int			old_pause;
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 79c5a0cd9923..4ea1c6eb0da8 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -96,12 +96,12 @@ static int dsa_slave_open(struct net_device *dev)
 			goto clear_allmulti;
 	}
 
-	err = dsa_port_enable(dp, p->phy);
+	err = dsa_port_enable(dp, dev->phydev);
 	if (err)
 		goto clear_promisc;
 
-	if (p->phy)
-		phy_start(p->phy);
+	if (dev->phydev)
+		phy_start(dev->phydev);
 
 	return 0;
 
@@ -124,10 +124,10 @@ static int dsa_slave_close(struct net_device *dev)
 	struct net_device *master = dsa_master_netdev(p);
 	struct dsa_port *dp = p->dp;
 
-	if (p->phy)
-		phy_stop(p->phy);
+	if (dev->phydev)
+		phy_stop(dev->phydev);
 
-	dsa_port_disable(dp, p->phy);
+	dsa_port_disable(dp, dev->phydev);
 
 	dev_mc_unsync(master, dev);
 	dev_uc_unsync(master, dev);
@@ -264,12 +264,10 @@ dsa_slave_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb,
 
 static int dsa_slave_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 {
-	struct dsa_slave_priv *p = netdev_priv(dev);
-
-	if (!p->phy)
+	if (!dev->phydev)
 		return -ENODEV;
 
-	return phy_mii_ioctl(p->phy, ifr, cmd);
+	return phy_mii_ioctl(dev->phydev, ifr, cmd);
 }
 
 static int dsa_slave_port_attr_set(struct net_device *dev,
@@ -426,12 +424,10 @@ static int
 dsa_slave_get_link_ksettings(struct net_device *dev,
 			     struct ethtool_link_ksettings *cmd)
 {
-	struct dsa_slave_priv *p = netdev_priv(dev);
-
-	if (!p->phy)
+	if (!dev->phydev)
 		return -ENODEV;
 
-	phy_ethtool_ksettings_get(p->phy, cmd);
+	phy_ethtool_ksettings_get(dev->phydev, cmd);
 
 	return 0;
 }
@@ -440,12 +436,10 @@ static int
 dsa_slave_set_link_ksettings(struct net_device *dev,
 			     const struct ethtool_link_ksettings *cmd)
 {
-	struct dsa_slave_priv *p = netdev_priv(dev);
-
-	if (!p->phy)
+	if (!dev->phydev)
 		return -ENODEV;
 
-	return phy_ethtool_ksettings_set(p->phy, cmd);
+	return phy_ethtool_ksettings_set(dev->phydev, cmd);
 }
 
 static void dsa_slave_get_drvinfo(struct net_device *dev,
@@ -479,24 +473,20 @@ dsa_slave_get_regs(struct net_device *dev, struct ethtool_regs *regs, void *_p)
 
 static int dsa_slave_nway_reset(struct net_device *dev)
 {
-	struct dsa_slave_priv *p = netdev_priv(dev);
-
-	if (!p->phy)
+	if (!dev->phydev)
 		return -ENODEV;
 
-	return genphy_restart_aneg(p->phy);
+	return genphy_restart_aneg(dev->phydev);
 }
 
 static u32 dsa_slave_get_link(struct net_device *dev)
 {
-	struct dsa_slave_priv *p = netdev_priv(dev);
-
-	if (!p->phy)
+	if (!dev->phydev)
 		return -ENODEV;
 
-	genphy_update_link(p->phy);
+	genphy_update_link(dev->phydev);
 
-	return p->phy->link;
+	return dev->phydev->link;
 }
 
 static int dsa_slave_get_eeprom_len(struct net_device *dev)
@@ -631,7 +621,7 @@ static int dsa_slave_set_eee(struct net_device *dev, struct ethtool_eee *e)
 	int ret;
 
 	/* Port's PHY and MAC both need to be EEE capable */
-	if (!p->phy)
+	if (!dev->phydev)
 		return -ENODEV;
 
 	if (!ds->ops->set_mac_eee)
@@ -642,12 +632,12 @@ static int dsa_slave_set_eee(struct net_device *dev, struct ethtool_eee *e)
 		return ret;
 
 	if (e->eee_enabled) {
-		ret = phy_init_eee(p->phy, 0);
+		ret = phy_init_eee(dev->phydev, 0);
 		if (ret)
 			return ret;
 	}
 
-	return phy_ethtool_set_eee(p->phy, e);
+	return phy_ethtool_set_eee(dev->phydev, e);
 }
 
 static int dsa_slave_get_eee(struct net_device *dev, struct ethtool_eee *e)
@@ -657,7 +647,7 @@ static int dsa_slave_get_eee(struct net_device *dev, struct ethtool_eee *e)
 	int ret;
 
 	/* Port's PHY and MAC both need to be EEE capable */
-	if (!p->phy)
+	if (!dev->phydev)
 		return -ENODEV;
 
 	if (!ds->ops->get_mac_eee)
@@ -667,7 +657,7 @@ static int dsa_slave_get_eee(struct net_device *dev, struct ethtool_eee *e)
 	if (ret)
 		return ret;
 
-	return phy_ethtool_get_eee(p->phy, e);
+	return phy_ethtool_get_eee(dev->phydev, e);
 }
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
@@ -976,26 +966,26 @@ static void dsa_slave_adjust_link(struct net_device *dev)
 	struct dsa_switch *ds = p->dp->ds;
 	unsigned int status_changed = 0;
 
-	if (p->old_link != p->phy->link) {
+	if (p->old_link != dev->phydev->link) {
 		status_changed = 1;
-		p->old_link = p->phy->link;
+		p->old_link = dev->phydev->link;
 	}
 
-	if (p->old_duplex != p->phy->duplex) {
+	if (p->old_duplex != dev->phydev->duplex) {
 		status_changed = 1;
-		p->old_duplex = p->phy->duplex;
+		p->old_duplex = dev->phydev->duplex;
 	}
 
-	if (p->old_pause != p->phy->pause) {
+	if (p->old_pause != dev->phydev->pause) {
 		status_changed = 1;
-		p->old_pause = p->phy->pause;
+		p->old_pause = dev->phydev->pause;
 	}
 
 	if (ds->ops->adjust_link && status_changed)
-		ds->ops->adjust_link(ds, p->dp->index, p->phy);
+		ds->ops->adjust_link(ds, p->dp->index, dev->phydev);
 
 	if (status_changed)
-		phy_print_status(p->phy);
+		phy_print_status(dev->phydev);
 }
 
 static int dsa_slave_fixed_link_update(struct net_device *dev,
@@ -1020,17 +1010,18 @@ static int dsa_slave_phy_connect(struct net_device *slave_dev, int addr)
 	struct dsa_slave_priv *p = netdev_priv(slave_dev);
 	struct dsa_switch *ds = p->dp->ds;
 
-	p->phy = mdiobus_get_phy(ds->slave_mii_bus, addr);
-	if (!p->phy) {
+	slave_dev->phydev = mdiobus_get_phy(ds->slave_mii_bus, addr);
+	if (!slave_dev->phydev) {
 		netdev_err(slave_dev, "no phy at %d\n", addr);
 		return -ENODEV;
 	}
 
 	/* Use already configured phy mode */
 	if (p->phy_interface == PHY_INTERFACE_MODE_NA)
-		p->phy_interface = p->phy->interface;
-	return phy_connect_direct(slave_dev, p->phy, dsa_slave_adjust_link,
-				  p->phy_interface);
+		p->phy_interface = slave_dev->phydev->interface;
+
+	return phy_connect_direct(slave_dev, slave_dev->phydev,
+				  dsa_slave_adjust_link, p->phy_interface);
 }
 
 static int dsa_slave_phy_setup(struct net_device *slave_dev)
@@ -1082,22 +1073,23 @@ static int dsa_slave_phy_setup(struct net_device *slave_dev)
 				return ret;
 			}
 		} else {
-			p->phy = of_phy_connect(slave_dev, phy_dn,
-						dsa_slave_adjust_link,
-						phy_flags,
-						p->phy_interface);
+			slave_dev->phydev = of_phy_connect(slave_dev, phy_dn,
+							   dsa_slave_adjust_link,
+							   phy_flags,
+							   p->phy_interface);
 		}
 
 		of_node_put(phy_dn);
 	}
 
-	if (p->phy && phy_is_fixed)
-		fixed_phy_set_link_update(p->phy, dsa_slave_fixed_link_update);
+	if (slave_dev->phydev && phy_is_fixed)
+		fixed_phy_set_link_update(slave_dev->phydev,
+					  dsa_slave_fixed_link_update);
 
 	/* We could not connect to a designated PHY, so use the switch internal
 	 * MDIO bus instead
 	 */
-	if (!p->phy) {
+	if (!slave_dev->phydev) {
 		ret = dsa_slave_phy_connect(slave_dev, p->dp->index);
 		if (ret) {
 			netdev_err(slave_dev, "failed to connect to port %d: %d\n",
@@ -1108,7 +1100,7 @@ static int dsa_slave_phy_setup(struct net_device *slave_dev)
 		}
 	}
 
-	phy_attached_info(p->phy);
+	phy_attached_info(slave_dev->phydev);
 
 	return 0;
 }
@@ -1128,12 +1120,12 @@ int dsa_slave_suspend(struct net_device *slave_dev)
 
 	netif_device_detach(slave_dev);
 
-	if (p->phy) {
-		phy_stop(p->phy);
+	if (slave_dev->phydev) {
+		phy_stop(slave_dev->phydev);
 		p->old_pause = -1;
 		p->old_link = -1;
 		p->old_duplex = -1;
-		phy_suspend(p->phy);
+		phy_suspend(slave_dev->phydev);
 	}
 
 	return 0;
@@ -1141,13 +1133,11 @@ int dsa_slave_suspend(struct net_device *slave_dev)
 
 int dsa_slave_resume(struct net_device *slave_dev)
 {
-	struct dsa_slave_priv *p = netdev_priv(slave_dev);
-
 	netif_device_attach(slave_dev);
 
-	if (p->phy) {
-		phy_resume(p->phy);
-		phy_start(p->phy);
+	if (slave_dev->phydev) {
+		phy_resume(slave_dev->phydev);
+		phy_start(slave_dev->phydev);
 	}
 
 	return 0;
@@ -1240,8 +1230,8 @@ void dsa_slave_destroy(struct net_device *slave_dev)
 	port_dn = p->dp->dn;
 
 	netif_carrier_off(slave_dev);
-	if (p->phy) {
-		phy_disconnect(p->phy);
+	if (slave_dev->phydev) {
+		phy_disconnect(slave_dev->phydev);
 
 		if (of_phy_is_fixed_link(port_dn))
 			of_phy_deregister_fixed_link(port_dn);
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next 3/5] net: dsa: use phy_ethtool_get_link_ksettings
From: Vivien Didelot @ 2017-09-26 21:15 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Vivien Didelot
In-Reply-To: <20170926211535.21273-1-vivien.didelot@savoirfairelinux.com>

Use phy_ethtool_get_link_ksettings now that dsa_slave_get_link_ksettings
does exactly the same.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
 net/dsa/slave.c | 13 +------------
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 4ea1c6eb0da8..bb0f64f47ae7 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -420,17 +420,6 @@ static netdev_tx_t dsa_slave_xmit(struct sk_buff *skb, struct net_device *dev)
 }
 
 /* ethtool operations *******************************************************/
-static int
-dsa_slave_get_link_ksettings(struct net_device *dev,
-			     struct ethtool_link_ksettings *cmd)
-{
-	if (!dev->phydev)
-		return -ENODEV;
-
-	phy_ethtool_ksettings_get(dev->phydev, cmd);
-
-	return 0;
-}
 
 static int
 dsa_slave_set_link_ksettings(struct net_device *dev,
@@ -921,8 +910,8 @@ static const struct ethtool_ops dsa_slave_ethtool_ops = {
 	.get_wol		= dsa_slave_get_wol,
 	.set_eee		= dsa_slave_set_eee,
 	.get_eee		= dsa_slave_get_eee,
-	.get_link_ksettings	= dsa_slave_get_link_ksettings,
 	.set_link_ksettings	= dsa_slave_set_link_ksettings,
+	.get_link_ksettings	= phy_ethtool_get_link_ksettings,
 	.get_rxnfc		= dsa_slave_get_rxnfc,
 	.set_rxnfc		= dsa_slave_set_rxnfc,
 };
-- 
2.14.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox