Netdev List
 help / color / mirror / Atom feed
* [PATCH v3 1/2] pegasus: fixes URB buffer allocation size;
From: Petko Manolov @ 2016-04-27 11:24 UTC (permalink / raw)
  To: netdev; +Cc: davem, a1291762, johannes, Petko Manolov
In-Reply-To: <1461756290-27421-1-git-send-email-petkan@mip-labs.com>

usb_fill_bulk_urb() receives buffer length parameter 8 bytes larger
than what's allocated by alloc_skb(); This seems to be a problem with
older (pegasus usb-1.1) devices, which may silently return more data
than the maximal packet length.

Reported-by: Lincoln Ramsay <a1291762@gmail.com>
Signed-off-by: Petko Manolov <petkan@mip-labs.com>
---
 drivers/net/usb/pegasus.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/usb/pegasus.c b/drivers/net/usb/pegasus.c
index f840802..f919e20 100644
--- a/drivers/net/usb/pegasus.c
+++ b/drivers/net/usb/pegasus.c
@@ -528,7 +528,7 @@ static void read_bulk_callback(struct urb *urb)
 goon:
 	usb_fill_bulk_urb(pegasus->rx_urb, pegasus->usb,
 			  usb_rcvbulkpipe(pegasus->usb, 1),
-			  pegasus->rx_skb->data, PEGASUS_MTU + 8,
+			  pegasus->rx_skb->data, PEGASUS_MTU,
 			  read_bulk_callback, pegasus);
 	rx_status = usb_submit_urb(pegasus->rx_urb, GFP_ATOMIC);
 	if (rx_status == -ENODEV)
@@ -569,7 +569,7 @@ static void rx_fixup(unsigned long data)
 	}
 	usb_fill_bulk_urb(pegasus->rx_urb, pegasus->usb,
 			  usb_rcvbulkpipe(pegasus->usb, 1),
-			  pegasus->rx_skb->data, PEGASUS_MTU + 8,
+			  pegasus->rx_skb->data, PEGASUS_MTU,
 			  read_bulk_callback, pegasus);
 try_again:
 	status = usb_submit_urb(pegasus->rx_urb, GFP_ATOMIC);
@@ -823,7 +823,7 @@ static int pegasus_open(struct net_device *net)
 
 	usb_fill_bulk_urb(pegasus->rx_urb, pegasus->usb,
 			  usb_rcvbulkpipe(pegasus->usb, 1),
-			  pegasus->rx_skb->data, PEGASUS_MTU + 8,
+			  pegasus->rx_skb->data, PEGASUS_MTU,
 			  read_bulk_callback, pegasus);
 	if ((res = usb_submit_urb(pegasus->rx_urb, GFP_KERNEL))) {
 		if (res == -ENODEV)
-- 
2.8.0.rc3

^ permalink raw reply related

* [PATCH v3 0/2] pegasus: correct buffer & packet sizes
From: Petko Manolov @ 2016-04-27 11:24 UTC (permalink / raw)
  To: netdev; +Cc: davem, a1291762, johannes, Petko Manolov

As noticed by Lincoln Ramsay <a1291762@gmail.com> some old (usb 1.1) Pegasus
based devices may actually return more bytes than the specified in the datasheet
amount.  That would not be a problem if the allocated space for the SKB was
equal to the parameter passed to usb_fill_bulk_urb().  Some poor bugger (i
really hope it was not me, but 'git blame' is useless in this case, so anyway)
decided to add '+ 8' to the buffer length parameter.  Sometimes the usb transfer
overflows and corrupts the socket structure, leading to kernel panic.

The above doesn't seem to happen for newer (Pegasus2 based) devices which did
help this bug to hide for so long.

The new default is to not include the CRC at the end of each received package.  
So far CRC has been ignored which makes no sense to do it in a first place.

The patch is against v4.6-rc5 and was tested on ADM8515 device by transferring
multiple gigabytes of data over a couple of days without any complaints from the
kernel.  Please apply it to whatever net tree you deem fit.

Changes since v1:

 - split the patch in two parts;
 - corrected the subject lines;

Changes since v2:

 - do not append CRC by default (based on a discussion with Johannes Berg);

Petko Manolov (2):
  pegasus: fixes URB buffer allocation size;
  pegasus: fixes reported packet length

 drivers/net/usb/pegasus.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

-- 
2.8.0.rc3

^ permalink raw reply

* [PATCH v3 2/2] pegasus: fixes reported packet length
From: Petko Manolov @ 2016-04-27 11:24 UTC (permalink / raw)
  To: netdev; +Cc: davem, a1291762, johannes, Petko Manolov
In-Reply-To: <1461756290-27421-1-git-send-email-petkan@mip-labs.com>

The default Pegasus setup was to append the status and CRC at the end of each
received packet.  The status bits are used to update various stats, but CRC has
been ignored.  The new default is to not append CRC at the end of RX packets.

Signed-off-by: Petko Manolov <petkan@mip-labs.com>
---
 drivers/net/usb/pegasus.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/pegasus.c b/drivers/net/usb/pegasus.c
index f919e20..82129ee 100644
--- a/drivers/net/usb/pegasus.c
+++ b/drivers/net/usb/pegasus.c
@@ -411,7 +411,7 @@ static int enable_net_traffic(struct net_device *dev, struct usb_device *usb)
 	int ret;
 
 	read_mii_word(pegasus, pegasus->phy, MII_LPA, &linkpart);
-	data[0] = 0xc9;
+	data[0] = 0xc8; /* TX & RX enable, append status, no CRC */
 	data[1] = 0;
 	if (linkpart & (ADVERTISE_100FULL | ADVERTISE_10FULL))
 		data[1] |= 0x20;	/* set full duplex */
@@ -497,7 +497,7 @@ static void read_bulk_callback(struct urb *urb)
 		pkt_len = buf[count - 3] << 8;
 		pkt_len += buf[count - 4];
 		pkt_len &= 0xfff;
-		pkt_len -= 8;
+		pkt_len -= 4;
 	}
 
 	/*
-- 
2.8.0.rc3

^ permalink raw reply related

* Re: [PATCH] netem: Segment GSO packets on enqueue.
From: Neil Horman @ 2016-04-27 11:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Jamal Hadi Salim, David S. Miller, netem
In-Reply-To: <1461701955.5535.38.camel@edumazet-glaptop3.roam.corp.google.com>

On Tue, Apr 26, 2016 at 01:19:15PM -0700, Eric Dumazet wrote:
> On Tue, 2016-04-26 at 15:00 -0400, Neil Horman wrote:
> > I can understand that, but that raises two questions in my mind:
> > 
> > 1)  Doesn't that make all the statistical manipulation for netem wrong?  That is
> > to say, if netem drops 5% of packets, and it happens to drop a GSO packet, its
> > actually dropping several, instead of the single one required.
> 
> 
> Please take a look at tbf_segment(), where you can find a proper way to
> handle this.
> 
> Note that for the case q->corrupt is 0, we definitely do not want to
> segment TSO packets.
> 
> > 2) How are you getting netem to work with GSO at all?  The warning is triggered
> > for me on every GSO packet, which I think would impact throughput :)
> 
> I mostly use netem to add delays and drops.
> I never had this bug, since q->corrupt = 0
> 

I see what you're saying now, I should only be segmenting the packet if the
qdisc needs to compute the checksum on each packet.  Other packets that aren't
selected to be mangled can pass through un-segmented (assuming they meet any
other queue constraints).

Ok, thanks.  Self-nak.  I'll respin/test and post a new version

Best
Neil

> 
> 
> 

^ permalink raw reply

* Re: [PATCH] vhost_net: stop polling socket during rx processing
From: Michael S. Tsirkin @ 2016-04-27 11:28 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <1461656153-24074-1-git-send-email-jasowang@redhat.com>

On Tue, Apr 26, 2016 at 03:35:53AM -0400, Jason Wang wrote:
> We don't stop polling socket during rx processing, this will lead
> unnecessary wakeups from under layer net devices (E.g
> sock_def_readable() form tun). Rx will be slowed down in this
> way. This patch avoids this by stop polling socket during rx
> processing. A small drawback is that this introduces some overheads in
> light load case because of the extra start/stop polling, but single
> netperf TCP_RR does not notice any change. In a super heavy load case,
> e.g using pktgen to inject packet to guest, we get about ~17%
> improvement on pps:
> 
> before: ~1370000 pkt/s
> after:  ~1500000 pkt/s
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

There is one other possible enhancement: we actually have the wait queue
lock taken in _wake_up, but we give it up only to take it again in the
handler.

It would be nicer to just remove the entry when we wake
the vhost thread. Re-add it if required.
I think that something like the below would give you the necessary API.
Pls feel free to use it if you are going to implement a patch on top
doing this - that's not a reason not to include this simple patch
though.

--->

wait: add API to drop a wait_queue_t entry from wake up handler

A wake up handler might want to remove its own wait queue entry to avoid
future wakeups.  In particular, vhost has such a need.  As wait queue
lock is already taken, all we need is an API to remove the entry without
wait_queue_head_t which isn't currently accessible to wake up handlers.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 27d7a0a..9c6604b 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -191,11 +191,17 @@ __add_wait_queue_tail_exclusive(wait_queue_head_t *q, wait_queue_t *wait)
 }
 
 static inline void
-__remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
+__remove_wait_queue_entry(wait_queue_t *old)
 {
 	list_del(&old->task_list);
 }
 
+static inline void
+__remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
+{
+	__remove_wait_queue_entry(old);
+}
+
 typedef int wait_bit_action_f(struct wait_bit_key *, int mode);
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
 void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);

^ permalink raw reply related

* Re: [RFC PATCH V2 1/2] vhost: convert pre sorted vhost memory array to interval tree
From: Michael S. Tsirkin @ 2016-04-27 11:30 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, qemu-devel, netdev, linux-kernel, peterx, virtualization,
	pbonzini
In-Reply-To: <1458873274-13961-2-git-send-email-jasowang@redhat.com>

On Fri, Mar 25, 2016 at 10:34:33AM +0800, Jason Wang wrote:
> Current pre-sorted memory region array has some limitations for future
> device IOTLB conversion:
> 
> 1) need extra work for adding and removing a single region, and it's
>    expected to be slow because of sorting or memory re-allocation.
> 2) need extra work of removing a large range which may intersect
>    several regions with different size.
> 3) need trick for a replacement policy like LRU
> 
> To overcome the above shortcomings, this patch convert it to interval
> tree which can easily address the above issue with almost no extra
> work.
> 
> The patch could be used for:
> 
> - Extend the current API and only let the userspace to send diffs of
>   memory table.
> - Simplify Device IOTLB implementation.

Does this affect performance at all?

> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vhost/net.c   |   8 +--
>  drivers/vhost/vhost.c | 182 ++++++++++++++++++++++++++++----------------------
>  drivers/vhost/vhost.h |  27 ++++++--
>  3 files changed, 128 insertions(+), 89 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 9eda69e..481db96 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -968,20 +968,20 @@ static long vhost_net_reset_owner(struct vhost_net *n)
>  	struct socket *tx_sock = NULL;
>  	struct socket *rx_sock = NULL;
>  	long err;
> -	struct vhost_memory *memory;
> +	struct vhost_umem *umem;
>  
>  	mutex_lock(&n->dev.mutex);
>  	err = vhost_dev_check_owner(&n->dev);
>  	if (err)
>  		goto done;
> -	memory = vhost_dev_reset_owner_prepare();
> -	if (!memory) {
> +	umem = vhost_dev_reset_owner_prepare();
> +	if (!umem) {
>  		err = -ENOMEM;
>  		goto done;
>  	}
>  	vhost_net_stop(n, &tx_sock, &rx_sock);
>  	vhost_net_flush(n);
> -	vhost_dev_reset_owner(&n->dev, memory);
> +	vhost_dev_reset_owner(&n->dev, umem);
>  	vhost_net_vq_reset(n);
>  done:
>  	mutex_unlock(&n->dev.mutex);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index ad2146a..32c35a9 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -27,6 +27,7 @@
>  #include <linux/cgroup.h>
>  #include <linux/module.h>
>  #include <linux/sort.h>
> +#include <linux/interval_tree_generic.h>
>  
>  #include "vhost.h"
>  
> @@ -42,6 +43,10 @@ enum {
>  #define vhost_used_event(vq) ((__virtio16 __user *)&vq->avail->ring[vq->num])
>  #define vhost_avail_event(vq) ((__virtio16 __user *)&vq->used->ring[vq->num])
>  
> +INTERVAL_TREE_DEFINE(struct vhost_umem_node,
> +		     rb, __u64, __subtree_last,
> +		     START, LAST, , vhost_umem_interval_tree);
> +
>  #ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY
>  static void vhost_vq_reset_user_be(struct vhost_virtqueue *vq)
>  {
> @@ -275,7 +280,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  	vq->call_ctx = NULL;
>  	vq->call = NULL;
>  	vq->log_ctx = NULL;
> -	vq->memory = NULL;
> +	vq->umem = NULL;
>  	vq->is_le = virtio_legacy_is_little_endian();
>  	vhost_vq_reset_user_be(vq);
>  }
> @@ -381,7 +386,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>  	mutex_init(&dev->mutex);
>  	dev->log_ctx = NULL;
>  	dev->log_file = NULL;
> -	dev->memory = NULL;
> +	dev->umem = NULL;
>  	dev->mm = NULL;
>  	spin_lock_init(&dev->work_lock);
>  	INIT_LIST_HEAD(&dev->work_list);
> @@ -486,27 +491,36 @@ err_mm:
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
>  
> -struct vhost_memory *vhost_dev_reset_owner_prepare(void)
> +static void *vhost_kvzalloc(unsigned long size)
> +{
> +	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
> +
> +	if (!n)
> +		n = vzalloc(size);
> +	return n;
> +}
> +
> +struct vhost_umem *vhost_dev_reset_owner_prepare(void)
>  {
> -	return kmalloc(offsetof(struct vhost_memory, regions), GFP_KERNEL);
> +	return vhost_kvzalloc(sizeof(struct vhost_umem));
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_reset_owner_prepare);
>  
>  /* Caller should have device mutex */
> -void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_memory *memory)
> +void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_umem *umem)
>  {
>  	int i;
>  
>  	vhost_dev_cleanup(dev, true);
>  
>  	/* Restore memory to default empty mapping. */
> -	memory->nregions = 0;
> -	dev->memory = memory;
> +	INIT_LIST_HEAD(&umem->umem_list);
> +	dev->umem = umem;
>  	/* We don't need VQ locks below since vhost_dev_cleanup makes sure
>  	 * VQs aren't running.
>  	 */
>  	for (i = 0; i < dev->nvqs; ++i)
> -		dev->vqs[i]->memory = memory;
> +		dev->vqs[i]->umem = umem;
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_reset_owner);
>  
> @@ -523,6 +537,21 @@ void vhost_dev_stop(struct vhost_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_stop);
>  
> +static void vhost_umem_clean(struct vhost_umem *umem)
> +{
> +	struct vhost_umem_node *node, *tmp;
> +
> +	if (!umem)
> +		return;
> +
> +	list_for_each_entry_safe(node, tmp, &umem->umem_list, link) {
> +		vhost_umem_interval_tree_remove(node, &umem->umem_tree);
> +		list_del(&node->link);
> +		kvfree(node);
> +	}
> +	kvfree(umem);
> +}
> +
>  /* Caller should have device mutex if and only if locked is set */
>  void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
>  {
> @@ -549,8 +578,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
>  		fput(dev->log_file);
>  	dev->log_file = NULL;
>  	/* No one will access memory at this point */
> -	kvfree(dev->memory);
> -	dev->memory = NULL;
> +	vhost_umem_clean(dev->umem);
> +	dev->umem = NULL;
>  	WARN_ON(!list_empty(&dev->work_list));
>  	if (dev->worker) {
>  		kthread_stop(dev->worker);
> @@ -576,25 +605,25 @@ static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
>  }
>  
>  /* Caller should have vq mutex and device mutex. */
> -static int vq_memory_access_ok(void __user *log_base, struct vhost_memory *mem,
> +static int vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
>  			       int log_all)
>  {
> -	int i;
> +	struct vhost_umem_node *node;
>  
> -	if (!mem)
> +	if (!umem)
>  		return 0;
>  
> -	for (i = 0; i < mem->nregions; ++i) {
> -		struct vhost_memory_region *m = mem->regions + i;
> -		unsigned long a = m->userspace_addr;
> -		if (m->memory_size > ULONG_MAX)
> +	list_for_each_entry(node, &umem->umem_list, link) {
> +		unsigned long a = node->userspace_addr;
> +
> +		if (node->size > ULONG_MAX)
>  			return 0;
>  		else if (!access_ok(VERIFY_WRITE, (void __user *)a,
> -				    m->memory_size))
> +				    node->size))
>  			return 0;
>  		else if (log_all && !log_access_ok(log_base,
> -						   m->guest_phys_addr,
> -						   m->memory_size))
> +						   node->start,
> +						   node->size))
>  			return 0;
>  	}
>  	return 1;
> @@ -602,7 +631,7 @@ static int vq_memory_access_ok(void __user *log_base, struct vhost_memory *mem,
>  
>  /* Can we switch to this memory table? */
>  /* Caller should have device mutex but not vq mutex */
> -static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem,
> +static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
>  			    int log_all)
>  {
>  	int i;
> @@ -615,7 +644,8 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem,
>  		log = log_all || vhost_has_feature(d->vqs[i], VHOST_F_LOG_ALL);
>  		/* If ring is inactive, will check when it's enabled. */
>  		if (d->vqs[i]->private_data)
> -			ok = vq_memory_access_ok(d->vqs[i]->log_base, mem, log);
> +			ok = vq_memory_access_ok(d->vqs[i]->log_base,
> +						 umem, log);
>  		else
>  			ok = 1;
>  		mutex_unlock(&d->vqs[i]->mutex);
> @@ -642,7 +672,7 @@ static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
>  /* Caller should have device mutex but not vq mutex */
>  int vhost_log_access_ok(struct vhost_dev *dev)
>  {
> -	return memory_access_ok(dev, dev->memory, 1);
> +	return memory_access_ok(dev, dev->umem, 1);
>  }
>  EXPORT_SYMBOL_GPL(vhost_log_access_ok);
>  
> @@ -653,7 +683,7 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
>  {
>  	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
>  
> -	return vq_memory_access_ok(log_base, vq->memory,
> +	return vq_memory_access_ok(log_base, vq->umem,
>  				   vhost_has_feature(vq, VHOST_F_LOG_ALL)) &&
>  		(!vq->log_used || log_access_ok(log_base, vq->log_addr,
>  					sizeof *vq->used +
> @@ -669,28 +699,12 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
>  }
>  EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
>  
> -static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2)
> -{
> -	const struct vhost_memory_region *r1 = p1, *r2 = p2;
> -	if (r1->guest_phys_addr < r2->guest_phys_addr)
> -		return 1;
> -	if (r1->guest_phys_addr > r2->guest_phys_addr)
> -		return -1;
> -	return 0;
> -}
> -
> -static void *vhost_kvzalloc(unsigned long size)
> -{
> -	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
> -
> -	if (!n)
> -		n = vzalloc(size);
> -	return n;
> -}
> -
>  static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
>  {
> -	struct vhost_memory mem, *newmem, *oldmem;
> +	struct vhost_memory mem, *newmem;
> +	struct vhost_memory_region *region;
> +	struct vhost_umem_node *node;
> +	struct vhost_umem *newumem, *oldumem;
>  	unsigned long size = offsetof(struct vhost_memory, regions);
>  	int i;
>  
> @@ -710,24 +724,52 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
>  		kvfree(newmem);
>  		return -EFAULT;
>  	}
> -	sort(newmem->regions, newmem->nregions, sizeof(*newmem->regions),
> -		vhost_memory_reg_sort_cmp, NULL);
>  
> -	if (!memory_access_ok(d, newmem, 0)) {
> +	newumem = vhost_kvzalloc(sizeof(*newumem));
> +	if (!newumem) {
>  		kvfree(newmem);
> -		return -EFAULT;
> +		return -ENOMEM;
> +	}
> +
> +	newumem->umem_tree = RB_ROOT;
> +	INIT_LIST_HEAD(&newumem->umem_list);
> +
> +	for (region = newmem->regions;
> +	     region < newmem->regions + mem.nregions;
> +	     region++) {
> +		node = vhost_kvzalloc(sizeof(*node));
> +		if (!node)
> +			goto err;
> +		node->start = region->guest_phys_addr;
> +		node->size = region->memory_size;
> +		node->last = node->start + node->size - 1;
> +		node->userspace_addr = region->userspace_addr;
> +		INIT_LIST_HEAD(&node->link);
> +		list_add_tail(&node->link, &newumem->umem_list);
> +		vhost_umem_interval_tree_insert(node, &newumem->umem_tree);
>  	}
> -	oldmem = d->memory;
> -	d->memory = newmem;
> +
> +	if (!memory_access_ok(d, newumem, 0))
> +		goto err;
> +
> +	oldumem = d->umem;
> +	d->umem = newumem;
>  
>  	/* All memory accesses are done under some VQ mutex. */
>  	for (i = 0; i < d->nvqs; ++i) {
>  		mutex_lock(&d->vqs[i]->mutex);
> -		d->vqs[i]->memory = newmem;
> +		d->vqs[i]->umem = newumem;
>  		mutex_unlock(&d->vqs[i]->mutex);
>  	}
> -	kvfree(oldmem);
> +
> +	kvfree(newmem);
> +	vhost_umem_clean(oldumem);
>  	return 0;
> +
> +err:
> +	vhost_umem_clean(newumem);
> +	kvfree(newmem);
> +	return -EFAULT;
>  }
>  
>  long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
> @@ -1017,28 +1059,6 @@ done:
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
>  
> -static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
> -						     __u64 addr, __u32 len)
> -{
> -	const struct vhost_memory_region *reg;
> -	int start = 0, end = mem->nregions;
> -
> -	while (start < end) {
> -		int slot = start + (end - start) / 2;
> -		reg = mem->regions + slot;
> -		if (addr >= reg->guest_phys_addr)
> -			end = slot;
> -		else
> -			start = slot + 1;
> -	}
> -
> -	reg = mem->regions + start;
> -	if (addr >= reg->guest_phys_addr &&
> -		reg->guest_phys_addr + reg->memory_size > addr)
> -		return reg;
> -	return NULL;
> -}
> -
>  /* TODO: This is really inefficient.  We need something like get_user()
>   * (instruction directly accesses the data, with an exception table entry
>   * returning -EFAULT). See Documentation/x86/exception-tables.txt.
> @@ -1180,29 +1200,29 @@ EXPORT_SYMBOL_GPL(vhost_init_used);
>  static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
>  			  struct iovec iov[], int iov_size)
>  {
> -	const struct vhost_memory_region *reg;
> -	struct vhost_memory *mem;
> +	const struct vhost_umem_node *node;
> +	struct vhost_umem *umem = vq->umem;
>  	struct iovec *_iov;
>  	u64 s = 0;
>  	int ret = 0;
>  
> -	mem = vq->memory;
>  	while ((u64)len > s) {
>  		u64 size;
>  		if (unlikely(ret >= iov_size)) {
>  			ret = -ENOBUFS;
>  			break;
>  		}
> -		reg = find_region(mem, addr, len);
> -		if (unlikely(!reg)) {
> +		node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
> +							addr, addr + len - 1);
> +		if (node == NULL || node->start > addr) {
>  			ret = -EFAULT;
>  			break;
>  		}
>  		_iov = iov + ret;
> -		size = reg->memory_size - addr + reg->guest_phys_addr;
> +		size = node->size - addr + node->start;
>  		_iov->iov_len = min((u64)len - s, size);
>  		_iov->iov_base = (void __user *)(unsigned long)
> -			(reg->userspace_addr + addr - reg->guest_phys_addr);
> +			(node->userspace_addr + addr - node->start);
>  		s += size;
>  		addr += size;
>  		++ret;
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index d3f7674..5d64393 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -52,6 +52,25 @@ struct vhost_log {
>  	u64 len;
>  };
>  
> +#define START(node) ((node)->start)
> +#define LAST(node) ((node)->last)
> +
> +struct vhost_umem_node {
> +	struct rb_node rb;
> +	struct list_head link;
> +	__u64 start;
> +	__u64 last;
> +	__u64 size;
> +	__u64 userspace_addr;
> +	__u64 flags_padding;
> +	__u64 __subtree_last;
> +};
> +
> +struct vhost_umem {
> +	struct rb_root umem_tree;
> +	struct list_head umem_list;
> +};
> +
>  /* The virtqueue structure describes a queue attached to a device. */
>  struct vhost_virtqueue {
>  	struct vhost_dev *dev;
> @@ -100,7 +119,7 @@ struct vhost_virtqueue {
>  	struct iovec *indirect;
>  	struct vring_used_elem *heads;
>  	/* Protected by virtqueue mutex. */
> -	struct vhost_memory *memory;
> +	struct vhost_umem *umem;
>  	void *private_data;
>  	u64 acked_features;
>  	/* Log write descriptors */
> @@ -117,7 +136,6 @@ struct vhost_virtqueue {
>  };
>  
>  struct vhost_dev {
> -	struct vhost_memory *memory;
>  	struct mm_struct *mm;
>  	struct mutex mutex;
>  	struct vhost_virtqueue **vqs;
> @@ -127,14 +145,15 @@ struct vhost_dev {
>  	spinlock_t work_lock;
>  	struct list_head work_list;
>  	struct task_struct *worker;
> +	struct vhost_umem *umem;
>  };
>  
>  void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
>  long vhost_dev_set_owner(struct vhost_dev *dev);
>  bool vhost_dev_has_owner(struct vhost_dev *dev);
>  long vhost_dev_check_owner(struct vhost_dev *);
> -struct vhost_memory *vhost_dev_reset_owner_prepare(void);
> -void vhost_dev_reset_owner(struct vhost_dev *, struct vhost_memory *);
> +struct vhost_umem *vhost_dev_reset_owner_prepare(void);
> +void vhost_dev_reset_owner(struct vhost_dev *, struct vhost_umem *);
>  void vhost_dev_cleanup(struct vhost_dev *, bool locked);
>  void vhost_dev_stop(struct vhost_dev *);
>  long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user *argp);
> -- 
> 2.5.0

^ permalink raw reply

* [PATCH] net: phy: at803x: Register 'link_change_notify' only for AT8030
From: Sebastian Frias @ 2016-04-27 11:34 UTC (permalink / raw)
  To: Daniel Mack, David S. Miller, netdev; +Cc: lkml, mason, Sergei Shtylyov

There is no need to register the callback introduced by
commit 13a56b449325 ("net: phy: at803x: Add support for hardware reset")
for non faulty PHYs.

The check on the PHY ID is not necessary anymore and thus has been
removed from the callback implementation as well.

Fixes: 13a56b449325 ("net: phy: at803x: Add support for hardware reset")

Signed-off-by: Sebastian Frias <sf84@laposte.net>
---
 drivers/net/phy/at803x.c | 43 ++++++++++++++++++++-----------------------
 1 file changed, 20 insertions(+), 23 deletions(-)

diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
index b3ffaee..7fdc676 100644
--- a/drivers/net/phy/at803x.c
+++ b/drivers/net/phy/at803x.c
@@ -353,33 +353,32 @@ static void at803x_link_change_notify(struct phy_device *phydev)
 	struct at803x_priv *priv = phydev->priv;
 
 	/*
-	 * Conduct a hardware reset for AT8030 every time a link loss is
+	 * Conduct a hardware reset for AT8030 (this callback is only
+	 * registered for AT8030 at the moment) every time a link loss is
 	 * signalled. This is necessary to circumvent a hardware bug that
 	 * occurs when the cable is unplugged while TX packets are pending
 	 * in the FIFO. In such cases, the FIFO enters an error mode it
 	 * cannot recover from by software.
 	 */
-	if (phydev->drv->phy_id == ATH8030_PHY_ID) {
-		if (phydev->state == PHY_NOLINK) {
-			if (priv->gpiod_reset && !priv->phy_reset) {
-				struct at803x_context context;
-
-				at803x_context_save(phydev, &context);
-
-				gpiod_set_value(priv->gpiod_reset, 1);
-				msleep(1);
-				gpiod_set_value(priv->gpiod_reset, 0);
-				msleep(1);
-
-				at803x_context_restore(phydev, &context);
-
-				phydev_dbg(phydev, "%s(): phy was reset\n",
-					   __func__);
-				priv->phy_reset = true;
-			}
-		} else {
-			priv->phy_reset = false;
+	if (phydev->state == PHY_NOLINK) {
+		if (priv->gpiod_reset && !priv->phy_reset) {
+			struct at803x_context context;
+
+			at803x_context_save(phydev, &context);
+
+			gpiod_set_value(priv->gpiod_reset, 1);
+			msleep(1);
+			gpiod_set_value(priv->gpiod_reset, 0);
+			msleep(1);
+
+			at803x_context_restore(phydev, &context);
+
+			phydev_dbg(phydev, "%s(): phy was reset\n",
+				   __func__);
+			priv->phy_reset = true;
 		}
+	} else {
+		priv->phy_reset = false;
 	}
 }
 
@@ -391,7 +390,6 @@ static struct phy_driver at803x_driver[] = {
 	.phy_id_mask		= 0xffffffef,
 	.probe			= at803x_probe,
 	.config_init		= at803x_config_init,
-	.link_change_notify	= at803x_link_change_notify,
 	.set_wol		= at803x_set_wol,
 	.get_wol		= at803x_get_wol,
 	.suspend		= at803x_suspend,
@@ -427,7 +425,6 @@ static struct phy_driver at803x_driver[] = {
 	.phy_id_mask		= 0xffffffef,
 	.probe			= at803x_probe,
 	.config_init		= at803x_config_init,
-	.link_change_notify	= at803x_link_change_notify,
 	.set_wol		= at803x_set_wol,
 	.get_wol		= at803x_get_wol,
 	.suspend		= at803x_suspend,
-- 
2.1.4

^ permalink raw reply related

* Re: [RFC PATCH V2 2/2] vhost: device IOTLB API
From: Michael S. Tsirkin @ 2016-04-27 11:45 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, qemu-devel, netdev, linux-kernel, peterx, virtualization,
	pbonzini
In-Reply-To: <1458873274-13961-3-git-send-email-jasowang@redhat.com>

On Fri, Mar 25, 2016 at 10:34:34AM +0800, Jason Wang wrote:
> This patch tries to implement an device IOTLB for vhost. This could be
> used with for co-operation with userspace(qemu) implementation of DMA
> remapping.
> 
> The idea is simple. When vhost meets an IOTLB miss, it will request
> the assistance of userspace to do the translation, this is done
> through:
> 
> - Fill the translation request in a preset userspace address (This
>   address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
> - Notify userspace through eventfd (This eventfd was set through ioctl
>   VHOST_SET_IOTLB_FD).

Why use an eventfd for this? We use them for interrupts because
that happens to be what kvm wants, but here - why don't we
just add a generic support for reading out events
on the vhost fd itself?

> - device IOTLB were started and stopped through VHOST_RUN_IOTLB ioctl
> 
> When userspace finishes the translation, it will update the vhost
> IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
> snooping the IOTLB invalidation of IOMMU IOTLB and use
> VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.

There's one problem here, and that is that VQs still do not undergo
translation.  In theory VQ could be mapped in such a way
that it's not contigious in userspace memory.


> Signed-off-by: Jason Wang <jasowang@redhat.com>

What limits amount of entries that kernel keeps around?
Do we want at least a mod parameter for this?


> ---
>  drivers/vhost/net.c        |   6 +-
>  drivers/vhost/vhost.c      | 301 +++++++++++++++++++++++++++++++++++++++------
>  drivers/vhost/vhost.h      |  17 ++-
>  fs/eventfd.c               |   3 +-
>  include/uapi/linux/vhost.h |  35 ++++++
>  5 files changed, 320 insertions(+), 42 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 481db96..7cbdeed 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -334,7 +334,7 @@ static void handle_tx(struct vhost_net *net)
>  		head = vhost_get_vq_desc(vq, vq->iov,
>  					 ARRAY_SIZE(vq->iov),
>  					 &out, &in,
> -					 NULL, NULL);
> +					 NULL, NULL, VHOST_ACCESS_RO);
>  		/* On error, stop handling until the next kick. */
>  		if (unlikely(head < 0))
>  			break;
> @@ -470,7 +470,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
>  		}
>  		r = vhost_get_vq_desc(vq, vq->iov + seg,
>  				      ARRAY_SIZE(vq->iov) - seg, &out,
> -				      &in, log, log_num);
> +				      &in, log, log_num, VHOST_ACCESS_WO);
>  		if (unlikely(r < 0))
>  			goto err;
>  
> @@ -1083,7 +1083,7 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
>  		r = vhost_dev_ioctl(&n->dev, ioctl, argp);
>  		if (r == -ENOIOCTLCMD)
>  			r = vhost_vring_ioctl(&n->dev, ioctl, argp);
> -		else
> +		else if (ioctl != VHOST_UPDATE_IOTLB)
>  			vhost_net_flush(n);
>  		mutex_unlock(&n->dev.mutex);
>  		return r;
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 32c35a9..1dd43e8 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -280,6 +280,10 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  	vq->call_ctx = NULL;
>  	vq->call = NULL;
>  	vq->log_ctx = NULL;
> +	vq->iotlb_call = NULL;
> +	vq->iotlb_call_ctx = NULL;
> +	vq->iotlb_request = NULL;
> +	vq->pending_request.flags.type = VHOST_IOTLB_INVALIDATE;
>  	vq->umem = NULL;
>  	vq->is_le = virtio_legacy_is_little_endian();
>  	vhost_vq_reset_user_be(vq);
> @@ -387,8 +391,10 @@ void vhost_dev_init(struct vhost_dev *dev,
>  	dev->log_ctx = NULL;
>  	dev->log_file = NULL;
>  	dev->umem = NULL;
> +	dev->iotlb = NULL;
>  	dev->mm = NULL;
>  	spin_lock_init(&dev->work_lock);
> +	spin_lock_init(&dev->iotlb_lock);
>  	INIT_LIST_HEAD(&dev->work_list);
>  	dev->worker = NULL;
>  
> @@ -537,6 +543,15 @@ void vhost_dev_stop(struct vhost_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_stop);
>  
> +static void vhost_umem_free(struct vhost_umem *umem,
> +			    struct vhost_umem_node *node)
> +{
> +	vhost_umem_interval_tree_remove(node, &umem->umem_tree);
> +	list_del(&node->link);
> +	kfree(node);
> +	umem->numem--;
> +}
> +
>  static void vhost_umem_clean(struct vhost_umem *umem)
>  {
>  	struct vhost_umem_node *node, *tmp;
> @@ -544,11 +559,9 @@ static void vhost_umem_clean(struct vhost_umem *umem)
>  	if (!umem)
>  		return;
>  
> -	list_for_each_entry_safe(node, tmp, &umem->umem_list, link) {
> -		vhost_umem_interval_tree_remove(node, &umem->umem_tree);
> -		list_del(&node->link);
> -		kvfree(node);
> -	}
> +	list_for_each_entry_safe(node, tmp, &umem->umem_list, link)
> +		vhost_umem_free(umem, node);
> +
>  	kvfree(umem);
>  }
>  
> @@ -580,6 +593,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
>  	/* No one will access memory at this point */
>  	vhost_umem_clean(dev->umem);
>  	dev->umem = NULL;
> +	vhost_umem_clean(dev->iotlb);
> +	dev->iotlb = NULL;
>  	WARN_ON(!list_empty(&dev->work_list));
>  	if (dev->worker) {
>  		kthread_stop(dev->worker);
> @@ -699,11 +714,61 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
>  }
>  EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
>  
> +static int vhost_new_umem_range(struct vhost_umem *umem,
> +				u64 start, u64 size, u64 end,
> +				u64 userspace_addr, int perm)
> +{
> +	struct vhost_umem_node *tmp, *node = kmalloc(sizeof(*node), GFP_ATOMIC);
> +
> +	if (!node)
> +		return -ENOMEM;
> +
> +	if (umem->numem == VHOST_IOTLB_SIZE) {
> +		tmp = list_last_entry(&umem->umem_list, typeof(*tmp), link);
> +		vhost_umem_free(umem, tmp);
> +	}
> +
> +	node->start = start;
> +	node->size = size;
> +	node->last = end;
> +	node->userspace_addr = userspace_addr;
> +	node->perm = perm;
> +	INIT_LIST_HEAD(&node->link);
> +	list_add_tail(&node->link, &umem->umem_list);
> +	vhost_umem_interval_tree_insert(node, &umem->umem_tree);
> +	umem->numem++;
> +
> +	return 0;
> +}
> +
> +static void vhost_del_umem_range(struct vhost_umem *umem,
> +				 u64 start, u64 end)
> +{
> +	struct vhost_umem_node *node;
> +
> +	while ((node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
> +							   start, end)))
> +		vhost_umem_free(umem, node);
> +}
> +
> +static struct vhost_umem *vhost_umem_alloc(void)
> +{
> +	struct vhost_umem *umem = vhost_kvzalloc(sizeof(*umem));
> +
> +	if (!umem)
> +		return NULL;
> +
> +	umem->umem_tree = RB_ROOT;
> +	umem->numem = 0;
> +	INIT_LIST_HEAD(&umem->umem_list);
> +
> +	return umem;
> +}
> +
>  static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
>  {
>  	struct vhost_memory mem, *newmem;
>  	struct vhost_memory_region *region;
> -	struct vhost_umem_node *node;
>  	struct vhost_umem *newumem, *oldumem;
>  	unsigned long size = offsetof(struct vhost_memory, regions);
>  	int i;
> @@ -725,28 +790,23 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
>  		return -EFAULT;
>  	}
>  
> -	newumem = vhost_kvzalloc(sizeof(*newumem));
> +	newumem = vhost_umem_alloc();
>  	if (!newumem) {
>  		kvfree(newmem);
>  		return -ENOMEM;
>  	}
>  
> -	newumem->umem_tree = RB_ROOT;
> -	INIT_LIST_HEAD(&newumem->umem_list);
> -
>  	for (region = newmem->regions;
>  	     region < newmem->regions + mem.nregions;
>  	     region++) {
> -		node = vhost_kvzalloc(sizeof(*node));
> -		if (!node)
> +		if (vhost_new_umem_range(newumem,
> +					 region->guest_phys_addr,
> +					 region->memory_size,
> +					 region->guest_phys_addr +
> +					 region->memory_size - 1,
> +					 region->userspace_addr,
> +				         VHOST_ACCESS_RW))
>  			goto err;
> -		node->start = region->guest_phys_addr;
> -		node->size = region->memory_size;
> -		node->last = node->start + node->size - 1;
> -		node->userspace_addr = region->userspace_addr;
> -		INIT_LIST_HEAD(&node->link);
> -		list_add_tail(&node->link, &newumem->umem_list);
> -		vhost_umem_interval_tree_insert(node, &newumem->umem_tree);
>  	}
>  
>  	if (!memory_access_ok(d, newumem, 0))
> @@ -782,6 +842,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
>  	struct vhost_vring_state s;
>  	struct vhost_vring_file f;
>  	struct vhost_vring_addr a;
> +	struct vhost_vring_iotlb_entry e;
>  	u32 idx;
>  	long r;
>  
> @@ -910,6 +971,35 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
>  		} else
>  			filep = eventfp;
>  		break;
> +	case VHOST_SET_VRING_IOTLB_REQUEST:
> +		r = -EFAULT;
> +		if (copy_from_user(&e, argp, sizeof e))
> +			break;
> +		if (!access_ok(VERIFY_WRITE, e.userspace_addr,
> +				sizeof(*vq->iotlb_request)))
> +			break;
> +		r = 0;
> +		vq->iotlb_request = (struct vhost_iotlb_entry __user *)e.userspace_addr;
> +		break;
> +	case VHOST_SET_VRING_IOTLB_CALL:
> +		if (copy_from_user(&f, argp, sizeof f)) {
> +			r = -EFAULT;
> +			break;
> +		}
> +		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
> +		if (IS_ERR(eventfp)) {
> +			r = PTR_ERR(eventfp);
> +			break;
> +		}
> +		if (eventfp != vq->iotlb_call) {
> +			filep = vq->iotlb_call;
> +			ctx = vq->iotlb_call_ctx;
> +			vq->iotlb_call = eventfp;
> +			vq->iotlb_call_ctx = eventfp ?
> +				eventfd_ctx_fileget(eventfp) : NULL;
> +		} else
> +			filep = eventfp;
> +		break;
>  	case VHOST_SET_VRING_CALL:
>  		if (copy_from_user(&f, argp, sizeof f)) {
>  			r = -EFAULT;
> @@ -977,11 +1067,55 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
>  }
>  EXPORT_SYMBOL_GPL(vhost_vring_ioctl);
>  
> +static int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled)
> +{
> +	struct vhost_umem *niotlb, *oiotlb;
> +
> +	if (enabled) {
> +		niotlb = vhost_umem_alloc();
> +		if (!niotlb)
> +			return -ENOMEM;
> +	} else
> +		niotlb = NULL;
> +
> +	spin_lock(&d->iotlb_lock);
> +	oiotlb = d->iotlb;
> +	d->iotlb = niotlb;
> +	spin_unlock(&d->iotlb_lock);
> +
> +	vhost_umem_clean(oiotlb);
> +
> +	return 0;
> +}
> +
> +static void vhost_complete_iotlb_update(struct vhost_dev *d,
> +					struct vhost_iotlb_entry *entry)
> +{
> +	struct vhost_iotlb_entry *req;
> +	struct vhost_virtqueue *vq;
> +	int i;
> +
> +
> +	for (i = 0; i < d->nvqs; i++) {
> +		vq = d->vqs[i];
> +		mutex_lock(&vq->mutex);
> +		req = &vq->pending_request;
> +		if (entry->iova <= req->iova &&
> +		    entry->iova + entry->size - 1 > req->iova &&
> +		    req->flags.type == VHOST_IOTLB_MISS) {
> +			*req = *entry;
> +			vhost_poll_queue(&vq->poll);
> +		}
> +		mutex_unlock(&vq->mutex);
> +	}
> +}
> +
>  /* Caller must have device mutex */
>  long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
>  {
>  	struct file *eventfp, *filep = NULL;
>  	struct eventfd_ctx *ctx = NULL;
> +	struct vhost_iotlb_entry entry;
>  	u64 p;
>  	long r;
>  	int i, fd;
> @@ -1050,6 +1184,52 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
>  		if (filep)
>  			fput(filep);
>  		break;
> +	case VHOST_RUN_IOTLB:
> +		/* FIXME: enable and disabled */
> +		vhost_init_device_iotlb(d, true);
> +		break;
> +	case VHOST_UPDATE_IOTLB:
> +		r = copy_from_user(&entry, argp, sizeof(entry));
> +		if (r < 0) {
> +			r = -EFAULT;
> +			goto done;
> +		}
> +
> +		spin_lock(&d->iotlb_lock);
> +		if (!d->iotlb) {
> +			spin_unlock(&d->iotlb_lock);
> +			r = -EFAULT;
> +			goto done;
> +		}
> +		switch (entry.flags.type) {
> +		case VHOST_IOTLB_UPDATE:
> +			if (entry.flags.valid != VHOST_IOTLB_VALID) {
> +				break;
> +			}
> +			if (vhost_new_umem_range(d->iotlb,
> +						 entry.iova,
> +						 entry.size,
> +						 entry.iova + entry.size - 1,
> +                                                 entry.userspace_addr,
> +                                                 entry.flags.perm)) {
> +				r = -ENOMEM;
> +				break;
> +			}
> +			break;
> +		case VHOST_IOTLB_INVALIDATE:
> +			vhost_del_umem_range(d->iotlb,
> +					     entry.iova,
> +					     entry.iova + entry.size - 1);
> +			break;
> +		default:
> +			r = -EINVAL;
> +		}
> +		spin_unlock(&d->iotlb_lock);
> +
> +		if (!r && entry.flags.type != VHOST_IOTLB_INVALIDATE)
> +			vhost_complete_iotlb_update(d, &entry);
> +
> +		break;
>  	default:
>  		r = -ENOIOCTLCMD;
>  		break;
> @@ -1197,27 +1377,69 @@ int vhost_init_used(struct vhost_virtqueue *vq)
>  }
>  EXPORT_SYMBOL_GPL(vhost_init_used);
>  
> +static int vhost_iotlb_miss(struct vhost_virtqueue *vq, u64 iova)
> +{
> +	struct vhost_iotlb_entry *pending = &vq->pending_request;
> +	int ret;
> +
> +	if (!vq->iotlb_call_ctx)
> +		return -EOPNOTSUPP;
> +
> +	if (!vq->iotlb_request)
> +		return -EOPNOTSUPP;
> +
> +	if (pending->flags.type == VHOST_IOTLB_MISS) {
> +		return -EEXIST;
> +	}
> +
> +	pending->iova = iova;
> +	pending->flags.type = VHOST_IOTLB_MISS;
> +
> +	ret = __copy_to_user(vq->iotlb_request, pending,
> +			     sizeof(struct vhost_iotlb_entry));
> +	if (ret) {
> +		goto err;
> +	}
> +
> +	if (vq->iotlb_call_ctx)
> +		eventfd_signal(vq->iotlb_call_ctx, 1);
> +err:
> +	return ret;
> +}
> +
>  static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
> -			  struct iovec iov[], int iov_size)
> +			  struct iovec iov[], int iov_size, int access)
>  {
>  	const struct vhost_umem_node *node;
> -	struct vhost_umem *umem = vq->umem;
> +	struct vhost_dev *dev = vq->dev;
> +	struct vhost_umem *umem = dev->iotlb ? dev->iotlb : dev->umem;
>  	struct iovec *_iov;
>  	u64 s = 0;
>  	int ret = 0;
>  
> +	spin_lock(&dev->iotlb_lock);
> +
>  	while ((u64)len > s) {
>  		u64 size;
>  		if (unlikely(ret >= iov_size)) {
>  			ret = -ENOBUFS;
>  			break;
>  		}
> +
>  		node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
>  							addr, addr + len - 1);
>  		if (node == NULL || node->start > addr) {
> -			ret = -EFAULT;
> +			if (umem != dev->iotlb) {
> +				ret = -EFAULT;
> +				break;
> +			}
> +			ret = -EAGAIN;
> +			break;
> +		} else if (!(node->perm & access)) {
> +			ret = -EPERM;
>  			break;
>  		}
> +
>  		_iov = iov + ret;
>  		size = node->size - addr + node->start;
>  		_iov->iov_len = min((u64)len - s, size);
> @@ -1228,6 +1450,10 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
>  		++ret;
>  	}
>  
> +	spin_unlock(&dev->iotlb_lock);
> +
> +	if (ret == -EAGAIN)
> +		vhost_iotlb_miss(vq, addr);
>  	return ret;
>  }
>  
> @@ -1256,7 +1482,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
>  			struct iovec iov[], unsigned int iov_size,
>  			unsigned int *out_num, unsigned int *in_num,
>  			struct vhost_log *log, unsigned int *log_num,
> -			struct vring_desc *indirect)
> +			struct vring_desc *indirect, int access)
>  {
>  	struct vring_desc desc;
>  	unsigned int i = 0, count, found = 0;
> @@ -1274,9 +1500,10 @@ static int get_indirect(struct vhost_virtqueue *vq,
>  	}
>  
>  	ret = translate_desc(vq, vhost64_to_cpu(vq, indirect->addr), len, vq->indirect,
> -			     UIO_MAXIOV);
> +			     UIO_MAXIOV, access);
>  	if (unlikely(ret < 0)) {
> -		vq_err(vq, "Translation failure %d in indirect.\n", ret);
> +		if (ret != -EAGAIN)
> +			vq_err(vq, "Translation failure %d in indirect.\n", ret);
>  		return ret;
>  	}
>  	iov_iter_init(&from, READ, vq->indirect, ret, len);
> @@ -1316,10 +1543,11 @@ static int get_indirect(struct vhost_virtqueue *vq,
>  
>  		ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
>  				     vhost32_to_cpu(vq, desc.len), iov + iov_count,
> -				     iov_size - iov_count);
> +				     iov_size - iov_count, access);
>  		if (unlikely(ret < 0)) {
> -			vq_err(vq, "Translation failure %d indirect idx %d\n",
> -			       ret, i);
> +			if (ret != -EAGAIN)
> +				vq_err(vq, "Translation failure %d indirect idx %d\n",
> +					ret, i);
>  			return ret;
>  		}
>  		/* If this is an input descriptor, increment that count. */
> @@ -1355,7 +1583,8 @@ static int get_indirect(struct vhost_virtqueue *vq,
>  int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  		      struct iovec iov[], unsigned int iov_size,
>  		      unsigned int *out_num, unsigned int *in_num,
> -		      struct vhost_log *log, unsigned int *log_num)
> +		      struct vhost_log *log, unsigned int *log_num,
> +		      int access)
>  {
>  	struct vring_desc desc;
>  	unsigned int i, head, found = 0;
> @@ -1433,10 +1662,11 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  		if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT)) {
>  			ret = get_indirect(vq, iov, iov_size,
>  					   out_num, in_num,
> -					   log, log_num, &desc);
> +					   log, log_num, &desc, access);
>  			if (unlikely(ret < 0)) {
> -				vq_err(vq, "Failure detected "
> -				       "in indirect descriptor at idx %d\n", i);
> +				if (ret != -EAGAIN)
> +					vq_err(vq, "Failure detected "
> +						"in indirect descriptor at idx %d\n", i);
>  				return ret;
>  			}
>  			continue;
> @@ -1444,10 +1674,11 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  
>  		ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
>  				     vhost32_to_cpu(vq, desc.len), iov + iov_count,
> -				     iov_size - iov_count);
> +				     iov_size - iov_count, access);
>  		if (unlikely(ret < 0)) {
> -			vq_err(vq, "Translation failure %d descriptor idx %d\n",
> -			       ret, i);
> +			if (ret != -EAGAIN)
> +				vq_err(vq, "Translation failure %d descriptor idx %d\n",
> +					ret, i);
>  			return ret;
>  		}
>  		if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_WRITE)) {
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 5d64393..4365104 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -62,13 +62,15 @@ struct vhost_umem_node {
>  	__u64 last;
>  	__u64 size;
>  	__u64 userspace_addr;
> -	__u64 flags_padding;
> +	__u32 perm;
> +	__u32 flags_padding;
>  	__u64 __subtree_last;
>  };
>  
>  struct vhost_umem {
>  	struct rb_root umem_tree;
>  	struct list_head umem_list;
> +	int numem;
>  };
>  
>  /* The virtqueue structure describes a queue attached to a device. */
> @@ -84,9 +86,13 @@ struct vhost_virtqueue {
>  	struct file *kick;
>  	struct file *call;
>  	struct file *error;
> +	struct file *iotlb_call;
>  	struct eventfd_ctx *call_ctx;
>  	struct eventfd_ctx *error_ctx;
>  	struct eventfd_ctx *log_ctx;
> +	struct eventfd_ctx *iotlb_call_ctx;
> +	struct vhost_iotlb_entry __user *iotlb_request;
> +	struct vhost_iotlb_entry pending_request;
>  
>  	struct vhost_poll poll;
>  
> @@ -135,6 +141,8 @@ struct vhost_virtqueue {
>  #endif
>  };
>  
> +#define VHOST_IOTLB_SIZE 2048
> +
>  struct vhost_dev {
>  	struct mm_struct *mm;
>  	struct mutex mutex;
> @@ -146,6 +154,8 @@ struct vhost_dev {
>  	struct list_head work_list;
>  	struct task_struct *worker;
>  	struct vhost_umem *umem;
> +	struct vhost_umem *iotlb;
> +	spinlock_t iotlb_lock;
>  };
>  
>  void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
> @@ -164,7 +174,8 @@ int vhost_log_access_ok(struct vhost_dev *);
>  int vhost_get_vq_desc(struct vhost_virtqueue *,
>  		      struct iovec iov[], unsigned int iov_count,
>  		      unsigned int *out_num, unsigned int *in_num,
> -		      struct vhost_log *log, unsigned int *log_num);
> +		      struct vhost_log *log, unsigned int *log_num,
> +		      int access);
>  void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
>  
>  int vhost_init_used(struct vhost_virtqueue *);
> @@ -183,7 +194,7 @@ int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
>  		    unsigned int log_num, u64 len);
>  
>  #define vq_err(vq, fmt, ...) do {                                  \
> -		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> +		printk(pr_fmt(fmt), ##__VA_ARGS__);       \
>  		if ((vq)->error_ctx)                               \
>  				eventfd_signal((vq)->error_ctx, 1);\
>  	} while (0)
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index 8d0c0df..5c0a22f 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -59,8 +59,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>  	if (ULLONG_MAX - ctx->count < n)
>  		n = ULLONG_MAX - ctx->count;
>  	ctx->count += n;
> -	if (waitqueue_active(&ctx->wqh))
> +	if (waitqueue_active(&ctx->wqh)) {
>  		wake_up_locked_poll(&ctx->wqh, POLLIN);
> +	}
>  	spin_unlock_irqrestore(&ctx->wqh.lock, flags);
>  
>  	return n;
> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> index ab373191..5c35ab4 100644
> --- a/include/uapi/linux/vhost.h
> +++ b/include/uapi/linux/vhost.h
> @@ -47,6 +47,32 @@ struct vhost_vring_addr {
>  	__u64 log_guest_addr;
>  };
>  
> +struct vhost_iotlb_entry {
> +	__u64 iova;
> +	__u64 size;
> +	__u64 userspace_addr;

Alignment requirements?

> +	struct {
> +#define VHOST_ACCESS_RO      0x1
> +#define VHOST_ACCESS_WO      0x2
> +#define VHOST_ACCESS_RW      0x3
> +		__u8  perm;
> +#define VHOST_IOTLB_MISS           1
> +#define VHOST_IOTLB_UPDATE         2
> +#define VHOST_IOTLB_INVALIDATE     3
> +		__u8  type;
> +#define VHOST_IOTLB_INVALID        0x1
> +#define VHOST_IOTLB_VALID          0x2
> +		__u8  valid;

why do we need this flag?

> +		__u8  u8_padding;
> +		__u32 padding;
> +	} flags;
> +};
> +
> +struct vhost_vring_iotlb_entry {
> +	unsigned int index;
> +	__u64 userspace_addr;
> +};
> +
>  struct vhost_memory_region {
>  	__u64 guest_phys_addr;
>  	__u64 memory_size; /* bytes */
> @@ -127,6 +153,15 @@ struct vhost_memory {
>  /* Set eventfd to signal an error */
>  #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
>  
> +/* IOTLB */
> +/* Specify an eventfd file descriptor to signle on IOTLB miss */

typo

> +#define VHOST_SET_VRING_IOTLB_CALL _IOW(VHOST_VIRTIO, 0x23, struct      \
> +                                        vhost_vring_file)
> +#define VHOST_SET_VRING_IOTLB_REQUEST _IOW(VHOST_VIRTIO, 0x25, struct   \
> +                                           vhost_vring_iotlb_entry)
> +#define VHOST_UPDATE_IOTLB _IOW(VHOST_VIRTIO, 0x24, struct vhost_iotlb_entry)
> +#define VHOST_RUN_IOTLB _IOW(VHOST_VIRTIO, 0x26, int)
> +

Is the assumption that userspace must dedicate a thread to running the iotlb? 
I dislike this one.
Please support asynchronous APIs at least optionally to make
userspace make its own threading decisions.

>  /* VHOST_NET specific defines */
>  
>  /* Attach virtio net ring to a raw socket, or tap device.

Don't we need a feature bit for this?
Are we short on feature bits? If yes maybe it's time to add
something like PROTOCOL_FEATURES that we have in vhost-user.

> -- 
> 2.5.0

^ permalink raw reply

* [PATCH net] gre: reject GUE and FOU in collect metadata mode
From: Jiri Benc @ 2016-04-27 12:08 UTC (permalink / raw)
  To: netdev; +Cc: Pravin Shelar, Thomas Graf, Tom Herbert

The collect metadata mode does not support GUE nor FOU. This might be
implemented later; until then, we should reject such config.

I think this is okay to be changed. It's unlikely anyone has such
configuration (as it doesn't work anyway) and we may need a way to
distinguish whether it's supported or not by the kernel later.

For backwards compatibility with iproute2, it's not possible to just check
the attribute presence (iproute2 always includes the attribute), the actual
value has to be checked, too.

Fixes: 2e15ea390e6f4 ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
Discovered this only after I already sent v3 of the previous gre set.
Submitting this patch on its own, it's an indepent fix anyway (though fixing
the same commit).
---
 net/ipv4/ip_gre.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index f973e0a58993..f502d34bcb40 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -948,6 +948,11 @@ static int ipgre_tunnel_validate(struct nlattr *tb[], struct nlattr *data[])
 	if (flags & (GRE_VERSION|GRE_ROUTING))
 		return -EINVAL;
 
+	if (data[IFLA_GRE_COLLECT_METADATA] &&
+	    data[IFLA_GRE_ENCAP_TYPE] &&
+	    nla_get_u16(data[IFLA_GRE_ENCAP_TYPE]) != TUNNEL_ENCAP_NONE)
+		return -EINVAL;
+
 	return 0;
 }
 
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH] ip: add udp_csum, udp6_csum_tx, udp6_csum_rx control flags to ip l2tp add tunnel
From: James Chapman @ 2016-04-27 12:21 UTC (permalink / raw)
  To: Wang Shanker; +Cc: netdev

On 26 April 2016 at 15:15, Wang Shanker <shankerwangmiao@gmail.com> wrote:
> Hi, all
>
> It’s my first time to contribute to such an important open source project. Things began when I upgraded my server, called "Server A", form ubuntu 14.04 to 16.04, which is shipped with new kernel version, 4.4. After upgrade, I soon found a l2tp tunnel between this server and another linux server, called "Server B", via ipv6 broke down. Here is the network topology:
>
>   +----------+                    +----------+
>   | Server A | -- IPV6 Network -- | Server B |
>   +----------+                    +----------+
>
> The l2tp tunnel was encapsulated in udp datagrams. All the configuration was normal and could work after I reverted the kernel on Server A to original version.
>
> Here is what i did to create the tunnel:
>
> ```
> on Server A:
>
> ip l2tp add tunnel tunnel_id 86 peer_tunnel_id 86 remote 2001:db8::aaaa local 2001:db8::bbbb udp_sport 1086 udp_dport 1086
> ip l2tp add session name l2tpeth0 tunnel_id 86 session_id 86 peer_session_id 86
> ip l s l2tpeth0 up
>
> on Server B:
>
> ip l2tp add tunnel tunnel_id 86 peer_tunnel_id 86 local 2001:db8::aaaa remote 2001:db8::bbbb udp_sport 1086 udp_dport 1086
> ip l2tp add session name l2tpeth0 tunnel_id 86 session_id 86 peer_session_id 86
> ip l s l2tpeth0 up
>
> ```
>
> When I used tcpdump to diagnose the problem, I got such result:
>
> ```
> on Server A:
>
> arping -i l2tpeth0 -0 1.2.3.4
>
> on Server B:
>
> tcpdump -i eth0 -n port 1086  -v
>
> 21:35:57.818810 IP6 (flowlabel 0x8f028, hlim 64, next-header UDP (17) payload length: 62) 2001:db8::aaaa.1086 > 2001:db8::bbbb.1086: [bad udp cksum 0x0000 -> 0x1140!] UDP, length 54
> 21:35:58.820572 IP6 (flowlabel 0x8f028, hlim 64, next-header UDP (17) payload length: 62) 2001:db8::aaaa.1086 > 2001:db8::bbbb.1086: [bad udp cksum 0x0000 -> 0x1140!] UDP, length 54
> 21:35:59.822216 IP6 (flowlabel 0x8f028, hlim 64, next-header UDP (17) payload length: 62) 2001:db8::aaaa.1086 > 2001:db8::bbbb.1086: [bad udp cksum 0x0000 -> 0x1140!] UDP, length 54
>
> ```
>
> After looking into kernel source, I found out that in this commit a new feature to set udp6 checksum to zero in commit 6b649fe, which added `L2TP_ATTR_UDP_ZERO_CSUM6_TX` and `L2TP_ATTR_UDP_ZERO_CSUM6_TX`.
>
> As a result, I added `udp_csum`, `udp6_csum_tx`, `udp6_csum_rx` control flags to `ip l2tp add tunnel` to control those attributes about checksum.
>
> Using this to create the tunnel instead on Server A:
>
> ```
> ip l2tp add tunnel tunnel_id 86 peer_tunnel_id 86 remote 2001:db8::aaaa local 2001:db8::bbbb udp_sport 1086 udp_dport 1086 udp6_csum_tx on udp6_csum_rx on
> ```
>
> I finally got:
>
> ```
> on Server A:
>
> arping -i l2tpeth0 -0 1.2.3.4
>
> on Server B:
>
> tcpdump -i eth0 -n port 1086  -v
>
> 22:07:03.844297 IP6 (flowlabel 0x8f028, hlim 64, next-header UDP (17) payload length: 62) 2001:db8::aaaa.1086 > 2001:db8::bbbb.1086: [udp sum ok] UDP, length 54
> 22:07:04.845717 IP6 (flowlabel 0x8f028, hlim 64, next-header UDP (17) payload length: 62) 2001:db8::aaaa.1086 > 2001:db8::bbbb.1086: [udp sum ok] UDP, length 54
> 22:07:05.847965 IP6 (flowlabel 0x8f028, hlim 64, next-header UDP (17) payload length: 62) 2001:db8::aaaa.1086 > 2001:db8::bbbb.1086: [udp sum ok] UDP, length 54
>
> tcpdump -i l2tpeth0 -v
>
> 22:10:35.691326 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.2.3.4 tell 0.0.0.0, length 28
> 22:10:36.693627 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.2.3.4 tell 0.0.0.0, length 28
> 22:10:37.695010 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.2.3.4 tell 0.0.0.0, length 28
> 22:10:38.697121 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.2.3.4 tell 0.0.0.0, length 28
>
> ```
>
> It seems to work. However, is it the real point that should be fixed and does my patch fix it well? I need your suggestion.

This seems reasonable to me. It is good to provide user control of
L2TP checksum options.

However, there is a problem with your patch. The netlink attributes
you are accessing to control checksums are flags, not u8 values.

Maybe the default checksum setting for such l2tp tunnels should be
changed in the l2tp kernel code to match the previous behaviour where
IPv6 checksums were disabled?

>
>
> ---
>  ip/ipl2tp.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
>
> diff --git a/ip/ipl2tp.c b/ip/ipl2tp.c
> index 3c8ee93..67a6482 100644
> --- a/ip/ipl2tp.c
> +++ b/ip/ipl2tp.c
> @@ -56,6 +56,8 @@ struct l2tp_parm {
>
>         uint16_t pw_type;
>         uint16_t mtu;
> +       int udp6_csum_tx:1;
> +       int udp6_csum_rx:1;
>         int udp_csum:1;
>         int recv_seq:1;
>         int send_seq:1;
> @@ -117,6 +119,9 @@ static int create_tunnel(struct l2tp_parm *p)
>         if (p->encap == L2TP_ENCAPTYPE_UDP) {
>                 addattr16(&req.n, 1024, L2TP_ATTR_UDP_SPORT, p->local_udp_port);
>                 addattr16(&req.n, 1024, L2TP_ATTR_UDP_DPORT, p->peer_udp_port);
> +               addattr8 (&req.n, 1024, L2TP_ATTR_UDP_CSUM, p->udp_csum);
> +               addattr8 (&req.n, 1024, L2TP_ATTR_UDP_ZERO_CSUM6_TX, p->udp6_csum_tx);
> +               addattr8 (&req.n, 1024, L2TP_ATTR_UDP_ZERO_CSUM6_RX, p->udp6_csum_rx);
>         }
>
>         if (rtnl_talk(&genl_rth, &req.n, NULL, 0) < 0)
> @@ -282,6 +287,14 @@ static int get_response(struct nlmsghdr *n, void *arg)
>                 p->l2spec_len = rta_getattr_u8(attrs[L2TP_ATTR_L2SPEC_LEN]);
>
>         p->udp_csum = !!attrs[L2TP_ATTR_UDP_CSUM];
> +       if (attrs[L2TP_ATTR_UDP_ZERO_CSUM6_TX])
> +         p->udp6_csum_tx = rta_getattr_u8(attrs[L2TP_ATTR_UDP_ZERO_CSUM6_TX]);
> +       else
> +         p->udp6_csum_tx = 1;
> +       if (attrs[L2TP_ATTR_UDP_ZERO_CSUM6_RX])
> +         p->udp6_csum_rx = rta_getattr_u8(attrs[L2TP_ATTR_UDP_ZERO_CSUM6_RX]);
> +       else
> +         p->udp6_csum_rx = 1;
>         if (attrs[L2TP_ATTR_COOKIE])
>                 memcpy(p->cookie, RTA_DATA(attrs[L2TP_ATTR_COOKIE]),
>                        p->cookie_len = RTA_PAYLOAD(attrs[L2TP_ATTR_COOKIE]));
> @@ -470,6 +483,9 @@ static void usage(void)
>         fprintf(stderr, "          tunnel_id ID peer_tunnel_id ID\n");
>         fprintf(stderr, "          [ encap { ip | udp } ]\n");
>         fprintf(stderr, "          [ udp_sport PORT ] [ udp_dport PORT ]\n");
> +       fprintf(stderr, "          [ udp_csum { on | off } ]\n");
> +       fprintf(stderr, "          [ udp6_csum_tx { on | off } ]\n");
> +       fprintf(stderr, "          [ udp6_csum_rx { on | off } ]\n");
>         fprintf(stderr, "Usage: ip l2tp add session [ name NAME ]\n");
>         fprintf(stderr, "          tunnel_id ID\n");
>         fprintf(stderr, "          session_id ID peer_session_id ID\n");
> @@ -500,6 +516,8 @@ static int parse_args(int argc, char **argv, int cmd, struct l2tp_parm *p)
>         /* Defaults */
>         p->l2spec_type = L2TP_L2SPECTYPE_DEFAULT;
>         p->l2spec_len = 4;
> +       p->udp6_csum_rx = 1;
> +       p->udp6_csum_tx = 1;
>
>         while (argc > 0) {
>                 if (strcmp(*argv, "encap") == 0) {
> @@ -569,6 +587,33 @@ static int parse_args(int argc, char **argv, int cmd, struct l2tp_parm *p)
>                         if (get_u16(&uval, *argv, 0))
>                                 invarg("invalid port\n", *argv);
>                         p->peer_udp_port = uval;
> +               } else if (strcmp(*argv, "udp_csum") == 0) {
> +                       NEXT_ARG();
> +                       if (strcmp(*argv, "on") == 0) {
> +                               p->udp_csum = 1;
> +                       } else if (strcmp(*argv, "off") == 0) {
> +                               p->udp_csum = 0;
> +                       } else {
> +                               invarg("invalid option for udp_csum\n", *argv);
> +                       }
> +               } else if (strcmp(*argv, "udp6_csum_rx") == 0) {
> +                       NEXT_ARG();
> +                       if (strcmp(*argv, "on") == 0) {
> +                               p->udp6_csum_rx = 1;
> +                       } else if (strcmp(*argv, "off") == 0) {
> +                               p->udp6_csum_rx = 0;
> +                       } else {
> +                               invarg("invalid option for udp6_csum_rx\n", *argv);
> +                       }
> +               } else if (strcmp(*argv, "udp6_csum_tx") == 0) {
> +                       NEXT_ARG();
> +                       if (strcmp(*argv, "on") == 0) {
> +                               p->udp6_csum_tx = 1;
> +                       } else if (strcmp(*argv, "off") == 0) {
> +                               p->udp6_csum_tx = 0;
> +                       } else {
> +                               invarg("invalid option for udp6_csum_tx\n", *argv);
> +                       }
>                 } else if (strcmp(*argv, "offset") == 0) {
>                         __u8 uval;
>
> --
> 2.5.2
>

^ permalink raw reply

* Re: [PATCH net-next 9/9] taskstats: use the libnl API to align nlattr on 64-bit
From: Balbir Singh @ 2016-04-27 12:29 UTC (permalink / raw)
  To: nicolas.dichtel, netdev
  Cc: davem, linux-kernel, linux-wpan, aar, pablo, kaber, kadlec,
	pshelar, kuznet, jmorris, yoshfuji, netfilter-devel, dev,
	steffen.klassert, herbert
In-Reply-To: <57206A47.2010906@6wind.com>



On 27/04/16 17:29, Nicolas Dichtel wrote:
> Le 27/04/2016 03:14, Balbir Singh a écrit :
>>
>>
>> On 23/04/16 01:31, Nicolas Dichtel wrote:
>>> Goal of this patch is to use the new libnl API to align netlink attribute
>>> when needed.
>>> The layout of the netlink message will be a bit different after the patch,
>>> because the padattr (TASKSTATS_TYPE_STATS) will be inside the nested
>>> attribute instead of before it.
>>>
>>> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>>
>> The layout will change/break user space -- I've not tested the patch though..
> Sigh.
> 
> I quote David:
> "All userspace components using netlink should always ignore attributes
> they do not recognize in dumps.
> 
> This is one of the most basic principles of netlink"
> 
> Do you have some pointers so I can made some tests?
> 

Please try

https://www.kernel.org/doc/Documentation/accounting/getdelays.c

iotop uses it as well. My concern is ABI breakage of user space.

Balbir Singh
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH nf-next] netfilter: allow logging from non-init namespaces
From: Michal Kubecek @ 2016-04-27 12:48 UTC (permalink / raw)
  To: netfilter-devel
  Cc: Pablo Neira Ayuso, Patrick McHardy, Jozsef Kadlecsik,
	Jonathan Corbet, coreteam, netdev, linux-doc, linux-kernel,
	bridge

Commit 69b34fb996b2 ("netfilter: xt_LOG: add net namespace support for
xt_LOG") disabled logging packets using the LOG target from non-init
namespaces. The motivation was to prevent containers from flooding
kernel log of the host. The plan was to keep it that way until syslog
namespace implementation allows containers to log in a safe way.

However, the work on syslog namespace seems to have hit a dead end
somewhere in 2013 and there are users who want to use xt_LOG in all
network namespaces. This patch allows to do so by setting

  /proc/sys/net/netfilter/nf_log_all_netns

to a nonzero value. This sysctl is only accessible from init_net so that
one cannot switch the behaviour from inside a container.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
---
 Documentation/networking/netfilter-sysctl.txt | 10 ++++++++++
 include/net/netfilter/nf_log.h                |  3 +++
 net/bridge/netfilter/ebt_log.c                |  2 +-
 net/ipv4/netfilter/nf_log_arp.c               |  2 +-
 net/ipv4/netfilter/nf_log_ipv4.c              |  2 +-
 net/ipv6/netfilter/nf_log_ipv6.c              |  2 +-
 net/netfilter/nf_log.c                        | 22 ++++++++++++++++++++++
 7 files changed, 39 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/networking/netfilter-sysctl.txt

diff --git a/Documentation/networking/netfilter-sysctl.txt b/Documentation/networking/netfilter-sysctl.txt
new file mode 100644
index 000000000000..55791e50e169
--- /dev/null
+++ b/Documentation/networking/netfilter-sysctl.txt
@@ -0,0 +1,10 @@
+/proc/sys/net/netfilter/* Variables:
+
+nf_log_all_netns - BOOLEAN
+	0 - disabled (default)
+	not 0 - enabled
+
+	By default, only init_net namespace can log packets into kernel log
+	with LOG target; this aims to prevent containers from flooding host
+	kernel log. If enabled, this target also works in other network
+	namespaces. This variable is only accessible from init_net.
diff --git a/include/net/netfilter/nf_log.h b/include/net/netfilter/nf_log.h
index 57639fca223a..8c4b018eef72 100644
--- a/include/net/netfilter/nf_log.h
+++ b/include/net/netfilter/nf_log.h
@@ -49,6 +49,9 @@ struct nf_logger {
 	struct module		*me;
 };
 
+/* sysctl_nf_log_all_netns - allow LOG target in all network namespaces */
+extern int sysctl_nf_log_all_netns;
+
 /* Function to register/unregister log function. */
 int nf_log_register(u_int8_t pf, struct nf_logger *logger);
 void nf_log_unregister(struct nf_logger *logger);
diff --git a/net/bridge/netfilter/ebt_log.c b/net/bridge/netfilter/ebt_log.c
index 152300d164ac..735230ec0e49 100644
--- a/net/bridge/netfilter/ebt_log.c
+++ b/net/bridge/netfilter/ebt_log.c
@@ -78,7 +78,7 @@ ebt_log_packet(struct net *net, u_int8_t pf, unsigned int hooknum,
 	unsigned int bitmask;
 
 	/* FIXME: Disabled from containers until syslog ns is supported */
-	if (!net_eq(net, &init_net))
+	if (!net_eq(net, &init_net) && !sysctl_nf_log_all_netns)
 		return;
 
 	spin_lock_bh(&ebt_log_lock);
diff --git a/net/ipv4/netfilter/nf_log_arp.c b/net/ipv4/netfilter/nf_log_arp.c
index e7ad950cf9ef..39e1348dfe45 100644
--- a/net/ipv4/netfilter/nf_log_arp.c
+++ b/net/ipv4/netfilter/nf_log_arp.c
@@ -87,7 +87,7 @@ static void nf_log_arp_packet(struct net *net, u_int8_t pf,
 	struct nf_log_buf *m;
 
 	/* FIXME: Disabled from containers until syslog ns is supported */
-	if (!net_eq(net, &init_net))
+	if (!net_eq(net, &init_net) && !sysctl_nf_log_all_netns)
 		return;
 
 	m = nf_log_buf_open();
diff --git a/net/ipv4/netfilter/nf_log_ipv4.c b/net/ipv4/netfilter/nf_log_ipv4.c
index 076aadda0473..2b0083112ed8 100644
--- a/net/ipv4/netfilter/nf_log_ipv4.c
+++ b/net/ipv4/netfilter/nf_log_ipv4.c
@@ -319,7 +319,7 @@ static void nf_log_ip_packet(struct net *net, u_int8_t pf,
 	struct nf_log_buf *m;
 
 	/* FIXME: Disabled from containers until syslog ns is supported */
-	if (!net_eq(net, &init_net))
+	if (!net_eq(net, &init_net) && !sysctl_nf_log_all_netns)
 		return;
 
 	m = nf_log_buf_open();
diff --git a/net/ipv6/netfilter/nf_log_ipv6.c b/net/ipv6/netfilter/nf_log_ipv6.c
index 8dd869642f45..04960486d0e2 100644
--- a/net/ipv6/netfilter/nf_log_ipv6.c
+++ b/net/ipv6/netfilter/nf_log_ipv6.c
@@ -351,7 +351,7 @@ static void nf_log_ip6_packet(struct net *net, u_int8_t pf,
 	struct nf_log_buf *m;
 
 	/* FIXME: Disabled from containers until syslog ns is supported */
-	if (!net_eq(net, &init_net))
+	if (!net_eq(net, &init_net) && !sysctl_nf_log_all_netns)
 		return;
 
 	m = nf_log_buf_open();
diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
index a5d41dfa9f05..a5f4c57b14c5 100644
--- a/net/netfilter/nf_log.c
+++ b/net/netfilter/nf_log.c
@@ -16,6 +16,9 @@
 #define NF_LOG_PREFIXLEN		128
 #define NFLOGGER_NAME_LEN		64
 
+int sysctl_nf_log_all_netns __read_mostly;
+EXPORT_SYMBOL(sysctl_nf_log_all_netns);
+
 static struct nf_logger __rcu *loggers[NFPROTO_NUMPROTO][NF_LOG_TYPE_MAX] __read_mostly;
 static DEFINE_MUTEX(nf_log_mutex);
 
@@ -392,6 +395,18 @@ static const struct file_operations nflog_file_ops = {
 #ifdef CONFIG_SYSCTL
 static char nf_log_sysctl_fnames[NFPROTO_NUMPROTO-NFPROTO_UNSPEC][3];
 static struct ctl_table nf_log_sysctl_table[NFPROTO_NUMPROTO+1];
+static struct ctl_table_header *nf_log_sysctl_fhdr;
+
+static struct ctl_table nf_log_sysctl_ftable[] = {
+	{
+		.procname	= "nf_log_all_netns",
+		.data		= &sysctl_nf_log_all_netns,
+		.maxlen		= sizeof(sysctl_nf_log_all_netns),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{ }
+};
 
 static int nf_log_proc_dostring(struct ctl_table *table, int write,
 			 void __user *buffer, size_t *lenp, loff_t *ppos)
@@ -461,6 +476,10 @@ static int netfilter_log_sysctl_init(struct net *net)
 			nf_log_sysctl_table[i].extra1 =
 				(void *)(unsigned long) i;
 		}
+		nf_log_sysctl_fhdr = register_net_sysctl(net, "net/netfilter",
+							 nf_log_sysctl_ftable);
+		if (!nf_log_sysctl_fhdr)
+			goto err_freg;
 	}
 
 	net->nf.nf_log_dir_header = register_net_sysctl(net,
@@ -474,6 +493,7 @@ static int netfilter_log_sysctl_init(struct net *net)
 err_reg:
 	if (!net_eq(net, &init_net))
 		kfree(table);
+err_freg:
 err_alloc:
 	return -ENOMEM;
 }
@@ -486,6 +506,8 @@ static void netfilter_log_sysctl_exit(struct net *net)
 	unregister_net_sysctl_table(net->nf.nf_log_dir_header);
 	if (!net_eq(net, &init_net))
 		kfree(table);
+	else
+		unregister_net_sysctl_table(nf_log_sysctl_fhdr);
 }
 #else
 static int netfilter_log_sysctl_init(struct net *net)
-- 
2.8.1


^ permalink raw reply related

* Re: [PATCH] wcn36xx: Set SMD timeout to 10 seconds
From: Kalle Valo @ 2016-04-27 12:55 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: Eugene Krasnikov, linux-wireless, netdev, linux-kernel,
	John Stultz
In-Reply-To: <1461272982-7233-1-git-send-email-bjorn.andersson@linaro.org>

Bjorn Andersson <bjorn.andersson@linaro.org> writes:

> After booting the wireless subsystem and uploading the NV blob to the
> WCNSS_CTRL service the remote continues to do things and will not start
> servicing wlan-requests for another 2-5 seconds (measured).
>
> The downstream code does not have any special handling for this case,
> but has a timeout of 10 seconds for the communication layer. By
> extending the wcn36xx timeout to match this we follows the same flow for
> the boot procedure and can successfully configure WiFi as wlan0 is
> registered.
>
> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>

Applied, thanks.

-- 
Kalle Valo

^ permalink raw reply

* Re: linux-next: manual merge of the net-next tree with the net tree
From: Saeed Mahameed @ 2016-04-27 13:13 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: David Miller, Linux Netdev List, linux-next, linux-kernel,
	Saeed Mahameed, Gal Pressman
In-Reply-To: <20160427120109.37449920@canb.auug.org.au>

On Wed, Apr 27, 2016 at 5:01 AM, Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> Hi all,
>
> Today's linux-next merge of the net-next tree got a conflict in:
>
>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>
> between commit:
>
>   d8edd2469ace ("et/mlx5e: Fix minimum MTU")
>
> from the net tree and commit:
>
>   0e405443e803 ("net/mlx5e: Improve set features ndo resiliency")
>
> from the net-next tree.
>
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
>
> --
> Cheers,
> Stephen Rothwell
>
> diff --cc drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 67d548b70e14,5bad17d37d7b..000000000000
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@@ -2025,9 -2214,49 +2240,52 @@@ static int set_feature_rx_vlan(struct n
>         return err;
>   }
>
> + static int mlx5e_handle_feature(struct net_device *netdev,
> +                               netdev_features_t wanted_features,
> +                               netdev_features_t feature,
> +                               mlx5e_feature_handler feature_handler)
> + {
> +       netdev_features_t changes = wanted_features ^ netdev->features;
> +       bool enable = !!(wanted_features & feature);
> +       int err;
> +
> +       if (!(changes & feature))
> +               return 0;
> +
> +       err = feature_handler(netdev, enable);
> +       if (err) {
> +               netdev_err(netdev, "%s feature 0x%llx failed err %d\n",
> +                          enable ? "Enable" : "Disable", feature, err);
> +               return err;
> +       }
> +
> +       MLX5E_SET_FEATURE(netdev, feature, enable);
> +       return 0;
> + }
> +
> + static int mlx5e_set_features(struct net_device *netdev,
> +                             netdev_features_t features)
> + {
> +       int err;
> +
> +       err  = mlx5e_handle_feature(netdev, features, NETIF_F_LRO,
> +                                   set_feature_lro);
> +       err |= mlx5e_handle_feature(netdev, features,
> +                                   NETIF_F_HW_VLAN_CTAG_FILTER,
> +                                   set_feature_vlan_filter);
> +       err |= mlx5e_handle_feature(netdev, features, NETIF_F_HW_TC,
> +                                   set_feature_tc_num_filters);
> +       err |= mlx5e_handle_feature(netdev, features, NETIF_F_RXALL,
> +                                   set_feature_rx_all);
> +       err |= mlx5e_handle_feature(netdev, features, NETIF_F_HW_VLAN_CTAG_RX,
> +                                   set_feature_rx_vlan);
> +
> +       return err ? -EINVAL : 0;
> + }
> +
>  +#define MXL5_HW_MIN_MTU 64
>  +#define MXL5E_MIN_MTU (MXL5_HW_MIN_MTU + ETH_FCS_LEN)
>  +
>   static int mlx5e_change_mtu(struct net_device *netdev, int new_mtu)
>   {
>         struct mlx5e_priv *priv = netdev_priv(netdev);
> @@@ -2633,18 -2966,10 +2997,19 @@@ static void mlx5e_destroy_netdev(struc
>         schedule_work(&priv->set_rx_mode_work);
>         mlx5e_disable_async_events(priv);
>         flush_scheduled_work();
>  -      unregister_netdev(netdev);
>  +      if (test_bit(MLX5_INTERFACE_STATE_SHUTDOWN, &mdev->intf_state)) {
>  +              netif_device_detach(netdev);
>  +              mutex_lock(&priv->state_lock);
>  +              if (test_bit(MLX5E_STATE_OPENED, &priv->state))
>  +                      mlx5e_close_locked(netdev);
>  +              mutex_unlock(&priv->state_lock);
>  +      } else {
>  +              unregister_netdev(netdev);
>  +      }
>  +
>         mlx5e_tc_cleanup(priv);
>         mlx5e_vxlan_cleanup(priv);
> +       mlx5e_destroy_q_counter(priv);
>         mlx5e_destroy_flow_tables(priv);
>         mlx5e_destroy_tirs(priv);
>         mlx5e_destroy_rqt(priv, MLX5E_SINGLE_RQ_RQT);

Looks good, and it is pretty straightforward.
Dave will have to take all hunks from both patches, same as you did.

Thank you Stephen.

Saeed.

^ permalink raw reply

* RE: [net-next PATCH 4/4] samples/bpf: allow make to be run from samples/bpf/ directory
From: David Laight @ 2016-04-27 13:15 UTC (permalink / raw)
  To: 'Alexei Starovoitov', Jesper Dangaard Brouer
  Cc: netdev@vger.kernel.org, bblanco@plumgrid.com,
	borkmann@iogearbox.net, linux-kbuild@vger.kernel.org
In-Reply-To: <20160426143524.GB39797@ast-mbp.thefacebook.com>

From: Alexei Starovoitov
> Sent: 26 April 2016 15:35
> On Tue, Apr 26, 2016 at 01:09:32PM +0200, Jesper Dangaard Brouer wrote:
> > It is not intuitive that 'make' must be run from the top level
> > directory with argument "samples/bpf/" to compile these eBPF samples.
> >
> > Introduce a kbuild make file trick that allow make to be run from the
> > "samples/bpf/" directory itself.  It basically change to the top level
> > directory and call "make samples/bpf/" with the "/" slash after the
> > directory name.
> >
> > Also add a clean target that only cleans this directory, by taking
> > advantage of the kbuild external module setting M=$PWD.
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  samples/bpf/Makefile   |    8 ++++++++
> >  samples/bpf/README.rst |    3 +++
> >  2 files changed, 11 insertions(+)
> >
> > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> > index 7ac66f5bbbf5..7370dfee482f 100644
> > --- a/samples/bpf/Makefile
> > +++ b/samples/bpf/Makefile
> > @@ -85,6 +85,14 @@ HOSTLOADLIBES_test_overhead += -lelf -lrt
> >  #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc
> >  LLC ?= llc
> >
> > +# Trick to allow make to be run from this directory
> > +all:
> > +	$(MAKE) -C ../../ $$PWD/
> > +
> > +clean:
> > +	$(MAKE) -C ../../ M=$$PWD clean
> > +	@rm -f *~
> > +
> 
> nice trick!

Another nice 'trick' is to write a GNUmakefie containing the local rules.
GNU make will read it in preference to Makefile and makefile.

	David

^ permalink raw reply

* [PATCH v2 1/2] net: nps_enet: Sync access to packet sent flag
From: Elad Kanfi @ 2016-04-27 13:18 UTC (permalink / raw)
  To: davem; +Cc: noamca, linux-kernel, abrodkin, eladkan, talz, netdev
In-Reply-To: <1461763110-15263-1-git-send-email-eladkan@mellanox.com>

From: Elad Kanfi <eladkan@mellanox.com>

Below is a description of a possible problematic
sequence. CPU-A is sending a frame and CPU-B handles
the interrupt that indicates the frame was sent. CPU-B
reads an invalid value of tx_packet_sent.

	CPU-A				CPU-B
	-----				-----
	nps_enet_send_frame
	.
	.
	tx_packet_sent = true
	order HW to start tx
	.
	.
	HW complete tx
			    ------> 	get tx complete interrupt
					.
					.
					if(tx_packet_sent == true)

	end memory transaction
	(tx_packet_sent actually
	 written)

Problem solution:

Add a memory barrier after setting tx_packet_sent,
in order to make sure that it is written before
the packet is sent.

Signed-off-by: Elad Kanfi <eladkan@mellanox.com>

Acked-by: Noam Camus <noamca@mellanox.com>
---
 drivers/net/ethernet/ezchip/nps_enet.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/ezchip/nps_enet.c b/drivers/net/ethernet/ezchip/nps_enet.c
index 1f23845..1373236 100644
--- a/drivers/net/ethernet/ezchip/nps_enet.c
+++ b/drivers/net/ethernet/ezchip/nps_enet.c
@@ -389,6 +389,13 @@ static void nps_enet_send_frame(struct net_device *ndev,
 
 	/* Indicate SW is done */
 	priv->tx_packet_sent = true;
+
+	/* before the frame is sent we have to make
+	 * sure that priv->tx_packet_sent will be valid
+	 * for the CPU'S that handles the ISR and NAPI poll
+	 */
+	smp_wmb();
+
 	tx_ctrl_value |= NPS_ENET_ENABLE << TX_CTL_CT_SHIFT;
 	/* Send Frame */
 	nps_enet_reg_set(priv, NPS_ENET_REG_TX_CTL, tx_ctrl_value);
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH] net: phy: at803x: Register 'link_change_notify' only for AT8030
From: Sebastian Frias @ 2016-04-27 13:50 UTC (permalink / raw)
  To: Daniel Mack, David S. Miller, netdev
  Cc: lkml, mason, Sergei Shtylyov, timur, Florian Fainelli
In-Reply-To: <5720A3C1.2040401@laposte.net>

Hi,

Sergei pointed out that the same patch was submitted yesterday by Timur (what are the chances?! :-) ) https://patchwork.ozlabs.org/patch/615019/

Regards,

Sebastian

On 04/27/2016 01:34 PM, Sebastian Frias wrote:
> There is no need to register the callback introduced by
> commit 13a56b449325 ("net: phy: at803x: Add support for hardware reset")
> for non faulty PHYs.
> 
> The check on the PHY ID is not necessary anymore and thus has been
> removed from the callback implementation as well.
> 
> Fixes: 13a56b449325 ("net: phy: at803x: Add support for hardware reset")
> 
> Signed-off-by: Sebastian Frias <sf84@laposte.net>
> ---
>  drivers/net/phy/at803x.c | 43 ++++++++++++++++++++-----------------------
>  1 file changed, 20 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
> index b3ffaee..7fdc676 100644
> --- a/drivers/net/phy/at803x.c
> +++ b/drivers/net/phy/at803x.c
> @@ -353,33 +353,32 @@ static void at803x_link_change_notify(struct phy_device *phydev)
>  	struct at803x_priv *priv = phydev->priv;
>  
>  	/*
> -	 * Conduct a hardware reset for AT8030 every time a link loss is
> +	 * Conduct a hardware reset for AT8030 (this callback is only
> +	 * registered for AT8030 at the moment) every time a link loss is
>  	 * signalled. This is necessary to circumvent a hardware bug that
>  	 * occurs when the cable is unplugged while TX packets are pending
>  	 * in the FIFO. In such cases, the FIFO enters an error mode it
>  	 * cannot recover from by software.
>  	 */
> -	if (phydev->drv->phy_id == ATH8030_PHY_ID) {
> -		if (phydev->state == PHY_NOLINK) {
> -			if (priv->gpiod_reset && !priv->phy_reset) {
> -				struct at803x_context context;
> -
> -				at803x_context_save(phydev, &context);
> -
> -				gpiod_set_value(priv->gpiod_reset, 1);
> -				msleep(1);
> -				gpiod_set_value(priv->gpiod_reset, 0);
> -				msleep(1);
> -
> -				at803x_context_restore(phydev, &context);
> -
> -				phydev_dbg(phydev, "%s(): phy was reset\n",
> -					   __func__);
> -				priv->phy_reset = true;
> -			}
> -		} else {
> -			priv->phy_reset = false;
> +	if (phydev->state == PHY_NOLINK) {
> +		if (priv->gpiod_reset && !priv->phy_reset) {
> +			struct at803x_context context;
> +
> +			at803x_context_save(phydev, &context);
> +
> +			gpiod_set_value(priv->gpiod_reset, 1);
> +			msleep(1);
> +			gpiod_set_value(priv->gpiod_reset, 0);
> +			msleep(1);
> +
> +			at803x_context_restore(phydev, &context);
> +
> +			phydev_dbg(phydev, "%s(): phy was reset\n",
> +				   __func__);
> +			priv->phy_reset = true;
>  		}
> +	} else {
> +		priv->phy_reset = false;
>  	}
>  }
>  
> @@ -391,7 +390,6 @@ static struct phy_driver at803x_driver[] = {
>  	.phy_id_mask		= 0xffffffef,
>  	.probe			= at803x_probe,
>  	.config_init		= at803x_config_init,
> -	.link_change_notify	= at803x_link_change_notify,
>  	.set_wol		= at803x_set_wol,
>  	.get_wol		= at803x_get_wol,
>  	.suspend		= at803x_suspend,
> @@ -427,7 +425,6 @@ static struct phy_driver at803x_driver[] = {
>  	.phy_id_mask		= 0xffffffef,
>  	.probe			= at803x_probe,
>  	.config_init		= at803x_config_init,
> -	.link_change_notify	= at803x_link_change_notify,
>  	.set_wol		= at803x_set_wol,
>  	.get_wol		= at803x_get_wol,
>  	.suspend		= at803x_suspend,
> 

^ permalink raw reply

* [PATCH v2 0/2] Net driver bugs fix
From: Elad Kanfi @ 2016-04-27 13:18 UTC (permalink / raw)
  To: davem; +Cc: noamca, linux-kernel, abrodkin, eladkan, talz, netdev

From: Elad Kanfi <eladkan@mellanox.com>

v2:
Remove code style commit for now.
Code style commit will be added after the bugs fix will be approved.

Summary:
 1. Bug description: TX done interrupts that arrives while interrupts
    are masked, during NAPI poll, will not trigger an interrupt handling.
    Since TX interrupt is of level edge we will lose the TX done interrupt.
    As a result all pending tx frames will get no service.

    Solution: Check if there is a pending tx request after unmasking the
    interrupt and if answer is yes then re-add ourselves to
    the NAPI poll list.

 2. Bug description: CPU-A before sending a frame will set a variable
    to true. CPU-B that executes the tx done interrupt service routine
    might read a non valid value of that variable.

    Solution: Add a memory barrier at the tx sending function.

Elad Kanfi (2):
  net: nps_enet: Sync access to packet sent flag
  net: nps_enet: bug fix - handle lost tx interrupts

 drivers/net/ethernet/ezchip/nps_enet.c |   21 +++++++++++++++++++++
 1 files changed, 21 insertions(+), 0 deletions(-)

^ permalink raw reply

* RE: [net-next PATCH V2 2/5] samples/bpf: Makefile verify LLVM compiler avail and bpf target is supported
From: David Laight @ 2016-04-27 13:52 UTC (permalink / raw)
  To: 'Jesper Dangaard Brouer', netdev@vger.kernel.org
  Cc: linux-kbuild@vger.kernel.org, bblanco@plumgrid.com,
	naveen.n.rao@linux.vnet.ibm.com, borkmann@iogearbox.net,
	alexei.starovoitov@gmail.com
In-Reply-To: <20160426162716.22962.36473.stgit@firesoul>

From: Jesper Dangaard Brouer
> Sent: 26 April 2016 17:27
> Make compiling samples/bpf more user friendly, by detecting if LLVM
> compiler tool 'llc' is available, and also detect if the 'bpf' target
> is available in this version of LLVM.
...
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 5bae9536f100..45859c99f573 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -85,6 +85,24 @@ HOSTLOADLIBES_test_overhead += -lelf -lrt
>  #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc
>  LLC ?= llc
> 
> +# Verify LLVM compiler is available and bpf target is supported
> +.PHONY: verify_cmd_llc verify_target_bpf
> +
> +verify_cmd_llc:
> +	@if ! (which "${LLC}" > /dev/null 2>&1); then \

You should use 'type' not 'which'.
'type' is a posix shell builtin, 'which' is a script/program
that tries to emulate a 'csh' builtin.
You want to know whether the shell that make runs can execute ${LLC}
not whether a csh would be able to run it.

You might also want to worry about:
    LLC="/path_to_llc/llc -extra_arg" make fubar

	David


^ permalink raw reply

* Re: [PATCH v2 1/2] net: nps_enet: Sync access to packet sent flag
From: Lino Sanfilippo @ 2016-04-27 13:56 UTC (permalink / raw)
  To: Elad Kanfi, davem; +Cc: noamca, linux-kernel, abrodkin, talz, netdev
In-Reply-To: <1461763110-15263-2-git-send-email-eladkan@mellanox.com>

Hi,

On 27.04.2016 15:18, Elad Kanfi wrote:
> From: Elad Kanfi <eladkan@mellanox.com>
>
> Below is a description of a possible problematic
> sequence. CPU-A is sending a frame and CPU-B handles
> the interrupt that indicates the frame was sent. CPU-B
> reads an invalid value of tx_packet_sent.
>
> 	CPU-A				CPU-B
> 	-----				-----
> 	nps_enet_send_frame
> 	.
> 	.
> 	tx_packet_sent = true
> 	order HW to start tx
> 	.
> 	.
> 	HW complete tx
> 			    ------> 	get tx complete interrupt
> 					.
> 					.
> 					if(tx_packet_sent == true)
>
> 	end memory transaction
> 	(tx_packet_sent actually
> 	 written)
>
> Problem solution:
>
> Add a memory barrier after setting tx_packet_sent,
> in order to make sure that it is written before
> the packet is sent.

Should not those SMP memory barriers be paired? AFAIK you do not only have to make sure
that the value written by CPU-A actually is written to memory but also that CPU-B
reads that value from memory. At least this is what I have understood from memory-barriers.txt...

Regards,
Lino

^ permalink raw reply

* [PATCH iproute2 0/2] ip link gre: fix external mode handling
From: Jiri Benc @ 2016-04-27 14:11 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, Paolo Abeni, Pravin Shelar

Fix two bugs with handling of the 'external' keyword for GRE.

Jiri Benc (2):
  ip link gre: create interfaces in external mode correctly
  ip link gre: print only relevant info in external mode

 ip/link_gre.c | 43 +++++++++++++++++++++++++------------------
 1 file changed, 25 insertions(+), 18 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* [PATCH iproute2 1/2] ip link gre: create interfaces in external mode correctly
From: Jiri Benc @ 2016-04-27 14:11 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, Paolo Abeni, Pravin Shelar
In-Reply-To: <cover.1461766016.git.jbenc@redhat.com>

For GRE interfaces in 'external' mode, the kernel ignores all manual
settings like remote IP address or TTL. However, for some of those
attributes, kernel checks their value and does not allow them to be zero
(even though they're ignored later).

Currently, 'ip link' always includes all attributes in the netlink message.
This leads to problem with creating interfaces in 'external' mode. For
example, this command does not work:

ip link add gre1 type gretap external

and needs a bogus remote IP address to be specified, as the kernel enforces
remote IP address to be either not present, or not null.

Ignore the parameters that do not make sense in 'external' mode.
Unfortunately, we cannot error out, as there may be existing deployments
that workarounded the bug by specifying bogus values.

Fixes: 926b39e1feffd ("gre: add support for collect metadata flag")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
 ip/link_gre.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/ip/link_gre.c b/ip/link_gre.c
index bcf003aaa5d7..36ce1252675b 100644
--- a/ip/link_gre.c
+++ b/ip/link_gre.c
@@ -315,24 +315,26 @@ get_failed:
 		return -1;
 	}
 
-	addattr32(n, 1024, IFLA_GRE_IKEY, ikey);
-	addattr32(n, 1024, IFLA_GRE_OKEY, okey);
-	addattr_l(n, 1024, IFLA_GRE_IFLAGS, &iflags, 2);
-	addattr_l(n, 1024, IFLA_GRE_OFLAGS, &oflags, 2);
-	addattr_l(n, 1024, IFLA_GRE_LOCAL, &saddr, 4);
-	addattr_l(n, 1024, IFLA_GRE_REMOTE, &daddr, 4);
-	addattr_l(n, 1024, IFLA_GRE_PMTUDISC, &pmtudisc, 1);
-	if (link)
-		addattr32(n, 1024, IFLA_GRE_LINK, link);
-	addattr_l(n, 1024, IFLA_GRE_TTL, &ttl, 1);
-	addattr_l(n, 1024, IFLA_GRE_TOS, &tos, 1);
+	if (!metadata) {
+		addattr32(n, 1024, IFLA_GRE_IKEY, ikey);
+		addattr32(n, 1024, IFLA_GRE_OKEY, okey);
+		addattr_l(n, 1024, IFLA_GRE_IFLAGS, &iflags, 2);
+		addattr_l(n, 1024, IFLA_GRE_OFLAGS, &oflags, 2);
+		addattr_l(n, 1024, IFLA_GRE_LOCAL, &saddr, 4);
+		addattr_l(n, 1024, IFLA_GRE_REMOTE, &daddr, 4);
+		addattr_l(n, 1024, IFLA_GRE_PMTUDISC, &pmtudisc, 1);
+		if (link)
+			addattr32(n, 1024, IFLA_GRE_LINK, link);
+		addattr_l(n, 1024, IFLA_GRE_TTL, &ttl, 1);
+		addattr_l(n, 1024, IFLA_GRE_TOS, &tos, 1);
+	} else {
+		addattr_l(n, 1024, IFLA_GRE_COLLECT_METADATA, NULL, 0);
+	}
 
 	addattr16(n, 1024, IFLA_GRE_ENCAP_TYPE, encaptype);
 	addattr16(n, 1024, IFLA_GRE_ENCAP_FLAGS, encapflags);
 	addattr16(n, 1024, IFLA_GRE_ENCAP_SPORT, htons(encapsport));
 	addattr16(n, 1024, IFLA_GRE_ENCAP_DPORT, htons(encapdport));
-	if (metadata)
-		addattr_l(n, 1024, IFLA_GRE_COLLECT_METADATA, NULL, 0);
 
 	return 0;
 }
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH iproute2 2/2] ip link gre: print only relevant info in external mode
From: Jiri Benc @ 2016-04-27 14:11 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, Paolo Abeni, Pravin Shelar
In-Reply-To: <cover.1461766016.git.jbenc@redhat.com>

Display only attributes that are relevant when a GRE interface is in
'external' mode instead of the default values (which are ignored by the
kernel even if passed back).

Fixes: 926b39e1feffd ("gre: add support for collect metadata flag")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
 ip/link_gre.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/ip/link_gre.c b/ip/link_gre.c
index 36ce1252675b..492c22053b89 100644
--- a/ip/link_gre.c
+++ b/ip/link_gre.c
@@ -339,7 +339,7 @@ get_failed:
 	return 0;
 }
 
-static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
+static void gre_print_direct_opt(FILE *f, struct rtattr *tb[])
 {
 	char s2[64];
 	const char *local = "any";
@@ -347,9 +347,6 @@ static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
 	unsigned int iflags = 0;
 	unsigned int oflags = 0;
 
-	if (!tb)
-		return;
-
 	if (tb[IFLA_GRE_REMOTE]) {
 		unsigned int addr = rta_getattr_u32(tb[IFLA_GRE_REMOTE]);
 
@@ -421,8 +418,16 @@ static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
 		fputs("icsum ", f);
 	if (oflags & GRE_CSUM)
 		fputs("ocsum ", f);
+}
 
-	if (tb[IFLA_GRE_COLLECT_METADATA])
+static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
+{
+	if (!tb)
+		return;
+
+	if (!tb[IFLA_GRE_COLLECT_METADATA])
+		gre_print_direct_opt(f, tb);
+	else
 		fputs("external ", f);
 
 	if (tb[IFLA_GRE_ENCAP_TYPE] &&
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v2 2/2] net: nps_enet: bug fix - handle lost tx interrupts
From: Elad Kanfi @ 2016-04-27 13:18 UTC (permalink / raw)
  To: davem; +Cc: noamca, linux-kernel, abrodkin, eladkan, talz, netdev
In-Reply-To: <1461763110-15263-1-git-send-email-eladkan@mellanox.com>

From: Elad Kanfi <eladkan@mellanox.com>

The tx interrupt is of edge type, and in case such interrupt is triggered
while it is masked it will not be handled even after tx interrupts are
re-enabled in the end of NAPI poll.
This will cause tx network to stop in the following scenario:
 * Rx is being handled, hence interrupts are masked.
 * Tx interrupt is triggered after checking if there is some tx to handle
   and before re-enabling the interrupts.
In this situation only rx transaction will release tx requests.

In order to handle the tx that was missed( if there was one ),
a NAPI reschdule was added after enabling the interrupts.

Signed-off-by: Elad Kanfi <eladkan@mellanox.com>

Acked-by: Noam Camus <noamca@mellanox.com>
---
 drivers/net/ethernet/ezchip/nps_enet.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/ezchip/nps_enet.c b/drivers/net/ethernet/ezchip/nps_enet.c
index 1373236..31e27a7 100644
--- a/drivers/net/ethernet/ezchip/nps_enet.c
+++ b/drivers/net/ethernet/ezchip/nps_enet.c
@@ -183,6 +183,8 @@ static int nps_enet_poll(struct napi_struct *napi, int budget)
 	work_done = nps_enet_rx_handler(ndev);
 	if (work_done < budget) {
 		u32 buf_int_enable_value = 0;
+		u32 tx_ctrl_value = nps_enet_reg_get(priv, NPS_ENET_REG_TX_CTL);
+		u32 tx_ctrl_ct = (tx_ctrl_value & TX_CTL_CT_MASK) >> TX_CTL_CT_SHIFT;
 
 		napi_complete(napi);
 
@@ -192,6 +194,18 @@ static int nps_enet_poll(struct napi_struct *napi, int budget)
 
 		nps_enet_reg_set(priv, NPS_ENET_REG_BUF_INT_ENABLE,
 				 buf_int_enable_value);
+
+		/* in case we will get a tx interrupt while interrupts
+		 * are masked, we will lose it since the tx is edge interrupt.
+		 * specifically, while executing the code section above,
+		 * between nps_enet_tx_handler and the interrupts enable, all
+		 * tx requests will be stuck until we will get an rx interrupt.
+		 * the two code lines below will solve this situation by
+		 * re-adding ourselves to the poll list.
+		 */
+
+		if (priv->tx_packet_sent && !tx_ctrl_ct)
+			napi_reschedule(napi);
 	}
 
 	return work_done;
-- 
1.7.1

^ permalink raw reply related

* [PATCH] MAINTAINERS: net: Change maintainer for GRETH 10/100/1G Ethernet MAC device driver
From: Andreas Larsson @ 2016-04-27 14:46 UTC (permalink / raw)
  To: David S. Miller, netdev; +Cc: linux-kernel, Sam Ravnborg, software

Signed-off-by: Andreas Larsson <andreas@gaisler.com>
---
 MAINTAINERS |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 1d5b4be..08ffebd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4903,7 +4903,7 @@ F:	net/ipv4/gre_offload.c
 F:	include/net/gre.h
 
 GRETH 10/100/1G Ethernet MAC device driver
-M:	Kristoffer Glembo <kristoffer@gaisler.com>
+M:	Andreas Larsson <andreas@gaisler.com>
 L:	netdev@vger.kernel.org
 S:	Maintained
 F:	drivers/net/ethernet/aeroflex/
-- 
1.7.10.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox