Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 3/3] vhost: apply cpumask and cgroup to vhost pollers
From: Tejun Heo @ 2010-05-30 20:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, Sridhar Samudrala, netdev, lkml,
	kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev, Jiri Kosina,
	Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <20100530112925.GB27611@redhat.com>

Apply the cpumask and cgroup of the initializing task to the created
vhost poller.

Based on Sridhar Samudrala's patch.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Sridhar Samudrala <samudrala.sridhar@gmail.com>
---
 drivers/vhost/vhost.c |   36 +++++++++++++++++++++++++++++++-----
 1 file changed, 31 insertions(+), 5 deletions(-)

Index: work/drivers/vhost/vhost.c
===================================================================
--- work.orig/drivers/vhost/vhost.c
+++ work/drivers/vhost/vhost.c
@@ -23,6 +23,7 @@
 #include <linux/highmem.h>
 #include <linux/slab.h>
 #include <linux/kthread.h>
+#include <linux/cgroup.h>

 #include <linux/net.h>
 #include <linux/if_packet.h>
@@ -176,12 +177,30 @@ repeat:
 long vhost_dev_init(struct vhost_dev *dev,
 		    struct vhost_virtqueue *vqs, int nvqs)
 {
-	struct task_struct *poller;
-	int i;
+	struct task_struct *poller = NULL;
+	cpumask_var_t mask;
+	int i, ret = -ENOMEM;
+
+	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
+		goto out;

 	poller = kthread_create(vhost_poller, dev, "vhost-%d", current->pid);
-	if (IS_ERR(poller))
-		return PTR_ERR(poller);
+	if (IS_ERR(poller)) {
+		ret = PTR_ERR(poller);
+		goto out;
+	}
+
+	ret = sched_getaffinity(current->pid, mask);
+	if (ret)
+		goto out;
+
+	ret = sched_setaffinity(poller->pid, mask);
+	if (ret)
+		goto out;
+
+	ret = cgroup_attach_task_current_cg(poller);
+	if (ret)
+		goto out;

 	dev->vqs = vqs;
 	dev->nvqs = nvqs;
@@ -202,7 +221,14 @@ long vhost_dev_init(struct vhost_dev *de
 			vhost_poll_init(&dev->vqs[i].poll,
 					dev->vqs[i].handle_kick, POLLIN, dev);
 	}
-	return 0;
+
+	wake_up_process(poller);	/* avoid contributing to loadavg */
+	ret = 0;
+out:
+	if (ret)
+		kthread_stop(poller);
+	free_cpumask_var(mask);
+	return ret;
 }

 /* Caller should have device mutex */

^ permalink raw reply

* [PATCH 2/3] cgroups: Add an API to attach a task to current task's cgroup
From: Tejun Heo @ 2010-05-30 20:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, Sridhar Samudrala, netdev, lkml,
	kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev, Jiri Kosina,
	Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <20100530112925.GB27611@redhat.com>

From: Sridhar Samudrala <samudrala.sridhar@gmail.com>

Add a new kernel API to attach a task to current task's cgroup
in all the active hierarchies.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
---
 include/linux/cgroup.h |    1 +
 kernel/cgroup.c        |   23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+)

Index: work/include/linux/cgroup.h
===================================================================
--- work.orig/include/linux/cgroup.h
+++ work/include/linux/cgroup.h
@@ -570,6 +570,7 @@ struct task_struct *cgroup_iter_next(str
 void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it);
 int cgroup_scan_tasks(struct cgroup_scanner *scan);
 int cgroup_attach_task(struct cgroup *, struct task_struct *);
+int cgroup_attach_task_current_cg(struct task_struct *);

 /*
  * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
Index: work/kernel/cgroup.c
===================================================================
--- work.orig/kernel/cgroup.c
+++ work/kernel/cgroup.c
@@ -1788,6 +1788,29 @@ out:
 	return retval;
 }

+/**
+ * cgroup_attach_task_current_cg - attach task 'tsk' to current task's cgroup
+ * @tsk: the task to be attached
+ */
+int cgroup_attach_task_current_cg(struct task_struct *tsk)
+{
+	struct cgroupfs_root *root;
+	struct cgroup *cur_cg;
+	int retval = 0;
+
+	cgroup_lock();
+	for_each_active_root(root) {
+		cur_cg = task_cgroup_from_root(current, root);
+		retval = cgroup_attach_task(cur_cg, tsk);
+		if (retval)
+			break;
+	}
+	cgroup_unlock();
+
+	return retval;
+}
+EXPORT_SYMBOL_GPL(cgroup_attach_task_current_cg);
+
 /*
  * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
  * held. May take task_lock of task

^ permalink raw reply

* [PATCH 1/3] vhost: replace vhost_workqueue with per-vhost kthread
From: Tejun Heo @ 2010-05-30 20:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, Sridhar Samudrala, netdev, lkml,
	kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev, Jiri Kosina,
	Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <20100530112925.GB27611@redhat.com>

Replace vhost_workqueue with per-vhost kthread.  Other than callback
argument change from struct work_struct * to struct vhost_poll *,
there's no visible change to vhost_poll_*() interface.

This conversion is to make each vhost use a dedicated kthread so that
resource control via cgroup can be applied.

Partially based on Sridhar Samudrala's patch.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Sridhar Samudrala <samudrala.sridhar@gmail.com>
---
Okay, here is three patch series to convert vhost to use per-vhost
kthread, add cgroup_attach_task_current_cg() and apply it to the vhost
kthreads.  The conversion is mostly straight forward although flush is
slightly tricky.

The problem is that I have no idea how to test this.  It builds fine
and I read it several times but it's entirely possible / likely that I
missed something.  Please proceed with caution (so, no sign off yet).

Thanks.

 drivers/vhost/net.c   |   58 +++++++++++++----------------
 drivers/vhost/vhost.c |   99 ++++++++++++++++++++++++++++++++++++--------------
 drivers/vhost/vhost.h |   32 +++++++++-------
 3 files changed, 117 insertions(+), 72 deletions(-)

Index: work/drivers/vhost/net.c
===================================================================
--- work.orig/drivers/vhost/net.c
+++ work/drivers/vhost/net.c
@@ -294,54 +294,60 @@ static void handle_rx(struct vhost_net *
 	unuse_mm(net->dev.mm);
 }

-static void handle_tx_kick(struct work_struct *work)
+static void handle_tx_kick(struct vhost_poll *poll)
 {
-	struct vhost_virtqueue *vq;
-	struct vhost_net *net;
-	vq = container_of(work, struct vhost_virtqueue, poll.work);
-	net = container_of(vq->dev, struct vhost_net, dev);
+	struct vhost_virtqueue *vq =
+		container_of(poll, struct vhost_virtqueue, poll);
+	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
+
 	handle_tx(net);
 }

-static void handle_rx_kick(struct work_struct *work)
+static void handle_rx_kick(struct vhost_poll *poll)
 {
-	struct vhost_virtqueue *vq;
-	struct vhost_net *net;
-	vq = container_of(work, struct vhost_virtqueue, poll.work);
-	net = container_of(vq->dev, struct vhost_net, dev);
+	struct vhost_virtqueue *vq =
+		container_of(poll, struct vhost_virtqueue, poll);
+	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
+
 	handle_rx(net);
 }

-static void handle_tx_net(struct work_struct *work)
+static void handle_tx_net(struct vhost_poll *poll)
 {
-	struct vhost_net *net;
-	net = container_of(work, struct vhost_net, poll[VHOST_NET_VQ_TX].work);
+	struct vhost_net *net =
+		container_of(poll, struct vhost_net, poll[VHOST_NET_VQ_TX]);
+
 	handle_tx(net);
 }

-static void handle_rx_net(struct work_struct *work)
+static void handle_rx_net(struct vhost_poll *poll)
 {
-	struct vhost_net *net;
-	net = container_of(work, struct vhost_net, poll[VHOST_NET_VQ_RX].work);
+	struct vhost_net *net =
+		container_of(poll, struct vhost_net, poll[VHOST_NET_VQ_RX]);
+
 	handle_rx(net);
 }

 static int vhost_net_open(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
+	struct vhost_dev *dev;
 	int r;
+
 	if (!n)
 		return -ENOMEM;
+
+	dev = &n->dev;
 	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
 	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
-	r = vhost_dev_init(&n->dev, n->vqs, VHOST_NET_VQ_MAX);
+	r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
 	if (r < 0) {
 		kfree(n);
 		return r;
 	}

-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
 	n->tx_poll_state = VHOST_NET_POLL_DISABLED;

 	f->private_data = n;
@@ -644,25 +650,13 @@ static struct miscdevice vhost_net_misc

 static int vhost_net_init(void)
 {
-	int r = vhost_init();
-	if (r)
-		goto err_init;
-	r = misc_register(&vhost_net_misc);
-	if (r)
-		goto err_reg;
-	return 0;
-err_reg:
-	vhost_cleanup();
-err_init:
-	return r;
-
+	return misc_register(&vhost_net_misc);
 }
 module_init(vhost_net_init);

 static void vhost_net_exit(void)
 {
 	misc_deregister(&vhost_net_misc);
-	vhost_cleanup();
 }
 module_exit(vhost_net_exit);

Index: work/drivers/vhost/vhost.c
===================================================================
--- work.orig/drivers/vhost/vhost.c
+++ work/drivers/vhost/vhost.c
@@ -17,12 +17,12 @@
 #include <linux/mm.h>
 #include <linux/miscdevice.h>
 #include <linux/mutex.h>
-#include <linux/workqueue.h>
 #include <linux/rcupdate.h>
 #include <linux/poll.h>
 #include <linux/file.h>
 #include <linux/highmem.h>
 #include <linux/slab.h>
+#include <linux/kthread.h>

 #include <linux/net.h>
 #include <linux/if_packet.h>
@@ -37,8 +37,6 @@ enum {
 	VHOST_MEMORY_F_LOG = 0x1,
 };

-static struct workqueue_struct *vhost_workqueue;
-
 static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
 			    poll_table *pt)
 {
@@ -52,23 +50,27 @@ static void vhost_poll_func(struct file
 static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
 			     void *key)
 {
-	struct vhost_poll *poll;
-	poll = container_of(wait, struct vhost_poll, wait);
+	struct vhost_poll *poll = container_of(wait, struct vhost_poll, wait);
+
 	if (!((unsigned long)key & poll->mask))
 		return 0;

-	queue_work(vhost_workqueue, &poll->work);
+	vhost_poll_queue(poll);
 	return 0;
 }

 /* Init poll structure */
-void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
-		     unsigned long mask)
+void vhost_poll_init(struct vhost_poll *poll, vhost_poll_fn_t fn,
+		     unsigned long mask, struct vhost_dev *dev)
 {
-	INIT_WORK(&poll->work, func);
+	poll->fn = fn;
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
+	INIT_LIST_HEAD(&poll->node);
+	init_waitqueue_head(&poll->done);
 	poll->mask = mask;
+	poll->dev = dev;
+	poll->queue_seq = poll->done_seq = 0;
 }

 /* Start polling a file. We add ourselves to file's wait queue. The caller must
@@ -88,16 +90,28 @@ void vhost_poll_stop(struct vhost_poll *
 	remove_wait_queue(poll->wqh, &poll->wait);
 }

-/* Flush any work that has been scheduled. When calling this, don't hold any
+/* Flush any poll that has been scheduled. When calling this, don't hold any
  * locks that are also used by the callback. */
 void vhost_poll_flush(struct vhost_poll *poll)
 {
-	flush_work(&poll->work);
+	int seq = poll->queue_seq;
+
+	if (seq - poll->done_seq > 0)
+		wait_event(poll->done, seq - poll->done_seq <= 0);
+	smp_rmb();	/* paired with wmb in vhost_poller() */
 }

 void vhost_poll_queue(struct vhost_poll *poll)
 {
-	queue_work(vhost_workqueue, &poll->work);
+	struct vhost_dev *dev = poll->dev;
+
+	spin_lock(&dev->poller_lock);
+	if (list_empty(&poll->node)) {
+		list_add_tail(&poll->node, &dev->poll_list);
+		poll->queue_seq++;
+		wake_up_process(dev->poller);
+	}
+	spin_unlock(&dev->poller_lock);
 }

 static void vhost_vq_reset(struct vhost_dev *dev,
@@ -125,10 +139,50 @@ static void vhost_vq_reset(struct vhost_
 	vq->log_ctx = NULL;
 }

+static int vhost_poller(void *data)
+{
+	struct vhost_dev *dev = data;
+	struct vhost_poll *poll;
+
+repeat:
+	set_current_state(TASK_INTERRUPTIBLE);	/* mb paired w/ kthread_stop */
+
+	if (kthread_should_stop()) {
+		__set_current_state(TASK_RUNNING);
+		return 0;
+	}
+
+	poll = NULL;
+	spin_lock(&dev->poller_lock);
+	if (!list_empty(&dev->poll_list)) {
+		poll = list_first_entry(&dev->poll_list,
+					struct vhost_poll, node);
+		list_del_init(&poll->node);
+	}
+	spin_unlock(&dev->poller_lock);
+
+	if (poll) {
+		__set_current_state(TASK_RUNNING);
+		poll->fn(poll);
+		smp_wmb();	/* paired with rmb in vhost_poll_flush() */
+		poll->done_seq = poll->queue_seq;
+		wake_up_all(&poll->done);
+	} else
+		schedule();
+
+	goto repeat;
+}
+
 long vhost_dev_init(struct vhost_dev *dev,
 		    struct vhost_virtqueue *vqs, int nvqs)
 {
+	struct task_struct *poller;
 	int i;
+
+	poller = kthread_create(vhost_poller, dev, "vhost-%d", current->pid);
+	if (IS_ERR(poller))
+		return PTR_ERR(poller);
+
 	dev->vqs = vqs;
 	dev->nvqs = nvqs;
 	mutex_init(&dev->mutex);
@@ -136,6 +190,9 @@ long vhost_dev_init(struct vhost_dev *de
 	dev->log_file = NULL;
 	dev->memory = NULL;
 	dev->mm = NULL;
+	spin_lock_init(&dev->poller_lock);
+	INIT_LIST_HEAD(&dev->poll_list);
+	dev->poller = poller;

 	for (i = 0; i < dev->nvqs; ++i) {
 		dev->vqs[i].dev = dev;
@@ -143,8 +200,7 @@ long vhost_dev_init(struct vhost_dev *de
 		vhost_vq_reset(dev, dev->vqs + i);
 		if (dev->vqs[i].handle_kick)
 			vhost_poll_init(&dev->vqs[i].poll,
-					dev->vqs[i].handle_kick,
-					POLLIN);
+					dev->vqs[i].handle_kick, POLLIN, dev);
 	}
 	return 0;
 }
@@ -217,6 +273,8 @@ void vhost_dev_cleanup(struct vhost_dev
 	if (dev->mm)
 		mmput(dev->mm);
 	dev->mm = NULL;
+
+	kthread_stop(dev->poller);
 }

 static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
@@ -1113,16 +1171,3 @@ void vhost_disable_notify(struct vhost_v
 		vq_err(vq, "Failed to enable notification at %p: %d\n",
 		       &vq->used->flags, r);
 }
-
-int vhost_init(void)
-{
-	vhost_workqueue = create_singlethread_workqueue("vhost");
-	if (!vhost_workqueue)
-		return -ENOMEM;
-	return 0;
-}
-
-void vhost_cleanup(void)
-{
-	destroy_workqueue(vhost_workqueue);
-}
Index: work/drivers/vhost/vhost.h
===================================================================
--- work.orig/drivers/vhost/vhost.h
+++ work/drivers/vhost/vhost.h
@@ -5,7 +5,6 @@
 #include <linux/vhost.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
-#include <linux/workqueue.h>
 #include <linux/poll.h>
 #include <linux/file.h>
 #include <linux/skbuff.h>
@@ -20,19 +19,26 @@ enum {
 	VHOST_NET_MAX_SG = MAX_SKB_FRAGS + 2,
 };

+struct vhost_poll;
+typedef void (*vhost_poll_fn_t)(struct vhost_poll *poll);
+
 /* Poll a file (eventfd or socket) */
 /* Note: there's nothing vhost specific about this structure. */
 struct vhost_poll {
+	vhost_poll_fn_t		  fn;
 	poll_table                table;
 	wait_queue_head_t        *wqh;
 	wait_queue_t              wait;
-	/* struct which will handle all actual work. */
-	struct work_struct        work;
+	struct list_head	  node;
+	wait_queue_head_t	  done;
 	unsigned long		  mask;
+	struct vhost_dev	 *dev;
+	int			  queue_seq;
+	int			  done_seq;
 };

-void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
-		     unsigned long mask);
+void vhost_poll_init(struct vhost_poll *poll, vhost_poll_fn_t fn,
+		     unsigned long mask, struct vhost_dev *dev);
 void vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
@@ -63,7 +69,7 @@ struct vhost_virtqueue {
 	struct vhost_poll poll;

 	/* The routine to call when the Guest pings us, or timeout. */
-	work_func_t handle_kick;
+	vhost_poll_fn_t handle_kick;

 	/* Last available index we saw. */
 	u16 last_avail_idx;
@@ -86,11 +92,11 @@ struct vhost_virtqueue {
 	struct iovec hdr[VHOST_NET_MAX_SG];
 	size_t hdr_size;
 	/* We use a kind of RCU to access private pointer.
-	 * All readers access it from workqueue, which makes it possible to
-	 * flush the workqueue instead of synchronize_rcu. Therefore readers do
+	 * All readers access it from poller, which makes it possible to
+	 * flush the vhost_poll instead of synchronize_rcu. Therefore readers do
 	 * not need to call rcu_read_lock/rcu_read_unlock: the beginning of
-	 * work item execution acts instead of rcu_read_lock() and the end of
-	 * work item execution acts instead of rcu_read_lock().
+	 * vhost_poll execution acts instead of rcu_read_lock() and the end of
+	 * vhost_poll execution acts instead of rcu_read_lock().
 	 * Writers use virtqueue mutex. */
 	void *private_data;
 	/* Log write descriptors */
@@ -110,6 +116,9 @@ struct vhost_dev {
 	int nvqs;
 	struct file *log_file;
 	struct eventfd_ctx *log_ctx;
+	spinlock_t poller_lock;
+	struct list_head poll_list;
+	struct task_struct *poller;
 };

 long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
@@ -136,9 +145,6 @@ bool vhost_enable_notify(struct vhost_vi
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);

-int vhost_init(void);
-void vhost_cleanup(void);
-
 #define vq_err(vq, fmt, ...) do {                                  \
 		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
 		if ((vq)->error_ctx)                               \

^ permalink raw reply

* Re: Subject: [PATCH] net/ipv6: Use GFP_ATOMIC when a lock is held
From: Eric Dumazet @ 2010-05-30 20:11 UTC (permalink / raw)
  To: Julia Lawall
  Cc: David S. Miller, Alexey Kuznetsov, Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy, netdev,
	linux-kernel, kernel-janitors
In-Reply-To: <Pine.LNX.4.64.1005302147310.19253@ask.diku.dk>

Le dimanche 30 mai 2010 à 21:48 +0200, Julia Lawall a écrit :
> From: Julia Lawall <julia@diku.dk>
> 
> A spin lock is taken near the beginning of the enclosing function.
> 
> The semantic patch that makes this change is as follows:
> (http://coccinelle.lip6.fr/)
> 
> // <smpl>
> @@
> @@
> 
> spin_lock(...)
> ... when != spin_unlock(...)
> -GFP_KERNEL
> +GFP_ATOMIC
> // </smpl>
> 
> Signed-off-by: Julia Lawall <julia@diku.dk>
> 
> ---
>  net/ipv6/sit.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff -u -p a/net/ipv6/sit.c b/net/ipv6/sit.c
> --- a/net/ipv6/sit.c
> +++ b/net/ipv6/sit.c
> @@ -358,7 +358,7 @@ ipip6_tunnel_add_prl(struct ip_tunnel *t
>  		goto out;
>  	}
>  
> -	p = kzalloc(sizeof(struct ip_tunnel_prl_entry), GFP_KERNEL);
> +	p = kzalloc(sizeof(struct ip_tunnel_prl_entry), GFP_ATOMIC);
>  	if (!p) {
>  		err = -ENOBUFS;
>  		goto out;

Nice catch, but what about allocating this outside of the locked
section ?

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index e51e650..ff3dd84 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -340,6 +340,10 @@ ipip6_tunnel_add_prl(struct ip_tunnel *t, struct ip_tunnel_prl *a, int chg)
 	if (a->addr == htonl(INADDR_ANY))
 		return -EINVAL;
 
+	p = kzalloc(sizeof(struct ip_tunnel_prl_entry), GFP_KERNEL);
+	if (!p)
+		return -ENOBUFS;
+
 	spin_lock(&ipip6_prl_lock);
 
 	for (p = t->prl; p; p = p->next) {
@@ -358,19 +362,16 @@ ipip6_tunnel_add_prl(struct ip_tunnel *t, struct ip_tunnel_prl *a, int chg)
 		goto out;
 	}
 
-	p = kzalloc(sizeof(struct ip_tunnel_prl_entry), GFP_KERNEL);
-	if (!p) {
-		err = -ENOBUFS;
-		goto out;
-	}
 
 	p->next = t->prl;
 	p->addr = a->addr;
 	p->flags = a->flags;
 	t->prl_count++;
 	rcu_assign_pointer(t->prl, p);
+	p = NULL;
 out:
 	spin_unlock(&ipip6_prl_lock);
+	kfree(p);
 	return err;
 }
 



^ permalink raw reply related

* Subject: [PATCH] net/ipv6: Use GFP_ATOMIC when a lock is held
From: Julia Lawall @ 2010-05-30 19:48 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, Pekka Savola (ipv6),
	James Morris, Hideaki YOSHIFUJI <yosh

From: Julia Lawall <julia@diku.dk>

A spin lock is taken near the beginning of the enclosing function.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
@@

spin_lock(...)
... when != spin_unlock(...)
-GFP_KERNEL
+GFP_ATOMIC
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>

---
 net/ipv6/sit.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -u -p a/net/ipv6/sit.c b/net/ipv6/sit.c
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -358,7 +358,7 @@ ipip6_tunnel_add_prl(struct ip_tunnel *t
 		goto out;
 	}
 
-	p = kzalloc(sizeof(struct ip_tunnel_prl_entry), GFP_KERNEL);
+	p = kzalloc(sizeof(struct ip_tunnel_prl_entry), GFP_ATOMIC);
 	if (!p) {
 		err = -ENOBUFS;
 		goto out;

^ permalink raw reply

* Re: [Patch]r8169: remove unnecessary cast of readl()'s return value
From: Jeff Garzik @ 2010-05-30 17:36 UTC (permalink / raw)
  To: davem, romieu, netdev
In-Reply-To: <20100530122606.GC1146@host-a-55.ustcsz.edu.cn>

On 05/30/2010 08:26 AM, Junchang Wang wrote:
> readl() returns a 32-bit integer on all platforms.
> There is no need to cast its return value.
>
> Signed-off-by: Junchang Wang<junchangwang@gmail.com>
> ---
>   drivers/net/r8169.c |    2 +-
>   1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
> index 217e709..ca93cdf 100644
> --- a/drivers/net/r8169.c
> +++ b/drivers/net/r8169.c
> @@ -88,7 +88,7 @@ static const int multicast_filter_limit = 32;
>   #define RTL_W32(reg, val32)	writel ((val32), ioaddr + (reg))
>   #define RTL_R8(reg)		readb (ioaddr + (reg))
>   #define RTL_R16(reg)		readw (ioaddr + (reg))
> -#define RTL_R32(reg)		((unsigned long) readl (ioaddr + (reg)))
> +#define RTL_R32(reg)		readl (ioaddr + (reg))

Ditto last email:  have you verified this matches all arch's definition 
of readl()?

	Jeff





^ permalink raw reply

* Re: [Patch]8139too: remove unnecessary cast of ioread32()'s return value
From: Jeff Garzik @ 2010-05-30 17:35 UTC (permalink / raw)
  To: davem, romieu, netdev
In-Reply-To: <20100530122213.GB1146@host-a-55.ustcsz.edu.cn>

On 05/30/2010 08:22 AM, Junchang Wang wrote:
> ioread32() returns a 32-bit integer on all platforms.
> There is no need to cast its return value.
>
> Signed-off-by: Junchang Wang<junchangwang@gmail.com>
> ---
>   drivers/net/8139too.c |    8 ++++----
>   1 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/8139too.c b/drivers/net/8139too.c
> index 4ba7293..cc7d462 100644
> --- a/drivers/net/8139too.c
> +++ b/drivers/net/8139too.c
> @@ -662,7 +662,7 @@ static const struct ethtool_ops rtl8139_ethtool_ops;
>   /* read MMIO register */
>   #define RTL_R8(reg)		ioread8 (ioaddr + (reg))
>   #define RTL_R16(reg)		ioread16 (ioaddr + (reg))
> -#define RTL_R32(reg)		((unsigned long) ioread32 (ioaddr + (reg)))
> +#define RTL_R32(reg)		ioread32 (ioaddr + (reg))

Have you verified this matches all architectures definition of readl()?

	Jeff




^ permalink raw reply

* Re: [PATCH] bnx2: Fix IRQ failures during kdump.
From: Andi Kleen @ 2010-05-30 17:30 UTC (permalink / raw)
  To: Michael Chan
  Cc: 'Andi Kleen', 'davem@davemloft.net',
	'netdev@vger.kernel.org',
	'linux-pci@vger.kernel.org'
In-Reply-To: <C27F8246C663564A84BB7AB3439772421B78147574@IRVEXCHCCR01.corp.ad.broadcom.com>

On Sun, May 30, 2010 at 09:12:15AM -0700, Michael Chan wrote:
> Andi Kleen wrote:
> 
> > "Michael Chan" <mchan@broadcom.com> writes:
> > 
> > > When switching from the crashed kernel to the kdump kernel without
> > going
> > > through PCI reset, IRQs may not work if a different IRQ mode is used
> > on
> > 
> > PCIe with AER actually does support per link root port reset
> > (e.g. used for AER)
> 
> Do you mean the slot_reset function in the pci_error_handlers?  This

Well the fallback code in the PCIE root port driver 
that does the actual resets.

It could be called directly before kexec.

> needs to be called in the context of the crashed kernel, right?

It could be done on kexec, however of course you would rely
on PCI root port data structures still being intact on a crash
(I guess that's reasonable, they are not very complicated)

> 
> > 
> > I've been wondering for some time if kexec should not simply
> > use that to reset all the devices, instead of addings hacks
> > around this to all drivers.
> > 
> > That would fix your problems too, right?
> 
> If it is called in the context of the crashed kernel, it won't work.
> We would reset it and put in back into the same IRQ mode.

Who would put it back? Your driver wouldn't be called anymore.

> 
> > 
> > The question is just if AER is widely enough supported for this.
> > 
> 
> Some newer PCIe devices support Function Level Reset, and that would
> be ideal.  But most existing devices including bnx2 devices don't have
> this feature.

Root port reset should be fine for this case. Even if some
innocent device on the same root port gets reset too that shouldn't matter. 
Only drawback for the NIC would be that you have to renegotiate links I think. 

Also there are systems without AER support.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [PATCH] bnx2: Fix IRQ failures during kdump.
From: Michael Chan @ 2010-05-30 16:32 UTC (permalink / raw)
  To: 'David Miller', 'matthew@wil.cx'
  Cc: 'grundler@parisc-linux.org',
	'netdev@vger.kernel.org',
	'linux-pci@vger.kernel.org'
In-Reply-To: <20100529.204906.55850229.davem@davemloft.net>

David Miller wrote:

> From: Matthew Wilcox <matthew@wil.cx>
> Date: Sat, 29 May 2010 19:24:01 -0600
> 
> > We should probably set the interrupt type back to pin-based before
> the
> > kexec kernel starts, right?  Or do we expect drivers to handle being
> > initialised with the device still set to MSI mode?
> 
> The expectation is that the device comes up in INTX mode, which is the
> default after a PCI reset.

We need to be very careful because the device may still be active as I
said earlier.  Turning INTX on may lead to an IRQ storm that nobody will
handle.  Some older devices don't have the INTX enable bit, and INTX will
automatically be enabled when MSI is disabled.

> 
> Basically all of these issues tend to be about the fact that unlike on
> a normal boot, after a kexec an intermediate PCI reset has not occured.

^ permalink raw reply

* Re: [PATCH] bnx2: Fix IRQ failures during kdump.
From: Michael Chan @ 2010-05-30 16:12 UTC (permalink / raw)
  To: 'Andi Kleen'
  Cc: 'davem@davemloft.net', 'netdev@vger.kernel.org',
	'linux-pci@vger.kernel.org'
In-Reply-To: <87ocfxzpvf.fsf@basil.nowhere.org>

Andi Kleen wrote:

> "Michael Chan" <mchan@broadcom.com> writes:
> 
> > When switching from the crashed kernel to the kdump kernel without
> going
> > through PCI reset, IRQs may not work if a different IRQ mode is used
> on
> 
> PCIe with AER actually does support per link root port reset
> (e.g. used for AER)

Do you mean the slot_reset function in the pci_error_handlers?  This
needs to be called in the context of the crashed kernel, right?

> 
> I've been wondering for some time if kexec should not simply
> use that to reset all the devices, instead of addings hacks
> around this to all drivers.
> 
> That would fix your problems too, right?

If it is called in the context of the crashed kernel, it won't work.
We would reset it and put in back into the same IRQ mode.

> 
> The question is just if AER is widely enough supported for this.
> 

Some newer PCIe devices support Function Level Reset, and that would
be ideal.  But most existing devices including bnx2 devices don't have
this feature.

^ permalink raw reply

* Re: MDNS is broken in latest -git
From: Maxim Levitsky @ 2010-05-30 14:44 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; +Cc: linux-wireless
In-Reply-To: <1275066591.3390.4.camel@maxim-laptop>

On Fri, 2010-05-28 at 20:09 +0300, Maxim Levitsky wrote: 
> On Fri, 2010-05-28 at 18:02 +0300, Maxim Levitsky wrote: 
> > On latest git, it became impossible to use hostname.local alias to
> > access my network hosts.
> > 
> > In fact when I look at 'avahi-discover' I see nothing but local
> > services.
> > 
> > I did a bisect, but unfortunely ended with merge commit, although
> > bisection seem to be normal (and I didn't do any shortcuts).
> 
> Since starting 'wireshark' magicly temporarly fixes this, I suspect that
> mulicast packets don't get through.
> 
> This smells like iwl3945 bug.
> 
> I use linus' master tree now.

Anybody?

Best regards,
Maxim Levitsky

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2] act_nat: fix the wrong checksum when addr isn't in old_addr/mask
From: Changli Gao @ 2010-05-30 14:11 UTC (permalink / raw)
  To: Herbert Xu; +Cc: jamal, David S. Miller, netdev
In-Reply-To: <AANLkTinDe-AluGZx87q3nvxCfushfeH0jS35Fav95IXk@mail.gmail.com>

On Sun, May 30, 2010 at 9:33 PM, Changli Gao <xiaosuo@gmail.com> wrote:
> On Sun, May 30, 2010 at 8:58 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>
>> Yes the patch is correct.
>>
>> However, the fact that you need this patch means that your act_nat
>> setup isn't perfect.  Ideally all the unNATed packets should be
>> filtered out before you hit act_nat.
>
> Thinking about this topologic:
>
> client -> DNAT -> router -> server.
>
> DNAT is used to map a public IP to server's private IP. If a
> DEST_UNREACH ICMP packet is sent out by router, in order to handle
> this ICMP packet correctly, I have to pass it to act_nat.c. How can I
> filter out the other packets? By inspecting the inner IP destination
> address of this ICMP packet? Maybe I can use u32 with complicate
> parameters.
>

Oh, I can pass all the ICMP packets, and the packets to the public IP
and the packets from the private IP.

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH] Exclude DAHDI devices from being probed by netjet
From: Alan Cox @ 2010-05-30 14:07 UTC (permalink / raw)
  To: Tzafrir Cohen; +Cc: netdev, linux-kernel
In-Reply-To: <20100529232946.GA1748@xorcom.com>

> 2. I don't have more precise data than the PCI ID tables in the drivers.
>    Does this run the risk of excluding some actual Netjet ISDN cards?

Is there a reason you are seeing so many different vendor values - surely
you should see a single 'Digium' subvendor, and various subdevice values ?


> +	switch (pdev->subsystem_vendor) {
> +	/* Fall-through */
> +	case 0x2151: /* Yeastart YSTDM8xx (ystdm8xx) */
> +	case 0xe16b: /* Zapata Project PCI-Radio (pciradio) */
> +	case 0x6159: /* Digium Wildcard T100/E100 (wct1xxp) */
> +	case 0x71fe: /* Digium Wildcard TE110P (wcte1xp) */
> +	case 0x795e: /* Digium Wildcard TE110P (wcte1xp) */
> +	case 0x797e: /* Digium Wildcard TE110P (wcte1xp) */
> +	case 0x79de: /* Digium Wildcard TE110P (wcte1xp) */
> +	case 0x79df: /* Digium Wildcard TE110P (wcte1xp) */
> +	case 0x8084: /* Digium Wildcard X101P clone (wcfxo) */
> +	case 0x8085: /* Digium Wildcard X101P (wcfxo) */
> +	case 0x8086: /* Digium Wildcard X101P clone (wcfxo) */
> +	case 0x8087: /* Digium Wildcard X101P clone (wcfxo) */
> +	case 0xa800: /* Digium Wildcard TDM400P Rev H (wctdm) */
> +	case 0xa801: /* Digium Wildcard TDM400P Rev H (wctdm) */
> +	case 0xa8fd: /* Digium Wildcard TDM400P Rev H (wctdm) */
> +	case 0xa901: /* Digium Wildcard TDM400P Rev H (wctdm) */
> +	case 0xa908: /* Digium Wildcard TDM400P Rev H (wctdm) */
> +	case 0xa9fd: /* Digium Wildcard TDM400P Rev H (wctdm) */
> +	case 0xb100: /* Digium Wildcard TDM400P Rev E/F (wctdm) */
> +	case 0xb118: /* Digium Wildcard TDM400P Rev I (wctdm) */
> +	case 0xb119: /* Digium Wildcard TDM400P Rev I (wctdm) */
> +	case 0xb1d9: /* Digium Wildcard TDM400P Rev I (wctdm) */
> +	case 0xa159: /* Digium Wildcard S400P Prototype (wctdm) */
> +	case 0xe159: /* Digium Wildcard S400P Prototype (wctdm) */

That might be better as a table. You can then als include the name and
match details in the report which will help diagnose problems with it eg

	struct whatever {
		u16 svid, sdid;
		const char *name;
	}

And then print

	netjet: %s card is not supported by this driver (%04X, %04X).\n", 
		name, svid, sdid

		
Then again we don't seem to have a driver for these other kernel devices
so perhaps the safe default would be to warn and continue unless a module
option is set. At least initially.

Alan

^ permalink raw reply

* [PATCH 2/2] drivers/isdn/hardware/mISDN: Use GFP_ATOMIC when a lock is held
From: Julia Lawall @ 2010-05-30 13:49 UTC (permalink / raw)
  To: Karsten Keil, netdev, linux-kernel, kernel-janitors

From: Julia Lawall <julia@diku.dk>

The function inittiger is only called from nj_init_card, where a lock is held.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@gfp exists@
identifier fn;
position p;
@@

fn(...) {
 ... when != spin_unlock_irqrestore
     when any
 GFP_KERNEL@p
 ... when any
}

@locked@
identifier gfp.fn;
@@

spin_lock_irqsave(...)
...  when != spin_unlock_irqrestore
fn(...)

@depends on locked@
position gfp.p;
@@

- GFP_KERNEL@p
+ GFP_ATOMIC
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>

---
 drivers/isdn/hardware/mISDN/netjet.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff -u -p a/drivers/isdn/hardware/mISDN/netjet.c b/drivers/isdn/hardware/mISDN/netjet.c
--- a/drivers/isdn/hardware/mISDN/netjet.c
+++ b/drivers/isdn/hardware/mISDN/netjet.c
@@ -320,12 +320,12 @@ inittiger(struct tiger_hw *card)
 		return -ENOMEM;
 	}
 	for (i = 0; i < 2; i++) {
-		card->bc[i].hsbuf = kmalloc(NJ_DMA_TXSIZE, GFP_KERNEL);
+		card->bc[i].hsbuf = kmalloc(NJ_DMA_TXSIZE, GFP_ATOMIC);
 		if (!card->bc[i].hsbuf) {
 			pr_info("%s: no B%d send buffer\n", card->name, i + 1);
 			return -ENOMEM;
 		}
-		card->bc[i].hrbuf = kmalloc(NJ_DMA_RXSIZE, GFP_KERNEL);
+		card->bc[i].hrbuf = kmalloc(NJ_DMA_RXSIZE, GFP_ATOMIC);
 		if (!card->bc[i].hrbuf) {
 			pr_info("%s: no B%d recv buffer\n", card->name, i + 1);
 			return -ENOMEM;

^ permalink raw reply

* Re: [PATCH v2] act_nat: fix the wrong checksum when addr isn't in old_addr/mask
From: Changli Gao @ 2010-05-30 13:33 UTC (permalink / raw)
  To: Herbert Xu; +Cc: jamal, David S. Miller, netdev
In-Reply-To: <20100530125811.GA8120@gondor.apana.org.au>

On Sun, May 30, 2010 at 8:58 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Sun, May 30, 2010 at 08:43:41AM -0400, jamal wrote:
>>
>> Copying Herbert, taking linux-kernel off...
>
> Thanks Jamal.
>
>> On Sun, 2010-05-30 at 08:26 +0800, Changli Gao wrote:
>> > fix the wrong checksum when addr isn't in old_addr/mask
>> >
>> > For TCP and UDP packets, when addr isn't in old_addr/mask we don't do SNAT or
>> > DNAT, and we should not update layer 4 checksum.
>> >
>> > Signed-off-by: Changli Gao <xiaosuo@gmail.com>
>> > ----
>> >  net/sched/act_nat.c |    4 ++++
>> >  1 file changed, 4 insertions(+)
>> > diff --git a/net/sched/act_nat.c b/net/sched/act_nat.c
>> > index d885ba3..5709494 100644
>> > --- a/net/sched/act_nat.c
>> > +++ b/net/sched/act_nat.c
>> > @@ -159,6 +159,9 @@ static int tcf_nat(struct sk_buff *skb, struct tc_action *a,
>> >                     iph->daddr = new_addr;
>> >
>> >             csum_replace4(&iph->check, addr, new_addr);
>> > +   } else if ((iph->frag_off & htons(IP_OFFSET)) ||
>> > +              iph->protocol != IPPROTO_ICMP) {
>> > +           goto out;
>> >     }
>
> Yes the patch is correct.
>
> However, the fact that you need this patch means that your act_nat
> setup isn't perfect.  Ideally all the unNATed packets should be
> filtered out before you hit act_nat.

Thinking about this topologic:

client -> DNAT -> router -> server.

DNAT is used to map a public IP to server's private IP. If a
DEST_UNREACH ICMP packet is sent out by router, in order to handle
this ICMP packet correctly, I have to pass it to act_nat.c. How can I
filter out the other packets? By inspecting the inner IP destination
address of this ICMP packet? Maybe I can use u32 with complicate
parameters.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: Question about an assignment in handle_ing()
From: jamal @ 2010-05-30 13:29 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Jiri Pirko, netdev, davem, kaber
In-Reply-To: <1274873881.3878.988.camel@bigi>

[-- Attachment #1: Type: text/plain, Size: 1593 bytes --]

On Wed, 2010-05-26 at 07:38 -0400, jamal wrote:
> On Wed, 2010-05-26 at 09:13 +1000, Herbert Xu wrote:
> 
> > If it did happen like you said then it would be a serious bug
> > in our stack as everything else (including the TCP stack) relies
> > on this.
> 
> It could have been a bug. Note this was not a simple test, so there
> may be other factors involved. If you or Jiri are willing to run the
> test i will construct a scenario which will test this out. It will need
> a compile of the kernel and a small check in pedit to see if we see
> cloned skbs when we run the two tcpdumps (and to make sure the tcpdumps
> see the correct bytes). Otherwise i will get to it by weekend.

I have constructed a test case (attached) and my fear is unfortunately
still there;-< What am i doing wrong?

The packet path is:
-->eth0-->tcpdump eth0-->pedit-->mirror to dummy0-->tcpdump dummy0

I expect pedit to see a cloned packet. It doesnt. The check is in
tcf_pedit(), just before "if (!(skb->tc_verd & TC_OK2MUNGE))"
added: 
printk("pedit: skb-%p is %s\n",skb,skb_cloned(skb)?"cloned":"!cloned");

Is pf packet not cloning etc? Sorry, I dont have much time today
to dig into the code - but i figure youd know the answer.

> > But how can the caller make that decision when you return exactly
> > the same value in the error case as the normal case?
> 
> Ok - i see your point Herbert ;-> 
> it makes sense to have pedit have an error action code like some of the
> others actions which defaults to a drop.
> I will do a proper patch sometime this weekend.

I will get it done this week.

cheers,
jamal

[-- Attachment #2: jiri-q-test --]
[-- Type: text/plain, Size: 1855 bytes --]


machine running script is 10.0.0.111 receiving on eth0.
We are pinging from 10.0.0.26 to 10.0.0.111.
On 10.0.0.111:
1)Edit the packet when it comes in to change src/dst mac addresses
2)Mirror copy to dummy0 

mirror to dummy0 is useful for debugging (ifconfig shows you stats and you
can run tcpdump to log the copies as we do)

run tcpdump before #1 and after #1 - this way we see the original
packet at eth0 and the modified packet at dummy0.

-------------- start script on 10.0.0.111 -----
tc qdisc del dev eth0 ingress     
tc qdisc add dev eth0 ingress     
ifconfig dummy0 up
tc filter add dev eth0 parent ffff: protocol ip prio 10 u32 \
match ip protocol 1 0xff flowid 1:2 \
action pedit \
munge offset -12 u32 set 0x00010100 \
munge offset -8 u32 set 0x0aaf0100 \
munge offset -4 u32 set 0x00080800 pipe \
action mirred egress mirror dev dummy0
------

To validate you did this right, dumping should look as follows:
----
filter protocol ip pref 10 u32 
filter protocol ip pref 10 u32 fh 800: ht divisor 1 
filter protocol ip pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:2 
  match 00010000/00ff0000 at 8
	action order 1:  pedit action pipe keys 3
 	 index 1 ref 1 bind 1
	 key #0  at -12: val 00010100 mask 00000000
	 key #1  at -8: val 0aaf0100 mask 00000000
	 key #2  at -4: val 00080800 mask 00000000
 
	action order 2: mirred (Egress Mirror to device dummy0) pipe
 	index 1 ref 1 bind 1
-----

tcpdump on dummy0 (showing modified macs):
 
0a:af:01:00:00:08 > 52:54:00:01:01:00, ethertype IPv4 (0x0800), length 98: 10.0.0.26 > 10.0.0.111: ICMP echo request, id 5981, seq 1, length 64
	0x0000:  4500 0054 0000 4000 4001 2621 0a00 001a
	0x0010:  0a00 006f 0800 a951 175d 0001 d3c8 fa4b
	0x0020:  0000 0000 9d68 0d00 0000 0000 1011 1213
	0x0030:  1415 1617 1819 1a1b 1c1d 1e1f 2021 2223
	0x0040:  2425 2627 2829 2a2b 2c2d 2e2f


^ permalink raw reply

* Re: [PATCH v2] act_nat: fix the wrong checksum when addr isn't in old_addr/mask
From: Herbert Xu @ 2010-05-30 12:58 UTC (permalink / raw)
  To: jamal; +Cc: Changli Gao, David S. Miller, netdev
In-Reply-To: <1275223421.3587.0.camel@bigi>

On Sun, May 30, 2010 at 08:43:41AM -0400, jamal wrote:
> 
> Copying Herbert, taking linux-kernel off...

Thanks Jamal.

> On Sun, 2010-05-30 at 08:26 +0800, Changli Gao wrote:
> > fix the wrong checksum when addr isn't in old_addr/mask
> > 
> > For TCP and UDP packets, when addr isn't in old_addr/mask we don't do SNAT or
> > DNAT, and we should not update layer 4 checksum.
> > 
> > Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> > ----
> >  net/sched/act_nat.c |    4 ++++
> >  1 file changed, 4 insertions(+)
> > diff --git a/net/sched/act_nat.c b/net/sched/act_nat.c
> > index d885ba3..5709494 100644
> > --- a/net/sched/act_nat.c
> > +++ b/net/sched/act_nat.c
> > @@ -159,6 +159,9 @@ static int tcf_nat(struct sk_buff *skb, struct tc_action *a,
> >  			iph->daddr = new_addr;
> >  
> >  		csum_replace4(&iph->check, addr, new_addr);
> > +	} else if ((iph->frag_off & htons(IP_OFFSET)) ||
> > +		   iph->protocol != IPPROTO_ICMP) {
> > +		goto out;
> >  	}

Yes the patch is correct.

However, the fact that you need this patch means that your act_nat
setup isn't perfect.  Ideally all the unNATed packets should be
filtered out before you hit act_nat.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH v2] act_nat: fix the wrong checksum when addr isn't in old_addr/mask
From: jamal @ 2010-05-30 12:43 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, netdev, Herbert Xu
In-Reply-To: <1275179219-10424-1-git-send-email-xiaosuo@gmail.com>


Copying Herbert, taking linux-kernel off...

On Sun, 2010-05-30 at 08:26 +0800, Changli Gao wrote:
> fix the wrong checksum when addr isn't in old_addr/mask
> 
> For TCP and UDP packets, when addr isn't in old_addr/mask we don't do SNAT or
> DNAT, and we should not update layer 4 checksum.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> ----
>  net/sched/act_nat.c |    4 ++++
>  1 file changed, 4 insertions(+)
> diff --git a/net/sched/act_nat.c b/net/sched/act_nat.c
> index d885ba3..5709494 100644
> --- a/net/sched/act_nat.c
> +++ b/net/sched/act_nat.c
> @@ -159,6 +159,9 @@ static int tcf_nat(struct sk_buff *skb, struct tc_action *a,
>  			iph->daddr = new_addr;
>  
>  		csum_replace4(&iph->check, addr, new_addr);
> +	} else if ((iph->frag_off & htons(IP_OFFSET)) ||
> +		   iph->protocol != IPPROTO_ICMP) {
> +		goto out;
>  	}
>  
>  	ihl = iph->ihl * 4;
> @@ -247,6 +250,7 @@ static int tcf_nat(struct sk_buff *skb, struct tc_action *a,
>  		break;
>  	}
>  
> +out:
>  	return action;
>  
>  drop:


^ permalink raw reply

* [Patch]r8169: remove unnecessary cast of readl()'s return value
From: Junchang Wang @ 2010-05-30 12:26 UTC (permalink / raw)
  To: davem, romieu; +Cc: netdev

readl() returns a 32-bit integer on all platforms.
There is no need to cast its return value.

Signed-off-by: Junchang Wang <junchangwang@gmail.com>
---
 drivers/net/r8169.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index 217e709..ca93cdf 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -88,7 +88,7 @@ static const int multicast_filter_limit = 32;
 #define RTL_W32(reg, val32)	writel ((val32), ioaddr + (reg))
 #define RTL_R8(reg)		readb (ioaddr + (reg))
 #define RTL_R16(reg)		readw (ioaddr + (reg))
-#define RTL_R32(reg)		((unsigned long) readl (ioaddr + (reg)))
+#define RTL_R32(reg)		readl (ioaddr + (reg))
 
 enum mac_version {
 	RTL_GIGA_MAC_NONE   = 0x00,
--

^ permalink raw reply related

* [Patch]8139too: remove unnecessary cast of ioread32()'s return value
From: Junchang Wang @ 2010-05-30 12:22 UTC (permalink / raw)
  To: davem, romieu; +Cc: netdev

ioread32() returns a 32-bit integer on all platforms.
There is no need to cast its return value.

Signed-off-by: Junchang Wang <junchangwang@gmail.com>
---
 drivers/net/8139too.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/8139too.c b/drivers/net/8139too.c
index 4ba7293..cc7d462 100644
--- a/drivers/net/8139too.c
+++ b/drivers/net/8139too.c
@@ -662,7 +662,7 @@ static const struct ethtool_ops rtl8139_ethtool_ops;
 /* read MMIO register */
 #define RTL_R8(reg)		ioread8 (ioaddr + (reg))
 #define RTL_R16(reg)		ioread16 (ioaddr + (reg))
-#define RTL_R32(reg)		((unsigned long) ioread32 (ioaddr + (reg)))
+#define RTL_R32(reg)		ioread32 (ioaddr + (reg))
 
 
 static const u16 rtl8139_intr_mask =
@@ -861,7 +861,7 @@ retry:
 
 	/* if unknown chip, assume array element #0, original RTL-8139 in this case */
 	dev_dbg(&pdev->dev, "unknown chip version, assuming RTL-8139\n");
-	dev_dbg(&pdev->dev, "TxConfig = 0x%lx\n", RTL_R32 (TxConfig));
+	dev_dbg(&pdev->dev, "TxConfig = 0x%x\n", RTL_R32 (TxConfig));
 	tp->chipset = 0;
 
 match:
@@ -1642,7 +1642,7 @@ static void rtl8139_tx_timeout_task (struct work_struct *work)
 	netdev_dbg(dev, "Tx queue start entry %ld  dirty entry %ld\n",
 		   tp->cur_tx, tp->dirty_tx);
 	for (i = 0; i < NUM_TX_DESC; i++)
-		netdev_dbg(dev, "Tx descriptor %d is %08lx%s\n",
+		netdev_dbg(dev, "Tx descriptor %d is %08x%s\n",
 			   i, RTL_R32(TxStatus0 + (i * 4)),
 			   i == tp->dirty_tx % NUM_TX_DESC ?
 			   " (queue head)" : "");
@@ -2486,7 +2486,7 @@ static void __set_rx_mode (struct net_device *dev)
 	int rx_mode;
 	u32 tmp;
 
-	netdev_dbg(dev, "rtl8139_set_rx_mode(%04x) done -- Rx config %08lx\n",
+	netdev_dbg(dev, "rtl8139_set_rx_mode(%04x) done -- Rx config %08x\n",
 		   dev->flags, RTL_R32(RxConfig));
 
 	/* Note: do not reorder, GCC is clever about common statements. */
--

^ permalink raw reply related

* Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
From: Michael S. Tsirkin @ 2010-05-30 11:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, Sridhar Samudrala, netdev, lkml,
	kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev, Jiri Kosina,
	Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <4BFFE742.2060205@kernel.org>

On Fri, May 28, 2010 at 05:54:42PM +0200, Tejun Heo wrote:
> Hello,
> 
> On 05/28/2010 05:08 PM, Michael S. Tsirkin wrote:
> > Well, we have create_singlethread_workqueue, right?
> > This is not very different ... is it?
> > 
> > Just copying structures and code from workqueue.c,
> > adding vhost_ in front of it will definitely work:
> 
> Sure it will, but you'll probably be able to get away with much less.
> 
> > there is nothing magic about the workqueue library.
> > But this just involves cut and paste which might be best avoided.
> 
> What I'm saying is that some magic needs to be added to workqueue and
> if you add this single(!) exception, it will have to be backed out
> pretty soon, so it would be better to do it properly now.
> 
> > One final idea before we go the cut and paste way: how about
> > 'create_workqueue_from_task' that would get a thread and have workqueue
> > run there?
> 
> You can currently depend on that implementation detail but it's not
> the workqueue interface is meant to do.  The single threadedness is
> there as execution ordering and concurrency specification and it
> doesn't (or rather won't) necessarily mean that a specific single
> thread is bound to certain workqueue.
> 
> Can you please direct me to have a look at the code.  I'll be happy to
> do the conversion for you.

Great, thanks! The code in question is in drivers/vhost/vhost.c
It is used from drivers/vhost/net.c

On top of this, we have patchset from Sridhar Samudrala,
titled '0/3 Make vhost multi-threaded and associate each thread to its
guest's cgroup':

cgroups: Add an API to attach a task to current task's cgroup
workqueue: Add an API to create a singlethread workqueue attached to the
current task's cgroup
vhost: make it more scalable by creating a vhost thread per device

I have bounced the last three your way.


> Thanks.
> 
> -- 
> tejun

^ permalink raw reply

* Re: [PATCH] bnx2: Fix IRQ failures during kdump.
From: Andi Kleen @ 2010-05-30  9:44 UTC (permalink / raw)
  To: David Miller; +Cc: matthew, mchan, grundler, netdev, linux-pci
In-Reply-To: <20100529.204906.55850229.davem@davemloft.net>

David Miller <davem@davemloft.net> writes:
>
> Basically all of these issues tend to be about the fact that unlike on
> a normal boot, after a kexec an intermediate PCI reset has not occured.

This could be fixed, assuming the system has AER capability ...

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [PATCH] bnx2: Fix IRQ failures during kdump.
From: Andi Kleen @ 2010-05-30  9:43 UTC (permalink / raw)
  To: Michael Chan; +Cc: davem, netdev, linux-pci
In-Reply-To: <1275103462-8527-1-git-send-email-mchan@broadcom.com>

"Michael Chan" <mchan@broadcom.com> writes:

> When switching from the crashed kernel to the kdump kernel without going
> through PCI reset, IRQs may not work if a different IRQ mode is used on

PCIe with AER actually does support per link root port reset
(e.g. used for AER)

I've been wondering for some time if kexec should not simply
use that to reset all the devices, instead of addings hacks
around this to all drivers.

That would fix your problems too, right?

The question is just if AER is widely enough supported for this.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [PATCH] bnx2: Fix IRQ failures during kdump.
From: David Miller @ 2010-05-30  3:49 UTC (permalink / raw)
  To: matthew; +Cc: mchan, grundler, netdev, linux-pci
In-Reply-To: <20100530012401.GC9132@parisc-linux.org>

From: Matthew Wilcox <matthew@wil.cx>
Date: Sat, 29 May 2010 19:24:01 -0600

> We should probably set the interrupt type back to pin-based before the
> kexec kernel starts, right?  Or do we expect drivers to handle being
> initialised with the device still set to MSI mode?

The expectation is that the device comes up in INTX mode, which is the
default after a PCI reset.

Basically all of these issues tend to be about the fact that unlike on
a normal boot, after a kexec an intermediate PCI reset has not occured.

^ permalink raw reply

* IAMT broken by commit 82776a4bcd7aa5fbcd2e6339b3ce88b727dd40ab
From: Aurelien Jarno @ 2010-05-30  1:02 UTC (permalink / raw)
  To: Bruce Allan, Jeff Kirsher; +Cc: netdev

Hi,

I have recently upgrade my kernel, and found that Intel AMT support is
not working anymore as expected. I have configured IAMT so that is 
always available, even when the machine is off ("Desktop: ON in S0, S3,
S4-5").

On recent kernels, IAMT support does not work after the machine has 
been powered-off. Even worse, it also goes into this state when I try
to reboot it.

I have done a bisect and got this commit:

| commit 82776a4bcd7aa5fbcd2e6339b3ce88b727dd40ab
| Author: Bruce Allan <bruce.w.allan@intel.com>
| Date:   Fri Aug 14 14:35:33 2009 +0000
| 
|     e1000e: WoL does not work on 82577/82578 with manageability enabled
|     
|     With manageability (Intel AMT) enabled via BIOS, PHY wakeup does not get
|     configured on newer parts which use PHY wakeup vs. MAC wakeup which causes
|     WoL to not work.  The driver should configure PHY wakeup whether or not
|     manageability is enabled.
|     
|     Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
|     Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|     Signed-off-by: David S. Miller <davem@davemloft.net>

I have tried to revert it on recent kernels (2.6.34), and IAMT is then
working as expected. My machine is using a Gigabyte EQ45M-S2 motherboard
with an 82567LM-3 ethernet chip (8086:10de), that is a different model
than the one of the original problem.

I do wonder if the changes in the patch should not only be done on some 
chip models, and I will appreciate any help in fixing this issue.

Thanks,
Aurelien

-- 
Aurelien Jarno                          GPG: 1024D/F1BCDB73
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox