Netdev List
 help / color / mirror / Atom feed
* Re: [RFC] random SYN drops causing connect() delays
From: Lennart Schulte @ 2010-04-14 11:37 UTC (permalink / raw)
  To: tgraf; +Cc: netdev
In-Reply-To: <20100412080633.GA27418@bombadil.infradead.org>

Hi,
this is very similar to what i have noticed, but up to now I couldn't figure out where it came from. 
Thanks very much for clearing it up!

> I have been tracking down an issue commonly referred to as the 3-sec
> connect() delay. It exists since recent 2.6.x kernels and has never
> been fixed even though it disappeared in recent releases unless
> sched_child_runs_first is set to 1 again.
>
> What happens is that if a client attemps to open many connections to
> a socket with only minimal delay inbetween attemps some SYNs are
> randomly dropped on the server side causing the client to resend after
> the 3 sec TCP timeout and thus causing connect()s to be randomly delayed.
>
> Facts:
>  - Issue can be reproduced over loopback or real networks.
>  - Enabling SO_LINGER on the client side will make the issue disappear!!
>  - While the issue is appearing, the acceptq seems to be overflowing. Both
>    LISTENOVERFLOWS and LISTENDROPS are increasing although not by the exact
>    number of delay occurences. inetdiag reports sk_max_ack_backlog to be 0
>    therefore one possibility that comes to mind is that sk_ack_backlog
>    underflows due to a race.
>  - The issue disappeared in recent kernels, I bisected it down to the following
>    commit:
> 	commit 2bba22c50b06abe9fd0d23933b1e64d35b419262
> 	Author: Mike Galbraith <efault@gmx.de>
> 	Date:   Wed Sep 9 15:41:37 2009 +0200
>
> 	    sched: Turn off child_runs_first
> 	    
> 	    Set child_runs_first default to off.
>
>    Setting kernel.sched_child_runs_first=1 makes the isssue reappear in recent
>    kernels.  This hardens the theory of a race condition.
>  - It looks like that the issue can only be reproduced if the server
>    socket sends out data immediately after the connection has been established
>    but I cannot proof this theory.

^ permalink raw reply

* Re: [PATCH v3] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-14 12:13 UTC (permalink / raw)
  To: hadi; +Cc: David S. Miller, Eric Dumazet, netdev
In-Reply-To: <1271246304.3943.60.camel@bigi>

On Wed, Apr 14, 2010 at 7:58 PM, jamal <hadi@cyberus.ca> wrote:
>
> It seems we are now going to generate a lot more IPIs with such a
> change. At least this is what i am imagining.
> CPU0: packet comes in,queue empty, generate an IPI to CPU1
> CPU0: second packet comes in, enqueue
> CPU1: grab two packets to process and run with them
> CPU0: packet comes in,queue empty, generate an IPI to CPU1

No extra IPI is needed.

+       qlen = queue->input_pkt_queue.qlen + queue->processing_queue.qlen;
+       if (qlen <= netdev_max_backlog) {
+               if (qlen) {

the packets in processing_queue are counted too.

>
> IPIs add to latency (refer to my other email). Did you test this
> to reach some conclusion that it improves thing or was it just by
> inspection?
>

:( only insepection.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH v3] net: batch skb dequeueing from softnet input_pkt_queue
From: jamal @ 2010-04-14 12:28 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, Eric Dumazet, netdev
In-Reply-To: <o2k412e6f7f1004140513h8de62790tb775bb357e2db6b1@mail.gmail.com>

On Wed, 2010-04-14 at 20:13 +0800, Changli Gao wrote:
> On Wed, Apr 14, 2010 at 7:58 PM, jamal <hadi@cyberus.ca> wrote:

> No extra IPI is needed.
> 
> +       qlen = queue->input_pkt_queue.qlen + queue->processing_queue.qlen;
> +       if (qlen <= netdev_max_backlog) {
> +               if (qlen) {
> 
> the packets in processing_queue are counted too.

Ok - Looks reasonable.

> > IPIs add to latency (refer to my other email). Did you test this
> > to reach some conclusion that it improves thing or was it just by
> > inspection?
> >
> 
> :( only insepection.

I am probably being pushy, but one simple test for latency of single
flow is: 

from machine 1, send ping -f

on rps machine:
Base test: no rps on ( a fresh boot with no sysctls should do fine)
Test 1: irq affinity on cpuX, rps to cpuY 
Test 2: repeat test1 with your change.

It should show no difference between test1 and 2. If it shows
improvement better - but showing worse latency is bad.

cheers,
jamal


^ permalink raw reply

* [GIT PULL] vhost-net fix for 2.6.34-rc4
From: Michael S. Tsirkin @ 2010-04-14 14:17 UTC (permalink / raw)
  To: David Miller, netdev, Christoph Hellwig

David,
The following tree includes a patch fixing an issue with vhost-net in
2.6.34-rc4.  Please pull for 2.6.34.

Thanks!

The following changes since commit 2ba3abd8186f24c7fb418927025b4e2120e3a362:

  Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 (2010-04-13 17:49:48 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost

Christoph Hellwig (1):
      vhost: fix sparse warnings

 drivers/vhost/net.c   |    4 ++--
 drivers/vhost/vhost.c |   11 ++++++-----
 2 files changed, 8 insertions(+), 7 deletions(-)

^ permalink raw reply

* Re: forcedeth driver hangs under heavy load
From: stephen mulcahy @ 2010-04-14 14:30 UTC (permalink / raw)
  To: Ayaz Abdulla
  Cc: Eric Dumazet, David Miller, bhutchings@solarflare.com,
	netdev@vger.kernel.org, ben@decadent.org.uk,
	572201@bugs.debian.org
In-Reply-To: <4BC5539B.6050908@nvidia.com>

Ayaz Abdulla wrote:
> Attached fix has been submitted to netdev.

I've run my reproducer with this patch applied to be Debian 2.6.32 
kernel and so far the problem with nodes becoming unresponsive hasn't 
occurred.

NIC settings were left the default so this looks positive

root@node23:~# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off

Thanks!

-stephen

^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Arnd Bergmann @ 2010-04-14 14:55 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, davem, jdike
In-Reply-To: <1270805865-16901-2-git-send-email-xiaohui.xin@intel.com>

On Friday 09 April 2010, xiaohui.xin@intel.com wrote:
> From: Xin Xiaohui <xiaohui.xin@intel.com>
> 
> Add a device to utilize the vhost-net backend driver for
> copy-less data transfer between guest FE and host NIC.
> It pins the guest user space to the host memory and
> provides proto_ops as sendmsg/recvmsg to vhost-net.

Sorry for taking so long before finding the time to look
at your code in more detail.

It seems that you are duplicating a lot of functionality that
is already in macvtap. I've asked about this before but then
didn't look at your newer versions. Can you explain the value
of introducing another interface to user land?

I'm still planning to add zero-copy support to macvtap,
hopefully reusing parts of your code, but do you think there
is value in having both?

> diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
> new file mode 100644
> index 0000000..86d2525
> --- /dev/null
> +++ b/drivers/vhost/mpassthru.c
> @@ -0,0 +1,1264 @@
> +
> +#ifdef MPASSTHRU_DEBUG
> +static int debug;
> +
> +#define DBG  if (mp->debug) printk
> +#define DBG1 if (debug == 2) printk
> +#else
> +#define DBG(a...)
> +#define DBG1(a...)
> +#endif

This should probably just use the existing dev_dbg/pr_debug infrastructure.

> [... skipping buffer management code for now]

> +static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
> +			struct msghdr *m, size_t total_len)
> +{
> [...]

This function looks like we should be able to easily include it into
macvtap and get zero-copy transmits without introducing the new
user-level interface.

> +static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
> +			struct msghdr *m, size_t total_len,
> +			int flags)
> +{
> +	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> +	struct page_ctor *ctor;
> +	struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb->private);

It smells like a layering violation to look at the iocb->private field
from a lower-level driver. I would have hoped that it's possible to implement
this without having this driver know about the higher-level vhost driver
internals. Can you explain why this is needed?

> +	spin_lock_irqsave(&ctor->read_lock, flag);
> +	list_add_tail(&info->list, &ctor->readq);
> +	spin_unlock_irqrestore(&ctor->read_lock, flag);
> +
> +	if (!vq->receiver) {
> +		vq->receiver = mp_recvmsg_notify;
> +		set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
> +				   vq->num * 4096,
> +				   vq->num * 4096);
> +	}
> +
> +	return 0;
> +}

Not sure what I'm missing, but who calls the vq->receiver? This seems
to be neither in the upstream version of vhost nor introduced by your
patch.

> +static void __mp_detach(struct mp_struct *mp)
> +{
> +	mp->mfile = NULL;
> +
> +	mp_dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
> +	page_ctor_detach(mp);
> +	mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
> +
> +	/* Drop the extra count on the net device */
> +	dev_put(mp->dev);
> +}
> +
> +static DEFINE_MUTEX(mp_mutex);
> +
> +static void mp_detach(struct mp_struct *mp)
> +{
> +	mutex_lock(&mp_mutex);
> +	__mp_detach(mp);
> +	mutex_unlock(&mp_mutex);
> +}
> +
> +static void mp_put(struct mp_file *mfile)
> +{
> +	if (atomic_dec_and_test(&mfile->count))
> +		mp_detach(mfile->mp);
> +}
> +
> +static int mp_release(struct socket *sock)
> +{
> +	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> +	struct mp_file *mfile = mp->mfile;
> +
> +	mp_put(mfile);
> +	sock_put(mp->socket.sk);
> +	put_net(mfile->net);
> +
> +	return 0;
> +}

Doesn't this prevent the underlying interface from going away while the chardev
is open? You also have logic to handle that case, so why do you keep the extra
reference on the netdev?

> +/* Ops structure to mimic raw sockets with mp device */
> +static const struct proto_ops mp_socket_ops = {
> +	.sendmsg = mp_sendmsg,
> +	.recvmsg = mp_recvmsg,
> +	.release = mp_release,
> +};

> +static int mp_chr_open(struct inode *inode, struct file * file)
> +{
> +	struct mp_file *mfile;
> +	cycle_kernel_lock();

I don't think you really want to use the BKL here, just kill that line.

> +static long mp_chr_ioctl(struct file *file, unsigned int cmd,
> +		unsigned long arg)
> +{
> +	struct mp_file *mfile = file->private_data;
> +	struct mp_struct *mp;
> +	struct net_device *dev;
> +	void __user* argp = (void __user *)arg;
> +	struct ifreq ifr;
> +	struct sock *sk;
> +	int ret;
> +
> +	ret = -EINVAL;
> +
> +	switch (cmd) {
> +	case MPASSTHRU_BINDDEV:
> +		ret = -EFAULT;
> +		if (copy_from_user(&ifr, argp, sizeof ifr))
> +			break;

This is broken for 32 bit compat mode ioctls, because struct ifreq
is different between 32 and 64 bit systems. Since you are only
using the device name anyway, a fixed length string or just the
interface index would be simpler and work better.

> +		ifr.ifr_name[IFNAMSIZ-1] = '\0';
> +
> +		ret = -EBUSY;
> +
> +		if (ifr.ifr_flags & IFF_MPASSTHRU_EXCL)
> +			break;

Your current use of the IFF_MPASSTHRU* flags does not seem to make
any sense whatsoever. You check that this flag is never set, but set
it later yourself and then ignore all flags.

> +		ret = -ENODEV;
> +		dev = dev_get_by_name(mfile->net, ifr.ifr_name);
> +		if (!dev)
> +			break;

There is no permission checking on who can access what device, which
seems a bit simplistic. Any user that has access to the mpassthru device
seems to be able to bind to any network interface in the namespace.
This is one point where the macvtap model seems more appropriate, it
separates the permissions for creating logical interfaces and using them.

> +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
> +				unsigned long count, loff_t pos)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct mp_struct *mp = mp_get(file->private_data);
> +	struct sock *sk = mp->socket.sk;
> +	struct sk_buff *skb;
> +	int len, err;
> +	ssize_t result;

Can you explain what this function is even there for? AFAICT, vhost-net
doesn't call it, the interface is incompatible with the existing
tap interface, and you don't provide a read function.

> diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
> new file mode 100644
> index 0000000..2be21c5
> --- /dev/null
> +++ b/include/linux/mpassthru.h
> @@ -0,0 +1,29 @@
> +#ifndef __MPASSTHRU_H
> +#define __MPASSTHRU_H
> +
> +#include <linux/types.h>
> +#include <linux/if_ether.h>
> +
> +/* ioctl defines */
> +#define MPASSTHRU_BINDDEV      _IOW('M', 213, int)
> +#define MPASSTHRU_UNBINDDEV    _IOW('M', 214, int)

These definitions are slightly wrong, because you pass more than just an 'int'.

> +/* MPASSTHRU ifc flags */
> +#define IFF_MPASSTHRU		0x0001
> +#define IFF_MPASSTHRU_EXCL	0x0002

As mentioned above, these flags don't make any sense with your current code.

	Arnd

^ permalink raw reply

* Re: [PATCH net-next-2.6] fasync: RCU locking
From: Eric Dumazet @ 2010-04-14 14:57 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: David Miller, Paul E. McKenney, netdev, linux-kernel
In-Reply-To: <4BC57E7D.9060706@cn.fujitsu.com>

Le mercredi 14 avril 2010 à 16:36 +0800, Lai Jiangshan a écrit :

> Since rcu_read_lock() protects fasync_struct *fa for us, we can access
> to @fa safely even fasync_remove_entry() is just called.
> 
> But this patch does not ensure 'fa->fa_file is not freed' nor
> 'fa->fa_fd is not released', so kill_fasync_rcu() may do wrong thing
> if there is no other code ensure it.

You are 100% right, I forgot my old attempt to RCUified struct files
failed...

Maybe its time to finally move f_owner out of struct file, and use RCU
to free it.

In the mean time, adding a lock in fasync_struct is more than enough.

Thanks !

[PATCH net-next-2.6 v2] fasync: fine grained locking

kill_fasync() uses a central rwlock, candidate for RCU conversion, to
avoid cache line ping pongs on SMP.

fasync_remove_entry() and fasync_add_entry() can disable IRQS on a short
section instead during whole list scan.

Use a spinlock per fasync_struct to synchronize fasync_{remove|
add}_entry() and kill_fasync_rcu()

We can remove __kill_fasync() direct use in net, and rename it to
kill_fasync_rcu().

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
v2: As Lai Jiangshan noticed, we need a mutual exclusion between
fasync_{remove|add}_entry() and kill_fasync_rcu().

 fs/fcntl.c         |   66 +++++++++++++++++++++++++++----------------
 include/linux/fs.h |   12 +++----
 net/socket.c       |    4 +-
 3 files changed, 50 insertions(+), 32 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 452d02f..0a14074 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -614,9 +614,15 @@ int send_sigurg(struct fown_struct *fown)
 	return ret;
 }
 
-static DEFINE_RWLOCK(fasync_lock);
+static DEFINE_SPINLOCK(fasync_lock);
 static struct kmem_cache *fasync_cache __read_mostly;
 
+static void fasync_free_rcu(struct rcu_head *head)
+{
+	kmem_cache_free(fasync_cache,
+			container_of(head, struct fasync_struct, fa_rcu));
+}
+
 /*
  * Remove a fasync entry. If successfully removed, return
  * positive and clear the FASYNC flag. If no entry exists,
@@ -625,8 +631,6 @@ static struct kmem_cache *fasync_cache __read_mostly;
  * NOTE! It is very important that the FASYNC flag always
  * match the state "is the filp on a fasync list".
  *
- * We always take the 'filp->f_lock', in since fasync_lock
- * needs to be irq-safe.
  */
 static int fasync_remove_entry(struct file *filp, struct fasync_struct **fapp)
 {
@@ -634,17 +638,22 @@ static int fasync_remove_entry(struct file *filp, struct fasync_struct **fapp)
 	int result = 0;
 
 	spin_lock(&filp->f_lock);
-	write_lock_irq(&fasync_lock);
+	spin_lock(&fasync_lock);
 	for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
 		if (fa->fa_file != filp)
 			continue;
+
+		spin_lock_irq(&fa->fa_lock);
+		fa->fa_file = NULL;
+		spin_unlock_irq(&fa->fa_lock);
+
 		*fp = fa->fa_next;
-		kmem_cache_free(fasync_cache, fa);
+		call_rcu(&fa->fa_rcu, fasync_free_rcu);
 		filp->f_flags &= ~FASYNC;
 		result = 1;
 		break;
 	}
-	write_unlock_irq(&fasync_lock);
+	spin_unlock(&fasync_lock);
 	spin_unlock(&filp->f_lock);
 	return result;
 }
@@ -666,25 +675,30 @@ static int fasync_add_entry(int fd, struct file *filp, struct fasync_struct **fa
 		return -ENOMEM;
 
 	spin_lock(&filp->f_lock);
-	write_lock_irq(&fasync_lock);
+	spin_lock(&fasync_lock);
 	for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
 		if (fa->fa_file != filp)
 			continue;
+
+		spin_lock_irq(&fa->fa_lock);
 		fa->fa_fd = fd;
+		spin_unlock_irq(&fa->fa_lock);
+
 		kmem_cache_free(fasync_cache, new);
 		goto out;
 	}
 
+	spin_lock_init(&new->fa_lock);
 	new->magic = FASYNC_MAGIC;
 	new->fa_file = filp;
 	new->fa_fd = fd;
 	new->fa_next = *fapp;
-	*fapp = new;
+	rcu_assign_pointer(*fapp, new);
 	result = 1;
 	filp->f_flags |= FASYNC;
 
 out:
-	write_unlock_irq(&fasync_lock);
+	spin_unlock(&fasync_lock);
 	spin_unlock(&filp->f_lock);
 	return result;
 }
@@ -704,37 +718,41 @@ int fasync_helper(int fd, struct file * filp, int on, struct fasync_struct **fap
 
 EXPORT_SYMBOL(fasync_helper);
 
-void __kill_fasync(struct fasync_struct *fa, int sig, int band)
+/*
+ * rcu_read_lock() is held
+ */
+static void kill_fasync_rcu(struct fasync_struct *fa, int sig, int band)
 {
 	while (fa) {
-		struct fown_struct * fown;
+		struct fown_struct *fown;
 		if (fa->magic != FASYNC_MAGIC) {
 			printk(KERN_ERR "kill_fasync: bad magic number in "
 			       "fasync_struct!\n");
 			return;
 		}
-		fown = &fa->fa_file->f_owner;
-		/* Don't send SIGURG to processes which have not set a
-		   queued signum: SIGURG has its own default signalling
-		   mechanism. */
-		if (!(sig == SIGURG && fown->signum == 0))
-			send_sigio(fown, fa->fa_fd, band);
-		fa = fa->fa_next;
+		spin_lock(&fa->fa_lock);
+		if (fa->fa_file) {
+			fown = &fa->fa_file->f_owner;
+			/* Don't send SIGURG to processes which have not set a
+			   queued signum: SIGURG has its own default signalling
+			   mechanism. */
+			if (!(sig == SIGURG && fown->signum == 0))
+				send_sigio(fown, fa->fa_fd, band);
+		}
+		spin_unlock(&fa->fa_lock);
+		fa = rcu_dereference(fa->fa_next);
 	}
 }
 
-EXPORT_SYMBOL(__kill_fasync);
-
 void kill_fasync(struct fasync_struct **fp, int sig, int band)
 {
 	/* First a quick test without locking: usually
 	 * the list is empty.
 	 */
 	if (*fp) {
-		read_lock(&fasync_lock);
-		/* reread *fp after obtaining the lock */
-		__kill_fasync(*fp, sig, band);
-		read_unlock(&fasync_lock);
+		rcu_read_lock();
+		kill_fasync_rcu(rcu_dereference(*fp), sig, band);
+		rcu_read_unlock();
 	}
 }
 EXPORT_SYMBOL(kill_fasync);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 39d57bc..018d382 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1280,10 +1280,12 @@ static inline int lock_may_write(struct inode *inode, loff_t start,
 
 
 struct fasync_struct {
-	int	magic;
-	int	fa_fd;
-	struct	fasync_struct	*fa_next; /* singly linked list */
-	struct	file 		*fa_file;
+	spinlock_t		fa_lock;
+	int			magic;
+	int			fa_fd;
+	struct fasync_struct	*fa_next; /* singly linked list */
+	struct file		*fa_file;
+	struct rcu_head		fa_rcu;
 };
 
 #define FASYNC_MAGIC 0x4601
@@ -1292,8 +1294,6 @@ struct fasync_struct {
 extern int fasync_helper(int, struct file *, int, struct fasync_struct **);
 /* can be called from interrupts */
 extern void kill_fasync(struct fasync_struct **, int, int);
-/* only for net: no internal synchronization */
-extern void __kill_fasync(struct fasync_struct *, int, int);
 
 extern int __f_setown(struct file *filp, struct pid *, enum pid_type, int force);
 extern int f_setown(struct file *filp, unsigned long arg, int force);
diff --git a/net/socket.c b/net/socket.c
index 35bc198..846739c 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1159,10 +1159,10 @@ int sock_wake_async(struct socket *sock, int how, int band)
 		/* fall through */
 	case SOCK_WAKE_IO:
 call_kill:
-		__kill_fasync(sock->fasync_list, SIGIO, band);
+		kill_fasync(sock->fasync_list, SIGIO, band);
 		break;
 	case SOCK_WAKE_URG:
-		__kill_fasync(sock->fasync_list, SIGURG, band);
+		kill_fasync(sock->fasync_list, SIGURG, band);
 	}
 	return 0;
 }

^ permalink raw reply related

* Re: [PATCH v3] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-14 15:20 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, netdev
In-Reply-To: <1271238738-8386-1-git-send-email-xiaosuo@gmail.com>

Le mercredi 14 avril 2010 à 17:52 +0800, Changli Gao a écrit :
> batch skb dequeueing from softnet input_pkt_queue
> 
> batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
> contention and irq disabling/enabling.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>

Adding stop_machine() with no explanation ?

No ack from my previous comments, suggestions, and still same logic ?

Are we supposed to read patch, test it, make some benches, correct bugs,
say Amen ?

This is becoming silly, if you ask me.

This is a NACK of this patch, obviously.




^ permalink raw reply

* Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-14 15:25 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mingo, jdike, davem
In-Reply-To: <1270193100-6769-1-git-send-email-xiaohui.xin@intel.com>

On Fri, Apr 02, 2010 at 03:25:00PM +0800, xiaohui.xin@intel.com wrote:
> The idea is simple, just to pin the guest VM user space and then
> let host NIC driver has the chance to directly DMA to it. 
> The patches are based on vhost-net backend driver. We add a device
> which provides proto_ops as sendmsg/recvmsg to vhost-net to
> send/recv directly to/from the NIC driver. KVM guest who use the
> vhost-net backend may bind any ethX interface in the host side to
> get copyless data transfer thru guest virtio-net frontend.
> 
> The scenario is like this:
> 
> The guest virtio-net driver submits multiple requests thru vhost-net
> backend driver to the kernel. And the requests are queued and then
> completed after corresponding actions in h/w are done.
> 
> For read, user space buffers are dispensed to NIC driver for rx when
> a page constructor API is invoked. Means NICs can allocate user buffers
> from a page constructor. We add a hook in netif_receive_skb() function
> to intercept the incoming packets, and notify the zero-copy device.
> 
> For write, the zero-copy deivce may allocates a new host skb and puts
> payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
> The request remains pending until the skb is transmitted by h/w.
> 
> Here, we have ever considered 2 ways to utilize the page constructor
> API to dispense the user buffers.
> 
> One:	Modify __alloc_skb() function a bit, it can only allocate a 
> 	structure of sk_buff, and the data pointer is pointing to a 
> 	user buffer which is coming from a page constructor API.
> 	Then the shinfo of the skb is also from guest.
> 	When packet is received from hardware, the skb->data is filled
> 	directly by h/w. What we have done is in this way.
> 
> 	Pros:	We can avoid any copy here.
> 	Cons:	Guest virtio-net driver needs to allocate skb as almost
> 		the same method with the host NIC drivers, say the size
> 		of netdev_alloc_skb() and the same reserved space in the
> 		head of skb. Many NIC drivers are the same with guest and
> 		ok for this. But some lastest NIC drivers reserves special
> 		room in skb head. To deal with it, we suggest to provide
> 		a method in guest virtio-net driver to ask for parameter
> 		we interest from the NIC driver when we know which device 
> 		we have bind to do zero-copy. Then we ask guest to do so.
> 		Is that reasonable?

Unfortunately, this would break compatibility with existing virtio.
This also complicates migration. What is the room in skb head used for?

> Two:	Modify driver to get user buffer allocated from a page constructor
> 	API(to substitute alloc_page()), the user buffer are used as payload
> 	buffers and filled by h/w directly when packet is received. Driver
> 	should associate the pages with skb (skb_shinfo(skb)->frags). For 
> 	the head buffer side, let host allocates skb, and h/w fills it. 
> 	After that, the data filled in host skb header will be copied into
> 	guest header buffer which is submitted together with the payload buffer.
> 
> 	Pros:	We could less care the way how guest or host allocates their
> 		buffers.
> 	Cons:	We still need a bit copy here for the skb header.
> 
> We are not sure which way is the better here.

The obvious question would be whether you see any speed difference
with the two approaches. If no, then the second approach would be
better.

> This is the first thing we want
> to get comments from the community. We wish the modification to the network
> part will be generic which not used by vhost-net backend only, but a user
> application may use it as well when the zero-copy device may provides async
> read/write operations later.
> 
> Please give comments especially for the network part modifications.
> 
> 
> We provide multiple submits and asynchronous notifiicaton to 
> vhost-net too.
> 
> Our goal is to improve the bandwidth and reduce the CPU usage.
> Exact performance data will be provided later. But for simple
> test with netperf, we found bindwidth up and CPU % up too,
> but the bindwidth up ratio is much more than CPU % up ratio.
> 
> What we have not done yet:
> 	packet split support

What does this mean, exactly?

> 	To support GRO

And TSO/GSO?

> 	Performance tuning
> 
> what we have done in v1:
> 	polish the RCU usage
> 	deal with write logging in asynchroush mode in vhost
> 	add notifier block for mp device
> 	rename page_ctor to mp_port in netdevice.h to make it looks generic
> 	add mp_dev_change_flags() for mp device to change NIC state
> 	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
> 	a small fix for missing dev_put when fail
> 	using dynamic minor instead of static minor number
> 	a __KERNEL__ protect to mp_get_sock()
> 
> what we have done in v2:
> 	
> 	remove most of the RCU usage, since the ctor pointer is only
> 	changed by BIND/UNBIND ioctl, and during that time, NIC will be
> 	stopped to get good cleanup(all outstanding requests are finished),
> 	so the ctor pointer cannot be raced into wrong situation.
> 
> 	Remove the struct vhost_notifier with struct kiocb.
> 	Let vhost-net backend to alloc/free the kiocb and transfer them
> 	via sendmsg/recvmsg.
> 
> 	use get_user_pages_fast() and set_page_dirty_lock() when read.
> 
> 	Add some comments for netdev_mp_port_prep() and handle_mpassthru().
> 
> 
> Comments not addressed yet in this time:
> 	the async write logging is not satified by vhost-net
> 	Qemu needs a sync write
> 	a limit for locked pages from get_user_pages_fast()
> 	
> 		
> performance:
> 	using netperf with GSO/TSO disabled, 10G NIC, 
> 	disabled packet split mode, with raw socket case compared to vhost.
> 
> 	bindwidth will be from 1.1Gbps to 1.7Gbps
> 	CPU % from 120%-140% to 140%-160%

^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-14 15:26 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <201004141655.21885.arnd@arndb.de>

On Wed, Apr 14, 2010 at 04:55:21PM +0200, Arnd Bergmann wrote:
> On Friday 09 April 2010, xiaohui.xin@intel.com wrote:
> > From: Xin Xiaohui <xiaohui.xin@intel.com>
> > 
> > Add a device to utilize the vhost-net backend driver for
> > copy-less data transfer between guest FE and host NIC.
> > It pins the guest user space to the host memory and
> > provides proto_ops as sendmsg/recvmsg to vhost-net.
> 
> Sorry for taking so long before finding the time to look
> at your code in more detail.
> 
> It seems that you are duplicating a lot of functionality that
> is already in macvtap. I've asked about this before but then
> didn't look at your newer versions. Can you explain the value
> of introducing another interface to user land?

Hmm, I have not noticed a lot of duplication.
BTW macvtap also duplicates tun code, it might be
a good idea for tun to export some functionality.

> I'm still planning to add zero-copy support to macvtap,
> hopefully reusing parts of your code, but do you think there
> is value in having both?

If macvtap would get zero copy tx and rx, maybe not. But
it's not immediately obvious whether zero-copy support
for macvtap might work, though, especially for zero copy rx.
The approach with mpassthru is much simpler in that
it takes complete control of the device.

> > diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
> > new file mode 100644
> > index 0000000..86d2525
> > --- /dev/null
> > +++ b/drivers/vhost/mpassthru.c
> > @@ -0,0 +1,1264 @@
> > +
> > +#ifdef MPASSTHRU_DEBUG
> > +static int debug;
> > +
> > +#define DBG  if (mp->debug) printk
> > +#define DBG1 if (debug == 2) printk
> > +#else
> > +#define DBG(a...)
> > +#define DBG1(a...)
> > +#endif
> 
> This should probably just use the existing dev_dbg/pr_debug infrastructure.
> 
> > [... skipping buffer management code for now]
> 
> > +static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
> > +			struct msghdr *m, size_t total_len)
> > +{
> > [...]
> 
> This function looks like we should be able to easily include it into
> macvtap and get zero-copy transmits without introducing the new
> user-level interface.
> 
> > +static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
> > +			struct msghdr *m, size_t total_len,
> > +			int flags)
> > +{
> > +	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> > +	struct page_ctor *ctor;
> > +	struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb->private);
> 
> It smells like a layering violation to look at the iocb->private field
> from a lower-level driver. I would have hoped that it's possible to implement
> this without having this driver know about the higher-level vhost driver
> internals.

I agree.

> Can you explain why this is needed?
> 
> > +	spin_lock_irqsave(&ctor->read_lock, flag);
> > +	list_add_tail(&info->list, &ctor->readq);
> > +	spin_unlock_irqrestore(&ctor->read_lock, flag);
> > +
> > +	if (!vq->receiver) {
> > +		vq->receiver = mp_recvmsg_notify;
> > +		set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
> > +				   vq->num * 4096,
> > +				   vq->num * 4096);
> > +	}
> > +
> > +	return 0;
> > +}
> 
> Not sure what I'm missing, but who calls the vq->receiver? This seems
> to be neither in the upstream version of vhost nor introduced by your
> patch.
> 
> > +static void __mp_detach(struct mp_struct *mp)
> > +{
> > +	mp->mfile = NULL;
> > +
> > +	mp_dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
> > +	page_ctor_detach(mp);
> > +	mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
> > +
> > +	/* Drop the extra count on the net device */
> > +	dev_put(mp->dev);
> > +}
> > +
> > +static DEFINE_MUTEX(mp_mutex);
> > +
> > +static void mp_detach(struct mp_struct *mp)
> > +{
> > +	mutex_lock(&mp_mutex);
> > +	__mp_detach(mp);
> > +	mutex_unlock(&mp_mutex);
> > +}
> > +
> > +static void mp_put(struct mp_file *mfile)
> > +{
> > +	if (atomic_dec_and_test(&mfile->count))
> > +		mp_detach(mfile->mp);
> > +}
> > +
> > +static int mp_release(struct socket *sock)
> > +{
> > +	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> > +	struct mp_file *mfile = mp->mfile;
> > +
> > +	mp_put(mfile);
> > +	sock_put(mp->socket.sk);
> > +	put_net(mfile->net);
> > +
> > +	return 0;
> > +}
> 
> Doesn't this prevent the underlying interface from going away while the chardev
> is open? You also have logic to handle that case, so why do you keep the extra
> reference on the netdev?
> 
> > +/* Ops structure to mimic raw sockets with mp device */
> > +static const struct proto_ops mp_socket_ops = {
> > +	.sendmsg = mp_sendmsg,
> > +	.recvmsg = mp_recvmsg,
> > +	.release = mp_release,
> > +};
> 
> > +static int mp_chr_open(struct inode *inode, struct file * file)
> > +{
> > +	struct mp_file *mfile;
> > +	cycle_kernel_lock();
> 
> I don't think you really want to use the BKL here, just kill that line.
> 
> > +static long mp_chr_ioctl(struct file *file, unsigned int cmd,
> > +		unsigned long arg)
> > +{
> > +	struct mp_file *mfile = file->private_data;
> > +	struct mp_struct *mp;
> > +	struct net_device *dev;
> > +	void __user* argp = (void __user *)arg;
> > +	struct ifreq ifr;
> > +	struct sock *sk;
> > +	int ret;
> > +
> > +	ret = -EINVAL;
> > +
> > +	switch (cmd) {
> > +	case MPASSTHRU_BINDDEV:
> > +		ret = -EFAULT;
> > +		if (copy_from_user(&ifr, argp, sizeof ifr))
> > +			break;
> 
> This is broken for 32 bit compat mode ioctls, because struct ifreq
> is different between 32 and 64 bit systems. Since you are only
> using the device name anyway, a fixed length string or just the
> interface index would be simpler and work better.
> 
> > +		ifr.ifr_name[IFNAMSIZ-1] = '\0';
> > +
> > +		ret = -EBUSY;
> > +
> > +		if (ifr.ifr_flags & IFF_MPASSTHRU_EXCL)
> > +			break;
> 
> Your current use of the IFF_MPASSTHRU* flags does not seem to make
> any sense whatsoever. You check that this flag is never set, but set
> it later yourself and then ignore all flags.
> 
> > +		ret = -ENODEV;
> > +		dev = dev_get_by_name(mfile->net, ifr.ifr_name);
> > +		if (!dev)
> > +			break;
> 
> There is no permission checking on who can access what device, which
> seems a bit simplistic. Any user that has access to the mpassthru device
> seems to be able to bind to any network interface in the namespace.
> This is one point where the macvtap model seems more appropriate, it
> separates the permissions for creating logical interfaces and using them.
> 
> > +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
> > +				unsigned long count, loff_t pos)
> > +{
> > +	struct file *file = iocb->ki_filp;
> > +	struct mp_struct *mp = mp_get(file->private_data);
> > +	struct sock *sk = mp->socket.sk;
> > +	struct sk_buff *skb;
> > +	int len, err;
> > +	ssize_t result;
> 
> Can you explain what this function is even there for? AFAICT, vhost-net
> doesn't call it, the interface is incompatible with the existing
> tap interface, and you don't provide a read function.

qemu needs the ability to inject raw packets into device
from userspace, bypassing vhost/virtio (for live migration).

> > diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
> > new file mode 100644
> > index 0000000..2be21c5
> > --- /dev/null
> > +++ b/include/linux/mpassthru.h
> > @@ -0,0 +1,29 @@
> > +#ifndef __MPASSTHRU_H
> > +#define __MPASSTHRU_H
> > +
> > +#include <linux/types.h>
> > +#include <linux/if_ether.h>
> > +
> > +/* ioctl defines */
> > +#define MPASSTHRU_BINDDEV      _IOW('M', 213, int)
> > +#define MPASSTHRU_UNBINDDEV    _IOW('M', 214, int)
> 
> These definitions are slightly wrong, because you pass more than just an 'int'.
> 
> > +/* MPASSTHRU ifc flags */
> > +#define IFF_MPASSTHRU		0x0001
> > +#define IFF_MPASSTHRU_EXCL	0x0002
> 
> As mentioned above, these flags don't make any sense with your current code.
> 
> 	Arnd

^ permalink raw reply

* Re: [PATCH net-next-2.6] fasync: RCU locking
From: Eric Dumazet @ 2010-04-14 15:34 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: David Miller, Paul E. McKenney, netdev, linux-kernel
In-Reply-To: <1271257027.16881.1663.camel@edumazet-laptop>

Le mercredi 14 avril 2010 à 16:57 +0200, Eric Dumazet a écrit :
> Le mercredi 14 avril 2010 à 16:36 +0800, Lai Jiangshan a écrit :
> 
> > Since rcu_read_lock() protects fasync_struct *fa for us, we can access
> > to @fa safely even fasync_remove_entry() is just called.
> > 
> > But this patch does not ensure 'fa->fa_file is not freed' nor
> > 'fa->fa_fd is not released', so kill_fasync_rcu() may do wrong thing
> > if there is no other code ensure it.
> 
> You are 100% right, I forgot my old attempt to RCUified struct files
> failed...
> 
> Maybe its time to finally move f_owner out of struct file, and use RCU
> to free it.
> 
> In the mean time, adding a lock in fasync_struct is more than enough.
> 
> Thanks !
> 
> [PATCH net-next-2.6 v2] fasync: fine grained locking
> 
> kill_fasync() uses a central rwlock, candidate for RCU conversion, to
> avoid cache line ping pongs on SMP.
> 
> fasync_remove_entry() and fasync_add_entry() can disable IRQS on a short
> section instead during whole list scan.
> 
> Use a spinlock per fasync_struct to synchronize fasync_{remove|
> add}_entry() and kill_fasync_rcu()
> 
> We can remove __kill_fasync() direct use in net, and rename it to
> kill_fasync_rcu().
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Please wait for a v3 version, as net/socket.c sock_fasync() should be
updated too...

^ permalink raw reply

* Re: [PATCH net-next-2.6] fasync: RCU locking
From: Paul E. McKenney @ 2010-04-14 15:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, linux-kernel
In-Reply-To: <1271230961.16881.630.camel@edumazet-laptop>

On Wed, Apr 14, 2010 at 09:42:41AM +0200, Eric Dumazet wrote:
> Paul, could you please check this patch, I am not sure
> of the IRQ safety thing...
> 
> Is call_rcu() the right method to use in this case ?

It looks like all the read-side critical sections are protected by
rcu_read_lock(), so call_rcu() should be OK.  And it is OK to invoke
call_rcu() with irqs disabled.  (Just don't try it in an NMI handler.)

Or am I missing some code path that tries to use disabling of irqs
instead of using rcu_read_lock()?  That happens to work in the current
implementation, but...

							Thanx, Paul

> Thanks
> 
> [PATCH net-next-2.6] fasync: RCU locking
> 
> kill_fasync() uses a central rwlock, candidate for RCU conversion.
> 
> We can remove __kill_fasync() direct use in net, and rename it to
> kill_fasync_rcu()
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  fs/fcntl.c         |   36 +++++++++++++++++++++---------------
>  include/linux/fs.h |   11 +++++------
>  net/socket.c       |    4 ++--
>  3 files changed, 28 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 452d02f..33cb3ee 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -614,9 +614,15 @@ int send_sigurg(struct fown_struct *fown)
>  	return ret;
>  }
> 
> -static DEFINE_RWLOCK(fasync_lock);
> +static DEFINE_SPINLOCK(fasync_lock);
>  static struct kmem_cache *fasync_cache __read_mostly;
> 
> +static void fasync_free_rcu(struct rcu_head *head)
> +{
> +	kmem_cache_free(fasync_cache,
> +			container_of(head, struct fasync_struct, fa_rcu));
> +}
> +
>  /*
>   * Remove a fasync entry. If successfully removed, return
>   * positive and clear the FASYNC flag. If no entry exists,
> @@ -634,17 +640,17 @@ static int fasync_remove_entry(struct file *filp, struct fasync_struct **fapp)
>  	int result = 0;
> 
>  	spin_lock(&filp->f_lock);
> -	write_lock_irq(&fasync_lock);
> +	spin_lock_irq(&fasync_lock);
>  	for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
>  		if (fa->fa_file != filp)
>  			continue;
>  		*fp = fa->fa_next;
> -		kmem_cache_free(fasync_cache, fa);
> +		call_rcu(&fa->fa_rcu, fasync_free_rcu);
>  		filp->f_flags &= ~FASYNC;
>  		result = 1;
>  		break;
>  	}
> -	write_unlock_irq(&fasync_lock);
> +	spin_unlock_irq(&fasync_lock);
>  	spin_unlock(&filp->f_lock);
>  	return result;
>  }
> @@ -666,7 +672,7 @@ static int fasync_add_entry(int fd, struct file *filp, struct fasync_struct **fa
>  		return -ENOMEM;
> 
>  	spin_lock(&filp->f_lock);
> -	write_lock_irq(&fasync_lock);
> +	spin_lock_irq(&fasync_lock);
>  	for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
>  		if (fa->fa_file != filp)
>  			continue;
> @@ -679,12 +685,12 @@ static int fasync_add_entry(int fd, struct file *filp, struct fasync_struct **fa
>  	new->fa_file = filp;
>  	new->fa_fd = fd;
>  	new->fa_next = *fapp;
> -	*fapp = new;
> +	rcu_assign_pointer(*fapp, new);
>  	result = 1;
>  	filp->f_flags |= FASYNC;
> 
>  out:
> -	write_unlock_irq(&fasync_lock);
> +	spin_unlock_irq(&fasync_lock);
>  	spin_unlock(&filp->f_lock);
>  	return result;
>  }
> @@ -704,7 +710,10 @@ int fasync_helper(int fd, struct file * filp, int on, struct fasync_struct **fap
> 
>  EXPORT_SYMBOL(fasync_helper);
> 
> -void __kill_fasync(struct fasync_struct *fa, int sig, int band)
> +/*
> + * rcu_read_lock() is held
> + */
> +static void kill_fasync_rcu(struct fasync_struct *fa, int sig, int band)
>  {
>  	while (fa) {
>  		struct fown_struct * fown;
> @@ -719,22 +728,19 @@ void __kill_fasync(struct fasync_struct *fa, int sig, int band)
>  		   mechanism. */
>  		if (!(sig == SIGURG && fown->signum == 0))
>  			send_sigio(fown, fa->fa_fd, band);
> -		fa = fa->fa_next;
> +		fa = rcu_dereference(fa->fa_next);
>  	}
>  }
> 
> -EXPORT_SYMBOL(__kill_fasync);
> -
>  void kill_fasync(struct fasync_struct **fp, int sig, int band)
>  {
>  	/* First a quick test without locking: usually
>  	 * the list is empty.
>  	 */
>  	if (*fp) {
> -		read_lock(&fasync_lock);
> -		/* reread *fp after obtaining the lock */
> -		__kill_fasync(*fp, sig, band);
> -		read_unlock(&fasync_lock);
> +		rcu_read_lock();
> +		kill_fasync_rcu(rcu_dereference(*fp), sig, band);
> +		rcu_read_unlock();
>  	}
>  }
>  EXPORT_SYMBOL(kill_fasync);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 39d57bc..158b2cc 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1280,10 +1280,11 @@ static inline int lock_may_write(struct inode *inode, loff_t start,
> 
> 
>  struct fasync_struct {
> -	int	magic;
> -	int	fa_fd;
> -	struct	fasync_struct	*fa_next; /* singly linked list */
> -	struct	file 		*fa_file;
> +	int			magic;
> +	int			fa_fd;
> +	struct fasync_struct	*fa_next; /* singly linked list */
> +	struct file		*fa_file;
> +	struct rcu_head		fa_rcu;
>  };
> 
>  #define FASYNC_MAGIC 0x4601
> @@ -1292,8 +1293,6 @@ struct fasync_struct {
>  extern int fasync_helper(int, struct file *, int, struct fasync_struct **);
>  /* can be called from interrupts */
>  extern void kill_fasync(struct fasync_struct **, int, int);
> -/* only for net: no internal synchronization */
> -extern void __kill_fasync(struct fasync_struct *, int, int);
> 
>  extern int __f_setown(struct file *filp, struct pid *, enum pid_type, int force);
>  extern int f_setown(struct file *filp, unsigned long arg, int force);
> diff --git a/net/socket.c b/net/socket.c
> index 35bc198..846739c 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -1159,10 +1159,10 @@ int sock_wake_async(struct socket *sock, int how, int band)
>  		/* fall through */
>  	case SOCK_WAKE_IO:
>  call_kill:
> -		__kill_fasync(sock->fasync_list, SIGIO, band);
> +		kill_fasync(sock->fasync_list, SIGIO, band);
>  		break;
>  	case SOCK_WAKE_URG:
> -		__kill_fasync(sock->fasync_list, SIGURG, band);
> +		kill_fasync(sock->fasync_list, SIGURG, band);
>  	}
>  	return 0;
>  }
> 
> 

^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Arnd Bergmann @ 2010-04-14 15:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <20100414152615.GA8079@redhat.com>

On Wednesday 14 April 2010, Michael S. Tsirkin wrote:
> On Wed, Apr 14, 2010 at 04:55:21PM +0200, Arnd Bergmann wrote:
> > On Friday 09 April 2010, xiaohui.xin@intel.com wrote:
> > > From: Xin Xiaohui <xiaohui.xin@intel.com>
>
> > It seems that you are duplicating a lot of functionality that
> > is already in macvtap. I've asked about this before but then
> > didn't look at your newer versions. Can you explain the value
> > of introducing another interface to user land?
> 
> Hmm, I have not noticed a lot of duplication.

The code is indeed quite distinct, but the idea of adding another
character device to pass into vhost for direct device access is.

> BTW macvtap also duplicates tun code, it might be
> a good idea for tun to export some functionality.

Yes, that's something I plan to look into.

> > I'm still planning to add zero-copy support to macvtap,
> > hopefully reusing parts of your code, but do you think there
> > is value in having both?
> 
> If macvtap would get zero copy tx and rx, maybe not. But
> it's not immediately obvious whether zero-copy support
> for macvtap might work, though, especially for zero copy rx.
> The approach with mpassthru is much simpler in that
> it takes complete control of the device.

As far as I can tell, the most significant limitation of mpassthru
is that there can only ever be a single guest on a physical NIC.

Given that limitation, I believe we can do the same on macvtap,
and simply disable zero-copy RX when you want to use more than one
guest, or both guest and host on the same NIC.

The logical next step here would be to allow VMDq and similar
technologies to separate out the RX traffic in the hardware.
We don't have a configuration interface for that yet, but
since this is logically the same as macvlan, I think we should
use the same interfaces for both, essentially treating VMDq
as a hardware acceleration for macvlan. We can probably handle
it in similar ways to how we handle hardware support for vlan.

At that stage, macvtap would be the logical interface for
connecting a VMDq (hardware macvlan) device to a guest!

> > > +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
> > > +				unsigned long count, loff_t pos)
> > > +{
> > > +	struct file *file = iocb->ki_filp;
> > > +	struct mp_struct *mp = mp_get(file->private_data);
> > > +	struct sock *sk = mp->socket.sk;
> > > +	struct sk_buff *skb;
> > > +	int len, err;
> > > +	ssize_t result;
> > 
> > Can you explain what this function is even there for? AFAICT, vhost-net
> > doesn't call it, the interface is incompatible with the existing
> > tap interface, and you don't provide a read function.
> 
> qemu needs the ability to inject raw packets into device
> from userspace, bypassing vhost/virtio (for live migration).

Ok, but since there is only a write callback and no read, it won't
actually be able to do this with the current code, right?

Moreover, it seems weird to have a new type of interface here that
duplicates tap/macvtap with less functionality. Coming back
to your original comment, this means that while mpassthru is currently
not duplicating the actual code from macvtap, it would need to do
exactly that to get the qemu interface right!

	Arnd

^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-14 16:16 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <201004141757.54829.arnd@arndb.de>

On Wed, Apr 14, 2010 at 05:57:54PM +0200, Arnd Bergmann wrote:
> On Wednesday 14 April 2010, Michael S. Tsirkin wrote:
> > On Wed, Apr 14, 2010 at 04:55:21PM +0200, Arnd Bergmann wrote:
> > > On Friday 09 April 2010, xiaohui.xin@intel.com wrote:
> > > > From: Xin Xiaohui <xiaohui.xin@intel.com>
> >
> > > It seems that you are duplicating a lot of functionality that
> > > is already in macvtap. I've asked about this before but then
> > > didn't look at your newer versions. Can you explain the value
> > > of introducing another interface to user land?
> > 
> > Hmm, I have not noticed a lot of duplication.
> 
> The code is indeed quite distinct, but the idea of adding another
> character device to pass into vhost for direct device access is.

All backends besides tap seem to do this, btw :)

> > BTW macvtap also duplicates tun code, it might be
> > a good idea for tun to export some functionality.
> 
> Yes, that's something I plan to look into.
> 
> > > I'm still planning to add zero-copy support to macvtap,
> > > hopefully reusing parts of your code, but do you think there
> > > is value in having both?
> > 
> > If macvtap would get zero copy tx and rx, maybe not. But
> > it's not immediately obvious whether zero-copy support
> > for macvtap might work, though, especially for zero copy rx.
> > The approach with mpassthru is much simpler in that
> > it takes complete control of the device.
> 
> As far as I can tell, the most significant limitation of mpassthru
> is that there can only ever be a single guest on a physical NIC.
> 
> Given that limitation, I believe we can do the same on macvtap,
> and simply disable zero-copy RX when you want to use more than one
> guest, or both guest and host on the same NIC.
> 
> The logical next step here would be to allow VMDq and similar
> technologies to separate out the RX traffic in the hardware.
> We don't have a configuration interface for that yet, but
> since this is logically the same as macvlan, I think we should
> use the same interfaces for both, essentially treating VMDq
> as a hardware acceleration for macvlan. We can probably handle
> it in similar ways to how we handle hardware support for vlan.
> 
> At that stage, macvtap would be the logical interface for
> connecting a VMDq (hardware macvlan) device to a guest!

I won't object to that but ... code walks.

> > > > +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
> > > > +				unsigned long count, loff_t pos)
> > > > +{
> > > > +	struct file *file = iocb->ki_filp;
> > > > +	struct mp_struct *mp = mp_get(file->private_data);
> > > > +	struct sock *sk = mp->socket.sk;
> > > > +	struct sk_buff *skb;
> > > > +	int len, err;
> > > > +	ssize_t result;
> > > 
> > > Can you explain what this function is even there for? AFAICT, vhost-net
> > > doesn't call it, the interface is incompatible with the existing
> > > tap interface, and you don't provide a read function.
> > 
> > qemu needs the ability to inject raw packets into device
> > from userspace, bypassing vhost/virtio (for live migration).
> 
> Ok, but since there is only a write callback and no read, it won't
> actually be able to do this with the current code, right?

I think it'll work as is, with vhost qemu only ever writes,
never reads from device. We'll also never need GSO etc
which is a large part of what tap does (and macvtap will
have to do).

> Moreover, it seems weird to have a new type of interface here that
> duplicates tap/macvtap with less functionality. Coming back
> to your original comment, this means that while mpassthru is currently
> not duplicating the actual code from macvtap, it would need to do
> exactly that to get the qemu interface right!
> 
> 	Arnd

I don't think so, see above. anyway, both can reuse tun.c :)

-- 
MST

^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Arnd Bergmann @ 2010-04-14 16:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <20100414161610.GA10897@redhat.com>

On Wednesday 14 April 2010, Michael S. Tsirkin wrote:
> > > 
> > > qemu needs the ability to inject raw packets into device
> > > from userspace, bypassing vhost/virtio (for live migration).
> > 
> > Ok, but since there is only a write callback and no read, it won't
> > actually be able to do this with the current code, right?
> 
> I think it'll work as is, with vhost qemu only ever writes,
> never reads from device. We'll also never need GSO etc
> which is a large part of what tap does (and macvtap will
> have to do).

Ah, I see. I didn't realize that qemu needs to write to the
device even if vhost is used. But for the case of migration to
another machine without vhost, wouldn't qemu also need to read?

> > Moreover, it seems weird to have a new type of interface here that
> > duplicates tap/macvtap with less functionality. Coming back
> > to your original comment, this means that while mpassthru is currently
> > not duplicating the actual code from macvtap, it would need to do
> > exactly that to get the qemu interface right!
> > 
> I don't think so, see above. anyway, both can reuse tun.c :)

There is one significant difference between macvtap/mpassthru and
tun/tap in that the directions are reversed. While macvtap and
mpassthru forward data from write into dev_queue_xmit and from
skb_receive into read, tun/tap forwards data from write into
skb_receive and from start_xmit into read.

Also, I'm not really objecting to duplicating code between
macvtap and mpassthru, as the implementation can always be merged.
My main objection is instead to having two different _user_interfaces_
for doing the same thing.

	Arnd

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Tom Herbert @ 2010-04-14 17:31 UTC (permalink / raw)
  To: hadi; +Cc: Eric Dumazet, netdev, robert, David Miller, Changli Gao,
	Andi Kleen
In-Reply-To: <1271245986.3943.55.camel@bigi>

The point of RPS is to increase parallelism, but the cost of that is
more overhead per packet.  If you are running a single flow, then
you'll see latency increase for that flow.  With more concurrent flows
the benefits of parallelism kick in and latency gets better.-- we've
seen the break even point around ten connections in our tests.  Also,
I don't think we've made the claim that RPS should generally perform
better than multi-queue, the primary motivation for RPS is make single
queue NICs give reasonable performance.


On Wed, Apr 14, 2010 at 4:53 AM, jamal <hadi@cyberus.ca> wrote:
> Following up like promised:
>
> On Mon, 2010-02-08 at 10:09 -0500, jamal wrote:
>> On Sun, 2010-02-07 at 21:58 -0800, Tom Herbert wrote:
>>
>> > I don't have specific numbers, although we are using this on
>> > application doing forwarding and numbers seem in line with what we see
>> > for an end host.
>> >
>>
>> When i get the chance i will give it a run. I have access to an i7
>> somewhere. It seems like i need some specific nics?
>
> I did step #0 last night on an i7 (single Nehalem). I think more than
> anything i was impressed by the Nehalem's excellent caching system.
> Robert, I am almost tempted to say skb recycling performance will be
> excellent on this  machine given the cost of a cache miss is much lower
> than previous generation hardware.
>
> My test was simple: irq affinity on cpu0(core0) and rps redirection to
> cpu1(core 1); tried also to redirect to different SMT threads (aka CPUs)
> on different cores with similar results. I base tested against no rps
> being used and a kernel which didnt have any RPS config on.
> [BTW, I had to hand-edit the .config since i couldnt do it from
> menuconfig (Is there any reason for it to be so?)]
>
> Traffic was sent from another machine into the i7 via an el-cheapo sky2
> (dont know how shitty this NIC is, but it seems to know how to do MSI so
> probably capable of multiqueueing); the test was several sets of
> a ping first and then a ping -f (I will get more sophisticated in my
> next test likely this weekend).
>
> Results:
> CPU utilization was about 20-30% higher in the case of rps. On cpu0, the
> cpu was being chewed highly by sky2_poll and on the redirected-to-core
> it was always smp_call_function_single.
> Latency was (consistently) on average 5 microseconds.
> So if i sent 1M ping -f packets, without RPS it took on average
> 176 seconds and with RPS it took 181 seconds to do a round-trip.
> Throughput didnt change but this could be attributed to the low amounts
> of data i was sending.
> I observed that we were generating, on average, an IPI per packet even
> with ping -f. (added an extra stat to record when we sent an IPI and
> counted against the number of packets sent).
> In my opinion it is these IPIs that contribute the most to the latency
> and i think it happens that the Nehalem is just highly improved in this
> area. I wish i had a more commonly used machine to test rps on.
> I expect that rps will perform worse on currently cheaper/older hardware
> for the traffic characteristic i tested.
>
> On IPIs:
> Is anyone familiar with what is going on with Nehalem? Why is it this
> good? I expect things will get a lot nastier with other hardware like
> xeon based or even Nehalem with rps going across QPI.
> Here's why i think IPIs are bad, please correct me if i am wrong:
> - they are synchronous. i.e an IPI issuer has to wait for an ACK (which
> is in the form of an IPI).
> - data cache has to be synced to main memory
> - the instruction pipeline is flushed
> - what else did i miss? Andi?
>
> So my question to Tom, Eric and Changli or anyone else who has been
> running RPS:
> What hardware did you use? Is there anyone using older hardware than
> say AMD Opteron or Intel Nehalem?
>
> My impressions of rps so far:
> I think i may end up being impressed when i generate a lot more traffic
> since the cost of IPI will be amortized.
> At this point multiqueue seems a lot more impressive alternative and it
> seems to me multiqueu hardware is a lot more commodity (price-point)
> than a Nehalem.
>
> Plan:
> I plan to still attack the app space (and write a basic udp app that
> binds to one or more rps cpus and try blasting a lot of UDP traffic to
> see what happens) my step after that is to move to forwarding tests..
>
> cheers,
> jamal
>
>

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-14 18:04 UTC (permalink / raw)
  To: Tom Herbert; +Cc: hadi, netdev, robert, David Miller, Changli Gao, Andi Kleen
In-Reply-To: <t2p65634d661004141031xf80f62e7sb64362ea1ce10a1f@mail.gmail.com>

Le mercredi 14 avril 2010 à 10:31 -0700, Tom Herbert a écrit :
> The point of RPS is to increase parallelism, but the cost of that is
> more overhead per packet.  If you are running a single flow, then
> you'll see latency increase for that flow.  With more concurrent flows
> the benefits of parallelism kick in and latency gets better.-- we've
> seen the break even point around ten connections in our tests.  Also,
> I don't think we've made the claim that RPS should generally perform
> better than multi-queue, the primary motivation for RPS is make single
> queue NICs give reasonable performance.
> 

Yes, multiqueue is far better of course, but in case of hardware lacking
multiqueue, RPS can help many workloads, where application has _some_
work to do, not only counting frames or so...

RPS overhead (IPI, cache misses, ...) must be amortized by
parallelization or we lose.

A ping test is not an ideal candidate for RPS, since everything is done
at softirq level, and should be faster without RPS...




^ permalink raw reply

* Re: [E1000-devel] e1000e/netdev.c patch -- tx_ring->next_to_use
From: Brandeburg, Jesse @ 2010-04-14 18:12 UTC (permalink / raw)
  To: Charles Slivkoff
  Cc: terry.loftin@hp.com, e1000-devel@lists.sourceforge.net,
	davem@davemloft.net, Kirsher, Jeffrey T, netdev, emil.s.tantilov
In-Reply-To: <4BC5E563.9050206@cmu.edu>



On Wed, 14 Apr 2010, Charles Slivkoff wrote:
> I have been experiencing a number of system hangs which I believe are 
> due to the e1000e driver. I have a Dell Optiplex 760, Intel Core 2 Duo, 
> 4GB RAM, and I'm running Ubuntu 9.10 (32-bit).

have you filed a bug at launchpad?  if so what is the number?  I just want 
to unite all the information we have.

>  From the stack included in the kernel oops output, I decided to apply 
> the patch you provided, which I found posted here:
> 
> 	http://patchwork.ozlabs.org/patch/49175/

Hi Charles, I copied netdev for you.  I agree the panic you're seeing is 
from something inside the e1000e driver.  The question becomes why is the 
driver getting a null pointer dereference in transmit cleanup.

> This morning, I attempted an rsync operation which caused a hang once again.
> 
> I am attaching the oops output from 04/08/2010 and 04/14/2010.

I see you also filed a bug at e1000's sourceforge, thank you.

As a workaround you can try disabling TSO using ethtool to see if that 
helps.  We need to reproduce this here if possible.

ethtool -K eth0 tso off

Do you happen to *not* have irqbalance installed or enabled?  I was 
confused by the move_irq in one of the stack traces.  In any case it 
probably doesn't matter but I was not expecting to see that there.

For others, I've included the panic traces inline here...

[603636.169243] BUG: unable to handle kernel NULL pointer dereference at 000000ac
[603636.172898] IP: [<f82ee88f>] e1000_clean_tx_irq+0x8f/0x330 [e1000e]
[603636.172898] *pdpt = 000000002c954001 *pde = 0000000000000000
[603636.172898] Oops: 0000 [#1] SMP
[603636.172898] last sysfs file: /sys/devices/virtual/block/ram9/uevent
[603636.172898] Modules linked in: isofs udf crc_itu_t ppp_async crc_ccitt 
vmnet vmci vmmon binfmt_misc cisco_ipsec(P) openafs(P) deflate 
zlib_deflate ctr twofish twofish_common camellia serpent blowfish cast5 
des_generic cbc aes_i586 aes_generic xcbc rmd160 sha256_generic 
sha1_generic crypto_null af_key nfsd exportfsnfs lockd nfs_acl auth_rpcgss 
sunrpc snd_hda_codec_analog ipt_REJECT ipt_LOG xt_limit xt_tcpudp xt_state 
ipt_addrtype snd_hda_intel snd_hda_codec snd_usb_audio snd_pcm_oss 
snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_usb_lib snd_seq_midi 
ip6table_filter ip6_tables nf_nat_irc nf_conntrack_irc snd_rawmidi 
snd_seq_midi_event snd_seq nf_nat_ftp nf_nat nf_conntrack_ipv4 
nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack snd_hwdep coretemp 
iptable_filter uvcvideo videodev snd_timer snd_seq_device v4l1_compat 
ip_tables x_tables psmouse serio_raw ppdev dell_wmi dcdbas parport_pc 
fglrx(P) snd soundcore lp snd_page_alloc parport heci(C) usbhid intel_agp 
e1000e agpgart
[603636.172898]
[603636.172898] Pid: 4368, comm: chrome Tainted: P         C  (2.6.31-20-generic-pae #58-Ubuntu) OptiPlex 760
[603636.172898] EIP: 0060:[<f82ee88f>] EFLAGS: 00010246 CPU: 0
[603636.172898] EIP is at e1000_clean_tx_irq+0x8f/0x330 [e1000e]
[603636.172898] EAX: 00000000 EBX: 00000024 ECX: 00000240 EDX: f902e360
[603636.172898] ESI: f6e741e0 EDI: ef412000 EBP: ebcbbd84 ESP: ebcbbd24
[603636.172898]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[603636.172898] Process chrome (pid: 4368, ti=ebcba000 task=ee5cd7f0 task.ti=ebcba000)
[603636.172898] Stack:
[603636.172898]  fffb2668 ec9b6d50 00000001 00000014 d7938e00 ebcbbeb4 0000000000000000
[603636.172898] <0> 00000000 00000000 ec9f01b8 00000001 f64dc000 b54cd000 00000086 00000001
[603636.172898] <0> f64dc340 00000024 00000086 c057bf01 00000001 f64dc340 f64dc340 00000040
[603636.172898] Call Trace:
[603636.172898]  [<c057bf01>] ? do_page_fault+0x141/0x380
[603636.172898]  [<f82f0a54>] ? e1000_clean+0x54/0x270 [e1000e]
[603636.172898]  [<c04a7795>] ? net_rx_action+0xe5/0x1c0
[603636.172898]  [<c014cb30>] ? __do_softirq+0x90/0x1a0
[603636.172898]  [<c019189c>] ? handle_IRQ_event+0x4c/0x140
[603636.172898]  [<c01fcb42>] ? __d_lookup+0x102/0x110
[603636.172898]  [<c0194544>] ? move_native_irq+0x14/0x50
[603636.172898]  [<c014cc7d>] ? do_softirq+0x3d/0x40
[603636.172898]  [<c014cdbd>] ? irq_exit+0x5d/0x70
[603636.172898]  [<c0104f50>] ? do_IRQ+0x50/0xc0
[603636.172898]  [<c01e6ec2>] ? __mem_cgroup_uncharge_common+0xa2/0xf0
[603636.172898]  [<c01039f0>] ? common_interrupt+0x30/0x40
[603636.172898]  [<c048007b>] ? hidinput_configure_usage+0xcab/0x2290
[603636.172898]  [<c05700d8>] ? hlt_loop+0x3/0xb
[603636.172898]  [<c04edbd1>] ? udp_v4_get_port+0x1/0x20
[603636.172898]  [<c04f6421>] ? inet_autobind+0x21/0x60
[603636.172898]  [<c04f659d>] ? inet_dgram_connect+0x5d/0x70
[603636.172898]  [<c049684e>] ? sys_connect+0xae/0xd0
[603636.172898]  [<c02d03b3>] ? security_d_instantiate+0x13/0x30
[603636.172898]  [<c01fc690>] ? d_instantiate+0x40/0x50
[603636.172898]  [<c0495178>] ? sock_attach_fd+0x78/0xc0
[603636.172898]  [<c0579a88>] ? _spin_lock+0x8/0x10
[603636.172898]  [<c01e9207>] ? fd_install+0x47/0x60
[603636.172898]  [<c04951fd>] ? sock_map_fd+0x3d/0x60
[603636.172898]  [<c0497578>] ? sys_socketcall+0x248/0x270
[603636.172898]  [<c01032c3>] ? sysenter_do_call+0x12/0x28
[603636.1: lost 7 rtc interrupts
[603636.538819] hpet1: lost 7 rtc interrupts
[603636.542825] hpet1: lost 7 rtc interrupts
[603636.546830] hpet1: lost 8 rtc interrupts
[603636.550836] hpet1: lost 7 rtc interrupts
[603636.554842] hpet1: lost 7 rtc interrupts
[603636.558848] hpet1: lost 7 rtc interrupts
[603636.562854] hpet1: lost 8 rtc interrupts
[603636.566865] ---[ end trace f3dd0b8abcd2bca2 ]---
[603636.571563] Kernel panic - not syncing: Fatal exception in interrupt
[603636.577995] Pid: 4368, comm: chrome Tainted: P      D  C 2.6.31-20-generic-pae #58-Ubuntu
[603636.586245] Call Trace:
[603636.588779]  [<c05775ee>] ? printk+0x18/0x1a
[603636.593132]  [<c0577532>] panic+0x43/0xe7
[603636.597226]  [<c057a935>] oops_end+0xc5/0xd0
[603636.601580]  [<c0129084>] no_context+0xb4/0xd0
[603636.606107]  [<c01290dd>] __bad_area_nosemaphore+0x3d/0x1a0
[603636.611760]  [<c012eb2e>] ? kmap_atomic_prot+0xde/0x100
[603636.617067]  [<c012e972>] ? kunmap_atomic+0x52/0x70


and....


[496797.222642] BUG: unable to handle kernel NULL pointer dereference at 000000ac
[496797.229405] IP: [<f922d50b>] e1000_clean_tx_irq+0xcb/0x320 [e1000e]
[496797.232626] *pdpt = 000000002b5cd001 *pde = 0000000000000000
[496797.232626] Oops: 0000 [#1] SMP
[496797.232626] last sysfs file: /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/class
[496797.232626] Modules linked in: e1000e vmnet vmci vmmon binfmt_misc 
cisco_ipsec(P) openafs(P) deflate zlib_deflate ctr twofish twofish_common 
camellia serpent blowfish cast5 des_generic cbc aes_i586 aes_generic xcbc 
rmd160 sha256_generic sha1_generic crypto_null af_key nfsd exportfs nfs 
lockd nfs_acl auth_rpcgss sunrpc snd_hda_codec_analog ipt_REJECT ipt_LOG 
xt_limit xt_tcpudp xt_state ipt_addrtype ip6table_filter ip6_tables 
nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_nat nf_conntrack_ipv4 
nf_defrag_ipv4 snd_usb_audio snd_hda_intel snd_hda_codec snd_seq_dummy 
snd_pcm_oss nf_conntrack_ftp snd_mixer_oss nf_conntrack iptable_filter 
ppdev ip_tables x_tables snd_pcm snd_usb_lib snd_hwdep snd_seq_oss 
dell_wmi dcdbas uvcvideo videodev v4l1_compat psmouse serio_raw fglrx(P) 
snd_seq_midi parport_pc snd_rawmidi snd_seq_midi_event snd_seq snd_timer 
snd_seq_device snd soundcoresnd_page_alloc heci(C) coretemp lp parport 
usbhid intel_agp agpgart [last unloaded: e1000e]
[496797.232626]
[496797.232626] Pid: 11466, comm: ssh Tainted: P         C (2.6.31-20-generic-pae #58-Ubuntu) OptiPlex 760
[496797.232626] EIP: 0060:[<f922d50b>] EFLAGS: 00210246 CPU: 1
[496797.232626] EIP is at e1000_clean_tx_irq+0xcb/0x320 [e1000e]
[496797.232626] EAX: 00000000 EBX: 00000056 ECX: 00000560 EDX: f8554810
[496797.232626] ESI: e9cba360 EDI: e9c6c000 EBP: efc8fcfc ESP: efc8fc9c
[496797.232626]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[496797.232626] Process ssh (pid: 11466, ti=efc8e000 task=f0554b60 task.ti=efc8e000)
[496797.232626] Stack:
[496797.232626]  00000020 20dd855d 00000005 f0014c00 e49ea5a0 efc8fcf8 c04e2387efc8fce4
[496797.232626] <0> c04dfc04 000252d0 00002b00 00000001 f6074000 e49ea580 00004912 0000000f
[496797.232626] <0> f6074340 00000056 0000058e c0127f01 0000000f f6074340 f6074340 00000040
[496797.232626] Call Trace:
[496797.232626]  [<c04e2387>] ? tcp_transmit_skb+0x397/0x650
[496797.232626]  [<c04dfc04>] ? tcp_clean_rtx_queue+0x3f4/0x7b0
[496797.232626]  [<c0127f01>] ? native_patch+0xf1/0x110
[496797.232626]  [<f922f504>] ? e1000_clean+0x54/0x270 [e1000e]
[496797.232626]  [<c0152227>] ? lock_timer_base+0x27/0x50
[496797.232626]  [<c04a7795>] ? net_rx_action+0xe5/0x1c0
[496797.232626]  [<c014cb30>] ? __do_softirq+0x90/0x1a0
[496797.232626]  [<c04db9df>] ? __tcp_ack_snd_check+0x5f/0x80
[496797.232626]  [<c04e0dfe>] ? tcp_rcv_established+0x32e/0x5f0
[496797.232626]  [<c014cc7d>] ? do_softirq+0x3d/0x40
[496797.232626]  [<c014d805>] ? local_bh_enable_ip+0x75/0x90
[496797.232626]  [<c0579c51>] ? _spin_unlock_bh+0x11/0x20
[496797.232626]  [<c04983d4>] ? release_sock+0x94/0xa0
[496797.232626]  [<c04d5535>] ? tcp_push+0x75/0xb0
[496797.488251]  [<c04d86bd>] ? tcp_sendmsg+0x67d/0x900



^ permalink raw reply

* Re: e1000e/netdev.c patch -- tx_ring->next_to_use
From: Charles Slivkoff @ 2010-04-14 18:47 UTC (permalink / raw)
  To: Brandeburg, Jesse
  Cc: emil.s.tantilov, e1000-devel@lists.sourceforge.net, netdev,
	terry.loftin@hp.com, Kirsher, Jeffrey T, davem@davemloft.net
In-Reply-To: <alpine.WNT.2.00.1004141105060.4368@jbrandeb-desk1.amr.corp.intel.com>

On 04/14/2010 02:12 PM, Brandeburg, Jesse wrote:
...

> have you filed a bug at launchpad?  if so what is the number?  I just want 
> to unite all the information we have.

I just submitted one, using "ubuntu-bug", so a number of associated 
system details are included as attachments.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/563267

...

> As a workaround you can try disabling TSO using ethtool to see if that 
> helps.  We need to reproduce this here if possible.
> 
> ethtool -K eth0 tso off

I will give this a try.

> Do you happen to *not* have irqbalance installed or enabled?  I was 
> confused by the move_irq in one of the stack traces.  In any case it 
> probably doesn't matter but I was not expecting to see that there.

irqbalance is *not* installed.

...


Thanks,

-Charles


------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-14 18:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, netdev, robert, David Miller, Changli Gao,
	Andi Kleen
In-Reply-To: <1271268242.16881.1719.camel@edumazet-laptop>

On Wed, 2010-04-14 at 20:04 +0200, Eric Dumazet wrote:

> Yes, multiqueue is far better of course, but in case of hardware lacking
> multiqueue, RPS can help many workloads, where application has _some_
> work to do, not only counting frames or so...

Agreed. So to enumerate, the benefits come in if:
a) you have many processors
b) you have single-queue nic
c) at sub-threshold traffic you dont care about a little latency
d) you have a specific cache hierachy
e) app is working hard to process incoming messages

> RPS overhead (IPI, cache misses, ...) must be amortized by
> parallelization or we lose.

Indeed. 
How well they can be amortized seems very cpu or board specific.

I think the main challenge for my pedantic mind is missing details. Is
there a paper on rps? Example for #d above, the commit log mentions that
rps benefits if you have certain types of "cache hierachy". Probably
some arch with large shared L2/3 (maybe inclusive) cache will benefit.
example: it does well on Nehalem and probably opterons as long (as you
dont start stacking these things on some interconnect like QPI or HT).
But what happens when you have FSB sharing across cores (still a very
common setup)? etc etc

Can I ask what hardware you run this on?

> A ping test is not an ideal candidate for RPS, since everything is done
> at softirq level, and should be faster without RPS...

ping wont do justice to the possible potential of rps mostly because it
generates very little traffic i.e the part #c above. But it helps me at
least boot a machine with proper setup - but it is not totally useless
because i think the cost of IPI can be deduced from the results.
I am going to put together some udp app with variable think-time to see
what happens. Would that be a reasonable thing to test on?

It would be valuable to have something like Documentation/networking/rps
to detail things a little more. 

cheers,
jamal


^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Stephen Hemminger @ 2010-04-14 18:53 UTC (permalink / raw)
  To: hadi
  Cc: Tom Herbert, Eric Dumazet, netdev, robert, David Miller,
	Changli Gao, Andi Kleen
In-Reply-To: <1271245986.3943.55.camel@bigi>

On Wed, 14 Apr 2010 07:53:06 -0400
jamal <hadi@cyberus.ca> wrote:

> Results:
> CPU utilization was about 20-30% higher in the case of rps. On cpu0, the
> cpu was being chewed highly by sky2_poll and on the redirected-to-core
> it was always smp_call_function_single

I posted a patch to use sky2 hardware hash (RSS) which should lower the
cost per packet.

-- 

^ permalink raw reply

* Över 300 spel
From: King Spin @ 2010-04-14 20:04 UTC (permalink / raw)
  To: newbieb

Kan du bevara en hemlighet? Det här stället är er helt fantastisk och jag vann stort här igår kväll.

http://www.bestgames-lux.net/sv/


^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Stephen Hemminger @ 2010-04-14 19:44 UTC (permalink / raw)
  To: hadi
  Cc: Eric Dumazet, Tom Herbert, netdev, robert, David Miller,
	Changli Gao, Andi Kleen
In-Reply-To: <1271271222.4567.51.camel@bigi>

On Wed, 14 Apr 2010 14:53:42 -0400
jamal <hadi@cyberus.ca> wrote:

> Agreed. So to enumerate, the benefits come in if:
> a) you have many processors
> b) you have single-queue nic
> c) at sub-threshold traffic you dont care about a little latency

There probably needs to be better autotuning for this, there is no reason
that RPS to be steering packets unless the queue is getting backed up.
Some kind of high / low water mark mechanism is needed.

RPS might also interact with the core turbo boost functionality on Intel chips.
Newer chips will make a single core faster if other core can be kept idle.


-- 

^ permalink raw reply

* [BUG net-next-2.6] fib: Some rcu warning
From: Eric Dumazet @ 2010-04-14 19:54 UTC (permalink / raw)
  To: netdev; +Cc: Paul E. McKenney

[   27.756998] IPv4 FIB: Using LC-trie version 0.409
[   27.757121] 
[   27.757121] ===================================================
[   27.757228] [ INFO: suspicious rcu_dereference_check() usage. ]
[   27.757285] ---------------------------------------------------
[   27.757342] net/ipv4/fib_trie.c:212 invoked rcu_dereference_check()
without protection!
[   27.757417] 
[   27.757417] other info that might help us debug this:
[   27.757418] 
[   27.757569] 
[   27.757570] rcu_scheduler_active = 1, debug_locks = 0
[   27.757674] 2 locks held by ip/5686:
[   27.757727]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff8134d247>]
rtnl_lock+0x17/0x20
[   27.757936]  #1:  ((inetaddr_chain).rwsem){.+.+.+}, at:
[<ffffffff810654a7>] __blocking_notifier_call_chain+0x47/0x90
[   27.758148] 
[   27.758149] stack backtrace:
[   27.758249] Pid: 5686, comm: ip Not tainted
2.6.34-rc3-03164-gb4bf665-dirty #10
[   27.758323] Call Trace:
[   27.758377]  [<ffffffff81071b4f>] lockdep_rcu_dereference+0xaf/0xc0
[   27.758437]  [<ffffffff813ae32c>] fib_find_node+0x19c/0x220
[   27.758495]  [<ffffffff813b0bdc>] fib_table_insert+0xac/0x760
[   27.758552]  [<ffffffff813aa121>] ? fib_get_table+0x91/0xd0
[   27.758609]  [<ffffffff813aa0b8>] ? fib_get_table+0x28/0xd0
[   27.758667]  [<ffffffff813aa561>] fib_magic+0x111/0x120
[   27.758723]  [<ffffffff813aa6a0>] fib_add_ifaddr+0x130/0x170
[   27.758780]  [<ffffffff813aad3c>] fib_inetaddr_event+0x5c/0x290
[   27.758838]  [<ffffffff810650d8>] notifier_call_chain+0x58/0x80
[   27.758896]  [<ffffffff810654bd>] __blocking_notifier_call_chain
+0x5d/0x90
[   27.758956]  [<ffffffff81065506>] blocking_notifier_call_chain
+0x16/0x20
[   27.759016]  [<ffffffff813a0c44>] __inet_insert_ifa+0xd4/0x170
[   27.759089]  [<ffffffff813a0cf2>] inet_insert_ifa+0x12/0x20
[   27.759146]  [<ffffffff813a2140>] inetdev_event+0x400/0x430
[   27.759204]  [<ffffffff81362472>] ? netlink_broadcast+0x262/0x3f0
[   27.759263]  [<ffffffff81351391>] ? fib_rules_event+0x21/0x180
[   27.759320]  [<ffffffff810650d8>] notifier_call_chain+0x58/0x80
[   27.759378]  [<ffffffff8106512e>] __raw_notifier_call_chain+0xe/0x10
[   27.759436]  [<ffffffff81065146>] raw_notifier_call_chain+0x16/0x20
[   27.761102]  [<ffffffff8133fc1b>] call_netdevice_notifiers+0x1b/0x20
[   27.761161]  [<ffffffff8133ffad>] __dev_notify_flags+0x7d/0x90
[   27.761219]  [<ffffffff81340005>] dev_change_flags+0x45/0x70
[   27.761278]  [<ffffffff813a2e79>] devinet_ioctl+0x5e9/0x770
[   27.761336]  [<ffffffff813a36b1>] inet_ioctl+0x61/0x80
[   27.761392]  [<ffffffff8132c532>] sock_do_ioctl+0x32/0x60
[   27.761449]  [<ffffffff8132d06c>] compat_sock_ioctl+0x89c/0xb60
[   27.761507]  [<ffffffff81075816>] ? __lock_acquire+0x486/0xaf0
[   27.761567]  [<ffffffff810cbfeb>] ? __do_fault+0x12b/0x480
[   27.761624]  [<ffffffff810b4a93>] ? filemap_fault+0xd3/0x3d0
[   27.761683]  [<ffffffff810f7104>] ? fget_light+0x174/0x360
[   27.761740]  [<ffffffff810cc1f9>] ? __do_fault+0x339/0x480
[   27.761798]  [<ffffffff811370e3>] compat_sys_ioctl+0xd3/0x1900
[   27.761856]  [<ffffffff810ce28a>] ? handle_mm_fault+0x19a/0x780
[   27.761915]  [<ffffffff81023832>] ? do_page_fault+0xe2/0x3b0
[   27.761972]  [<ffffffff81064703>] ? up_read+0x23/0x40
[   27.762028]  [<ffffffff810238f5>] ? do_page_fault+0x1a5/0x3b0
[   27.762087]  [<ffffffff813f36cb>] ? _raw_spin_unlock+0x2b/0x40
[   27.762145]  [<ffffffff810f3508>] ? fd_install+0xa8/0xe0
[   27.762202]  [<ffffffff8132eced>] ? sock_map_fd+0x2d/0x40
[   27.762259]  [<ffffffff813f3a69>] ? retint_swapgs+0xe/0x13
[   27.762331]  [<ffffffff8107292d>] ? trace_hardirqs_on_caller
+0x10d/0x190
[   27.762391]  [<ffffffff813f2d21>] ? trace_hardirqs_off_thunk
+0x3a/0x3c
[   27.762450]  [<ffffffff81028b64>] sysenter_dispatch+0x7/0x30
[   27.762507]  [<ffffffff813f2ce2>] ? trace_hardirqs_on_thunk+0x3a/0x3f



^ permalink raw reply

* [PATCH net-next-2.6 v3] fasync: RCU and fine grained locking
From: Eric Dumazet @ 2010-04-14 19:55 UTC (permalink / raw)
  To: David Miller; +Cc: Paul E. McKenney, netdev, linux-kernel, Lai Jiangshan
In-Reply-To: <1271259264.16881.1703.camel@edumazet-laptop>

Here is V3 of the patch. This patch is a preliminary work to full RCU
conversion of sock_def_readable() & sock_def_write_space() functions,
called nearly for each packet...

I based it against David net-next-2.6 tree.

Thanks

[PATCH net-next-2.6 v3] fasync: RCU and fine grained locking

kill_fasync() uses a central rwlock, candidate for RCU conversion, to
avoid cache line ping pongs on SMP.

fasync_remove_entry() and fasync_add_entry() can disable IRQS on a short
section instead during whole list scan.

Use a spinlock per fasync_struct to synchronize kill_fasync_rcu() and
fasync_{remove|add}_entry(). This spinlock is IRQ safe, so sock_fasync()
doesnt need its own implementation and can use fasync_helper(), to
reduce code size and complexity.

We can remove __kill_fasync() direct use in net/socket.c, and rename it
to kill_fasync_rcu().

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
---
v3: sock_fasync() can use generic fasync_helper(), this gives a nice
cleanup.

v2: As Lai Jiangshan noticed, we need a mutual exclusion between
fasync_{remove|add}_entry() and kill_fasync_rcu().

 fs/fcntl.c         |   66 ++++++++++++++++++++++++--------------
 include/linux/fs.h |   12 +++----
 net/socket.c       |   73 ++++++-------------------------------------
 3 files changed, 59 insertions(+), 92 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 452d02f..0a14074 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -614,9 +614,15 @@ int send_sigurg(struct fown_struct *fown)
 	return ret;
 }
 
-static DEFINE_RWLOCK(fasync_lock);
+static DEFINE_SPINLOCK(fasync_lock);
 static struct kmem_cache *fasync_cache __read_mostly;
 
+static void fasync_free_rcu(struct rcu_head *head)
+{
+	kmem_cache_free(fasync_cache,
+			container_of(head, struct fasync_struct, fa_rcu));
+}
+
 /*
  * Remove a fasync entry. If successfully removed, return
  * positive and clear the FASYNC flag. If no entry exists,
@@ -625,8 +631,6 @@ static struct kmem_cache *fasync_cache __read_mostly;
  * NOTE! It is very important that the FASYNC flag always
  * match the state "is the filp on a fasync list".
  *
- * We always take the 'filp->f_lock', in since fasync_lock
- * needs to be irq-safe.
  */
 static int fasync_remove_entry(struct file *filp, struct fasync_struct **fapp)
 {
@@ -634,17 +638,22 @@ static int fasync_remove_entry(struct file *filp, struct fasync_struct **fapp)
 	int result = 0;
 
 	spin_lock(&filp->f_lock);
-	write_lock_irq(&fasync_lock);
+	spin_lock(&fasync_lock);
 	for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
 		if (fa->fa_file != filp)
 			continue;
+
+		spin_lock_irq(&fa->fa_lock);
+		fa->fa_file = NULL;
+		spin_unlock_irq(&fa->fa_lock);
+
 		*fp = fa->fa_next;
-		kmem_cache_free(fasync_cache, fa);
+		call_rcu(&fa->fa_rcu, fasync_free_rcu);
 		filp->f_flags &= ~FASYNC;
 		result = 1;
 		break;
 	}
-	write_unlock_irq(&fasync_lock);
+	spin_unlock(&fasync_lock);
 	spin_unlock(&filp->f_lock);
 	return result;
 }
@@ -666,25 +675,30 @@ static int fasync_add_entry(int fd, struct file *filp, struct fasync_struct **fa
 		return -ENOMEM;
 
 	spin_lock(&filp->f_lock);
-	write_lock_irq(&fasync_lock);
+	spin_lock(&fasync_lock);
 	for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
 		if (fa->fa_file != filp)
 			continue;
+
+		spin_lock_irq(&fa->fa_lock);
 		fa->fa_fd = fd;
+		spin_unlock_irq(&fa->fa_lock);
+
 		kmem_cache_free(fasync_cache, new);
 		goto out;
 	}
 
+	spin_lock_init(&new->fa_lock);
 	new->magic = FASYNC_MAGIC;
 	new->fa_file = filp;
 	new->fa_fd = fd;
 	new->fa_next = *fapp;
-	*fapp = new;
+	rcu_assign_pointer(*fapp, new);
 	result = 1;
 	filp->f_flags |= FASYNC;
 
 out:
-	write_unlock_irq(&fasync_lock);
+	spin_unlock(&fasync_lock);
 	spin_unlock(&filp->f_lock);
 	return result;
 }
@@ -704,37 +718,41 @@ int fasync_helper(int fd, struct file * filp, int on, struct fasync_struct **fap
 
 EXPORT_SYMBOL(fasync_helper);
 
-void __kill_fasync(struct fasync_struct *fa, int sig, int band)
+/*
+ * rcu_read_lock() is held
+ */
+static void kill_fasync_rcu(struct fasync_struct *fa, int sig, int band)
 {
 	while (fa) {
-		struct fown_struct * fown;
+		struct fown_struct *fown;
 		if (fa->magic != FASYNC_MAGIC) {
 			printk(KERN_ERR "kill_fasync: bad magic number in "
 			       "fasync_struct!\n");
 			return;
 		}
-		fown = &fa->fa_file->f_owner;
-		/* Don't send SIGURG to processes which have not set a
-		   queued signum: SIGURG has its own default signalling
-		   mechanism. */
-		if (!(sig == SIGURG && fown->signum == 0))
-			send_sigio(fown, fa->fa_fd, band);
-		fa = fa->fa_next;
+		spin_lock(&fa->fa_lock);
+		if (fa->fa_file) {
+			fown = &fa->fa_file->f_owner;
+			/* Don't send SIGURG to processes which have not set a
+			   queued signum: SIGURG has its own default signalling
+			   mechanism. */
+			if (!(sig == SIGURG && fown->signum == 0))
+				send_sigio(fown, fa->fa_fd, band);
+		}
+		spin_unlock(&fa->fa_lock);
+		fa = rcu_dereference(fa->fa_next);
 	}
 }
 
-EXPORT_SYMBOL(__kill_fasync);
-
 void kill_fasync(struct fasync_struct **fp, int sig, int band)
 {
 	/* First a quick test without locking: usually
 	 * the list is empty.
 	 */
 	if (*fp) {
-		read_lock(&fasync_lock);
-		/* reread *fp after obtaining the lock */
-		__kill_fasync(*fp, sig, band);
-		read_unlock(&fasync_lock);
+		rcu_read_lock();
+		kill_fasync_rcu(rcu_dereference(*fp), sig, band);
+		rcu_read_unlock();
 	}
 }
 EXPORT_SYMBOL(kill_fasync);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 39d57bc..018d382 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1280,10 +1280,12 @@ static inline int lock_may_write(struct inode *inode, loff_t start,
 
 
 struct fasync_struct {
-	int	magic;
-	int	fa_fd;
-	struct	fasync_struct	*fa_next; /* singly linked list */
-	struct	file 		*fa_file;
+	spinlock_t		fa_lock;
+	int			magic;
+	int			fa_fd;
+	struct fasync_struct	*fa_next; /* singly linked list */
+	struct file		*fa_file;
+	struct rcu_head		fa_rcu;
 };
 
 #define FASYNC_MAGIC 0x4601
@@ -1292,8 +1294,6 @@ struct fasync_struct {
 extern int fasync_helper(int, struct file *, int, struct fasync_struct **);
 /* can be called from interrupts */
 extern void kill_fasync(struct fasync_struct **, int, int);
-/* only for net: no internal synchronization */
-extern void __kill_fasync(struct fasync_struct *, int, int);
 
 extern int __f_setown(struct file *filp, struct pid *, enum pid_type, int force);
 extern int f_setown(struct file *filp, unsigned long arg, int force);
diff --git a/net/socket.c b/net/socket.c
index 35bc198..9822081 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1067,78 +1067,27 @@ static int sock_close(struct inode *inode, struct file *filp)
  *	1. fasync_list is modified only under process context socket lock
  *	   i.e. under semaphore.
  *	2. fasync_list is used under read_lock(&sk->sk_callback_lock)
- *	   or under socket lock.
- *	3. fasync_list can be used from softirq context, so that
- *	   modification under socket lock have to be enhanced with
- *	   write_lock_bh(&sk->sk_callback_lock).
- *							--ANK (990710)
+ *	   or under socket lock
  */
 
 static int sock_fasync(int fd, struct file *filp, int on)
 {
-	struct fasync_struct *fa, *fna = NULL, **prev;
-	struct socket *sock;
-	struct sock *sk;
-
-	if (on) {
-		fna = kmalloc(sizeof(struct fasync_struct), GFP_KERNEL);
-		if (fna == NULL)
-			return -ENOMEM;
-	}
-
-	sock = filp->private_data;
+	struct socket *sock = filp->private_data;
+	struct sock *sk = sock->sk;
 
-	sk = sock->sk;
-	if (sk == NULL) {
-		kfree(fna);
+	if (sk == NULL)
 		return -EINVAL;
-	}
 
 	lock_sock(sk);
 
-	spin_lock(&filp->f_lock);
-	if (on)
-		filp->f_flags |= FASYNC;
-	else
-		filp->f_flags &= ~FASYNC;
-	spin_unlock(&filp->f_lock);
-
-	prev = &(sock->fasync_list);
+	fasync_helper(fd, filp, on, &sock->fasync_list);
 
-	for (fa = *prev; fa != NULL; prev = &fa->fa_next, fa = *prev)
-		if (fa->fa_file == filp)
-			break;
-
-	if (on) {
-		if (fa != NULL) {
-			write_lock_bh(&sk->sk_callback_lock);
-			fa->fa_fd = fd;
-			write_unlock_bh(&sk->sk_callback_lock);
-
-			kfree(fna);
-			goto out;
-		}
-		fna->fa_file = filp;
-		fna->fa_fd = fd;
-		fna->magic = FASYNC_MAGIC;
-		fna->fa_next = sock->fasync_list;
-		write_lock_bh(&sk->sk_callback_lock);
-		sock->fasync_list = fna;
+	if (!sock->fasync_list)
+		sock_reset_flag(sk, SOCK_FASYNC);
+	else
 		sock_set_flag(sk, SOCK_FASYNC);
-		write_unlock_bh(&sk->sk_callback_lock);
-	} else {
-		if (fa != NULL) {
-			write_lock_bh(&sk->sk_callback_lock);
-			*prev = fa->fa_next;
-			if (!sock->fasync_list)
-				sock_reset_flag(sk, SOCK_FASYNC);
-			write_unlock_bh(&sk->sk_callback_lock);
-			kfree(fa);
-		}
-	}
 
-out:
-	release_sock(sock->sk);
+	release_sock(sk);
 	return 0;
 }
 
@@ -1159,10 +1108,10 @@ int sock_wake_async(struct socket *sock, int how, int band)
 		/* fall through */
 	case SOCK_WAKE_IO:
 call_kill:
-		__kill_fasync(sock->fasync_list, SIGIO, band);
+		kill_fasync(&sock->fasync_list, SIGIO, band);
 		break;
 	case SOCK_WAKE_URG:
-		__kill_fasync(sock->fasync_list, SIGURG, band);
+		kill_fasync(&sock->fasync_list, SIGURG, band);
 	}
 	return 0;
 }

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox