Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-14 18:04 UTC (permalink / raw)
  To: Tom Herbert; +Cc: hadi, netdev, robert, David Miller, Changli Gao, Andi Kleen
In-Reply-To: <t2p65634d661004141031xf80f62e7sb64362ea1ce10a1f@mail.gmail.com>

Le mercredi 14 avril 2010 à 10:31 -0700, Tom Herbert a écrit :
> The point of RPS is to increase parallelism, but the cost of that is
> more overhead per packet.  If you are running a single flow, then
> you'll see latency increase for that flow.  With more concurrent flows
> the benefits of parallelism kick in and latency gets better.-- we've
> seen the break even point around ten connections in our tests.  Also,
> I don't think we've made the claim that RPS should generally perform
> better than multi-queue, the primary motivation for RPS is make single
> queue NICs give reasonable performance.
> 

Yes, multiqueue is far better of course, but in case of hardware lacking
multiqueue, RPS can help many workloads, where application has _some_
work to do, not only counting frames or so...

RPS overhead (IPI, cache misses, ...) must be amortized by
parallelization or we lose.

A ping test is not an ideal candidate for RPS, since everything is done
at softirq level, and should be faster without RPS...

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Tom Herbert @ 2010-04-14 17:31 UTC (permalink / raw)
  To: hadi; +Cc: Eric Dumazet, netdev, robert, David Miller, Changli Gao,
	Andi Kleen
In-Reply-To: <1271245986.3943.55.camel@bigi>

The point of RPS is to increase parallelism, but the cost of that is
more overhead per packet.  If you are running a single flow, then
you'll see latency increase for that flow.  With more concurrent flows
the benefits of parallelism kick in and latency gets better.-- we've
seen the break even point around ten connections in our tests.  Also,
I don't think we've made the claim that RPS should generally perform
better than multi-queue, the primary motivation for RPS is make single
queue NICs give reasonable performance.


On Wed, Apr 14, 2010 at 4:53 AM, jamal <hadi@cyberus.ca> wrote:
> Following up like promised:
>
> On Mon, 2010-02-08 at 10:09 -0500, jamal wrote:
>> On Sun, 2010-02-07 at 21:58 -0800, Tom Herbert wrote:
>>
>> > I don't have specific numbers, although we are using this on
>> > application doing forwarding and numbers seem in line with what we see
>> > for an end host.
>> >
>>
>> When i get the chance i will give it a run. I have access to an i7
>> somewhere. It seems like i need some specific nics?
>
> I did step #0 last night on an i7 (single Nehalem). I think more than
> anything i was impressed by the Nehalem's excellent caching system.
> Robert, I am almost tempted to say skb recycling performance will be
> excellent on this  machine given the cost of a cache miss is much lower
> than previous generation hardware.
>
> My test was simple: irq affinity on cpu0(core0) and rps redirection to
> cpu1(core 1); tried also to redirect to different SMT threads (aka CPUs)
> on different cores with similar results. I base tested against no rps
> being used and a kernel which didnt have any RPS config on.
> [BTW, I had to hand-edit the .config since i couldnt do it from
> menuconfig (Is there any reason for it to be so?)]
>
> Traffic was sent from another machine into the i7 via an el-cheapo sky2
> (dont know how shitty this NIC is, but it seems to know how to do MSI so
> probably capable of multiqueueing); the test was several sets of
> a ping first and then a ping -f (I will get more sophisticated in my
> next test likely this weekend).
>
> Results:
> CPU utilization was about 20-30% higher in the case of rps. On cpu0, the
> cpu was being chewed highly by sky2_poll and on the redirected-to-core
> it was always smp_call_function_single.
> Latency was (consistently) on average 5 microseconds.
> So if i sent 1M ping -f packets, without RPS it took on average
> 176 seconds and with RPS it took 181 seconds to do a round-trip.
> Throughput didnt change but this could be attributed to the low amounts
> of data i was sending.
> I observed that we were generating, on average, an IPI per packet even
> with ping -f. (added an extra stat to record when we sent an IPI and
> counted against the number of packets sent).
> In my opinion it is these IPIs that contribute the most to the latency
> and i think it happens that the Nehalem is just highly improved in this
> area. I wish i had a more commonly used machine to test rps on.
> I expect that rps will perform worse on currently cheaper/older hardware
> for the traffic characteristic i tested.
>
> On IPIs:
> Is anyone familiar with what is going on with Nehalem? Why is it this
> good? I expect things will get a lot nastier with other hardware like
> xeon based or even Nehalem with rps going across QPI.
> Here's why i think IPIs are bad, please correct me if i am wrong:
> - they are synchronous. i.e an IPI issuer has to wait for an ACK (which
> is in the form of an IPI).
> - data cache has to be synced to main memory
> - the instruction pipeline is flushed
> - what else did i miss? Andi?
>
> So my question to Tom, Eric and Changli or anyone else who has been
> running RPS:
> What hardware did you use? Is there anyone using older hardware than
> say AMD Opteron or Intel Nehalem?
>
> My impressions of rps so far:
> I think i may end up being impressed when i generate a lot more traffic
> since the cost of IPI will be amortized.
> At this point multiqueue seems a lot more impressive alternative and it
> seems to me multiqueu hardware is a lot more commodity (price-point)
> than a Nehalem.
>
> Plan:
> I plan to still attack the app space (and write a basic udp app that
> binds to one or more rps cpus and try blasting a lot of UDP traffic to
> see what happens) my step after that is to move to forwarding tests..
>
> cheers,
> jamal
>
>

^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Arnd Bergmann @ 2010-04-14 16:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <20100414161610.GA10897@redhat.com>

On Wednesday 14 April 2010, Michael S. Tsirkin wrote:
> > > 
> > > qemu needs the ability to inject raw packets into device
> > > from userspace, bypassing vhost/virtio (for live migration).
> > 
> > Ok, but since there is only a write callback and no read, it won't
> > actually be able to do this with the current code, right?
> 
> I think it'll work as is, with vhost qemu only ever writes,
> never reads from device. We'll also never need GSO etc
> which is a large part of what tap does (and macvtap will
> have to do).

Ah, I see. I didn't realize that qemu needs to write to the
device even if vhost is used. But for the case of migration to
another machine without vhost, wouldn't qemu also need to read?

> > Moreover, it seems weird to have a new type of interface here that
> > duplicates tap/macvtap with less functionality. Coming back
> > to your original comment, this means that while mpassthru is currently
> > not duplicating the actual code from macvtap, it would need to do
> > exactly that to get the qemu interface right!
> > 
> I don't think so, see above. anyway, both can reuse tun.c :)

There is one significant difference between macvtap/mpassthru and
tun/tap in that the directions are reversed. While macvtap and
mpassthru forward data from write into dev_queue_xmit and from
skb_receive into read, tun/tap forwards data from write into
skb_receive and from start_xmit into read.

Also, I'm not really objecting to duplicating code between
macvtap and mpassthru, as the implementation can always be merged.
My main objection is instead to having two different _user_interfaces_
for doing the same thing.

	Arnd

^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-14 16:16 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <201004141757.54829.arnd@arndb.de>

On Wed, Apr 14, 2010 at 05:57:54PM +0200, Arnd Bergmann wrote:
> On Wednesday 14 April 2010, Michael S. Tsirkin wrote:
> > On Wed, Apr 14, 2010 at 04:55:21PM +0200, Arnd Bergmann wrote:
> > > On Friday 09 April 2010, xiaohui.xin@intel.com wrote:
> > > > From: Xin Xiaohui <xiaohui.xin@intel.com>
> >
> > > It seems that you are duplicating a lot of functionality that
> > > is already in macvtap. I've asked about this before but then
> > > didn't look at your newer versions. Can you explain the value
> > > of introducing another interface to user land?
> > 
> > Hmm, I have not noticed a lot of duplication.
> 
> The code is indeed quite distinct, but the idea of adding another
> character device to pass into vhost for direct device access is.

All backends besides tap seem to do this, btw :)

> > BTW macvtap also duplicates tun code, it might be
> > a good idea for tun to export some functionality.
> 
> Yes, that's something I plan to look into.
> 
> > > I'm still planning to add zero-copy support to macvtap,
> > > hopefully reusing parts of your code, but do you think there
> > > is value in having both?
> > 
> > If macvtap would get zero copy tx and rx, maybe not. But
> > it's not immediately obvious whether zero-copy support
> > for macvtap might work, though, especially for zero copy rx.
> > The approach with mpassthru is much simpler in that
> > it takes complete control of the device.
> 
> As far as I can tell, the most significant limitation of mpassthru
> is that there can only ever be a single guest on a physical NIC.
> 
> Given that limitation, I believe we can do the same on macvtap,
> and simply disable zero-copy RX when you want to use more than one
> guest, or both guest and host on the same NIC.
> 
> The logical next step here would be to allow VMDq and similar
> technologies to separate out the RX traffic in the hardware.
> We don't have a configuration interface for that yet, but
> since this is logically the same as macvlan, I think we should
> use the same interfaces for both, essentially treating VMDq
> as a hardware acceleration for macvlan. We can probably handle
> it in similar ways to how we handle hardware support for vlan.
> 
> At that stage, macvtap would be the logical interface for
> connecting a VMDq (hardware macvlan) device to a guest!

I won't object to that but ... code walks.

> > > > +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
> > > > +				unsigned long count, loff_t pos)
> > > > +{
> > > > +	struct file *file = iocb->ki_filp;
> > > > +	struct mp_struct *mp = mp_get(file->private_data);
> > > > +	struct sock *sk = mp->socket.sk;
> > > > +	struct sk_buff *skb;
> > > > +	int len, err;
> > > > +	ssize_t result;
> > > 
> > > Can you explain what this function is even there for? AFAICT, vhost-net
> > > doesn't call it, the interface is incompatible with the existing
> > > tap interface, and you don't provide a read function.
> > 
> > qemu needs the ability to inject raw packets into device
> > from userspace, bypassing vhost/virtio (for live migration).
> 
> Ok, but since there is only a write callback and no read, it won't
> actually be able to do this with the current code, right?

I think it'll work as is, with vhost qemu only ever writes,
never reads from device. We'll also never need GSO etc
which is a large part of what tap does (and macvtap will
have to do).

> Moreover, it seems weird to have a new type of interface here that
> duplicates tap/macvtap with less functionality. Coming back
> to your original comment, this means that while mpassthru is currently
> not duplicating the actual code from macvtap, it would need to do
> exactly that to get the qemu interface right!
> 
> 	Arnd

I don't think so, see above. anyway, both can reuse tun.c :)

-- 
MST

^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Arnd Bergmann @ 2010-04-14 15:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <20100414152615.GA8079@redhat.com>

On Wednesday 14 April 2010, Michael S. Tsirkin wrote:
> On Wed, Apr 14, 2010 at 04:55:21PM +0200, Arnd Bergmann wrote:
> > On Friday 09 April 2010, xiaohui.xin@intel.com wrote:
> > > From: Xin Xiaohui <xiaohui.xin@intel.com>
>
> > It seems that you are duplicating a lot of functionality that
> > is already in macvtap. I've asked about this before but then
> > didn't look at your newer versions. Can you explain the value
> > of introducing another interface to user land?
> 
> Hmm, I have not noticed a lot of duplication.

The code is indeed quite distinct, but the idea of adding another
character device to pass into vhost for direct device access is.

> BTW macvtap also duplicates tun code, it might be
> a good idea for tun to export some functionality.

Yes, that's something I plan to look into.

> > I'm still planning to add zero-copy support to macvtap,
> > hopefully reusing parts of your code, but do you think there
> > is value in having both?
> 
> If macvtap would get zero copy tx and rx, maybe not. But
> it's not immediately obvious whether zero-copy support
> for macvtap might work, though, especially for zero copy rx.
> The approach with mpassthru is much simpler in that
> it takes complete control of the device.

As far as I can tell, the most significant limitation of mpassthru
is that there can only ever be a single guest on a physical NIC.

Given that limitation, I believe we can do the same on macvtap,
and simply disable zero-copy RX when you want to use more than one
guest, or both guest and host on the same NIC.

The logical next step here would be to allow VMDq and similar
technologies to separate out the RX traffic in the hardware.
We don't have a configuration interface for that yet, but
since this is logically the same as macvlan, I think we should
use the same interfaces for both, essentially treating VMDq
as a hardware acceleration for macvlan. We can probably handle
it in similar ways to how we handle hardware support for vlan.

At that stage, macvtap would be the logical interface for
connecting a VMDq (hardware macvlan) device to a guest!

> > > +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
> > > +				unsigned long count, loff_t pos)
> > > +{
> > > +	struct file *file = iocb->ki_filp;
> > > +	struct mp_struct *mp = mp_get(file->private_data);
> > > +	struct sock *sk = mp->socket.sk;
> > > +	struct sk_buff *skb;
> > > +	int len, err;
> > > +	ssize_t result;
> > 
> > Can you explain what this function is even there for? AFAICT, vhost-net
> > doesn't call it, the interface is incompatible with the existing
> > tap interface, and you don't provide a read function.
> 
> qemu needs the ability to inject raw packets into device
> from userspace, bypassing vhost/virtio (for live migration).

Ok, but since there is only a write callback and no read, it won't
actually be able to do this with the current code, right?

Moreover, it seems weird to have a new type of interface here that
duplicates tap/macvtap with less functionality. Coming back
to your original comment, this means that while mpassthru is currently
not duplicating the actual code from macvtap, it would need to do
exactly that to get the qemu interface right!

	Arnd

^ permalink raw reply

* Re: [PATCH net-next-2.6] fasync: RCU locking
From: Paul E. McKenney @ 2010-04-14 15:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, linux-kernel
In-Reply-To: <1271230961.16881.630.camel@edumazet-laptop>

On Wed, Apr 14, 2010 at 09:42:41AM +0200, Eric Dumazet wrote:
> Paul, could you please check this patch, I am not sure
> of the IRQ safety thing...
> 
> Is call_rcu() the right method to use in this case ?

It looks like all the read-side critical sections are protected by
rcu_read_lock(), so call_rcu() should be OK.  And it is OK to invoke
call_rcu() with irqs disabled.  (Just don't try it in an NMI handler.)

Or am I missing some code path that tries to use disabling of irqs
instead of using rcu_read_lock()?  That happens to work in the current
implementation, but...

							Thanx, Paul

> Thanks
> 
> [PATCH net-next-2.6] fasync: RCU locking
> 
> kill_fasync() uses a central rwlock, candidate for RCU conversion.
> 
> We can remove __kill_fasync() direct use in net, and rename it to
> kill_fasync_rcu()
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  fs/fcntl.c         |   36 +++++++++++++++++++++---------------
>  include/linux/fs.h |   11 +++++------
>  net/socket.c       |    4 ++--
>  3 files changed, 28 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 452d02f..33cb3ee 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -614,9 +614,15 @@ int send_sigurg(struct fown_struct *fown)
>  	return ret;
>  }
> 
> -static DEFINE_RWLOCK(fasync_lock);
> +static DEFINE_SPINLOCK(fasync_lock);
>  static struct kmem_cache *fasync_cache __read_mostly;
> 
> +static void fasync_free_rcu(struct rcu_head *head)
> +{
> +	kmem_cache_free(fasync_cache,
> +			container_of(head, struct fasync_struct, fa_rcu));
> +}
> +
>  /*
>   * Remove a fasync entry. If successfully removed, return
>   * positive and clear the FASYNC flag. If no entry exists,
> @@ -634,17 +640,17 @@ static int fasync_remove_entry(struct file *filp, struct fasync_struct **fapp)
>  	int result = 0;
> 
>  	spin_lock(&filp->f_lock);
> -	write_lock_irq(&fasync_lock);
> +	spin_lock_irq(&fasync_lock);
>  	for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
>  		if (fa->fa_file != filp)
>  			continue;
>  		*fp = fa->fa_next;
> -		kmem_cache_free(fasync_cache, fa);
> +		call_rcu(&fa->fa_rcu, fasync_free_rcu);
>  		filp->f_flags &= ~FASYNC;
>  		result = 1;
>  		break;
>  	}
> -	write_unlock_irq(&fasync_lock);
> +	spin_unlock_irq(&fasync_lock);
>  	spin_unlock(&filp->f_lock);
>  	return result;
>  }
> @@ -666,7 +672,7 @@ static int fasync_add_entry(int fd, struct file *filp, struct fasync_struct **fa
>  		return -ENOMEM;
> 
>  	spin_lock(&filp->f_lock);
> -	write_lock_irq(&fasync_lock);
> +	spin_lock_irq(&fasync_lock);
>  	for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
>  		if (fa->fa_file != filp)
>  			continue;
> @@ -679,12 +685,12 @@ static int fasync_add_entry(int fd, struct file *filp, struct fasync_struct **fa
>  	new->fa_file = filp;
>  	new->fa_fd = fd;
>  	new->fa_next = *fapp;
> -	*fapp = new;
> +	rcu_assign_pointer(*fapp, new);
>  	result = 1;
>  	filp->f_flags |= FASYNC;
> 
>  out:
> -	write_unlock_irq(&fasync_lock);
> +	spin_unlock_irq(&fasync_lock);
>  	spin_unlock(&filp->f_lock);
>  	return result;
>  }
> @@ -704,7 +710,10 @@ int fasync_helper(int fd, struct file * filp, int on, struct fasync_struct **fap
> 
>  EXPORT_SYMBOL(fasync_helper);
> 
> -void __kill_fasync(struct fasync_struct *fa, int sig, int band)
> +/*
> + * rcu_read_lock() is held
> + */
> +static void kill_fasync_rcu(struct fasync_struct *fa, int sig, int band)
>  {
>  	while (fa) {
>  		struct fown_struct * fown;
> @@ -719,22 +728,19 @@ void __kill_fasync(struct fasync_struct *fa, int sig, int band)
>  		   mechanism. */
>  		if (!(sig == SIGURG && fown->signum == 0))
>  			send_sigio(fown, fa->fa_fd, band);
> -		fa = fa->fa_next;
> +		fa = rcu_dereference(fa->fa_next);
>  	}
>  }
> 
> -EXPORT_SYMBOL(__kill_fasync);
> -
>  void kill_fasync(struct fasync_struct **fp, int sig, int band)
>  {
>  	/* First a quick test without locking: usually
>  	 * the list is empty.
>  	 */
>  	if (*fp) {
> -		read_lock(&fasync_lock);
> -		/* reread *fp after obtaining the lock */
> -		__kill_fasync(*fp, sig, band);
> -		read_unlock(&fasync_lock);
> +		rcu_read_lock();
> +		kill_fasync_rcu(rcu_dereference(*fp), sig, band);
> +		rcu_read_unlock();
>  	}
>  }
>  EXPORT_SYMBOL(kill_fasync);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 39d57bc..158b2cc 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1280,10 +1280,11 @@ static inline int lock_may_write(struct inode *inode, loff_t start,
> 
> 
>  struct fasync_struct {
> -	int	magic;
> -	int	fa_fd;
> -	struct	fasync_struct	*fa_next; /* singly linked list */
> -	struct	file 		*fa_file;
> +	int			magic;
> +	int			fa_fd;
> +	struct fasync_struct	*fa_next; /* singly linked list */
> +	struct file		*fa_file;
> +	struct rcu_head		fa_rcu;
>  };
> 
>  #define FASYNC_MAGIC 0x4601
> @@ -1292,8 +1293,6 @@ struct fasync_struct {
>  extern int fasync_helper(int, struct file *, int, struct fasync_struct **);
>  /* can be called from interrupts */
>  extern void kill_fasync(struct fasync_struct **, int, int);
> -/* only for net: no internal synchronization */
> -extern void __kill_fasync(struct fasync_struct *, int, int);
> 
>  extern int __f_setown(struct file *filp, struct pid *, enum pid_type, int force);
>  extern int f_setown(struct file *filp, unsigned long arg, int force);
> diff --git a/net/socket.c b/net/socket.c
> index 35bc198..846739c 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -1159,10 +1159,10 @@ int sock_wake_async(struct socket *sock, int how, int band)
>  		/* fall through */
>  	case SOCK_WAKE_IO:
>  call_kill:
> -		__kill_fasync(sock->fasync_list, SIGIO, band);
> +		kill_fasync(sock->fasync_list, SIGIO, band);
>  		break;
>  	case SOCK_WAKE_URG:
> -		__kill_fasync(sock->fasync_list, SIGURG, band);
> +		kill_fasync(sock->fasync_list, SIGURG, band);
>  	}
>  	return 0;
>  }
> 
> 

^ permalink raw reply

* Re: [PATCH net-next-2.6] fasync: RCU locking
From: Eric Dumazet @ 2010-04-14 15:34 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: David Miller, Paul E. McKenney, netdev, linux-kernel
In-Reply-To: <1271257027.16881.1663.camel@edumazet-laptop>

Le mercredi 14 avril 2010 à 16:57 +0200, Eric Dumazet a écrit :
> Le mercredi 14 avril 2010 à 16:36 +0800, Lai Jiangshan a écrit :
> 
> > Since rcu_read_lock() protects fasync_struct *fa for us, we can access
> > to @fa safely even fasync_remove_entry() is just called.
> > 
> > But this patch does not ensure 'fa->fa_file is not freed' nor
> > 'fa->fa_fd is not released', so kill_fasync_rcu() may do wrong thing
> > if there is no other code ensure it.
> 
> You are 100% right, I forgot my old attempt to RCUified struct files
> failed...
> 
> Maybe its time to finally move f_owner out of struct file, and use RCU
> to free it.
> 
> In the mean time, adding a lock in fasync_struct is more than enough.
> 
> Thanks !
> 
> [PATCH net-next-2.6 v2] fasync: fine grained locking
> 
> kill_fasync() uses a central rwlock, candidate for RCU conversion, to
> avoid cache line ping pongs on SMP.
> 
> fasync_remove_entry() and fasync_add_entry() can disable IRQS on a short
> section instead during whole list scan.
> 
> Use a spinlock per fasync_struct to synchronize fasync_{remove|
> add}_entry() and kill_fasync_rcu()
> 
> We can remove __kill_fasync() direct use in net, and rename it to
> kill_fasync_rcu().
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Please wait for a v3 version, as net/socket.c sock_fasync() should be
updated too...

^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-14 15:26 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <201004141655.21885.arnd@arndb.de>

On Wed, Apr 14, 2010 at 04:55:21PM +0200, Arnd Bergmann wrote:
> On Friday 09 April 2010, xiaohui.xin@intel.com wrote:
> > From: Xin Xiaohui <xiaohui.xin@intel.com>
> > 
> > Add a device to utilize the vhost-net backend driver for
> > copy-less data transfer between guest FE and host NIC.
> > It pins the guest user space to the host memory and
> > provides proto_ops as sendmsg/recvmsg to vhost-net.
> 
> Sorry for taking so long before finding the time to look
> at your code in more detail.
> 
> It seems that you are duplicating a lot of functionality that
> is already in macvtap. I've asked about this before but then
> didn't look at your newer versions. Can you explain the value
> of introducing another interface to user land?

Hmm, I have not noticed a lot of duplication.
BTW macvtap also duplicates tun code, it might be
a good idea for tun to export some functionality.

> I'm still planning to add zero-copy support to macvtap,
> hopefully reusing parts of your code, but do you think there
> is value in having both?

If macvtap would get zero copy tx and rx, maybe not. But
it's not immediately obvious whether zero-copy support
for macvtap might work, though, especially for zero copy rx.
The approach with mpassthru is much simpler in that
it takes complete control of the device.

> > diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
> > new file mode 100644
> > index 0000000..86d2525
> > --- /dev/null
> > +++ b/drivers/vhost/mpassthru.c
> > @@ -0,0 +1,1264 @@
> > +
> > +#ifdef MPASSTHRU_DEBUG
> > +static int debug;
> > +
> > +#define DBG  if (mp->debug) printk
> > +#define DBG1 if (debug == 2) printk
> > +#else
> > +#define DBG(a...)
> > +#define DBG1(a...)
> > +#endif
> 
> This should probably just use the existing dev_dbg/pr_debug infrastructure.
> 
> > [... skipping buffer management code for now]
> 
> > +static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
> > +			struct msghdr *m, size_t total_len)
> > +{
> > [...]
> 
> This function looks like we should be able to easily include it into
> macvtap and get zero-copy transmits without introducing the new
> user-level interface.
> 
> > +static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
> > +			struct msghdr *m, size_t total_len,
> > +			int flags)
> > +{
> > +	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> > +	struct page_ctor *ctor;
> > +	struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb->private);
> 
> It smells like a layering violation to look at the iocb->private field
> from a lower-level driver. I would have hoped that it's possible to implement
> this without having this driver know about the higher-level vhost driver
> internals.

I agree.

> Can you explain why this is needed?
> 
> > +	spin_lock_irqsave(&ctor->read_lock, flag);
> > +	list_add_tail(&info->list, &ctor->readq);
> > +	spin_unlock_irqrestore(&ctor->read_lock, flag);
> > +
> > +	if (!vq->receiver) {
> > +		vq->receiver = mp_recvmsg_notify;
> > +		set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
> > +				   vq->num * 4096,
> > +				   vq->num * 4096);
> > +	}
> > +
> > +	return 0;
> > +}
> 
> Not sure what I'm missing, but who calls the vq->receiver? This seems
> to be neither in the upstream version of vhost nor introduced by your
> patch.
> 
> > +static void __mp_detach(struct mp_struct *mp)
> > +{
> > +	mp->mfile = NULL;
> > +
> > +	mp_dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
> > +	page_ctor_detach(mp);
> > +	mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
> > +
> > +	/* Drop the extra count on the net device */
> > +	dev_put(mp->dev);
> > +}
> > +
> > +static DEFINE_MUTEX(mp_mutex);
> > +
> > +static void mp_detach(struct mp_struct *mp)
> > +{
> > +	mutex_lock(&mp_mutex);
> > +	__mp_detach(mp);
> > +	mutex_unlock(&mp_mutex);
> > +}
> > +
> > +static void mp_put(struct mp_file *mfile)
> > +{
> > +	if (atomic_dec_and_test(&mfile->count))
> > +		mp_detach(mfile->mp);
> > +}
> > +
> > +static int mp_release(struct socket *sock)
> > +{
> > +	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> > +	struct mp_file *mfile = mp->mfile;
> > +
> > +	mp_put(mfile);
> > +	sock_put(mp->socket.sk);
> > +	put_net(mfile->net);
> > +
> > +	return 0;
> > +}
> 
> Doesn't this prevent the underlying interface from going away while the chardev
> is open? You also have logic to handle that case, so why do you keep the extra
> reference on the netdev?
> 
> > +/* Ops structure to mimic raw sockets with mp device */
> > +static const struct proto_ops mp_socket_ops = {
> > +	.sendmsg = mp_sendmsg,
> > +	.recvmsg = mp_recvmsg,
> > +	.release = mp_release,
> > +};
> 
> > +static int mp_chr_open(struct inode *inode, struct file * file)
> > +{
> > +	struct mp_file *mfile;
> > +	cycle_kernel_lock();
> 
> I don't think you really want to use the BKL here, just kill that line.
> 
> > +static long mp_chr_ioctl(struct file *file, unsigned int cmd,
> > +		unsigned long arg)
> > +{
> > +	struct mp_file *mfile = file->private_data;
> > +	struct mp_struct *mp;
> > +	struct net_device *dev;
> > +	void __user* argp = (void __user *)arg;
> > +	struct ifreq ifr;
> > +	struct sock *sk;
> > +	int ret;
> > +
> > +	ret = -EINVAL;
> > +
> > +	switch (cmd) {
> > +	case MPASSTHRU_BINDDEV:
> > +		ret = -EFAULT;
> > +		if (copy_from_user(&ifr, argp, sizeof ifr))
> > +			break;
> 
> This is broken for 32 bit compat mode ioctls, because struct ifreq
> is different between 32 and 64 bit systems. Since you are only
> using the device name anyway, a fixed length string or just the
> interface index would be simpler and work better.
> 
> > +		ifr.ifr_name[IFNAMSIZ-1] = '\0';
> > +
> > +		ret = -EBUSY;
> > +
> > +		if (ifr.ifr_flags & IFF_MPASSTHRU_EXCL)
> > +			break;
> 
> Your current use of the IFF_MPASSTHRU* flags does not seem to make
> any sense whatsoever. You check that this flag is never set, but set
> it later yourself and then ignore all flags.
> 
> > +		ret = -ENODEV;
> > +		dev = dev_get_by_name(mfile->net, ifr.ifr_name);
> > +		if (!dev)
> > +			break;
> 
> There is no permission checking on who can access what device, which
> seems a bit simplistic. Any user that has access to the mpassthru device
> seems to be able to bind to any network interface in the namespace.
> This is one point where the macvtap model seems more appropriate, it
> separates the permissions for creating logical interfaces and using them.
> 
> > +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
> > +				unsigned long count, loff_t pos)
> > +{
> > +	struct file *file = iocb->ki_filp;
> > +	struct mp_struct *mp = mp_get(file->private_data);
> > +	struct sock *sk = mp->socket.sk;
> > +	struct sk_buff *skb;
> > +	int len, err;
> > +	ssize_t result;
> 
> Can you explain what this function is even there for? AFAICT, vhost-net
> doesn't call it, the interface is incompatible with the existing
> tap interface, and you don't provide a read function.

qemu needs the ability to inject raw packets into device
from userspace, bypassing vhost/virtio (for live migration).

> > diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
> > new file mode 100644
> > index 0000000..2be21c5
> > --- /dev/null
> > +++ b/include/linux/mpassthru.h
> > @@ -0,0 +1,29 @@
> > +#ifndef __MPASSTHRU_H
> > +#define __MPASSTHRU_H
> > +
> > +#include <linux/types.h>
> > +#include <linux/if_ether.h>
> > +
> > +/* ioctl defines */
> > +#define MPASSTHRU_BINDDEV      _IOW('M', 213, int)
> > +#define MPASSTHRU_UNBINDDEV    _IOW('M', 214, int)
> 
> These definitions are slightly wrong, because you pass more than just an 'int'.
> 
> > +/* MPASSTHRU ifc flags */
> > +#define IFF_MPASSTHRU		0x0001
> > +#define IFF_MPASSTHRU_EXCL	0x0002
> 
> As mentioned above, these flags don't make any sense with your current code.
> 
> 	Arnd

^ permalink raw reply

* Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-14 15:25 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mingo, jdike, davem
In-Reply-To: <1270193100-6769-1-git-send-email-xiaohui.xin@intel.com>

On Fri, Apr 02, 2010 at 03:25:00PM +0800, xiaohui.xin@intel.com wrote:
> The idea is simple, just to pin the guest VM user space and then
> let host NIC driver has the chance to directly DMA to it. 
> The patches are based on vhost-net backend driver. We add a device
> which provides proto_ops as sendmsg/recvmsg to vhost-net to
> send/recv directly to/from the NIC driver. KVM guest who use the
> vhost-net backend may bind any ethX interface in the host side to
> get copyless data transfer thru guest virtio-net frontend.
> 
> The scenario is like this:
> 
> The guest virtio-net driver submits multiple requests thru vhost-net
> backend driver to the kernel. And the requests are queued and then
> completed after corresponding actions in h/w are done.
> 
> For read, user space buffers are dispensed to NIC driver for rx when
> a page constructor API is invoked. Means NICs can allocate user buffers
> from a page constructor. We add a hook in netif_receive_skb() function
> to intercept the incoming packets, and notify the zero-copy device.
> 
> For write, the zero-copy deivce may allocates a new host skb and puts
> payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
> The request remains pending until the skb is transmitted by h/w.
> 
> Here, we have ever considered 2 ways to utilize the page constructor
> API to dispense the user buffers.
> 
> One:	Modify __alloc_skb() function a bit, it can only allocate a 
> 	structure of sk_buff, and the data pointer is pointing to a 
> 	user buffer which is coming from a page constructor API.
> 	Then the shinfo of the skb is also from guest.
> 	When packet is received from hardware, the skb->data is filled
> 	directly by h/w. What we have done is in this way.
> 
> 	Pros:	We can avoid any copy here.
> 	Cons:	Guest virtio-net driver needs to allocate skb as almost
> 		the same method with the host NIC drivers, say the size
> 		of netdev_alloc_skb() and the same reserved space in the
> 		head of skb. Many NIC drivers are the same with guest and
> 		ok for this. But some lastest NIC drivers reserves special
> 		room in skb head. To deal with it, we suggest to provide
> 		a method in guest virtio-net driver to ask for parameter
> 		we interest from the NIC driver when we know which device 
> 		we have bind to do zero-copy. Then we ask guest to do so.
> 		Is that reasonable?

Unfortunately, this would break compatibility with existing virtio.
This also complicates migration. What is the room in skb head used for?

> Two:	Modify driver to get user buffer allocated from a page constructor
> 	API(to substitute alloc_page()), the user buffer are used as payload
> 	buffers and filled by h/w directly when packet is received. Driver
> 	should associate the pages with skb (skb_shinfo(skb)->frags). For 
> 	the head buffer side, let host allocates skb, and h/w fills it. 
> 	After that, the data filled in host skb header will be copied into
> 	guest header buffer which is submitted together with the payload buffer.
> 
> 	Pros:	We could less care the way how guest or host allocates their
> 		buffers.
> 	Cons:	We still need a bit copy here for the skb header.
> 
> We are not sure which way is the better here.

The obvious question would be whether you see any speed difference
with the two approaches. If no, then the second approach would be
better.

> This is the first thing we want
> to get comments from the community. We wish the modification to the network
> part will be generic which not used by vhost-net backend only, but a user
> application may use it as well when the zero-copy device may provides async
> read/write operations later.
> 
> Please give comments especially for the network part modifications.
> 
> 
> We provide multiple submits and asynchronous notifiicaton to 
> vhost-net too.
> 
> Our goal is to improve the bandwidth and reduce the CPU usage.
> Exact performance data will be provided later. But for simple
> test with netperf, we found bindwidth up and CPU % up too,
> but the bindwidth up ratio is much more than CPU % up ratio.
> 
> What we have not done yet:
> 	packet split support

What does this mean, exactly?

> 	To support GRO

And TSO/GSO?

> 	Performance tuning
> 
> what we have done in v1:
> 	polish the RCU usage
> 	deal with write logging in asynchroush mode in vhost
> 	add notifier block for mp device
> 	rename page_ctor to mp_port in netdevice.h to make it looks generic
> 	add mp_dev_change_flags() for mp device to change NIC state
> 	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
> 	a small fix for missing dev_put when fail
> 	using dynamic minor instead of static minor number
> 	a __KERNEL__ protect to mp_get_sock()
> 
> what we have done in v2:
> 	
> 	remove most of the RCU usage, since the ctor pointer is only
> 	changed by BIND/UNBIND ioctl, and during that time, NIC will be
> 	stopped to get good cleanup(all outstanding requests are finished),
> 	so the ctor pointer cannot be raced into wrong situation.
> 
> 	Remove the struct vhost_notifier with struct kiocb.
> 	Let vhost-net backend to alloc/free the kiocb and transfer them
> 	via sendmsg/recvmsg.
> 
> 	use get_user_pages_fast() and set_page_dirty_lock() when read.
> 
> 	Add some comments for netdev_mp_port_prep() and handle_mpassthru().
> 
> 
> Comments not addressed yet in this time:
> 	the async write logging is not satified by vhost-net
> 	Qemu needs a sync write
> 	a limit for locked pages from get_user_pages_fast()
> 	
> 		
> performance:
> 	using netperf with GSO/TSO disabled, 10G NIC, 
> 	disabled packet split mode, with raw socket case compared to vhost.
> 
> 	bindwidth will be from 1.1Gbps to 1.7Gbps
> 	CPU % from 120%-140% to 140%-160%

^ permalink raw reply

* Re: [PATCH v3] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-14 15:20 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, netdev
In-Reply-To: <1271238738-8386-1-git-send-email-xiaosuo@gmail.com>

Le mercredi 14 avril 2010 à 17:52 +0800, Changli Gao a écrit :
> batch skb dequeueing from softnet input_pkt_queue
> 
> batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
> contention and irq disabling/enabling.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>

Adding stop_machine() with no explanation ?

No ack from my previous comments, suggestions, and still same logic ?

Are we supposed to read patch, test it, make some benches, correct bugs,
say Amen ?

This is becoming silly, if you ask me.

This is a NACK of this patch, obviously.

^ permalink raw reply

* Re: [PATCH net-next-2.6] fasync: RCU locking
From: Eric Dumazet @ 2010-04-14 14:57 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: David Miller, Paul E. McKenney, netdev, linux-kernel
In-Reply-To: <4BC57E7D.9060706@cn.fujitsu.com>

Le mercredi 14 avril 2010 à 16:36 +0800, Lai Jiangshan a écrit :

> Since rcu_read_lock() protects fasync_struct *fa for us, we can access
> to @fa safely even fasync_remove_entry() is just called.
> 
> But this patch does not ensure 'fa->fa_file is not freed' nor
> 'fa->fa_fd is not released', so kill_fasync_rcu() may do wrong thing
> if there is no other code ensure it.

You are 100% right, I forgot my old attempt to RCUified struct files
failed...

Maybe its time to finally move f_owner out of struct file, and use RCU
to free it.

In the mean time, adding a lock in fasync_struct is more than enough.

Thanks !

[PATCH net-next-2.6 v2] fasync: fine grained locking

kill_fasync() uses a central rwlock, candidate for RCU conversion, to
avoid cache line ping pongs on SMP.

fasync_remove_entry() and fasync_add_entry() can disable IRQS on a short
section instead during whole list scan.

Use a spinlock per fasync_struct to synchronize fasync_{remove|
add}_entry() and kill_fasync_rcu()

We can remove __kill_fasync() direct use in net, and rename it to
kill_fasync_rcu().

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
v2: As Lai Jiangshan noticed, we need a mutual exclusion between
fasync_{remove|add}_entry() and kill_fasync_rcu().

 fs/fcntl.c         |   66 +++++++++++++++++++++++++++----------------
 include/linux/fs.h |   12 +++----
 net/socket.c       |    4 +-
 3 files changed, 50 insertions(+), 32 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 452d02f..0a14074 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -614,9 +614,15 @@ int send_sigurg(struct fown_struct *fown)
 	return ret;
 }
 
-static DEFINE_RWLOCK(fasync_lock);
+static DEFINE_SPINLOCK(fasync_lock);
 static struct kmem_cache *fasync_cache __read_mostly;
 
+static void fasync_free_rcu(struct rcu_head *head)
+{
+	kmem_cache_free(fasync_cache,
+			container_of(head, struct fasync_struct, fa_rcu));
+}
+
 /*
  * Remove a fasync entry. If successfully removed, return
  * positive and clear the FASYNC flag. If no entry exists,
@@ -625,8 +631,6 @@ static struct kmem_cache *fasync_cache __read_mostly;
  * NOTE! It is very important that the FASYNC flag always
  * match the state "is the filp on a fasync list".
  *
- * We always take the 'filp->f_lock', in since fasync_lock
- * needs to be irq-safe.
  */
 static int fasync_remove_entry(struct file *filp, struct fasync_struct **fapp)
 {
@@ -634,17 +638,22 @@ static int fasync_remove_entry(struct file *filp, struct fasync_struct **fapp)
 	int result = 0;
 
 	spin_lock(&filp->f_lock);
-	write_lock_irq(&fasync_lock);
+	spin_lock(&fasync_lock);
 	for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
 		if (fa->fa_file != filp)
 			continue;
+
+		spin_lock_irq(&fa->fa_lock);
+		fa->fa_file = NULL;
+		spin_unlock_irq(&fa->fa_lock);
+
 		*fp = fa->fa_next;
-		kmem_cache_free(fasync_cache, fa);
+		call_rcu(&fa->fa_rcu, fasync_free_rcu);
 		filp->f_flags &= ~FASYNC;
 		result = 1;
 		break;
 	}
-	write_unlock_irq(&fasync_lock);
+	spin_unlock(&fasync_lock);
 	spin_unlock(&filp->f_lock);
 	return result;
 }
@@ -666,25 +675,30 @@ static int fasync_add_entry(int fd, struct file *filp, struct fasync_struct **fa
 		return -ENOMEM;
 
 	spin_lock(&filp->f_lock);
-	write_lock_irq(&fasync_lock);
+	spin_lock(&fasync_lock);
 	for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
 		if (fa->fa_file != filp)
 			continue;
+
+		spin_lock_irq(&fa->fa_lock);
 		fa->fa_fd = fd;
+		spin_unlock_irq(&fa->fa_lock);
+
 		kmem_cache_free(fasync_cache, new);
 		goto out;
 	}
 
+	spin_lock_init(&new->fa_lock);
 	new->magic = FASYNC_MAGIC;
 	new->fa_file = filp;
 	new->fa_fd = fd;
 	new->fa_next = *fapp;
-	*fapp = new;
+	rcu_assign_pointer(*fapp, new);
 	result = 1;
 	filp->f_flags |= FASYNC;
 
 out:
-	write_unlock_irq(&fasync_lock);
+	spin_unlock(&fasync_lock);
 	spin_unlock(&filp->f_lock);
 	return result;
 }
@@ -704,37 +718,41 @@ int fasync_helper(int fd, struct file * filp, int on, struct fasync_struct **fap
 
 EXPORT_SYMBOL(fasync_helper);
 
-void __kill_fasync(struct fasync_struct *fa, int sig, int band)
+/*
+ * rcu_read_lock() is held
+ */
+static void kill_fasync_rcu(struct fasync_struct *fa, int sig, int band)
 {
 	while (fa) {
-		struct fown_struct * fown;
+		struct fown_struct *fown;
 		if (fa->magic != FASYNC_MAGIC) {
 			printk(KERN_ERR "kill_fasync: bad magic number in "
 			       "fasync_struct!\n");
 			return;
 		}
-		fown = &fa->fa_file->f_owner;
-		/* Don't send SIGURG to processes which have not set a
-		   queued signum: SIGURG has its own default signalling
-		   mechanism. */
-		if (!(sig == SIGURG && fown->signum == 0))
-			send_sigio(fown, fa->fa_fd, band);
-		fa = fa->fa_next;
+		spin_lock(&fa->fa_lock);
+		if (fa->fa_file) {
+			fown = &fa->fa_file->f_owner;
+			/* Don't send SIGURG to processes which have not set a
+			   queued signum: SIGURG has its own default signalling
+			   mechanism. */
+			if (!(sig == SIGURG && fown->signum == 0))
+				send_sigio(fown, fa->fa_fd, band);
+		}
+		spin_unlock(&fa->fa_lock);
+		fa = rcu_dereference(fa->fa_next);
 	}
 }
 
-EXPORT_SYMBOL(__kill_fasync);
-
 void kill_fasync(struct fasync_struct **fp, int sig, int band)
 {
 	/* First a quick test without locking: usually
 	 * the list is empty.
 	 */
 	if (*fp) {
-		read_lock(&fasync_lock);
-		/* reread *fp after obtaining the lock */
-		__kill_fasync(*fp, sig, band);
-		read_unlock(&fasync_lock);
+		rcu_read_lock();
+		kill_fasync_rcu(rcu_dereference(*fp), sig, band);
+		rcu_read_unlock();
 	}
 }
 EXPORT_SYMBOL(kill_fasync);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 39d57bc..018d382 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1280,10 +1280,12 @@ static inline int lock_may_write(struct inode *inode, loff_t start,
 
 
 struct fasync_struct {
-	int	magic;
-	int	fa_fd;
-	struct	fasync_struct	*fa_next; /* singly linked list */
-	struct	file 		*fa_file;
+	spinlock_t		fa_lock;
+	int			magic;
+	int			fa_fd;
+	struct fasync_struct	*fa_next; /* singly linked list */
+	struct file		*fa_file;
+	struct rcu_head		fa_rcu;
 };
 
 #define FASYNC_MAGIC 0x4601
@@ -1292,8 +1294,6 @@ struct fasync_struct {
 extern int fasync_helper(int, struct file *, int, struct fasync_struct **);
 /* can be called from interrupts */
 extern void kill_fasync(struct fasync_struct **, int, int);
-/* only for net: no internal synchronization */
-extern void __kill_fasync(struct fasync_struct *, int, int);
 
 extern int __f_setown(struct file *filp, struct pid *, enum pid_type, int force);
 extern int f_setown(struct file *filp, unsigned long arg, int force);
diff --git a/net/socket.c b/net/socket.c
index 35bc198..846739c 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1159,10 +1159,10 @@ int sock_wake_async(struct socket *sock, int how, int band)
 		/* fall through */
 	case SOCK_WAKE_IO:
 call_kill:
-		__kill_fasync(sock->fasync_list, SIGIO, band);
+		kill_fasync(sock->fasync_list, SIGIO, band);
 		break;
 	case SOCK_WAKE_URG:
-		__kill_fasync(sock->fasync_list, SIGURG, band);
+		kill_fasync(sock->fasync_list, SIGURG, band);
 	}
 	return 0;
 }

^ permalink raw reply related

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Arnd Bergmann @ 2010-04-14 14:55 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, davem, jdike
In-Reply-To: <1270805865-16901-2-git-send-email-xiaohui.xin@intel.com>

On Friday 09 April 2010, xiaohui.xin@intel.com wrote:
> From: Xin Xiaohui <xiaohui.xin@intel.com>
> 
> Add a device to utilize the vhost-net backend driver for
> copy-less data transfer between guest FE and host NIC.
> It pins the guest user space to the host memory and
> provides proto_ops as sendmsg/recvmsg to vhost-net.

Sorry for taking so long before finding the time to look
at your code in more detail.

It seems that you are duplicating a lot of functionality that
is already in macvtap. I've asked about this before but then
didn't look at your newer versions. Can you explain the value
of introducing another interface to user land?

I'm still planning to add zero-copy support to macvtap,
hopefully reusing parts of your code, but do you think there
is value in having both?

> diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
> new file mode 100644
> index 0000000..86d2525
> --- /dev/null
> +++ b/drivers/vhost/mpassthru.c
> @@ -0,0 +1,1264 @@
> +
> +#ifdef MPASSTHRU_DEBUG
> +static int debug;
> +
> +#define DBG  if (mp->debug) printk
> +#define DBG1 if (debug == 2) printk
> +#else
> +#define DBG(a...)
> +#define DBG1(a...)
> +#endif

This should probably just use the existing dev_dbg/pr_debug infrastructure.

> [... skipping buffer management code for now]

> +static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
> +			struct msghdr *m, size_t total_len)
> +{
> [...]

This function looks like we should be able to easily include it into
macvtap and get zero-copy transmits without introducing the new
user-level interface.

> +static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
> +			struct msghdr *m, size_t total_len,
> +			int flags)
> +{
> +	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> +	struct page_ctor *ctor;
> +	struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb->private);

It smells like a layering violation to look at the iocb->private field
from a lower-level driver. I would have hoped that it's possible to implement
this without having this driver know about the higher-level vhost driver
internals. Can you explain why this is needed?

> +	spin_lock_irqsave(&ctor->read_lock, flag);
> +	list_add_tail(&info->list, &ctor->readq);
> +	spin_unlock_irqrestore(&ctor->read_lock, flag);
> +
> +	if (!vq->receiver) {
> +		vq->receiver = mp_recvmsg_notify;
> +		set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
> +				   vq->num * 4096,
> +				   vq->num * 4096);
> +	}
> +
> +	return 0;
> +}

Not sure what I'm missing, but who calls the vq->receiver? This seems
to be neither in the upstream version of vhost nor introduced by your
patch.

> +static void __mp_detach(struct mp_struct *mp)
> +{
> +	mp->mfile = NULL;
> +
> +	mp_dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
> +	page_ctor_detach(mp);
> +	mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
> +
> +	/* Drop the extra count on the net device */
> +	dev_put(mp->dev);
> +}
> +
> +static DEFINE_MUTEX(mp_mutex);
> +
> +static void mp_detach(struct mp_struct *mp)
> +{
> +	mutex_lock(&mp_mutex);
> +	__mp_detach(mp);
> +	mutex_unlock(&mp_mutex);
> +}
> +
> +static void mp_put(struct mp_file *mfile)
> +{
> +	if (atomic_dec_and_test(&mfile->count))
> +		mp_detach(mfile->mp);
> +}
> +
> +static int mp_release(struct socket *sock)
> +{
> +	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> +	struct mp_file *mfile = mp->mfile;
> +
> +	mp_put(mfile);
> +	sock_put(mp->socket.sk);
> +	put_net(mfile->net);
> +
> +	return 0;
> +}

Doesn't this prevent the underlying interface from going away while the chardev
is open? You also have logic to handle that case, so why do you keep the extra
reference on the netdev?

> +/* Ops structure to mimic raw sockets with mp device */
> +static const struct proto_ops mp_socket_ops = {
> +	.sendmsg = mp_sendmsg,
> +	.recvmsg = mp_recvmsg,
> +	.release = mp_release,
> +};

> +static int mp_chr_open(struct inode *inode, struct file * file)
> +{
> +	struct mp_file *mfile;
> +	cycle_kernel_lock();

I don't think you really want to use the BKL here, just kill that line.

> +static long mp_chr_ioctl(struct file *file, unsigned int cmd,
> +		unsigned long arg)
> +{
> +	struct mp_file *mfile = file->private_data;
> +	struct mp_struct *mp;
> +	struct net_device *dev;
> +	void __user* argp = (void __user *)arg;
> +	struct ifreq ifr;
> +	struct sock *sk;
> +	int ret;
> +
> +	ret = -EINVAL;
> +
> +	switch (cmd) {
> +	case MPASSTHRU_BINDDEV:
> +		ret = -EFAULT;
> +		if (copy_from_user(&ifr, argp, sizeof ifr))
> +			break;

This is broken for 32 bit compat mode ioctls, because struct ifreq
is different between 32 and 64 bit systems. Since you are only
using the device name anyway, a fixed length string or just the
interface index would be simpler and work better.

> +		ifr.ifr_name[IFNAMSIZ-1] = '\0';
> +
> +		ret = -EBUSY;
> +
> +		if (ifr.ifr_flags & IFF_MPASSTHRU_EXCL)
> +			break;

Your current use of the IFF_MPASSTHRU* flags does not seem to make
any sense whatsoever. You check that this flag is never set, but set
it later yourself and then ignore all flags.

> +		ret = -ENODEV;
> +		dev = dev_get_by_name(mfile->net, ifr.ifr_name);
> +		if (!dev)
> +			break;

There is no permission checking on who can access what device, which
seems a bit simplistic. Any user that has access to the mpassthru device
seems to be able to bind to any network interface in the namespace.
This is one point where the macvtap model seems more appropriate, it
separates the permissions for creating logical interfaces and using them.

> +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
> +				unsigned long count, loff_t pos)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct mp_struct *mp = mp_get(file->private_data);
> +	struct sock *sk = mp->socket.sk;
> +	struct sk_buff *skb;
> +	int len, err;
> +	ssize_t result;

Can you explain what this function is even there for? AFAICT, vhost-net
doesn't call it, the interface is incompatible with the existing
tap interface, and you don't provide a read function.

> diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
> new file mode 100644
> index 0000000..2be21c5
> --- /dev/null
> +++ b/include/linux/mpassthru.h
> @@ -0,0 +1,29 @@
> +#ifndef __MPASSTHRU_H
> +#define __MPASSTHRU_H
> +
> +#include <linux/types.h>
> +#include <linux/if_ether.h>
> +
> +/* ioctl defines */
> +#define MPASSTHRU_BINDDEV      _IOW('M', 213, int)
> +#define MPASSTHRU_UNBINDDEV    _IOW('M', 214, int)

These definitions are slightly wrong, because you pass more than just an 'int'.

> +/* MPASSTHRU ifc flags */
> +#define IFF_MPASSTHRU		0x0001
> +#define IFF_MPASSTHRU_EXCL	0x0002

As mentioned above, these flags don't make any sense with your current code.

	Arnd

^ permalink raw reply

* Re: forcedeth driver hangs under heavy load
From: stephen mulcahy @ 2010-04-14 14:30 UTC (permalink / raw)
  To: Ayaz Abdulla
  Cc: Eric Dumazet, David Miller, bhutchings@solarflare.com,
	netdev@vger.kernel.org, ben@decadent.org.uk,
	572201@bugs.debian.org
In-Reply-To: <4BC5539B.6050908@nvidia.com>

Ayaz Abdulla wrote:
> Attached fix has been submitted to netdev.

I've run my reproducer with this patch applied to be Debian 2.6.32 
kernel and so far the problem with nodes becoming unresponsive hasn't 
occurred.

NIC settings were left the default so this looks positive

root@node23:~# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off

Thanks!

-stephen

^ permalink raw reply

* [GIT PULL] vhost-net fix for 2.6.34-rc4
From: Michael S. Tsirkin @ 2010-04-14 14:17 UTC (permalink / raw)
  To: David Miller, netdev, Christoph Hellwig

David,
The following tree includes a patch fixing an issue with vhost-net in
2.6.34-rc4.  Please pull for 2.6.34.

Thanks!

The following changes since commit 2ba3abd8186f24c7fb418927025b4e2120e3a362:

  Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 (2010-04-13 17:49:48 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost

Christoph Hellwig (1):
      vhost: fix sparse warnings

 drivers/vhost/net.c   |    4 ++--
 drivers/vhost/vhost.c |   11 ++++++-----
 2 files changed, 8 insertions(+), 7 deletions(-)

^ permalink raw reply

* Re: [PATCH v3] net: batch skb dequeueing from softnet input_pkt_queue
From: jamal @ 2010-04-14 12:28 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, Eric Dumazet, netdev
In-Reply-To: <o2k412e6f7f1004140513h8de62790tb775bb357e2db6b1@mail.gmail.com>

On Wed, 2010-04-14 at 20:13 +0800, Changli Gao wrote:
> On Wed, Apr 14, 2010 at 7:58 PM, jamal <hadi@cyberus.ca> wrote:

> No extra IPI is needed.
> 
> +       qlen = queue->input_pkt_queue.qlen + queue->processing_queue.qlen;
> +       if (qlen <= netdev_max_backlog) {
> +               if (qlen) {
> 
> the packets in processing_queue are counted too.

Ok - Looks reasonable.

> > IPIs add to latency (refer to my other email). Did you test this
> > to reach some conclusion that it improves thing or was it just by
> > inspection?
> >
> 
> :( only insepection.

I am probably being pushy, but one simple test for latency of single
flow is: 

from machine 1, send ping -f

on rps machine:
Base test: no rps on ( a fresh boot with no sysctls should do fine)
Test 1: irq affinity on cpuX, rps to cpuY 
Test 2: repeat test1 with your change.

It should show no difference between test1 and 2. If it shows
improvement better - but showing worse latency is bad.

cheers,
jamal

^ permalink raw reply

* Re: [PATCH v3] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-14 12:13 UTC (permalink / raw)
  To: hadi; +Cc: David S. Miller, Eric Dumazet, netdev
In-Reply-To: <1271246304.3943.60.camel@bigi>

On Wed, Apr 14, 2010 at 7:58 PM, jamal <hadi@cyberus.ca> wrote:
>
> It seems we are now going to generate a lot more IPIs with such a
> change. At least this is what i am imagining.
> CPU0: packet comes in,queue empty, generate an IPI to CPU1
> CPU0: second packet comes in, enqueue
> CPU1: grab two packets to process and run with them
> CPU0: packet comes in,queue empty, generate an IPI to CPU1

No extra IPI is needed.

+       qlen = queue->input_pkt_queue.qlen + queue->processing_queue.qlen;
+       if (qlen <= netdev_max_backlog) {
+               if (qlen) {

the packets in processing_queue are counted too.

>
> IPIs add to latency (refer to my other email). Did you test this
> to reach some conclusion that it improves thing or was it just by
> inspection?
>

:( only insepection.

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [RFC] random SYN drops causing connect() delays
From: Lennart Schulte @ 2010-04-14 11:37 UTC (permalink / raw)
  To: tgraf; +Cc: netdev
In-Reply-To: <20100412080633.GA27418@bombadil.infradead.org>

Hi,
this is very similar to what i have noticed, but up to now I couldn't figure out where it came from. 
Thanks very much for clearing it up!

> I have been tracking down an issue commonly referred to as the 3-sec
> connect() delay. It exists since recent 2.6.x kernels and has never
> been fixed even though it disappeared in recent releases unless
> sched_child_runs_first is set to 1 again.
>
> What happens is that if a client attemps to open many connections to
> a socket with only minimal delay inbetween attemps some SYNs are
> randomly dropped on the server side causing the client to resend after
> the 3 sec TCP timeout and thus causing connect()s to be randomly delayed.
>
> Facts:
>  - Issue can be reproduced over loopback or real networks.
>  - Enabling SO_LINGER on the client side will make the issue disappear!!
>  - While the issue is appearing, the acceptq seems to be overflowing. Both
>    LISTENOVERFLOWS and LISTENDROPS are increasing although not by the exact
>    number of delay occurences. inetdiag reports sk_max_ack_backlog to be 0
>    therefore one possibility that comes to mind is that sk_ack_backlog
>    underflows due to a race.
>  - The issue disappeared in recent kernels, I bisected it down to the following
>    commit:
> 	commit 2bba22c50b06abe9fd0d23933b1e64d35b419262
> 	Author: Mike Galbraith <efault@gmx.de>
> 	Date:   Wed Sep 9 15:41:37 2009 +0200
>
> 	    sched: Turn off child_runs_first
> 	    
> 	    Set child_runs_first default to off.
>
>    Setting kernel.sched_child_runs_first=1 makes the isssue reappear in recent
>    kernels.  This hardens the theory of a race condition.
>  - It looks like that the issue can only be reproduced if the server
>    socket sends out data immediately after the connection has been established
>    but I cannot proof this theory.

^ permalink raw reply

* Re: [PATCH v3] net: batch skb dequeueing from softnet input_pkt_queue
From: jamal @ 2010-04-14 11:58 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, Eric Dumazet, netdev
In-Reply-To: <1271238738-8386-1-git-send-email-xiaosuo@gmail.com>

On Wed, 2010-04-14 at 17:52 +0800, Changli Gao wrote:
> batch skb dequeueing from softnet input_pkt_queue
> 
> batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
> contention and irq disabling/enabling.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>

It seems we are now going to generate a lot more IPIs with such a
change. At least this is what i am imagining.
CPU0: packet comes in,queue empty, generate an IPI to CPU1
CPU0: second packet comes in, enqueue
CPU1: grab two packets to process and run with them
CPU0: packet comes in,queue empty, generate an IPI to CPU1
..
...
.....

IPIs add to latency (refer to my other email). Did you test this
to reach some conclusion that it improves thing or was it just by
inspection?

cheers,
jamal

^ permalink raw reply

* Re: [PATCH] tun: orphan an skb on tx
From: David Miller @ 2010-04-14 11:55 UTC (permalink / raw)
  To: herbert
  Cc: eric.dumazet, mst, jan.kiszka, paul.moore, David.Woodhouse,
	netdev, linux-kernel, qemu-devel
In-Reply-To: <20100414005822.GD18044@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 14 Apr 2010 08:58:22 +0800

> On Tue, Apr 13, 2010 at 08:31:03PM +0200, Eric Dumazet wrote:
>>
>> Herbert Acked your patch, so I guess its OK, but I think it can be
>> dangerous.
> 
> The tun socket accounting was never designed to stop it from
> flooding another tun interface.  It's there to stop it from
> transmitting above a destination interface TX bandwidth and
> cause unnecessary packet drops.  It also limits the total amount
> of kernel memory that can be pinned down by a single tun interface.
> 
> In this case, all we're doing is shifting the accounting from the
> "hardware" queue to the qdisc queue.
> 
> So your ability to flood a tun interface is essentially unchanged.
> 
> BTW we do the same thing in a number of hardware drivers, as well
> as virtio-net.

Right.  Although this reminds me about the whole SKB
orphaning on xmit issue that keeps coming back to haunt
us.

If there weren't odd references to the SKB's socket in
the packet scheduler et al. we could just orphan these
things right upon entry to the qdisc and not have to
add hacks like this to every driver.

In fact... maybe we can just do it in dev_hard_queue_xmit()
since we are out of the qdisc at that point.... but I guess
there might be weird drivers that want the SKB socket in
their ->xmit routine...  Ho hum.

In any event that's net-next-2.6 exploratory material, and I've
applied this patch to net-2.6, thanks!

^ permalink raw reply

* rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-14 11:53 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, netdev, robert, David Miller, Changli Gao,
	Andi Kleen
In-Reply-To: <1265641748.3688.56.camel@bigi>

Following up like promised:

On Mon, 2010-02-08 at 10:09 -0500, jamal wrote:
> On Sun, 2010-02-07 at 21:58 -0800, Tom Herbert wrote:
> 
> > I don't have specific numbers, although we are using this on
> > application doing forwarding and numbers seem in line with what we see
> > for an end host.
> > 
> 
> When i get the chance i will give it a run. I have access to an i7
> somewhere. It seems like i need some specific nics?

I did step #0 last night on an i7 (single Nehalem). I think more than
anything i was impressed by the Nehalem's excellent caching system.
Robert, I am almost tempted to say skb recycling performance will be
excellent on this  machine given the cost of a cache miss is much lower
than previous generation hardware.

My test was simple: irq affinity on cpu0(core0) and rps redirection to
cpu1(core 1); tried also to redirect to different SMT threads (aka CPUs)
on different cores with similar results. I base tested against no rps
being used and a kernel which didnt have any RPS config on.
[BTW, I had to hand-edit the .config since i couldnt do it from
menuconfig (Is there any reason for it to be so?)]

Traffic was sent from another machine into the i7 via an el-cheapo sky2
(dont know how shitty this NIC is, but it seems to know how to do MSI so
probably capable of multiqueueing); the test was several sets of 
a ping first and then a ping -f (I will get more sophisticated in my
next test likely this weekend).

Results:
CPU utilization was about 20-30% higher in the case of rps. On cpu0, the
cpu was being chewed highly by sky2_poll and on the redirected-to-core
it was always smp_call_function_single.
Latency was (consistently) on average 5 microseconds. 
So if i sent 1M ping -f packets, without RPS it took on average
176 seconds and with RPS it took 181 seconds to do a round-trip.
Throughput didnt change but this could be attributed to the low amounts
of data i was sending.
I observed that we were generating, on average, an IPI per packet even
with ping -f. (added an extra stat to record when we sent an IPI and
counted against the number of packets sent).
In my opinion it is these IPIs that contribute the most to the latency
and i think it happens that the Nehalem is just highly improved in this 
area. I wish i had a more commonly used machine to test rps on.
I expect that rps will perform worse on currently cheaper/older hardware
for the traffic characteristic i tested.

On IPIs:
Is anyone familiar with what is going on with Nehalem? Why is it this
good? I expect things will get a lot nastier with other hardware like
xeon based or even Nehalem with rps going across QPI.
Here's why i think IPIs are bad, please correct me if i am wrong:
- they are synchronous. i.e an IPI issuer has to wait for an ACK (which
is in the form of an IPI).
- data cache has to be synced to main memory
- the instruction pipeline is flushed
- what else did i miss? Andi?

So my question to Tom, Eric and Changli or anyone else who has been
running RPS:
What hardware did you use? Is there anyone using older hardware than
say AMD Opteron or Intel Nehalem?

My impressions of rps so far:
I think i may end up being impressed when i generate a lot more traffic
since the cost of IPI will be amortized. 
At this point multiqueue seems a lot more impressive alternative and it
seems to me multiqueu hardware is a lot more commodity (price-point)
than a Nehalem.

Plan:
I plan to still attack the app space (and write a basic udp app that
binds to one or more rps cpus and try blasting a lot of UDP traffic to
see what happens) my step after that is to move to forwarding tests..

cheers,
jamal

^ permalink raw reply

* Re: HTB - What's the minimal value for 'rate' parameter?
From: Antonio Almeida @ 2010-04-14 10:22 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: netdev, kaber, davem, devik
In-Reply-To: <20100409212657.GA3560@del.dom.local>

What do you mean with "1:2 has grandchildren with overflown rate tables"?
I couldn't understand your idea. Is there any mistake in the
configuration I sent?
How would you set rates for this particular example?

Regards
  Antonio Almeida

On Fri, Apr 9, 2010 at 10:26 PM, Jarek Poplawski wrote:
> On Fri, Apr 09, 2010 at 04:40:44PM +0100, Antonio Almeida wrote:
>> So, what about the rate limit miss?
>> As you can see the ceil of class 1:2 is set to 4096Kbit but its
>> sending rate is actually 8071Kbit!
>> It looks like classes 1:10 and 1:11 are ignoring hierarchical rate
>> restrictions of class 1:2
>> Here:
>> class htb 1:2 parent 1:1 rate 4096Kbit ceil 4096Kbit burst 3655b cburst 3655b
>>  Sent 84285894 bytes 55671 pkt (dropped 0, overlimits 0 requeues 0)
>>  rate 8071Kbit 666pps backlog 0b 0p requeues 0
>>  lended: 0 borrowed: 0 giants: 0
>>  tokens: -937499999 ctokens: -937499999
>
> Yes, since 1:2 has grandchildren with overflown rate tables, they
> could behave as if they had set rates higher than their parents or
> grandparent (and HTB doesn't restrict it hierarchically).
>
> Jarek P.
>

^ permalink raw reply

* Re: [PATCH] forcedeth: fix tx limit2 flag check
From: stephen mulcahy @ 2010-04-14 10:14 UTC (permalink / raw)
  To: Ayaz Abdulla
  Cc: David Miller, eric.dumazet@gmail.com, bhutchings@solarflare.com,
	netdev@vger.kernel.org, ben@decadent.org.uk,
	572201@bugs.debian.org
In-Reply-To: <4BC5532E.7000302@nvidia.com>

Ayaz Abdulla wrote:
> This patch fixes the TX_LIMIT feature flag. The previous logic check for 
> TX_LIMIT2 also took into account a device that only had TX_LIMIT set.
> 
> Signed-off-by: Ayaz Abdulla <aabdulla@nvidia.com>
> 
> This is a fix for bug 572201 @ bugs.debian.org

Hi,

Thanks! I'll rebuild my Debian kernel with this and run a test today.

-stephen

^ permalink raw reply

* [PATCH v3] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-14  9:52 UTC (permalink / raw)
  To: David S. Miller; +Cc: Eric Dumazet, netdev, Changli Gao

batch skb dequeueing from softnet input_pkt_queue

batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
contention and irq disabling/enabling.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
 include/linux/netdevice.h |    1 
 net/core/dev.c            |   56 ++++++++++++++++++++++++++++++++--------------
 2 files changed, 40 insertions(+), 17 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d1a21b5..898bc62 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1335,6 +1335,7 @@ struct softnet_data {
 	struct call_single_data	csd ____cacheline_aligned_in_smp;
 #endif
 	struct sk_buff_head	input_pkt_queue;
+	struct sk_buff_head	processing_queue;
 	struct napi_struct	backlog;
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index a10a216..c635a71 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -131,6 +131,7 @@
 #include <linux/random.h>
 #include <trace/events/napi.h>
 #include <linux/pci.h>
+#include <linux/stop_machine.h>
 
 #include "net-sysfs.h"
 
@@ -2332,6 +2333,7 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
 {
 	struct softnet_data *queue;
 	unsigned long flags;
+	u32 qlen;
 
 	queue = &per_cpu(softnet_data, cpu);
 
@@ -2339,8 +2341,9 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
 	__get_cpu_var(netdev_rx_stat).total++;
 
 	rps_lock(queue);
-	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
-		if (queue->input_pkt_queue.qlen) {
+	qlen = queue->input_pkt_queue.qlen + queue->processing_queue.qlen;
+	if (qlen <= netdev_max_backlog) {
+		if (qlen) {
 enqueue:
 			__skb_queue_tail(&queue->input_pkt_queue, skb);
 			rps_unlock(queue);
@@ -2791,19 +2794,31 @@ int netif_receive_skb(struct sk_buff *skb)
 EXPORT_SYMBOL(netif_receive_skb);
 
 /* Network device is going away, flush any packets still pending  */
-static void flush_backlog(void *arg)
+static void __flush_backlog(struct sk_buff_head *head, struct net_device *dev)
 {
-	struct net_device *dev = arg;
-	struct softnet_data *queue = &__get_cpu_var(softnet_data);
 	struct sk_buff *skb, *tmp;
 
-	rps_lock(queue);
-	skb_queue_walk_safe(&queue->input_pkt_queue, skb, tmp)
+	skb_queue_walk_safe(head, skb, tmp) {
 		if (skb->dev == dev) {
-			__skb_unlink(skb, &queue->input_pkt_queue);
+			__skb_unlink(skb, head);
 			kfree_skb(skb);
 		}
-	rps_unlock(queue);
+	}
+}
+
+static int flush_backlog(void *arg)
+{
+	struct net_device *dev = arg;
+	struct softnet_data *queue;
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		queue = &per_cpu(softnet_data, cpu);
+		__flush_backlog(&queue->input_pkt_queue, dev);
+		__flush_backlog(&queue->processing_queue, dev);
+	}
+
+	return 0;
 }
 
 static int napi_gro_complete(struct sk_buff *skb)
@@ -3118,20 +3133,23 @@ static int process_backlog(struct napi_struct *napi, int quota)
 
 		local_irq_disable();
 		rps_lock(queue);
-		skb = __skb_dequeue(&queue->input_pkt_queue);
-		if (!skb) {
+		skb_queue_splice_tail_init(&queue->input_pkt_queue,
+					   &queue->processing_queue);
+		if (skb_queue_empty(&queue->processing_queue)) {
 			__napi_complete(napi);
 			rps_unlock(queue);
 			local_irq_enable();
-			break;
+			return work;
 		}
 		rps_unlock(queue);
 		local_irq_enable();
 
-		__netif_receive_skb(skb);
-	} while (++work < quota && jiffies == start_time);
-
-	return work;
+		while ((skb = __skb_dequeue(&queue->processing_queue))) {
+			__netif_receive_skb(skb);
+			if (++work >= quota || jiffies != start_time)
+				return work;
+		}
+	} while (1);
 }
 
 /**
@@ -5027,7 +5045,7 @@ void netdev_run_todo(void)
 
 		dev->reg_state = NETREG_UNREGISTERED;
 
-		on_each_cpu(flush_backlog, dev, 1);
+		stop_machine(flush_backlog, dev, NULL);
 
 		netdev_wait_allrefs(dev);
 
@@ -5487,6 +5505,9 @@ static int dev_cpu_callback(struct notifier_block *nfb,
 	raise_softirq_irqoff(NET_TX_SOFTIRQ);
 	local_irq_enable();
 
+	while ((skb = __skb_dequeue(&oldsd->processing_queue)))
+		netif_rx(skb);
+
 	/* Process offline CPU's input_pkt_queue */
 	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue)))
 		netif_rx(skb);
@@ -5709,6 +5730,7 @@ static int __init net_dev_init(void)
 
 		queue = &per_cpu(softnet_data, i);
 		skb_queue_head_init(&queue->input_pkt_queue);
+		skb_queue_head_init(&queue->processing_queue);
 		queue->completion_queue = NULL;
 		INIT_LIST_HEAD(&queue->poll_list);
 

^ permalink raw reply related

* Re: [PATCH net-next-2.6] fasync: RCU locking
From: Lai Jiangshan @ 2010-04-14  8:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Paul E. McKenney, netdev, linux-kernel
In-Reply-To: <1271230961.16881.630.camel@edumazet-laptop>

Eric Dumazet wrote:
> -void __kill_fasync(struct fasync_struct *fa, int sig, int band)
> +/*
> + * rcu_read_lock() is held
> + */
> +static void kill_fasync_rcu(struct fasync_struct *fa, int sig, int band)
>  {
>  	while (fa) {
>  		struct fown_struct * fown;
> @@ -719,22 +728,19 @@ void __kill_fasync(struct fasync_struct *fa, int sig, int band)
>  		   mechanism. */
>  		if (!(sig == SIGURG && fown->signum == 0))
>  			send_sigio(fown, fa->fa_fd, band);
> -		fa = fa->fa_next;
> +		fa = rcu_dereference(fa->fa_next);
>  	}
>  }
>  

Since rcu_read_lock() protects fasync_struct *fa for us, we can access
to @fa safely even fasync_remove_entry() is just called.

But this patch does not ensure 'fa->fa_file is not freed' nor
'fa->fa_fd is not released', so kill_fasync_rcu() may do wrong thing
if there is no other code ensure it.

^ permalink raw reply

* Re: usb-sound circular locking again?
From: Richard Zidlicky @ 2010-04-14  8:26 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Andrew Morton, linux-kernel, netdev
In-Reply-To: <s5h4oje5zl9.wl%tiwai@suse.de>

Hi,

> > is this the same old issue?
> 
> I think so.  It appears relatively new since a sysfs lockdep check was
> introduced.

you are right, it was definitely my impression that this particular instance is 
a new (last previously tested 2.6.32.8).
After a few more tests it appears to be 100% repeatable in pm-hibernate. Simply 
doing "sync" right now does nothing.

Richard

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox