Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [NET] Add proc file to display the state of all qdiscs.
From: Jesper Dangaard Brouer @ 2009-09-03 17:30 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Christoph Lameter, Eric Dumazet, Jarek Poplawski, David Miller,
	netdev
In-Reply-To: <4A9FD2DC.7070807@trash.net>


On Thu, 3 Sep 2009, Patrick McHardy wrote:

> Jesper Dangaard Brouer wrote:
>>
>> On Wed, 2 Sep 2009, Christoph Lameter wrote:
>>> On Wed, 2 Sep 2009, Eric Dumazet wrote:
>>>
>>>> Same name "eth0" is displayed, that might confuse parsers...
>>>>
>>>> What naming convention should we choose for multiqueue devices ?
>>>
>>> eth0/tx<number> ?
>>
>> Remember that we already have a naming convention in /proc/interrupts
>>
>>  eth0-tx-<number>
>>
>> Lets not introduce too many new once ;-)
>
> The approach I'm currently working on will present multiqueue root
> qdiscs as children of a dummy classful qdisc. This avoids handle
> clashes and the need for new identifiers and allows to address each
> qdisc seperately, similar to how it works with other classful qdiscs:

I like your approach. Its well suited for the qdiscs :-)

I especially like the possibility to access each qdisc seperately.  Does 
it then support having seperate qdisc per TX queue?  (I'm toying with the 
idea of transmitting our multicast traffic into/via a seperate TX hardware 
queue, and making a special qdisc for IPTV MPEG2-TS shaping)

Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply

* Re: [PATCH] slub: fix slab_pad_check()
From: Paul E. McKenney @ 2009-09-03 17:44 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Eric Dumazet, Pekka Enberg, Zdenek Kabelac, Patrick McHardy,
	Robin Holt, Linux Kernel Mailing List, Jesper Dangaard Brouer,
	Linux Netdev List, Netfilter Developers
In-Reply-To: <alpine.DEB.1.10.0909031414310.29881@V090114053VZO-1>

On Thu, Sep 03, 2009 at 02:24:17PM -0500, Christoph Lameter wrote:
> On Thu, 3 Sep 2009, Eric Dumazet wrote:
> 
> > Point is we cannot deal with RCU quietness before disposing the slab cache,
> > (if SLAB_DESTROY_BY_RCU was set on the cache) since this disposing *will*
> > make call_rcu() calls when a full slab is freed/purged.
> 
> There is no need to do call_rcu calls for frees at that point since
> objects are no longer in use. We could simply disable SLAB_DESTROY_BY_RCU
> for the final clearing of caches.

Suppose we have the following sequence of events:

1.	CPU 0 is running a task that is using the slab cache.

	This CPU does kmem_cache_free(), which happens to free up
	some memory to the system.  Because SLAB_DESTROY_BY_RCU is
	set, an RCU callback is posted to do the actual freeing.

	Please note that this RCU callback is internal to the slab,
	so that the slab user cannot be aware of it.  In fact, the
	slab user isn't doing any call_rcu()s whatever.

2.	CPU 0 discovers that the slab cache can now be destroyed.

	It determines that there are no users, and has guaranteed
	that there will be no future users.  So it knows that it
	can safely do kmem_cache_destroy().

3.	In absence of rcu_barrier(), kmem_cache_destroy() would
	immediately tear down the slab data structures.

4.	At the end of the next grace period, the RCU callback posted
	(again, internally by the slab cache) is invoked.  It has a
	coronary due to the slab data structures having already been
	freed, and (worse yet) possibly reallocated for other uses.

Hence the need for the rcu_barrier() when tearing down SLAB_DESTROY_BY_RCU
slab caches.

> > And when RCU grace period is elapsed, the callback *will* need access to
> > the cache we want to dismantle. Better to not have kfreed()/poisoned it...
> 
> But going through the RCU period is pointless since no user of the cache
> remains.

Which is irrelevant.  The outstanding RCU callback was posted by the
slab cache itself, -not- by the user of the slab cache.

> > I believe you mix two RCU uses here.
> >
> > 1) The one we all know, is use normal caches (!SLAB_DESTROY_BY_RCU)
> > (or kmalloc()), and use call_rcu(... kfree_something)
> >
> >    In this case, you are 100% right that the subsystem itself has
> >    to call rcu_barrier() (or respect whatever self-synchro) itself,
> >    before calling kmem_cache_destroy()
> >
> > 2) The SLAB_DESTROY_BY_RCU one.
> >
> >    Part of cache dismantle needs to call rcu_barrier() itself.
> >    Caller doesnt have to use rcu_barrier(). It would be a waste of time,
> >    as kmem_cache_destroy() will refill rcu wait queues with its own stuff.
> 
> The dismantling does not need RCU since there are no operations on the
> objects in progress. So simply switch DESTROY_BY_RCU off for close.

Unless I am missing something, this patch re-introduces the bug that
the rcu_barrier() was added to prevent.  So, in absence of a better
explanation of what I am missing:

NACK.

							Thanx, Paul

> ---
>  mm/slub.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2009-09-03 10:14:51.000000000 -0500
> +++ linux-2.6/mm/slub.c	2009-09-03 10:18:32.000000000 -0500
> @@ -2594,9 +2594,9 @@ static inline int kmem_cache_close(struc
>   */
>  void kmem_cache_destroy(struct kmem_cache *s)
>  {
> -	if (s->flags & SLAB_DESTROY_BY_RCU)
> -		rcu_barrier();
>  	down_write(&slub_lock);
> +	/* Stop deferring frees so that we can immediately free structures */
> +	s->flags &= ~SLAB_DESTROY_BY_RCU;
>  	s->refcount--;
>  	if (!s->refcount) {
>  		list_del(&s->list);

^ permalink raw reply

* Re: [PATCH] slub: fix slab_pad_check() and SLAB_DESTROY_BY_RCU
From: Christoph Lameter @ 2009-09-03 17:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Pekka Enberg, Zdenek Kabelac, Patrick McHardy, Robin Holt,
	Linux Kernel Mailing List, Jesper Dangaard Brouer,
	Linux Netdev List, Netfilter Developers, paulmck
In-Reply-To: <4A9F7283.1090306@gmail.com>

On Thu, 3 Sep 2009, Eric Dumazet wrote:

> on a SLAB_DESTROY_BY_RCU cache, there is no need to try to optimize this
> rcu_barrier() call, unless we want superfast reboot/halt sequences...

I stilll think that the action to quiesce rcu is something that the caller
of kmem_cache_destroy must take care of.

Could you split this into two patches: One that addresses the poison and
another that deals with rcu?

^ permalink raw reply

* Re: [PATCH] slub: fix slab_pad_check() and SLAB_DESTROY_BY_RCU
From: Christoph Lameter @ 2009-09-03 17:50 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Eric Dumazet, Zdenek Kabelac, Patrick McHardy, Robin Holt,
	Linux Kernel Mailing List, Jesper Dangaard Brouer,
	Linux Netdev List, Netfilter Developers, paulmck
In-Reply-To: <84144f020909030051u6cf6ae01he25c268f718ff3af@mail.gmail.com>

On Thu, 3 Sep 2009, Pekka Enberg wrote:

> Oh, sure, the fix looks sane to me. It's just that I am a complete
> coward when it comes to merging RCU related patches so I always try to
> fish an Acked-by from Paul or Christoph ;).

I am fine with acking the poison piece.

I did not ack the patch that added rcu to kmem_cache_destroy() and I
likely wont ack that piece either.

^ permalink raw reply

* Re: [PATCH net-next-2.6] macvlan: add multiqueue capability
From: Patrick McHardy @ 2009-09-03 17:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, Linux Netdev List
In-Reply-To: <4A9F9661.7020301@gmail.com>

Eric Dumazet wrote:
> macvlan devices are currently not multi-queue capable.
> 
> We can do that defining rtnl_link_ops method,
> get_tx_queues(), called from rtnl_create_link()
> 
> This new method gets num_tx_queues/real_num_tx_queues
> from lower device.
> 
> macvlan_get_tx_queues() is a copy of vlan_get_tx_queues().
> 
> Because macvlan_start_xmit() has to update netdev_queue
> stats only (and not dev->stats), I chose to change
> tx_errors/tx_aborted_errors accounting to tx_dropped,
> since netdev_queue structure doesnt define tx_errors /
> tx_aborted_errors.

The patch looks fine, but it just occured to me that this won't
have any effect since both VLAN and macvlan use a tx_queue_len of 0,
so they will by default have queueing disabled. In fact this
will increase costs for the default case since we're now hashing
every packet.

^ permalink raw reply

* Re: [NET] Add proc file to display the state of all qdiscs.
From: Patrick McHardy @ 2009-09-03 17:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Christoph Lameter, Eric Dumazet, Jarek Poplawski, David Miller,
	netdev
In-Reply-To: <Pine.LNX.4.64.0909031911400.3490@ask.diku.dk>

Jesper Dangaard Brouer wrote:
> 
> On Thu, 3 Sep 2009, Patrick McHardy wrote:
> 
>> The approach I'm currently working on will present multiqueue root
>> qdiscs as children of a dummy classful qdisc. This avoids handle
>> clashes and the need for new identifiers and allows to address each
>> qdisc seperately, similar to how it works with other classful qdiscs:
> 
> I like your approach. Its well suited for the qdiscs :-)
> 
> I especially like the possibility to access each qdisc seperately.  Does
> it then support having seperate qdisc per TX queue?  (I'm toying with
> the idea of transmitting our multicast traffic into/via a seperate TX
> hardware queue, and making a special qdisc for IPTV MPEG2-TS shaping)

Yes, you can attach qdiscs to the classes representing the queues.
At least it should work :) It would probably also be possible to
use TC classifiers for queue selection.


^ permalink raw reply

* Re: [PATCH] slub: fix slab_pad_check()
From: Eric Dumazet @ 2009-09-03 17:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zdenek Kabelac, Patrick McHardy, Robin Holt,
	Linux Kernel Mailing List, Jesper Dangaard Brouer,
	Linux Netdev List, Netfilter Developers, paulmck
In-Reply-To: <alpine.DEB.1.10.0909031414310.29881@V090114053VZO-1>

Christoph Lameter a écrit :
> On Thu, 3 Sep 2009, Eric Dumazet wrote:
> 
>> Point is we cannot deal with RCU quietness before disposing the slab cache,
>> (if SLAB_DESTROY_BY_RCU was set on the cache) since this disposing *will*
>> make call_rcu() calls when a full slab is freed/purged.
> 
> There is no need to do call_rcu calls for frees at that point since
> objects are no longer in use. We could simply disable SLAB_DESTROY_BY_RCU
> for the final clearing of caches.
> 
>> And when RCU grace period is elapsed, the callback *will* need access to
>> the cache we want to dismantle. Better to not have kfreed()/poisoned it...
> 
> But going through the RCU period is pointless since no user of the cache
> remains.
> 
>> I believe you mix two RCU uses here.
>>
>> 1) The one we all know, is use normal caches (!SLAB_DESTROY_BY_RCU)
>> (or kmalloc()), and use call_rcu(... kfree_something)
>>
>>    In this case, you are 100% right that the subsystem itself has
>>    to call rcu_barrier() (or respect whatever self-synchro) itself,
>>    before calling kmem_cache_destroy()
>>
>> 2) The SLAB_DESTROY_BY_RCU one.
>>
>>    Part of cache dismantle needs to call rcu_barrier() itself.
>>    Caller doesnt have to use rcu_barrier(). It would be a waste of time,
>>    as kmem_cache_destroy() will refill rcu wait queues with its own stuff.
> 
> The dismantling does not need RCU since there are no operations on the
> objects in progress. So simply switch DESTROY_BY_RCU off for close.
> 
> 
> ---
>  mm/slub.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2009-09-03 10:14:51.000000000 -0500
> +++ linux-2.6/mm/slub.c	2009-09-03 10:18:32.000000000 -0500
> @@ -2594,9 +2594,9 @@ static inline int kmem_cache_close(struc
>   */
>  void kmem_cache_destroy(struct kmem_cache *s)
>  {
> -	if (s->flags & SLAB_DESTROY_BY_RCU)
> -		rcu_barrier();
>  	down_write(&slub_lock);
> +	/* Stop deferring frees so that we can immediately free structures */
> +	s->flags &= ~SLAB_DESTROY_BY_RCU;
>  	s->refcount--;
>  	if (!s->refcount) {
>  		list_del(&s->list);

It seems very smart, but needs review of all callers to make sure no slabs
are waiting for final freeing in call_rcu queue on some cpu.

I suspect most of them will then have to use rcu_barrier() before calling
kmem_cache_destroy(), so why not factorizing code in one place ?

net/dccp/ipv6.c:1145:   .slab_flags        = SLAB_DESTROY_BY_RCU,
net/dccp/ipv4.c:941:    .slab_flags             = SLAB_DESTROY_BY_RCU,
net/ipv4/udp.c:1593:    .slab_flags        = SLAB_DESTROY_BY_RCU,
net/ipv4/udplite.c:54:  .slab_flags        = SLAB_DESTROY_BY_RCU,
net/ipv4/tcp_ipv4.c:2446:       .slab_flags             = SLAB_DESTROY_BY_RCU,
net/ipv4/udp.c.orig:1587:       .slab_flags        = SLAB_DESTROY_BY_RCU,
net/ipv6/udp.c:1274:    .slab_flags        = SLAB_DESTROY_BY_RCU,
net/ipv6/udplite.c:52:  .slab_flags        = SLAB_DESTROY_BY_RCU,
net/ipv6/tcp_ipv6.c:2085:       .slab_flags             = SLAB_DESTROY_BY_RCU,
net/netfilter/nf_conntrack_core.c:1269:                                         0, SLAB_DESTROY_BY_RCU, NULL);

^ permalink raw reply

* Re: [PATCH net-next-2.6] macvlan: add multiqueue capability
From: Eric Dumazet @ 2009-09-03 18:08 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David S. Miller, Linux Netdev List
In-Reply-To: <4AA002BC.3050507@trash.net>

Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> macvlan devices are currently not multi-queue capable.
>>
>> We can do that defining rtnl_link_ops method,
>> get_tx_queues(), called from rtnl_create_link()
>>
>> This new method gets num_tx_queues/real_num_tx_queues
>> from lower device.
>>
>> macvlan_get_tx_queues() is a copy of vlan_get_tx_queues().
>>
>> Because macvlan_start_xmit() has to update netdev_queue
>> stats only (and not dev->stats), I chose to change
>> tx_errors/tx_aborted_errors accounting to tx_dropped,
>> since netdev_queue structure doesnt define tx_errors /
>> tx_aborted_errors.
> 
> The patch looks fine, but it just occured to me that this won't
> have any effect since both VLAN and macvlan use a tx_queue_len of 0,
> so they will by default have queueing disabled. In fact this
> will increase costs for the default case since we're now hashing
> every packet.

Good point !

We'll have to hash the packet later when hitting the lowerdevice,
which is multiqueue. No ?

Also, what's wrong with

ip link add link eth0 eth0.103 txqueuelen 100 type vlan id 103

;)



^ permalink raw reply

* Re: System freeze on reboot - general protection fault
From: Paul E. McKenney @ 2009-09-03 18:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Zdenek Kabelac, Patrick McHardy, Christoph Lameter, Robin Holt,
	Linux Kernel Mailing List, Pekka Enberg, Jesper Dangaard Brouer,
	Linux Netdev List, Netfilter Developers
In-Reply-To: <4A9EEF07.5070800@gmail.com>

On Thu, Sep 03, 2009 at 12:17:43AM +0200, Eric Dumazet wrote:
> Zdenek Kabelac a écrit :
> > 2009/8/17 Patrick McHardy <kaber@trash.net>:
> >> Eric Dumazet wrote:
> >>> Zdenek Kabelac a écrit :
> >>>>  [<ffffffffa02c502f>] nf_conntrack_ftp_fini+0x2f/0x70 [nf_conntrack_ftp]
> >>>>  [<ffffffff8027bcc5>] sys_delete_module+0x1a5/0x270
> >>>>  [<ffffffff8020d329>] ? retint_swapgs+0xe/0x13
> >>>>  [<ffffffff80271bf2>] ? trace_hardirqs_on_caller+0x162/0x1b0
> >>>>  [<ffffffff80292121>] ? audit_syscall_entry+0x191/0x1c0
> >>>>  [<ffffffff80526dae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> >>>>  [<ffffffff8020c84b>] system_call_fastpath+0x16/0x1b
> >>>> Code: c6 00 00 0f 82 66 ff ff ff 49 8b 9e d8 05 00 00 48 85 db 75 16
> >>>> e9 8e 00 00 00 0f 1f 44 00 00 48 85 c0 0f 84 80 00 00 00 48 89 c3 <0f>
> >>>> b6 4b 37 48 8b 03 48 8d 14 cd 00 00 00 00 0f 18 08 48 29 ca
> >>>> RIP  [<ffffffffa02b2c2c>] nf_conntrack_helper_unregister+0x16c/0x320
> >>>> [nf_conntrack]
> >>>>  RSP <ffff88013982fe68>
> >>>> CR2: 0000000000000038
> >>>> ---[ end trace bc3a0ede3d0084db ]---
> >>>>
> >>> I am currently traveling and wont be able to help you before next week.
> >>>
> >>> I added netdev, Patrick, and netfilter-devel in CC so that more eyes can take a look.
> >> Thanks for the report, I'll have a look at this. Zdenek, please
> >> send me the nf_conntrack.ko file used in the above oops. Thanks.
> >>
> > 
> > Ok
> > 
> > I've found the solution for my problem.
> > 
> > http://thread.gmane.org/gmane.comp.security.firewalls.netfilter.devel/30483
> > 
> > I've made this small fix from this thread:
> > 
> > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core
> > index b5869b9..68488f8 100644
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -1108,6 +1108,7 @@ static void nf_conntrack_cleanup_init_net(void)
> >  {
> >         nf_conntrack_helper_fini();
> >         nf_conntrack_proto_fini();
> > +       rcu_barrier();
> >         kmem_cache_destroy(nf_conntrack_cachep);
> >  }
> > 
> > @@ -1266,7 +1267,7 @@ static int nf_conntrack_init_init_net(void)
> > 
> >         nf_conntrack_cachep = kmem_cache_create("nf_conntrack",
> >                                                 sizeof(struct nf_conn),
> > -                                               0, SLAB_DESTROY_BY_RCU, NULL);
> > +                                               0, 0, NULL);
> >         if (!nf_conntrack_cachep) {
> >                 printk(KERN_ERR "Unable to create nf_conn slab cache\n");
> >                 ret = -ENOMEM;
> > 
> > 
> > As the thread nf_conntrack: Use rcu_barrier() and fix kmem_cache_create flags
> > seems to be samewhat 'unfinished'  and already a bit old and I've no
> > idea whether it actually fixes problem completely or just hides it in
> > my case - I'm leaving it to some RCU gurus to fix this issue.
> > 
> > All I could say is - this this extra rcu_barrier() and removal of
> > SLAB_DESTROY removes my GPF on reboot.
> > 
> > Zdenek
> 
> Ouch..
> 
> Dont think such a patch makes your kernel better, it'll crash too.
> 
> You cannot remove SLAB_DESTROY_BY_RCU like this, it's there for very good reasons.

And if I understand correctly, this is more evidence that
kmem_cache_destroy() needs to do an rcu_barrier() in the
SLAB_DESTROY_BY_RCU case.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next-2.6] macvlan: add multiqueue capability
From: Eric Dumazet @ 2009-09-03 18:19 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David S. Miller, Linux Netdev List
In-Reply-To: <4AA002BC.3050507@trash.net>

Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> macvlan devices are currently not multi-queue capable.
>>
>> We can do that defining rtnl_link_ops method,
>> get_tx_queues(), called from rtnl_create_link()
>>
>> This new method gets num_tx_queues/real_num_tx_queues
>> from lower device.
>>
>> macvlan_get_tx_queues() is a copy of vlan_get_tx_queues().
>>
>> Because macvlan_start_xmit() has to update netdev_queue
>> stats only (and not dev->stats), I chose to change
>> tx_errors/tx_aborted_errors accounting to tx_dropped,
>> since netdev_queue structure doesnt define tx_errors /
>> tx_aborted_errors.
> 
> The patch looks fine, but it just occured to me that this won't
> have any effect since both VLAN and macvlan use a tx_queue_len of 0,
> so they will by default have queueing disabled. In fact this
> will increase costs for the default case since we're now hashing
> every packet.

Just read again dev_queue_xmit(), in case we have no queueing
on macvlan/vlan

Having mutiple txq should help multi flow / multi cpus setups,
since hashing will provide more chances to hit different txq/locks,
and let several cpus run concurrently, each one on a different queue.

So I dont understand why you think it'll increase costs...

^ permalink raw reply

* Re: [PATCH net-next-2.6] macvlan: add multiqueue capability
From: Patrick McHardy @ 2009-09-03 18:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, Linux Netdev List
In-Reply-To: <4AA00634.7050103@gmail.com>

Eric Dumazet wrote:
> Patrick McHardy a écrit :
>> The patch looks fine, but it just occured to me that this won't
>> have any effect since both VLAN and macvlan use a tx_queue_len of 0,
>> so they will by default have queueing disabled. In fact this
>> will increase costs for the default case since we're now hashing
>> every packet.
> 
> Good point !
> 
> We'll have to hash the packet later when hitting the lowerdevice,
> which is multiqueue. No ?

Right. But we don't reuse that decision from what I can tell.

> Also, what's wrong with
> 
> ip link add link eth0 eth0.103 txqueuelen 100 type vlan id 103

There's nothing wrong, but its kind of pointless since with the
default qdisc the queue will be bypassed, other qdiscs are shared
between the queues and defeat multiqueue.

I guess it could make sense if you want to apply TC actions
or something like that once we support using different (non-shared)
qdiscs for each queue.

^ permalink raw reply

* Re: [PATCH net-next-2.6] macvlan: add multiqueue capability
From: Patrick McHardy @ 2009-09-03 18:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, Linux Netdev List
In-Reply-To: <4AA008C9.4000805@gmail.com>

Eric Dumazet wrote:
> Patrick McHardy a écrit :
>> The patch looks fine, but it just occured to me that this won't
>> have any effect since both VLAN and macvlan use a tx_queue_len of 0,
>> so they will by default have queueing disabled. In fact this
>> will increase costs for the default case since we're now hashing
>> every packet.
> 
> Just read again dev_queue_xmit(), in case we have no queueing
> on macvlan/vlan
> 
> Having mutiple txq should help multi flow / multi cpus setups,
> since hashing will provide more chances to hit different txq/locks,
> and let several cpus run concurrently, each one on a different queue.

You're right, I missed that we're also perfoming locking in the
noqueue case. Sorry :)

^ permalink raw reply

* Re: [PATCH] slub: fix slab_pad_check()
From: Christoph Lameter @ 2009-09-03 18:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Pekka Enberg, Zdenek Kabelac, Patrick McHardy, Robin Holt,
	Linux Kernel Mailing List, Jesper Dangaard Brouer,
	Linux Netdev List, Netfilter Developers, paulmck
In-Reply-To: <4A9FCDC6.3060003@gmail.com>

On Thu, 3 Sep 2009, Eric Dumazet wrote:

> Christoph Lameter a ?crit :
> > On Thu, 3 Sep 2009, Eric Dumazet wrote:
> >
> >> on a SLAB_DESTROY_BY_RCU cache, there is no need to try to optimize this
> >> rcu_barrier() call, unless we want superfast reboot/halt sequences...
> >
> > I stilll think that the action to quiesce rcu is something that the caller
> > of kmem_cache_destroy must take care of.
>
> Do you mean :
>
> if (kmem_cache_shrink(s) == 0) {
> 	rcu_barrier();
> 	kmem_cache_destroy_no_rcu_barrier(s);
> } else {
> 	kmem_cache_destroy_with_rcu_barrier_because_SLAB_DESTROY_BY_RCU_cache(s);
> }
>
> What would be the point ?

The above is port of slub?

I mean that (in this case) the net subsystem would have to deal with RCU quietness
before disposing of the slab cache. There may be multiple ways of dealing
with RCU. The RCU barrier may be unnecessary for future uses. Typically
one would expect that all deferred handling of structures must be complete
for correctness before disposing of the whole cache.

> [PATCH] slub: fix slab_pad_check()

Acked-by: Christoph Lameter <cl@linux-foundation.org>

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Ira W. Snyder @ 2009-09-03 18:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, virtualization, kvm, linux-kernel, mingo, linux-mm, akpm,
	hpa, gregory.haskins, Rusty Russell, s.hetze
In-Reply-To: <20090827160750.GD23722@redhat.com>

On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote:
> What it is: vhost net is a character device that can be used to reduce
> the number of system calls involved in virtio networking.
> Existing virtio net code is used in the guest without modification.
> 
> There's similarity with vringfd, with some differences and reduced scope
> - uses eventfd for signalling
> - structures can be moved around in memory at any time (good for migration)
> - support memory table and not just an offset (needed for kvm)
> 
> common virtio related code has been put in a separate file vhost.c and
> can be made into a separate module if/when more backends appear.  I used
> Rusty's lguest.c as the source for developing this part : this supplied
> me with witty comments I wouldn't be able to write myself.
> 
> What it is not: vhost net is not a bus, and not a generic new system
> call. No assumptions are made on how guest performs hypercalls.
> Userspace hypervisors are supported as well as kvm.
> 
> How it works: Basically, we connect virtio frontend (configured by
> userspace) to a backend. The backend could be a network device, or a
> tun-like device. In this version I only support raw socket as a backend,
> which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
> also configured by userspace, including vlan/mac etc.
> 
> Status:
> This works for me, and I haven't see any crashes.
> I have done some light benchmarking (with v4), compared to userspace, I
> see improved latency (as I save up to 4 system calls per packet) but not
> bandwidth/CPU (as TSO and interrupt mitigation are not supported).  For
> ping benchmark (where there's no TSO) troughput is also improved.
> 
> Features that I plan to look at in the future:
> - tap support
> - TSO
> - interrupt mitigation
> - zero copy
> 

Hello Michael,

I've started looking at vhost with the intention of using it over PCI to
connect physical machines together.

The part that I am struggling with the most is figuring out which parts
of the rings are in the host's memory, and which parts are in the
guest's memory.

If I understand everything correctly, the rings are all userspace
addresses, which means that they can be moved around in physical memory,
and get pushed out to swap. AFAIK, this is impossible to handle when
connecting two physical systems, you'd need the rings available in IO
memory (PCI memory), so you can ioreadXX() them instead. To the best of
my knowledge, I shouldn't be using copy_to_user() on an __iomem address.
Also, having them migrate around in memory would be a bad thing.

Also, I'm having trouble figuring out how the packet contents are
actually copied from one system to the other. Could you point this out
for me?

Is there somewhere I can find the userspace code (kvm, qemu, lguest,
etc.) code needed for interacting with the vhost misc device so I can
get a better idea of how userspace is supposed to work? (Features
negotiation, etc.)

Thanks,
Ira

> Acked-by: Arnd Bergmann <arnd@arndb.de>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ---
>  MAINTAINERS                |   10 +
>  arch/x86/kvm/Kconfig       |    1 +
>  drivers/Makefile           |    1 +
>  drivers/vhost/Kconfig      |   11 +
>  drivers/vhost/Makefile     |    2 +
>  drivers/vhost/net.c        |  475 ++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.c      |  688 ++++++++++++++++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.h      |  122 ++++++++
>  include/linux/Kbuild       |    1 +
>  include/linux/miscdevice.h |    1 +
>  include/linux/vhost.h      |  101 +++++++
>  11 files changed, 1413 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/vhost/Kconfig
>  create mode 100644 drivers/vhost/Makefile
>  create mode 100644 drivers/vhost/net.c
>  create mode 100644 drivers/vhost/vhost.c
>  create mode 100644 drivers/vhost/vhost.h
>  create mode 100644 include/linux/vhost.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index b1114cf..de4587f 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5431,6 +5431,16 @@ S:	Maintained
>  F:	Documentation/filesystems/vfat.txt
>  F:	fs/fat/
>  
> +VIRTIO HOST (VHOST)
> +P:	Michael S. Tsirkin
> +M:	mst@redhat.com
> +L:	kvm@vger.kernel.org
> +L:	virtualization@lists.osdl.org
> +L:	netdev@vger.kernel.org
> +S:	Maintained
> +F:	drivers/vhost/
> +F:	include/linux/vhost.h
> +
>  VIA RHINE NETWORK DRIVER
>  M:	Roger Luethi <rl@hellgate.ch>
>  S:	Maintained
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index b84e571..94f44d9 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -64,6 +64,7 @@ config KVM_AMD
>  
>  # OK, it's a little counter-intuitive to do this, but it puts it neatly under
>  # the virtualization menu.
> +source drivers/vhost/Kconfig
>  source drivers/lguest/Kconfig
>  source drivers/virtio/Kconfig
>  
> diff --git a/drivers/Makefile b/drivers/Makefile
> index bc4205d..1551ae1 100644
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -105,6 +105,7 @@ obj-$(CONFIG_HID)		+= hid/
>  obj-$(CONFIG_PPC_PS3)		+= ps3/
>  obj-$(CONFIG_OF)		+= of/
>  obj-$(CONFIG_SSB)		+= ssb/
> +obj-$(CONFIG_VHOST_NET)		+= vhost/
>  obj-$(CONFIG_VIRTIO)		+= virtio/
>  obj-$(CONFIG_VLYNQ)		+= vlynq/
>  obj-$(CONFIG_STAGING)		+= staging/
> diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
> new file mode 100644
> index 0000000..d955406
> --- /dev/null
> +++ b/drivers/vhost/Kconfig
> @@ -0,0 +1,11 @@
> +config VHOST_NET
> +	tristate "Host kernel accelerator for virtio net"
> +	depends on NET && EVENTFD
> +	---help---
> +	  This kernel module can be loaded in host kernel to accelerate
> +	  guest networking with virtio_net. Not to be confused with virtio_net
> +	  module itself which needs to be loaded in guest kernel.
> +
> +	  To compile this driver as a module, choose M here: the module will
> +	  be called vhost_net.
> +
> diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
> new file mode 100644
> index 0000000..72dd020
> --- /dev/null
> +++ b/drivers/vhost/Makefile
> @@ -0,0 +1,2 @@
> +obj-$(CONFIG_VHOST_NET) += vhost_net.o
> +vhost_net-y := vhost.o net.o
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> new file mode 100644
> index 0000000..2210eaa
> --- /dev/null
> +++ b/drivers/vhost/net.c
> @@ -0,0 +1,475 @@
> +/* Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + *
> + * virtio-net server in host kernel.
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/eventfd.h>
> +#include <linux/vhost.h>
> +#include <linux/virtio_net.h>
> +#include <linux/mmu_context.h>
> +#include <linux/miscdevice.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/workqueue.h>
> +#include <linux/rcupdate.h>
> +#include <linux/file.h>
> +
> +#include <linux/net.h>
> +#include <linux/if_packet.h>
> +#include <linux/if_arp.h>
> +
> +#include <net/sock.h>
> +
> +#include "vhost.h"
> +
> +enum {
> +	VHOST_NET_VQ_RX = 0,
> +	VHOST_NET_VQ_TX = 1,
> +	VHOST_NET_VQ_MAX = 2,
> +};
> +
> +struct vhost_net {
> +	struct vhost_dev dev;
> +	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
> +	/* We use a kind of RCU to access sock pointer.
> +	 * All readers access it from workqueue, which makes it possible to
> +	 * flush the workqueue instead of synchronize_rcu. Therefore readers do
> +	 * not need to call rcu_read_lock/rcu_read_unlock: the beginning of
> +	 * work item execution acts instead of rcu_read_lock() and the end of
> +	 * work item execution acts instead of rcu_read_lock().
> +	 * Writers use device mutex. */
> +	struct socket *sock;
> +	struct vhost_poll poll[VHOST_NET_VQ_MAX];
> +};
> +
> +/* Pop first len bytes from iovec. Return number of segments used. */
> +static int move_iovec_hdr(struct iovec *from, struct iovec *to,
> +			  size_t len, int iov_count)
> +{
> +       int seg = 0;
> +       size_t size;
> +       while (len && seg < iov_count) {
> +               size = min(from->iov_len, len);
> +               to->iov_base = from->iov_base;
> +               to->iov_len = size;
> +               from->iov_len -= size;
> +               from->iov_base += size;
> +               len -= size;
> +               ++from;
> +               ++to;
> +               ++seg;
> +       }
> +       return seg;
> +}
> +
> +/* Expects to be always run from workqueue - which acts as
> + * read-size critical section for our kind of RCU. */
> +static void handle_tx(struct vhost_net *net)
> +{
> +	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> +	unsigned head, out, in, s;
> +	struct msghdr msg = {
> +		.msg_name = NULL,
> +		.msg_namelen = 0,
> +		.msg_control = NULL,
> +		.msg_controllen = 0,
> +		.msg_iov = vq->iov,
> +		.msg_flags = MSG_DONTWAIT,
> +	};
> +	size_t len;
> +	int err;
> +	struct socket *sock = rcu_dereference(net->sock);
> +	if (!sock || !sock_writeable(sock->sk))
> +		return;
> +
> +	use_mm(net->dev.mm);
> +	mutex_lock(&vq->mutex);
> +	for (;;) {
> +		head = vhost_get_vq_desc(&net->dev, vq, vq->iov, &out, &in);
> +		/* Nothing new?  Wait for eventfd to tell us they refilled. */
> +		if (head == vq->num)
> +			break;
> +		if (in) {
> +			vq_err(vq, "Unexpected descriptor format for TX: "
> +			       "out %d, int %d\n", out, in);
> +			break;
> +		}
> +		/* Skip header. TODO: support TSO. */
> +		s = move_iovec_hdr(vq->iov, vq->hdr,
> +				   sizeof(struct virtio_net_hdr), out);
> +		msg.msg_iovlen = out;
> +		len = iov_length(vq->iov, out);
> +		/* Sanity check */
> +		if (!len) {
> +			vq_err(vq, "Unexpected header len for TX: "
> +			       "%ld expected %zd\n",
> +			       iov_length(vq->hdr, s),
> +			       sizeof(struct virtio_net_hdr));
> +			break;
> +		}
> +		/* TODO: Check specific error and bomb out unless ENOBUFS? */
> +		err = sock->ops->sendmsg(NULL, sock, &msg, len);
> +		if (err < 0) {
> +			vhost_discard_vq_desc(vq);
> +			break;
> +		}
> +		if (err != len)
> +			pr_err("Truncated TX packet: "
> +			       " len %d != %zd\n", err, len);
> +		vhost_add_used_and_trigger(&net->dev, vq, head, 0);
> +	}
> +
> +	mutex_unlock(&vq->mutex);
> +	unuse_mm(net->dev.mm);
> +}
> +
> +/* Expects to be always run from workqueue - which acts as
> + * read-size critical section for our kind of RCU. */
> +static void handle_rx(struct vhost_net *net)
> +{
> +	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> +	unsigned head, out, in, s;
> +	struct msghdr msg = {
> +		.msg_name = NULL,
> +		.msg_namelen = 0,
> +		.msg_control = NULL, /* FIXME: get and handle RX aux data. */
> +		.msg_controllen = 0,
> +		.msg_iov = vq->iov,
> +		.msg_flags = MSG_DONTWAIT,
> +	};
> +
> +	struct virtio_net_hdr hdr = {
> +		.flags = 0,
> +		.gso_type = VIRTIO_NET_HDR_GSO_NONE
> +	};
> +
> +	size_t len;
> +	int err;
> +	struct socket *sock = rcu_dereference(net->sock);
> +	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> +		return;
> +
> +	use_mm(net->dev.mm);
> +	mutex_lock(&vq->mutex);
> +	vhost_no_notify(vq);
> +
> +	for (;;) {
> +		head = vhost_get_vq_desc(&net->dev, vq, vq->iov, &out, &in);
> +		/* OK, now we need to know about added descriptors. */
> +		if (head == vq->num && vhost_notify(vq))
> +			/* They could have slipped one in as we were doing that:
> +			 * check again. */
> +			continue;
> +		/* Nothing new?  Wait for eventfd to tell us they refilled. */
> +		if (head == vq->num)
> +			break;
> +		/* We don't need to be notified again. */
> +		vhost_no_notify(vq);
> +		if (out) {
> +			vq_err(vq, "Unexpected descriptor format for RX: "
> +			       "out %d, int %d\n",
> +			       out, in);
> +			break;
> +		}
> +		/* Skip header. TODO: support TSO/mergeable rx buffers. */
> +		s = move_iovec_hdr(vq->iov, vq->hdr, sizeof hdr, in);
> +		msg.msg_iovlen = in;
> +		len = iov_length(vq->iov, in);
> +		/* Sanity check */
> +		if (!len) {
> +			vq_err(vq, "Unexpected header len for RX: "
> +			       "%zd expected %zd\n",
> +			       iov_length(vq->hdr, s), sizeof hdr);
> +			break;
> +		}
> +		err = sock->ops->recvmsg(NULL, sock, &msg,
> +					 len, MSG_DONTWAIT | MSG_TRUNC);
> +		/* TODO: Check specific error and bomb out unless EAGAIN? */
> +		if (err < 0) {
> +			vhost_discard_vq_desc(vq);
> +			break;
> +		}
> +		/* TODO: Should check and handle checksum. */
> +		if (err > len) {
> +			pr_err("Discarded truncated rx packet: "
> +			       " len %d > %zd\n", err, len);
> +			vhost_discard_vq_desc(vq);
> +			continue;
> +		}
> +		len = err;
> +		err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr, sizeof hdr);
> +		if (err) {
> +			vq_err(vq, "Unable to write vnet_hdr at addr %p: %d\n",
> +			       vq->iov->iov_base, err);
> +			break;
> +		}
> +		vhost_add_used_and_trigger(&net->dev, vq, head,
> +					   len + sizeof hdr);
> +	}
> +
> +	mutex_unlock(&vq->mutex);
> +	unuse_mm(net->dev.mm);
> +}
> +
> +static void handle_tx_kick(struct work_struct *work)
> +{
> +	struct vhost_virtqueue *vq;
> +	struct vhost_net *net;
> +	vq = container_of(work, struct vhost_virtqueue, poll.work);
> +	net = container_of(vq->dev, struct vhost_net, dev);
> +	handle_tx(net);
> +}
> +
> +static void handle_rx_kick(struct work_struct *work)
> +{
> +	struct vhost_virtqueue *vq;
> +	struct vhost_net *net;
> +	vq = container_of(work, struct vhost_virtqueue, poll.work);
> +	net = container_of(vq->dev, struct vhost_net, dev);
> +	handle_rx(net);
> +}
> +
> +static void handle_tx_net(struct work_struct *work)
> +{
> +	struct vhost_net *net;
> +	net = container_of(work, struct vhost_net, poll[VHOST_NET_VQ_TX].work);
> +	handle_tx(net);
> +}
> +
> +static void handle_rx_net(struct work_struct *work)
> +{
> +	struct vhost_net *net;
> +	net = container_of(work, struct vhost_net, poll[VHOST_NET_VQ_RX].work);
> +	handle_rx(net);
> +}
> +
> +static int vhost_net_open(struct inode *inode, struct file *f)
> +{
> +	struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
> +	int r;
> +	if (!n)
> +		return -ENOMEM;
> +	f->private_data = n;
> +	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
> +	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
> +	r = vhost_dev_init(&n->dev, n->vqs, VHOST_NET_VQ_MAX);
> +	if (r < 0) {
> +		kfree(n);
> +		return r;
> +	}
> +
> +	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
> +	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
> +	return 0;
> +}
> +
> +static struct socket *vhost_net_stop(struct vhost_net *n)
> +{
> +	struct socket *sock = n->sock;
> +	rcu_assign_pointer(n->sock, NULL);
> +	if (sock) {
> +		vhost_poll_flush(n->poll + VHOST_NET_VQ_TX);
> +		vhost_poll_flush(n->poll + VHOST_NET_VQ_RX);
> +	}
> +	return sock;
> +}
> +
> +static int vhost_net_release(struct inode *inode, struct file *f)
> +{
> +	struct vhost_net *n = f->private_data;
> +	struct socket *sock;
> +
> +	sock = vhost_net_stop(n);
> +	vhost_dev_cleanup(&n->dev);
> +	if (sock)
> +		fput(sock->file);
> +	kfree(n);
> +	return 0;
> +}
> +
> +static void vhost_net_flush(struct vhost_net *n)
> +{
> +	vhost_poll_flush(n->poll + VHOST_NET_VQ_TX);
> +	vhost_poll_flush(n->poll + VHOST_NET_VQ_RX);
> +	vhost_poll_flush(&n->dev.vqs[VHOST_NET_VQ_TX].poll);
> +	vhost_poll_flush(&n->dev.vqs[VHOST_NET_VQ_RX].poll);
> +}
> +
> +static long vhost_net_set_socket(struct vhost_net *n, int fd)
> +{
> +	struct {
> +		struct sockaddr_ll sa;
> +		char  buf[MAX_ADDR_LEN];
> +	} uaddr;
> +	struct socket *sock, *oldsock = NULL;
> +	int uaddr_len = sizeof uaddr, r;
> +
> +	mutex_lock(&n->dev.mutex);
> +	r = vhost_dev_check_owner(&n->dev);
> +	if (r)
> +		goto done;
> +
> +	if (fd == -1) {
> +		/* Disconnect from socket and device. */
> +		oldsock = vhost_net_stop(n);
> +		goto done;
> +	}
> +
> +	sock = sockfd_lookup(fd, &r);
> +	if (!sock) {
> +		r = -ENOTSOCK;
> +		goto done;
> +	}
> +
> +	/* Parameter checking */
> +	if (sock->sk->sk_type != SOCK_RAW) {
> +		r = -ESOCKTNOSUPPORT;
> +		goto done;
> +	}
> +
> +	r = sock->ops->getname(sock, (struct sockaddr *)&uaddr.sa,
> +			       &uaddr_len, 0);
> +	if (r)
> +		goto done;
> +
> +	if (uaddr.sa.sll_family != AF_PACKET) {
> +		r = -EPFNOSUPPORT;
> +		goto done;
> +	}
> +
> +	/* start polling new socket */
> +	if (sock == oldsock)
> +		goto done;
> +
> +	if (oldsock) {
> +		vhost_poll_stop(n->poll + VHOST_NET_VQ_TX);
> +		vhost_poll_stop(n->poll + VHOST_NET_VQ_RX);
> +	}
> +	oldsock = n->sock;
> +	rcu_assign_pointer(n->sock, sock);
> +	vhost_poll_start(n->poll + VHOST_NET_VQ_TX, sock->file);
> +	vhost_poll_start(n->poll + VHOST_NET_VQ_RX, sock->file);
> +done:
> +	mutex_unlock(&n->dev.mutex);
> +	if (oldsock) {
> +		vhost_net_flush(n);
> +		fput(oldsock->file);
> +	}
> +	return r;
> +}
> +
> +static long vhost_net_reset_owner(struct vhost_net *n)
> +{
> +	struct socket *sock = NULL;
> +	long r;
> +	mutex_lock(&n->dev.mutex);
> +	r = vhost_dev_check_owner(&n->dev);
> +	if (r)
> +		goto done;
> +	sock = vhost_net_stop(n);
> +	r = vhost_dev_reset_owner(&n->dev);
> +done:
> +	mutex_unlock(&n->dev.mutex);
> +	if (sock)
> +		fput(sock->file);
> +	return r;
> +}
> +
> +static void vhost_net_set_features(struct vhost_net *n, u64 features)
> +{
> +	mutex_unlock(&n->dev.mutex);
> +	n->dev.acked_features = features;
> +	mutex_unlock(&n->dev.mutex);
> +	vhost_net_flush(n);
> +}
> +
> +static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
> +			    unsigned long arg)
> +{
> +	struct vhost_net *n = f->private_data;
> +	void __user *argp = (void __user *)arg;
> +	u32 __user *featurep = argp;
> +	int __user *fdp = argp;
> +	u64 features;
> +	int fd, r;
> +	switch (ioctl) {
> +	case VHOST_NET_SET_SOCKET:
> +		r = get_user(fd, fdp);
> +		if (r < 0)
> +			return r;
> +		return vhost_net_set_socket(n, fd);
> +	case VHOST_GET_FEATURES:
> +		features = VHOST_FEATURES;
> +		return put_user(features, featurep);
> +	case VHOST_ACK_FEATURES:
> +		r = get_user(features, featurep);
> +		/* No features for now */
> +		if (r < 0)
> +			return r;
> +		if (features & ~VHOST_FEATURES)
> +			return -EOPNOTSUPP;
> +		vhost_net_set_features(n, features);
> +		return 0;
> +	case VHOST_RESET_OWNER:
> +		return vhost_net_reset_owner(n);
> +	default:
> +		return vhost_dev_ioctl(&n->dev, ioctl, arg);
> +	}
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long vhost_net_compat_ioctl(struct file *f, unsigned int ioctl,
> +				   unsigned long arg)
> +{
> +	return vhost_net_ioctl(f, ioctl, (unsigned long)compat_ptr(arg));
> +}
> +#endif
> +
> +const static struct file_operations vhost_net_fops = {
> +	.owner          = THIS_MODULE,
> +	.release        = vhost_net_release,
> +	.unlocked_ioctl = vhost_net_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl   = vhost_net_compat_ioctl,
> +#endif
> +	.open           = vhost_net_open,
> +};
> +
> +static struct miscdevice vhost_net_misc = {
> +	VHOST_NET_MINOR,
> +	"vhost-net",
> +	&vhost_net_fops,
> +};
> +
> +int vhost_net_init(void)
> +{
> +	int r = vhost_init();
> +	if (r)
> +		goto err_init;
> +	r = misc_register(&vhost_net_misc);
> +	if (r)
> +		goto err_reg;
> +	return 0;
> +err_reg:
> +	vhost_cleanup();
> +err_init:
> +	return r;
> +
> +}
> +module_init(vhost_net_init);
> +
> +void vhost_net_exit(void)
> +{
> +	misc_deregister(&vhost_net_misc);
> +	vhost_cleanup();
> +}
> +module_exit(vhost_net_exit);
> +
> +MODULE_VERSION("0.0.1");
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Michael S. Tsirkin");
> +MODULE_DESCRIPTION("Host kernel accelerator for virtio net");
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> new file mode 100644
> index 0000000..6925cc1
> --- /dev/null
> +++ b/drivers/vhost/vhost.c
> @@ -0,0 +1,688 @@
> +/* Copyright (C) 2009 Red Hat, Inc.
> + * Copyright (C) 2006 Rusty Russell IBM Corporation
> + *
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + *
> + * Inspiration, some code, and most witty comments come from
> + * Documentation/lguest/lguest.c, by Rusty Russell
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + *
> + * Generic code for virtio server in host kernel.
> + */
> +
> +#include <linux/eventfd.h>
> +#include <linux/vhost.h>
> +#include <linux/virtio_net.h>
> +#include <linux/mm.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mutex.h>
> +#include <linux/workqueue.h>
> +#include <linux/rcupdate.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +
> +#include <linux/net.h>
> +#include <linux/if_packet.h>
> +#include <linux/if_arp.h>
> +
> +#include <net/sock.h>
> +
> +#include "vhost.h"
> +
> +enum {
> +	VHOST_MEMORY_MAX_NREGIONS = 64,
> +};
> +
> +static struct workqueue_struct *vhost_workqueue;
> +
> +static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
> +			    poll_table *pt)
> +{
> +	struct vhost_poll *poll;
> +	poll = container_of(pt, struct vhost_poll, table);
> +
> +	poll->wqh = wqh;
> +	add_wait_queue(wqh, &poll->wait);
> +}
> +
> +static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
> +			     void *key)
> +{
> +	struct vhost_poll *poll;
> +	poll = container_of(wait, struct vhost_poll, wait);
> +	if (!((unsigned long)key & poll->mask))
> +		return 0;
> +
> +	queue_work(vhost_workqueue, &poll->work);
> +	return 0;
> +}
> +
> +/* Init poll structure */
> +void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
> +		     unsigned long mask)
> +{
> +	INIT_WORK(&poll->work, func);
> +	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
> +	init_poll_funcptr(&poll->table, vhost_poll_func);
> +	poll->mask = mask;
> +}
> +
> +/* Start polling a file. We add ourselves to file's wait queue. The caller must
> + * keep a reference to a file until after vhost_poll_stop is called. */
> +void vhost_poll_start(struct vhost_poll *poll, struct file *file)
> +{
> +	unsigned long mask;
> +	mask = file->f_op->poll(file, &poll->table);
> +	if (mask)
> +		vhost_poll_wakeup(&poll->wait, 0, 0, (void *)mask);
> +}
> +
> +/* Stop polling a file. After this function returns, it becomes safe to drop the
> + * file reference. You must also flush afterwards. */
> +void vhost_poll_stop(struct vhost_poll *poll)
> +{
> +	remove_wait_queue(poll->wqh, &poll->wait);
> +}
> +
> +/* Flush any work that has been scheduled. When calling this, don't hold any
> + * locks that are also used by the callback. */
> +void vhost_poll_flush(struct vhost_poll *poll)
> +{
> +	flush_work(&poll->work);
> +}
> +
> +long vhost_dev_init(struct vhost_dev *dev,
> +		    struct vhost_virtqueue *vqs, int nvqs)
> +{
> +	int i;
> +	dev->vqs = vqs;
> +	dev->nvqs = nvqs;
> +	mutex_init(&dev->mutex);
> +
> +	for (i = 0; i < dev->nvqs; ++i) {
> +		dev->vqs[i].dev = dev;
> +		mutex_init(&dev->vqs[i].mutex);
> +		if (dev->vqs[i].handle_kick)
> +			vhost_poll_init(&dev->vqs[i].poll,
> +					dev->vqs[i].handle_kick,
> +					POLLIN);
> +	}
> +	return 0;
> +}
> +
> +/* Caller should have device mutex */
> +long vhost_dev_check_owner(struct vhost_dev *dev)
> +{
> +	/* Are you the owner? If not, I don't think you mean to do that */
> +	return dev->mm == current->mm ? 0 : -EPERM;
> +}
> +
> +/* Caller should have device mutex */
> +static long vhost_dev_set_owner(struct vhost_dev *dev)
> +{
> +	/* Is there an owner already? */
> +	if (dev->mm)
> +		return -EBUSY;
> +	/* No owner, become one */
> +	dev->mm = get_task_mm(current);
> +	return 0;
> +}
> +
> +/* Caller should have device mutex */
> +long vhost_dev_reset_owner(struct vhost_dev *dev)
> +{
> +	struct vhost_memory *memory;
> +
> +	/* Restore memory to default 1:1 mapping. */
> +	memory = kmalloc(offsetof(struct vhost_memory, regions) +
> +			 2 * sizeof *memory->regions, GFP_KERNEL);
> +	if (!memory)
> +		return -ENOMEM;
> +
> +	vhost_dev_cleanup(dev);
> +
> +	memory->nregions = 2;
> +	memory->regions[0].guest_phys_addr = 1;
> +	memory->regions[0].userspace_addr = 1;
> +	memory->regions[0].memory_size = ~0ULL;
> +	memory->regions[1].guest_phys_addr = 0;
> +	memory->regions[1].userspace_addr = 0;
> +	memory->regions[1].memory_size = 1;
> +	dev->memory = memory;
> +	return 0;
> +}
> +
> +/* Caller should have device mutex */
> +void vhost_dev_cleanup(struct vhost_dev *dev)
> +{
> +	int i;
> +	for (i = 0; i < dev->nvqs; ++i) {
> +		if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
> +			vhost_poll_stop(&dev->vqs[i].poll);
> +			vhost_poll_flush(&dev->vqs[i].poll);
> +		}
> +		if (dev->vqs[i].error_ctx)
> +			eventfd_ctx_put(dev->vqs[i].error_ctx);
> +		if (dev->vqs[i].error)
> +			fput(dev->vqs[i].error);
> +		if (dev->vqs[i].kick)
> +			fput(dev->vqs[i].kick);
> +		if (dev->vqs[i].call_ctx)
> +			eventfd_ctx_put(dev->vqs[i].call_ctx);
> +		if (dev->vqs[i].call)
> +			fput(dev->vqs[i].call);
> +		dev->vqs[i].error_ctx = NULL;
> +		dev->vqs[i].error = NULL;
> +		dev->vqs[i].kick = NULL;
> +		dev->vqs[i].call_ctx = NULL;
> +		dev->vqs[i].call = NULL;
> +	}
> +	/* No one will access memory at this point */
> +	kfree(dev->memory);
> +	dev->memory = NULL;
> +	if (dev->mm)
> +		mmput(dev->mm);
> +	dev->mm = NULL;
> +}
> +
> +static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
> +{
> +	struct vhost_memory mem, *newmem, *oldmem;
> +	unsigned long size = offsetof(struct vhost_memory, regions);
> +	long r;
> +	r = copy_from_user(&mem, m, size);
> +	if (r)
> +		return r;
> +	if (mem.padding)
> +		return -EOPNOTSUPP;
> +	if (mem.nregions > VHOST_MEMORY_MAX_NREGIONS)
> +		return -E2BIG;
> +	newmem = kmalloc(size + mem.nregions * sizeof *m->regions, GFP_KERNEL);
> +	if (!newmem)
> +		return -ENOMEM;
> +
> +	memcpy(newmem, &mem, size);
> +	r = copy_from_user(newmem->regions, m->regions,
> +			   mem.nregions * sizeof *m->regions);
> +	if (r) {
> +		kfree(newmem);
> +		return r;
> +	}
> +	oldmem = d->memory;
> +	rcu_assign_pointer(d->memory, newmem);
> +	synchronize_rcu();
> +	kfree(oldmem);
> +	return 0;
> +}
> +
> +static int init_used(struct vhost_virtqueue *vq)
> +{
> +	int r = put_user(vq->used_flags, &vq->used->flags);
> +	if (r)
> +		return r;
> +	return get_user(vq->last_used_idx, &vq->used->idx);
> +}
> +
> +static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
> +{
> +	struct file *eventfp, *filep = NULL,
> +		    *pollstart = NULL, *pollstop = NULL;
> +	struct eventfd_ctx *ctx = NULL;
> +	u32 __user *idxp = argp;
> +	struct vhost_virtqueue *vq;
> +	struct vhost_vring_state s;
> +	struct vhost_vring_file f;
> +	struct vhost_vring_addr a;
> +	u32 idx;
> +	long r;
> +
> +	r = get_user(idx, idxp);
> +	if (r < 0)
> +		return r;
> +	if (idx > d->nvqs)
> +		return -ENOBUFS;
> +
> +	vq = d->vqs + idx;
> +
> +	mutex_lock(&vq->mutex);
> +
> +	switch (ioctl) {
> +	case VHOST_SET_VRING_NUM:
> +		r = copy_from_user(&s, argp, sizeof s);
> +		if (r < 0)
> +			break;
> +		if (s.num > 0xffff) {
> +			r = -EINVAL;
> +			break;
> +		}
> +		vq->num = s.num;
> +		break;
> +	case VHOST_SET_VRING_BASE:
> +		r = copy_from_user(&s, argp, sizeof s);
> +		if (r < 0)
> +			break;
> +		if (s.num > 0xffff) {
> +			r = -EINVAL;
> +			break;
> +		}
> +		vq->avail_idx = vq->last_avail_idx = s.num;
> +		break;
> +	case VHOST_GET_VRING_BASE:
> +		s.index = idx;
> +		s.num = vq->last_avail_idx;
> +		r = copy_to_user(argp, &s, sizeof s);
> +		break;
> +	case VHOST_SET_VRING_DESC:
> +		r = copy_from_user(&a, argp, sizeof a);
> +		if (r < 0)
> +			break;
> +		if (a.padding) {
> +			r = -EOPNOTSUPP;
> +			break;
> +		}
> +		if ((u64)(long)a.user_addr != a.user_addr) {
> +			r = -EFAULT;
> +			break;
> +		}
> +		vq->desc = (void __user *)(long)a.user_addr;
> +		break;
> +	case VHOST_SET_VRING_AVAIL:
> +		r = copy_from_user(&a, argp, sizeof a);
> +		if (r < 0)
> +			break;
> +		if (a.padding) {
> +			r = -EOPNOTSUPP;
> +			break;
> +		}
> +		if ((u64)(long)a.user_addr != a.user_addr) {
> +			r = -EFAULT;
> +			break;
> +		}
> +		vq->avail = (void __user *)(long)a.user_addr;
> +		/* Forget the cached index value. */
> +		vq->avail_idx = vq->last_avail_idx;
> +		break;
> +	case VHOST_SET_VRING_USED:
> +		r = copy_from_user(&a, argp, sizeof a);
> +		if (r < 0)
> +			break;
> +		if (a.padding) {
> +			r = -EOPNOTSUPP;
> +			break;
> +		}
> +		if ((u64)(long)a.user_addr != a.user_addr) {
> +			r = -EFAULT;
> +			break;
> +		}
> +		vq->used = (void __user *)(long)a.user_addr;
> +		r = init_used(vq);
> +		if (r)
> +			break;
> +		break;
> +	case VHOST_SET_VRING_KICK:
> +		r = copy_from_user(&f, argp, sizeof f);
> +		if (r < 0)
> +			break;
> +		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
> +		if (IS_ERR(eventfp))
> +			return PTR_ERR(eventfp);
> +		if (eventfp != vq->kick) {
> +			pollstop = filep = vq->kick;
> +			pollstart = vq->kick = eventfp;
> +		} else
> +			filep = eventfp;
> +		break;
> +	case VHOST_SET_VRING_CALL:
> +		r = copy_from_user(&f, argp, sizeof f);
> +		if (r < 0)
> +			break;
> +		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
> +		if (IS_ERR(eventfp))
> +			return PTR_ERR(eventfp);
> +		if (eventfp != vq->call) {
> +			filep = vq->call;
> +			ctx = vq->call_ctx;
> +			vq->call = eventfp;
> +			vq->call_ctx = eventfp ?
> +				eventfd_ctx_fileget(eventfp) : NULL;
> +		} else
> +			filep = eventfp;
> +		break;
> +	case VHOST_SET_VRING_ERR:
> +		r = copy_from_user(&f, argp, sizeof f);
> +		if (r < 0)
> +			break;
> +		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
> +		if (IS_ERR(eventfp))
> +			return PTR_ERR(eventfp);
> +		if (eventfp != vq->error) {
> +			filep = vq->error;
> +			vq->error = eventfp;
> +			ctx = vq->error_ctx;
> +			vq->error_ctx = eventfp ?
> +				eventfd_ctx_fileget(eventfp) : NULL;
> +		} else
> +			filep = eventfp;
> +		break;
> +	default:
> +		r = -ENOIOCTLCMD;
> +	}
> +
> +	if (pollstop && vq->handle_kick)
> +		vhost_poll_stop(&vq->poll);
> +
> +	if (ctx)
> +		eventfd_ctx_put(ctx);
> +	if (filep)
> +		fput(filep);
> +
> +	if (pollstart && vq->handle_kick)
> +		vhost_poll_start(&vq->poll, vq->kick);
> +
> +	mutex_unlock(&vq->mutex);
> +
> +	if (pollstop && vq->handle_kick)
> +		vhost_poll_flush(&vq->poll);
> +	return 0;
> +}
> +
> +long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned long arg)
> +{
> +	void __user *argp = (void __user *)arg;
> +	long r;
> +
> +	mutex_lock(&d->mutex);
> +	/* If you are not the owner, you can become one */
> +	if (ioctl == VHOST_SET_OWNER) {
> +		r = vhost_dev_set_owner(d);
> +		goto done;
> +	}
> +
> +	/* You must be the owner to do anything else */
> +	r = vhost_dev_check_owner(d);
> +	if (r)
> +		goto done;
> +
> +	switch (ioctl) {
> +	case VHOST_SET_MEM_TABLE:
> +		r = vhost_set_memory(d, argp);
> +		break;
> +	default:
> +		r = vhost_set_vring(d, ioctl, argp);
> +		break;
> +	}
> +done:
> +	mutex_unlock(&d->mutex);
> +	return r;
> +}
> +
> +static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
> +						     __u64 addr, __u32 len)
> +{
> +	struct vhost_memory_region *reg;
> +	int i;
> +	/* linear search is not brilliant, but we really have on the order of 6
> +	 * regions in practice */
> +	for (i = 0; i < mem->nregions; ++i) {
> +		reg = mem->regions + i;
> +		if (reg->guest_phys_addr <= addr &&
> +		    reg->guest_phys_addr + reg->memory_size - 1 >= addr)
> +			return reg;
> +	}
> +	return NULL;
> +}
> +
> +int translate_desc(struct vhost_dev *dev, u64 addr, u32 len,
> +		   struct iovec iov[], int iov_size)
> +{
> +	const struct vhost_memory_region *reg;
> +	struct vhost_memory *mem;
> +	struct iovec *_iov;
> +	u64 s = 0;
> +	int ret = 0;
> +
> +	rcu_read_lock();
> +
> +	mem = rcu_dereference(dev->memory);
> +	while ((u64)len > s) {
> +		u64 size;
> +		if (ret >= iov_size) {
> +			ret = -ENOBUFS;
> +			break;
> +		}
> +		reg = find_region(mem, addr, len);
> +		if (!reg) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +		_iov = iov + ret;
> +		size = reg->memory_size - addr + reg->guest_phys_addr;
> +		_iov->iov_len = min((u64)len, size);
> +		_iov->iov_base = (void *)
> +			(reg->userspace_addr + addr - reg->guest_phys_addr);
> +		s += size;
> +		addr += size;
> +		++ret;
> +	}
> +
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
> +/* Each buffer in the virtqueues is actually a chain of descriptors.  This
> + * function returns the next descriptor in the chain, or vq->vring.num if we're
> + * at the end. */
> +static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc)
> +{
> +	unsigned int next;
> +
> +	/* If this descriptor says it doesn't chain, we're done. */
> +	if (!(desc->flags & VRING_DESC_F_NEXT))
> +		return vq->num;
> +
> +	/* Check they're not leading us off end of descriptors. */
> +	next = desc->next;
> +	/* Make sure compiler knows to grab that: we don't want it changing! */
> +	/* We will use the result as an index in an array, so most
> +	 * architectures only need a compiler barrier here. */
> +	read_barrier_depends();
> +
> +	if (next >= vq->num) {
> +		vq_err(vq, "Desc next is %u > %u", next, vq->num);
> +		return vq->num;
> +	}
> +
> +	return next;
> +}
> +
> +/* This looks in the virtqueue and for the first available buffer, and converts
> + * it to an iovec for convenient access.  Since descriptors consist of some
> + * number of output then some number of input descriptors, it's actually two
> + * iovecs, but we pack them into one and note how many of each there were.
> + *
> + * This function returns the descriptor number found, or vq->num (which
> + * is never a valid descriptor number) if none was found. */
> +unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
> +			   struct iovec iov[],
> +			   unsigned int *out_num, unsigned int *in_num)
> +{
> +	struct vring_desc desc;
> +	unsigned int i, head;
> +	u16 last_avail_idx;
> +	int ret;
> +
> +	/* Check it isn't doing very strange things with descriptor numbers. */
> +	last_avail_idx = vq->last_avail_idx;
> +	if (get_user(vq->avail_idx, &vq->avail->idx)) {
> +		vq_err(vq, "Failed to access avail idx at %p\n",
> +		       &vq->avail->idx);
> +		return vq->num;
> +	}
> +
> +	if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
> +		vq_err(vq, "Guest moved used index from %u to %u",
> +		       last_avail_idx, vq->avail_idx);
> +		return vq->num;
> +	}
> +
> +	/* If there's nothing new since last we looked, return invalid. */
> +	if (vq->avail_idx == last_avail_idx)
> +		return vq->num;
> +
> +	/* Grab the next descriptor number they're advertising, and increment
> +	 * the index we've seen. */
> +	if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
> +		vq_err(vq, "Failed to read head: idx %d address %p\n",
> +		       last_avail_idx,
> +		       &vq->avail->ring[last_avail_idx % vq->num]);
> +		return vq->num;
> +	}
> +
> +	/* If their number is silly, that's an error. */
> +	if (head >= vq->num) {
> +		vq_err(vq, "Guest says index %u > %u is available",
> +		       head, vq->num);
> +		return vq->num;
> +	}
> +
> +	vq->last_avail_idx++;
> +
> +	/* When we start there are none of either input nor output. */
> +	*out_num = *in_num = 0;
> +
> +	i = head;
> +	do {
> +		unsigned iov_count = *in_num + *out_num;
> +		if (copy_from_user(&desc, vq->desc + i, sizeof desc)) {
> +			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
> +			       i, vq->desc + i);
> +			return vq->num;
> +		}
> +		ret = translate_desc(dev, desc.addr, desc.len, iov + iov_count,
> +				     VHOST_NET_MAX_SG - iov_count);
> +		if (ret < 0) {
> +			vq_err(vq, "Translation failure %d descriptor idx %d\n",
> +			       ret, i);
> +			return vq->num;
> +		}
> +		/* If this is an input descriptor, increment that count. */
> +		if (desc.flags & VRING_DESC_F_WRITE)
> +			*in_num += ret;
> +		else {
> +			/* If it's an output descriptor, they're all supposed
> +			 * to come before any input descriptors. */
> +			if (*in_num) {
> +				vq_err(vq, "Descriptor has out after in: "
> +				       "idx %d\n", i);
> +				return vq->num;
> +			}
> +			*out_num += ret;
> +		}
> +	} while ((i = next_desc(vq, &desc)) != vq->num);
> +	return head;
> +}
> +
> +/* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
> +void vhost_discard_vq_desc(struct vhost_virtqueue *vq)
> +{
> +	vq->last_avail_idx--;
> +}
> +
> +/* After we've used one of their buffers, we tell them about it.  We'll then
> + * want to send them an interrupt, using vq->call. */
> +int vhost_add_used(struct vhost_virtqueue *vq,
> +			  unsigned int head, int len)
> +{
> +	struct vring_used_elem *used;
> +
> +	/* The virtqueue contains a ring of used buffers.  Get a pointer to the
> +	 * next entry in that used ring. */
> +	used = &vq->used->ring[vq->last_used_idx % vq->num];
> +	if (put_user(head, &used->id)) {
> +		vq_err(vq, "Failed to write used id");
> +		return -EFAULT;
> +	}
> +	if (put_user(len, &used->len)) {
> +		vq_err(vq, "Failed to write used len");
> +		return -EFAULT;
> +	}
> +	/* Make sure buffer is written before we update index. */
> +	wmb();
> +	if (put_user(vq->last_used_idx + 1, &vq->used->idx)) {
> +		vq_err(vq, "Failed to increment used idx");
> +		return -EFAULT;
> +	}
> +	vq->last_used_idx++;
> +	return 0;
> +}
> +
> +/* This actually sends the interrupt for this virtqueue */
> +void vhost_trigger_irq(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> +{
> +	__u16 flags = 0;
> +	if (get_user(flags, &vq->avail->flags)) {
> +		vq_err(vq, "Failed to get flags");
> +		return;
> +	}
> +
> +	/* If they don't want an interrupt, don't send one, unless empty. */
> +	if ((flags & VRING_AVAIL_F_NO_INTERRUPT) &&
> +	    (!vhost_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY) ||
> +	     vq->avail_idx != vq->last_avail_idx))
> +		return;
> +
> +	/* Send the Guest an interrupt tell them we used something up. */
> +	if (vq->call_ctx)
> +		eventfd_signal(vq->call_ctx, 1);
> +}
> +
> +/* And here's the combo meal deal.  Supersize me! */
> +void vhost_add_used_and_trigger(struct vhost_dev *dev,
> +				struct vhost_virtqueue *vq,
> +				unsigned int head, int len)
> +{
> +	vhost_add_used(vq, head, len);
> +	vhost_trigger_irq(dev, vq);
> +}
> +
> +/* OK, now we need to know about added descriptors. */
> +bool vhost_notify(struct vhost_virtqueue *vq)
> +{
> +	int r;
> +	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
> +		return false;
> +	vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
> +	r = put_user(vq->used_flags, &vq->used->flags);
> +	if (r)
> +		vq_err(vq, "Failed to disable notification: %d\n", r);
> +	/* They could have slipped one in as we were doing that: make
> +	 * sure it's written, tell caller it needs to check again. */
> +	mb();
> +	return true;
> +}
> +
> +/* We don't need to be notified again. */
> +void vhost_no_notify(struct vhost_virtqueue *vq)
> +{
> +	int r;
> +	if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
> +		return;
> +	vq->used_flags |= VRING_USED_F_NO_NOTIFY;
> +	r = put_user(vq->used_flags, &vq->used->flags);
> +	if (r)
> +		vq_err(vq, "Failed to enable notification: %d\n", r);
> +}
> +
> +int vhost_init(void)
> +{
> +	vhost_workqueue = create_workqueue("vhost");
> +	if (!vhost_workqueue)
> +		return -ENOMEM;
> +	return 0;
> +}
> +
> +void vhost_cleanup(void)
> +{
> +	destroy_workqueue(vhost_workqueue);
> +}
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> new file mode 100644
> index 0000000..8e13d06
> --- /dev/null
> +++ b/drivers/vhost/vhost.h
> @@ -0,0 +1,122 @@
> +#ifndef _VHOST_H
> +#define _VHOST_H
> +
> +#include <linux/eventfd.h>
> +#include <linux/vhost.h>
> +#include <linux/mm.h>
> +#include <linux/mutex.h>
> +#include <linux/workqueue.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +#include <linux/skbuff.h>
> +#include <linux/uio.h>
> +#include <linux/virtio_config.h>
> +
> +struct vhost_device;
> +
> +enum {
> +	VHOST_NET_MAX_SG = MAX_SKB_FRAGS + 2,
> +};
> +
> +/* Poll a file (eventfd or socket) */
> +/* Note: there's nothing vhost specific about this structure. */
> +struct vhost_poll {
> +	poll_table                table;
> +	wait_queue_head_t        *wqh;
> +	wait_queue_t              wait;
> +	/* struct which will handle all actual work. */
> +	struct work_struct        work;
> +	unsigned long		  mask;
> +};
> +
> +void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
> +		     unsigned long mask);
> +void vhost_poll_start(struct vhost_poll *poll, struct file *file);
> +void vhost_poll_stop(struct vhost_poll *poll);
> +void vhost_poll_flush(struct vhost_poll *poll);
> +
> +/* The virtqueue structure describes a queue attached to a device. */
> +struct vhost_virtqueue {
> +	struct vhost_dev *dev;
> +
> +	/* The actual ring of buffers. */
> +	struct mutex mutex;
> +	unsigned int num;
> +	struct vring_desc __user *desc;
> +	struct vring_avail __user *avail;
> +	struct vring_used __user *used;
> +	struct file *kick;
> +	struct file *call;
> +	struct file *error;
> +	struct eventfd_ctx *call_ctx;
> +	struct eventfd_ctx *error_ctx;
> +
> +	struct vhost_poll poll;
> +
> +	/* The routine to call when the Guest pings us, or timeout. */
> +	work_func_t handle_kick;
> +
> +	/* Last available index we saw. */
> +	u16 last_avail_idx;
> +
> +	/* Caches available index value from user. */
> +	u16 avail_idx;
> +
> +	/* Last index we used. */
> +	u16 last_used_idx;
> +
> +	/* Used flags */
> +	u16 used_flags;
> +
> +	struct iovec iov[VHOST_NET_MAX_SG];
> +	struct iovec hdr[VHOST_NET_MAX_SG];
> +};
> +
> +struct vhost_dev {
> +	/* Readers use RCU to access memory table pointer.
> +	 * Writers use mutex below.*/
> +	struct vhost_memory *memory;
> +	struct mm_struct *mm;
> +	struct vhost_virtqueue *vqs;
> +	int nvqs;
> +	struct mutex mutex;
> +	unsigned acked_features;
> +};
> +
> +long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
> +long vhost_dev_check_owner(struct vhost_dev *);
> +long vhost_dev_reset_owner(struct vhost_dev *);
> +void vhost_dev_cleanup(struct vhost_dev *);
> +long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, unsigned long arg);
> +
> +unsigned vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
> +			   struct iovec iov[],
> +			   unsigned int *out_num, unsigned int *in_num);
> +void vhost_discard_vq_desc(struct vhost_virtqueue *);
> +
> +int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
> +void vhost_trigger_irq(struct vhost_dev *, struct vhost_virtqueue *);
> +void vhost_add_used_and_trigger(struct vhost_dev *, struct vhost_virtqueue *,
> +				unsigned int head, int len);
> +void vhost_no_notify(struct vhost_virtqueue *);
> +bool vhost_notify(struct vhost_virtqueue *);
> +
> +int vhost_init(void);
> +void vhost_cleanup(void);
> +
> +#define vq_err(vq, fmt, ...) do {                                  \
> +		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> +		if ((vq)->error_ctx)                               \
> +				eventfd_signal((vq)->error_ctx, 1);\
> +	} while (0)
> +
> +enum {
> +	VHOST_FEATURES = 1 << VIRTIO_F_NOTIFY_ON_EMPTY,
> +};
> +
> +static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
> +{
> +	return dev->acked_features & (1 << bit);
> +}
> +
> +#endif
> diff --git a/include/linux/Kbuild b/include/linux/Kbuild
> index dec2f18..975df9a 100644
> --- a/include/linux/Kbuild
> +++ b/include/linux/Kbuild
> @@ -360,6 +360,7 @@ unifdef-y += uio.h
>  unifdef-y += unistd.h
>  unifdef-y += usbdevice_fs.h
>  unifdef-y += utsname.h
> +unifdef-y += vhost.h
>  unifdef-y += videodev2.h
>  unifdef-y += videodev.h
>  unifdef-y += virtio_config.h
> diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
> index 0521177..781a8bb 100644
> --- a/include/linux/miscdevice.h
> +++ b/include/linux/miscdevice.h
> @@ -30,6 +30,7 @@
>  #define HPET_MINOR		228
>  #define FUSE_MINOR		229
>  #define KVM_MINOR		232
> +#define VHOST_NET_MINOR		233
>  #define MISC_DYNAMIC_MINOR	255
>  
>  struct device;
> diff --git a/include/linux/vhost.h b/include/linux/vhost.h
> new file mode 100644
> index 0000000..3f441a9
> --- /dev/null
> +++ b/include/linux/vhost.h
> @@ -0,0 +1,101 @@
> +#ifndef _LINUX_VHOST_H
> +#define _LINUX_VHOST_H
> +/* Userspace interface for in-kernel virtio accelerators. */
> +
> +/* vhost is used to reduce the number of system calls involved in virtio.
> + *
> + * Existing virtio net code is used in the guest without modification.
> + *
> + * This header includes interface used by userspace hypervisor for
> + * device configuration.
> + */
> +
> +#include <linux/types.h>
> +#include <linux/compiler.h>
> +#include <linux/ioctl.h>
> +#include <linux/virtio_config.h>
> +#include <linux/virtio_ring.h>
> +
> +struct vhost_vring_state {
> +	unsigned int index;
> +	unsigned int num;
> +};
> +
> +struct vhost_vring_file {
> +	unsigned int index;
> +	int fd;
> +};
> +
> +struct vhost_vring_addr {
> +	unsigned int index;
> +	unsigned int padding;
> +	__u64 user_addr;
> +};
> +
> +struct vhost_memory_region {
> +	__u64 guest_phys_addr;
> +	__u64 memory_size; /* bytes */
> +	__u64 userspace_addr;
> +	__u64 padding; /* read/write protection? */
> +};
> +
> +struct vhost_memory {
> +	__u32 nregions;
> +	__u32 padding;
> +	struct vhost_memory_region regions[0];
> +};
> +
> +/* ioctls */
> +
> +#define VHOST_VIRTIO 0xAF
> +
> +/* Features bitmask for forward compatibility.  Transport bits are used for
> + * vhost specific features. */
> +#define VHOST_GET_FEATURES	_IOR(VHOST_VIRTIO, 0x00, __u64)
> +#define VHOST_ACK_FEATURES	_IOW(VHOST_VIRTIO, 0x00, __u64)
> +
> +/* Set current process as the (exclusive) owner of this file descriptor.  This
> + * must be called before any other vhost command.  Further calls to
> + * VHOST_OWNER_SET fail until VHOST_OWNER_RESET is called. */
> +#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01)
> +/* Give up ownership, and reset the device to default values.
> + * Allows subsequent call to VHOST_OWNER_SET to succeed. */
> +#define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
> +
> +/* Set up/modify memory layout */
> +#define VHOST_SET_MEM_TABLE	_IOW(VHOST_VIRTIO, 0x03, struct vhost_memory)
> +
> +/* Ring setup. These parameters can not be modified while ring is running
> + * (bound to a device). */
> +/* Set number of descriptors in ring */
> +#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)
> +/* Start of array of descriptors (virtually contiguous) */
> +#define VHOST_SET_VRING_DESC _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)
> +/* Used structure address */
> +#define VHOST_SET_VRING_USED _IOW(VHOST_VIRTIO, 0x12, struct vhost_vring_addr)
> +/* Available structure address */
> +#define VHOST_SET_VRING_AVAIL _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_addr)
> +/* Base value where queue looks for available descriptors */
> +#define VHOST_SET_VRING_BASE _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
> +/* Get accessor: reads index, writes value in num */
> +#define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
> +
> +/* The following ioctls use eventfd file descriptors to signal and poll
> + * for events. */
> +
> +/* Set eventfd to poll for added buffers */
> +#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)
> +/* Set eventfd to signal when buffers have beed used */
> +#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)
> +/* Set eventfd to signal an error */
> +#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
> +
> +/* VHOST_NET specific defines */
> +
> +/* Attach virtio net device to a raw socket. The socket must be already
> + * bound to an ethernet device, this device will be used for transmit.
> + * Pass -1 to unbind from the socket and the transmit device.
> + * This can be used to stop the device (e.g. for migration). */
> +#define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, int)
> +
> +#endif
> -- 
> 1.6.2.5
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] slub: fix slab_pad_check()
From: Pekka Enberg @ 2009-09-03 19:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Zdenek Kabelac, Patrick McHardy, Robin Holt,
	Linux Kernel Mailing List, Jesper Dangaard Brouer,
	Linux Netdev List, Netfilter Developers, paulmck
In-Reply-To: <4AA00400.1030005@gmail.com>

On Thu, Sep 3, 2009 at 8:59 PM, Eric Dumazet<eric.dumazet@gmail.com> wrote:
> Christoph Lameter a écrit :
>> On Thu, 3 Sep 2009, Eric Dumazet wrote:
>>
>>> Point is we cannot deal with RCU quietness before disposing the slab cache,
>>> (if SLAB_DESTROY_BY_RCU was set on the cache) since this disposing *will*
>>> make call_rcu() calls when a full slab is freed/purged.
>>
>> There is no need to do call_rcu calls for frees at that point since
>> objects are no longer in use. We could simply disable SLAB_DESTROY_BY_RCU
>> for the final clearing of caches.
>>
>>> And when RCU grace period is elapsed, the callback *will* need access to
>>> the cache we want to dismantle. Better to not have kfreed()/poisoned it...
>>
>> But going through the RCU period is pointless since no user of the cache
>> remains.
>>
>>> I believe you mix two RCU uses here.
>>>
>>> 1) The one we all know, is use normal caches (!SLAB_DESTROY_BY_RCU)
>>> (or kmalloc()), and use call_rcu(... kfree_something)
>>>
>>>    In this case, you are 100% right that the subsystem itself has
>>>    to call rcu_barrier() (or respect whatever self-synchro) itself,
>>>    before calling kmem_cache_destroy()
>>>
>>> 2) The SLAB_DESTROY_BY_RCU one.
>>>
>>>    Part of cache dismantle needs to call rcu_barrier() itself.
>>>    Caller doesnt have to use rcu_barrier(). It would be a waste of time,
>>>    as kmem_cache_destroy() will refill rcu wait queues with its own stuff.
>>
>> The dismantling does not need RCU since there are no operations on the
>> objects in progress. So simply switch DESTROY_BY_RCU off for close.
>>
>>
>> ---
>>  mm/slub.c |    4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> Index: linux-2.6/mm/slub.c
>> ===================================================================
>> --- linux-2.6.orig/mm/slub.c  2009-09-03 10:14:51.000000000 -0500
>> +++ linux-2.6/mm/slub.c       2009-09-03 10:18:32.000000000 -0500
>> @@ -2594,9 +2594,9 @@ static inline int kmem_cache_close(struc
>>   */
>>  void kmem_cache_destroy(struct kmem_cache *s)
>>  {
>> -     if (s->flags & SLAB_DESTROY_BY_RCU)
>> -             rcu_barrier();
>>       down_write(&slub_lock);
>> +     /* Stop deferring frees so that we can immediately free structures */
>> +     s->flags &= ~SLAB_DESTROY_BY_RCU;
>>       s->refcount--;
>>       if (!s->refcount) {
>>               list_del(&s->list);
>
> It seems very smart, but needs review of all callers to make sure no slabs
> are waiting for final freeing in call_rcu queue on some cpu.
>
> I suspect most of them will then have to use rcu_barrier() before calling
> kmem_cache_destroy(), so why not factorizing code in one place ?

\x10[snip]

Can someone please explain what's the upside in Christoph's approach?
Performance? Correctness? Something else entirely? We're looking at a
tested bug fix here and I don't understand why I shouldn't just go
ahead and merge it. Hmm?

                        Pekka

^ permalink raw reply

* Re: Network hangs with 2.6.30.5
From: Holger Hoffstaette @ 2009-09-03 19:20 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20090903074610.GA6000@ff.dom.local>

Problem found! At least for me..

On Thu, 03 Sep 2009 07:46:10 +0000, Jarek Poplawski wrote:

> On 01-09-2009 17:32, Holger Hoffstaette wrote:
>> On Tue, 01 Sep 2009 16:17:08 +0200, Holger Hoffstaette wrote:
>> 
>> [network regressions in .30]
>> 
>>> I do have an older Intel Gbit card identified thusly: 00:0b.0 Ethernet
>>> controller: Intel Corporation 82545GM Gigabit Ethernet Controller (rev
>>> 04)
>>>
>>> and enabled all sorts of offloading:
>>>
>>> $ethtool -k eth0
>>> Offload parameters for eth0:
>>> rx-checksumming: on
>>> tx-checksumming: on
>>> scatter-gather: on
>>> tcp segmentation offload: on
>>> udp fragmentation offload: off
>>> generic segmentation offload: on
>>>
>>> Maybe that is the culprit, as Eric Dumazet suspected in his mail..I
>>> will try the latest .30 stable again without that, but in any case
>>> something is indeed very broken in there.
>> 
>> So I just tried .30.5 again. Indeed the offloading seems to play a role:
>> with everything enabled I cannot even reliably ssh into the machine
>> (only "sometimes"?); however without any offloading things get "a bit
>> better" and squid even serves up some pages..for a while. Then it seems
>> to hang, swallow requests or not finish them. The tested sites reliably
>> work for the Windows client when it bypasses squid, as does DNS (also
>> served from the box). It *seems* to affect incoming traffic more than
>> outgoing - e.g. mail or news polling seemed to kick off and finish just
>> fine. Rebooting back into .29 fixes everything. Last time I tried
>> .31rc-something (4 IIRC) it exhibited the same problems.
>> 
>> I'm open to suggestions and willing to help fix this but need this
>> machine for actual work. :/
> 
> It seems, you and Clifford, use e1000 so it would be interesting to find
> out if it matters. Does your friend with working .30 use another card? If
> you can't try with another NIC, we could probably try to revert most of
> the driver's changes after .29 (except maybe 3) to check this driver only.
> 
> Clifford, if it still doesn't work for you, could you try 2.6.29?

I got the git .30.y stable tree and reverted various e1000 commits that
seemed to coincide with the various .30-rc releases but nothing helped.
Also no relation to offloads etc.

However I did notice that the "stuck squid" problem seemed to magically
fix itself after a few seconds - then hang again, fix itself after
timeouts etc. So I suspected something TCP related and BINGO!

Turns out I had both tcp_tw_recycle and tcp_tw_reuse set to 1 for reasons
I don't want to explain. :)

I can now arbitrarily fix the hanging behaviour by setting
tcp_tw_recycle to 0, and cause hangs by setting it to 1 again. For obvious
reasons this seems to affect squid more than other tasks with more long-lived
connections. What is the right behaviour? beats me.

tcp_tw_reuse does not appear to play a role, so the real culprit at least
in my case seems to be tcp_tw_recycle. In previous releases this (and
tw_reuse) was necessary for various server tasks.

Nevertheless, something has changed between .29 and .30 that "broke" the
previous behaviour. Whether this is progress or an regression I cannot
say. Maybe someone else has an idea?

Holger

^ permalink raw reply

* Re: [PATCH] slub: fix slab_pad_check()
From: Christoph Lameter @ 2009-09-03 19:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Pekka Enberg, Zdenek Kabelac, Patrick McHardy, Robin Holt,
	Linux Kernel Mailing List, Jesper Dangaard Brouer,
	Linux Netdev List, Netfilter Developers, paulmck
In-Reply-To: <4A9FDA72.8060001@gmail.com>

On Thu, 3 Sep 2009, Eric Dumazet wrote:

> Point is we cannot deal with RCU quietness before disposing the slab cache,
> (if SLAB_DESTROY_BY_RCU was set on the cache) since this disposing *will*
> make call_rcu() calls when a full slab is freed/purged.

There is no need to do call_rcu calls for frees at that point since
objects are no longer in use. We could simply disable SLAB_DESTROY_BY_RCU
for the final clearing of caches.

> And when RCU grace period is elapsed, the callback *will* need access to
> the cache we want to dismantle. Better to not have kfreed()/poisoned it...

But going through the RCU period is pointless since no user of the cache
remains.

> I believe you mix two RCU uses here.
>
> 1) The one we all know, is use normal caches (!SLAB_DESTROY_BY_RCU)
> (or kmalloc()), and use call_rcu(... kfree_something)
>
>    In this case, you are 100% right that the subsystem itself has
>    to call rcu_barrier() (or respect whatever self-synchro) itself,
>    before calling kmem_cache_destroy()
>
> 2) The SLAB_DESTROY_BY_RCU one.
>
>    Part of cache dismantle needs to call rcu_barrier() itself.
>    Caller doesnt have to use rcu_barrier(). It would be a waste of time,
>    as kmem_cache_destroy() will refill rcu wait queues with its own stuff.

The dismantling does not need RCU since there are no operations on the
objects in progress. So simply switch DESTROY_BY_RCU off for close.


---
 mm/slub.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-09-03 10:14:51.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-09-03 10:18:32.000000000 -0500
@@ -2594,9 +2594,9 @@ static inline int kmem_cache_close(struc
  */
 void kmem_cache_destroy(struct kmem_cache *s)
 {
-	if (s->flags & SLAB_DESTROY_BY_RCU)
-		rcu_barrier();
 	down_write(&slub_lock);
+	/* Stop deferring frees so that we can immediately free structures */
+	s->flags &= ~SLAB_DESTROY_BY_RCU;
 	s->refcount--;
 	if (!s->refcount) {
 		list_del(&s->list);

^ permalink raw reply

* Re: Network hangs with 2.6.30.5
From: Eric Dumazet @ 2009-09-03 19:27 UTC (permalink / raw)
  To: Holger Hoffstaette; +Cc: netdev
In-Reply-To: <pan.2009.09.03.19.20.44.736875@googlemail.com>

Holger Hoffstaette a écrit :
> Problem found! At least for me..
> 
> On Thu, 03 Sep 2009 07:46:10 +0000, Jarek Poplawski wrote:
> 
>> On 01-09-2009 17:32, Holger Hoffstaette wrote:
>>> On Tue, 01 Sep 2009 16:17:08 +0200, Holger Hoffstaette wrote:
>>>
>>> [network regressions in .30]
>>>
>>>> I do have an older Intel Gbit card identified thusly: 00:0b.0 Ethernet
>>>> controller: Intel Corporation 82545GM Gigabit Ethernet Controller (rev
>>>> 04)
>>>>
>>>> and enabled all sorts of offloading:
>>>>
>>>> $ethtool -k eth0
>>>> Offload parameters for eth0:
>>>> rx-checksumming: on
>>>> tx-checksumming: on
>>>> scatter-gather: on
>>>> tcp segmentation offload: on
>>>> udp fragmentation offload: off
>>>> generic segmentation offload: on
>>>>
>>>> Maybe that is the culprit, as Eric Dumazet suspected in his mail..I
>>>> will try the latest .30 stable again without that, but in any case
>>>> something is indeed very broken in there.
>>> So I just tried .30.5 again. Indeed the offloading seems to play a role:
>>> with everything enabled I cannot even reliably ssh into the machine
>>> (only "sometimes"?); however without any offloading things get "a bit
>>> better" and squid even serves up some pages..for a while. Then it seems
>>> to hang, swallow requests or not finish them. The tested sites reliably
>>> work for the Windows client when it bypasses squid, as does DNS (also
>>> served from the box). It *seems* to affect incoming traffic more than
>>> outgoing - e.g. mail or news polling seemed to kick off and finish just
>>> fine. Rebooting back into .29 fixes everything. Last time I tried
>>> .31rc-something (4 IIRC) it exhibited the same problems.
>>>
>>> I'm open to suggestions and willing to help fix this but need this
>>> machine for actual work. :/
>> It seems, you and Clifford, use e1000 so it would be interesting to find
>> out if it matters. Does your friend with working .30 use another card? If
>> you can't try with another NIC, we could probably try to revert most of
>> the driver's changes after .29 (except maybe 3) to check this driver only.
>>
>> Clifford, if it still doesn't work for you, could you try 2.6.29?
> 
> I got the git .30.y stable tree and reverted various e1000 commits that
> seemed to coincide with the various .30-rc releases but nothing helped.
> Also no relation to offloads etc.
> 
> However I did notice that the "stuck squid" problem seemed to magically
> fix itself after a few seconds - then hang again, fix itself after
> timeouts etc. So I suspected something TCP related and BINGO!
> 
> Turns out I had both tcp_tw_recycle and tcp_tw_reuse set to 1 for reasons
> I don't want to explain. :)
> 
> I can now arbitrarily fix the hanging behaviour by setting
> tcp_tw_recycle to 0, and cause hangs by setting it to 1 again. For obvious
> reasons this seems to affect squid more than other tasks with more long-lived
> connections. What is the right behaviour? beats me.
> 
> tcp_tw_reuse does not appear to play a role, so the real culprit at least
> in my case seems to be tcp_tw_recycle. In previous releases this (and
> tw_reuse) was necessary for various server tasks.
> 
> Nevertheless, something has changed between .29 and .30 that "broke" the
> previous behaviour. Whether this is progress or an regression I cannot
> say. Maybe someone else has an idea?
> 

Well... not yet :)

We probably can reproduce this problem with any NIC...

Could you send from the 'buggy' setup

$ grep . /proc/sys/net/ipv4/*


When you say squid is stuck, does it mean it doesnt accept new connections ?

Could help to strace it and check what it is doing ?

^ permalink raw reply

* Re: [PATCH] slub: fix slab_pad_check()
From: Pekka Enberg @ 2009-09-03 19:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Zdenek Kabelac, Patrick McHardy, Robin Holt,
	Linux Kernel Mailing List, Jesper Dangaard Brouer,
	Linux Netdev List, Netfilter Developers, paulmck
In-Reply-To: <4A9FCDC6.3060003@gmail.com>

On Thu, Sep 3, 2009 at 5:08 PM, Eric Dumazet<eric.dumazet@gmail.com> wrote:
> Sure, here is the poison thing
>
> [PATCH] slub: fix slab_pad_check()
>
> When SLAB_POISON is used and slab_pad_check() finds an overwrite of the
> slab padding, we call restore_bytes() on the whole slab, not only
> on the padding.
>
> Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks!

^ permalink raw reply

* Re: [NET] Add proc file to display the state of all qdiscs.
From: Jarek Poplawski @ 2009-09-03 19:35 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: eric.dumazet, netdev, David Miller, Patrick McHardy
In-Reply-To: <alpine.DEB.1.10.0909031302190.27903@V090114053VZO-1>

On Thu, Sep 03, 2009 at 01:03:37PM -0500, Christoph Lameter wrote:
> On Wed, 2 Sep 2009, Jarek Poplawski wrote:
> 
> > Btw, Patrick's comments reminded me this is probably not what you want
> > in case of non-default qdiscs: the root qdisc like prio will be
> > repeated for each tx queue with the same stats. I guess you need to do
> > here an additional query e.g. by comparing dev_queue->qdisc_sleeping
> > with that of i = 0.
> 
> Hmmm.. Maybe I better leave it to the experts then.
> 

This is certainly not what I wanted... The experts missed this problem
long enough. It might only need to wait a bit for Patrick's changes.

Jarek P.

^ permalink raw reply

* Re: [NET] Add proc file to display the state of all qdiscs.
From: Eric Dumazet @ 2009-09-03 19:38 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Christoph Lameter, netdev, David Miller, Patrick McHardy
In-Reply-To: <20090903193555.GA3138@ami.dom.local>

Jarek Poplawski a écrit :
> On Thu, Sep 03, 2009 at 01:03:37PM -0500, Christoph Lameter wrote:
>> On Wed, 2 Sep 2009, Jarek Poplawski wrote:
>>
>>> Btw, Patrick's comments reminded me this is probably not what you want
>>> in case of non-default qdiscs: the root qdisc like prio will be
>>> repeated for each tx queue with the same stats. I guess you need to do
>>> here an additional query e.g. by comparing dev_queue->qdisc_sleeping
>>> with that of i = 0.
>> Hmmm.. Maybe I better leave it to the experts then.
>>
> 
> This is certainly not what I wanted... The experts missed this problem
> long enough. It might only need to wait a bit for Patrick's changes.

The experts ? Anyway, you can say "the expert", there is only one :)

^ permalink raw reply

* Re: [PATCH] slub: Fix kmem_cache_destroy() with SLAB_DESTROY_BY_RCU
From: Pekka Enberg @ 2009-09-03 19:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Zdenek Kabelac, Patrick McHardy, Robin Holt,
	Linux Kernel Mailing List, Jesper Dangaard Brouer,
	Linux Netdev List, Netfilter Developers, paulmck,
	stable@kernel.org
In-Reply-To: <4A9FD047.9000002@gmail.com>

On Thu, Sep 3, 2009 at 5:18 PM, Eric Dumazet<eric.dumazet@gmail.com> wrote:
> Here is the second patch (RCU thing). Stable candidate
>
> [PATCH] slub: Fix kmem_cache_destroy() with SLAB_DESTROY_BY_RCU
>
> kmem_cache_destroy() should call rcu_barrier() *after* kmem_cache_close()
> and *before* sysfs_slab_remove() or risk rcu_free_slab()
> being called after kmem_cache is deleted (kfreed).
>
> rmmod nf_conntrack can crash the machine because it has to
> kmem_cache_destroy() a SLAB_DESTROY_BY_RCU enabled cache.

Do we have a bugzilla URL for this?

> Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

OK, this is in for-next now and queued for 2.6.31. If you guys want to
fix this in a different way, lets do that in 2.6.32.

                        Pekka

^ permalink raw reply

* Re: [PATCH 2/3] IPVS: make friends with nf_conntrack
From: Julian Anastasov @ 2009-09-03 19:50 UTC (permalink / raw)
  To: Hannes Eder
  Cc: lvs-devel, linux-kernel, netdev, netfilter-devel,
	Fabien Duchêne, Jan Engelhardt, Jean-Luc Fortemaison,
	Julius Volz, Laurent Grawet, Patrick McHardy, Simon Horman,
	Wensong Zhang
In-Reply-To: <20090902101538.11561.11911.stgit@jazzy.zrh.corp.google.com>


	Hello,

On Wed, 2 Sep 2009, Hannes Eder wrote:

> Update the nf_conntrack tuple in reply direction, as we will see
> traffic from the real server (RIP) to the client (CIP).  Once this is
> done we can use netfilters SNAT in POSTROUTING, especially with
> xt_ipvs, to do source NAT, e.g.:
> 
> % iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 --vport 8080 \
> > -j SNAT --to-source 192.168.10.10
> 
> Signed-off-by: Hannes Eder <heder@google.com>
> ---

	The following changes in ip_vs_core.c may be break normal
ip_vs_ftp users. Somehow you decided that this POST_ROUTING code is not
needed and deleted it. This code should be present by default.

	From http://www.ssi.bg/~ja/LVS.txt:

===
	Now after  many changes in  latest kernels  I'm not sure
	what  happens if  netfilter sees  IPVS traffic  in POST_ROUTING.
	Such  change require  testing of  ip_vs_ftp in  both passive and
	active  LVS-NAT mode,  with different length  of IP address:port
	representation  in FTP  commands, to check  if resulting packets
	survive double NAT when payload size is changed.  It is the best
	test  for  IPVS to  see  if netfilter  additionally  changes FTP
	packets  leading to  wrong payload.	
===

	So, you have to check the ip_vs_ftp case because double
NAT for IPs and Ports usually works but double changing of SEQs
and payload may be not.

	You can also check NFCT for IPVS (http://www.ssi.bg/~ja/nfct/)
for using netfilter functions and structures (ip_vs_nfct.c)

	most recent rediff:
http://www.ssi.bg/~ja/nfct/ipvs-nfct-2.6.28-1.diff

> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index b227750..27bd002 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -521,26 +521,6 @@ int ip_vs_leave(struct ip_vs_service *svc, struct sk_buff *skb,
>  	return NF_DROP;
>  }
>  
> -
> -/*
> - *      It is hooked before NF_IP_PRI_NAT_SRC at the NF_INET_POST_ROUTING
> - *      chain, and is used for VS/NAT.
> - *      It detects packets for VS/NAT connections and sends the packets
> - *      immediately. This can avoid that iptable_nat mangles the packets
> - *      for VS/NAT.
> - */
> -static unsigned int ip_vs_post_routing(unsigned int hooknum,
> -				       struct sk_buff *skb,
> -				       const struct net_device *in,
> -				       const struct net_device *out,
> -				       int (*okfn)(struct sk_buff *))
> -{
> -	if (!skb->ipvs_property)
> -		return NF_ACCEPT;
> -	/* The packet was sent from IPVS, exit this chain */
> -	return NF_STOP;
> -}
> -
>  __sum16 ip_vs_checksum_complete(struct sk_buff *skb, int offset)
>  {
>  	return csum_fold(skb_checksum(skb, offset, skb->len - offset, 0));
> @@ -1431,14 +1411,6 @@ static struct nf_hook_ops ip_vs_ops[] __read_mostly = {
>  		.hooknum        = NF_INET_FORWARD,
>  		.priority       = 99,
>  	},
> -	/* Before the netfilter connection tracking, exit from POST_ROUTING */
> -	{
> -		.hook		= ip_vs_post_routing,
> -		.owner		= THIS_MODULE,
> -		.pf		= PF_INET,
> -		.hooknum        = NF_INET_POST_ROUTING,
> -		.priority       = NF_IP_PRI_NAT_SRC-1,
> -	},
>  #ifdef CONFIG_IP_VS_IPV6
>  	/* After packet filtering, forward packet through VS/DR, VS/TUN,
>  	 * or VS/NAT(change destination), so that filtering rules can be
> @@ -1467,14 +1439,6 @@ static struct nf_hook_ops ip_vs_ops[] __read_mostly = {
>  		.hooknum        = NF_INET_FORWARD,
>  		.priority       = 99,
>  	},
> -	/* Before the netfilter connection tracking, exit from POST_ROUTING */
> -	{
> -		.hook		= ip_vs_post_routing,
> -		.owner		= THIS_MODULE,
> -		.pf		= PF_INET6,
> -		.hooknum        = NF_INET_POST_ROUTING,
> -		.priority       = NF_IP6_PRI_NAT_SRC-1,
> -	},
>  #endif
>  };

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [NET] Add proc file to display the state of all qdiscs.
From: Jarek Poplawski @ 2009-09-03 19:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Christoph Lameter, netdev, David Miller, Patrick McHardy
In-Reply-To: <4AA01B4E.9060501@gmail.com>

On Thu, Sep 03, 2009 at 09:38:54PM +0200, Eric Dumazet wrote:
> Jarek Poplawski a écrit :
> > On Thu, Sep 03, 2009 at 01:03:37PM -0500, Christoph Lameter wrote:
> >> On Wed, 2 Sep 2009, Jarek Poplawski wrote:
> >>
> >>> Btw, Patrick's comments reminded me this is probably not what you want
> >>> in case of non-default qdiscs: the root qdisc like prio will be
> >>> repeated for each tx queue with the same stats. I guess you need to do
> >>> here an additional query e.g. by comparing dev_queue->qdisc_sleeping
> >>> with that of i = 0.
> >> Hmmm.. Maybe I better leave it to the experts then.
> >>
> > 
> > This is certainly not what I wanted... The experts missed this problem
> > long enough. It might only need to wait a bit for Patrick's changes.
> 
> The experts ? Anyway, you can say "the expert", there is only one :)

Right! I admire Davem too... ;-)

Jarek P.

^ permalink raw reply

* Re: [PATCH] slub: Fix kmem_cache_destroy() with SLAB_DESTROY_BY_RCU
From: Eric Dumazet @ 2009-09-03 19:56 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Zdenek Kabelac, Patrick McHardy, Robin Holt,
	Linux Kernel Mailing List, Jesper Dangaard Brouer,
	Linux Netdev List, Netfilter Developers, paulmck,
	stable@kernel.org
In-Reply-To: <84144f020909031248s56c16205j7992930c413b1bbe@mail.gmail.com>

Pekka Enberg a écrit :
> On Thu, Sep 3, 2009 at 5:18 PM, Eric Dumazet<eric.dumazet@gmail.com> wrote:
>> Here is the second patch (RCU thing). Stable candidate
>>
>> [PATCH] slub: Fix kmem_cache_destroy() with SLAB_DESTROY_BY_RCU
>>
>> kmem_cache_destroy() should call rcu_barrier() *after* kmem_cache_close()
>> and *before* sysfs_slab_remove() or risk rcu_free_slab()
>> being called after kmem_cache is deleted (kfreed).
>>
>> rmmod nf_conntrack can crash the machine because it has to
>> kmem_cache_destroy() a SLAB_DESTROY_BY_RCU enabled cache.
> 
> Do we have a bugzilla URL for this?

Well, I can crash my 2.6.30.5 machine just doing rmmod nf_conntrack

(You'll need CONFIG_SLUB_DEBUG_ON or equivalent)

Original Zdenek report : http://thread.gmane.org/gmane.linux.kernel/876016/focus=876086

> 
>> Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> OK, this is in for-next now and queued for 2.6.31. If you guys want to
> fix this in a different way, lets do that in 2.6.32.

Seems the right thing IMHO
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox