Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net: make ctl_path local and const
From: Joe Perches @ 2010-10-20  3:28 UTC (permalink / raw)
  To: Changli Gao
  Cc: Andy Grover, James Morris, linux-sctp, rds-devel,
	Pekka Savola (ipv6), linux-x25, dccp, bridge, Andrew, coreteam,
	Arnaldo Carvalho de Melo, Alexey Kuznetsov, Joerg Reuter,
	Sridhar Samudrala, Samuel Ortiz, Vlad Yasevich, netfilter,
	Remi Denis-Courmont, linux-hams, Hideaki YOSHIFUJI, netdev,
	linux-decnet-user, linux-kernel
In-Reply-To: <AANLkTi=pz4zYnnUSvb9FjVvCAURruHhJTz7pBKDi8Pdw@mail.gmail.com>

On Wed, 2010-10-20 at 11:10 +0800, Changli Gao wrote:
> On Wed, Oct 20, 2010 at 11:01 AM, Joe Perches <joe@perches.com> wrote:
> > On Wed, 2010-10-20 at 10:54 +0800, Changli Gao wrote:
> >> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> > []
> >> diff --git a/net/appletalk/sysctl_net_atalk.c b/net/appletalk/sysctl_net_atalk.c
> >> index 04e9c0d..b92f269 100644
> >> --- a/net/appletalk/sysctl_net_atalk.c
> >> +++ b/net/appletalk/sysctl_net_atalk.c
> >> @@ -42,16 +42,16 @@ static struct ctl_table atalk_table[] = {
> >>       { },
> >>  };
> >> -static struct ctl_path atalk_path[] = {
> >> -     { .procname = "net", },
> >> -     { .procname = "appletalk", },
> >> -     { }
> >> -};
> >> -
> >>  static struct ctl_table_header *atalk_table_header;
> >>
> >>  void atalk_register_sysctl(void)
> >>  {
> >> +     const struct ctl_path atalk_path[] = {
> > Shouldn't all of these be static const struct ?
> They needn't. And some variables are specified __net_initdata currently.

At least some objects are smaller with static.

$ size net/appletalk/sysctl_net_atalk.o.*
   text	   data	    bss	    dec	    hex	filename
    324	    236	     48	    608	    260	net/appletalk/sysctl_net_atalk.o.withstatic
    344	    236	     48	    628	    274	net/appletalk/sysctl_net_atalk.o.withoutstatic

^ permalink raw reply

* Re: [Ksummit-2010-discuss] [v2] Remaining BKL users, what to do
From: Dave Young @ 2010-10-20  4:43 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Greg KH, Oliver Neukum, Valdis.Kletnieks, Dave Airlie, codalist,
	ksummit-2010-discuss, autofs, Jan Harkes, Samuel Ortiz, Jan Kara,
	Arnaldo Carvalho de Melo, netdev, Anders Larsen, linux-kernel,
	dri-devel, Bryan Schumaker, Christoph Hellwig, Petr Vandrovec,
	Mikulas Patocka, linux-fsdevel, Evgeniy Dushistov, Ingo Molnar,
	Andrew Hendry, linux-media
In-Reply-To: <201010192244.41913.arnd@arndb.de>

On Wed, Oct 20, 2010 at 4:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 19 October 2010 22:29:12 Greg KH wrote:
>> On Tue, Oct 19, 2010 at 09:40:47PM +0200, Oliver Neukum wrote:
>> > Am Dienstag, 19. Oktober 2010, 21:37:35 schrieb Greg KH:
>> > > > So no need to clean it up for multiprocessor support.
>> > > >
>> > > > http://download.intel.com/design/chipsets/datashts/29067602.pdf
>> > > > http://www.intel.com/design/chipsets/specupdt/29069403.pdf
>> > >
>> > > Great, we can just drop all calls to lock_kernel() and the like in the
>> > > driver and be done with it, right?
>> >
>> > No,
>> >
>> > you still need to switch off preemption.
>>
>> Hm, how would you do that from within a driver?
>
> I think this would do:
> ---
> drm/i810: remove SMP support and BKL
>
> The i810 and i815 chipsets supported by the i810 drm driver were not
> officially designed for SMP operation, so the big kernel lock is
> only required for kernel preemption. This disables the driver if
> preemption is enabled and removes all calls to lock_kernel in it.
>
> If you own an Acorp 6A815EPD mainboard with a i815 chipset and
> two Pentium-III sockets, and want to run recent kernels on it,
> tell me about it.
>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index b755301..e071bc8 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -73,8 +73,8 @@ source "drivers/gpu/drm/radeon/Kconfig"
>
>  config DRM_I810
>        tristate "Intel I810"
> -       # BKL usage in order to avoid AB-BA deadlocks, may become BROKEN_ON_SMP
> -       depends on DRM && AGP && AGP_INTEL && BKL
> +       # PREEMPT requires BKL support here, which was removed
> +       depends on DRM && AGP && AGP_INTEL && !PREEMPT

be curious, why can't just fix the lock_kernel logic of i810? Fixing
is too hard?

Find a i810 hardware should be possible, even if the hardware does not
support SMP, can't we test the fix with preemption?

-- 
Regards
dave

^ permalink raw reply

* Re: [PATCH] net: make ctl_path local and const
From: Changli Gao @ 2010-10-20  4:52 UTC (permalink / raw)
  To: Joe Perches
  Cc: Andy Grover, linux-sctp, rds-devel, Pekka Savola (ipv6),
	linux-x25, dccp, bridge, James Morris, coreteam,
	Arnaldo Carvalho de Melo, Alexey Kuznetsov, Joerg Reuter,
	Sridhar Samudrala, Samuel Ortiz, Vlad Yasevich, netfilter,
	Remi Denis-Courmont, linux-hams, Hideaki YOSHIFUJI, netdev,
	linux-decnet-user, linux-kernel, Ralf Baechle <ralf
In-Reply-To: <1287545337.10409.602.camel@Joe-Laptop>

On Wed, Oct 20, 2010 at 11:28 AM, Joe Perches <joe@perches.com> wrote:
>
> At least some objects are smaller with static.
>
> $ size net/appletalk/sysctl_net_atalk.o.*
>   text    data     bss     dec     hex filename
>    324     236      48     608     260 net/appletalk/sysctl_net_atalk.o.withstatic
>    344     236      48     628     274 net/appletalk/sysctl_net_atalk.o.withoutstatic
>
>

I got the opposite result for the size of the whole kernel image with
allyesconfig.

original: 33531456
patched: 33531424

It seems random.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH] net: make ctl_path local and const
From: Joe Perches @ 2010-10-20  4:59 UTC (permalink / raw)
  To: Changli Gao
  Cc: Andy Grover, James Morris, linux-sctp, rds-devel,
	Pekka Savola (ipv6), linux-x25, dccp, bridge, Andrew, coreteam,
	Arnaldo Carvalho de Melo, Alexey Kuznetsov, Joerg Reuter,
	Sridhar Samudrala, Samuel Ortiz, Vlad Yasevich, netfilter,
	Remi Denis-Courmont, linux-hams, Hideaki YOSHIFUJI, netdev,
	linux-decnet-user, linux-kernel
In-Reply-To: <AANLkTimGUB-+TTsje1kc3q9PyoB2jJdgCK2Z7n9u7N1J@mail.gmail.com>

On Wed, 2010-10-20 at 12:52 +0800, Changli Gao wrote:
> On Wed, Oct 20, 2010 at 11:28 AM, Joe Perches <joe@perches.com> wrote:
> > At least some objects are smaller with static.
> > $ size net/appletalk/sysctl_net_atalk.o.*
> >   text    data     bss     dec     hex filename
> >    324     236      48     608     260 net/appletalk/sysctl_net_atalk.o.withstatic
> >    344     236      48     628     274 net/appletalk/sysctl_net_atalk.o.withoutstatic
> I got the opposite result for the size of the whole kernel image with
> allyesconfig.
> original: 33531456
> patched: 33531424
> It seems random.

Not using static requires the compiler to emit
initialization code for any use of the routine
that otherwise would only be done once.

^ permalink raw reply

* Re: [PATCH] net: make ctl_path local and const
From: Changli Gao @ 2010-10-20  5:10 UTC (permalink / raw)
  To: Joe Perches
  Cc: Andy Grover, linux-sctp, rds-devel, Pekka Savola (ipv6),
	linux-x25, dccp, bridge, James Morris, coreteam,
	Arnaldo Carvalho de Melo, Alexey Kuznetsov, Joerg Reuter,
	Sridhar Samudrala, Samuel Ortiz, Vlad Yasevich, netfilter,
	Remi Denis-Courmont, linux-hams, Hideaki YOSHIFUJI, netdev,
	linux-decnet-user, linux-kernel, Ralf Baechle <ralf
In-Reply-To: <1287550779.10409.620.camel@Joe-Laptop>

On Wed, Oct 20, 2010 at 12:59 PM, Joe Perches <joe@perches.com> wrote:
>
> Not using static requires the compiler to emit
> initialization code for any use of the routine
> that otherwise would only be done once.
>

If the code isn't performance critical, I think we can afford.
Otherwise, we have to reserve the memory used to save the static data,
if we don't specify it as initial data.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: 2.6.36-rc7: net/bridge causes temporary network I/O lockups [2]
From: Herbert Xu @ 2010-10-20  6:16 UTC (permalink / raw)
  To: Patrick Ringl; +Cc: netdev, linux-kernel, bridge
In-Reply-To: <4CBCB014.9080108@freenet.de>

On Mon, Oct 18, 2010 at 10:37:40PM +0200, Patrick Ringl wrote:
>
> Anything else I could possibly provide? :-)

Yes, testing :)

First of all I'd like to rule out (or in) the IPv6 query code,
which is clearly generating a bogus packet (wrong payload_len).

So can you apply this patch and see if it makes the problem
go away? Please take packet dumps so we know that the IPv6 query
is no longer being sent.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index eb5b256..66f39d7 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -832,11 +832,6 @@ static void br_multicast_send_query(struct net_bridge *br,
 	br_group.proto = htons(ETH_P_IP);
 	__br_multicast_send_query(br, port, &br_group);
 
-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
-	br_group.proto = htons(ETH_P_IPV6);
-	__br_multicast_send_query(br, port, &br_group);
-#endif
-
 	time = jiffies;
 	time += sent < br->multicast_startup_query_count ?
 		br->multicast_startup_query_interval :

^ permalink raw reply related

* Re: [Ksummit-2010-discuss] [v2] Remaining BKL users, what to do
From: Arnd Bergmann @ 2010-10-20  6:50 UTC (permalink / raw)
  To: Dave Young
  Cc: Greg KH, Oliver Neukum, Valdis.Kletnieks, Dave Airlie, codalist,
	ksummit-2010-discuss, autofs, Jan Harkes, Samuel Ortiz, Jan Kara,
	Arnaldo Carvalho de Melo, netdev, Anders Larsen, linux-kernel,
	dri-devel, Bryan Schumaker, Christoph Hellwig, Petr Vandrovec,
	Mikulas Patocka, linux-fsdevel, Evgeniy Dushistov, Ingo Molnar,
	Andrew Hendry, linux-media
In-Reply-To: <AANLkTimRFxKT5p1K=Rd1MxXZymonx_t6rHKBhn=8CsW=@mail.gmail.com>

On Wednesday 20 October 2010, Dave Young wrote:
> be curious, why can't just fix the lock_kernel logic of i810? Fixing
> is too hard?
> 
> Find a i810 hardware should be possible, even if the hardware does not
> support SMP, can't we test the fix with preemption?

Yes, that should work too. My usual approach for removing the BKL without
having the hardware myself was to make locking stricter, i.e. replace
the BKL with a new spinlock or mutex. This way all the code would still
be serialized and if I did something wrong, lockdep would complain about
it, but there would be no risk of silent data corruption.

In case of i810, locking across DRM is rather complicated and there is no
way of doing this without making changes to other DRM code.

In fact, the only critical section that is actually protected by the BKL
are the few lines in i810_mmap_buffers. They look like they might not even
need the BKL to start with and we can just remove it even on SMP/PREEMPT,
except for perhaps the assignment to buf_priv->currently_mapped.
Someone who understands more about the driver than I do can probably figure
this out easily, but I couldn't come up with a way that doesn't risk
breaking in corner cases.

	Arnd

^ permalink raw reply

* Re: [PATCH 1/2] Remove netpoll blocking from uninit path
From: Cong Wang @ 2010-10-20  7:47 UTC (permalink / raw)
  To: nhorman; +Cc: netdev, bonding-devel, fubar, davem, andy
In-Reply-To: <1287507866-25156-2-git-send-email-nhorman@tuxdriver.com>

On 10/20/10 01:04, nhorman@tuxdriver.com wrote:
> From: Neil Horman<nhorman@tuxdriver.com>
>
> Some recent testing in netpoll with bonding showed this backtrace
>
>   ------------[ cut here ]------------
>   kernel BUG at drivers/net/bonding/bonding.h:134!
>   invalid opcode: 0000 [#1] SMP
>   last sysfs file: /sys/devices/pci0000:00/0000:00:1d.2/usb7/devnum
>   CPU 0
>   Pid: 1876, comm: rmmod Not tainted 2.6.36-rc3+ #10 D26928/
>   RIP: 0010:[<ffffffffa0514ba4>]  [<ffffffffa0514ba4>] bond_uninit+0x6f4/0x7a0
>   RSP: 0018:ffff88003b1b5d58  EFLAGS: 00010296
>   RAX: ffff88003b9b6200 RBX: ffff8800373e8e00 RCX: 00000000000f4240
>   RDX: 00000000ffffffff RSI: 0000000000000286 RDI: 0000000000000286
>   RBP: ffff88003b1b5dc8 R08: 0000000000000000 R09: 00000001af7de920
>   R10: 0000000000000000 R11: ffff880002495e98 R12: ffff880037922700
>   R13: ffff880038c31000 R14: ffff880037922730 R15: 0000000000000286
>   FS:  00007f90e6d72700(0000) GS:ffff880002400000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>   CR2: 000000346f0d9ad0 CR3: 000000003b263000 CR4: 00000000000006f0
>   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>   DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>   Process rmmod (pid: 1876, threadinfo ffff88003b1b4000, task ffff88003b36aa80)
>   Stack:
>   00000000ffffffff ffff88003b1b5d7a ffff8800379221e8 ffff880037922000
>   <0>  ffff88003b1b5dc8 ffffffff813eb5fb ffff88003b1b5da8 0000000031b177a3
>   <0>  ffff88003b1b5da8 ffff880037922000 ffff88003b1b5e48 ffff88003b1b5e48
>   Call Trace:
>   [<ffffffff813eb5fb>] ? rtmsg_ifinfo+0xcb/0xf0
>   [<ffffffff813daad8>] rollback_registered_many+0x168/0x280
>   [<ffffffff813dac09>] unregister_netdevice_many+0x19/0x80
>   [<ffffffff813e97b3>] __rtnl_kill_links+0x63/0x90
>   [<ffffffff813e980b>] __rtnl_link_unregister+0x2b/0x60
>   [<ffffffff813e9bde>] rtnl_link_unregister+0x1e/0x30
>   [<ffffffffa052124b>] bonding_exit+0x37/0x51 [bonding]
>   [<ffffffff81098b2e>] sys_delete_module+0x19e/0x270
>   [<ffffffff810bb2b2>] ? audit_syscall_entry+0x252/0x280
>   [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
>   RIP  [<ffffffffa0514ba4>] bond_uninit+0x6f4/0x7a0 [bonding]
>   RSP<ffff88003b1b5d58>
>   ---[ end trace 1395ad691cea24d1 ]---
>
> It occurs because of my recent netpoll blocking patches, which I added to avoid
> recursive deadlock in the bonding driver.  It relies on some per cpu bits, but
> the shutdown path forces some rescheduling as we cancel workqueues for the
> driver and wait for some device refcounts.  If after the forced reschedule, we
> wind up on a different cpu we trigger the bughalt in unblock_netpoll_tx.
>
> The fix is to remove the netpoll block/unblock calls from bond_release_all.
> This is safe to do because bond_uninit, which is called via ndo_uninit in
> rollback_registered_many, doesn't occur until we send a NETDEV_UNREGISTER event,
> which triggers netconsole to remove us as a netpoll client, so we are guaranteed
> not to recurse into our own tx path here.

Also bond_release_all() is called after bond_netpoll_cleanup()
in bond_uninit().

>
> Signed-off-by: Neil Horman<nhorman@tuxdriver.com>

Reviewed-by: WANG Cong <amwang@redhat.com>

Thanks.

^ permalink raw reply

* Re: [PATCH 2/2] Revert napi_poll fix for bonding driver
From: Cong Wang @ 2010-10-20  7:52 UTC (permalink / raw)
  To: nhorman; +Cc: netdev, bonding-devel, fubar, davem, andy
In-Reply-To: <1287507866-25156-3-git-send-email-nhorman@tuxdriver.com>

On 10/20/10 01:04, nhorman@tuxdriver.com wrote:
> From: Neil Horman<nhorman@tuxdriver.com>
>
> In an erlier patch I modified napi_poll so that devices with IFF_MASTER polled
> the per_cpu list instead of the device list for napi.  I did this because the
> bonding driver has no napi instances to poll, it instead expects to check the
> slave devices napi instances, which napi_poll was unaware of.  Looking at this
> more closely however, I now see this isn't strictly needed.  As the bond driver
> poll_controller calls the slaves poll_controller via netpoll_poll_dev, which
> recursively calls poll_napi on each slave, allowing those napi instances to get
> serviced.  The earlier patch isn't at all harmfull, its just not needed, so lets
> revert it to make the code cleaner.  Sorry for the noise,
>
> Signed-off-by: Neil Horman<nhorman@tuxdriver.com>

Looks reasonable to me,

Reviewed-by: WANG Cong <amwang@redhat.com>

Thanks.

^ permalink raw reply

* [PATCH] phonet: remove the unused variable pn
From: Changli Gao @ 2010-10-20  7:51 UTC (permalink / raw)
  To: Remi Denis-Courmont; +Cc: David S. Miller, netdev, Changli Gao

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
---
 net/phonet/pep.c |    1 -
 1 file changed, 1 deletion(-)
diff --git a/net/phonet/pep.c b/net/phonet/pep.c
index 9c903f9..3e60f2e 100644
--- a/net/phonet/pep.c
+++ b/net/phonet/pep.c
@@ -300,7 +300,6 @@ static int pipe_handler_send_ind(struct sock *sk, u8 utid, u8 msg_id)
 
 static int pipe_handler_enable_pipe(struct sock *sk, int enable)
 {
-	struct pep_sock *pn = pep_sk(sk);
 	int utid, req;
 
 	if (enable) {

^ permalink raw reply related

* Re: [RFC PATCH 1/9] ipvs network name space aware
From: Hans Schillstrom @ 2010-10-20  8:25 UTC (permalink / raw)
  To: paulmck@linux.vnet.ibm.com
  Cc: Daniel Lezcano, lvs-devel@vger.kernel.org, netdev@vger.kernel.org,
	netfilter-devel@vger.kernel.org, horms@verge.net.au, ja@ssi.bg,
	wensong@linux-vs.org
In-Reply-To: <20101019184436.GG2362@linux.vnet.ibm.com>

On Tuesday 19 October 2010 20:44:36 Paul E. McKenney wrote:
> On Mon, Oct 18, 2010 at 03:23:48PM +0200, Hans Schillstrom wrote:
> > On Monday 18 October 2010 13:37:38 Daniel Lezcano wrote:
> > > On 10/18/2010 11:54 AM, Hans Schillstrom wrote:
> > > > On Monday 18 October 2010 10:59:25 Daniel Lezcano wrote:
> > > >
> > > >> On 10/08/2010 01:16 PM, Hans Schillstrom wrote:
> > > >>
> > > >>> This part contains the include files
> > > >>> where include/net/netns/ip_vs.h is new and contains all moved vars.
> > > >>>
> > > >>> SUMMARY
> > > >>>
> > > >>>    include/net/ip_vs.h                     |  136 ++++---
> > > >>>    include/net/net_namespace.h             |    2 +
> > > >>>    include/net/netns/ip_vs.h               |  112 +++++
> > > >>>
> > > >>> Signed-off-by:Hans Schillstrom<hans.schillstrom@ericsson.com>
> > > >>> ---
> > > >>>
> > > >>>
> > > >>>
> > > >> [ ... ]
> > > >>
> > > >>
> > > >>>    #ifdef CONFIG_IP_VS_IPV6
> > > >>> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> > > >>> index bd10a79..b59cdc5 100644
> > > >>> --- a/include/net/net_namespace.h
> > > >>> +++ b/include/net/net_namespace.h
> > > >>> @@ -15,6 +15,7 @@
> > > >>>    #include<net/netns/ipv4.h>
> > > >>>    #include<net/netns/ipv6.h>
> > > >>>    #include<net/netns/dccp.h>
> > > >>> +#include<net/netns/ip_vs.h>
> > > >>>    #include<net/netns/x_tables.h>
> > > >>>    #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
> > > >>>    #include<net/netns/conntrack.h>
> > > >>> @@ -91,6 +92,7 @@ struct net {
> > > >>>    	struct sk_buff_head	wext_nlevents;
> > > >>>    #endif
> > > >>>    	struct net_generic	*gen;
> > > >>> +	struct netns_ipvs       *ipvs;
> > > >>>    };
> > > >>>
> > > >>>
> > > >> IMHO, it would be better to use the net_generic infra-structure instead
> > > >> of adding a new field in the netns structure.
> > > >>
> > > >>
> > > >>
> > > > I realized that to, but the performance penalty is quite high with net_generic :-(
> > > > But on the other hand if you are going to backport it, (without recompiling the kernel)
> > > > you gonna need it!
> > > >
> > >
> > > Hmm, yes. We don't want to have the init_net_ns performances to be impacted.
> > >
> > > You use here a pointer which will be dereferenced like the net_generic,
> > > I don't think there will be
> > > a big difference between using net_generic and using a pointer in the
> > > net namespace structure.
> > >
> > > The difference is the id usage, but this one is based on the idr which
> > > is quite fast.
> > >
> >
> > I'm not so sure about that, have a look at net_generic and rcu_read_lock
> > and compare
> >  ipvs = net->ipvs;
> > vs.
> >  ipvs = net_generic(net, id)
> >
> > static inline void *net_generic(struct net *net, int id)
> > {
> > 	struct net_generic *ng;
> > 	void *ptr;
> >
> > 	rcu_read_lock();
> > 	ng = rcu_dereference(net->gen);
> > 	BUG_ON(id == 0 || id > ng->len);
> > 	ptr = ng->ptr[id - 1];
> > 	rcu_read_unlock();
> >
> > 	return ptr;
> > }
> > ...
> > static inline void rcu_read_lock(void)
> > {
> >         __rcu_read_lock();
> >         __acquire(RCU);
> >         rcu_read_acquire();
> > }
> >
> > Another way of doing it is to pass the ipvs ptr instead of the net ptr,
> > and add *net to the ipvs struct.
> >
> > > We should experiment a bit here to compare both solutions.
> > Agre
> > >
> > I single stepped through the rcu_read_lock() on a x86_64
> > and it's quite many "stepi" that you need to enter :-(
>
> Was this by chance with lockdep enabled?  If not, could you please send
> your .config?
>
> 							Thanx, Paul

No lockdep, but what I ment is that net_generic is not as fast as a plain ptr->xxx.
IPVS has hooks in the netfilter chain, and gets a huge amount of packets .

I don't think IPVS is a candidate for net_generic, it should have its own part in "struct net"
That was my point.
( No critic to locking or net_generic)

--
Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>

^ permalink raw reply

* Re: [PATCH 1/2] Remove netpoll blocking from uninit path
From: David Miller @ 2010-10-20  8:45 UTC (permalink / raw)
  To: amwang; +Cc: nhorman, netdev, bonding-devel, fubar, andy
In-Reply-To: <4CBE9E7F.60107@redhat.com>

From: Cong Wang <amwang@redhat.com>
Date: Wed, 20 Oct 2010 15:47:11 +0800

> On 10/20/10 01:04, nhorman@tuxdriver.com wrote:
>> From: Neil Horman<nhorman@tuxdriver.com>
>>
>> Some recent testing in netpoll with bonding showed this backtrace
...
>> Signed-off-by: Neil Horman<nhorman@tuxdriver.com>
> 
> Reviewed-by: WANG Cong <amwang@redhat.com>

Applied.

Neil, please add proper subsystem prefixes to your subject
lines, I've had to add them by hand as I add your patches.

Thanks.

^ permalink raw reply

* Re: [PATCH 2/2] Revert napi_poll fix for bonding driver
From: David Miller @ 2010-10-20  8:45 UTC (permalink / raw)
  To: amwang; +Cc: nhorman, netdev, bonding-devel, fubar, andy
In-Reply-To: <4CBE9FA0.6010900@redhat.com>

From: Cong Wang <amwang@redhat.com>
Date: Wed, 20 Oct 2010 15:52:00 +0800

> On 10/20/10 01:04, nhorman@tuxdriver.com wrote:
>> Signed-off-by: Neil Horman<nhorman@tuxdriver.com>
 ...
> Reviewed-by: WANG Cong <amwang@redhat.com>

Also applied, thanks.

^ permalink raw reply

* [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Krishna Kumar @ 2010-10-20  8:54 UTC (permalink / raw)
  To: rusty, davem, mst
  Cc: arnd, eric.dumazet, netdev, avi, anthony, kvm, Krishna Kumar

Following set of patches implement transmit MQ in virtio-net.  Also
included is the user qemu changes.  MQ is disabled by default unless
qemu specifies it.

                  Changes from rev2:
                  ------------------
1. Define (in virtio_net.h) the maximum send txqs; and use in
   virtio-net and vhost-net.
2. vi->sq[i] is allocated individually, resulting in cache line
   aligned sq[0] to sq[n].  Another option was to define
   'send_queue' as:
       struct send_queue {
               struct virtqueue *svq;
               struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
       } ____cacheline_aligned_in_smp;
   and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
   the submitted method is preferable.
3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
   handles TX[0-n].
4. Further change TX handling such that vhost[0] handles both RX/TX
   for single stream case.

                  Enabling MQ on virtio:
                  -----------------------
When following options are passed to qemu:
        - smp > 1
        - vhost=on
        - mq=on (new option, default:off)
then #txqueues = #cpus.  The #txqueues can be changed by using an
optional 'numtxqs' option.  e.g. for a smp=4 guest:
        vhost=on                   ->   #txqueues = 1
        vhost=on,mq=on             ->   #txqueues = 4
        vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2
        vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8


                   Performance (guest -> local host):
                   -----------------------------------
System configuration:
        Host:  8 Intel Xeon, 8 GB memory
        Guest: 4 cpus, 2 GB memory
Test: Each test case runs for 60 secs, sum over three runs (except
when number of netperf sessions is 1, which has 10 runs of 12 secs
each).  No tuning (default netperf) other than taskset vhost's to
cpus 0-3.  numtxqs=32 gave the best results though the guest had
only 4 vcpus (I haven't tried beyond that).

______________ numtxqs=2, vhosts=3  ____________________
#sessions  BW%      CPU%    RCPU%    SD%      RSD%
________________________________________________________
1          4.46    -1.96     .19     -12.50   -6.06
2          4.93    -1.16    2.10      0       -2.38
4          46.17    64.77   33.72     19.51   -2.48
8          47.89    70.00   36.23     41.46    13.35
16         48.97    80.44   40.67     21.11   -5.46
24         49.03    78.78   41.22     20.51   -4.78
32         51.11    77.15   42.42     15.81   -6.87
40         51.60    71.65   42.43     9.75    -8.94
48         50.10    69.55   42.85     11.80   -5.81
64         46.24    68.42   42.67     14.18   -3.28
80         46.37    63.13   41.62     7.43    -6.73
96         46.40    63.31   42.20     9.36    -4.78
128        50.43    62.79   42.16     13.11   -1.23
________________________________________________________
BW: 37.2%,  CPU/RCPU: 66.3%,41.6%,  SD/RSD: 11.5%,-3.7%

______________ numtxqs=8, vhosts=5  ____________________
#sessions   BW%      CPU%     RCPU%     SD%      RSD%
________________________________________________________
1           -.76    -1.56     2.33      0        3.03
2           17.41    11.11    11.41     0       -4.76
4           42.12    55.11    30.20     19.51    .62
8           54.69    80.00    39.22     24.39    -3.88
16          54.77    81.62    40.89     20.34    -6.58
24          54.66    79.68    41.57     15.49    -8.99
32          54.92    76.82    41.79     17.59    -5.70
40          51.79    68.56    40.53     15.31    -3.87
48          51.72    66.40    40.84     9.72     -7.13
64          51.11    63.94    41.10     5.93     -8.82
80          46.51    59.50    39.80     9.33     -4.18
96          47.72    57.75    39.84     4.20     -7.62
128         54.35    58.95    40.66     3.24     -8.63
________________________________________________________
BW: 38.9%,  CPU/RCPU: 63.0%,40.1%,  SD/RSD: 6.0%,-7.4%

______________ numtxqs=16, vhosts=5  ___________________
#sessions   BW%      CPU%     RCPU%     SD%      RSD%
________________________________________________________
1           -1.43    -3.52    1.55      0          3.03
2           33.09     21.63   20.12    -10.00     -9.52
4           67.17     94.60   44.28     19.51     -11.80
8           75.72     108.14  49.15     25.00     -10.71
16          80.34     101.77  52.94     25.93     -4.49
24          70.84     93.12   43.62     27.63     -5.03
32          69.01     94.16   47.33     29.68     -1.51
40          58.56     63.47   25.91    -3.92      -25.85
48          61.16     74.70   34.88     .89       -22.08
64          54.37     69.09   26.80    -6.68      -30.04
80          36.22     22.73   -2.97    -8.25      -27.23
96          41.51     50.59   13.24     9.84      -16.77
128         48.98     38.15   6.41     -.33       -22.80
________________________________________________________
BW: 46.2%,  CPU/RCPU: 55.2%,18.8%,  SD/RSD: 1.2%,-22.0%

______________ numtxqs=32, vhosts=5  ___________________
#            BW%       CPU%    RCPU%    SD%     RSD%
________________________________________________________
1            7.62     -38.03   -26.26  -50.00   -33.33
2            28.95     20.46    21.62   0       -7.14
4            84.05     60.79    45.74  -2.43    -12.42
8            86.43     79.57    50.32   15.85   -3.10
16           88.63     99.48    58.17   9.47    -13.10
24           74.65     80.87    41.99  -1.81    -22.89
32           63.86     59.21    23.58  -18.13   -36.37
40           64.79     60.53    22.23  -15.77   -35.84
48           49.68     26.93    .51    -36.40   -49.61
64           54.69     36.50    5.41   -26.59   -43.23
80           45.06     12.72   -13.25  -37.79   -52.08
96           40.21    -3.16    -24.53  -39.92   -52.97
128          36.33    -33.19   -43.66  -5.68    -20.49
________________________________________________________
BW: 49.3%,  CPU/RCPU: 15.5%,-8.2%,  SD/RSD: -22.2%,-37.0%


Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---

^ permalink raw reply

* [v3 RFC PATCH 2/4] Changes for virtio-net
From: Krishna Kumar @ 2010-10-20  8:55 UTC (permalink / raw)
  To: rusty, davem, mst
  Cc: kvm, arnd, netdev, avi, anthony, eric.dumazet, Krishna Kumar
In-Reply-To: <20101020085452.15579.76002.sendpatchset@krkumar2.in.ibm.com>

Implement mq virtio-net driver. 

Though struct virtio_net_config changes, it works with old
qemu's since the last element is not accessed, unless qemu
sets VIRTIO_NET_F_NUMTXQS.  Patch also adds a macro for the
maximum number of TX vq's (VIRTIO_MAX_SQ) that the user can
specify.
        
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---     
 drivers/net/virtio_net.c   |  234 ++++++++++++++++++++++++++---------
 include/linux/virtio_net.h |    6 
 2 files changed, 185 insertions(+), 55 deletions(-)

diff -ruNp org/include/linux/virtio_net.h new.dynamic.optimize_vhost/include/linux/virtio_net.h
--- org/include/linux/virtio_net.h	2010-10-11 10:20:22.000000000 +0530
+++ new.dynamic.optimize_vhost/include/linux/virtio_net.h	2010-10-19 13:24:38.000000000 +0530
@@ -7,6 +7,9 @@
 #include <linux/virtio_config.h>
 #include <linux/if_ether.h>
 
+/* Maximum number of TX queues supported */
+#define VIRTIO_MAX_SQ 32
+
 /* The feature bitmap for virtio net */
 #define VIRTIO_NET_F_CSUM	0	/* Host handles pkts w/ partial csum */
 #define VIRTIO_NET_F_GUEST_CSUM	1	/* Guest handles pkts w/ partial csum */
@@ -26,6 +29,7 @@
 #define VIRTIO_NET_F_CTRL_RX	18	/* Control channel RX mode support */
 #define VIRTIO_NET_F_CTRL_VLAN	19	/* Control channel VLAN filtering */
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20	/* Extra RX mode control support */
+#define VIRTIO_NET_F_NUMTXQS	21	/* Device supports multiple TX queue */
 
 #define VIRTIO_NET_S_LINK_UP	1	/* Link is up */
 
@@ -34,6 +38,8 @@ struct virtio_net_config {
 	__u8 mac[6];
 	/* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
 	__u16 status;
+	/* number of transmit queues */
+	__u16 numtxqs;
 } __attribute__((packed));
 
 /* This is the first element of the scatter-gather list.  If you don't
diff -ruNp org/drivers/net/virtio_net.c new.dynamic.optimize_vhost/drivers/net/virtio_net.c
--- org/drivers/net/virtio_net.c	2010-10-11 10:20:02.000000000 +0530
+++ new.dynamic.optimize_vhost/drivers/net/virtio_net.c	2010-10-19 17:01:53.000000000 +0530
@@ -40,11 +40,24 @@ module_param(gso, bool, 0444);
 
 #define VIRTNET_SEND_COMMAND_SG_MAX    2
 
+/* Our representation of a send virtqueue */
+struct send_queue {
+	struct virtqueue *svq;
+
+	/* TX: fragments + linear part + virtio header */
+	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
+};
+
 struct virtnet_info {
+	struct send_queue **sq;
+	struct napi_struct napi ____cacheline_aligned_in_smp;
+
+	/* read-mostly variables */
+	int numtxqs ____cacheline_aligned_in_smp;
 	struct virtio_device *vdev;
-	struct virtqueue *rvq, *svq, *cvq;
+	struct virtqueue *rvq;
+	struct virtqueue *cvq;
 	struct net_device *dev;
-	struct napi_struct napi;
 	unsigned int status;
 
 	/* Number of input buffers, and max we've ever had. */
@@ -62,9 +75,8 @@ struct virtnet_info {
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
-	/* fragments + linear part + virtio header */
+	/* RX: fragments + linear part + virtio header */
 	struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
-	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
 };
 
 struct skb_vnet_hdr {
@@ -120,12 +132,13 @@ static struct page *get_a_page(struct vi
 static void skb_xmit_done(struct virtqueue *svq)
 {
 	struct virtnet_info *vi = svq->vdev->priv;
+	int qnum = svq->queue_index - 1;	/* 0 is RX vq */
 
 	/* Suppress further interrupts. */
 	virtqueue_disable_cb(svq);
 
 	/* We were probably waiting for more output buffers. */
-	netif_wake_queue(vi->dev);
+	netif_wake_subqueue(vi->dev, qnum);
 }
 
 static void set_skb_frag(struct sk_buff *skb, struct page *page,
@@ -495,12 +508,13 @@ again:
 	return received;
 }
 
-static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
+static unsigned int free_old_xmit_skbs(struct virtnet_info *vi,
+				       struct virtqueue *svq)
 {
 	struct sk_buff *skb;
 	unsigned int len, tot_sgs = 0;
 
-	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
+	while ((skb = virtqueue_get_buf(svq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 		vi->dev->stats.tx_bytes += skb->len;
 		vi->dev->stats.tx_packets++;
@@ -510,7 +524,8 @@ static unsigned int free_old_xmit_skbs(s
 	return tot_sgs;
 }
 
-static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
+static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb,
+		    struct virtqueue *svq, struct scatterlist *tx_sg)
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
@@ -548,12 +563,12 @@ static int xmit_skb(struct virtnet_info 
 
 	/* Encode metadata header at front. */
 	if (vi->mergeable_rx_bufs)
-		sg_set_buf(vi->tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
+		sg_set_buf(tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
 	else
-		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
+		sg_set_buf(tx_sg, &hdr->hdr, sizeof hdr->hdr);
 
-	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
-	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
+	hdr->num_sg = skb_to_sgvec(skb, tx_sg + 1, 0, skb->len) + 1;
+	return virtqueue_add_buf(svq, tx_sg, hdr->num_sg,
 					0, skb);
 }
 
@@ -561,31 +576,34 @@ static netdev_tx_t start_xmit(struct sk_
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	int capacity;
+	int qnum = skb_get_queue_mapping(skb);
+	struct virtqueue *svq = vi->sq[qnum]->svq;
 
 	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(vi);
+	free_old_xmit_skbs(vi, svq);
 
 	/* Try to transmit */
-	capacity = xmit_skb(vi, skb);
+	capacity = xmit_skb(vi, skb, svq, vi->sq[qnum]->tx_sg);
 
 	/* This can happen with OOM and indirect buffers. */
 	if (unlikely(capacity < 0)) {
 		if (net_ratelimit()) {
 			if (likely(capacity == -ENOMEM)) {
 				dev_warn(&dev->dev,
-					 "TX queue failure: out of memory\n");
+					 "TXQ (%d) failure: out of memory\n",
+					 qnum);
 			} else {
 				dev->stats.tx_fifo_errors++;
 				dev_warn(&dev->dev,
-					 "Unexpected TX queue failure: %d\n",
-					 capacity);
+					 "Unexpected TXQ (%d) failure: %d\n",
+					 qnum, capacity);
 			}
 		}
 		dev->stats.tx_dropped++;
 		kfree_skb(skb);
 		return NETDEV_TX_OK;
 	}
-	virtqueue_kick(vi->svq);
+	virtqueue_kick(svq);
 
 	/* Don't wait up for transmitted skbs to be freed. */
 	skb_orphan(skb);
@@ -594,13 +612,13 @@ static netdev_tx_t start_xmit(struct sk_
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (capacity < 2+MAX_SKB_FRAGS) {
-		netif_stop_queue(dev);
-		if (unlikely(!virtqueue_enable_cb(vi->svq))) {
+		netif_stop_subqueue(dev, qnum);
+		if (unlikely(!virtqueue_enable_cb(svq))) {
 			/* More just got used, free them then recheck. */
-			capacity += free_old_xmit_skbs(vi);
+			capacity += free_old_xmit_skbs(vi, svq);
 			if (capacity >= 2+MAX_SKB_FRAGS) {
-				netif_start_queue(dev);
-				virtqueue_disable_cb(vi->svq);
+				netif_start_subqueue(dev, qnum);
+				virtqueue_disable_cb(svq);
 			}
 		}
 	}
@@ -871,10 +889,10 @@ static void virtnet_update_status(struct
 
 	if (vi->status & VIRTIO_NET_S_LINK_UP) {
 		netif_carrier_on(vi->dev);
-		netif_wake_queue(vi->dev);
+		netif_tx_wake_all_queues(vi->dev);
 	} else {
 		netif_carrier_off(vi->dev);
-		netif_stop_queue(vi->dev);
+		netif_tx_stop_all_queues(vi->dev);
 	}
 }
 
@@ -885,18 +903,122 @@ static void virtnet_config_changed(struc
 	virtnet_update_status(vi);
 }
 
+#define MAX_DEVICE_NAME		16
+static int initialize_vqs(struct virtnet_info *vi, int numtxqs)
+{
+	vq_callback_t **callbacks;
+	struct virtqueue **vqs;
+	int i, err = -ENOMEM;
+	int totalvqs;
+	char **names;
+
+	vi->sq = kzalloc(numtxqs * sizeof(*vi->sq), GFP_KERNEL);
+	if (!vi->sq)
+		goto out;
+	for (i = 0; i < numtxqs; i++) {
+		vi->sq[i] = kzalloc(sizeof(*vi->sq[i]), GFP_KERNEL);
+		if (!vi->sq[i])
+			goto out;
+	}
+
+	/* setup initial send queue parameters */
+	for (i = 0; i < numtxqs; i++)
+		sg_init_table(vi->sq[i]->tx_sg, ARRAY_SIZE(vi->sq[i]->tx_sg));
+
+	/*
+	 * We expect 1 RX virtqueue followed by 'numtxqs' TX virtqueues, and
+	 * optionally one control virtqueue.
+	 */
+	totalvqs = 1 + numtxqs +
+		   virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ);
+
+	/* Setup parameters for find_vqs */
+	vqs = kmalloc(totalvqs * sizeof(*vqs), GFP_KERNEL);
+	callbacks = kmalloc(totalvqs * sizeof(*callbacks), GFP_KERNEL);
+	names = kzalloc(totalvqs * sizeof(*names), GFP_KERNEL);
+	if (!vqs || !callbacks || !names)
+		goto free_mem;
+
+	/* Parameters for recv virtqueue */
+	callbacks[0] = skb_recv_done;
+	names[0] = "input";
+
+	/* Parameters for send virtqueues */
+	for (i = 1; i <= numtxqs; i++) {
+		callbacks[i] = skb_xmit_done;
+		names[i] = kmalloc(MAX_DEVICE_NAME * sizeof(*names[i]),
+				   GFP_KERNEL);
+		if (!names[i])
+			goto free_mem;
+		sprintf(names[i], "output.%d", i - 1);
+	}
+
+	/* Parameters for control virtqueue, if any */
+	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
+		callbacks[i] = NULL;
+		names[i] = "control";
+	}
+
+	err = vi->vdev->config->find_vqs(vi->vdev, totalvqs, vqs, callbacks,
+					 (const char **)names);
+	if (err)
+		goto free_mem;
+
+	vi->rvq = vqs[0];
+	for (i = 0; i < numtxqs; i++)
+		vi->sq[i]->svq = vqs[i + 1];
+
+	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
+		vi->cvq = vqs[i + 1];
+
+		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
+			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
+	}
+
+free_mem:
+	if (names) {
+		for (i = 1; i <= numtxqs; i++)
+			kfree(names[i]);
+		kfree(names);
+	}
+
+	kfree(callbacks);
+	kfree(vqs);
+
+out:
+	if (err) {
+		for (i = 0; i < numtxqs; i++)
+			kfree(vi->sq[i]);
+		kfree(vi->sq);
+	}
+
+	return err;
+}
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
-	int err;
+	int i, err;
+	u16 numtxqs;
 	struct net_device *dev;
 	struct virtnet_info *vi;
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
-	const char *names[] = { "input", "output", "control" };
-	int nvqs;
+
+	/*
+	 * Find if host passed the number of transmit queues supported
+	 * by the device
+	 */
+	err = virtio_config_val(vdev, VIRTIO_NET_F_NUMTXQS,
+				offsetof(struct virtio_net_config, numtxqs),
+				&numtxqs);
+
+	/* We need atleast one txq */
+	if (err || !numtxqs)
+		numtxqs = 1;
+
+	if (numtxqs > VIRTIO_MAX_SQ)
+		return -EINVAL;
 
 	/* Allocate ourselves a network device with room for our info */
-	dev = alloc_etherdev(sizeof(struct virtnet_info));
+	dev = alloc_etherdev_mq(sizeof(struct virtnet_info), numtxqs);
 	if (!dev)
 		return -ENOMEM;
 
@@ -940,9 +1062,9 @@ static int virtnet_probe(struct virtio_d
 	vi->vdev = vdev;
 	vdev->priv = vi;
 	vi->pages = NULL;
+	vi->numtxqs = numtxqs;
 	INIT_DELAYED_WORK(&vi->refill, refill_work);
 	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
-	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
 
 	/* If we can receive ANY GSO packets, we must allocate large ones. */
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
@@ -953,23 +1075,10 @@ static int virtnet_probe(struct virtio_d
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
 		vi->mergeable_rx_bufs = true;
 
-	/* We expect two virtqueues, receive then send,
-	 * and optionally control. */
-	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
-
-	err = vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names);
+	/* Initialize our rx/tx queue parameters, and invoke find_vqs */
+	err = initialize_vqs(vi, numtxqs);
 	if (err)
-		goto free;
-
-	vi->rvq = vqs[0];
-	vi->svq = vqs[1];
-
-	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
-		vi->cvq = vqs[2];
-
-		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
-			dev->features |= NETIF_F_HW_VLAN_FILTER;
-	}
+		goto free_netdev;
 
 	err = register_netdev(dev);
 	if (err) {
@@ -986,6 +1095,9 @@ static int virtnet_probe(struct virtio_d
 		goto unregister;
 	}
 
+	dev_info(&dev->dev, "(virtio-net) Allocated 1 RX and %d TX vq's\n",
+		 numtxqs);
+
 	vi->status = VIRTIO_NET_S_LINK_UP;
 	virtnet_update_status(vi);
 	netif_carrier_on(dev);
@@ -998,7 +1110,10 @@ unregister:
 	cancel_delayed_work_sync(&vi->refill);
 free_vqs:
 	vdev->config->del_vqs(vdev);
-free:
+	for (i = 0; i < numtxqs; i++)
+		kfree(vi->sq[i]);
+	kfree(vi->sq);
+free_netdev:
 	free_netdev(dev);
 	return err;
 }
@@ -1006,12 +1121,21 @@ free:
 static void free_unused_bufs(struct virtnet_info *vi)
 {
 	void *buf;
-	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->svq);
-		if (!buf)
-			break;
-		dev_kfree_skb(buf);
+	int i;
+
+	for (i = 0; i < vi->numtxqs; i++) {
+		struct virtqueue *svq = vi->sq[i]->svq;
+
+		while (1) {
+			buf = virtqueue_detach_unused_buf(svq);
+			if (!buf)
+				break;
+			dev_kfree_skb(buf);
+		}
+		kfree(vi->sq[i]);
 	}
+	kfree(vi->sq);
+
 	while (1) {
 		buf = virtqueue_detach_unused_buf(vi->rvq);
 		if (!buf)
@@ -1059,7 +1183,7 @@ static unsigned int features[] = {
 	VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6,
 	VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
 	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
-	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
+	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, VIRTIO_NET_F_NUMTXQS,
 };
 
 static struct virtio_driver virtio_net_driver = {

^ permalink raw reply

* [v3 RFC PATCH 1/4] Change virtqueue structure
From: Krishna Kumar @ 2010-10-20  8:54 UTC (permalink / raw)
  To: rusty, davem, mst
  Cc: eric.dumazet, kvm, netdev, arnd, avi, anthony, Krishna Kumar
In-Reply-To: <20101020085452.15579.76002.sendpatchset@krkumar2.in.ibm.com>

Move queue_index from virtio_pci_vq_info to virtqueue.  This
allows callback handlers to figure out the queue number for
the vq that needs attention.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>  
---
 drivers/virtio/virtio_pci.c |   10 +++-------
 include/linux/virtio.h      |    1 +
 2 files changed, 4 insertions(+), 7 deletions(-)

diff -ruNp org/include/linux/virtio.h new.dynamic.optimize_vhost/include/linux/virtio.h
--- org/include/linux/virtio.h	2010-10-11 10:20:22.000000000 +0530
+++ new.dynamic.optimize_vhost/include/linux/virtio.h	2010-10-15 13:25:42.000000000 +0530
@@ -22,6 +22,7 @@ struct virtqueue {
 	void (*callback)(struct virtqueue *vq);
 	const char *name;
 	struct virtio_device *vdev;
+	int queue_index;	/* the index of the queue */
 	void *priv;
 };
 
diff -ruNp org/drivers/virtio/virtio_pci.c new.dynamic.optimize_vhost/drivers/virtio/virtio_pci.c
--- org/drivers/virtio/virtio_pci.c	2010-10-11 10:20:15.000000000 +0530
+++ new.dynamic.optimize_vhost/drivers/virtio/virtio_pci.c	2010-10-15 13:25:42.000000000 +0530
@@ -75,9 +75,6 @@ struct virtio_pci_vq_info
 	/* the number of entries in the queue */
 	int num;
 
-	/* the index of the queue */
-	int queue_index;
-
 	/* the virtual address of the ring queue */
 	void *queue;
 
@@ -185,11 +182,10 @@ static void vp_reset(struct virtio_devic
 static void vp_notify(struct virtqueue *vq)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev);
-	struct virtio_pci_vq_info *info = vq->priv;
 
 	/* we write the queue's selector into the notification register to
 	 * signal the other end */
-	iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
+	iowrite16(vq->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
 }
 
 /* Handle a configuration change: Tell driver if it wants to know. */
@@ -385,7 +381,6 @@ static struct virtqueue *setup_vq(struct
 	if (!info)
 		return ERR_PTR(-ENOMEM);
 
-	info->queue_index = index;
 	info->num = num;
 	info->msix_vector = msix_vec;
 
@@ -408,6 +403,7 @@ static struct virtqueue *setup_vq(struct
 		goto out_activate_queue;
 	}
 
+	vq->queue_index = index;
 	vq->priv = info;
 	info->vq = vq;
 
@@ -446,7 +442,7 @@ static void vp_del_vq(struct virtqueue *
 	list_del(&info->node);
 	spin_unlock_irqrestore(&vp_dev->lock, flags);
 
-	iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
+	iowrite16(vq->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
 
 	if (vp_dev->msix_enabled) {
 		iowrite16(VIRTIO_MSI_NO_VECTOR,

^ permalink raw reply

* [v3 RFC PATCH 3/4] Changes for vhost
From: Krishna Kumar @ 2010-10-20  8:55 UTC (permalink / raw)
  To: rusty, davem, mst
  Cc: arnd, eric.dumazet, netdev, avi, anthony, kvm, Krishna Kumar
In-Reply-To: <20101020085452.15579.76002.sendpatchset@krkumar2.in.ibm.com>

Changes for mq vhost.

vhost_net_open is changed to allocate a vhost_net and
return.  The remaining initializations are delayed till
SET_OWNER.  SET_OWNER is changed so that the argument
is used to determine how many txqs to use.  Unmodified
qemu's will pass NULL, so this is recognized and handled
as numtxqs=1.

Besides changing handle_tx to use 'vq', this patch also
changes handle_rx to take vq as parameter.  The mq RX
patch requires this change, but till then it is consistent
(and less confusing) to make the interfaces for handling
rx and tx similar.

vhost thread handling for RX and TX is as follows.  The
first vhost thread handles RX traffic, while the remaining
threads handles TX.  The number of threads is <= #txqs, and
threads handle more than one txq when #txqs is more than
MAX_VHOST_THREADS (4).  When guest is started with >1 txqs
and there is only one stream of traffic from the guest,
that is recognized and handled such that vhost[0] processes
both RX and TX.  This can change dynamically.  vhost_poll
has a new element - find_vq(), which allows optimizing some
code for cases where numtxqs=1 or a packet on vhost[0]
needs processing.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 drivers/vhost/net.c   |  284 ++++++++++++++++++++++++++--------------
 drivers/vhost/vhost.c |  275 ++++++++++++++++++++++++++++----------
 drivers/vhost/vhost.h |   42 +++++
 3 files changed, 430 insertions(+), 171 deletions(-)

diff -ruNp org/drivers/vhost/vhost.h new/drivers/vhost/vhost.h
--- org/drivers/vhost/vhost.h	2010-10-11 10:21:14.000000000 +0530
+++ new/drivers/vhost/vhost.h	2010-10-20 14:11:23.000000000 +0530
@@ -35,11 +35,13 @@ struct vhost_poll {
 	wait_queue_t              wait;
 	struct vhost_work	  work;
 	unsigned long		  mask;
-	struct vhost_dev	 *dev;
+	struct vhost_virtqueue	  *(*find_vq)(struct vhost_poll *poll);
+	struct vhost_virtqueue	  *vq;  /* points back to vq */
 };
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev);
+		     unsigned long mask, struct vhost_virtqueue *vq,
+		     int single_queue);
 void vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
@@ -108,6 +110,10 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log *log;
+	struct task_struct *worker; /* vhost for this vq, can be shared */
+	spinlock_t *work_lock;
+	struct list_head *work_list;
+	int qnum;		/* 0 for RX, 1 -> n-1 for TX */
 };
 
 struct vhost_dev {
@@ -119,15 +125,39 @@ struct vhost_dev {
 	struct mutex mutex;
 	unsigned acked_features;
 	struct vhost_virtqueue *vqs;
+	unsigned long *jiffies;
 	int nvqs;
 	struct file *log_file;
 	struct eventfd_ctx *log_ctx;
-	spinlock_t work_lock;
-	struct list_head work_list;
-	struct task_struct *worker;
+	spinlock_t *work_lock;
+	struct list_head *work_list;
 };
 
-long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
+/*
+ * Define maximum number of TX threads, and use that to have a maximum
+ * number of vhost threads to handle RX & TX. First thread handles RX.
+ * If guest is started with #txqs=1, only one vhost thread is started.
+ * Else, upto MAX_VHOST_THREADS are started where th[0] handles RX and
+ * remaining handles TX. However, vhost_poll_queue has an optimization
+ * where th[0] is selected for both RX & TX if there is only one flow.
+ */
+#define MAX_TXQ_THREADS		4
+#define MAX_VHOST_THREADS	(MAX_TXQ_THREADS + 1)
+
+static inline int get_nvhosts(int nvqs)
+{
+	int num_vhosts = nvqs - 1;
+
+	if (nvqs > 2)
+		num_vhosts = min_t(int, nvqs, MAX_VHOST_THREADS);
+
+	return num_vhosts;
+}
+
+int vhost_setup_vqs(struct vhost_dev *dev, int numtxqs);
+void vhost_free_vqs(struct vhost_dev *dev);
+long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs,
+		    int nvhosts);
 long vhost_dev_check_owner(struct vhost_dev *);
 long vhost_dev_reset_owner(struct vhost_dev *);
 void vhost_dev_cleanup(struct vhost_dev *);
diff -ruNp org/drivers/vhost/net.c new/drivers/vhost/net.c
--- org/drivers/vhost/net.c	2010-10-11 10:21:14.000000000 +0530
+++ new/drivers/vhost/net.c	2010-10-20 14:20:10.000000000 +0530
@@ -33,12 +33,6 @@
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x80000
 
-enum {
-	VHOST_NET_VQ_RX = 0,
-	VHOST_NET_VQ_TX = 1,
-	VHOST_NET_VQ_MAX = 2,
-};
-
 enum vhost_net_poll_state {
 	VHOST_NET_POLL_DISABLED = 0,
 	VHOST_NET_POLL_STARTED = 1,
@@ -47,12 +41,13 @@ enum vhost_net_poll_state {
 
 struct vhost_net {
 	struct vhost_dev dev;
-	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
-	struct vhost_poll poll[VHOST_NET_VQ_MAX];
+	struct vhost_virtqueue *vqs;
+	struct vhost_poll *poll;
+	struct socket **socks;
 	/* Tells us whether we are polling a socket for TX.
 	 * We only do this when socket buffer fills up.
 	 * Protected by tx vq lock. */
-	enum vhost_net_poll_state tx_poll_state;
+	enum vhost_net_poll_state *tx_poll_state;
 };
 
 /* Pop first len bytes from iovec. Return number of segments used. */
@@ -92,28 +87,28 @@ static void copy_iovec_hdr(const struct 
 }
 
 /* Caller must have TX VQ lock */
-static void tx_poll_stop(struct vhost_net *net)
+static void tx_poll_stop(struct vhost_net *net, int qnum)
 {
-	if (likely(net->tx_poll_state != VHOST_NET_POLL_STARTED))
+	if (likely(net->tx_poll_state[qnum] != VHOST_NET_POLL_STARTED))
 		return;
-	vhost_poll_stop(net->poll + VHOST_NET_VQ_TX);
-	net->tx_poll_state = VHOST_NET_POLL_STOPPED;
+	vhost_poll_stop(&net->poll[qnum]);
+	net->tx_poll_state[qnum] = VHOST_NET_POLL_STOPPED;
 }
 
 /* Caller must have TX VQ lock */
-static void tx_poll_start(struct vhost_net *net, struct socket *sock)
+static void tx_poll_start(struct vhost_net *net, struct socket *sock, int qnum)
 {
-	if (unlikely(net->tx_poll_state != VHOST_NET_POLL_STOPPED))
+	if (unlikely(net->tx_poll_state[qnum] != VHOST_NET_POLL_STOPPED))
 		return;
-	vhost_poll_start(net->poll + VHOST_NET_VQ_TX, sock->file);
-	net->tx_poll_state = VHOST_NET_POLL_STARTED;
+	vhost_poll_start(&net->poll[qnum], sock->file);
+	net->tx_poll_state[qnum] = VHOST_NET_POLL_STARTED;
 }
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_tx(struct vhost_net *net)
+static void handle_tx(struct vhost_virtqueue *vq)
 {
-	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
+	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
 	unsigned out, in, s;
 	int head;
 	struct msghdr msg = {
@@ -134,7 +129,7 @@ static void handle_tx(struct vhost_net *
 	wmem = atomic_read(&sock->sk->sk_wmem_alloc);
 	if (wmem >= sock->sk->sk_sndbuf) {
 		mutex_lock(&vq->mutex);
-		tx_poll_start(net, sock);
+		tx_poll_start(net, sock, vq->qnum);
 		mutex_unlock(&vq->mutex);
 		return;
 	}
@@ -144,7 +139,7 @@ static void handle_tx(struct vhost_net *
 	vhost_disable_notify(vq);
 
 	if (wmem < sock->sk->sk_sndbuf / 2)
-		tx_poll_stop(net);
+		tx_poll_stop(net, vq->qnum);
 	hdr_size = vq->vhost_hlen;
 
 	for (;;) {
@@ -159,7 +154,7 @@ static void handle_tx(struct vhost_net *
 		if (head == vq->num) {
 			wmem = atomic_read(&sock->sk->sk_wmem_alloc);
 			if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
-				tx_poll_start(net, sock);
+				tx_poll_start(net, sock, vq->qnum);
 				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
 				break;
 			}
@@ -189,7 +184,7 @@ static void handle_tx(struct vhost_net *
 		err = sock->ops->sendmsg(NULL, sock, &msg, len);
 		if (unlikely(err < 0)) {
 			vhost_discard_vq_desc(vq, 1);
-			tx_poll_start(net, sock);
+			tx_poll_start(net, sock, vq->qnum);
 			break;
 		}
 		if (err != len)
@@ -282,9 +277,9 @@ err:
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_rx_big(struct vhost_net *net)
+static void handle_rx_big(struct vhost_virtqueue *vq,
+			  struct vhost_net *net)
 {
-	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
 	unsigned out, in, log, s;
 	int head;
 	struct vhost_log *vq_log;
@@ -393,9 +388,9 @@ static void handle_rx_big(struct vhost_n
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_rx_mergeable(struct vhost_net *net)
+static void handle_rx_mergeable(struct vhost_virtqueue *vq,
+				struct vhost_net *net)
 {
-	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
 	unsigned uninitialized_var(in), log;
 	struct vhost_log *vq_log;
 	struct msghdr msg = {
@@ -500,96 +495,184 @@ static void handle_rx_mergeable(struct v
 	unuse_mm(net->dev.mm);
 }
 
-static void handle_rx(struct vhost_net *net)
+static void handle_rx(struct vhost_virtqueue *vq)
 {
+	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
+
 	if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF))
-		handle_rx_mergeable(net);
+		handle_rx_mergeable(vq, net);
 	else
-		handle_rx_big(net);
+		handle_rx_big(vq, net);
 }
 
 static void handle_tx_kick(struct vhost_work *work)
 {
 	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
 						  poll.work);
-	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
 
-	handle_tx(net);
+	handle_tx(vq);
 }
 
 static void handle_rx_kick(struct vhost_work *work)
 {
 	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
 						  poll.work);
-	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
 
-	handle_rx(net);
+	handle_rx(vq);
 }
 
 static void handle_tx_net(struct vhost_work *work)
 {
-	struct vhost_net *net = container_of(work, struct vhost_net,
-					     poll[VHOST_NET_VQ_TX].work);
-	handle_tx(net);
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_poll,
+						  work)->vq;
+
+	handle_tx(vq);
 }
 
 static void handle_rx_net(struct vhost_work *work)
 {
-	struct vhost_net *net = container_of(work, struct vhost_net,
-					     poll[VHOST_NET_VQ_RX].work);
-	handle_rx(net);
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_poll,
+						  work)->vq;
+
+	handle_rx(vq);
 }
 
-static int vhost_net_open(struct inode *inode, struct file *f)
+void vhost_free_vqs(struct vhost_dev *dev)
 {
-	struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
-	struct vhost_dev *dev;
-	int r;
+	struct vhost_net *n = container_of(dev, struct vhost_net, dev);
+
+	kfree(dev->work_list);
+	kfree(dev->work_lock);
+	kfree(dev->jiffies);
+	kfree(n->socks);
+	kfree(n->tx_poll_state);
+	kfree(n->poll);
+	kfree(n->vqs);
+
+	/*
+	 * Reset so that vhost_net_release (after vhost_dev_set_owner call)
+	 * will notice.
+	 */
+	n->vqs = NULL;
+	n->poll = NULL;
+	n->socks = NULL;
+	n->tx_poll_state = NULL;
+	dev->jiffies = NULL;
+	dev->work_lock = NULL;
+	dev->work_list = NULL;
+}
+
+int vhost_setup_vqs(struct vhost_dev *dev, int numtxqs)
+{
+	struct vhost_net *n = container_of(dev, struct vhost_net, dev);
+	int nvhosts;
+	int i, nvqs;
+	int ret;
+
+	if (numtxqs < 0 || numtxqs > VIRTIO_MAX_SQ)
+		return -EINVAL;
+
+	if (numtxqs == 0) {
+		/* Old qemu doesn't pass arguments to set_owner, use 1 txq */
+		numtxqs = 1;
+	}
+
+	/* Get total number of virtqueues */
+	nvqs = numtxqs + 1;
+
+	/* Get total number of vhost threads */
+	nvhosts = get_nvhosts(nvqs);
+
+	n->vqs = kmalloc(nvqs * sizeof(*n->vqs), GFP_KERNEL);
+	n->poll = kmalloc(nvqs * sizeof(*n->poll), GFP_KERNEL);
+	n->socks = kmalloc(nvqs * sizeof(*n->socks), GFP_KERNEL);
+	n->tx_poll_state = kmalloc(nvqs * sizeof(*n->tx_poll_state),
+				   GFP_KERNEL);
+	dev->jiffies = kzalloc(numtxqs * sizeof(*dev->jiffies), GFP_KERNEL);
+	dev->work_lock = kmalloc(nvhosts * sizeof(*dev->work_lock),
+				 GFP_KERNEL);
+	dev->work_list = kmalloc(nvhosts * sizeof(*dev->work_list),
+				 GFP_KERNEL);
+
+	if (!n->vqs || !n->poll || !n->socks || !n->tx_poll_state ||
+	    !dev->jiffies || !dev->work_lock || !dev->work_list) {
+		ret = -ENOMEM;
+		goto err;
+	}
 
-	if (!n)
-		return -ENOMEM;
+	/* 1 RX, followed by 'numtxqs' TX queues */
+	n->vqs[0].handle_kick = handle_rx_kick;
 
-	dev = &n->dev;
-	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
-	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
-	r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
-	if (r < 0) {
-		kfree(n);
-		return r;
-	}
+	for (i = 1; i < nvqs; i++)
+		n->vqs[i].handle_kick = handle_tx_kick;
+
+	ret = vhost_dev_init(dev, n->vqs, nvqs, nvhosts);
+	if (ret < 0)
+		goto err;
 
-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
-	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+	vhost_poll_init(&n->poll[0], handle_rx_net, POLLIN, &n->vqs[0], 1);
 
-	f->private_data = n;
+	for (i = 1; i < nvqs; i++) {
+		vhost_poll_init(&n->poll[i], handle_tx_net, POLLOUT,
+				&n->vqs[i], (nvqs == 2));
+		n->tx_poll_state[i] = VHOST_NET_POLL_DISABLED;
+	}
 
 	return 0;
+
+err:
+	/* Free all pointers that may have been allocated */
+	vhost_free_vqs(dev);
+
+	return ret;
+}
+
+static int vhost_net_open(struct inode *inode, struct file *f)
+{
+	struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
+	int ret = ENOMEM;
+
+	if (n) {
+		struct vhost_dev *dev = &n->dev;
+
+		f->private_data = n;
+		mutex_init(&dev->mutex);
+
+		/* Defer all other initialization till user does SET_OWNER */
+		ret = 0;
+	}
+
+	return ret;
 }
 
 static void vhost_net_disable_vq(struct vhost_net *n,
 				 struct vhost_virtqueue *vq)
 {
+	int qnum = vq->qnum;
+
 	if (!vq->private_data)
 		return;
-	if (vq == n->vqs + VHOST_NET_VQ_TX) {
-		tx_poll_stop(n);
-		n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+	if (qnum) {	/* TX */
+		tx_poll_stop(n, qnum);
+		n->tx_poll_state[qnum] = VHOST_NET_POLL_DISABLED;
 	} else
-		vhost_poll_stop(n->poll + VHOST_NET_VQ_RX);
+		vhost_poll_stop(&n->poll[qnum]);
 }
 
 static void vhost_net_enable_vq(struct vhost_net *n,
 				struct vhost_virtqueue *vq)
 {
 	struct socket *sock = vq->private_data;
+	int qnum = vq->qnum;
+
 	if (!sock)
 		return;
-	if (vq == n->vqs + VHOST_NET_VQ_TX) {
-		n->tx_poll_state = VHOST_NET_POLL_STOPPED;
-		tx_poll_start(n, sock);
+
+	if (qnum) {	/* TX */
+		n->tx_poll_state[qnum] = VHOST_NET_POLL_STOPPED;
+		tx_poll_start(n, sock, qnum);
 	} else
-		vhost_poll_start(n->poll + VHOST_NET_VQ_RX, sock->file);
+		vhost_poll_start(&n->poll[qnum], sock->file);
 }
 
 static struct socket *vhost_net_stop_vq(struct vhost_net *n,
@@ -605,11 +688,12 @@ static struct socket *vhost_net_stop_vq(
 	return sock;
 }
 
-static void vhost_net_stop(struct vhost_net *n, struct socket **tx_sock,
-			   struct socket **rx_sock)
+static void vhost_net_stop(struct vhost_net *n)
 {
-	*tx_sock = vhost_net_stop_vq(n, n->vqs + VHOST_NET_VQ_TX);
-	*rx_sock = vhost_net_stop_vq(n, n->vqs + VHOST_NET_VQ_RX);
+	int i;
+
+	for (i = n->dev.nvqs - 1; i >= 0; i--)
+		n->socks[i] = vhost_net_stop_vq(n, &n->vqs[i]);
 }
 
 static void vhost_net_flush_vq(struct vhost_net *n, int index)
@@ -620,26 +704,33 @@ static void vhost_net_flush_vq(struct vh
 
 static void vhost_net_flush(struct vhost_net *n)
 {
-	vhost_net_flush_vq(n, VHOST_NET_VQ_TX);
-	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
+	int i;
+
+	for (i = n->dev.nvqs - 1; i >= 0; i--)
+		vhost_net_flush_vq(n, i);
 }
 
 static int vhost_net_release(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = f->private_data;
-	struct socket *tx_sock;
-	struct socket *rx_sock;
+	struct vhost_dev *dev = &n->dev;
+	int i;
 
-	vhost_net_stop(n, &tx_sock, &rx_sock);
+	vhost_net_stop(n);
 	vhost_net_flush(n);
-	vhost_dev_cleanup(&n->dev);
-	if (tx_sock)
-		fput(tx_sock->file);
-	if (rx_sock)
-		fput(rx_sock->file);
+	vhost_dev_cleanup(dev);
+
+	for (i = n->dev.nvqs - 1; i >= 0; i--)
+		if (n->socks[i])
+			fput(n->socks[i]->file);
+
 	/* We do an extra flush before freeing memory,
 	 * since jobs can re-queue themselves. */
 	vhost_net_flush(n);
+
+	/* Free all old pointers */
+	vhost_free_vqs(dev);
+
 	kfree(n);
 	return 0;
 }
@@ -717,7 +808,7 @@ static long vhost_net_set_backend(struct
 	if (r)
 		goto err;
 
-	if (index >= VHOST_NET_VQ_MAX) {
+	if (index >= n->dev.nvqs) {
 		r = -ENOBUFS;
 		goto err;
 	}
@@ -738,9 +829,9 @@ static long vhost_net_set_backend(struct
 	/* start polling new socket */
 	oldsock = vq->private_data;
 	if (sock != oldsock) {
-                vhost_net_disable_vq(n, vq);
-                rcu_assign_pointer(vq->private_data, sock);
-                vhost_net_enable_vq(n, vq);
+		vhost_net_disable_vq(n, vq);
+		rcu_assign_pointer(vq->private_data, sock);
+		vhost_net_enable_vq(n, vq);
 	}
 
 	mutex_unlock(&vq->mutex);
@@ -762,22 +853,25 @@ err:
 
 static long vhost_net_reset_owner(struct vhost_net *n)
 {
-	struct socket *tx_sock = NULL;
-	struct socket *rx_sock = NULL;
 	long err;
+	int i;
+
 	mutex_lock(&n->dev.mutex);
 	err = vhost_dev_check_owner(&n->dev);
-	if (err)
-		goto done;
-	vhost_net_stop(n, &tx_sock, &rx_sock);
+	if (err) {
+		mutex_unlock(&n->dev.mutex);
+		return err;
+	}
+
+	vhost_net_stop(n);
 	vhost_net_flush(n);
 	err = vhost_dev_reset_owner(&n->dev);
-done:
 	mutex_unlock(&n->dev.mutex);
-	if (tx_sock)
-		fput(tx_sock->file);
-	if (rx_sock)
-		fput(rx_sock->file);
+
+	for (i = n->dev.nvqs - 1; i >= 0; i--)
+		if (n->socks[i])
+			fput(n->socks[i]->file);
+
 	return err;
 }
 
@@ -806,7 +900,7 @@ static int vhost_net_set_features(struct
 	}
 	n->dev.acked_features = features;
 	smp_wmb();
-	for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
+	for (i = 0; i < n->dev.nvqs; ++i) {
 		mutex_lock(&n->vqs[i].mutex);
 		n->vqs[i].vhost_hlen = vhost_hlen;
 		n->vqs[i].sock_hlen = sock_hlen;
diff -ruNp org/drivers/vhost/vhost.c new/drivers/vhost/vhost.c
--- org/drivers/vhost/vhost.c	2010-10-11 10:21:14.000000000 +0530
+++ new/drivers/vhost/vhost.c	2010-10-20 14:20:04.000000000 +0530
@@ -69,16 +69,70 @@ static void vhost_work_init(struct vhost
 	work->queue_seq = work->done_seq = 0;
 }
 
+/*
+ * __vhost_sq_find_vq: This is the poll->find_vq() handler for cases:
+ *	- #numtxqs == 1; or
+ *	- this is an RX vq
+ */
+static struct vhost_virtqueue *__vhost_sq_find_vq(struct vhost_poll *poll)
+{
+	return poll->vq;
+}
+
+/* Define how recently a txq was used, beyond this it is considered unused */
+#define RECENTLY_USED  5
+
+/*
+ * __vhost_mq_find_vq: This is the poll->find_vq() handler for cases:
+ *	- #numtxqs > 1, and
+ *	- this is a TX vq
+ *
+ * Algorithm for selecting vq:
+ *
+ *	Condition:					Return:
+ *	If all txqs unused				vq[0]
+ *	If one txq used, and new txq is same		vq[0]
+ *	If one txq used, and new txq is different	vq[vq->qnum]
+ *	If > 1 txqs used				vq[vq->qnum]
+ * Where "used" means the txq was used in the last RECENTLY_USED jiffies.
+ *
+ * Note: locking is not required as an update race will only result in
+ * a different worker being woken up.
+ */
+static struct vhost_virtqueue *__vhost_mq_find_vq(struct vhost_poll *poll)
+{
+	struct vhost_dev *dev = poll->vq->dev;
+	struct vhost_virtqueue *vq = &dev->vqs[0];
+	unsigned long max_time = jiffies - RECENTLY_USED;
+	unsigned long *table = dev->jiffies;
+	int i, used = 0;
+
+	for (i = 0; i < dev->nvqs - 1; i++) {
+		if (time_after_eq(table[i], max_time) && ++used > 1) {
+			vq = poll->vq;
+			break;
+		}
+	}
+
+	table[poll->vq->qnum - 1] = jiffies;
+	return vq;
+}
+
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev)
+		     unsigned long mask, struct vhost_virtqueue *vq,
+		     int single_queue)
 {
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
 	poll->mask = mask;
-	poll->dev = dev;
+	poll->vq = vq;
 
 	vhost_work_init(&poll->work, fn);
+	if (single_queue)
+		poll->find_vq = __vhost_sq_find_vq;
+	else
+		poll->find_vq = __vhost_mq_find_vq;
 }
 
 /* Start polling a file. We add ourselves to file's wait queue. The caller must
@@ -98,25 +152,25 @@ void vhost_poll_stop(struct vhost_poll *
 	remove_wait_queue(poll->wqh, &poll->wait);
 }
 
-static void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
+static void vhost_work_flush(struct vhost_poll *poll, struct vhost_work *work)
 {
 	unsigned seq;
 	int left;
 	int flushing;
 
-	spin_lock_irq(&dev->work_lock);
+	spin_lock_irq(poll->vq->work_lock);
 	seq = work->queue_seq;
 	work->flushing++;
-	spin_unlock_irq(&dev->work_lock);
+	spin_unlock_irq(poll->vq->work_lock);
 	wait_event(work->done, ({
-		   spin_lock_irq(&dev->work_lock);
+		   spin_lock_irq(poll->vq->work_lock);
 		   left = seq - work->done_seq <= 0;
-		   spin_unlock_irq(&dev->work_lock);
+		   spin_unlock_irq(poll->vq->work_lock);
 		   left;
 	}));
-	spin_lock_irq(&dev->work_lock);
+	spin_lock_irq(poll->vq->work_lock);
 	flushing = --work->flushing;
-	spin_unlock_irq(&dev->work_lock);
+	spin_unlock_irq(poll->vq->work_lock);
 	BUG_ON(flushing < 0);
 }
 
@@ -124,26 +178,28 @@ static void vhost_work_flush(struct vhos
  * locks that are also used by the callback. */
 void vhost_poll_flush(struct vhost_poll *poll)
 {
-	vhost_work_flush(poll->dev, &poll->work);
+	vhost_work_flush(poll, &poll->work);
 }
 
-static inline void vhost_work_queue(struct vhost_dev *dev,
+static inline void vhost_work_queue(struct vhost_virtqueue *vq,
 				    struct vhost_work *work)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&dev->work_lock, flags);
+	spin_lock_irqsave(vq->work_lock, flags);
 	if (list_empty(&work->node)) {
-		list_add_tail(&work->node, &dev->work_list);
+		list_add_tail(&work->node, vq->work_list);
 		work->queue_seq++;
-		wake_up_process(dev->worker);
+		wake_up_process(vq->worker);
 	}
-	spin_unlock_irqrestore(&dev->work_lock, flags);
+	spin_unlock_irqrestore(vq->work_lock, flags);
 }
 
 void vhost_poll_queue(struct vhost_poll *poll)
 {
-	vhost_work_queue(poll->dev, &poll->work);
+	struct vhost_virtqueue *vq = poll->find_vq(poll);
+
+	vhost_work_queue(vq, &poll->work);
 }
 
 static void vhost_vq_reset(struct vhost_dev *dev,
@@ -174,7 +230,7 @@ static void vhost_vq_reset(struct vhost_
 
 static int vhost_worker(void *data)
 {
-	struct vhost_dev *dev = data;
+	struct vhost_virtqueue *vq = data;
 	struct vhost_work *work = NULL;
 	unsigned uninitialized_var(seq);
 
@@ -182,7 +238,7 @@ static int vhost_worker(void *data)
 		/* mb paired w/ kthread_stop */
 		set_current_state(TASK_INTERRUPTIBLE);
 
-		spin_lock_irq(&dev->work_lock);
+		spin_lock_irq(vq->work_lock);
 		if (work) {
 			work->done_seq = seq;
 			if (work->flushing)
@@ -190,18 +246,18 @@ static int vhost_worker(void *data)
 		}
 
 		if (kthread_should_stop()) {
-			spin_unlock_irq(&dev->work_lock);
+			spin_unlock_irq(vq->work_lock);
 			__set_current_state(TASK_RUNNING);
 			return 0;
 		}
-		if (!list_empty(&dev->work_list)) {
-			work = list_first_entry(&dev->work_list,
+		if (!list_empty(vq->work_list)) {
+			work = list_first_entry(vq->work_list,
 						struct vhost_work, node);
 			list_del_init(&work->node);
 			seq = work->queue_seq;
 		} else
 			work = NULL;
-		spin_unlock_irq(&dev->work_lock);
+		spin_unlock_irq(vq->work_lock);
 
 		if (work) {
 			__set_current_state(TASK_RUNNING);
@@ -251,8 +307,19 @@ static void vhost_dev_free_iovecs(struct
 	}
 }
 
+/* Get index of an existing thread that will handle this txq */
+static int vhost_get_buddy_thread(int index, int nvhosts)
+{
+	int buddy = 0;
+
+	if (nvhosts > 1)
+		buddy = (index - 1) % MAX_TXQ_THREADS + 1;
+
+	return buddy;
+}
+
 long vhost_dev_init(struct vhost_dev *dev,
-		    struct vhost_virtqueue *vqs, int nvqs)
+		    struct vhost_virtqueue *vqs, int nvqs, int nvhosts)
 {
 	int i;
 
@@ -263,20 +330,37 @@ long vhost_dev_init(struct vhost_dev *de
 	dev->log_file = NULL;
 	dev->memory = NULL;
 	dev->mm = NULL;
-	spin_lock_init(&dev->work_lock);
-	INIT_LIST_HEAD(&dev->work_list);
-	dev->worker = NULL;
 
 	for (i = 0; i < dev->nvqs; ++i) {
-		dev->vqs[i].log = NULL;
-		dev->vqs[i].indirect = NULL;
-		dev->vqs[i].heads = NULL;
-		dev->vqs[i].dev = dev;
-		mutex_init(&dev->vqs[i].mutex);
+		struct vhost_virtqueue *vq = &dev->vqs[i];
+		int single_queue = (!i || dev->nvqs == 2);
+
+		if (i < nvhosts) {
+			spin_lock_init(&dev->work_lock[i]);
+			INIT_LIST_HEAD(&dev->work_list[i]);
+
+			vq->work_lock = &dev->work_lock[i];
+			vq->work_list = &dev->work_list[i];
+		} else {
+			/* Share work with another thread */
+			int j = vhost_get_buddy_thread(i, nvhosts);
+
+			vq->work_lock = &dev->work_lock[j];
+			vq->work_list = &dev->work_list[j];
+		}
+
+		vq->worker = NULL;
+		vq->qnum = i;
+		vq->log = NULL;
+		vq->indirect = NULL;
+		vq->heads = NULL;
+		vq->dev = dev;
+		mutex_init(&vq->mutex);
 		vhost_vq_reset(dev, dev->vqs + i);
-		if (dev->vqs[i].handle_kick)
-			vhost_poll_init(&dev->vqs[i].poll,
-					dev->vqs[i].handle_kick, POLLIN, dev);
+		if (vq->handle_kick)
+			vhost_poll_init(&vq->poll,
+					vq->handle_kick, POLLIN, vq,
+					single_queue);
 	}
 
 	return 0;
@@ -290,61 +374,116 @@ long vhost_dev_check_owner(struct vhost_
 }
 
 struct vhost_attach_cgroups_struct {
-        struct vhost_work work;
-        struct task_struct *owner;
-        int ret;
+	struct vhost_work work;
+	struct task_struct *owner;
+	int ret;
 };
 
 static void vhost_attach_cgroups_work(struct vhost_work *work)
 {
-        struct vhost_attach_cgroups_struct *s;
-        s = container_of(work, struct vhost_attach_cgroups_struct, work);
-        s->ret = cgroup_attach_task_all(s->owner, current);
+	struct vhost_attach_cgroups_struct *s;
+	s = container_of(work, struct vhost_attach_cgroups_struct, work);
+	s->ret = cgroup_attach_task_all(s->owner, current);
 }
 
-static int vhost_attach_cgroups(struct vhost_dev *dev)
-{
-        struct vhost_attach_cgroups_struct attach;
-        attach.owner = current;
-        vhost_work_init(&attach.work, vhost_attach_cgroups_work);
-        vhost_work_queue(dev, &attach.work);
-        vhost_work_flush(dev, &attach.work);
-        return attach.ret;
+static int vhost_attach_cgroups(struct vhost_virtqueue *vq)
+{
+	struct vhost_attach_cgroups_struct attach;
+	attach.owner = current;
+	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
+	vhost_work_queue(vq, &attach.work);
+	vhost_work_flush(&vq->poll, &attach.work);
+	return attach.ret;
+}
+
+static void __vhost_stop_workers(struct vhost_dev *dev, int nvhosts)
+{
+	int i;
+
+	for (i = 0; i < nvhosts; i++) {
+		WARN_ON(!list_empty(dev->vqs[i].work_list));
+		if (dev->vqs[i].worker) {
+			kthread_stop(dev->vqs[i].worker);
+			dev->vqs[i].worker = NULL;
+		}
+	}
+}
+
+static void vhost_stop_workers(struct vhost_dev *dev)
+{
+	__vhost_stop_workers(dev, get_nvhosts(dev->nvqs));
+}
+
+static int vhost_start_workers(struct vhost_dev *dev)
+{
+	int nvhosts = get_nvhosts(dev->nvqs);
+	int i, err;
+
+	for (i = 0; i < dev->nvqs; ++i) {
+		struct vhost_virtqueue *vq = &dev->vqs[i];
+
+		if (i < nvhosts) {
+			/* Start a new thread */
+			vq->worker = kthread_create(vhost_worker, vq,
+						    "vhost-%d-%d",
+						    current->pid, i);
+			if (IS_ERR(vq->worker)) {
+				i--;	/* no thread to clean at this index */
+				err = PTR_ERR(vq->worker);
+				goto err;
+			}
+
+			wake_up_process(vq->worker);
+
+			/* avoid contributing to loadavg */
+			err = vhost_attach_cgroups(vq);
+			if (err)
+				goto err;
+		} else {
+			/* Share work with an existing thread */
+			int j = vhost_get_buddy_thread(i, nvhosts);
+			struct vhost_virtqueue *share_vq = &dev->vqs[j];
+
+			vq->worker = share_vq->worker;
+		}
+	}
+	return 0;
+
+err:
+	__vhost_stop_workers(dev, i);
+	return err;
 }
 
 /* Caller should have device mutex */
-static long vhost_dev_set_owner(struct vhost_dev *dev)
+static long vhost_dev_set_owner(struct vhost_dev *dev, int numtxqs)
 {
-	struct task_struct *worker;
 	int err;
 	/* Is there an owner already? */
 	if (dev->mm) {
 		err = -EBUSY;
 		goto err_mm;
 	}
+
+	err = vhost_setup_vqs(dev, numtxqs);
+	if (err)
+		goto err_mm;
+
 	/* No owner, become one */
 	dev->mm = get_task_mm(current);
-	worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
-	if (IS_ERR(worker)) {
-		err = PTR_ERR(worker);
-		goto err_worker;
-	}
-
-	dev->worker = worker;
-	wake_up_process(worker);	/* avoid contributing to loadavg */
 
-	err = vhost_attach_cgroups(dev);
+	/* Start threads */
+	err =  vhost_start_workers(dev);
 	if (err)
-		goto err_cgroup;
+		goto err_worker;
 
 	err = vhost_dev_alloc_iovecs(dev);
 	if (err)
-		goto err_cgroup;
+		goto err_iovec;
 
 	return 0;
-err_cgroup:
-	kthread_stop(worker);
-	dev->worker = NULL;
+err_iovec:
+	vhost_stop_workers(dev);
+	vhost_free_vqs(dev);
 err_worker:
 	if (dev->mm)
 		mmput(dev->mm);
@@ -405,11 +544,7 @@ void vhost_dev_cleanup(struct vhost_dev 
 		mmput(dev->mm);
 	dev->mm = NULL;
 
-	WARN_ON(!list_empty(&dev->work_list));
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
-	}
+	vhost_stop_workers(dev);
 }
 
 static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
@@ -760,7 +895,7 @@ long vhost_dev_ioctl(struct vhost_dev *d
 
 	/* If you are not the owner, you can become one */
 	if (ioctl == VHOST_SET_OWNER) {
-		r = vhost_dev_set_owner(d);
+		r = vhost_dev_set_owner(d, arg);
 		goto done;
 	}
 

^ permalink raw reply

* [v3 RFC PATCH 4/4] qemu changes
From: Krishna Kumar @ 2010-10-20  8:55 UTC (permalink / raw)
  To: rusty, davem, mst
  Cc: eric.dumazet, kvm, netdev, arnd, avi, anthony, Krishna Kumar
In-Reply-To: <20101020085452.15579.76002.sendpatchset@krkumar2.in.ibm.com>

Changes in qemu to support mq TX.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---     
 hw/vhost.c      |    7 ++++--
 hw/vhost.h      |    2 -
 hw/vhost_net.c  |   16 +++++++++----
 hw/vhost_net.h  |    2 -
 hw/virtio-net.c |   53 ++++++++++++++++++++++++++++++++--------------
 hw/virtio-net.h |    2 +
 hw/virtio-pci.c |    2 +
 net.c           |   17 ++++++++++++++
 net.h           |    1 
 net/tap.c       |   34 ++++++++++++++++++++++++++---
 10 files changed, 107 insertions(+), 29 deletions(-)

diff -ruNp org3/hw/vhost.c new3/hw/vhost.c
--- org3/hw/vhost.c	2010-10-19 19:38:11.000000000 +0530
+++ new3/hw/vhost.c	2010-10-20 12:44:21.000000000 +0530
@@ -580,7 +580,7 @@ static void vhost_virtqueue_cleanup(stru
                               0, virtio_queue_get_desc_size(vdev, idx));
 }
 
-int vhost_dev_init(struct vhost_dev *hdev, int devfd)
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, int numtxqs)
 {
     uint64_t features;
     int r;
@@ -592,11 +592,14 @@ int vhost_dev_init(struct vhost_dev *hde
             return -errno;
         }
     }
-    r = ioctl(hdev->control, VHOST_SET_OWNER, NULL);
+
+    r = ioctl(hdev->control, VHOST_SET_OWNER, numtxqs);
     if (r < 0) {
         goto fail;
     }
 
+    hdev->nvqs = numtxqs + 1;
+
     r = ioctl(hdev->control, VHOST_GET_FEATURES, &features);
     if (r < 0) {
         goto fail;
diff -ruNp org3/hw/vhost.h new3/hw/vhost.h
--- org3/hw/vhost.h	2010-07-01 11:42:09.000000000 +0530
+++ new3/hw/vhost.h	2010-10-20 12:47:10.000000000 +0530
@@ -40,7 +40,7 @@ struct vhost_dev {
     unsigned long long log_size;
 };
 
-int vhost_dev_init(struct vhost_dev *hdev, int devfd);
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, int numtxqs);
 void vhost_dev_cleanup(struct vhost_dev *hdev);
 int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
 void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
diff -ruNp org3/hw/vhost_net.c new3/hw/vhost_net.c
--- org3/hw/vhost_net.c	2010-09-28 10:07:31.000000000 +0530
+++ new3/hw/vhost_net.c	2010-10-19 19:46:52.000000000 +0530
@@ -36,7 +36,8 @@
 
 struct vhost_net {
     struct vhost_dev dev;
-    struct vhost_virtqueue vqs[2];
+    struct vhost_virtqueue *vqs;
+    int nvqs;
     int backend;
     VLANClientState *vc;
 };
@@ -81,7 +82,8 @@ static int vhost_net_get_fd(VLANClientSt
     }
 }
 
-struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
+				 int numtxqs)
 {
     int r;
     struct vhost_net *net = qemu_malloc(sizeof *net);
@@ -98,10 +100,14 @@ struct vhost_net *vhost_net_init(VLANCli
         (1 << VHOST_NET_F_VIRTIO_NET_HDR);
     net->backend = r;
 
-    r = vhost_dev_init(&net->dev, devfd);
+    r = vhost_dev_init(&net->dev, devfd, numtxqs);
     if (r < 0) {
         goto fail;
     }
+
+    net->nvqs = numtxqs + 1;
+    net->vqs = qemu_malloc(net->nvqs * (sizeof *net->vqs));
+
     if (!tap_has_vnet_hdr_len(backend,
                               sizeof(struct virtio_net_hdr_mrg_rxbuf))) {
         net->dev.features &= ~(1 << VIRTIO_NET_F_MRG_RXBUF);
@@ -131,7 +137,6 @@ int vhost_net_start(struct vhost_net *ne
                              sizeof(struct virtio_net_hdr_mrg_rxbuf));
     }
 
-    net->dev.nvqs = 2;
     net->dev.vqs = net->vqs;
     r = vhost_dev_start(&net->dev, dev);
     if (r < 0) {
@@ -188,7 +193,8 @@ void vhost_net_cleanup(struct vhost_net 
     qemu_free(net);
 }
 #else
-struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
+				 int nvqs)
 {
 	return NULL;
 }
diff -ruNp org3/hw/vhost_net.h new3/hw/vhost_net.h
--- org3/hw/vhost_net.h	2010-07-01 11:42:09.000000000 +0530
+++ new3/hw/vhost_net.h	2010-10-19 19:46:52.000000000 +0530
@@ -6,7 +6,7 @@
 struct vhost_net;
 typedef struct vhost_net VHostNetState;
 
-VHostNetState *vhost_net_init(VLANClientState *backend, int devfd);
+VHostNetState *vhost_net_init(VLANClientState *backend, int devfd, int nvqs);
 
 int vhost_net_start(VHostNetState *net, VirtIODevice *dev);
 void vhost_net_stop(VHostNetState *net, VirtIODevice *dev);
diff -ruNp org3/hw/virtio-net.c new3/hw/virtio-net.c
--- org3/hw/virtio-net.c	2010-10-19 19:38:11.000000000 +0530
+++ new3/hw/virtio-net.c	2010-10-19 21:02:33.000000000 +0530
@@ -32,7 +32,7 @@ typedef struct VirtIONet
     uint8_t mac[ETH_ALEN];
     uint16_t status;
     VirtQueue *rx_vq;
-    VirtQueue *tx_vq;
+    VirtQueue **tx_vq;
     VirtQueue *ctrl_vq;
     NICState *nic;
     QEMUTimer *tx_timer;
@@ -65,6 +65,7 @@ typedef struct VirtIONet
     } mac_table;
     uint32_t *vlans;
     DeviceState *qdev;
+    uint16_t numtxqs;
 } VirtIONet;
 
 /* TODO
@@ -82,6 +83,7 @@ static void virtio_net_get_config(VirtIO
     struct virtio_net_config netcfg;
 
     netcfg.status = n->status;
+    netcfg.numtxqs = n->numtxqs;
     memcpy(netcfg.mac, n->mac, ETH_ALEN);
     memcpy(config, &netcfg, sizeof(netcfg));
 }
@@ -196,6 +198,8 @@ static uint32_t virtio_net_get_features(
     VirtIONet *n = to_virtio_net(vdev);
 
     features |= (1 << VIRTIO_NET_F_MAC);
+    if (n->numtxqs > 1)
+        features |= (1 << VIRTIO_NET_F_NUMTXQS);
 
     if (peer_has_vnet_hdr(n)) {
         tap_using_vnet_hdr(n->nic->nc.peer, 1);
@@ -659,13 +663,16 @@ static void virtio_net_tx_complete(VLANC
 {
     VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
 
-    virtqueue_push(n->tx_vq, &n->async_tx.elem, n->async_tx.len);
-    virtio_notify(&n->vdev, n->tx_vq);
+    /*
+     * If this function executes, we are single TX and hence use only txq[0]
+     */
+    virtqueue_push(n->tx_vq[0], &n->async_tx.elem, n->async_tx.len);
+    virtio_notify(&n->vdev, n->tx_vq[0]);
 
     n->async_tx.elem.out_num = n->async_tx.len = 0;
 
-    virtio_queue_set_notification(n->tx_vq, 1);
-    virtio_net_flush_tx(n, n->tx_vq);
+    virtio_queue_set_notification(n->tx_vq[0], 1);
+    virtio_net_flush_tx(n, n->tx_vq[0]);
 }
 
 /* TX */
@@ -679,7 +686,7 @@ static int32_t virtio_net_flush_tx(VirtI
     }
 
     if (n->async_tx.elem.out_num) {
-        virtio_queue_set_notification(n->tx_vq, 0);
+        virtio_queue_set_notification(n->tx_vq[0], 0);
         return num_packets;
     }
 
@@ -714,7 +721,7 @@ static int32_t virtio_net_flush_tx(VirtI
         ret = qemu_sendv_packet_async(&n->nic->nc, out_sg, out_num,
                                       virtio_net_tx_complete);
         if (ret == 0) {
-            virtio_queue_set_notification(n->tx_vq, 0);
+            virtio_queue_set_notification(n->tx_vq[0], 0);
             n->async_tx.elem = elem;
             n->async_tx.len  = len;
             return -EBUSY;
@@ -771,8 +778,8 @@ static void virtio_net_tx_timer(void *op
     if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
         return;
 
-    virtio_queue_set_notification(n->tx_vq, 1);
-    virtio_net_flush_tx(n, n->tx_vq);
+    virtio_queue_set_notification(n->tx_vq[0], 1);
+    virtio_net_flush_tx(n, n->tx_vq[0]);
 }
 
 static void virtio_net_tx_bh(void *opaque)
@@ -786,7 +793,7 @@ static void virtio_net_tx_bh(void *opaqu
     if (unlikely(!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK)))
         return;
 
-    ret = virtio_net_flush_tx(n, n->tx_vq);
+    ret = virtio_net_flush_tx(n, n->tx_vq[0]);
     if (ret == -EBUSY) {
         return; /* Notification re-enable handled by tx_complete */
     }
@@ -802,9 +809,9 @@ static void virtio_net_tx_bh(void *opaqu
     /* If less than a full burst, re-enable notification and flush
      * anything that may have come in while we weren't looking.  If
      * we find something, assume the guest is still active and reschedule */
-    virtio_queue_set_notification(n->tx_vq, 1);
-    if (virtio_net_flush_tx(n, n->tx_vq) > 0) {
-        virtio_queue_set_notification(n->tx_vq, 0);
+    virtio_queue_set_notification(n->tx_vq[0], 1);
+    if (virtio_net_flush_tx(n, n->tx_vq[0]) > 0) {
+        virtio_queue_set_notification(n->tx_vq[0], 0);
         qemu_bh_schedule(n->tx_bh);
         n->tx_waiting = 1;
     }
@@ -820,6 +827,7 @@ static void virtio_net_save(QEMUFile *f,
     virtio_save(&n->vdev, f);
 
     qemu_put_buffer(f, n->mac, ETH_ALEN);
+    qemu_put_be16(f, n->numtxqs);
     qemu_put_be32(f, n->tx_waiting);
     qemu_put_be32(f, n->mergeable_rx_bufs);
     qemu_put_be16(f, n->status);
@@ -849,6 +857,7 @@ static int virtio_net_load(QEMUFile *f, 
     virtio_load(&n->vdev, f);
 
     qemu_get_buffer(f, n->mac, ETH_ALEN);
+    n->numtxqs = qemu_get_be32(f);
     n->tx_waiting = qemu_get_be32(f);
     n->mergeable_rx_bufs = qemu_get_be32(f);
 
@@ -966,11 +975,14 @@ VirtIODevice *virtio_net_init(DeviceStat
                               virtio_net_conf *net)
 {
     VirtIONet *n;
+    int i;
 
     n = (VirtIONet *)virtio_common_init("virtio-net", VIRTIO_ID_NET,
                                         sizeof(struct virtio_net_config),
                                         sizeof(VirtIONet));
 
+    n->numtxqs = conf->peer->numtxqs;
+
     n->vdev.get_config = virtio_net_get_config;
     n->vdev.set_config = virtio_net_set_config;
     n->vdev.get_features = virtio_net_get_features;
@@ -978,8 +990,8 @@ VirtIODevice *virtio_net_init(DeviceStat
     n->vdev.bad_features = virtio_net_bad_features;
     n->vdev.reset = virtio_net_reset;
     n->vdev.set_status = virtio_net_set_status;
-    n->rx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_rx);
 
+    n->rx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_rx);
     if (net->tx && strcmp(net->tx, "timer") && strcmp(net->tx, "bh")) {
         fprintf(stderr, "virtio-net: "
                 "Unknown option tx=%s, valid options: \"timer\" \"bh\"\n",
@@ -987,12 +999,21 @@ VirtIODevice *virtio_net_init(DeviceStat
         fprintf(stderr, "Defaulting to \"bh\"\n");
     }
 
+    /* Allocate per tx vq's */
+    n->tx_vq = qemu_mallocz(n->numtxqs * sizeof(*n->tx_vq));
+    for (i = 0; i < n->numtxqs; i++) {
+        if (net->tx && !strcmp(net->tx, "timer")) {
+            n->tx_vq[i] = virtio_add_queue(&n->vdev, 256,
+                                           virtio_net_handle_tx_timer);
+        } else {
+            n->tx_vq[i] = virtio_add_queue(&n->vdev, 256,
+                                           virtio_net_handle_tx_bh);
+        }
+    }
     if (net->tx && !strcmp(net->tx, "timer")) {
-        n->tx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx_timer);
         n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
         n->tx_timeout = net->txtimer;
     } else {
-        n->tx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx_bh);
         n->tx_bh = qemu_bh_new(virtio_net_tx_bh, n);
     }
     n->ctrl_vq = virtio_add_queue(&n->vdev, 64, virtio_net_handle_ctrl);
diff -ruNp org3/hw/virtio-net.h new3/hw/virtio-net.h
--- org3/hw/virtio-net.h	2010-09-28 10:07:31.000000000 +0530
+++ new3/hw/virtio-net.h	2010-10-19 19:46:52.000000000 +0530
@@ -44,6 +44,7 @@
 #define VIRTIO_NET_F_CTRL_RX    18      /* Control channel RX mode support */
 #define VIRTIO_NET_F_CTRL_VLAN  19      /* Control channel VLAN filtering */
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20   /* Extra RX mode control support */
+#define VIRTIO_NET_F_NUMTXQS    21      /* Supports multiple TX queues */
 
 #define VIRTIO_NET_S_LINK_UP    1       /* Link is up */
 
@@ -72,6 +73,7 @@ struct virtio_net_config
     uint8_t mac[ETH_ALEN];
     /* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
     uint16_t status;
+    uint16_t numtxqs;	/* number of transmit queues */
 } __attribute__((packed));
 
 /* This is the first element of the scatter-gather list.  If you don't
diff -ruNp org3/hw/virtio-pci.c new3/hw/virtio-pci.c
--- org3/hw/virtio-pci.c	2010-10-19 19:38:11.000000000 +0530
+++ new3/hw/virtio-pci.c	2010-10-19 19:46:52.000000000 +0530
@@ -99,6 +99,7 @@ typedef struct {
     uint32_t addr;
     uint32_t class_code;
     uint32_t nvectors;
+    uint32_t mq;
     BlockConf block;
     NICConf nic;
     uint32_t host_features;
@@ -788,6 +789,7 @@ static PCIDeviceInfo virtio_info[] = {
         .romfile    = "pxe-virtio.bin",
         .qdev.props = (Property[]) {
             DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 3),
+	    DEFINE_PROP_UINT32("mq", VirtIOPCIProxy, mq, 1),
             DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features),
             DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic),
             DEFINE_PROP_UINT32("x-txtimer", VirtIOPCIProxy,
diff -ruNp org3/net/tap.c new3/net/tap.c
--- org3/net/tap.c	2010-09-28 10:07:31.000000000 +0530
+++ new3/net/tap.c	2010-10-20 12:39:56.000000000 +0530
@@ -320,13 +320,14 @@ static NetClientInfo net_tap_info = {
 static TAPState *net_tap_fd_init(VLANState *vlan,
                                  const char *model,
                                  const char *name,
-                                 int fd,
+                                 int fd, int numtxqs,
                                  int vnet_hdr)
 {
     VLANClientState *nc;
     TAPState *s;
 
     nc = qemu_new_net_client(&net_tap_info, vlan, NULL, model, name);
+    nc->numtxqs = numtxqs;
 
     s = DO_UPCAST(TAPState, nc, nc);
 
@@ -424,6 +425,27 @@ int net_init_tap(QemuOpts *opts, Monitor
 {
     TAPState *s;
     int fd, vnet_hdr = 0;
+    int vhost;
+    int numtxqs = 1;
+
+    vhost = qemu_opt_get_bool(opts, "vhost", 0);
+
+    /*
+     * We support multiple tx queues if:
+     *      1. smp > 1
+     *      2. vhost=on
+     *      3. mq=on
+     * In this case, #txqueues = #cpus. This value can be changed by
+     * using the "numtxqs" option.
+     */
+    if (vhost && smp_cpus > 1) {
+        if (qemu_opt_get_bool(opts, "mq", 0)) {
+#define VIRTIO_MAX_TXQS         32
+            int dflt = MIN(smp_cpus, VIRTIO_MAX_TXQS);
+
+            numtxqs = qemu_opt_get_number(opts, "numtxqs", dflt);
+        }
+    }
 
     if (qemu_opt_get(opts, "fd")) {
         if (qemu_opt_get(opts, "ifname") ||
@@ -457,7 +479,7 @@ int net_init_tap(QemuOpts *opts, Monitor
         }
     }
 
-    s = net_tap_fd_init(vlan, "tap", name, fd, vnet_hdr);
+    s = net_tap_fd_init(vlan, "tap", name, fd, numtxqs, vnet_hdr);
     if (!s) {
         close(fd);
         return -1;
@@ -486,7 +508,7 @@ int net_init_tap(QemuOpts *opts, Monitor
         }
     }
 
-    if (qemu_opt_get_bool(opts, "vhost", !!qemu_opt_get(opts, "vhostfd"))) {
+    if (vhost) {
         int vhostfd, r;
         if (qemu_opt_get(opts, "vhostfd")) {
             r = net_handle_fd_param(mon, qemu_opt_get(opts, "vhostfd"));
@@ -497,9 +519,13 @@ int net_init_tap(QemuOpts *opts, Monitor
         } else {
             vhostfd = -1;
         }
-        s->vhost_net = vhost_net_init(&s->nc, vhostfd);
+        s->vhost_net = vhost_net_init(&s->nc, vhostfd, numtxqs);
         if (!s->vhost_net) {
             error_report("vhost-net requested but could not be initialized");
+            if (numtxqs > 1) {
+                error_report("Need vhost support for numtxqs > 1, exiting...");
+                exit(1);
+            }
             return -1;
         }
     } else if (qemu_opt_get(opts, "vhostfd")) {
diff -ruNp org3/net.c new3/net.c
--- org3/net.c	2010-10-19 19:38:11.000000000 +0530
+++ new3/net.c	2010-10-19 19:46:52.000000000 +0530
@@ -849,6 +849,15 @@ static int net_init_nic(QemuOpts *opts,
         return -1;
     }
 
+    if (nd->netdev->numtxqs > 1 && nd->nvectors == DEV_NVECTORS_UNSPECIFIED) {
+        /*
+         * User specified mq for guest, but no "vectors=", tune
+         * it automatically to 'numtxqs' TX + 1 RX + 1 controlq.
+         */
+        nd->nvectors = nd->netdev->numtxqs + 1 + 1;
+        monitor_printf(mon, "nvectors tuned to %d\n", nd->nvectors);
+    }
+
     nd->used = 1;
     nb_nics++;
 
@@ -992,6 +1001,14 @@ static const struct {
             },
 #ifndef _WIN32
             {
+                .name = "mq",
+                .type = QEMU_OPT_BOOL,
+                .help = "enable multiqueue on network i/f",
+            }, {
+                .name = "numtxqs",
+                .type = QEMU_OPT_NUMBER,
+                .help = "optional number of TX queues, if mq is enabled",
+            }, {
                 .name = "fd",
                 .type = QEMU_OPT_STRING,
                 .help = "file descriptor of an already opened tap",
diff -ruNp org3/net.h new3/net.h
--- org3/net.h	2010-10-19 19:38:11.000000000 +0530
+++ new3/net.h	2010-10-19 19:46:52.000000000 +0530
@@ -62,6 +62,7 @@ struct VLANClientState {
     struct VLANState *vlan;
     VLANClientState *peer;
     NetQueue *send_queue;
+    int numtxqs;
     char *model;
     char *name;
     char info_str[256];

^ permalink raw reply

* Re: [PATCH] phonet: remove the unused variable pn
From: David Miller @ 2010-10-20  8:56 UTC (permalink / raw)
  To: xiaosuo; +Cc: remi.denis-courmont, netdev
In-Reply-To: <1287561063-24548-1-git-send-email-xiaosuo@gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Wed, 20 Oct 2010 15:51:03 +0800

> Signed-off-by: Changli Gao <xiaosuo@gmail.com>

Applied, thanks.

^ permalink raw reply

* Re: pull request: wireless-next-2.6 2010-10-15
From: David Miller @ 2010-10-20  9:00 UTC (permalink / raw)
  To: linville; +Cc: linux-wireless, netdev
In-Reply-To: <20101015204048.GC2438@tuxdriver.com>

From: "John W. Linville" <linville@tuxdriver.com>
Date: Fri, 15 Oct 2010 16:40:50 -0400

> Here is the latest batch of updates intended for 2.6.37.  As usual,
> it is primarily a bunch of driver updates and fixes to code already
> in -next.  This also includes a batch of Bluetooth updates courtesy
> of Gustavo Padovan.  There is also the movement of the wl1251 driver
> out of the wl12xx directory.  This is a prelude to the expansion of
> the wl1271 code to cover some new hardware, all of which is actually
> largely unrelated to wl1251.

Pulled, thanks John.

^ permalink raw reply

* Re: [PATCH 1/4] net - Add AF_ALG macros
From: David Miller @ 2010-10-20  9:01 UTC (permalink / raw)
  To: herbert; +Cc: linux-crypto, netdev, linux-kernel
In-Reply-To: <E1P8CVt-0003Yv-Ml@gondolin.me.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 19 Oct 2010 21:46:01 +0800

> net - Add AF_ALG macros
> 
> This patch adds the socket family/level macros for the yet-to-be-born
> AF_ALG family.  The AF_ALG family provides the user-space interface
> for the kernel crypto API.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* Re: [RFC PATCH 0/9] ipvs network name space (netns) aware
From: Simon Horman @ 2010-10-20  9:17 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: lvs-devel@vger.kernel.org, netdev@vger.kernel.org,
	netfilter-devel@vger.kernel.org, ja@ssi.bg, wensong@linux-vs.org,
	daniel.lezcano@free.fr
In-Reply-To: <201010181355.59763.hans.schillstrom@ericsson.com>

On Mon, Oct 18, 2010 at 01:55:58PM +0200, Hans Schillstrom wrote:
> On Sunday 17 October 2010 08:47:31 Simon Horman wrote:
> > On Fri, Oct 08, 2010 at 01:16:36PM +0200, Hans Schillstrom wrote:
> > > This patch series adds network name space (netns) support to the LVS.
> > > 
> > > REVISION
> > > 
> > > This is version 1
> > > 
> > > OVERVIEW
> > > 
> > > The patch doesn't remove or add any functionality except for netns.
> > > For users that don't use network name space (netns) this patch is
> > > completely transparent.
> > > 
> > > No it's possible to run LVS in a Linux container (see lxc-tools)
> > > i.e.  a light weight virtualization. For example it's possible to run
> > > one or several lvs on a real server in their own network name spaces.
> > > >From the LVS point of view it looks like it runs on it's own machine.
> > > 
> > > IMPLEMENTATION
> > > Basic requirements for netns awareness
> > >  - Global variables has to be moved to dyn. allocated memory.
> > > 
> > > Most global variables now resides in a struct ipvs { } in netns/ip_vs.h.
> > > What is moved and what is not ?
> > > 
> > > Some cache aligned locks are still in global, module init params and some debug_level.
> > > 
> > > Algorithm files they are untouched.
> > > 
> > > QUESTIONS
> > > Drop rate in ip_vs_ctl per netns or grand total ?
> 
> This is a tricky one (I think), 
> if the interface is shared with root name-space and/or other name-spaces
>  - use grand total
> if it's an "own interface"  
>  - drop rate can/should be in netns...

I hadn't thought about shared devices - yes that is tricky.

> > My gut-feeling is that per netns makes more sense.
> > 
> > > Should more lock variables be moved (or less) ?
> > 
> > I'm unsure what you are asking here but I will make a general statement
> > about locking in IPVS: it needs work.
> 
> Some locks still resides as global variables, and others in netns_ipvs struct.
> Since you have a lot of experience with IPVS locks, 
> you might have ideas what to move and what to not move.

My basic thought is that locks tend to either related to a connection
or the configuration of a service. And it seems to me that if you
have a per-namespace connection hash table then both of these categories
of locks are good candidates to be made per-namespace.

Do you have any particular locks that you are worried about?

> > > PATCH SET
> > > This patch set is based upon net-next-2.6 (2.6.36-rc3) from 4 oct 2010
> > > and [patch v4] ipvs: IPv6 tunnel mode
> > > 
> > > Note: ip_vs_xmit.c will not work without "[patch v4] ipvs: IPv6 tunnel mode"
> > 
> > Unfortunately the patches don't apply with the persistence engine
> > patches which were recently merged into nf-next-2.6 (although
> > "[patch v4.1 ]ipvs: IPv6 tunnel mode" is still unmerged).
> > 
> I do have a patch based on the nf-next without the SIP/PE patch
> 
> > I'm happy to work with you to make the required changes there.
> 
> I would appreciate that.

No problem. I am a bit busy this week as I am attending the Netfilter
Workshop. But I will try to find some time to rebase your changes soon.

> > (I realise those patches weren't merged when you made your post.
> >  But regardless, either your or me will need to update the patches).
> > 
> > Another issue is that your patches seem to be split in a way
> > where the build breaks along the way. E.g. after applying
> > patch 1, the build breaks. Could you please split things up
> > in a manner such that this doesn't happen. The reason being
> > that it breaks bisection.
> > 
> Hmm, Daniel also pointed at this,
> The Patch is quite large, and will become even larger with pe and sip.
> My Idea was to review the patch in pieces and put it together in one or two large patches when submitting it.
> I don't know that might be a stupid ?
> It's hard to break it up, making the code reentrant causes changes every where.
> 
> Daniel L, had another approach break it into many many tiny patches.

I would prefer the tiny patch approach.

> > Lastly, could you provide a unique subject for each patch.
> > I know its a bit tedious, but it does make a difference when
> > browsing the changelog.
> > 
> Yepp, no problem

^ permalink raw reply

* Re: SCTP AUTO-ASCONF patch
From: Shan Wei @ 2010-10-20  9:30 UTC (permalink / raw)
  To: Michio Honda; +Cc: Vlad Yasevich, netdev, hadi
In-Reply-To: <EB46A57B-530A-4F52-8A29-120192604816@sfc.wide.ad.jp>

Michio Honda wrote, at 10/12/2010 08:27 PM:
> Hi, 
> 
> I'm resubmitting a patch to enable AUTO_ASCONF for Linux SCTP. (version is for 2.6.36-rc7).

For so big patch, please provide more documents for reviewing. :;)
See some trivial comment in the context.

> 
> Thanks,
> - Michio
> 
> Only in linux-2.6/include: asm-arm
> Only in linux-2.6/include: asm-mn10300
> diff -ru -x '\.git' -x arch -x drivers -x fs -x asm -x Documentation -p linux-2.6.orig/include/linux/sysctl.h linux-2.6/include/linux/sysctl.h
> --- linux-2.6.orig/include/linux/sysctl.h	2010-10-11 08:24:33.000000000 +0900
> +++ linux-2.6/include/linux/sysctl.h	2010-10-11 07:21:40.000000000 +0900
> @@ -767,6 +767,7 @@ enum {
>  	NET_SCTP_SNDBUF_POLICY		 = 15,
>  	NET_SCTP_SACK_TIMEOUT		 = 16,
>  	NET_SCTP_RCVBUF_POLICY		 = 17,
> +	NET_SCTP_AUTO_ASCONF_ENABLE	 = 18,
>  };
>  
>  /* /proc/sys/net/bridge */
> diff -ru -x '\.git' -x arch -x drivers -x fs -x asm -x Documentation -p linux-2.6.orig/include/net/sctp/sctp.h linux-2.6/include/net/sctp/sctp.h
> --- linux-2.6.orig/include/net/sctp/sctp.h	2010-10-11 08:24:33.000000000 +0900
> +++ linux-2.6/include/net/sctp/sctp.h	2010-10-11 07:21:40.000000000 +0900
> @@ -121,6 +121,8 @@ extern int sctp_copy_local_addr_list(str
>  				     int flags);
>  extern struct sctp_pf *sctp_get_pf_specific(sa_family_t family);
>  extern int sctp_register_pf(struct sctp_pf *, sa_family_t);
> +void sctp_addr_wq_mgmt(union sctp_addr *, int);
> +void sctp_path_check_and_react(struct sctp_association *, struct sockaddr *);
>  
>  /*
>   * sctp/socket.c
> @@ -135,6 +137,9 @@ void sctp_sock_rfree(struct sk_buff *skb
>  void sctp_copy_sock(struct sock *newsk, struct sock *sk,
>  		    struct sctp_association *asoc);
>  extern struct percpu_counter sctp_sockets_allocated;
> +int sctp_asconf_mgmt(struct sctp_endpoint *, struct sock *sk);
> +void sctp_add_addr_to_laddr(struct sockaddr *, struct sctp_association *);
> +void sctp_trans_immediate_retrans(struct sctp_transport *);
>  
>  /*
>   * sctp/primitive.c
> diff -ru -x '\.git' -x arch -x drivers -x fs -x asm -x Documentation -p linux-2.6.orig/include/net/sctp/sm.h linux-2.6/include/net/sctp/sm.h
> --- linux-2.6.orig/include/net/sctp/sm.h	2010-10-11 08:24:33.000000000 +0900
> +++ linux-2.6/include/net/sctp/sm.h	2010-10-11 07:21:40.000000000 +0900
> @@ -295,6 +295,8 @@ int sctp_addip_addr_config(struct sctp_a
>  __u32 sctp_generate_tag(const struct sctp_endpoint *);
>  __u32 sctp_generate_tsn(const struct sctp_endpoint *);
>  
> +void sctp_path_check_and_react(struct sctp_association *, struct sockaddr *);
> +
>  /* Extern declarations for major data structures.  */
>  extern sctp_timer_event_t *sctp_timer_events[SCTP_NUM_TIMEOUT_TYPES];
>  
> diff -ru -x '\.git' -x arch -x drivers -x fs -x asm -x Documentation -p linux-2.6.orig/include/net/sctp/structs.h linux-2.6/include/net/sctp/structs.h
> --- linux-2.6.orig/include/net/sctp/structs.h	2010-10-11 08:24:33.000000000 +0900
> +++ linux-2.6/include/net/sctp/structs.h	2010-10-11 07:21:40.000000000 +0900
> @@ -205,6 +205,10 @@ extern struct sctp_globals {
>  	 * It is a list of sctp_sockaddr_entry.
>  	 */
>  	struct list_head local_addr_list;
> +	int auto_asconf_enable;
> +	struct list_head addr_waitq;
> +	struct timer_list addr_wq_timer;
> +	spinlock_t addr_wq_lock;
>  
>  	/* Lock that protects the local_addr_list writers */
>  	spinlock_t addr_list_lock;
> @@ -265,6 +269,10 @@ extern struct sctp_globals {
>  #define sctp_port_alloc_lock		(sctp_globals.port_alloc_lock)
>  #define sctp_port_hashtable		(sctp_globals.port_hashtable)
>  #define sctp_local_addr_list		(sctp_globals.local_addr_list)
> +#define sctp_addr_waitq			(sctp_globals.addr_waitq)

be align.

> +#define sctp_addr_wq_timer		(sctp_globals.addr_wq_timer)
> +#define sctp_addr_wq_lock		(sctp_globals.addr_wq_lock)
> +#define sctp_auto_asconf_enable		(sctp_globals.auto_asconf_enable)

be align.

>  #define sctp_local_addr_lock		(sctp_globals.addr_list_lock)
>  #define sctp_scope_policy		(sctp_globals.ipv4_scope_policy)
>  #define sctp_addip_enable		(sctp_globals.addip_enable)
> @@ -798,6 +806,16 @@ struct sctp_sockaddr_entry {
>  	__u8 valid;
>  };
>  
> +#define SCTP_NEWADDR	1
> +#define SCTP_DELADDR	2
> +#define SCTP_ADDRESS_TICK_DELAY	500
> +struct sctp_addr_wait {
> +	struct list_head list;
> +	struct rcu_head rcu;
> +	union sctp_addr a;
> +	int	cmd;
> +};
> +
>  typedef struct sctp_chunk *(sctp_packet_phandler_t)(struct sctp_association *);
>  
>  /* This structure holds lists of chunks as we are assembling for
> @@ -1241,6 +1259,7 @@ sctp_scope_t sctp_scope(const union sctp
>  int sctp_in_scope(const union sctp_addr *addr, const sctp_scope_t scope);
>  int sctp_is_any(struct sock *sk, const union sctp_addr *addr);
>  int sctp_addr_is_valid(const union sctp_addr *addr);
> +int sctp_is_ep_boundall(struct sock *sk);
>  
>  
>  /* What type of endpoint?  */
> @@ -1903,6 +1922,11 @@ struct sctp_association {
>  	 * after reaching 4294967295.
>  	 */
>  	__u32 addip_serial;
> +	/* list of valid address in association local */
> +	struct list_head asoc_laddr_list; 
> +	union sctp_addr *asconf_addr_del_pending;
> +	__u32 asconf_del_pending_cid;
> +	int src_out_of_asoc_ok;
>  
>  	/* SCTP AUTH: list of the endpoint shared keys.  These
>  	 * keys are provided out of band by the user applicaton
> diff -ru -x '\.git' -x arch -x drivers -x fs -x asm -x Documentation -p linux-2.6.orig/kernel/sysctl_check.c linux-2.6/kernel/sysctl_check.c
> --- linux-2.6.orig/kernel/sysctl_check.c	2010-04-22 17:55:41.000000000 +0900
> +++ linux-2.6/kernel/sysctl_check.c	2010-04-21 18:40:22.000000000 +0900
> @@ -5,7 +5,6 @@
>  #include <linux/string.h>
>  #include <net/ip_vs.h>
>  
> -

an redundant change.

>  static int sysctl_depth(struct ctl_table *table)
>  {
>  	struct ctl_table *tmp;
> diff -ru -x '\.git' -x arch -x drivers -x fs -x asm -x Documentation -p linux-2.6.orig/net/sctp/associola.c linux-2.6/net/sctp/associola.c
> --- linux-2.6.orig/net/sctp/associola.c	2010-10-11 08:24:34.000000000 +0900
> +++ linux-2.6/net/sctp/associola.c	2010-10-11 07:21:40.000000000 +0900
> @@ -278,6 +278,10 @@ static struct sctp_association *sctp_ass
>  	if (sctp_addip_noauth)
>  		asoc->peer.asconf_capable = 1;
>  
> +	asoc->asconf_addr_del_pending = NULL;
> +	asoc->asconf_del_pending_cid = 0;
> +	asoc->src_out_of_asoc_ok = 0;
> +	INIT_LIST_HEAD(&asoc->asoc_laddr_list);
>  	/* Create an input queue.  */
>  	sctp_inq_init(&asoc->base.inqueue);
>  	sctp_inq_set_th_handler(&asoc->base.inqueue, sctp_assoc_bh_rcv);
> @@ -444,6 +448,17 @@ void sctp_association_free(struct sctp_a
>  	/* Free any cached ASCONF_ACK chunk. */
>  	sctp_assoc_free_asconf_acks(asoc);
>  
> +	/* Free pending address space being deleted */
> +	if (asoc->asconf_addr_del_pending != NULL) 
> +		kfree(asoc->asconf_addr_del_pending);
> +	if (!list_empty(&asoc->asoc_laddr_list)) {
> +		struct sctp_sockaddr_entry *laddr = NULL;
> +		list_for_each_entry(laddr, &asoc->asoc_laddr_list, list) {
> +			list_del(&laddr->list);
> +			kfree(laddr);
> +		}
> +	}
> +
>  	/* Free any cached ASCONF chunk. */
>  	if (asoc->addip_last_asconf)
>  		sctp_chunk_free(asoc->addip_last_asconf);
> @@ -618,6 +633,7 @@ void sctp_assoc_rm_peer(struct sctp_asso
>  			if (!mod_timer(&active->T3_rtx_timer,
>  					jiffies + active->rto))
>  				sctp_transport_hold(active);
> +		active->flight_size += peer->flight_size;
>  	}
>  
>  	asoc->peer.transport_count--;
> diff -ru -x '\.git' -x arch -x drivers -x fs -x asm -x Documentation -p linux-2.6.orig/net/sctp/bind_addr.c linux-2.6/net/sctp/bind_addr.c
> --- linux-2.6.orig/net/sctp/bind_addr.c	2010-04-22 17:55:41.000000000 +0900
> +++ linux-2.6/net/sctp/bind_addr.c	2010-06-24 02:33:20.000000000 +0900
> @@ -536,6 +536,24 @@ int sctp_in_scope(const union sctp_addr 
>  	return 0;
>  }
>  
> +int sctp_is_ep_boundall(struct sock *sk)
> +{
> +	struct sctp_bind_addr *bp;
> +	struct sctp_sockaddr_entry *addr;
> +
> +	if (!sk) 
> +		return 0;
> +       
> +	bp = &sctp_sk(sk)->ep->base.bind_addr;
> +	if (sctp_list_single_entry(&bp->address_list)) {
> +		addr = list_entry(bp->address_list.next,
> +				  struct sctp_sockaddr_entry, list);
> +		if (sctp_is_any(sk, &addr->a)) 
> +			return 1;
> +	}
> +	return 0;
> +}
> +
>  /********************************************************************
>   * 3rd Level Abstractions
>   ********************************************************************/
> diff -ru -x '\.git' -x arch -x drivers -x fs -x asm -x Documentation -p linux-2.6.orig/net/sctp/ipv6.c linux-2.6/net/sctp/ipv6.c
> --- linux-2.6.orig/net/sctp/ipv6.c	2010-10-11 08:24:34.000000000 +0900
> +++ linux-2.6/net/sctp/ipv6.c	2010-10-11 07:21:40.000000000 +0900
> @@ -103,6 +103,7 @@ static int sctp_inet6addr_event(struct n
>  			addr->valid = 1;
>  			spin_lock_bh(&sctp_local_addr_lock);
>  			list_add_tail_rcu(&addr->list, &sctp_local_addr_list);
> +			sctp_addr_wq_mgmt(&addr->a, SCTP_NEWADDR);
>  			spin_unlock_bh(&sctp_local_addr_lock);
>  		}
>  		break;
> @@ -113,6 +114,7 @@ static int sctp_inet6addr_event(struct n
>  			if (addr->a.sa.sa_family == AF_INET6 &&
>  					ipv6_addr_equal(&addr->a.v6.sin6_addr,
>  						&ifa->addr)) {
> +				sctp_addr_wq_mgmt(&addr->a, SCTP_DELADDR);
>  				found = 1;
>  				addr->valid = 0;
>  				list_del_rcu(&addr->list);
> @@ -330,6 +332,25 @@ static void sctp_v6_get_saddr(struct sct
>  				matchlen = bmatchlen;
>  			}
>  		}
> +		if ((laddr->state == SCTP_ADDR_NEW) && asoc->src_out_of_asoc_ok) {
> +			bmatchlen = sctp_v6_addr_match_len(daddr, &laddr->a);
> +			if (!baddr || (matchlen < bmatchlen)) {
> +				baddr = &laddr->a;
> +				matchlen = bmatchlen;
> +			}
> +		}
> +	}
> +	if (baddr == NULL) {
> +		/* We don't have a valid src addr in "endpoint-wide".  
> +		 * Looking up in assoc-locally valid address list.  
> +		 */
> +		list_for_each_entry(laddr, &asoc->asoc_laddr_list, list) {
> +			bmatchlen = sctp_v6_addr_match_len(daddr, &laddr->a);
> +			if (!baddr || (matchlen < bmatchlen)) {
> +				baddr = &laddr->a;
> +				matchlen = bmatchlen;
> +			}
> +		}
>  	}
>  
>  	if (baddr) {
> diff -ru -x '\.git' -x arch -x drivers -x fs -x asm -x Documentation -p linux-2.6.orig/net/sctp/outqueue.c linux-2.6/net/sctp/outqueue.c
> --- linux-2.6.orig/net/sctp/outqueue.c	2010-10-11 08:24:34.000000000 +0900
> +++ linux-2.6/net/sctp/outqueue.c	2010-10-11 07:21:40.000000000 +0900
> @@ -342,7 +342,13 @@ int sctp_outq_tail(struct sctp_outq *q, 
>  			break;
>  		}
>  	} else {
> -		list_add_tail(&chunk->list, &q->control_chunk_list);
> +		/* We add the ASCONF for the only one newly added address at 
> +		 * the front of the queue 
> +		 */
> +		if (q->asoc->src_out_of_asoc_ok && (chunk->chunk_hdr->type == SCTP_CID_ASCONF))
> +			list_add(&chunk->list, &q->control_chunk_list);
> +		else
> +			list_add_tail(&chunk->list, &q->control_chunk_list);
>  		SCTP_INC_STATS(SCTP_MIB_OUTCTRLCHUNKS);
>  	}
>  
> @@ -850,6 +856,24 @@ static int sctp_outq_flush(struct sctp_o
>  		case SCTP_CID_SHUTDOWN:
>  		case SCTP_CID_ECN_ECNE:
>  		case SCTP_CID_ASCONF:
> +			/* RFC 5061, 5.3
> +			 * F1) This means that until such time as the ASCONF 
> +			 * containing the add is acknowledged, the sender MUST 
> +			 * NOT use the new IP address as a source for ANY SCTP 
> +			 * packet except on carrying an ASCONF Chunk.
> +			 */
> +			if (asoc->src_out_of_asoc_ok) {
> +				SCTP_DEBUG_PRINTK("outq_flush: out_of_asoc_ok, transmit chunk type %d\n", chunk->chunk_hdr->type);
> +				packet = &transport->packet;
> +				sctp_packet_config(packet, vtag, 
> +						asoc->peer.ecn_capable);
> +				sctp_packet_append_chunk(packet, chunk);
> +				error = sctp_packet_transmit(packet);
> +				if (error < 0) {
> +					return error;
> +				}
> +				goto sctp_flush_out;
> +			}
>  		case SCTP_CID_FWD_TSN:
>  			status = sctp_packet_transmit_chunk(packet, chunk,
>  							    one_packet);
> diff -ru -x '\.git' -x arch -x drivers -x fs -x asm -x Documentation -p linux-2.6.orig/net/sctp/protocol.c linux-2.6/net/sctp/protocol.c
> --- linux-2.6.orig/net/sctp/protocol.c	2010-10-11 08:24:34.000000000 +0900
> +++ linux-2.6/net/sctp/protocol.c	2010-10-11 07:21:40.000000000 +0900
> @@ -508,12 +508,19 @@ static struct dst_entry *sctp_v4_get_dst
>  		sctp_v4_dst_saddr(&dst_saddr, dst, htons(bp->port));
>  		rcu_read_lock();
>  		list_for_each_entry_rcu(laddr, &bp->address_list, list) {
> -			if (!laddr->valid || (laddr->state != SCTP_ADDR_SRC))
> +			if (!laddr->valid || ((laddr->state != SCTP_ADDR_SRC) && (asoc->src_out_of_asoc_ok == 0)))
>  				continue;
>  			if (sctp_v4_cmp_addr(&dst_saddr, &laddr->a))
>  				goto out_unlock;
>  		}
>  		rcu_read_unlock();
> +		/* We don't have a valid src addr in "endpoint-wide".  
> +		 * Looking up in assoc-locally valid address list.  
> +		 */
> +		list_for_each_entry(laddr, &asoc->asoc_laddr_list, list) {
> +			if (sctp_v4_cmp_addr(&dst_saddr, &laddr->a))
> +				goto out_unlock;
> +		}
>  
>  		/* None of the bound addresses match the source address of the
>  		 * dst. So release it.
> @@ -633,6 +640,184 @@ static void sctp_v4_ecn_capable(struct s
>  	INET_ECN_xmit(sk);
>  }
>  
> +void sctp_addr_wq_timeout_handler(unsigned long arg)
> +{
> +	struct sctp_addr_wait *addrw = NULL;
> +	union sctp_addr *addr = NULL;
> +	struct sctp_ep_common *epb = NULL;
> +	struct sctp_endpoint *ep = NULL;
> +	struct hlist_node *node = NULL;
> +	struct sctp_hashbucket *head = NULL;
> +	int cnt=0;
> +	int i; 
> +
> +	spin_lock_bh(&sctp_addr_wq_lock);
> +retry_wq:
> +	if (list_empty(&sctp_addr_waitq)) {
> +		SCTP_DEBUG_PRINTK("sctp_addrwq_timo_handler: nothing in addr waitq\n");
> +		spin_unlock_bh(&sctp_addr_wq_lock);
> +		return;
> +	}
> +	addrw = list_first_entry(&sctp_addr_waitq, struct sctp_addr_wait, list);
> +	if ((addrw->cmd != SCTP_NEWADDR) && (addrw->cmd != SCTP_DELADDR)) {
> +		SCTP_DEBUG_PRINTK("sctp_addrwq_timo_handler: Huh, cmd is neither NEWADDR nor DELADDR\n");
> +		list_del(&addrw->list);
> +		kfree(addrw);
> +		goto retry_wq;
> +	}
> +
> +	addr = &addrw->a;
> +	SCTP_DEBUG_PRINTK_IPADDR("sctp_addrwq_timo_handler: the first ent in wq %p is "," for cmd %d at entry %p\n", &sctp_addr_waitq, addr, addrw->cmd, addrw);
> +
> +	/* Now we send an ASCONF for each association */
> +	/* Note. we currently don't handle link local IPv6 addressees */
> +	if (addr->sa.sa_family == AF_INET6) {
> +		struct in6_addr *in6 = (struct in6_addr *)&addr->v6.sin6_addr;
> +
> +		if (ipv6_addr_type(&addr->v6.sin6_addr) & IPV6_ADDR_LINKLOCAL) {
> +			SCTP_DEBUG_PRINTK("sctp_timo_handler: link local, hence don't tell eps\n");
> +			list_del(&addrw->list);
> +			kfree(addrw);
> +			goto retry_wq;
> +		}
> +		if ((ipv6_chk_addr(&init_net, in6, NULL, 0) == 0) && (addrw->cmd == SCTP_NEWADDR)) {
> +			unsigned long timeo_val;
> +
> +			SCTP_DEBUG_PRINTK("sctp_timo_handler: this is on DAD, trying %d sec later\n", SCTP_ADDRESS_TICK_DELAY);
> +			timeo_val = jiffies;
> +			timeo_val += msecs_to_jiffies(SCTP_ADDRESS_TICK_DELAY);
> +			(void)mod_timer(&sctp_addr_wq_timer, timeo_val);
> +			spin_unlock_bh(&sctp_addr_wq_lock);
> +			return;
> +		}
> +	}
> +	for (i = 0; i < sctp_ep_hashsize; ++i) {
> +		head = &sctp_ep_hashtable[i];
> +		if (head == NULL) {
> +			SCTP_DEBUG_PRINTK("addrwq_timo_handler: no head in hash\n");
> +			continue;
> +		}
> +		write_lock(&head->lock);
> +		epb = NULL;
> +		sctp_for_each_hentry(epb, node, &head->chain) {
> +
> +			if (epb == NULL) {
> +				SCTP_DEBUG_PRINTK("addrwq_timo_handler: no epb\n");
> +				continue;
> +			}
> +			if (!sctp_is_ep_boundall(epb->sk)) {
> +				/* ignore bound-specific endpoints */
> +				continue;
> +			}
> +			ep = sctp_ep(epb);
> +			if (sctp_asconf_mgmt(ep, epb->sk) < 0) {
> +				SCTP_DEBUG_PRINTK("sctp_addrwq_timo_handler: sctp_asconf_mgmt failed\n");
> +				continue;
> +			}
> +			++cnt;
> +		}
> +		write_unlock(&head->lock);
> +	}
> +
> +	list_del(&addrw->list);
> +	kfree(addrw);
> +
> +	if (list_empty(&sctp_addr_waitq)) {
> +		spin_unlock_bh(&sctp_addr_wq_lock);
> +		return;
> +	} else {
> +		goto retry_wq;
> +	}
> +	spin_unlock_bh(&sctp_addr_wq_lock);
> +}
> +
> +void sctp_addr_wq_mgmt(union sctp_addr *reqaddr, int cmd)
> +{
> +	struct sctp_addr_wait *addrw = NULL;
> +	struct sctp_addr_wait *addrw_new = NULL;
> +	unsigned long timeo_val;
> +	union sctp_addr *tmpaddr; // for debugging
> +
> +	/* first, we check if an opposite message already exist in the queue.  
> +	 * If we found such message, it is removed.  
> +	 * This operation is a bit stupid, but the DHCP client attaches the 
> +	 * new address after a couple of addition and deletion of that address
> +	 */
> +
> +	if (reqaddr == NULL) {
> +		SCTP_DEBUG_PRINTK("sctp_addr_wq_mgmt: no address message?\n");
> +		return;
> +	}
> +	
> +	spin_lock_bh(&sctp_addr_wq_lock);
> +	/* Offsets existing events in addr_wq */
> +	list_for_each_entry(addrw, &sctp_addr_waitq, list) {
> +		if (addrw->a.sa.sa_family != reqaddr->sa.sa_family) {
> +			continue;
> +		}
> +		if (reqaddr->sa.sa_family == AF_INET) {
> +			if (reqaddr->v4.sin_addr.s_addr == addrw->a.v4.sin_addr.s_addr) {
> +				if (cmd != addrw->cmd) {
> +					tmpaddr = &addrw->a;
> +					SCTP_DEBUG_PRINTK_IPADDR("sctp_addr_wq_mgmt offsets existing entry for %d "," in waitq %p\n", addrw->cmd, tmpaddr, &sctp_addr_waitq);
> +					list_del(&addrw->list);
> +					kfree(addrw);
> +					/* nothing to do anymore */
> +					spin_unlock_bh(&sctp_addr_wq_lock);
> +					return;
> +				}
> +			}
> +		}
> +		else if (reqaddr->sa.sa_family == AF_INET6) {
> +			if (memcmp(&reqaddr->v6.sin6_addr, &addrw->a.v6.sin6_addr, sizeof(struct in6_addr)) == 0) {

try ipv6_addr_equal().
ipv6_addr_equal() is used to check whether these two IPv6 addresss is equal.

> +				if (cmd != addrw->cmd) {
> +					tmpaddr = &addrw->a;
> +					SCTP_DEBUG_PRINTK_IPADDR("sctp_addr_wq_mgmt: offsets existing entry for %d "," in waitq %p\n", addrw->cmd, tmpaddr, &sctp_addr_waitq);
> +					list_del(&addrw->list);
> +					kfree(addrw);
> +					spin_unlock_bh(&sctp_addr_wq_lock);
> +					return;
> +				}
> +			}
> +		}
> +	}
> +				
> +	/* OK, we have to add the new address to the wait queue */
> +	addrw_new = kmalloc(sizeof(struct sctp_addr_wait), GFP_ATOMIC);
> +	if (addrw_new == NULL) {
> +		SCTP_DEBUG_PRINTK("sctp_addr_weitq_mgmt no memory? return\n");
> +		spin_unlock_bh(&sctp_addr_wq_lock);
> +		return;
> +	}
> +	memset(addrw_new, 0, sizeof(struct sctp_addr_wait));

what about kzalloc()?

> +	if (reqaddr->sa.sa_family == AF_INET) {
> +		addrw_new->a.v4.sin_family = AF_INET;
> +		memcpy(&addrw_new->a.v4.sin_addr, &reqaddr->v4.sin_addr, sizeof(struct in_addr));

&addrw_new->a.v4.sin_addr.s_addr = &reqaddr->v4.sin_addr.s_addr.

> +	} else if (reqaddr->sa.sa_family == AF_INET6) {
> +		addrw_new->a.v6.sin6_family = AF_INET6;
> +		memcpy(&addrw_new->a.v6.sin6_addr, &reqaddr->v6.sin6_addr, sizeof(struct in6_addr));

using ipv6_addr_copy().

> +	} else {
> +		SCTP_DEBUG_PRINTK("sctp_addr_waitq_mgmt: Unknown family of request addr, return\n");
> +		kfree(addrw_new);
> +		spin_unlock_bh(&sctp_addr_wq_lock);
> +		return;
> +	}
> +	addrw_new->cmd = cmd;
> +	list_add_tail(&addrw_new->list, &sctp_addr_waitq);
> +	tmpaddr = &addrw_new->a;
> +	SCTP_DEBUG_PRINTK_IPADDR("sctp_addr_wq_mgmt add new entry for cmd:%d "," in waitq %p, start a timer\n", addrw_new->cmd, tmpaddr, &sctp_addr_waitq);
> +
> +	if (timer_pending(&sctp_addr_wq_timer)) {
> +		SCTP_DEBUG_PRINTK("sctp_addr_wq_mgmt: addr_wq timer is already running\n");
> +		spin_unlock_bh(&sctp_addr_wq_lock);
> +		return;
> +	}
> +	timeo_val = jiffies;
> +	timeo_val += msecs_to_jiffies(SCTP_ADDRESS_TICK_DELAY);
> +	(void)mod_timer(&sctp_addr_wq_timer, timeo_val);
> +	spin_unlock_bh(&sctp_addr_wq_lock);
> +}
> +
>  /* Event handler for inet address addition/deletion events.
>   * The sctp_local_addr_list needs to be protocted by a spin lock since
>   * multiple notifiers (say IPv4 and IPv6) may be running at the same
> @@ -660,6 +845,7 @@ static int sctp_inetaddr_event(struct no
>  			addr->valid = 1;
>  			spin_lock_bh(&sctp_local_addr_lock);
>  			list_add_tail_rcu(&addr->list, &sctp_local_addr_list);
> +			sctp_addr_wq_mgmt(&addr->a, SCTP_NEWADDR);
>  			spin_unlock_bh(&sctp_local_addr_lock);
>  		}
>  		break;
> @@ -670,6 +856,7 @@ static int sctp_inetaddr_event(struct no
>  			if (addr->a.sa.sa_family == AF_INET &&
>  					addr->a.v4.sin_addr.s_addr ==
>  					ifa->ifa_local) {
> +				sctp_addr_wq_mgmt(&addr->a, SCTP_DELADDR);
>  				found = 1;
>  				addr->valid = 0;
>  				list_del_rcu(&addr->list);
> @@ -1276,6 +1463,10 @@ SCTP_STATIC __init int sctp_init(void)
>  
>  	/* Initialize the local address list. */
>  	INIT_LIST_HEAD(&sctp_local_addr_list);
> +	INIT_LIST_HEAD(&sctp_addr_waitq);
> +	spin_lock_init(&sctp_addr_wq_lock);
> +	sctp_addr_wq_timer.expires = 0;
> +	setup_timer(&sctp_addr_wq_timer, sctp_addr_wq_timeout_handler, (unsigned long)NULL); 
>  	spin_lock_init(&sctp_local_addr_lock);
>  	sctp_get_local_addr_list();
>  
> diff -ru -x '\.git' -x arch -x drivers -x fs -x asm -x Documentation -p linux-2.6.orig/net/sctp/sm_make_chunk.c linux-2.6/net/sctp/sm_make_chunk.c
> --- linux-2.6.orig/net/sctp/sm_make_chunk.c	2010-10-11 08:24:34.000000000 +0900
> +++ linux-2.6/net/sctp/sm_make_chunk.c	2010-10-11 07:21:40.000000000 +0900
> @@ -2649,6 +2649,78 @@ __u32 sctp_generate_tsn(const struct sct
>  	return retval;
>  }
>  
> +void
> +sctp_trans_immediate_retrans(struct sctp_transport *trans)
> +{
> +	struct sctp_association *asoc = trans->asoc;
> +
> +	/* Stop pending T3_rtx_timer on this transport */
> +	if (timer_pending(&trans->T3_rtx_timer)) {
> +		(void)del_timer(&trans->T3_rtx_timer);
> +		sctp_transport_put(trans);
> +	} 
> +
> +	/* We consider the event as if the T3RTX timer expires */
> +	sctp_retransmit(&asoc->outqueue, trans, SCTP_RTXR_T3_RTX);
> +	if (!timer_pending(&trans->T3_rtx_timer)) {
> +		if (!mod_timer(&trans->T3_rtx_timer, jiffies + trans->rto)) 
> +			sctp_transport_hold(trans);
> +	}
> +
> +	return;
> +}
> +
> +void
> +sctp_path_check_and_react(struct sctp_association *asoc, struct sockaddr *sa)
> +{
> +	struct sctp_transport *trans;
> +	int addrnum, family;
> +	struct sctp_sockaddr_entry *saddr;
> +	struct sctp_bind_addr *bp;
> +	union sctp_addr *tmpaddr;
> +
> +	family = sa->sa_family;
> +	bp = &asoc->base.bind_addr;
> +	addrnum = 0;
> +	/* count up the number of local addresses in the same family */
> +	list_for_each_entry(saddr, &bp->address_list, list) {
> +		if (saddr->a.sa.sa_family == family) {
> +			tmpaddr = &saddr->a;
> +			if (family == AF_INET6 && 
> +			    ipv6_addr_type(&tmpaddr->v6.sin6_addr) & 
> +			    IPV6_ADDR_LINKLOCAL) {
> +				continue;
> +			}
> +			addrnum++;
> +		}
> +	}
> +	if (addrnum == 1) {
> +		union sctp_addr *tmpaddr;
> +		tmpaddr = (union sctp_addr *)sa;
> +		SCTP_DEBUG_PRINTK_IPADDR("pcheck_react: only 1 local addr in asoc %p "," family %d\n", asoc, tmpaddr, family);
> +		list_for_each_entry(trans, &asoc->peer.transport_addr_list, transports) {
> +			/* reset path information and release refcount to the 
> +			 * dst_entry  based on the src change */
> +			sctp_transport_hold(trans);
> +			trans->cwnd = min(4*asoc->pathmtu, max_t(__u32, 2*asoc->pathmtu, 4380));
> +			trans->ssthresh = asoc->peer.i.a_rwnd;
> +			trans->rtt = 0;
> +			trans->srtt = 0;
> +			trans->rttvar = 0;
> +			trans->rto = asoc->rto_initial;
> +			dst_release(trans->dst);
> +			trans->dst = NULL;
> +			memset(&trans->saddr, 0, sizeof(union sctp_addr));
> +			sctp_transport_route(trans, NULL, sctp_sk(asoc->base.sk));
> +			SCTP_DEBUG_PRINTK_IPADDR("we freed dst_entry (asoc: %p dst: "," trans: %p)\n", asoc, (&trans->ipaddr), trans);
> +			trans->rto_pending = 1;
> +			sctp_trans_immediate_retrans(trans);
> +			sctp_transport_put(trans);
> +		}
> +	}
> +	return;
> +}
> +
>  /*
>   * ADDIP 3.1.1 Address Configuration Change Chunk (ASCONF)
>   *      0                   1                   2                   3
> @@ -2742,11 +2814,30 @@ struct sctp_chunk *sctp_make_asconf_upda
>  	int			addr_param_len = 0;
>  	int 			totallen = 0;
>  	int 			i;
> +	sctp_addip_param_t del_param; // 8 Bytes (Type(0xC002), Len and CrrID)
> +	sctp_addip_param_t spr_param;
> +	struct sctp_af *del_af;
> +	struct sctp_af *spr_af;
> +	int del_addr_param_len = 0;
> +	int spr_addr_param_len = 0;
> +	int del_paramlen = sizeof(sctp_addip_param_t);
> +	int spr_paramlen = sizeof(sctp_addip_param_t);
> +	union sctp_addr_param del_addr_param; // (v4) 8 Bytes, (v6) 20 Bytes
> +	union sctp_addr_param spr_addr_param;
> +	int			v4 = 0;
> +	int			v6 = 0;
>  
>  	/* Get total length of all the address parameters. */
>  	addr_buf = addrs;
>  	for (i = 0; i < addrcnt; i++) {
>  		addr = (union sctp_addr *)addr_buf;
> +		if (addr != NULL) {
> +			if (addr->sa.sa_family == AF_INET) {
> +				v4 = 1;
> +			} else if (addr->sa.sa_family == AF_INET6) {
> +				v6 = 1;
> +			}
> +		}
>  		af = sctp_get_af_specific(addr->v4.sin_family);
>  		addr_param_len = af->to_addr_param(addr, &addr_param);
>  
> @@ -2755,6 +2846,35 @@ struct sctp_chunk *sctp_make_asconf_upda
>  
>  		addr_buf += af->sockaddr_len;
>  	}
> +	/* Add the length of a pending address being deleted */
> +	if ((flags == SCTP_PARAM_ADD_IP) && 
> +	    (asoc->asconf_addr_del_pending != NULL)) {
> +		if (((asoc->asconf_addr_del_pending->sa.sa_family == AF_INET) 
> +		    && v4) || 
> +		    ((asoc->asconf_addr_del_pending->sa.sa_family == AF_INET6)
> +		    && v6)) {
> +			del_af = sctp_get_af_specific(asoc->asconf_addr_del_pending->sa.sa_family);
> +			del_addr_param_len = del_af->to_addr_param(asoc->asconf_addr_del_pending, &del_addr_param);
> +			totallen += del_paramlen;
> +			totallen += del_addr_param_len;
> +			SCTP_DEBUG_PRINTK("mkasconf_update_ip: now we picked del_pending addr, totallen for all addresses is %d\n", totallen);
> +			/* for Set Primary (equal size as del parameters */
> +			totallen += del_paramlen;
> +			totallen += del_addr_param_len;
> +		}
> +		if (v4) {
> +			if ((totallen != 32) && (totallen != 48)) {
> +				SCTP_DEBUG_PRINTK("mkasconf_update_ip: incorrect total length of ASCONF parameters, del + add MUST be 32 bytes, but %d bytes\n", totallen);
> +			return NULL;
> +			}
> +		} else if (v6) {
> +			if ((totallen != 56) && (totallen != 84)) {
> +				SCTP_DEBUG_PRINTK("mkasconf_update_ip: incorrect total length of ASCONF parameters, del + add MUST be 56 bytes, but %d bytes\n", totallen);
> +			return NULL;
> +			}
> +		}
> +	}
> +	SCTP_DEBUG_PRINTK("mkasconf_update_ip: call mkasconf() for %d bytes\n", totallen);
>  
>  	/* Create an asconf chunk with the required length. */
>  	retval = sctp_make_asconf(asoc, laddr, totallen);
> @@ -2776,6 +2896,29 @@ struct sctp_chunk *sctp_make_asconf_upda
>  
>  		addr_buf += af->sockaddr_len;
>  	}
> +	if ((flags == SCTP_PARAM_ADD_IP) && 
> +	    (asoc->asconf_addr_del_pending != NULL)) {
> +		addr = asoc->asconf_addr_del_pending;
> +		del_af = sctp_get_af_specific(addr->v4.sin_family);
> +		del_addr_param_len = del_af->to_addr_param(addr, &del_addr_param);
> +		del_param.param_hdr.type = SCTP_PARAM_DEL_IP;
> +		del_param.param_hdr.length = htons(del_paramlen + del_addr_param_len);
> +		del_param.crr_id = i;
> +		asoc->asconf_del_pending_cid = i;
> +
> +		sctp_addto_chunk(retval, del_paramlen, &del_param);
> +		sctp_addto_chunk(retval, del_addr_param_len, &del_addr_param);
> +		/* For SET_PRIMARY */
> +		addr_buf = addrs;
> +		addr = (union sctp_addr *)addr_buf;
> +		spr_af = sctp_get_af_specific(addr->v4.sin_family);
> +		spr_addr_param_len = spr_af->to_addr_param(addr, &spr_addr_param);
> +		spr_param.param_hdr.type = SCTP_PARAM_SET_PRIMARY;
> +		spr_param.param_hdr.length = htons(spr_paramlen + spr_addr_param_len);
> +		spr_param.crr_id = (i+1);
> +		sctp_addto_chunk(retval, spr_paramlen, &spr_param);
> +		sctp_addto_chunk(retval, spr_addr_param_len, &spr_addr_param);
> +	}
>  	return retval;
>  }
>  
> @@ -2988,7 +3131,7 @@ static __be16 sctp_process_asconf_param(
>  		 * an Error Cause TLV set to the new error code 'Request to
>  		 * Delete Source IP Address'
>  		 */
> -		if (sctp_cmp_addr_exact(sctp_source(asconf), &addr))
> +		if (sctp_cmp_addr_exact(&asconf->source, &addr))
>  			return SCTP_ERROR_DEL_SRC_IP;
>  
>  		/* Section 4.2.2
> @@ -3169,7 +3312,6 @@ static void sctp_asconf_param_success(st
>  	struct sctp_bind_addr *bp = &asoc->base.bind_addr;
>  	union sctp_addr_param *addr_param;
>  	struct sctp_transport *transport;
> -	struct sctp_sockaddr_entry *saddr;
>  
>  	addr_param = (union sctp_addr_param *)
>  			((void *)asconf_param + sizeof(sctp_addip_param_t));
> @@ -3184,9 +3326,16 @@ static void sctp_asconf_param_success(st
>  		 * held, so the list can not change.
>  		 */
>  		local_bh_disable();
> -		list_for_each_entry(saddr, &bp->address_list, list) {
> -			if (sctp_cmp_addr_exact(&saddr->a, &addr))
> -				saddr->state = SCTP_ADDR_SRC;
> +		/* Until this ASCONF is acked on all associations, we cannot 
> +		 * consider this address as ADDR_SRC
> +		 */
> +		asoc->src_out_of_asoc_ok = 0;
> +		sctp_add_addr_to_laddr(&addr.sa, asoc);
> +		list_for_each_entry(transport, &asoc->peer.transport_addr_list,
> +				transports) {
> +			dst_release(transport->dst);
> +			sctp_transport_route(transport, NULL,
> +					     sctp_sk(asoc->base.sk));
>  		}
>  		local_bh_enable();
>  		list_for_each_entry(transport, &asoc->peer.transport_addr_list,
> @@ -3201,6 +3350,26 @@ static void sctp_asconf_param_success(st
>  	case SCTP_PARAM_DEL_IP:
>  		local_bh_disable();
>  		sctp_del_bind_addr(bp, &addr);
> +		if (asoc->asconf_addr_del_pending != NULL) {
> +			if ((addr.sa.sa_family == AF_INET) && 
> +			    (asoc->asconf_addr_del_pending->sa.sa_family == 
> +			     AF_INET)) {
> +				if (asoc->asconf_addr_del_pending->v4.sin_addr.s_addr == addr.v4.sin_addr.s_addr) {
> +					kfree(asoc->asconf_addr_del_pending);
> +					asoc->asconf_del_pending_cid = 0;
> +					asoc->asconf_addr_del_pending = NULL;
> +				}
> +			} 
> +			else if ((addr.sa.sa_family == AF_INET6) && 
> +				(asoc->asconf_addr_del_pending->sa.sa_family == 
> +				 AF_INET6)) {
> +				if (memcmp(&asoc->asconf_addr_del_pending->v6.sin6_addr, &addr.v6.sin6_addr, sizeof(struct in6_addr)) == 0) {
> +					kfree(asoc->asconf_addr_del_pending);
> +					asoc->asconf_del_pending_cid = 0;
> +					asoc->asconf_addr_del_pending = NULL;
> +				}
> +			}
> +		}
>  		local_bh_enable();
>  		list_for_each_entry(transport, &asoc->peer.transport_addr_list,
>  				transports) {
> @@ -3291,6 +3460,8 @@ int sctp_process_asconf_ack(struct sctp_
>  	int	no_err = 1;
>  	int	retval = 0;
>  	__be16	err_code = SCTP_ERROR_NO_ERROR;
> +	sctp_addip_param_t *first_asconf_param = NULL;
> +	int first_asconf_paramlen;
>  
>  	/* Skip the chunkhdr and addiphdr from the last asconf sent and store
>  	 * a pointer to address parameter.
> @@ -3305,6 +3476,8 @@ int sctp_process_asconf_ack(struct sctp_
>  	length = ntohs(addr_param->v4.param_hdr.length);
>  	asconf_param = (sctp_addip_param_t *)((void *)addr_param + length);
>  	asconf_len -= length;
> +	first_asconf_param = asconf_param;
> +	first_asconf_paramlen = ntohs(first_asconf_param->param_hdr.length);
>  
>  	/* ADDIP 4.1
>  	 * A8) If there is no response(s) to specific TLV parameter(s), and no
> @@ -3359,6 +3532,34 @@ int sctp_process_asconf_ack(struct sctp_
>  		asconf_len -= length;
>  	}
>  
> +	/* When the source address obviously changes to newly added one, we 
> +	   reset the cwnd to re-probe the path condition
> +	*/
> +	if (no_err && (first_asconf_param->param_hdr.type == SCTP_PARAM_ADD_IP)) {
> +		if (first_asconf_paramlen == 16) {
> +			struct sockaddr_in sin;
> +
> +			memset(&sin, 0, sizeof(struct sockaddr_in));
> +			sin.sin_family = AF_INET;
> +			memcpy(&sin.sin_addr.s_addr, first_asconf_param + 1, 
> +					sizeof(struct in_addr));
> +			sctp_path_check_and_react(asoc, 
> +					(struct sockaddr *)&sin);
> +
> +		} else if (first_asconf_paramlen == 28) {
> +			struct sockaddr_in6 sin6;
> +
> +			memset(&sin6, 0, sizeof(struct sockaddr_in6));
> +			sin6.sin6_family = AF_INET6;
> +			memcpy(&sin6.sin6_addr, first_asconf_param + 1, 
> +					sizeof(struct in6_addr));
> +			sctp_path_check_and_react(asoc, 
> +					(struct sockaddr *)&sin6);
> +		} else {
> +			SCTP_DEBUG_PRINTK("funny asconf_paramlen? (%d)\n", first_asconf_paramlen);
> +		}
> +	}
> +
>  	/* Free the cached last sent asconf chunk. */
>  	list_del_init(&asconf->transmitted_list);
>  	sctp_chunk_free(asconf);
> diff -ru -x '\.git' -x arch -x drivers -x fs -x asm -x Documentation -p linux-2.6.orig/net/sctp/socket.c linux-2.6/net/sctp/socket.c
> --- linux-2.6.orig/net/sctp/socket.c	2010-10-11 08:24:34.000000000 +0900
> +++ linux-2.6/net/sctp/socket.c	2010-10-11 07:21:40.000000000 +0900
> @@ -525,6 +525,7 @@ static int sctp_send_asconf_add_ip(struc
>  	struct list_head		*p;
>  	int 				i;
>  	int 				retval = 0;
> +	struct sctp_transport 		*trans = NULL;
>  
>  	if (!sctp_addip_enable)
>  		return retval;
> @@ -581,13 +582,11 @@ static int sctp_send_asconf_add_ip(struc
>  			goto out;
>  		}
>  
> -		retval = sctp_send_asconf(asoc, chunk);
> -		if (retval)
> -			goto out;
>  
>  		/* Add the new addresses to the bind address list with
>  		 * use_as_src set to 0.
>  		 */
> +		SCTP_DEBUG_PRINTK("snd_asconf_addip: next, add_bind_addr with ADDR_NEW flag\n");
>  		addr_buf = addrs;
>  		for (i = 0; i < addrcnt; i++) {
>  			addr = (union sctp_addr *)addr_buf;
> @@ -597,6 +596,26 @@ static int sctp_send_asconf_add_ip(struc
>  						    SCTP_ADDR_NEW, GFP_ATOMIC);
>  			addr_buf += af->sockaddr_len;
>  		}
> +		list_for_each_entry(trans, &asoc->peer.transport_addr_list, transports) {
> +			if (asoc->asconf_addr_del_pending != NULL) {
> +				/* This ADDIP ASCONF piggybacks DELIP for the 
> +				 * last address, so need to select src addr 
> +				 * from the out_of_asoc addrs 
> +				 */
> +				asoc->src_out_of_asoc_ok = 1;
> +			}
> +			/* Clear the source and route cache in the path */
> +			memset(&trans->saddr, 0, sizeof(union sctp_addr));
> +			dst_release(trans->dst);
> +			trans->cwnd = min(4*asoc->pathmtu, max_t(__u32, 2*asoc->pathmtu, 4380));
> +			trans->ssthresh = asoc->peer.i.a_rwnd;
> +			trans->rto = asoc->rto_initial;
> +			trans->rtt = 0;
> +			trans->srtt = 0;
> +			trans->rttvar = 0;
> +			sctp_transport_route(trans, NULL, sctp_sk(asoc->base.sk));
> +		}
> +		retval = sctp_send_asconf(asoc, chunk);
>  	}
>  
>  out:
> @@ -638,6 +657,7 @@ static int sctp_bindx_rem(struct sock *s
>  		 * bind address, there is nothing more to be removed (we need
>  		 * at least one address here).
>  		 */
> +		
>  		if (list_empty(&bp->address_list) ||
>  		    (sctp_list_single_entry(&bp->address_list))) {
>  			retval = -EBUSY;
> @@ -709,7 +729,9 @@ static int sctp_send_asconf_del_ip(struc
>  	struct sctp_sockaddr_entry *saddr;
>  	int 			i;
>  	int 			retval = 0;
> +	int			stored = 0;
>  
> +	chunk = NULL;
>  	if (!sctp_addip_enable)
>  		return retval;
>  
> @@ -760,8 +782,36 @@ static int sctp_send_asconf_del_ip(struc
>  		bp = &asoc->base.bind_addr;
>  		laddr = sctp_find_unmatch_addr(bp, (union sctp_addr *)addrs,
>  					       addrcnt, sp);
> -		if (!laddr)
> -			continue;
> +		if ((laddr == NULL) && (addrcnt == 1)) {
> +			union sctp_addr *sa_addr = NULL;
> +
> +			if (asoc->asconf_addr_del_pending == NULL) {
> +				asoc->asconf_addr_del_pending = kmalloc(sizeof(union sctp_addr), GFP_ATOMIC);
> +				memset(asoc->asconf_addr_del_pending, 0, 
> +						sizeof(union sctp_addr));
> +				if (addrs->sa_family == AF_INET) {
> +					struct sockaddr_in *sin;
> +
> +					sin = (struct sockaddr_in *)addrs;
> +					asoc->asconf_addr_del_pending->v4.sin_family = AF_INET;
> +					memcpy(&asoc->asconf_addr_del_pending->v4.sin_addr, &sin->sin_addr, sizeof(struct in_addr));
> +				} else if (addrs->sa_family == AF_INET6) {
> +					struct sockaddr_in6 *sin6;
> +
> +					sin6 = (struct sockaddr_in6 *)addrs;
> +					asoc->asconf_addr_del_pending->v6.sin6_family = AF_INET6;
> +					memcpy(&asoc->asconf_addr_del_pending->v6.sin6_addr, &sin6->sin6_addr, sizeof(struct in6_addr));
> +				}
> +				sa_addr = (union sctp_addr *)addrs;
> +				SCTP_DEBUG_PRINTK_IPADDR("send_asconf_del_ip: keep the last address asoc: %p "," at %p\n", asoc, sa_addr, asoc->asconf_addr_del_pending);
> +				stored = 1;
> +				goto skip_mkasconf;
> +			} else {
> +				SCTP_DEBUG_PRINTK_IPADDR("send_asconf_del_ip: asoc %p, deleting last address "," is already stored at %p\n", asoc, asoc->asconf_addr_del_pending, asoc->asconf_addr_del_pending);
> +				continue;
> +			}
> +		}
> +
>  
>  		/* We do not need RCU protection throughout this loop
>  		 * because this is done under a socket lock from the
> @@ -774,6 +824,7 @@ static int sctp_send_asconf_del_ip(struc
>  			goto out;
>  		}
>  
> +skip_mkasconf:
>  		/* Reset use_as_src flag for the addresses in the bind address
>  		 * list that are to be deleted.
>  		 */
> @@ -795,16 +846,222 @@ static int sctp_send_asconf_del_ip(struc
>  		list_for_each_entry(transport, &asoc->peer.transport_addr_list,
>  					transports) {
>  			dst_release(transport->dst);
> +			/* Clear source address cache */
> +			memset(&transport->saddr, 0, sizeof(union sctp_addr));
>  			sctp_transport_route(transport, NULL,
>  					     sctp_sk(asoc->base.sk));
>  		}
>  
> +		if (stored) {
> +			/* We don't need to transmit ASCONF */
> +			continue;
> +		}
>  		retval = sctp_send_asconf(asoc, chunk);
>  	}
>  out:
>  	return retval;
>  }
>  
> +/* Add a new address to the list contains available addresses only in the 
> + * association.  If the new address is also available on the other associations 
> + * on the endpoint, it is marked as SCTP_ADDR_SRC in the bind address list on 
> + * the endpoint.  This situation is possible when some of associations receive
> + * ASCONF-ACK for ADD_IP at the endpoint
> + */
> +void
> +sctp_add_addr_to_laddr(struct sockaddr *sa, struct sctp_association *asoc)
> +{
> +	struct sctp_endpoint *ep = asoc->ep;
> +	struct sctp_association *tmp = NULL;
> +	struct sctp_bind_addr *bp;
> +	struct sctp_sockaddr_entry *addr;
> +	struct sockaddr_in *sin = NULL;
> +	struct sockaddr_in6 *sin6 = NULL;
> +	int local;
> +	int found;
> +
> +	union sctp_addr *tmpaddr = NULL;
> +	tmpaddr = (union sctp_addr *)sa;
> +	SCTP_DEBUG_PRINTK_IPADDR("add_addr_to_laddr: asoc: %p "," ep: %p", asoc, tmpaddr, ep);
> +	if (sa->sa_family == AF_INET) {
> +		sin = (struct sockaddr_in *)sa;
> +	} else if (sa->sa_family == AF_INET6) {
> +		sin6 = (struct sockaddr_in6 *)sa;
> +	}
> +
> +	/* Check if this address is locally available in the other asocs */
> +	local = 0;
> +	list_for_each_entry(tmp, &ep->asocs, asocs) {
> +		if (tmp == asoc) {
> +			continue;
> +		}
> +		found = 0;
> +		list_for_each_entry(addr, &tmp->asoc_laddr_list, list) {
> +			tmpaddr = &addr->a;
> +			if (sa->sa_family != addr->a.sa.sa_family) {
> +				continue;
> +			}
> +			if (sa->sa_family == AF_INET) {
> +				if (sin->sin_addr.s_addr == addr->a.v4.sin_addr.s_addr) {
> +					found = 1;
> +				}
> +			} else if (sa->sa_family == AF_INET6) {
> +				if (memcmp(&sin6->sin6_addr, &addr->a.v6.sin6_addr, sizeof(struct in6_addr)) == 0) {
> +					found = 1;
> +
> +				}
> +			}
> +		}
> +		if (!found) {
> +			SCTP_DEBUG_PRINTK("add_addr_to_laddr: not found in asoc %p\n", tmp);
> +			local = 1;
> +			break;
> +		}
> +	}
> +	addr = NULL;
> +
> +	if (local) {
> +		/* this address is not available in some of the other 
> +		 * associations.  So add as locally-available in this 
> +		 * asocciation 
> +		 */
> +		addr = kmalloc(sizeof(struct sctp_sockaddr_entry), GFP_ATOMIC);
> +		if  (addr == NULL) {
> +			SCTP_DEBUG_PRINTK("add_addr_to_laddr: failed to allocate memory for this address\n");
> +			return;
> +		}
> +		memset(addr, 0, sizeof(struct sctp_sockaddr_entry));
> +		if (sa->sa_family == AF_INET) {
> +			addr->a.sa.sa_family = AF_INET;
> +			addr->a.v4.sin_port = sin->sin_port;
> +			addr->a.v4.sin_addr.s_addr = sin->sin_addr.s_addr;
> +		} else if (sa->sa_family == AF_INET6) {
> +			addr->a.sa.sa_family = AF_INET6;
> +			addr->a.v6.sin6_port = sin6->sin6_port;
> +			memcpy(&addr->a.v6.sin6_addr, &sin6->sin6_addr, sizeof(struct in6_addr));
> +		}
> +		list_add_tail(&addr->list, &asoc->asoc_laddr_list);
> +		SCTP_DEBUG_PRINTK("add_addr_to_laddr: now we added this address to the local list on asoc %p\n", asoc);
> +	} else {
> +		/* this address is also available in all other asocs.  So set 
> +		 * it as ADDR_SRC in the bind-addr list in the endpoint, then 
> +		 * remove from the asoc_laddr_list on the associations.  
> +		 */
> +		SCTP_DEBUG_PRINTK("add_addr_to_laddr: this address is available in all other asocs\n");
> +		bp = &asoc->base.bind_addr;
> +
> +		/* change state of the new address in the bind list */
> +		list_for_each_entry(addr, &bp->address_list, list) {
> +			if (addr->state != SCTP_ADDR_NEW) {
> +				continue;
> +			}
> +			if (addr->a.sa.sa_family != sa->sa_family) {
> +				continue;
> +			}
> +			if (addr->a.sa.sa_family == AF_INET) {
> +				if (sin->sin_port != addr->a.v4.sin_port) {
> +					continue;
> +				}
> +				if (sin->sin_addr.s_addr != 
> +				    addr->a.v4.sin_addr.s_addr) {
> +					continue;
> +				}
> +			} else if (addr->a.sa.sa_family == AF_INET6) {
> +				if (sin6->sin6_port != addr->a.v6.sin6_port) {
> +					continue;
> +				}
> +				if (memcmp(&sin6->sin6_addr, 
> +				    &addr->a.v6.sin6_addr, 
> +				    sizeof(struct in6_addr)) != 0) {
> +					continue;
> +				}
> +			}
> +			SCTP_DEBUG_PRINTK("add_addr_to_laddr: found the entry for this address with ADDR_NEW flag, set to ADDR_SRC\n");
> +			addr->state = SCTP_ADDR_SRC;
> +		}
> +
> +		/* remove the entry of this address from the asoc-local list */
> +		list_for_each_entry(tmp, &ep->asocs, asocs) {
> +			if (tmp == asoc) {
> +				continue;
> +			}
> +			addr = NULL;
> +			list_for_each_entry(addr, &tmp->asoc_laddr_list, list) {
> +				if (sa->sa_family != addr->a.sa.sa_family) {
> +					continue;
> +				}
> +				if (sa->sa_family == AF_INET) {
> +					if (sin->sin_addr.s_addr != addr->a.v4.sin_addr.s_addr) {
> +						continue;
> +					}
> +				} else if (sa->sa_family == AF_INET6) {
> +					if (memcmp(&sin6->sin6_addr, &addr->a.v6.sin6_addr, sizeof(struct in6_addr)) != 0) {
> +						continue;
> +					}
> +				}
> +				break;
> +			}
> +			if (addr == NULL) {
> +				SCTP_DEBUG_PRINTK("add_addr_to_laddr: Huh, asoc %p doesn't have the entry for this address?\n", asoc);
> +				continue;
> +			}
> +			list_del(&addr->list);
> +			kfree(addr);
> +		}
> +	}
> +}
> +
> +/* set address events to associations in the given endpoint.  We assume the ep 
> + * is write-locked, and addr_wq is read-locked.  
> + */
> +int
> +sctp_asconf_mgmt(struct sctp_endpoint *ep, struct sock *sk)
> +{
> +	struct sctp_addr_wait *addrw = NULL;
> +	union sctp_addr *addr = NULL;
> +	int cmd;
> +	int error = 0;
> +
> +	if (!sctp_auto_asconf_enable) {
> +		return (0);
> +	}
> +	if ((ep == NULL) || (sk == NULL)) {
> +		return(-EINVAL);
> +	}
> +	if (list_empty(&sctp_addr_waitq)) {
> +		SCTP_DEBUG_PRINTK("asconf_mgmt: nothing in the wq\n");
> +		return(-EINVAL);
> +	}
> +	addrw = list_first_entry(&sctp_addr_waitq, struct sctp_addr_wait, list);
> +	if (addrw->cmd != SCTP_NEWADDR && addrw->cmd != SCTP_DELADDR) {
> +		return(-EINVAL);
> +	}
> +	addr = &addrw->a;
> +	cmd = addrw->cmd;
> +
> +	if (addr->sa.sa_family == AF_INET) {
> +		addr->v4.sin_port = htons(ep->base.bind_addr.port);
> +	} else if (addr->sa.sa_family == AF_INET6) {
> +		addr->v6.sin6_port = htons(ep->base.bind_addr.port);
> +	}
> +
> +	if (cmd == SCTP_NEWADDR) {
> +		error = sctp_send_asconf_add_ip(sk, (struct sockaddr *)addr, 1);
> +		if (error) {
> +			SCTP_DEBUG_PRINTK("asconf_mgmt: send_asconf_add_ip returns %d\n", error);
> +			return(error);
> +		}
> +	} else if (cmd == SCTP_DELADDR) {
> +		error = sctp_send_asconf_del_ip(sk, (struct sockaddr *)addr, 1);
> +		if (error) {
> +			SCTP_DEBUG_PRINTK("asconf_mgmt: send_asconf_del_ip returns %d\n", error);
> +			return(error);
> +		}
> +	}
> +
> +	return(0);
> +}
> +
>  /* Helper for tunneling sctp_bindx() requests through sctp_setsockopt()
>   *
>   * API 8.1
> @@ -1146,6 +1403,7 @@ static int __sctp_connect(struct sock* s
>  	if ((err == 0 || err == -EINPROGRESS) && assoc_id)
>  		*assoc_id = asoc->assoc_id;
>  
> +	sctp_hash_endpoint(ep);
>  	/* Don't free association on exit. */
>  	asoc = NULL;
>  
> @@ -3559,6 +3817,8 @@ SCTP_STATIC struct sock *sctp_accept(str
>  	struct sctp_association *asoc;
>  	long timeo;
>  	int error = 0;
> +	struct sctp_sock *newsp = NULL;
> +	struct sctp_endpoint *newep = NULL;
>  
>  	sctp_lock_sock(sk);
>  
> @@ -3596,6 +3856,9 @@ SCTP_STATIC struct sock *sctp_accept(str
>  	 * asoc to the newsk.
>  	 */
>  	sctp_sock_migrate(sk, newsk, asoc, SCTP_SOCKET_TCP);
> +	newsp = sctp_sk(newsk);
> +	newep = newsp->ep;
> +	sctp_hash_endpoint(newep);
>  
>  out:
>  	sctp_release_sock(sk);
> diff -ru -x '\.git' -x arch -x drivers -x fs -x asm -x Documentation -p linux-2.6.orig/net/sctp/sysctl.c linux-2.6/net/sctp/sysctl.c
> --- linux-2.6.orig/net/sctp/sysctl.c	2010-04-22 17:55:41.000000000 +0900
> +++ linux-2.6/net/sctp/sysctl.c	2010-06-23 09:11:02.000000000 +0900
> @@ -183,6 +183,13 @@ static ctl_table sctp_table[] = {
>  		.proc_handler	= proc_dointvec,
>  	},
>  	{
> +		.procname	= "auto_asconf_enable",
> +		.data		= &sctp_auto_asconf_enable,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
> +	{
>  		.procname	= "prsctp_enable",
>  		.data		= &sctp_prsctp_enable,
>  		.maxlen		= sizeof(int),
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


-- 

Best Regards
-----
Shan Wei



^ permalink raw reply

* Re: [PATCH 0/3] net: RX and TX queue allocation cleanup
From: David Miller @ 2010-10-20  9:30 UTC (permalink / raw)
  To: therbert; +Cc: netdev, eric.dumazet, bhutchings
In-Reply-To: <alpine.DEB.1.00.1010181053200.15812@pokey.mtv.corp.google.com>

From: Tom Herbert <therbert@google.com>
Date: Mon, 18 Oct 2010 11:02:06 -0700 (PDT)

> Move TX queue allocation to register_netdevice.  Enforce
> alloc_netdev_mq & netif_set_real_num_[rt]x_queues netif_alloc_rx_queues
> are called with at least one queue to allocate.

All applied, and I fixed the indentation issue Eric noticed
while integrating.

Thanks.

^ permalink raw reply

* Re: [PATCH 0/3] net: RX and TX queue allocation cleanup
From: David Miller @ 2010-10-20  9:38 UTC (permalink / raw)
  To: therbert; +Cc: netdev, eric.dumazet, bhutchings
In-Reply-To: <20101020.023041.112586469.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Wed, 20 Oct 2010 02:30:41 -0700 (PDT)

> From: Tom Herbert <therbert@google.com>
> Date: Mon, 18 Oct 2010 11:02:06 -0700 (PDT)
> 
>> Move TX queue allocation to register_netdevice.  Enforce
>> alloc_netdev_mq & netif_set_real_num_[rt]x_queues netif_alloc_rx_queues
>> are called with at least one queue to allocate.
> 
> All applied, and I fixed the indentation issue Eric noticed
> while integrating.

Just to be clear, I did make sure I used the "v2" series
of the patches, not the original. :-)


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox