Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH] net: Fix references to out-of-scope variables in put_cmsg_compat()
From: Jesper Juhl @ 2012-07-22 21:37 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, linux-kernel

In net/compat.c::put_cmsg_compat() we may assign 'data' the address of
either the 'ctv' or 'cts' local variables inside the 'if
(!COMPAT_USE_64BIT_TIME)' branch.

Those variables go out of scope at the end of the 'if' statement, so
when we use 'data' further down in 'copy_to_user(CMSG_COMPAT_DATA(cm),
data, cmlen - sizeof(struct compat_cmsghdr))' there's no telling what
it may be refering to - not good.

Fix the problem by simply giving 'ctv' and 'cts' function scope.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
---
 net/compat.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/compat.c b/net/compat.c
index 1b96281..74ed1d7 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -221,6 +221,8 @@ int put_cmsg_compat(struct msghdr *kmsg, int level, int type, int len, void *dat
 {
 	struct compat_cmsghdr __user *cm = (struct compat_cmsghdr __user *) kmsg->msg_control;
 	struct compat_cmsghdr cmhdr;
+	struct compat_timeval ctv;
+	struct compat_timespec cts[3];
 	int cmlen;

 	if (cm == NULL || kmsg->msg_controllen < sizeof(*cm)) {
@@ -229,8 +231,6 @@ int put_cmsg_compat(struct msghdr *kmsg, int level, int type, int len, void *dat
 	}

 	if (!COMPAT_USE_64BIT_TIME) {
-		struct compat_timeval ctv;
-		struct compat_timespec cts[3];
 		if (level == SOL_SOCKET && type == SCM_TIMESTAMP) {
 			struct timeval *tv = (struct timeval *)data;
 			ctv.tv_sec = tv->tv_sec;
-- 
1.7.11.2

-- 
Jesper Juhl <jj@chaosbits.net>       http://www.chaosbits.net/
Don't top-post http://www.catb.org/jargon/html/T/top-post.html
Plain text mails only, please.

^ permalink raw reply related

* Re: [net-next 00/13][pull request] Intel Wired LAN Driver Updates
From: Jeff Kirsher @ 2012-07-22 21:39 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, gospo, sassmann
In-Reply-To: <20120722.123733.1898891964192153293.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 768 bytes --]

On Sun, 2012-07-22 at 12:37 -0700, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Sun, 22 Jul 2012 12:24:05 -0700 (PDT)
> 
> > From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> > Date: Sat, 21 Jul 2012 16:08:49 -0700
> > 
> >> This series contains updates to ixgbe and ixgbevf.
> >> 
> >> The following are changes since commit
> 186e868786f97c8026f0a81400b451ace306b3a4:
> >>   forcedeth: spin_unlock_irq in interrupt handler fix
> >> and are available in the git repository at:
> >>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next
> master
> > 
> > Pulled, thanks Jeff.
> 
> Can you guys actually build test this stuff? 

I did, but it appears I did not have PCI_IOV enabled.  That was my bad,
sorry.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH RFC] tcp: use seqlock for all cached tcp_metrics
From: David Miller @ 2012-07-22 21:50 UTC (permalink / raw)
  To: ja; +Cc: netdev
In-Reply-To: <alpine.LFD.2.00.1207222314230.2458@ja.ssi.bg>

From: Julian Anastasov <ja@ssi.bg>
Date: Sun, 22 Jul 2012 23:34:38 +0300 (EEST)

> So, I can just send patch (or 2) for the tcpm_stamp and
> tcp_tw_remember_stamp problems, now or after 2 weeks?

You can send it now.

^ permalink raw reply

* Re: [net-next 00/13][pull request] Intel Wired LAN Driver Updates
From: David Miller @ 2012-07-22 21:53 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, sassmann
In-Reply-To: <1342993193.23226.0.camel@jtkirshe-mobl>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Sun, 22 Jul 2012 14:39:53 -0700

> On Sun, 2012-07-22 at 12:37 -0700, David Miller wrote:
>> From: David Miller <davem@davemloft.net>
>> Date: Sun, 22 Jul 2012 12:24:05 -0700 (PDT)
>> 
>> > From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>> > Date: Sat, 21 Jul 2012 16:08:49 -0700
>> > 
>> >> This series contains updates to ixgbe and ixgbevf.
>> >> 
>> >> The following are changes since commit
>> 186e868786f97c8026f0a81400b451ace306b3a4:
>> >>   forcedeth: spin_unlock_irq in interrupt handler fix
>> >> and are available in the git repository at:
>> >>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next
>> master
>> > 
>> > Pulled, thanks Jeff.
>> 
>> Can you guys actually build test this stuff? 
> 
> I did, but it appears I did not have PCI_IOV enabled.  That was my bad,
> sorry.

If you're not doing "allmodconfig" builds, there are by definition
parts you are not testing.  It's the first thing I do with any change
I apply.

^ permalink raw reply

* Re: [PATCH] net: Fix references to out-of-scope variables in put_cmsg_compat()
From: David Miller @ 2012-07-22 21:54 UTC (permalink / raw)
  To: jj; +Cc: netdev, linux-kernel
In-Reply-To: <alpine.LNX.2.00.1207222335200.31033@swampdragon.chaosbits.net>

From: Jesper Juhl <jj@chaosbits.net>
Date: Sun, 22 Jul 2012 23:37:20 +0200 (CEST)

> In net/compat.c::put_cmsg_compat() we may assign 'data' the address of
> either the 'ctv' or 'cts' local variables inside the 'if
> (!COMPAT_USE_64BIT_TIME)' branch.
> 
> Those variables go out of scope at the end of the 'if' statement, so
> when we use 'data' further down in 'copy_to_user(CMSG_COMPAT_DATA(cm),
> data, cmlen - sizeof(struct compat_cmsghdr))' there's no telling what
> it may be refering to - not good.
> 
> Fix the problem by simply giving 'ctv' and 'cts' function scope.
> 
> Signed-off-by: Jesper Juhl <jj@chaosbits.net>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH] ppp: add 64 bit stats
From: David Miller @ 2012-07-22 21:54 UTC (permalink / raw)
  To: kgroeneveld; +Cc: netdev
In-Reply-To: <1342988397-3077-1-git-send-email-kgroeneveld@gmail.com>

From: Kevin Groeneveld <kgroeneveld@gmail.com>
Date: Sun, 22 Jul 2012 16:19:56 -0400

> Add 64 bit stats to ppp driver.  The 64 bit stats include tx_bytes,
> rx_bytes, tx_packets and rx_packets.  Other stats are still 32 bit.
> The 64 bit stats can be retrieved via the ndo_get_stats operation.  The
> SIOCGPPPSTATS ioctl is still 32 bit stats only.
> 
> Signed-off-by: Kevin Groeneveld <kgroeneveld@gmail.com>

I don't see this as being very practical nor justified.

^ permalink raw reply

* Re: [PATCH] mlx4: Add support for EEH error recovery
From: David Miller @ 2012-07-23  0:15 UTC (permalink / raw)
  To: ogerlitz; +Cc: klebers, netdev, jackm, yevgenyp, cascardo, brking, shlomop
In-Reply-To: <500BD558.2060803@mellanox.com>

From: Or Gerlitz <ogerlitz@mellanox.com>
Date: Sun, 22 Jul 2012 13:26:32 +0300

> is there anything in the code you added which maybe implicitly
> assumes PPC arch?

He implemented support for a standard PCI API in the kernel, he
happened to test it on a particular platform, and I think that's
the long and short of it.

^ permalink raw reply

* Re: [PATCH] ppp: add 64 bit stats
From: Kevin Groeneveld @ 2012-07-23  0:32 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120722.145454.943592825984584407.davem@davemloft.net>

On Sun, Jul 22, 2012 at 5:54 PM, David Miller <davem@davemloft.net> wrote:
> From: Kevin Groeneveld <kgroeneveld@gmail.com>
> Date: Sun, 22 Jul 2012 16:19:56 -0400
>
>> Add 64 bit stats to ppp driver.  The 64 bit stats include tx_bytes,
>
> I don't see this as being very practical nor justified.

It is obviously not up to me to decide what is practical or justified
in this case.  I am just always annoyed when I log into my Linksys
router and check the network stats for my PPP interface and have no
idea how many times the transfer stats have rolled over at 4GB.

Kevin

^ permalink raw reply

* Re: net-next compilation failure since 76ff5cc9
From: Neil Horman @ 2012-07-23  0:37 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev, Jiri Pirko, David S. Miller
In-Reply-To: <20120722162107.GA23219@netboy.at.omicron.at>

On Sun, Jul 22, 2012 at 06:21:07PM +0200, Richard Cochran wrote:
> Without CONFIG_RPS, net-next fails to compile with
> 
>   CC      net/core/rtnetlink.o
> /linux/net/core/rtnetlink.c: In function ‘rtnl_fill_ifinfo’:
> /linux/net/core/rtnetlink.c:895: error: ‘struct net_device’ has no member named ‘num_rx_queues’
> make[3]: *** [net/core/rtnetlink.o] Error 1
> 
> Looks like this was caused by
> 
> 76ff5cc9  rtnl: allow to specify number of rx and tx queues on device creation
> 
> Not sure how to fix.
> 
> Thanks,
> Richard
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Its fixed already by 1d69c2b3.  Just pull and you should be good to go.
Neil

^ permalink raw reply

* Re: [PATCH 00/16] Remove the ipv4 routing cache
From: David Miller @ 2012-07-23  0:39 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20120720.142502.1144557295933737451.davem@davemloft.net>


Just FYI, I'm pushing this work out to net-next now.

^ permalink raw reply

* Re: [PATCH] ppp: add 64 bit stats
From: David Miller @ 2012-07-23  0:40 UTC (permalink / raw)
  To: kgroeneveld; +Cc: netdev
In-Reply-To: <CABF+-6XVWpaw6zQe2XNaz4y3oj-KdDm63vbut5jAxs3RWRC6LQ@mail.gmail.com>

From: Kevin Groeneveld <kgroeneveld@gmail.com>
Date: Sun, 22 Jul 2012 20:32:26 -0400

> On Sun, Jul 22, 2012 at 5:54 PM, David Miller <davem@davemloft.net> wrote:
>> From: Kevin Groeneveld <kgroeneveld@gmail.com>
>> Date: Sun, 22 Jul 2012 16:19:56 -0400
>>
>>> Add 64 bit stats to ppp driver.  The 64 bit stats include tx_bytes,
>>
>> I don't see this as being very practical nor justified.
> 
> It is obviously not up to me to decide what is practical or justified
> in this case.  I am just always annoyed when I log into my Linksys
> router and check the network stats for my PPP interface and have no
> idea how many times the transfer stats have rolled over at 4GB.

I guess that's a good point.

^ permalink raw reply

* Re: [PATCH] sctp: Make "Invalid Stream Identifier" ERROR follows SACK when bundling
From: Neil Horman @ 2012-07-23  0:49 UTC (permalink / raw)
  To: xufengzhang.main; +Cc: vyasevich, sri, davem, linux-sctp, netdev, linux-kernel
In-Reply-To: <1342677450-21810-1-git-send-email-xufengzhang.main@gmail.com>

On Thu, Jul 19, 2012 at 01:57:30PM +0800, xufengzhang.main@gmail.com wrote:
> When "Invalid Stream Identifier" ERROR happens after process the
> received DATA chunks, this ERROR chunk is enqueued into outqueue
> before SACK chunk, so when bundling ERROR chunk with SACK chunk,
> the ERROR chunk is always placed first in the packet because of
> the chunk's position in the outqueue.
> This violates sctp specification:
>     RFC 4960 6.5. Stream Identifier and Stream Sequence Number
>     ...The endpoint may bundle the ERROR chunk in the same
>     packet as the SACK as long as the ERROR follows the SACK.
> So we must place SACK first when bundling "Invalid Stream Identifier"
> ERROR and SACK in one packet.
> Although we can do that by enqueue SACK chunk into outqueue before
> ERROR chunk, it will violate the side-effect interpreter processing.
> It's easy to do this job when dequeue chunks from the outqueue,
> by this way, we introduce a flag 'has_isi_err' which indicate
> whether or not the "Invalid Stream Identifier" ERROR happens.
> 
> Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com>

Not sure I understand how you came into this error.  If we get an invalid
stream, we issue an SCTP_REPORT_TSN side effect, followed by an SCTP_CMD_REPLY
which sends the error chunk.  The reply goes through
sctp_outq_tail->sctp_outq_chunk->sctp_outq_transmit_chunk->sctp_outq_append_chunk.
That last function checks to see if a sack is already part of the packet, and if
there isn't one, appends one, using the updated tsn map.  So Can you explain in
some more detail how you're getting into this situation?

Thanks!
Neil

^ permalink raw reply

* Re: [PATCH] Fix divide zero crash when xmit with no enabled port
From: ISHIKAWA Mutsumi @ 2012-07-23  0:49 UTC (permalink / raw)
  To: jpirko; +Cc: netdev
In-Reply-To: <20120722135956.GA12129@minipsycho.orion>

>>>>> In <20120722135956.GA12129@minipsycho.orion> 
>>>>>	Jiri Pirko <jpirko@redhat.com> wrote:
>> Sat, Jul 21, 2012 at 10:30:35PM CEST, ishikawa@hanzubon.jp wrote:
>> >
>> >hash calculation in lb_transmit() cause divide zero crash when
>> >xmit on teaming loadbalance mode with no team member port is enabled
>> >(this situation means team->en_port_count = 0). Add check
>> >team->en_port_count is not 0.
>>
>> What kernel are you see the issue one?
>> I believe this is fixed in net-next already:

 Oops, sorry. I've see it on linus's linux-2.6 git tree only.

-- 
ISHIKAWA Mutsumi
 <ishikawa@debian.org>, <ishikawa@hanzubon.jp>, <ishikawa@osdn.jp>

^ permalink raw reply

* many subscribers to this list have been removed
From: David Miller @ 2012-07-23  0:59 UTC (permalink / raw)
  To: netdev

You absolutely cannot bounce back at me, the list owner, simply
because your site is too damn stupid to accept ISO-2022-JP-2
encoded postings.

I refuse to get bombed with bounces from your site just because
your broken email installation refuses to accept postings encoded
in that completely valid character set.

I sort of quietly tolerated this garbage in the past, but it's
a losing game, and people simply need to fix their kit.

If you want to not accept any email encoding in that character
set, that is YOUR business, but once you make such emails
generate a bounce back to the sender (that's me) then it's MY
business.

So people at these sites have two choices 1) turn off the filter or
2) make it quietly drop and not emit a bounce back to the sender.

^ permalink raw reply

* Re: [PATCH V2 resend] ipv6: fix incorrect route 'expires' value passed to userspace
From: Li Wei @ 2012-07-23  1:02 UTC (permalink / raw)
  To: David Laight; +Cc: David Miller, netdev, shemminger
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B6F9C@saturn3.aculab.com>

On 07/20/2012 06:32 PM, David Laight wrote:
>> -	else if (rt->dst.expires - jiffies < INT_MAX)
>> -		expires = rt->dst.expires - jiffies;
>> +	else if ((long)rt->dst.expires - (long)jiffies > INT_MIN
>> +			&& (long)rt->dst.expires - (long)jiffies <
> INT_MAX)
>> +		expires = (long)rt->dst.expires - (long)jiffies;
>>  	else
>> -		expires = INT_MAX;
>> +		expires = time_is_after_jiffies(rt->dst.expires) ?
> INT_MAX : INT_MIN;
> 
> I can't help feeling there is a better way to do this.
> Maybe:
> 	long expires = rt->dst.expires - jiffies;
> 	if (expires != (int)expires)
> 		expires = expires > 0 ? INT_MAX : INT_MIN;
> Although maybe -INT_MAX instead of INT_MIN.
> 
> 	David
> 

Thanks David, your code looks much cleaner and can archieve the same
function except we should use
	long expires = (long)rt->dst.expires - (long)jiffies;
to avoid the wrapping of jiffies.

Thanks, 
Wei

^ permalink raw reply

* Re: [PATCH V2 resend] ipv6: fix incorrect route 'expires' value passed to userspace
From: Li Wei @ 2012-07-23  1:05 UTC (permalink / raw)
  To: David Miller; +Cc: David.Laight, netdev, shemminger
In-Reply-To: <20120720.112241.2111041227435292899.davem@davemloft.net>

On 07/21/2012 02:22 AM, David Miller wrote:
> From: "David Laight" <David.Laight@ACULAB.COM>
> Date: Fri, 20 Jul 2012 11:32:05 +0100
> 
>>> -	else if (rt->dst.expires - jiffies < INT_MAX)
>>> -		expires = rt->dst.expires - jiffies;
>>> +	else if ((long)rt->dst.expires - (long)jiffies > INT_MIN
>>> +			&& (long)rt->dst.expires - (long)jiffies <
>> INT_MAX)
>>> +		expires = (long)rt->dst.expires - (long)jiffies;
>>>  	else
>>> -		expires = INT_MAX;
>>> +		expires = time_is_after_jiffies(rt->dst.expires) ?
>> INT_MAX : INT_MIN;
>>
>> I can't help feeling there is a better way to do this.
>> Maybe:
>> 	long expires = rt->dst.expires - jiffies;
>> 	if (expires != (int)expires)
>> 		expires = expires > 0 ? INT_MAX : INT_MIN;
>> Although maybe -INT_MAX instead of INT_MIN.
> 
> This patch also does not apply at all to net-next, so needs to be
> redone regardless.

I'll redone this patch base on 'net-next'.

Thanks

> 
> 

^ permalink raw reply

* Re: [PATCH] net, cgroup: Fix boot failure due to iteration of uninitialized list
From: Gao feng @ 2012-07-23  1:15 UTC (permalink / raw)
  To: Srivatsa S. Bhat, eric.dumazet, davem
  Cc: nhorman, linux-kernel, netdev, mark.d.rustad, john.r.fastabend,
	lizefan
In-Reply-To: <20120719162532.23505.85946.stgit@srivatsabhat.in.ibm.com>

于 2012年07月20日 00:27, Srivatsa S. Bhat 写道:
> After commit ef209f15 (net: cgroup: fix access the unallocated memory in
> netprio cgroup), boot fails with the following NULL pointer dereference:
> 
> Initializing cgroup subsys devices
> Initializing cgroup subsys freezer
> Initializing cgroup subsys net_cls
> Initializing cgroup subsys blkio
> Initializing cgroup subsys perf_event
> Initializing cgroup subsys net_prio
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000698
> IP: [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
> PGD 0
> Oops: 0000 [#1] SMP
> CPU 0
> Modules linked in:
> 
> Pid: 0, comm: swapper/0 Not tainted 3.5.0-rc7-mandeep #1 IBM IBM System x -[7870C4Q]-/68Y8033
> RIP: 0010:[<ffffffff8145e8d6>]  [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
> RSP: 0000:ffffffff81a01ea8  EFLAGS: 00010213
> RAX: 0000000000000000 RBX: ffffffffffffff10 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffffff81aa70a0
> RBP: ffffffff81a01ed8 R08: 0000000000000000 R09: 0000000000000000
> R10: ffff8808ff8641c0 R11: 6e697a696c616974 R12: 0000000000000001
> R13: ffff8808ff8641c0 R14: 0000000000000000 R15: 0000000000093970
> FS:  0000000000000000(0000) GS:ffff8808ffc00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000698 CR3: 0000000001a0b000 CR4: 00000000000006b0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper/0 (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a13420)
> Stack:
>  ffffffff81a01eb8 ffffffff818060ff ffffffff81d75ec8 ffffffff81aa8960
>  ffffffff81aa8960 ffffffff81b4c2c0 ffffffff81a01ef8 ffffffff81b1cb78
>  0000000000000018 0000000000000048 ffffffff81a01f18 ffffffff81b1ce13
> Call Trace:
>  [<ffffffff81b1cb78>] cgroup_init_subsys+0x83/0x169
>  [<ffffffff81b1ce13>] cgroup_init+0x36/0x119
>  [<ffffffff81affef7>] start_kernel+0x3ba/0x3ef
>  [<ffffffff81aff95b>] ? kernel_init+0x27b/0x27b
>  [<ffffffff81aff356>] x86_64_start_reservations+0x131/0x136
>  [<ffffffff81aff45e>] x86_64_start_kernel+0x103/0x112
> Code: 01 48 3d f8 e1 ec 81 48 8d 98 10 ff ff ff 75 1b eb 73 0f 1f 00 48 8b 83 f0 00 00 00 48 3d f8 e1 ec 81 48 8d 98 10 ff ff ff 74 5a <48> 8b 83 88 07 00 00 48 85 c0 74 de 44 3b 60 10 76 d8 44 89 e6
> RIP  [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
>  RSP <ffffffff81a01ea8>
> CR2: 0000000000000698
> ---[ end trace a7919e7f17c0a725 ]---
> Kernel panic - not syncing: Attempted to kill the idle task!
> 
> The code corresponds to:
> 
> update_netdev_tables():
>         for_each_netdev(&init_net, dev) {
>                 map = rtnl_dereference(dev->priomap);  <---- HERE
> 
> 
> The list head is initialized in netdev_init(), which is called much
> later than cgrp_create(). So the problem is that we are calling
> update_netdev_tables() way too early (in cgrp_create()), which will
> end up traversing the not-yet-circular linked list. So at some point,
> the dev pointer will become NULL and hence dev->priomap becomes an
> invalid access.
> 
> To fix this, just remove the update_netdev_tables() function entirely,
> since it appears that write_update_netdev_table() will handle things
> just fine.

The reason I add update_netdev_tables in cgrp_create is to avoid additional
bound checkings when we accessing the dev->priomap.priomap.

Eric,can we revert this commit 91c68ce2b26319248a32d7baa1226f819d283758 now?
I think it's safe enough to access priomap without bound check.

Thanks

^ permalink raw reply

* Re: [PATCH] sctp: Make "Invalid Stream Identifier" ERROR follows SACK when bundling
From: xufeng zhang @ 2012-07-23  2:30 UTC (permalink / raw)
  To: Neil Horman
  Cc: xufengzhang.main, vyasevich, sri, davem, linux-sctp, netdev,
	linux-kernel
In-Reply-To: <20120723004932.GB8040@neilslaptop.think-freely.org>

On 07/23/2012 08:49 AM, Neil Horman wrote:
>
> Not sure I understand how you came into this error.  If we get an invalid
> stream, we issue an SCTP_REPORT_TSN side effect, followed by an SCTP_CMD_REPLY
> which sends the error chunk.  The reply goes through
> sctp_outq_tail->sctp_outq_chunk->sctp_outq_transmit_chunk->sctp_outq_append_chunk.
> That last function checks to see if a sack is already part of the packet, and if
> there isn't one, appends one, using the updated tsn map.
Yes, you are right, but consider the invalid stream identifier's DATA 
chunk is the first
DATA chunk in the association which will need SACK immediately.
Here is what I thought of the scenario:
     sctp_sf_eat_data_6_2()
         -->sctp_eat_data()
             -->sctp_make_op_error()
             -->sctp_add_cmd_sf(commands, SCTP_CMD_REPLY, SCTP_CHUNK(err))
             -->sctp_outq_tail()          /* First enqueue ERROR chunk */
         -->sctp_add_cmd_sf(commands, SCTP_CMD_GEN_SACK, SCTP_FORCE())
             -->sctp_gen_sack()
                 -->sctp_make_sack()
                 -->sctp_add_cmd_sf(commands, SCTP_CMD_REPLY, 
SCTP_CHUNK(sack))
                 -->sctp_outq_tail()          /* Then enqueue SACK chunk */

So SACK chunk is enqueued after ERROR chunk.
> So Can you explain in
> some more detail how you're getting into this situation?
>    
Actually it's triggered by a customer's test case, but we can also 
reproduce this problem
easily by explicitly contaminating the sctp stack:
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -701,7 +701,7 @@ struct sctp_chunk *sctp_make_datafrag_empty(struct 
sctp_association *asoc,
          * creating the chunk.
          */
         dp.tsn = 0;
-       dp.stream = htons(sinfo->sinfo_stream);
+       dp.stream = htons(sinfo->sinfo_stream) + 10;
         dp.ppid   = sinfo->sinfo_ppid;

         /* Set the flags for an unordered send.  */


Then run sctp_darn application and capture the sctp packet at the same time.



Thanks,
Xufeng Zhang
> Thanks!
> Neil
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
>    

^ permalink raw reply

* [PATCH net] rds: set correct msg_namelen
From: Weiping Pan @ 2012-07-23  2:37 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

Jay Fenlason (fenlason@redhat.com) found a bug,
that recvfrom() on an RDS socket can return the contents of random kernel
memory to userspace if it was called with a address length larger than
sizeof(struct sockaddr_in).
rds_recvmsg() also fails to set the addr_len paramater properly before
returning, but that's just a bug.
There are also a number of cases wher recvfrom() can return an entirely bogus
address. Anything in rds_recvmsg() that returns a non-negative value but does
not go through the "sin = (struct sockaddr_in *)msg->msg_name;" code path
at the end of the while(1) loop will return up to 128 bytes of kernel memory
to userspace.

And I write two test programs to reproduce this bug, you will see that in
rds_server, fromAddr will be overwritten and the following sock_fd will be
destroyed.
Yes, it is the programmer's fault to set msg_namelen incorrectly, but it is
better to make the kernel copy the real length of address to user space in
such case.

How to run the test programs ?
I test them on 32bit x86 system, 3.5.0-rc7.

1 compile
gcc -o rds_client rds_client.c
gcc -o rds_server rds_server.c

2 run ./rds_server on one console

3 run ./rds_client on another console

4 you will see something like:
server is waiting to receive data...
old socket fd=3
server received data from client:data from client
msg.msg_namelen=32
new socket fd=-1067277685
sendmsg()
: Bad file descriptor

/***************** rds_client.c ********************/

int main(void)
{
	int sock_fd;
	struct sockaddr_in serverAddr;
	struct sockaddr_in toAddr;
	char recvBuffer[128] = "data from client";
	struct msghdr msg;
	struct iovec iov;

	sock_fd = socket(AF_RDS, SOCK_SEQPACKET, 0);
	if (sock_fd < 0) {
		perror("create socket error\n");
		exit(1);
	}

	memset(&serverAddr, 0, sizeof(serverAddr));
	serverAddr.sin_family = AF_INET;
	serverAddr.sin_addr.s_addr = inet_addr("127.0.0.1");
	serverAddr.sin_port = htons(4001);

	if (bind(sock_fd, (struct sockaddr*)&serverAddr, sizeof(serverAddr)) < 0) {
		perror("bind() error\n");
		close(sock_fd);
		exit(1);
	}

	memset(&toAddr, 0, sizeof(toAddr));
	toAddr.sin_family = AF_INET;
	toAddr.sin_addr.s_addr = inet_addr("127.0.0.1");
	toAddr.sin_port = htons(4000);
	msg.msg_name = &toAddr;
	msg.msg_namelen = sizeof(toAddr);
	msg.msg_iov = &iov;
	msg.msg_iovlen = 1;
	msg.msg_iov->iov_base = recvBuffer;
	msg.msg_iov->iov_len = strlen(recvBuffer) + 1;
	msg.msg_control = 0;
	msg.msg_controllen = 0;
	msg.msg_flags = 0;

	if (sendmsg(sock_fd, &msg, 0) == -1) {
		perror("sendto() error\n");
		close(sock_fd);
		exit(1);
	}

	printf("client send data:%s\n", recvBuffer);

	memset(recvBuffer, '\0', 128);

	msg.msg_name = &toAddr;
	msg.msg_namelen = sizeof(toAddr);
	msg.msg_iov = &iov;
	msg.msg_iovlen = 1;
	msg.msg_iov->iov_base = recvBuffer;
	msg.msg_iov->iov_len = 128;
	msg.msg_control = 0;
	msg.msg_controllen = 0;
	msg.msg_flags = 0;
	if (recvmsg(sock_fd, &msg, 0) == -1) {
		perror("recvmsg() error\n");
		close(sock_fd);
		exit(1);
	}

	printf("receive data from server:%s\n", recvBuffer);

	close(sock_fd);

	return 0;
}

/***************** rds_server.c ********************/

int main(void)
{
	struct sockaddr_in fromAddr;
	int sock_fd;
	struct sockaddr_in serverAddr;
	unsigned int addrLen;
	char recvBuffer[128];
	struct msghdr msg;
	struct iovec iov;

	sock_fd = socket(AF_RDS, SOCK_SEQPACKET, 0);
	if(sock_fd < 0) {
		perror("create socket error\n");
		exit(0);
	}

	memset(&serverAddr, 0, sizeof(serverAddr));
	serverAddr.sin_family = AF_INET;
	serverAddr.sin_addr.s_addr = inet_addr("127.0.0.1");
	serverAddr.sin_port = htons(4000);
	if (bind(sock_fd, (struct sockaddr*)&serverAddr, sizeof(serverAddr)) < 0) {
		perror("bind error\n");
		close(sock_fd);
		exit(1);
	}

	printf("server is waiting to receive data...\n");
	msg.msg_name = &fromAddr;

	/*
	 * I add 16 to sizeof(fromAddr), ie 32,
	 * and pay attention to the definition of fromAddr,
	 * recvmsg() will overwrite sock_fd,
	 * since kernel will copy 32 bytes to userspace.
	 *
	 * If you just use sizeof(fromAddr), it works fine.
	 * */
	msg.msg_namelen = sizeof(fromAddr) + 16;
	/* msg.msg_namelen = sizeof(fromAddr); */
	msg.msg_iov = &iov;
	msg.msg_iovlen = 1;
	msg.msg_iov->iov_base = recvBuffer;
	msg.msg_iov->iov_len = 128;
	msg.msg_control = 0;
	msg.msg_controllen = 0;
	msg.msg_flags = 0;

	while (1) {
		printf("old socket fd=%d\n", sock_fd);
		if (recvmsg(sock_fd, &msg, 0) == -1) {
			perror("recvmsg() error\n");
			close(sock_fd);
			exit(1);
		}
		printf("server received data from client:%s\n", recvBuffer);
		printf("msg.msg_namelen=%d\n", msg.msg_namelen);
		printf("new socket fd=%d\n", sock_fd);
		strcat(recvBuffer, "--data from server");
		if (sendmsg(sock_fd, &msg, 0) == -1) {
			perror("sendmsg()\n");
			close(sock_fd);
			exit(1);
		}
	}

	close(sock_fd);
	return 0;
}

Signed-off-by: Weiping Pan <wpan@redhat.com>
---
 net/rds/recv.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/rds/recv.c b/net/rds/recv.c
index 5c6e9f1..9f0f17c 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -410,6 +410,8 @@ int rds_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 
 	rdsdebug("size %zu flags 0x%x timeo %ld\n", size, msg_flags, timeo);
 
+	msg->msg_namelen = 0;
+
 	if (msg_flags & MSG_OOB)
 		goto out;
 
@@ -485,6 +487,7 @@ int rds_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 			sin->sin_port = inc->i_hdr.h_sport;
 			sin->sin_addr.s_addr = inc->i_saddr;
 			memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
+			msg->msg_namelen = sizeof(*sin);
 		}
 		break;
 	}
-- 
1.7.4

^ permalink raw reply related

* Re: [PATCH] sctp: Make "Invalid Stream Identifier" ERROR follows SACK when bundling
From: xufeng zhang @ 2012-07-23  5:16 UTC (permalink / raw)
  To: vyasevich, sri, davem; +Cc: xufengzhang.main, linux-sctp, netdev, linux-kernel
In-Reply-To: <1342677450-21810-1-git-send-email-xufengzhang.main@gmail.com>

On 07/19/2012 01:57 PM, xufengzhang.main@gmail.com wrote:
> When "Invalid Stream Identifier" ERROR happens after process the
> received DATA chunks, this ERROR chunk is enqueued into outqueue
> before SACK chunk, so when bundling ERROR chunk with SACK chunk,
> the ERROR chunk is always placed first in the packet because of
> the chunk's position in the outqueue.
> This violates sctp specification:
>      RFC 4960 6.5. Stream Identifier and Stream Sequence Number
>      ...The endpoint may bundle the ERROR chunk in the same
>      packet as the SACK as long as the ERROR follows the SACK.
> So we must place SACK first when bundling "Invalid Stream Identifier"
> ERROR and SACK in one packet.
> Although we can do that by enqueue SACK chunk into outqueue before
> ERROR chunk, it will violate the side-effect interpreter processing.
> It's easy to do this job when dequeue chunks from the outqueue,
> by this way, we introduce a flag 'has_isi_err' which indicate
> whether or not the "Invalid Stream Identifier" ERROR happens.
>
> Signed-off-by: Xufeng Zhang<xufeng.zhang@windriver.com>
> ---
>   include/net/sctp/structs.h |    2 ++
>   net/sctp/output.c          |   26 ++++++++++++++++++++++++++
>   2 files changed, 28 insertions(+), 0 deletions(-)
>
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index 88949a9..5adf4de 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -842,6 +842,8 @@ struct sctp_packet {
>   	    has_sack:1,		/* This packet contains a SACK chunk. */
>   	    has_auth:1,		/* This packet contains an AUTH chunk */
>   	    has_data:1,		/* This packet contains at least 1 DATA chunk */
> +	    has_isi_err:1,	/* This packet contains a "Invalid Stream
> +				 * Identifier" ERROR chunk */
>   	    ipfragok:1,		/* So let ip fragment this packet */
>   	    malloced:1;		/* Is it malloced? */
>   };
> diff --git a/net/sctp/output.c b/net/sctp/output.c
> index 817174e..77fb1ae 100644
> --- a/net/sctp/output.c
> +++ b/net/sctp/output.c
> @@ -79,6 +79,7 @@ static void sctp_packet_reset(struct sctp_packet *packet)
>   	packet->has_sack = 0;
>   	packet->has_data = 0;
>   	packet->has_auth = 0;
> +	packet->has_isi_err = 0;
>   	packet->ipfragok = 0;
>   	packet->auth = NULL;
>   }
> @@ -267,6 +268,7 @@ static sctp_xmit_t sctp_packet_bundle_sack(struct sctp_packet *pkt,
>   sctp_xmit_t sctp_packet_append_chunk(struct sctp_packet *packet,
>   				     struct sctp_chunk *chunk)
>   {
> +	struct sctp_chunk *lchunk;
>   	sctp_xmit_t retval = SCTP_XMIT_OK;
>   	__u16 chunk_len = WORD_ROUND(ntohs(chunk->chunk_hdr->length));
>
> @@ -316,7 +318,31 @@ sctp_xmit_t sctp_packet_append_chunk(struct sctp_packet *packet,
>   		packet->has_cookie_echo = 1;
>   		break;
>
> +	    case SCTP_CID_ERROR:
> +		if (chunk->subh.err_hdr->cause&  SCTP_ERROR_INV_STRM)
> +			packet->has_isi_err = 1;
> +		break;
> +
>   	    case SCTP_CID_SACK:
> +		/* RFC 4960
> +		 * 6.5 Stream Identifier and Stream Sequence Number
> +		 * The endpoint may bundle the ERROR chunk in the same
> +		 * packet as the SACK as long as the ERROR follows the SACK.
> +		 */
> +		if (packet->has_isi_err) {
> +			if (list_is_singular(&packet->chunk_list))
> +				list_add(&chunk->list,&packet->chunk_list);
> +			else {
> +				lchunk = list_first_entry(&packet->chunk_list,
> +						struct sctp_chunk, list);
> +				list_add(&chunk->list,&lchunk->list);
> +			}
>    
And I should clarify the above judgment code.
AFAIK, there should be two cases for the bundling when invalid stream 
identifier error happens:
1). COOKIE_ACK ERROR SACK
2). ERROR SACK
So I need to deal with the two cases differently.


Thanks,
Xufeng Zhang
> +			packet->size += chunk_len;
> +			chunk->transport = packet->transport;
> +			packet->has_sack = 1;
> +			goto finish;
> +		}
> +
>   		packet->has_sack = 1;
>   		break;
>
>    

^ permalink raw reply

* Re: [PATCH] ppp: add 64 bit stats
From: Eric Dumazet @ 2012-07-23  5:16 UTC (permalink / raw)
  To: Kevin Groeneveld; +Cc: netdev
In-Reply-To: <1342988397-3077-1-git-send-email-kgroeneveld@gmail.com>

On Sun, 2012-07-22 at 16:19 -0400, Kevin Groeneveld wrote:
> Add 64 bit stats to ppp driver.  The 64 bit stats include tx_bytes,
> rx_bytes, tx_packets and rx_packets.  Other stats are still 32 bit.
> The 64 bit stats can be retrieved via the ndo_get_stats operation.  The
> SIOCGPPPSTATS ioctl is still 32 bit stats only.
> 
> Signed-off-by: Kevin Groeneveld <kgroeneveld@gmail.com>
> ---
>  drivers/net/ppp/ppp_generic.c |  110 +++++++++++++++++++++++++++++++++++------
>  1 file changed, 95 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
> index 5c05572..210238c 100644
> --- a/drivers/net/ppp/ppp_generic.c
> +++ b/drivers/net/ppp/ppp_generic.c
> @@ -94,6 +94,19 @@ struct ppp_file {
>  #define PF_TO_CHANNEL(pf)	PF_TO_X(pf, struct channel)
>  
>  /*
> + * Data structure to hold primary network stats for which
> + * we want to use 64 bit storage.  Other network stats
> + * are stored in dev->stats of the ppp strucute.
> + */
> +struct ppp_link_stats {
> +	struct u64_stats_sync syncp;
> +	u64 tx_bytes;
> +	u64 tx_packets;
> +	u64 rx_bytes;
> +	u64 rx_packets;
> +};

Hmm. This patches adds races, but also adds a good amount of memory on
these servers with thousand of ppp devices, and 64 cpus.

I really doubt ppp is performance sensitive, it so doesnt need percpu
counter.

If you really want 64bits stats on ppp, use proper synchronization
around u64 counters (but shared ones)

^ permalink raw reply

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
From: Jason Wang @ 2012-07-23  5:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: krkumar2, habanero, mashirle, kvm, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <20120720123320.GC16550@redhat.com>

On 07/20/2012 08:33 PM, Michael S. Tsirkin wrote:
> On Thu, Jul 05, 2012 at 06:29:54PM +0800, Jason Wang wrote:
>> This patch let the virtio_net driver can negotiate the number of queues it
>> wishes to use through control virtqueue and export an ethtool interface to let
>> use tweak it.
>>
>> As current multiqueue virtio-net implementation has optimizations on per-cpu
>> virtuqueues, so only two modes were support:
>>
>> - single queue pair mode
>> - multiple queue paris mode, the number of queues matches the number of vcpus
>>
>> The single queue mode were used by default currently due to regression of
>> multiqueue mode in some test (especially in stream test).
>>
>> Since virtio core does not support paritially deleting virtqueues, so during
>> mode switching the whole virtqueue were deleted and the driver would re-create
>> the virtqueues it would used.
>>
>> btw. The queue number negotiating were defered to .ndo_open(), this is because
>> only after feature negotitaion could we send the command to control virtqueue
>> (as it may also use event index).
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> ---
>>   drivers/net/virtio_net.c   |  171 ++++++++++++++++++++++++++++++++++---------
>>   include/linux/virtio_net.h |    7 ++
>>   2 files changed, 142 insertions(+), 36 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 7410187..3339eeb 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -88,6 +88,7 @@ struct receive_queue {
>>
>>   struct virtnet_info {
>>   	u16 num_queue_pairs;		/* # of RX/TX vq pairs */
>> +	u16 total_queue_pairs;
>>
>>   	struct send_queue *sq[MAX_QUEUES] ____cacheline_aligned_in_smp;
>>   	struct receive_queue *rq[MAX_QUEUES] ____cacheline_aligned_in_smp;
>> @@ -137,6 +138,8 @@ struct padded_vnet_hdr {
>>   	char padding[6];
>>   };
>>
>> +static const struct ethtool_ops virtnet_ethtool_ops;
>> +
>>   static inline int txq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
>>   {
>>   	int ret = virtqueue_get_queue_index(vq);
>> @@ -802,22 +805,6 @@ static void virtnet_netpoll(struct net_device *dev)
>>   }
>>   #endif
>>
>> -static int virtnet_open(struct net_device *dev)
>> -{
>> -	struct virtnet_info *vi = netdev_priv(dev);
>> -	int i;
>> -
>> -	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> -		/* Make sure we have some buffers: if oom use wq. */
>> -		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
>> -			queue_delayed_work(system_nrt_wq,
>> -					&vi->rq[i]->refill, 0);
>> -		virtnet_napi_enable(vi->rq[i]);
>> -	}
>> -
>> -	return 0;
>> -}
>> -
>>   /*
>>    * Send command via the control virtqueue and check status.  Commands
>>    * supported by the hypervisor, as indicated by feature bits, should
>> @@ -873,6 +860,43 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi)
>>   	rtnl_unlock();
>>   }
>>
>> +static int virtnet_set_queues(struct virtnet_info *vi)
>> +{
>> +	struct scatterlist sg;
>> +	struct net_device *dev = vi->dev;
>> +	sg_init_one(&sg,&vi->num_queue_pairs, sizeof(vi->num_queue_pairs));
>> +
>> +	if (!vi->has_cvq)
>> +		return -EINVAL;
>> +
>> +	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MULTIQUEUE,
>> +				  VIRTIO_NET_CTRL_MULTIQUEUE_QNUM,&sg, 1, 0)){
>> +		dev_warn(&dev->dev, "Fail to set the number of queue pairs to"
>> +			 " %d\n", vi->num_queue_pairs);
>> +		return -EINVAL;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int virtnet_open(struct net_device *dev)
>> +{
>> +	struct virtnet_info *vi = netdev_priv(dev);
>> +	int i;
>> +
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		/* Make sure we have some buffers: if oom use wq. */
>> +		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
>> +			queue_delayed_work(system_nrt_wq,
>> +					&vi->rq[i]->refill, 0);
>> +		virtnet_napi_enable(vi->rq[i]);
>> +	}
>> +
>> +	virtnet_set_queues(vi);
>> +
>> +	return 0;
>> +}
>> +
>>   static int virtnet_close(struct net_device *dev)
>>   {
>>   	struct virtnet_info *vi = netdev_priv(dev);
>> @@ -1013,12 +1037,6 @@ static void virtnet_get_drvinfo(struct net_device *dev,
>>
>>   }
>>
>> -static const struct ethtool_ops virtnet_ethtool_ops = {
>> -	.get_drvinfo = virtnet_get_drvinfo,
>> -	.get_link = ethtool_op_get_link,
>> -	.get_ringparam = virtnet_get_ringparam,
>> -};
>> -
>>   #define MIN_MTU 68
>>   #define MAX_MTU 65535
>>
>> @@ -1235,7 +1253,7 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
>>
>>   err:
>>   	if (ret&&  names)
>> -		for (i = 0; i<  vi->num_queue_pairs * 2; i++)
>> +		for (i = 0; i<  total_vqs * 2; i++)
>>   			kfree(names[i]);
>>
>>   	kfree(names);
>> @@ -1373,7 +1391,6 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   	mutex_init(&vi->config_lock);
>>   	vi->config_enable = true;
>>   	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
>> -	vi->num_queue_pairs = num_queue_pairs;
>>
>>   	/* If we can receive ANY GSO packets, we must allocate large ones. */
>>   	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
>> @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>   		vi->has_cvq = true;
>>
>> +	/* Use single tx/rx queue pair as default */
>> +	vi->num_queue_pairs = 1;
>> +	vi->total_queue_pairs = num_queue_pairs;
>> +
>>   	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
>>   	err = virtnet_setup_vqs(vi);
>>   	if (err)
>> @@ -1396,6 +1417,9 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
>>   		dev->features |= NETIF_F_HW_VLAN_FILTER;
>>
>> +	netif_set_real_num_tx_queues(dev, 1);
>> +	netif_set_real_num_rx_queues(dev, 1);
>> +
>>   	err = register_netdev(dev);
>>   	if (err) {
>>   		pr_debug("virtio_net: registering device failed\n");
>> @@ -1403,7 +1427,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   	}
>>
>>   	/* Last of all, set up some receive buffers. */
>> -	for (i = 0; i<  num_queue_pairs; i++) {
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>>   		try_fill_recv(vi->rq[i], GFP_KERNEL);
>>
>>   		/* If we didn't even get one input buffer, we're useless. */
>> @@ -1474,10 +1498,8 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
>>   	free_netdev(vi->dev);
>>   }
>>
>> -#ifdef CONFIG_PM
>> -static int virtnet_freeze(struct virtio_device *vdev)
>> +static void virtnet_stop(struct virtnet_info *vi)
>>   {
>> -	struct virtnet_info *vi = vdev->priv;
>>   	int i;
>>
>>   	/* Prevent config work handler from accessing the device */
>> @@ -1493,17 +1515,10 @@ static int virtnet_freeze(struct virtio_device *vdev)
>>   		for (i = 0; i<  vi->num_queue_pairs; i++)
>>   			napi_disable(&vi->rq[i]->napi);
>>
>> -
>> -	remove_vq_common(vi);
>> -
>> -	flush_work(&vi->config_work);
>> -
>> -	return 0;
>>   }
>>
>> -static int virtnet_restore(struct virtio_device *vdev)
>> +static int virtnet_start(struct virtnet_info *vi)
>>   {
>> -	struct virtnet_info *vi = vdev->priv;
>>   	int err, i;
>>
>>   	err = virtnet_setup_vqs(vi);
>> @@ -1527,6 +1542,29 @@ static int virtnet_restore(struct virtio_device *vdev)
>>
>>   	return 0;
>>   }
>> +
>> +#ifdef CONFIG_PM
>> +static int virtnet_freeze(struct virtio_device *vdev)
>> +{
>> +	struct virtnet_info *vi = vdev->priv;
>> +
>> +	virtnet_stop(vi);
>> +
>> +	remove_vq_common(vi);
>> +
>> +	flush_work(&vi->config_work);
>> +
>> +	return 0;
>> +}
>> +
>> +static int virtnet_restore(struct virtio_device *vdev)
>> +{
>> +	struct virtnet_info *vi = vdev->priv;
>> +
>> +	virtnet_start(vi);
>> +
>> +	return 0;
>> +}
>>   #endif
>>
>>   static struct virtio_device_id id_table[] = {
>> @@ -1560,6 +1598,67 @@ static struct virtio_driver virtio_net_driver = {
>>   #endif
>>   };
>>
>> +static int virtnet_set_channels(struct net_device *dev,
>> +				struct ethtool_channels *channels)
>> +{
>> +	struct virtnet_info *vi = netdev_priv(dev);
>> +	u16 queues = channels->rx_count;
>> +	unsigned status = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER;
>> +
>> +	if (channels->rx_count != channels->tx_count)
>> +		return -EINVAL;
>> +	/* Only two modes were support currently */
> s/were/are/ ?

Ok.
>
>> +	if (queues != vi->total_queue_pairs&&  queues != 1)
>> +		return -EINVAL;
> So userspace has to get queue number right. How does it know
> what the valid value is?

Usespace could query the number through ethtool -l (virtnet_get_channels()).
>
>> +	if (!vi->has_cvq)
>> +		return -EINVAL;
>> +
>> +	virtnet_stop(vi);
>> +
>> +	netif_set_real_num_tx_queues(dev, queues);
>> +	netif_set_real_num_rx_queues(dev, queues);
>> +
>> +	remove_vq_common(vi);
>> +	flush_work(&vi->config_work);
>> +
>> +	vi->num_queue_pairs = queues;
>> +	virtnet_start(vi);
>> +
>> +	vi->vdev->config->finalize_features(vi->vdev);
>> +
>> +	if (virtnet_set_queues(vi))
>> +		status |= VIRTIO_CONFIG_S_FAILED;
>> +	else
>> +		status |= VIRTIO_CONFIG_S_DRIVER_OK;
>> +
>> +	vi->vdev->config->set_status(vi->vdev, status);
>> +
> Why do we need to tweak status like that?

Because remove_vq_common() reset the device. Since virtio core api does 
not support remove a specified number of virtqueues, it's the only 
method when we change the number of queues.
> Can we maybe just roll changes back on error?

Not easy, we reset and detroy previous virtqueues and create new ones.
>> +	return 0;
>> +}
>> +
>> +static void virtnet_get_channels(struct net_device *dev,
>> +				 struct ethtool_channels *channels)
>> +{
>> +	struct virtnet_info *vi = netdev_priv(dev);
>> +
>> +	channels->max_rx = vi->total_queue_pairs;
>> +	channels->max_tx = vi->total_queue_pairs;
>> +	channels->max_other = 0;
>> +	channels->max_combined = 0;
>> +	channels->rx_count = vi->num_queue_pairs;
>> +	channels->tx_count = vi->num_queue_pairs;
>> +	channels->other_count = 0;
>> +	channels->combined_count = 0;
>> +}
>> +
>> +static const struct ethtool_ops virtnet_ethtool_ops = {
>> +	.get_drvinfo = virtnet_get_drvinfo,
>> +	.get_link = ethtool_op_get_link,
>> +	.get_ringparam = virtnet_get_ringparam,
>> +	.set_channels = virtnet_set_channels,
>> +	.get_channels = virtnet_get_channels,
>> +};
>> +
>>   static int __init init(void)
>>   {
>>   	return register_virtio_driver(&virtio_net_driver);
>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
>> index 60f09ff..0d21e08 100644
>> --- a/include/linux/virtio_net.h
>> +++ b/include/linux/virtio_net.h
>> @@ -169,4 +169,11 @@ struct virtio_net_ctrl_mac {
>>   #define VIRTIO_NET_CTRL_ANNOUNCE       3
>>    #define VIRTIO_NET_CTRL_ANNOUNCE_ACK         0
>>
>> +/*
>> + * Control multiqueue
>> + *
>> + */
>> +#define VIRTIO_NET_CTRL_MULTIQUEUE       4
>> + #define VIRTIO_NET_CTRL_MULTIQUEUE_QNUM         0
>> +
>>   #endif /* _LINUX_VIRTIO_NET_H */
>> -- 
>> 1.7.1
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
From: Jason Wang @ 2012-07-23  5:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: krkumar2, habanero, mashirle, kvm, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <20120720134014.GD16550@redhat.com>

On 07/20/2012 09:40 PM, Michael S. Tsirkin wrote:
> On Thu, Jul 05, 2012 at 06:29:53PM +0800, Jason Wang wrote:
>> This patch converts virtio_net to a multi queue device. After negotiated
>> VIRTIO_NET_F_MULTIQUEUE feature, the virtio device has many tx/rx queue pairs,
>> and driver could read the number from config space.
>>
>> The driver expects the number of rx/tx queue paris is equal to the number of
>> vcpus. To maximize the performance under this per-cpu rx/tx queue pairs, some
>> optimization were introduced:
>>
>> - Txq selection is based on the processor id in order to avoid contending a lock
>>    whose owner may exits to host.
>> - Since the txq/txq were per-cpu, affinity hint were set to the cpu that owns
>>    the queue pairs.
>>
>> Signed-off-by: Krishna Kumar<krkumar2@in.ibm.com>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> Overall fine. I think it is best to smash the following patch
> into this one, so that default behavior does not
> jump to mq then back. some comments below: mostly nits, and a minor bug.

Sure, thanks for the reviewing.
>
> If you are worried the patch is too big, it can be split
> differently
> 	- rework to use send_queue/receive_queue structures, no
> 	  functional changes.
> 	- add multiqueue
>
> but this is not a must.
>
>> ---
>>   drivers/net/virtio_net.c   |  645 ++++++++++++++++++++++++++++++-------------
>>   include/linux/virtio_net.h |    2 +
>>   2 files changed, 452 insertions(+), 195 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 1db445b..7410187 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -26,6 +26,7 @@
>>   #include<linux/scatterlist.h>
>>   #include<linux/if_vlan.h>
>>   #include<linux/slab.h>
>> +#include<linux/interrupt.h>
>>
>>   static int napi_weight = 128;
>>   module_param(napi_weight, int, 0444);
>> @@ -41,6 +42,8 @@ module_param(gso, bool, 0444);
>>   #define VIRTNET_SEND_COMMAND_SG_MAX    2
>>   #define VIRTNET_DRIVER_VERSION "1.0.0"
>>
>> +#define MAX_QUEUES 256
>> +
>>   struct virtnet_stats {
>>   	struct u64_stats_sync tx_syncp;
>>   	struct u64_stats_sync rx_syncp;
> Would be a bit better not to have artificial limits like that.
> Maybe allocate arrays at probe time, then we can
> take whatever the device gives us?

Sure.
>> @@ -51,43 +54,69 @@ struct virtnet_stats {
>>   	u64 rx_packets;
>>   };
>>
>> -struct virtnet_info {
>> -	struct virtio_device *vdev;
>> -	struct virtqueue *rvq, *svq, *cvq;
>> -	struct net_device *dev;
>> +/* Internal representation of a send virtqueue */
>> +struct send_queue {
>> +	/* Virtqueue associated with this send _queue */
>> +	struct virtqueue *vq;
>> +
>> +	/* TX: fragments + linear part + virtio header */
>> +	struct scatterlist sg[MAX_SKB_FRAGS + 2];
>> +};
>> +
>> +/* Internal representation of a receive virtqueue */
>> +struct receive_queue {
>> +	/* Virtqueue associated with this receive_queue */
>> +	struct virtqueue *vq;
>> +
>> +	/* Back pointer to the virtnet_info */
>> +	struct virtnet_info *vi;
>> +
>>   	struct napi_struct napi;
>> -	unsigned int status;
>>
>>   	/* Number of input buffers, and max we've ever had. */
>>   	unsigned int num, max;
>>
>> +	/* Work struct for refilling if we run low on memory. */
>> +	struct delayed_work refill;
>> +
>> +	/* Chain pages by the private ptr. */
>> +	struct page *pages;
>> +
>> +	/* RX: fragments + linear part + virtio header */
>> +	struct scatterlist sg[MAX_SKB_FRAGS + 2];
>> +};
>> +
>> +struct virtnet_info {
>> +	u16 num_queue_pairs;		/* # of RX/TX vq pairs */
>> +
>> +	struct send_queue *sq[MAX_QUEUES] ____cacheline_aligned_in_smp;
>> +	struct receive_queue *rq[MAX_QUEUES] ____cacheline_aligned_in_smp;
> The assumption is a tx/rx pair is handled on the same cpu, yes?
> If yes maybe make it a single array to improve cache locality
> a bit?
> 	struct queue_pair {
> 		struct send_queue sq;
> 		struct receive_queue rq;
> 	};

Ok.
>> +	struct virtqueue *cvq;
>> +
>> +	struct virtio_device *vdev;
>> +	struct net_device *dev;
>> +	unsigned int status;
>> +
>>   	/* I like... big packets and I cannot lie! */
>>   	bool big_packets;
>>
>>   	/* Host will merge rx buffers for big packets (shake it! shake it!) */
>>   	bool mergeable_rx_bufs;
>>
>> +	/* Has control virtqueue */
>> +	bool has_cvq;
>> +
> won't checking (cvq != NULL) be enough?

Enough, so has_cvq is dupliated with vi->cvq. I will remove it in next 
version.
>
>>   	/* enable config space updates */
>>   	bool config_enable;
>>
>>   	/* Active statistics */
>>   	struct virtnet_stats __percpu *stats;
>>
>> -	/* Work struct for refilling if we run low on memory. */
>> -	struct delayed_work refill;
>> -
>>   	/* Work struct for config space updates */
>>   	struct work_struct config_work;
>>
>>   	/* Lock for config space updates */
>>   	struct mutex config_lock;
>> -
>> -	/* Chain pages by the private ptr. */
>> -	struct page *pages;
>> -
>> -	/* fragments + linear part + virtio header */
>> -	struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
>> -	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
>>   };
>>
>>   struct skb_vnet_hdr {
>> @@ -108,6 +137,22 @@ struct padded_vnet_hdr {
>>   	char padding[6];
>>   };
>>
>> +static inline int txq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
>> +{
>> +	int ret = virtqueue_get_queue_index(vq);
>> +
>> +	/* skip ctrl vq */
>> +	if (vi->has_cvq)
>> +		return (ret - 1) / 2;
>> +	else
>> +		return ret / 2;
>> +}
>> +
>> +static inline int rxq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
>> +{
>> +	return virtqueue_get_queue_index(vq) / 2;
>> +}
>> +
>>   static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
>>   {
>>   	return (struct skb_vnet_hdr *)skb->cb;
>> @@ -117,22 +162,22 @@ static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
>>    * private is used to chain pages for big packets, put the whole
>>    * most recent used list in the beginning for reuse
>>    */
>> -static void give_pages(struct virtnet_info *vi, struct page *page)
>> +static void give_pages(struct receive_queue *rq, struct page *page)
>>   {
>>   	struct page *end;
>>
>>   	/* Find end of list, sew whole thing into vi->pages. */
>>   	for (end = page; end->private; end = (struct page *)end->private);
>> -	end->private = (unsigned long)vi->pages;
>> -	vi->pages = page;
>> +	end->private = (unsigned long)rq->pages;
>> +	rq->pages = page;
>>   }
>>
>> -static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
>> +static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>>   {
>> -	struct page *p = vi->pages;
>> +	struct page *p = rq->pages;
>>
>>   	if (p) {
>> -		vi->pages = (struct page *)p->private;
>> +		rq->pages = (struct page *)p->private;
>>   		/* clear private here, it is used to chain pages */
>>   		p->private = 0;
>>   	} else
>> @@ -140,15 +185,15 @@ static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
>>   	return p;
>>   }
>>
>> -static void skb_xmit_done(struct virtqueue *svq)
>> +static void skb_xmit_done(struct virtqueue *vq)
>>   {
>> -	struct virtnet_info *vi = svq->vdev->priv;
>> +	struct virtnet_info *vi = vq->vdev->priv;
>>
>>   	/* Suppress further interrupts. */
>> -	virtqueue_disable_cb(svq);
>> +	virtqueue_disable_cb(vq);
>>
>>   	/* We were probably waiting for more output buffers. */
>> -	netif_wake_queue(vi->dev);
>> +	netif_wake_subqueue(vi->dev, txq_get_qnum(vi, vq));
>>   }
>>
>>   static void set_skb_frag(struct sk_buff *skb, struct page *page,
>> @@ -167,9 +212,10 @@ static void set_skb_frag(struct sk_buff *skb, struct page *page,
>>   }
>>
>>   /* Called from bottom half context */
>> -static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>> +static struct sk_buff *page_to_skb(struct receive_queue *rq,
>>   				   struct page *page, unsigned int len)
>>   {
>> +	struct virtnet_info *vi = rq->vi;
>>   	struct sk_buff *skb;
>>   	struct skb_vnet_hdr *hdr;
>>   	unsigned int copy, hdr_len, offset;
>> @@ -225,12 +271,12 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>   	}
>>
>>   	if (page)
>> -		give_pages(vi, page);
>> +		give_pages(rq, page);
>>
>>   	return skb;
>>   }
>>
>> -static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
>> +static int receive_mergeable(struct receive_queue *rq, struct sk_buff *skb)
>>   {
>>   	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
>>   	struct page *page;
>> @@ -244,7 +290,7 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
>>   			skb->dev->stats.rx_length_errors++;
>>   			return -EINVAL;
>>   		}
>> -		page = virtqueue_get_buf(vi->rvq,&len);
>> +		page = virtqueue_get_buf(rq->vq,&len);
>>   		if (!page) {
>>   			pr_debug("%s: rx error: %d buffers missing\n",
>>   				 skb->dev->name, hdr->mhdr.num_buffers);
>> @@ -257,13 +303,14 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
>>
>>   		set_skb_frag(skb, page, 0,&len);
>>
>> -		--vi->num;
>> +		--rq->num;
>>   	}
>>   	return 0;
>>   }
>>
>> -static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
>> +static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
>>   {
>> +	struct net_device *dev = rq->vi->dev;
>>   	struct virtnet_info *vi = netdev_priv(dev);
>>   	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>>   	struct sk_buff *skb;
>> @@ -274,7 +321,7 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
>>   		pr_debug("%s: short packet %i\n", dev->name, len);
>>   		dev->stats.rx_length_errors++;
>>   		if (vi->mergeable_rx_bufs || vi->big_packets)
>> -			give_pages(vi, buf);
>> +			give_pages(rq, buf);
>>   		else
>>   			dev_kfree_skb(buf);
>>   		return;
>> @@ -286,14 +333,14 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
>>   		skb_trim(skb, len);
>>   	} else {
>>   		page = buf;
>> -		skb = page_to_skb(vi, page, len);
>> +		skb = page_to_skb(rq, page, len);
>>   		if (unlikely(!skb)) {
>>   			dev->stats.rx_dropped++;
>> -			give_pages(vi, page);
>> +			give_pages(rq, page);
>>   			return;
>>   		}
>>   		if (vi->mergeable_rx_bufs)
>> -			if (receive_mergeable(vi, skb)) {
>> +			if (receive_mergeable(rq, skb)) {
>>   				dev_kfree_skb(skb);
>>   				return;
>>   			}
>> @@ -363,90 +410,91 @@ frame_err:
>>   	dev_kfree_skb(skb);
>>   }
>>
>> -static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
>> +static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp)
>>   {
>>   	struct sk_buff *skb;
>>   	struct skb_vnet_hdr *hdr;
>>   	int err;
>>
>> -	skb = __netdev_alloc_skb_ip_align(vi->dev, MAX_PACKET_LEN, gfp);
>> +	skb = __netdev_alloc_skb_ip_align(rq->vi->dev, MAX_PACKET_LEN, gfp);
>>   	if (unlikely(!skb))
>>   		return -ENOMEM;
>>
>>   	skb_put(skb, MAX_PACKET_LEN);
>>
>>   	hdr = skb_vnet_hdr(skb);
>> -	sg_set_buf(vi->rx_sg,&hdr->hdr, sizeof hdr->hdr);
>> +	sg_set_buf(rq->sg,&hdr->hdr, sizeof hdr->hdr);
>> +
>> +	skb_to_sgvec(skb, rq->sg + 1, 0, skb->len);
>>
>> -	skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
>> +	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 2, skb, gfp);
>>
>> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
>>   	if (err<  0)
>>   		dev_kfree_skb(skb);
>>
>>   	return err;
>>   }
>>
>> -static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
>> +static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
>>   {
>>   	struct page *first, *list = NULL;
>>   	char *p;
>>   	int i, err, offset;
>>
>> -	/* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
>> +	/* page in rq->sg[MAX_SKB_FRAGS + 1] is list tail */
>>   	for (i = MAX_SKB_FRAGS + 1; i>  1; --i) {
>> -		first = get_a_page(vi, gfp);
>> +		first = get_a_page(rq, gfp);
>>   		if (!first) {
>>   			if (list)
>> -				give_pages(vi, list);
>> +				give_pages(rq, list);
>>   			return -ENOMEM;
>>   		}
>> -		sg_set_buf(&vi->rx_sg[i], page_address(first), PAGE_SIZE);
>> +		sg_set_buf(&rq->sg[i], page_address(first), PAGE_SIZE);
>>
>>   		/* chain new page in list head to match sg */
>>   		first->private = (unsigned long)list;
>>   		list = first;
>>   	}
>>
>> -	first = get_a_page(vi, gfp);
>> +	first = get_a_page(rq, gfp);
>>   	if (!first) {
>> -		give_pages(vi, list);
>> +		give_pages(rq, list);
>>   		return -ENOMEM;
>>   	}
>>   	p = page_address(first);
>>
>> -	/* vi->rx_sg[0], vi->rx_sg[1] share the same page */
>> -	/* a separated vi->rx_sg[0] for virtio_net_hdr only due to QEMU bug */
>> -	sg_set_buf(&vi->rx_sg[0], p, sizeof(struct virtio_net_hdr));
>> +	/* rq->sg[0], rq->sg[1] share the same page */
>> +	/* a separated rq->sg[0] for virtio_net_hdr only due to QEMU bug */
>> +	sg_set_buf(&rq->sg[0], p, sizeof(struct virtio_net_hdr));
>>
>> -	/* vi->rx_sg[1] for data packet, from offset */
>> +	/* rq->sg[1] for data packet, from offset */
>>   	offset = sizeof(struct padded_vnet_hdr);
>> -	sg_set_buf(&vi->rx_sg[1], p + offset, PAGE_SIZE - offset);
>> +	sg_set_buf(&rq->sg[1], p + offset, PAGE_SIZE - offset);
>>
>>   	/* chain first in list head */
>>   	first->private = (unsigned long)list;
>> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
>> +	err = virtqueue_add_buf(rq->vq, rq->sg, 0, MAX_SKB_FRAGS + 2,
>>   				first, gfp);
>>   	if (err<  0)
>> -		give_pages(vi, first);
>> +		give_pages(rq, first);
>>
>>   	return err;
>>   }
>>
>> -static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
>> +static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
>>   {
>>   	struct page *page;
>>   	int err;
>>
>> -	page = get_a_page(vi, gfp);
>> +	page = get_a_page(rq, gfp);
>>   	if (!page)
>>   		return -ENOMEM;
>>
>> -	sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
>> +	sg_init_one(rq->sg, page_address(page), PAGE_SIZE);
>>
>> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
>> +	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 1, page, gfp);
>>   	if (err<  0)
>> -		give_pages(vi, page);
>> +		give_pages(rq, page);
>>
>>   	return err;
>>   }
>> @@ -458,97 +506,104 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
>>    * before we're receiving packets, or from refill_work which is
>>    * careful to disable receiving (using napi_disable).
>>    */
>> -static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
>> +static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
>>   {
>> +	struct virtnet_info *vi = rq->vi;
>>   	int err;
>>   	bool oom;
>>
>>   	do {
>>   		if (vi->mergeable_rx_bufs)
>> -			err = add_recvbuf_mergeable(vi, gfp);
>> +			err = add_recvbuf_mergeable(rq, gfp);
>>   		else if (vi->big_packets)
>> -			err = add_recvbuf_big(vi, gfp);
>> +			err = add_recvbuf_big(rq, gfp);
>>   		else
>> -			err = add_recvbuf_small(vi, gfp);
>> +			err = add_recvbuf_small(rq, gfp);
>>
>>   		oom = err == -ENOMEM;
>>   		if (err<  0)
>>   			break;
>> -		++vi->num;
>> +		++rq->num;
>>   	} while (err>  0);
>> -	if (unlikely(vi->num>  vi->max))
>> -		vi->max = vi->num;
>> -	virtqueue_kick(vi->rvq);
>> +	if (unlikely(rq->num>  rq->max))
>> +		rq->max = rq->num;
>> +	virtqueue_kick(rq->vq);
>>   	return !oom;
>>   }
>>
>> -static void skb_recv_done(struct virtqueue *rvq)
>> +static void skb_recv_done(struct virtqueue *vq)
>>   {
>> -	struct virtnet_info *vi = rvq->vdev->priv;
>> +	struct virtnet_info *vi = vq->vdev->priv;
>> +	struct napi_struct *napi =&vi->rq[rxq_get_qnum(vi, vq)]->napi;
>> +
>>   	/* Schedule NAPI, Suppress further interrupts if successful. */
>> -	if (napi_schedule_prep(&vi->napi)) {
>> -		virtqueue_disable_cb(rvq);
>> -		__napi_schedule(&vi->napi);
>> +	if (napi_schedule_prep(napi)) {
>> +		virtqueue_disable_cb(vq);
>> +		__napi_schedule(napi);
>>   	}
>>   }
>>
>> -static void virtnet_napi_enable(struct virtnet_info *vi)
>> +static void virtnet_napi_enable(struct receive_queue *rq)
>>   {
>> -	napi_enable(&vi->napi);
>> +	napi_enable(&rq->napi);
>>
>>   	/* If all buffers were filled by other side before we napi_enabled, we
>>   	 * won't get another interrupt, so process any outstanding packets
>>   	 * now.  virtnet_poll wants re-enable the queue, so we disable here.
>>   	 * We synchronize against interrupts via NAPI_STATE_SCHED */
>> -	if (napi_schedule_prep(&vi->napi)) {
>> -		virtqueue_disable_cb(vi->rvq);
>> +	if (napi_schedule_prep(&rq->napi)) {
>> +		virtqueue_disable_cb(rq->vq);
>>   		local_bh_disable();
>> -		__napi_schedule(&vi->napi);
>> +		__napi_schedule(&rq->napi);
>>   		local_bh_enable();
>>   	}
>>   }
>>
>>   static void refill_work(struct work_struct *work)
>>   {
>> -	struct virtnet_info *vi;
>> +	struct napi_struct *napi;
>> +	struct receive_queue *rq;
>>   	bool still_empty;
>>
>> -	vi = container_of(work, struct virtnet_info, refill.work);
>> -	napi_disable(&vi->napi);
>> -	still_empty = !try_fill_recv(vi, GFP_KERNEL);
>> -	virtnet_napi_enable(vi);
>> +	rq = container_of(work, struct receive_queue, refill.work);
>> +	napi =&rq->napi;
>> +
>> +	napi_disable(napi);
>> +	still_empty = !try_fill_recv(rq, GFP_KERNEL);
>> +	virtnet_napi_enable(rq);
>>
>>   	/* In theory, this can happen: if we don't get any buffers in
>>   	 * we will *never* try to fill again. */
>>   	if (still_empty)
>> -		queue_delayed_work(system_nrt_wq,&vi->refill, HZ/2);
>> +		queue_delayed_work(system_nrt_wq,&rq->refill, HZ/2);
>>   }
>>
>>   static int virtnet_poll(struct napi_struct *napi, int budget)
>>   {
>> -	struct virtnet_info *vi = container_of(napi, struct virtnet_info, napi);
>> +	struct receive_queue *rq = container_of(napi, struct receive_queue,
>> +						napi);
>>   	void *buf;
>>   	unsigned int len, received = 0;
>>
>>   again:
>>   	while (received<  budget&&
>> -	       (buf = virtqueue_get_buf(vi->rvq,&len)) != NULL) {
>> -		receive_buf(vi->dev, buf, len);
>> -		--vi->num;
>> +	       (buf = virtqueue_get_buf(rq->vq,&len)) != NULL) {
>> +		receive_buf(rq, buf, len);
>> +		--rq->num;
>>   		received++;
>>   	}
>>
>> -	if (vi->num<  vi->max / 2) {
>> -		if (!try_fill_recv(vi, GFP_ATOMIC))
>> -			queue_delayed_work(system_nrt_wq,&vi->refill, 0);
>> +	if (rq->num<  rq->max / 2) {
>> +		if (!try_fill_recv(rq, GFP_ATOMIC))
>> +			queue_delayed_work(system_nrt_wq,&rq->refill, 0);
>>   	}
>>
>>   	/* Out of packets? */
>>   	if (received<  budget) {
>>   		napi_complete(napi);
>> -		if (unlikely(!virtqueue_enable_cb(vi->rvq))&&
>> +		if (unlikely(!virtqueue_enable_cb(rq->vq))&&
>>   		napi_schedule_prep(napi)) {
>> -			virtqueue_disable_cb(vi->rvq);
>> +			virtqueue_disable_cb(rq->vq);
>>   			__napi_schedule(napi);
>>   			goto again;
>>   		}
>> @@ -557,13 +612,14 @@ again:
>>   	return received;
>>   }
>>
>> -static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
>> +static unsigned int free_old_xmit_skbs(struct virtnet_info *vi,
>> +				       struct virtqueue *vq)
>>   {
>>   	struct sk_buff *skb;
>>   	unsigned int len, tot_sgs = 0;
>>   	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>>
>> -	while ((skb = virtqueue_get_buf(vi->svq,&len)) != NULL) {
>> +	while ((skb = virtqueue_get_buf(vq,&len)) != NULL) {
>>   		pr_debug("Sent skb %p\n", skb);
>>
>>   		u64_stats_update_begin(&stats->tx_syncp);
>> @@ -577,7 +633,8 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
>>   	return tot_sgs;
>>   }
>>
>> -static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
>> +static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb,
>> +		    struct virtqueue *vq, struct scatterlist *sg)
>>   {
>>   	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
>>   	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
>> @@ -615,44 +672,47 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
>>
>>   	/* Encode metadata header at front. */
>>   	if (vi->mergeable_rx_bufs)
>> -		sg_set_buf(vi->tx_sg,&hdr->mhdr, sizeof hdr->mhdr);
>> +		sg_set_buf(sg,&hdr->mhdr, sizeof hdr->mhdr);
>>   	else
>> -		sg_set_buf(vi->tx_sg,&hdr->hdr, sizeof hdr->hdr);
>> +		sg_set_buf(sg,&hdr->hdr, sizeof hdr->hdr);
>>
>> -	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
>> -	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
>> +	hdr->num_sg = skb_to_sgvec(skb, sg + 1, 0, skb->len) + 1;
>> +	return virtqueue_add_buf(vq, sg, hdr->num_sg,
>>   				 0, skb, GFP_ATOMIC);
>>   }
>>
>>   static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>   {
>>   	struct virtnet_info *vi = netdev_priv(dev);
>> +	int qnum = skb_get_queue_mapping(skb);
>> +	struct virtqueue *vq = vi->sq[qnum]->vq;
>>   	int capacity;
>>
>>   	/* Free up any pending old buffers before queueing new ones. */
>> -	free_old_xmit_skbs(vi);
>> +	free_old_xmit_skbs(vi, vq);
>>
>>   	/* Try to transmit */
>> -	capacity = xmit_skb(vi, skb);
>> +	capacity = xmit_skb(vi, skb, vq, vi->sq[qnum]->sg);
>>
>>   	/* This can happen with OOM and indirect buffers. */
>>   	if (unlikely(capacity<  0)) {
>>   		if (likely(capacity == -ENOMEM)) {
>>   			if (net_ratelimit())
>>   				dev_warn(&dev->dev,
>> -					 "TX queue failure: out of memory\n");
>> +					"TXQ (%d) failure: out of memory\n",
>> +					qnum);
>>   		} else {
>>   			dev->stats.tx_fifo_errors++;
>>   			if (net_ratelimit())
>>   				dev_warn(&dev->dev,
>> -					 "Unexpected TX queue failure: %d\n",
>> -					 capacity);
>> +					"Unexpected TXQ (%d) failure: %d\n",
>> +					qnum, capacity);
>>   		}
>>   		dev->stats.tx_dropped++;
>>   		kfree_skb(skb);
>>   		return NETDEV_TX_OK;
>>   	}
>> -	virtqueue_kick(vi->svq);
>> +	virtqueue_kick(vq);
>>
>>   	/* Don't wait up for transmitted skbs to be freed. */
>>   	skb_orphan(skb);
>> @@ -661,13 +721,13 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>   	/* Apparently nice girls don't return TX_BUSY; stop the queue
>>   	 * before it gets out of hand.  Naturally, this wastes entries. */
>>   	if (capacity<  2+MAX_SKB_FRAGS) {
>> -		netif_stop_queue(dev);
>> -		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
>> +		netif_stop_subqueue(dev, qnum);
>> +		if (unlikely(!virtqueue_enable_cb_delayed(vq))) {
>>   			/* More just got used, free them then recheck. */
>> -			capacity += free_old_xmit_skbs(vi);
>> +			capacity += free_old_xmit_skbs(vi, vq);
>>   			if (capacity>= 2+MAX_SKB_FRAGS) {
>> -				netif_start_queue(dev);
>> -				virtqueue_disable_cb(vi->svq);
>> +				netif_start_subqueue(dev, qnum);
>> +				virtqueue_disable_cb(vq);
>>   			}
>>   		}
>>   	}
>> @@ -700,7 +760,8 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
>>   	unsigned int start;
>>
>>   	for_each_possible_cpu(cpu) {
>> -		struct virtnet_stats *stats = per_cpu_ptr(vi->stats, cpu);
>> +		struct virtnet_stats __percpu *stats
>> +			= per_cpu_ptr(vi->stats, cpu);
>>   		u64 tpackets, tbytes, rpackets, rbytes;
>>
>>   		do {
>> @@ -734,20 +795,26 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
>>   static void virtnet_netpoll(struct net_device *dev)
>>   {
>>   	struct virtnet_info *vi = netdev_priv(dev);
>> +	int i;
>>
>> -	napi_schedule(&vi->napi);
>> +	for (i = 0; i<  vi->num_queue_pairs; i++)
>> +		napi_schedule(&vi->rq[i]->napi);
>>   }
>>   #endif
>>
>>   static int virtnet_open(struct net_device *dev)
>>   {
>>   	struct virtnet_info *vi = netdev_priv(dev);
>> +	int i;
>>
>> -	/* Make sure we have some buffers: if oom use wq. */
>> -	if (!try_fill_recv(vi, GFP_KERNEL))
>> -		queue_delayed_work(system_nrt_wq,&vi->refill, 0);
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		/* Make sure we have some buffers: if oom use wq. */
>> +		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
>> +			queue_delayed_work(system_nrt_wq,
>> +					&vi->rq[i]->refill, 0);
>> +		virtnet_napi_enable(vi->rq[i]);
>> +	}
>>
>> -	virtnet_napi_enable(vi);
>>   	return 0;
>>   }
>>
>> @@ -809,10 +876,13 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi)
>>   static int virtnet_close(struct net_device *dev)
>>   {
>>   	struct virtnet_info *vi = netdev_priv(dev);
>> +	int i;
>>
>>   	/* Make sure refill_work doesn't re-enable napi! */
>> -	cancel_delayed_work_sync(&vi->refill);
>> -	napi_disable(&vi->napi);
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		cancel_delayed_work_sync(&vi->rq[i]->refill);
>> +		napi_disable(&vi->rq[i]->napi);
>> +	}
>>
>>   	return 0;
>>   }
>> @@ -924,11 +994,10 @@ static void virtnet_get_ringparam(struct net_device *dev,
>>   {
>>   	struct virtnet_info *vi = netdev_priv(dev);
>>
>> -	ring->rx_max_pending = virtqueue_get_vring_size(vi->rvq);
>> -	ring->tx_max_pending = virtqueue_get_vring_size(vi->svq);
>> +	ring->rx_max_pending = virtqueue_get_vring_size(vi->rq[0]->vq);
>> +	ring->tx_max_pending = virtqueue_get_vring_size(vi->sq[0]->vq);
>>   	ring->rx_pending = ring->rx_max_pending;
>>   	ring->tx_pending = ring->tx_max_pending;
>> -
>>   }
>>
>>
>> @@ -961,6 +1030,19 @@ static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
>>   	return 0;
>>   }
>>
>> +/* To avoid contending a lock hold by a vcpu who would exit to host, select the
>> + * txq based on the processor id.
>> + */
>> +static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
>> +{
>> +	int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
>> +		  smp_processor_id();
>> +
>> +	while (unlikely(txq>= dev->real_num_tx_queues))
>> +		txq -= dev->real_num_tx_queues;
>> +	return txq;
>> +}
>> +
>>   static const struct net_device_ops virtnet_netdev = {
>>   	.ndo_open            = virtnet_open,
>>   	.ndo_stop   	     = virtnet_close,
>> @@ -972,6 +1054,7 @@ static const struct net_device_ops virtnet_netdev = {
>>   	.ndo_get_stats64     = virtnet_stats,
>>   	.ndo_vlan_rx_add_vid = virtnet_vlan_rx_add_vid,
>>   	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
>> +	.ndo_select_queue     = virtnet_select_queue,
>>   #ifdef CONFIG_NET_POLL_CONTROLLER
>>   	.ndo_poll_controller = virtnet_netpoll,
>>   #endif
>> @@ -1007,10 +1090,10 @@ static void virtnet_config_changed_work(struct work_struct *work)
>>
>>   	if (vi->status&  VIRTIO_NET_S_LINK_UP) {
>>   		netif_carrier_on(vi->dev);
>> -		netif_wake_queue(vi->dev);
>> +		netif_tx_wake_all_queues(vi->dev);
>>   	} else {
>>   		netif_carrier_off(vi->dev);
>> -		netif_stop_queue(vi->dev);
>> +		netif_tx_stop_all_queues(vi->dev);
>>   	}
>>   done:
>>   	mutex_unlock(&vi->config_lock);
>> @@ -1023,41 +1106,217 @@ static void virtnet_config_changed(struct virtio_device *vdev)
>>   	queue_work(system_nrt_wq,&vi->config_work);
>>   }
>>
>> -static int init_vqs(struct virtnet_info *vi)
>> +static void free_receive_bufs(struct virtnet_info *vi)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		while (vi->rq[i]->pages)
>> +			__free_pages(get_a_page(vi->rq[i], GFP_KERNEL), 0);
>> +	}
>> +}
>> +
>> +/* Free memory allocated for send and receive queues */
>> +static void virtnet_free_queues(struct virtnet_info *vi)
>>   {
>> -	struct virtqueue *vqs[3];
>> -	vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
>> -	const char *names[] = { "input", "output", "control" };
>> -	int nvqs, err;
>> +	int i;
>>
>> -	/* We expect two virtqueues, receive then send,
>> -	 * and optionally control. */
>> -	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		kfree(vi->rq[i]);
>> +		vi->rq[i] = NULL;
>> +		kfree(vi->sq[i]);
>> +		vi->sq[i] = NULL;
>> +	}
>> +}
>>
>> -	err = vi->vdev->config->find_vqs(vi->vdev, nvqs, vqs, callbacks, names);
>> -	if (err)
>> -		return err;
>> +static void free_unused_bufs(struct virtnet_info *vi)
>> +{
>> +	void *buf;
>> +	int i;
>> +
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		struct virtqueue *vq = vi->sq[i]->vq;
>> +
>> +		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL)
>> +			dev_kfree_skb(buf);
>> +	}
>> +
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		struct virtqueue *vq = vi->rq[i]->vq;
>> +
>> +		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
>> +			if (vi->mergeable_rx_bufs || vi->big_packets)
>> +				give_pages(vi->rq[i], buf);
>> +			else
>> +				dev_kfree_skb(buf);
>> +			--vi->rq[i]->num;
>> +		}
>> +		BUG_ON(vi->rq[i]->num != 0);
>> +	}
>> +}
>> +
>> +static void virtnet_set_affinity(struct virtnet_info *vi, bool set)
>> +{
>> +	int i;
>> +
>> +	if (vi->num_queue_pairs == 1)
>> +		return;
>> +
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		int cpu = set ? i : -1;
>> +		virtqueue_set_affinity(vi->rq[i]->vq, cpu);
>> +		virtqueue_set_affinity(vi->sq[i]->vq, cpu);
>> +	}
>> +	return;
>> +}
>> +
>> +static void virtnet_del_vqs(struct virtnet_info *vi)
>> +{
>> +	struct virtio_device *vdev = vi->vdev;
>> +
>> +	virtnet_set_affinity(vi, false);
>> +
>> +	vdev->config->del_vqs(vdev);
>> +
>> +	virtnet_free_queues(vi);
>> +}
>> +
>> +static int virtnet_find_vqs(struct virtnet_info *vi)
>> +{
>> +	vq_callback_t **callbacks;
>> +	struct virtqueue **vqs;
>> +	int ret = -ENOMEM;
>> +	int i, total_vqs;
>> +	char **names;
>>
>> -	vi->rvq = vqs[0];
>> -	vi->svq = vqs[1];
>> +	/*
>> +	 * We expect 1 RX virtqueue followed by 1 TX virtqueue, followd by
>> +	 * possible control virtqueue and followed by the same
>> +	 * 'vi->num_queue_pairs-1' more times
>> +	 */
>> +	total_vqs = vi->num_queue_pairs * 2 +
>> +		    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ);
>> +
>> +	/* Allocate space for find_vqs parameters */
>> +	vqs = kmalloc(total_vqs * sizeof(*vqs), GFP_KERNEL);
>> +	callbacks = kmalloc(total_vqs * sizeof(*callbacks), GFP_KERNEL);
>> +	names = kmalloc(total_vqs * sizeof(*names), GFP_KERNEL);
> so this needs to be kzalloc otherwise on an error cleanup will
> get uninitialized data and crash?

Yes, will change it to use kzalloc.
>
>> +	if (!vqs || !callbacks || !names)
>> +		goto err;
>> +
>> +	/* Parameters for control virtqueue, if any */
>> +	if (vi->has_cvq) {
>> +		callbacks[2] = NULL;
>> +		names[2] = "control";
>> +	}
>> +
>> +	/* Allocate/initialize parameters for send/receive virtqueues */
>> +	for (i = 0; i<  vi->num_queue_pairs * 2; i += 2) {
>> +		int j = (i == 0 ? i : i + vi->has_cvq);
>> +		callbacks[j] = skb_recv_done;
>> +		callbacks[j + 1] = skb_xmit_done;
>> +		names[j] = kasprintf(GFP_KERNEL, "input.%d", i / 2);
>> +		names[j + 1] = kasprintf(GFP_KERNEL, "output.%d", i / 2);
> This needs wrappers. E.g. virtnet_rx_vq(int queue_pair), virtnet_tx_vq(int queue_pair);
> Then you would just scan 0 to num_queue_pairs, and i is queue pair
> number.

Ok.
>> +	}
>>
>> -	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
>> +	ret = vi->vdev->config->find_vqs(vi->vdev, total_vqs, vqs, callbacks,
>> +					 (const char **)names);
>> +	if (ret)
>> +		goto err;
>> +
>> +	if (vi->has_cvq)
>>   		vi->cvq = vqs[2];
>>
>> -		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
>> -			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
>> +	for (i = 0; i<  vi->num_queue_pairs * 2; i += 2) {
>> +		int j = i == 0 ? i : i + vi->has_cvq;
>> +		vi->rq[i / 2]->vq = vqs[j];
>> +		vi->sq[i / 2]->vq = vqs[j + 1];
> Same here.

Consider the code is really simple, seem no need to use helpers.
>
>>   	}
>> -	return 0;
>> +
>> +err:
>> +	if (ret&&  names)
> If we are here ret != 0. For names, just add another label, don't
> complicate cleanup.

Ok.
>> +		for (i = 0; i<  vi->num_queue_pairs * 2; i++)
>> +			kfree(names[i]);
>> +
>> +	kfree(names);
>> +	kfree(callbacks);
>> +	kfree(vqs);
>> +
>> +	return ret;
>> +}
>> +
>> +static int virtnet_alloc_queues(struct virtnet_info *vi)
>> +{
>> +	int ret = -ENOMEM;
>> +	int i;
>> +
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		vi->rq[i] = kzalloc(sizeof(*vi->rq[i]), GFP_KERNEL);
>> +		vi->sq[i] = kzalloc(sizeof(*vi->sq[i]), GFP_KERNEL);
>> +		if (!vi->rq[i] || !vi->sq[i])
>> +			goto err;
>> +	}
>> +
>> +	ret = 0;
>> +
>> +	/* setup initial receive and send queue parameters */
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		vi->rq[i]->vi = vi;
>> +		vi->rq[i]->pages = NULL;
>> +		INIT_DELAYED_WORK(&vi->rq[i]->refill, refill_work);
>> +		netif_napi_add(vi->dev,&vi->rq[i]->napi, virtnet_poll,
>> +			       napi_weight);
>> +
>> +		sg_init_table(vi->rq[i]->sg, ARRAY_SIZE(vi->rq[i]->sg));
>> +		sg_init_table(vi->sq[i]->sg, ARRAY_SIZE(vi->sq[i]->sg));
>> +	}
>> +
> Add return 0 here, then ret = 0 will not be needed
> above and if (ret) below.
>

Ok.
>> +err:
>> +	if (ret)
>> +		virtnet_free_queues(vi);
>> +
>> +	return ret;
>> +}
>> +
>> +static int virtnet_setup_vqs(struct virtnet_info *vi)
>> +{
>> +	int ret;
>> +
>> +	/* Allocate send&  receive queues */
>> +	ret = virtnet_alloc_queues(vi);
>> +	if (!ret) {
>> +		ret = virtnet_find_vqs(vi);
>> +		if (ret)
>> +			virtnet_free_queues(vi);
>> +		else
>> +			virtnet_set_affinity(vi, true);
>> +	}
>> +
>> +	return ret;
> Add some labels for error handling, this if nesting is messy.

Ok.
>>   }
>>
>>   static int virtnet_probe(struct virtio_device *vdev)
>>   {
>> -	int err;
>> +	int i, err;
>>   	struct net_device *dev;
>>   	struct virtnet_info *vi;
>> +	u16 num_queues, num_queue_pairs;
>> +
>> +	/* Find if host supports multiqueue virtio_net device */
>> +	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
>> +				offsetof(struct virtio_net_config,
>> +				num_queues),&num_queues);
>> +
>> +	/* We need atleast 2 queue's */
> typo

Will fix this, thanks.
>> +	if (err || num_queues<  2)
>> +		num_queues = 2;
>> +	if (num_queues>  MAX_QUEUES * 2)
>> +		num_queues = MAX_QUEUES;
>> +
>> +	num_queue_pairs = num_queues / 2;
>>
>>   	/* Allocate ourselves a network device with room for our info */
>> -	dev = alloc_etherdev(sizeof(struct virtnet_info));
>> +	dev = alloc_etherdev_mq(sizeof(struct virtnet_info), num_queue_pairs);
>>   	if (!dev)
>>   		return -ENOMEM;
>>
>> @@ -1103,22 +1362,18 @@ static int virtnet_probe(struct virtio_device *vdev)
>>
>>   	/* Set up our device-specific information */
>>   	vi = netdev_priv(dev);
>> -	netif_napi_add(dev,&vi->napi, virtnet_poll, napi_weight);
>>   	vi->dev = dev;
>>   	vi->vdev = vdev;
>>   	vdev->priv = vi;
>> -	vi->pages = NULL;
>>   	vi->stats = alloc_percpu(struct virtnet_stats);
>>   	err = -ENOMEM;
>>   	if (vi->stats == NULL)
>> -		goto free;
>> +		goto free_netdev;
>>
>> -	INIT_DELAYED_WORK(&vi->refill, refill_work);
>>   	mutex_init(&vi->config_lock);
>>   	vi->config_enable = true;
>>   	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
>> -	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
>> -	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
>> +	vi->num_queue_pairs = num_queue_pairs;
>>
>>   	/* If we can receive ANY GSO packets, we must allocate large ones. */
>>   	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
>> @@ -1129,9 +1384,17 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
>>   		vi->mergeable_rx_bufs = true;
>>
>> -	err = init_vqs(vi);
>> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>> +		vi->has_cvq = true;
>> +
> How about we disable multiqueue if there's no cvq?
> Will make logic a bit simpler, won't it?

We can, but as you said, just let the logic simpler a bit.
>> +	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
>> +	err = virtnet_setup_vqs(vi);
>>   	if (err)
>> -		goto free_stats;
>> +		goto free_netdev;
>> +
>> +	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)&&
>> +	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
>> +		dev->features |= NETIF_F_HW_VLAN_FILTER;
>>
>>   	err = register_netdev(dev);
>>   	if (err) {
>> @@ -1140,12 +1403,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   	}
>>
>>   	/* Last of all, set up some receive buffers. */
>> -	try_fill_recv(vi, GFP_KERNEL);
>> -
>> -	/* If we didn't even get one input buffer, we're useless. */
>> -	if (vi->num == 0) {
>> -		err = -ENOMEM;
>> -		goto unregister;
>> +	for (i = 0; i<  num_queue_pairs; i++) {
>> +		try_fill_recv(vi->rq[i], GFP_KERNEL);
>> +
>> +		/* If we didn't even get one input buffer, we're useless. */
>> +		if (vi->rq[i]->num == 0) {
>> +			free_unused_bufs(vi);
>> +			err = -ENOMEM;
>> +			goto free_recv_bufs;
>> +		}
>>   	}
>>
>>   	/* Assume link up if device can't report link status,
>> @@ -1158,42 +1424,25 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   		netif_carrier_on(dev);
>>   	}
>>
>> -	pr_debug("virtnet: registered device %s\n", dev->name);
>> +	pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
>> +		 dev->name, num_queue_pairs);
>> +
>>   	return 0;
>>
>> -unregister:
>> +free_recv_bufs:
>> +	free_receive_bufs(vi);
>>   	unregister_netdev(dev);
>> +
>>   free_vqs:
>> -	vdev->config->del_vqs(vdev);
>> -free_stats:
>> -	free_percpu(vi->stats);
>> -free:
>> +	for (i = 0; i<  num_queue_pairs; i++)
>> +		cancel_delayed_work_sync(&vi->rq[i]->refill);
>> +	virtnet_del_vqs(vi);
>> +
>> +free_netdev:
>>   	free_netdev(dev);
>>   	return err;
>>   }
>>
>> -static void free_unused_bufs(struct virtnet_info *vi)
>> -{
>> -	void *buf;
>> -	while (1) {
>> -		buf = virtqueue_detach_unused_buf(vi->svq);
>> -		if (!buf)
>> -			break;
>> -		dev_kfree_skb(buf);
>> -	}
>> -	while (1) {
>> -		buf = virtqueue_detach_unused_buf(vi->rvq);
>> -		if (!buf)
>> -			break;
>> -		if (vi->mergeable_rx_bufs || vi->big_packets)
>> -			give_pages(vi, buf);
>> -		else
>> -			dev_kfree_skb(buf);
>> -		--vi->num;
>> -	}
>> -	BUG_ON(vi->num != 0);
>> -}
>> -
>>   static void remove_vq_common(struct virtnet_info *vi)
>>   {
>>   	vi->vdev->config->reset(vi->vdev);
>> @@ -1201,10 +1450,9 @@ static void remove_vq_common(struct virtnet_info *vi)
>>   	/* Free unused buffers in both send and recv, if any. */
>>   	free_unused_bufs(vi);
>>
>> -	vi->vdev->config->del_vqs(vi->vdev);
>> +	free_receive_bufs(vi);
>>
>> -	while (vi->pages)
>> -		__free_pages(get_a_page(vi, GFP_KERNEL), 0);
>> +	virtnet_del_vqs(vi);
>>   }
>>
>>   static void __devexit virtnet_remove(struct virtio_device *vdev)
>> @@ -1230,6 +1478,7 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
>>   static int virtnet_freeze(struct virtio_device *vdev)
>>   {
>>   	struct virtnet_info *vi = vdev->priv;
>> +	int i;
>>
>>   	/* Prevent config work handler from accessing the device */
>>   	mutex_lock(&vi->config_lock);
>> @@ -1237,10 +1486,13 @@ static int virtnet_freeze(struct virtio_device *vdev)
>>   	mutex_unlock(&vi->config_lock);
>>
>>   	netif_device_detach(vi->dev);
>> -	cancel_delayed_work_sync(&vi->refill);
>> +	for (i = 0; i<  vi->num_queue_pairs; i++)
>> +		cancel_delayed_work_sync(&vi->rq[i]->refill);
>>
>>   	if (netif_running(vi->dev))
>> -		napi_disable(&vi->napi);
>> +		for (i = 0; i<  vi->num_queue_pairs; i++)
>> +			napi_disable(&vi->rq[i]->napi);
>> +
>>
>>   	remove_vq_common(vi);
>>
>> @@ -1252,19 +1504,22 @@ static int virtnet_freeze(struct virtio_device *vdev)
>>   static int virtnet_restore(struct virtio_device *vdev)
>>   {
>>   	struct virtnet_info *vi = vdev->priv;
>> -	int err;
>> +	int err, i;
>>
>> -	err = init_vqs(vi);
>> +	err = virtnet_setup_vqs(vi);
>>   	if (err)
>>   		return err;
>>
>>   	if (netif_running(vi->dev))
>> -		virtnet_napi_enable(vi);
>> +		for (i = 0; i<  vi->num_queue_pairs; i++)
>> +			virtnet_napi_enable(vi->rq[i]);
>>
>>   	netif_device_attach(vi->dev);
>>
>> -	if (!try_fill_recv(vi, GFP_KERNEL))
>> -		queue_delayed_work(system_nrt_wq,&vi->refill, 0);
>> +	for (i = 0; i<  vi->num_queue_pairs; i++)
>> +		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
>> +			queue_delayed_work(system_nrt_wq,
>> +					&vi->rq[i]->refill, 0);
>>
>>   	mutex_lock(&vi->config_lock);
>>   	vi->config_enable = true;
>> @@ -1287,7 +1542,7 @@ static unsigned int features[] = {
>>   	VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
>>   	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
>>   	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
>> -	VIRTIO_NET_F_GUEST_ANNOUNCE,
>> +	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MULTIQUEUE,
>>   };
>>
>>   static struct virtio_driver virtio_net_driver = {
>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
>> index 1bc7e30..60f09ff 100644
>> --- a/include/linux/virtio_net.h
>> +++ b/include/linux/virtio_net.h
>> @@ -61,6 +61,8 @@ struct virtio_net_config {
>>   	__u8 mac[6];
>>   	/* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
>>   	__u16 status;
>> +	/* Total number of RX/TX queues */
>> +	__u16 num_queues;
>>   } __attribute__((packed));
>>
>>   /* This is the first element of the scatter-gather list.  If you don't
>> -- 
>> 1.7.1

^ permalink raw reply

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
From: Jason Wang @ 2012-07-23  5:54 UTC (permalink / raw)
  To: Sasha Levin
  Cc: krkumar2, habanero, mashirle, kvm, Michael S. Tsirkin, netdev,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <500A9A72.20507@gmail.com>

On 07/21/2012 08:02 PM, Sasha Levin wrote:
> On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote:
>>> -	err = init_vqs(vi);
>>>> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>>> +		vi->has_cvq = true;
>>>> +
>> How about we disable multiqueue if there's no cvq?
>> Will make logic a bit simpler, won't it?
> multiqueues don't really depend on cvq. Does this added complexity really justifies adding an artificial limit?
>

Yes, it does not depends on cvq. Cvq were just used to negotiate the 
number of queues a guest wishes to use which is really useful (at least 
for now). Since multiqueue can not out-perform for single queue in every 
kinds of workloads or benchmark, so we want to let guest driver use 
single queue by default even when multiqueue were enabled by management 
software and let use to enalbe it through ethtool. So user could not 
feel regression when it switch to use a multiqueue capable driver and 
backend.

So the only difference is the user experiences.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [PATCH RFT RESEND] net: Fix Neptune ethernet driver to check dma mapping error
From: David Miller @ 2012-07-23  6:34 UTC (permalink / raw)
  To: shuah.khan
  Cc: mcarlson, bhutchings, eric.dumazet, mchan, netdev, linux-kernel,
	shuahkhan, stable
In-Reply-To: <1342821035.5434.60.camel@lorien2>

From: Shuah Khan <shuah.khan@hp.com>
Date: Fri, 20 Jul 2012 15:50:35 -0600

> Fix Neptune ethernet driver to check dma mapping error after map_page()
> interface returns.
> 
> Signed-off-by: Shuah Khan <shuah.khan@hp.com>

Applied.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox