Netdev List
 help / color / mirror / Atom feed
* RE: [PATCH v2] net/macb: Use non-coherent memory for rx buffers
From: David Laight @ 2012-12-05 15:22 UTC (permalink / raw)
  To: Nicolas Ferre
  Cc: David S. Miller, netdev, linux-arm-kernel, linux-kernel,
	Joachim Eastwood, Jean-Christophe PLAGNIOL-VILLARD,
	Havard Skinnemoen
In-Reply-To: <50BF6467.5060701@atmel.com>

> Well, for the 10/100 MACB interface, I am stuck with 128 Bytes buffers!
> So this use of pages seems sensible.

If you have dma coherent memory you can make the rx buffer space
be an array of short buffers referenced by adjacent ring entries
(possibly with the last one slightly short to allow for the
2 byte offset).
Then, when a big frame ends up in multiple buffers, you need
a maximum of two copies to extract the data.

	David

^ permalink raw reply

* iputils-s20121205
From: YOSHIFUJI Hideaki @ 2012-12-05 15:13 UTC (permalink / raw)
  To: netdev; +Cc: yoshfuji

Hello,

iputils-s20121205 has been released.  Diffstat and changelog below.

Files:
        https://sourceforge.net/projects/iputils/files/
        http://www.skbuff.net/iputils/
Tree:
        http://www.linux-ipv6.org/gitweb/gitweb.cgi?p=gitroot/iputils.git
        https://sourceforge.net/p/iputils/code/ci/HEAD/tree/

Regards,

--yoshfuji

[Diffstat]

 Makefile                |   82 +++++++++++++++++++---------
 RELNOTES                |   43 +++++++++++++++
 SNAPSHOT.h              |    2 -
 arping.c                |  138 +++++++++++++++++++++++++++++------------------
 doc/docbook2man-spec.pl |    4 +
 doc/ping.sgml           |   14 +++--
 doc/snapshot.db         |    2 -
 doc/tracepath.sgml      |    3 +
 iputils.spec            |   97 +++++++++++++++++++--------------
 ping.c                  |   42 ++++++++++++++
 ping6.c                 |   31 ++++++++++-
 ping_common.c           |   74 ++++++++++++++-----------
 rdisc.c                 |   10 +++
 tracepath.c             |   20 ++++---
 tracepath6.c            |   46 +++++++++++-----
 15 files changed, 421 insertions(+), 187 deletions(-)

[Changelog]
Jan Synacek (1):
      ping,tracepath doc: Fix missing end tags.

YOSHIFUJI Hideaki (36):
      tracepath6: packet length option (-l) did not have any effect.
      tracepath,tracepath6: Fix pktlen message.
      tracepath,tracepath6: Use calloc(3) instead of using stack.
      tracepath6: Ignore families other than IPv4 and IPv6.
      ping6: Improve randomness of NI Nonce.
      tracepath,tracepath6 doc: Fix default pktlen.
      ping,rdisc: Optimize checksumming.
      makefile: Static link support for crypto, resolv, cap and sysfs.
      doc: Ajdust spaces around sqare brackets.
      ping,rdisc: Use macro to get odd byte when checksumming.
      ping6: Do not try to free memory pointed by uninitialized variable on error path.
      arping: Allow building without default interface.
      arping: No default interface by default.
      arping: Allow printing usage without permission errors.
      ping,ping6: Allow printing usage without permission errors.
      ping,ping6: Fix cap_t leakage.
      arping,ping,ping6: Do not ideologically check return value from cap_free,cap_{set,get}_flag().
      arping: Fix sysfs_class leakage on error path.
      arping: Some comments for new functions for finding devices support.
      arping: Typo in type declaration.
      makefile: Use call function for external libraries.
      makefile: Add more comments.
      arping: Ensure to fail if no appropriate device found with sysfs.
      arping: Enforce user to specify device (-I) if multiple devices found.
      Makefile: parameterize options for linking libraries.
      Makefile: Use shell function instead if backquotes.
      Makefile: Ensure to have same date when making snapshot.
      spec: Maintainer does not use ipsec.spec.
      spec: partially sync with fedora.
      Makefile: Bump date in iputils.spec as well.
      spec: Add exmple lines for suid-root installation
      spec: Sort changelog.
      ping: Exit on SO_BINDTODEVICE failure.
      ping: Warn if kernel has selected source address from other interface.
      ping: Clarify difference between -I device and -I addr.
      iputils-s20121205

^ permalink raw reply

* Re: [PATCH v2] net/macb: Use non-coherent memory for rx buffers
From: Nicolas Ferre @ 2012-12-05 15:12 UTC (permalink / raw)
  To: David Laight
  Cc: David S. Miller, netdev, linux-arm-kernel, linux-kernel,
	Joachim Eastwood, Jean-Christophe PLAGNIOL-VILLARD,
	Havard Skinnemoen
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B70D9@saturn3.aculab.com>

On 12/05/2012 10:35 AM, David Laight :
>> If I understand well, you mean that the call to:
>>
>> 		dma_sync_single_range_for_device(&bp->pdev->dev, phys,
>> 				pg_offset, frag_len, DMA_FROM_DEVICE);
>>
>> in the rx path after having copied the data to skb is not needed?
>> That is also the conclusion that I found after having thinking about
>> this again... I will check this.
> 
> You need to make sure that the memory isn't in the data cache
> when you give the rx buffer back to the MAC.
> (and ensure the cpu doesn't read it until the rx is complete.)
> I've NFI what that dma_sync call does - you need to invalidate
> the cache lines.

The invalidate of cache lines is done by
dma_sync_single_range_for_device(, DMA_FROM_DEVICE) so I need to keep it.

>> For the CRC, my driver is not using the CRC offloading feature for the
>> moment. So no CRC is written by the device.
> 
> I was thinking it would matter if the MAC wrote the CRC into the
> buffer (even though it was excluded from the length).
> It doesn't - you only need to worry about data you've read.
> 
>>> I was wondering if the code needs to do per page allocations?
>>> Perhaps that is necessary to avoid needing a large block of
>>> contiguous physical memory (and virtual addresses)?
>>
>> The page management seems interesting for future management of RX
>> buffers as skb fragments: that will allow to avoid copying received data.
> 
> Dunno - the complexities of such buffer loaning schemes often
> exceed the gain of avoiding the data copy.
> Using buffers allocated to the skb is a bit different - since
> you completely forget about the memory once you pass the skb
> upstream.
> 
> Some quick sums indicate you might want to allocate 8k memory
> blocks and split into 5 buffers.

Well, for the 10/100 MACB interface, I am stuck with 128 Bytes buffers!
So this use of pages seems sensible.
On the other hand, it is true that I may have to reconsider the GEM
memory management (it one is able to cover 128-10KB rx DMA buffers)...

Best regards,
-- 
Nicolas Ferre

^ permalink raw reply

* Re: [PATCH] net/macb: increase RX buffer size for GEM
From: Nicolas Ferre @ 2012-12-05 15:08 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-arm-kernel, linux-kernel, manabian, plagnioj
In-Reply-To: <20121204.132227.1430662061932892582.davem@davemloft.net>

On 12/04/2012 07:22 PM, David Miller :
> From: Nicolas Ferre <nicolas.ferre@atmel.com>
> Date: Mon, 3 Dec 2012 13:15:43 +0100
> 
>> Macb Ethernet controller requires a RX buffer of 128 bytes. It is
>> highly sub-optimal for Gigabit-capable GEM that is able to use
>> a bigger DMA buffer. Change this constant and associated macros
>> with data stored in the private structure.
>> I also kept the result of buffers per page calculation to lower the
>> impact of this move to a variable rx buffer size on rx hot path.
>> RX DMA buffer size has to be multiple of 64 bytes as indicated in
>> DMA Configuration Register specification.
>>
>> Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
> 
> This looks like it will waste a couple hundred bytes for 1500 MTU
> frames, am I right?

Yep! But buffers get recycled, and with the current memory management by
pages, it seems that I have to rework some part of it to optimize this
memory usage (8KB memory blocks split into 5 buffers each as David said...).

Do you think it is worth digging this way or may I rework the rx buffer
management in case of the GEM interface. If I implement a different path
for GEM interface, I will have the possibility to tailor rx DMA buffers
from 1500 Bytes up to 10KB jumbo frames...

Best regards,
-- 
Nicolas Ferre

^ permalink raw reply

* Re: [Suggestion] net/atm : for sprintf, need check the total write length whether larger than a page.
From: chas williams - CONTRACTOR @ 2012-12-05 14:55 UTC (permalink / raw)
  To: Chen Gang; +Cc: David Miller, netdev
In-Reply-To: <50BEE2BE.2030704@asianux.com>

On Wed, 05 Dec 2012 13:59:26 +0800
Chen Gang <gang.chen@asianux.com> wrote:

> 于 2012年12月05日 13:40, Chen Gang 写道:
> > 于 2012年12月05日 12:56, Chen Gang 写道:
> >>>>>>>> -		pos += sprintf(pos, "\n");
> >>>>>>>> +		count += scnprintf(buf + count, PAGE_SIZE - count, "\n");
> >>>> ..
> >>>>>>  need we judge whether count >= PAGE_SIZE ?
> >>>>
> >>>> count will eventually make PAGE_SIZE - count reach 0 at which point,
> >>>> scnprintf() won't be able to write into the buffer.
> >>   I also think so.
> >>
> >>   I think, maybe it will be better to break the loop when we already
> >> know that "count >= PAGE_SIZE" (it can save waste looping, although it
> >> seems unlikly happen, for example, using unlikly(...) ).

it doesn't seem like optimizing for this corner case is a huge
concern.  the list cannot be infinitely long.

> >>
> >> By the way:
> >>   will it be better that always let "\n" at the end ?
> >>   (if count == PAGE_SIZE in a loop, we can not let "\n" at the end).
> > 
> >    oh, sorry ! count will never >= PAGE_SIZE.
> > 
> >    I think let "PAGE_SIZE - 2" instead of "PAGE_SIZE" in the loop, so we
> > can make the room for the end of "\n".
> > 
> > 
> > 
>    sorry, "PAGE_SIZE - 1" is enough, not need "PAGE_SIZE - 2".

did you mean '\0' instead of '\n'?  scnprintf() considers the trailing
'\0' when formatting.

^ permalink raw reply

* Webmail Limit
From: CORREO @ 2012-12-05 12:34 UTC (permalink / raw)


Din Webmail Kvot har överskridit den fastställda kvoten / gräns som är 20GB. Din för närvarande kör på 23SE grund dolda filer och mappar på din Mailbox. Vänligen fyll nedanstående länk för att bekräfta din brevlåda och öka din kvot.
Användarnamn:
Gammal nyckel:
Ny nyckel:

^ permalink raw reply

* [PATCH] net: fixup tx time stamping for uml vde driver.
From: Paul Chavent @ 2012-12-05 14:20 UTC (permalink / raw)
  To: jdike, richard, user-mode-linux-devel, netdev; +Cc: Paul Chavent

Call skb_tx_timestamp after write completion.

Signed-off-by: Paul Chavent <paul.chavent@onera.fr>
---
 arch/um/drivers/vde_kern.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/um/drivers/vde_kern.c b/arch/um/drivers/vde_kern.c
index 6a365fa..38fea2f 100644
--- a/arch/um/drivers/vde_kern.c
+++ b/arch/um/drivers/vde_kern.c
@@ -52,9 +52,13 @@ static int vde_write(int fd, struct sk_buff *skb, struct uml_net_private *lp)
 {
 	struct vde_data *pri = (struct vde_data *) &lp->user;
 
-	if (pri->conn != NULL)
-		return vde_user_write((void *)pri->conn, skb->data,
-				      skb->len);
+	if (pri->conn != NULL) {
+		int count;
+		count = vde_user_write((void *)pri->conn, skb->data,
+				       skb->len);
+		skb_tx_timestamp(skb);
+		return count;
+	}
 
 	printk(KERN_ERR "vde_write - we have no VDECONN to write to");
 	return -EBADF;
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH] 3com: make 3c59x depend on HAS_IOPORT
From: Jan Glauber @ 2012-12-05 14:04 UTC (permalink / raw)
  To: netdev

From: Jan Glauber <jang@linux.vnet.ibm.com>

The 3com driver for 3c59x requires ioport_map. Since not all
architectures support IO port mapping make 3c59x dependent on HAS_IOPORT.

Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com>
---
 drivers/net/ethernet/3com/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/3com/Kconfig b/drivers/net/ethernet/3com/Kconfig
index bad4fa6..eb56174 100644
--- a/drivers/net/ethernet/3com/Kconfig
+++ b/drivers/net/ethernet/3com/Kconfig
@@ -80,7 +80,7 @@ config PCMCIA_3C589
 
 config VORTEX
 	tristate "3c590/3c900 series (592/595/597) \"Vortex/Boomerang\" support"
-	depends on (PCI || EISA)
+	depends on (PCI || EISA) && HAS_IOPORT
 	select NET_CORE
 	select MII
 	---help---

^ permalink raw reply related

* Re: [RFC PATCH 2/2] tun: fix LSM/SELinux labeling of tun/tap devices
From: Jason Wang @ 2012-12-05 14:01 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Paul Moore, netdev, linux-security-module, selinux
In-Reply-To: <20121205114455.GB26649@redhat.com>

On Wednesday, December 05, 2012 01:44:55 PM Michael S. Tsirkin wrote:
> On Wed, Dec 05, 2012 at 02:19:22PM +0800, Jason Wang wrote:
> > On 12/05/2012 02:17 AM, Paul Moore wrote:
> > > On Tuesday, December 04, 2012 07:36:26 PM Michael S. Tsirkin wrote:
> > >> On Tue, Dec 04, 2012 at 11:18:57AM -0500, Paul Moore wrote:
> > >>> Okay, based on your explanation of TUNSETQUEUE, the steps below are
> > >>> what I
> > >>> believe we need to do ... if you disagree speak up quickly please.
> > >>> 
> > >>> A. TUNSETIFF (new, non-persistent device)
> > >>> 
> > >>> [Allocate and initialize the tun_struct LSM state based on the calling
> > >>> process, use this state to label the TUN socket.]
> > >>> 
> > >>> 1. Call security_tun_dev_create() which authorizes the action.
> > >>> 2. Call security_tun_dev_alloc_security() which allocates the
> > >>> tun_struct
> > >>> LSM blob and SELinux sets some internal blob state to record the label
> > >>> of
> > >>> the calling process.
> > >>> 3. Call security_tun_dev_attach() which sets the label of the TUN
> > >>> socket
> > >>> to match the label stored in the tun_struct LSM blob during A2.  No
> > >>> authorization is done at this point since the socket is new/unlabeled.
> > >>> 
> > >>> B. TUNSETIFF (existing, persistent device)
> > >>> 
> > >>> [Relabel the existing tun_struct LSM state based on the calling
> > >>> process,
> > >>> use this state to label the TUN socket.]
> > >>> 
> > >>> 1. Attempt to relabel/reset the tun_struct LSM blob from the currently
> > >>> stored value, set during A2, to the label of the current calling
> > >>> process.
> > >>> *** THIS IS NOT CURRENTLY DONE IN THE RFC PATCH ***
> > >>> 2. Call security_tun_dev_attach() which sets the label of the TUN
> > >>> socket
> > >>> to match the label stored in the tun_struct LSM blob during B1. No
> > >>> authorization is done at this point since the socket is new/unlabeled.
> > >>> 
> > >>> C. TUNSETQUEUE
> > >>> 
> > >>> [Use the existing tun_struct LSM state to label the new TUN socket.]
> > >>> 
> > >>> 1. Call security_tun_dev_attach() which sets the label of the TUN
> > >>> socket
> > >>> to match the label stored in the tun_struct LSM blob set during either
> > >>> A2
> > >>> or B1. No authorization is done at this point since the socket is
> > >>> new/unlabeled.
> > >> 
> > >> Here's what bothers me. libvirt currently opens tun and passes
> > >> fd to qemu. What would prevent qemu from attaching fd using TUNSETQUEUE
> > >> to another device it does not own?
> > > 
> > > True, assuming all the above is correct and that I'm understanding it
> > > correctly (Jason?), we should probably add a new SELinux access control
> > > for
> > > TUNSETQUEUE.
> > 
> > Yes, we need make sure qemu can call TUNSETQUEUE for the device it does
> > not own.
> 
> Meaning can *not* call?

Sorry for not being clear, I mean qemu can call TUNSETQUEUE for the device it 
owns and for the device it does not own, it can't call.
> 
> > > The current DAC code exists in tun_not_capable().

^ permalink raw reply

* Re: [RFC PATCH 2/2] tun: fix LSM/SELinux labeling of tun/tap devices
From: Jason Wang @ 2012-12-05 13:45 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Paul Moore, netdev, linux-security-module, selinux
In-Reply-To: <20121205114339.GA26649@redhat.com>

On Wednesday, December 05, 2012 01:43:39 PM Michael S. Tsirkin wrote:
> On Wed, Dec 05, 2012 at 02:17:30PM +0800, Jason Wang wrote:
> > On 12/04/2012 11:24 PM, Michael S. Tsirkin wrote:
> > > On Tue, Dec 04, 2012 at 09:24:43PM +0800, Jason Wang wrote:
> > >> On Monday, December 03, 2012 11:22:29 AM Paul Moore wrote:
> > >>> On Monday, December 03, 2012 06:15:42 PM Jason Wang wrote:
> > >>>> On 11/30/2012 06:06 AM, Paul Moore wrote:
> > >>>>> This patch corrects some problems with LSM/SELinux that were
> > >>>>> introduced
> > >>>>> with the multiqueue patchset.  The problem stems from the fact that
> > >>>>> the
> > >>>>> multiqueue work changed the relationship between the tun device and
> > >>>>> its
> > >>>>> associated socket; before the socket persisted for the life of the
> > >>>>> device, however after the multiqueue changes the socket only
> > >>>>> persisted
> > >>>>> for the life of the userspace connection (fd open).  For
> > >>>>> non-persistent
> > >>>>> devices this is not an issue, but for persistent devices this can
> > >>>>> cause
> > >>>>> the tun device to lose its SELinux label.
> > >>>>> 
> > >>>>> We correct this problem by adding an opaque LSM security blob to the
> > >>>>> tun device struct which allows us to have the LSM security state,
> > >>>>> e.g.
> > >>>>> SELinux labeling information, persist for the lifetime of the tun
> > >>>>> device.
> > >>> 
> > >>> ...
> > >>> 
> > >>>>> -static int selinux_tun_dev_attach(struct sock *sk)
> > >>>>> +static int selinux_tun_dev_attach(struct sock *sk, void *security)
> > >>>>> 
> > >>>>>  {
> > >>>>> 
> > >>>>> +	struct tun_security_struct *tunsec = security;
> > >>>>> 
> > >>>>>  	struct sk_security_struct *sksec = sk->sk_security;
> > >>>>>  	u32 sid = current_sid();
> > >>>>>  	int err;
> > >>>>> 
> > >>>>> +	/* we don't currently perform any NetLabel based labeling here ...
> > >>>>> 
> > >>>>>  	err = avc_has_perm(sid, sksec->sid, SECCLASS_TUN_SOCKET,
> > >>>>>  	
> > >>>>>  			   TUN_SOCKET__RELABELFROM, NULL);
> > >>>>>  	
> > >>>>>  	if (err)
> > >>>>>  	
> > >>>>>  		return err;
> > >>>>> 
> > >>>>> -	err = avc_has_perm(sid, sid, SECCLASS_TUN_SOCKET,
> > >>>>> +	err = avc_has_perm(sid, tunsec->sid, SECCLASS_TUN_SOCKET,
> > >>>>> 
> > >>>>>  			   TUN_SOCKET__RELABELTO, NULL);
> > >>>>>  	
> > >>>>>  	if (err)
> > >>>>>  	
> > >>>>>  		return err;
> > >>>>> 
> > >>>>> -	sksec->sid = sid;
> > >>>>> +	sksec->sid = tunsec->sid;
> > >>>>> +	sksec->sclass = SECCLASS_TUN_SOCKET;
> > >>>> 
> > >>>> I'm not sure whether this is correct, looks like we need to differ
> > >>>> between
> > >>>> TUNSETQUEUE and TUNSETIFF. When userspace call TUNSETIFF for
> > >>>> persistent
> > >>>> device, looks like we need change the sid of tunsec like in the past.
> > >>> 
> > >>> It may be that I'm misunderstanding TUNSETQUEUE and/or TUNSETIFF.  Can
> > >>> you
> > >>> elaborate as to why they should be different?
> > >> 
> > >> If I understand correctly, before multiqueue patchset, TUNSETIFF is
> > >> used to:
> > >> 
> > >> 1) Create the tun/tap network device
> > >> 2) For persistent device, re-attach the fd to the network device /
> > >> socket. In this case, we call selinux_tun_dev_attch() to relabel the
> > >> socket sid (in fact also the device's since the socket were persistent
> > >> also) to the sid of process that calls TUNSETIFF.
> > >> 
> > >> So, after the changes of multiqueue, we need try to preserve those
> > >> policy. The interesting part is the introducing of TUNSETQUEUE, it's
> > >> used to attach more file descriptors/sockets to a tun/tap device after
> > >> at least one file descriptor were attached to the tun/tap device
> > >> through TUNSETIFF. So I think maybe we need differ those two ioctls.
> > >> This patch looks fine for TUNSETQUEUE, but for TUNSETIFF, we need
> > >> relabel the tunsec to the process that calling TUNSETIFF for
> > >> persistent device?
> > > 
> > > Basically, it looks like currently once you get a tun fd,
> > > you can attach it to any device even if normally
> > > selinux would prevent you from accessing it.
> > 
> > Yes some checking during TUNSETQUEUE is missed.
> > 
> > > If we reuse selinux_tun_dev_attach, we won't need to
> > > change selinux policy, with a new capability we will need to change it
> > > to allow libvirt to do TUNSETQUEUE.
> > 
> > Also needed for qemu too since it may call TUNSETQUEUE when guest wants
> > to change the number of queues.
> 
> Hmm that's nasty. If you allow qemu to do TUNSETQUEUE then how
> do you prevent it from attaching to some other tun?

Currently, we do the uid & gid check in tun_not_capable() also for 
TUNSETQUEUE. So the fd can't be attach to the device if current process is not 
the owner. And if it's the owner, is it harm to allow the fd to be attached to 
another device?

There's no state information of the tun stored in the tun_file, and it was 
something like after you open a fd of tun, you can call TUNSETIFF to any tun 
if the owner is current process.
> Maybe we can extend TUNSETQUEUE or add another ioctl to
> mark a queue active/inactive? Probably control transmit
> and receive being active separately as well.

It's better to extend current TUNSETQUEUE if needed, for controling 
transmitting and receiving, we can add more parameters to this ioctl (e.g. 
IFF_TX_ENABLE, IFF_TX_DISABLE)
> 
> > >> btw. Current code does allow calling TUNSETQUEUE to a persistent
> > >> tun/tap
> > >> device with no file attached. It should be a bug and need to be fixed.
> > > 
> > > Is this a problem? You can always
> > > attach
> > > set queue
> > > detach
> > > 
> > > and it would be hard to prevent this ...
> > 
> > Currently, the following steps is allowed:
> > 
> > 1. fd1 = open("/dev/net/tun");
> > 2. tunsetiff(fd1, "tap0");
> > 3. tunsetpersistent("tap0");
> > 4. close(fd1);
> > 5. fd2 = open("/dev/net/tun");
> > 6. tunsetqueue(fd2, "tap0);
> 
> Allowed for libvirt, right? Not for qemu.

But what if step 1-4 is done by one process, step 5-6 is done by another? The 
issuse only existed for persistent device with no fd attached. 

I agree we can add checks and let TUNSETQUEUE and let it to differ those 
conditions. But for simplicity, I suggest is to let userspace call TUNSETIFF 
first just like what we did when only single queue is supported in tun, and 
relabel the security of tun_struct there. And we can use different check (e.g. 
selinux for TUNSETQUEUE and TUNSETIFF) to enable the qemu to does TUNSETQUEUE 
itself.

> 
> > Looks like step 6 should be forbidden since:
> > 
> > - no fd/sockets were attached to the device, we need use TUNSETIFF
> > instead to keep the API as we used do in single queue tun
> > - we need update the security information in tun_struct just like what
> > we discussed in this mail
> > - it may also miss checks in TUNSETIFF
> 
> We need to fix the checks anyway. Basically if you allow
> a queue to be the only fd that is attached
> (and I don't see how you can prevent this since
> userspace does not expect close to fail),
> I don't see why one way to get to this state
> should be legal and another illegal.

Well, we can return -EPERM when userspace call TUNSETQUEUE to a tun_struct 
with zero fd/queues.
> 
> The rest is implementation detail.
> 
> > >>> One thing that I think we probably should change is the relabelto/from
> > >>> permissions in the function above (selinux_tun_dev_attach()); in the
> > >>> case
> > >>> where the socket does not yet have a label, e.g. 'sksec->sid == 0', we
> > >>> should probably skip the relabel permissions since we want to assign
> > >>> the
> > >>> TUN device label regardless in this case.
> > >> 
> > >> I'm not familiar with the selinux, have a quick glance of the code,
> > >> looks like the label has been initialized to SECINITSID_KERNEL in
> > >> selinux_socket_post_create().
> > >> 
> > >> Thanks
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] net: ICMPv6 packets transmitted on wrong interface if nfmark is mangled
From: Dries De Winter @ 2012-12-05 13:41 UTC (permalink / raw)
  To: David Miller; +Cc: pablo, kaber, netdev, netfilter-devel
In-Reply-To: <20121203.170636.1325623456787407245.davem@davemloft.net>

2012/12/3 David Miller <davem@davemloft.net>:
> From: Dries De Winter <dries.dewinter@gmail.com>
> Date: Mon, 3 Dec 2012 22:31:51 +0100
>
>> Not fixing this means that skb->mark is unavailable for use on ICMPv6
>> packets because it will inevitably put those packets on the wrong
>> interface.
>
> Maybe this suggests that a better fix is to simply explicitly check
> for protocol ICMPV6 in ip6_route_me_harder().

Hmmm, maybe my subject line isn't very well chosen ... I don't really
mean all ICMPv6 traffic. ICMPv6 also includes Destination Unreachable,
Echo request/reply ... and lots of other types that are just regular
unicast packets that should follow the routing table like normal
packets. My concern is mainly about neighbour discovery and MLD. And
after having thought this over and over again, this discussion could
be extended to all link-local traffic in general, not to ICMPv6 in
general.

Routing doesn't make much sense for link-local traffic and typically
the sender specifies on what interface its data has to go out. In the
example of MLD and neighbour discovery this is done using those
special dst entries. But for userspace this can be done by specifying
a non-zero sin6_scope_id or by SO_BINDTODEVICE which sets
sk_bound_dev_if. When sending a message with a link-local destination,
these parameters are taken into account by routing.
ip6_route_me_harder() also takes sk_bound_dev_if into account, but it
has no access to the user supplied sin6_scope_id, so depending on the
method used, you may also hit this issue from user space.

My "noreroute" patch will not fix this. Therefore it's indeed maybe
better to add a simple check to ip6_route_me_harder(): not a check for
ICMPv6, but a check for (ipv6_addr_type(&iph->daddr) &
IPV6_ADDR_LINKLOCAL) instead. What do you think?

Dries.

^ permalink raw reply

* RE: [PATCH net-next 0/7] Allow to monitor multicast cache event via rtnetlink
From: David Laight @ 2012-12-05 11:41 UTC (permalink / raw)
  To: nicolas.dichtel, David Miller; +Cc: netdev
In-Reply-To: <50BF29DA.7020903@6wind.com>

> >> The one thing I worry about are those 64-bit statistics.  I fear that they
> >> not be 64-bit aligned in the final netlink message.  This matters on cpus
> >> that trap on unaligned loads/stores, such as sparc and MIPS.
> >>
> >> Can you validate this?
> >>
> > I can have a try on a tile platform. I don't have access to sparc or mips.

> Hmm, I've read arm instead of mips! So I've tried on mips. Data are aligned on
> 32-bit, like for all netlink messages. nla_put_u64() will do the same, as it
> calls nla_put().
> 
> And the kernel will only use memcpy() to treat this attribute. Reader will be in
> userland.

Probably worth commenting that the 64bit items might only be 32bit aligned.
Just to stop anyone trying to read/write them with pointer casts.

I think they are currently done with memcpy() - which should be ok.
It might be possibly to optimise by using a structure containing
a 64bit value marked __attribute__((aligned(4))).

	David

^ permalink raw reply

* Re: [RFC PATCH 2/2] tun: fix LSM/SELinux labeling of tun/tap devices
From: Michael S. Tsirkin @ 2012-12-05 11:44 UTC (permalink / raw)
  To: Jason Wang; +Cc: Paul Moore, netdev, linux-security-module, selinux
In-Reply-To: <50BEE76A.3080707@redhat.com>

On Wed, Dec 05, 2012 at 02:19:22PM +0800, Jason Wang wrote:
> On 12/05/2012 02:17 AM, Paul Moore wrote:
> > On Tuesday, December 04, 2012 07:36:26 PM Michael S. Tsirkin wrote:
> >> On Tue, Dec 04, 2012 at 11:18:57AM -0500, Paul Moore wrote:
> >>> Okay, based on your explanation of TUNSETQUEUE, the steps below are what I
> >>> believe we need to do ... if you disagree speak up quickly please.
> >>>
> >>> A. TUNSETIFF (new, non-persistent device)
> >>>
> >>> [Allocate and initialize the tun_struct LSM state based on the calling
> >>> process, use this state to label the TUN socket.]
> >>>
> >>> 1. Call security_tun_dev_create() which authorizes the action.
> >>> 2. Call security_tun_dev_alloc_security() which allocates the tun_struct
> >>> LSM blob and SELinux sets some internal blob state to record the label of
> >>> the calling process.
> >>> 3. Call security_tun_dev_attach() which sets the label of the TUN socket
> >>> to match the label stored in the tun_struct LSM blob during A2.  No
> >>> authorization is done at this point since the socket is new/unlabeled.
> >>>
> >>> B. TUNSETIFF (existing, persistent device)
> >>>
> >>> [Relabel the existing tun_struct LSM state based on the calling process,
> >>> use this state to label the TUN socket.]
> >>>
> >>> 1. Attempt to relabel/reset the tun_struct LSM blob from the currently
> >>> stored value, set during A2, to the label of the current calling process.
> >>> *** THIS IS NOT CURRENTLY DONE IN THE RFC PATCH ***
> >>> 2. Call security_tun_dev_attach() which sets the label of the TUN socket
> >>> to match the label stored in the tun_struct LSM blob during B1. No
> >>> authorization is done at this point since the socket is new/unlabeled.
> >>>
> >>> C. TUNSETQUEUE
> >>>
> >>> [Use the existing tun_struct LSM state to label the new TUN socket.]
> >>>
> >>> 1. Call security_tun_dev_attach() which sets the label of the TUN socket
> >>> to match the label stored in the tun_struct LSM blob set during either A2
> >>> or B1. No authorization is done at this point since the socket is
> >>> new/unlabeled.
> >> Here's what bothers me. libvirt currently opens tun and passes
> >> fd to qemu. What would prevent qemu from attaching fd using TUNSETQUEUE
> >> to another device it does not own?
> > True, assuming all the above is correct and that I'm understanding it 
> > correctly (Jason?), we should probably add a new SELinux access control for 
> > TUNSETQUEUE.
> 
> Yes, we need make sure qemu can call TUNSETQUEUE for the device it does
> not own.

Meaning can *not* call?

> >
> > The current DAC code exists in tun_not_capable().
> >

^ permalink raw reply

* Re: [RFC PATCH 2/2] tun: fix LSM/SELinux labeling of tun/tap devices
From: Michael S. Tsirkin @ 2012-12-05 11:43 UTC (permalink / raw)
  To: Jason Wang; +Cc: Paul Moore, netdev, linux-security-module, selinux
In-Reply-To: <50BEE6FA.1080507@redhat.com>

On Wed, Dec 05, 2012 at 02:17:30PM +0800, Jason Wang wrote:
> On 12/04/2012 11:24 PM, Michael S. Tsirkin wrote:
> > On Tue, Dec 04, 2012 at 09:24:43PM +0800, Jason Wang wrote:
> >> On Monday, December 03, 2012 11:22:29 AM Paul Moore wrote:
> >>> On Monday, December 03, 2012 06:15:42 PM Jason Wang wrote:
> >>>> On 11/30/2012 06:06 AM, Paul Moore wrote:
> >>>>> This patch corrects some problems with LSM/SELinux that were introduced
> >>>>> with the multiqueue patchset.  The problem stems from the fact that the
> >>>>> multiqueue work changed the relationship between the tun device and its
> >>>>> associated socket; before the socket persisted for the life of the
> >>>>> device, however after the multiqueue changes the socket only persisted
> >>>>> for the life of the userspace connection (fd open).  For non-persistent
> >>>>> devices this is not an issue, but for persistent devices this can cause
> >>>>> the tun device to lose its SELinux label.
> >>>>>
> >>>>> We correct this problem by adding an opaque LSM security blob to the
> >>>>> tun device struct which allows us to have the LSM security state, e.g.
> >>>>> SELinux labeling information, persist for the lifetime of the tun
> >>>>> device.
> >>> ...
> >>>
> >>>>> -static int selinux_tun_dev_attach(struct sock *sk)
> >>>>> +static int selinux_tun_dev_attach(struct sock *sk, void *security)
> >>>>>
> >>>>>  {
> >>>>>
> >>>>> +	struct tun_security_struct *tunsec = security;
> >>>>>
> >>>>>  	struct sk_security_struct *sksec = sk->sk_security;
> >>>>>  	u32 sid = current_sid();
> >>>>>  	int err;
> >>>>>
> >>>>> +	/* we don't currently perform any NetLabel based labeling here ...
> >>>>>
> >>>>>  	err = avc_has_perm(sid, sksec->sid, SECCLASS_TUN_SOCKET,
> >>>>>  	
> >>>>>  			   TUN_SOCKET__RELABELFROM, NULL);
> >>>>>  	
> >>>>>  	if (err)
> >>>>>  	
> >>>>>  		return err;
> >>>>>
> >>>>> -	err = avc_has_perm(sid, sid, SECCLASS_TUN_SOCKET,
> >>>>> +	err = avc_has_perm(sid, tunsec->sid, SECCLASS_TUN_SOCKET,
> >>>>>
> >>>>>  			   TUN_SOCKET__RELABELTO, NULL);
> >>>>>  	
> >>>>>  	if (err)
> >>>>>  	
> >>>>>  		return err;
> >>>>>
> >>>>> -	sksec->sid = sid;
> >>>>> +	sksec->sid = tunsec->sid;
> >>>>> +	sksec->sclass = SECCLASS_TUN_SOCKET;
> >>>> I'm not sure whether this is correct, looks like we need to differ between
> >>>> TUNSETQUEUE and TUNSETIFF. When userspace call TUNSETIFF for persistent
> >>>> device, looks like we need change the sid of tunsec like in the past.
> >>> It may be that I'm misunderstanding TUNSETQUEUE and/or TUNSETIFF.  Can you
> >>> elaborate as to why they should be different?
> >> If I understand correctly, before multiqueue patchset, TUNSETIFF is used to:
> >>
> >> 1) Create the tun/tap network device
> >> 2) For persistent device, re-attach the fd to the network device / socket. In 
> >> this case, we call selinux_tun_dev_attch() to relabel the socket sid (in fact 
> >> also the device's since the socket were persistent also) to the sid of process 
> >> that calls TUNSETIFF.
> >>
> >> So, after the changes of multiqueue, we need try to preserve those policy. The 
> >> interesting part is the introducing of TUNSETQUEUE, it's used to attach more 
> >> file descriptors/sockets to a tun/tap device after at least one file descriptor 
> >> were attached to the tun/tap device through TUNSETIFF. So I think maybe we 
> >> need differ those two ioctls. This patch looks fine for TUNSETQUEUE, but for 
> >> TUNSETIFF, we need relabel the tunsec to the process that calling TUNSETIFF 
> >> for persistent device?
> > Basically, it looks like currently once you get a tun fd,
> > you can attach it to any device even if normally
> > selinux would prevent you from accessing it.
> 
> Yes some checking during TUNSETQUEUE is missed.
> > If we reuse selinux_tun_dev_attach, we won't need to
> > change selinux policy, with a new capability we will need to change it
> > to allow libvirt to do TUNSETQUEUE.
> >
> 
> Also needed for qemu too since it may call TUNSETQUEUE when guest wants
> to change the number of queues.

Hmm that's nasty. If you allow qemu to do TUNSETQUEUE then how
do you prevent it from attaching to some other tun?
Maybe we can extend TUNSETQUEUE or add another ioctl to
mark a queue active/inactive? Probably control transmit
and receive being active separately as well.

> >> btw. Current code does allow calling TUNSETQUEUE to a persistent tun/tap 
> >> device with no file attached. It should be a bug and need to be fixed.
> > Is this a problem? You can always
> > attach
> > set queue
> > detach
> >
> > and it would be hard to prevent this ...
> 
> Currently, the following steps is allowed:
> 
> 1. fd1 = open("/dev/net/tun");
> 2. tunsetiff(fd1, "tap0");
> 3. tunsetpersistent("tap0");
> 4. close(fd1);
> 5. fd2 = open("/dev/net/tun");
> 6. tunsetqueue(fd2, "tap0);

Allowed for libvirt, right? Not for qemu.

> Looks like step 6 should be forbidden since:
> 
> - no fd/sockets were attached to the device, we need use TUNSETIFF
> instead to keep the API as we used do in single queue tun
> - we need update the security information in tun_struct just like what
> we discussed in this mail
> - it may also miss checks in TUNSETIFF

We need to fix the checks anyway. Basically if you allow
a queue to be the only fd that is attached
(and I don't see how you can prevent this since
userspace does not expect close to fail),
I don't see why one way to get to this state
should be legal and another illegal.

The rest is implementation detail.

> >>> One thing that I think we probably should change is the relabelto/from
> >>> permissions in the function above (selinux_tun_dev_attach()); in the case
> >>> where the socket does not yet have a label, e.g. 'sksec->sid == 0', we
> >>> should probably skip the relabel permissions since we want to assign the
> >>> TUN device label regardless in this case.
> >> I'm not familiar with the selinux, have a quick glance of the code, looks like 
> >> the label has been initialized to SECINITSID_KERNEL in 
> >> selinux_socket_post_create().
> >>
> >> Thanks

^ permalink raw reply

* Re: [PATCH net-next 0/7] Allow to monitor multicast cache event via rtnetlink
From: Nicolas Dichtel @ 2012-12-05 11:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <50BE56D3.2030704@6wind.com>

Le 04/12/2012 21:02, Nicolas Dichtel a écrit :
> Le 04/12/2012 19:09, David Miller a écrit :
>> From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>> Date: Tue,  4 Dec 2012 12:13:34 +0100
>>
>>> The goal of this serie is to be able to monitor multicast activities via
>>> rtnetlink.
>>>
>>> The main changes are:
>>>   - when user dumps mfc entries it now get all entries, included the unresolved
>>>     cache.
>>>   - kernel sends rtnetlink when it adds/deletes mfc entries.
>>>
>>> As usual, the patch against iproute2 will be sent once the patches are
>>> included and
>>> net-next merged. I can send it on demand.
>>
>> This looks good, applied, thanks Nicolas.
>>
>> The one thing I worry about are those 64-bit statistics.  I fear that they
>> not be 64-bit aligned in the final netlink message.  This matters on cpus
>> that trap on unaligned loads/stores, such as sparc and MIPS.
>>
>> Can you validate this?
>>
> I can have a try on a tile platform. I don't have access to sparc or mips.
Hmm, I've read arm instead of mips! So I've tried on mips. Data are aligned on 
32-bit, like for all netlink messages. nla_put_u64() will do the same, as it 
calls nla_put().

And the kernel will only use memcpy() to treat this attribute. Reader will be in 
userland.

^ permalink raw reply

* [PATCH net-next v2 3/3] virtio-net: support changing the number of queue pairs through ethtool
From: Jason Wang @ 2012-12-05 10:37 UTC (permalink / raw)
  To: mst, rusty, davem, virtualization, netdev, linux-kernel, krkumar2
  Cc: bhutchings, jwhan, shiyer, kvm
In-Reply-To: <1354703872-25677-1-git-send-email-jasowang@redhat.com>

This patch implements the ethtool_{set|get}_channels method of virtio-net to
allow user to change the number of queues when the device is running on demand.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c |   43 +++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 43 insertions(+), 0 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index def11ce..51c9b75 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1075,10 +1075,53 @@ static void virtnet_get_drvinfo(struct net_device *dev,
 
 }
 
+/* TODO: Eliminate OOO packets during switching */
+static int virtnet_set_channels(struct net_device *dev,
+				struct ethtool_channels *channels)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	u16 queue_pairs = channels->combined_count;
+	int err;
+
+	/* We don't support separate rx/tx channels.
+	 * We don't allow setting 'other' channels.
+	 */
+	if (channels->rx_count || channels->tx_count || channels->other_count)
+		return -EINVAL;
+
+	if (queue_pairs > vi->max_queue_pairs)
+		return -EINVAL;
+
+	err = virtnet_set_queues(vi, queue_pairs);
+	if (!err) {
+		netif_set_real_num_tx_queues(dev, queue_pairs);
+		netif_set_real_num_rx_queues(dev, queue_pairs);
+
+		virtnet_set_affinity(vi, true);
+	}
+
+	return err;
+}
+
+static void virtnet_get_channels(struct net_device *dev,
+				 struct ethtool_channels *channels)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+
+	channels->combined_count = vi->curr_queue_pairs;
+	channels->max_combined = vi->max_queue_pairs;
+	channels->max_other = 0;
+	channels->rx_count = 0;
+	channels->tx_count = 0;
+	channels->other_count = 0;
+}
+
 static const struct ethtool_ops virtnet_ethtool_ops = {
 	.get_drvinfo = virtnet_get_drvinfo,
 	.get_link = ethtool_op_get_link,
 	.get_ringparam = virtnet_get_ringparam,
+	.set_channels = virtnet_set_channels,
+	.get_channels = virtnet_get_channels,
 };
 
 #define MIN_MTU 68
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next v2 2/3] virtio_net: multiqueue support
From: Jason Wang @ 2012-12-05 10:37 UTC (permalink / raw)
  To: mst, rusty, davem, virtualization, netdev, linux-kernel, krkumar2
  Cc: bhutchings, jwhan, shiyer, kvm
In-Reply-To: <1354703872-25677-1-git-send-email-jasowang@redhat.com>

This patch adds the multiqueue (VIRTIO_NET_F_RFS) support to virtio_net
driver. VIRTIO_NET_F_RFS capable device could allow the driver to do packet
transmission and reception through multiple queue pairs and does the packet
steering to get better performance. By default, one one queue pair is used, user
could change the number of queue pairs by ethtool in the next patch.

When multiple queue pairs is used and the number of queue pairs is equal to the
number of vcpus. Driver does the following optimizations to implement per-cpu
virt queue pairs:

- select the txq based on the smp processor id.
- smp affinity hint to the cpu that owns the queue pairs.

This could be used with the flow steering support of the device to guarantee the
packets of a single flow is handled by the same cpu.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c        |  473 +++++++++++++++++++++++++++++++--------
 include/uapi/linux/virtio_net.h |   27 +++
 2 files changed, 402 insertions(+), 98 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0dcaee7..def11ce 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -58,6 +58,9 @@ struct send_queue {
 
 	/* TX: fragments + linear part + virtio header */
 	struct scatterlist sg[MAX_SKB_FRAGS + 2];
+
+	/* Name of the send queue: output.$index */
+	char name[40];
 };
 
 /* Internal representation of a receive virtqueue */
@@ -75,22 +78,34 @@ struct receive_queue {
 
 	/* RX: fragments + linear part + virtio header */
 	struct scatterlist sg[MAX_SKB_FRAGS + 2];
+
+	/* Name of this receive queue: input.$index */
+	char name[40];
 };
 
 struct virtnet_info {
 	struct virtio_device *vdev;
 	struct virtqueue *cvq;
 	struct net_device *dev;
-	struct send_queue sq;
-	struct receive_queue rq;
+	struct send_queue *sq;
+	struct receive_queue *rq;
 	unsigned int status;
 
+	/* Max # of queue pairs supported by the device */
+	u16 max_queue_pairs;
+
+	/* # of queue pairs currently used by the driver */
+	u16 curr_queue_pairs;
+
 	/* I like... big packets and I cannot lie! */
 	bool big_packets;
 
 	/* Host will merge rx buffers for big packets (shake it! shake it!) */
 	bool mergeable_rx_bufs;
 
+	/* Has control virtqueue */
+	bool has_cvq;
+
 	/* enable config space updates */
 	bool config_enable;
 
@@ -105,6 +120,9 @@ struct virtnet_info {
 
 	/* Lock for config space updates */
 	struct mutex config_lock;
+
+	/* Does the affinity hint is set for virtqueues? */
+	bool affinity_hint_set;
 };
 
 struct skb_vnet_hdr {
@@ -125,6 +143,29 @@ struct padded_vnet_hdr {
 	char padding[6];
 };
 
+/* Converting between virtqueue no. and kernel tx/rx queue no.
+ * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
+ */
+static int vq2txq(struct virtqueue *vq)
+{
+	return (virtqueue_get_queue_index(vq) - 1) / 2;
+}
+
+static int txq2vq(int txq)
+{
+	return txq * 2 + 1;
+}
+
+static int vq2rxq(struct virtqueue *vq)
+{
+	return virtqueue_get_queue_index(vq) / 2;
+}
+
+static int rxq2vq(int rxq)
+{
+	return rxq * 2;
+}
+
 static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
 {
 	return (struct skb_vnet_hdr *)skb->cb;
@@ -165,7 +206,7 @@ static void skb_xmit_done(struct virtqueue *vq)
 	virtqueue_disable_cb(vq);
 
 	/* We were probably waiting for more output buffers. */
-	netif_wake_queue(vi->dev);
+	netif_wake_subqueue(vi->dev, vq2txq(vq));
 }
 
 static void set_skb_frag(struct sk_buff *skb, struct page *page,
@@ -502,7 +543,7 @@ static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
 static void skb_recv_done(struct virtqueue *rvq)
 {
 	struct virtnet_info *vi = rvq->vdev->priv;
-	struct receive_queue *rq = &vi->rq;
+	struct receive_queue *rq = &vi->rq[vq2rxq(rvq)];
 
 	/* Schedule NAPI, Suppress further interrupts if successful. */
 	if (napi_schedule_prep(&rq->napi)) {
@@ -532,15 +573,21 @@ static void refill_work(struct work_struct *work)
 	struct virtnet_info *vi =
 		container_of(work, struct virtnet_info, refill.work);
 	bool still_empty;
+	int i;
+
+	for (i = 0; i < vi->max_queue_pairs; i++) {
+		struct receive_queue *rq = &vi->rq[i];
 
-	napi_disable(&vi->rq.napi);
-	still_empty = !try_fill_recv(&vi->rq, GFP_KERNEL);
-	virtnet_napi_enable(&vi->rq);
+		napi_disable(&rq->napi);
+		still_empty = !try_fill_recv(rq, GFP_KERNEL);
+		virtnet_napi_enable(rq);
 
-	/* In theory, this can happen: if we don't get any buffers in
-	 * we will *never* try to fill again. */
-	if (still_empty)
-		schedule_delayed_work(&vi->refill, HZ/2);
+		/* In theory, this can happen: if we don't get any buffers in
+		 * we will *never* try to fill again.
+		 */
+		if (still_empty)
+			schedule_delayed_work(&vi->refill, HZ/2);
+	}
 }
 
 static int virtnet_poll(struct napi_struct *napi, int budget)
@@ -578,6 +625,21 @@ again:
 	return received;
 }
 
+static int virtnet_open(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	int i;
+
+	for (i = 0; i < vi->max_queue_pairs; i++) {
+		/* Make sure we have some buffers: if oom use wq. */
+		if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
+			schedule_delayed_work(&vi->refill, 0);
+		virtnet_napi_enable(&vi->rq[i]);
+	}
+
+	return 0;
+}
+
 static unsigned int free_old_xmit_skbs(struct send_queue *sq)
 {
 	struct sk_buff *skb;
@@ -650,7 +712,8 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
-	struct send_queue *sq = &vi->sq;
+	int qnum = skb_get_queue_mapping(skb);
+	struct send_queue *sq = &vi->sq[qnum];
 	int capacity;
 
 	/* Free up any pending old buffers before queueing new ones. */
@@ -664,13 +727,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		if (likely(capacity == -ENOMEM)) {
 			if (net_ratelimit())
 				dev_warn(&dev->dev,
-					 "TX queue failure: out of memory\n");
+					 "TXQ (%d) failure: out of memory\n",
+					 qnum);
 		} else {
 			dev->stats.tx_fifo_errors++;
 			if (net_ratelimit())
 				dev_warn(&dev->dev,
-					 "Unexpected TX queue failure: %d\n",
-					 capacity);
+					 "Unexpected TXQ (%d) failure: %d\n",
+					 qnum, capacity);
 		}
 		dev->stats.tx_dropped++;
 		kfree_skb(skb);
@@ -685,12 +749,12 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (capacity < 2+MAX_SKB_FRAGS) {
-		netif_stop_queue(dev);
+		netif_stop_subqueue(dev, qnum);
 		if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
 			/* More just got used, free them then recheck. */
 			capacity += free_old_xmit_skbs(sq);
 			if (capacity >= 2+MAX_SKB_FRAGS) {
-				netif_start_queue(dev);
+				netif_start_subqueue(dev, qnum);
 				virtqueue_disable_cb(sq->vq);
 			}
 		}
@@ -758,23 +822,13 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
 static void virtnet_netpoll(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	int i;
 
-	napi_schedule(&vi->rq.napi);
+	for (i = 0; i < vi->curr_queue_pairs; i++)
+		napi_schedule(&vi->rq[i].napi);
 }
 #endif
 
-static int virtnet_open(struct net_device *dev)
-{
-	struct virtnet_info *vi = netdev_priv(dev);
-
-	/* Make sure we have some buffers: if oom use wq. */
-	if (!try_fill_recv(&vi->rq, GFP_KERNEL))
-		schedule_delayed_work(&vi->refill, 0);
-
-	virtnet_napi_enable(&vi->rq);
-	return 0;
-}
-
 /*
  * Send command via the control virtqueue and check status.  Commands
  * supported by the hypervisor, as indicated by feature bits, should
@@ -830,13 +884,39 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi)
 	rtnl_unlock();
 }
 
+static int virtnet_set_queues(struct virtnet_info *vi, u16 queue_pairs)
+{
+	struct scatterlist sg;
+	struct virtio_net_ctrl_rfs s;
+	struct net_device *dev = vi->dev;
+
+	if (!vi->has_cvq || !virtio_has_feature(vi->vdev, VIRTIO_NET_F_RFS))
+		return 0;
+
+	s.virtqueue_pairs = queue_pairs;
+	sg_init_one(&sg, &s, sizeof(s));
+
+	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RFS,
+				  VIRTIO_NET_CTRL_RFS_VQ_PAIRS_SET, &sg, 1, 0)){
+		dev_warn(&dev->dev, "Fail to set num of queue pairs to %d\n",
+			 queue_pairs);
+		return -EINVAL;
+	} else
+		vi->curr_queue_pairs = queue_pairs;
+
+	return 0;
+}
+
 static int virtnet_close(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	int i;
 
 	/* Make sure refill_work doesn't re-enable napi! */
 	cancel_delayed_work_sync(&vi->refill);
-	napi_disable(&vi->rq.napi);
+
+	for (i = 0; i < vi->max_queue_pairs; i++)
+		napi_disable(&vi->rq[i].napi);
 
 	return 0;
 }
@@ -943,13 +1023,41 @@ static int virtnet_vlan_rx_kill_vid(struct net_device *dev, u16 vid)
 	return 0;
 }
 
+static void virtnet_set_affinity(struct virtnet_info *vi, bool set)
+{
+	int i;
+
+	/* In multiqueue mode, when the number of cpu is equal to the number of
+	 * queue pairs, we let the queue pairs to be private to one cpu by
+	 * setting the affinity hint to eliminate the contention.
+	 */
+	if ((vi->curr_queue_pairs == 1 ||
+	     vi->max_queue_pairs != num_online_cpus()) && set) {
+		if (vi->affinity_hint_set)
+			set = false;
+		else
+			return;
+	}
+
+	for (i = 0; i < vi->max_queue_pairs; i++) {
+		int cpu = set ? i : -1;
+		virtqueue_set_affinity(vi->rq[i].vq, cpu);
+		virtqueue_set_affinity(vi->sq[i].vq, cpu);
+	}
+
+	if (set)
+		vi->affinity_hint_set = true;
+	else
+		vi->affinity_hint_set = false;
+}
+
 static void virtnet_get_ringparam(struct net_device *dev,
 				struct ethtool_ringparam *ring)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 
-	ring->rx_max_pending = virtqueue_get_vring_size(vi->rq.vq);
-	ring->tx_max_pending = virtqueue_get_vring_size(vi->sq.vq);
+	ring->rx_max_pending = virtqueue_get_vring_size(vi->rq[0].vq);
+	ring->tx_max_pending = virtqueue_get_vring_size(vi->sq[0].vq);
 	ring->rx_pending = ring->rx_max_pending;
 	ring->tx_pending = ring->tx_max_pending;
 }
@@ -984,6 +1092,21 @@ static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
 	return 0;
 }
 
+/* To avoid contending a lock hold by a vcpu who would exit to host, select the
+ * txq based on the processor id.
+ * TODO: handle cpu hotplug.
+ */
+static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
+{
+	int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
+		  smp_processor_id();
+
+	while (unlikely(txq >= dev->real_num_tx_queues))
+		txq -= dev->real_num_tx_queues;
+
+	return txq;
+}
+
 static const struct net_device_ops virtnet_netdev = {
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
@@ -995,6 +1118,7 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_get_stats64     = virtnet_stats,
 	.ndo_vlan_rx_add_vid = virtnet_vlan_rx_add_vid,
 	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
+	.ndo_select_queue     = virtnet_select_queue,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_poll_controller = virtnet_netpoll,
 #endif
@@ -1030,10 +1154,10 @@ static void virtnet_config_changed_work(struct work_struct *work)
 
 	if (vi->status & VIRTIO_NET_S_LINK_UP) {
 		netif_carrier_on(vi->dev);
-		netif_wake_queue(vi->dev);
+		netif_tx_wake_all_queues(vi->dev);
 	} else {
 		netif_carrier_off(vi->dev);
-		netif_stop_queue(vi->dev);
+		netif_tx_stop_all_queues(vi->dev);
 	}
 done:
 	mutex_unlock(&vi->config_lock);
@@ -1046,48 +1170,203 @@ static void virtnet_config_changed(struct virtio_device *vdev)
 	schedule_work(&vi->config_work);
 }
 
+static void virtnet_free_queues(struct virtnet_info *vi)
+{
+	kfree(vi->rq);
+	kfree(vi->sq);
+}
+
+static void free_receive_bufs(struct virtnet_info *vi)
+{
+	int i;
+
+	for (i = 0; i < vi->max_queue_pairs; i++) {
+		while (vi->rq[i].pages)
+			__free_pages(get_a_page(&vi->rq[i], GFP_KERNEL), 0);
+	}
+}
+
+static void free_unused_bufs(struct virtnet_info *vi)
+{
+	void *buf;
+	int i;
+
+	for (i = 0; i < vi->max_queue_pairs; i++) {
+		struct virtqueue *vq = vi->sq[i].vq;
+		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL)
+			dev_kfree_skb(buf);
+	}
+
+	for (i = 0; i < vi->max_queue_pairs; i++) {
+		struct virtqueue *vq = vi->rq[i].vq;
+
+		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
+			if (vi->mergeable_rx_bufs || vi->big_packets)
+				give_pages(&vi->rq[i], buf);
+			else
+				dev_kfree_skb(buf);
+			--vi->rq[i].num;
+		}
+		BUG_ON(vi->rq[i].num != 0);
+	}
+}
+
 static void virtnet_del_vqs(struct virtnet_info *vi)
 {
 	struct virtio_device *vdev = vi->vdev;
 
+	virtnet_set_affinity(vi, false);
+
 	vdev->config->del_vqs(vdev);
+
+	virtnet_free_queues(vi);
 }
 
-static int init_vqs(struct virtnet_info *vi)
+static int virtnet_find_vqs(struct virtnet_info *vi)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
-	const char *names[] = { "input", "output", "control" };
-	int nvqs, err;
-
-	/* We expect two virtqueues, receive then send,
-	 * and optionally control. */
-	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
-
-	err = vi->vdev->config->find_vqs(vi->vdev, nvqs, vqs, callbacks, names);
-	if (err)
-		return err;
+	vq_callback_t **callbacks;
+	struct virtqueue **vqs;
+	int ret = -ENOMEM;
+	int i, total_vqs;
+	const char **names;
+
+	/* We expect 1 RX virtqueue followed by 1 TX virtqueue, followed by
+	 * possible N-1 RX/TX queue pairs used in multiqueue mode, followed by
+	 * possible control vq.
+	 */
+	total_vqs = vi->max_queue_pairs * 2 +
+		    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ);
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kzalloc(total_vqs * sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc(total_vqs * sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc(total_vqs * sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	/* Parameters for control virtqueue, if any */
+	if (vi->has_cvq) {
+		callbacks[total_vqs - 1] = NULL;
+		names[total_vqs - 1] = "control";
+	}
 
-	vi->rq.vq = vqs[0];
-	vi->sq.vq = vqs[1];
+	/* Allocate/initialize parameters for send/receive virtqueues */
+	for (i = 0; i < vi->max_queue_pairs; i++) {
+		callbacks[rxq2vq(i)] = skb_recv_done;
+		callbacks[txq2vq(i)] = skb_xmit_done;
+		sprintf(vi->rq[i].name, "input.%d", i);
+		sprintf(vi->sq[i].name, "output.%d", i);
+		names[rxq2vq(i)] = vi->rq[i].name;
+		names[txq2vq(i)] = vi->sq[i].name;
+	}
 
-	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
-		vi->cvq = vqs[2];
+	ret = vi->vdev->config->find_vqs(vi->vdev, total_vqs, vqs, callbacks,
+					 names);
+	if (ret)
+		goto err_find;
 
+	if (vi->has_cvq) {
+		vi->cvq = vqs[total_vqs - 1];
 		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
 			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
 	}
+
+	for (i = 0; i < vi->max_queue_pairs; i++) {
+		vi->rq[i].vq = vqs[rxq2vq(i)];
+		vi->sq[i].vq = vqs[txq2vq(i)];
+	}
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
+
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return ret;
+}
+
+static int virtnet_alloc_queues(struct virtnet_info *vi)
+{
+	int i;
+
+	vi->sq = kzalloc(sizeof(*vi->sq) * vi->max_queue_pairs, GFP_KERNEL);
+	if (!vi->sq)
+		goto err_sq;
+	vi->rq = kzalloc(sizeof(*vi->rq) * vi->max_queue_pairs, GFP_KERNEL);
+	if (!vi->sq)
+		goto err_rq;
+
+	INIT_DELAYED_WORK(&vi->refill, refill_work);
+	for (i = 0; i < vi->max_queue_pairs; i++) {
+		vi->rq[i].pages = NULL;
+		netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
+			       napi_weight);
+
+		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
+		sg_init_table(vi->sq[i].sg, ARRAY_SIZE(vi->sq[i].sg));
+	}
+
+	return 0;
+
+err_rq:
+	kfree(vi->sq);
+err_sq:
+	return -ENOMEM;
+}
+
+static int init_vqs(struct virtnet_info *vi)
+{
+	int ret;
+
+	/* Allocate send & receive queues */
+	ret = virtnet_alloc_queues(vi);
+	if (ret)
+		goto err;
+
+	ret = virtnet_find_vqs(vi);
+	if (ret)
+		goto err_free;
+
+	virtnet_set_affinity(vi, true);
+	return 0;
+
+err_free:
+	virtnet_free_queues(vi);
+err:
+	return ret;
 }
 
 static int virtnet_probe(struct virtio_device *vdev)
 {
-	int err;
+	int i, err;
 	struct net_device *dev;
 	struct virtnet_info *vi;
+	u16 max_queue_pairs;
+
+	/* Find if host supports multiqueue virtio_net device */
+	err = virtio_config_val(vdev, VIRTIO_NET_F_RFS,
+				offsetof(struct virtio_net_config,
+				max_virtqueue_pairs), &max_queue_pairs);
+
+	/* We need at least 2 queue's */
+	if (err || max_queue_pairs < VIRTIO_NET_CTRL_RFS_VQ_PAIRS_MIN ||
+	    max_queue_pairs > VIRTIO_NET_CTRL_RFS_VQ_PAIRS_MAX ||
+	    !virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
+		max_queue_pairs = 1;
 
 	/* Allocate ourselves a network device with room for our info */
-	dev = alloc_etherdev(sizeof(struct virtnet_info));
+	dev = alloc_etherdev_mq(sizeof(struct virtnet_info), max_queue_pairs);
 	if (!dev)
 		return -ENOMEM;
 
@@ -1134,22 +1413,17 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	/* Set up our device-specific information */
 	vi = netdev_priv(dev);
-	netif_napi_add(dev, &vi->rq.napi, virtnet_poll, napi_weight);
 	vi->dev = dev;
 	vi->vdev = vdev;
 	vdev->priv = vi;
-	vi->rq.pages = NULL;
 	vi->stats = alloc_percpu(struct virtnet_stats);
 	err = -ENOMEM;
 	if (vi->stats == NULL)
 		goto free;
 
-	INIT_DELAYED_WORK(&vi->refill, refill_work);
 	mutex_init(&vi->config_lock);
 	vi->config_enable = true;
 	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
-	sg_init_table(vi->rq.sg, ARRAY_SIZE(vi->rq.sg));
-	sg_init_table(vi->sq.sg, ARRAY_SIZE(vi->sq.sg));
 
 	/* If we can receive ANY GSO packets, we must allocate large ones. */
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
@@ -1160,10 +1434,21 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
 		vi->mergeable_rx_bufs = true;
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
+		vi->has_cvq = true;
+
+	/* Use single tx/rx queue pair as default */
+	vi->curr_queue_pairs = 1;
+	vi->max_queue_pairs = max_queue_pairs;
+
+	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
 	err = init_vqs(vi);
 	if (err)
 		goto free_stats;
 
+	netif_set_real_num_tx_queues(dev, 1);
+	netif_set_real_num_rx_queues(dev, 1);
+
 	err = register_netdev(dev);
 	if (err) {
 		pr_debug("virtio_net: registering device failed\n");
@@ -1171,12 +1456,15 @@ static int virtnet_probe(struct virtio_device *vdev)
 	}
 
 	/* Last of all, set up some receive buffers. */
-	try_fill_recv(&vi->rq, GFP_KERNEL);
-
-	/* If we didn't even get one input buffer, we're useless. */
-	if (vi->rq.num == 0) {
-		err = -ENOMEM;
-		goto unregister;
+	for (i = 0; i < vi->max_queue_pairs; i++) {
+		try_fill_recv(&vi->rq[i], GFP_KERNEL);
+
+		/* If we didn't even get one input buffer, we're useless. */
+		if (vi->rq[i].num == 0) {
+			free_unused_bufs(vi);
+			err = -ENOMEM;
+			goto free_recv_bufs;
+		}
 	}
 
 	/* Assume link up if device can't report link status,
@@ -1189,12 +1477,16 @@ static int virtnet_probe(struct virtio_device *vdev)
 		netif_carrier_on(dev);
 	}
 
-	pr_debug("virtnet: registered device %s\n", dev->name);
+	pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
+		 dev->name, max_queue_pairs);
+
 	return 0;
 
-unregister:
+free_recv_bufs:
+	free_receive_bufs(vi);
 	unregister_netdev(dev);
 free_vqs:
+	cancel_delayed_work_sync(&vi->refill);
 	virtnet_del_vqs(vi);
 free_stats:
 	free_percpu(vi->stats);
@@ -1203,28 +1495,6 @@ free:
 	return err;
 }
 
-static void free_unused_bufs(struct virtnet_info *vi)
-{
-	void *buf;
-	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->sq.vq);
-		if (!buf)
-			break;
-		dev_kfree_skb(buf);
-	}
-	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->rq.vq);
-		if (!buf)
-			break;
-		if (vi->mergeable_rx_bufs || vi->big_packets)
-			give_pages(&vi->rq, buf);
-		else
-			dev_kfree_skb(buf);
-		--vi->rq.num;
-	}
-	BUG_ON(vi->rq.num != 0);
-}
-
 static void remove_vq_common(struct virtnet_info *vi)
 {
 	vi->vdev->config->reset(vi->vdev);
@@ -1232,10 +1502,9 @@ static void remove_vq_common(struct virtnet_info *vi)
 	/* Free unused buffers in both send and recv, if any. */
 	free_unused_bufs(vi);
 
-	virtnet_del_vqs(vi);
+	free_receive_bufs(vi);
 
-	while (vi->rq.pages)
-		__free_pages(get_a_page(&vi->rq, GFP_KERNEL), 0);
+	virtnet_del_vqs(vi);
 }
 
 static void __devexit virtnet_remove(struct virtio_device *vdev)
@@ -1261,6 +1530,7 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
 static int virtnet_freeze(struct virtio_device *vdev)
 {
 	struct virtnet_info *vi = vdev->priv;
+	int i;
 
 	/* Prevent config work handler from accessing the device */
 	mutex_lock(&vi->config_lock);
@@ -1271,7 +1541,10 @@ static int virtnet_freeze(struct virtio_device *vdev)
 	cancel_delayed_work_sync(&vi->refill);
 
 	if (netif_running(vi->dev))
-		napi_disable(&vi->rq.napi);
+		for (i = 0; i < vi->max_queue_pairs; i++) {
+			napi_disable(&vi->rq[i].napi);
+			netif_napi_del(&vi->rq[i].napi);
+		}
 
 	remove_vq_common(vi);
 
@@ -1283,24 +1556,28 @@ static int virtnet_freeze(struct virtio_device *vdev)
 static int virtnet_restore(struct virtio_device *vdev)
 {
 	struct virtnet_info *vi = vdev->priv;
-	int err;
+	int err, i;
 
 	err = init_vqs(vi);
 	if (err)
 		return err;
 
 	if (netif_running(vi->dev))
-		virtnet_napi_enable(&vi->rq);
+		for (i = 0; i < vi->max_queue_pairs; i++)
+			virtnet_napi_enable(&vi->rq[i]);
 
 	netif_device_attach(vi->dev);
 
-	if (!try_fill_recv(&vi->rq, GFP_KERNEL))
-		schedule_delayed_work(&vi->refill, 0);
+	for (i = 0; i < vi->max_queue_pairs; i++)
+		if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
+			schedule_delayed_work(&vi->refill, 0);
 
 	mutex_lock(&vi->config_lock);
 	vi->config_enable = true;
 	mutex_unlock(&vi->config_lock);
 
+	virtnet_set_queues(vi, vi->curr_queue_pairs);
+
 	return 0;
 }
 #endif
@@ -1318,7 +1595,7 @@ static unsigned int features[] = {
 	VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
 	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
 	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
-	VIRTIO_NET_F_GUEST_ANNOUNCE,
+	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_RFS,
 };
 
 static struct virtio_driver virtio_net_driver = {
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 2470f54..7f45f3a 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -51,6 +51,8 @@
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20	/* Extra RX mode control support */
 #define VIRTIO_NET_F_GUEST_ANNOUNCE 21	/* Guest can announce device on the
 					 * network */
+#define VIRTIO_NET_F_RFS	22	/* Device supports Receive Flow
+					 * Steering */
 
 #define VIRTIO_NET_S_LINK_UP	1	/* Link is up */
 #define VIRTIO_NET_S_ANNOUNCE	2	/* Announcement is needed */
@@ -60,6 +62,11 @@ struct virtio_net_config {
 	__u8 mac[6];
 	/* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
 	__u16 status;
+	/* Maximum number of each of transmit and receive queues;
+	 * see VIRTIO_NET_F_RFS and VIRTIO_NET_CTRL_RFS.
+	 * Legal values are between 1 and 0x8000
+	 */
+	__u16 max_virtqueue_pairs;
 } __attribute__((packed));
 
 /* This is the first element of the scatter-gather list.  If you don't
@@ -166,4 +173,24 @@ struct virtio_net_ctrl_mac {
 #define VIRTIO_NET_CTRL_ANNOUNCE       3
  #define VIRTIO_NET_CTRL_ANNOUNCE_ACK         0
 
+/*
+ * Control Receive Flow Steering
+ *
+ * The command VIRTIO_NET_CTRL_RFS_VQ_PAIRS_SET
+ * enables Receive Flow Steering, specifying the number of the transmit and
+ * receive queues that will be used. After the command is consumed and acked by
+ * the device, the device will not steer new packets on receive virtqueues
+ * other than specified nor read from transmit virtqueues other than specified.
+ * Accordingly, driver should not transmit new packets  on virtqueues other than
+ * specified.
+ */
+struct virtio_net_ctrl_rfs {
+	u16 virtqueue_pairs;
+};
+
+#define VIRTIO_NET_CTRL_RFS   4
+ #define VIRTIO_NET_CTRL_RFS_VQ_PAIRS_SET        0
+ #define VIRTIO_NET_CTRL_RFS_VQ_PAIRS_MIN        1
+ #define VIRTIO_NET_CTRL_RFS_VQ_PAIRS_MAX        0x8000
+
 #endif /* _LINUX_VIRTIO_NET_H */
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next v2 1/3] virtio-net: separate fields of sending/receiving queue from virtnet_info
From: Jason Wang @ 2012-12-05 10:37 UTC (permalink / raw)
  To: mst, rusty, davem, virtualization, netdev, linux-kernel, krkumar2
  Cc: bhutchings, jwhan, shiyer, kvm
In-Reply-To: <1354703872-25677-1-git-send-email-jasowang@redhat.com>

To support multiqueue transmitq/receiveq, the first step is to separate queue
related structure from virtnet_info. This patch introduce send_queue and
receive_queue structure and use the pointer to them as the parameter in
functions handling sending/receiving.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c |  282 ++++++++++++++++++++++++++--------------------
 1 files changed, 158 insertions(+), 124 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 8262232..0dcaee7 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -51,16 +51,40 @@ struct virtnet_stats {
 	u64 rx_packets;
 };
 
-struct virtnet_info {
-	struct virtio_device *vdev;
-	struct virtqueue *rvq, *svq, *cvq;
-	struct net_device *dev;
+/* Internal representation of a send virtqueue */
+struct send_queue {
+	/* Virtqueue associated with this send _queue */
+	struct virtqueue *vq;
+
+	/* TX: fragments + linear part + virtio header */
+	struct scatterlist sg[MAX_SKB_FRAGS + 2];
+};
+
+/* Internal representation of a receive virtqueue */
+struct receive_queue {
+	/* Virtqueue associated with this receive_queue */
+	struct virtqueue *vq;
+
 	struct napi_struct napi;
-	unsigned int status;
 
 	/* Number of input buffers, and max we've ever had. */
 	unsigned int num, max;
 
+	/* Chain pages by the private ptr. */
+	struct page *pages;
+
+	/* RX: fragments + linear part + virtio header */
+	struct scatterlist sg[MAX_SKB_FRAGS + 2];
+};
+
+struct virtnet_info {
+	struct virtio_device *vdev;
+	struct virtqueue *cvq;
+	struct net_device *dev;
+	struct send_queue sq;
+	struct receive_queue rq;
+	unsigned int status;
+
 	/* I like... big packets and I cannot lie! */
 	bool big_packets;
 
@@ -81,13 +105,6 @@ struct virtnet_info {
 
 	/* Lock for config space updates */
 	struct mutex config_lock;
-
-	/* Chain pages by the private ptr. */
-	struct page *pages;
-
-	/* fragments + linear part + virtio header */
-	struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
-	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
 };
 
 struct skb_vnet_hdr {
@@ -117,22 +134,22 @@ static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
  * private is used to chain pages for big packets, put the whole
  * most recent used list in the beginning for reuse
  */
-static void give_pages(struct virtnet_info *vi, struct page *page)
+static void give_pages(struct receive_queue *rq, struct page *page)
 {
 	struct page *end;
 
-	/* Find end of list, sew whole thing into vi->pages. */
+	/* Find end of list, sew whole thing into vi->rq.pages. */
 	for (end = page; end->private; end = (struct page *)end->private);
-	end->private = (unsigned long)vi->pages;
-	vi->pages = page;
+	end->private = (unsigned long)rq->pages;
+	rq->pages = page;
 }
 
-static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
+static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
 {
-	struct page *p = vi->pages;
+	struct page *p = rq->pages;
 
 	if (p) {
-		vi->pages = (struct page *)p->private;
+		rq->pages = (struct page *)p->private;
 		/* clear private here, it is used to chain pages */
 		p->private = 0;
 	} else
@@ -140,12 +157,12 @@ static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
 	return p;
 }
 
-static void skb_xmit_done(struct virtqueue *svq)
+static void skb_xmit_done(struct virtqueue *vq)
 {
-	struct virtnet_info *vi = svq->vdev->priv;
+	struct virtnet_info *vi = vq->vdev->priv;
 
 	/* Suppress further interrupts. */
-	virtqueue_disable_cb(svq);
+	virtqueue_disable_cb(vq);
 
 	/* We were probably waiting for more output buffers. */
 	netif_wake_queue(vi->dev);
@@ -167,9 +184,10 @@ static void set_skb_frag(struct sk_buff *skb, struct page *page,
 }
 
 /* Called from bottom half context */
-static struct sk_buff *page_to_skb(struct virtnet_info *vi,
+static struct sk_buff *page_to_skb(struct receive_queue *rq,
 				   struct page *page, unsigned int len)
 {
+	struct virtnet_info *vi = rq->vq->vdev->priv;
 	struct sk_buff *skb;
 	struct skb_vnet_hdr *hdr;
 	unsigned int copy, hdr_len, offset;
@@ -224,12 +242,12 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
 	}
 
 	if (page)
-		give_pages(vi, page);
+		give_pages(rq, page);
 
 	return skb;
 }
 
-static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
+static int receive_mergeable(struct receive_queue *rq, struct sk_buff *skb)
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	struct page *page;
@@ -243,7 +261,7 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
 			skb->dev->stats.rx_length_errors++;
 			return -EINVAL;
 		}
-		page = virtqueue_get_buf(vi->rvq, &len);
+		page = virtqueue_get_buf(rq->vq, &len);
 		if (!page) {
 			pr_debug("%s: rx error: %d buffers missing\n",
 				 skb->dev->name, hdr->mhdr.num_buffers);
@@ -256,14 +274,15 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
 
 		set_skb_frag(skb, page, 0, &len);
 
-		--vi->num;
+		--rq->num;
 	}
 	return 0;
 }
 
-static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
+static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 {
-	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_info *vi = rq->vq->vdev->priv;
+	struct net_device *dev = vi->dev;
 	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
 	struct sk_buff *skb;
 	struct page *page;
@@ -273,7 +292,7 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
 		pr_debug("%s: short packet %i\n", dev->name, len);
 		dev->stats.rx_length_errors++;
 		if (vi->mergeable_rx_bufs || vi->big_packets)
-			give_pages(vi, buf);
+			give_pages(rq, buf);
 		else
 			dev_kfree_skb(buf);
 		return;
@@ -285,14 +304,14 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
 		skb_trim(skb, len);
 	} else {
 		page = buf;
-		skb = page_to_skb(vi, page, len);
+		skb = page_to_skb(rq, page, len);
 		if (unlikely(!skb)) {
 			dev->stats.rx_dropped++;
-			give_pages(vi, page);
+			give_pages(rq, page);
 			return;
 		}
 		if (vi->mergeable_rx_bufs)
-			if (receive_mergeable(vi, skb)) {
+			if (receive_mergeable(rq, skb)) {
 				dev_kfree_skb(skb);
 				return;
 			}
@@ -359,8 +378,9 @@ frame_err:
 	dev_kfree_skb(skb);
 }
 
-static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
+static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp)
 {
+	struct virtnet_info *vi = rq->vq->vdev->priv;
 	struct sk_buff *skb;
 	struct skb_vnet_hdr *hdr;
 	int err;
@@ -372,77 +392,77 @@ static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
 	skb_put(skb, MAX_PACKET_LEN);
 
 	hdr = skb_vnet_hdr(skb);
-	sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr);
+	sg_set_buf(rq->sg, &hdr->hdr, sizeof hdr->hdr);
 
-	skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
+	skb_to_sgvec(skb, rq->sg + 1, 0, skb->len);
 
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
+	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 2, skb, gfp);
 	if (err < 0)
 		dev_kfree_skb(skb);
 
 	return err;
 }
 
-static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
+static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 {
 	struct page *first, *list = NULL;
 	char *p;
 	int i, err, offset;
 
-	/* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
+	/* page in rq->sg[MAX_SKB_FRAGS + 1] is list tail */
 	for (i = MAX_SKB_FRAGS + 1; i > 1; --i) {
-		first = get_a_page(vi, gfp);
+		first = get_a_page(rq, gfp);
 		if (!first) {
 			if (list)
-				give_pages(vi, list);
+				give_pages(rq, list);
 			return -ENOMEM;
 		}
-		sg_set_buf(&vi->rx_sg[i], page_address(first), PAGE_SIZE);
+		sg_set_buf(&rq->sg[i], page_address(first), PAGE_SIZE);
 
 		/* chain new page in list head to match sg */
 		first->private = (unsigned long)list;
 		list = first;
 	}
 
-	first = get_a_page(vi, gfp);
+	first = get_a_page(rq, gfp);
 	if (!first) {
-		give_pages(vi, list);
+		give_pages(rq, list);
 		return -ENOMEM;
 	}
 	p = page_address(first);
 
-	/* vi->rx_sg[0], vi->rx_sg[1] share the same page */
-	/* a separated vi->rx_sg[0] for virtio_net_hdr only due to QEMU bug */
-	sg_set_buf(&vi->rx_sg[0], p, sizeof(struct virtio_net_hdr));
+	/* rq->sg[0], rq->sg[1] share the same page */
+	/* a separated rq->sg[0] for virtio_net_hdr only due to QEMU bug */
+	sg_set_buf(&rq->sg[0], p, sizeof(struct virtio_net_hdr));
 
-	/* vi->rx_sg[1] for data packet, from offset */
+	/* rq->sg[1] for data packet, from offset */
 	offset = sizeof(struct padded_vnet_hdr);
-	sg_set_buf(&vi->rx_sg[1], p + offset, PAGE_SIZE - offset);
+	sg_set_buf(&rq->sg[1], p + offset, PAGE_SIZE - offset);
 
 	/* chain first in list head */
 	first->private = (unsigned long)list;
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
+	err = virtqueue_add_buf(rq->vq, rq->sg, 0, MAX_SKB_FRAGS + 2,
 				first, gfp);
 	if (err < 0)
-		give_pages(vi, first);
+		give_pages(rq, first);
 
 	return err;
 }
 
-static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
+static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
 	struct page *page;
 	int err;
 
-	page = get_a_page(vi, gfp);
+	page = get_a_page(rq, gfp);
 	if (!page)
 		return -ENOMEM;
 
-	sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
+	sg_init_one(rq->sg, page_address(page), PAGE_SIZE);
 
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
+	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 1, page, gfp);
 	if (err < 0)
-		give_pages(vi, page);
+		give_pages(rq, page);
 
 	return err;
 }
@@ -454,65 +474,68 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
  * before we're receiving packets, or from refill_work which is
  * careful to disable receiving (using napi_disable).
  */
-static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
+static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
 {
+	struct virtnet_info *vi = rq->vq->vdev->priv;
 	int err;
 	bool oom;
 
 	do {
 		if (vi->mergeable_rx_bufs)
-			err = add_recvbuf_mergeable(vi, gfp);
+			err = add_recvbuf_mergeable(rq, gfp);
 		else if (vi->big_packets)
-			err = add_recvbuf_big(vi, gfp);
+			err = add_recvbuf_big(rq, gfp);
 		else
-			err = add_recvbuf_small(vi, gfp);
+			err = add_recvbuf_small(rq, gfp);
 
 		oom = err == -ENOMEM;
 		if (err < 0)
 			break;
-		++vi->num;
+		++rq->num;
 	} while (err > 0);
-	if (unlikely(vi->num > vi->max))
-		vi->max = vi->num;
-	virtqueue_kick(vi->rvq);
+	if (unlikely(rq->num > rq->max))
+		rq->max = rq->num;
+	virtqueue_kick(rq->vq);
 	return !oom;
 }
 
 static void skb_recv_done(struct virtqueue *rvq)
 {
 	struct virtnet_info *vi = rvq->vdev->priv;
+	struct receive_queue *rq = &vi->rq;
+
 	/* Schedule NAPI, Suppress further interrupts if successful. */
-	if (napi_schedule_prep(&vi->napi)) {
+	if (napi_schedule_prep(&rq->napi)) {
 		virtqueue_disable_cb(rvq);
-		__napi_schedule(&vi->napi);
+		__napi_schedule(&rq->napi);
 	}
 }
 
-static void virtnet_napi_enable(struct virtnet_info *vi)
+static void virtnet_napi_enable(struct receive_queue *rq)
 {
-	napi_enable(&vi->napi);
+	napi_enable(&rq->napi);
 
 	/* If all buffers were filled by other side before we napi_enabled, we
 	 * won't get another interrupt, so process any outstanding packets
 	 * now.  virtnet_poll wants re-enable the queue, so we disable here.
 	 * We synchronize against interrupts via NAPI_STATE_SCHED */
-	if (napi_schedule_prep(&vi->napi)) {
-		virtqueue_disable_cb(vi->rvq);
+	if (napi_schedule_prep(&rq->napi)) {
+		virtqueue_disable_cb(rq->vq);
 		local_bh_disable();
-		__napi_schedule(&vi->napi);
+		__napi_schedule(&rq->napi);
 		local_bh_enable();
 	}
 }
 
 static void refill_work(struct work_struct *work)
 {
-	struct virtnet_info *vi;
+	struct virtnet_info *vi =
+		container_of(work, struct virtnet_info, refill.work);
 	bool still_empty;
 
-	vi = container_of(work, struct virtnet_info, refill.work);
-	napi_disable(&vi->napi);
-	still_empty = !try_fill_recv(vi, GFP_KERNEL);
-	virtnet_napi_enable(vi);
+	napi_disable(&vi->rq.napi);
+	still_empty = !try_fill_recv(&vi->rq, GFP_KERNEL);
+	virtnet_napi_enable(&vi->rq);
 
 	/* In theory, this can happen: if we don't get any buffers in
 	 * we will *never* try to fill again. */
@@ -522,29 +545,31 @@ static void refill_work(struct work_struct *work)
 
 static int virtnet_poll(struct napi_struct *napi, int budget)
 {
-	struct virtnet_info *vi = container_of(napi, struct virtnet_info, napi);
+	struct receive_queue *rq =
+		container_of(napi, struct receive_queue, napi);
+	struct virtnet_info *vi = rq->vq->vdev->priv;
 	void *buf;
 	unsigned int len, received = 0;
 
 again:
 	while (received < budget &&
-	       (buf = virtqueue_get_buf(vi->rvq, &len)) != NULL) {
-		receive_buf(vi->dev, buf, len);
-		--vi->num;
+	       (buf = virtqueue_get_buf(rq->vq, &len)) != NULL) {
+		receive_buf(rq, buf, len);
+		--rq->num;
 		received++;
 	}
 
-	if (vi->num < vi->max / 2) {
-		if (!try_fill_recv(vi, GFP_ATOMIC))
+	if (rq->num < rq->max / 2) {
+		if (!try_fill_recv(rq, GFP_ATOMIC))
 			schedule_delayed_work(&vi->refill, 0);
 	}
 
 	/* Out of packets? */
 	if (received < budget) {
 		napi_complete(napi);
-		if (unlikely(!virtqueue_enable_cb(vi->rvq)) &&
+		if (unlikely(!virtqueue_enable_cb(rq->vq)) &&
 		    napi_schedule_prep(napi)) {
-			virtqueue_disable_cb(vi->rvq);
+			virtqueue_disable_cb(rq->vq);
 			__napi_schedule(napi);
 			goto again;
 		}
@@ -553,13 +578,14 @@ again:
 	return received;
 }
 
-static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
+static unsigned int free_old_xmit_skbs(struct send_queue *sq)
 {
 	struct sk_buff *skb;
 	unsigned int len, tot_sgs = 0;
+	struct virtnet_info *vi = sq->vq->vdev->priv;
 	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
 
-	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
+	while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 
 		u64_stats_update_begin(&stats->tx_syncp);
@@ -573,10 +599,11 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
 	return tot_sgs;
 }
 
-static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
+static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
+	struct virtnet_info *vi = sq->vq->vdev->priv;
 
 	pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest);
 
@@ -611,25 +638,26 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
 
 	/* Encode metadata header at front. */
 	if (vi->mergeable_rx_bufs)
-		sg_set_buf(vi->tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
+		sg_set_buf(sq->sg, &hdr->mhdr, sizeof hdr->mhdr);
 	else
-		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
+		sg_set_buf(sq->sg, &hdr->hdr, sizeof hdr->hdr);
 
-	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
-	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
+	hdr->num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
+	return virtqueue_add_buf(sq->vq, sq->sg, hdr->num_sg,
 				 0, skb, GFP_ATOMIC);
 }
 
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	struct send_queue *sq = &vi->sq;
 	int capacity;
 
 	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(vi);
+	free_old_xmit_skbs(sq);
 
 	/* Try to transmit */
-	capacity = xmit_skb(vi, skb);
+	capacity = xmit_skb(sq, skb);
 
 	/* This can happen with OOM and indirect buffers. */
 	if (unlikely(capacity < 0)) {
@@ -648,7 +676,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		kfree_skb(skb);
 		return NETDEV_TX_OK;
 	}
-	virtqueue_kick(vi->svq);
+	virtqueue_kick(sq->vq);
 
 	/* Don't wait up for transmitted skbs to be freed. */
 	skb_orphan(skb);
@@ -658,12 +686,12 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (capacity < 2+MAX_SKB_FRAGS) {
 		netif_stop_queue(dev);
-		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
+		if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
 			/* More just got used, free them then recheck. */
-			capacity += free_old_xmit_skbs(vi);
+			capacity += free_old_xmit_skbs(sq);
 			if (capacity >= 2+MAX_SKB_FRAGS) {
 				netif_start_queue(dev);
-				virtqueue_disable_cb(vi->svq);
+				virtqueue_disable_cb(sq->vq);
 			}
 		}
 	}
@@ -731,7 +759,7 @@ static void virtnet_netpoll(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 
-	napi_schedule(&vi->napi);
+	napi_schedule(&vi->rq.napi);
 }
 #endif
 
@@ -740,10 +768,10 @@ static int virtnet_open(struct net_device *dev)
 	struct virtnet_info *vi = netdev_priv(dev);
 
 	/* Make sure we have some buffers: if oom use wq. */
-	if (!try_fill_recv(vi, GFP_KERNEL))
+	if (!try_fill_recv(&vi->rq, GFP_KERNEL))
 		schedule_delayed_work(&vi->refill, 0);
 
-	virtnet_napi_enable(vi);
+	virtnet_napi_enable(&vi->rq);
 	return 0;
 }
 
@@ -808,7 +836,7 @@ static int virtnet_close(struct net_device *dev)
 
 	/* Make sure refill_work doesn't re-enable napi! */
 	cancel_delayed_work_sync(&vi->refill);
-	napi_disable(&vi->napi);
+	napi_disable(&vi->rq.napi);
 
 	return 0;
 }
@@ -920,11 +948,10 @@ static void virtnet_get_ringparam(struct net_device *dev,
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 
-	ring->rx_max_pending = virtqueue_get_vring_size(vi->rvq);
-	ring->tx_max_pending = virtqueue_get_vring_size(vi->svq);
+	ring->rx_max_pending = virtqueue_get_vring_size(vi->rq.vq);
+	ring->tx_max_pending = virtqueue_get_vring_size(vi->sq.vq);
 	ring->rx_pending = ring->rx_max_pending;
 	ring->tx_pending = ring->tx_max_pending;
-
 }
 
 
@@ -1019,6 +1046,13 @@ static void virtnet_config_changed(struct virtio_device *vdev)
 	schedule_work(&vi->config_work);
 }
 
+static void virtnet_del_vqs(struct virtnet_info *vi)
+{
+	struct virtio_device *vdev = vi->vdev;
+
+	vdev->config->del_vqs(vdev);
+}
+
 static int init_vqs(struct virtnet_info *vi)
 {
 	struct virtqueue *vqs[3];
@@ -1034,8 +1068,8 @@ static int init_vqs(struct virtnet_info *vi)
 	if (err)
 		return err;
 
-	vi->rvq = vqs[0];
-	vi->svq = vqs[1];
+	vi->rq.vq = vqs[0];
+	vi->sq.vq = vqs[1];
 
 	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
 		vi->cvq = vqs[2];
@@ -1100,11 +1134,11 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	/* Set up our device-specific information */
 	vi = netdev_priv(dev);
-	netif_napi_add(dev, &vi->napi, virtnet_poll, napi_weight);
+	netif_napi_add(dev, &vi->rq.napi, virtnet_poll, napi_weight);
 	vi->dev = dev;
 	vi->vdev = vdev;
 	vdev->priv = vi;
-	vi->pages = NULL;
+	vi->rq.pages = NULL;
 	vi->stats = alloc_percpu(struct virtnet_stats);
 	err = -ENOMEM;
 	if (vi->stats == NULL)
@@ -1114,8 +1148,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 	mutex_init(&vi->config_lock);
 	vi->config_enable = true;
 	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
-	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
-	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
+	sg_init_table(vi->rq.sg, ARRAY_SIZE(vi->rq.sg));
+	sg_init_table(vi->sq.sg, ARRAY_SIZE(vi->sq.sg));
 
 	/* If we can receive ANY GSO packets, we must allocate large ones. */
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
@@ -1137,10 +1171,10 @@ static int virtnet_probe(struct virtio_device *vdev)
 	}
 
 	/* Last of all, set up some receive buffers. */
-	try_fill_recv(vi, GFP_KERNEL);
+	try_fill_recv(&vi->rq, GFP_KERNEL);
 
 	/* If we didn't even get one input buffer, we're useless. */
-	if (vi->num == 0) {
+	if (vi->rq.num == 0) {
 		err = -ENOMEM;
 		goto unregister;
 	}
@@ -1161,7 +1195,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 unregister:
 	unregister_netdev(dev);
 free_vqs:
-	vdev->config->del_vqs(vdev);
+	virtnet_del_vqs(vi);
 free_stats:
 	free_percpu(vi->stats);
 free:
@@ -1173,22 +1207,22 @@ static void free_unused_bufs(struct virtnet_info *vi)
 {
 	void *buf;
 	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->svq);
+		buf = virtqueue_detach_unused_buf(vi->sq.vq);
 		if (!buf)
 			break;
 		dev_kfree_skb(buf);
 	}
 	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->rvq);
+		buf = virtqueue_detach_unused_buf(vi->rq.vq);
 		if (!buf)
 			break;
 		if (vi->mergeable_rx_bufs || vi->big_packets)
-			give_pages(vi, buf);
+			give_pages(&vi->rq, buf);
 		else
 			dev_kfree_skb(buf);
-		--vi->num;
+		--vi->rq.num;
 	}
-	BUG_ON(vi->num != 0);
+	BUG_ON(vi->rq.num != 0);
 }
 
 static void remove_vq_common(struct virtnet_info *vi)
@@ -1198,10 +1232,10 @@ static void remove_vq_common(struct virtnet_info *vi)
 	/* Free unused buffers in both send and recv, if any. */
 	free_unused_bufs(vi);
 
-	vi->vdev->config->del_vqs(vi->vdev);
+	virtnet_del_vqs(vi);
 
-	while (vi->pages)
-		__free_pages(get_a_page(vi, GFP_KERNEL), 0);
+	while (vi->rq.pages)
+		__free_pages(get_a_page(&vi->rq, GFP_KERNEL), 0);
 }
 
 static void __devexit virtnet_remove(struct virtio_device *vdev)
@@ -1237,7 +1271,7 @@ static int virtnet_freeze(struct virtio_device *vdev)
 	cancel_delayed_work_sync(&vi->refill);
 
 	if (netif_running(vi->dev))
-		napi_disable(&vi->napi);
+		napi_disable(&vi->rq.napi);
 
 	remove_vq_common(vi);
 
@@ -1256,11 +1290,11 @@ static int virtnet_restore(struct virtio_device *vdev)
 		return err;
 
 	if (netif_running(vi->dev))
-		virtnet_napi_enable(vi);
+		virtnet_napi_enable(&vi->rq);
 
 	netif_device_attach(vi->dev);
 
-	if (!try_fill_recv(vi, GFP_KERNEL))
+	if (!try_fill_recv(&vi->rq, GFP_KERNEL))
 		schedule_delayed_work(&vi->refill, 0);
 
 	mutex_lock(&vi->config_lock);
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next v2 0/3] Multiqueue support in virtio-net
From: Jason Wang @ 2012-12-05 10:37 UTC (permalink / raw)
  To: mst, rusty, davem, virtualization, netdev, linux-kernel, krkumar2
  Cc: bhutchings, jwhan, shiyer, kvm

Hi all:

This series is an update version of multiqueue virtio-net driver based on
Krishna Kumar's work to let virtio-net use multiple rx/tx queues to do the
packets reception and transmission. Please review and comments.

A protype implementation of qemu-kvm support could by found in
git://github.com/jasowang/qemu-kvm-mq.git. To start a guest with two queues, you
could specify the queues parameters to both tap and virtio-net like:

./qemu-kvm -netdev tap,queues=2,... -device virtio-net-pci,queues=2,...

then enable the multiqueue through ethtool by:

ethtool -L eth0 combined 2

Changes from V1:
Addressing Michael's comments:
- fix typos in commit log
- don't move virtnet_open()
- don't set to NULL in virtnet_free_queues()
- style & comment fixes
- conditionally set the irq affinity hint based on online cpus and queue pairs
- move the virnet_del_vqs to patch 1
- change the meaningless kzalloc() to kmalloc()
- open code the err handling
- store the name of virtqueue in send/receive queue
- avoid type cast in virtnet_find_vqs()
- fix the mem leak and freeing issue of names in virtnet_find_vqs()
- check cvq during before setting the max_queue_pairs in virtnet_probe()
- check the cvq and VIRTIO_NET_F_RFS in virtnet_set_queues()
- set the curr_queue_pairs in virtnet_set_queue()
- use the err report by virtnet_set_queue() as the return value of
  ethtool_set_channels()

Changes from RFC v7:
Addressing Rusty's comments:
- align the implementation (location of cvq) to v5.
- fix the style issue.
- use a global refill instead of per-vq one.
- check the VIRTIO_NET_F_RFS before calling virtnet_set_queues()

Addresing Michael's comments
- rename the curr_queue_pairs in virtnet_probe() to max_queue_pairs
- validate the number of queue pairs supported by the device against
  VIRTIO_NET_CTRL_RFS_VQ_PAIRS_MIN and VIRTIO_NET_CTRL_RFS_VQ_PAIRS_MAX.
- don't crash when failing to change the number of virtqueues
- don't set the affinity hint when onle single queue is used or there's too much
  virtqueues
- add a TODO of handling cpu hotplug
- allow user to set the nubmer of queue pairs between 1 and max_queue_pairs

Changes from RFC v6:
- Align the implementation with the RFC spec update v5
- Addressing Rusty's comments:
  * split the patches
  * rename to max_queue_pairs and curr_queue_pairs
  * remove the useless status
  * fix the hibernation bug
- Addressing Ben's comments:
  * check other parameters in ethtool_set_queues

Changes from RFC v5:
- Align the implementation with the RFC spec update v4
- Switch the mode between single mode and multiqueue mode without reset
- Remove the 256 limitation of queues
- Use helpers to do the mapping between virtqueues and tx/rx queues
- Use commbined channels instead of separated rx/tx queus when do the queue
  number configuartion
- Other coding style comments from Michael

Changes from RFC v4:
- Add ability to negotiate the number of queues through control virtqueue
- Ethtool -{L|l} support and default the tx/rx queue number to 1
- Expose the API to set irq affinity instead of irq itself

Changes from RFC v3:
- Rebase to the net-next
- Let queue 2 to be the control virtqueue to obey the spec
- Prodives irq affinity
- Choose txq based on processor id

Reference:
- Virtio spec RFC: http://patchwork.ozlabs.org/patch/201303/
- V1: https://lkml.org/lkml/2012/11/27/177
- RFC V7: https://lkml.org/lkml/2012/11/27/177a
- RFC V6: https://lkml.org/lkml/2012/10/30/127
- RFC V5: http://lwn.net/Articles/505388/
- RFC V4: https://lkml.org/lkml/2012/6/25/120
- RFC V2: http://lwn.net/Articles/467283/

Perf Numbers:

Will do some basic test and post as a reply to this mail.

Jason Wang (3):
  virtio-net: separate fields of sending/receiving queue from
    virtnet_info
  virtio_net: multiqueue support
  virtio-net: support changing the number of queue pairs through
    ethtool

 drivers/net/virtio_net.c        |  726 +++++++++++++++++++++++++++++----------
 include/uapi/linux/virtio_net.h |   27 ++
 2 files changed, 567 insertions(+), 186 deletions(-)

^ permalink raw reply

* Re: [Patch 1/1] net/phy: Add interrupt support for dp83640 phy.
From: Richard Cochran @ 2012-12-05 10:05 UTC (permalink / raw)
  To: Stephan Gatzka; +Cc: netdev, davem
In-Reply-To: <1354652498-16573-1-git-send-email-stephan.gatzka@gmail.com>

On Tue, Dec 04, 2012 at 09:21:38PM +0100, Stephan Gatzka wrote:
> Added functions for ack_interrupt and config_intr. Tested on an mpc5200b
> powerpc board.
> 
> Signed-off-by: Stephan Gatzka <stephan.gatzka@gmail.com>

The patch looks okay to me, but I worry that this might fail on boards
which have not connected the phyer's PWERDOWN/INTN pin to anything.
Such designs really need the PHY_POLL working.

Taking a brief glance at the drivers for two such boards I know of
(m5234bcc and an IXP), it looks like their MAC drivers set mii_bus irq
to PHY_POLL, so it might work fine, but this patch still makes me
nervous that some other board might break.

Maybe this should be a kconfig option?

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH -next v2] ipw2200: return error code on error in ipw_wx_get_auth()
From: Stanislav Yakovlev @ 2012-12-05  9:44 UTC (permalink / raw)
  To: Wei Yongjun; +Cc: linville, yongjun_wei, linux-wireless, netdev
In-Reply-To: <CAPgLHd_AN+Tr5EQFBZMH0FZCp6ESHVURZHvd9SuW=a2PYkPoXw@mail.gmail.com>

On 5 December 2012 00:08, Wei Yongjun <weiyj.lk@gmail.com> wrote:
> From: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
>
> We have assinged error code to 'ret' when get auth from some
> option is not supported but never used it, but we'd better return
> the error code.
>
> Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>

Looks fine, thanks.

Stanislav.

> ---
>  drivers/net/wireless/ipw2x00/ipw2200.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/drivers/net/wireless/ipw2x00/ipw2200.c b/drivers/net/wireless/ipw2x00/ipw2200.c
> index 482f505..b0879ad 100644
> --- a/drivers/net/wireless/ipw2x00/ipw2200.c
> +++ b/drivers/net/wireless/ipw2x00/ipw2200.c
> @@ -6812,7 +6812,6 @@ static int ipw_wx_get_auth(struct net_device *dev,
>         struct libipw_device *ieee = priv->ieee;
>         struct lib80211_crypt_data *crypt;
>         struct iw_param *param = &wrqu->param;
> -       int ret = 0;
>
>         switch (param->flags & IW_AUTH_INDEX) {
>         case IW_AUTH_WPA_VERSION:
> @@ -6822,8 +6821,7 @@ static int ipw_wx_get_auth(struct net_device *dev,
>                 /*
>                  * wpa_supplicant will control these internally
>                  */
> -               ret = -EOPNOTSUPP;
> -               break;
> +               return -EOPNOTSUPP;
>
>         case IW_AUTH_TKIP_COUNTERMEASURES:
>                 crypt = priv->ieee->crypt_info.crypt[priv->ieee->crypt_info.tx_keyidx];
>
>

^ permalink raw reply

* RE: [PATCH v2] net/macb: Use non-coherent memory for rx buffers
From: David Laight @ 2012-12-05  9:35 UTC (permalink / raw)
  To: Nicolas Ferre
  Cc: David S. Miller, netdev, linux-arm-kernel, linux-kernel,
	Joachim Eastwood, Jean-Christophe PLAGNIOL-VILLARD,
	Havard Skinnemoen
In-Reply-To: <50BE2FEC.2070500@atmel.com>

> If I understand well, you mean that the call to:
> 
> 		dma_sync_single_range_for_device(&bp->pdev->dev, phys,
> 				pg_offset, frag_len, DMA_FROM_DEVICE);
> 
> in the rx path after having copied the data to skb is not needed?
> That is also the conclusion that I found after having thinking about
> this again... I will check this.

You need to make sure that the memory isn't in the data cache
when you give the rx buffer back to the MAC.
(and ensure the cpu doesn't read it until the rx is complete.)
I've NFI what that dma_sync call does - you need to invalidate
the cache lines.
 
> For the CRC, my driver is not using the CRC offloading feature for the
> moment. So no CRC is written by the device.

I was thinking it would matter if the MAC wrote the CRC into the
buffer (even though it was excluded from the length).
It doesn't - you only need to worry about data you've read.

> > I was wondering if the code needs to do per page allocations?
> > Perhaps that is necessary to avoid needing a large block of
> > contiguous physical memory (and virtual addresses)?
> 
> The page management seems interesting for future management of RX
> buffers as skb fragments: that will allow to avoid copying received data.

Dunno - the complexities of such buffer loaning schemes often
exceed the gain of avoiding the data copy.
Using buffers allocated to the skb is a bit different - since
you completely forget about the memory once you pass the skb
upstream.

Some quick sums indicate you might want to allocate 8k memory
blocks and split into 5 buffers.

	David

^ permalink raw reply

* Re: ip_rt_min_pmtu
From: Christopher Schramm @ 2012-12-05  9:33 UTC (permalink / raw)
  To: Rami Rosen; +Cc: netdev
In-Reply-To: <CAKoUArn0EYaeebFXtCgqDSYeJD2oFKUfns7Lbk7Rsg=N=pnpHA@mail.gmail.com>

On Wed, 05 Dec 2012 09:46:18 +0100, Rami Rosen wrote:
> But RFC 791 also declares 576 as PMTU:
> "All hosts must be prepared to accept datagrams of up to 576 octets".

That's the common mistake I mentioned. The sentence goes on: "(whether 
they arrive whole or in fragments)", so it says nothing about lower 
layers, especially not about the MTU.

The 576 octects IP datagram every implementation is required to be able 
to handle could be transferred in 12 up to over 60 fragments if we had 
the minimal PMTU of 68, depending on the header size. Of course that 
leads to an overhead of nearly 90 percent, but it should be possible 
following the RFC.

With Linux, trying to send large data over a path with PMTU < 552 will 
probably fail (unless you change the min_pmtu value).

^ permalink raw reply

* Re: [net-next PATCH V3-evictor] net: frag evictor, avoid killing warm frag queues
From: Jesper Dangaard Brouer @ 2012-12-05  9:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Florian Westphal, netdev, Thomas Graf,
	Paul E. McKenney, Cong Wang, Herbert Xu
In-Reply-To: <20121204133007.20215.52566.stgit@dragon>


First of all, this patch contains a small bug (see below), which
resulted in me not testing the correct patch...

Second, this patch does NOT behave as I expected and claimed.  Thus, my
conclusions, in my previous respond might be wrong!

The previous evictor patch of letting new fragments enter, worked
amazingly well.  But I suspect, this might also be related to a
bug/problem in the evictor loop (which were being hidden by that patch).

My new *theory* is that the evictor loop, will be looping too much, if
it finds a fragment which is INET_FRAG_COMPLETE ... in that case, we
don't advance the LRU list, and thus will pickup the exact same
inet_frag_queue again in the loop... to get out of the loop we need
another CPU or packet to change the LRU list for us... I'll test that
theory... (its could also be CPUs fighting over the same LRU head
element that cause this) ... more to come...


On Tue, 2012-12-04 at 14:30 +0100, Jesper Dangaard Brouer wrote:
> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index 4750d2b..d8bf59b 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -178,6 +178,16 @@ int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force)
>  
>  		q = list_first_entry(&nf->lru_list,
>  				struct inet_frag_queue, lru_list);
> +
> +		/* When head of LRU is very new/warm, then the head is
> +		 * most likely the one with most fragments and the
> +		 * tail with least, thus drop tail
> +		 */
> +		if (!force && q->creation_ts == (u32) jiffies) {
> +			q = list_entry(&nf->lru_list.prev,

Remove the "&" in &nf->lru_list.prev

> +				struct inet_frag_queue, lru_list);
> +		}
> +
>  		atomic_inc(&q->refcnt);
>  		read_unlock(&f->lock);

^ permalink raw reply

* Re: ip_rt_min_pmtu
From: Rami Rosen @ 2012-12-05  8:46 UTC (permalink / raw)
  To: Christopher Schramm; +Cc: netdev
In-Reply-To: <50BE654D.2010602@shakaweb.org>

Hi,
Just a short note:
RFC 791 indeed set 68 for internet module MTU.

But RFC 791 also declares 576 as PMTU:
"All hosts must be prepared to accept datagrams of up to 576 octets".
and it says also:
"The number 576 is selected to allow a reasonable sized data block to
be transmitted in addition to the required header information."

It seems that there is a distinction between a module sending MTU and
hosts receiving MTU.

Regarding the historical details of why it was sent at that time  -
I don't have an idea.

Regards,
Rami Rosen
http://ramirose.wix.com/ramirosen



On Tue, Dec 4, 2012 at 11:04 PM, Christopher Schramm
<netdev@shakaweb.org> wrote:
> Hi,
>
> I'm looking into an interesting detail of the Linux IPv4 implementation I
> stumbled upon during a University course.
>
> In route.c there's a value ip_rt_min_pmtu, defined as 512 + 20 + 20, that
> tells Linux a minimum PMTU to use, even if e. g. an ICMP message tells it to
> set a smaller one.
>
> Of course, this is not a problem in real world, but not standard-compliant,
> since RFC 791 defines a minimum MTU of 68 for IPv4. So I wonder what's the
> reason for the restriction.
>
> I looked into it and found that it appeared in Linux 2.3.15 with the
> following ID in route.c:
>
> v 1.71 1999/08/20 11:05:58 davem
>
> While it was not present in Linux 2.3.14 with:
>
> v 1.69 1999/06/09 10:11:02 davem
>
> I couldn't find any related discussion or patch on the LKML around that
> dates, so I'm asking you for any hints to find out the reason for
> implementing this lower bound.
>
> What I've found on the LKML is a topic around February 15th, 2001, titled
> "MTU and 2.4.x kernel", where Alexey Kuznetsov points out that the handling
> of "DF on syn frames" is broken for MTUs smaller than 128 and "Preventing
> DoSes requires to block pmtu discovery at 576 or at least 552".
>
> Does anybody know the actual reason for the change in 2.3.15? I first
> thought it's the common misinterpretation that 576 would be the lower bound
> for MTUs in IPv4, but I wonder why it was put in place as a patch years
> after the IPv4 implementation was already done. There seems to have been
> some clear reason for it. I also wonder why it has never been removed up to
> today if it's really nothing more than a mistake.
>
> Would be great if someone could help me shed some light on this.
>
> Regards
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox