Netdev List

Netdev List
 help / color / mirror / Atom feed

* Stable regression with 'tcp: allow splice() to build full TSO packets'
From: Willy Tarreau @ 2012-05-17 12:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

Hi Eric,

I'm facing a regression in stable 3.2.17 and 3.0.31 which is
exhibited by your patch 'tcp: allow splice() to build full TSO
packets' which unfortunately I am very interested in !

What I'm observing is that TCP transmits using splice() stall
quite quickly if I'm using pipes larger than 64kB (even 65537
is enough to reliably observe the stall).

I'm seeing this on haproxy running on a small ARM machine (a
dockstar), which exchanges data through a gig switch with my
development PC. The NIC (mv643xx) doesn't support TSO but has
GSO enabled. If I disable GSO, the problem remains. I can however
make the problem disappear by disabling SG or Tx checksumming.
BTW, using recv/send() instead of splice() also gets rid of the
problem.

I can also reduce the risk of seeing the problem by increasing
the default TCP buffer sizes in tcp_wmem. By default I'm running
at 16kB, but if I increase the output buffer size above the pipe
size, the problem *seems* to disappear though I can't be certain,
since larger buffers generally means the problem takes longer to
appear, probably due to the fact that the buffers don't need to
be filled. Still I'm certain that with 64k TCP buffers and 128k
pipes I'm still seeing it.

With strace, I'm seeing data fill up the pipe with the splice()
call responsible for pushing the data to the output socket returing
-1 EAGAIN. During this time, the client receives no data.

Something bugs me, I have tested with a dummy server of mine,
httpterm, which uses tee+splice() to push data outside, and it
has no problem filling the gig pipe, and correctly recoverers
from the EAGAIN :

  send(13, "HTTP/1.1 200\r\nConnection: close\r"..., 160, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
  tee(0x3, 0x6, 0x10000, 0x2)             = 42552
  splice(0x5, 0, 0xd, 0, 0xa00000, 0x2)   = 14440
  tee(0x3, 0x6, 0x10000, 0x2)             = 13880
  splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2)   = -1 EAGAIN (Resource temporarily unavailable)
  ...
  tee(0x3, 0x6, 0x10000, 0x2)             = 13880
  splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2)   = 51100
  tee(0x3, 0x6, 0x10000, 0x2)             = 50744
  splice(0x5, 0, 0xd, 0, 0x9efffc, 0x2)   = 32120
  tee(0x3, 0x6, 0x10000, 0x2)             = 30264
  splice(0x5, 0, 0xd, 0, 0x9e8284, 0x2)   = -1 EAGAIN (Resource temporarily unavailable)

etc...

It's only with haproxy which uses splice() to copy data between
two sockets that I'm getting the issue (data forwarded from fd 0xe
to fd 0x6) :

  16:03:17.797144 pipe([36, 37])          = 0
  16:03:17.797318 fcntl64(36, 0x407 /* F_??? */, 0x20000) = 131072 ## note: fcntl(F_SETPIPE_SZ, 128k)
  16:03:17.797473 splice(0xe, 0, 0x25, 0, 0x9f2234, 0x3) = 10220
  16:03:17.797706 splice(0x24, 0, 0x6, 0, 0x27ec, 0x3) = 10220
  16:03:17.802036 gettimeofday({1324652597, 802093}, NULL) = 0
  16:03:17.802200 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 7
  16:03:17.802363 gettimeofday({1324652597, 802419}, NULL) = 0
  16:03:17.802530 splice(0xe, 0, 0x25, 0, 0x9efa48, 0x3) = 16060
  16:03:17.802789 splice(0x24, 0, 0x6, 0, 0x3ebc, 0x3) = 16060
  16:03:17.806593 gettimeofday({1324652597, 806651}, NULL) = 0
  16:03:17.806759 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 4
  16:03:17.806919 gettimeofday({1324652597, 806974}, NULL) = 0
  16:03:17.807087 splice(0xe, 0, 0x25, 0, 0x9ebb8c, 0x3) = 17520
  16:03:17.807356 splice(0x24, 0, 0x6, 0, 0x4470, 0x3) = 17520
  16:03:17.809565 gettimeofday({1324652597, 809620}, NULL) = 0
  16:03:17.809726 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1
  16:03:17.809883 gettimeofday({1324652597, 809937}, NULL) = 0
  16:03:17.810047 splice(0xe, 0, 0x25, 0, 0x9e771c, 0x3) = 36500
  16:03:17.810399 splice(0x24, 0, 0x6, 0, 0x8e94, 0x3) = 23360
  16:03:17.810629 epoll_ctl(0x3, 0x1, 0x6, 0x85378) = 0       ## note: epoll_ctl(ADD, fd=6, dir=OUT).
  16:03:17.810792 gettimeofday({1324652597, 810848}, NULL) = 0
  16:03:17.810954 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1
  16:03:17.811188 gettimeofday({1324652597, 811246}, NULL) = 0
  16:03:17.811356 splice(0xe, 0, 0x25, 0, 0x9de888, 0x3) = 21900
  16:03:17.811651 splice(0x24, 0, 0x6, 0, 0x88e0, 0x3) = -1 EAGAIN (Resource temporarily unavailable)

So output fd 6 hangs here and will not appear anymore until
here where I pressed Ctrl-C to stop the test :

  16:03:24.740985 gettimeofday({1324652604, 741042}, NULL) = 0
  16:03:24.741148 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 7
  16:03:24.951762 gettimeofday({1324652604, 951838}, NULL) = 0
  16:03:24.951956 splice(0x24, 0, 0x6, 0, 0x88e0, 0x3) = -1 EPIPE (Broken pipe)

I tried disabling LRO/GRO at the input interface (which happens to be
the same) to see if fragmentation of input data had any impact on this
but nothing chnages.

Please note that I'm not even certain the patch is the culprit, I'm
suspecting that by improving splice() efficiency, it might make a
latent issue become more visible. I have no data to back this
feeling, but nothing strikes me in your patch.

I don't know what I can do to troubleshoot this issue. I don't want
to pollute the list with network captures nor strace outputs, but I
have them if you're interested in verifying a few things.

I have another platform available for a test (Atom+82574L supporting
TSO). I'll rebuild and boot on this one to see if I observe the same
behaviour.

If you have any suggestion about things to check of tweaks to change
in the code, I'm quite open to experiment.

Best regards,
Willy

^ permalink raw reply

* Re: [net-next 3/4] e1000: look in the page and not in skb->data for the last byte
From: Jeff Kirsher @ 2012-05-17 12:02 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: davem, netdev, gospo, sassmann
In-Reply-To: <4FB4E782.3070502@linutronix.de>

[-- Attachment #1: Type: text/plain, Size: 675 bytes --]

On Thu, 2012-05-17 at 13:56 +0200, Sebastian Andrzej Siewior wrote:
> On 05/17/2012 01:50 PM, Jeff Kirsher wrote:
> >
> > Your correct, I apologize.  This was my fault, I applied your v1 of the
> > patch and then realized there was a v2.
> >
> > I will re-send the series with the correct patch.
> 
> Okay. I haven't seen [0] in the series. Did you merge it somewhere?
> 
> [0] http://thread.gmane.org/gmane.linux.drivers.e1000.devel/10019
> 
> Sebastian

No, not yet.  Aaron is still validating that patch since it was actually
the last one you sent me.  I expect to be pushing it in the next day or
so with some ixgbe patches, once it finishes validation.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [net-next 3/4] e1000: look in the page and not in skb->data for the last byte
From: Sebastian Andrzej Siewior @ 2012-05-17 11:56 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: davem, netdev, gospo, sassmann
In-Reply-To: <1337255417.2714.49.camel@jtkirshe-mobl>

On 05/17/2012 01:50 PM, Jeff Kirsher wrote:
>
> Your correct, I apologize.  This was my fault, I applied your v1 of the
> patch and then realized there was a v2.
>
> I will re-send the series with the correct patch.

Okay. I haven't seen [0] in the series. Did you merge it somewhere?

[0] http://thread.gmane.org/gmane.linux.drivers.e1000.devel/10019

Sebastian

^ permalink raw reply

* Re: [PATCH 06/15] batman-adv: Distributed ARP Table - add snooping functions for ARP messages
From: Marek Lindner @ 2012-05-17 11:53 UTC (permalink / raw)
  To: b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David Miller
In-Reply-To: <201205121626.38520.lindner_marek-LWAfsSFWpa4@public.gmane.org>


David,

> On Tuesday, May 01, 2012 08:59:04 David Miller wrote:
> > From: Antonio Quartulli <ordex-GaUfNO9RBHfsrOwW+9ziJQ@public.gmane.org>
> > Date: Tue, 1 May 2012 00:22:30 +0200
> > 
> > > However this patch also contains a procedure which queries the neigh
> > > table in order to understand whether a given host is known or not.
> > > Would it be possible to do that in another way (Without manually
> > > touching the table)?
> > > 
> > > Instead, in the next patch (patch 06/15) batman-adv manually increase
> > > the neigh timeouts. Do you think we should avoid doing that as well?
> > > If we are allowed to do that, how can we perform the same operation in
> > > a cleaner way?
> > > 
> > > Last question: why can't other modules use exported functions? Are you
> > > going to change them as well?
> > 
> > I really don't have time to discuss your neigh issues right now as I'm
> > busy speaking at conferences and dealing with the backlog of other
> > patches.
> > 
> > You'll need to find someone else to discuss it with you, sorry.
> 
> I hope now is a good moment to bring the questions back onto the table. We
> still are not sure how to proceed because we have no clear picture of what
> is going to come and how the exported functions are supposed to be used.
> 
> David, if you don't have the time to discuss the ARP handling with us could
> you name someone who knows your plans and the code equally well ? So far,
> nobody has stepped up.

let me add another piece of information: The distributed ARP table does not 
really depend on the kernel's ARP table. We can easily write our own backend 
to be totally independent of the kernel's ARP table. Initially, we thought it 
might be considered a smart move if the code made use of existing kernel 
infrastructure instead of writing our own storage / user space API / etc, 
hence duplicating what is already there. But if you feel this is the better 
way forward we certainly will make the necessary changes.

Regards,
Marek

^ permalink raw reply

* Re: [net-next 0/4][pull request] Intel Wired LAN Driver Updates
From: Jeff Kirsher @ 2012-05-17 11:51 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, sassmann
In-Reply-To: <1337254070-32500-1-git-send-email-jeffrey.t.kirsher@intel.com>

[-- Attachment #1: Type: text/plain, Size: 1082 bytes --]

On Thu, 2012-05-17 at 04:27 -0700, Jeff Kirsher wrote:
> This series of patches contains updates for e1000, e1000e and igb.
> 
> The following are changes since commit dc6b9b78234fecdc6d2ca5e1629185718202bcf5:
>   net: include/net/sock.h cleanup
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next master
> 
> Bruce Allan (1):
>   e1000e: fix typo in definition of E1000_CTRL_EXT_FORCE_SMBUS
> 
> Matthew Vick (1):
>   igb: Disable the BMC-to-OS Watchdog Enable bit for DMAC.
> 
> Sebastian Andrzej Siewior (2):
>   e1000: remove workaround for Errata 23 from jumbo alloc
>   e1000: look in the page and not in skb->data for the last byte
> 
>  drivers/net/ethernet/intel/e1000/e1000_main.c  |   30 ++++--------------------
>  drivers/net/ethernet/intel/e1000e/defines.h    |    2 +-
>  drivers/net/ethernet/intel/igb/e1000_defines.h |    2 +
>  drivers/net/ethernet/intel/igb/igb_main.c      |    3 ++
>  4 files changed, 11 insertions(+), 26 deletions(-)
> 

v2 of the series will be coming.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [net-next 3/4] e1000: look in the page and not in skb->data for the last byte
From: Jeff Kirsher @ 2012-05-17 11:50 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: davem, netdev, gospo, sassmann
In-Reply-To: <4FB4E34F.2050004@linutronix.de>

[-- Attachment #1: Type: text/plain, Size: 1547 bytes --]

On Thu, 2012-05-17 at 13:38 +0200, Sebastian Andrzej Siewior wrote:
> On 05/17/2012 01:27 PM, Jeff Kirsher wrote:
> > diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> > index fefbf4d..6ac80c8 100644
> > --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> > +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> > @@ -4066,7 +4066,11 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
> >   		/* errors is only valid for DD + EOP descriptors */
> >   		if (unlikely((status&  E1000_RXD_STAT_EOP)&&
> >   		(rx_desc->errors&  E1000_RXD_ERR_FRAME_ERR_MASK))) {
> > -			u8 last_byte = *(skb->data + length - 1);
> > +			u8 *mapped;
> > +			u8 last_byte;
> > +
> > +			mapped = kmap_atomic(buffer_info->page);
> > +			last_byte = *(mapped + length - 1);
> >   			if (TBI_ACCEPT(hw, status, rx_desc->errors, length,
> >   				       last_byte)) {
> >   				spin_lock_irqsave(&adapter->stats_lock,
> 
> This is not what I've sent. My original patch [0] hat a unmap as well. 
> One comment was, that kmap_atomic() is too much overhead because the 
> page can never be highmem. So I changed it to page_address() [1].
> 
> [0] http://permalink.gmane.org/gmane.linux.drivers.e1000.devel/10008
> [1] http://permalink.gmane.org/gmane.linux.drivers.e1000.devel/10012
> 
> Sebastian

Your correct, I apologize.  This was my fault, I applied your v1 of the
patch and then realized there was a v2.

I will re-send the series with the correct patch.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH 1/1] smsc95xx: add FLAG_POINTTOPOINT flag for driver_info
From: Ben Hutchings @ 2012-05-17 11:45 UTC (permalink / raw)
  To: Xiao Jiang; +Cc: steve.glendinning, gregkh, netdev, linux-usb, linux-kernel
In-Reply-To: <4FB4B98E.7000208@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1193 bytes --]

On Thu, 2012-05-17 at 16:40 +0800, Xiao Jiang wrote:
> Ben Hutchings wrote:
> > On Wed, 2012-05-16 at 16:01 +0800, jgq516@gmail.com wrote:
> >   
> >> From: Xiao Jiang <jgq516@gmail.com>
> >>
> >> commit c26134 introduced FLAG_POINTTOPOINT flag for USB ethernet devices
> >> which possibly use "usb%d" names, add this flag to make sure pandaboard
> >> can mount nfs with smsc95xx NIC.
> >>     
> >
> > These are normal Ethernet interfaces, whereas FLAG_POINTTOPOINT is for
> > devices that use non-standard short physical links.
> >
> >   
> This flag is used by some usb NICs, I amn't familiar with those cards 
> perhaps those are
> non-standard short physical links as you said.
> But smsc95xx seems need this flag to use "usb%d" name,

But this is a regular Ethernet interface and should be named
accordingly.

> at least my 
> pandaboard can't
> mount nfs with eth0 name, is there other ways to avoid nfs issue with 
> keep smsc95xx's
> name unchange? thanks.
[...]

I don't know what this NFS issue is, but I don't see how this can be the
correct solution.

Ben.

-- 
Ben Hutchings
Every program is either trivial or else contains at least one bug

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: [net-next 3/4] e1000: look in the page and not in skb->data for the last byte
From: Sebastian Andrzej Siewior @ 2012-05-17 11:38 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, netdev, gospo, sassmann
In-Reply-To: <1337254070-32500-4-git-send-email-jeffrey.t.kirsher@intel.com>

On 05/17/2012 01:27 PM, Jeff Kirsher wrote:
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index fefbf4d..6ac80c8 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -4066,7 +4066,11 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
>   		/* errors is only valid for DD + EOP descriptors */
>   		if (unlikely((status&  E1000_RXD_STAT_EOP)&&
>   		(rx_desc->errors&  E1000_RXD_ERR_FRAME_ERR_MASK))) {
> -			u8 last_byte = *(skb->data + length - 1);
> +			u8 *mapped;
> +			u8 last_byte;
> +
> +			mapped = kmap_atomic(buffer_info->page);
> +			last_byte = *(mapped + length - 1);
>   			if (TBI_ACCEPT(hw, status, rx_desc->errors, length,
>   				       last_byte)) {
>   				spin_lock_irqsave(&adapter->stats_lock,

This is not what I've sent. My original patch [0] hat a unmap as well. 
One comment was, that kmap_atomic() is too much overhead because the 
page can never be highmem. So I changed it to page_address() [1].

[0] http://permalink.gmane.org/gmane.linux.drivers.e1000.devel/10008
[1] http://permalink.gmane.org/gmane.linux.drivers.e1000.devel/10012

Sebastian

^ permalink raw reply

* [net-next 2/4] e1000: remove workaround for Errata 23 from jumbo alloc
From: Jeff Kirsher @ 2012-05-17 11:27 UTC (permalink / raw)
  To: davem; +Cc: Sebastian Andrzej Siewior, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1337254070-32500-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

According to the comment, errata 23 says that the memory we allocate
can't cross a 64KiB boundary. In case of jumbo frames we allocate
complete pages which can never cross the 64KiB boundary because
PAGE_SIZE should be a multiple of 64KiB so we stop either before the
boundary or start after it but never cross it. Furthermore the check
seems bogus because it looks at skb->data which is not seen by the HW
at all because we only pass the DMA address of the page we allocated. So
I *think* the workaround is not required here.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000_main.c |   24 ------------------------
 1 files changed, 0 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index f1aef68..fefbf4d 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -4391,30 +4391,6 @@ e1000_alloc_jumbo_rx_buffers(struct e1000_adapter *adapter,
 			break;
 		}
 
-		/* Fix for errata 23, can't cross 64kB boundary */
-		if (!e1000_check_64k_bound(adapter, skb->data, bufsz)) {
-			struct sk_buff *oldskb = skb;
-			e_err(rx_err, "skb align check failed: %u bytes at "
-			      "%p\n", bufsz, skb->data);
-			/* Try again, without freeing the previous */
-			skb = netdev_alloc_skb_ip_align(netdev, bufsz);
-			/* Failed allocation, critical failure */
-			if (!skb) {
-				dev_kfree_skb(oldskb);
-				adapter->alloc_rx_buff_failed++;
-				break;
-			}
-
-			if (!e1000_check_64k_bound(adapter, skb->data, bufsz)) {
-				/* give up */
-				dev_kfree_skb(skb);
-				dev_kfree_skb(oldskb);
-				break; /* while (cleaned_count--) */
-			}
-
-			/* Use new allocation */
-			dev_kfree_skb(oldskb);
-		}
 		buffer_info->skb = skb;
 		buffer_info->length = adapter->rx_buffer_len;
 check_page:
-- 
1.7.7.6

^ permalink raw reply related

* [net-next 3/4] e1000: look in the page and not in skb->data for the last byte
From: Jeff Kirsher @ 2012-05-17 11:27 UTC (permalink / raw)
  To: davem; +Cc: Sebastian Andrzej Siewior, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1337254070-32500-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

The code seems to want to look at the last byte where the HW puts some
information. Since the skb->data area is never seen by the HW I guess it
does not work as expected. We pass the page address to the HW so I
*think* in order to get to the last byte where the information might be
one should use the page buffer and take a look.
This is of course not more than just compile tested.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000_main.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index fefbf4d..6ac80c8 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -4066,7 +4066,11 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 		/* errors is only valid for DD + EOP descriptors */
 		if (unlikely((status & E1000_RXD_STAT_EOP) &&
 		    (rx_desc->errors & E1000_RXD_ERR_FRAME_ERR_MASK))) {
-			u8 last_byte = *(skb->data + length - 1);
+			u8 *mapped;
+			u8 last_byte;
+
+			mapped = kmap_atomic(buffer_info->page);
+			last_byte = *(mapped + length - 1);
 			if (TBI_ACCEPT(hw, status, rx_desc->errors, length,
 				       last_byte)) {
 				spin_lock_irqsave(&adapter->stats_lock,
-- 
1.7.7.6

^ permalink raw reply related

* [net-next 4/4] igb: Disable the BMC-to-OS Watchdog Enable bit for DMAC.
From: Jeff Kirsher @ 2012-05-17 11:27 UTC (permalink / raw)
  To: davem; +Cc: Matthew Vick, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1337254070-32500-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Matthew Vick <matthew.vick@intel.com>

Under certain scenarios, it's possible that bursty manageability traffic
over the BMC-to-OS path may overrun the internal manageability receive
buffer causing dropped manageability packets. Clearing this bit prevents
this situation by interrupting coalescing to allow manageability traffic
through.

Signed-off-by: Matthew Vick <matthew.vick@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/igb/e1000_defines.h |    2 ++
 drivers/net/ethernet/intel/igb/igb_main.c      |    3 +++
 2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 6409f85..ec7e4fe 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -301,6 +301,8 @@
 							* transactions */
 #define E1000_DMACR_DMAC_LX_SHIFT       28
 #define E1000_DMACR_DMAC_EN             0x80000000 /* Enable DMA Coalescing */
+/* DMA Coalescing BMC-to-OS Watchdog Enable */
+#define E1000_DMACR_DC_BMC2OSW_EN	0x00008000
 
 #define E1000_DMCTXTH_DMCTTHR_MASK      0x00000FFF /* DMA Coalescing Transmit
 							* Threshold */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 9bbf1a2..dd3bfe8 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -7147,6 +7147,9 @@ static void igb_init_dmac(struct igb_adapter *adapter, u32 pba)
 
 			/* watchdog timer= +-1000 usec in 32usec intervals */
 			reg |= (1000 >> 5);
+
+			/* Disable BMC-to-OS Watchdog Enable */
+			reg &= ~E1000_DMACR_DC_BMC2OSW_EN;
 			wr32(E1000_DMACR, reg);
 
 			/*
-- 
1.7.7.6

^ permalink raw reply related

* [net-next 0/4][pull request] Intel Wired LAN Driver Updates
From: Jeff Kirsher @ 2012-05-17 11:27 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, gospo, sassmann

This series of patches contains updates for e1000, e1000e and igb.

The following are changes since commit dc6b9b78234fecdc6d2ca5e1629185718202bcf5:
  net: include/net/sock.h cleanup
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next master

Bruce Allan (1):
  e1000e: fix typo in definition of E1000_CTRL_EXT_FORCE_SMBUS

Matthew Vick (1):
  igb: Disable the BMC-to-OS Watchdog Enable bit for DMAC.

Sebastian Andrzej Siewior (2):
  e1000: remove workaround for Errata 23 from jumbo alloc
  e1000: look in the page and not in skb->data for the last byte

 drivers/net/ethernet/intel/e1000/e1000_main.c  |   30 ++++--------------------
 drivers/net/ethernet/intel/e1000e/defines.h    |    2 +-
 drivers/net/ethernet/intel/igb/e1000_defines.h |    2 +
 drivers/net/ethernet/intel/igb/igb_main.c      |    3 ++
 4 files changed, 11 insertions(+), 26 deletions(-)

-- 
1.7.7.6

^ permalink raw reply

* [net-next 1/4] e1000e: fix typo in definition of E1000_CTRL_EXT_FORCE_SMBUS
From: Jeff Kirsher @ 2012-05-17 11:27 UTC (permalink / raw)
  To: davem; +Cc: Bruce Allan, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1337254070-32500-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Bruce Allan <bruce.w.allan@intel.com>

This define is needed by i217.

Reported-by: Bjorn Mork <bjorn@mork.no>
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000e/defines.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/defines.h b/drivers/net/ethernet/intel/e1000e/defines.h
index 11c4666..351a409 100644
--- a/drivers/net/ethernet/intel/e1000e/defines.h
+++ b/drivers/net/ethernet/intel/e1000e/defines.h
@@ -76,7 +76,7 @@
 /* Extended Device Control */
 #define E1000_CTRL_EXT_LPCD  0x00000004     /* LCD Power Cycle Done */
 #define E1000_CTRL_EXT_SDP3_DATA 0x00000080 /* Value of SW Definable Pin 3 */
-#define E1000_CTRL_EXT_FORCE_SMBUS 0x00000004 /* Force SMBus mode*/
+#define E1000_CTRL_EXT_FORCE_SMBUS 0x00000800 /* Force SMBus mode */
 #define E1000_CTRL_EXT_EE_RST    0x00002000 /* Reinitialize from EEPROM */
 #define E1000_CTRL_EXT_SPD_BYPS  0x00008000 /* Speed Select Bypass */
 #define E1000_CTRL_EXT_RO_DIS    0x00020000 /* Relaxed Ordering disable */
-- 
1.7.7.6

^ permalink raw reply related

* [net] e1000: Prevent reset task killing itself.
From: Jeff Kirsher @ 2012-05-17 11:04 UTC (permalink / raw)
  To: davem; +Cc: Tushar Dave, netdev, gospo, sassmann, stable, Jeff Kirsher

From: Tushar Dave <tushar.n.dave@intel.com>

Killing reset task while adapter is resetting causes deadlock.
Only kill reset task if adapter is not resetting.
Ref bug #43132 on bugzilla.kernel.org

CC: stable@vger.kernel.org
Signed-off-by: Tushar Dave <tushar.n.dave@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

@stable - this patch is applicable back to 3.1 kernels

---
 drivers/net/ethernet/intel/e1000/e1000_main.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 37caa88..8d8908d 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -493,7 +493,11 @@ out:
 static void e1000_down_and_stop(struct e1000_adapter *adapter)
 {
 	set_bit(__E1000_DOWN, &adapter->flags);
-	cancel_work_sync(&adapter->reset_task);
+
+	/* Only kill reset task if adapter is not resetting */
+	if (!test_bit(__E1000_RESETTING, &adapter->flags))
+		cancel_work_sync(&adapter->reset_task);
+
 	cancel_delayed_work_sync(&adapter->watchdog_task);
 	cancel_delayed_work_sync(&adapter->phy_info_task);
 	cancel_delayed_work_sync(&adapter->fifo_stall_task);
-- 
1.7.7.6

^ permalink raw reply related

* [PATCH net-next] net/mlx4_en: num cores tx rings for every UP
From: Amir Vadai @ 2012-05-17 10:58 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Oren Duer, Amir Vadai, John Fastabend, Liran Liss

Change the TX ring scheme such that the number of rings for untagged packets
and for tagged packets (per each of the vlan priorities) is the same, unlike
the current situation where for tagged traffic there's one ring per priority
and for untagged rings as the number of core.

Queue selection is done as follows:

If the mqprio qdisc is operates on the interface, such that the core networking
code invoked the device setup_tc ndo callback, a mapping of skb->priority =>
queue set is forced - for both, tagged and untagged traffic.

Else, the egress map skb->priority =>  User priority is used for tagged traffic, and
all untagged traffic is sent through tx rings of UP 0.

The patch follows the convergence of discussing that issue with John Fastabend
over this thread http://comments.gmane.org/gmane.linux.network/229877

Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Liran Liss <liranl@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_main.c   |    6 ++-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |   41 +++++++++++++++++------
 drivers/net/ethernet/mellanox/mlx4/en_tx.c     |   15 +++++---
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |    9 ++---
 4 files changed, 47 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_main.c b/drivers/net/ethernet/mellanox/mlx4/en_main.c
index 346fdb2..988b242 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_main.c
@@ -101,6 +101,8 @@ static int mlx4_en_get_profile(struct mlx4_en_dev *mdev)
 	int i;
 
 	params->udp_rss = udp_rss;
+	params->num_tx_rings_p_up = min_t(int, num_online_cpus(),
+			MLX4_EN_MAX_TX_RING_P_UP);
 	if (params->udp_rss && !(mdev->dev->caps.flags
 					& MLX4_DEV_CAP_FLAG_UDP_RSS)) {
 		mlx4_warn(mdev, "UDP RSS is not supported on this device.\n");
@@ -113,8 +115,8 @@ static int mlx4_en_get_profile(struct mlx4_en_dev *mdev)
 		params->prof[i].tx_ppp = pfctx;
 		params->prof[i].tx_ring_size = MLX4_EN_DEF_TX_RING_SIZE;
 		params->prof[i].rx_ring_size = MLX4_EN_DEF_RX_RING_SIZE;
-		params->prof[i].tx_ring_num = MLX4_EN_NUM_TX_RINGS +
-			MLX4_EN_NUM_PPP_RINGS;
+		params->prof[i].tx_ring_num = params->num_tx_rings_p_up *
+			MLX4_EN_NUM_UP;
 		params->prof[i].rss_rings = 0;
 	}
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index eaa8fad..926d8aa 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -47,9 +47,22 @@
 
 static int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 {
-	if (up != MLX4_EN_NUM_UP)
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	int i;
+	unsigned int q, offset = 0;
+
+	if (up && up != MLX4_EN_NUM_UP)
 		return -EINVAL;
 
+	netdev_set_num_tc(dev, up);
+
+	/* Partition Tx queues evenly amongst UP's */
+	q = priv->tx_ring_num / up;
+	for (i = 0; i < up; i++) {
+		netdev_set_tc_queue(dev, i, q, offset);
+		offset += q;
+	}
+
 	return 0;
 }
 
@@ -661,7 +674,7 @@ int mlx4_en_start_port(struct net_device *dev)
 		/* Configure ring */
 		tx_ring = &priv->tx_ring[i];
 		err = mlx4_en_activate_tx_ring(priv, tx_ring, cq->mcq.cqn,
-				max(0, i - MLX4_EN_NUM_TX_RINGS));
+			i / priv->mdev->profile.num_tx_rings_p_up);
 		if (err) {
 			en_err(priv, "Failed allocating Tx ring\n");
 			mlx4_en_deactivate_cq(priv, cq);
@@ -986,6 +999,9 @@ void mlx4_en_destroy_netdev(struct net_device *dev)
 
 	mlx4_en_free_resources(priv);
 
+	kfree(priv->tx_ring);
+	kfree(priv->tx_cq);
+
 	free_netdev(dev);
 }
 
@@ -1091,6 +1107,18 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	priv->ctrl_flags = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE |
 			MLX4_WQE_CTRL_SOLICITED);
 	priv->tx_ring_num = prof->tx_ring_num;
+	priv->tx_ring = kzalloc(sizeof(struct mlx4_en_tx_ring) *
+			priv->tx_ring_num, GFP_KERNEL);
+	if (!priv->tx_ring) {
+		err = -ENOMEM;
+		goto out;
+	}
+	priv->tx_cq = kzalloc(sizeof(struct mlx4_en_cq) * priv->tx_ring_num,
+			GFP_KERNEL);
+	if (!priv->tx_cq) {
+		err = -ENOMEM;
+		goto out;
+	}
 	priv->rx_ring_num = prof->rx_ring_num;
 	priv->mac_index = -1;
 	priv->msg_enable = MLX4_EN_MSG_LEVEL;
@@ -1138,15 +1166,6 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	netif_set_real_num_tx_queues(dev, priv->tx_ring_num);
 	netif_set_real_num_rx_queues(dev, priv->rx_ring_num);
 
-	netdev_set_num_tc(dev, MLX4_EN_NUM_UP);
-
-	/* First 9 rings are for UP 0 */
-	netdev_set_tc_queue(dev, 0, MLX4_EN_NUM_TX_RINGS + 1, 0);
-
-	/* Partition Tx queues evenly amongst UP's 1-7 */
-	for (i = 1; i < MLX4_EN_NUM_UP; i++)
-		netdev_set_tc_queue(dev, i, 1, MLX4_EN_NUM_TX_RINGS + i);
-
 	SET_ETHTOOL_OPS(dev, &mlx4_en_ethtool_ops);
 
 	/* Set defualt MAC */
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 9a38483..019d856 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -525,14 +525,17 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
 
 u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
 {
-	u16 vlan_tag = 0;
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	u16 rings_p_up = priv->mdev->profile.num_tx_rings_p_up;
+	u8 up = 0;
 
-	if (vlan_tx_tag_present(skb)) {
-		vlan_tag = vlan_tx_tag_get(skb);
-		return MLX4_EN_NUM_TX_RINGS + (vlan_tag >> 13);
-	}
+	if (dev->num_tc)
+		return skb_tx_hash(dev, skb);
+
+	if (vlan_tx_tag_present(skb))
+		up = vlan_tx_tag_get(skb) >> VLAN_PRIO_SHIFT;
 
-	return skb_tx_hash(dev, skb);
+	return __skb_tx_hash(dev, skb, rings_p_up) + up * rings_p_up;
 }
 
 static void mlx4_bf_copy(void __iomem *dst, unsigned long *src, unsigned bytecnt)
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 5d87637..6ae3509 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -111,9 +111,7 @@ enum {
 #define MLX4_EN_MIN_TX_SIZE	(4096 / TXBB_SIZE)
 
 #define MLX4_EN_SMALL_PKT_SIZE		64
-#define MLX4_EN_NUM_TX_RINGS		8
-#define MLX4_EN_NUM_PPP_RINGS		8
-#define MAX_TX_RINGS			(MLX4_EN_NUM_TX_RINGS + MLX4_EN_NUM_PPP_RINGS)
+#define MLX4_EN_MAX_TX_RING_P_UP	32
 #define MLX4_EN_NUM_UP			8
 #define MLX4_EN_DEF_TX_RING_SIZE	512
 #define MLX4_EN_DEF_RX_RING_SIZE  	1024
@@ -339,6 +337,7 @@ struct mlx4_en_profile {
 	u32 active_ports;
 	u32 small_pkt_int;
 	u8 no_reset;
+	u8 num_tx_rings_p_up;
 	struct mlx4_en_port_profile prof[MLX4_MAX_PORTS + 1];
 };
 
@@ -477,9 +476,9 @@ struct mlx4_en_priv {
 	u16 num_frags;
 	u16 log_rx_info;
 
-	struct mlx4_en_tx_ring tx_ring[MAX_TX_RINGS];
+	struct mlx4_en_tx_ring *tx_ring;
 	struct mlx4_en_rx_ring rx_ring[MAX_RX_RINGS];
-	struct mlx4_en_cq tx_cq[MAX_TX_RINGS];
+	struct mlx4_en_cq *tx_cq;
 	struct mlx4_en_cq rx_cq[MAX_RX_RINGS];
 	struct work_struct mcast_task;
 	struct work_struct mac_task;
-- 
1.7.8.2

^ permalink raw reply related

* Re: [PATCH v4 6/6] net: sh_eth: use NAPI
From: Francois Romieu @ 2012-05-17 10:33 UTC (permalink / raw)
  To: Shimoda, Yoshihiro; +Cc: netdev, SH-Linux
In-Reply-To: <4FB32D17.30404@renesas.com>

Shimoda, Yoshihiro <yoshihiro.shimoda.uh@renesas.com> :
[...]
> diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
> index c64a31c..edc7dfe 100644
> --- a/drivers/net/ethernet/renesas/sh_eth.c
> +++ b/drivers/net/ethernet/renesas/sh_eth.c
[...]
> +static int sh_eth_poll(struct napi_struct *napi, int budget)
> +{
> +	struct sh_eth_private *mdp = container_of(napi, struct sh_eth_private,
> +						  napi);
> +	struct net_device *ndev = mdp->ndev;
> +	struct sh_eth_cpu_data *cd = mdp->cd;
> +	int work_done = 0, txfree_num;
> +	u32 intr_status = sh_eth_read(ndev, EESR);
> +
> +	/* Clear interrupt flags */
> +	sh_eth_write(ndev, intr_status, EESR);
> +
> +	/* check txdesc */
> +	txfree_num = sh_eth_txfree(ndev);

[...]
> @@ -1678,19 +1710,15 @@ static int sh_eth_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>  	struct sh_eth_private *mdp = netdev_priv(ndev);
>  	struct sh_eth_txdesc *txdesc;
>  	u32 entry;
> -	unsigned long flags;
> 
> -	spin_lock_irqsave(&mdp->lock, flags);
>  	if ((mdp->cur_tx - mdp->dirty_tx) >= (mdp->num_tx_ring - 4)) {
>  		if (!sh_eth_txfree(ndev)) {

There are now two racing sh_eth_txfree and there is no [PATCH v4 7/6].

If I may suggest a slightly different approach, I would apply the patch
below before anything NAPI related:

diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
index d63e09b..6d77462 100644
--- a/drivers/net/ethernet/renesas/sh_eth.c
+++ b/drivers/net/ethernet/renesas/sh_eth.c
@@ -1495,18 +1495,6 @@ static int sh_eth_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	u32 entry;
 	unsigned long flags;
 
-	spin_lock_irqsave(&mdp->lock, flags);
-	if ((mdp->cur_tx - mdp->dirty_tx) >= (TX_RING_SIZE - 4)) {
-		if (!sh_eth_txfree(ndev)) {
-			if (netif_msg_tx_queued(mdp))
-				dev_warn(&ndev->dev, "TxFD exhausted.\n");
-			netif_stop_queue(ndev);
-			spin_unlock_irqrestore(&mdp->lock, flags);
-			return NETDEV_TX_BUSY;
-		}
-	}
-	spin_unlock_irqrestore(&mdp->lock, flags);
-
 	entry = mdp->cur_tx % TX_RING_SIZE;
 	mdp->tx_skbuff[entry] = skb;
 	txdesc = &mdp->tx_ring[entry];
@@ -1531,6 +1519,15 @@ static int sh_eth_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	if (!(sh_eth_read(ndev, EDTRR) & sh_eth_get_edtrr_trns(mdp)))
 		sh_eth_write(ndev, sh_eth_get_edtrr_trns(mdp), EDTRR);
 
+	spin_lock_irqsave(&mdp->lock, flags);
+	if ((mdp->cur_tx - mdp->dirty_tx) >= (TX_RING_SIZE - 4)) {
+		if (netif_msg_tx_queued(mdp)) {
+			dev_warn(&ndev->dev, "TxFD exhausted.\n");
+			netif_stop_queue(ndev);
+		}
+	}
+	spin_unlock_irqrestore(&mdp->lock, flags);
+
 	return NETDEV_TX_OK;
 }
 

Rationale: the driver does not need to return NETDEV_TX_BUSY when it
should signal that it will not handle more packets after the current
one. You may add an extra assertion at the start of sh_eth_start_xmit()
and return NETDEV_TX_BUSY but it should be understood as a debug / bug
helper only.

Then you can convert to a {start/stop} queue race free NAPI with adequate
barriers.

-- 
Ueimor

^ permalink raw reply related

* Re: [PATCH v5 2/2] decrement static keys on real destroy time
From: KAMEZAWA Hiroyuki @ 2012-05-17 10:27 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Andrew Morton, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, devel-GEFAQzZX7r8dnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Li Zefan,
	Johannes Weiner, Michal Hocko
In-Reply-To: <4FB4D14D.4020303-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

(2012/05/17 19:22), Glauber Costa wrote:

> On 05/17/2012 02:18 PM, KAMEZAWA Hiroyuki wrote:
>> (2012/05/17 18:52), Glauber Costa wrote:
>>
>>> On 05/17/2012 09:37 AM, Andrew Morton wrote:
>>>>>>   If that happens, locking in static_key_slow_inc will prevent any damage.
>>>>>>   My previous version had explicit code to prevent that, but we were
>>>>>>   pointed out that this is already part of the static_key expectations, so
>>>>>>   that was dropped.
>>>> This makes no sense.  If two threads run that code concurrently,
>>>> key->enabled gets incremented twice.  Nobody anywhere has a record that
>>>> this happened so it cannot be undone.  key->enabled is now in an
>>>> unknown state.
>>>
>>> Kame, Tejun,
>>>
>>> Andrew is right. It seems we will need that mutex after all. Just this
>>> is not a race, and neither something that should belong in the
>>> static_branch interface.
>>>
>>
>>
>> Hmm....how about having
>>
>> res_counter_xchg_limit(res,&old_limit, new_limit);
>>
>> if (!cg_proto->updated&&  old_limit == RESOURCE_MAX)
>> 	....update labels...
>>
>> Then, no mutex overhead maybe and activated will be updated only once.
>> Ah, but please fix in a way you like. Above is an example.
> 
> I think a mutex is a lot cleaner than adding a new function to the 
> res_counter interface.
> 
> We could do a counter, and then later decrement the key until the 
> counter reaches zero, but between those two, I still think a mutex here 
> is preferable.
> 
> Only that, instead of coming up with a mutex of ours, we could export 
> and reuse set_limit_mutex from memcontrol.c
> 


ok, please.

thx,
-Kame

> 
>> Thanks,
>> -Kame
>> (*) I'm sorry I won't be able to read e-mails, tomorrow.
>>
> Ok Kame. I am not in a terrible hurry to fix this, it doesn't seem to be 
> hurting any real workload.
> 
> 

^ permalink raw reply

* Re: [PATCH v5 2/2] decrement static keys on real destroy time
From: Glauber Costa @ 2012-05-17 10:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, devel-GEFAQzZX7r8dnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Li Zefan,
	Johannes Weiner, Michal Hocko
In-Reply-To: <4FB4D061.10406-+CUm20s59erQFUHtdCDX3A@public.gmane.org>

On 05/17/2012 02:18 PM, KAMEZAWA Hiroyuki wrote:
> (2012/05/17 18:52), Glauber Costa wrote:
>
>> On 05/17/2012 09:37 AM, Andrew Morton wrote:
>>>>>   If that happens, locking in static_key_slow_inc will prevent any damage.
>>>>>   My previous version had explicit code to prevent that, but we were
>>>>>   pointed out that this is already part of the static_key expectations, so
>>>>>   that was dropped.
>>> This makes no sense.  If two threads run that code concurrently,
>>> key->enabled gets incremented twice.  Nobody anywhere has a record that
>>> this happened so it cannot be undone.  key->enabled is now in an
>>> unknown state.
>>
>> Kame, Tejun,
>>
>> Andrew is right. It seems we will need that mutex after all. Just this
>> is not a race, and neither something that should belong in the
>> static_branch interface.
>>
>
>
> Hmm....how about having
>
> res_counter_xchg_limit(res,&old_limit, new_limit);
>
> if (!cg_proto->updated&&  old_limit == RESOURCE_MAX)
> 	....update labels...
>
> Then, no mutex overhead maybe and activated will be updated only once.
> Ah, but please fix in a way you like. Above is an example.

I think a mutex is a lot cleaner than adding a new function to the 
res_counter interface.

We could do a counter, and then later decrement the key until the 
counter reaches zero, but between those two, I still think a mutex here 
is preferable.

Only that, instead of coming up with a mutex of ours, we could export 
and reuse set_limit_mutex from memcontrol.c


> Thanks,
> -Kame
> (*) I'm sorry I won't be able to read e-mails, tomorrow.
>
Ok Kame. I am not in a terrible hurry to fix this, it doesn't seem to be 
hurting any real workload.

^ permalink raw reply

* Re: [PATCH v5 2/2] decrement static keys on real destroy time
From: KAMEZAWA Hiroyuki @ 2012-05-17 10:18 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Andrew Morton, cgroups, linux-mm, devel, netdev, Tejun Heo,
	Li Zefan, Johannes Weiner, Michal Hocko
In-Reply-To: <4FB4CA4D.50608@parallels.com>

(2012/05/17 18:52), Glauber Costa wrote:

> On 05/17/2012 09:37 AM, Andrew Morton wrote:
>>>>  If that happens, locking in static_key_slow_inc will prevent any damage.
>>>>  My previous version had explicit code to prevent that, but we were
>>>>  pointed out that this is already part of the static_key expectations, so
>>>>  that was dropped.
>> This makes no sense.  If two threads run that code concurrently,
>> key->enabled gets incremented twice.  Nobody anywhere has a record that
>> this happened so it cannot be undone.  key->enabled is now in an
>> unknown state.
> 
> Kame, Tejun,
> 
> Andrew is right. It seems we will need that mutex after all. Just this 
> is not a race, and neither something that should belong in the 
> static_branch interface.
> 


Hmm....how about having

res_counter_xchg_limit(res, &old_limit, new_limit);

if (!cg_proto->updated && old_limit == RESOURCE_MAX)
	....update labels...

Then, no mutex overhead maybe and activated will be updated only once.
Ah, but please fix in a way you like. Above is an example.

Thanks,
-Kame
(*) I'm sorry I won't be able to read e-mails, tomorrow.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v5 2/2] decrement static keys on real destroy time
From: Glauber Costa @ 2012-05-17  9:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	devel-GEFAQzZX7r8dnm+yROfE0A,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	netdev-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Li Zefan,
	Johannes Weiner, Michal Hocko
In-Reply-To: <20120516223715.5d1b4385.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>

On 05/17/2012 09:37 AM, Andrew Morton wrote:
>> >  If that happens, locking in static_key_slow_inc will prevent any damage.
>> >  My previous version had explicit code to prevent that, but we were
>> >  pointed out that this is already part of the static_key expectations, so
>> >  that was dropped.
> This makes no sense.  If two threads run that code concurrently,
> key->enabled gets incremented twice.  Nobody anywhere has a record that
> this happened so it cannot be undone.  key->enabled is now in an
> unknown state.

Kame, Tejun,

Andrew is right. It seems we will need that mutex after all. Just this 
is not a race, and neither something that should belong in the 
static_branch interface.

We want to make sure that enabled is not updated before the jump label 
update, because we need a specific ordering guarantee at the patched 
sites. And *that*, the interface guarantees, and we were wrong to 
believe it did not. That is a correction issue for the accounting, and 
that part is right.

But when we disarm it, we'll need to make sure that happened only once, 
otherwise we may never unpatch it. That, or we'd need that to be a 
counter. The jump label interface does not - and should not - keep track 
of how many updates happened to a key. That's the role of whoever is 
using it.

If you agree with the above, I'll send this patch again with the correction.

Andrew, thank you very much. Do you spot anything else here?

^ permalink raw reply

* Re: [PATCH 1/1] smsc95xx: add FLAG_POINTTOPOINT flag for driver_info
From: Xiao Jiang @ 2012-05-17  9:51 UTC (permalink / raw)
  To: Ming Lei; +Cc: steve.glendinning, gregkh, netdev, linux-usb, linux-kernel
In-Reply-To: <CACVXFVPLf9+8qQKgkikexq3ao=b9fM4jOCasMWVJrbZEVSj_Tg@mail.gmail.com>

Ming Lei wrote:
> On Thu, May 17, 2012 at 10:23 AM, Xiao Jiang <jgq516@gmail.com> wrote:
>   
>> Ming Lei wrote:
>>     
>>> On Wed, May 16, 2012 at 4:01 PM,  <jgq516@gmail.com> wrote:
>>>
>>>       
>>>> From: Xiao Jiang <jgq516@gmail.com>
>>>>
>>>> commit c26134 introduced FLAG_POINTTOPOINT flag for USB ethernet devices
>>>> which possibly use "usb%d" names, add this flag to make sure pandaboard
>>>> can mount nfs with smsc95xx NIC.
>>>>
>>>>         
>>> Without the flag, I also can mount nfs successfully on my Pandaboard...
>>>       
>
> I always mount nfs in console, and not tried to mount nfs as root fs.
>
>   
>>>       
>> I have pulled latest tree
>> (git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>> commit 0e93b4b304ae052ba1bc73f6d34a68556fe93429), and enable related options
>> (USB_NET_SMSC95XX,
>> USB_EHCI_HCD and USB_EHCI_HCD_OMAP) with omap2plus_config, However the
>> kernel still can't mount
>> nfs, pls see below infos.
>>
>> [    3.114105] smsc95xx v1.0.4
>> [    4.533752] smsc95xx 1-1.1:1.0: *eth0*: register 'smsc95xx' at
>> usb-ehci-omap.0-1.1, smsc95xx USB 2.0 Ethernet, fe:b9:1b:07:8e:d1
>> [  108.854217] VFS: Unable to mount root fs via NFS, trying floppy.
>> [  108.861114] VFS: Cannot open root device "nfs" or unknown-block(2,0):
>> error -6
>> [  108.868713] Please append a correct "root=" boot option; here are the
>> available partitions:
>> [  108.877655] b300         7761920 mmcblk0  driver: mmcblk
>> [  108.883239]   b301           40131 mmcblk0p1
>> 00000000-0000-0000-0000-000000000mmcblk0p1
>> [  108.891662]   b302         7719232 mmcblk0p2
>> 00000000-0000-0000-0000-000000000mmcblk0p2
>> [  108.900146] Kernel panic - not syncing: VFS: Unable to mount root fs on
>> unknown-block(2,0)
>>
>> BTW: I tested it with OMAP4430 ES2.2 pandaboard, the issue can be solved
>> with apply the patch.
>>
>> Is there something which I missed? thanks.
>>     
>
> What is your kernel parameter? Maybe you use 'usb%d' in kernel parameter for
> mounting nfs as root fs. If so, could you try 'eth%d' in kernel cmd?
>
> In fact, smsc95xx is a real LAN interface, and 'eth%d' should be prefered name
> as described in changelog of commit
> c261344d3ce3edac781f9d3c7eabe2e96d8e8fe8(usbnet:use eth%d name for
> known ethernet devices)
>
>   
Thanks for your notice, I used wrong kernel parameter.

Regards,
Xiao
> Thanks,
>   

^ permalink raw reply

* tcp timestamp issues with google servers
From: Miklos Szeredi @ 2012-05-17  9:39 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

Sometimes connection to google.com, gmail.com and other google servers
doesn't work or takes ages to connect.  When this hits it hits all
google servers at the same time and it's persistent.  It never happens
to anything other than google.  Rebooting helps.  Rarely it goes away
spontaneously.

Apparently google is sometimes replying with an invalid TSecr timestamp
value (smaller than the one sent in the last packet) and this confuses
the Linux TCP stack which either discards the packet or sends a Reset.

Network dump attached.

I found only a couple of references to this issue:

http://gotchas.livejournal.com/3028.html

http://groups.google.com/group/comp.os.linux.networking/browse_thread/thread/29f56feded11b42a

Turning tcp timestamps fixes the issue:

  sysctl -w net.ipv4.tcp_timestamps=0

Not sure why this happens only to me and a very few others.

It appears to be an issue with google TCP stack (is it a modified
stack?) but I thought about issues in my network switch (restarting it
doesn't help) or something in the ISP, but those look unlikely.

Any ideas?

Thanks,
Miklos



  1   0.000000 192.168.28.100 -> 74.125.232.226 TCP 51303 > http [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSV=35355050 TSER=0 WS=5
  2   0.002730 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=0 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184565067 TSER=35325344 WS=6
  3   0.002776 192.168.28.100 -> 74.125.232.226 TCP 51303 > http [RST] Seq=1 Win=0 Len=0
  4   1.001408 192.168.28.100 -> 74.125.232.226 TCP 51303 > http [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSV=35356052 TSER=0 WS=5
  5   1.004136 74.125.232.226 -> 192.168.28.100 TCP [TCP Previous segment lost] http > 51303 [SYN, ACK] Seq=15638919 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184566068 TSER=35325344 WS=6
  6   1.411915 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=15638919 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184566476 TSER=35325344 WS=6
  7   2.011568 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=15638919 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184567076 TSER=35325344 WS=6
  8   3.005400 192.168.28.100 -> 74.125.232.226 TCP 51303 > http [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSV=35358056 TSER=0 WS=5
  9   3.007972 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=15638919 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184568072 TSER=35325344 WS=6
 10   3.212862 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=15638919 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184568277 TSER=35325344 WS=6
 11   5.612449 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=15638919 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184570677 TSER=35325344 WS=6
 12   7.013405 192.168.28.100 -> 74.125.232.226 TCP 51303 > http [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSV=35362064 TSER=0 WS=5
 13   7.016627 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=15638919 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184572080 TSER=35325344 WS=6
 14  10.412642 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=15638919 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184575477 TSER=35325344 WS=6
 15  15.029547 192.168.28.100 -> 74.125.232.226 TCP 51303 > http [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSV=35370080 TSER=0 WS=5
 16  15.032931 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=15638919 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184580097 TSER=35325344 WS=6
 17  31.061400 192.168.28.100 -> 74.125.232.226 TCP 51303 > http [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSV=35386112 TSER=0 WS=5
 18  31.064538 74.125.232.226 -> 192.168.28.100 TCP [TCP Previous segment lost] http > 51303 [SYN, ACK] Seq=485350292 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184596129 TSER=35325344 WS=6
 19  31.416339 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=485350292 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184596480 TSER=35325344 WS=6
 20  32.015998 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=485350292 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184597081 TSER=35325344 WS=6
 21  33.216276 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=485350292 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184598281 TSER=35325344 WS=6
 22  35.616879 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=485350292 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184600681 TSER=35325344 WS=6
 23  40.417065 74.125.232.226 -> 192.168.28.100 TCP http > 51303 [SYN, ACK] Seq=485350292 Ack=1 Win=14180 Len=0 MSS=1430 SACK_PERM=1 TSV=1184605482 TSER=35325344 WS=6

^ permalink raw reply

* [PULL] virtio: last minute fixes for 3.4
From: Michael S. Tsirkin @ 2012-05-17  9:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kvm, mst, netdev, linux-kernel, virtualization, uobergfe,
	amit.shah, David Miller

The following changes since commit 0e93b4b304ae052ba1bc73f6d34a68556fe93429:

  Merge git://git.kernel.org/pub/scm/virt/kvm/kvm (2012-05-16 14:30:51 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git for_linus

for you to fetch changes up to ec13ee80145ccb95b00e6e610044bbd94a170051:

  virtio_net: invoke softirqs after __napi_schedule (2012-05-17 12:16:38 +0300)

----------------------------------------------------------------
virtio: last minute fixes for 3.4

Here are a couple of last minute virtio fixes for 3.4.
Hope it's not too late yes - I might have tried too hard
to make sure the fix is well tested.

Fixes are by Amit and myself. One fixes module removal
and one suspend of a VM, the last one the handling of out
of memory condition.
They are thus very low risk as most people never hit these paths, but do fix
very annoying problems for people that do use the feature.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

----------------------------------------------------------------
Amit Shah (2):
      virtio: console: tell host of open ports after resume from s3/s4
      virtio: balloon: let host know of updated balloon size before module removal

Michael S. Tsirkin (1):
      virtio_net: invoke softirqs after __napi_schedule

 drivers/char/virtio_console.c   |    7 +++++++
 drivers/net/virtio_net.c        |    2 ++
 drivers/virtio/virtio_balloon.c |    1 +
 3 files changed, 10 insertions(+), 0 deletions(-)

^ permalink raw reply

* [PATCH 2/2] [net/virtio_net]: make virtio_net support NUMA info
From: Liu Ping Fan @ 2012-05-17  9:20 UTC (permalink / raw)
  To: kvm, netdev
  Cc: linux-kernel, qemu-devel, Avi Kivity, Michael S. Tsirkin,
	Srivatsa Vaddagiri, Rusty Russell, Anthony Liguori, Ryan Harper,
	Shirley Ma, Krishna Kumar, Tom Lendacky
In-Reply-To: <1337246456-30909-1-git-send-email-kernelfans@gmail.com>

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Vhost net uses separate transfer logic unit in different node.
Virtio net must determine which logic unit it will talk with,
so we can improve the performance.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 drivers/net/virtio_net.c |  425 ++++++++++++++++++++++++++++++++++------------
 1 files changed, 314 insertions(+), 111 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index af8acc8..31abafa 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -50,16 +50,32 @@ struct virtnet_stats {
 	u64 rx_packets;
 };
 
+struct napi_info {
+	struct napi_struct napi;
+	struct work_struct enable_napi;
+};
+
+struct vnet_virtio_node {
+	struct virtio_node vnode;
+	int demo_cpu;
+	struct napi_info info;
+	struct delayed_work refill;
+	struct virtnet_info *owner;
+};
+
 struct virtnet_info {
 	struct virtio_device *vdev;
-	struct virtqueue *rvq, *svq, *cvq;
+	/* we want to scatter in different host nodes */
+	struct virtqueue **vqs, **rvqs, **svqs;
+	struct virtqueue *cvq;
+	/* we want to scatter in different host nodes */
+	struct vnet_virtio_node **vnet_nodes;
 	struct net_device *dev;
-	struct napi_struct napi;
+
 	unsigned int status;
 
 	/* Number of input buffers, and max we've ever had. */
 	unsigned int num, max;
-
 	/* I like... big packets and I cannot lie! */
 	bool big_packets;
 
@@ -69,9 +85,6 @@ struct virtnet_info {
 	/* Active statistics */
 	struct virtnet_stats __percpu *stats;
 
-	/* Work struct for refilling if we run low on memory. */
-	struct delayed_work refill;
-
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
@@ -136,7 +149,6 @@ static void skb_xmit_done(struct virtqueue *svq)
 
 	/* Suppress further interrupts. */
 	virtqueue_disable_cb(svq);
-
 	/* We were probably waiting for more output buffers. */
 	netif_wake_queue(vi->dev);
 }
@@ -220,7 +232,8 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
 	return skb;
 }
 
-static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
+static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb,
+	struct virtqueue *rvq)
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	struct page *page;
@@ -234,7 +247,7 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
 			skb->dev->stats.rx_length_errors++;
 			return -EINVAL;
 		}
-		page = virtqueue_get_buf(vi->rvq, &len);
+		page = virtqueue_get_buf(rvq, &len);
 		if (!page) {
 			pr_debug("%s: rx error: %d buffers missing\n",
 				 skb->dev->name, hdr->mhdr.num_buffers);
@@ -252,7 +265,8 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
 	return 0;
 }
 
-static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
+static void receive_buf(struct net_device *dev, void *buf, unsigned int len,
+	struct virtqueue *rvq)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
@@ -283,7 +297,7 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
 			return;
 		}
 		if (vi->mergeable_rx_bufs)
-			if (receive_mergeable(vi, skb)) {
+			if (receive_mergeable(vi, skb, rvq)) {
 				dev_kfree_skb(skb);
 				return;
 			}
@@ -353,7 +367,67 @@ frame_err:
 	dev_kfree_skb(skb);
 }
 
-static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
+/* todo, this will be redesign, and as a part of exporting host numa info to
+  * guest scheduler  */
+/* fix me, host numa node id directly exposed to guest? */
+
+/* fill in by host */
+static s16 __vapicid_to_vnode[MAX_LOCAL_APIC];
+/* fix me, HOST_NUMNODES is defined by host */
+#define  HOST_NUMNODES  128
+static struct cpumask vnode_to_vcpumask_map[HOST_NUMNODES];
+DECLARE_PER_CPU(int, vcpu_to_vnode_map);
+
+void init_vnode_map(void)
+{
+	int cpu, apicid, vnode;
+	for_each_possible_cpu(cpu) {
+		apicid = cpu_physical_id(cpu);
+		vnode = __vapicid_to_vnode[apicid];
+		per_cpu(vcpu_to_vnode_map, cpu) = vnode;
+	}
+}
+
+struct cpumask *vnode_to_vcpumask(int virtio_node)
+{
+	struct cpumask *msk = &vnode_to_vcpumask_map[virtio_node];
+	return msk;
+}
+
+static int first_vcpu_on_virtio_node(int virtio_node)
+{
+	 struct cpumask *msk = vnode_to_vcpumask(virtio_node);
+	 return cpumask_first(msk);
+}
+
+static int vcpu_to_virtio_node(void)
+{
+	int vnode = __get_cpu_var(vcpu_to_vnode_map);
+	return vnode;
+}
+/* end of todo */
+
+static int virtqueue_pickup(struct virtnet_info *vi, struct virtqueue **vq, int rx)
+{
+	int node;
+	int i;
+	struct vnet_virtio_node *vnnode;
+	node = vcpu_to_virtio_node();
+	for (i = 0; i < vi->vdev->node_cnt; i++) {
+		vnnode = vi->vnet_nodes[i];
+		if (vnnode->vnode.node_id == node) {
+			if (rx == 0)
+				*vq = vnnode->vnode.svq;
+			else
+				*vq = vnnode->vnode.rvq;
+			return 0;
+		}
+	}
+	*vq = NULL;
+	return -1;
+}
+
+static int add_recvbuf_small(struct virtnet_info *vi, struct virtqueue *vq, gfp_t gfp)
 {
 	struct sk_buff *skb;
 	struct skb_vnet_hdr *hdr;
@@ -369,15 +443,14 @@ static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
 	sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr);
 
 	skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
-
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
+	err = virtqueue_add_buf(vq, vi->rx_sg, 0, 2, skb, gfp);
 	if (err < 0)
 		dev_kfree_skb(skb);
 
 	return err;
 }
 
-static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
+static int add_recvbuf_big(struct virtnet_info *vi, struct virtqueue *vq, gfp_t gfp)
 {
 	struct page *first, *list = NULL;
 	char *p;
@@ -415,7 +488,8 @@ static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
 
 	/* chain first in list head */
 	first->private = (unsigned long)list;
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
+
+	err = virtqueue_add_buf(vq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
 				first, gfp);
 	if (err < 0)
 		give_pages(vi, first);
@@ -423,7 +497,7 @@ static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
 	return err;
 }
 
-static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
+static int add_recvbuf_mergeable(struct virtnet_info *vi, struct virtqueue *vq, gfp_t gfp)
 {
 	struct page *page;
 	int err;
@@ -433,8 +507,7 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
 		return -ENOMEM;
 
 	sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
-
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
+	err = virtqueue_add_buf(vq, vi->rx_sg, 0, 1, page, gfp);
 	if (err < 0)
 		give_pages(vi, page);
 
@@ -448,18 +521,17 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
  * before we're receiving packets, or from refill_work which is
  * careful to disable receiving (using napi_disable).
  */
-static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
+static bool try_fill_recv(struct virtnet_info *vi, struct virtqueue *rvq, gfp_t gfp)
 {
 	int err;
 	bool oom;
-
 	do {
 		if (vi->mergeable_rx_bufs)
-			err = add_recvbuf_mergeable(vi, gfp);
+			err = add_recvbuf_mergeable(vi, rvq, gfp);
 		else if (vi->big_packets)
-			err = add_recvbuf_big(vi, gfp);
+			err = add_recvbuf_big(vi, rvq, gfp);
 		else
-			err = add_recvbuf_small(vi, gfp);
+			err = add_recvbuf_small(vi, rvq, gfp);
 
 		oom = err == -ENOMEM;
 		if (err < 0)
@@ -468,31 +540,79 @@ static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
 	} while (err > 0);
 	if (unlikely(vi->num > vi->max))
 		vi->max = vi->num;
-	virtqueue_kick(vi->rvq);
+
+	virtqueue_kick(rvq);
 	return !oom;
 }
 
+static void try_fill_all_recv(struct virtnet_info *vi, gfp_t gfp)
+{
+	int i, cpu, err;
+	struct vnet_virtio_node *vnnode;
+	for (i = 0; i < vi->vdev->node_cnt; i++) {
+		vnnode = vi->vnet_nodes[i];
+		err = try_fill_recv(vi, vnnode->vnode.rvq, gfp);
+		if (err) {
+			cpu = first_vcpu_on_virtio_node(vnnode->vnode.node_id);
+			queue_delayed_work_on(cpu, system_nrt_wq, &vnnode->refill, 0);
+		}
+	}
+	return;
+}
+
 static void skb_recv_done(struct virtqueue *rvq)
 {
-	struct virtnet_info *vi = rvq->vdev->priv;
+	struct vnet_virtio_node *vnet_node = container_of(rvq->node, struct vnet_virtio_node, vnode);
+	struct napi_struct *napi = &vnet_node->info.napi;
+
 	/* Schedule NAPI, Suppress further interrupts if successful. */
-	if (napi_schedule_prep(&vi->napi)) {
+	if (napi_schedule_prep(napi)) {
 		virtqueue_disable_cb(rvq);
-		__napi_schedule(&vi->napi);
+		__napi_schedule(napi);
 	}
 }
 
-static void virtnet_napi_enable(struct virtnet_info *vi)
+static void virtnet_napi_enable(struct napi_struct *napi, struct virtqueue *rvq)
 {
-	napi_enable(&vi->napi);
+	napi_enable(napi);
 
 	/* If all buffers were filled by other side before we napi_enabled, we
 	 * won't get another interrupt, so process any outstanding packets
 	 * now.  virtnet_poll wants re-enable the queue, so we disable here.
 	 * We synchronize against interrupts via NAPI_STATE_SCHED */
-	if (napi_schedule_prep(&vi->napi)) {
-		virtqueue_disable_cb(vi->rvq);
-		__napi_schedule(&vi->napi);
+	if (napi_schedule_prep(napi)) {
+		virtqueue_disable_cb(rvq);
+		__napi_schedule(napi);
+	}
+}
+
+static void virtnet_napis_disable(struct virtnet_info *vi)
+{
+	int i;
+	struct vnet_virtio_node *vnnode;
+	for (i = 0; i < vi->vdev->node_cnt; i++) {
+		vnnode = vi->vnet_nodes[i];
+		napi_disable(&vnnode->info.napi);
+	}
+}
+
+static void napi_enable_worker(struct work_struct *work)
+{
+	struct vnet_virtio_node *vnnode = container_of(work,
+		struct vnet_virtio_node, refill.work);
+	struct virtqueue *rvq = vnnode->vnode.rvq;
+	virtnet_napi_enable(&vnnode->info.napi, rvq);
+}
+
+static void virtnet_napis_enable(struct virtnet_info *vi)
+{
+	int i;
+	struct work_struct *work;
+	struct vnet_virtio_node *vnnode;
+	for (i = 0; i < vi->vdev->node_cnt; i++) {
+		vnnode = vi->vnet_nodes[i];
+		work = &vnnode->info.enable_napi;
+		queue_work_on(vnnode->demo_cpu, system_wq, work);
 	}
 }
 
@@ -500,43 +620,52 @@ static void refill_work(struct work_struct *work)
 {
 	struct virtnet_info *vi;
 	bool still_empty;
+	struct napi_struct *napi;
+	struct virtqueue *rvq;
+	struct vnet_virtio_node *vnnode = container_of(work,
+		struct vnet_virtio_node, refill.work);
 
-	vi = container_of(work, struct virtnet_info, refill.work);
-	napi_disable(&vi->napi);
-	still_empty = !try_fill_recv(vi, GFP_KERNEL);
-	virtnet_napi_enable(vi);
+	vi = vnnode->owner;
+	napi = &vnnode->info.napi;
+	rvq = vnnode->vnode.rvq;
+	napi_disable(napi);
+
+	still_empty = !try_fill_recv(vi, rvq, GFP_KERNEL);
+	virtnet_napi_enable(napi, rvq);
 
 	/* In theory, this can happen: if we don't get any buffers in
 	 * we will *never* try to fill again. */
 	if (still_empty)
-		queue_delayed_work(system_nrt_wq, &vi->refill, HZ/2);
+		queue_delayed_work_on(vnnode->demo_cpu, system_nrt_wq, &vnnode->refill, HZ/2);
 }
 
 static int virtnet_poll(struct napi_struct *napi, int budget)
 {
-	struct virtnet_info *vi = container_of(napi, struct virtnet_info, napi);
+	struct virtnet_info *vi;
 	void *buf;
 	unsigned int len, received = 0;
-
+	struct vnet_virtio_node *vnnode = container_of(napi, struct vnet_virtio_node, info.napi);
+	struct virtqueue *rvq = vnnode->vnode.rvq;
+	vi = vnnode->owner;
 again:
 	while (received < budget &&
-	       (buf = virtqueue_get_buf(vi->rvq, &len)) != NULL) {
-		receive_buf(vi->dev, buf, len);
+	       (buf = virtqueue_get_buf(rvq, &len)) != NULL) {
+		receive_buf(vi->dev, buf, len, rvq);
 		--vi->num;
 		received++;
 	}
 
 	if (vi->num < vi->max / 2) {
-		if (!try_fill_recv(vi, GFP_ATOMIC))
-			queue_delayed_work(system_nrt_wq, &vi->refill, 0);
+		if (!try_fill_recv(vi, rvq, GFP_ATOMIC))
+			queue_delayed_work(system_nrt_wq, &vnnode->refill, 0);
 	}
 
 	/* Out of packets? */
 	if (received < budget) {
 		napi_complete(napi);
-		if (unlikely(!virtqueue_enable_cb(vi->rvq)) &&
+		if (unlikely(!virtqueue_enable_cb(rvq)) &&
 		    napi_schedule_prep(napi)) {
-			virtqueue_disable_cb(vi->rvq);
+			virtqueue_disable_cb(rvq);
 			__napi_schedule(napi);
 			goto again;
 		}
@@ -545,13 +674,13 @@ again:
 	return received;
 }
 
-static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
+static unsigned int free_old_xmit_skbs(struct virtnet_info *vi, struct virtqueue *svq)
 {
 	struct sk_buff *skb;
 	unsigned int len, tot_sgs = 0;
 	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
 
-	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
+	while ((skb = virtqueue_get_buf(svq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 
 		u64_stats_update_begin(&stats->syncp);
@@ -565,7 +694,7 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
 	return tot_sgs;
 }
 
-static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
+static int xmit_skb(struct virtnet_info *vi, struct virtqueue *svq, struct sk_buff *skb)
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
@@ -608,7 +737,8 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
 		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
 
 	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
-	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
+
+	return virtqueue_add_buf(svq, vi->tx_sg, hdr->num_sg,
 				 0, skb, GFP_ATOMIC);
 }
 
@@ -616,12 +746,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	int capacity;
+	struct virtqueue *svq;
+	virtqueue_pickup(vi, &svq, 0);
 
 	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(vi);
+	free_old_xmit_skbs(vi, svq);
 
 	/* Try to transmit */
-	capacity = xmit_skb(vi, skb);
+	capacity = xmit_skb(vi, svq, skb);
 
 	/* This can happen with OOM and indirect buffers. */
 	if (unlikely(capacity < 0)) {
@@ -640,7 +772,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		kfree_skb(skb);
 		return NETDEV_TX_OK;
 	}
-	virtqueue_kick(vi->svq);
+	virtqueue_kick(svq);
 
 	/* Don't wait up for transmitted skbs to be freed. */
 	skb_orphan(skb);
@@ -650,12 +782,12 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (capacity < 2+MAX_SKB_FRAGS) {
 		netif_stop_queue(dev);
-		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
+		if (unlikely(!virtqueue_enable_cb_delayed(svq))) {
 			/* More just got used, free them then recheck. */
-			capacity += free_old_xmit_skbs(vi);
+			capacity += free_old_xmit_skbs(vi, svq);
 			if (capacity >= 2+MAX_SKB_FRAGS) {
 				netif_start_queue(dev);
-				virtqueue_disable_cb(vi->svq);
+				virtqueue_disable_cb(svq);
 			}
 		}
 	}
@@ -718,20 +850,15 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
 static void virtnet_netpoll(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
-
-	napi_schedule(&vi->napi);
+	virtnet_napis_enable(vi);
 }
 #endif
 
 static int virtnet_open(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
-
-	/* Make sure we have some buffers: if oom use wq. */
-	if (!try_fill_recv(vi, GFP_KERNEL))
-		queue_delayed_work(system_nrt_wq, &vi->refill, 0);
-
-	virtnet_napi_enable(vi);
+	try_fill_all_recv(vi, GFP_KERNEL);
+	virtnet_napis_enable(vi);
 	return 0;
 }
 
@@ -783,11 +910,10 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
 static int virtnet_close(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
-
-	/* Make sure refill_work doesn't re-enable napi! */
-	cancel_delayed_work_sync(&vi->refill);
-	napi_disable(&vi->napi);
-
+	int i;
+	for (i = 0; i < vi->vdev->node_cnt; i++)
+		cancel_delayed_work_sync(&vi->vnet_nodes[i]->refill);
+	virtnet_napis_disable(vi);
 	return 0;
 }
 
@@ -897,9 +1023,10 @@ static void virtnet_get_ringparam(struct net_device *dev,
 				struct ethtool_ringparam *ring)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	struct vnet_virtio_node *vnnode =  vi->vnet_nodes[0];
 
-	ring->rx_max_pending = virtqueue_get_vring_size(vi->rvq);
-	ring->tx_max_pending = virtqueue_get_vring_size(vi->svq);
+	ring->rx_max_pending = virtqueue_get_vring_size(vnnode->vnode.rvq);
+	ring->tx_max_pending = virtqueue_get_vring_size(vnnode->vnode.svq);
 	ring->rx_pending = ring->rx_max_pending;
 	ring->tx_pending = ring->tx_max_pending;
 
@@ -986,29 +1113,61 @@ static void virtnet_config_changed(struct virtio_device *vdev)
 
 static int init_vqs(struct virtnet_info *vi)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
+	struct virtqueue **vqs;
 	const char *names[] = { "input", "output", "control" };
-	int nvqs, err;
-
+	const char **name_array;
+	vq_callback_t **callbacks;
+	int node_cnt, nvqs, err =  -ENOMEM;
+	int i;
 	/* We expect two virtqueues, receive then send,
 	 * and optionally control. */
-	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
+	node_cnt = vi->vdev->node_cnt;
+	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)? node_cnt*2+1 :
+		node_cnt*2;
+	callbacks = kzalloc(sizeof(void *)*nvqs, GFP_KERNEL);
+	for (i = 0; i < node_cnt; i++)
+		callbacks[i] = skb_recv_done;
+	for (; i < node_cnt*2; i++)
+		callbacks[i] = skb_xmit_done;
+
+	name_array = kmalloc(sizeof(void *)*nvqs, GFP_KERNEL);
+	if ( name_array == NULL)
+		goto free_callbacks;
+
+	for (i = 0; i < node_cnt; i++)
+		name_array[i] = names[0];
+	for (; i <  node_cnt*2; i++)
+		name_array[i] = names[1];
+	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ))
+		name_array[i] = names[2];
+
+	vqs = kmalloc(sizeof(void *)*nvqs, GFP_KERNEL);
+	if (vqs == NULL)
+		goto free_name;
 
 	err = vi->vdev->config->find_vqs(vi->vdev, nvqs, vqs, callbacks, names);
 	if (err)
-		return err;
+		goto free_vqs;
 
-	vi->rvq = vqs[0];
-	vi->svq = vqs[1];
+	vi->vqs = vqs;
+	vi->rvqs = vi->vqs;
+	vi->svqs = vi->vqs + vi->vdev->node_cnt;
 
 	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
-		vi->cvq = vqs[2];
+		vi->cvq = vi->vqs[vi->vdev->node_cnt*2];
 
 		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
 			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
 	}
-	return 0;
+	err = 0;
+free_vqs:
+	if (err)
+		kfree(vqs);
+free_name:
+	kfree(name_array);
+free_callbacks:
+	kfree(callbacks);
+	return err;
 }
 
 static int virtnet_probe(struct virtio_device *vdev)
@@ -1016,6 +1175,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 	int err;
 	struct net_device *dev;
 	struct virtnet_info *vi;
+	int i, size, cur, prev = 0;
+	struct vnet_virtio_node *vnnode;
 
 	/* Allocate ourselves a network device with room for our info */
 	dev = alloc_etherdev(sizeof(struct virtnet_info));
@@ -1064,7 +1225,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	/* Set up our device-specific information */
 	vi = netdev_priv(dev);
-	netif_napi_add(dev, &vi->napi, virtnet_poll, napi_weight);
+
 	vi->dev = dev;
 	vi->vdev = vdev;
 	vdev->priv = vi;
@@ -1074,7 +1235,6 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (vi->stats == NULL)
 		goto free;
 
-	INIT_DELAYED_WORK(&vi->refill, refill_work);
 	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
 	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
 
@@ -1086,19 +1246,46 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
 		vi->mergeable_rx_bufs = true;
-
 	err = init_vqs(vi);
 	if (err)
 		goto free_stats;
 
+	/* Which host node napi_struct will be on, determined by page fault handled by KVM.
+	  * So allocate them seperately!
+	 */
+	vi->vnet_nodes = kmalloc(sizeof(void *) * vi->vdev->node_cnt, GFP_KERNEL);
+	size = PAGE_ALIGN(sizeof(struct vnet_virtio_node));
+	for (i = 0; i < vi->vdev->node_cnt; i++) {
+		vnnode = kmalloc(size, GFP_KERNEL);
+		if (vnnode == NULL) {
+			err = -ENOMEM;
+			goto free_napi;
+		}
+		cur = find_next_bit(&vi->vdev->allow_map, 64, prev);
+		prev = cur;
+		vnnode->vnode.node_id = cur;
+		vnnode->owner = vi;
+		vnnode->vnode.rvq = vi->rvqs[i];
+		vnnode->vnode.svq = vi->svqs[i];
+		vnnode->demo_cpu = first_vcpu_on_virtio_node(cur);
+
+		vi->rvqs[i]->node = &vnnode->vnode;
+		vi->svqs[i]->node = &vnnode->vnode;
+
+		INIT_WORK(&vnnode->info.enable_napi, napi_enable_worker);
+		netif_napi_add(dev, &vnnode->info.napi, virtnet_poll, napi_weight);
+		INIT_DELAYED_WORK(&vnnode->refill, refill_work);
+		vi->vnet_nodes[i] = vnnode;
+	}
+
 	err = register_netdev(dev);
 	if (err) {
 		pr_debug("virtio_net: registering device failed\n");
 		goto free_vqs;
 	}
 
-	/* Last of all, set up some receive buffers. */
-	try_fill_recv(vi, GFP_KERNEL);
+	try_fill_all_recv(vi, GFP_KERNEL);
+
 
 	/* If we didn't even get one input buffer, we're useless. */
 	if (vi->num == 0) {
@@ -1121,6 +1308,12 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 unregister:
 	unregister_netdev(dev);
+free_napi:
+	for (; i  >  0; --i) {
+		vnnode = vi->vnet_nodes[i];
+		netif_napi_del(&vnnode->info.napi);
+		kfree(vnnode);
+	}
 free_vqs:
 	vdev->config->del_vqs(vdev);
 free_stats:
@@ -1133,32 +1326,39 @@ free:
 static void free_unused_bufs(struct virtnet_info *vi)
 {
 	void *buf;
-	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->svq);
-		if (!buf)
-			break;
-		dev_kfree_skb(buf);
-	}
-	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->rvq);
-		if (!buf)
-			break;
-		if (vi->mergeable_rx_bufs || vi->big_packets)
-			give_pages(vi, buf);
-		else
+	int i;
+	struct virtqueue *svq, *rvq;
+	for (i = 0; i < vi->vdev->node_cnt; i++) {
+		svq = vi->svqs[i];
+		rvq = vi->rvqs[i];
+
+		while (1) {
+			buf = virtqueue_detach_unused_buf(svq);
+			if (!buf)
+				break;
 			dev_kfree_skb(buf);
-		--vi->num;
+		}
+		while (1) {
+			buf = virtqueue_detach_unused_buf(rvq);
+			if (!buf)
+				break;
+			if (vi->mergeable_rx_bufs || vi->big_packets)
+				give_pages(vi, buf);
+			else
+				dev_kfree_skb(buf);
+			--vi->num;
+		}
 	}
 	BUG_ON(vi->num != 0);
 }
 
+
 static void remove_vq_common(struct virtnet_info *vi)
 {
 	vi->vdev->config->reset(vi->vdev);
 
 	/* Free unused buffers in both send and recv, if any. */
 	free_unused_bufs(vi);
-
 	vi->vdev->config->del_vqs(vi->vdev);
 
 	while (vi->pages)
@@ -1172,7 +1372,8 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
 	unregister_netdev(vi->dev);
 
 	remove_vq_common(vi);
-
+	kfree(vi->vqs);
+	kfree(vi->vnet_nodes);
 	free_percpu(vi->stats);
 	free_netdev(vi->dev);
 }
@@ -1181,17 +1382,22 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
 static int virtnet_freeze(struct virtio_device *vdev)
 {
 	struct virtnet_info *vi = vdev->priv;
+	int i;
 
-	virtqueue_disable_cb(vi->rvq);
-	virtqueue_disable_cb(vi->svq);
+	for (i = 0; i < vdev->node_cnt; i++) {
+		virtqueue_disable_cb(vi->rvqs[i]);
+		virtqueue_disable_cb(vi->svqs[i]);
+	}
 	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ))
 		virtqueue_disable_cb(vi->cvq);
 
 	netif_device_detach(vi->dev);
-	cancel_delayed_work_sync(&vi->refill);
+
+	for (i = 0; i < vdev->node_cnt; i++)
+		cancel_delayed_work_sync(&vi->vnet_nodes[i]->refill);
 
 	if (netif_running(vi->dev))
-		napi_disable(&vi->napi);
+		virtnet_napis_disable(vi);
 
 	remove_vq_common(vi);
 
@@ -1208,13 +1414,10 @@ static int virtnet_restore(struct virtio_device *vdev)
 		return err;
 
 	if (netif_running(vi->dev))
-		virtnet_napi_enable(vi);
+		virtnet_napis_enable(vi);
 
 	netif_device_attach(vi->dev);
-
-	if (!try_fill_recv(vi, GFP_KERNEL))
-		queue_delayed_work(system_nrt_wq, &vi->refill, 0);
-
+	try_fill_all_recv(vi, GFP_KERNEL);
 	return 0;
 }
 #endif
-- 
1.7.4.4


^ permalink raw reply related

* [PATCH 1/2] [kvm/virtio]: make virtio support NUMA attr
From: Liu Ping Fan @ 2012-05-17  9:20 UTC (permalink / raw)
  To: kvm, netdev
  Cc: linux-kernel, qemu-devel, Avi Kivity, Michael S. Tsirkin,
	Srivatsa Vaddagiri, Rusty Russell, Anthony Liguori, Ryan Harper,
	Shirley Ma, Krishna Kumar, Tom Lendacky
In-Reply-To: <1337246456-30909-1-git-send-email-kernelfans@gmail.com>

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

For each numa node reported by vhost, we alloc a pair of i/o vq,
and assign them msix IRQ, and set irq affinity to a set of vcpu
in the same node.
Also we alloc vqs on PAGE_SIZE align, so they will be allocated by
host when pg fault happen on different node.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 drivers/virtio/virtio.c       |    2 +-
 drivers/virtio/virtio_pci.c   |   35 +++++++++++++++++++++++++++++++++--
 drivers/virtio/virtio_ring.c  |    9 ++++++---
 include/linux/virtio.h        |    9 +++++++++
 include/linux/virtio_config.h |    1 +
 include/linux/virtio_pci.h    |    9 +++++++++
 6 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 984c501..79e873f 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -136,7 +136,7 @@ static int virtio_dev_probe(struct device *_d)
 			set_bit(i, dev->features);
 
 	dev->config->finalize_features(dev);
-
+	dev->config->get_numa_map(dev);
 	err = drv->probe(dev);
 	if (err)
 		add_status(dev, VIRTIO_CONFIG_S_FAILED);
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 2e03d41..5bb8a97 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -129,6 +129,24 @@ static void vp_finalize_features(struct virtio_device *vdev)
 	iowrite32(vdev->features[0], vp_dev->ioaddr+VIRTIO_PCI_GUEST_FEATURES);
 }
 
+static void vp_get_numa_map(struct virtio_device *vdev)
+{
+	int i, cnt,  sz = 32;
+	int cur, prev = 0;
+	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+
+	/* We only support 32 numa bits. */
+	vdev->allow_map = ioread32(vp_dev->ioaddr+VIRTIO_PCI_NUMA_MAP);
+	for (i = 0; i < sz; i++) {
+		cur = find_next_bit(&vdev->allow_map, sz, prev);
+		prev = cur;
+		if (cur >= sz)
+			break;
+		cnt++;
+	}
+	vdev->node_cnt = cnt;
+}
+
 /* virtio config->get() implementation */
 static void vp_get(struct virtio_device *vdev, unsigned offset,
 		   void *buf, unsigned len)
@@ -516,6 +534,8 @@ static int vp_try_to_find_vqs(struct virtio_device *vdev, unsigned nvqs,
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
 	u16 msix_vec;
 	int i, err, nvectors, allocated_vectors;
+	int irq, next, prev = 0;
+	struct cpumask *mask;
 
 	if (!use_msix) {
 		/* Old style: one normal interrupt for change and all vqs. */
@@ -562,14 +582,24 @@ static int vp_try_to_find_vqs(struct virtio_device *vdev, unsigned nvqs,
 			 sizeof *vp_dev->msix_names,
 			 "%s-%s",
 			 dev_name(&vp_dev->vdev.dev), names[i]);
-		err = request_irq(vp_dev->msix_entries[msix_vec].vector,
-				  vring_interrupt, 0,
+		irq = vp_dev->msix_entries[msix_vec].vector;
+		err = request_irq(irq, vring_interrupt, 0,
 				  vp_dev->msix_names[msix_vec],
 				  vqs[i]);
 		if (err) {
 			vp_del_vq(vqs[i]);
 			goto error_find;
 		}
+		if (i == vdev->node_cnt)
+			prev = 0;
+		/* fix me the @size */
+		next = find_next_bit(vdev->allow_map, 64, prev);
+		prev = next;
+		if (next < 64) {
+			mask = vnode_to_vcpumask(next);
+			mask = cpumask_and(mask, cpu_online_mask, mask);
+			irq_set_affinity(irq, mask);
+		}
 	}
 	return 0;
 
@@ -619,6 +649,7 @@ static struct virtio_config_ops virtio_pci_config_ops = {
 	.del_vqs	= vp_del_vqs,
 	.get_features	= vp_get_features,
 	.finalize_features = vp_finalize_features,
+	.get_numa_map = vp_get_numa_map,
 	.bus_name	= vp_bus_name,
 };
 
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5aa43c3..5baa949 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -626,15 +626,18 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
 				      const char *name)
 {
 	struct vring_virtqueue *vq;
-	unsigned int i;
+	unsigned int i, size, max;
 
 	/* We assume num is a power of 2. */
 	if (num & (num - 1)) {
 		dev_warn(&vdev->dev, "Bad virtqueue length %u\n", num);
 		return NULL;
 	}
-
-	vq = kmalloc(sizeof(*vq) + sizeof(void *)*num, GFP_KERNEL);
+	size = PAGE_ALIGN (sizeof(*vq) + sizeof(void *)*num);
+	/* Allocate on PAGE boundary, so host can locate them at proper
+	 * node
+	 */
+	vq = kmalloc(size, GFP_KERNEL);
 	if (!vq)
 		return NULL;
 
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 8efd28a..ec992c9 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -9,6 +9,12 @@
 #include <linux/mod_devicetable.h>
 #include <linux/gfp.h>
 
+struct virtio_node {
+	int node_id;
+	struct virtqueue *rvq;
+	struct virtqueue *svq;
+};
+
 /**
  * virtqueue - a queue to register buffers for sending or receiving.
  * @list: the chain of virtqueues for this device
@@ -22,6 +28,7 @@ struct virtqueue {
 	void (*callback)(struct virtqueue *vq);
 	const char *name;
 	struct virtio_device *vdev;
+	struct virtio_node *node;
 	void *priv;
 };
 
@@ -66,6 +73,8 @@ struct virtio_device {
 	struct virtio_device_id id;
 	struct virtio_config_ops *config;
 	struct list_head vqs;
+	int node_cnt;
+	unsigned long allow_map;
 	/* Note that this is a Linux set_bit-style bitmap. */
 	unsigned long features[1];
 	void *priv;
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 7323a33..5e2fd77 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -124,6 +124,7 @@ struct virtio_config_ops {
 	void (*del_vqs)(struct virtio_device *);
 	u32 (*get_features)(struct virtio_device *vdev);
 	void (*finalize_features)(struct virtio_device *vdev);
+	void (*get_numa_map)(struct virtio_device *vdev);
 	const char *(*bus_name)(struct virtio_device *vdev);
 };
 
diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h
index ea66f3f..1426717 100644
--- a/include/linux/virtio_pci.h
+++ b/include/linux/virtio_pci.h
@@ -78,9 +78,18 @@
 /* Vector value used to disable MSI for queue */
 #define VIRTIO_MSI_NO_VECTOR            0xffff
 
+#ifdef VIRTIO_NUMA
+/* 32bits to show allowed numa */
+#define VIRTIO_PCI_NUMA_MAP         24
+
+/* The remaining space is defined by each driver as the per-driver
+ * configuration space */
+#define VIRTIO_PCI_CONFIG(dev)		28
+#else
 /* The remaining space is defined by each driver as the per-driver
  * configuration space */
 #define VIRTIO_PCI_CONFIG(dev)		((dev)->msix_enabled ? 24 : 20)
+#endif
 
 /* Virtio ABI version, this must match exactly */
 #define VIRTIO_PCI_ABI_VERSION		0
-- 
1.7.4.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox