Netdev List

Netdev List
 help / color / mirror / Atom feed

* PCI runtime power management for r8169
From: Markus Feldmann @ 2010-04-24 11:30 UTC (permalink / raw)
  To: netdev

Hi all,

as i read in this mailing list, there is/will be a power management for 
the r8169 driver?! I have the following devices:
01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. 
RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03)
04:04.0 Ethernet controller: Realtek Semiconductor Co., Ltd. 
RTL-8110SC/8169SC Gigabit Ethernet (rev 10)
04:06.0 Ethernet controller: Realtek Semiconductor Co., Ltd. 
RTL-8110SC/8169SC Gigabit Ethernet (rev 10)
04:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. 
RTL-8110SC/8169SC Gigabit Ethernet (rev 10)

Not all of them are used at the same time. So how can i put the unused 
in a low power state?

regards Markus

^ permalink raw reply

* Re: [PATCH] e100: Fix the TX workqueue race
From: Alan Cox @ 2010-04-24 11:11 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, e1000-devel
In-Reply-To: <20100423.163545.157549100.davem@davemloft.net>

On Fri, 23 Apr 2010 16:35:45 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: David Miller <davem@davemloft.net>
> Date: Fri, 23 Apr 2010 16:31:27 -0700 (PDT)
> 
> > I'll apply this to net-2.6, thanks Alan.
> 
> Nevermind...
> 
> Doesn't apply to net-2.6, but even when I fix that up it doesn't
> even compile.  There is no 'dev' variable present etc.
> 
> You even use a combination of "dev" and "netdev" in the resulting
> code block.
> 
> If it doesn't even build, I doubt it's been tested either.

No idea why it won't apply - I guess net has diverged from -next in
this area. Other problem is not typing "stg ref" before "stg export"

(If it doesn't apply I'll look at it next weekend when -next and -net ought to be back in sync ?)


commit 526fb792b745da7c9532725a1a6ecc83a01110cf
Author: Alan Cox <alan@linux.intel.com>
Date:   Sat Apr 24 12:09:23 2010 +0100

    e100: Fix the TX workqueue race
    
    Nothing stops the workqueue being left to run in parallel with close or a
    few other operations. This causes double unmaps and the like.
    
    See kerneloops.org #1041230 for an example
    
    Signed-off-by: Alan Cox <alan@linux.intel.com>

diff --git a/drivers/net/e100.c b/drivers/net/e100.c
index 3e8d000..ef97bfc 100644
--- a/drivers/net/e100.c
+++ b/drivers/net/e100.c
@@ -168,6 +168,7 @@
 #include <linux/ethtool.h>
 #include <linux/string.h>
 #include <linux/firmware.h>
+#include <linux/rtnetlink.h>
 #include <asm/unaligned.h>
 
 
@@ -2280,8 +2281,13 @@ static void e100_tx_timeout_task(struct work_struct *work)
 
 	netif_printk(nic, tx_err, KERN_DEBUG, nic->netdev,
 		     "scb.status=0x%02X\n", ioread8(&nic->csr->scb.status));
-	e100_down(netdev_priv(netdev));
-	e100_up(netdev_priv(netdev));
+
+	rtnl_lock();
+	if (netif_running(netdev)) {
+		e100_down(netdev_priv(netdev));
+		e100_up(netdev_priv(netdev));
+	}
+	rtnl_unlock();
 }
 
 static int e100_loopback_test(struct nic *nic, enum loopback loopback_mode)

^ permalink raw reply related

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: jamal @ 2010-04-24 14:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, David S. Miller, Tom Herbert, Stephen Hemminger,
	netdev
In-Reply-To: <1272060153.8918.8.camel@bigi>

[-- Attachment #1: Type: text/plain, Size: 203 bytes --]

On Fri, 2010-04-23 at 18:02 -0400, jamal wrote:

> Ive done a setup with the last patch from Changli + net-next - I will
> post test results tomorrow AM.

ok, annotated results attached. 

cheers,
jamal

[-- Attachment #2: summary-apr23.txt --]
[-- Type: text/plain, Size: 45513 bytes --]

		sink    cpu all     cpuint       cpuapp
nn-standalone 	93.95%   84.5%        99.8%        79.8%
nn-rps          96.41%   85.4%        95.5%        82.5%
nn-cl           97.29%   84.0%        99.9%        79.6%
nn-cl-rps       97.76%   86.5%        96.5%        84.8%

nn-standalone: Basic net-next from Apr23
nn-rps: Basic net-next from Apr23 with rps mask ee and irq affinity to cpu0
nn-cl: Basic net-next from Apr23 + Changli patch
nn-cl-rps: Basic net-next from Apr23 + Changli patch + rps mask ee,irq aff cpu0
sink: the amount of traffic the system was able to sink in.
cpu all: avg % system cpu consumed in test
cpuint: avg %cpu consumed by the cpu where interrupts happened
cpuapp: avg %cpu consumed by a sample cpu which did app processing

Testing was as previously explained..
I repeated each test 4-5 times and took averages..

It seems the non-rps case has improved drammatically since the last 
net-next i tested. The rps case has also improved but the gap between 
rps and non-rps is smaller.
[There are just too many variables for me to pinpoint
to one item as being the contributor. For example sky2 driver may
have become worse (consumes more cycles) but i cant quantify it yet
(i just see sky2_rx_submit showing up higher in profiles than before).
Also call_function_single_interrupt shows up prominently on application
processing CPUs but improved by Changli's changes].
After doing the math, I dont trust my results after applying Changlis patch. 
It seems both the rps and non-rps case have gotten better (and i dont 
see Changlis contribution to non-rps). It also seems that the gap between 
rps and non-rps is non-existent now. In other words, there is no benefit to
using rps (it consumes more cpu for the same throughput). So it is likely 
that i need to repeat these tests; maybe i did something wrong in my setup...

And here are the profiles:
--------------------------

cpu0 always received all the interrupts regardless of the tests.
cpu1, 7 etc were processing apps..
I could not spot much difference between before and after Changli's


I: Test setup : nn-standalone: Basic net-next from Apr23

All cpus

-------------------------------------------------------------------------------
   PerfTop:    3784 irqs/sec  kernel:84.2% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ________

             3254.00 10.3% sky2_poll                   [sky2]  
             1853.00  5.9% _raw_spin_lock_irqsave      [kernel]
              872.00  2.8% fget                        [kernel]
              870.00  2.8% copy_user_generic_string    [kernel]
              819.00  2.6% _raw_spin_unlock_irqrestore [kernel]
              729.00  2.3% sys_epoll_ctl               [kernel]
              701.00  2.2% datagram_poll               [kernel]
              615.00  2.0% udp_recvmsg                 [kernel]
              602.00  1.9% _raw_spin_lock_bh           [kernel]
              595.00  1.9% system_call                 [kernel]
              592.00  1.9% kmem_cache_free             [kernel]
              574.00  1.8% schedule                    [kernel]
              568.00  1.8% _raw_spin_lock              [kernel]


-------------------------------------------------------------------------------
   PerfTop:    3574 irqs/sec  kernel:85.1% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ________

             5023.00 10.9% sky2_poll                   [sky2]  
             2762.00  6.0% _raw_spin_lock_irqsave      [kernel]
             1319.00  2.9% copy_user_generic_string    [kernel]
             1306.00  2.8% fget                        [kernel]
             1198.00  2.6% _raw_spin_unlock_irqrestore [kernel]
             1071.00  2.3% datagram_poll               [kernel]
             1061.00  2.3% sys_epoll_ctl               [kernel]
              927.00  2.0% _raw_spin_lock_bh           [kernel]
              917.00  2.0% system_call                 [kernel]
              901.00  1.9% udp_recvmsg                 [kernel]
              895.00  1.9% kmem_cache_free             [kernel]
              819.00  1.8% _raw_spin_lock              [kernel]
              802.00  1.7% schedule                    [kernel]
              774.00  1.7% sys_epoll_wait              [kernel]
              720.00  1.6% kmem_cache_alloc            [kernel]


-------------------------------------------------------------------------------
   PerfTop:    1000 irqs/sec  kernel:100.0% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function               DSO
             _______ _____ ______________________ ________

              751.00 36.1% sky2_poll              [sky2]  
              108.00  5.2% __udp4_lib_lookup      [kernel]
               95.00  4.6% ip_route_input         [kernel]
               83.00  4.0% _raw_spin_lock         [kernel]
               79.00  3.8% _raw_spin_lock_irqsave [kernel]
               77.00  3.7% __netif_receive_skb    [kernel]
               77.00  3.7% __alloc_skb            [kernel]
               66.00  3.2% ip_rcv                 [kernel]
               60.00  2.9% __udp4_lib_rcv         [kernel]
               54.00  2.6% sock_queue_rcv_skb     [kernel]
               45.00  2.2% sky2_rx_submit         [sky2]  
               42.00  2.0% __wake_up_common       [kernel]
               40.00  1.9% __kmalloc              [kernel]
               39.00  1.9% sock_def_readable      [kernel]
               30.00  1.4% ep_poll_callback       [kernel]


-------------------------------------------------------------------------------
   PerfTop:    1001 irqs/sec  kernel:99.8% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function               DSO
             _______ _____ ______________________ ________

             3511.00 36.7% sky2_poll              [sky2]  
              519.00  5.4% __udp4_lib_lookup      [kernel]
              431.00  4.5% ip_route_input         [kernel]
              353.00  3.7% _raw_spin_lock_irqsave [kernel]
              351.00  3.7% __alloc_skb            [kernel]
              338.00  3.5% __netif_receive_skb    [kernel]
              337.00  3.5% _raw_spin_lock         [kernel]
              307.00  3.2% ip_rcv                 [kernel]
              264.00  2.8% sky2_rx_submit         [sky2]  
              254.00  2.7% sock_queue_rcv_skb     [kernel]
              246.00  2.6% __udp4_lib_rcv         [kernel]
              206.00  2.2% sock_def_readable      [kernel]
              177.00  1.9% __wake_up_common       [kernel]
              168.00  1.8% __kmalloc              [kernel]


-------------------------------------------------------------------------------
   PerfTop:     908 irqs/sec  kernel:80.0% [1000Hz cycles],  (all, cpu: 1)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ________

              177.00  6.7% _raw_spin_lock_irqsave      [kernel]
              120.00  4.5% copy_user_generic_string    [kernel]
              110.00  4.2% fget                        [kernel]
              108.00  4.1% datagram_poll               [kernel]
               98.00  3.7% _raw_spin_lock_bh           [kernel]
               91.00  3.4% sys_epoll_ctl               [kernel]
               89.00  3.4% kmem_cache_free             [kernel]
               77.00  2.9% system_call                 [kernel]
               76.00  2.9% schedule                    [kernel]
               76.00  2.9% _raw_spin_unlock_irqrestore [kernel]
               63.00  2.4% fput                        [kernel]
               61.00  2.3% sys_epoll_wait              [kernel]
               61.00  2.3% udp_recvmsg                 [kernel]
               49.00  1.8% process_recv                mcpudp  


-------------------------------------------------------------------------------
   PerfTop:     815 irqs/sec  kernel:79.8% [1000Hz cycles],  (all, cpu: 1)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ _________________

              491.00  8.0% _raw_spin_lock_irqsave      [kernel.kallsyms]
              285.00  4.7% copy_user_generic_string    [kernel.kallsyms]
              252.00  4.1% fget                        [kernel.kallsyms]
              215.00  3.5% datagram_poll               [kernel.kallsyms]
              206.00  3.4% _raw_spin_unlock_irqrestore [kernel.kallsyms]
              204.00  3.3% sys_epoll_ctl               [kernel.kallsyms]
              196.00  3.2% _raw_spin_lock_bh           [kernel.kallsyms]
              184.00  3.0% udp_recvmsg                 [kernel.kallsyms]
              184.00  3.0% kmem_cache_free             [kernel.kallsyms]
              180.00  2.9% system_call                 [kernel.kallsyms]
              168.00  2.7% sys_epoll_wait              [kernel.kallsyms]
              159.00  2.6% schedule                    [kernel.kallsyms]
              144.00  2.4% fput                        [kernel.kallsyms]


II: Test setup 
nn-rps: Basic net-next from Apr23 with rps mask ee and irq affinity to cpu0

-------------------------------------------------------------------------------
   PerfTop:    3558 irqs/sec  kernel:85.0% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ ________

             3519.00 15.9% sky2_poll                      [sky2]  
              865.00  3.9% _raw_spin_lock_irqsave         [kernel]
              568.00  2.6% _raw_spin_unlock_irqrestore    [kernel]
              526.00  2.4% sky2_intr                      [sky2]  
              493.00  2.2% __netif_receive_skb            [kernel]
              477.00  2.2% _raw_spin_lock                 [kernel]
              470.00  2.1% ip_rcv                         [kernel]
              456.00  2.1% fget                           [kernel]
              447.00  2.0% sys_epoll_ctl                  [kernel]
              420.00  1.9% copy_user_generic_string       [kernel]
              387.00  1.8% ip_route_input                 [kernel]
              359.00  1.6% system_call                    [kernel]
              334.00  1.5% kmem_cache_free                [kernel]
              310.00  1.4% kmem_cache_alloc               [kernel]
              302.00  1.4% call_function_single_interrupt [kernel]


-------------------------------------------------------------------------------
   PerfTop:    3546 irqs/sec  kernel:85.8% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ ________

             6592.00 16.2% sky2_poll                      [sky2]  
             1540.00  3.8% _raw_spin_lock_irqsave         [kernel]
             1014.00  2.5% _raw_spin_unlock_irqrestore    [kernel]
              885.00  2.2% fget                           [kernel]
              881.00  2.2% _raw_spin_lock                 [kernel]
              880.00  2.2% sky2_intr                      [sky2]  
              872.00  2.1% __netif_receive_skb            [kernel]
              858.00  2.1% ip_rcv                         [kernel]
              802.00  2.0% sys_epoll_ctl                  [kernel]
              710.00  1.7% copy_user_generic_string       [kernel]
              696.00  1.7% system_call                    [kernel]
              692.00  1.7% ip_route_input                 [kernel]
              634.00  1.6% schedule                       [kernel]
              618.00  1.5% kmem_cache_free                [kernel]
              605.00  1.5% call_function_single_interrupt [kernel]


cpu0

-------------------------------------------------------------------------------
   PerfTop:     971 irqs/sec  kernel:96.5% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ________

             4222.00 58.2% sky2_poll                   [sky2]  
              668.00  9.2% sky2_intr                   [sky2]  
              228.00  3.1% __alloc_skb                 [kernel]
              183.00  2.5% get_rps_cpu                 [kernel]
              138.00  1.9% sky2_rx_submit              [sky2]  
              124.00  1.7% enqueue_to_backlog          [kernel]
              119.00  1.6% __kmalloc                   [kernel]
              103.00  1.4% kmem_cache_alloc            [kernel]
               91.00  1.3% _raw_spin_lock              [kernel]
               90.00  1.2% _raw_spin_lock_irqsave      [kernel]
               73.00  1.0% swiotlb_sync_single         [kernel]
               72.00  1.0% irq_entries_start           [kernel]
               55.00  0.8% copy_user_generic_string    [kernel]
               53.00  0.7% _raw_spin_unlock_irqrestore [kernel]
               48.00  0.7% fget                        [kernel]


-------------------------------------------------------------------------------
   PerfTop:     998 irqs/sec  kernel:94.8% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ________

             6745.00 58.5% sky2_poll                   [sky2]  
              831.00  7.2% sky2_intr                   [sky2]  
              352.00  3.1% __alloc_skb                 [kernel]
              281.00  2.4% get_rps_cpu                 [kernel]
              226.00  2.0% sky2_rx_submit              [sky2]  
              186.00  1.6% __kmalloc                   [kernel]
              181.00  1.6% enqueue_to_backlog          [kernel]
              173.00  1.5% _raw_spin_lock_irqsave      [kernel]
              166.00  1.4% kmem_cache_alloc            [kernel]
              162.00  1.4% _raw_spin_lock              [kernel]
               99.00  0.9% swiotlb_sync_single         [kernel]
               98.00  0.9% irq_entries_start           [kernel]
               94.00  0.8% fget                        [kernel]
               92.00  0.8% _raw_spin_unlock_irqrestore [kernel]
               80.00  0.7% system_call                 [kernel]


cpu1


-------------------------------------------------------------------------------
   PerfTop:     724 irqs/sec  kernel:82.0% [1000Hz cycles],  (all, cpu: 1)
-------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ _________________

              204.00  5.3% _raw_spin_lock_irqsave         [kernel.kallsyms]
              153.00  4.0% _raw_spin_unlock_irqrestore    [kernel.kallsyms]
              147.00  3.8% call_function_single_interrupt [kernel.kallsyms]
              139.00  3.6% __netif_receive_skb            [kernel.kallsyms]
              135.00  3.5% sys_epoll_ctl                  [kernel.kallsyms]
              132.00  3.4% ip_rcv                         [kernel.kallsyms]
              129.00  3.3% fget                           [kernel.kallsyms]
              128.00  3.3% _raw_spin_lock                 [kernel.kallsyms]
              122.00  3.2% system_call                    [kernel.kallsyms]
              118.00  3.1% ip_route_input                 [kernel.kallsyms]
              109.00  2.8% kmem_cache_free                [kernel.kallsyms]
              108.00  2.8% copy_user_generic_string       [kernel.kallsyms]
               90.00  2.3% schedule                       [kernel.kallsyms]
               85.00  2.2% fput                           [kernel.kallsyms]



-------------------------------------------------------------------------------
   PerfTop:     763 irqs/sec  kernel:83.0% [1000Hz cycles],  (all, cpu: 1)
-------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ _________________

              428.00  6.2% _raw_spin_lock_irqsave         [kernel.kallsyms]
              302.00  4.4% _raw_spin_unlock_irqrestore    [kernel.kallsyms]
              269.00  3.9% __netif_receive_skb            [kernel.kallsyms]
              258.00  3.7% call_function_single_interrupt [kernel.kallsyms]
              254.00  3.7% fget                           [kernel.kallsyms]
              238.00  3.4% ip_rcv                         [kernel.kallsyms]
              230.00  3.3% sys_epoll_ctl                  [kernel.kallsyms]
              222.00  3.2% _raw_spin_lock                 [kernel.kallsyms]
              220.00  3.2% ip_route_input                 [kernel.kallsyms]
              197.00  2.9% system_call                    [kernel.kallsyms]
              189.00  2.7% kmem_cache_free                [kernel.kallsyms]
              184.00  2.7% copy_user_generic_string       [kernel.kallsyms]
              144.00  2.1% ep_remove                      [kernel.kallsyms]
              140.00  2.0% schedule                       [kernel.kallsyms]


-------------------------------------------------------------------------------
   PerfTop:     546 irqs/sec  kernel:83.3% [1000Hz cycles],  (all, cpu: 1)
-------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ _________________

              346.00  5.7% _raw_spin_lock_irqsave         [kernel.kallsyms]
              275.00  4.6% _raw_spin_unlock_irqrestore    [kernel.kallsyms]
              238.00  3.9% call_function_single_interrupt [kernel.kallsyms]
              228.00  3.8% fget                           [kernel.kallsyms]
              222.00  3.7% __netif_receive_skb            [kernel.kallsyms]
              219.00  3.6% sys_epoll_ctl                  [kernel.kallsyms]
              209.00  3.5% _raw_spin_lock                 [kernel.kallsyms]
              205.00  3.4% ip_rcv                         [kernel.kallsyms]
              199.00  3.3% ip_route_input                 [kernel.kallsyms]
              173.00  2.9% system_call                    [kernel.kallsyms]
              170.00  2.8% copy_user_generic_string       [kernel.kallsyms]
              167.00  2.8% kmem_cache_free                [kernel.kallsyms]
              127.00  2.1% ep_remove                      [kernel.kallsyms]
              123.00  2.0% dst_release                    [kernel.kalls



III: Test setup 
nn-cl: Basic net-next from Apr23 + Changli patch

-------------------------------------------------------------------------------
   PerfTop:    3789 irqs/sec  kernel:84.1% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ____________________

             3514.00 10.2% sky2_poll                   [sky2]              
             1862.00  5.4% _raw_spin_lock_irqsave      [kernel]            
             1274.00  3.7% system_call                 [kernel]            
              926.00  2.7% fget                        [kernel]            
              872.00  2.5% _raw_spin_unlock_irqrestore [kernel]            
              862.00  2.5% copy_user_generic_string    [kernel]            
              766.00  2.2% sys_epoll_ctl               [kernel]            
              765.00  2.2% datagram_poll               [kernel]            
              671.00  2.0% _raw_spin_lock_bh           [kernel]            
              668.00  1.9% kmem_cache_free             [kernel]            
              602.00  1.8% udp_recvmsg                 [kernel]            
              586.00  1.7% _raw_spin_lock              [kernel]            
              585.00  1.7% vread_tsc                   [kernel].vsyscall_fn



-------------------------------------------------------------------------------
   PerfTop:    3794 irqs/sec  kernel:83.6% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ____________________

             4756.00  9.8% sky2_poll                   [sky2]              
             2742.00  5.7% _raw_spin_lock_irqsave      [kernel]            
             1826.00  3.8% system_call                 [kernel]            
             1285.00  2.7% fget                        [kernel]            
             1284.00  2.7% copy_user_generic_string    [kernel]            
             1235.00  2.6% _raw_spin_unlock_irqrestore [kernel]            
             1096.00  2.3% sys_epoll_ctl               [kernel]            
             1071.00  2.2% datagram_poll               [kernel]            
              954.00  2.0% kmem_cache_free             [kernel]            
              925.00  1.9% _raw_spin_lock_bh           [kernel]            
              888.00  1.8% vread_tsc                   [kernel].vsyscall_fn
              880.00  1.8% udp_recvmsg                 [kernel]            
              793.00  1.6% _raw_spin_lock              [kernel]            
              790.00  1.6% schedule                    [kernel]   

-------------------------------------------------------------------------------
   PerfTop:    1001 irqs/sec  kernel:99.9% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function               DSO
             _______ _____ ______________________ ________

              675.00 32.6% sky2_poll              [sky2]  
              116.00  5.6% __udp4_lib_lookup      [kernel]
              111.00  5.4% ip_route_input         [kernel]
               81.00  3.9% _raw_spin_lock_irqsave [kernel]
               81.00  3.9% _raw_spin_lock         [kernel]
               70.00  3.4% __alloc_skb            [kernel]
               67.00  3.2% ip_rcv                 [kernel]
               66.00  3.2% __netif_receive_skb    [kernel]
               61.00  2.9% __udp4_lib_rcv         [kernel]
               57.00  2.8% sock_queue_rcv_skb     [kernel]
               47.00  2.3% sock_def_readable      [kernel]
               42.00  2.0% __kmalloc              [kernel]
               42.00  2.0% __wake_up_common       [kernel]
               38.00  1.8% sky2_rx_submit         [sky2]  

-------------------------------------------------------------------------------
   PerfTop:    1001 irqs/sec  kernel:100.0% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function               DSO
             _______ _____ ______________________ ________

             2526.00 32.8% sky2_poll              [sky2]  
              406.00  5.3% ip_route_input         [kernel]
              399.00  5.2% __udp4_lib_lookup      [kernel]
              328.00  4.3% _raw_spin_lock_irqsave [kernel]
              307.00  4.0% _raw_spin_lock         [kernel]
              296.00  3.8% ip_rcv                 [kernel]
              287.00  3.7% __alloc_skb            [kernel]
              272.00  3.5% sock_queue_rcv_skb     [kernel]
              224.00  2.9% __udp4_lib_rcv         [kernel]
              224.00  2.9% __netif_receive_skb    [kernel]
              182.00  2.4% sock_def_readable      [kernel]
              163.00  2.1% __wake_up_common       [kernel]
              140.00  1.8% sky2_rx_submit         [sky2]  

-------------------------------------------------------------------------------
   PerfTop:    1001 irqs/sec  kernel:100.0% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function               DSO
             _______ _____ ______________________ ________

             4445.00 33.4% sky2_poll              [sky2]  
              707.00  5.3% __udp4_lib_lookup      [kernel]
              662.00  5.0% ip_route_input         [kernel]
              567.00  4.3% _raw_spin_lock_irqsave [kernel]
              512.00  3.8% __alloc_skb            [kernel]
              506.00  3.8% ip_rcv                 [kernel]
              476.00  3.6% sock_queue_rcv_skb     [kernel]
              473.00  3.6% _raw_spin_lock         [kernel]
              415.00  3.1% __udp4_lib_rcv         [kernel]
              408.00  3.1% __netif_receive_skb    [kernel]
              306.00  2.3% sock_def_readable      [kernel]
              272.00  2.0% __wake_up_common       [kernel]
              260.00  2.0% __kmalloc              [kernel]
              216.00  1.6% _raw_read_lock         [kernel]
              214.00  1.6% sky2_rx_submit         [sky2]  


-------------------------------------------------------------------------------
   PerfTop:     748 irqs/sec  kernel:80.9% [1000Hz cycles],  (all, cpu: 1)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ____________________

              244.00  7.4% _raw_spin_lock_irqsave      [kernel]            
              207.00  6.2% system_call                 [kernel]            
              127.00  3.8% _raw_spin_unlock_irqrestore [kernel]            
              124.00  3.7% copy_user_generic_string    [kernel]            
              122.00  3.7% sys_epoll_ctl               [kernel]            
              120.00  3.6% fget                        [kernel]            
              118.00  3.6% datagram_poll               [kernel]            
               96.00  2.9% schedule                    [kernel]            
               94.00  2.8% _raw_spin_lock_bh           [kernel]            
               86.00  2.6% vread_tsc                   [kernel].vsyscall_fn
               82.00  2.5% udp_recvmsg                 [kernel]            
               76.00  2.3% fput                        [kernel]            
               73.00  2.2% kmem_cache_free             [kernel]            
               67.00  2.0% sys_epoll_wait              [kernel]         

-------------------------------------------------------------------------------
   PerfTop:     625 irqs/sec  kernel:78.6% [1000Hz cycles],  (all, cpu: 1)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ____________________

              488.00  7.5% _raw_spin_lock_irqsave      [kernel]            
              380.00  5.9% system_call                 [kernel]            
              274.00  4.2% copy_user_generic_string    [kernel]            
              252.00  3.9% fget                        [kernel]            
              244.00  3.8% datagram_poll               [kernel]            
              217.00  3.3% _raw_spin_unlock_irqrestore [kernel]            
              211.00  3.3% sys_epoll_ctl               [kernel]            
              186.00  2.9% schedule                    [kernel]            
              185.00  2.9% _raw_spin_lock_bh           [kernel]            
              173.00  2.7% udp_recvmsg                 [kernel]            
              169.00  2.6% vread_tsc                   [kernel].vsyscall_fn
              164.00  2.5% kmem_cache_free             [kernel]            
              143.00  2.2% fput                        [kernel]            
              133.00  2.1% sys_epoll_wait              [kernel]        


IV: Test setup 
nn-cl-rps: Basic net-next from Apr23 + Changli patch + rps mask ee,irq aff

--------------------------------------------------------------------------
   PerfTop:    3043 irqs/sec  kernel:87.5% [1000Hz cycles],  (all, 8 CPUs)
--------------------------------------------------------------------------

             samples  pcnt function                   DSO
             _______ _____ __________________________ ____________________

             2240.00 20.4% sky2_poll                  [sky2]              
              375.00  3.4% _raw_spin_lock_irqsave     [kernel]            
              335.00  3.0% sky2_intr                  [sky2]              
              326.00  3.0% system_call                [kernel]            
              239.00  2.2% _raw_spin_unlock_irqrestor [kernel]            
              224.00  2.0% ip_rcv                     [kernel]            
              201.00  1.8% __netif_receive_skb        [kernel]            
              198.00  1.8% sys_epoll_ctl              [kernel]            
              190.00  1.7% _raw_spin_lock             [kernel]            
              182.00  1.7% fget                       [kernel]            
              169.00  1.5% copy_user_generic_string   [kernel]            
              165.00  1.5% kmem_cache_free            [kernel]            
              149.00  1.4% load_balance               [kernel]            
              146.00  1.3% ip_route_input             [kernel]           


--------------------------------------------------------------------------
   PerfTop:    3210 irqs/sec  kernel:85.8% [1000Hz cycles],  (all, 8 CPUs)
--------------------------------------------------------------------------

             samples  pcnt function                   DSO
             _______ _____ __________________________ ____________________

             6539.00 20.4% sky2_poll                  [sky2]              
             1106.00  3.4% _raw_spin_lock_irqsave     [kernel]            
             1014.00  3.2% sky2_intr                  [sky2]              
              976.00  3.0% system_call                [kernel]            
              684.00  2.1% _raw_spin_unlock_irqrestor [kernel]            
              611.00  1.9% ip_rcv                     [kernel]            
              601.00  1.9% fget                       [kernel]            
              593.00  1.8% _raw_spin_lock             [kernel]            
              592.00  1.8% sys_epoll_ctl              [kernel]            
              574.00  1.8% __netif_receive_skb        [kernel]            
              526.00  1.6% copy_user_generic_string   [kernel]            
              482.00  1.5% kmem_cache_free            [kernel]            
              480.00  1.5% ip_route_input             [kernel]            
              425.00  1.3% vread_tsc                  [kernel].vsyscall_fn
              410.00  1.3% kmem_cache_alloc           [kernel]            


--------------------------------------------------------------------------
   PerfTop:     999 irqs/sec  kernel:97.2% [1000Hz cycles],  (all, cpu: 0)
--------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ________

             2035.00 60.5% sky2_poll                   [sky2]  
              302.00  9.0% sky2_intr                   [sky2]  
              109.00  3.2% __alloc_skb                 [kernel]
               57.00  1.7% _raw_spin_lock              [kernel]
               57.00  1.7% get_rps_cpu                 [kernel]
               52.00  1.5% __kmalloc                   [kernel]
               51.00  1.5% enqueue_to_backlog          [kernel]
               49.00  1.5% _raw_spin_lock_irqsave      [kernel]
               44.00  1.3% kmem_cache_alloc            [kernel]
               34.00  1.0% sky2_rx_submit              [sky2]  
               33.00  1.0% swiotlb_sync_single         [kernel]
               31.00  0.9% system_call                 [kernel]
               28.00  0.8% irq_entries_start           [kernel]
               22.00  0.7% _raw_spin_unlock_irqrestore [kernel]
               21.00  0.6% sky2_remove                 [sky2]  

--------------------------------------------------------------------------
   PerfTop:    1000 irqs/sec  kernel:96.2% [1000Hz cycles],  (all, cpu: 0)
--------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ________

             5493.00 60.1% sky2_poll                   [sky2]  
              803.00  8.8% sky2_intr                   [sky2]  
              281.00  3.1% __alloc_skb                 [kernel]
              233.00  2.6% get_rps_cpu                 [kernel]
              136.00  1.5% enqueue_to_backlog          [kernel]
              132.00  1.4% __kmalloc                   [kernel]
              126.00  1.4% _raw_spin_lock              [kernel]
              122.00  1.3% kmem_cache_alloc            [kernel]
              122.00  1.3% _raw_spin_lock_irqsave      [kernel]
              102.00  1.1% swiotlb_sync_single         [kernel]
               88.00  1.0% sky2_rx_submit              [sky2]  
               77.00  0.8% system_call                 [kernel]
               69.00  0.8% irq_entries_start           [kernel]
               55.00  0.6% _raw_spin_unlock_irqrestore [kernel]
               54.00  0.6% copy_user_generic_string    [kernel]

--------------------------------------------------------------------------
   PerfTop:     999 irqs/sec  kernel:97.5% [1000Hz cycles],  (all, cpu: 0)
--------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ________

             6699.00 60.1% sky2_poll                   [sky2]  
              988.00  8.9% sky2_intr                   [sky2]  
              327.00  2.9% __alloc_skb                 [kernel]
              261.00  2.3% get_rps_cpu                 [kernel]
              168.00  1.5% __kmalloc                   [kernel]
              161.00  1.4% kmem_cache_alloc            [kernel]
              160.00  1.4% enqueue_to_backlog          [kernel]
              157.00  1.4% _raw_spin_lock              [kernel]
              125.00  1.1% _raw_spin_lock_irqsave      [kernel]
              122.00  1.1% swiotlb_sync_single         [kernel]
              114.00  1.0% sky2_rx_submit              [sky2]  
               96.00  0.9% system_call                 [kernel]
               85.00  0.8% irq_entries_start           [kernel]
               66.00  0.6% sky2_remove                 [sky2]  
               64.00  0.6% _raw_spin_unlock_irqrestore [kernel]

--------------------------------------------------------------------------
   PerfTop:     420 irqs/sec  kernel:84.8% [1000Hz cycles],  (all, cpu: 2)
--------------------------------------------------------------------------

             samples  pcnt function                   DSO
             _______ _____ __________________________ ____________________

              188.00  4.8% _raw_spin_lock_irqsave     [kernel]            
              175.00  4.5% system_call                [kernel]            
              155.00  4.0% _raw_spin_unlock_irqrestor [kernel]            
              143.00  3.7% __netif_receive_skb        [kernel]            
              124.00  3.2% ip_route_input             [kernel]            
              122.00  3.1% fget                       [kernel]            
              118.00  3.0% ip_rcv                     [kernel]            
              115.00  2.9% sys_epoll_ctl              [kernel]            
              107.00  2.7% call_function_single_inter [kernel]            
               98.00  2.5% vread_tsc                  [kernel].vsyscall_fn
               97.00  2.5% _raw_spin_lock             [kernel]            
               89.00  2.3% copy_user_generic_string   [kernel]        

--------------------------------------------------------------------------
   PerfTop:     372 irqs/sec  kernel:87.9% [1000Hz cycles],  (all, cpu: 2)
--------------------------------------------------------------------------

             samples  pcnt function                   DSO
             _______ _____ __________________________ ____________________

              212.00  4.6% _raw_spin_lock_irqsave     [kernel]            
              192.00  4.2% system_call                [kernel]            
              187.00  4.1% __netif_receive_skb        [kernel]            
              184.00  4.0% ip_rcv                     [kernel]            
              174.00  3.8% ip_route_input             [kernel]            
              165.00  3.6% _raw_spin_unlock_irqrestor [kernel]            
              143.00  3.1% call_function_single_inter [kernel]            
              135.00  3.0% fget                       [kernel]            
              133.00  2.9% sys_epoll_ctl              [kernel]            
              122.00  2.7% _raw_spin_lock             [kernel]            
              112.00  2.5% __udp4_lib_lookup          [kernel]            
               99.00  2.2% copy_user_generic_string   [kernel]            
               93.00  2.0% vread_tsc                  [kernel].vsyscall_fn
               90.00  2.0% kmem_cache_free            [kernel]            
               89.00  1.9% ep_remove                  [kernel]        
o
--------------------------------------------------------------------------
   PerfTop:     269 irqs/sec  kernel:85.1% [1000Hz cycles],  (all, cpu: 7)
--------------------------------------------------------------------------

             samples  pcnt function                   DSO
             _______ _____ __________________________ ____________________

               23.00  4.6% _raw_spin_lock_irqsave     [kernel]            
               21.00  4.2% system_call                [kernel]            
               19.00  3.8% _raw_spin_unlock_irqrestor [kernel]            
               17.00  3.4% fget                       [kernel]            
               15.00  3.0% __netif_receive_skb        [kernel]            
               14.00  2.8% dst_release                [kernel]            
               13.00  2.6% call_function_single_inter [kernel]            
               11.00  2.2% kmem_cache_free            [kernel]            
               10.00  2.0% vread_tsc                  [kernel].vsyscall_fn
               10.00  2.0% copy_user_generic_string   [kernel]            
               10.00  2.0% ktime_get                  [kernel]            
               10.00  2.0% ip_route_input             [kernel]            
               10.00  2.0% schedule                   [kernel]            


--------------------------------------------------------------------------
   PerfTop:     253 irqs/sec  kernel:84.6% [1000Hz cycles],  (all, cpu: 7)
--------------------------------------------------------------------------

             samples  pcnt function                   DSO
             _______ _____ __________________________ ____________________

              109.00  4.9% system_call                [kernel]            
              104.00  4.6% _raw_spin_lock_irqsave     [kernel]            
               79.00  3.5% ip_rcv                     [kernel]            
               74.00  3.3% _raw_spin_unlock_irqrestor [kernel]            
               71.00  3.2% fget                       [kernel]            
               68.00  3.0% sys_epoll_ctl              [kernel]            
               66.00  2.9% ip_route_input             [kernel]            
               58.00  2.6% call_function_single_inter [kernel]            
               55.00  2.4% _raw_spin_lock             [kernel]            
               54.00  2.4% copy_user_generic_string   [kernel]            
               53.00  2.4% __netif_receive_skb        [kernel]            
               51.00  2.3% schedule                   [kernel]            
               51.00  2.3% kmem_cache_free            [kernel]            
               43.00  1.9% vread_tsc                  [kernel].vsyscall_fn
               38.00  1.7% __udp4_lib_lookup          [kernel]  

--------------------------------------------------------------------------
   PerfTop:     236 irqs/sec  kernel:84.3% [1000Hz cycles],  (all, cpu: 7)
--------------------------------------------------------------------------

             samples  pcnt function                   DSO
             _______ _____ __________________________ ____________________

              131.00  4.9% _raw_spin_lock_irqsave     [kernel]            
              128.00  4.8% system_call                [kernel]            
              101.00  3.8% _raw_spin_unlock_irqrestor [kernel]            
               89.00  3.3% fget                       [kernel]            
               85.00  3.2% sys_epoll_ctl              [kernel]            
               81.00  3.0% ip_rcv                     [kernel]            
               76.00  2.8% ip_route_input             [kernel]            
               66.00  2.5% call_function_single_inter [kernel]            
               65.00  2.4% _raw_spin_lock             [kernel]            
               65.00  2.4% kmem_cache_free            [kernel]            
               64.00  2.4% copy_user_generic_string   [kernel]            
               57.00  2.1% __netif_receive_skb        [kernel]            
               47.00  1.8% schedule                   [kernel]            
               45.00  1.7% vread_tsc                  [kernel].vsyscall_fn


--------------------------------------------------------------------------
   PerfTop:     478 irqs/sec  kernel:82.2% [1000Hz cycles],  (all, cpu: 2)
--------------------------------------------------------------------------

             samples  pcnt function                   DSO
             _______ _____ __________________________ ____________________

              319.00  5.2% _raw_spin_lock_irqsave     [kernel]            
              289.00  4.7% system_call                [kernel]            
              246.00  4.0% _raw_spin_unlock_irqrestor [kernel]            
              199.00  3.2% ip_route_input             [kernel]            
              198.00  3.2% __netif_receive_skb        [kernel]            
              197.00  3.2% sys_epoll_ctl              [kernel]            
              183.00  3.0% ip_rcv                     [kernel]            
              182.00  2.9% fget                       [kernel]            
              166.00  2.7% call_function_single_inter [kernel]            
              157.00  2.5% copy_user_generic_string   [kernel]            
              149.00  2.4% kmem_cache_free            [kernel]            
              146.00  2.4% vread_tsc                  [kernel].vsyscall_fn
              133.00  2.1% _raw_spin_lock             [kernel]            
              118.00  1.9% schedule                   [kernel]            
              112.00  1.8% __udp4_lib_lookup          [kernel]            



--------------------------------------------------------------------------
   PerfTop:     535 irqs/sec  kernel:83.0% [1000Hz cycles],  (all, cpu: 2)
--------------------------------------------------------------------------

             samples  pcnt function                   DSO
             _______ _____ __________________________ ____________________

              345.00  5.2% _raw_spin_lock_irqsave     [kernel]            
              291.00  4.4% system_call                [kernel]            
              255.00  3.9% _raw_spin_unlock_irqrestor [kernel]            
              218.00  3.3% fget                       [kernel]            
              201.00  3.0% ip_route_input             [kernel]            
              193.00  2.9% __netif_receive_skb        [kernel]            
              193.00  2.9% sys_epoll_ctl              [kernel]            
              180.00  2.7% ip_rcv                     [kernel]            
              173.00  2.6% call_function_single_inter [kernel]            
              163.00  2.5% copy_user_generic_string   [kernel]            
              152.00  2.3% kmem_cache_free            [kernel]            
              151.00  2.3% vread_tsc                  [kernel].vsyscall_fn
              142.00  2.1% _raw_spin_lock             [kernel]            
              131.00  2.0% schedule                   [kernel]            



^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/2] add enic ndo_vf_set_port_profile op support for dynamic vnics
From: Scott Feldman @ 2010-04-24 14:30 UTC (permalink / raw)
  To: Chris Wright; +Cc: davem, netdev, arnd
In-Reply-To: <20100424022121.GE3843@x200.localdomain>

On 4/23/10 7:21 PM, "Chris Wright" <chrisw@redhat.com> wrote:

> * Scott Feldman (scofeldm@cisco.com) wrote:
>> -#define DRV_VERSION  "1.3.1.1"
>> +#define DRV_VERSION  "1.3.1.1-iov"
> 
> not a version bump?

Anything ver diff will work to tell this patch from what's upstream already.

>> @@ -810,14 +819,24 @@ static void enic_reset_mcaddrs(struct enic *enic)
>>  
>>  static int enic_set_mac_addr(struct net_device *netdev, char *addr)
>>  {
>> - if (!is_valid_ether_addr(addr))
>> -  return -EADDRNOTAVAIL;
>> + struct enic *enic = netdev_priv(netdev);
>>  
>> - memcpy(netdev->dev_addr, addr, netdev->addr_len);
>> + if (enic_is_dynamic(enic)) {
>> +  random_ether_addr(netdev->dev_addr);
> 
> Would it make more sense to just ignore this?  Then the default (not
> port profile configured yet) mac is still a useful identifier.

Dynamic enics start out with all zero mac addr, so this assigns a random one
to the interface when created.  After I sent out this patch, I realized if
mac addr arg wasn't included in port-profile by user (it's optional), then I
should use this random netdev->dev_addr for the port-profile mac addr.  Next
patch.

>> +static int enic_set_mac_address(struct net_device *netdev, void *p)
>> +{
>> + return -EOPNOTSUPP;
>> +}
>> +
> 
> Ever?  Even on non-dynamic enic?  Oh, I see, this was just a lie before ;-)

Static enics get mac addr from mgmt tool; and dynamic enics get mac addr
during port-profile assignment.  In either case, there is no need for the
user to change the interface mac addr, so I'm adding this explicit block.

It's weird, without setting any value to ndo_set_mac_address, you can change
the mac addr on the interface when the interface is DOWN but not when it's
UP.  Not sure why that is.  In any case, adding this ndo_set_mac_address
callback blocks all attempts to change mac addr on interface.

>> +static int enic_set_vf_port_profile(struct net_device *netdev, int vf,
>> + u8 *port_profile, u8 *mac, u8 *host_uuid, u8 *client_uuid,
>> + u8 *client_name)
>> +{
>> + struct enic *enic = netdev_priv(netdev);
>> + struct vic_provinfo *vp;
>> + u8 oui[3] = VIC_PROVINFO_CISCO_OUI;
>> + int err;
>> +
>> + if (!enic_is_dynamic(enic))
>> +  return -EOPNOTSUPP;
> 
> Do you want to validate vf (like require it to be 0) or something?

Yes, I should add a check for vf == 0.

> How should userspace know to talk directly to the VF (dynamic enic) in
> this device, but a PF + VF_index for another device?

Can we use IFLA_NUM_VF?  In this enic case since PF==VF (for now), we'll
need to return IFLA_NUM_VF=1.  Let me see what we can do here.

>> + err = enic_vnic_dev_deinit(enic);
>> + if (err)
>> +  goto err_out;
>> +
>> + err = enic_dev_init_prov(enic, vp);
>> +
>> +err_out:
>> + vic_provinfo_free(vp);
>> +
>> + enic_set_multicast_list(netdev);
> 
> Should that happen in error case (and is the locking correct)?

Locking is correct.  I change this to do this only on the non-err patch.

-scott

^ permalink raw reply

* Re: [net-next-2.6 PATCH 1/2] Add ndo_set_vf_port_profile (was iovnl)
From: Scott Feldman @ 2010-04-24 14:37 UTC (permalink / raw)
  To: Chris Wright; +Cc: davem, netdev, arnd
In-Reply-To: <20100424022242.GF3843@x200.localdomain>

On 4/23/10 7:22 PM, "Chris Wright" <chrisw@redhat.com> wrote:

>> I took some liberties and s/SR-IOV/IOV in the code comments around the
>> ndo_set_vf_* cmds as they can apply to both SR-IOV and non-SR-IOV adapters,
>> as long as there is a PF:VF parent:child relationship.
> 
> For enic case, which do you expect to use for net_dev and VF index?  Would
> this be VF + index== 0 (meaning the degenerate case you described last
> time where PF==VF)?

Yes, for this enic PF==VF, but that's a short term situation.  It's a small
matter of programming (in firmware) to turn enic into the more general case.
But I want to focus on getting port-profile support in first, with the
current enic+firmware.

>> A port-profile is used to configure/enable the network port backing the VF,
>> not
>> to configure the host-facing side of the VF.
> 
> How shall we do the lldpad case?

Same as before with iovnl.  The sender of RTM_SETLINK msg (say libvirt)
needs to send with mcast group RTMGRP_LINK and listener (say lldpad) needs
to listen on that mcast group.  This way, both kernel and user-space get the
msg.

>> + if (tb[IFLA_VF_PORT_PROFILE]) {
>> +  struct ifla_vf_port_profile *ivp;
>> +  ivp = nla_data(tb[IFLA_VF_PORT_PROFILE]);
>> +  err = -EOPNOTSUPP;
>> +  if (ops->ndo_set_vf_port_profile)
>> +   ivp->port_profile[sizeof(ivp->port_profile)-1] = 0;
>> +   ivp->host_uuid[sizeof(ivp->host_uuid)-1] = 0;
>> +   ivp->client_uuid[sizeof(ivp->client_uuid)-1] = 0;
>> +   ivp->client_name[sizeof(ivp->client_name)-1] = 0;
> 
> Seems a little unusual to modify the buffer, add a kernel internal structure
> that can be passed to ndo callback (where buffer lens can be knonw)?

Ok, let me see what can be done here.

-scott


^ permalink raw reply

* [PATCH v3] rps: optimize rps_get_cpu()
From: Changli Gao @ 2010-04-24 15:17 UTC (permalink / raw)
  To: David Miller; +Cc: Tom Herbert, Eric Dumazet, netdev, Changli Gao

optimize rps_get_cpu().

don't initialize ports when we can get the ports. one memory access for ports
than two.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
 net/core/dev.c |   24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index a4a7c36..4d43f1a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2229,7 +2229,11 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	int cpu = -1;
 	u8 ip_proto;
 	u16 tcpu;
-	u32 addr1, addr2, ports, ihl;
+	u32 addr1, addr2, ihl;
+	union {
+		u32 v32;
+		u16 v16[2];
+	} ports;
 
 	if (skb_rx_queue_recorded(skb)) {
 		u16 index = skb_get_rx_queue(skb);
@@ -2275,7 +2279,6 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	default:
 		goto done;
 	}
-	ports = 0;
 	switch (ip_proto) {
 	case IPPROTO_TCP:
 	case IPPROTO_UDP:
@@ -2285,25 +2288,20 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	case IPPROTO_SCTP:
 	case IPPROTO_UDPLITE:
 		if (pskb_may_pull(skb, (ihl * 4) + 4)) {
-			__be16 *hports = (__be16 *) (skb->data + (ihl * 4));
-			u32 sport, dport;
-
-			sport = (__force u16) hports[0];
-			dport = (__force u16) hports[1];
-			if (dport < sport)
-				swap(sport, dport);
-			ports = (sport << 16) + dport;
+			ports.v32 = * (__force u32 *) (skb->data + (ihl * 4));
+			if (ports.v16[1] < ports.v16[0])
+				swap(ports.v16[0], ports.v16[1]);
+			break;
 		}
-		break;
-
 	default:
+		ports.v32 = 0;
 		break;
 	}
 
 	/* get a consistent hash (same value on both flow directions) */
 	if (addr2 < addr1)
 		swap(addr1, addr2);
-	skb->rxhash = jhash_3words(addr1, addr2, ports, hashrnd);
+	skb->rxhash = jhash_3words(addr1, addr2, ports.v32, hashrnd);
 	if (!skb->rxhash)
 		skb->rxhash = 1;
 

^ permalink raw reply related

* Re: [PATCH v3] rps: optimize rps_get_cpu()
From: jamal @ 2010-04-24 16:04 UTC (permalink / raw)
  To: Changli Gao; +Cc: David Miller, Tom Herbert, Eric Dumazet, netdev
In-Reply-To: <1272122227-13070-1-git-send-email-xiaosuo@gmail.com>

By the time you hit this code (at least on machines that make sense for
RPS), you already have the ethernet header, IP header and transport
ports in cache, no?
I think the sport << 16 shifting is avoided - but i dont think theres
any effect on mem access.

cheers,
jamal

^ permalink raw reply

* Re: [PATCH v3] rps: optimize rps_get_cpu()
From: Changli Gao @ 2010-04-24 16:19 UTC (permalink / raw)
  To: hadi; +Cc: David Miller, Tom Herbert, Eric Dumazet, netdev
In-Reply-To: <1272125052.8918.18.camel@bigi>

On Sun, Apr 25, 2010 at 12:04 AM, jamal <hadi@cyberus.ca> wrote:
>
> By the time you hit this code (at least on machines that make sense for
> RPS), you already have the ethernet header, IP header and transport
> ports in cache, no?
> I think the sport << 16 shifting is avoided - but i dont think theres
> any effect on mem access.

Maybe I have used the wrong word. Sorry. If the ports are already in
cache, the new code has only one cache access for ports, and the later
operations are in registers.


-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH v3] rps: optimize rps_get_cpu()
From: jamal @ 2010-04-24 16:22 UTC (permalink / raw)
  To: Changli Gao; +Cc: David Miller, Tom Herbert, Eric Dumazet, netdev
In-Reply-To: <q2i412e6f7f1004240919la9825819i81a4d3063ed781a5@mail.gmail.com>

On Sun, 2010-04-25 at 00:19 +0800, Changli Gao wrote:

> Maybe I have used the wrong word. Sorry. If the ports are already in
> cache, the new code has only one cache access for ports, and the later
> operations are in registers.

Ok, that makes more sense - so your commit log is confusing.
You are saving a shift operation per packet - probably not a big deal
but better than zero ;->

cheers,
jamal


^ permalink raw reply

* [patch v2] sctp: cleanup: remove duplicate assignment
From: Dan Carpenter @ 2010-04-24 17:19 UTC (permalink / raw)
  To: Vlad Yasevich
  Cc: Sridhar Samudrala, David S. Miller, Wei Yongjun, Chris Dischino,
	linux-sctp, netdev, kernel-janitors
In-Reply-To: <4BD1AE9D.6090807@hp.com>

This assignment isn't needed because we did it earlier already.

Also another reason to delete the assignment is because it triggers a
Smatch warning about checking for NULL pointers after a dereference.

Reported-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: Dan Carpenter <error27@gmail.com>
---
Thanks Vlad.  I came so close to seeing that myself if only I had openned
my eyes a tiny bit more.  :P

diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 17cb400..33aed1c 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -419,10 +419,17 @@ struct sctp_chunk *sctp_make_init_ack(const struct sctp_association *asoc,
 	if (!retval)
 		goto nomem_chunk;
 
-	/* Per the advice in RFC 2960 6.4, send this reply to
-	 * the source of the INIT packet.
+	/* RFC 2960 6.4 Multi-homed SCTP Endpoints
+	 *
+	 * An endpoint SHOULD transmit reply chunks (e.g., SACK,
+	 * HEARTBEAT ACK, * etc.) to the same destination transport
+	 * address from which it received the DATA or control chunk
+	 * to which it is replying.
+	 *
+	 * [INIT ACK back to where the INIT came from.]
 	 */
 	retval->transport = chunk->transport;
+
 	retval->subh.init_hdr =
 		sctp_addto_chunk(retval, sizeof(initack), &initack);
 	retval->param_hdr.v = sctp_addto_chunk(retval, addrs_len, addrs.v);
@@ -461,18 +468,6 @@ struct sctp_chunk *sctp_make_init_ack(const struct sctp_association *asoc,
 	/* We need to remove the const qualifier at this point.  */
 	retval->asoc = (struct sctp_association *) asoc;
 
-	/* RFC 2960 6.4 Multi-homed SCTP Endpoints
-	 *
-	 * An endpoint SHOULD transmit reply chunks (e.g., SACK,
-	 * HEARTBEAT ACK, * etc.) to the same destination transport
-	 * address from which it received the DATA or control chunk
-	 * to which it is replying.
-	 *
-	 * [INIT ACK back to where the INIT came from.]
-	 */
-	if (chunk)
-		retval->transport = chunk->transport;
-
 nomem_chunk:
 	kfree(cookie);
 nomem_cookie:

^ permalink raw reply related

* Re: [PATCHv5] add mergeable receiver buffers support to vhost
From: Michael S. Tsirkin @ 2010-04-24 19:07 UTC (permalink / raw)
  To: David L Stevens; +Cc: rusty, kvm, virtualization, netdev
In-Reply-To: <1272053205.3114.3.camel@lab1.dls>

On Fri, Apr 23, 2010 at 01:06:45PM -0700, David L Stevens wrote:
> This patch adds mergeable receive buffers support to vhost.
> 
> Signed-off-by: David L Stevens <dlstevens@us.ibm.com>

It seems the logging is wrong. Did you test live migration? Please do.
I think reason for the bug could be you did some cut and paste
from code that was there before 86e9424d7252bae5ad1c17b4b8088193e6b27cbe.
So I put a suggestion on reducing this duplication a bit, below.

Also, I think this patch adds sparse errors: some __user annotations
seem missing.  Could you please make your patch apply
on top of patch 'vhost: fix sparse warnings' from
Christoph Hellwig, and then make sure your patch
does not add new sparse errors?

I also wanted to make some coding style tweaks, to make
patch match the style of the rest of the code, I could do
them myself but since there's these issues, and we need another
round, I put them in comments in mail below.

Thanks!

> diff -ruNp net-next-v0/drivers/vhost/net.c net-next-v5/drivers/vhost/net.c
> --- net-next-v0/drivers/vhost/net.c	2010-04-22 11:31:57.000000000 -0700
> +++ net-next-v5/drivers/vhost/net.c	2010-04-22 12:41:17.000000000 -0700
> @@ -109,7 +109,7 @@ static void handle_tx(struct vhost_net *
>  	};
>  	size_t len, total_len = 0;
>  	int err, wmem;
> -	size_t hdr_size;
> +	size_t vhost_hlen;
>  	struct socket *sock = rcu_dereference(vq->private_data);
>  	if (!sock)
>  		return;
> @@ -128,13 +128,13 @@ static void handle_tx(struct vhost_net *
>  
>  	if (wmem < sock->sk->sk_sndbuf / 2)
>  		tx_poll_stop(net);
> -	hdr_size = vq->hdr_size;
> +	vhost_hlen = vq->vhost_hlen;
>  
>  	for (;;) {
> -		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> -					 ARRAY_SIZE(vq->iov),
> -					 &out, &in,
> -					 NULL, NULL);
> +		head = vhost_get_desc(&net->dev, vq, vq->iov,
> +				      ARRAY_SIZE(vq->iov),
> +				      &out, &in,
> +				      NULL, NULL);
>  		/* Nothing new?  Wait for eventfd to tell us they refilled. */
>  		if (head == vq->num) {
>  			wmem = atomic_read(&sock->sk->sk_wmem_alloc);
> @@ -155,20 +155,20 @@ static void handle_tx(struct vhost_net *
>  			break;
>  		}
>  		/* Skip header. TODO: support TSO. */
> -		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
> +		s = move_iovec_hdr(vq->iov, vq->hdr, vhost_hlen, out);
>  		msg.msg_iovlen = out;
>  		len = iov_length(vq->iov, out);
>  		/* Sanity check */
>  		if (!len) {
>  			vq_err(vq, "Unexpected header len for TX: "
>  			       "%zd expected %zd\n",
> -			       iov_length(vq->hdr, s), hdr_size);
> +			       iov_length(vq->hdr, s), vhost_hlen);
>  			break;
>  		}
>  		/* TODO: Check specific error and bomb out unless ENOBUFS? */
>  		err = sock->ops->sendmsg(NULL, sock, &msg, len);
>  		if (unlikely(err < 0)) {
> -			vhost_discard_vq_desc(vq);
> +			vhost_discard_desc(vq, 1);
>  			tx_poll_start(net, sock);
>  			break;
>  		}
> @@ -187,12 +187,25 @@ static void handle_tx(struct vhost_net *
>  	unuse_mm(net->dev.mm);
>  }
>  
> +static int vhost_head_len(struct vhost_virtqueue *vq, struct sock *sk)
> +{
> +	struct sk_buff *head;
> +	int len = 0;
> +
> +	lock_sock(sk);
> +	head = skb_peek(&sk->sk_receive_queue);
> +	if (head)
> +		len = head->len + vq->sock_hlen;
> +	release_sock(sk);
> +	return len;
> +}
> +
>  /* Expects to be always run from workqueue - which acts as
>   * read-size critical section for our kind of RCU. */
>  static void handle_rx(struct vhost_net *net)
>  {
>  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> -	unsigned head, out, in, log, s;
> +	unsigned in, log, s;
>  	struct vhost_log *vq_log;
>  	struct msghdr msg = {
>  		.msg_name = NULL,
> @@ -203,14 +216,14 @@ static void handle_rx(struct vhost_net *
>  		.msg_flags = MSG_DONTWAIT,
>  	};
>  
> -	struct virtio_net_hdr hdr = {
> -		.flags = 0,
> -		.gso_type = VIRTIO_NET_HDR_GSO_NONE
> +	struct virtio_net_hdr_mrg_rxbuf hdr = {
> +		.hdr.flags = 0,
> +		.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
>  	};
>  
>  	size_t len, total_len = 0;
> -	int err;
> -	size_t hdr_size;
> +	int err, headcount, datalen;
> +	size_t vhost_hlen;
>  	struct socket *sock = rcu_dereference(vq->private_data);
>  	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
>  		return;
> @@ -218,18 +231,18 @@ static void handle_rx(struct vhost_net *
>  	use_mm(net->dev.mm);
>  	mutex_lock(&vq->mutex);
>  	vhost_disable_notify(vq);
> -	hdr_size = vq->hdr_size;
> +	vhost_hlen = vq->vhost_hlen;
>  
>  	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
>  		vq->log : NULL;
>  
> -	for (;;) {
> -		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> -					 ARRAY_SIZE(vq->iov),
> -					 &out, &in,
> -					 vq_log, &log);
> +	while ((datalen = vhost_head_len(vq, sock->sk))) {
> +		headcount = vhost_get_desc_n(vq, vq->heads, datalen+vhost_hlen,

checkpatch does not catch this but please add spaces around +.

> +					     &in, vq_log, &log);
> +		if (headcount < 0)
> +			break;
>  		/* OK, now we need to know about added descriptors. */
> -		if (head == vq->num) {
> +		if (!headcount) {
>  			if (unlikely(vhost_enable_notify(vq))) {
>  				/* They have slipped one in as we were
>  				 * doing that: check again. */
> @@ -241,46 +254,54 @@ static void handle_rx(struct vhost_net *
>  			break;
>  		}
>  		/* We don't need to be notified again. */
> -		if (out) {
> -			vq_err(vq, "Unexpected descriptor format for RX: "
> -			       "out %d, int %d\n",
> -			       out, in);
> -			break;
> -		}
> -		/* Skip header. TODO: support TSO/mergeable rx buffers. */
> -		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
> +		/* Skip header. TODO: support TSO. */
> +		s = move_iovec_hdr(vq->iov, vq->hdr, vhost_hlen, in);
>  		msg.msg_iovlen = in;
>  		len = iov_length(vq->iov, in);
>  		/* Sanity check */
>  		if (!len) {
>  			vq_err(vq, "Unexpected header len for RX: "
>  			       "%zd expected %zd\n",
> -			       iov_length(vq->hdr, s), hdr_size);
> +			       iov_length(vq->hdr, s), vhost_hlen);
>  			break;
>  		}
>  		err = sock->ops->recvmsg(NULL, sock, &msg,
>  					 len, MSG_DONTWAIT | MSG_TRUNC);
>  		/* TODO: Check specific error and bomb out unless EAGAIN? */
>  		if (err < 0) {
> -			vhost_discard_vq_desc(vq);
> +			vhost_discard_desc(vq, headcount);
>  			break;
>  		}
> -		/* TODO: Should check and handle checksum. */
> -		if (err > len) {
> -			pr_err("Discarded truncated rx packet: "
> -			       " len %d > %zd\n", err, len);
> -			vhost_discard_vq_desc(vq);
> +		if (err != datalen) {
> +			pr_err("Discarded rx packet: "
> +			       " len %d, expected %zd\n", err, datalen);
> +			vhost_discard_desc(vq, headcount);
>  			continue;
>  		}
>  		len = err;
> -		err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr, hdr_size);
> +		err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr,
> +				     vhost_hlen);
>  		if (err) {
>  			vq_err(vq, "Unable to write vnet_hdr at addr %p: %d\n",
>  			       vq->iov->iov_base, err);
>  			break;
>  		}
> -		len += hdr_size;
> -		vhost_add_used_and_signal(&net->dev, vq, head, len);
> +		/* TODO: Should check and handle checksum. */
> +		if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF)) {
> +			struct virtio_net_hdr_mrg_rxbuf hdr;
> +			struct iovec *iov = vhost_hlen ? vq->hdr : vq->iov;
> +
> +			if (memcpy_toiovecend(iov, (unsigned char *)&headcount,
> +				      offsetof(typeof(hdr), num_buffers),
> +				      sizeof(hdr.num_buffers))) {
> +				vq_err(vq, "Failed num_buffers write");
> +				vhost_discard_desc(vq, headcount);
> +				break;
> +			}
> +		}
> +		len += vhost_hlen;
> +		vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
> +					    headcount);
>  		if (unlikely(vq_log))
>  			vhost_log_write(vq, vq_log, log, len);
>  		total_len += len;
> @@ -561,9 +582,24 @@ done:
>  
>  static int vhost_net_set_features(struct vhost_net *n, u64 features)
>  {
> -	size_t hdr_size = features & (1 << VHOST_NET_F_VIRTIO_NET_HDR) ?
> -		sizeof(struct virtio_net_hdr) : 0;
> +	size_t vhost_hlen;
> +	size_t sock_hlen;
>  	int i;
> +
> +	if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
> +		/* vhost provides vnet_hdr */
> +		vhost_hlen = sizeof(struct virtio_net_hdr);
> +		if (features & (1 << VIRTIO_NET_F_MRG_RXBUF))
> +			vhost_hlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +		sock_hlen = 0;
> +	} else {
> +		/* socket provides vnet_hdr */
> +		vhost_hlen = 0;
> +		if (features & (1 << VIRTIO_NET_F_MRG_RXBUF))
> +			sock_hlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +		else
> +			sock_hlen = sizeof(struct virtio_net_hdr);
> +	}
>  	mutex_lock(&n->dev.mutex);
>  	if ((features & (1 << VHOST_F_LOG_ALL)) &&
>  	    !vhost_log_access_ok(&n->dev)) {
> @@ -574,7 +610,8 @@ static int vhost_net_set_features(struct
>  	smp_wmb();
>  	for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
>  		mutex_lock(&n->vqs[i].mutex);
> -		n->vqs[i].hdr_size = hdr_size;
> +		n->vqs[i].vhost_hlen = vhost_hlen;
> +		n->vqs[i].sock_hlen = sock_hlen;
>  		mutex_unlock(&n->vqs[i].mutex);
>  	}
>  	vhost_net_flush(n);
> diff -ruNp net-next-v0/drivers/vhost/vhost.c net-next-v5/drivers/vhost/vhost.c
> --- net-next-v0/drivers/vhost/vhost.c	2010-04-22 11:31:57.000000000 -0700
> +++ net-next-v5/drivers/vhost/vhost.c	2010-04-22 12:19:59.000000000 -0700
> @@ -114,7 +114,8 @@ static void vhost_vq_reset(struct vhost_
>  	vq->used_flags = 0;
>  	vq->log_used = false;
>  	vq->log_addr = -1ull;
> -	vq->hdr_size = 0;
> +	vq->vhost_hlen = 0;
> +	vq->sock_hlen = 0;
>  	vq->private_data = NULL;
>  	vq->log_base = NULL;
>  	vq->error_ctx = NULL;
> @@ -861,6 +862,53 @@ static unsigned get_indirect(struct vhos
>  	return 0;
>  }
>  
> +/* This is a multi-buffer version of vhost_get_vq_desc
> + * @vq		- the relevant virtqueue
> + * datalen	- data length we'll be reading
> + * @iovcount	- returned count of io vectors we fill
> + * @log		- vhost log
> + * @log_num	- log offset
> + *	returns number of buffer heads allocated, negative on error
> + */
> +int vhost_get_desc_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
> +		     int datalen, int *iovcount, struct vhost_log *log,
> +		     unsigned int *log_num)
> +{
> +	int out, in;
> +	int seg = 0;		/* iov index */

In rest of code, I always put comments above the commented code,
not to the right of it.

> +	int hc = 0;		/* head count */

Maybe just call the variable head_count? That'd make the
comment unnecessary.

> +	int rv;

rest of code uses r or ret for such variables.

> +
> +	while (datalen > 0) {
> +		if (hc >= VHOST_NET_MAX_SG) {
> +			rv = -ENOBUFS;
> +			goto err;
> +		}
> +		heads[hc].id = vhost_get_desc(vq->dev, vq, vq->iov+seg,
> +					      ARRAY_SIZE(vq->iov)-seg, &out,

same here, spaces around + and -

> +					      &in, log, log_num);
> +		if (heads[hc].id == vq->num) {
> +			rv = 0;
> +			goto err;
> +		}
> +		if (out || in <= 0) {
> +			vq_err(vq, "unexpected descriptor format for RX: "
> +				"out %d, in %d\n", out, in);
> +			rv = -EINVAL;
> +			goto err;
> +		}
> +		heads[hc].len = iov_length(vq->iov+seg, in);

and here

> +		datalen -= heads[hc].len;
> +		hc++;

I use ++x in the rest of the code wher I don't care for
the old value.

> +		seg += in;
> +	}
> +	*iovcount = seg;
> +	return hc;
> +err:
> +	vhost_discard_desc(vq, hc);
> +	return rv;
> +}
> +
>  /* This looks in the virtqueue and for the first available buffer, and converts
>   * it to an iovec for convenient access.  Since descriptors consist of some
>   * number of output then some number of input descriptors, it's actually two
> @@ -868,7 +916,7 @@ static unsigned get_indirect(struct vhos
>   *
>   * This function returns the descriptor number found, or vq->num (which
>   * is never a valid descriptor number) if none was found. */
> -unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
> +unsigned vhost_get_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
>  			   struct iovec iov[], unsigned int iov_size,
>  			   unsigned int *out_num, unsigned int *in_num,
>  			   struct vhost_log *log, unsigned int *log_num)
> @@ -986,9 +1034,9 @@ unsigned vhost_get_vq_desc(struct vhost_
>  }
>  
>  /* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
> -void vhost_discard_vq_desc(struct vhost_virtqueue *vq)
> +void vhost_discard_desc(struct vhost_virtqueue *vq, int n)
>  {
> -	vq->last_avail_idx--;
> +	vq->last_avail_idx -= n;
>  }
>  
>  /* After we've used one of their buffers, we tell them about it.  We'll then
> @@ -1017,6 +1065,54 @@ int vhost_add_used(struct vhost_virtqueu
>  	if (unlikely(vq->log_used)) {
>  		/* Make sure data is seen before log. */
>  		smp_wmb();
> +		log_write(vq->log_base, vq->log_addr + sizeof *vq->used->ring *
> +			  (vq->last_used_idx % vq->num),
> +			  sizeof *vq->used->ring);
> +		log_write(vq->log_base, vq->log_addr, sizeof *vq->used->ring);
> +		if (vq->log_ctx)
> +			eventfd_signal(vq->log_ctx, 1);
> +	}
> +	vq->last_used_idx++;
> +	return 0;

The above looks like a copy from an old version of vhost.


> +}
> +
> +/* After we've used one of their buffers, we tell them about it.  We'll then
> + * want to notify the guest, using eventfd. */
> +int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
> +		   int count)

please align int at least to the right of (.

> +{
> +	struct vring_used_elem *used;
> +	int start, n;
> +
> +	if (count <= 0)
> +		return -EINVAL;

Is the above necessary?
Just make count unsigned for clarity?

> +
> +	start = vq->last_used_idx % vq->num;
> +	if (vq->num - start < count)
> +		n = vq->num - start;
> +	else
> +		n = count;

I'd say use min, or reorder code as I suggest below
to only have a single if.

> +	used = vq->used->ring + start;
> +	if (copy_to_user(used, heads, sizeof(heads[0])*n)) {

Pls put spaces around *
Also I'd prefer sizeof *used here instead of sizeof(heads[0]),
it's shorter. My style is also to put () after sizeof only
if the argument is a type, and order the expression
to make precedence not matter. So we'd end up with:

	n * sizeof *used.

> +		vq_err(vq, "Failed to write used");
> +		return -EFAULT;
> +	}
> +	if (n < count) {	/* wrapped the ring */

In rest of code, I always put comments above the commented code,
not to the right of it.

The only case we can have n < count is wrap-around,
so I find this a roundabout way to code this up, and we
end up with test for same condition, which makes the
code fragile (ignoring performance impact).
Also note need for extra logging as explained below.

Maybe it's cleanest to have static __vhost_add_used_n which assumes
no wrap-around, and then just call:

vhost_add_used_n()
	if (unlikely(n < count)) {
		if (r = __vhost_add_used_n())
			return r;
		heads += n;
		count -= n;
	}
	return __vhost_add_used_n()

this would do an extra write into used index,
but this is almost freem 


> +		used = vq->used->ring;
> +		if (copy_to_user(used, heads+n, sizeof(heads[0])*(count-n))) {

spaces around + and *

> +			vq_err(vq, "Failed to write used");
> +			return -EFAULT;
> +		}
> +	}
> +	/* Make sure buffer is written before we update index. */
> +	smp_wmb();
> +	if (put_user(vq->last_used_idx+count, &vq->used->idx)) {

and here

> +		vq_err(vq, "Failed to increment used idx");
> +		return -EFAULT;
> +	}
> +	if (unlikely(vq->log_used)) {
> +		/* Make sure data is seen before log. */
> +		smp_wmb();
>  		/* Log used ring entry write. */
>  		log_write(vq->log_base,
>  			  vq->log_addr +

this uses 'used' pointer, but I think it's wrong in case
of wrap-around. I think we'll need extra logging for
wrap-around. That would be 3 copies of identical code:
maybe add a function?

static void vhost_log_used(struct vhost_virtqueue *vq,
			   struct vring_used_elem __user *used)
{
	/* Make sure data is seen before log. */
	smp_wmb();
	/* Log used ring entry write. */
	log_write(vq->log_base,
		  vq->log_addr +
		   ((void __user *)used - (void __user *)vq->used),
		  sizeof *used);
	/* Log used index update. */
	log_write(vq->log_base,
		  vq->log_addr + offsetof(struct vring_used, idx),
		  sizeof vq->used->idx);
	if (vq->log_ctx)
		eventfd_signal(vq->log_ctx, 1);
}


> @@ -1029,7 +1125,7 @@ int vhost_add_used(struct vhost_virtqueu
>  		if (vq->log_ctx)
>  			eventfd_signal(vq->log_ctx, 1);
>  	}
> -	vq->last_used_idx++;
> +	vq->last_used_idx += count;
>  	return 0;
>  }
>  
> @@ -1062,6 +1158,15 @@ void vhost_add_used_and_signal(struct vh
>  	vhost_signal(dev, vq);
>  }
>  
> +/* multi-buffer version of vhost_add_used_and_signal */
> +void vhost_add_used_and_signal_n(struct vhost_dev *dev,
> +				 struct vhost_virtqueue *vq,
> +				 struct vring_used_elem *heads, int count)
> +{
> +	vhost_add_used_n(vq, heads, count);
> +	vhost_signal(dev, vq);
> +}
> +
>  /* OK, now we need to know about added descriptors. */
>  bool vhost_enable_notify(struct vhost_virtqueue *vq)
>  {
> @@ -1086,7 +1191,7 @@ bool vhost_enable_notify(struct vhost_vi
>  		return false;
>  	}
>  
> -	return avail_idx != vq->last_avail_idx;
> +	return avail_idx != vq->avail_idx;
>  }
>  
>  /* We don't need to be notified again. */
> diff -ruNp net-next-v0/drivers/vhost/vhost.h net-next-v5/drivers/vhost/vhost.h
> --- net-next-v0/drivers/vhost/vhost.h	2010-03-22 12:04:38.000000000 -0700
> +++ net-next-v5/drivers/vhost/vhost.h	2010-04-22 11:35:54.000000000 -0700
> @@ -84,7 +84,9 @@ struct vhost_virtqueue {
>  	struct iovec indirect[VHOST_NET_MAX_SG];
>  	struct iovec iov[VHOST_NET_MAX_SG];
>  	struct iovec hdr[VHOST_NET_MAX_SG];
> -	size_t hdr_size;
> +	size_t vhost_hlen;
> +	size_t sock_hlen;
> +	struct vring_used_elem heads[VHOST_NET_MAX_SG];
>  	/* We use a kind of RCU to access private pointer.
>  	 * All readers access it from workqueue, which makes it possible to
>  	 * flush the workqueue instead of synchronize_rcu. Therefore readers do
> @@ -120,16 +122,23 @@ long vhost_dev_ioctl(struct vhost_dev *,
>  int vhost_vq_access_ok(struct vhost_virtqueue *vq);
>  int vhost_log_access_ok(struct vhost_dev *);
>  
> -unsigned vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
> +int vhost_get_desc_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
> +		     int datalen, int *iovcount, struct vhost_log *log,
> +		     unsigned int *log_num);
> +unsigned vhost_get_desc(struct vhost_dev *, struct vhost_virtqueue *,
>  			   struct iovec iov[], unsigned int iov_count,
>  			   unsigned int *out_num, unsigned int *in_num,
>  			   struct vhost_log *log, unsigned int *log_num);
> -void vhost_discard_vq_desc(struct vhost_virtqueue *);
> +void vhost_discard_desc(struct vhost_virtqueue *, int);
>  
>  int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
> -void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
> +int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
> +		    int count);
>  void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
> -			       unsigned int head, int len);
> +			       unsigned int id, int len);
> +void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
> +			       struct vring_used_elem *heads, int count);
> +void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
>  void vhost_disable_notify(struct vhost_virtqueue *);
>  bool vhost_enable_notify(struct vhost_virtqueue *);
>  
> @@ -149,7 +158,8 @@ enum {
>  	VHOST_FEATURES = (1 << VIRTIO_F_NOTIFY_ON_EMPTY) |
>  			 (1 << VIRTIO_RING_F_INDIRECT_DESC) |
>  			 (1 << VHOST_F_LOG_ALL) |
> -			 (1 << VHOST_NET_F_VIRTIO_NET_HDR),
> +			 (1 << VHOST_NET_F_VIRTIO_NET_HDR) |
> +			 (1 << VIRTIO_NET_F_MRG_RXBUF),
>  };
>  
>  static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
> 

^ permalink raw reply

* Re: [RFC][PATCH v3 2/3] Provides multiple submits and asynchronous notifications.
From: Michael S. Tsirkin @ 2010-04-24 19:32 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: arnd, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <1272006513-5188-1-git-send-email-xiaohui.xin@intel.com>

On Fri, Apr 23, 2010 at 03:08:33PM +0800, xiaohui.xin@intel.com wrote:
> From: Xin Xiaohui <xiaohui.xin@intel.com>
> 
> The vhost-net backend now only supports synchronous send/recv
> operations. The patch provides multiple submits and asynchronous
> notifications. This is needed for zero-copy case.
> 
> Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
> ---
> 
> Michael,
> >>>Can't vhost supply a kiocb completion callback that will handle the list?
> >>Yes, thanks. And with it I also remove the vq->receivr finally.
> >>Thanks
> >>Xiaohui
> 
> >Nice progress. I commented on some minor issues below.
> >Thanks!
> 
> The updated patch addressed your comments on the minor issues.
> Thanks!
> 
> Thanks
> Xiaohui  
> 
>  drivers/vhost/net.c   |  236 +++++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/vhost/vhost.c |  120 ++++++++++++++-----------
>  drivers/vhost/vhost.h |   14 +++
>  3 files changed, 314 insertions(+), 56 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 38989d1..18f6c41 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -23,6 +23,8 @@
>  #include <linux/if_arp.h>
>  #include <linux/if_tun.h>
>  #include <linux/if_macvlan.h>
> +#include <linux/mpassthru.h>
> +#include <linux/aio.h>
>  
>  #include <net/sock.h>
>  
> @@ -48,6 +50,7 @@ struct vhost_net {
>  	struct vhost_dev dev;
>  	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
>  	struct vhost_poll poll[VHOST_NET_VQ_MAX];
> +	struct kmem_cache       *cache;
>  	/* Tells us whether we are polling a socket for TX.
>  	 * We only do this when socket buffer fills up.
>  	 * Protected by tx vq lock. */
> @@ -92,11 +95,138 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
>  	net->tx_poll_state = VHOST_NET_POLL_STARTED;
>  }
>  
> +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
> +{
> +	struct kiocb *iocb = NULL;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&vq->notify_lock, flags);
> +	if (!list_empty(&vq->notifier)) {
> +		iocb = list_first_entry(&vq->notifier,
> +				struct kiocb, ki_list);
> +		list_del(&iocb->ki_list);
> +	}
> +	spin_unlock_irqrestore(&vq->notify_lock, flags);
> +	return iocb;
> +}
> +
> +static void handle_iocb(struct kiocb *iocb)
> +{
> +	struct vhost_virtqueue *vq = iocb->private;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&vq->notify_lock, flags);
> +	list_add_tail(&iocb->ki_list, &vq->notifier);
> +	spin_unlock_irqrestore(&vq->notify_lock, flags);

Don't we need to wake up the wq as well?

> +}
> +
> +static int is_async_vq(struct vhost_virtqueue *vq)
> +{
> +	return (vq->link_state == VHOST_VQ_LINK_ASYNC);

() not needed

> +}
> +
> +static void handle_async_rx_events_notify(struct vhost_net *net,
> +					  struct vhost_virtqueue *vq,
> +					  struct socket *sock)
> +{
> +	struct kiocb *iocb = NULL;
> +	struct vhost_log *vq_log = NULL;
> +	int rx_total_len = 0;
> +	unsigned int head, log, in, out;
> +	int size;
> +
> +	if (!is_async_vq(vq))
> +		return;
> +
> +	if (sock->sk->sk_data_ready)
> +		sock->sk->sk_data_ready(sock->sk, 0);
> +
> +	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> +		vq->log : NULL;
> +
> +	while ((iocb = notify_dequeue(vq)) != NULL) {
> +		vhost_add_used_and_signal(&net->dev, vq,
> +				iocb->ki_pos, iocb->ki_nbytes);
> +		size = iocb->ki_nbytes;
> +		head = iocb->ki_pos;
> +		rx_total_len += iocb->ki_nbytes;
> +
> +		if (iocb->ki_dtor)
> +			iocb->ki_dtor(iocb);

I am confused by the above. Isn't ki_dtor handle_iocb?
Why is it called here?

> +		kmem_cache_free(net->cache, iocb);
> +
> +		/* when log is enabled, recomputing the log info is needed,
> +		 * since these buffers are in async queue, and may not get
> +		 * the log info before.
> +		 */
> +		if (unlikely(vq_log)) {
> +			if (!log)

log is uninitialized now?

> +				__vhost_get_vq_desc(&net->dev, vq, vq->iov,
> +						    ARRAY_SIZE(vq->iov),
> +						    &out, &in, vq_log,
> +						    &log, head);
> +			vhost_log_write(vq, vq_log, log, size);
> +		}
> +		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
> +			vhost_poll_queue(&vq->poll);
> +			break;
> +		}
> +	}
> +}
> +
> +static void handle_async_tx_events_notify(struct vhost_net *net,
> +					  struct vhost_virtqueue *vq)
> +{
> +	struct kiocb *iocb = NULL;
> +	int tx_total_len = 0;
> +
> +	if (!is_async_vq(vq))
> +		return;
> +
> +	while ((iocb = notify_dequeue(vq)) != NULL) {

Please just write this as while (((iocb = notify_dequeue(vq)))
above as well

> +		vhost_add_used_and_signal(&net->dev, vq,
> +				iocb->ki_pos, 0);

pls indent continuation lines to the roght of (
above as well

> +		tx_total_len += iocb->ki_nbytes;
> +
> +		if (iocb->ki_dtor)
> +			iocb->ki_dtor(iocb);

same question as above

> +
> +		kmem_cache_free(net->cache, iocb);
> +		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
> +			vhost_poll_queue(&vq->poll);
> +			break;
> +		}
> +	}
> +}
> +
> +static struct kiocb *create_iocb(struct vhost_net *net,
> +				 struct vhost_virtqueue *vq,
> +				 unsigned head)
> +{
> +	struct kiocb *iocb = NULL;
> +
> +	if (!is_async_vq(vq))
> +		return NULL;
> +
> +	iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> +	if (!iocb)
> +		return NULL;
> +	iocb->private = vq;
> +	iocb->ki_pos = head;
> +	iocb->ki_dtor = handle_iocb;

So, dtor calls handle_iocb, but what causes vhost
to wake-up is really poll, right?


> +	if (vq == &net->dev.vqs[VHOST_NET_VQ_RX]) {
> +		iocb->ki_user_data = vq->num;

Is the above used?

> +		iocb->ki_iovec = vq->hdr;
> +	}
> +	return iocb;
> +}
> +
>  /* Expects to be always run from workqueue - which acts as
>   * read-size critical section for our kind of RCU. */
>  static void handle_tx(struct vhost_net *net)
>  {
>  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> +	struct kiocb *iocb = NULL;

Why do we need to init iocb to NULL?

>  	unsigned head, out, in, s;
>  	struct msghdr msg = {
>  		.msg_name = NULL,
> @@ -129,6 +259,8 @@ static void handle_tx(struct vhost_net *net)
>  		tx_poll_stop(net);
>  	hdr_size = vq->hdr_size;
>  
> +	handle_async_tx_events_notify(net, vq);
> +
>  	for (;;) {
>  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
>  					 ARRAY_SIZE(vq->iov),
> @@ -156,6 +288,13 @@ static void handle_tx(struct vhost_net *net)
>  		/* Skip header. TODO: support TSO. */
>  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
>  		msg.msg_iovlen = out;
> +
> +		if (is_async_vq(vq)) {
> +			iocb = create_iocb(net, vq, head);
> +			if (!iocb)
> +				break;
> +		}
> +
>  		len = iov_length(vq->iov, out);
>  		/* Sanity check */
>  		if (!len) {
> @@ -165,12 +304,18 @@ static void handle_tx(struct vhost_net *net)
>  			break;
>  		}
>  		/* TODO: Check specific error and bomb out unless ENOBUFS? */
> -		err = sock->ops->sendmsg(NULL, sock, &msg, len);
> +		err = sock->ops->sendmsg(iocb, sock, &msg, len);
>  		if (unlikely(err < 0)) {
> +			if (is_async_vq(vq))
> +				kmem_cache_free(net->cache, iocb);
>  			vhost_discard_vq_desc(vq);
>  			tx_poll_start(net, sock);
>  			break;
>  		}
> +
> +		if (is_async_vq(vq))
> +			continue;
> +
>  		if (err != len)
>  			pr_err("Truncated TX packet: "
>  			       " len %d != %zd\n", err, len);
> @@ -182,6 +327,8 @@ static void handle_tx(struct vhost_net *net)
>  		}
>  	}
>  
> +	handle_async_tx_events_notify(net, vq);
> +
>  	mutex_unlock(&vq->mutex);
>  	unuse_mm(net->dev.mm);
>  }
> @@ -191,6 +338,7 @@ static void handle_tx(struct vhost_net *net)
>  static void handle_rx(struct vhost_net *net)
>  {
>  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> +	struct kiocb *iocb = NULL;
>  	unsigned head, out, in, log, s;
>  	struct vhost_log *vq_log;
>  	struct msghdr msg = {
> @@ -211,7 +359,8 @@ static void handle_rx(struct vhost_net *net)
>  	int err;
>  	size_t hdr_size;
>  	struct socket *sock = rcu_dereference(vq->private_data);
> -	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> +	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
> +			vq->link_state == VHOST_VQ_LINK_SYNC))
>  		return;
>  
>  	use_mm(net->dev.mm);
> @@ -219,9 +368,17 @@ static void handle_rx(struct vhost_net *net)
>  	vhost_disable_notify(vq);
>  	hdr_size = vq->hdr_size;
>  
> +	/* In async cases, when write log is enabled, in case the submitted
> +	 * buffers did not get log info before the log enabling, so we'd
> +	 * better recompute the log info when needed. We do this in
> +	 * handle_async_rx_events_notify().
> +	 */
> +
>  	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
>  		vq->log : NULL;
>  
> +	handle_async_rx_events_notify(net, vq, sock);
> +
>  	for (;;) {
>  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
>  					 ARRAY_SIZE(vq->iov),
> @@ -250,6 +407,13 @@ static void handle_rx(struct vhost_net *net)
>  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
>  		msg.msg_iovlen = in;
>  		len = iov_length(vq->iov, in);
> +
> +		if (is_async_vq(vq)) {
> +			iocb = create_iocb(net, vq, head);
> +			if (!iocb)
> +				break;
> +		}
> +
>  		/* Sanity check */
>  		if (!len) {
>  			vq_err(vq, "Unexpected header len for RX: "
> @@ -257,13 +421,20 @@ static void handle_rx(struct vhost_net *net)
>  			       iov_length(vq->hdr, s), hdr_size);
>  			break;
>  		}
> -		err = sock->ops->recvmsg(NULL, sock, &msg,
> +
> +		err = sock->ops->recvmsg(iocb, sock, &msg,
>  					 len, MSG_DONTWAIT | MSG_TRUNC);
>  		/* TODO: Check specific error and bomb out unless EAGAIN? */
>  		if (err < 0) {
> +			if (is_async_vq(vq))
> +				kmem_cache_free(net->cache, iocb);
>  			vhost_discard_vq_desc(vq);
>  			break;
>  		}
> +
> +		if (is_async_vq(vq))
> +			continue;
> +
>  		/* TODO: Should check and handle checksum. */
>  		if (err > len) {
>  			pr_err("Discarded truncated rx packet: "
> @@ -289,6 +460,8 @@ static void handle_rx(struct vhost_net *net)
>  		}
>  	}
>  
> +	handle_async_rx_events_notify(net, vq, sock);
> +
>  	mutex_unlock(&vq->mutex);
>  	unuse_mm(net->dev.mm);
>  }
> @@ -342,6 +515,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>  	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
>  	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
>  	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
> +	n->cache = NULL;
>  
>  	f->private_data = n;
>  
> @@ -405,6 +579,18 @@ static void vhost_net_flush(struct vhost_net *n)
>  	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
>  }
>  
> +static void vhost_async_cleanup(struct vhost_net *n)
> +{
> +	/* clean the notifier */
> +	struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
> +	struct kiocb *iocb = NULL;
> +	if (n->cache) {
> +		while ((iocb = notify_dequeue(vq)) != NULL)
> +			kmem_cache_free(n->cache, iocb);
> +		kmem_cache_destroy(n->cache);
> +	}
> +}
> +
>  static int vhost_net_release(struct inode *inode, struct file *f)
>  {
>  	struct vhost_net *n = f->private_data;
> @@ -421,6 +607,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
>  	/* We do an extra flush before freeing memory,
>  	 * since jobs can re-queue themselves. */
>  	vhost_net_flush(n);
> +	vhost_async_cleanup(n);
>  	kfree(n);
>  	return 0;
>  }
> @@ -472,21 +659,58 @@ static struct socket *get_tap_socket(int fd)
>  	return sock;
>  }
>  
> -static struct socket *get_socket(int fd)
> +static struct socket *get_mp_socket(int fd)
> +{
> +	struct file *file = fget(fd);
> +	struct socket *sock;
> +	if (!file)
> +		return ERR_PTR(-EBADF);
> +	sock = mp_get_socket(file);
> +	if (IS_ERR(sock))
> +		fput(file);
> +	return sock;
> +}
> +
> +static struct socket *get_socket(struct vhost_virtqueue *vq, int fd,
> +				 enum vhost_vq_link_state *state)
>  {
>  	struct socket *sock;
>  	/* special case to disable backend */
>  	if (fd == -1)
>  		return NULL;
> +
> +	*state = VHOST_VQ_LINK_SYNC;
> +
>  	sock = get_raw_socket(fd);
>  	if (!IS_ERR(sock))
>  		return sock;
>  	sock = get_tap_socket(fd);
>  	if (!IS_ERR(sock))
>  		return sock;
> +	sock = get_mp_socket(fd);
> +	if (!IS_ERR(sock)) {
> +		*state = VHOST_VQ_LINK_ASYNC;
> +		return sock;
> +	}
>  	return ERR_PTR(-ENOTSOCK);
>  }
>  
> +static void vhost_init_link_state(struct vhost_net *n, int index)

so let's pass link state as parameter, and set it in this function.
And maybe pass in vq, no need for index tricks.

> +{
> +	struct vhost_virtqueue *vq = n->vqs + index;
> +
> +	WARN_ON(!mutex_is_locked(&vq->mutex));

there's a single place of call, I don't think we need this check.

> +	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> +		INIT_LIST_HEAD(&vq->notifier);
> +		spin_lock_init(&vq->notify_lock);
> +		if (!n->cache) {
> +			n->cache = kmem_cache_create("vhost_kiocb",

vhost_net_kiocb a better name

> +					sizeof(struct kiocb), 0,
> +					SLAB_HWCACHE_ALIGN, NULL);
> +		}

no need for {} for single statement if.

> +	}
> +}
> +
>  static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
>  {
>  	struct socket *sock, *oldsock;
> @@ -510,12 +734,14 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
>  		r = -EFAULT;
>  		goto err_vq;
>  	}
> -	sock = get_socket(fd);
> +	sock = get_socket(vq, fd, &vq->link_state);
>  	if (IS_ERR(sock)) {
>  		r = PTR_ERR(sock);
>  		goto err_vq;
>  	}
>  
> +	vhost_init_link_state(n, index);
> +
>  	/* start polling new socket */
>  	oldsock = vq->private_data;
>  	if (sock == oldsock)
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 3f10194..add77d3 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -860,61 +860,17 @@ static unsigned get_indirect(struct vhost_dev *dev, struct vhost_virtqueue *vq,
>  	return 0;
>  }
>  
> -/* This looks in the virtqueue and for the first available buffer, and converts
> - * it to an iovec for convenient access.  Since descriptors consist of some
> - * number of output then some number of input descriptors, it's actually two
> - * iovecs, but we pack them into one and note how many of each there were.
> - *
> - * This function returns the descriptor number found, or vq->num (which
> - * is never a valid descriptor number) if none was found. */
> -unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
> -			   struct iovec iov[], unsigned int iov_size,
> -			   unsigned int *out_num, unsigned int *in_num,
> -			   struct vhost_log *log, unsigned int *log_num)
> +/* This computes the log info according to the index of buffer */
> +unsigned __vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
> +			     struct iovec iov[], unsigned int iov_size,
> +			     unsigned int *out_num, unsigned int *in_num,
> +			     struct vhost_log *log, unsigned int *log_num,
> +			     unsigned int head)
>  {
>  	struct vring_desc desc;
>  	unsigned int i, head, found = 0;
> -	u16 last_avail_idx;
> -	int ret;
> -
> -	/* Check it isn't doing very strange things with descriptor numbers. */
> -	last_avail_idx = vq->last_avail_idx;
> -	if (get_user(vq->avail_idx, &vq->avail->idx)) {
> -		vq_err(vq, "Failed to access avail idx at %p\n",
> -		       &vq->avail->idx);
> -		return vq->num;
> -	}
> -
> -	if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
> -		vq_err(vq, "Guest moved used index from %u to %u",
> -		       last_avail_idx, vq->avail_idx);
> -		return vq->num;
> -	}
> -
> -	/* If there's nothing new since last we looked, return invalid. */
> -	if (vq->avail_idx == last_avail_idx)
> -		return vq->num;
> +	unsigned int ret;
>  
> -	/* Only get avail ring entries after they have been exposed by guest. */
> -	smp_rmb();
> -
> -	/* Grab the next descriptor number they're advertising, and increment
> -	 * the index we've seen. */
> -	if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
> -		vq_err(vq, "Failed to read head: idx %d address %p\n",
> -		       last_avail_idx,
> -		       &vq->avail->ring[last_avail_idx % vq->num]);
> -		return vq->num;
> -	}
> -
> -	/* If their number is silly, that's an error. */
> -	if (head >= vq->num) {
> -		vq_err(vq, "Guest says index %u > %u is available",
> -		       head, vq->num);
> -		return vq->num;
> -	}
> -
> -	/* When we start there are none of either input nor output. */
>  	*out_num = *in_num = 0;
>  	if (unlikely(log))
>  		*log_num = 0;
> @@ -978,8 +934,70 @@ unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
>  			*out_num += ret;
>  		}
>  	} while ((i = next_desc(&desc)) != -1);
> +	return head;
> +}
> +
> +/* This looks in the virtqueue and for the first available buffer, and converts
> + * it to an iovec for convenient access.  Since descriptors consist of some
> + * number of output then some number of input descriptors, it's actually two
> + * iovecs, but we pack them into one and note how many of each there were.
> + *
> + * This function returns the descriptor number found, or vq->num (which
> + * is never a valid descriptor number) if none was found. */
> +unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
> +			   struct iovec iov[], unsigned int iov_size,
> +			   unsigned int *out_num, unsigned int *in_num,
> +			   struct vhost_log *log, unsigned int *log_num)
> +{
> +	struct vring_desc desc;
> +	unsigned int i, head, found = 0;
> +	u16 last_avail_idx;
> +	unsigned int ret;
> +
> +	/* Check it isn't doing very strange things with descriptor numbers. */
> +	last_avail_idx = vq->last_avail_idx;
> +	if (get_user(vq->avail_idx, &vq->avail->idx)) {
> +		vq_err(vq, "Failed to access avail idx at %p\n",
> +		       &vq->avail->idx);
> +		return vq->num;
> +	}
> +
> +	if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
> +		vq_err(vq, "Guest moved used index from %u to %u",
> +		       last_avail_idx, vq->avail_idx);
> +		return vq->num;
> +	}
> +
> +	/* If there's nothing new since last we looked, return invalid. */
> +	if (vq->avail_idx == last_avail_idx)
> +		return vq->num;
> +
> +	/* Only get avail ring entries after they have been exposed by guest. */
> +	rmb();
> +
> +	/* Grab the next descriptor number they're advertising, and increment
> +	 * the index we've seen. */
> +	if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
> +		vq_err(vq, "Failed to read head: idx %d address %p\n",
> +		       last_avail_idx,
> +		       &vq->avail->ring[last_avail_idx % vq->num]);
> +		return vq->num;
> +	}
> +
> +	/* If their number is silly, that's an error. */
> +	if (head >= vq->num) {
> +		vq_err(vq, "Guest says index %u > %u is available",
> +		       head, vq->num);
> +		return vq->num;
> +	}
> +
> +	ret = __vhost_get_vq_desc(dev, vq, iov, iov_size,
> +				  out_num, in_num,
> +				  log, log_num, head);
>  
>  	/* On success, increment avail index. */
> +	if (ret == vq->num)
> +		return ret;
>  	vq->last_avail_idx++;
>  	return head;
>  }
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 44591ba..3c9cbce 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -43,6 +43,11 @@ struct vhost_log {
>  	u64 len;
>  };
>  
> +enum vhost_vq_link_state {
> +	VHOST_VQ_LINK_SYNC = 0,
> +	VHOST_VQ_LINK_ASYNC = 1,
> +};
> +
>  /* The virtqueue structure describes a queue attached to a device. */
>  struct vhost_virtqueue {
>  	struct vhost_dev *dev;
> @@ -96,6 +101,10 @@ struct vhost_virtqueue {
>  	/* Log write descriptors */
>  	void __user *log_base;
>  	struct vhost_log log[VHOST_NET_MAX_SG];
> +	/* Differiate async socket for 0-copy from normal */
> +	enum vhost_vq_link_state link_state;
> +	struct list_head notifier;
> +	spinlock_t notify_lock;
>  };
>  
>  struct vhost_dev {
> @@ -124,6 +133,11 @@ unsigned vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
>  			   struct iovec iov[], unsigned int iov_count,
>  			   unsigned int *out_num, unsigned int *in_num,
>  			   struct vhost_log *log, unsigned int *log_num);
> +unsigned __vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
> +			   struct iovec iov[], unsigned int iov_count,
> +			   unsigned int *out_num, unsigned int *in_num,
> +			   struct vhost_log *log, unsigned int *log_num,
> +			   unsigned int head);
>  void vhost_discard_vq_desc(struct vhost_virtqueue *);
>  
>  int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
> -- 
> 1.5.4.4

^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Eric Dumazet @ 2010-04-24 20:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: paulmck, Patrick McHardy, Changli Gao, hawk,
	Linux Kernel Network Hackers, Netfilter Developers
In-Reply-To: <Pine.LNX.4.64.1004241219280.32071@ask.diku.dk>

Le samedi 24 avril 2010 à 13:11 +0200, Jesper Dangaard Brouer a écrit :
> On Fri, 23 Apr 2010, Eric Dumazet wrote:
> 
> > Le jeudi 22 avril 2010 à 22:38 +0200, Jesper Dangaard Brouer a écrit :
> >
> >>
> >> I think its plausable, there is a lot of modification going on.
> >> Approx 40.000 deletes/sec and 40.000 inserts/sec.
> >> The hash bucket size is 300032, and with 80000 modifications/sec, we are
> >> (potentially) changing 26.6% of the hash chains each second.
> >>
> >> As can be seen from the graphs:
> >>   http://people.netfilter.org/hawk/DDoS/2010-04-12__001/list.html
> >>
> >> Notice that primarily CPU2 is doing the 40k deletes/sec, while CPU1 is
> >> caught searching...
> >>
> >>
> >>> maybe hash table has one slot :)
> >>
> >> Guess I have to reproduce the DoS attack in a testlab (I will first have
> >> time Tuesday).  So we can determine if its bad hashing or restart of the
> >> search loop.
> >>
> >>
> >> The traffic pattern was fairly simple:
> >>
> >> 200 bytes UDP packets, comming from approx 60 source IPs, going to one
> >> destination IP.  The UDP destination port number was varied in the range
> >> of 1 to 6000.   The source UDP port was varied a bit more, some ranging
> >> from 32768 to 61000, and some from 1028 to 5000.
> >>
> >>
> >
> > Re-reading this, I am not sure there is a real problem on RCU as you
> > pointed out.
> >
> > With 800.000 entries, in a 300.032 buckets hash table, each lookup hit
> > about 3 entries (aka searches in conntrack stats)
> >
> > 300.000 packets/second -> 900.000 'searches' per second.
> 
> The machine is not getting 300.000 pps, its only getting 40.000 pps (the 
> rest is stopped by the NIC by sending Ethernet flowcontrol pause frames)
> 
>   http://people.netfilter.org/hawk/DDoS/2010-04-12__001/eth0-rx.png
> 
> We are doing 700.000 'searches' per second, with 40.000 pps, thus on 
> average the list lenght (in each hash bucket) just need to be 17.5 
> elements.  Is this an acceptable has distribution, with 900.000 elements 
> in a 300.032 buckets hash table?
> 
> 
> > If you have four cpus all trying to insert/delete entries in //, they
> > all hit the central conntrack lock.
> 
> This machine only have two CPUs, or rather one physical CPU and one 
> hyperthreaded.  The server is a old HP380 G4, with an old Xeon type CPU 
> 3.4 GHz (1MB cache).  If remember correctly its based on Pentium-4 
> technologi, which had a pretty bad hyperthreading.
> 
> 
> > On a DDOS scenario, every packet needs to take this lock twice,
> > once to free an old conntrack (early drop), once to insert a new entry.
> 
> I was worried about if the "early drop" e.g. free an old conntrack would 
> disturbe the conntrack searching?
> 
> 
> > To scale this, only way would be to have an array of locks, like we have
> > for TCP/UDP hash tables.
> >
> > I did some tests here, with a multiqueue card, flooded with 300.000
> > pack/second, 65.536 source IP, millions of flows, and nothing wrong
> > happened (but packets drops, of course)
> 
> A small hint when testing, use Haralds tool 'lnstat' to see the stats on 
> the command line, thus you don't need to RRDtool graphe every thing:
> 
> Command:
>   lnstat -f nf_conntrack -i 1 -c 1000
> 
> I don't have a multiqueue NIC in this old machine.
> 
> I also ran some tests on my 10G testlab, but it didn't go wrong. 
> Tweeking the pktgen DDoS I could get the system to do 4.500.000 'searches' 
> per sec with a 1.500.000 packets/sec. (Have not reloaded the kernel with 
> the new failed lookup stats).  Guess my 10G machines are too fast to hit 
> the issues.
> 
> 
> > My two cpus were busy 100%, after tweaking smp_affinities, because on
> > first try, irqbalance put "01" mask on both queues, so only one ksoftirq
> > was working, other cpu was idle :(
> 
> Think my machine is some what slower than yours, perhaps its simply not 
> fast enough for this kind of workload (pretty sure that the cache is the 
> CPU is getting f*ked in this case).
> 
> On my machine one CPU in stuck in softirq:
>   http://people.netfilter.org/hawk/DDoS/2010-04-12__001/cpu_softirq001.png
> 
> And another observation is that the CPUs are disturbing each other on the 
> RX softirq code path.
> http://people.netfilter.org/hawk/DDoS/2010-04-12__001/softnet_time_squeeze_rx-softirq001.png
> (/proc/net/softnet_stat column 3)
> 
> Monday or Tuesdag I'll do a test setup with some old HP380 G4 machines to 
> see if I can reproduce the DDoS attack senario.  And see if I can get 
> it into to lookup loop.

Theorically a loop is very unlikely, given a single retry is very
unlikly too.

Unless a cpu gets in its cache a corrupted value of a 'next' pointer.

Maybe a hardware problem ?

My test machine is a fairly low end one, an AMD Athlon Dual core 5050e,
2.6 GHz

I used an igb card for ease of setup, and to make sure my two cores
would handle packets in parallel, without RPS.

With same hash bucket size (300.032) and max conntracks (800.000), and
after more than 10 hours of test, not a single lookup was restarted
because of a nulls with wrong value.

I can setup a test on a 16 cpu machine, multiqueue card too.

Hmm, I forgot to say I am using net-next-2.6, not your kernel version...



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH  kernel 2.6.34-rc5] smc91c92_cs: spin_unlock_irqrestore before calling smc_interrupt()
From: Ken Kawasaki @ 2010-04-24 20:37 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20100411075014.d3e80847.ken_kawasaki@spring.nifty.jp>


smc91c92_cs:
  * spin_unlock_irqrestore before calling smc_interrupt() in media_check()
     to avoid lockup.
  * use spin_lock_irqsave for ethtool function.

Signed-off-by: Ken Kawasaki <ken_kawasaki@spring.nifty.jp>

---

--- linux-2.6.34-rc5/drivers/net/pcmcia/smc91c92_cs.c.orig	2010-04-24 07:33:50.000000000 +0900
+++ linux-2.6.34-rc5/drivers/net/pcmcia/smc91c92_cs.c	2010-04-24 07:42:44.000000000 +0900
@@ -1804,23 +1804,30 @@ static void media_check(u_long arg)
     SMC_SELECT_BANK(1);
     media |= (inw(ioaddr + CONFIG) & CFG_AUI_SELECT) ? 2 : 1;
 
+    SMC_SELECT_BANK(saved_bank);
+    spin_unlock_irqrestore(&smc->lock, flags);
+
     /* Check for pending interrupt with watchdog flag set: with
        this, we can limp along even if the interrupt is blocked */
     if (smc->watchdog++ && ((i>>8) & i)) {
 	if (!smc->fast_poll)
 	    printk(KERN_INFO "%s: interrupt(s) dropped!\n", dev->name);
+	local_irq_save(flags);
 	smc_interrupt(dev->irq, dev);
+	local_irq_restore(flags);
 	smc->fast_poll = HZ;
     }
     if (smc->fast_poll) {
 	smc->fast_poll--;
 	smc->media.expires = jiffies + HZ/100;
 	add_timer(&smc->media);
-	SMC_SELECT_BANK(saved_bank);
-	spin_unlock_irqrestore(&smc->lock, flags);
 	return;
     }
 
+    spin_lock_irqsave(&smc->lock, flags);
+
+    saved_bank = inw(ioaddr + BANK_SELECT);
+
     if (smc->cfg & CFG_MII_SELECT) {
 	if (smc->mii_if.phy_id < 0)
 	    goto reschedule;
@@ -1978,15 +1985,16 @@ static int smc_get_settings(struct net_d
 	unsigned int ioaddr = dev->base_addr;
 	u16 saved_bank = inw(ioaddr + BANK_SELECT);
 	int ret;
+	unsigned long flags;
 
-	spin_lock_irq(&smc->lock);
+	spin_lock_irqsave(&smc->lock, flags);
 	SMC_SELECT_BANK(3);
 	if (smc->cfg & CFG_MII_SELECT)
 		ret = mii_ethtool_gset(&smc->mii_if, ecmd);
 	else
 		ret = smc_netdev_get_ecmd(dev, ecmd);
 	SMC_SELECT_BANK(saved_bank);
-	spin_unlock_irq(&smc->lock);
+	spin_unlock_irqrestore(&smc->lock, flags);
 	return ret;
 }
 
@@ -1996,15 +2004,16 @@ static int smc_set_settings(struct net_d
 	unsigned int ioaddr = dev->base_addr;
 	u16 saved_bank = inw(ioaddr + BANK_SELECT);
 	int ret;
+	unsigned long flags;
 
-	spin_lock_irq(&smc->lock);
+	spin_lock_irqsave(&smc->lock, flags);
 	SMC_SELECT_BANK(3);
 	if (smc->cfg & CFG_MII_SELECT)
 		ret = mii_ethtool_sset(&smc->mii_if, ecmd);
 	else
 		ret = smc_netdev_set_ecmd(dev, ecmd);
 	SMC_SELECT_BANK(saved_bank);
-	spin_unlock_irq(&smc->lock);
+	spin_unlock_irqrestore(&smc->lock, flags);
 	return ret;
 }
 
@@ -2014,12 +2023,13 @@ static u32 smc_get_link(struct net_devic
 	unsigned int ioaddr = dev->base_addr;
 	u16 saved_bank = inw(ioaddr + BANK_SELECT);
 	u32 ret;
+	unsigned long flags;
 
-	spin_lock_irq(&smc->lock);
+	spin_lock_irqsave(&smc->lock, flags);
 	SMC_SELECT_BANK(3);
 	ret = smc_link_ok(dev);
 	SMC_SELECT_BANK(saved_bank);
-	spin_unlock_irq(&smc->lock);
+	spin_unlock_irqrestore(&smc->lock, flags);
 	return ret;
 }
 
@@ -2056,16 +2066,17 @@ static int smc_ioctl (struct net_device 
 	int rc = 0;
 	u16 saved_bank;
 	unsigned int ioaddr = dev->base_addr;
+	unsigned long flags;
 
 	if (!netif_running(dev))
 		return -EINVAL;
 
-	spin_lock_irq(&smc->lock);
+	spin_lock_irqsave(&smc->lock, flags);
 	saved_bank = inw(ioaddr + BANK_SELECT);
 	SMC_SELECT_BANK(3);
 	rc = generic_mii_ioctl(&smc->mii_if, mii, cmd, NULL);
 	SMC_SELECT_BANK(saved_bank);
-	spin_unlock_irq(&smc->lock);
+	spin_unlock_irqrestore(&smc->lock, flags);
 	return rc;
 }
 

^ permalink raw reply

* [PATCH 2/2] sky2: add support for receive hashing (v3)
From: Stephen Hemminger @ 2010-04-24 23:22 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: davem, netdev
In-Reply-To: <4BD0F8AA.2060700@garzik.org>

Subject: sky2: add support for receive hashing

Sky2 hardware supports hardware receive hash calculation.
Now that Receive Packet Steering is available, add support
to enable it.

This version does not depend on CONFIG_RPS. Also set_flags rejects
all values except RXHASH, so driver won't have to change next time
somebody adds a new one.


Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---

 drivers/net/sky2.c |   75 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 drivers/net/sky2.h |   23 ++++++++++++++++
 2 files changed, 96 insertions(+), 2 deletions(-)

--- a/drivers/net/sky2.c	2010-04-23 08:56:41.444717559 -0700
+++ b/drivers/net/sky2.c	2010-04-23 08:56:51.335959014 -0700
@@ -1193,6 +1193,39 @@ static void rx_set_checksum(struct sky2_
 		     ? BMU_ENA_RX_CHKSUM : BMU_DIS_RX_CHKSUM);
 }
 
+/* Enable/disable receive hash calculation (RSS) */
+static void rx_set_rss(struct net_device *dev)
+{
+	struct sky2_port *sky2 = netdev_priv(dev);
+	struct sky2_hw *hw = sky2->hw;
+	int i, nkeys = 4;
+
+	/* Supports IPv6 and other modes */
+	if (hw->flags & SKY2_HW_NEW_LE) {
+		nkeys = 10;
+		sky2_write32(hw, SK_REG(sky2->port, RSS_CFG), HASH_ALL);
+	}
+
+	/* Program RSS initial values */
+	if (dev->features & NETIF_F_RXHASH) {
+		u32 key[nkeys];
+
+		get_random_bytes(key, nkeys * sizeof(u32));
+		for (i = 0; i < nkeys; i++)
+			sky2_write32(hw, SK_REG(sky2->port, RSS_KEY + i * 4),
+				     key[i]);
+
+		/* Need to turn on (undocumented) flag to make hashing work  */
+		sky2_write32(hw, SK_REG(sky2->port, RX_GMF_CTRL_T),
+			     RX_STFW_ENA);
+
+		sky2_write32(hw, Q_ADDR(rxqaddr[sky2->port], Q_CSR),
+			     BMU_ENA_RX_RSS_HASH);
+	} else
+		sky2_write32(hw, Q_ADDR(rxqaddr[sky2->port], Q_CSR),
+			     BMU_DIS_RX_RSS_HASH);
+}
+
 /*
  * The RX Stop command will not work for Yukon-2 if the BMU does not
  * reach the end of packet and since we can't make sure that we have
@@ -1425,6 +1458,9 @@ static void sky2_rx_start(struct sky2_po
 	if (!(hw->flags & SKY2_HW_NEW_LE))
 		rx_set_checksum(sky2);
 
+	if (!(hw->flags & SKY2_HW_RSS_BROKEN))
+		rx_set_rss(sky2->netdev);
+
 	/* submit Rx ring */
 	for (i = 0; i < sky2->rx_pending; i++) {
 		re = sky2->rx_ring + i;
@@ -2534,6 +2570,14 @@ static void sky2_rx_checksum(struct sky2
 	}
 }
 
+static void sky2_rx_hash(struct sky2_port *sky2, u32 status)
+{
+	struct sk_buff *skb;
+
+	skb = sky2->rx_ring[sky2->rx_next].skb;
+	skb->rxhash = le32_to_cpu(status);
+}
+
 /* Process status response ring */
 static int sky2_status_intr(struct sky2_hw *hw, int to_do, u16 idx)
 {
@@ -2606,6 +2650,10 @@ static int sky2_status_intr(struct sky2_
 				sky2_rx_checksum(sky2, status);
 			break;
 
+		case OP_RSS_HASH:
+			sky2_rx_hash(sky2, status);
+			break;
+
 		case OP_TXINDEXLE:
 			/* TX index reports status for both ports */
 			sky2_tx_done(hw->dev[0], status & 0xfff);
@@ -2960,6 +3008,8 @@ static int __devinit sky2_init(struct sk
 	switch(hw->chip_id) {
 	case CHIP_ID_YUKON_XL:
 		hw->flags = SKY2_HW_GIGABIT | SKY2_HW_NEWER_PHY;
+		if (hw->chip_rev < CHIP_REV_YU_XL_A2)
+			hw->flags |= SKY2_HW_RSS_BROKEN;
 		break;
 
 	case CHIP_ID_YUKON_EC_U:
@@ -2985,10 +3035,11 @@ static int __devinit sky2_init(struct sk
 			dev_err(&hw->pdev->dev, "unsupported revision Yukon-EC rev A1\n");
 			return -EOPNOTSUPP;
 		}
-		hw->flags = SKY2_HW_GIGABIT;
+		hw->flags = SKY2_HW_GIGABIT | SKY2_HW_RSS_BROKEN;
 		break;
 
 	case CHIP_ID_YUKON_FE:
+		hw->flags = SKY2_HW_RSS_BROKEN;
 		break;
 
 	case CHIP_ID_YUKON_FE_P:
@@ -4112,6 +4163,25 @@ static int sky2_set_eeprom(struct net_de
 	return sky2_vpd_write(sky2->hw, cap, data, eeprom->offset, eeprom->len);
 }
 
+static int sky2_set_flags(struct net_device *dev, u32 data)
+{
+	struct sky2_port *sky2 = netdev_priv(dev);
+
+	if (data & ~ETH_FLAG_RXHASH)
+		return -EOPNOTSUPP;
+
+	if (data & ETH_FLAG_RXHASH) {
+		if (sky2->hw->flags & SKY2_HW_RSS_BROKEN)
+			return -EINVAL;
+
+		dev->features |= NETIF_F_RXHASH;
+	} else
+		dev->features &= ~NETIF_F_RXHASH;
+
+	rx_set_rss(dev);
+
+	return 0;
+}
 
 static const struct ethtool_ops sky2_ethtool_ops = {
 	.get_settings	= sky2_get_settings,
@@ -4143,6 +4213,7 @@ static const struct ethtool_ops sky2_eth
 	.phys_id	= sky2_phys_id,
 	.get_sset_count = sky2_get_sset_count,
 	.get_ethtool_stats = sky2_get_ethtool_stats,
+	.set_flags	= sky2_set_flags,
 };
 
 #ifdef CONFIG_SKY2_DEBUG
@@ -4496,6 +4567,10 @@ static __devinit struct net_device *sky2
 	if (highmem)
 		dev->features |= NETIF_F_HIGHDMA;
 
+	/* Enable receive hashing unless hardware is known broken */
+	if (!(hw->flags & SKY2_HW_RSS_BROKEN))
+		dev->features |= NETIF_F_RXHASH;
+
 #ifdef SKY2_VLAN_TAG_USED
 	/* The workaround for FE+ status conflicts with VLAN tag detection. */
 	if (!(sky2->hw->chip_id == CHIP_ID_YUKON_FE_P &&
@@ -4692,7 +4767,7 @@ static int __devinit sky2_probe(struct p
 		goto err_out_iounmap;
 
 	/* ring for status responses */
-	hw->st_size = hw->ports * roundup_pow_of_two(2*RX_MAX_PENDING + TX_MAX_PENDING);
+	hw->st_size = hw->ports * roundup_pow_of_two(3*RX_MAX_PENDING + TX_MAX_PENDING);
 	hw->st_le = pci_alloc_consistent(pdev, hw->st_size * sizeof(struct sky2_status_le),
 					 &hw->st_dma);
 	if (!hw->st_le)
--- a/drivers/net/sky2.h	2010-04-23 08:56:41.424736307 -0700
+++ b/drivers/net/sky2.h	2010-04-23 08:56:51.335959014 -0700
@@ -694,8 +694,21 @@ enum {
 	TXA_CTRL	= 0x0210,/*  8 bit	Tx Arbiter Control Register */
 	TXA_TEST	= 0x0211,/*  8 bit	Tx Arbiter Test Register */
 	TXA_STAT	= 0x0212,/*  8 bit	Tx Arbiter Status Register */
+
+	RSS_KEY		= 0x0220, /* RSS Key setup */
+	RSS_CFG		= 0x0248, /* RSS Configuration */
 };
 
+enum {
+	HASH_TCP_IPV6_EX_CTRL	= 1<<5,
+	HASH_IPV6_EX_CTRL	= 1<<4,
+	HASH_TCP_IPV6_CTRL	= 1<<3,
+	HASH_IPV6_CTRL		= 1<<2,
+	HASH_TCP_IPV4_CTRL	= 1<<1,
+	HASH_IPV4_CTRL		= 1<<0,
+
+	HASH_ALL		= 0x3f,
+};
 
 enum {
 	B6_EXT_REG	= 0x0300,/* External registers (GENESIS only) */
@@ -2261,6 +2274,7 @@ struct sky2_hw {
 #define SKY2_HW_NEW_LE		0x00000020	/* new LSOv2 format */
 #define SKY2_HW_AUTO_TX_SUM	0x00000040	/* new IP decode for Tx */
 #define SKY2_HW_ADV_POWER_CTL	0x00000080	/* additional PHY power regs */
+#define SKY2_HW_RSS_BROKEN	0x00000100
 
 	u8	     	     chip_id;
 	u8		     chip_rev;

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Changli Gao @ 2010-04-25  2:31 UTC (permalink / raw)
  To: hadi; +Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271938343.4032.30.camel@bigi>

On Thu, Apr 22, 2010 at 8:12 PM, jamal <hadi@cyberus.ca> wrote:
>
>> I see slave/application cpus hit _raw_spin_lock_irqsave() and
>> _raw_spin_unlock_irqrestore().
>>
>> Maybe a ring buffer could help (instead of a double linked queue) for
>> backlog, or the double queue trick, if Changli wants to respin his
>> patch.
>>
>
> Ok, I will have some cycles later today/tommorow or for sure on weekend.
> My setup is still intact - so i can test.
>

I read the code again, and find that we don't use spin_lock_irqsave(),
and we use local_irq_save() and spin_lock() instead, so
_raw_spin_lock_irqsave() and _raw_spin_lock_irqrestore() should not be
related to backlog. the lock maybe sk_receive_queue.lock.

Jamal, did you use a single socket to serve all the clients?

BTW:  completion_queue and output_queue in softnet_data both are LIFO
queues. For completion_queue, FIFO is better, as the last used skb is
more likely in cache, and should be used first. Since slab has always
cache the last used memory at the head, we'd better free the skb in
FIFO manner. For output_queue, FIFO is good for fairness among qdiscs.

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Paul E. McKenney @ 2010-04-25  2:34 UTC (permalink / raw)
  To: Miles Lane
  Cc: Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan, Johannes Berg
In-Reply-To: <m2xa44ae5cd1004231559hcf90671asf146a43b4748c2c3@mail.gmail.com>

On Fri, Apr 23, 2010 at 06:59:12PM -0400, Miles Lane wrote:
> On Fri, Apr 23, 2010 at 3:42 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Fri, Apr 23, 2010 at 08:50:59AM -0400, Miles Lane wrote:
> >> Hi Paul,
> >> There has been a bit of back and forth, and I am not sure what patches
> >> I should test now.
> >> Could you send me a bundle of whatever needs testing now?
> >
> > Hello, Miles,
> >
> > I am posting my set as replies to this message.  There are a couple
> > of KVM fixes that are going up via Avi's tree, and a number of networking
> > fixes that are going up via Dave Miller's tree -- a number of these
> > are against quickly changing code, so it didn't make sense for me to
> > keep them separately.
> >
> > I believe that the two splats below are addressed by this patch set
> > carried in the networking tree:
> >
> >        https://patchwork.kernel.org/patch/90754/
> 
> With your twelve patches and the one linked to above applied to
> 2.6.34-rc5-git3, here are the warnings I see:
> 
> [    0.173969] [ INFO: suspicious rcu_dereference_check() usage. ]
> [    0.174097] ---------------------------------------------------
> [    0.174226] include/linux/cgroup.h:534 invoked
> rcu_dereference_check() without protection!
> [    0.174429]
> [    0.174430] other info that might help us debug this:
> [    0.174431]
> [    0.174792]
> [    0.174793] rcu_scheduler_active = 1, debug_locks = 1
> [    0.175037] no locks held by watchdog/0/5.
> [    0.175162]
> [    0.175163] stack backtrace:
> [    0.175405] Pid: 5, comm: watchdog/0 Not tainted 2.6.34-rc5-git3 #22
> [    0.175534] Call Trace:
> [    0.175666]  [<ffffffff81067fbe>] lockdep_rcu_dereference+0x9d/0xa5
> [    0.175799]  [<ffffffff8102d678>] task_subsys_state+0x59/0x70
> [    0.175931]  [<ffffffff810328fa>] __sched_setscheduler+0x19d/0x300
> [    0.176064]  [<ffffffff8102b477>] ? need_resched+0x1e/0x28
> [    0.176196]  [<ffffffff813cd401>] ? schedule+0x5c3/0x66e
> [    0.176327]  [<ffffffff81091943>] ? watchdog+0x0/0x8c
> [    0.176457]  [<ffffffff81032a78>] sched_setscheduler+0xe/0x10
> [    0.176587]  [<ffffffff8109196d>] watchdog+0x2a/0x8c
> [    0.176677]  [<ffffffff81091943>] ? watchdog+0x0/0x8c
> [    0.176808]  [<ffffffff81057152>] kthread+0x89/0x91
> [    0.176939]  [<ffffffff8106891e>] ? trace_hardirqs_on_caller+0x114/0x13f
> [    0.177073]  [<ffffffff81003994>] kernel_thread_helper+0x4/0x10
> [    0.177204]  [<ffffffff813cfc40>] ? restore_args+0x0/0x30
> [    0.177334]  [<ffffffff810570c9>] ? kthread+0x0/0x91
> [    0.177463]  [<ffffffff81003990>] ? kernel_thread_helper+0x0/0x10

According to Documentation/cgroups/cgroups.txt, we must hold cgroup_mutex,
the task's task_alloc lock, or be in an RCU read-side critical section.
We are in neither of these.

I would argue that sched_setscheduler() should take care of
synchronization, but am not sure which of these three are appropriate
for sched_setscheduler() to acquire.  Peter, thoughts?

> [    3.173419] [ INFO: suspicious rcu_dereference_check() usage. ]
> [    3.173419] ---------------------------------------------------
> [    3.173419] kernel/cgroup.c:4438 invoked rcu_dereference_check()
> without protection!
> [    3.173419]
> [    3.173419] other info that might help us debug this:
> [    3.173419]
> [    3.173419]
> [    3.173419] rcu_scheduler_active = 1, debug_locks = 1
> [    3.173419] 2 locks held by async/0/668:
> [    3.173419]  #0:  (&shost->scan_mutex){+.+.+.}, at:
> [<ffffffff812df020>] __scsi_add_device+0x83/0xe4
> [    3.173419]  #1:  (&(&blkcg->lock)->rlock){......}, at:
> [<ffffffff811f2df9>] blkiocg_add_blkio_group+0x29/0x7f
> [    3.173419]
> [    3.173419] stack backtrace:
> [    3.173419] Pid: 668, comm: async/0 Not tainted 2.6.34-rc5-git3 #22
> [    3.173419] Call Trace:
> [    3.173419]  [<ffffffff81067fbe>] lockdep_rcu_dereference+0x9d/0xa5
> [    3.173419]  [<ffffffff8107f9ad>] css_id+0x3f/0x51
> [    3.173419]  [<ffffffff811f2e08>] blkiocg_add_blkio_group+0x38/0x7f
> [    3.173419]  [<ffffffff811f4dd0>] cfq_init_queue+0xdf/0x2dc
> [    3.173419]  [<ffffffff811e33b1>] elevator_init+0xba/0xf5
> [    3.173419]  [<ffffffff812dbfaa>] ? scsi_request_fn+0x0/0x451
> [    3.173419]  [<ffffffff811e68d7>] blk_init_queue_node+0x12f/0x135
> [    3.173419]  [<ffffffff811e68e9>] blk_init_queue+0xc/0xe
> [    3.173419]  [<ffffffff812dc41c>] __scsi_alloc_queue+0x21/0x111
> [    3.173419]  [<ffffffff812dc524>] scsi_alloc_queue+0x18/0x64
> [    3.173419]  [<ffffffff812de520>] scsi_alloc_sdev+0x19e/0x256
> [    3.173419]  [<ffffffff812de6be>] scsi_probe_and_add_lun+0xe6/0x9c5
> [    3.173419]  [<ffffffff8106891e>] ? trace_hardirqs_on_caller+0x114/0x13f
> [    3.173419]  [<ffffffff813ce056>] ? __mutex_lock_common+0x3e4/0x43a
> [    3.173419]  [<ffffffff812df020>] ? __scsi_add_device+0x83/0xe4
> [    3.173419]  [<ffffffff812d09dc>] ? transport_setup_classdev+0x0/0x17
> [    3.173419]  [<ffffffff812df020>] ? __scsi_add_device+0x83/0xe4
> [    3.173419]  [<ffffffff812df055>] __scsi_add_device+0xb8/0xe4
> [    3.173419]  [<ffffffff812ea945>] ata_scsi_scan_host+0x74/0x16e
> [    3.173419]  [<ffffffff81057699>] ? autoremove_wake_function+0x0/0x34
> [    3.173419]  [<ffffffff812e8de4>] async_port_probe+0xab/0xb7
> [    3.173419]  [<ffffffff8105e1b1>] ? async_thread+0x0/0x1f4
> [    3.173419]  [<ffffffff8105e2b6>] async_thread+0x105/0x1f4
> [    3.173419]  [<ffffffff81033d8e>] ? default_wake_function+0x0/0xf
> [    3.173419]  [<ffffffff8105e1b1>] ? async_thread+0x0/0x1f4
> [    3.173419]  [<ffffffff81057152>] kthread+0x89/0x91
> [    3.173419]  [<ffffffff8106891e>] ? trace_hardirqs_on_caller+0x114/0x13f
> [    3.173419]  [<ffffffff81003994>] kernel_thread_helper+0x4/0x10
> [    3.173419]  [<ffffffff813cfc40>] ? restore_args+0x0/0x30
> [    3.173419]  [<ffffffff810570c9>] ? kthread+0x0/0x91
> [    3.173419]  [<ffffffff81003990>] ? kernel_thread_helper+0x0/0x10

Please see below for a patch for this based on my earlier conversation
with Vivek Goyal.  (Vivek, if you are already pushing a fix elsewhere,
please let me know, and I will drop my patch in favor of yours.)

> [   32.905446] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   32.905449] ---------------------------------------------------
> [   32.905453] net/core/dev.c:1993 invoked rcu_dereference_check()
> without protection!
> [   32.905456]
> [   32.905457] other info that might help us debug this:
> [   32.905458]
> [   32.905461]
> [   32.905462] rcu_scheduler_active = 1, debug_locks = 1
> [   32.905466] 2 locks held by canberra-gtk-pl/4182:
> [   32.905469]  #0:  (sk_lock-AF_INET){+.+.+.}, at:
> [<ffffffff81394f7d>] inet_stream_connect+0x3a/0x24d
> [   32.905483]  #1:  (rcu_read_lock_bh){.+....}, at:
> [<ffffffff8134a789>] dev_queue_xmit+0x14e/0x4b8
> [   32.905495]
> [   32.905496] stack backtrace:
> [   32.905500] Pid: 4182, comm: canberra-gtk-pl Not tainted 2.6.34-rc5-git3 #22
> [   32.905504] Call Trace:
> [   32.905512]  [<ffffffff81067fbe>] lockdep_rcu_dereference+0x9d/0xa5
> [   32.905518]  [<ffffffff8134a894>] dev_queue_xmit+0x259/0x4b8
> [   32.905524]  [<ffffffff8134a789>] ? dev_queue_xmit+0x14e/0x4b8
> [   32.905531]  [<ffffffff81041c66>] ? _local_bh_enable_ip+0xcd/0xda
> [   32.905538]  [<ffffffff813536da>] neigh_resolve_output+0x234/0x285
> [   32.905544]  [<ffffffff8136f69f>] ip_finish_output2+0x257/0x28c
> [   32.905549]  [<ffffffff8136f73c>] ip_finish_output+0x68/0x6a
> [   32.905554]  [<ffffffff81370433>] T.866+0x52/0x59
> [   32.905559]  [<ffffffff8137067e>] ip_output+0xaa/0xb4
> [   32.905565]  [<ffffffff8136eb38>] ip_local_out+0x20/0x24
> [   32.905571]  [<ffffffff8136f184>] ip_queue_xmit+0x309/0x368
> [   32.905578]  [<ffffffff810e4226>] ? __kmalloc_track_caller+0x111/0x155
> [   32.905585]  [<ffffffff8138316f>] ? tcp_connect+0x223/0x3d3
> [   32.905591]  [<ffffffff813818f1>] tcp_transmit_skb+0x707/0x745
> [   32.905597]  [<ffffffff813832c2>] tcp_connect+0x376/0x3d3
> [   32.905604]  [<ffffffff81268a43>] ? secure_tcp_sequence_number+0x55/0x6f
> [   32.905610]  [<ffffffff81387270>] tcp_v4_connect+0x3df/0x455
> [   32.905617]  [<ffffffff8133cb59>] ? lock_sock_nested+0xf3/0x102
> [   32.905623]  [<ffffffff81394fe7>] inet_stream_connect+0xa4/0x24d
> [   32.905629]  [<ffffffff8133b398>] sys_connect+0x90/0xd0
> [   32.905636]  [<ffffffff81002b9c>] ? sysret_check+0x27/0x62
> [   32.905642]  [<ffffffff8106891e>] ? trace_hardirqs_on_caller+0x114/0x13f
> [   32.905649]  [<ffffffff813cec80>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [   32.905655]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b

A fix for the above is already in Dave Miller's tree.

> [   51.912282] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   51.912285] ---------------------------------------------------
> [   51.912289] net/mac80211/sta_info.c:886 invoked
> rcu_dereference_check() without protection!
> [   51.912293]
> [   51.912293] other info that might help us debug this:
> [   51.912295]
> [   51.912298]
> [   51.912298] rcu_scheduler_active = 1, debug_locks = 1
> [   51.912302] no locks held by wpa_supplicant/3951.
> [   51.912305]
> [   51.912306] stack backtrace:
> [   51.912310] Pid: 3951, comm: wpa_supplicant Not tainted 2.6.34-rc5-git3 #22
> [   51.912314] Call Trace:
> [   51.912317]  <IRQ>  [<ffffffff81067fbe>] lockdep_rcu_dereference+0x9d/0xa5
> [   51.912345]  [<ffffffffa014f9ae>]
> ieee80211_find_sta_by_hw+0x46/0x10f [mac80211]
> [   51.912358]  [<ffffffffa014fa8e>] ieee80211_find_sta+0x17/0x19 [mac80211]
> [   51.912373]  [<ffffffffa01e50f2>] iwl_tx_queue_reclaim+0xdb/0x1b1 [iwlcore]
> [   51.912380]  [<ffffffff8106842b>] ? mark_lock+0x2d/0x235
> [   51.912391]  [<ffffffffa0252f1c>] iwl5000_rx_reply_tx+0x4a9/0x556 [iwlagn]
> [   51.912399]  [<ffffffff8120a353>] ? is_swiotlb_buffer+0x2e/0x3b
> [   51.912407]  [<ffffffffa024bbf4>] iwl_rx_handle+0x163/0x2b5 [iwlagn]
> [   51.912414]  [<ffffffff81068904>] ? trace_hardirqs_on_caller+0xfa/0x13f
> [   51.912422]  [<ffffffffa024c3ac>] iwl_irq_tasklet+0x2bb/0x3c0 [iwlagn]
> [   51.912429]  [<ffffffff810411f3>] tasklet_action+0xa7/0x10f
> [   51.912435]  [<ffffffff81042205>] __do_softirq+0x144/0x252
> [   51.912442]  [<ffffffff81003a8c>] call_softirq+0x1c/0x34
> [   51.912447]  [<ffffffff810050e4>] do_softirq+0x38/0x80
> [   51.912452]  [<ffffffff81041cd2>] irq_exit+0x45/0x94
> [   51.912457]  [<ffffffff81004829>] do_IRQ+0xad/0xc4
> [   51.912463]  [<ffffffff810cbbd3>] ? might_fault+0x63/0xb3
> [   51.912470]  [<ffffffff813cfb93>] ret_from_intr+0x0/0xf
> [   51.912474]  <EOI>  [<ffffffff810cbbd3>] ? might_fault+0x63/0xb3
> [   51.912484]  [<ffffffff8106a75d>] ? lock_release+0x208/0x215
> [   51.912490]  [<ffffffff810cbc1c>] might_fault+0xac/0xb3
> [   51.912495]  [<ffffffff810cbbd3>] ? might_fault+0x63/0xb3
> [   51.912501]  [<ffffffff812025e3>] __clear_user+0x15/0x59
> [   51.912508]  [<ffffffff8100b2bc>] save_i387_xstate+0x9c/0x1bc
> [   51.912515]  [<ffffffff81002276>] do_signal+0x240/0x686
> [   51.912521]  [<ffffffff81002b9c>] ? sysret_check+0x27/0x62
> [   51.912527]  [<ffffffff8106891e>] ? trace_hardirqs_on_caller+0x114/0x13f
> [   51.912533]  [<ffffffff813cec80>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [   51.912539]  [<ffffffff810026e3>] do_notify_resume+0x27/0x5f
> [   51.912545]  [<ffffffff813cec80>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [   51.912551]  [<ffffffff81002e86>] int_signal+0x12/0x17

This is a repeat from last time that confused me at the time.  I could
do a hacky "fix" by putting an RCU read-side critical section around
the for_each_sta_info() in ieee80211_find_sta_by_hw(), but I do not
understand this code well enough to feel comfortable doing so.

Johannes, any enlightenment?

> [   51.929529] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   51.929532] ---------------------------------------------------
> [   51.929536] net/mac80211/sta_info.c:886 invoked
> rcu_dereference_check() without protection!
> [   51.929540]
> [   51.929541] other info that might help us debug this:
> [   51.929542]
> [   51.929545]
> [   51.929546] rcu_scheduler_active = 1, debug_locks = 1
> [   51.929550] 1 lock held by Xorg/4013:
> [   51.929553]  #0:  (clock-AF_UNIX){++.+..}, at: [<ffffffff8133cebd>]
> sock_def_readable+0x19/0x62
> [   51.929567]
> [   51.929568] stack backtrace:
> [   51.929573] Pid: 4013, comm: Xorg Not tainted 2.6.34-rc5-git3 #22
> [   51.929576] Call Trace:
> [   51.929579]  <IRQ>  [<ffffffff81067fbe>] lockdep_rcu_dereference+0x9d/0xa5
> [   51.929603]  [<ffffffffa014f9fe>]
> ieee80211_find_sta_by_hw+0x96/0x10f [mac80211]
> [   51.929615]  [<ffffffffa014fa8e>] ieee80211_find_sta+0x17/0x19 [mac80211]
> [   51.929631]  [<ffffffffa01e50f2>] iwl_tx_queue_reclaim+0xdb/0x1b1 [iwlcore]
> [   51.929642]  [<ffffffffa0252f1c>] iwl5000_rx_reply_tx+0x4a9/0x556 [iwlagn]
> [   51.929649]  [<ffffffff81068685>] ? mark_held_locks+0x52/0x70
> [   51.929656]  [<ffffffff813cf46c>] ? _raw_spin_unlock_irqrestore+0x3a/0x69
> [   51.929662]  [<ffffffff8120a353>] ? is_swiotlb_buffer+0x2e/0x3b
> [   51.929671]  [<ffffffffa024bbf4>] iwl_rx_handle+0x163/0x2b5 [iwlagn]
> [   51.929680]  [<ffffffffa024c3ac>] iwl_irq_tasklet+0x2bb/0x3c0 [iwlagn]
> [   51.929687]  [<ffffffff810411f3>] tasklet_action+0xa7/0x10f
> [   51.929693]  [<ffffffff81042205>] __do_softirq+0x144/0x252
> [   51.929700]  [<ffffffff81003a8c>] call_softirq+0x1c/0x34
> [   51.929705]  [<ffffffff810050e4>] do_softirq+0x38/0x80
> [   51.929711]  [<ffffffff81041cd2>] irq_exit+0x45/0x94
> [   51.929717]  [<ffffffff81019b10>] smp_apic_timer_interrupt+0x87/0x95
> [   51.929724]  [<ffffffff81003553>] apic_timer_interrupt+0x13/0x20
> [   51.929727]  <EOI>  [<ffffffff813cf46e>] ?
> _raw_spin_unlock_irqrestore+0x3c/0x69
> [   51.929739]  [<ffffffff8102d3fb>] __wake_up_sync_key+0x49/0x52
> [   51.929745]  [<ffffffff8133cee7>] sock_def_readable+0x43/0x62
> [   51.929751]  [<ffffffff813b1c61>] unix_stream_sendmsg+0x243/0x2e2
> [   51.929758]  [<ffffffff8133b912>] ? sock_aio_write+0x0/0xcf
> [   51.929764]  [<ffffffff81339342>] __sock_sendmsg+0x59/0x64
> [   51.929770]  [<ffffffff8133b9cd>] sock_aio_write+0xbb/0xcf
> [   51.929777]  [<ffffffff810e9909>] do_sync_readv_writev+0xbc/0xfb
> [   51.929785]  [<ffffffff811c1792>] ? selinux_file_permission+0xa2/0xaf
> [   51.929790]  [<ffffffff810e9690>] ? copy_from_user+0x2a/0x2c
> [   51.929797]  [<ffffffff811baff1>] ? security_file_permission+0x11/0x13
> [   51.929804]  [<ffffffff810ea6a6>] do_readv_writev+0xa2/0x122
> [   51.929810]  [<ffffffff810ead93>] ? fcheck_files+0x8f/0xc9
> [   51.929816]  [<ffffffff810ea764>] vfs_writev+0x3e/0x49
> [   51.929821]  [<ffffffff810ea84a>] sys_writev+0x45/0x8e
> [   51.929828]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b

Ditto.

						Thanx, Paul

------------------------------------------------------------------------

commit 0868dd631def762ba00c2f0f397a53c5cdf24ae2
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Sat Apr 24 19:23:30 2010 -0700

    block-cgroup: fix RCU-lockdep splat in blkiocg_add_blkio_group()
    
    It is necessary to be in an RCU read-side critical section when invoking
    css_id(), so this patch adds one to blkiocg_add_blkio_group().  This is
    actually a false positive, because this is called at initialization time,
    and hence always refers to the root cgroup, which cannot go away.
    
    Located-by: Miles Lane <miles.lane@gmail.com>
    Suggested-by: Vivek Goyal <vgoyal@redhat.com>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 5fe03de..55c8c73 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -71,7 +71,9 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
 
 	spin_lock_irqsave(&blkcg->lock, flags);
 	rcu_assign_pointer(blkg->key, key);
+	rcu_read_lock();
 	blkg->blkcg_id = css_id(&blkcg->css);
+	rcu_read_unlock();
 	hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
 	spin_unlock_irqrestore(&blkcg->lock, flags);
 #ifdef CONFIG_DEBUG_BLK_CGROUP

^ permalink raw reply related

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Paul E. McKenney @ 2010-04-25  2:36 UTC (permalink / raw)
  To: Miles Lane
  Cc: Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan
In-Reply-To: <u2qa44ae5cd1004232235i2e1cd2a0g634fc1d5d8c3f7c2@mail.gmail.com>

On Sat, Apr 24, 2010 at 01:35:01AM -0400, Miles Lane wrote:
> 2.6.34-rc5-git5 with all of your patches applied.
> 
> I reconfigured my kernel build options and got the following new issue:
> 
> [    2.686515] [ INFO: suspicious rcu_dereference_check() usage. ]
> [    2.686519] ---------------------------------------------------
> [    2.686523] kernel/cgroup.c:4438 invoked rcu_dereference_check()
> without protection!
> [    2.686526]
> [    2.686527] other info that might help us debug this:
> [    2.686529]
> [    2.686532]
> [    2.686533] rcu_scheduler_active = 1, debug_locks = 1
> [    2.686537] 2 locks held by swapper/1:
> [    2.686540]  #0:  (mtd_table_mutex){+.+.+.}, at:
> [<ffffffff812d7714>] register_mtd_blktrans+0xa2/0x25e
> [    2.686555]  #1:  (&(&blkcg->lock)->rlock){......}, at:
> [<ffffffff811ca7bd>] blkiocg_add_blkio_group+0x29/0x7f
> [    2.686566]
> [    2.686567] stack backtrace:
> [    2.686572] Pid: 1, comm: swapper Not tainted 2.6.34-rc5-git5 #25
> [    2.686576] Call Trace:
> [    2.686584]  [<ffffffff810642da>] lockdep_rcu_dereference+0x9d/0xa5
> [    2.686591]  [<ffffffff8107af54>] css_id+0x3f/0x52
> [    2.686597]  [<ffffffff811ca7cc>] blkiocg_add_blkio_group+0x38/0x7f
> [    2.686603]  [<ffffffff811cc593>] cfq_init_queue+0xdf/0x2dc
> [    2.686609]  [<ffffffff811bb858>] elevator_init+0xba/0xf5
> [    2.686616]  [<ffffffff812d7046>] ? mtd_blktrans_request+0x0/0x1c
> [    2.686623]  [<ffffffff811c0b62>] blk_init_queue_node+0x12f/0x135
> [    2.686629]  [<ffffffff811c0b74>] blk_init_queue+0xc/0xe
> [    2.686635]  [<ffffffff812d7777>] register_mtd_blktrans+0x105/0x25e
> [    2.686642]  [<ffffffff818c0de9>] ? init_mtdblock+0x0/0x2c
> [    2.686648]  [<ffffffff818c0e13>] init_mtdblock+0x2a/0x2c
> [    2.686656]  [<ffffffff810001ef>] do_one_initcall+0x59/0x14e
> [    2.686663]  [<ffffffff818986a6>] kernel_init+0x160/0x1ea
> [    2.686669]  [<ffffffff81003814>] kernel_thread_helper+0x4/0x10
> [    2.686677]  [<ffffffff8140d77c>] ? restore_args+0x0/0x30
> [    2.686683]  [<ffffffff81898546>] ? kernel_init+0x0/0x1ea
> [    2.686688]  [<ffffffff81003810>] ? kernel_thread_helper+0x0/0x10
> [    2.687683] mtdoops: mtd device (mtddev=name/number) must be supplied

This should be covered by the patch I sent with my previous email.

And thank you again, Miles, for all the testing!!!

							Thanx, Paul

^ permalink raw reply

* Re: [PATCH] e100: Fix the TX workqueue race
From: David Miller @ 2010-04-25  2:58 UTC (permalink / raw)
  To: alan; +Cc: e1000-devel, netdev
In-Reply-To: <20100424121127.084b9766@linux.intel.com>

From: Alan Cox <alan@linux.intel.com>
Date: Sat, 24 Apr 2010 12:11:27 +0100

> No idea why it won't apply - I guess net has diverged from -next in
> this area. Other problem is not typing "stg ref" before "stg export"

It has, the debug print statement above the lines you are changing are
completely different.

Please generate this patch against net-2.6 so I can apply it, thanks
Alan.

------------------------------------------------------------------------------
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: [PATCH] e100: Fix the TX workqueue race
From: David Miller @ 2010-04-25  3:00 UTC (permalink / raw)
  To: alan; +Cc: netdev, e1000-devel
In-Reply-To: <20100424113629.0ad3569b@linux.intel.com>

From: Alan Cox <alan@linux.intel.com>
Date: Sat, 24 Apr 2010 11:36:29 +0100

> Puzzling as it came from a building -next tree. Will see whats
> happened next week if I get time, but I'm afraid net stuff isn't a
> priority - in fact its disappointing that having diagnosed a bug
> months ago (which was the hard bit) and posted a test patch months
> ago the maintainers haven't fixed it.

It's disappointing to me that someone as experienced and skilled
as yourself can't generate a clean patch which is 1) against
the appropriate tree for a bug fix and 2) actually compiles.

Or is this too much to ask? :-)

^ permalink raw reply

* Re: [PATCH 2/2] sky2: add support for receive hashing (v3)
From: David Miller @ 2010-04-25  3:04 UTC (permalink / raw)
  To: shemminger; +Cc: jeff, netdev
In-Reply-To: <20100424162239.1aae32e0@nehalam>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Sat, 24 Apr 2010 16:22:39 -0700

> Subject: sky2: add support for receive hashing
> 
> Sky2 hardware supports hardware receive hash calculation.
> Now that Receive Packet Steering is available, add support
> to enable it.
> 
> This version does not depend on CONFIG_RPS. Also set_flags rejects
> all values except RXHASH, so driver won't have to change next time
> somebody adds a new one.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Applied, thanks Stephen.

^ permalink raw reply

* 2.6.33.2 networking regression
From: Dave Jones @ 2010-04-25  3:16 UTC (permalink / raw)
  To: netdev; +Cc: stable

Something odd happened when I upgraded my router
from 2.6.33.1 to .33.2.  Its internal NIC (a VIA Velocity)
stopped recieving packets.

dmesg was getting flooded with..

[  188.919957] via-velocity 0000:00:0e.0: BAR 0: set to [io  0xf800-0xf8ff] (PCI address [0xf800-0xf8ff]
[  188.920002] via-velocity 0000:00:0e.0: BAR 1: set to [mem 0xfdffe000-0xfdffe0ff] (PCI address [0xfdffe000-0xfdffe0ff]
[  203.913967] via-velocity 0000:00:0e.0: BAR 0: set to [io  0xf800-0xf8ff] (PCI address [0xf800-0xf8ff]
[  203.914181] via-velocity 0000:00:0e.0: BAR 1: set to [mem 0xfdffe000-0xfdffe0ff] (PCI address [0xfdffe000-0xfdffe0ff]

every so often for some reason.

rebooting back to .1, it works fine.

There don't appear to be any direct changes to via-velocity.c in the
diff, so I'm really confused. Any clues ? I'll bisect it, but it
probably won't be until Monday..

	Dave

^ permalink raw reply

* Re: 2.6.33.2 networking regression
From: David Miller @ 2010-04-25  3:19 UTC (permalink / raw)
  To: davej; +Cc: netdev, stable
In-Reply-To: <20100425031656.GA27598@redhat.com>

From: Dave Jones <davej@redhat.com>
Date: Sat, 24 Apr 2010 23:16:57 -0400

> There don't appear to be any direct changes to via-velocity.c in the
> diff, so I'm really confused. Any clues ? I'll bisect it, but it
> probably won't be until Monday..

Looks like some x86/PCI/ACPI change causes this, rather than a
networking change.

^ permalink raw reply

* Re: [PATCH] e100: Fix the TX workqueue race
From: David Miller @ 2010-04-25  4:10 UTC (permalink / raw)
  To: alan; +Cc: e1000-devel, netdev
In-Reply-To: <20100424.195859.193729555.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Sat, 24 Apr 2010 19:58:59 -0700 (PDT)

> Please generate this patch against net-2.6 so I can apply it, thanks
> Alan.

Nevermind, I took care of this for you.

------------------------------------------------------------------------------
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: [PATCH v3] rps: optimize rps_get_cpu()
From: David Miller @ 2010-04-25  5:51 UTC (permalink / raw)
  To: xiaosuo; +Cc: therbert, eric.dumazet, netdev
In-Reply-To: <1272122227-13070-1-git-send-email-xiaosuo@gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Sat, 24 Apr 2010 23:17:07 +0800

> optimize rps_get_cpu().
> 
> don't initialize ports when we can get the ports. one memory access for ports
> than two.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>

Applied, thanks.

We can load both addresses in one go on 64-bit btw.

It seems we're just duplicating, one by one, the optimizations
we already do in INET_COMBINED_PORTS() and INET_ADDR_COOKIE().
:-)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox