Netdev List
 help / color / mirror / Atom feed
* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-14 18:04 UTC (permalink / raw)
  To: Tom Herbert; +Cc: hadi, netdev, robert, David Miller, Changli Gao, Andi Kleen
In-Reply-To: <t2p65634d661004141031xf80f62e7sb64362ea1ce10a1f@mail.gmail.com>

Le mercredi 14 avril 2010 à 10:31 -0700, Tom Herbert a écrit :
> The point of RPS is to increase parallelism, but the cost of that is
> more overhead per packet.  If you are running a single flow, then
> you'll see latency increase for that flow.  With more concurrent flows
> the benefits of parallelism kick in and latency gets better.-- we've
> seen the break even point around ten connections in our tests.  Also,
> I don't think we've made the claim that RPS should generally perform
> better than multi-queue, the primary motivation for RPS is make single
> queue NICs give reasonable performance.
> 

Yes, multiqueue is far better of course, but in case of hardware lacking
multiqueue, RPS can help many workloads, where application has _some_
work to do, not only counting frames or so...

RPS overhead (IPI, cache misses, ...) must be amortized by
parallelization or we lose.

A ping test is not an ideal candidate for RPS, since everything is done
at softirq level, and should be faster without RPS...




^ permalink raw reply

* Re: [E1000-devel] e1000e/netdev.c patch -- tx_ring->next_to_use
From: Brandeburg, Jesse @ 2010-04-14 18:12 UTC (permalink / raw)
  To: Charles Slivkoff
  Cc: terry.loftin@hp.com, e1000-devel@lists.sourceforge.net,
	davem@davemloft.net, Kirsher, Jeffrey T, netdev, emil.s.tantilov
In-Reply-To: <4BC5E563.9050206@cmu.edu>



On Wed, 14 Apr 2010, Charles Slivkoff wrote:
> I have been experiencing a number of system hangs which I believe are 
> due to the e1000e driver. I have a Dell Optiplex 760, Intel Core 2 Duo, 
> 4GB RAM, and I'm running Ubuntu 9.10 (32-bit).

have you filed a bug at launchpad?  if so what is the number?  I just want 
to unite all the information we have.

>  From the stack included in the kernel oops output, I decided to apply 
> the patch you provided, which I found posted here:
> 
> 	http://patchwork.ozlabs.org/patch/49175/

Hi Charles, I copied netdev for you.  I agree the panic you're seeing is 
from something inside the e1000e driver.  The question becomes why is the 
driver getting a null pointer dereference in transmit cleanup.

> This morning, I attempted an rsync operation which caused a hang once again.
> 
> I am attaching the oops output from 04/08/2010 and 04/14/2010.

I see you also filed a bug at e1000's sourceforge, thank you.

As a workaround you can try disabling TSO using ethtool to see if that 
helps.  We need to reproduce this here if possible.

ethtool -K eth0 tso off

Do you happen to *not* have irqbalance installed or enabled?  I was 
confused by the move_irq in one of the stack traces.  In any case it 
probably doesn't matter but I was not expecting to see that there.

For others, I've included the panic traces inline here...

[603636.169243] BUG: unable to handle kernel NULL pointer dereference at 000000ac
[603636.172898] IP: [<f82ee88f>] e1000_clean_tx_irq+0x8f/0x330 [e1000e]
[603636.172898] *pdpt = 000000002c954001 *pde = 0000000000000000
[603636.172898] Oops: 0000 [#1] SMP
[603636.172898] last sysfs file: /sys/devices/virtual/block/ram9/uevent
[603636.172898] Modules linked in: isofs udf crc_itu_t ppp_async crc_ccitt 
vmnet vmci vmmon binfmt_misc cisco_ipsec(P) openafs(P) deflate 
zlib_deflate ctr twofish twofish_common camellia serpent blowfish cast5 
des_generic cbc aes_i586 aes_generic xcbc rmd160 sha256_generic 
sha1_generic crypto_null af_key nfsd exportfsnfs lockd nfs_acl auth_rpcgss 
sunrpc snd_hda_codec_analog ipt_REJECT ipt_LOG xt_limit xt_tcpudp xt_state 
ipt_addrtype snd_hda_intel snd_hda_codec snd_usb_audio snd_pcm_oss 
snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_usb_lib snd_seq_midi 
ip6table_filter ip6_tables nf_nat_irc nf_conntrack_irc snd_rawmidi 
snd_seq_midi_event snd_seq nf_nat_ftp nf_nat nf_conntrack_ipv4 
nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack snd_hwdep coretemp 
iptable_filter uvcvideo videodev snd_timer snd_seq_device v4l1_compat 
ip_tables x_tables psmouse serio_raw ppdev dell_wmi dcdbas parport_pc 
fglrx(P) snd soundcore lp snd_page_alloc parport heci(C) usbhid intel_agp 
e1000e agpgart
[603636.172898]
[603636.172898] Pid: 4368, comm: chrome Tainted: P         C  (2.6.31-20-generic-pae #58-Ubuntu) OptiPlex 760
[603636.172898] EIP: 0060:[<f82ee88f>] EFLAGS: 00010246 CPU: 0
[603636.172898] EIP is at e1000_clean_tx_irq+0x8f/0x330 [e1000e]
[603636.172898] EAX: 00000000 EBX: 00000024 ECX: 00000240 EDX: f902e360
[603636.172898] ESI: f6e741e0 EDI: ef412000 EBP: ebcbbd84 ESP: ebcbbd24
[603636.172898]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[603636.172898] Process chrome (pid: 4368, ti=ebcba000 task=ee5cd7f0 task.ti=ebcba000)
[603636.172898] Stack:
[603636.172898]  fffb2668 ec9b6d50 00000001 00000014 d7938e00 ebcbbeb4 0000000000000000
[603636.172898] <0> 00000000 00000000 ec9f01b8 00000001 f64dc000 b54cd000 00000086 00000001
[603636.172898] <0> f64dc340 00000024 00000086 c057bf01 00000001 f64dc340 f64dc340 00000040
[603636.172898] Call Trace:
[603636.172898]  [<c057bf01>] ? do_page_fault+0x141/0x380
[603636.172898]  [<f82f0a54>] ? e1000_clean+0x54/0x270 [e1000e]
[603636.172898]  [<c04a7795>] ? net_rx_action+0xe5/0x1c0
[603636.172898]  [<c014cb30>] ? __do_softirq+0x90/0x1a0
[603636.172898]  [<c019189c>] ? handle_IRQ_event+0x4c/0x140
[603636.172898]  [<c01fcb42>] ? __d_lookup+0x102/0x110
[603636.172898]  [<c0194544>] ? move_native_irq+0x14/0x50
[603636.172898]  [<c014cc7d>] ? do_softirq+0x3d/0x40
[603636.172898]  [<c014cdbd>] ? irq_exit+0x5d/0x70
[603636.172898]  [<c0104f50>] ? do_IRQ+0x50/0xc0
[603636.172898]  [<c01e6ec2>] ? __mem_cgroup_uncharge_common+0xa2/0xf0
[603636.172898]  [<c01039f0>] ? common_interrupt+0x30/0x40
[603636.172898]  [<c048007b>] ? hidinput_configure_usage+0xcab/0x2290
[603636.172898]  [<c05700d8>] ? hlt_loop+0x3/0xb
[603636.172898]  [<c04edbd1>] ? udp_v4_get_port+0x1/0x20
[603636.172898]  [<c04f6421>] ? inet_autobind+0x21/0x60
[603636.172898]  [<c04f659d>] ? inet_dgram_connect+0x5d/0x70
[603636.172898]  [<c049684e>] ? sys_connect+0xae/0xd0
[603636.172898]  [<c02d03b3>] ? security_d_instantiate+0x13/0x30
[603636.172898]  [<c01fc690>] ? d_instantiate+0x40/0x50
[603636.172898]  [<c0495178>] ? sock_attach_fd+0x78/0xc0
[603636.172898]  [<c0579a88>] ? _spin_lock+0x8/0x10
[603636.172898]  [<c01e9207>] ? fd_install+0x47/0x60
[603636.172898]  [<c04951fd>] ? sock_map_fd+0x3d/0x60
[603636.172898]  [<c0497578>] ? sys_socketcall+0x248/0x270
[603636.172898]  [<c01032c3>] ? sysenter_do_call+0x12/0x28
[603636.1: lost 7 rtc interrupts
[603636.538819] hpet1: lost 7 rtc interrupts
[603636.542825] hpet1: lost 7 rtc interrupts
[603636.546830] hpet1: lost 8 rtc interrupts
[603636.550836] hpet1: lost 7 rtc interrupts
[603636.554842] hpet1: lost 7 rtc interrupts
[603636.558848] hpet1: lost 7 rtc interrupts
[603636.562854] hpet1: lost 8 rtc interrupts
[603636.566865] ---[ end trace f3dd0b8abcd2bca2 ]---
[603636.571563] Kernel panic - not syncing: Fatal exception in interrupt
[603636.577995] Pid: 4368, comm: chrome Tainted: P      D  C 2.6.31-20-generic-pae #58-Ubuntu
[603636.586245] Call Trace:
[603636.588779]  [<c05775ee>] ? printk+0x18/0x1a
[603636.593132]  [<c0577532>] panic+0x43/0xe7
[603636.597226]  [<c057a935>] oops_end+0xc5/0xd0
[603636.601580]  [<c0129084>] no_context+0xb4/0xd0
[603636.606107]  [<c01290dd>] __bad_area_nosemaphore+0x3d/0x1a0
[603636.611760]  [<c012eb2e>] ? kmap_atomic_prot+0xde/0x100
[603636.617067]  [<c012e972>] ? kunmap_atomic+0x52/0x70


and....


[496797.222642] BUG: unable to handle kernel NULL pointer dereference at 000000ac
[496797.229405] IP: [<f922d50b>] e1000_clean_tx_irq+0xcb/0x320 [e1000e]
[496797.232626] *pdpt = 000000002b5cd001 *pde = 0000000000000000
[496797.232626] Oops: 0000 [#1] SMP
[496797.232626] last sysfs file: /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/class
[496797.232626] Modules linked in: e1000e vmnet vmci vmmon binfmt_misc 
cisco_ipsec(P) openafs(P) deflate zlib_deflate ctr twofish twofish_common 
camellia serpent blowfish cast5 des_generic cbc aes_i586 aes_generic xcbc 
rmd160 sha256_generic sha1_generic crypto_null af_key nfsd exportfs nfs 
lockd nfs_acl auth_rpcgss sunrpc snd_hda_codec_analog ipt_REJECT ipt_LOG 
xt_limit xt_tcpudp xt_state ipt_addrtype ip6table_filter ip6_tables 
nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_nat nf_conntrack_ipv4 
nf_defrag_ipv4 snd_usb_audio snd_hda_intel snd_hda_codec snd_seq_dummy 
snd_pcm_oss nf_conntrack_ftp snd_mixer_oss nf_conntrack iptable_filter 
ppdev ip_tables x_tables snd_pcm snd_usb_lib snd_hwdep snd_seq_oss 
dell_wmi dcdbas uvcvideo videodev v4l1_compat psmouse serio_raw fglrx(P) 
snd_seq_midi parport_pc snd_rawmidi snd_seq_midi_event snd_seq snd_timer 
snd_seq_device snd soundcoresnd_page_alloc heci(C) coretemp lp parport 
usbhid intel_agp agpgart [last unloaded: e1000e]
[496797.232626]
[496797.232626] Pid: 11466, comm: ssh Tainted: P         C (2.6.31-20-generic-pae #58-Ubuntu) OptiPlex 760
[496797.232626] EIP: 0060:[<f922d50b>] EFLAGS: 00210246 CPU: 1
[496797.232626] EIP is at e1000_clean_tx_irq+0xcb/0x320 [e1000e]
[496797.232626] EAX: 00000000 EBX: 00000056 ECX: 00000560 EDX: f8554810
[496797.232626] ESI: e9cba360 EDI: e9c6c000 EBP: efc8fcfc ESP: efc8fc9c
[496797.232626]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[496797.232626] Process ssh (pid: 11466, ti=efc8e000 task=f0554b60 task.ti=efc8e000)
[496797.232626] Stack:
[496797.232626]  00000020 20dd855d 00000005 f0014c00 e49ea5a0 efc8fcf8 c04e2387efc8fce4
[496797.232626] <0> c04dfc04 000252d0 00002b00 00000001 f6074000 e49ea580 00004912 0000000f
[496797.232626] <0> f6074340 00000056 0000058e c0127f01 0000000f f6074340 f6074340 00000040
[496797.232626] Call Trace:
[496797.232626]  [<c04e2387>] ? tcp_transmit_skb+0x397/0x650
[496797.232626]  [<c04dfc04>] ? tcp_clean_rtx_queue+0x3f4/0x7b0
[496797.232626]  [<c0127f01>] ? native_patch+0xf1/0x110
[496797.232626]  [<f922f504>] ? e1000_clean+0x54/0x270 [e1000e]
[496797.232626]  [<c0152227>] ? lock_timer_base+0x27/0x50
[496797.232626]  [<c04a7795>] ? net_rx_action+0xe5/0x1c0
[496797.232626]  [<c014cb30>] ? __do_softirq+0x90/0x1a0
[496797.232626]  [<c04db9df>] ? __tcp_ack_snd_check+0x5f/0x80
[496797.232626]  [<c04e0dfe>] ? tcp_rcv_established+0x32e/0x5f0
[496797.232626]  [<c014cc7d>] ? do_softirq+0x3d/0x40
[496797.232626]  [<c014d805>] ? local_bh_enable_ip+0x75/0x90
[496797.232626]  [<c0579c51>] ? _spin_unlock_bh+0x11/0x20
[496797.232626]  [<c04983d4>] ? release_sock+0x94/0xa0
[496797.232626]  [<c04d5535>] ? tcp_push+0x75/0xb0
[496797.488251]  [<c04d86bd>] ? tcp_sendmsg+0x67d/0x900



^ permalink raw reply

* Re: e1000e/netdev.c patch -- tx_ring->next_to_use
From: Charles Slivkoff @ 2010-04-14 18:47 UTC (permalink / raw)
  To: Brandeburg, Jesse
  Cc: emil.s.tantilov, e1000-devel@lists.sourceforge.net, netdev,
	terry.loftin@hp.com, Kirsher, Jeffrey T, davem@davemloft.net
In-Reply-To: <alpine.WNT.2.00.1004141105060.4368@jbrandeb-desk1.amr.corp.intel.com>

On 04/14/2010 02:12 PM, Brandeburg, Jesse wrote:
...

> have you filed a bug at launchpad?  if so what is the number?  I just want 
> to unite all the information we have.

I just submitted one, using "ubuntu-bug", so a number of associated 
system details are included as attachments.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/563267

...

> As a workaround you can try disabling TSO using ethtool to see if that 
> helps.  We need to reproduce this here if possible.
> 
> ethtool -K eth0 tso off

I will give this a try.

> Do you happen to *not* have irqbalance installed or enabled?  I was 
> confused by the move_irq in one of the stack traces.  In any case it 
> probably doesn't matter but I was not expecting to see that there.

irqbalance is *not* installed.

...


Thanks,

-Charles


------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-14 18:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, netdev, robert, David Miller, Changli Gao,
	Andi Kleen
In-Reply-To: <1271268242.16881.1719.camel@edumazet-laptop>

On Wed, 2010-04-14 at 20:04 +0200, Eric Dumazet wrote:

> Yes, multiqueue is far better of course, but in case of hardware lacking
> multiqueue, RPS can help many workloads, where application has _some_
> work to do, not only counting frames or so...

Agreed. So to enumerate, the benefits come in if:
a) you have many processors
b) you have single-queue nic
c) at sub-threshold traffic you dont care about a little latency
d) you have a specific cache hierachy
e) app is working hard to process incoming messages

> RPS overhead (IPI, cache misses, ...) must be amortized by
> parallelization or we lose.

Indeed. 
How well they can be amortized seems very cpu or board specific.

I think the main challenge for my pedantic mind is missing details. Is
there a paper on rps? Example for #d above, the commit log mentions that
rps benefits if you have certain types of "cache hierachy". Probably
some arch with large shared L2/3 (maybe inclusive) cache will benefit.
example: it does well on Nehalem and probably opterons as long (as you
dont start stacking these things on some interconnect like QPI or HT).
But what happens when you have FSB sharing across cores (still a very
common setup)? etc etc

Can I ask what hardware you run this on?

> A ping test is not an ideal candidate for RPS, since everything is done
> at softirq level, and should be faster without RPS...

ping wont do justice to the possible potential of rps mostly because it
generates very little traffic i.e the part #c above. But it helps me at
least boot a machine with proper setup - but it is not totally useless
because i think the cost of IPI can be deduced from the results.
I am going to put together some udp app with variable think-time to see
what happens. Would that be a reasonable thing to test on?

It would be valuable to have something like Documentation/networking/rps
to detail things a little more. 

cheers,
jamal


^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Stephen Hemminger @ 2010-04-14 18:53 UTC (permalink / raw)
  To: hadi
  Cc: Tom Herbert, Eric Dumazet, netdev, robert, David Miller,
	Changli Gao, Andi Kleen
In-Reply-To: <1271245986.3943.55.camel@bigi>

On Wed, 14 Apr 2010 07:53:06 -0400
jamal <hadi@cyberus.ca> wrote:

> Results:
> CPU utilization was about 20-30% higher in the case of rps. On cpu0, the
> cpu was being chewed highly by sky2_poll and on the redirected-to-core
> it was always smp_call_function_single

I posted a patch to use sky2 hardware hash (RSS) which should lower the
cost per packet.

-- 

^ permalink raw reply

* Över 300 spel
From: King Spin @ 2010-04-14 20:04 UTC (permalink / raw)
  To: newbieb

Kan du bevara en hemlighet? Det här stället är er helt fantastisk och jag vann stort här igår kväll.

http://www.bestgames-lux.net/sv/


^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Stephen Hemminger @ 2010-04-14 19:44 UTC (permalink / raw)
  To: hadi
  Cc: Eric Dumazet, Tom Herbert, netdev, robert, David Miller,
	Changli Gao, Andi Kleen
In-Reply-To: <1271271222.4567.51.camel@bigi>

On Wed, 14 Apr 2010 14:53:42 -0400
jamal <hadi@cyberus.ca> wrote:

> Agreed. So to enumerate, the benefits come in if:
> a) you have many processors
> b) you have single-queue nic
> c) at sub-threshold traffic you dont care about a little latency

There probably needs to be better autotuning for this, there is no reason
that RPS to be steering packets unless the queue is getting backed up.
Some kind of high / low water mark mechanism is needed.

RPS might also interact with the core turbo boost functionality on Intel chips.
Newer chips will make a single core faster if other core can be kept idle.


-- 

^ permalink raw reply

* [BUG net-next-2.6] fib: Some rcu warning
From: Eric Dumazet @ 2010-04-14 19:54 UTC (permalink / raw)
  To: netdev; +Cc: Paul E. McKenney

[   27.756998] IPv4 FIB: Using LC-trie version 0.409
[   27.757121] 
[   27.757121] ===================================================
[   27.757228] [ INFO: suspicious rcu_dereference_check() usage. ]
[   27.757285] ---------------------------------------------------
[   27.757342] net/ipv4/fib_trie.c:212 invoked rcu_dereference_check()
without protection!
[   27.757417] 
[   27.757417] other info that might help us debug this:
[   27.757418] 
[   27.757569] 
[   27.757570] rcu_scheduler_active = 1, debug_locks = 0
[   27.757674] 2 locks held by ip/5686:
[   27.757727]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff8134d247>]
rtnl_lock+0x17/0x20
[   27.757936]  #1:  ((inetaddr_chain).rwsem){.+.+.+}, at:
[<ffffffff810654a7>] __blocking_notifier_call_chain+0x47/0x90
[   27.758148] 
[   27.758149] stack backtrace:
[   27.758249] Pid: 5686, comm: ip Not tainted
2.6.34-rc3-03164-gb4bf665-dirty #10
[   27.758323] Call Trace:
[   27.758377]  [<ffffffff81071b4f>] lockdep_rcu_dereference+0xaf/0xc0
[   27.758437]  [<ffffffff813ae32c>] fib_find_node+0x19c/0x220
[   27.758495]  [<ffffffff813b0bdc>] fib_table_insert+0xac/0x760
[   27.758552]  [<ffffffff813aa121>] ? fib_get_table+0x91/0xd0
[   27.758609]  [<ffffffff813aa0b8>] ? fib_get_table+0x28/0xd0
[   27.758667]  [<ffffffff813aa561>] fib_magic+0x111/0x120
[   27.758723]  [<ffffffff813aa6a0>] fib_add_ifaddr+0x130/0x170
[   27.758780]  [<ffffffff813aad3c>] fib_inetaddr_event+0x5c/0x290
[   27.758838]  [<ffffffff810650d8>] notifier_call_chain+0x58/0x80
[   27.758896]  [<ffffffff810654bd>] __blocking_notifier_call_chain
+0x5d/0x90
[   27.758956]  [<ffffffff81065506>] blocking_notifier_call_chain
+0x16/0x20
[   27.759016]  [<ffffffff813a0c44>] __inet_insert_ifa+0xd4/0x170
[   27.759089]  [<ffffffff813a0cf2>] inet_insert_ifa+0x12/0x20
[   27.759146]  [<ffffffff813a2140>] inetdev_event+0x400/0x430
[   27.759204]  [<ffffffff81362472>] ? netlink_broadcast+0x262/0x3f0
[   27.759263]  [<ffffffff81351391>] ? fib_rules_event+0x21/0x180
[   27.759320]  [<ffffffff810650d8>] notifier_call_chain+0x58/0x80
[   27.759378]  [<ffffffff8106512e>] __raw_notifier_call_chain+0xe/0x10
[   27.759436]  [<ffffffff81065146>] raw_notifier_call_chain+0x16/0x20
[   27.761102]  [<ffffffff8133fc1b>] call_netdevice_notifiers+0x1b/0x20
[   27.761161]  [<ffffffff8133ffad>] __dev_notify_flags+0x7d/0x90
[   27.761219]  [<ffffffff81340005>] dev_change_flags+0x45/0x70
[   27.761278]  [<ffffffff813a2e79>] devinet_ioctl+0x5e9/0x770
[   27.761336]  [<ffffffff813a36b1>] inet_ioctl+0x61/0x80
[   27.761392]  [<ffffffff8132c532>] sock_do_ioctl+0x32/0x60
[   27.761449]  [<ffffffff8132d06c>] compat_sock_ioctl+0x89c/0xb60
[   27.761507]  [<ffffffff81075816>] ? __lock_acquire+0x486/0xaf0
[   27.761567]  [<ffffffff810cbfeb>] ? __do_fault+0x12b/0x480
[   27.761624]  [<ffffffff810b4a93>] ? filemap_fault+0xd3/0x3d0
[   27.761683]  [<ffffffff810f7104>] ? fget_light+0x174/0x360
[   27.761740]  [<ffffffff810cc1f9>] ? __do_fault+0x339/0x480
[   27.761798]  [<ffffffff811370e3>] compat_sys_ioctl+0xd3/0x1900
[   27.761856]  [<ffffffff810ce28a>] ? handle_mm_fault+0x19a/0x780
[   27.761915]  [<ffffffff81023832>] ? do_page_fault+0xe2/0x3b0
[   27.761972]  [<ffffffff81064703>] ? up_read+0x23/0x40
[   27.762028]  [<ffffffff810238f5>] ? do_page_fault+0x1a5/0x3b0
[   27.762087]  [<ffffffff813f36cb>] ? _raw_spin_unlock+0x2b/0x40
[   27.762145]  [<ffffffff810f3508>] ? fd_install+0xa8/0xe0
[   27.762202]  [<ffffffff8132eced>] ? sock_map_fd+0x2d/0x40
[   27.762259]  [<ffffffff813f3a69>] ? retint_swapgs+0xe/0x13
[   27.762331]  [<ffffffff8107292d>] ? trace_hardirqs_on_caller
+0x10d/0x190
[   27.762391]  [<ffffffff813f2d21>] ? trace_hardirqs_off_thunk
+0x3a/0x3c
[   27.762450]  [<ffffffff81028b64>] sysenter_dispatch+0x7/0x30
[   27.762507]  [<ffffffff813f2ce2>] ? trace_hardirqs_on_thunk+0x3a/0x3f



^ permalink raw reply

* [PATCH net-next-2.6 v3] fasync: RCU and fine grained locking
From: Eric Dumazet @ 2010-04-14 19:55 UTC (permalink / raw)
  To: David Miller; +Cc: Paul E. McKenney, netdev, linux-kernel, Lai Jiangshan
In-Reply-To: <1271259264.16881.1703.camel@edumazet-laptop>

Here is V3 of the patch. This patch is a preliminary work to full RCU
conversion of sock_def_readable() & sock_def_write_space() functions,
called nearly for each packet...

I based it against David net-next-2.6 tree.

Thanks

[PATCH net-next-2.6 v3] fasync: RCU and fine grained locking

kill_fasync() uses a central rwlock, candidate for RCU conversion, to
avoid cache line ping pongs on SMP.

fasync_remove_entry() and fasync_add_entry() can disable IRQS on a short
section instead during whole list scan.

Use a spinlock per fasync_struct to synchronize kill_fasync_rcu() and
fasync_{remove|add}_entry(). This spinlock is IRQ safe, so sock_fasync()
doesnt need its own implementation and can use fasync_helper(), to
reduce code size and complexity.

We can remove __kill_fasync() direct use in net/socket.c, and rename it
to kill_fasync_rcu().

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
---
v3: sock_fasync() can use generic fasync_helper(), this gives a nice
cleanup.

v2: As Lai Jiangshan noticed, we need a mutual exclusion between
fasync_{remove|add}_entry() and kill_fasync_rcu().

 fs/fcntl.c         |   66 ++++++++++++++++++++++++--------------
 include/linux/fs.h |   12 +++----
 net/socket.c       |   73 ++++++-------------------------------------
 3 files changed, 59 insertions(+), 92 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 452d02f..0a14074 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -614,9 +614,15 @@ int send_sigurg(struct fown_struct *fown)
 	return ret;
 }
 
-static DEFINE_RWLOCK(fasync_lock);
+static DEFINE_SPINLOCK(fasync_lock);
 static struct kmem_cache *fasync_cache __read_mostly;
 
+static void fasync_free_rcu(struct rcu_head *head)
+{
+	kmem_cache_free(fasync_cache,
+			container_of(head, struct fasync_struct, fa_rcu));
+}
+
 /*
  * Remove a fasync entry. If successfully removed, return
  * positive and clear the FASYNC flag. If no entry exists,
@@ -625,8 +631,6 @@ static struct kmem_cache *fasync_cache __read_mostly;
  * NOTE! It is very important that the FASYNC flag always
  * match the state "is the filp on a fasync list".
  *
- * We always take the 'filp->f_lock', in since fasync_lock
- * needs to be irq-safe.
  */
 static int fasync_remove_entry(struct file *filp, struct fasync_struct **fapp)
 {
@@ -634,17 +638,22 @@ static int fasync_remove_entry(struct file *filp, struct fasync_struct **fapp)
 	int result = 0;
 
 	spin_lock(&filp->f_lock);
-	write_lock_irq(&fasync_lock);
+	spin_lock(&fasync_lock);
 	for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
 		if (fa->fa_file != filp)
 			continue;
+
+		spin_lock_irq(&fa->fa_lock);
+		fa->fa_file = NULL;
+		spin_unlock_irq(&fa->fa_lock);
+
 		*fp = fa->fa_next;
-		kmem_cache_free(fasync_cache, fa);
+		call_rcu(&fa->fa_rcu, fasync_free_rcu);
 		filp->f_flags &= ~FASYNC;
 		result = 1;
 		break;
 	}
-	write_unlock_irq(&fasync_lock);
+	spin_unlock(&fasync_lock);
 	spin_unlock(&filp->f_lock);
 	return result;
 }
@@ -666,25 +675,30 @@ static int fasync_add_entry(int fd, struct file *filp, struct fasync_struct **fa
 		return -ENOMEM;
 
 	spin_lock(&filp->f_lock);
-	write_lock_irq(&fasync_lock);
+	spin_lock(&fasync_lock);
 	for (fp = fapp; (fa = *fp) != NULL; fp = &fa->fa_next) {
 		if (fa->fa_file != filp)
 			continue;
+
+		spin_lock_irq(&fa->fa_lock);
 		fa->fa_fd = fd;
+		spin_unlock_irq(&fa->fa_lock);
+
 		kmem_cache_free(fasync_cache, new);
 		goto out;
 	}
 
+	spin_lock_init(&new->fa_lock);
 	new->magic = FASYNC_MAGIC;
 	new->fa_file = filp;
 	new->fa_fd = fd;
 	new->fa_next = *fapp;
-	*fapp = new;
+	rcu_assign_pointer(*fapp, new);
 	result = 1;
 	filp->f_flags |= FASYNC;
 
 out:
-	write_unlock_irq(&fasync_lock);
+	spin_unlock(&fasync_lock);
 	spin_unlock(&filp->f_lock);
 	return result;
 }
@@ -704,37 +718,41 @@ int fasync_helper(int fd, struct file * filp, int on, struct fasync_struct **fap
 
 EXPORT_SYMBOL(fasync_helper);
 
-void __kill_fasync(struct fasync_struct *fa, int sig, int band)
+/*
+ * rcu_read_lock() is held
+ */
+static void kill_fasync_rcu(struct fasync_struct *fa, int sig, int band)
 {
 	while (fa) {
-		struct fown_struct * fown;
+		struct fown_struct *fown;
 		if (fa->magic != FASYNC_MAGIC) {
 			printk(KERN_ERR "kill_fasync: bad magic number in "
 			       "fasync_struct!\n");
 			return;
 		}
-		fown = &fa->fa_file->f_owner;
-		/* Don't send SIGURG to processes which have not set a
-		   queued signum: SIGURG has its own default signalling
-		   mechanism. */
-		if (!(sig == SIGURG && fown->signum == 0))
-			send_sigio(fown, fa->fa_fd, band);
-		fa = fa->fa_next;
+		spin_lock(&fa->fa_lock);
+		if (fa->fa_file) {
+			fown = &fa->fa_file->f_owner;
+			/* Don't send SIGURG to processes which have not set a
+			   queued signum: SIGURG has its own default signalling
+			   mechanism. */
+			if (!(sig == SIGURG && fown->signum == 0))
+				send_sigio(fown, fa->fa_fd, band);
+		}
+		spin_unlock(&fa->fa_lock);
+		fa = rcu_dereference(fa->fa_next);
 	}
 }
 
-EXPORT_SYMBOL(__kill_fasync);
-
 void kill_fasync(struct fasync_struct **fp, int sig, int band)
 {
 	/* First a quick test without locking: usually
 	 * the list is empty.
 	 */
 	if (*fp) {
-		read_lock(&fasync_lock);
-		/* reread *fp after obtaining the lock */
-		__kill_fasync(*fp, sig, band);
-		read_unlock(&fasync_lock);
+		rcu_read_lock();
+		kill_fasync_rcu(rcu_dereference(*fp), sig, band);
+		rcu_read_unlock();
 	}
 }
 EXPORT_SYMBOL(kill_fasync);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 39d57bc..018d382 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1280,10 +1280,12 @@ static inline int lock_may_write(struct inode *inode, loff_t start,
 
 
 struct fasync_struct {
-	int	magic;
-	int	fa_fd;
-	struct	fasync_struct	*fa_next; /* singly linked list */
-	struct	file 		*fa_file;
+	spinlock_t		fa_lock;
+	int			magic;
+	int			fa_fd;
+	struct fasync_struct	*fa_next; /* singly linked list */
+	struct file		*fa_file;
+	struct rcu_head		fa_rcu;
 };
 
 #define FASYNC_MAGIC 0x4601
@@ -1292,8 +1294,6 @@ struct fasync_struct {
 extern int fasync_helper(int, struct file *, int, struct fasync_struct **);
 /* can be called from interrupts */
 extern void kill_fasync(struct fasync_struct **, int, int);
-/* only for net: no internal synchronization */
-extern void __kill_fasync(struct fasync_struct *, int, int);
 
 extern int __f_setown(struct file *filp, struct pid *, enum pid_type, int force);
 extern int f_setown(struct file *filp, unsigned long arg, int force);
diff --git a/net/socket.c b/net/socket.c
index 35bc198..9822081 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1067,78 +1067,27 @@ static int sock_close(struct inode *inode, struct file *filp)
  *	1. fasync_list is modified only under process context socket lock
  *	   i.e. under semaphore.
  *	2. fasync_list is used under read_lock(&sk->sk_callback_lock)
- *	   or under socket lock.
- *	3. fasync_list can be used from softirq context, so that
- *	   modification under socket lock have to be enhanced with
- *	   write_lock_bh(&sk->sk_callback_lock).
- *							--ANK (990710)
+ *	   or under socket lock
  */
 
 static int sock_fasync(int fd, struct file *filp, int on)
 {
-	struct fasync_struct *fa, *fna = NULL, **prev;
-	struct socket *sock;
-	struct sock *sk;
-
-	if (on) {
-		fna = kmalloc(sizeof(struct fasync_struct), GFP_KERNEL);
-		if (fna == NULL)
-			return -ENOMEM;
-	}
-
-	sock = filp->private_data;
+	struct socket *sock = filp->private_data;
+	struct sock *sk = sock->sk;
 
-	sk = sock->sk;
-	if (sk == NULL) {
-		kfree(fna);
+	if (sk == NULL)
 		return -EINVAL;
-	}
 
 	lock_sock(sk);
 
-	spin_lock(&filp->f_lock);
-	if (on)
-		filp->f_flags |= FASYNC;
-	else
-		filp->f_flags &= ~FASYNC;
-	spin_unlock(&filp->f_lock);
-
-	prev = &(sock->fasync_list);
+	fasync_helper(fd, filp, on, &sock->fasync_list);
 
-	for (fa = *prev; fa != NULL; prev = &fa->fa_next, fa = *prev)
-		if (fa->fa_file == filp)
-			break;
-
-	if (on) {
-		if (fa != NULL) {
-			write_lock_bh(&sk->sk_callback_lock);
-			fa->fa_fd = fd;
-			write_unlock_bh(&sk->sk_callback_lock);
-
-			kfree(fna);
-			goto out;
-		}
-		fna->fa_file = filp;
-		fna->fa_fd = fd;
-		fna->magic = FASYNC_MAGIC;
-		fna->fa_next = sock->fasync_list;
-		write_lock_bh(&sk->sk_callback_lock);
-		sock->fasync_list = fna;
+	if (!sock->fasync_list)
+		sock_reset_flag(sk, SOCK_FASYNC);
+	else
 		sock_set_flag(sk, SOCK_FASYNC);
-		write_unlock_bh(&sk->sk_callback_lock);
-	} else {
-		if (fa != NULL) {
-			write_lock_bh(&sk->sk_callback_lock);
-			*prev = fa->fa_next;
-			if (!sock->fasync_list)
-				sock_reset_flag(sk, SOCK_FASYNC);
-			write_unlock_bh(&sk->sk_callback_lock);
-			kfree(fa);
-		}
-	}
 
-out:
-	release_sock(sock->sk);
+	release_sock(sk);
 	return 0;
 }
 
@@ -1159,10 +1108,10 @@ int sock_wake_async(struct socket *sock, int how, int band)
 		/* fall through */
 	case SOCK_WAKE_IO:
 call_kill:
-		__kill_fasync(sock->fasync_list, SIGIO, band);
+		kill_fasync(&sock->fasync_list, SIGIO, band);
 		break;
 	case SOCK_WAKE_URG:
-		__kill_fasync(sock->fasync_list, SIGURG, band);
+		kill_fasync(&sock->fasync_list, SIGURG, band);
 	}
 	return 0;
 }

^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-14 19:58 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: hadi, Tom Herbert, netdev, robert, David Miller, Changli Gao,
	Andi Kleen
In-Reply-To: <20100414124426.6aee95c3@nehalam>

Le mercredi 14 avril 2010 à 12:44 -0700, Stephen Hemminger a écrit :
> On Wed, 14 Apr 2010 14:53:42 -0400
> jamal <hadi@cyberus.ca> wrote:
> 
> > Agreed. So to enumerate, the benefits come in if:
> > a) you have many processors
> > b) you have single-queue nic
> > c) at sub-threshold traffic you dont care about a little latency
> 
> There probably needs to be better autotuning for this, there is no reason
> that RPS to be steering packets unless the queue is getting backed up.
> Some kind of high / low water mark mechanism is needed.
> 
> RPS might also interact with the core turbo boost functionality on Intel chips.
> Newer chips will make a single core faster if other core can be kept idle.
> 
> 

This was discussed a while ago, and Out Of Order packet delivery was the
thing that frightened us a bit.

Every time we change RPS to be on or off, we might have some extra
noise. Maybe we already have this problem with irqbalance ?




^ permalink raw reply

* Re: [BUG net-next-2.6] fib: Some rcu warning
From: Eric Dumazet @ 2010-04-14 20:10 UTC (permalink / raw)
  To: netdev, David Miller; +Cc: Paul E. McKenney
In-Reply-To: <1271274844.16881.1743.camel@edumazet-laptop>

Le mercredi 14 avril 2010 à 21:54 +0200, Eric Dumazet a écrit :
> [   27.756998] IPv4 FIB: Using LC-trie version 0.409
> [   27.757121] 
> [   27.757121] ===================================================
> [   27.757228] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   27.757285] ---------------------------------------------------
> [   27.757342] net/ipv4/fib_trie.c:212 invoked rcu_dereference_check()
> without protection!
> [   27.757417] 
> [   27.757417] other info that might help us debug this:
> [   27.757418] 
> [   27.757569] 
> [   27.757570] rcu_scheduler_active = 1, debug_locks = 0
> [   27.757674] 2 locks held by ip/5686:
> [   27.757727]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff8134d247>]
> rtnl_lock+0x17/0x20
> [   27.757936]  #1:  ((inetaddr_chain).rwsem){.+.+.+}, at:
> [<ffffffff810654a7>] __blocking_notifier_call_chain+0x47/0x90
> [   27.758148] 
> [   27.758149] stack backtrace:
> [   27.758249] Pid: 5686, comm: ip Not tainted
> 2.6.34-rc3-03164-gb4bf665-dirty #10
> [   27.758323] Call Trace:
> [   27.758377]  [<ffffffff81071b4f>] lockdep_rcu_dereference+0xaf/0xc0
> [   27.758437]  [<ffffffff813ae32c>] fib_find_node+0x19c/0x220
> [   27.758495]  [<ffffffff813b0bdc>] fib_table_insert+0xac/0x760
> [   27.758552]  [<ffffffff813aa121>] ? fib_get_table+0x91/0xd0
> [   27.758609]  [<ffffffff813aa0b8>] ? fib_get_table+0x28/0xd0
> [   27.758667]  [<ffffffff813aa561>] fib_magic+0x111/0x120
> [   27.758723]  [<ffffffff813aa6a0>] fib_add_ifaddr+0x130/0x170
> [   27.758780]  [<ffffffff813aad3c>] fib_inetaddr_event+0x5c/0x290
> [   27.758838]  [<ffffffff810650d8>] notifier_call_chain+0x58/0x80
> [   27.758896]  [<ffffffff810654bd>] __blocking_notifier_call_chain
> +0x5d/0x90
> [   27.758956]  [<ffffffff81065506>] blocking_notifier_call_chain
> +0x16/0x20
> [   27.759016]  [<ffffffff813a0c44>] __inet_insert_ifa+0xd4/0x170
> [   27.759089]  [<ffffffff813a0cf2>] inet_insert_ifa+0x12/0x20
> [   27.759146]  [<ffffffff813a2140>] inetdev_event+0x400/0x430
> [   27.759204]  [<ffffffff81362472>] ? netlink_broadcast+0x262/0x3f0
> [   27.759263]  [<ffffffff81351391>] ? fib_rules_event+0x21/0x180
> [   27.759320]  [<ffffffff810650d8>] notifier_call_chain+0x58/0x80
> [   27.759378]  [<ffffffff8106512e>] __raw_notifier_call_chain+0xe/0x10
> [   27.759436]  [<ffffffff81065146>] raw_notifier_call_chain+0x16/0x20
> [   27.761102]  [<ffffffff8133fc1b>] call_netdevice_notifiers+0x1b/0x20
> [   27.761161]  [<ffffffff8133ffad>] __dev_notify_flags+0x7d/0x90
> [   27.761219]  [<ffffffff81340005>] dev_change_flags+0x45/0x70
> [   27.761278]  [<ffffffff813a2e79>] devinet_ioctl+0x5e9/0x770
> [   27.761336]  [<ffffffff813a36b1>] inet_ioctl+0x61/0x80
> [   27.761392]  [<ffffffff8132c532>] sock_do_ioctl+0x32/0x60
> [   27.761449]  [<ffffffff8132d06c>] compat_sock_ioctl+0x89c/0xb60
> [   27.761507]  [<ffffffff81075816>] ? __lock_acquire+0x486/0xaf0
> [   27.761567]  [<ffffffff810cbfeb>] ? __do_fault+0x12b/0x480
> [   27.761624]  [<ffffffff810b4a93>] ? filemap_fault+0xd3/0x3d0
> [   27.761683]  [<ffffffff810f7104>] ? fget_light+0x174/0x360
> [   27.761740]  [<ffffffff810cc1f9>] ? __do_fault+0x339/0x480
> [   27.761798]  [<ffffffff811370e3>] compat_sys_ioctl+0xd3/0x1900
> [   27.761856]  [<ffffffff810ce28a>] ? handle_mm_fault+0x19a/0x780
> [   27.761915]  [<ffffffff81023832>] ? do_page_fault+0xe2/0x3b0
> [   27.761972]  [<ffffffff81064703>] ? up_read+0x23/0x40
> [   27.762028]  [<ffffffff810238f5>] ? do_page_fault+0x1a5/0x3b0
> [   27.762087]  [<ffffffff813f36cb>] ? _raw_spin_unlock+0x2b/0x40
> [   27.762145]  [<ffffffff810f3508>] ? fd_install+0xa8/0xe0
> [   27.762202]  [<ffffffff8132eced>] ? sock_map_fd+0x2d/0x40
> [   27.762259]  [<ffffffff813f3a69>] ? retint_swapgs+0xe/0x13
> [   27.762331]  [<ffffffff8107292d>] ? trace_hardirqs_on_caller
> +0x10d/0x190
> [   27.762391]  [<ffffffff813f2d21>] ? trace_hardirqs_off_thunk
> +0x3a/0x3c
> [   27.762450]  [<ffffffff81028b64>] sysenter_dispatch+0x7/0x30
> [   27.762507]  [<ffffffff813f2ce2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> 
> 

Here is the patch to remove this problem.

[PATCH] fib: suppress lockdep-RCU false positive in FIB trie.

Followup of commit 634a4b20

Allow tnode_get_child_rcu() to be called either under rcu_read_lock()
protection or with RTNL held.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 59a8387..c98f115 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -209,7 +209,9 @@ static inline struct node *tnode_get_child_rcu(struct tnode *tn, unsigned int i)
 {
 	struct node *ret = tnode_get_child(tn, i);
 
-	return rcu_dereference(ret);
+	return rcu_dereference_check(ret,
+				     rcu_read_lock_held() ||
+				     lockdep_rtnl_is_held());
 }
 
 static inline int tnode_child_length(const struct tnode *tn)



^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-14 20:22 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Eric Dumazet, Tom Herbert, netdev, robert, David Miller,
	Changli Gao, Andi Kleen
In-Reply-To: <20100414124426.6aee95c3@nehalam>

On Wed, 2010-04-14 at 12:44 -0700, Stephen Hemminger wrote:

> RPS might also interact with the core turbo boost functionality on Intel chips.
> Newer chips will make a single core faster if other core can be kept idle.

how well does it work with Linux? Sounds like all i need to do is turn
on some BIOS feature. 
One of the negatives with multiqueue nics is because the core selection
is static, you could end up overloading one core while others stay idle.
This seems to steal cycle capacity from the idle cores and gives it to
the busy cpus. nice. So i see it as a boost to multiqueue.

cheers,
jamal


^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-14 20:27 UTC (permalink / raw)
  To: hadi
  Cc: Stephen Hemminger, Tom Herbert, netdev, robert, David Miller,
	Changli Gao, Andi Kleen
In-Reply-To: <1271276568.4567.59.camel@bigi>

Le mercredi 14 avril 2010 à 16:22 -0400, jamal a écrit :
> On Wed, 2010-04-14 at 12:44 -0700, Stephen Hemminger wrote:
> 
> > RPS might also interact with the core turbo boost functionality on Intel chips.
> > Newer chips will make a single core faster if other core can be kept idle.
> 
> how well does it work with Linux? Sounds like all i need to do is turn
> on some BIOS feature. 
> One of the negatives with multiqueue nics is because the core selection
> is static, you could end up overloading one core while others stay idle.
> This seems to steal cycle capacity from the idle cores and gives it to
> the busy cpus. nice. So i see it as a boost to multiqueue.

Only if more than one flow is involved.

And if you have many flows, chance they will spread several queues...




^ permalink raw reply

* Re: [BUG net-next-2.6] fib: Some rcu warning
From: Paul E. McKenney @ 2010-04-14 20:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1271274844.16881.1743.camel@edumazet-laptop>

On Wed, Apr 14, 2010 at 09:54:04PM +0200, Eric Dumazet wrote:
> [   27.756998] IPv4 FIB: Using LC-trie version 0.409
> [   27.757121] 
> [   27.757121] ===================================================
> [   27.757228] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   27.757285] ---------------------------------------------------
> [   27.757342] net/ipv4/fib_trie.c:212 invoked rcu_dereference_check()
> without protection!

Looks like this needs to allow for RTNL.  Please see below for a
patch.  Other places in this file might also need this.

							Thanx, Paul

> [   27.757417] 
> [   27.757417] other info that might help us debug this:
> [   27.757418] 
> [   27.757569] 
> [   27.757570] rcu_scheduler_active = 1, debug_locks = 0
> [   27.757674] 2 locks held by ip/5686:
> [   27.757727]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff8134d247>]
> rtnl_lock+0x17/0x20
> [   27.757936]  #1:  ((inetaddr_chain).rwsem){.+.+.+}, at:
> [<ffffffff810654a7>] __blocking_notifier_call_chain+0x47/0x90
> [   27.758148] 
> [   27.758149] stack backtrace:
> [   27.758249] Pid: 5686, comm: ip Not tainted
> 2.6.34-rc3-03164-gb4bf665-dirty #10
> [   27.758323] Call Trace:
> [   27.758377]  [<ffffffff81071b4f>] lockdep_rcu_dereference+0xaf/0xc0
> [   27.758437]  [<ffffffff813ae32c>] fib_find_node+0x19c/0x220
> [   27.758495]  [<ffffffff813b0bdc>] fib_table_insert+0xac/0x760
> [   27.758552]  [<ffffffff813aa121>] ? fib_get_table+0x91/0xd0
> [   27.758609]  [<ffffffff813aa0b8>] ? fib_get_table+0x28/0xd0
> [   27.758667]  [<ffffffff813aa561>] fib_magic+0x111/0x120
> [   27.758723]  [<ffffffff813aa6a0>] fib_add_ifaddr+0x130/0x170
> [   27.758780]  [<ffffffff813aad3c>] fib_inetaddr_event+0x5c/0x290
> [   27.758838]  [<ffffffff810650d8>] notifier_call_chain+0x58/0x80
> [   27.758896]  [<ffffffff810654bd>] __blocking_notifier_call_chain
> +0x5d/0x90
> [   27.758956]  [<ffffffff81065506>] blocking_notifier_call_chain
> +0x16/0x20
> [   27.759016]  [<ffffffff813a0c44>] __inet_insert_ifa+0xd4/0x170
> [   27.759089]  [<ffffffff813a0cf2>] inet_insert_ifa+0x12/0x20
> [   27.759146]  [<ffffffff813a2140>] inetdev_event+0x400/0x430
> [   27.759204]  [<ffffffff81362472>] ? netlink_broadcast+0x262/0x3f0
> [   27.759263]  [<ffffffff81351391>] ? fib_rules_event+0x21/0x180
> [   27.759320]  [<ffffffff810650d8>] notifier_call_chain+0x58/0x80
> [   27.759378]  [<ffffffff8106512e>] __raw_notifier_call_chain+0xe/0x10
> [   27.759436]  [<ffffffff81065146>] raw_notifier_call_chain+0x16/0x20
> [   27.761102]  [<ffffffff8133fc1b>] call_netdevice_notifiers+0x1b/0x20
> [   27.761161]  [<ffffffff8133ffad>] __dev_notify_flags+0x7d/0x90
> [   27.761219]  [<ffffffff81340005>] dev_change_flags+0x45/0x70
> [   27.761278]  [<ffffffff813a2e79>] devinet_ioctl+0x5e9/0x770
> [   27.761336]  [<ffffffff813a36b1>] inet_ioctl+0x61/0x80
> [   27.761392]  [<ffffffff8132c532>] sock_do_ioctl+0x32/0x60
> [   27.761449]  [<ffffffff8132d06c>] compat_sock_ioctl+0x89c/0xb60
> [   27.761507]  [<ffffffff81075816>] ? __lock_acquire+0x486/0xaf0
> [   27.761567]  [<ffffffff810cbfeb>] ? __do_fault+0x12b/0x480
> [   27.761624]  [<ffffffff810b4a93>] ? filemap_fault+0xd3/0x3d0
> [   27.761683]  [<ffffffff810f7104>] ? fget_light+0x174/0x360
> [   27.761740]  [<ffffffff810cc1f9>] ? __do_fault+0x339/0x480
> [   27.761798]  [<ffffffff811370e3>] compat_sys_ioctl+0xd3/0x1900
> [   27.761856]  [<ffffffff810ce28a>] ? handle_mm_fault+0x19a/0x780
> [   27.761915]  [<ffffffff81023832>] ? do_page_fault+0xe2/0x3b0
> [   27.761972]  [<ffffffff81064703>] ? up_read+0x23/0x40
> [   27.762028]  [<ffffffff810238f5>] ? do_page_fault+0x1a5/0x3b0
> [   27.762087]  [<ffffffff813f36cb>] ? _raw_spin_unlock+0x2b/0x40
> [   27.762145]  [<ffffffff810f3508>] ? fd_install+0xa8/0xe0
> [   27.762202]  [<ffffffff8132eced>] ? sock_map_fd+0x2d/0x40
> [   27.762259]  [<ffffffff813f3a69>] ? retint_swapgs+0xe/0x13
> [   27.762331]  [<ffffffff8107292d>] ? trace_hardirqs_on_caller
> +0x10d/0x190
> [   27.762391]  [<ffffffff813f2d21>] ? trace_hardirqs_off_thunk
> +0x3a/0x3c
> [   27.762450]  [<ffffffff81028b64>] sysenter_dispatch+0x7/0x30
> [   27.762507]  [<ffffffff813f2ce2>] ? trace_hardirqs_on_thunk+0x3a/0x3f

commit 3f69d3dbdd2468f13591b676ca36b3d61d1a1294
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Wed Apr 14 13:30:13 2010 -0700

    net: fix RCU lockdep splat in tnode_get_child_rcu()
    
    Located-by: Eric Dumazet <eric.dumazet@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index af5d897..45bda1f 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -208,7 +208,8 @@ static inline struct node *tnode_get_child_rcu(struct tnode *tn, unsigned int i)
 {
 	struct node *ret = tnode_get_child(tn, i);
 
-	return rcu_dereference(ret);
+	return rcu_dereference_check(ret,
+				     rcu_read_lock_held() || rtnl_is_locked());
 }
 
 static inline int tnode_child_length(const struct tnode *tn)

^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: Andi Kleen @ 2010-04-14 20:34 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: hadi, Eric Dumazet, Tom Herbert, netdev, robert, David Miller,
	Changli Gao, Andi Kleen
In-Reply-To: <20100414124426.6aee95c3@nehalam>

> RPS might also interact with the core turbo boost functionality on Intel chips.
> Newer chips will make a single core faster if other core can be kept idle.

In addition to Turbo using less cores can also help to save power.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-14 20:31 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <201004141835.57673.arnd@arndb.de>

On Wed, Apr 14, 2010 at 06:35:57PM +0200, Arnd Bergmann wrote:
> On Wednesday 14 April 2010, Michael S. Tsirkin wrote:
> > > > 
> > > > qemu needs the ability to inject raw packets into device
> > > > from userspace, bypassing vhost/virtio (for live migration).
> > > 
> > > Ok, but since there is only a write callback and no read, it won't
> > > actually be able to do this with the current code, right?
> > 
> > I think it'll work as is, with vhost qemu only ever writes,
> > never reads from device. We'll also never need GSO etc
> > which is a large part of what tap does (and macvtap will
> > have to do).
> 
> Ah, I see. I didn't realize that qemu needs to write to the
> device even if vhost is used. But for the case of migration to
> another machine without vhost, wouldn't qemu also need to read?

Not that I know. Why?

> > > Moreover, it seems weird to have a new type of interface here that
> > > duplicates tap/macvtap with less functionality. Coming back
> > > to your original comment, this means that while mpassthru is currently
> > > not duplicating the actual code from macvtap, it would need to do
> > > exactly that to get the qemu interface right!
> > > 
> > I don't think so, see above. anyway, both can reuse tun.c :)
> 
> There is one significant difference between macvtap/mpassthru and
> tun/tap in that the directions are reversed. While macvtap and
> mpassthru forward data from write into dev_queue_xmit and from
> skb_receive into read, tun/tap forwards data from write into
> skb_receive and from start_xmit into read.
> 
> Also, I'm not really objecting to duplicating code between
> macvtap and mpassthru, as the implementation can always be merged.
> My main objection is instead to having two different _user_interfaces_
> for doing the same thing.
> 
> 	Arnd

They *could* do the same thing :)

-- 
MST

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-14 20:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Tom Herbert, netdev, robert, David Miller,
	Changli Gao, Andi Kleen
In-Reply-To: <1271276855.16881.1756.camel@edumazet-laptop>

On Wed, 2010-04-14 at 22:27 +0200, Eric Dumazet wrote:

> Only if more than one flow is involved.
> 
> And if you have many flows, chance they will spread several queues...

Over long period of time measurement, true; but even with > 1 flows, it
is possible that one flow is more active/intense than others (rtp vs
some bulk file transfer) or more processor intensive than others(eg
ipsec vs clear text) etc. 
 
BTW: just poking at intel doc on turbo boost and it seems the max a
core can steal from others is 400Mhz; so a core can go from 2.8Ghz
to 3.2Ghz. I am sure theres a lot of interesting dynamics from this ;->
I think i will try turning this thing in my tests since i have an i7.

cheers,
jamal


^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Arnd Bergmann @ 2010-04-14 20:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <20100414203142.GA11321@redhat.com>

On Wednesday 14 April 2010 22:31:42 Michael S. Tsirkin wrote:
> On Wed, Apr 14, 2010 at 06:35:57PM +0200, Arnd Bergmann wrote:
> > On Wednesday 14 April 2010, Michael S. Tsirkin wrote:
> > > > > 
> > > > > qemu needs the ability to inject raw packets into device
> > > > > from userspace, bypassing vhost/virtio (for live migration).
> > > > 
> > > > Ok, but since there is only a write callback and no read, it won't
> > > > actually be able to do this with the current code, right?
> > > 
> > > I think it'll work as is, with vhost qemu only ever writes,
> > > never reads from device. We'll also never need GSO etc
> > > which is a large part of what tap does (and macvtap will
> > > have to do).
> > 
> > Ah, I see. I didn't realize that qemu needs to write to the
> > device even if vhost is used. But for the case of migration to
> > another machine without vhost, wouldn't qemu also need to read?
> 
> Not that I know. Why?

Well, if the guest not only wants to send data but also receive
frames coming from other machines, they need to get from the kernel
into qemu, and the only way I can see for doing that is to read
from this device if there is no vhost support around on the new
machine.

Maybe we're talking about different things here.

	Arnd

^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-14 20:40 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <201004142239.50299.arnd@arndb.de>

On Wed, Apr 14, 2010 at 10:39:49PM +0200, Arnd Bergmann wrote:
> On Wednesday 14 April 2010 22:31:42 Michael S. Tsirkin wrote:
> > On Wed, Apr 14, 2010 at 06:35:57PM +0200, Arnd Bergmann wrote:
> > > On Wednesday 14 April 2010, Michael S. Tsirkin wrote:
> > > > > > 
> > > > > > qemu needs the ability to inject raw packets into device
> > > > > > from userspace, bypassing vhost/virtio (for live migration).
> > > > > 
> > > > > Ok, but since there is only a write callback and no read, it won't
> > > > > actually be able to do this with the current code, right?
> > > > 
> > > > I think it'll work as is, with vhost qemu only ever writes,
> > > > never reads from device. We'll also never need GSO etc
> > > > which is a large part of what tap does (and macvtap will
> > > > have to do).
> > > 
> > > Ah, I see. I didn't realize that qemu needs to write to the
> > > device even if vhost is used. But for the case of migration to
> > > another machine without vhost, wouldn't qemu also need to read?
> > 
> > Not that I know. Why?
> 
> Well, if the guest not only wants to send data but also receive
> frames coming from other machines, they need to get from the kernel
> into qemu, and the only way I can see for doing that is to read
> from this device if there is no vhost support around on the new
> machine.
> 
> Maybe we're talking about different things here.
> 
> 	Arnd

mpassthrough is currently useless without vhost.
If the new machine has no vhost, it can't use mpassthrough :)

-- 
MST

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Tom Herbert @ 2010-04-14 20:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: hadi, Stephen Hemminger, netdev, robert, David Miller,
	Changli Gao, Andi Kleen
In-Reply-To: <1271276855.16881.1756.camel@edumazet-laptop>

> Only if more than one flow is involved.
>
> And if you have many flows, chance they will spread several queues...
>

But use too many queues and the efficiency of NAPI drops and cost of
device interrupts becomes dominant, so that the overhead from
additional hard interrupts can surpass the overhead of doing RPS and
the IPIs.  I believe we are seeing this is in some of our results
which shows that a combination of multi-queue and RPS can be better
than just multi-queue (see rps changelog).  Again, I'm not claiming
that is generally true, but there are a lot of factors to consider.

^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Arnd Bergmann @ 2010-04-14 20:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, davem, jdike
In-Reply-To: <20100414204003.GC11321@redhat.com>

On Wednesday 14 April 2010 22:40:03 Michael S. Tsirkin wrote:
> On Wed, Apr 14, 2010 at 10:39:49PM +0200, Arnd Bergmann wrote:
> > Well, if the guest not only wants to send data but also receive
> > frames coming from other machines, they need to get from the kernel
> > into qemu, and the only way I can see for doing that is to read
> > from this device if there is no vhost support around on the new
> > machine.
> > 
> > Maybe we're talking about different things here.
>
> mpassthrough is currently useless without vhost.
> If the new machine has no vhost, it can't use mpassthrough :)

Ok. Is that a planned feature though? vhost is currently limited
to guests with a virtio-net driver and even if you extend it to other
guest emulations, it will probably always be a subset of the qemu
supported drivers, but it may be useful to support zero-copy on other
drivers as well.

	Arnd

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-14 20:57 UTC (permalink / raw)
  To: Tom Herbert
  Cc: hadi, Stephen Hemminger, netdev, robert, David Miller,
	Changli Gao, Andi Kleen
In-Reply-To: <z2y65634d661004141345o9bdcf759sf266866931823baf@mail.gmail.com>

Le mercredi 14 avril 2010 à 13:45 -0700, Tom Herbert a écrit :
> > Only if more than one flow is involved.
> >
> > And if you have many flows, chance they will spread several queues...
> >
> 
> But use too many queues and the efficiency of NAPI drops and cost of
> device interrupts becomes dominant, so that the overhead from
> additional hard interrupts can surpass the overhead of doing RPS and
> the IPIs.  I believe we are seeing this is in some of our results
> which shows that a combination of multi-queue and RPS can be better
> than just multi-queue (see rps changelog).  Again, I'm not claiming
> that is generally true, but there are a lot of factors to consider.
> --

RPS can be tuned (Changli wants a finer tuning...), it would be
intereting to tune multiqueue devices too. I dont know if its possible
right now.

On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
10Gigabit has 16 queues. It might be good to use less queues according
to your results on some workloads, and eventually use RPS on a second
layering.





^ permalink raw reply

* Re: HTB - What's the minimal value for 'rate' parameter?
From: Jarek Poplawski @ 2010-04-14 21:45 UTC (permalink / raw)
  To: Antonio Almeida; +Cc: netdev, kaber, davem, devik
In-Reply-To: <p2q298f5c051004140322hc54f5315taa5c7d85cb74f06@mail.gmail.com>

Antonio Almeida wrote, On 04/14/2010 12:22 PM:

> What do you mean with "1:2 has grandchildren with overflown rate tables"?
> I couldn't understand your idea. Is there any mistake in the
> configuration I sent?
> How would you set rates for this particular example?


class htb 1:1 root rate 1000Mbit ceil 1000Mbit
class htb 1:2 parent 1:1 rate 4096Kbit ceil 4096Kbit
class htb 1:10 parent 1:2 rate 1024Kbit ceil 4096Kbit
class htb 1:11 parent 1:2 rate 1024Kbit ceil 4096Kbit
class htb 1:101 parent 1:10 prio 3 rate 8bit ceil 4096Kbit
class htb 1:111 parent 1:11 prio 3 rate 8bit ceil 4096Kbit

Classes 1:101 and 1:111 have too low rates, which causes wrong (overflowed!)
values in their rate tables, so their rates could be practically
uncontrollable. They are limited by their ceils instead, so something like:

class htb 1:101 parent 1:10 leaf 101: prio 3 rate 4096Kbit ceil 4096Kbit
class htb 1:111 parent 1:11 leaf 111: prio 3 rate 4096Kbit ceil 4096Kbit

But then their guaranteed rates are higher than their parents, and the
sum is higher than grandparent's rate, which means the config is wrong.
(You have to control these sums - HTB doesn't.)

As I wrote before, the minimal (overflow safe) rate depends on max
packet size, and for 1500 byte it would be something around:
1500b/2min, so if your clients can wait so long, try this:

class htb 1:101 parent 1:10 leaf 101: prio 3 rate 100bit ceil 4096Kbit
class htb 1:111 parent 1:11 leaf 111: prio 3 rate 100bit ceil 4096Kbit

Regards,
Jarek P.

^ permalink raw reply

* [net-next-2.6 PATCH] ixgbe: fix bug with vlan strip in promsic mode
From: Jeff Kirsher @ 2010-04-14 22:11 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, Jesse Brandeburg, Jeff Kirsher

From: Jesse Brandeburg <jesse.brandeburg@intel.com>

The ixgbe driver was setting up 82598 hardware correctly, so that
when promiscuous mode was enabled hardware stripping was turned
off.  But on 82599 the logic to disable/enable hardware stripping
is different, and the code was not updated correctly when the
hardware vlan stripping was enabled as default.

This change comprises the creation of two new helper functions
and calling them from the right locations to disable and enable
hardware stripping of vlan tags at appropriate times.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_main.c |  115 +++++++++++++++++++++++++---------------
 1 files changed, 72 insertions(+), 43 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 1b1419c..a98ff0e 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -2482,12 +2482,74 @@ static void ixgbe_vlan_rx_kill_vid(struct net_device *netdev, u16 vid)
 	hw->mac.ops.set_vfta(&adapter->hw, vid, pool_ndx, false);
 }
 
+/**
+ * ixgbe_vlan_filter_disable - helper to disable hw vlan filtering
+ * @adapter: driver data
+ */
+static void ixgbe_vlan_filter_disable(struct ixgbe_adapter *adapter)
+{
+	struct ixgbe_hw *hw = &adapter->hw;
+	u32 vlnctrl = IXGBE_READ_REG(hw, IXGBE_VLNCTRL);
+	int i, j;
+
+	switch (hw->mac.type) {
+	case ixgbe_mac_82598EB:
+		vlnctrl &= ~(IXGBE_VLNCTRL_VME | IXGBE_VLNCTRL_VFE);
+		vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
+		IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
+		break;
+	case ixgbe_mac_82599EB:
+		vlnctrl &= ~IXGBE_VLNCTRL_VFE;
+		vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
+		IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
+		for (i = 0; i < adapter->num_rx_queues; i++) {
+			j = adapter->rx_ring[i]->reg_idx;
+			vlnctrl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(j));
+			vlnctrl &= ~IXGBE_RXDCTL_VME;
+			IXGBE_WRITE_REG(hw, IXGBE_RXDCTL(j), vlnctrl);
+		}
+		break;
+	default:
+		break;
+	}
+}
+
+/**
+ * ixgbe_vlan_filter_enable - helper to enable hw vlan filtering
+ * @adapter: driver data
+ */
+static void ixgbe_vlan_filter_enable(struct ixgbe_adapter *adapter)
+{
+	struct ixgbe_hw *hw = &adapter->hw;
+	u32 vlnctrl = IXGBE_READ_REG(hw, IXGBE_VLNCTRL);
+	int i, j;
+
+	switch (hw->mac.type) {
+	case ixgbe_mac_82598EB:
+		vlnctrl |= IXGBE_VLNCTRL_VME | IXGBE_VLNCTRL_VFE;
+		vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
+		IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
+		break;
+	case ixgbe_mac_82599EB:
+		vlnctrl |= IXGBE_VLNCTRL_VFE;
+		vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
+		IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
+		for (i = 0; i < adapter->num_rx_queues; i++) {
+			j = adapter->rx_ring[i]->reg_idx;
+			vlnctrl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(j));
+			vlnctrl |= IXGBE_RXDCTL_VME;
+			IXGBE_WRITE_REG(hw, IXGBE_RXDCTL(j), vlnctrl);
+		}
+		break;
+	default:
+		break;
+	}
+}
+
 static void ixgbe_vlan_rx_register(struct net_device *netdev,
                                    struct vlan_group *grp)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
-	u32 ctrl;
-	int i, j;
 
 	if (!test_bit(__IXGBE_DOWN, &adapter->state))
 		ixgbe_irq_disable(adapter);
@@ -2498,25 +2560,7 @@ static void ixgbe_vlan_rx_register(struct net_device *netdev,
 	 * still receive traffic from a DCB-enabled host even if we're
 	 * not in DCB mode.
 	 */
-	ctrl = IXGBE_READ_REG(&adapter->hw, IXGBE_VLNCTRL);
-
-	/* Disable CFI check */
-	ctrl &= ~IXGBE_VLNCTRL_CFIEN;
-
-	/* enable VLAN tag stripping */
-	if (adapter->hw.mac.type == ixgbe_mac_82598EB) {
-		ctrl |= IXGBE_VLNCTRL_VME;
-	} else if (adapter->hw.mac.type == ixgbe_mac_82599EB) {
-		for (i = 0; i < adapter->num_rx_queues; i++) {
-			u32 ctrl;
-			j = adapter->rx_ring[i]->reg_idx;
-			ctrl = IXGBE_READ_REG(&adapter->hw, IXGBE_RXDCTL(j));
-			ctrl |= IXGBE_RXDCTL_VME;
-			IXGBE_WRITE_REG(&adapter->hw, IXGBE_RXDCTL(j), ctrl);
-		}
-	}
-
-	IXGBE_WRITE_REG(&adapter->hw, IXGBE_VLNCTRL, ctrl);
+	ixgbe_vlan_filter_enable(adapter);
 
 	ixgbe_vlan_rx_add_vid(netdev, 0);
 
@@ -2551,17 +2595,17 @@ void ixgbe_set_rx_mode(struct net_device *netdev)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
 	struct ixgbe_hw *hw = &adapter->hw;
-	u32 fctrl, vlnctrl;
+	u32 fctrl;
 
 	/* Check for Promiscuous and All Multicast modes */
 
 	fctrl = IXGBE_READ_REG(hw, IXGBE_FCTRL);
-	vlnctrl = IXGBE_READ_REG(hw, IXGBE_VLNCTRL);
 
 	if (netdev->flags & IFF_PROMISC) {
 		hw->addr_ctrl.user_set_promisc = 1;
 		fctrl |= (IXGBE_FCTRL_UPE | IXGBE_FCTRL_MPE);
-		vlnctrl &= ~IXGBE_VLNCTRL_VFE;
+		/* don't hardware filter vlans in promisc mode */
+		ixgbe_vlan_filter_disable(adapter);
 	} else {
 		if (netdev->flags & IFF_ALLMULTI) {
 			fctrl |= IXGBE_FCTRL_MPE;
@@ -2569,12 +2613,11 @@ void ixgbe_set_rx_mode(struct net_device *netdev)
 		} else {
 			fctrl &= ~(IXGBE_FCTRL_UPE | IXGBE_FCTRL_MPE);
 		}
-		vlnctrl |= IXGBE_VLNCTRL_VFE;
+		ixgbe_vlan_filter_enable(adapter);
 		hw->addr_ctrl.user_set_promisc = 0;
 	}
 
 	IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
-	IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
 
 	/* reprogram secondary unicast list */
 	hw->mac.ops.update_uc_addr_list(hw, netdev);
@@ -2641,7 +2684,7 @@ static void ixgbe_napi_disable_all(struct ixgbe_adapter *adapter)
 static void ixgbe_configure_dcb(struct ixgbe_adapter *adapter)
 {
 	struct ixgbe_hw *hw = &adapter->hw;
-	u32 txdctl, vlnctrl;
+	u32 txdctl;
 	int i, j;
 
 	ixgbe_dcb_check_config(&adapter->dcb_cfg);
@@ -2659,22 +2702,8 @@ static void ixgbe_configure_dcb(struct ixgbe_adapter *adapter)
 		IXGBE_WRITE_REG(hw, IXGBE_TXDCTL(j), txdctl);
 	}
 	/* Enable VLAN tag insert/strip */
-	vlnctrl = IXGBE_READ_REG(hw, IXGBE_VLNCTRL);
-	if (hw->mac.type == ixgbe_mac_82598EB) {
-		vlnctrl |= IXGBE_VLNCTRL_VME | IXGBE_VLNCTRL_VFE;
-		vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
-		IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
-	} else if (hw->mac.type == ixgbe_mac_82599EB) {
-		vlnctrl |= IXGBE_VLNCTRL_VFE;
-		vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
-		IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
-		for (i = 0; i < adapter->num_rx_queues; i++) {
-			j = adapter->rx_ring[i]->reg_idx;
-			vlnctrl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(j));
-			vlnctrl |= IXGBE_RXDCTL_VME;
-			IXGBE_WRITE_REG(hw, IXGBE_RXDCTL(j), vlnctrl);
-		}
-	}
+	ixgbe_vlan_filter_enable(adapter);
+
 	hw->mac.ops.set_vfta(&adapter->hw, 0, 0, true);
 }
 


^ permalink raw reply related

* Re: hdlc_ppp: why no detach()?
From: Krzysztof Halasa @ 2010-04-14 22:48 UTC (permalink / raw)
  To: Michael Barkowski
  Cc: David S. Miller, Julia Lawall, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <4BC32B00.1030600@ruggedcom.com>

Hello Michael,

Michael Barkowski <michaelbarkowski@ruggedcom.com> writes:

> I am looking at your hdlc_ppp code and I don't understand: why is there
> not the equivalent of fr_detach() in there?

I assume you mean .detach = fr_destroy(). It's used only to kill
subdevices, i.e. it has nothing to do with the interface being up/down.

> pc8300_drv:cpc_remove_one() frees netdevs quite confidently but I wonder
> how it can be so sure that there are not skbs in hdlc_ppp's tx_queue
> associated with those devices before freeing them....q

Theoretically all paths adding skbs to the tx_queue should send them out
before returning (possibly also on behalf of other devices). However I
wonder if it's the case. Let's see: Only ppp_tx_cp() adds to the queue
directly:
- ppp_rx() (calls ppp_tx_flush())
- ppp_timer (calls ppp_tx_flush())
- ppp_cp_event():
  - ppp_cp_parse_cr() (calls ppp_tx_flush())
  - ppp_stop() calls ppp_cp_event(), but it won't queue any skb, it only
    marks the connection as closed and does the same to IPCP and IPV6CP.

This means the problematic part is ppp_start() which calls
ppp_cp_event(LCP, START) = IRC | SCR | 3 meaning
Initialize-Restart-Count, Send-Configure-Request and change state to
REQ_SENT. This causes two problems:

1. The SCR packet will be delayed by 2 seconds (both first and second
   SCR will be sent the same time). Perhaps we delay only a little
   (instead of full 2 seconds) and only then send the initial packet.

2. (as you noted) the skb will be added to tx_queue and left there. If
   we happen to "ifconfig up" and "rmmod driver" before receiving any
   packet and before ppp->req_timeout (2 seconds) and before any other
   PPP interface does the same, we will eventually get skb with invalid
   ->dev. This is simple to drain in .close (detach is a wrong place
   since it may be called long after the interface is deactivated, there
   is no need to delay it past .close). The fix for #1 will already fix
   #2, but the redundant safety doesn't cost us anything.

Thanks for noting the problem, I'll post a patch shortly.

Also it seems the timeouts etc. should be configurable. ATM we're only
fixing bugs, good.
-- 
Krzysztof Halasa

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox