Netdev List
 help / color / mirror / Atom feed
* [vmxnet3] possible irq lock inversion dependency detected
From: Jongman Heo @ 2011-01-21  9:44 UTC (permalink / raw)
  To: netdev


I'm using Fedora 14 on VMWare.

With latest Linus git tree(2b1caf6ed7b888), following warnings are printed.

Is this a known issue? I don't know whether this is a regression or not.
This is my first time using vmxnet3 driver.

===============================================================
[   17.593243] NET: Registered protocol family 10
[   17.640420] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   18.418134] auditd (733): /proc/733/oom_adj is deprecated, please 
use /proc/733/oom_score_adj instead.
[   24.074627] eth0: intr type 3, mode 0, 5 vectors allocated
[   24.075450] eth0: NIC Link is Up 10000 Mbps
[   24.081505] 
[   24.081507] =========================================================
[   24.081693] [ INFO: possible irq lock inversion dependency detected ]
[   24.081797] 2.6.38-rc1+ #85
[   24.081914] ---------------------------------------------------------
[   24.082061] dbus-daemon/847 just changed the state of lock:
[   24.082200]  (&(&mc->mca_lock)->rlock){+.-...}, at: [<f85a034e>] 
mld_ifc_timer_expire+0x12a/0x1f2 [ipv6]
[   24.082488] but this lock took another, SOFTIRQ-unsafe lock in the past:
[   24.082690]  (&(&adapter->cmd_lock)->rlock){+.+...}
[   24.082769] 
[   24.082770] and interrupts could create inverse lock ordering between them.
[   24.082772] 
[   24.083196] 
[   24.083197] other info that might help us debug this:
[   24.083415] 3 locks held by dbus-daemon/847:
[   24.083538]  #0:  (&mm->mmap_sem){++++++}, at: [<c07d49ea>] 
do_page_fault+0x140/0x33b
[   24.083799]  #1:  (&idev->mc_ifc_timer){+.-...}, at: [<c04459c7>] 
run_timer_softirq+0x11c/0x268
[   24.084081]  #2:  (&ndev->lock){++.-..}, at: [<f85a023f>] 
mld_ifc_timer_expire+0x1b/0x1f2 [ipv6]
[   24.084364] 
[   24.084365] the shortest dependencies between 2nd lock and 1st lock:
[   24.084659]   -> (&(&adapter->cmd_lock)->rlock){+.+...} ops: 28 {
[   24.084826]      HARDIRQ-ON-W at:
[   24.084987]                                            [<c0461e11>] 
__lock_acquire+0x2d9/0xbf2
[   24.085302]                                            [<c0462b5f>] 
lock_acquire+0xb7/0xd7
[   24.085507]                                            [<c07d1835>] 
_raw_spin_lock+0x33/0x40
[   24.085708]                                            [<f855e2bf>] 
vmxnet3_alloc_intr_resources+0x18/0x1c1 [vmxnet3]
[   24.085964]                                            [<f8562b23>] 
vmxnet3_probe_device+0x503/0x712 [vmxnet3]
[   24.086180]                                            [<c05f645a>] 
local_pci_probe+0x2f/0x5a
[   24.086382]                                            [<c05f68ed>] 
pci_device_probe+0x48/0x6b
[   24.086582]                                            [<c067f87a>] 
driver_probe_device+0x115/0x1ec
[   24.086788]                                            [<c067f990>] 
__driver_attach+0x3f/0x5b
[   24.087014]                                            [<c067eb28>] 
bus_for_each_dev+0x3d/0x60
[   24.087214]                                            [<c067f50e>] 
driver_attach+0x19/0x1b
[   24.087411]                                            [<c067f1a4>] 
bus_add_driver+0xbd/0x215
[   24.087611]                                            [<c067fb61>] 
driver_register+0x7f/0xde
[   24.087811]                                            [<c05f6adb>] 
__pci_register_driver+0x4c/0xa9
[   24.088046]                                            [<f8568036>] 
0xf8568036
[   24.088238]                                            [<c0401268>] 
do_one_initcall+0x87/0x143
[   24.088439]                                            [<c046b0a6>] 
sys_init_module+0x130d/0x14aa
[   24.088643]                                            [<c040319f>] 
sysenter_do_call+0x12/0x38
[   24.088844]      SOFTIRQ-ON-W at:
[   24.115469]                                            [<c0461e30>] 
__lock_acquire+0x2f8/0xbf2
[   24.115483]                                            [<c0462b5f>] 
lock_acquire+0xb7/0xd7
[   24.115486]                                            [<c07d1835>] 
_raw_spin_lock+0x33/0x40
[   24.115493]                                            [<f855e2bf>] 
vmxnet3_alloc_intr_resources+0x18/0x1c1 [vmxnet3]
[   24.115508]                                            [<f8562b23>] 
vmxnet3_probe_device+0x503/0x712 [vmxnet3]
[   24.115513]                                            [<c05f645a>] 
local_pci_probe+0x2f/0x5a
[   24.115519]                                            [<c05f68ed>] 
pci_device_probe+0x48/0x6b
[   24.115523]                                            [<c067f87a>] 
driver_probe_device+0x115/0x1ec
[   24.115529]                                            [<c067f990>] 
__driver_attach+0x3f/0x5b
[   24.115532]                                            [<c067eb28>] 
bus_for_each_dev+0x3d/0x60
[   24.115535]                                            [<c067f50e>] 
driver_attach+0x19/0x1b
[   24.115539]                                            [<c067f1a4>] 
bus_add_driver+0xbd/0x215
[   24.115542]                                            [<c067fb61>] 
driver_register+0x7f/0xde
[   24.115545]                                            [<c05f6adb>] 
__pci_register_driver+0x4c/0xa9
[   24.115555]                                            [<f8568036>] 
0xf8568036
[   24.115562]                                            [<c0401268>] 
do_one_initcall+0x87/0x143
[   24.115567]                                            [<c046b0a6>] 
sys_init_module+0x130d/0x14aa
[   24.115590]                                            [<c040319f>] 
sysenter_do_call+0x12/0x38
[   24.115596]      INITIAL USE at:
[   24.115598]                                           [<c0461e85>] 
__lock_acquire+0x34d/0xbf2
[   24.115602]                                           [<c0462b5f>] 
lock_acquire+0xb7/0xd7
[   24.115606]                                           [<c07d1835>] 
_raw_spin_lock+0x33/0x40
[   24.115609]                                           [<f855e2bf>] 
vmxnet3_alloc_intr_resources+0x18/0x1c1 [vmxnet3]
[   24.115614]                                           [<f8562b23>] 
vmxnet3_probe_device+0x503/0x712 [vmxnet3]
[   24.115619]                                           [<c05f645a>] 
local_pci_probe+0x2f/0x5a
[   24.115622]                                           [<c05f68ed>] 
pci_device_probe+0x48/0x6b
[   24.115626]                                           [<c067f87a>] 
driver_probe_device+0x115/0x1ec
[   24.115629]                                           [<c067f990>] 
__driver_attach+0x3f/0x5b
[   24.115633]                                           [<c067eb28>] 
bus_for_each_dev+0x3d/0x60
[   24.115636]                                           [<c067f50e>] 
driver_attach+0x19/0x1b
[   24.115639]                                           [<c067f1a4>] 
bus_add_driver+0xbd/0x215
[   24.115642]                                           [<c067fb61>] 
driver_register+0x7f/0xde
[   24.115645]                                           [<c05f6adb>] 
__pci_register_driver+0x4c/0xa9
[   24.115648]                                           [<f8568036>] 
0xf8568036
[   24.115652]                                           [<c0401268>] 
do_one_initcall+0x87/0x143
[   24.115655]                                           [<c046b0a6>] 
sys_init_module+0x130d/0x14aa
[   24.115659]                                           [<c040319f>] 
sysenter_do_call+0x12/0x38
[   24.115662]    }
[   24.115663]    ... key      at: [<f8564580>] __key.40447+0x0/0xffffe7b2 
[vmxnet3]
[   24.115668]    ... acquired at:
[   24.115670]    [<c0462b5f>] lock_acquire+0xb7/0xd7
[   24.115673]    [<c07d1920>] _raw_spin_lock_irqsave+0x40/0x50
[   24.115676]    [<f855f494>] vmxnet3_set_mc+0x11a/0x165 [vmxnet3]
[   24.115684]    [<c0751f4d>] __dev_set_rx_mode+0x76/0x7a
[   24.115689]    [<c0751f6c>] dev_set_rx_mode+0x1b/0x26
[   24.115692]    [<c0752014>] __dev_open+0x9d/0xaf
[   24.115694]    [<c07521e6>] __dev_change_flags+0x98/0x10d
[   24.115697]    [<c07522c1>] dev_change_flags+0x13/0x3f
[   24.115699]    [<c075ae71>] do_setlink+0x245/0x56b
[   24.115703]    [<c075b6a6>] rtnl_setlink+0xaa/0xc6
[   24.115706]    [<c075b90f>] rtnetlink_rcv_msg+0x1a0/0x1af
[   24.115709]    [<c07694fd>] netlink_rcv_skb+0x32/0x73
[   24.115712]    [<c075b3bc>] rtnetlink_rcv+0x1b/0x22
[   24.115714]    [<c0769098>] netlink_unicast+0xc4/0x120
[   24.115716]    [<c076934e>] netlink_sendmsg+0x25a/0x271
[   24.115719]    [<c074135e>] __sock_sendmsg+0x54/0x5b
[   24.115723]    [<c07419dd>] sock_sendmsg+0x95/0xac
[   24.115726]    [<c0742fbc>] sys_sendmsg+0x181/0x1e8
[   24.115729]    [<c074348c>] sys_socketcall+0x22c/0x287
[   24.115732]    [<c040319f>] sysenter_do_call+0x12/0x38
[   24.115735] 
[   24.115736]  -> (_xmit_ETHER){+.....} ops: 6 {
[   24.115741]     HARDIRQ-ON-W at:
[   24.115742]                                          [<c0461e11>] 
__lock_acquire+0x2d9/0xbf2
[   24.115746]                                          [<c0462b5f>] 
lock_acquire+0xb7/0xd7
[   24.115750]                                          [<c07d1a20>] 
_raw_spin_lock_bh+0x38/0x45
[   24.115753]                                          [<c07553cd>] 
__dev_mc_add+0x23/0x61
[   24.115761]                                          [<c0755424>] 
dev_mc_add+0xa/0xc
[   24.115764]                                          [<f85a0bb9>] 
igmp6_group_added+0x56/0x139 [ipv6]
[   24.115784]                                          [<f85a114f>] 
ipv6_dev_mc_inc+0x1fb/0x20c [ipv6]
[   24.115799]                                          [<f858d0f9>] 
ipv6_add_dev+0x26d/0x28b [ipv6]
[   24.115834]                                          [<f8590007>] 
addrconf_notify+0x57/0x52c [ipv6]
[   24.115848]                                          [<c074ec2a>] 
register_netdevice_notifier+0x54/0x14e
[   24.115852]                                          [<f866b324>] 0xf866b324
[   24.115856]                                          [<f866b18a>] 0xf866b18a
[   24.115859]                                          [<c0401268>] 
do_one_initcall+0x87/0x143
[   24.115862]                                          [<c046b0a6>] 
sys_init_module+0x130d/0x14aa
[   24.115867]                                          [<c040319f>] 
sysenter_do_call+0x12/0x38
[   24.115870]     INITIAL USE at:
[   24.115872]                                         [<c0461e85>] 
__lock_acquire+0x34d/0xbf2
[   24.115876]                                         [<c0462b5f>] 
lock_acquire+0xb7/0xd7
[   24.115880]                                         [<c07d1a20>] 
_raw_spin_lock_bh+0x38/0x45
[   24.115884]                                         [<c07553cd>] 
__dev_mc_add+0x23/0x61
[   24.115887]                                         [<c0755424>] 
dev_mc_add+0xa/0xc
[   24.115891]                                         [<f85a0bb9>] 
igmp6_group_added+0x56/0x139 [ipv6]
[   24.115911]                                         [<f85a114f>] 
ipv6_dev_mc_inc+0x1fb/0x20c [ipv6]
[   24.115926]                                         [<f858d0f9>] 
ipv6_add_dev+0x26d/0x28b [ipv6]
[   24.115939]                                         [<f8590007>] 
addrconf_notify+0x57/0x52c [ipv6]
[   24.115951]                                         [<c074ec2a>] 
register_netdevice_notifier+0x54/0x14e
[   24.115954]                                         [<f866b324>] 0xf866b324
[   24.115957]                                         [<f866b18a>] 0xf866b18a
[   24.115960]                                         [<c0401268>] 
do_one_initcall+0x87/0x143
[   24.115963]                                         [<c046b0a6>] 
sys_init_module+0x130d/0x14aa
[   24.115966]                                         [<c040319f>] 
sysenter_do_call+0x12/0x38
[   24.115970]   }
[   24.115971]   ... key      at: [<c10308b8>] netdev_addr_lock_key+0x8/0x1d0
[   24.115985]   ... acquired at:
[   24.115986]    [<c0462b5f>] lock_acquire+0xb7/0xd7
[   24.115990]    [<c07d1a20>] _raw_spin_lock_bh+0x38/0x45
[   24.115992]    [<c07553cd>] __dev_mc_add+0x23/0x61
[   24.115995]    [<c0755424>] dev_mc_add+0xa/0xc
[   24.115997]    [<f85a0bb9>] igmp6_group_added+0x56/0x139 [ipv6]
[   24.116013]    [<f85a114f>] ipv6_dev_mc_inc+0x1fb/0x20c [ipv6]
[   24.116027]    [<f858d0f9>] ipv6_add_dev+0x26d/0x28b [ipv6]
[   24.116039]    [<f8590007>] addrconf_notify+0x57/0x52c [ipv6]
[   24.116051]    [<c074ec2a>] register_netdevice_notifier+0x54/0x14e
[   24.116054]    [<f866b324>] 0xf866b324
[   24.116056]    [<f866b18a>] 0xf866b18a
[   24.116058]    [<c0401268>] do_one_initcall+0x87/0x143
[   24.116061]    [<c046b0a6>] sys_init_module+0x130d/0x14aa
[   24.116064]    [<c040319f>] sysenter_do_call+0x12/0x38
[   24.116067] 
[   24.116068] -> (&(&mc->mca_lock)->rlock){+.-...} ops: 6 {
[   24.116071]    HARDIRQ-ON-W at:
[   24.116073]                                        [<c0461e11>] 
__lock_acquire+0x2d9/0xbf2
[   24.116077]                                        [<c0462b5f>] 
lock_acquire+0xb7/0xd7
[   24.116080]                                        [<c07d1a20>] 
_raw_spin_lock_bh+0x38/0x45
[   24.116083]                                        [<f85a0b8b>] 
igmp6_group_added+0x28/0x139 [ipv6]
[   24.116102]                                        [<f85a114f>] 
ipv6_dev_mc_inc+0x1fb/0x20c [ipv6]
[   24.116118]                                        [<f858d0f9>] 
ipv6_add_dev+0x26d/0x28b [ipv6]
[   24.116130]                                        [<f866b2f0>] 0xf866b2f0
[   24.116133]                                        [<f866b18a>] 0xf866b18a
[   24.116136]                                        [<c0401268>] 
do_one_initcall+0x87/0x143
[   24.116139]                                        [<c046b0a6>] 
sys_init_module+0x130d/0x14aa
[   24.116143]                                        [<c040319f>] 
sysenter_do_call+0x12/0x38
[   24.116146]    IN-SOFTIRQ-W at:
[   24.116148]                                        [<c0461dbc>] 
__lock_acquire+0x284/0xbf2
[   24.116151]                                        [<c0462b5f>] 
lock_acquire+0xb7/0xd7
[   24.116154]                                        [<c07d1a20>] 
_raw_spin_lock_bh+0x38/0x45
[   24.116158]                                        [<f85a034e>] 
mld_ifc_timer_expire+0x12a/0x1f2 [ipv6]
[   24.116173]                                        [<c0445a4a>] 
run_timer_softirq+0x19f/0x268
[   24.116180]                                        [<c043fd5b>] 
__do_softirq+0xa9/0x16a
[   24.116183]    INITIAL USE at:
[   24.116185]                                       [<c0461e85>] 
__lock_acquire+0x34d/0xbf2
[   24.116188]                                       [<c0462b5f>] 
lock_acquire+0xb7/0xd7
[   24.116191]                                       [<c07d1a20>] 
_raw_spin_lock_bh+0x38/0x45
[   24.116195]                                       [<f85a0b8b>] 
igmp6_group_added+0x28/0x139 [ipv6]
[   24.116210]                                       [<f85a114f>] 
ipv6_dev_mc_inc+0x1fb/0x20c [ipv6]
[   24.116226]                                       [<f858d0f9>] 
ipv6_add_dev+0x26d/0x28b [ipv6]
[   24.116238]                                       [<f866b2f0>] 0xf866b2f0
[   24.116241]                                       [<f866b18a>] 0xf866b18a
[   24.116244]                                       [<c0401268>] 
do_one_initcall+0x87/0x143
[   24.116247]                                       [<c046b0a6>] 
sys_init_module+0x130d/0x14aa
[   24.116251]                                       [<c040319f>] 
sysenter_do_call+0x12/0x38
[   24.116254]  }
[   24.116255]  ... key      at: [<f85b382c>] __key.38329+0x0/0xffff9cd8 [ipv6]
[   24.116266]  ... acquired at:
[   24.116268]    [<c046135b>] check_usage_forwards+0x6f/0x77
[   24.116271]    [<c0461a70>] mark_lock+0xf3/0x1bb
[   24.116273]    [<c0461dbc>] __lock_acquire+0x284/0xbf2
[   24.116276]    [<c0462b5f>] lock_acquire+0xb7/0xd7
[   24.116279]    [<c07d1a20>] _raw_spin_lock_bh+0x38/0x45
[   24.116282]    [<f85a034e>] mld_ifc_timer_expire+0x12a/0x1f2 [ipv6]
[   24.116296]    [<c0445a4a>] run_timer_softirq+0x19f/0x268
[   24.116299]    [<c043fd5b>] __do_softirq+0xa9/0x16a
[   24.116302] 
[   24.116303] 
[   24.116303] stack backtrace:
[   24.116307] Pid: 847, comm: dbus-daemon Not tainted 2.6.38-rc1+ #85
[   24.116309] Call Trace:
[   24.116314]  [<c04612e2>] ? print_irq_inversion_bug+0xfc/0x106
[   24.116317]  [<c046135b>] ? check_usage_forwards+0x6f/0x77
[   24.116320]  [<c0461a70>] ? mark_lock+0xf3/0x1bb
[   24.116323]  [<c04612ec>] ? check_usage_forwards+0x0/0x77
[   24.116327]  [<c0461dbc>] ? __lock_acquire+0x284/0xbf2
[   24.116330]  [<c04607f5>] ? save_trace+0x37/0x93
[   24.116333]  [<c046267c>] ? __lock_acquire+0xb44/0xbf2
[   24.116348]  [<f85a034e>] ? mld_ifc_timer_expire+0x12a/0x1f2 [ipv6]
[   24.116352]  [<c0462b5f>] ? lock_acquire+0xb7/0xd7
[   24.116366]  [<f85a034e>] ? mld_ifc_timer_expire+0x12a/0x1f2 [ipv6]
[   24.116370]  [<c07d1a20>] ? _raw_spin_lock_bh+0x38/0x45
[   24.116385]  [<f85a034e>] ? mld_ifc_timer_expire+0x12a/0x1f2 [ipv6]
[   24.116400]  [<f85a034e>] ? mld_ifc_timer_expire+0x12a/0x1f2 [ipv6]
[   24.116403]  [<c04459c7>] ? run_timer_softirq+0x11c/0x268
[   24.116410]  [<c0445a4a>] ? run_timer_softirq+0x19f/0x268
[   24.116413]  [<c04459c7>] ? run_timer_softirq+0x11c/0x268
[   24.116428]  [<f85a0224>] ? mld_ifc_timer_expire+0x0/0x1f2 [ipv6]
[   24.116432]  [<c043fd5b>] ? __do_softirq+0xa9/0x16a
[   24.116434]  [<c043fcb2>] ? __do_softirq+0x0/0x16a
[   24.116436]  <IRQ>  [<c043fead>] ? irq_exit+0x38/0x6c
[   24.116443]  [<c0419e71>] ? smp_apic_timer_interrupt+0x66/0x73
[   24.116447]  [<c05e6cc0>] ? trace_hardirqs_off_thunk+0xc/0x10
[   24.116451]  [<c07d2522>] ? apic_timer_interrupt+0x36/0x3c
[   24.116456]  [<c04bb9d7>] ? copy_user_highpage.clone.44+0x21/0x34
[   24.116459]  [<c04bc87a>] ? do_wp_page+0x397/0x514
[   24.116462]  [<c07d183c>] ? _raw_spin_lock+0x3a/0x40
[   24.116465]  [<c04be2a8>] ? handle_pte_fault+0x67f/0x6ea
[   24.116468]  [<c04be3bf>] ? handle_mm_fault+0xac/0xb8
[   24.116472]  [<c07d4bcd>] ? do_page_fault+0x323/0x33b
[   24.116475]  [<c0462b77>] ? lock_acquire+0xcf/0xd7
[   24.116478]  [<c07d205d>] ? restore_all_notrace+0x0/0x18
[   24.116481]  [<c04601a3>] ? trace_hardirqs_off_caller+0x2e/0x86
[   24.116484]  [<c07d48aa>] ? do_page_fault+0x0/0x33b
[   24.116487]  [<c07d27a4>] ? error_code+0x6c/0x74
[   24.550299] RPC: Registered udp transport module.
[   24.550405] RPC: Registered tcp transport module.
[   24.550498] RPC: Registered tcp NFSv4.1 backchannel transport module.
[   28.499064] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
[   28.725260] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery 
directory
[   28.783996] NFSD: starting 90-second grace period
[   33.488381] Bridge firewalling registered
[   34.443551] ------------[ cut here ]------------
[   34.443561] WARNING: at net/core/dev.c:1351 dev_disable_lro+0x54/0x57()
[   34.443563] Hardware name: VMware Virtual Platform
[   34.443565] Modules linked in: ipt_MASQUERADE iptable_nat nf_nat bridge stp 
llc nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc xt_physdev 
nf_conntrack_tftp nf_conntrack_netbios_ns ip6t_REJECT nf_conntrack_ipv6 
nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 vmhgfs uinput snd_ens1371 
gameport snd_rawmidi snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm 
snd_timer microcode vmxnet3 vmci snd soundcore snd_page_alloc i2c_piix4 mptspi 
mptscsih mptbase scsi_transport_spi [last unloaded: scsi_wait_scan]
[   34.443605] Pid: 1358, comm: libvirtd Not tainted 2.6.38-rc1+ #85
[   34.443607] Call Trace:
[   34.443615]  [<c043a801>] ? warn_slowpath_common+0x77/0x8c
[   34.443618]  [<c074d8cb>] ? dev_disable_lro+0x54/0x57
[   34.443620]  [<c074d8cb>] ? dev_disable_lro+0x54/0x57
[   34.443623]  [<c043a833>] ? warn_slowpath_null+0x1d/0x1f
[   34.443626]  [<c074d8cb>] ? dev_disable_lro+0x54/0x57
[   34.443630]  [<c079a574>] ? devinet_sysctl_forward+0xd5/0x139
[   34.443633]  [<c079a49f>] ? devinet_sysctl_forward+0x0/0x139
[   34.443638]  [<c051e889>] ? proc_sys_call_handler.clone.0+0x6a/0x89
[   34.443641]  [<c051e8a8>] ? proc_sys_write+0x0/0x22
[   34.443643]  [<c051e8c5>] ? proc_sys_write+0x1d/0x22
[   34.443649]  [<c04da9d8>] ? vfs_write+0x86/0xde
[   34.443651]  [<c04dba58>] ? fget_light+0x5f/0x66
[   34.443654]  [<c04daba6>] ? sys_write+0x3d/0x5e
[   34.443659]  [<c040319f>] ? sysenter_do_call+0x12/0x38
[   34.443662] ---[ end trace 06a697a570356b0c ]---


^ permalink raw reply

* [PATCH] netfilter: ipvs: fix compiler warnings
From: Changli Gao @ 2011-01-21 10:02 UTC (permalink / raw)
  To: Simon Horman
  Cc: Wensong Zhang, Julian Anastasov, Patrick McHardy, David S. Miller,
	netdev, lvs-devel, netfilter-devel, Changli Gao

Fix compiler warnings when no transport protocol load balancing support
is configured.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
---
 net/netfilter/ipvs/ip_vs_core.c  |    4 +---
 net/netfilter/ipvs/ip_vs_ctl.c   |    4 ++++
 net/netfilter/ipvs/ip_vs_proto.c |    4 ++++
 3 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index f36a84f..d889f4f 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -1894,9 +1894,7 @@ static int __net_init __ip_vs_init(struct net *net)
 
 static void __net_exit __ip_vs_cleanup(struct net *net)
 {
-	struct netns_ipvs *ipvs = net_ipvs(net);
-
-	IP_VS_DBG(10, "ipvs netns %d released\n", ipvs->gen);
+	IP_VS_DBG(10, "ipvs netns %d released\n", net_ipvs(net)->gen);
 }
 
 static struct pernet_operations ipvs_core_ops = {
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 09ca2ce..68b8033 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -2062,7 +2062,9 @@ static const struct file_operations ip_vs_stats_percpu_fops = {
  */
 static int ip_vs_set_timeout(struct net *net, struct ip_vs_timeout_user *u)
 {
+#if defined(CONFIG_IP_VS_PROTO_TCP) || defined(CONFIG_IP_VS_PROTO_UDP)
 	struct ip_vs_proto_data *pd;
+#endif
 
 	IP_VS_DBG(2, "Setting timeout tcp:%d tcpfin:%d udp:%d\n",
 		  u->tcp_timeout,
@@ -2405,7 +2407,9 @@ __ip_vs_get_dest_entries(struct net *net, const struct ip_vs_get_dests *get,
 static inline void
 __ip_vs_get_timeouts(struct net *net, struct ip_vs_timeout_user *u)
 {
+#if defined(CONFIG_IP_VS_PROTO_TCP) || defined(CONFIG_IP_VS_PROTO_UDP)
 	struct ip_vs_proto_data *pd;
+#endif
 
 #ifdef CONFIG_IP_VS_PROTO_TCP
 	pd = ip_vs_proto_data_get(net, IPPROTO_TCP);
diff --git a/net/netfilter/ipvs/ip_vs_proto.c b/net/netfilter/ipvs/ip_vs_proto.c
index 6ac986c..17484a4 100644
--- a/net/netfilter/ipvs/ip_vs_proto.c
+++ b/net/netfilter/ipvs/ip_vs_proto.c
@@ -60,6 +60,9 @@ static int __used __init register_ip_vs_protocol(struct ip_vs_protocol *pp)
 	return 0;
 }
 
+#if defined(CONFIG_IP_VS_PROTO_TCP) || defined(CONFIG_IP_VS_PROTO_UDP) || \
+    defined(CONFIG_IP_VS_PROTO_SCTP) || defined(CONFIG_IP_VS_PROTO_AH) || \
+    defined(CONFIG_IP_VS_PROTO_ESP)
 /*
  *	register an ipvs protocols netns related data
  */
@@ -85,6 +88,7 @@ register_ip_vs_proto_netns(struct net *net, struct ip_vs_protocol *pp)
 
 	return 0;
 }
+#endif
 
 /*
  *	unregister an ipvs protocol

^ permalink raw reply related

* Re: RFC: pid "ownership" of ip config information
From: Nicolas de Pesloüan @ 2011-01-21 10:17 UTC (permalink / raw)
  To: Patrick Schaaf; +Cc: netdev
In-Reply-To: <1295602091.3582.1.camel@lat1>

Le 21/01/2011 10:28, Patrick Schaaf a écrit :
> Dear netdev,
>
> I want to solicit comments on a feature enhancement that occured
> to me recently.
>
> Feature:
>
> - For "ip addr add", "ip route add", "ip rule add", and maybe "ip link
> add",
>    implement an option 'pid XXXXX' to specify a PID
> - if that PID is not currently existing, fail the operation
> - if, at a later time, that PID dies, automatically remove the
> configuration,
>    as if a corresponding "ip ... del" would have been given
>
> The feature would be useful in any kind of "IP takeover" scenario.
>
> I'm concretely working on deployment of keepalived (VRRP address
> takeover) and memcachedb (address takeover after berkeley DB master
> selection).
>
> It would also apply to all kinds of routing daemons (zebra, quagga...).
>
> In all these cases, for as long as the process is working normally,
> it can trigger the relevant address withdrawal, but when the process
> dies unexpectedly (oom killer or whatever), addresses are left
> configured,
> while a partner on another host might take them over, resulting in
> actively duplicate IPs and the application breaking.
>
> The alternative to such a feature, would be to have an additional
> monitoring process, which would watch the PID somehow, and need to
> be configured to know what to withdraw when it dies.
>
> Before I go ahead and try to implement that, I would like to have
> some feedback regarding the idea
>
> - has it been discussed before?
> - would it be accepted by the relevant maintainers?
> - did I overlook alternative solutions to the problem?

There exists some user space clustering system that should provide the same functionalities. Did you 
had a look at http://www.linux-ha.org/ ?

> best regards
>    Patrick

^ permalink raw reply

* Re: Flow Control and Port Mirroring Revisited
From: Michael S. Tsirkin @ 2011-01-21  9:59 UTC (permalink / raw)
  To: Simon Horman
  Cc: Rick Jones, Jesse Gross, Rusty Russell, virtualization, dev,
	virtualization, netdev, kvm
In-Reply-To: <20110120083727.GA1807@verge.net.au>

On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
> [ Trimmed Eric from CC list as vger was complaining that it is too long ]
> 
> On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> > >So it won't be all that simple to implement well, and before we try,
> > >I'd like to know whether there are applications that are helped
> > >by it. For example, we could try to measure latency at various
> > >pps and see whether the backpressure helps. netperf has -b, -w
> > >flags which might help these measurements.
> > 
> > Those options are enabled when one adds --enable-burst to the
> > pre-compilation ./configure  of netperf (one doesn't have to
> > recompile netserver).  However, if one is also looking at latency
> > statistics via the -j option in the top-of-trunk, or simply at the
> > histogram with --enable-histogram on the ./configure and a verbosity
> > level of 2 (global -v 2) then one wants the very top of trunk
> > netperf from:
> 
> Hi,
> 
> I have constructed a test where I run an un-paced  UDP_STREAM test in
> one guest and a paced omni rr test in another guest at the same time.

Hmm, what is this supposed to measure?  Basically each time you run an
un-paced UDP_STREAM you get some random load on the network.
You can't tell what it was exactly, only that it was between
the send and receive throughput.

> Breifly I get the following results from the omni test..
> 
> 1. Omni test only:		MEAN_LATENCY=272.00
> 2. Omni and stream test:	MEAN_LATENCY=3423.00
> 3. cpu and net_cls group:	MEAN_LATENCY=493.00
>    As per 2 plus cgoups are created for each guest
>    and guest tasks added to the groups
> 4. 100Mbit/s class:		MEAN_LATENCY=273.00
>    As per 3 plus the net_cls groups each have a 100MBit/s HTB class
> 5. cpu.shares=128:		MEAN_LATENCY=652.00
>    As per 4 plus the cpu groups have cpu.shares set to 128
> 6. Busy CPUS:			MEAN_LATENCY=15126.00
>    As per 5 but the CPUs are made busy using a simple shell while loop
> 
> There is a bit of noise in the results as the two netperf invocations
> aren't started at exactly the same moment
> 
> For reference, my netperf invocations are:
> netperf -c -C -t UDP_STREAM -H 172.17.60.216 -l 12
> netperf.omni -p 12866 -D -c -C -H 172.17.60.216 -t omni -j -v 2 -- -r 1 -d rr -k foo -b 1 -w 200 -m 200
> 
> foo contains
> PROTOCOL
> THROUGHPUT,THROUGHPUT_UNITS
> LOCAL_SEND_THROUGHPUT
> LOCAL_RECV_THROUGHPUT
> REMOTE_SEND_THROUGHPUT
> REMOTE_RECV_THROUGHPUT
> RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
> P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY
> LOCAL_CPU_UTIL,REMOTE_CPU_UTIL

^ permalink raw reply

* [PATCH net-next-2.6] net_sched:  TCQ_F_CAN_BYPASS generalization
From: Eric Dumazet @ 2011-01-21 11:04 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Patrick McHardy, Jesper Dangaard Brouer, Jarek Poplawski,
	jamal
In-Reply-To: <1295537236.2825.286.camel@edumazet-laptop>

Now qdisc stab is handled before TCQ_F_CAN_BYPASS test in
__dev_xmit_skb(), we can generalize TCQ_F_CAN_BYPASS to other qdiscs
than pfifo_fast : pfifo, bfifo, pfifo_head_drop and sfq

SFQ is special because it can have external classifiers, and in these
cases, we cannot bypass queue discipline (packet could be dropped by
classifier) without admin asking it, or further changes.

Its worth doing this, especially for SFQ, avoiding dirtying memory in
case no packets are already waiting in queue.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Jesper Dangaard Brouer <hawk@diku.dk>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Jamal Hadi Salim <hadi@cyberus.ca>
CC: Stephen Hemminger <shemminger@vyatta.com>
---
I am not sure RED can use bypass too, feel free to comment on this ;)

 net/sched/sch_fifo.c    |   13 ++++++++++++-
 net/sched/sch_generic.c |    5 ++---
 net/sched/sch_mq.c      |    1 -
 net/sched/sch_mqprio.c  |    1 -
 net/sched/sch_sfq.c     |    6 ++++++
 5 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/net/sched/sch_fifo.c b/net/sched/sch_fifo.c
index b3075f8..f7290d2 100644
--- a/net/sched/sch_fifo.c
+++ b/net/sched/sch_fifo.c
@@ -64,11 +64,13 @@ static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 static int fifo_init(struct Qdisc *sch, struct nlattr *opt)
 {
 	struct fifo_sched_data *q = qdisc_priv(sch);
+	bool bypass;
+	bool is_bfifo = sch->ops == &bfifo_qdisc_ops;
 
 	if (opt == NULL) {
 		u32 limit = qdisc_dev(sch)->tx_queue_len ? : 1;
 
-		if (sch->ops == &bfifo_qdisc_ops)
+		if (is_bfifo)
 			limit *= psched_mtu(qdisc_dev(sch));
 
 		q->limit = limit;
@@ -81,6 +83,15 @@ static int fifo_init(struct Qdisc *sch, struct nlattr *opt)
 		q->limit = ctl->limit;
 	}
 
+	if (is_bfifo)
+		bypass = q->limit >= psched_mtu(qdisc_dev(sch));
+	else
+		bypass = q->limit >= 1;
+
+	if (bypass)
+		sch->flags |= TCQ_F_CAN_BYPASS;
+	else
+		sch->flags &= ~TCQ_F_CAN_BYPASS;
 	return 0;
 }
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index cc17e79..0da09d5 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -527,6 +527,8 @@ static int pfifo_fast_init(struct Qdisc *qdisc, struct nlattr *opt)
 	for (prio = 0; prio < PFIFO_FAST_BANDS; prio++)
 		skb_queue_head_init(band2list(priv, prio));
 
+	/* Can by-pass the queue discipline */
+	qdisc->flags |= TCQ_F_CAN_BYPASS;
 	return 0;
 }
 
@@ -691,9 +693,6 @@ static void attach_one_default_qdisc(struct net_device *dev,
 			netdev_info(dev, "activation failed\n");
 			return;
 		}
-
-		/* Can by-pass the queue discipline for default qdisc */
-		qdisc->flags |= TCQ_F_CAN_BYPASS;
 	}
 	dev_queue->qdisc_sleeping = qdisc;
 }
diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
index ecc302f..ec5cbc8 100644
--- a/net/sched/sch_mq.c
+++ b/net/sched/sch_mq.c
@@ -61,7 +61,6 @@ static int mq_init(struct Qdisc *sch, struct nlattr *opt)
 						    TC_H_MIN(ntx + 1)));
 		if (qdisc == NULL)
 			goto err;
-		qdisc->flags |= TCQ_F_CAN_BYPASS;
 		priv->qdiscs[ntx] = qdisc;
 	}
 
diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index 8620c65..fbc6f53 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -130,7 +130,6 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt)
 			err = -ENOMEM;
 			goto err;
 		}
-		qdisc->flags |= TCQ_F_CAN_BYPASS;
 		priv->qdiscs[i] = qdisc;
 	}
 
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 156ad30..fdba52a 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -560,6 +560,10 @@ static int sfq_init(struct Qdisc *sch, struct nlattr *opt)
 		slot_queue_init(&q->slots[i]);
 		sfq_link(q, i);
 	}
+	if (q->limit >= 1)
+		sch->flags |= TCQ_F_CAN_BYPASS;
+	else
+		sch->flags &= ~TCQ_F_CAN_BYPASS;
 	return 0;
 }
 
@@ -611,6 +615,8 @@ static unsigned long sfq_get(struct Qdisc *sch, u32 classid)
 static unsigned long sfq_bind(struct Qdisc *sch, unsigned long parent,
 			      u32 classid)
 {
+	/* we cannot bypass queue discipline anymore */
+	sch->flags &= ~TCQ_F_CAN_BYPASS;
 	return 0;
 }
 



^ permalink raw reply related

* RE: Using ethernet device as efficient small packet generator
From: juice @ 2011-01-21 11:44 UTC (permalink / raw)
  To: Loke, Chetan, Jon Zhou, Eric Dumazet, Stephen Hemminger, netdev
In-Reply-To: <D3F292ADF945FB49B35E96C94C2061B90ECC4FAC@nsmail.netscout.com>

>> -----Original Message-----
>> From: netdev-owner@vger.kernel.org [mailto:netdev-
>> owner@vger.kernel.org] On Behalf Of Jon Zhou
>> Sent: December 23, 2010 3:58 AM
>> To: juice@swagman.org; Eric Dumazet; Stephen Hemminger;
>> netdev@vger.kernel.org
>> Subject: RE: Using ethernet device as efficient small packet generator
>>
>>
>> At another old kernel(2.6.16) with tg3 and bnx2 1G NIC,XEON E5450, I
>> only got 490K pps(it is about 300Mbps,30% GE), I think the reason is
>> multiqueue unsupported in this kernel.
>>
>> I will do a test with 1Gb nic on the new kernel later.
>>
>
>
> I can hit close to 1M pps(first time every time) w/ a 64-byte payload on
> my VirtualMachine(running 2.6.33) via vmxnet3 vNIC -
>
>
> [root@localhost ~]# cat /proc/net/pktgen/eth2
> Params: count 0  min_pkt_size: 60  max_pkt_size: 60
>      frags: 0  delay: 0  clone_skb: 0  ifname: eth2
>      flows: 0 flowlen: 0
>      queue_map_min: 0  queue_map_max: 0
>      dst_min: 192.168.222.2  dst_max:
>         src_min:   src_max:
>      src_mac: 00:50:56:b1:00:19 dst_mac: 00:50:56:c0:00:3e
>      udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
>      src_mac_count: 0  dst_mac_count: 0
>      Flags:
> Current:
>      pkts-sofar: 59241012  errors: 0
>      started: 1898437021us  stopped: 1957709510us idle: 9168us
>      seq_num: 59241013  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
>      cur_saddr: 0x0  cur_daddr: 0x2dea8c0
>      cur_udp_dst: 9  cur_udp_src: 9
>      cur_queue_map: 0
>      flows: 0
> Result: OK: 59272488(c59263320+d9168) nsec, 59241012 (60byte,0frags)
>   999468pps 479Mb/sec (479744640bps) errors: 0
>
>
>
> Chetan
>


Hi again.

It has been a while since last time I got to be able to test this
again, as there have been some other matters at hand.
However, now I managed to rerun my tests in several different kernels.

I am using now a PCIe Intel e1000e card, that should be able to handle
the needed traffic amount.

The statistics that I get are as follows:

kernel 2.6.32-27 (ubuntu 10.10 default)
    pktgen:           750064pps 360Mb/sec (360030720bps)
    AX4000 analyser:  Total bitrate:             383.879 MBits/s
                      Bandwidth:                 38.39% GE
                      Average packet intereval:  1.33 us

kernel 2.6.37 (latest stable from kernel.org)
    pktgen:           786848pps 377Mb/sec (377687040bps)
    AX4000 analyser:  Total bitrate:             402.904 MBits/s
                      Bandwidth:                 40.29% GE
                      Average packet intereval:  1.27 us

kernel 2.6.38-rc1 (latest from kernel.org)
    pktgen:           795297pps 381Mb/sec (381742560bps)
    AX4000 analyser:  Total bitrate:             407.117 MBits/s
                      Bandwidth:                 40.72% GE
                      Average packet intereval:  1.26 us


In every case I have set the IRQ affinity of eth1 to CPU0 and started
the test running in kpktgend_0.

The complete data of my measurements follows in the end of this post.

It looks like the small packet sending effiency of the ethernet driver
is improving all the time, albeit quite slowly.

Now, I would be intrested in knowing whether it is indeed possible to
increase the sending rate near full 1GE capacity with the current
ethernet card I am using or do I have here a hardware limitation here?

I recall hearing that there are some enhanced versions of the e1000
network card, such that have been geared towards higher performance
at the expense of some functionality or general system effiency.
Can anybody point me how to do that?

As I stated before, quoting myself:

> Which do you suppose is the reason for poor performance on my setup,
> is it lack of multiqueue HW in the GE NIC's I am using or is it lack
> of multiqueue support in the kernel (2.6.32) that I am using?
>
> Is multiqueue really necessary to achieve the full 1GE saturation, or
> is it only needed on 10GE NIC's?
>
> As I understand multiqueue is useful only if there are lots of CPU cores
> to run, each handling one queue.
>
> The application I am thinking of, preloading a packet sequence into
> kernel from userland application and then starting to send from buffer
> propably does not benefit so much from many cores, it would be enough
> that one CPU would handle the sending and other core(s) would handle
> other tasks.

Yours, Jussi Ohenoja


*** Measurement details follows ***


root@d8labralinux:/var/home/juice# lspci -vvv -s 04:00.0
04:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet
Controller (Copper) (rev 06)
	Subsystem: Intel Corporation Device 1082
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 11
	Region 0: Memory at f3cc0000 (32-bit, non-prefetchable) [size=128K]
	Region 1: Memory at f3ce0000 (32-bit, non-prefetchable) [size=128K]
	Region 2: I/O ports at cce0 [size=32]
	Expansion ROM at f3d00000 [disabled] [size=128K]
	Capabilities: [c8] Power Management version 2
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0
Enable-
		Address: 0000000000000000  Data: 0000
	Capabilities: [e0] Express (v1) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Latency L0 <4us, L1
<64us
			ClockPM- Suprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive-
BWMgmt- ABWMgmt-
	Capabilities: [100] Advanced Error Reporting <?>
	Capabilities: [140] Device Serial Number b1-e5-7c-ff-ff-21-1b-00
	Kernel modules: e1000e

root@d8labralinux:/var/home/juice# ethtool eth1
Settings for eth1:
	Supported ports: [ TP ]
	Supported link modes:   10baseT/Half 10baseT/Full
	                        100baseT/Half 100baseT/Full
	                        1000baseT/Full
	Supports auto-negotiation: Yes
	Advertised link modes:  10baseT/Half 10baseT/Full
	                        100baseT/Half 100baseT/Full
	                        1000baseT/Full
	Advertised pause frame use: No
	Advertised auto-negotiation: Yes
	Link partner advertised link modes:  Not reported
	Link partner advertised pause frame use: No
	Link partner advertised auto-negotiation: No
	Speed: 1000Mb/s
	Duplex: Full
	Port: Twisted Pair
	PHYAD: 1
	Transceiver: internal
	Auto-negotiation: on
	MDI-X: on
	Supports Wake-on: pumbag
	Wake-on: d
	Current message level: 0x00000001 (1)
	Link detected: yes





2.6.38-rc1
----------

dmesg:

[  195.685655] e1000e: Intel(R) PRO/1000 Network Driver - 1.2.20-k2
[  195.685658] e1000e: Copyright(c) 1999 - 2011 Intel Corporation.
[  195.685677] e1000e 0000:04:00.0: Disabling ASPM  L1
[  195.685690] e1000e 0000:04:00.0: PCI INT A -> GSI 16 (level, low) ->
IRQ 16
[  195.685707] e1000e 0000:04:00.0: setting latency timer to 64
[  195.685852] e1000e 0000:04:00.0: irq 69 for MSI/MSI-X
[  195.869917] e1000e 0000:04:00.0: eth1: (PCI Express:2.5GB/s:Width x1)
00:1b:21:7c:e5:b1
[  195.869921] e1000e 0000:04:00.0: eth1: Intel(R) PRO/1000 Network
Connection
[  195.870006] e1000e 0000:04:00.0: eth1: MAC: 1, PHY: 4, PBA No: D50861-006
[  196.017285] e1000e 0000:04:00.0: irq 69 for MSI/MSI-X
[  196.073144] e1000e 0000:04:00.0: irq 69 for MSI/MSI-X
[  196.073630] ADDRCONF(NETDEV_UP): eth1: link is not ready
[  198.746000] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: None
[  198.746162] ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[  209.564433] eth1: no IPv6 routers present


pktgen:

Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
     frags: 0  delay: 0  clone_skb: 1  ifname: eth1
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 10.10.11.2  dst_max:
        src_min:   src_max:
     src_mac: 00:1b:21:7c:e5:b1 dst_mac: 00:04:23:08:91:dc
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 10000000  errors: 0
     started: 77203892067us  stopped: 77216465982us idle: 1325us
     seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x0  cur_daddr: 0x20b0a0a
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 12573914(c12572589+d1325) nsec, 10000000 (60byte,0frags)
  795297pps 381Mb/sec (381742560bps) errors: 0


AX4000 analyser:

   Total bitrate:             407.117 MBits/s
   Bandwidth:                 40.72% GE
   Average packet intereval:  1.26 us






2.6.37
------


dmesg:

[ 1810.959907] e1000e: Intel(R) PRO/1000 Network Driver - 1.2.7-k2
[ 1810.959909] e1000e: Copyright (c) 1999 - 2010 Intel Corporation.
[ 1810.959928] e1000e 0000:04:00.0: Disabling ASPM  L1
[ 1810.959942] e1000e 0000:04:00.0: PCI INT A -> GSI 16 (level, low) ->
IRQ 16
[ 1810.959961] e1000e 0000:04:00.0: setting latency timer to 64
[ 1810.960103] e1000e 0000:04:00.0: irq 66 for MSI/MSI-X
[ 1811.137269] e1000e 0000:04:00.0: eth1: (PCI Express:2.5GB/s:Width x1)
00:1b:21:7c:e5:b1
[ 1811.137272] e1000e 0000:04:00.0: eth1: Intel(R) PRO/1000 Network
Connection
[ 1811.137358] e1000e 0000:04:00.0: eth1: MAC: 1, PHY: 4, PBA No: d50861-006
[ 1811.286173] e1000e 0000:04:00.0: irq 66 for MSI/MSI-X
[ 1811.342065] e1000e 0000:04:00.0: irq 66 for MSI/MSI-X
[ 1811.342575] ADDRCONF(NETDEV_UP): eth1: link is not ready
[ 1814.010736] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: None
[ 1814.010949] ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[ 1824.082148] eth1: no IPv6 routers present


pktgen:

Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
     frags: 0  delay: 0  clone_skb: 1  ifname: eth1
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 10.10.11.2  dst_max:
        src_min:   src_max:
     src_mac: 00:1b:21:7c:e5:b1 dst_mac: 00:04:23:08:91:dc
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 10000000  errors: 0
     started: 265936151us  stopped: 278645077us idle: 1651us
     seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x0  cur_daddr: 0x20b0a0a
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 12708925(c12707274+d1651) nsec, 10000000 (60byte,0frags)
  786848pps 377Mb/sec (377687040bps) errors: 0


AX4000 analyser:

   Total bitrate:             402.904 MBits/s
   Bandwidth:                 40.29% GE
   Average packet intereval:  1.27 us






2.6.32-27
---------


dmesg:

[    2.178800] e1000e: Intel(R) PRO/1000 Network Driver - 1.0.2-k2
[    2.178802] e1000e: Copyright (c) 1999-2008 Intel Corporation.
[    2.178854] e1000e 0000:04:00.0: PCI INT A -> GSI 16 (level, low) ->
IRQ 16
[    2.178887] e1000e 0000:04:00.0: setting latency timer to 64
[    2.179039] e1000e 0000:04:00.0: irq 53 for MSI/MSI-X
[    2.360700] 0000:04:00.0: eth1: (PCI Express:2.5GB/s:Width x1)
00:1b:21:7c:e5:b1
[    2.360702] 0000:04:00.0: eth1: Intel(R) PRO/1000 Network Connection
[    2.360787] 0000:04:00.0: eth1: MAC: 1, PHY: 4, PBA No: d50861-006
[    9.551486] e1000e 0000:04:00.0: irq 53 for MSI/MSI-X
[    9.607309] e1000e 0000:04:00.0: irq 53 for MSI/MSI-X
[    9.607876] ADDRCONF(NETDEV_UP): eth1: link is not ready
[   12.448302] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: None
[   12.448544] ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[   23.068498] eth1: no IPv6 routers present


pktgen:

Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
     frags: 0  delay: 0  clone_skb: 1  ifname: eth1
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 10.10.11.2  dst_max:
        src_min:   src_max:
     src_mac: 00:1b:21:7c:e5:b1 dst_mac: 00:04:23:08:91:dc
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 10000000  errors: 0
     started: 799760010us  stopped: 813092189us idle: 1314us
     seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x0  cur_daddr: 0x20b0a0a
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 13332178(c13330864+d1314) nsec, 10000000 (60byte,0frags)
  750064pps 360Mb/sec (360030720bps) errors: 0


AX4000 analyser:

   Total bitrate:             383.879 MBits/s
   Bandwidth:                 38.39% GE
   Average packet intereval:  1.33 us




root@d8labralinux:/var/home/juice/pkt_test# cat ./pktgen_conf
#!/bin/bash

#modprobe pktgen

function pgset() {
  local result
  echo $1 > $PGDEV
  result=`cat $PGDEV | fgrep "Result: OK:"`
  if [ "$result" = "" ]; then
    cat $PGDEV | fgrep Result:
  fi
}

function pg() {
  echo inject > $PGDEV
  cat $PGDEV
}

# Config Start Here
-----------------------------------------------------------

# thread config
# Each CPU has own thread. Two CPU exammple. We add eth1, eth2 respectivly.
PGDEV=/proc/net/pktgen/kpktgend_0
echo "Removing all devices"
pgset "rem_device_all"
PGDEV=/proc/net/pktgen/kpktgend_1
pgset "rem_device_all"

PGDEV=/proc/net/pktgen/kpktgend_0
echo "Adding eth1"
pgset "add_device eth1"
#echo "Setting max_before_softirq 10000"
#pgset "max_before_softirq 10000"

# device config
# ipg is inter packet gap. 0 means maximum speed.
CLONE_SKB="clone_skb 1"
# NIC adds 4 bytes CRC
PKT_SIZE="pkt_size 60"
# COUNT 0 means forever
#COUNT="count 0"
COUNT="count 10000000"
IPG="delay 0"
PGDEV=/proc/net/pktgen/eth1
echo "Configuring $PGDEV"
pgset "$COUNT"
pgset "$CLONE_SKB"
pgset "$PKT_SIZE"
pgset "$IPG"
pgset "dst 10.10.11.2"
pgset "dst_mac 00:04:23:08:91:dc"
pgset "queue_map_min 0"

# Time to run
PGDEV=/proc/net/pktgen/pgctrl
echo "Running... ctrl^C to stop"
pgset "start"
echo "Done"

# Result can be vieved in /proc/net/pktgen/eth1





^ permalink raw reply

* Re: [PATCH] Ensure that we unshare skbs prior to calling pskb_may_pull in bonding driver
From: Neil Horman @ 2011-01-21 11:51 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, andy, fubar
In-Reply-To: <20110120.164723.73670910.davem@davemloft.net>

On Thu, Jan 20, 2011 at 04:47:23PM -0800, David Miller wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Thu, 20 Jan 2011 14:02:31 -0500
> 
> > Recently reported oops:
> 
> Applied, but please compose reasonable Subject lines with your patches,
> always begin the line with a subsystem tag followed by a colon.
> 
> This way we get
> 
> 	bonding: Foo bar baz
> 
> instead of
> 
> 	Foo bar baz in the bonding driver
> 
> Thanks.
> 
Yeah, my bad, I realized I screwed up the Subject the second I sent the email,
sorry about that.

Regards
Neil


^ permalink raw reply

* RE: Using ethernet device as efficient small packet generator
From: Eric Dumazet @ 2011-01-21 11:51 UTC (permalink / raw)
  To: juice; +Cc: Loke, Chetan, Jon Zhou, Stephen Hemminger, netdev
In-Reply-To: <13dbf221c875a931d408784495884998.squirrel@www.liukuma.net>

Le vendredi 21 janvier 2011 à 13:44 +0200, juice a écrit :

> Hi again.
> 
> It has been a while since last time I got to be able to test this
> again, as there have been some other matters at hand.
> However, now I managed to rerun my tests in several different kernels.
> 
> I am using now a PCIe Intel e1000e card, that should be able to handle
> the needed traffic amount.
> 
> The statistics that I get are as follows:
> 
> kernel 2.6.32-27 (ubuntu 10.10 default)
>     pktgen:           750064pps 360Mb/sec (360030720bps)
>     AX4000 analyser:  Total bitrate:             383.879 MBits/s
>                       Bandwidth:                 38.39% GE
>                       Average packet intereval:  1.33 us
> 
> kernel 2.6.37 (latest stable from kernel.org)
>     pktgen:           786848pps 377Mb/sec (377687040bps)
>     AX4000 analyser:  Total bitrate:             402.904 MBits/s
>                       Bandwidth:                 40.29% GE
>                       Average packet intereval:  1.27 us
> 
> kernel 2.6.38-rc1 (latest from kernel.org)
>     pktgen:           795297pps 381Mb/sec (381742560bps)
>     AX4000 analyser:  Total bitrate:             407.117 MBits/s
>                       Bandwidth:                 40.72% GE
>                       Average packet intereval:  1.26 us
> 
> 

...

> pktgen:
> 
> Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
>      frags: 0  delay: 0  clone_skb: 1  ifname: eth1
>      flows: 0 flowlen: 0
>      queue_map_min: 0  queue_map_max: 0
>      dst_min: 10.10.11.2  dst_max:
>         src_min:   src_max:
>      src_mac: 00:1b:21:7c:e5:b1 dst_mac: 00:04:23:08:91:dc
>      udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
>      src_mac_count: 0  dst_mac_count: 0
>      Flags:
> Current:
>      pkts-sofar: 10000000  errors: 0
>      started: 77203892067us  stopped: 77216465982us idle: 1325us
>      seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
>      cur_saddr: 0x0  cur_daddr: 0x20b0a0a
>      cur_udp_dst: 9  cur_udp_src: 9
>      cur_queue_map: 0
>      flows: 0
> Result: OK: 12573914(c12572589+d1325) nsec, 10000000 (60byte,0frags)
>   795297pps 381Mb/sec (381742560bps) errors: 0
> 
> 
> AX4000 analyser:
> 
>    Total bitrate:             407.117 MBits/s
>    Bandwidth:                 40.72% GE
>    Average packet intereval:  1.26 us
> 
> 

You should try

CLONE_SKB="clone_skb 10"
...
pgset "$CLONE_SKB"


Because I suspect you hit a performance problem on skb
allocation/filling/use/freeing

You can use perf tool to get some performance profile while your pktgen
session is running

# cd tools/perf
# make
...
# ./perf top




^ permalink raw reply

* RE: Using ethernet device as efficient small packet generator
From: juice @ 2011-01-21 12:12 UTC (permalink / raw)
  To: Eric Dumazet, Loke, Chetan, Jon Zhou, Stephen Hemminger, netdev
In-Reply-To: <1295610709.2601.35.camel@edumazet-laptop>

> Le vendredi 21 janvier 2011 à 13:44 +0200, juice a écrit :
>
>
> You should try
>
> CLONE_SKB="clone_skb 10"
> ...
> pgset "$CLONE_SKB"
>
>
> Because I suspect you hit a performance problem on skb
> allocation/filling/use/freeing

Actually, that makes the performance worse:
(Now I tried it with kernel 2.6.37, which is currently running)

root@d8labralinux:/var/home/juice/pkt_test# cat /proc/net/pktgen/eth1
Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
     frags: 0  delay: 0  clone_skb: 10  ifname: eth1
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 10.10.11.2  dst_max:
        src_min:   src_max:
     src_mac: 00:1b:21:7c:e5:b1 dst_mac: 00:04:23:08:91:dc
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 10000000  errors: 0
     started: 2555660074us  stopped: 2569239323us idle: 3484us
     seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x0  cur_daddr: 0x20b0a0a
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 13579248(c13575763+d3484) nsec, 10000000 (60byte,0frags)
  736417pps 353Mb/sec (353480160bps) errors: 0


> You can use perf tool to get some performance profile while your pktgen
> session is running
>
> # cd tools/perf
> # make
> ...
> # ./perf top
>

I can try that.
Where do I get the performance profiler tool?


Yours, Jussi Ohenoja



^ permalink raw reply

* Re: [PATCH v4] net: add Faraday FTMAC100 10/100 Ethernet driver
From: Michał Mirosław @ 2011-01-21 12:26 UTC (permalink / raw)
  To: Po-Yu Chuang
  Cc: netdev, linux-kernel, bhutchings, eric.dumazet, joe, dilinger,
	Po-Yu Chuang
In-Reply-To: <1295596533-1748-1-git-send-email-ratbert.chuang@gmail.com>

2011/1/21 Po-Yu Chuang <ratbert.chuang@gmail.com>:
> From: Po-Yu Chuang <ratbert@faraday-tech.com>
>
> FTMAC100 Ethernet Media Access Controller supports 10/100 Mbps and
> MII.  This driver has been working on some ARM/NDS32 SoC's including
> Faraday A320 and Andes AG101.
>
> Signed-off-by: Po-Yu Chuang <ratbert@faraday-tech.com>
[...]
> +static void ftmac100_txdes_reset(struct ftmac100_txdes *txdes)
> +{
> +       /* clear all except end of ring bit */
> +       txdes->txdes0 = 0;
> +       txdes->txdes1 &= FTMAC100_TXDES1_EDOTR;
> +       txdes->txdes2 = 0;
> +       txdes->txdes3 = 0;
> +}

This also probably needs cpu_to_le32().

[...]
> +static void ftmac100_free_buffers(struct ftmac100 *priv)
> +{
> +       int i;
> +
> +       for (i = 0; i < RX_QUEUE_ENTRIES; i += 2) {
> +               struct ftmac100_rxdes *rxdes = &priv->descs->rxdes[i];
> +               dma_addr_t d = ftmac100_rxdes_get_dma_addr(rxdes);
> +               void *page = ftmac100_rxdes_get_va(rxdes);
> +
> +               if (d)
> +                       dma_unmap_single(priv->dev, d, PAGE_SIZE,
> +                                        DMA_FROM_DEVICE);
> +
> +               if (page != NULL)
> +                       free_page((unsigned long)page);
> +       }
> +
[...]

> +static int ftmac100_alloc_buffers(struct ftmac100 *priv)
> +{
> +       int i;
> +
> +       priv->descs = dma_alloc_coherent(priv->dev,
> +                                        sizeof(struct ftmac100_descs),
> +                                        &priv->descs_dma_addr,
> +                                        GFP_KERNEL | GFP_DMA);
> +       if (priv->descs == NULL)
> +               return -ENOMEM;
> +
> +       memset(priv->descs, 0, sizeof(struct ftmac100_descs));
> +
> +       /* initialize RX ring */
> +
> +       ftmac100_rxdes_set_end_of_ring(&priv->descs->rxdes[RX_QUEUE_ENTRIES - 1]);
> +
> +       for (i = 0; i < RX_QUEUE_ENTRIES; i += 2) {
> +               struct ftmac100_rxdes *rxdes = &priv->descs->rxdes[i];
> +               void *page;
> +               dma_addr_t d;
> +
> +               page = (void *)__get_free_page(GFP_KERNEL | GFP_DMA);
> +               if (page == NULL)
> +                       goto err;
> +
> +               d = dma_map_single(priv->dev, page, PAGE_SIZE, DMA_FROM_DEVICE);
> +               if (unlikely(dma_mapping_error(priv->dev, d))) {
> +                       free_page((unsigned long)page);
> +                       goto err;
> +               }
> +
> +               /*
> +                * The hardware enforces a sub-2K maximum packet size, so we
> +                * put two buffers on every hardware page.
> +                */
> +               ftmac100_rxdes_set_va(rxdes, page);
> +               ftmac100_rxdes_set_va(rxdes + 1, page + PAGE_SIZE / 2);
> +
> +               ftmac100_rxdes_set_dma_addr(rxdes, d);
> +               ftmac100_rxdes_set_dma_addr(rxdes + 1, d + PAGE_SIZE / 2);
> +
> +               ftmac100_rxdes_set_buffer_size(rxdes, RX_BUF_SIZE);
> +               ftmac100_rxdes_set_buffer_size(rxdes + 1, RX_BUF_SIZE);
> +
> +               ftmac100_rxdes_set_dma_own(rxdes);
> +               ftmac100_rxdes_set_dma_own(rxdes + 1);
> +       }
[...]

Did you test this? This looks like it will result in double free after
packet RX, as you are giving the same page (referenced once) to two
distinct RX descriptors, that may be assigned different packets.

Since your not implementing any RX offloads, you might just allocate
fresh skb's with alloc_skb() and store skb pointer in rxdes3. Since
hardware doesn't touch it, you can skip cpu_to_le32()/le32_to_cpu()
there (leave a comment, though).

Unless this needs to work for ISA devices, you should drop GFP_DMA
allocation flag.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: MultiPath TCP in the Linux Kernel
From: Christoph Paasch @ 2011-01-21 13:19 UTC (permalink / raw)
  To: Peter Chacko
  Cc: netdev, bonding-devel, linux-sctp, MS PRASAD, Lal Samuel Varghese,
	Gregory S. Tseytin
In-Reply-To: <AANLkTi=v7jJzfRoeQLGzwU29p2dVqJqd=Eap_8ntfxKP@mail.gmail.com>

Hi Peter,

On Thursday, January 20, 2011 wrote Peter Chacko:
> Is there any document that shows the difference of SCTP multi-streaming and
> MPTCP ? SCTP team has done lot of great workin in load-balanced
> multi-streaming work across multi-hopped hosts. I am curious to learn more
> on your work in the context of this.
MPTCP can be compared to SCTP-CMT. Regular SCTP just uses one path for a 
stream and thus does not increase the throughput.
At http://inl.info.ucl.ac.be/mptcp/ is a master  thesis from a student, where 
he compared plenty of different multi-path transport solutions. Have a look at 
the document "multi-path congestion control" from Arnaud Ongenae on this 
website.

> Does MP-TCP make use of CM ? (Congestion Manager, created out of MIT) for
> sharing congestion states across TCP ensembles ?
No, MPTCP uses the coupled congestion control. However, this congestion-
manager might be an interesting extension to mptcp.

> How does it work with a firewall that will allow only a session that is the
> reply of a SYN packet from inside firewall ? (Because other parallel
> streams of TCP doesn't send SYN to the same destination, and how will the
> firewall will allow the replies of this other session go through ? )
You mean the case, where a server has 2 interfaces, and a client (behind the 
firewall) establishes a MPTCP-session to the server?
The server will tell the client about it's available addresses.
Additionally, the server will try by itself the establishment of the 
additional subflow. This attempt will get blocked by the firewall.
As the server sended its set of addresses to the client, the client will also 
attempt to establish a new subflow. This attempt will pass the firewall as it 
comes from the inside.

> Is there any commerical use  of this work ?
Our implementation is still in alpha-state. There are still missing features, 
and it's not yet 100% stable. It's a major modification of the TCP/IP-stack 
because packets from the subflows need to get pushed up to the 'meta-
connection' and reordered there.

> Will this also support message level TCP packet exchange ? (So that  main
> strengths of SCTP are "in" to the TCP stack ).
What do you mean with "message level TCP packet exchange" ?

Best regards,
Christoph


> 2011/1/20 Christoph Paasch <christoph.paasch@uclouvain.be>
> 
> > Hi,
> > 
> > MultiPath TCP is not a port of SCTP. It is based on regular TCP and
> > presents a
> > regular socket-api to the application. Thus applications do not have to
> > be modified.
> > 
> > MPTCP opens several TCP-subflows across it's different IP-addresses, and
> > lets
> > the data go over these different TCP sessions. To synchronize the data-
> > transfer MPTCP uses TCP-options. Thus, on the wire it looks like regular
> > TCP,
> > with the only difference being that there are additional TCP-options.
> > 
> > MPTCP increases the throughput, because it uses the TCP-subflows
> > simultaneously. With our implementation we got 2Gbps throughput for a
> > single
> > iperf-session on a machine having two 1Gb-interfaces (using
> > jumbo-frames), whereas regular TCP could only go up to 1Gbps, as it only
> > uses one interface.
> > 
> > To maintain bottleneck-fairness the Coupled Congestion Control controls
> > the congestion window of the individual subflows (included in the
> > implementation
> > since the latest release).
> > http://datatracker.ietf.org/doc/draft-ietf-mptcp-congestion/
> > 
> > 
> > Cheers,
> > Christoph
> > 
> > P.S.: We have a public webserver running MPTCP at
> > http://mptcp.info.ucl.ac.be
> > So you can directly try out the power of MPTCP... ;-)
> > 
> > On Thursday, January 20, 2011 wrote Peter Chacko:
> > > SCTP already provides that , and is TCP Multi-Path is going to be a
> > > port
> > 
> > of
> > 
> > > it or any other difference ?
> > > 
> > > We are looking to use SCTP for this feature, but as we found it it has
> > 
> > not
> > 
> > > kicked off , because of its firewall issues, we are trying add
> > > Multi-Pathing at application layer, sharing all the congestion
> > 
> > states(like
> > 
> > > CM idea) as we are building a WAN optimized storage replication module
> > > as part of our cloud storage gateway development.
> > > 
> > > Curious to see more info on this.
> > > 
> > > Thanks
> > > 
> > > 2011/1/20 Christoph Paasch <christoph.paasch@uclouvain.be>
> > > 
> > > > Hi all,
> > > > 
> > > > The IETF is developing a new transport layer solution, MultiPath TCP,
> > > > which allows to efficiently exploit several Internet paths between a
> > > > pair of hosts,
> > > > while presenting a single TCP connection to the application layer.
> > > > 
> > > > At the UCLouvain in Belgium we are developping the support for
> > 
> > MultiPath
> > 
> > > > TCP
> > > > in the Linux Kernel. The implementation is a major extension to the
> > 
> > Linux
> > 
> > > > TCP-
> > > > stack.
> > > > 
> > > > For general information, access:
> > > > http://inl.info.ucl.ac.be/mptcp
> > > > https://scm.info.ucl.ac.be/trac/mptcp/
> > > > 
> > > > To access the git-repository:
> > > > git://scm.info.ucl.ac.be/mtcp.git
> > > > 
> > > > branches:
> > > >        mptcp_2.6.36 - based on Linux Kernel 2.6.36
> > > >        mtcp_no_subrcvqueue - based on Linux Kernel 2.6.28
> > > > 
> > > > For questions, feedback,... feel free to subscribe to the mptcp-dev
> > > > Mailing-
> > > > List:
> > > > https://listes-2.sipr.ucl.ac.be/sympa/info/mptcp-dev
> > > > 
> > > > 
> > > > Regards,
> > > > Christoph
> > > > 
> > > > --
> > > > Christoph Paasch
> > > > PhD Student
> > > > 
> > > > IP Networking Lab --- http://inl.info.ucl.ac.be
> > > > MultiPath TCP in the Linux Kernel --- http://inl.info.ucl.ac.be/mptcp
> > > > Université Catholique de Louvain
> > > > 
> > > > www.rollerbulls.be
> > > > --
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > --
> > Christoph Paasch
> > PhD Student
> > 
> > IP Networking Lab --- http://inl.info.ucl.ac.be
> > MultiPath TCP in the Linux Kernel --- http://inl.info.ucl.ac.be/mptcp
> > Université Catholique de Louvain
> > 
> > www.rollerbulls.be
> > --

--
Christoph Paasch
PhD Student

IP Networking Lab --- http://inl.info.ucl.ac.be
MultiPath TCP in the Linux Kernel --- http://inl.info.ucl.ac.be/mptcp
Université Catholique de Louvain

www.rollerbulls.be
--

^ permalink raw reply

* Re: MultiPath TCP in the Linux Kernel
From: Christoph Paasch @ 2011-01-21 13:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Hagen Paul Pfeifer, netdev, bonding-devel, linux-sctp
In-Reply-To: <1295538176.2825.311.camel@edumazet-laptop>


On Thursday, January 20, 2011 wrote Eric Dumazet:
> > if you want that your work becomes part of the official network stack you
> > should align your effort on the official Linux way. This means you should
> > split your work and publish patches on this maillinglist.
> 
> Hmm, they probably know that, and prefer to wait MTCP stuff is mature
> before patch submission :)
Exactly... :)

The protocol specification of MPTCP is not stable yet 
(http://datatracker.ietf.org/doc/draft-ietf-mptcp-multiaddressed/).

Also, all the features are not yet in our implementation (e.g., IPv6-support 
and dual IPv4/IPv6-support), and it is necessary to make our implementation 
less intrusive to the regular TCP/IP-stack.

However, as developper-resources are limited this will still take quite some 
time... ;)

Cheers,
Christoph


--
Christoph Paasch
PhD Student

IP Networking Lab --- http://inl.info.ucl.ac.be
MultiPath TCP in the Linux Kernel --- http://inl.info.ucl.ac.be/mptcp
Université Catholique de Louvain

www.rollerbulls.be
--

^ permalink raw reply

* Re: Using ethernet device as efficient small packet generator
From: Ben Greear @ 2011-01-21 13:38 UTC (permalink / raw)
  To: juice; +Cc: Eric Dumazet, Loke, Chetan, Jon Zhou, Stephen Hemminger, netdev
In-Reply-To: <37bfb9ca79c2325ec4b70033f509200a.squirrel@www.liukuma.net>

On 01/21/2011 04:12 AM, juice wrote:
>> Le vendredi 21 janvier 2011 à 13:44 +0200, juice a écrit :
>>
>>
>> You should try
>>
>> CLONE_SKB="clone_skb 10"
>> ...
>> pgset "$CLONE_SKB"
>>
>>
>> Because I suspect you hit a performance problem on skb
>> allocation/filling/use/freeing
>
> Actually, that makes the performance worse:
> (Now I tried it with kernel 2.6.37, which is currently running)

Maybe try clone-skb of 1000 or so.  It zero's it's memory when
allocating a packet which can be quite expensive.

Also note that the Ethernet inter-frame gap isn't accounted for in
the BPS, but it is a significant amount of the total bandwidth
when using 64-byte packets.
You are pushing a bit more than half of the theoretical limit of
around 1,400,000 64-byte packets per second for 1Gbps ethernet.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
From: Oleg V. Ukhno @ 2011-01-21 13:55 UTC (permalink / raw)
  To: Nicolas de Pesloüan
  Cc: Jay Vosburgh, John Fastabend, netdev@vger.kernel.org
In-Reply-To: <4D3745AF.5040808@gmail.com>

On 01/19/2011 11:12 PM, Nicolas de Pesloüan wrote:

> If you have time for that, then yes, please, do the same test using
> balance-rr+vlan to segregate path. With those results, we whould have
> the opportunity to enhance the documentation with some well tested cases
> of TCP load balancing on a LAN, not limited to 802.3ad automatic setup.
> Both setups make sense, and assuming the results would be similar is
> probably true, but not reliable enough to assert it into the documentation.
>
> Thanks,
>
> Nicolas.
>
Nicolas,
I've ran similar tests for VLAN tunneling scenario. Results are 
identical, as I expected. The only significat difference is link failure 
handling. 802.3ad mode allows almost painless load reditribution, 
balance-rr causes packet loss.
The only question for me now is if my patch could be applied to upstream 
version - fixing issues with adaptftion to net-next code aren't the 
problem, if nobody objects
There were 2 tests:
1) unidirectional test
2) bidirectional test
Below are results:

Iperf results:
test 1:
  iperf -f m -c 192.168.111.128 -B 192.168.111.129 -p 9999 -t 300
------------------------------------------------------------
Client connecting to 192.168.111.128, TCP port 9999
Binding to local address 192.168.111.129
TCP window size: 32.0 MByte (default)
------------------------------------------------------------
[  3] local 192.168.111.129 port 9999 connected with 192.168.111.128 
port 9999
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-300.0 sec  141637 MBytes  3960 Mbits/sec

test 2:
iperf -f m -c 192.168.111.128 -B 192.168.111.129 -p 9999 -t 300 
--dualtest -P 4
------------------------------------------------------------
Server listening on TCP port 9999
Binding to local address 192.168.111.129
TCP window size: 32.0 MByte (default)
------------------------------------------------------------
...
[SUM]  0.0-300.2 sec  111334 MBytes  3111 Mbits/sec
[SUM]  0.0-300.4 sec  109582 MBytes  3060 Mbits/sec

TCP stats:
receiver side, before test 1:
[root@target1 ~]# netstat -st
IcmpMsg:
     InType0: 4
     InType3: 6
     InType8: 2
     OutType0: 2
     OutType3: 6
     OutType8: 4
Tcp:
     4 active connections openings
     2 passive connection openings
     3 failed connection attempts
     0 connection resets received
     3 connections established
     10252 segments received
     29766 segments send out
     2 segments retransmited
     0 bad segments received.
     0 resets sent
UdpLite:
TcpExt:
     3 delayed acks sent
     613 packets directly queued to recvmsg prequeue.
     16 packets directly received from backlog
     1760 packets directly received from prequeue
     428 packets header predicted
     10 packets header predicted and directly queued to user
     9295 acknowledgments not containing data received
     265 predicted acknowledgments
     0 TCP data loss events
     1 other TCP timeouts
     TCPSackMerged: 1
     TCPSackShiftFallback: 1
IpExt:
     InMcastPkts: 92
     OutMcastPkts: 64
     InBcastPkts: 2
     InOctets: 1089217
     OutOctets: 265005791
     InMcastOctets: 16294
     OutMcastOctets: 10364
     InBcastOctets: 483


receiver side , after test 1:
[root@target1 ~]netstat -st
IcmpMsg:
     InType0: 17
     InType3: 6
     InType8: 9
     OutType0: 9
     OutType3: 6
     OutType8: 19
Tcp:
     84 active connections openings
     14 passive connection openings
     6 failed connection attempts
     4 connection resets received
     4 connections established
     16684784 segments received
     16704650 segments send out
     22 segments retransmited
     0 bad segments received.
     6 resets sent
UdpLite:
TcpExt:
     39 TCP sockets finished time wait in slow timer
     23 delayed acks sent
     83 delayed acks further delayed because of locked socket
     Quick ack mode was activated 225 times
     1019 packets directly queued to recvmsg prequeue.
     3235352384 packets directly received from backlog
     483600 packets directly received from prequeue
     86065 packets header predicted
     4855 packets header predicted and directly queued to user
     10369 acknowledgments not containing data received
     928 predicted acknowledgments
     0 TCP data loss events
     2 retransmits in slow start
     6 other TCP timeouts
     225 DSACKs sent for old packets
     1 connections reset due to unexpected data
     TCPSackMerged: 1
     TCPSackShiftFallback: 3
IpExt:
     InMcastPkts: 108
     OutMcastPkts: 72
     InBcastPkts: 4
     InOctets: -936746758
     OutOctets: 1556837236
     InMcastOctets: 16774
     OutMcastOctets: 10620
     InBcastOctets: 966

receiver side, after test 2
[root@target1 ~]netstat -st
IcmpMsg:
     InType0: 17
     InType3: 6
     InType8: 12
     OutType0: 12
     OutType3: 6
     OutType8: 19
Tcp:
     144 active connections openings
     25 passive connection openings
     29 failed connection attempts
     7 connection resets received
     4 connections established
     44349148 segments received
     44401154 segments send out
     58434 segments retransmited
     0 bad segments received.
     6 resets sent
UdpLite:
TcpExt:
     58 TCP sockets finished time wait in slow timer
     735072 packets rejects in established connections because of timestamp
     34 delayed acks sent
     359 delayed acks further delayed because of locked socket
     Quick ack mode was activated 14800 times
     2112 packets directly queued to recvmsg prequeue.
     3753925448 packets directly received from backlog
     4377976 packets directly received from prequeue
     847653 packets header predicted
     105696 packets header predicted and directly queued to user
     8804473 acknowledgments not containing data received
     154775 predicted acknowledgments
     10465 times recovered from packet loss due to SACK data
     Detected reordering 1 times using FACK
     Detected reordering 11185 times using SACK
     Detected reordering 182 times using time stamp
     2116 congestion windows fully recovered
     18951 congestion windows partially recovered using Hoe heuristic
     TCPDSACKUndo: 58
     8 congestion windows recovered after partial ack
     0 TCP data loss events
     53 timeouts after SACK recovery
     1 timeouts in loss state
     57287 fast retransmits
     12 forward retransmits
     793 retransmits in slow start
     10 other TCP timeouts
     263 sack retransmits failed
     14800 DSACKs sent for old packets
     31 DSACKs sent for out of order packets
     14289 DSACKs received
     43 DSACKs for out of order packets received
     1 connections reset due to unexpected data
     TCPDSACKIgnoredOld: 8615
     TCPDSACKIgnoredNoUndo: 5683
     TCPSackMerged: 1
     TCPSackShiftFallback: 15015212
IpExt:
     InMcastPkts: 116
     OutMcastPkts: 76
     InBcastPkts: 4
     InOctets: 1012355682
     OutOctets: -1540562156
     InMcastOctets: 17014
     OutMcastOctets: 10748
     InBcastOctets: 966


sender side, before test 1:
[root@target2 ~]# netstat -st
IcmpMsg:
     InType3: 4
     InType8: 32
     OutType0: 32
     OutType3: 4
Tcp:
     1 active connections openings
     2 passive connection openings
     0 failed connection attempts
     0 connection resets received
     3 connections established
     30268 segments received
     10217 segments send out
     0 segments retransmited
     0 bad segments received.
     3 resets sent
UdpLite:
TcpExt:
     7 delayed acks sent
     6332 packets directly queued to recvmsg prequeue.
     8 packets directly received from backlog
     46104 packets directly received from prequeue
     27935 packets header predicted
     11 packets header predicted and directly queued to user
     455 acknowledgments not containing data received
     119 predicted acknowledgments
     0 TCP data loss events
     TCPSackShiftFallback: 1
IpExt:
     InMcastPkts: 87
     OutMcastPkts: 54
     InBcastPkts: 2
     InOctets: 265039007
     OutOctets: 1083024
     InMcastOctets: 16444
     OutMcastOctets: 9893
     InBcastOctets: 483

sender side , after test 1:
[root@target2 ~]# netstat -st
IcmpMsg:
     InType3: 4
     InType8: 53
     OutType0: 53
     OutType3: 4
Tcp:
     69 active connections openings
     12 passive connection openings
     2 failed connection attempts
     4 connection resets received
     4 connections established
     16704819 segments received
     16684841 segments send out
     401 segments retransmited
     0 bad segments received.
     10 resets sent
UdpLite:
TcpExt:
     31 TCP sockets finished time wait in slow timer
     25 delayed acks sent
     6515 packets directly queued to recvmsg prequeue.
     24 packets directly received from backlog
     46988 packets directly received from prequeue
     27974 packets header predicted
     115 packets header predicted and directly queued to user
     10259331 acknowledgments not containing data received
     12483 predicted acknowledgments
     166 times recovered from packet loss due to SACK data
     Detected reordering 1 times using FACK
     Detected reordering 7 times using SACK
     Detected reordering 1 times using time stamp
     1 congestion windows fully recovered
     41 congestion windows partially recovered using Hoe heuristic
     0 TCP data loss events
     386 fast retransmits
     5 forward retransmits
     3 other TCP timeouts
     1 times receiver scheduled too late for direct processing
     225 DSACKs received
     1 connections reset due to unexpected data
     TCPDSACKIgnoredOld: 167
     TCPDSACKIgnoredNoUndo: 58
     TCPSackShiftFallback: 30925668
IpExt:
     InMcastPkts: 103
     OutMcastPkts: 62
     InBcastPkts: 4
     InOctets: 1556368288
     OutOctets: -934790015
     InMcastOctets: 16924
     OutMcastOctets: 10149
     InBcastOctets: 966

sender side, after test 2:
[root@target2 ~]# netstat -st
IcmpMsg:
     InType3: 4
     InType8: 56
     OutType0: 56
     OutType3: 4
Tcp:
     117 active connections openings
     25 passive connection openings
     2 failed connection attempts
     7 connection resets received
     4 connections established
     44383169 segments received
     44367187 segments send out
     59660 segments retransmited
     0 bad segments received.
     34 resets sent
UdpLite:
TcpExt:
     2 TCP sockets finished time wait in fast timer
     57 TCP sockets finished time wait in slow timer
     717082 packets rejects in established connections because of timestamp
     46 delayed acks sent
     202 delayed acks further delayed because of locked socket
     Quick ack mode was activated 14356 times
     7432 packets directly queued to recvmsg prequeue.
     135038632 packets directly received from backlog
     3633432 packets directly received from prequeue
     783534 packets header predicted
     94671 packets header predicted and directly queued to user
     20034470 acknowledgments not containing data received
     177885 predicted acknowledgments
     10851 times recovered from packet loss due to SACK data
     Detected reordering 6 times using FACK
     Detected reordering 9217 times using SACK
     Detected reordering 111 times using time stamp
     2125 congestion windows fully recovered
     19325 congestion windows partially recovered using Hoe heuristic
     TCPDSACKUndo: 71
     7 congestion windows recovered after partial ack
     0 TCP data loss events
     52 timeouts after SACK recovery
     58562 fast retransmits
     67 forward retransmits
     736 retransmits in slow start
     8 other TCP timeouts
     226 sack retransmits failed
     1 times receiver scheduled too late for direct processing
     14356 DSACKs sent for old packets
     44 DSACKs sent for out of order packets
     14679 DSACKs received
     31 DSACKs for out of order packets received
     1 connections reset due to unexpected data
     TCPDSACKIgnoredOld: 8899
     TCPDSACKIgnoredNoUndo: 5791
     TCPSackShiftFallback: 47227517
IpExt:
     InMcastPkts: 109
     OutMcastPkts: 65
     InBcastPkts: 4
     InOctets: -1885181292
     OutOctets: 1366995261
     InMcastOctets: 17104
     OutMcastOctets: 10245
     InBcastOctets: 966

-- 
Best regards,
Oleg Ukhno,
ITO Team lead
Yandex LLC.


^ permalink raw reply

* Re: 2.6.38-rc1: arp triggers RTNL assertion
From: Richard Cochran @ 2011-01-21 14:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jamie Heilman, linux-kernel, netdev
In-Reply-To: <1295593946.2613.52.camel@edumazet-laptop>

On Fri, Jan 21, 2011 at 08:12:26AM +0100, Eric Dumazet wrote:
> Thanks for the report, I am looking at this right now.

FYI, I had this too. Happens every time I use my UMTS modem.

Kernel: private branch from e744070fd4ff9d3114277e52d77afa21579adce2


Jan 13 05:07:23 riccoc20 pppd[12961]: Serial connection established.
Jan 13 05:07:23 riccoc20 pppd[12961]: Using interface ppp0
Jan 13 05:07:23 riccoc20 pppd[12961]: Connect: ppp0 <--> /dev/ttyACM0
Jan 13 05:07:24 riccoc20 pppd[12961]: PAP authentication succeeded
Jan 13 05:07:25 riccoc20 pppd[12961]: replacing old default route to eth0 [10.0.0.1]
Jan 13 05:07:25 riccoc20 pppd[12961]: found interface eth0 for proxy arp
Jan 13 05:07:25 riccoc20 pppd[12961]: local  IP address 46.124.43.179
Jan 13 05:07:25 riccoc20 pppd[12961]: remote IP address 10.6.6.6
Jan 13 05:07:25 riccoc20 pppd[12961]: primary   DNS address 213.162.69.169
Jan 13 05:07:25 riccoc20 pppd[12961]: secondary DNS address 213.162.65.1
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088740] RTNL: assertion failed at /home/cochran/work/git/net-next-2.6/net/core/neighbour.c (589)
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088746] Pid: 12961, comm: pppd Tainted: P            2.6.37+ #3
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088748] Call Trace:
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088757]  [<c04da48f>] ? pneigh_lookup+0x1af/0x1c0
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088761]  [<c05254fe>] ? arp_req_set+0x18e/0x290
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088764]  [<c04dbb62>] ? __rtnl_unlock+0x12/0x20
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088767]  [<c04cd972>] ? netdev_run_todo+0x42/0x230
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088771]  [<c0329821>] ? apparmor_capable+0x21/0x70
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088774]  [<c04cf81d>] ? dev_get_by_name_rcu+0x8d/0xb0
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088777]  [<c0525844>] ? arp_ioctl+0x244/0x260
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088780]  [<c052a5e5>] ? inet_ioctl+0xa5/0xb0
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088784]  [<c04bc6bd>] ? sock_ioctl+0x6d/0x290
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088786]  [<c04bc650>] ? sock_ioctl+0x0/0x290
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088790]  [<c02222cc>] ? do_vfs_ioctl+0x8c/0x620
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088794]  [<c05ab22a>] ? do_page_fault+0x1ca/0x450
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088796]  [<c04be7bb>] ? sys_send+0x3b/0x40
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088798]  [<c04bf3b8>] ? sys_socketcall+0x1d8/0x2a0
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088801]  [<c02228c7>] ? sys_ioctl+0x67/0x80
Jan 13 05:07:25 riccoc20 kernel: [ 3151.088804]  [<c05a80a4>] ? syscall_call+0x7/0xb
Jan 13 05:07:31 riccoc20 ntpdate[13030]: no server suitable for synchronization found

^ permalink raw reply

* [PATCH 1/8] af_unix: Documentation on multicast unix sockets
From: Alban Crequy @ 2011-01-21 14:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Lennart Poettering, netdev,
	linux-doc, linux-
  Cc: Alban Crequy
In-Reply-To: <20110121143751.57b1453d@chocolatine.cbg.collabora.co.uk>

Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
Reviewed-by: Ian Molton <ian.molton@collabora.co.uk>
---
 .../networking/multicast-unix-sockets.txt          |  171 ++++++++++++++++++++
 1 files changed, 171 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/networking/multicast-unix-sockets.txt

diff --git a/Documentation/networking/multicast-unix-sockets.txt b/Documentation/networking/multicast-unix-sockets.txt
new file mode 100644
index 0000000..0cc30cb
--- /dev/null
+++ b/Documentation/networking/multicast-unix-sockets.txt
@@ -0,0 +1,171 @@
+Multicast Unix sockets
+======================
+
+Multicast is implemented on SOCK_DGRAM and SOCK_SEQPACKET Unix sockets.
+
+An userspace application can create a multicast group with:
+
+  struct unix_mreq mreq = {0,};
+  mreq.address.sun_family = AF_UNIX;
+  mreq.address.sun_path[0] = '\0';
+  strcpy(mreq.address.sun_path + 1, "socket-address");
+
+  sockfd = socket(AF_UNIX, SOCK_DGRAM, 0);
+  ret = setsockopt(sockfd, SOL_UNIX, UNIX_CREATE_GROUP, &mreq, sizeof(mreq));
+
+This allocates a struct unix_mcast_group, which is reference counted and exists
+as long as the socket who created it exists or the group has at least one
+member.
+
+Then a multicast group can be joined with:
+
+  ret = setsockopt(sockfd, SOL_UNIX, UNIX_JOIN_GROUP, &mreq, sizeof(mreq));
+
+This allocates a struct unix_mcast, which holds the settings of the membership,
+mainly whether loopback is enabled. A socket can be a member of several
+multicast groups.
+
+The socket is part of the multicast group until it is released, shutdown with
+RCV_SHUTDOWN or it leaves explicitely the group:
+
+  ret = setsockopt(sockfd, SOL_UNIX, UNIX_LEAVE_GROUP, &mreq, sizeof(mreq));
+
+Struct unix_mcast nodes are linked in two RCU lists:
+- (struct unix_sock)->mcast_subscriptions
+- (struct unix_mcast_group)->mcast_members
+
+              unix_mcast_group  unix_mcast_group
+                      |                 |
+                      v                 v
+unix_sock  ---->  unix_mcast  ----> unix_mcast
+                      |
+                      v
+unix_sock  ---->  unix_mcast
+                      |
+                      v
+unix_sock  ---->  unix_mcast
+
+
+SOCK_DGRAM semantics
+====================
+
+          G          The socket which created the group
+       /  |  \
+     P1  P2  P3      The member sockets
+
+Messages sent to the group are received by all members except the sender itself
+unless the sending socket has UNIX_MREQ_LOOPBACK set.
+
+Non-members can also send to the group socket G and the message will be
+broadcast to the group members, however socket G does not receive messages sent
+to the group, via it, itself.
+
+
+SOCK_SEQPACKET semantics
+========================
+
+When a connection is performed on a SOCK_SEQPACKET multicast socket, a new
+socket is created and its file descriptor is received by accept().
+
+          L          The listening socket
+       /  |  \
+     A1  A2  A3      The accepted sockets
+      |   |   |
+     C1  C2  C3      The connected sockets
+
+Messages sent on the C1 socket are received by:
+- C1 itself if UNIX_MREQ_LOOPBACK is set.
+- The peer socket A1 if UNIX_MREQ_SEND_TO_PEER is set.
+- The other members of the multicast group C2 and C3.
+
+Only members can send to the group in this case.
+
+
+Atomic delivery and ordering
+============================
+
+Each message sent is delivered atomically to either none of the recipients or
+all the recipients, even with interruptions and errors.
+
+Locking is used in order to keep the ordering consistent on all recipients. We
+want to avoid the following scenario. Two emitters A and B, and 2 recipients, C
+and D:
+
+           C    D
+A -------->|    |    Step 1: A's message is delivered to C
+B -------->|    |    Step 2: B's message is delivered to C
+B ---------|--->|    Step 3: B's message is delivered to D
+A ---------|--->|    Step 4: A's message is delivered to D
+
+Result: - C received (A, B)
+        - D received (B, A)
+
+Although A and B had a list of recipients (C, D) in the same order, C and D
+received the messages in a different order. To avoid this scenario, we need a
+locking mechanism while the messages are being delivered with skb_queue_tail().
+
+Solution 1:
+The easiest implementation would be to use a global spinlock on the group, but
+it creates an avoidable contention, especially when there are two independent
+streams set up with socket filters; e.g. if A sends messages received only by
+C, and B sends messages received only by D.
+
+Solution 2:
+Fine-grained locking could be implemented with a spinlock on each recipient.
+Before delivering the message to the recipients, the sender takes a spinlock on
+each recipient at the same time.
+
+Taking several spinlocks on the same struct can be dangerous and leads to
+deadlocks. This is prevented by sorting the list of sockets by memory address
+and taking the spinlocks in that order. The ordered list of recipients is
+computed on demand when a message is sent and the list is cached for
+performance. When the group membership changes, the generation of the
+membership is incremented and the ordered recipient list is invalidated.
+
+With this solution, the number of spinlocks taken simultaneously can be
+arbitrary big. Whilst it works, it breaks the lockdep mechanism.
+
+Solution 3:
+The current implementation is similar to solution 2 but with a limit on the
+number of spinlocks taken simultaneously (8), so lockdep works fine. A hash
+function and bit array with n=8 specifies which spinlocks to take.  Contention
+on independent streams can still happen but it is less likely.
+
+
+Flow control
+============
+
+When a socket's receiving queue is full, the default behavior is to block
+senders (or to return -EAGAIN on non-blocking sockets). The socket can also
+join a multicast group with the flag UNIX_MREQ_DROP_WHEN_FULL. In this case,
+messages sent to the group will not be delivered to that socket when its
+receiving queue is full.
+
+Messages are still delivered atomically to all members who don't have the flag
+UNIX_MREQ_DROP_WHEN_FULL. If send() returns -EAGAIN, nobody received the
+message. If send() blocks because of one member, the other members don't
+receive the message until all sockets (except those with
+UNIX_MREQ_DROP_WHEN_FULL set) can receive at the same time.
+
+poll/epoll/select on POLLOUT events have a consistent behavior; they block if
+at least one member of the multicast group without UNIX_MREQ_DROP_WHEN_FULL has
+a full receiving queue.
+
+
+Multicast socket reference counting
+===================================
+
+A poller for POLLOUT events can block for any member of the group. The poller
+can use the wait queue "peer_wait" of any member. So it is important that Unix
+sockets are not released before all pollers exit. This is achieved by:
+
+- Incrementing the reference counter of a socket when it joins a multicast
+  group.
+- Decrementing it when the group is destroyed, that is when all
+  sockets keeping a reference on the group released their reference on the
+  group.
+
+struct unix_mcast_group keeps track of both current members and previous
+members. When a socket leaves a group, it is removed from the members list and
+put in the dead members list. This is done in order to take advantage of RCU
+lists, which reduces lock contention.
-- 
1.7.2.3

^ permalink raw reply related

* [PATCH 3/8] af_unix: add setsockopt on unix sockets
From: Alban Crequy @ 2011-01-21 14:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Lennart Poettering, netdev,
	linux-doc, linux-
  Cc: Alban Crequy
In-Reply-To: <20110121143751.57b1453d@chocolatine.cbg.collabora.co.uk>

unix_setsockopt() is called only on SOCK_DGRAM and SOCK_SEQPACKET unix sockets

Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
Reviewed-by: Ian Molton <ian.molton@collabora.co.uk>
---
 net/unix/af_unix.c |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index d8d98d5..7ea85de 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -512,6 +512,8 @@ static unsigned int unix_dgram_poll(struct file *, struct socket *,
 				    poll_table *);
 static int unix_ioctl(struct socket *, unsigned int, unsigned long);
 static int unix_shutdown(struct socket *, int);
+static int unix_setsockopt(struct socket *, int, int,
+			   char __user *, unsigned int);
 static int unix_stream_sendmsg(struct kiocb *, struct socket *,
 			       struct msghdr *, size_t);
 static int unix_stream_recvmsg(struct kiocb *, struct socket *,
@@ -559,7 +561,7 @@ static const struct proto_ops unix_dgram_ops = {
 	.ioctl =	unix_ioctl,
 	.listen =	sock_no_listen,
 	.shutdown =	unix_shutdown,
-	.setsockopt =	sock_no_setsockopt,
+	.setsockopt =	unix_setsockopt,
 	.getsockopt =	sock_no_getsockopt,
 	.sendmsg =	unix_dgram_sendmsg,
 	.recvmsg =	unix_dgram_recvmsg,
@@ -580,7 +582,7 @@ static const struct proto_ops unix_seqpacket_ops = {
 	.ioctl =	unix_ioctl,
 	.listen =	unix_listen,
 	.shutdown =	unix_shutdown,
-	.setsockopt =	sock_no_setsockopt,
+	.setsockopt =	unix_setsockopt,
 	.getsockopt =	sock_no_getsockopt,
 	.sendmsg =	unix_seqpacket_sendmsg,
 	.recvmsg =	unix_dgram_recvmsg,
@@ -1561,6 +1563,13 @@ out:
 }
 
 
+static int unix_setsockopt(struct socket *sock, int level, int optname,
+			   char __user *optval, unsigned int optlen)
+{
+	return -EOPNOTSUPP;
+}
+
+
 static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 			       struct msghdr *msg, size_t len)
 {
-- 
1.7.2.3

^ permalink raw reply related

* [PATCH 4/8] af_unix: create, join and leave multicast groups with setsockopt
From: Alban Crequy @ 2011-01-21 14:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Lennart Poettering, netdev,
	linux-doc, linux-
  Cc: Alban Crequy, Ian Molton
In-Reply-To: <20110121143751.57b1453d@chocolatine.cbg.collabora.co.uk>

Multicast is implemented on SOCK_DGRAM and SOCK_SEQPACKET unix sockets.

An userspace application can create a multicast group with:
  struct unix_mreq mreq;
  mreq.address.sun_family = AF_UNIX;
  mreq.address.sun_path[0] = '\0';
  strcpy(mreq.address.sun_path + 1, "socket-address");
  mreq.flags = 0;

  sockfd = socket(AF_UNIX, SOCK_DGRAM, 0);
  ret = setsockopt(sockfd, SOL_UNIX, UNIX_CREATE_GROUP, &mreq, sizeof(mreq));

Then a multicast group can be joined and left with:
  ret = setsockopt(sockfd, SOL_UNIX, UNIX_JOIN_GROUP, &mreq, sizeof(mreq));
  ret = setsockopt(sockfd, SOL_UNIX, UNIX_LEAVE_GROUP, &mreq, sizeof(mreq));

A socket can be a member of several multicast group.

Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
Signed-off-by: Ian Molton <ian.molton@collabora.co.uk>
---
 include/net/af_unix.h |   77 +++++++++++
 net/unix/Kconfig      |   10 ++
 net/unix/af_unix.c    |  339 ++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 424 insertions(+), 2 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 18e5c3f..f2b605b 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -41,7 +41,62 @@ struct unix_skb_parms {
 				spin_lock_nested(&unix_sk(s)->lock, \
 				SINGLE_DEPTH_NESTING)
 
+/* UNIX socket options */
+#define UNIX_CREATE_GROUP	1
+#define UNIX_JOIN_GROUP		2
+#define UNIX_LEAVE_GROUP	3
+
+/* Flags on unix_mreq */
+
+/* On UNIX_JOIN_GROUP: the socket will receive its own messages */
+#define UNIX_MREQ_LOOPBACK		0x01
+
+/* ON UNIX_JOIN_GROUP: the messages will also be received by the peer */
+#define UNIX_MREQ_SEND_TO_PEER		0x02
+
+/* ON UNIX_JOIN_GROUP: just drop the message instead of blocking if the
+ * receiving queue is full */
+#define UNIX_MREQ_DROP_WHEN_FULL	0x04
+
+struct unix_mreq {
+	struct sockaddr_un	address;
+	unsigned int		flags;
+};
+
 #ifdef __KERNEL__
+
+struct unix_mcast_group {
+	/* RCU list of (struct unix_mcast)->member_node
+	 * Messages sent to the multicast group are delivered to this list of
+	 * members */
+	struct hlist_head	mcast_members;
+
+	/* RCU list of (struct unix_mcast)->member_dead_node
+	 * When the group dies, previous members' reference counters must be
+	 * decremented */
+	struct hlist_head	mcast_dead_members;
+
+	/* RCU list of (struct sock_set)->list */
+	struct hlist_head	mcast_members_lists;
+
+	atomic_t		mcast_members_cnt;
+
+	/* The generation is incremented each time a peer joins or
+	 * leaves the group. It is used to invalidate old lists
+	 * struct sock_set */
+	atomic_t		mcast_membership_generation;
+
+	/* Locks to guarantee causal order in deliveries */
+#define MCAST_LOCK_CLASS_COUNT	8
+	spinlock_t		lock[MCAST_LOCK_CLASS_COUNT];
+
+	/* The group is referenced by:
+	 * - the socket who created the multicast group
+	 * - the accepted sockets (SOCK_SEQPACKET only)
+	 * - the current members of the group */
+	atomic_t		refcnt;
+};
+
 /* The AF_UNIX socket */
 struct unix_sock {
 	/* WARNING: sk has to be the first member */
@@ -57,9 +112,31 @@ struct unix_sock {
 	spinlock_t		lock;
 	unsigned int		gc_candidate : 1;
 	unsigned int		gc_maybe_cycle : 1;
+	unsigned int		mcast_send_to_peer : 1;
+	unsigned int		mcast_drop_when_peer_full : 1;
 	unsigned char		recursion_level;
+	struct unix_mcast_group	*mcast_group;
+
+	/* RCU List of (struct unix_mcast)->subscription_node
+	 * A socket can subscribe to several multicast group
+	 */
+	struct hlist_head	mcast_subscriptions;
+
 	struct socket_wq	peer_wq;
 };
+
+struct unix_mcast {
+	struct unix_sock	*member;
+	struct unix_mcast_group	*group;
+	unsigned int		flags;
+	struct hlist_node	subscription_node;
+	/* A subscription cannot be both alive and dead but we cannot use the
+	 * same field because RCU readers run lockless. member_dead_node is
+	 * not read by lockless RCU readers. */
+	struct hlist_node	member_node;
+	struct hlist_node	member_dead_node;
+};
+
 #define unix_sk(__sk) ((struct unix_sock *)__sk)
 
 #define peer_wait peer_wq.wait
diff --git a/net/unix/Kconfig b/net/unix/Kconfig
index 5a69733..e3e5d9b 100644
--- a/net/unix/Kconfig
+++ b/net/unix/Kconfig
@@ -19,3 +19,13 @@ config UNIX
 
 	  Say Y unless you know what you are doing.
 
+config UNIX_MULTICAST
+	depends on UNIX && EXPERIMENTAL
+	bool "Multicast over Unix domain sockets"
+	---help---
+	  If you say Y here, you will include support for multicasting on Unix
+	  domain sockets. Support is available for SOCK_DGRAM and
+	  SOCK_SEQPACKET. Certain types of delivery synchronisation are
+	  provided, see Documentation/networking/multicast-unix-sockets.txt
+
+
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 7ea85de..f25c020 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -117,6 +117,9 @@
 
 static struct hlist_head unix_socket_table[UNIX_HASH_SIZE + 1];
 static DEFINE_SPINLOCK(unix_table_lock);
+#ifdef CONFIG_UNIX_MULTICAST
+static DEFINE_SPINLOCK(unix_multicast_lock);
+#endif
 static atomic_long_t unix_nr_socks;
 
 #define unix_sockets_unbound	(&unix_socket_table[UNIX_HASH_SIZE])
@@ -371,6 +374,28 @@ static void unix_sock_destructor(struct sock *sk)
 #endif
 }
 
+#ifdef CONFIG_UNIX_MULTICAST
+static void
+destroy_mcast_group(struct unix_mcast_group *group)
+{
+	struct unix_mcast *node;
+	struct hlist_node *pos;
+	struct hlist_node *pos_tmp;
+
+	BUG_ON(atomic_read(&group->refcnt) != 0);
+	BUG_ON(!hlist_empty(&group->mcast_members));
+
+	hlist_for_each_entry_safe(node, pos, pos_tmp,
+				  &group->mcast_dead_members,
+				  member_dead_node) {
+		hlist_del_rcu(&node->member_dead_node);
+		sock_put(&node->member->sk);
+		kfree(node);
+	}
+	kfree(group);
+}
+#endif
+
 static int unix_release_sock(struct sock *sk, int embrion)
 {
 	struct unix_sock *u = unix_sk(sk);
@@ -379,6 +404,11 @@ static int unix_release_sock(struct sock *sk, int embrion)
 	struct sock *skpair;
 	struct sk_buff *skb;
 	int state;
+#ifdef CONFIG_UNIX_MULTICAST
+	struct unix_mcast *node;
+	struct hlist_node *pos;
+	struct hlist_node *pos_tmp;
+#endif
 
 	unix_remove_socket(sk);
 
@@ -392,6 +422,23 @@ static int unix_release_sock(struct sock *sk, int embrion)
 	u->mnt	     = NULL;
 	state = sk->sk_state;
 	sk->sk_state = TCP_CLOSE;
+#ifdef CONFIG_UNIX_MULTICAST
+	spin_lock(&unix_multicast_lock);
+	hlist_for_each_entry_safe(node, pos, pos_tmp, &u->mcast_subscriptions,
+				  subscription_node) {
+		hlist_del_rcu(&node->member_node);
+		hlist_del_rcu(&node->subscription_node);
+		atomic_dec(&node->group->mcast_members_cnt);
+		atomic_inc(&node->group->mcast_membership_generation);
+		hlist_add_head_rcu(&node->member_dead_node,
+				   &node->group->mcast_dead_members);
+		if (atomic_dec_and_test(&node->group->refcnt))
+			destroy_mcast_group(node->group);
+	}
+	if (u->mcast_group && atomic_dec_and_test(&u->mcast_group->refcnt))
+		destroy_mcast_group(u->mcast_group);
+	spin_unlock(&unix_multicast_lock);
+#endif
 	unix_state_unlock(sk);
 
 	wake_up_interruptible_all(&u->peer_wait);
@@ -631,6 +678,9 @@ static struct sock *unix_create1(struct net *net, struct socket *sock)
 	atomic_long_set(&u->inflight, 0);
 	INIT_LIST_HEAD(&u->link);
 	mutex_init(&u->readlock); /* single task reading lock */
+#ifdef CONFIG_UNIX_MULTICAST
+	INIT_HLIST_HEAD(&u->mcast_subscriptions);
+#endif
 	init_waitqueue_head(&u->peer_wait);
 	unix_insert_socket(unix_sockets_unbound, sk);
 out:
@@ -1055,6 +1105,10 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	struct sock *newsk = NULL;
 	struct sock *other = NULL;
 	struct sk_buff *skb = NULL;
+#ifdef CONFIG_UNIX_MULTICAST
+	struct unix_mcast *node;
+	struct hlist_node *pos;
+#endif
 	unsigned hash;
 	int st;
 	int err;
@@ -1082,6 +1136,7 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	newsk = unix_create1(sock_net(sk), NULL);
 	if (newsk == NULL)
 		goto out;
+	newu = unix_sk(newsk);
 
 	/* Allocate skb for sending to listening sock */
 	skb = sock_wmalloc(newsk, 1, 0, GFP_KERNEL);
@@ -1094,6 +1149,8 @@ restart:
 	if (!other)
 		goto out;
 
+	otheru = unix_sk(other);
+
 	/* Latch state of peer */
 	unix_state_lock(other);
 
@@ -1165,6 +1222,18 @@ restart:
 		goto out_unlock;
 	}
 
+#ifdef CONFIG_UNIX_MULTICAST
+	/* Multicast sockets */
+	hlist_for_each_entry_rcu(node, pos, &u->mcast_subscriptions,
+				 subscription_node) {
+		if (node->group == otheru->mcast_group) {
+			atomic_inc(&otheru->mcast_group->refcnt);
+			newu->mcast_group = otheru->mcast_group;
+			break;
+		}
+	}
+#endif
+
 	/* The way is open! Fastly set all the necessary fields... */
 
 	sock_hold(sk);
@@ -1172,9 +1241,7 @@ restart:
 	newsk->sk_state		= TCP_ESTABLISHED;
 	newsk->sk_type		= sk->sk_type;
 	init_peercred(newsk);
-	newu = unix_sk(newsk);
 	newsk->sk_wq		= &newu->peer_wq;
-	otheru = unix_sk(other);
 
 	/* copy address information from listening to new sock*/
 	if (otheru->addr) {
@@ -1563,10 +1630,278 @@ out:
 }
 
 
+#ifdef CONFIG_UNIX_MULTICAST
+static int unix_mc_create(struct socket *sock, struct unix_mreq *mreq)
+{
+	struct sock *other;
+	int err;
+	unsigned hash;
+	int namelen;
+	struct unix_mcast_group *mcast_group;
+	int i;
+
+	if (mreq->address.sun_family != AF_UNIX ||
+	    mreq->address.sun_path[0] != '\0')
+		return -EINVAL;
+
+	err = unix_mkname(&mreq->address, sizeof(struct sockaddr_un), &hash);
+	if (err < 0)
+		return err;
+
+	namelen = err;
+	other = unix_find_other(sock_net(sock->sk), &mreq->address, namelen,
+				sock->type, hash, &err);
+	if (other) {
+		sock_put(other);
+		return -EADDRINUSE;
+	}
+
+	mcast_group = kmalloc(sizeof(struct unix_mcast_group), GFP_KERNEL);
+	if (!mcast_group)
+		return -ENOBUFS;
+
+	INIT_HLIST_HEAD(&mcast_group->mcast_members);
+	INIT_HLIST_HEAD(&mcast_group->mcast_dead_members);
+	INIT_HLIST_HEAD(&mcast_group->mcast_members_lists);
+	atomic_set(&mcast_group->mcast_members_cnt, 0);
+	atomic_set(&mcast_group->mcast_membership_generation, 1);
+	atomic_set(&mcast_group->refcnt, 1);
+	for (i = 0 ; i < MCAST_LOCK_CLASS_COUNT ; i++) {
+		spin_lock_init(&mcast_group->lock[i]);
+		lockdep_set_subclass(&mcast_group->lock[i], i);
+	}
+
+	err = sock->ops->bind(sock,
+		(struct sockaddr *)&mreq->address,
+		sizeof(struct sockaddr_un));
+	if (err < 0) {
+		kfree(mcast_group);
+		return err;
+	}
+
+	unix_state_lock(sock->sk);
+	unix_sk(sock->sk)->mcast_group = mcast_group;
+	unix_state_unlock(sock->sk);
+
+	return 0;
+}
+
+
+static int unix_mc_join(struct socket *sock, struct unix_mreq *mreq)
+{
+	struct unix_sock *u = unix_sk(sock->sk);
+	struct sock *other, *peer;
+	struct unix_mcast_group *group;
+	struct unix_mcast *node;
+	int err;
+	unsigned hash;
+	int namelen;
+
+	if (mreq->address.sun_family != AF_UNIX ||
+	    mreq->address.sun_path[0] != '\0')
+		return -EINVAL;
+
+	/* sockets which represent a group are not allowed to join another
+	 * group */
+	if (u->mcast_group)
+		return -EINVAL;
+
+	err = unix_autobind(sock);
+	if (err < 0)
+		return err;
+
+	err = unix_mkname(&mreq->address, sizeof(struct sockaddr_un), &hash);
+	if (err < 0)
+		return err;
+
+	namelen = err;
+	other = unix_find_other(sock_net(sock->sk), &mreq->address, namelen,
+				sock->type, hash, &err);
+	if (!other)
+		return -EINVAL;
+
+	group = unix_sk(other)->mcast_group;
+
+	if (!group) {
+		err = -EADDRINUSE;
+		goto sock_put_out;
+	}
+
+	node = kmalloc(sizeof(struct unix_mcast), GFP_KERNEL);
+	if (!node) {
+		err = -ENOMEM;
+		goto sock_put_out;
+	}
+	node->member = u;
+	node->group = group;
+	node->flags = mreq->flags;
+
+	if (sock->sk->sk_type == SOCK_SEQPACKET) {
+		peer = unix_peer_get(sock->sk);
+		if (peer) {
+			atomic_inc(&group->refcnt);
+			unix_sk(peer)->mcast_group = group;
+			sock_put(peer);
+		}
+	}
+
+	unix_state_lock(sock->sk);
+	unix_sk(sock->sk)->mcast_send_to_peer =
+		!!(mreq->flags & UNIX_MREQ_SEND_TO_PEER);
+	unix_sk(sock->sk)->mcast_drop_when_peer_full =
+		!!(mreq->flags & UNIX_MREQ_DROP_WHEN_FULL);
+	unix_state_unlock(sock->sk);
+
+	/* Keep a reference */
+	sock_hold(sock->sk);
+	atomic_inc(&group->refcnt);
+
+	spin_lock(&unix_multicast_lock);
+	hlist_add_head_rcu(&node->member_node,
+			   &group->mcast_members);
+	hlist_add_head_rcu(&node->subscription_node, &u->mcast_subscriptions);
+	atomic_inc(&group->mcast_members_cnt);
+	atomic_inc(&group->mcast_membership_generation);
+	spin_unlock(&unix_multicast_lock);
+
+	return 0;
+
+sock_put_out:
+	sock_put(other);
+	return err;
+}
+
+
+static int unix_mc_leave(struct socket *sock, struct unix_mreq *mreq)
+{
+	struct unix_sock *u = unix_sk(sock->sk);
+	struct sock *other;
+	struct unix_mcast_group *group;
+	struct unix_mcast *node;
+	struct hlist_node *pos;
+	int err;
+	unsigned hash;
+	int namelen;
+
+	if (mreq->address.sun_family != AF_UNIX ||
+	    mreq->address.sun_path[0] != '\0')
+		return -EINVAL;
+
+	err = unix_mkname(&mreq->address, sizeof(struct sockaddr_un), &hash);
+	if (err < 0)
+		return err;
+
+	namelen = err;
+	other = unix_find_other(sock_net(sock->sk), &mreq->address, namelen,
+				sock->type, hash, &err);
+	if (!other)
+		return -EINVAL;
+
+	group = unix_sk(other)->mcast_group;
+
+	if (!group) {
+		err = -EINVAL;
+		goto sock_put_out;
+	}
+
+	spin_lock(&unix_multicast_lock);
+
+	hlist_for_each_entry_rcu(node, pos, &u->mcast_subscriptions,
+			     subscription_node) {
+		if (node->group == group)
+			break;
+	}
+
+	if (!pos) {
+		spin_unlock(&unix_multicast_lock);
+		err = -EINVAL;
+		goto sock_put_out;
+	}
+
+	hlist_del_rcu(&node->member_node);
+	hlist_del_rcu(&node->subscription_node);
+	atomic_dec(&group->mcast_members_cnt);
+	atomic_inc(&group->mcast_membership_generation);
+	hlist_add_head_rcu(&node->member_dead_node,
+			   &group->mcast_dead_members);
+	spin_unlock(&unix_multicast_lock);
+
+	if (sock->sk->sk_type == SOCK_SEQPACKET) {
+		struct sock *peer = unix_peer_get(sock->sk);
+		if (peer) {
+			unix_sk(peer)->mcast_group = NULL;
+			atomic_dec(&group->refcnt);
+			sock_put(peer);
+		}
+	}
+
+	synchronize_rcu();
+
+	if (atomic_dec_and_test(&group->refcnt)) {
+		spin_lock(&unix_multicast_lock);
+		destroy_mcast_group(group);
+		spin_unlock(&unix_multicast_lock);
+	}
+
+	err = 0;
+
+	/* If the receiving queue of that socket was full, some writers on the
+	 * multicast group may be blocked */
+	wake_up_interruptible_sync_poll(&u->peer_wait,
+					POLLOUT | POLLWRNORM | POLLWRBAND);
+
+sock_put_out:
+	sock_put(other);
+	return err;
+}
+#endif
+
 static int unix_setsockopt(struct socket *sock, int level, int optname,
 			   char __user *optval, unsigned int optlen)
 {
+#ifdef CONFIG_UNIX_MULTICAST
+	struct unix_mreq mreq;
+	int err = 0;
+
+	if (level != SOL_UNIX)
+		return -ENOPROTOOPT;
+
+	switch (optname) {
+	case UNIX_CREATE_GROUP:
+	case UNIX_JOIN_GROUP:
+	case UNIX_LEAVE_GROUP:
+		if (optlen < sizeof(struct unix_mreq))
+			return -EINVAL;
+		if (copy_from_user(&mreq, optval, sizeof(struct unix_mreq)))
+			return -EFAULT;
+		break;
+
+	default:
+		break;
+	}
+
+	switch (optname) {
+	case UNIX_CREATE_GROUP:
+		err = unix_mc_create(sock, &mreq);
+		break;
+
+	case UNIX_JOIN_GROUP:
+		err = unix_mc_join(sock, &mreq);
+		break;
+
+	case UNIX_LEAVE_GROUP:
+		err = unix_mc_leave(sock, &mreq);
+		break;
+
+	default:
+		err = -ENOPROTOOPT;
+		break;
+	}
+
+	return err;
+#else
 	return -EOPNOTSUPP;
+#endif
 }
 
 
-- 
1.7.2.3

^ permalink raw reply related

* [PATCH 5/8] af_unix: find the recipients of a multicast group
From: Alban Crequy @ 2011-01-21 14:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Lennart Poettering, netdev,
	linux-doc, linux-
  Cc: Alban Crequy
In-Reply-To: <20110121143751.57b1453d@chocolatine.cbg.collabora.co.uk>

unix_find_multicast_recipients() returns a list of recipients for the specific
multicast address. It checks the options UNIX_MREQ_SEND_TO_PEER and
UNIX_MREQ_LOOPBACK to get the right recipients.

The list of recipients is ordered and guaranteed not to have duplicates.

When the caller has finished with the list of recipients, it will call
up_sock_set() and the list can be reused by another sender.

Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
Reviewed-by: Ian Molton <ian.molton@collabora.co.uk>
---
 net/unix/af_unix.c |  259 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 256 insertions(+), 3 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index f25c020..fe0d3bb 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -114,18 +114,84 @@
 #include <linux/mount.h>
 #include <net/checksum.h>
 #include <linux/security.h>
-
-static struct hlist_head unix_socket_table[UNIX_HASH_SIZE + 1];
-static DEFINE_SPINLOCK(unix_table_lock);
 #ifdef CONFIG_UNIX_MULTICAST
+#include <linux/sort.h>
+
 static DEFINE_SPINLOCK(unix_multicast_lock);
 #endif
+static struct hlist_head unix_socket_table[UNIX_HASH_SIZE + 1];
+static DEFINE_SPINLOCK(unix_table_lock);
 static atomic_long_t unix_nr_socks;
 
 #define unix_sockets_unbound	(&unix_socket_table[UNIX_HASH_SIZE])
 
 #define UNIX_ABSTRACT(sk)	(unix_sk(sk)->addr->hash != UNIX_HASH_SIZE)
 
+#ifdef CONFIG_UNIX_MULTICAST
+/* Array of sockets used in multicast deliveries */
+struct sock_item {
+	/* constant fields */
+	struct sock *s;
+	unsigned int flags;
+
+	/* fields reinitialized at every send */
+	struct sk_buff *skb;
+	unsigned int to_deliver:1;
+};
+
+struct sock_set {
+	/* struct sock_set is used by one sender at a time */
+	struct semaphore sem;
+	struct hlist_node list;
+	struct rcu_head rcu;
+	int generation;
+
+	/* the sender should consider only sockets from items[offset] to
+	 * item[cnt-1] */
+	int cnt;
+	int offset;
+	/* Bitfield of (struct unix_mcast_group)->lock spinlocks to take in
+	 * order to guarantee causal order of delivery */
+	u8 hash;
+	/* ordered list of sockets without duplicates. Cell zero is reserved
+	 * for sending a message to the accepted socket (SOCK_SEQPACKET only).
+	 */
+	struct sock_item items[0];
+};
+
+static void up_sock_set(struct sock_set *set)
+{
+	if ((set->offset == 0) && set->items[0].s) {
+		sock_put(set->items[0].s);
+		set->items[0].s = NULL;
+		set->items[0].skb = NULL;
+	}
+	up(&set->sem);
+}
+
+static void kfree_sock_set(struct sock_set *set)
+{
+	int i;
+	for (i = set->offset ; i < set->cnt ; i++) {
+		if (set->items[i].s)
+			sock_put(set->items[i].s);
+	}
+	kfree(set);
+}
+
+static int sock_item_compare(const void *_a, const void *_b)
+{
+	const struct sock_item *a = _a;
+	const struct sock_item *b = _b;
+	if (a->s > b->s)
+		return 1;
+	else if (a->s < b->s)
+		return -1;
+	else
+		return 0;
+}
+#endif
+
 #ifdef CONFIG_SECURITY_NETWORK
 static void unix_get_secdata(struct scm_cookie *scm, struct sk_buff *skb)
 {
@@ -379,6 +445,7 @@ static void
 destroy_mcast_group(struct unix_mcast_group *group)
 {
 	struct unix_mcast *node;
+	struct sock_set *set;
 	struct hlist_node *pos;
 	struct hlist_node *pos_tmp;
 
@@ -392,6 +459,12 @@ destroy_mcast_group(struct unix_mcast_group *group)
 		sock_put(&node->member->sk);
 		kfree(node);
 	}
+	hlist_for_each_entry_safe(set, pos, pos_tmp,
+				  &group->mcast_members_lists,
+				  list) {
+		hlist_del_rcu(&set->list);
+		kfree_sock_set(set);
+	}
 	kfree(group);
 }
 #endif
@@ -851,6 +924,186 @@ fail:
 	return NULL;
 }
 
+#ifdef CONFIG_UNIX_MULTICAST
+static int unix_find_multicast_members(struct sock_set *set,
+				       int recipient_cnt,
+				       struct hlist_head *list)
+{
+	struct unix_mcast *node;
+	struct hlist_node *pos;
+
+	hlist_for_each_entry_rcu(node, pos, list,
+			     member_node) {
+		struct sock *s;
+
+		if (set->cnt + 1 > recipient_cnt)
+			return -ENOMEM;
+
+		s = &node->member->sk;
+		sock_hold(s);
+		set->items[set->cnt].s = s;
+		set->items[set->cnt].flags = node->flags;
+		set->cnt++;
+
+		set->hash |= 1 << ((((int)s) >> 6) & 0x07);
+	}
+
+	return 0;
+}
+
+void sock_set_reclaim(struct rcu_head *rp)
+{
+	struct sock_set *set = container_of(rp, struct sock_set, rcu);
+	kfree_sock_set(set);
+}
+
+static struct sock_set *unix_find_multicast_recipients(struct sock *sender,
+				struct unix_mcast_group *group,
+				int *err)
+{
+	struct sock_set *set = NULL; /* fake GCC */
+	struct sock_set *del_set;
+	struct hlist_node *pos;
+	int recipient_cnt;
+	int generation;
+	int i;
+
+	BUG_ON(sender == NULL);
+	BUG_ON(group == NULL);
+
+	/* Find an available set if any */
+	generation = atomic_read(&group->mcast_membership_generation);
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(set, pos, &group->mcast_members_lists,
+			     list) {
+		if (down_trylock(&set->sem)) {
+			/* the set is being used by someone else */
+			continue;
+		}
+		if (set->generation == generation) {
+			/* the set is still valid, use it */
+			break;
+		}
+		/* The set is outdated. It will be removed from the RCU list
+		 * soon but not in this lockless RCU read */
+		up(&set->sem);
+	}
+	rcu_read_unlock();
+	if (pos)
+		goto list_found;
+
+	/* We cannot allocate in the spin lock. First, count the recipients */
+try_again:
+	generation = atomic_read(&group->mcast_membership_generation);
+	recipient_cnt = atomic_read(&group->mcast_members_cnt);
+
+	/* Allocate for the set and hope the number of recipients does not
+	 * change while the lock is released. If it changes, we have to try
+	 * again... We allocate a bit more than needed, so if a _few_ members
+	 * are added in a multicast group meanwhile, we don't always need to
+	 * try again. */
+	recipient_cnt += 5;
+
+	set = kmalloc(sizeof(struct sock_set)
+		      + sizeof(struct sock_item) * recipient_cnt,
+	    GFP_KERNEL);
+	if (!set) {
+		*err = -ENOMEM;
+		return NULL;
+	}
+	sema_init(&set->sem, 0);
+	set->cnt = 1;
+	set->offset = 1;
+	set->generation = generation;
+	set->hash = 0;
+
+	rcu_read_lock();
+	if (unix_find_multicast_members(set, recipient_cnt,
+			&group->mcast_members)) {
+		rcu_read_unlock();
+		kfree_sock_set(set);
+		goto try_again;
+	}
+	rcu_read_unlock();
+
+	/* Keep the array ordered to prevent deadlocks when locking the
+	 * receiving queues. The ordering is:
+	 * - First, the accepted socket (SOCK_SEQPACKET only)
+	 * - Then, the member sockets ordered by memory address
+	 * The accepted socket cannot be member of a multicast group.
+	 */
+	sort(set->items + 1, set->cnt - 1, sizeof(struct sock_item),
+	     sock_item_compare, NULL);
+	/* Avoid duplicates */
+	for (i = 2 ; i < set->cnt ; i++) {
+		if (set->items[i].s == set->items[i - 1].s) {
+			sock_put(set->items[i - 1].s);
+			set->items[i - 1].s = NULL;
+		}
+	}
+
+	if (generation != atomic_read(&group->mcast_membership_generation)) {
+		kfree_sock_set(set);
+		goto try_again;
+	}
+
+	/* Take the lock to insert the new list but take the opportunity to do
+	 * some garbage collection on outdated lists */
+	spin_lock(&unix_multicast_lock);
+	hlist_for_each_entry_rcu(del_set, pos, &group->mcast_members_lists,
+			     list) {
+		if (down_trylock(&del_set->sem)) {
+			/* the list is being used by someone else */
+			continue;
+		}
+		if (del_set->generation < generation) {
+			hlist_del_rcu(&del_set->list);
+			call_rcu(&del_set->rcu, sock_set_reclaim);
+		}
+		up(&del_set->sem);
+	}
+	hlist_add_head_rcu(&set->list,
+			   &group->mcast_members_lists);
+	spin_unlock(&unix_multicast_lock);
+
+list_found:
+	/* List found. Initialize the first item. */
+	if (sender->sk_type == SOCK_SEQPACKET
+	    && unix_peer(sender)
+	    && unix_sk(sender)->mcast_send_to_peer) {
+		set->offset = 0;
+		sock_hold(unix_peer(sender));
+		set->items[0].s = unix_peer(sender);
+		set->items[0].skb = NULL;
+		set->items[0].to_deliver = 1;
+		set->items[0].flags =
+			unix_sk(sender)->mcast_drop_when_peer_full
+			? UNIX_MREQ_DROP_WHEN_FULL : 0;
+	} else {
+		set->items[0].s = NULL;
+		set->items[0].skb = NULL;
+		set->items[0].to_deliver = 0;
+		set->offset = 1;
+	}
+
+	/* Initialize the other items. */
+	for (i = 1 ; i < set->cnt ; i++) {
+		set->items[i].skb = NULL;
+		if (set->items[i].s == NULL) {
+			set->items[i].to_deliver = 0;
+			continue;
+		}
+		if (set->items[i].flags & UNIX_MREQ_LOOPBACK
+		    || sender != set->items[i].s)
+			set->items[i].to_deliver = 1;
+		else
+			set->items[i].to_deliver = 0;
+	}
+
+	return set;
+}
+#endif
+
 
 static int unix_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 {
-- 
1.7.2.3

^ permalink raw reply related

* [PATCH 6/8] af_unix: Deliver message to several recipients in case of multicast
From: Alban Crequy @ 2011-01-21 14:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Lennart Poettering, netdev,
	linux-doc, linux-
  Cc: Alban Crequy, Ian Molton
In-Reply-To: <20110121143751.57b1453d@chocolatine.cbg.collabora.co.uk>

unix_dgram_sendmsg() implements the delivery both for SOCK_DGRAM and
SOCK_SEQPACKET unix sockets.

The delivery is done in an atomic way; either the message is delivered to all
recipients or none, even in case of interruptions or errors.

Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
Signed-off-by: Ian Molton <ian.molton@collabora.co.uk>
---
 net/unix/af_unix.c |  242 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 242 insertions(+), 0 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index fe0d3bb..4147d64 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1715,6 +1715,210 @@ static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool sen
 	return err;
 }
 
+#ifdef CONFIG_UNIX_MULTICAST
+static void kfree_skb_sock_set(struct sock_set *set)
+{
+	int i;
+	for (i = set->offset ; i < set->cnt ; i++) {
+		if (set->items[i].skb) {
+			kfree_skb(set->items[i].skb);
+			set->items[i].skb = NULL;
+		}
+	}
+}
+
+static void unix_mcast_lock(struct unix_mcast_group *group,
+			    struct sock_set *set)
+{
+	int i;
+	for (i = 0 ; i < MCAST_LOCK_CLASS_COUNT ; i++) {
+		if (set->hash & (1 << i))
+			spin_lock_nested(&group->lock[i], i);
+	}
+}
+
+static void unix_mcast_unlock(struct unix_mcast_group *group,
+			      struct sock_set *set)
+{
+	int i;
+	for (i = MCAST_LOCK_CLASS_COUNT - 1 ; i >= 0 ; i--) {
+		if (set->hash & (1 << i))
+			spin_unlock(&group->lock[i]);
+	}
+}
+
+
+static int unix_dgram_sendmsg_multicast(struct sock_iocb *siocb,
+					struct sock *sk,
+					struct sk_buff *skb,
+					struct unix_mcast_group *group,
+					struct sock_set *others_set,
+					size_t len,
+					int max_level,
+					long timeo)
+{
+	int err;
+	int i;
+
+	BUG_ON(!others_set);
+
+restart:
+	for (i = others_set->offset ; i < others_set->cnt ; i++) {
+		struct sock *cur = others_set->items[i].s;
+		unsigned int pkt_len;
+		struct sk_filter *filter;
+
+		if (!others_set->items[i].to_deliver)
+			continue;
+
+		BUG_ON(others_set->items[i].skb);
+		BUG_ON(cur == NULL);
+
+		rcu_read_lock();
+		filter = rcu_dereference(cur->sk_filter);
+		if (filter)
+			pkt_len = sk_run_filter(skb, filter->insns);
+		else
+			pkt_len = 0xffffffff;
+		rcu_read_unlock();
+
+		if (pkt_len == 0) {
+			others_set->items[i].to_deliver = 0;
+			continue;
+		}
+
+		others_set->items[i].skb = skb_clone(skb, GFP_KERNEL);
+		if (!others_set->items[i].skb) {
+			kfree_skb_sock_set(others_set);
+			err = -ENOMEM;
+			goto out_free;
+		}
+		skb_set_owner_w(others_set->items[i].skb, sk);
+		err = unix_scm_to_skb(siocb->scm, others_set->items[i].skb,
+				      true);
+		if (err < 0)
+			goto out_free;
+		unix_get_secdata(siocb->scm, others_set->items[i].skb);
+		pskb_trim(others_set->items[i].skb, pkt_len);
+	}
+
+	for (i = others_set->offset ; i < others_set->cnt ; i++) {
+		struct sock *cur = others_set->items[i].s;
+
+		if (!others_set->items[i].to_deliver)
+			continue;
+
+		unix_state_lock(cur);
+
+		if (cur->sk_shutdown & RCV_SHUTDOWN) {
+			unix_state_unlock(cur);
+			kfree_skb(others_set->items[i].skb);
+			others_set->items[i].skb = NULL;
+				others_set->items[i].to_deliver = 0;
+				continue;
+		}
+
+		if (sk->sk_type != SOCK_SEQPACKET) {
+			err = security_unix_may_send(sk->sk_socket,
+						     cur->sk_socket);
+			if (err) {
+				unix_state_unlock(cur);
+				kfree_skb(others_set->items[i].skb);
+				others_set->items[i].skb = NULL;
+					others_set->items[i].to_deliver = 0;
+					continue;
+			}
+		}
+
+		if (unix_peer(cur) != sk && unix_recvq_full(cur)) {
+			kfree_skb(others_set->items[i].skb);
+			others_set->items[i].skb = NULL;
+
+			if (others_set->items[i].flags
+					& UNIX_MREQ_DROP_WHEN_FULL) {
+				/* Drop the skbs and continue */
+				unix_state_unlock(cur);
+				others_set->items[i].to_deliver = 0;
+				continue;
+			} else {
+				if (!timeo) {
+					unix_state_unlock(cur);
+					err = -EAGAIN;
+					goto out_free;
+				}
+
+				timeo = unix_wait_for_peer(cur, timeo);
+
+				err = sock_intr_errno(timeo);
+				if (signal_pending(current))
+					goto out_free;
+
+				kfree_skb_sock_set(others_set);
+				goto restart;
+			}
+		}
+		unix_state_unlock(cur);
+	}
+
+	unix_mcast_lock(group, others_set);
+	for (i = others_set->offset ; i < others_set->cnt ; i++) {
+		struct sock *cur = others_set->items[i].s;
+
+		if (!others_set->items[i].to_deliver)
+			continue;
+
+		BUG_ON(cur == NULL);
+		BUG_ON(others_set->items[i].skb == NULL);
+
+		unix_state_lock(cur);
+
+		if (sock_flag(cur, SOCK_DEAD)) {
+			unix_state_unlock(cur);
+
+			kfree_skb(others_set->items[i].skb);
+			others_set->items[i].skb = NULL;
+			others_set->items[i].to_deliver = 0;
+			continue;
+		}
+
+		if (sock_flag(cur, SOCK_RCVTSTAMP))
+			__net_timestamp(others_set->items[i].skb);
+
+		skb_queue_tail(&cur->sk_receive_queue,
+			       others_set->items[i].skb);
+		others_set->items[i].skb = NULL;
+		if (max_level > unix_sk(cur)->recursion_level)
+			unix_sk(cur)->recursion_level = max_level;
+
+		unix_state_unlock(cur);
+	}
+	unix_mcast_unlock(group, others_set);
+
+	for (i = others_set->offset ; i < others_set->cnt ; i++) {
+		struct sock *cur = others_set->items[i].s;
+
+		if (!others_set->items[i].to_deliver)
+			continue;
+
+		cur->sk_data_ready(cur, len);
+	}
+
+	kfree_skb(skb);
+	scm_destroy(siocb->scm);
+	up_sock_set(others_set);
+	return len;
+
+out_free:
+	kfree_skb(skb);
+	if (others_set) {
+		kfree_skb_sock_set(others_set);
+		up_sock_set(others_set);
+	}
+	return err;
+}
+#endif
+
+
 /*
  *	Send AF_UNIX data.
  */
@@ -1735,6 +1939,10 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	long timeo;
 	struct scm_cookie tmp_scm;
 	int max_level;
+#ifdef CONFIG_UNIX_MULTICAST
+	struct unix_mcast_group *group = NULL;
+	struct sock_set *others_set = NULL;
+#endif
 
 	if (NULL == siocb->scm)
 		siocb->scm = &tmp_scm;
@@ -1756,8 +1964,20 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		sunaddr = NULL;
 		err = -ENOTCONN;
 		other = unix_peer_get(sk);
+
 		if (!other)
 			goto out;
+
+#ifdef CONFIG_UNIX_MULTICAST
+		group = unix_sk(other)->mcast_group;
+		if (group) {
+			others_set = unix_find_multicast_recipients(sk,
+				group, &err);
+
+			if (!others_set)
+				goto out;
+		}
+#endif
 	}
 
 	if (test_bit(SOCK_PASSCRED, &sock->flags) && !u->addr
@@ -1795,6 +2015,28 @@ restart:
 					hash, &err);
 		if (other == NULL)
 			goto out_free;
+
+#ifdef CONFIG_UNIX_MULTICAST
+		group = unix_sk(other)->mcast_group;
+		if (group) {
+			others_set = unix_find_multicast_recipients(sk,
+				group, &err);
+
+			sock_put(other);
+			other = NULL;
+
+			if (!others_set)
+				goto out;
+		}
+	}
+
+	if (group) {
+		err = unix_dgram_sendmsg_multicast(siocb, sk, skb, group,
+			others_set, len, max_level, timeo);
+		if (err < 0)
+			goto out;
+		return err;
+#endif
 	}
 
 	if (sk_filter(other, skb) < 0) {
-- 
1.7.2.3

^ permalink raw reply related

* [PATCH 7/8] af_unix: implement poll(POLLOUT) for multicast sockets
From: Alban Crequy @ 2011-01-21 14:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Lennart Poettering, netdev,
	linux-doc, linux-
  Cc: Alban Crequy
In-Reply-To: <20110121143751.57b1453d@chocolatine.cbg.collabora.co.uk>

When a socket subscribed to a multicast group has its incoming queue full, it
can either block the emission to the multicast group or let the messages be
dropped. The latter is useful to monitor all messages without slowing down the
traffic.

It is specified with the flag UNIX_MREQ_DROP_WHEN_FULL when the multicast group
is joined.

poll(POLLOUT) is implemented by checking all receiving queues of subscribed
sockets. If only one of them has its receiving queue full and does not have
UNIX_MREQ_DROP_WHEN_FULL, the multicast socket is not writeable.

Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
Reviewed-by: Ian Molton <ian.molton@collabora.co.uk>
---
 net/unix/af_unix.c |   33 +++++++++++++++++++++++++++++++++
 1 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 4147d64..138d9a2 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2940,6 +2940,11 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 {
 	struct sock *sk = sock->sk, *other;
 	unsigned int mask, writable;
+#ifdef CONFIG_UNIX_MULTICAST
+	struct sock_set *others;
+	int err = 0;
+	int i;
+#endif
 
 	sock_poll_wait(file, sk_sleep(sk), wait);
 	mask = 0;
@@ -2980,6 +2985,34 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 		sock_put(other);
 	}
 
+#ifdef CONFIG_UNIX_MULTICAST
+	/*
+	 * On multicast sockets, we need to check if the receiving queue is
+	 * full on all peers who don't have UNIX_MREQ_DROP_WHEN_FULL.
+	 */
+	if (!other || !unix_sk(other)->mcast_group)
+		goto skip_multicast;
+	others = unix_find_multicast_recipients(sk,
+		unix_sk(other)->mcast_group, &err);
+	if (!others)
+		goto skip_multicast;
+	for (i = others->offset ; i < others->cnt ; i++) {
+		if (others->items[i].flags & UNIX_MREQ_DROP_WHEN_FULL)
+			continue;
+		if (unix_peer(others->items[i].s) != sk) {
+			sock_poll_wait(file,
+				&unix_sk(others->items[i].s)->peer_wait, wait);
+			if (unix_recvq_full(others->items[i].s)) {
+				writable = 0;
+				break;
+			}
+		}
+	}
+	up_sock_set(others);
+
+skip_multicast:
+#endif
+
 	if (writable)
 		mask |= POLLOUT | POLLWRNORM | POLLWRBAND;
 	else
-- 
1.7.2.3

^ permalink raw reply related

* [PATCH 8/8] af_unix: Unsubscribe sockets from their multicast groups on RCV_SHUTDOWN
From: Alban Crequy @ 2011-01-21 14:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Lennart Poettering, netdev,
	linux-doc, linux-
  Cc: Alban Crequy
In-Reply-To: <20110121143751.57b1453d@chocolatine.cbg.collabora.co.uk>

Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
Reviewed-by: Ian Molton <ian.molton@collabora.co.uk>
---
 net/unix/af_unix.c |   35 +++++++++++++++++++++++++++++++++++
 1 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 138d9a2..9b281cf 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2820,6 +2820,10 @@ static int unix_shutdown(struct socket *sock, int mode)
 {
 	struct sock *sk = sock->sk;
 	struct sock *other;
+#ifdef CONFIG_UNIX_MULTICAST
+	struct unix_sock *u = unix_sk(sk);
+	int unsubscribed = 0;
+#endif
 
 	mode = (mode+1)&(RCV_SHUTDOWN|SEND_SHUTDOWN);
 
@@ -2831,7 +2835,38 @@ static int unix_shutdown(struct socket *sock, int mode)
 	other = unix_peer(sk);
 	if (other)
 		sock_hold(other);
+
+#ifdef CONFIG_UNIX_MULTICAST
+	/* If the socket subscribed to a multicast group and it is shutdown
+	 * with (mode&RCV_SHUTDOWN), it should be unsubscribed or at least
+	 * stop blocking the peers */
+	if (mode&RCV_SHUTDOWN) {
+		struct unix_mcast *node;
+		struct hlist_node *pos;
+		struct hlist_node *pos_tmp;
+
+		spin_lock(&unix_multicast_lock);
+		hlist_for_each_entry_safe(node, pos, pos_tmp,
+					  &u->mcast_subscriptions,
+					  subscription_node) {
+			hlist_del_rcu(&node->member_node);
+			hlist_del_rcu(&node->subscription_node);
+			atomic_dec(&node->group->mcast_members_cnt);
+			atomic_inc(&node->group->mcast_membership_generation);
+			hlist_add_head_rcu(&node->member_dead_node,
+					   &node->group->mcast_dead_members);
+			unsubscribed = 1;
+		}
+		spin_unlock(&unix_multicast_lock);
+	}
+#endif
 	unix_state_unlock(sk);
+
+#ifdef CONFIG_UNIX_MULTICAST
+	if (unsubscribed)
+		wake_up_interruptible_all(&u->peer_wait);
+#endif
+
 	sk->sk_state_change(sk);
 
 	if (other &&
-- 
1.7.2.3

^ permalink raw reply related

* [PATCH 2/8] af_unix: Add constant for unix socket options level
From: Alban Crequy @ 2011-01-21 14:39 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Lennart Poettering, netdev,
	linux-doc, linux-
  Cc: Alban Crequy
In-Reply-To: <20110121143751.57b1453d@chocolatine.cbg.collabora.co.uk>

Assign the next free socket options level to be used by the unix
protocol and address family.

Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
Reviewed-by: Ian Molton <ian.molton@collabora.co.uk>
---
 include/linux/socket.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index edbb1d0..a257d1c 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -308,6 +308,7 @@ struct ucred {
 #define SOL_IUCV	277
 #define SOL_CAIF	278
 #define SOL_ALG		279
+#define SOL_UNIX	280
 
 /* IPX options */
 #define IPX_TYPE	1
-- 
1.7.2.3


^ permalink raw reply related

* Re: [PATCH] e1000: add support for Marvell Alaska M88E1118R PHY
From: Florian Fainelli @ 2011-01-21 16:27 UTC (permalink / raw)
  To: Dirk Brandewie; +Cc: Jeff Kirsher, netdev@vger.kernel.org, David Miller
In-Reply-To: <1295549359.7387.30.camel@localhost.localdomain>

Hello Dirk, Jeff,

On Thursday 20 January 2011 19:49:19 Dirk Brandewie wrote:
> On Wed, 2011-01-19 at 22:51 -0800, Jeff Kirsher wrote:
> > On Wed, Jan 19, 2011 at 01:09, Florian Fainelli <ffainelli@freebox.fr> 
wrote:
> > > From: Florian Fainelli <ffainelli@freebox.fr>
> > > 
> > > This patch adds support for Marvell Alask M88E188R PHY chips. Support
> > > for other M88* PHYs is already there, so there is nothing more to add
> > > than its PHY id.
> > > 
> > > Signed-off-by: Florian Fainelli <ffainelli@freebox.fr>
> > > CC: Dirk Brandewie <dirk.j.brandewie@intel.com>
> > > CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> > > ---
> > 
> > The patch itself looks fine.  I am concerned about validation.
> > 
> > Dirk - is there a chance that the ce4100 will use this PHY?  If so,
> > can you cover the validation?
> 
> Florian is working on a CE4100 based platform.  It looks like they used
> a different PHY from the Flacon Falls reference platform. I can't
> directly test this patch since I don't have their hardware.  I will be
> testing .38-rc1 next week on falcon falls.

Indeed, we use this PHY on our hardware.

> 
> I think the best we can do without the hardware is to compare the data
> sheet for the new PHY with the PHYs already supported and make sure they
> are compatible.  If the datasheets match up for the features the driver
> is using this seems pretty low risk IMHO.

As far as I could check, all M88E111* should behave the same for the setup 
done in e1000.
--
Florian

^ permalink raw reply

* Re: [PATCH] netfilter: ipvs: fix compiler warnings
From: Patrick McHardy @ 2011-01-21 16:50 UTC (permalink / raw)
  To: Changli Gao
  Cc: Simon Horman, Wensong Zhang, Julian Anastasov, David S. Miller,
	netdev, lvs-devel, netfilter-devel
In-Reply-To: <1295604133-6869-1-git-send-email-xiaosuo@gmail.com>

Am 21.01.2011 11:02, schrieb Changli Gao:
> Fix compiler warnings when no transport protocol load balancing support
> is configured.

Thanks Changli, I'll apply your patch once one of the IPVS developers
ACKs this.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox