Netdev List
 help / color / mirror / Atom feed
* Re: localed stuck in recent 3.18 git in copy_net_ns?
From: Cong Wang @ 2014-10-22 17:37 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Kevin Fenzi, netdev, Linux-Kernel@Vger. Kernel. Org,
	Paul McKenney, Eric W. Biederman
In-Reply-To: <CA+5PVA7Ro_ejBUqsZ9StWVeu59==fGnj6e4Gx8zM4_3+Lq5s4A@mail.gmail.com>

(Adding Paul and Eric in Cc)

I am not aware of any change in net/core/dev.c related here,
so I guess it's a bug in rcu_barrier().

Thanks.

On Wed, Oct 22, 2014 at 10:12 AM, Josh Boyer <jwboyer@fedoraproject.org> wrote:
>
> Someone else is seeing this when they try and modprobe ppp_generic:
>
> [  240.599195] INFO: task kworker/u16:5:100 blocked for more than 120 seconds.
> [  240.599338]       Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
> [  240.599446] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  240.599583] kworker/u16:5   D ffff8802202db480 12400   100      2 0x00000000
> [  240.599744] Workqueue: netns cleanup_net
> [  240.599823]  ffff8802202eb9e8 0000000000000096 ffff8802202db480
> 00000000001d5f00
> [  240.600066]  ffff8802202ebfd8 00000000001d5f00 ffff8800368c3480
> ffff8802202db480
> [  240.600228]  ffffffff81ee2690 7fffffffffffffff ffffffff81ee2698
> ffffffff81ee2690
> [  240.600386] Call Trace:
> [  240.600445]  [<ffffffff8185e239>] schedule+0x29/0x70
> [  240.600541]  [<ffffffff8186345c>] schedule_timeout+0x26c/0x410
> [  240.600651]  [<ffffffff81865ef7>] ? retint_restore_args+0x13/0x13
> [  240.600765]  [<ffffffff818644e4>] ? _raw_spin_unlock_irq+0x34/0x50
> [  240.600879]  [<ffffffff8185fc6c>] wait_for_completion+0x10c/0x150
> [  240.601025]  [<ffffffff810e53e0>] ? wake_up_state+0x20/0x20
> [  240.601133]  [<ffffffff8112a749>] _rcu_barrier+0x159/0x200
> [  240.601237]  [<ffffffff8112a845>] rcu_barrier+0x15/0x20
> [  240.601335]  [<ffffffff81718ebf>] netdev_run_todo+0x6f/0x310
> [  240.601442]  [<ffffffff8170da85>] ? rollback_registered_many+0x265/0x2e0
> [  240.601564]  [<ffffffff81725f2e>] rtnl_unlock+0xe/0x10
> [  240.601660]  [<ffffffff8170f8e6>] default_device_exit_batch+0x156/0x180
> [  240.601781]  [<ffffffff810fd8a0>] ? abort_exclusive_wait+0xb0/0xb0
> [  240.601895]  [<ffffffff81707993>] ops_exit_list.isra.1+0x53/0x60
> [  240.602028]  [<ffffffff81708540>] cleanup_net+0x100/0x1f0
> [  240.602131]  [<ffffffff810ccfa8>] process_one_work+0x218/0x850
> [  240.602241]  [<ffffffff810ccf0f>] ? process_one_work+0x17f/0x850
> [  240.602350]  [<ffffffff810cd6c7>] ? worker_thread+0xe7/0x4a0
> [  240.602454]  [<ffffffff810cd64b>] worker_thread+0x6b/0x4a0
> [  240.602555]  [<ffffffff810cd5e0>] ? process_one_work+0x850/0x850
> [  240.602665]  [<ffffffff810d399b>] kthread+0x10b/0x130
> [  240.602762]  [<ffffffff81028cc9>] ? sched_clock+0x9/0x10
> [  240.602862]  [<ffffffff810d3890>] ? kthread_create_on_node+0x250/0x250
> [  240.603004]  [<ffffffff818651fc>] ret_from_fork+0x7c/0xb0
> [  240.603106]  [<ffffffff810d3890>] ? kthread_create_on_node+0x250/0x250
> [  240.603224] 4 locks held by kworker/u16:5/100:
> [  240.603304]  #0:  ("%s""netns"){.+.+.+}, at: [<ffffffff810ccf0f>]
> process_one_work+0x17f/0x850
> [  240.603495]  #1:  (net_cleanup_work){+.+.+.}, at:
> [<ffffffff810ccf0f>] process_one_work+0x17f/0x850
> [  240.603691]  #2:  (net_mutex){+.+.+.}, at: [<ffffffff817084cc>]
> cleanup_net+0x8c/0x1f0
> [  240.603869]  #3:  (rcu_sched_state.barrier_mutex){+.+...}, at:
> [<ffffffff8112a625>] _rcu_barrier+0x35/0x200
> [  240.604211] INFO: task modprobe:1387 blocked for more than 120 seconds.
> [  240.604329]       Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
> [  240.604434] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  240.604570] modprobe        D ffff8800cb4f1a40 13112  1387   1386 0x00000080
> [  240.604719]  ffff8800cafbbbe8 0000000000000096 ffff8800cb4f1a40
> 00000000001d5f00
> [  240.604878]  ffff8800cafbbfd8 00000000001d5f00 ffff880223280000
> ffff8800cb4f1a40
> [  240.605068]  ffff8800cb4f1a40 ffffffff81f8fb48 0000000000000246
> ffff8800cb4f1a40
> [  240.605228] Call Trace:
> [  240.605283]  [<ffffffff8185e7e1>] schedule_preempt_disabled+0x31/0x80
> [  240.605400]  [<ffffffff81860033>] mutex_lock_nested+0x183/0x440
> [  240.605510]  [<ffffffff8170835f>] ? register_pernet_subsys+0x1f/0x50
> [  240.605626]  [<ffffffff8170835f>] ? register_pernet_subsys+0x1f/0x50
> [  240.605757]  [<ffffffffa0701000>] ? 0xffffffffa0701000
> [  240.605854]  [<ffffffff8170835f>] register_pernet_subsys+0x1f/0x50
> [  240.606005]  [<ffffffffa0701048>] br_init+0x48/0xd3 [bridge]
> [  240.606112]  [<ffffffff81002148>] do_one_initcall+0xd8/0x210
> [  240.606224]  [<ffffffff81153c02>] load_module+0x20c2/0x2870
> [  240.606327]  [<ffffffff8114ebe0>] ? store_uevent+0x70/0x70
> [  240.606433]  [<ffffffff8110ac26>] ? lock_release_non_nested+0x3c6/0x3d0
> [  240.606557]  [<ffffffff81154497>] SyS_init_module+0xe7/0x140
> [  240.606664]  [<ffffffff818652a9>] system_call_fastpath+0x12/0x17
> [  240.606773] 1 lock held by modprobe/1387:
> [  240.606845]  #0:  (net_mutex){+.+.+.}, at: [<ffffffff8170835f>]
> register_pernet_subsys+0x1f/0x50
> [  240.607114] INFO: task modprobe:1466 blocked for more than 120 seconds.
> [  240.607231]       Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
> [  240.607337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  240.607473] modprobe        D ffff88020fbab480 13096  1466   1399 0x00000084
> [  240.607622]  ffff88020d1bbbe8 0000000000000096 ffff88020fbab480
> 00000000001d5f00
> [  240.607791]  ffff88020d1bbfd8 00000000001d5f00 ffffffff81e1b580
> ffff88020fbab480
> [  240.607949]  ffff88020fbab480 ffffffff81f8fb48 0000000000000246
> ffff88020fbab480
> [  240.608138] Call Trace:
> [  240.608193]  [<ffffffff8185e7e1>] schedule_preempt_disabled+0x31/0x80
> [  240.608316]  [<ffffffff81860033>] mutex_lock_nested+0x183/0x440
> [  240.608425]  [<ffffffff817083ad>] ? register_pernet_device+0x1d/0x70
> [  240.608542]  [<ffffffff817083ad>] ? register_pernet_device+0x1d/0x70
> [  240.608662]  [<ffffffffa071d000>] ? 0xffffffffa071d000
> [  240.608759]  [<ffffffff817083ad>] register_pernet_device+0x1d/0x70
> [  240.608881]  [<ffffffffa071d020>] ppp_init+0x20/0x1000 [ppp_generic]
> [  240.609021]  [<ffffffff81002148>] do_one_initcall+0xd8/0x210
> [  240.609131]  [<ffffffff81153c02>] load_module+0x20c2/0x2870
> [  240.609235]  [<ffffffff8114ebe0>] ? store_uevent+0x70/0x70
> [  240.609339]  [<ffffffff8110ac26>] ? lock_release_non_nested+0x3c6/0x3d0
> [  240.609462]  [<ffffffff81154497>] SyS_init_module+0xe7/0x140
> [  240.609568]  [<ffffffff818652a9>] system_call_fastpath+0x12/0x17
> [  240.609677] 1 lock held by modprobe/1466:
> [  240.609749]  #0:  (net_mutex){+.+.+.}, at: [<ffffffff817083ad>]
> register_pernet_device+0x1d/0x70
>
> Looks like contention on net_mutex or something, but I honestly have
> no idea yet.  I can't recreate it myself at the moment or I would
> bisect.
>
> Has nobody else run into this with the pre-3.18 kernels?  Fedora isn't
> carrying any patches in this area.
>
> josh
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: localed stuck in recent 3.18 git in copy_net_ns?
From: Josh Boyer @ 2014-10-22 17:12 UTC (permalink / raw)
  To: Kevin Fenzi; +Cc: netdev, Linux-Kernel@Vger. Kernel. Org
In-Reply-To: <20141021151225.5df96645@voldemort.scrye.com>

On Tue, Oct 21, 2014 at 5:12 PM, Kevin Fenzi <kevin@scrye.com> wrote:
> On Mon, 20 Oct 2014 14:53:59 -0600
> Kevin Fenzi <kevin@scrye.com> wrote:
>
>> On Mon, 20 Oct 2014 16:43:26 -0400
>> Dave Jones <davej@redhat.com> wrote:
>>
>> > I've seen similar soft lockup traces from the sys_unshare path when
>> > running my fuzz tester.  It seems that if you create enough network
>> > namespaces, it can take a huge amount of time for them to be
>> > iterated. (Running trinity with '-c unshare' you can see the slow
>> > down happen. In some cases, it takes so long that the watchdog
>> > process kills it -- though the SIGKILL won't get delivered until
>> > the unshare() completes)
>> >
>> > Any idea what this machine had been doing prior to this that may
>> > have involved creating lots of namespaces ?
>>
>> That was right after boot. ;)
>>
>> This is my main rawhide running laptop.
>>
>> A 'ip netns list' shows nothing.
>
> Some more information:
>
> The problem started between:
>
> v3.17-7872-g5ff0b9e1a1da and v3.17-8307-gf1d0d14120a8
>
> (I can try and do a bisect, but have to head out on a trip tomorrow)
>
> In all the kernels with the problem, there is a kworker process in D.
>
> sysrq-t says:
>                                             Showing all locks held in the system:
> Oct 21 15:06:31 voldemort.scrye.com kernel: 4 locks held by kworker/u16:0/6:
> Oct 21 15:06:31 voldemort.scrye.com kernel:  #0:  ("%s""netns"){.+.+.+}, at: [<ffffffff810ccbff>] process_one_work+0x17f/0x850
> Oct 21 15:06:31 voldemort.scrye.com kernel:  #1:  (net_cleanup_work){+.+.+.}, at: [<ffffffff810ccbff>] process_one_work+0x17f/0x850
> Oct 21 15:06:31 voldemort.scrye.com kernel:  #2:  (net_mutex){+.+.+.}, at: [<ffffffff817069fc>] cleanup_net+0x8c/0x1f0
> Oct 21 15:06:31 voldemort.scrye.com kernel:  #3:
> (rcu_sched_state.barrier_mutex){+.+...}, at: [<ffffffff8112a395>]
> _rcu_barrier+0x35/0x200
>
> On first running any of the systemd units that use PrivateNetwork, then
> run ok, but they are also set to timeout after a minute. On sucessive
> runs they hang in D also.

Someone else is seeing this when they try and modprobe ppp_generic:

[  240.599195] INFO: task kworker/u16:5:100 blocked for more than 120 seconds.
[  240.599338]       Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
[  240.599446] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  240.599583] kworker/u16:5   D ffff8802202db480 12400   100      2 0x00000000
[  240.599744] Workqueue: netns cleanup_net
[  240.599823]  ffff8802202eb9e8 0000000000000096 ffff8802202db480
00000000001d5f00
[  240.600066]  ffff8802202ebfd8 00000000001d5f00 ffff8800368c3480
ffff8802202db480
[  240.600228]  ffffffff81ee2690 7fffffffffffffff ffffffff81ee2698
ffffffff81ee2690
[  240.600386] Call Trace:
[  240.600445]  [<ffffffff8185e239>] schedule+0x29/0x70
[  240.600541]  [<ffffffff8186345c>] schedule_timeout+0x26c/0x410
[  240.600651]  [<ffffffff81865ef7>] ? retint_restore_args+0x13/0x13
[  240.600765]  [<ffffffff818644e4>] ? _raw_spin_unlock_irq+0x34/0x50
[  240.600879]  [<ffffffff8185fc6c>] wait_for_completion+0x10c/0x150
[  240.601025]  [<ffffffff810e53e0>] ? wake_up_state+0x20/0x20
[  240.601133]  [<ffffffff8112a749>] _rcu_barrier+0x159/0x200
[  240.601237]  [<ffffffff8112a845>] rcu_barrier+0x15/0x20
[  240.601335]  [<ffffffff81718ebf>] netdev_run_todo+0x6f/0x310
[  240.601442]  [<ffffffff8170da85>] ? rollback_registered_many+0x265/0x2e0
[  240.601564]  [<ffffffff81725f2e>] rtnl_unlock+0xe/0x10
[  240.601660]  [<ffffffff8170f8e6>] default_device_exit_batch+0x156/0x180
[  240.601781]  [<ffffffff810fd8a0>] ? abort_exclusive_wait+0xb0/0xb0
[  240.601895]  [<ffffffff81707993>] ops_exit_list.isra.1+0x53/0x60
[  240.602028]  [<ffffffff81708540>] cleanup_net+0x100/0x1f0
[  240.602131]  [<ffffffff810ccfa8>] process_one_work+0x218/0x850
[  240.602241]  [<ffffffff810ccf0f>] ? process_one_work+0x17f/0x850
[  240.602350]  [<ffffffff810cd6c7>] ? worker_thread+0xe7/0x4a0
[  240.602454]  [<ffffffff810cd64b>] worker_thread+0x6b/0x4a0
[  240.602555]  [<ffffffff810cd5e0>] ? process_one_work+0x850/0x850
[  240.602665]  [<ffffffff810d399b>] kthread+0x10b/0x130
[  240.602762]  [<ffffffff81028cc9>] ? sched_clock+0x9/0x10
[  240.602862]  [<ffffffff810d3890>] ? kthread_create_on_node+0x250/0x250
[  240.603004]  [<ffffffff818651fc>] ret_from_fork+0x7c/0xb0
[  240.603106]  [<ffffffff810d3890>] ? kthread_create_on_node+0x250/0x250
[  240.603224] 4 locks held by kworker/u16:5/100:
[  240.603304]  #0:  ("%s""netns"){.+.+.+}, at: [<ffffffff810ccf0f>]
process_one_work+0x17f/0x850
[  240.603495]  #1:  (net_cleanup_work){+.+.+.}, at:
[<ffffffff810ccf0f>] process_one_work+0x17f/0x850
[  240.603691]  #2:  (net_mutex){+.+.+.}, at: [<ffffffff817084cc>]
cleanup_net+0x8c/0x1f0
[  240.603869]  #3:  (rcu_sched_state.barrier_mutex){+.+...}, at:
[<ffffffff8112a625>] _rcu_barrier+0x35/0x200
[  240.604211] INFO: task modprobe:1387 blocked for more than 120 seconds.
[  240.604329]       Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
[  240.604434] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  240.604570] modprobe        D ffff8800cb4f1a40 13112  1387   1386 0x00000080
[  240.604719]  ffff8800cafbbbe8 0000000000000096 ffff8800cb4f1a40
00000000001d5f00
[  240.604878]  ffff8800cafbbfd8 00000000001d5f00 ffff880223280000
ffff8800cb4f1a40
[  240.605068]  ffff8800cb4f1a40 ffffffff81f8fb48 0000000000000246
ffff8800cb4f1a40
[  240.605228] Call Trace:
[  240.605283]  [<ffffffff8185e7e1>] schedule_preempt_disabled+0x31/0x80
[  240.605400]  [<ffffffff81860033>] mutex_lock_nested+0x183/0x440
[  240.605510]  [<ffffffff8170835f>] ? register_pernet_subsys+0x1f/0x50
[  240.605626]  [<ffffffff8170835f>] ? register_pernet_subsys+0x1f/0x50
[  240.605757]  [<ffffffffa0701000>] ? 0xffffffffa0701000
[  240.605854]  [<ffffffff8170835f>] register_pernet_subsys+0x1f/0x50
[  240.606005]  [<ffffffffa0701048>] br_init+0x48/0xd3 [bridge]
[  240.606112]  [<ffffffff81002148>] do_one_initcall+0xd8/0x210
[  240.606224]  [<ffffffff81153c02>] load_module+0x20c2/0x2870
[  240.606327]  [<ffffffff8114ebe0>] ? store_uevent+0x70/0x70
[  240.606433]  [<ffffffff8110ac26>] ? lock_release_non_nested+0x3c6/0x3d0
[  240.606557]  [<ffffffff81154497>] SyS_init_module+0xe7/0x140
[  240.606664]  [<ffffffff818652a9>] system_call_fastpath+0x12/0x17
[  240.606773] 1 lock held by modprobe/1387:
[  240.606845]  #0:  (net_mutex){+.+.+.}, at: [<ffffffff8170835f>]
register_pernet_subsys+0x1f/0x50
[  240.607114] INFO: task modprobe:1466 blocked for more than 120 seconds.
[  240.607231]       Not tainted 3.18.0-0.rc1.git2.1.fc22.x86_64 #1
[  240.607337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  240.607473] modprobe        D ffff88020fbab480 13096  1466   1399 0x00000084
[  240.607622]  ffff88020d1bbbe8 0000000000000096 ffff88020fbab480
00000000001d5f00
[  240.607791]  ffff88020d1bbfd8 00000000001d5f00 ffffffff81e1b580
ffff88020fbab480
[  240.607949]  ffff88020fbab480 ffffffff81f8fb48 0000000000000246
ffff88020fbab480
[  240.608138] Call Trace:
[  240.608193]  [<ffffffff8185e7e1>] schedule_preempt_disabled+0x31/0x80
[  240.608316]  [<ffffffff81860033>] mutex_lock_nested+0x183/0x440
[  240.608425]  [<ffffffff817083ad>] ? register_pernet_device+0x1d/0x70
[  240.608542]  [<ffffffff817083ad>] ? register_pernet_device+0x1d/0x70
[  240.608662]  [<ffffffffa071d000>] ? 0xffffffffa071d000
[  240.608759]  [<ffffffff817083ad>] register_pernet_device+0x1d/0x70
[  240.608881]  [<ffffffffa071d020>] ppp_init+0x20/0x1000 [ppp_generic]
[  240.609021]  [<ffffffff81002148>] do_one_initcall+0xd8/0x210
[  240.609131]  [<ffffffff81153c02>] load_module+0x20c2/0x2870
[  240.609235]  [<ffffffff8114ebe0>] ? store_uevent+0x70/0x70
[  240.609339]  [<ffffffff8110ac26>] ? lock_release_non_nested+0x3c6/0x3d0
[  240.609462]  [<ffffffff81154497>] SyS_init_module+0xe7/0x140
[  240.609568]  [<ffffffff818652a9>] system_call_fastpath+0x12/0x17
[  240.609677] 1 lock held by modprobe/1466:
[  240.609749]  #0:  (net_mutex){+.+.+.}, at: [<ffffffff817083ad>]
register_pernet_device+0x1d/0x70

Looks like contention on net_mutex or something, but I honestly have
no idea yet.  I can't recreate it myself at the moment or I would
bisect.

Has nobody else run into this with the pre-3.18 kernels?  Fedora isn't
carrying any patches in this area.

josh

^ permalink raw reply

* Re: [PATCH net-next] tcp: Add TCP_FREEZE socket option
From: Kristian Evensen @ 2014-10-22 17:11 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev
In-Reply-To: <CAHA+R7OTmmeUf=_LaN1_isL1ZKiqcS0pYE0eKpRbPhaPeX-2jQ@mail.gmail.com>

Hi,

On Wed, Oct 22, 2014 at 6:56 PM, Cong Wang <cwang@twopensource.com> wrote:
> At least split TCP is transparent to applications, while your approach is not.
> I don't understand why you said it typically operates on some ports, since
> TCP is stateful.

I see that I might have used the wrong word here. I am use to calling
them TCP splitters, but I see that the devices are also referred to as
transparent TCP proxies. Anyhow, they are still transparent, but they
violate end-to-end (even though I guess that is pretty common
now-a-days).

What I mean by the port-comment is that only connections to some ports
are proxied/split. For example, one of the operators in Norway only
proxy port 80, so any HTTPS transfer risk getting stuck after a
temporary disconnect.

-Kristian

^ permalink raw reply

* Re: [PATCH net-next] tcp: Add TCP_FREEZE socket option
From: Kristian Evensen @ 2014-10-22 17:08 UTC (permalink / raw)
  To: David Miller; +Cc: Network Development
In-Reply-To: <20141022.121418.1477947755244353588.davem@davemloft.net>

Hi,

On Wed, Oct 22, 2014 at 6:14 PM, David Miller <davem@davemloft.net> wrote:
> Instead, I would expect the device layer to trigger a notification
> during a "technology change" or whatever you want to call losing
> connectivity, whichi TCP can receive and use to start sending zero
> windows over all TCP connections using that path.

I totally agree that this is ideally something that should be
controlled by the device layer. However, these temporary disconnects
are not visible through any normal link events (like link down, loss
of address, ...). The only way to detect the events is to parse meta
data coming from devices and look at traffic statistics. This would
involve for example adding parsing of the different mobile broadband
protocols (QMI, MBIM, and so on) to the device layer. When looking at
for example the commits for the QMI driver, parsing QMI messages seems
to have intentionally been left up to user space applications to avoid
bloating driver.

> And therefore there should be a global option that turns this on for
> the entire system by default.
>
> This requires a lot more work than you have done here, you need to
> add all the notification handling, the logic in TCP to look at the
> attached route on send and trigger zero window probes if the device
> event has happened, etc.

Another approach I designed was to have a separate TCP Freeze module
and trigger the freeze/unfreeze through genetlink-messages. A user
space application will be responsible for monitoring the devices and
decide when to trigger the ZWAs. Would a design like that be
acceptable?

-Kristian

^ permalink raw reply

* Re: [PATCH net-next] tcp: Add TCP_FREEZE socket option
From: Cong Wang @ 2014-10-22 16:56 UTC (permalink / raw)
  To: Kristian Evensen; +Cc: netdev
In-Reply-To: <1413992196-4891-1-git-send-email-kristian.evensen@gmail.com>

On Wed, Oct 22, 2014 at 8:36 AM, Kristian Evensen
<kristian.evensen@gmail.com> wrote:
> From: Kristian Evensen <kristian.evensen@gmail.com>
>
> This patch introduces support for Freeze-TCP [1].
>
> Devices that are mobile frequently experience temporary disconnects, for example
> due to signal fading or a technology change. These changes can last for a
> substantial amount of time (>10 seconds), potentially causing multiple RTOs to
> expire and the sender to enter slow start. Even though a device has
> reconnected, it can take a long time for the TCP connection to recover.
>
> Operators of mobile broadband networks mitigate this issue by placing TCP
> splitters at the edge of their networks. However, the splitters typically only
> operate on some ports (mostly only port 80) and violate the end-to-end
> principle. The operator's TCP splitter receives a notification when a temporary
> disconnect occurs and starts sending Zero Window Announcements (ZWA) to the
> remote part of the connection. When a devices regains connectivity, the window
> is reopened.

At least split TCP is transparent to applications, while your approach is not.
I don't understand why you said it typically operates on some ports, since
TCP is stateful.

BTW, AFAIK Linux doesn't support split TCP.

^ permalink raw reply

* Re: [PATCH] net: tso: fix unaligned access to crafted TCP header in helper API
From: David Miller @ 2014-10-22 16:53 UTC (permalink / raw)
  To: karl.beldan; +Cc: netdev, karl.beldan, ezequiel.garcia
In-Reply-To: <1413900365-12289-1-git-send-email-karl.beldan@gmail.com>

From: Karl Beldan <karl.beldan@gmail.com>
Date: Tue, 21 Oct 2014 16:06:05 +0200

> From: Karl Beldan <karl.beldan@rivierawaves.com>
> 
> The crafted header start address is from a driver supplied buffer, which
> one can reasonably expect to be aligned on a 4-bytes boundary.
> However ATM the TSO helper API is only used by ethernet drivers and
> the tcp header will then be aligned to a 2-bytes only boundary from the
> header start address.
> 
> Signed-off-by: Karl Beldan <karl.beldan@rivierawaves.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net] sfc: remove incorrect EFX_BUG_ON_PARANOID check
From: David Miller @ 2014-10-22 16:51 UTC (permalink / raw)
  To: ecree; +Cc: netdev, sshah, jcooper, linux-net-drivers
In-Reply-To: <alpine.LFD.2.03.1410211353290.26215@solarflare.com>

From: Edward Cree <ecree@solarflare.com>
Date: Tue, 21 Oct 2014 14:50:29 +0100

> From: Jon Cooper <jcooper@solarflare.com>
> 
> write_count and insert_count can wrap around, making > check invalid.
> 
> Fixes: 70b33fb0ddec827cbbd14cdc664fc27b2ef4a6b6 ("sfc: add support for
>  skb->xmit_more").
> 
> Signed-off-by: Edward Cree <ecree@solarflare.com>

Applied, thanks.

^ permalink raw reply

* Re: [net-next 1/2] ip6_tunnel: put ip6tnl0 FB device into 'any' mode
From: David Miller @ 2014-10-22 16:47 UTC (permalink / raw)
  To: alan; +Cc: netdev, edumazet
In-Reply-To: <1413879088-13513-1-git-send-email-alan@al-an.info>

From: "Alexey Andriyanov" <alan@al-an.info>
Date: Tue, 21 Oct 2014 12:11:27 +0400

> The fallback device is in ipv6 mode by default.
> The mode can not be changed in runtime, so there
> is no way to decapsulate ip4in6 packets coming from
> various sources without creating the specific tunnel
> ifaces for each peer.
> 
> Cc: David Miller <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Alexey Andriyanov <alan@al-an.info>

I don't think you can legitimately change this after all
these years.  You'll break someone's setup somehow.

You're going to have to find a way to achieve what you
want whilst keeping the default unchanged.

Sorry.

^ permalink raw reply

* Re: [PATCH] netfilter: log: protect nf_log_register against double registering
From: Pablo Neira Ayuso @ 2014-10-22 16:44 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner; +Cc: netfilter-devel, netdev
In-Reply-To: <5447A429.3030608@redhat.com>

On Wed, Oct 22, 2014 at 10:33:45AM -0200, Marcelo Ricardo Leitner wrote:
> On 22-10-2014 10:02, Pablo Neira Ayuso wrote:
> >On Mon, Oct 20, 2014 at 07:58:03PM -0200, Marcelo Ricardo Leitner wrote:
> >>Currently, despite the comment right before the function,
> >>nf_log_register allows registering two loggers on with the same type and
> >>end up overwriting the previous register.
> >>
> >>Not a real issue today as current tree doesn't have two loggers for the
> >>same type but it's better to get this protected.
> >>
> >>Also make sure that all of its callers do error checking.
> >
> >No major objetions to this sanity check. Some comment below.
> >
> >>Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
> >>---
> >>
> >>Notes:
> >>     Please let me know if you have any issues with the identation on
> >>     nf_log_register. I just couldn't find a better one.
> >
> >You can split nf_log_register() in two functions to avoid this.
> 
> Sorry but I don't follow this one. You mean having the check on
> nf_log_register() and then calling a __nf_log_register() to actually
> register it?
> 
> Now I'm thinking on wrapping
>             rcu_dereference_protected(loggers[i][logger->type],
> +					lockdep_is_held(&nf_log_mutex))
> into a macro or something like that, because that's the issue in
> there and this construction is called several times. Something like:
> 
> #define logger_deref_protected(pf, type) \
>         rcu_dereference_protected(loggers[pf][type], \
>                                   lockdep_is_held(&nf_log_mutex));
> 
> WDYT?

Seems OK, I think this can be:

#define nft_log_dereference(logger)

So you can use this both from net->nf.nf_loggers[x] and
loggers[x][y] and we have one single macro and we avoid the indent
issues.

Thanks.

^ permalink raw reply

* [PATCH net v1 0/2] amd-xgbe: AMD XGBE driver fixes 2014-10-22
From: Tom Lendacky @ 2014-10-22 16:26 UTC (permalink / raw)
  To: netdev; +Cc: davem

The following series of patches includes fixes to the driver.

- Properly handle feature changes via ethtool by using correctly sized
  variables
- Perform proper napi packet counting and budget checking

This patch series is based on net.

---

Tom Lendacky (2):
      amd-xgbe: Properly handle feature changes via ethtool
      amd-xgbe: Fix napi Rx budget accounting


 drivers/net/ethernet/amd/xgbe/xgbe-drv.c |   22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

-- 
Tom Lendacky

^ permalink raw reply

* [PATCH net v1 2/2] amd-xgbe: Fix napi Rx budget accounting
From: Tom Lendacky @ 2014-10-22 16:26 UTC (permalink / raw)
  To: netdev; +Cc: davem
In-Reply-To: <20141022162605.31495.98889.stgit@tlendack-t1.amdoffice.net>

Currently the amd-xgbe driver increments the packets processed counter
each time a descriptor is processed.  Since a packet can be represented
by more than one descriptor incrementing the counter in this way is not
appropriate.  Also, since multiple descriptors cause the budget check
to be short circuited, sometimes the returned value from the poll
function would be larger than the budget value resulting in a WARN_ONCE
being triggered.

Update the polling logic to properly account for the number of packets
processed and exit when the budget value is reached.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 drivers/net/ethernet/amd/xgbe/xgbe-drv.c |   20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
index a480b23..2349ea9 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
@@ -1598,7 +1598,8 @@ static int xgbe_rx_poll(struct xgbe_channel *channel, int budget)
 	struct skb_shared_hwtstamps *hwtstamps;
 	unsigned int incomplete, error, context_next, context;
 	unsigned int len, put_len, max_len;
-	int received = 0;
+	unsigned int received = 0;
+	int packet_count = 0;
 
 	DBGPR("-->xgbe_rx_poll: budget=%d\n", budget);
 
@@ -1608,7 +1609,7 @@ static int xgbe_rx_poll(struct xgbe_channel *channel, int budget)
 
 	rdata = XGBE_GET_DESC_DATA(ring, ring->cur);
 	packet = &ring->packet_data;
-	while (received < budget) {
+	while (packet_count < budget) {
 		DBGPR("  cur = %d\n", ring->cur);
 
 		/* First time in loop see if we need to restore state */
@@ -1662,7 +1663,7 @@ read_again:
 			if (packet->errors)
 				DBGPR("Error in received packet\n");
 			dev_kfree_skb(skb);
-			continue;
+			goto next_packet;
 		}
 
 		if (!context) {
@@ -1677,7 +1678,7 @@ read_again:
 					}
 
 					dev_kfree_skb(skb);
-					continue;
+					goto next_packet;
 				}
 				memcpy(skb_tail_pointer(skb), rdata->skb->data,
 				       put_len);
@@ -1694,7 +1695,7 @@ read_again:
 
 		/* Stray Context Descriptor? */
 		if (!skb)
-			continue;
+			goto next_packet;
 
 		/* Be sure we don't exceed the configured MTU */
 		max_len = netdev->mtu + ETH_HLEN;
@@ -1705,7 +1706,7 @@ read_again:
 		if (skb->len > max_len) {
 			DBGPR("packet length exceeds configured MTU\n");
 			dev_kfree_skb(skb);
-			continue;
+			goto next_packet;
 		}
 
 #ifdef XGMAC_ENABLE_RX_PKT_DUMP
@@ -1739,6 +1740,9 @@ read_again:
 
 		netdev->last_rx = jiffies;
 		napi_gro_receive(&pdata->napi, skb);
+
+next_packet:
+		packet_count++;
 	}
 
 	/* Check if we need to save state before leaving */
@@ -1752,9 +1756,9 @@ read_again:
 		rdata->state.error = error;
 	}
 
-	DBGPR("<--xgbe_rx_poll: received = %d\n", received);
+	DBGPR("<--xgbe_rx_poll: packet_count = %d\n", packet_count);
 
-	return received;
+	return packet_count;
 }
 
 static int xgbe_poll(struct napi_struct *napi, int budget)

^ permalink raw reply related

* [PATCH net v1 1/2] amd-xgbe: Properly handle feature changes via ethtool
From: Tom Lendacky @ 2014-10-22 16:26 UTC (permalink / raw)
  To: netdev; +Cc: davem
In-Reply-To: <20141022162605.31495.98889.stgit@tlendack-t1.amdoffice.net>

The ndo_set_features callback function was improperly using an unsigned
int to save the current feature value for features such as NETIF_F_RXCSUM.
Since that feature is in the upper 32 bits of a 64 bit variable the
result was always 0 making it not possible to actually turn off the
hardware RX checksum support.  Change the unsigned int type to the
netdev_features_t type in order to properly capture the current value
and perform the proper operation.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 drivers/net/ethernet/amd/xgbe/xgbe-drv.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
index 2955499..a480b23 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
@@ -1465,7 +1465,7 @@ static int xgbe_set_features(struct net_device *netdev,
 {
 	struct xgbe_prv_data *pdata = netdev_priv(netdev);
 	struct xgbe_hw_if *hw_if = &pdata->hw_if;
-	unsigned int rxcsum, rxvlan, rxvlan_filter;
+	netdev_features_t rxcsum, rxvlan, rxvlan_filter;
 
 	rxcsum = pdata->netdev_features & NETIF_F_RXCSUM;
 	rxvlan = pdata->netdev_features & NETIF_F_HW_VLAN_CTAG_RX;

^ permalink raw reply related

* Re: [PATCH net-next] tcp: Add TCP_FREEZE socket option
From: David Miller @ 2014-10-22 16:14 UTC (permalink / raw)
  To: kristian.evensen; +Cc: netdev
In-Reply-To: <1413992196-4891-1-git-send-email-kristian.evensen@gmail.com>

From: Kristian Evensen <kristian.evensen@gmail.com>
Date: Wed, 22 Oct 2014 17:36:36 +0200

> This patch introduces support for Freeze-TCP [1].

By your description I would not expect the application to get involved
with the actual final zero window advertisement decision at all.

Instead, I would expect the device layer to trigger a notification
during a "technology change" or whatever you want to call losing
connectivity, whichi TCP can receive and use to start sending zero
windows over all TCP connections using that path.

So the socket option enables or disables the facility, but doesn't
actually trigger the zero window advertisement.  A real device based
event does that.

The application has no business watching for the loss of connectivity,
and I am certain you do not want that logice in every application in
order for it to take advantage of this.

And therefore there should be a global option that turns this on for
the entire system by default.

This requires a lot more work than you have done here, you need to
add all the notification handling, the logic in TCP to look at the
attached route on send and trigger zero window probes if the device
event has happened, etc.

^ permalink raw reply

* Re: [PATCH net-next] tcp: Add TCP_FREEZE socket option
From: Kristian Evensen @ 2014-10-22 16:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Network Development
In-Reply-To: <1413993272.9031.3.camel@edumazet-glaptop2.roam.corp.google.com>

Hi,

On Wed, Oct 22, 2014 at 5:54 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> This asymmetry looks strange
>
> Following sequence should be allowed :
>
> getsockopt(...  TCP_FREEZE, &val, ...)
> setsockopt(...  TCP_FREEZE, &val, ...)
>
> So setsockopt() should accept val = 0

Thanks for you comment and I agree. The reasoning behind my original
ordering was that I wanted the values to be in the order which made
most logical sense to me, which is Enable (1), Disable (2) and Disable
with TR-ACK (3). However, I see now that when using the option and
when combined with getsockopt(), this does not make much sense. I will
wait for some more feedback and send a revised version tomorrow with
the following ordering: Disable (0), Enable (1), Disable with TR-ACK
(2).

Thanks again,
Kristian

^ permalink raw reply

* [PATCH] net: fix saving TX flow hash in sock for outgoing connections
From: Sathya Perla @ 2014-10-22 16:12 UTC (permalink / raw)
  To: netdev; +Cc: therbert

The commit "net: Save TX flow hash in sock and set in skbuf on xmit"
introduced the inet_set_txhash() and ip6_set_txhash() routines to calculate
and record flow hash(sk_txhash) in the socket structure. sk_txhash is used
to set skb->hash which is used to spread flows across multiple TXQs.

But, the above routines are invoked before the source port of the connection
is created. Because of this all outgoing connections that just differ in the
source port get hashed into the same TXQ.

This patch fixes this problem for IPv4/6 by invoking the the above routines
after the source port is available for the socket.

Fixes: b73c3d0e4("net: Save TX flow hash in sock and set in skbuf on xmit")

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
---
 net/ipv4/tcp_ipv4.c |    4 ++--
 net/ipv6/tcp_ipv6.c |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 94d1a77..9c7d762 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -206,8 +206,6 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	inet->inet_dport = usin->sin_port;
 	inet->inet_daddr = daddr;
 
-	inet_set_txhash(sk);
-
 	inet_csk(sk)->icsk_ext_hdr_len = 0;
 	if (inet_opt)
 		inet_csk(sk)->icsk_ext_hdr_len = inet_opt->opt.optlen;
@@ -224,6 +222,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (err)
 		goto failure;
 
+	inet_set_txhash(sk);
+
 	rt = ip_route_newports(fl4, rt, orig_sport, orig_dport,
 			       inet->inet_sport, inet->inet_dport, sk);
 	if (IS_ERR(rt)) {
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 8314955..ace29b6 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -200,8 +200,6 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	sk->sk_v6_daddr = usin->sin6_addr;
 	np->flow_label = fl6.flowlabel;
 
-	ip6_set_txhash(sk);
-
 	/*
 	 *	TCP over IPv4
 	 */
@@ -297,6 +295,8 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	if (err)
 		goto late_failure;
 
+	ip6_set_txhash(sk);
+
 	if (!tp->write_seq && likely(!tp->repair))
 		tp->write_seq = secure_tcpv6_sequence_number(np->saddr.s6_addr32,
 							     sk->sk_v6_daddr.s6_addr32,
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH net-next] tcp: Add TCP_FREEZE socket option
From: Eric Dumazet @ 2014-10-22 15:54 UTC (permalink / raw)
  To: Kristian Evensen; +Cc: netdev
In-Reply-To: <1413992196-4891-1-git-send-email-kristian.evensen@gmail.com>

On Wed, 2014-10-22 at 17:36 +0200, Kristian Evensen wrote:
> From: Kristian Evensen <kristian.evensen@gmail.com>
> 
> This patch introduces support for Freeze-TCP [1].
> 
> Devices that are mobile frequently experience temporary disconnects, for example
> due to signal fading or a technology change. These changes can last for a
> substantial amount of time (>10 seconds), potentially causing multiple RTOs to
> expire and the sender to enter slow start. Even though a device has
> reconnected, it can take a long time for the TCP connection to recover.
> 
> Operators of mobile broadband networks mitigate this issue by placing TCP
> splitters at the edge of their networks. However, the splitters typically only
> operate on some ports (mostly only port 80) and violate the end-to-end
> principle. The operator's TCP splitter receives a notification when a temporary
> disconnect occurs and starts sending Zero Window Announcements (ZWA) to the
> remote part of the connection. When a devices regains connectivity, the window
> is reopened.
> 
> Freeze-TCP is a client-side only approach for enabling application developers to
> trigger sending ZWAs. It is implemented as a socket option and accepts three
> different values. If the value is set to one, the connection is frozen. A ZWA is
> sent and the window size set to 0 in any reply to additional packets arriving
> from remote party. If the value is set to two, the connection is unfrozen and a
> window update announcement is sent. If the value is set to three, two additional
> window update announcements are sent. This is referred to as TR-ACK in the paper
> and is used to increase probability that a window update announcement will be
> received.
> 
> When to trigger Freeze-TCP depends on the application requirements and
> underlaying network, is not the responsibility of the kernel. One approach is to
> have the application, or a daemon, analyze the meta data exported from a mobile
> broadband modem. A temporary disconnect can often be detected in advance by
> looking at different statistics.
> 
> [1] - T. Goff, J. Moronski, D. S. Phatak, and V. Gupta, "Freeze-TCP: a True
> End-to-end TCP Enhancement Mechanism for Mobile Environments," In Proceedings of
> IEEE INFOCOM 2000. URL: http://www.csee.umbc.edu/~phatak/publications/ftcp.pdf
> 
> Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
> ---
>  include/linux/tcp.h      |  3 ++-
>  include/uapi/linux/tcp.h |  1 +
>  net/ipv4/tcp.c           | 33 +++++++++++++++++++++++++++++++++
>  net/ipv4/tcp_output.c    |  8 +++++++-
>  4 files changed, 43 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index c2dee7d..7ed26c1 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -187,7 +187,8 @@ struct tcp_sock {
>  		syn_data:1,	/* SYN includes data */
>  		syn_fastopen:1,	/* SYN includes Fast Open option */
>  		syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
> -		is_cwnd_limited:1;/* forward progress limited by snd_cwnd? */
> +		is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */
> +		frozen:1;	/* Artifically deflate announced window to 0 */
>  	u32	tlp_high_seq;	/* snd_nxt at the time of TLP retransmit. */
>  
>  /* RTT measurement */
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index 3b97183..bc0684d 100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -112,6 +112,7 @@ enum {
>  #define TCP_FASTOPEN		23	/* Enable FastOpen on listeners */
>  #define TCP_TIMESTAMP		24
>  #define TCP_NOTSENT_LOWAT	25	/* limit number of unsent bytes in write queue */
> +#define TCP_FREEZE		26	/* Freeze TCP connection by sending ZWA */
>  
>  struct tcp_repair_opt {
>  	__u32	opt_code;
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1bec4e7..5bf30d0 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2339,6 +2339,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>  	struct inet_connection_sock *icsk = inet_csk(sk);
>  	int val;
>  	int err = 0;
> +	u8 itr = 0;
>  
>  	/* These are data/string values, all the others are ints */
>  	switch (optname) {
> @@ -2600,6 +2601,35 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>  		tp->notsent_lowat = val;
>  		sk->sk_write_space(sk);
>  		break;
> +	case TCP_FREEZE:
> +		if (val < 1 || val > 3 ||
> +		    !((1 << sk->sk_state) & TCPF_ESTABLISHED)) {
> +			err = -EINVAL;
> +			break;
> +		}
> +
> +		if (val == 1) {
> +			tp->frozen = 1;
> +			tcp_send_ack(sk);
> +			break;
> +		} else if (!tp->frozen) {
> +			err = -EINVAL;
> +			break;
> +		}
> +
> +		tp->frozen = 0;
> +		tcp_send_ack(sk);
> +
> +		if (val == 2)
> +			break;
> +
> +		/* If val is three, send two additional reconnection ACKs to
> +		 * increase chance of a non-zero windows announcement arriving.
> +		 */
> +		for (itr = 0; itr < 2; itr++)
> +			tcp_send_ack(sk);
> +
> +		break;
>  	default:
>  		err = -ENOPROTOOPT;
>  		break;
> @@ -2832,6 +2862,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
>  	case TCP_NOTSENT_LOWAT:
>  		val = tp->notsent_lowat;
>  		break;
> +	case TCP_FREEZE:
> +		val = tp->frozen;
> +		break;
>  	default:
>  		return -ENOPROTOOPT;
>  	}


This asymmetry looks strange

Following sequence should be allowed :

getsockopt(...  TCP_FREEZE, &val, ...)
setsockopt(...  TCP_FREEZE, &val, ...)

So setsockopt() should accept val = 0

^ permalink raw reply

* [PATCH RFC v2 10/16] virtio_net: use v1.0 endian.
From: Michael S. Tsirkin @ 2014-10-22 15:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, virtualization
In-Reply-To: <1413992894-22976-1-git-send-email-mst@redhat.com>

From: Rusty Russell <rusty@rustcorp.com.au>

[Cornelia Huck: converted some missed fields]
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c | 31 +++++++++++++++++++------------
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d75256bd..2e6561e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -347,13 +347,14 @@ err:
 }
 
 static struct sk_buff *receive_mergeable(struct net_device *dev,
+					 struct virtnet_info *vi,
 					 struct receive_queue *rq,
 					 unsigned long ctx,
 					 unsigned int len)
 {
 	void *buf = mergeable_ctx_to_buf_address(ctx);
 	struct skb_vnet_hdr *hdr = buf;
-	int num_buf = hdr->mhdr.num_buffers;
+	u16 num_buf = virtio16_to_cpu(rq->vq->vdev, hdr->mhdr.num_buffers);
 	struct page *page = virt_to_head_page(buf);
 	int offset = buf - page_address(page);
 	unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
@@ -369,7 +370,9 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 		ctx = (unsigned long)virtqueue_get_buf(rq->vq, &len);
 		if (unlikely(!ctx)) {
 			pr_debug("%s: rx error: %d buffers out of %d missing\n",
-				 dev->name, num_buf, hdr->mhdr.num_buffers);
+				 dev->name, num_buf,
+				 virtio16_to_cpu(rq->vq->vdev,
+						 hdr->mhdr.num_buffers));
 			dev->stats.rx_length_errors++;
 			goto err_buf;
 		}
@@ -454,7 +457,7 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 	}
 
 	if (vi->mergeable_rx_bufs)
-		skb = receive_mergeable(dev, rq, (unsigned long)buf, len);
+		skb = receive_mergeable(dev, vi, rq, (unsigned long)buf, len);
 	else if (vi->big_packets)
 		skb = receive_big(dev, rq, buf, len);
 	else
@@ -473,8 +476,8 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 	if (hdr->hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
 		pr_debug("Needs csum!\n");
 		if (!skb_partial_csum_set(skb,
-					  hdr->hdr.csum_start,
-					  hdr->hdr.csum_offset))
+			  virtio16_to_cpu(vi->vdev, hdr->hdr.csum_start),
+			  virtio16_to_cpu(vi->vdev, hdr->hdr.csum_offset)))
 			goto frame_err;
 	} else if (hdr->hdr.flags & VIRTIO_NET_HDR_F_DATA_VALID) {
 		skb->ip_summed = CHECKSUM_UNNECESSARY;
@@ -505,7 +508,8 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 		if (hdr->hdr.gso_type & VIRTIO_NET_HDR_GSO_ECN)
 			skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
 
-		skb_shinfo(skb)->gso_size = hdr->hdr.gso_size;
+		skb_shinfo(skb)->gso_size = virtio16_to_cpu(vi->vdev,
+							    hdr->hdr.gso_size);
 		if (skb_shinfo(skb)->gso_size == 0) {
 			net_warn_ratelimited("%s: zero gso size.\n", dev->name);
 			goto frame_err;
@@ -867,16 +871,19 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 
 	if (skb->ip_summed == CHECKSUM_PARTIAL) {
 		hdr->hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-		hdr->hdr.csum_start = skb_checksum_start_offset(skb);
-		hdr->hdr.csum_offset = skb->csum_offset;
+		hdr->hdr.csum_start = cpu_to_virtio16(vi->vdev,
+						skb_checksum_start_offset(skb));
+		hdr->hdr.csum_offset = cpu_to_virtio16(vi->vdev,
+							 skb->csum_offset);
 	} else {
 		hdr->hdr.flags = 0;
 		hdr->hdr.csum_offset = hdr->hdr.csum_start = 0;
 	}
 
 	if (skb_is_gso(skb)) {
-		hdr->hdr.hdr_len = skb_headlen(skb);
-		hdr->hdr.gso_size = skb_shinfo(skb)->gso_size;
+		hdr->hdr.hdr_len = cpu_to_virtio16(vi->vdev, skb_headlen(skb));
+		hdr->hdr.gso_size = cpu_to_virtio16(vi->vdev,
+						    skb_shinfo(skb)->gso_size);
 		if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4)
 			hdr->hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
 		else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6)
@@ -1182,7 +1189,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 	sg_init_table(sg, 2);
 
 	/* Store the unicast list and count in the front of the buffer */
-	mac_data->entries = uc_count;
+	mac_data->entries = cpu_to_virtio32(vi->vdev, uc_count);
 	i = 0;
 	netdev_for_each_uc_addr(ha, dev)
 		memcpy(&mac_data->macs[i++][0], ha->addr, ETH_ALEN);
@@ -1193,7 +1200,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 	/* multicast list and count fill the end */
 	mac_data = (void *)&mac_data->macs[uc_count][0];
 
-	mac_data->entries = mc_count;
+	mac_data->entries = cpu_to_virtio32(vi->vdev, mc_count);
 	i = 0;
 	netdev_for_each_mc_addr(ha, dev)
 		memcpy(&mac_data->macs[i++][0], ha->addr, ETH_ALEN);
-- 
MST

^ permalink raw reply related

* Re: [PATCH] igb: don't reuse pages with pfmemalloc flag
From: Eric Dumazet @ 2014-10-22 15:45 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: alexander.h.duyck, netdev, sassmann, e1000-devel,
	peter.p.waskiewicz.jr, bruce.w.allan, jesse.brandeburg, davem,
	john.ronciak, gregkh, linux-kernel
In-Reply-To: <1413985819-9553-1-git-send-email-klamm@yandex-team.ru>

On Wed, 2014-10-22 at 17:50 +0400, Roman Gushchin wrote:
> Incoming packet is dropped silently by sk_filter(), if the skb was
> allocated from pfmemalloc reserves and the corresponding socket is
> not marked with the SOCK_MEMALLOC flag.
> 
> Igb driver allocates pages for DMA with __skb_alloc_page(), which
> calls alloc_pages_node() with the __GFP_MEMALLOC flag. So, in case
> of OOM condition, igb can get pages with pfmemalloc flag set.
> 
> If an incoming packet hits the pfmemalloc page and is large enough
> (small packets are copying into the memory, allocated with
> netdev_alloc_skb_ip_align(), so they are not affected), it will be
> dropped.
> 
> This behavior is ok under high memory pressure, but the problem is
> that the igb driver reuses these mapped pages. So, packets are still
> dropping even if all memory issues are gone and there is a plenty
> of free memory.
> 
> In my case, some TCP sessions hang on a small percentage (< 0.1%)
> of machines days after OOMs.
> 
> Fix this by avoiding reuse of such pages.
> 
> Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
> ---

Interesting...

It seems we also need to clear skb->pfmemalloc in napi_reuse_skb()




------------------------------------------------------------------------------
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* [PATCH net-next] tcp: Add TCP_FREEZE socket option
From: Kristian Evensen @ 2014-10-22 15:36 UTC (permalink / raw)
  To: netdev; +Cc: Kristian Evensen

From: Kristian Evensen <kristian.evensen@gmail.com>

This patch introduces support for Freeze-TCP [1].

Devices that are mobile frequently experience temporary disconnects, for example
due to signal fading or a technology change. These changes can last for a
substantial amount of time (>10 seconds), potentially causing multiple RTOs to
expire and the sender to enter slow start. Even though a device has
reconnected, it can take a long time for the TCP connection to recover.

Operators of mobile broadband networks mitigate this issue by placing TCP
splitters at the edge of their networks. However, the splitters typically only
operate on some ports (mostly only port 80) and violate the end-to-end
principle. The operator's TCP splitter receives a notification when a temporary
disconnect occurs and starts sending Zero Window Announcements (ZWA) to the
remote part of the connection. When a devices regains connectivity, the window
is reopened.

Freeze-TCP is a client-side only approach for enabling application developers to
trigger sending ZWAs. It is implemented as a socket option and accepts three
different values. If the value is set to one, the connection is frozen. A ZWA is
sent and the window size set to 0 in any reply to additional packets arriving
from remote party. If the value is set to two, the connection is unfrozen and a
window update announcement is sent. If the value is set to three, two additional
window update announcements are sent. This is referred to as TR-ACK in the paper
and is used to increase probability that a window update announcement will be
received.

When to trigger Freeze-TCP depends on the application requirements and
underlaying network, is not the responsibility of the kernel. One approach is to
have the application, or a daemon, analyze the meta data exported from a mobile
broadband modem. A temporary disconnect can often be detected in advance by
looking at different statistics.

[1] - T. Goff, J. Moronski, D. S. Phatak, and V. Gupta, "Freeze-TCP: a True
End-to-end TCP Enhancement Mechanism for Mobile Environments," In Proceedings of
IEEE INFOCOM 2000. URL: http://www.csee.umbc.edu/~phatak/publications/ftcp.pdf

Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
---
 include/linux/tcp.h      |  3 ++-
 include/uapi/linux/tcp.h |  1 +
 net/ipv4/tcp.c           | 33 +++++++++++++++++++++++++++++++++
 net/ipv4/tcp_output.c    |  8 +++++++-
 4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c2dee7d..7ed26c1 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -187,7 +187,8 @@ struct tcp_sock {
 		syn_data:1,	/* SYN includes data */
 		syn_fastopen:1,	/* SYN includes Fast Open option */
 		syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
-		is_cwnd_limited:1;/* forward progress limited by snd_cwnd? */
+		is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */
+		frozen:1;	/* Artifically deflate announced window to 0 */
 	u32	tlp_high_seq;	/* snd_nxt at the time of TLP retransmit. */
 
 /* RTT measurement */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 3b97183..bc0684d 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -112,6 +112,7 @@ enum {
 #define TCP_FASTOPEN		23	/* Enable FastOpen on listeners */
 #define TCP_TIMESTAMP		24
 #define TCP_NOTSENT_LOWAT	25	/* limit number of unsent bytes in write queue */
+#define TCP_FREEZE		26	/* Freeze TCP connection by sending ZWA */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1bec4e7..5bf30d0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2339,6 +2339,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	int val;
 	int err = 0;
+	u8 itr = 0;
 
 	/* These are data/string values, all the others are ints */
 	switch (optname) {
@@ -2600,6 +2601,35 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		tp->notsent_lowat = val;
 		sk->sk_write_space(sk);
 		break;
+	case TCP_FREEZE:
+		if (val < 1 || val > 3 ||
+		    !((1 << sk->sk_state) & TCPF_ESTABLISHED)) {
+			err = -EINVAL;
+			break;
+		}
+
+		if (val == 1) {
+			tp->frozen = 1;
+			tcp_send_ack(sk);
+			break;
+		} else if (!tp->frozen) {
+			err = -EINVAL;
+			break;
+		}
+
+		tp->frozen = 0;
+		tcp_send_ack(sk);
+
+		if (val == 2)
+			break;
+
+		/* If val is three, send two additional reconnection ACKs to
+		 * increase chance of a non-zero windows announcement arriving.
+		 */
+		for (itr = 0; itr < 2; itr++)
+			tcp_send_ack(sk);
+
+		break;
 	default:
 		err = -ENOPROTOOPT;
 		break;
@@ -2832,6 +2862,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_NOTSENT_LOWAT:
 		val = tp->notsent_lowat;
 		break;
+	case TCP_FREEZE:
+		val = tp->frozen;
+		break;
 	default:
 		return -ENOPROTOOPT;
 	}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 3af2129..9c1429b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -958,7 +958,13 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 		 */
 		th->window	= htons(min(tp->rcv_wnd, 65535U));
 	} else {
-		th->window	= htons(tcp_select_window(sk));
+		/* Because window is only artifically deflated to zero, we
+		 * postpone updating tcp state until connection is unfrozen
+		 */
+		if (unlikely(tp->frozen))
+			th->window = 0;
+		else
+			th->window = htons(tcp_select_window(sk));
 	}
 	th->check		= 0;
 	th->urg_ptr		= 0;
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH] ovs: Turn vports with dependencies into separate modules
From: Thomas Graf @ 2014-10-22 15:29 UTC (permalink / raw)
  To: dev; +Cc: netdev

The internal and netdev vport remain part of openvswitch.ko. Encap
vports including vxlan, gre, and geneve can be built as separate
modules and are loaded on demand. Modules can be unloaded after use.
Datapath ports keep a reference to the vport module during their
lifetime.

Allows to remove the error prone maintenance of the global list
vport_ops_list.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 net/openvswitch/Kconfig              |  18 +++----
 net/openvswitch/Makefile             |  14 ++---
 net/openvswitch/datapath.c           |  16 +++++-
 net/openvswitch/vport-geneve.c       |  23 +++++++-
 net/openvswitch/vport-gre.c          |  23 +++++++-
 net/openvswitch/vport-internal_dev.c |  17 +++++-
 net/openvswitch/vport-netdev.c       |  14 ++++-
 net/openvswitch/vport-netdev.h       |   3 ++
 net/openvswitch/vport-vxlan.c        |  23 +++++++-
 net/openvswitch/vport.c              | 102 ++++++++++++++++++++++++-----------
 net/openvswitch/vport.h              |  14 +++--
 11 files changed, 199 insertions(+), 68 deletions(-)

diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index ba3bb82..2a9673e 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -29,11 +29,11 @@ config OPENVSWITCH
 	  If unsure, say N.
 
 config OPENVSWITCH_GRE
-	bool "Open vSwitch GRE tunneling support"
+	tristate "Open vSwitch GRE tunneling support"
 	depends on INET
 	depends on OPENVSWITCH
-	depends on NET_IPGRE_DEMUX && !(OPENVSWITCH=y && NET_IPGRE_DEMUX=m)
-	default y
+	depends on NET_IPGRE_DEMUX
+	default OPENVSWITCH
 	---help---
 	  If you say Y here, then the Open vSwitch will be able create GRE
 	  vport.
@@ -43,11 +43,11 @@ config OPENVSWITCH_GRE
 	  If unsure, say Y.
 
 config OPENVSWITCH_VXLAN
-	bool "Open vSwitch VXLAN tunneling support"
+	tristate "Open vSwitch VXLAN tunneling support"
 	depends on INET
 	depends on OPENVSWITCH
-	depends on VXLAN && !(OPENVSWITCH=y && VXLAN=m)
-	default y
+	depends on VXLAN
+	default OPENVSWITCH
 	---help---
 	  If you say Y here, then the Open vSwitch will be able create vxlan vport.
 
@@ -56,11 +56,11 @@ config OPENVSWITCH_VXLAN
 	  If unsure, say Y.
 
 config OPENVSWITCH_GENEVE
-	bool "Open vSwitch Geneve tunneling support"
+	tristate "Open vSwitch Geneve tunneling support"
 	depends on INET
 	depends on OPENVSWITCH
-	depends on GENEVE && !(OPENVSWITCH=y && GENEVE=m)
-	default y
+	depends on GENEVE
+	default OPENVSWITCH
 	---help---
 	  If you say Y here, then the Open vSwitch will be able create geneve vport.
 
diff --git a/net/openvswitch/Makefile b/net/openvswitch/Makefile
index 9a33a27..91b9478 100644
--- a/net/openvswitch/Makefile
+++ b/net/openvswitch/Makefile
@@ -15,14 +15,6 @@ openvswitch-y := \
 	vport-internal_dev.o \
 	vport-netdev.o
 
-ifneq ($(CONFIG_OPENVSWITCH_GENEVE),)
-openvswitch-y += vport-geneve.o
-endif
-
-ifneq ($(CONFIG_OPENVSWITCH_VXLAN),)
-openvswitch-y += vport-vxlan.o
-endif
-
-ifneq ($(CONFIG_OPENVSWITCH_GRE),)
-openvswitch-y += vport-gre.o
-endif
+obj-$(CONFIG_OPENVSWITCH_GENEVE)+= vport-geneve.o
+obj-$(CONFIG_OPENVSWITCH_VXLAN)	+= vport-vxlan.o
+obj-$(CONFIG_OPENVSWITCH_GRE)	+= vport-gre.o
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index e6d7255..aecddb9 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -59,6 +59,7 @@
 #include "vport-netdev.h"
 
 int ovs_net_id __read_mostly;
+EXPORT_SYMBOL(ovs_net_id);
 
 static struct genl_family dp_packet_genl_family;
 static struct genl_family dp_flow_genl_family;
@@ -1764,6 +1765,7 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, struct genl_info *info)
 		return -ENOMEM;
 
 	ovs_lock();
+restart:
 	dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
 	err = -ENODEV;
 	if (!dp)
@@ -1795,8 +1797,11 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, struct genl_info *info)
 
 	vport = new_vport(&parms);
 	err = PTR_ERR(vport);
-	if (IS_ERR(vport))
+	if (IS_ERR(vport)) {
+		if (err == -EAGAIN)
+			goto restart;
 		goto exit_unlock_free;
+	}
 
 	err = ovs_vport_cmd_fill_info(vport, reply, info->snd_portid,
 				      info->snd_seq, 0, OVS_VPORT_CMD_NEW);
@@ -2112,12 +2117,18 @@ static int __init dp_init(void)
 	if (err)
 		goto error_netns_exit;
 
+	err = ovs_netdev_init();
+	if (err)
+		goto error_unreg_notifier;
+
 	err = dp_register_genl();
 	if (err < 0)
-		goto error_unreg_notifier;
+		goto error_unreg_netdev;
 
 	return 0;
 
+error_unreg_netdev:
+	ovs_netdev_exit();
 error_unreg_notifier:
 	unregister_netdevice_notifier(&ovs_dp_device_notifier);
 error_netns_exit:
@@ -2137,6 +2148,7 @@ error:
 static void dp_cleanup(void)
 {
 	dp_unregister_genl(ARRAY_SIZE(dp_genl_families));
+	ovs_netdev_exit();
 	unregister_netdevice_notifier(&ovs_dp_device_notifier);
 	unregister_pernet_device(&ovs_net_ops);
 	rcu_barrier();
diff --git a/net/openvswitch/vport-geneve.c b/net/openvswitch/vport-geneve.c
index 106a9d8..70c9765 100644
--- a/net/openvswitch/vport-geneve.c
+++ b/net/openvswitch/vport-geneve.c
@@ -17,6 +17,7 @@
 #include <linux/rculist.h>
 #include <linux/udp.h>
 #include <linux/if_vlan.h>
+#include <linux/module.h>
 
 #include <net/geneve.h>
 #include <net/icmp.h>
@@ -28,6 +29,8 @@
 #include "datapath.h"
 #include "vport.h"
 
+static struct vport_ops ovs_geneve_vport_ops;
+
 /**
  * struct geneve_port - Keeps track of open UDP ports
  * @gs: The socket created for this port number.
@@ -225,11 +228,29 @@ static const char *geneve_get_name(const struct vport *vport)
 	return geneve_port->name;
 }
 
-const struct vport_ops ovs_geneve_vport_ops = {
+static struct vport_ops ovs_geneve_vport_ops = {
 	.type		= OVS_VPORT_TYPE_GENEVE,
 	.create		= geneve_tnl_create,
 	.destroy	= geneve_tnl_destroy,
 	.get_name	= geneve_get_name,
 	.get_options	= geneve_get_options,
 	.send		= geneve_tnl_send,
+	.owner          = THIS_MODULE,
 };
+
+static int __init ovs_geneve_tnl_init(void)
+{
+	return ovs_vport_ops_register(&ovs_geneve_vport_ops);
+}
+
+static void __exit ovs_geneve_tnl_exit(void)
+{
+	ovs_vport_ops_unregister(&ovs_geneve_vport_ops);
+}
+
+module_init(ovs_geneve_tnl_init);
+module_exit(ovs_geneve_tnl_exit);
+
+MODULE_DESCRIPTION("OVS: Geneve swiching port");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS("vport-type-5");
diff --git a/net/openvswitch/vport-gre.c b/net/openvswitch/vport-gre.c
index 108b82d..00270b6 100644
--- a/net/openvswitch/vport-gre.c
+++ b/net/openvswitch/vport-gre.c
@@ -29,6 +29,7 @@
 #include <linux/jhash.h>
 #include <linux/list.h>
 #include <linux/kernel.h>
+#include <linux/module.h>
 #include <linux/workqueue.h>
 #include <linux/rculist.h>
 #include <net/route.h>
@@ -45,6 +46,8 @@
 #include "datapath.h"
 #include "vport.h"
 
+static struct vport_ops ovs_gre_vport_ops;
+
 /* Returns the least-significant 32 bits of a __be64. */
 static __be32 be64_get_low32(__be64 x)
 {
@@ -281,10 +284,28 @@ static void gre_tnl_destroy(struct vport *vport)
 	gre_exit();
 }
 
-const struct vport_ops ovs_gre_vport_ops = {
+static struct vport_ops ovs_gre_vport_ops = {
 	.type		= OVS_VPORT_TYPE_GRE,
 	.create		= gre_create,
 	.destroy	= gre_tnl_destroy,
 	.get_name	= gre_get_name,
 	.send		= gre_tnl_send,
+	.owner		= THIS_MODULE,
 };
+
+static int __init ovs_gre_tnl_init(void)
+{
+	return ovs_vport_ops_register(&ovs_gre_vport_ops);
+}
+
+static void __exit ovs_gre_tnl_exit(void)
+{
+	ovs_vport_ops_unregister(&ovs_gre_vport_ops);
+}
+
+module_init(ovs_gre_tnl_init);
+module_exit(ovs_gre_tnl_exit);
+
+MODULE_DESCRIPTION("OVS: GRE switching port");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS("vport-type-3");
diff --git a/net/openvswitch/vport-internal_dev.c b/net/openvswitch/vport-internal_dev.c
index 8451612..10dc07e 100644
--- a/net/openvswitch/vport-internal_dev.c
+++ b/net/openvswitch/vport-internal_dev.c
@@ -36,6 +36,8 @@ struct internal_dev {
 	struct vport *vport;
 };
 
+static struct vport_ops ovs_internal_vport_ops;
+
 static struct internal_dev *internal_dev_priv(struct net_device *netdev)
 {
 	return netdev_priv(netdev);
@@ -238,7 +240,7 @@ static int internal_dev_recv(struct vport *vport, struct sk_buff *skb)
 	return len;
 }
 
-const struct vport_ops ovs_internal_vport_ops = {
+static struct vport_ops ovs_internal_vport_ops = {
 	.type		= OVS_VPORT_TYPE_INTERNAL,
 	.create		= internal_dev_create,
 	.destroy	= internal_dev_destroy,
@@ -261,10 +263,21 @@ struct vport *ovs_internal_dev_get_vport(struct net_device *netdev)
 
 int ovs_internal_dev_rtnl_link_register(void)
 {
-	return rtnl_link_register(&internal_dev_link_ops);
+	int err;
+
+	err = rtnl_link_register(&internal_dev_link_ops);
+	if (err < 0)
+		return err;
+
+	err = ovs_vport_ops_register(&ovs_internal_vport_ops);
+	if (err < 0)
+		rtnl_link_unregister(&internal_dev_link_ops);
+
+	return err;
 }
 
 void ovs_internal_dev_rtnl_link_unregister(void)
 {
+	ovs_vport_ops_unregister(&ovs_internal_vport_ops);
 	rtnl_link_unregister(&internal_dev_link_ops);
 }
diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c
index d21f77d..877ee74 100644
--- a/net/openvswitch/vport-netdev.c
+++ b/net/openvswitch/vport-netdev.c
@@ -33,6 +33,8 @@
 #include "vport-internal_dev.h"
 #include "vport-netdev.h"
 
+static struct vport_ops ovs_netdev_vport_ops;
+
 /* Must be called with rcu_read_lock. */
 static void netdev_port_receive(struct vport *vport, struct sk_buff *skb)
 {
@@ -224,10 +226,20 @@ struct vport *ovs_netdev_get_vport(struct net_device *dev)
 		return NULL;
 }
 
-const struct vport_ops ovs_netdev_vport_ops = {
+static struct vport_ops ovs_netdev_vport_ops = {
 	.type		= OVS_VPORT_TYPE_NETDEV,
 	.create		= netdev_create,
 	.destroy	= netdev_destroy,
 	.get_name	= ovs_netdev_get_name,
 	.send		= netdev_send,
 };
+
+int __init ovs_netdev_init(void)
+{
+	return ovs_vport_ops_register(&ovs_netdev_vport_ops);
+}
+
+void ovs_netdev_exit(void)
+{
+	ovs_vport_ops_unregister(&ovs_netdev_vport_ops);
+}
diff --git a/net/openvswitch/vport-netdev.h b/net/openvswitch/vport-netdev.h
index 8df01c11..6f7038e 100644
--- a/net/openvswitch/vport-netdev.h
+++ b/net/openvswitch/vport-netdev.h
@@ -41,4 +41,7 @@ netdev_vport_priv(const struct vport *vport)
 const char *ovs_netdev_get_name(const struct vport *);
 void ovs_netdev_detach_dev(struct vport *);
 
+int __init ovs_netdev_init(void);
+void ovs_netdev_exit(void);
+
 #endif /* vport_netdev.h */
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index 2735e01..965e750 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -24,6 +24,7 @@
 #include <linux/net.h>
 #include <linux/rculist.h>
 #include <linux/udp.h>
+#include <linux/module.h>
 
 #include <net/icmp.h>
 #include <net/ip.h>
@@ -50,6 +51,8 @@ struct vxlan_port {
 	char name[IFNAMSIZ];
 };
 
+static struct vport_ops ovs_vxlan_vport_ops;
+
 static inline struct vxlan_port *vxlan_vport(const struct vport *vport)
 {
 	return vport_priv(vport);
@@ -192,11 +195,29 @@ static const char *vxlan_get_name(const struct vport *vport)
 	return vxlan_port->name;
 }
 
-const struct vport_ops ovs_vxlan_vport_ops = {
+static struct vport_ops ovs_vxlan_vport_ops = {
 	.type		= OVS_VPORT_TYPE_VXLAN,
 	.create		= vxlan_tnl_create,
 	.destroy	= vxlan_tnl_destroy,
 	.get_name	= vxlan_get_name,
 	.get_options	= vxlan_get_options,
 	.send		= vxlan_tnl_send,
+	.owner		= THIS_MODULE,
 };
+
+static int __init ovs_vxlan_tnl_init(void)
+{
+	return ovs_vport_ops_register(&ovs_vxlan_vport_ops);
+}
+
+static void __exit ovs_vxlan_tnl_exit(void)
+{
+	ovs_vport_ops_unregister(&ovs_vxlan_vport_ops);
+}
+
+module_init(ovs_vxlan_tnl_init);
+module_exit(ovs_vxlan_tnl_exit);
+
+MODULE_DESCRIPTION("OVS: VXLAN switching port");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS("vport-type-4");
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index 6015802..8168ef0 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -28,6 +28,7 @@
 #include <linux/rtnetlink.h>
 #include <linux/compat.h>
 #include <net/net_namespace.h>
+#include <linux/module.h>
 
 #include "datapath.h"
 #include "vport.h"
@@ -36,22 +37,7 @@
 static void ovs_vport_record_error(struct vport *,
 				   enum vport_err_type err_type);
 
-/* List of statically compiled vport implementations.  Don't forget to also
- * add yours to the list at the bottom of vport.h. */
-static const struct vport_ops *vport_ops_list[] = {
-	&ovs_netdev_vport_ops,
-	&ovs_internal_vport_ops,
-
-#ifdef CONFIG_OPENVSWITCH_GRE
-	&ovs_gre_vport_ops,
-#endif
-#ifdef CONFIG_OPENVSWITCH_VXLAN
-	&ovs_vxlan_vport_ops,
-#endif
-#ifdef CONFIG_OPENVSWITCH_GENEVE
-	&ovs_geneve_vport_ops,
-#endif
-};
+static LIST_HEAD(vport_ops_list);
 
 /* Protected by RCU read lock for reading, ovs_mutex for writing. */
 static struct hlist_head *dev_table;
@@ -88,6 +74,32 @@ static struct hlist_head *hash_bucket(struct net *net, const char *name)
 	return &dev_table[hash & (VPORT_HASH_BUCKETS - 1)];
 }
 
+int ovs_vport_ops_register(struct vport_ops *ops)
+{
+	int err = -EEXIST;
+	struct vport_ops *o;
+
+	ovs_lock();
+	list_for_each_entry(o, &vport_ops_list, list)
+		if (ops->type == o->type)
+			goto errout;
+
+	list_add_tail(&ops->list, &vport_ops_list);
+	err = 0;
+errout:
+	ovs_unlock();
+	return err;
+}
+EXPORT_SYMBOL(ovs_vport_ops_register);
+
+void ovs_vport_ops_unregister(struct vport_ops *ops)
+{
+	ovs_lock();
+	list_del(&ops->list);
+	ovs_unlock();
+}
+EXPORT_SYMBOL(ovs_vport_ops_unregister);
+
 /**
  *	ovs_vport_locate - find a port that has already been created
  *
@@ -153,6 +165,7 @@ struct vport *ovs_vport_alloc(int priv_size, const struct vport_ops *ops,
 
 	return vport;
 }
+EXPORT_SYMBOL(ovs_vport_alloc);
 
 /**
  *	ovs_vport_free - uninitialize and free vport
@@ -173,6 +186,18 @@ void ovs_vport_free(struct vport *vport)
 	free_percpu(vport->percpu_stats);
 	kfree(vport);
 }
+EXPORT_SYMBOL(ovs_vport_free);
+
+static struct vport_ops *ovs_vport_lookup(const struct vport_parms *parms)
+{
+	struct vport_ops *ops;
+
+	list_for_each_entry(ops, &vport_ops_list, list)
+		if (ops->type == parms->type)
+			return ops;
+
+	return NULL;
+}
 
 /**
  *	ovs_vport_add - add vport device (for kernel callers)
@@ -184,31 +209,40 @@ void ovs_vport_free(struct vport *vport)
  */
 struct vport *ovs_vport_add(const struct vport_parms *parms)
 {
+	struct vport_ops *ops;
 	struct vport *vport;
-	int err = 0;
-	int i;
 
-	for (i = 0; i < ARRAY_SIZE(vport_ops_list); i++) {
-		if (vport_ops_list[i]->type == parms->type) {
-			struct hlist_head *bucket;
+	ops = ovs_vport_lookup(parms);
+	if (ops) {
+		struct hlist_head *bucket;
 
-			vport = vport_ops_list[i]->create(parms);
-			if (IS_ERR(vport)) {
-				err = PTR_ERR(vport);
-				goto out;
-			}
+		if (!try_module_get(ops->owner))
+			return ERR_PTR(-EAFNOSUPPORT);
 
-			bucket = hash_bucket(ovs_dp_get_net(vport->dp),
-					     vport->ops->get_name(vport));
-			hlist_add_head_rcu(&vport->hash_node, bucket);
+		vport = ops->create(parms);
+		if (IS_ERR(vport)) {
+			module_put(ops->owner);
 			return vport;
 		}
+
+		bucket = hash_bucket(ovs_dp_get_net(vport->dp),
+				     vport->ops->get_name(vport));
+		hlist_add_head_rcu(&vport->hash_node, bucket);
+		return vport;
 	}
 
-	err = -EAFNOSUPPORT;
+	/* Unlock to attempt module load and return -EAGAIN if load
+	 * was successful as we need to restart the port addition
+	 * workflow.
+	 */
+	ovs_unlock();
+	request_module("vport-type-%d", parms->type);
+	ovs_lock();
 
-out:
-	return ERR_PTR(err);
+	if (!ovs_vport_lookup(parms))
+		return ERR_PTR(-EAFNOSUPPORT);
+	else
+		return ERR_PTR(-EAGAIN);
 }
 
 /**
@@ -242,6 +276,8 @@ void ovs_vport_del(struct vport *vport)
 	hlist_del_rcu(&vport->hash_node);
 
 	vport->ops->destroy(vport);
+
+	module_put(vport->ops->owner);
 }
 
 /**
@@ -457,6 +493,7 @@ void ovs_vport_receive(struct vport *vport, struct sk_buff *skb,
 	}
 	ovs_dp_process_packet(skb, &key);
 }
+EXPORT_SYMBOL(ovs_vport_receive);
 
 /**
  *	ovs_vport_send - send a packet on a device
@@ -535,3 +572,4 @@ void ovs_vport_deferred_free(struct vport *vport)
 
 	call_rcu(&vport->rcu, free_vport_rcu);
 }
+EXPORT_SYMBOL(ovs_vport_deferred_free);
diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
index 8942125..e41c3fa 100644
--- a/net/openvswitch/vport.h
+++ b/net/openvswitch/vport.h
@@ -161,6 +161,9 @@ struct vport_ops {
 	const char *(*get_name)(const struct vport *);
 
 	int (*send)(struct vport *, struct sk_buff *);
+
+	struct module *owner;
+	struct list_head list;
 };
 
 enum vport_err_type {
@@ -209,14 +212,6 @@ static inline struct vport *vport_from_priv(void *priv)
 void ovs_vport_receive(struct vport *, struct sk_buff *,
 		       struct ovs_tunnel_info *);
 
-/* List of statically compiled vport implementations.  Don't forget to also
- * add yours to the list at the top of vport.c. */
-extern const struct vport_ops ovs_netdev_vport_ops;
-extern const struct vport_ops ovs_internal_vport_ops;
-extern const struct vport_ops ovs_gre_vport_ops;
-extern const struct vport_ops ovs_vxlan_vport_ops;
-extern const struct vport_ops ovs_geneve_vport_ops;
-
 static inline void ovs_skb_postpush_rcsum(struct sk_buff *skb,
 				      const void *start, unsigned int len)
 {
@@ -224,4 +219,7 @@ static inline void ovs_skb_postpush_rcsum(struct sk_buff *skb,
 		skb->csum = csum_add(skb->csum, csum_partial(start, len, 0));
 }
 
+int ovs_vport_ops_register(struct vport_ops *ops);
+void ovs_vport_ops_unregister(struct vport_ops *ops);
+
 #endif /* vport.h */
-- 
1.9.3

^ permalink raw reply related

* [PATCH] net: fec: ptp: fix NULL pointer dereference if ptp_clock is not set
From: Philipp Zabel @ 2014-10-22 14:34 UTC (permalink / raw)
  To: David S. Miller, Luwei Zhou; +Cc: netdev, Philipp Zabel

Since commit 278d24047891 (net: fec: ptp: Enable PPS output based on ptp clock)
fec_enet_interrupt calls fec_ptp_check_pps_event unconditionally, which calls
into ptp_clock_event. If fep->ptp_clock is NULL, ptp_clock_event tries to
dereference the NULL pointer.
Since on i.MX53 fep->bufdesc_ex is not set, fec_ptp_init is never called,
and fep->ptp_clock is NULL, which reliably causes a kernel panic.

This patch adds a check for fep->ptp_clock == NULL in fec_enet_interrupt.

Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
---
 drivers/net/ethernet/freescale/fec_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 81b96cf..50a851d 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1581,7 +1581,8 @@ fec_enet_interrupt(int irq, void *dev_id)
 		complete(&fep->mdio_done);
 	}
 
-	fec_ptp_check_pps_event(fep);
+	if (fep->ptp_clock)
+		fec_ptp_check_pps_event(fep);
 
 	return ret;
 }
-- 
2.1.1

^ permalink raw reply related

* Re: [PATCH] Documentation: ptp: Fix build failure on MIPS cross builds
From: Peter Foley @ 2014-10-22 14:09 UTC (permalink / raw)
  To: Richard Cochran
  Cc: David Daney, David Miller, markos.chandras, linux-mips, corbet,
	netdev, linux-doc@vger.kernel.org, LKML
In-Reply-To: <20141022080302.GA4037@localhost.localdomain>

On Wed, Oct 22, 2014 at 4:03 AM, Richard Cochran
<richardcochran@gmail.com> wrote:
> In the mean time, I would like to restore the testptp.mk that *does*
> cross compile, so that people may use the test program if they
> want. In fact I use this all the time, and so I am a bit annoyed that
> something working was deleted and replaced with something broken.

Sure, I didn't realize that anyone was actually using testptp.mk on a
regular basis.
Feel free to restore it.

^ permalink raw reply

* [PATCH] igb: don't reuse pages with pfmemalloc flag
From: Roman Gushchin @ 2014-10-22 13:50 UTC (permalink / raw)
  To: jeffrey.t.kirsher, jesse.brandeburg, bruce.w.allan,
	carolyn.wyborny, donald.c.skidmore, gregory.v.rose,
	peter.p.waskiewicz.jr, alexander.h.duyck, john.ronciak,
	tushar.n.dave, davem, sassmann, gregkh, e1000-devel
  Cc: netdev, linux-kernel, Roman Gushchin

Incoming packet is dropped silently by sk_filter(), if the skb was
allocated from pfmemalloc reserves and the corresponding socket is
not marked with the SOCK_MEMALLOC flag.

Igb driver allocates pages for DMA with __skb_alloc_page(), which
calls alloc_pages_node() with the __GFP_MEMALLOC flag. So, in case
of OOM condition, igb can get pages with pfmemalloc flag set.

If an incoming packet hits the pfmemalloc page and is large enough
(small packets are copying into the memory, allocated with
netdev_alloc_skb_ip_align(), so they are not affected), it will be
dropped.

This behavior is ok under high memory pressure, but the problem is
that the igb driver reuses these mapped pages. So, packets are still
dropping even if all memory issues are gone and there is a plenty
of free memory.

In my case, some TCP sessions hang on a small percentage (< 0.1%)
of machines days after OOMs.

Fix this by avoiding reuse of such pages.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 0d4c897..6586392 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6178,6 +6178,9 @@ static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer,
 	if (unlikely(page_to_nid(page) != numa_node_id()))
 		return false;
 
+	if (unlikely(page->pfmemalloc))
+		return false;
+
 #if (PAGE_SIZE < 8192)
 	/* if we are only owner of page we can reuse it */
 	if (unlikely(page_count(page) != 1))
@@ -6245,7 +6248,8 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
 		memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long)));
 
 		/* we can reuse buffer as-is, just make sure it is local */
-		if (likely(page_to_nid(page) == numa_node_id()))
+		if (likely((page_to_nid(page) == numa_node_id()) &&
+			   !page->pfmemalloc))
 			return true;
 
 		/* this page cannot be reused so discard it */
-- 
1.9.3

^ permalink raw reply related

* [PATCH 1/3] xen-netback: make feature-rx-notify mandatory
From: David Vrabel @ 2014-10-22 13:08 UTC (permalink / raw)
  To: netdev; +Cc: David Vrabel, xen-devel, Ian Campbell, Wei Liu
In-Reply-To: <1413983335-8307-1-git-send-email-david.vrabel@citrix.com>

Frontends that do not provide feature-rx-notify may stall because
netback depends on the notification from frontend to wake the guest Rx
thread (even if can_queue is false).

This could be fixed but feature-rx-notify was introduced in 2006 and I
am not aware of any frontends that do not implement this.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 drivers/net/xen-netback/common.h    |    5 -----
 drivers/net/xen-netback/interface.c |   12 +-----------
 drivers/net/xen-netback/xenbus.c    |   13 ++++---------
 3 files changed, 5 insertions(+), 25 deletions(-)

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index d4eb8d2..93ca77c 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -228,9 +228,6 @@ struct xenvif {
 	u8 ip_csum:1;
 	u8 ipv6_csum:1;
 
-	/* Internal feature information. */
-	u8 can_queue:1;	    /* can queue packets for receiver? */
-
 	/* Is this interface disabled? True when backend discovers
 	 * frontend is rogue.
 	 */
@@ -272,8 +269,6 @@ void xenvif_xenbus_fini(void);
 
 int xenvif_schedulable(struct xenvif *vif);
 
-int xenvif_must_stop_queue(struct xenvif_queue *queue);
-
 int xenvif_queue_stopped(struct xenvif_queue *queue);
 void xenvif_wake_queue(struct xenvif_queue *queue);
 
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index f379689..c6759b1 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -60,16 +60,6 @@ void xenvif_skb_zerocopy_complete(struct xenvif_queue *queue)
 	atomic_dec(&queue->inflight_packets);
 }
 
-static inline void xenvif_stop_queue(struct xenvif_queue *queue)
-{
-	struct net_device *dev = queue->vif->dev;
-
-	if (!queue->vif->can_queue)
-		return;
-
-	netif_tx_stop_queue(netdev_get_tx_queue(dev, queue->id));
-}
-
 int xenvif_schedulable(struct xenvif *vif)
 {
 	return netif_running(vif->dev) &&
@@ -209,7 +199,7 @@ static int xenvif_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (!xenvif_rx_ring_slots_available(queue, min_slots_needed)) {
 		queue->rx_stalled.function = xenvif_rx_stalled;
 		queue->rx_stalled.data = (unsigned long)queue;
-		xenvif_stop_queue(queue);
+		netif_tx_stop_queue(netdev_get_tx_queue(dev, queue->id));
 		mod_timer(&queue->rx_stalled,
 			  jiffies + rx_drain_timeout_jiffies);
 	}
diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c
index 8079c31..9060857 100644
--- a/drivers/net/xen-netback/xenbus.c
+++ b/drivers/net/xen-netback/xenbus.c
@@ -873,15 +873,10 @@ static int read_xenbus_vif_flags(struct backend_info *be)
 	if (!rx_copy)
 		return -EOPNOTSUPP;
 
-	if (vif->dev->tx_queue_len != 0) {
-		if (xenbus_scanf(XBT_NIL, dev->otherend,
-				 "feature-rx-notify", "%d", &val) < 0)
-			val = 0;
-		if (val)
-			vif->can_queue = 1;
-		else
-			/* Must be non-zero for pfifo_fast to work. */
-			vif->dev->tx_queue_len = 1;
+	if (xenbus_scanf(XBT_NIL, dev->otherend,
+			 "feature-rx-notify", "%d", &val) < 0 || val == 0) {
+		xenbus_dev_fatal(dev, -EINVAL, "feature-rx-notify is mandatory");
+		return -EINVAL;
 	}
 
 	if (xenbus_scanf(XBT_NIL, dev->otherend, "feature-sg",
-- 
1.7.10.4

^ permalink raw reply related

* [PATCHv1 0/3 net-next] xen-netback: guest Rx queue drain and stall fixes
From: David Vrabel @ 2014-10-22 13:08 UTC (permalink / raw)
  To: netdev; +Cc: David Vrabel, xen-devel, Ian Campbell, Wei Liu

This series fixes two critical xen-netback bugs.

1. Netback may consume all of host memory by queuing an unlimited
   number of skb on the internal guest Rx queue.  This behaviour is
   guest triggerable.

2. Carrier flapping under high traffic rates which reduces
   performance.

The first patch is a prerequite.  Removing support for frontends with
feature-rx-notify makes it easier to reason about the correctness of
netback since it no longer has to support this outdated and broken
mode.

David

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox