Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 08/14] gtp: Support encpasulating over IPv6
From: David Miller @ 2017-09-20 19:45 UTC (permalink / raw)
  To: tom; +Cc: tom, netdev, pablo, laforge, rohit
In-Reply-To: <CALx6S36cJEU2LkbNf-d8kUHCPj78bt5Y2Bd_rdvjxJ9epOEqAQ@mail.gmail.com>

From: Tom Herbert <tom@herbertland.com>
Date: Wed, 20 Sep 2017 11:03:52 -0700

> On Mon, Sep 18, 2017 at 9:19 PM, David Miller <davem@davemloft.net> wrote:
>> From: Tom Herbert <tom@quantonium.net>
>> Date: Mon, 18 Sep 2017 17:38:58 -0700
>>
>>> Allow peers to be specified by IPv6 addresses.
>>>
>>> Signed-off-by: Tom Herbert <tom@quantonium.net>
>>
>> Hmmm, can you just check the socket family or something like that?
> 
> I'm not sure what code you're referring to.

There is a socket associated with the tunnel to do the encapsulation
and it has an address family, right?

^ permalink raw reply

* Re: Regression in throughput between kvm guests over virtual bridge
From: Matthew Rosato @ 2017-09-20 19:38 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: davem, mst
In-Reply-To: <1173ab1f-e2b6-26b3-8c3c-bd5ceaa1bd8e@redhat.com>


> Seems to make some progress on wakeup mitigation. Previous patch tries
> to reduce the unnecessary traversal of waitqueue during rx. Attached
> patch goes even further which disables rx polling during processing tx.
> Please try it to see if it has any difference.

Unfortunately, this patch doesn't seem to have made a difference.  I
tried runs with both this patch and the previous patch applied, as well
as only this patch applied for comparison (numbers from vhost thread of
sending VM):

4.12    4.13     patch1   patch2   patch1+2
2.00%   +3.69%   +2.55%   +2.81%   +2.69%   [...] __wake_up_sync_key

In each case, the regression in throughput was still present.

> And two questions:
> - Is the issue existed if you do uperf between 2VMs (instead of 4VMs)

Verified that the second set of guests are not actually required, I can
see the regression with only 2 VMs.

> - Can enable batching in the tap of sending VM improve the performance
> (ethtool -C $tap rx-frames 64)

I tried this, but it did not help (actually seemed to make things a
little worse)

^ permalink raw reply

* Re: usb/net/p54: trying to register non-static key in p54_unregister_leds
From: Christian Lamparter @ 2017-09-20 19:27 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Kalle Valo, linux-wireless, netdev, LKML, Dmitry Vyukov,
	Kostya Serebryany, syzkaller, Stephen Boyd, Tejun Heo, Yong Zhang
In-Reply-To: <CAAeHK+xPoNn7i5xOzEqHX6tfDpmpb1rd0hcAxZpTHKi6B8xzsw@mail.gmail.com>

On Wednesday, September 20, 2017 8:37:08 PM CEST Andrey Konovalov wrote:
> Hi!
> 
> I've got the following report while fuzzing the kernel with syzkaller.
> 
> On commit ebb2c2437d8008d46796902ff390653822af6cc4 (Sep 18).
> 
> INFO: trying to register non-static key.
> the code is fine but needs lockdep annotation.
> turning off the locking correctness validator.
> CPU: 1 PID: 1404 Comm: kworker/1:1 Not tainted
> 4.14.0-rc1-42251-gebb2c2437d80-dirty #205
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Workqueue: usb_hub_wq hub_event
> Call Trace:
>  __dump_stack lib/dump_stack.c:16
>  dump_stack+0x292/0x395 lib/dump_stack.c:52
>  register_lock_class+0x6c4/0x1a00 kernel/locking/lockdep.c:769
>  __lock_acquire+0x27e/0x4550 kernel/locking/lockdep.c:3385
>  lock_acquire+0x259/0x620 kernel/locking/lockdep.c:4002
>  flush_work+0xf0/0x8c0 kernel/workqueue.c:2886
>  __cancel_work_timer+0x51d/0x870 kernel/workqueue.c:2961
>  cancel_delayed_work_sync+0x1f/0x30 kernel/workqueue.c:3081
>  p54_unregister_leds+0x6c/0xc0 drivers/net/wireless/intersil/p54/led.c:160
>  p54_unregister_common+0x3d/0xb0 drivers/net/wireless/intersil/p54/main.c:856
>  p54u_disconnect+0x86/0x120 drivers/net/wireless/intersil/p54/p54usb.c:1073
>  usb_unbind_interface+0x21c/0xa90 drivers/usb/core/driver.c:423
>  __device_release_driver drivers/base/dd.c:861
>  device_release_driver_internal+0x4f4/0x5c0 drivers/base/dd.c:893
>  device_release_driver+0x1e/0x30 drivers/base/dd.c:918
>  bus_remove_device+0x2f4/0x4b0 drivers/base/bus.c:565
>  device_del+0x5c4/0xab0 drivers/base/core.c:1985
>  usb_disable_device+0x1e9/0x680 drivers/usb/core/message.c:1170
>  usb_disconnect+0x260/0x7a0 drivers/usb/core/hub.c:2124
>  hub_port_connect drivers/usb/core/hub.c:4754
>  hub_port_connect_change drivers/usb/core/hub.c:5009
>  port_event drivers/usb/core/hub.c:5115
>  hub_event+0x1318/0x3740 drivers/usb/core/hub.c:5195
>  process_one_work+0xc7f/0x1db0 kernel/workqueue.c:2119
>  process_scheduled_works kernel/workqueue.c:2179
>  worker_thread+0xb2b/0x1850 kernel/workqueue.c:2255
>  kthread+0x3a1/0x470 kernel/kthread.c:231
>  ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
> 

It seems this is caused as a result of:
    -> lock_map_acquire(&work->lockdep_map);
	    lock_map_release(&work->lockdep_map);

    in flush_work() [0]

This was added by:

	commit 0976dfc1d0cd80a4e9dfaf87bd8744612bde475a
	Author: Stephen Boyd <sboyd@codeaurora.org>
	Date:   Fri Apr 20 17:28:50 2012 -0700

	workqueue: Catch more locking problems with flush_work()
    
Looking at the Stephen's patch, it's clear that it was made
with "static DECLARE_WORK(work, my_work)" in mind. However
p54's led_work is "per-device", hence it is stored in the
devices context p54_common, which is dynamically allocated.
So, maybe revert Stephen's patch?

[0] <http://elixir.free-electrons.com/linux/latest/source/kernel/workqueue.c#L2853>

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Paweł Staszewski @ 2017-09-20 19:23 UTC (permalink / raw)
  To: Cong Wang, Eric Dumazet
  Cc: Wei Wang, Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <07f0c800-7ee7-1ebe-9f5f-eaad52a7b42d@itcare.pl>



W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:
>
>
> W dniu 2017-09-20 o 20:36, Cong Wang pisze:
>> On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet 
>> <eric.dumazet@gmail.com> wrote:
>>> On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:
>>>> but dmesg at this time shows nothing about interfaces or flaps.
>>>>
>>>> This is very odd.
>>>>
>>>> We only free netdevice in free_netdev() and it is only called when
>>>> we unregister a netdevice. Otherwise pcpu_refcnt is impossible
>>>> to be NULL.
>>> If there is a missing dev_hold() or one dev_put() in excess,
>>> this would allow the netdev to be freed too soon.
>>>
>>> -> Use after free.
>>> memory holding netdev could be reallocated-cleared by some other kernel
>>> user.
>>>
>> Sure, but only unregister could trigger a free. If there is no 
>> unregister,
>> like what Pawel claims, then there is no free, the refcnt just goes to
>> 0 but the memory is still there.
>>
> About possible mistake from my side with bisect - i can judge too 
> early that some bisect was good
> the road was:
> git bisect start
> # bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
> 'pinctrl-v4.13-1' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
> git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
> # good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' 
> of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
> git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
> # bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
> using stack larger than 1024.
> git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
> # good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
> 'udp-reduce-cache-pressure'
> git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
> # bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
> 's390-net-updates-part-2'
> git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
> # good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
> 'bpf-ctx-narrow'
> git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
> # good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove 
> cp_outgoing
> git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
> # bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
> TCP_MD5SIG_EXT socket option to set a key address prefix
> git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
> # good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a 
> new function dst_dev_put()
>
> And currently have this running for about 4 hours without problems.
>
>
>
> git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
> # bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
> DST_NOCACHE flag
>
> Here for sure - panic
>
> git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
> # bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
> dst_hold_safe() properly
> git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
> # good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
> dst_hold_safe() properly
> git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
> # bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
> dst->__refcnt for insertion into fib6 tree
>
> im not 100% sure tor last two
> Will test them again starting from
> [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() 
> properly
>
>
> git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
> # bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC 
> and remove the operation of dst_free()
>
>
>
> git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
> # first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: 
> mark DST_NOGC and remove the operation of dst_free()
>
>
>
What i can say more
I can reproduce this on any server with similar configuration
the difference can be teamd instead of bonding
ixgbe or i40e and mlx5
Same problems

vlans - more or less prefixes learned from bgp -> zebra -> netlink -> kernel
But normally in lab when using only plain routing no bgpd and about 128 
vlans - with 128 routes - cant reproduce this - this apperas only with 
bgp - minimum where i can reproduce this was about 130k prefixes with 
about 286 nexthops

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Paweł Staszewski @ 2017-09-20 19:13 UTC (permalink / raw)
  To: Cong Wang, Eric Dumazet
  Cc: Wei Wang, Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <CAM_iQpVF3Gf3G=543o8g+5N2dtEL4DcUPN3a6=L8V_G0CXZ65w@mail.gmail.com>



W dniu 2017-09-20 o 20:36, Cong Wang pisze:
> On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:
>>> but dmesg at this time shows nothing about interfaces or flaps.
>>>
>>> This is very odd.
>>>
>>> We only free netdevice in free_netdev() and it is only called when
>>> we unregister a netdevice. Otherwise pcpu_refcnt is impossible
>>> to be NULL.
>> If there is a missing dev_hold() or one dev_put() in excess,
>> this would allow the netdev to be freed too soon.
>>
>> -> Use after free.
>> memory holding netdev could be reallocated-cleared by some other kernel
>> user.
>>
> Sure, but only unregister could trigger a free. If there is no unregister,
> like what Pawel claims, then there is no free, the refcnt just goes to
> 0 but the memory is still there.
>
About possible mistake from my side with bisect - i can judge too early 
that some bisect was good
the road was:
git bisect start
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' 
of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid using 
stack larger than 1024.
git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'
git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
# bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
's390-net-updates-part-2'
git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
# good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
'bpf-ctx-narrow'
git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
# good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove 
cp_outgoing
git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
# bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
TCP_MD5SIG_EXT socket option to set a key address prefix
git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
# good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a new 
function dst_dev_put()

And currently have this running for about 4 hours without problems.



git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
# bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
DST_NOCACHE flag

Here for sure - panic

git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
# bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
dst_hold_safe() properly
git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
# good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
dst_hold_safe() properly
git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
# bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
dst->__refcnt for insertion into fib6 tree

im not 100% sure tor last two
Will test them again starting from
[95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() properly


git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
# bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC 
and remove the operation of dst_free()



git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
# first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: 
mark DST_NOGC and remove the operation of dst_free()

^ permalink raw reply

* [PATCH] hv_netvsc: fix send buffer failure on MTU change
From: Stephen Hemminger @ 2017-09-20 18:17 UTC (permalink / raw)
  To: davem, kys; +Cc: netdev, Alex Ng, Stephen Hemminger

From: Alex Ng <alexng@microsoft.com>

If MTU is changed the host would reject the send buffer change.
This problem is result of recent change to allow changing send
buffer size.

Every time we change the MTU, we store the previous net_device section
count before destroying the buffer, but we don’t store the previous
section size. When we reinitialize the buffer, its size is calculated
by multiplying the previous count and previous size. Since we
continuously increase the MTU, the host returns us a decreasing count
value while the section size is reinitialized to 1728 bytes every
time.

This eventually leads to a condition where the calculated buf_size is
so small that the host rejects it.

Fixes: 8b5327975ae1 ("netvsc: allow controlling send/recv buffer size")
Signed-off-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
---
 drivers/net/hyperv/hyperv_net.h | 2 ++
 drivers/net/hyperv/netvsc.c     | 7 ++-----
 drivers/net/hyperv/netvsc_drv.c | 8 ++++++++
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index d98cdfb1536b..5176be76ca7d 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -150,6 +150,8 @@ struct netvsc_device_info {
 	u32  num_chn;
 	u32  send_sections;
 	u32  recv_sections;
+	u32  send_section_size;
+	u32  recv_section_size;
 };
 
 enum rndis_device_state {
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index a5511b7326af..8d5077fb0492 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -76,9 +76,6 @@ static struct netvsc_device *alloc_net_device(void)
 	net_device->max_pkt = RNDIS_MAX_PKT_DEFAULT;
 	net_device->pkt_align = RNDIS_PKT_ALIGN_DEFAULT;
 
-	net_device->recv_section_size = NETVSC_RECV_SECTION_SIZE;
-	net_device->send_section_size = NETVSC_SEND_SECTION_SIZE;
-
 	init_completion(&net_device->channel_init_wait);
 	init_waitqueue_head(&net_device->subchan_open);
 	INIT_WORK(&net_device->subchan_work, rndis_set_subchannel);
@@ -262,7 +259,7 @@ static int netvsc_init_buf(struct hv_device *device,
 	int ret = 0;
 
 	/* Get receive buffer area. */
-	buf_size = device_info->recv_sections * net_device->recv_section_size;
+	buf_size = device_info->recv_sections * device_info->recv_section_size;
 	buf_size = roundup(buf_size, PAGE_SIZE);
 
 	net_device->recv_buf = vzalloc(buf_size);
@@ -344,7 +341,7 @@ static int netvsc_init_buf(struct hv_device *device,
 		goto cleanup;
 
 	/* Now setup the send buffer. */
-	buf_size = device_info->send_sections * net_device->send_section_size;
+	buf_size = device_info->send_sections * device_info->send_section_size;
 	buf_size = round_up(buf_size, PAGE_SIZE);
 
 	net_device->send_buf = vzalloc(buf_size);
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index d4902ee5f260..a32ae02e1b6c 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -848,7 +848,9 @@ static int netvsc_set_channels(struct net_device *net,
 	device_info.num_chn = count;
 	device_info.ring_size = ring_size;
 	device_info.send_sections = nvdev->send_section_cnt;
+	device_info.send_section_size = nvdev->send_section_size;
 	device_info.recv_sections = nvdev->recv_section_cnt;
+	device_info.recv_section_size = nvdev->recv_section_size;
 
 	rndis_filter_device_remove(dev, nvdev);
 
@@ -963,7 +965,9 @@ static int netvsc_change_mtu(struct net_device *ndev, int mtu)
 	device_info.ring_size = ring_size;
 	device_info.num_chn = nvdev->num_chn;
 	device_info.send_sections = nvdev->send_section_cnt;
+	device_info.send_section_size = nvdev->send_section_size;
 	device_info.recv_sections = nvdev->recv_section_cnt;
+	device_info.recv_section_size = nvdev->recv_section_size;
 
 	rndis_filter_device_remove(hdev, nvdev);
 
@@ -1485,7 +1489,9 @@ static int netvsc_set_ringparam(struct net_device *ndev,
 	device_info.num_chn = nvdev->num_chn;
 	device_info.ring_size = ring_size;
 	device_info.send_sections = new_tx;
+	device_info.send_section_size = nvdev->send_section_size;
 	device_info.recv_sections = new_rx;
+	device_info.recv_section_size = nvdev->recv_section_size;
 
 	netif_device_detach(ndev);
 	was_opened = rndis_filter_opened(nvdev);
@@ -1934,7 +1940,9 @@ static int netvsc_probe(struct hv_device *dev,
 	device_info.ring_size = ring_size;
 	device_info.num_chn = VRSS_CHANNEL_DEFAULT;
 	device_info.send_sections = NETVSC_DEFAULT_TX;
+	device_info.send_section_size = NETVSC_SEND_SECTION_SIZE;
 	device_info.recv_sections = NETVSC_DEFAULT_RX;
+	device_info.recv_section_size = NETVSC_RECV_SECTION_SIZE;
 
 	nvdev = rndis_filter_device_add(dev, &device_info);
 	if (IS_ERR(nvdev)) {
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH net-next] bpf: Optimize lpm trie delete
From: Craig Gallek @ 2017-09-20 18:51 UTC (permalink / raw)
  To: Daniel Mack; +Cc: Alexei Starovoitov, Daniel Borkmann, David S . Miller, netdev
In-Reply-To: <47e47f0d-d04d-b52b-647a-7883aa5f3268@zonque.org>

On Wed, Sep 20, 2017 at 12:51 PM, Daniel Mack <daniel@zonque.org> wrote:
> Hi Craig,
>
> Thanks, this looks much cleaner already :)
>
> On 09/20/2017 06:22 PM, Craig Gallek wrote:
>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>> index 9d58a576b2ae..b5a7d70ec8b5 100644
>> --- a/kernel/bpf/lpm_trie.c
>> +++ b/kernel/bpf/lpm_trie.c
>> @@ -397,7 +397,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
>>       struct lpm_trie_node __rcu **trim;
>>       struct lpm_trie_node *node;
>>       unsigned long irq_flags;
>> -     unsigned int next_bit;
>> +     unsigned int next_bit = 0;
>
> This default assignment seems wrong, and I guess you only added it to
> squelch a compiler warning?
Yes, this variable is only initialized after the lookup iterations
below (meaning it will never be initialized the the root-removal
case).

> [...]
>
>> +     /* If the node has one child, we may be able to collapse the tree
>> +      * while removing this node if the node's child is in the same
>> +      * 'next bit' slot as this node was in its parent or if the node
>> +      * itself is the root.
>> +      */
>> +     if (trim == &trie->root) {
>> +             next_bit = node->child[0] ? 0 : 1;
>> +             rcu_assign_pointer(trie->root, node->child[next_bit]);
>> +             kfree_rcu(node, rcu);
>
> I don't think you should treat this 'root' case special.
>
> Instead, move the 'next_bit' assignment outside of the condition ...
I'm not quite sure I follow.  Are you saying do something like this:

if (trim == &trie->root) {
  next_bit = node->child[0] ? 0 : 1;
}
if (rcu_access_pointer(node->child[next_bit])) {
...

This would save a couple lines of code, but I think the as-is
implementation is slightly easier to understand.  I don't have a
strong opinion either way, though.

Thanks for the pointers,
Craig

^ permalink raw reply

* usb/net/p54: trying to register non-static key in p54_unregister_leds
From: Andrey Konovalov @ 2017-09-20 18:37 UTC (permalink / raw)
  To: Christian Lamparter, Kalle Valo, linux-wireless, netdev, LKML
  Cc: Dmitry Vyukov, Kostya Serebryany, syzkaller

Hi!

I've got the following report while fuzzing the kernel with syzkaller.

On commit ebb2c2437d8008d46796902ff390653822af6cc4 (Sep 18).

INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 1 PID: 1404 Comm: kworker/1:1 Not tainted
4.14.0-rc1-42251-gebb2c2437d80-dirty #205
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: usb_hub_wq hub_event
Call Trace:
 __dump_stack lib/dump_stack.c:16
 dump_stack+0x292/0x395 lib/dump_stack.c:52
 register_lock_class+0x6c4/0x1a00 kernel/locking/lockdep.c:769
 __lock_acquire+0x27e/0x4550 kernel/locking/lockdep.c:3385
 lock_acquire+0x259/0x620 kernel/locking/lockdep.c:4002
 flush_work+0xf0/0x8c0 kernel/workqueue.c:2886
 __cancel_work_timer+0x51d/0x870 kernel/workqueue.c:2961
 cancel_delayed_work_sync+0x1f/0x30 kernel/workqueue.c:3081
 p54_unregister_leds+0x6c/0xc0 drivers/net/wireless/intersil/p54/led.c:160
 p54_unregister_common+0x3d/0xb0 drivers/net/wireless/intersil/p54/main.c:856
 p54u_disconnect+0x86/0x120 drivers/net/wireless/intersil/p54/p54usb.c:1073
 usb_unbind_interface+0x21c/0xa90 drivers/usb/core/driver.c:423
 __device_release_driver drivers/base/dd.c:861
 device_release_driver_internal+0x4f4/0x5c0 drivers/base/dd.c:893
 device_release_driver+0x1e/0x30 drivers/base/dd.c:918
 bus_remove_device+0x2f4/0x4b0 drivers/base/bus.c:565
 device_del+0x5c4/0xab0 drivers/base/core.c:1985
 usb_disable_device+0x1e9/0x680 drivers/usb/core/message.c:1170
 usb_disconnect+0x260/0x7a0 drivers/usb/core/hub.c:2124
 hub_port_connect drivers/usb/core/hub.c:4754
 hub_port_connect_change drivers/usb/core/hub.c:5009
 port_event drivers/usb/core/hub.c:5115
 hub_event+0x1318/0x3740 drivers/usb/core/hub.c:5195
 process_one_work+0xc7f/0x1db0 kernel/workqueue.c:2119
 process_scheduled_works kernel/workqueue.c:2179
 worker_thread+0xb2b/0x1850 kernel/workqueue.c:2255
 kthread+0x3a1/0x470 kernel/kthread.c:231
 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Cong Wang @ 2017-09-20 18:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Paweł Staszewski, Wei Wang, Linux Kernel Network Developers,
	Eric Dumazet
In-Reply-To: <1505932231.29839.106.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:
>> but dmesg at this time shows nothing about interfaces or flaps.
>>
>> This is very odd.
>>
>> We only free netdevice in free_netdev() and it is only called when
>> we unregister a netdevice. Otherwise pcpu_refcnt is impossible
>> to be NULL.
>
> If there is a missing dev_hold() or one dev_put() in excess,
> this would allow the netdev to be freed too soon.
>
> -> Use after free.
> memory holding netdev could be reallocated-cleared by some other kernel
> user.
>

Sure, but only unregister could trigger a free. If there is no unregister,
like what Pawel claims, then there is no free, the refcnt just goes to
0 but the memory is still there.

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Eric Dumazet @ 2017-09-20 18:30 UTC (permalink / raw)
  To: Cong Wang
  Cc: Paweł Staszewski, Wei Wang, Linux Kernel Network Developers,
	Eric Dumazet
In-Reply-To: <CAM_iQpUOOCcES4d_ueK2wizn2oyDZPEzLfWQFu3oOq=BuRzdiw@mail.gmail.com>

On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:
> but dmesg at this time shows nothing about interfaces or flaps.
> 
> This is very odd.
> 
> We only free netdevice in free_netdev() and it is only called when
> we unregister a netdevice. Otherwise pcpu_refcnt is impossible
> to be NULL.

If there is a missing dev_hold() or one dev_put() in excess,
this would allow the netdev to be freed too soon.

-> Use after free.
memory holding netdev could be reallocated-cleared by some other kernel
user.

^ permalink raw reply

* Re: cross namespace interface notification for tun devices
From: Cong Wang @ 2017-09-20 18:29 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: Netdev, Mathias
In-Reply-To: <CAHmME9oSZeAoDtSLxoGiEcTcCbmPLf-nj3OJcJ1ma8fs4BWQAw@mail.gmail.com>

On Tue, Sep 19, 2017 at 2:02 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> On Tue, Sep 19, 2017 at 10:40 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> By "notification" I assume you mean netlink notification.
>
> Yes, netlink notification.
>
>> The question is why does the process in A still care about
>> the device sitting in B?
>>
>> Also, the process should be able to receive a last notification
>> on IFF_UP|IFF_RUNNING before device is finally moved to B.
>> After this point, it should not have any relation to netns A
>> any more, like the device were completely gone.
>
> That's very clearly not the case with a tun device. Tun devices work
> by letting a userspace process control the inputs (ndo_start_xmit) and
> outputs (netif_rx) of the actual network device. This controlling
> userspace process needs to know when its own interface that it
> controls goes up and down. In the kernel, we can do this by just
> checking dev->flags&IFF_UP, and receive notifications on ndo_open and
> ndo_stop. In userspace, the controlling process looses the ability to
> receive notifications like ndo_open/ndo_stop when the interface is
> moved to a new namespace. After the interface is moved to a namespace,
> the process will still control inputs and ouputs (ndo_start_xmit and
> netif_rx), but it will no longer receive netlink notifications for the
> equivalent of ndo_open and ndo_stop. This is problematic.

Sounds like we should set NETIF_F_NETNS_LOCAL for tun
device.

What is your legitimate use case of send/receive packet to/from
a tun device in a different netns?

^ permalink raw reply

* Re: [PATCH v5 05/10] dt-bindings: net: dwmac-sun8i: update documentation about integrated PHY
From: Corentin Labbe @ 2017-09-20 18:23 UTC (permalink / raw)
  To: Rob Herring
  Cc: Andrew Lunn, Mark Rutland, Florian Fainelli, Alexandre Torgue,
	devicetree@vger.kernel.org, Catalin Marinas, Will Deacon,
	Russell King, linux-kernel@vger.kernel.org, Chen-Yu Tsai, netdev,
	Giuseppe CAVALLARO, Maxime Ripard,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <CAL_JsqJ2eVA9Zg60Hagqm732ZoOzHWTwVBqOabOhjE5aQ26L+g@mail.gmail.com>

On Tue, Sep 19, 2017 at 09:49:52PM -0500, Rob Herring wrote:
> On Thu, Sep 14, 2017 at 2:19 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> >> > Is the MDIO controller "allwinner,sun8i-h3-emac" or "snps,dwmac-mdio"?
> >> > If the latter, then I think the node is fine, but then the mux should be
> >> > a child node of it. IOW, the child of an MDIO controller should either
> >> > be a mux node or slave devices.
> >
> > Hi Rob
> >
> > Up until now, children of an MDIO bus have been MDIO devices. Those
> > MDIO devices are either Ethernet PHYs, Ethernet Switches, or the
> > oddball devices that Broadcom iProc has, like generic PHYs.
> >
> > We have never had MDIO-muxes as MDIO children. A Mux is not an MDIO
> > device, and does not have the properties of an MDIO device. It is not
> > addressable on the MDIO bus. The current MUXes are addressed via GPIOs
> > or MMIO.
> 
> The DT parent/child relationship defines the bus topology. We describe
> MDIO buses in that way and if a mux is sitting between the controller
> and the devices, then the DT hierarchy should reflect that. Now
> sometimes we have 2 options for what interface has the parent/child
> relationship (e.g. an I2C controlled USB hub chip), but in this case
> we don't.
> 

Putting mdio-mux as a child of it (the mdio node) give me:
[   18.175338] libphy: stmmac: probed
[   18.175379] mdio_bus stmmac-0: /soc/ethernet@1c30000/mdio/mdio-mux has invalid PHY address
[   18.175408] mdio_bus stmmac-0: scan phy mdio-mux at address 0
[   18.175450] mdio_bus stmmac-0: scan phy mdio-mux at address 1
[   18.175482] mdio_bus stmmac-0: scan phy mdio-mux at address 2
[   18.175513] mdio_bus stmmac-0: scan phy mdio-mux at address 3
[   18.175544] mdio_bus stmmac-0: scan phy mdio-mux at address 4
[   18.175575] mdio_bus stmmac-0: scan phy mdio-mux at address 5
[   18.175607] mdio_bus stmmac-0: scan phy mdio-mux at address 6
[   18.175638] mdio_bus stmmac-0: scan phy mdio-mux at address 7
[   18.175669] mdio_bus stmmac-0: scan phy mdio-mux at address 8
[   18.175700] mdio_bus stmmac-0: scan phy mdio-mux at address 9
[   18.175731] mdio_bus stmmac-0: scan phy mdio-mux at address 10
[   18.175762] mdio_bus stmmac-0: scan phy mdio-mux at address 11
[   18.175795] mdio_bus stmmac-0: scan phy mdio-mux at address 12
[   18.175827] mdio_bus stmmac-0: scan phy mdio-mux at address 13
[   18.175858] mdio_bus stmmac-0: scan phy mdio-mux at address 14
[   18.175889] mdio_bus stmmac-0: scan phy mdio-mux at address 15
[   18.175919] mdio_bus stmmac-0: scan phy mdio-mux at address 16
[   18.175951] mdio_bus stmmac-0: scan phy mdio-mux at address 17
[   18.175982] mdio_bus stmmac-0: scan phy mdio-mux at address 18
[   18.176014] mdio_bus stmmac-0: scan phy mdio-mux at address 19
[   18.176045] mdio_bus stmmac-0: scan phy mdio-mux at address 20
[   18.176076] mdio_bus stmmac-0: scan phy mdio-mux at address 21
[   18.176107] mdio_bus stmmac-0: scan phy mdio-mux at address 22
[   18.176139] mdio_bus stmmac-0: scan phy mdio-mux at address 23
[   18.176170] mdio_bus stmmac-0: scan phy mdio-mux at address 24
[   18.176202] mdio_bus stmmac-0: scan phy mdio-mux at address 25
[   18.176233] mdio_bus stmmac-0: scan phy mdio-mux at address 26
[   18.176271] mdio_bus stmmac-0: scan phy mdio-mux at address 27
[   18.176320] mdio_bus stmmac-0: scan phy mdio-mux at address 28
[   18.176371] mdio_bus stmmac-0: scan phy mdio-mux at address 29
[   18.176420] mdio_bus stmmac-0: scan phy mdio-mux at address 30
[   18.176452] mdio_bus stmmac-0: scan phy mdio-mux at address 31

Adding a fake <reg> to mdio-mux remove it, but I found that a bit ugly.
Or perhaps patching of_mdiobus_register() to not scan node with compatible "mdio-mux".

What do you think ?

Regards

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Cong Wang @ 2017-09-20 18:22 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: Eric Dumazet, Wei Wang, Linux Kernel Network Developers,
	Eric Dumazet
In-Reply-To: <3c227be7-a954-a406-1987-24e908cf214c@itcare.pl>

On Wed, Sep 20, 2017 at 10:55 AM, Paweł Staszewski
<pstaszewski@itcare.pl> wrote:
>
>
> W dniu 2017-09-20 o 19:50, Cong Wang pisze:
>
> On Wed, Sep 20, 2017 at 6:11 AM, Eric Dumazet <eric.dumazet@gmail.com>
> wrote:
>
> Sorry for top-posting, but this is to give context to Wei, since Pawel
> used a top posting way to report his bisection.
>
> Wei, can you take a look at Pawel report ?
>
> Crash happens in dst_destroy() at following :
>
> if (dst->dev)
>      dev_put(dst->dev); <<CRASH>>
>
>
> dst->dev is not NULL, but netdev->pcpu_refcnt is NULL
>
> 65 ff 08                decl   %gs:(%rax)   // CRASH since rax = NULL
>
>
>
> Pawel, please share your netdevices and routing setup  ?
>
> Looks like a double dev_put() on some dev...
>
> Pawel, do you have any idea how this is triggered? Does your
> test try to remove some network device? If so which one?
> I noticed you have at least multiple vlan, bond and ixgbe
> devices.
>
> Just after i start bgp sessions
> So when host is starting i have all bgp sessions to upstreams shutdown
>
> To trigger panic i just enable all 6x bgp sessions at once to upstreams -
> and zebra is start to pull prefixes and push them to the kernel
>
> Then some traffic is generated from test hosts thru this backup router and
> panic is generated - every time after 10 to 15 seconds after bgp sessions
> are connected.
>
> I'm not removing any interface at this time or do anything with interfaces -
> just wait.
>
> And yes there are vlans attached to the bond devices
> but dmesg at this time shows nothing about interfaces or flaps.

This is very odd.

We only free netdevice in free_netdev() and it is only called when
we unregister a netdevice. Otherwise pcpu_refcnt is impossible
to be NULL.

^ permalink raw reply

* Re: [PATCH net-next 12/14] gtp: Configuration for zero UDP checksum
From: Tom Herbert @ 2017-09-20 18:09 UTC (permalink / raw)
  To: David Miller
  Cc: Tom Herbert, Linux Kernel Network Developers, Pablo Neira Ayuso,
	Harald Welte, Rohit Seth
In-Reply-To: <20170918.212445.1680068602644706922.davem@davemloft.net>

On Mon, Sep 18, 2017 at 9:24 PM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <tom@quantonium.net>
> Date: Mon, 18 Sep 2017 17:39:02 -0700
>
>> Add configuration to control use of zero checksums on transmit for both
>> IPv4 and IPv6, and control over accepting zero IPv6 checksums on
>> receive.
>>
>> Signed-off-by: Tom Herbert <tom@quantonium.net>
>
> I thought we were trying to move away from this special case of allowing
> zero UDP checksums with tunnels, especially for ipv6.

I don't have a strong preference either way. I like consistency with
VXLAN and foo/UDP, but I guess it's not required. Interestingly, since
GTP only carries IP, IPv6 zero checksums are actually safer here than
VXLAN or GRE/UDP.

Tom

^ permalink raw reply

* Re: [PATCH net-next 08/14] gtp: Support encpasulating over IPv6
From: Tom Herbert @ 2017-09-20 18:03 UTC (permalink / raw)
  To: David Miller
  Cc: Tom Herbert, Linux Kernel Network Developers, Pablo Neira Ayuso,
	Harald Welte, Rohit Seth
In-Reply-To: <20170918.211952.1397675528248642600.davem@davemloft.net>

On Mon, Sep 18, 2017 at 9:19 PM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <tom@quantonium.net>
> Date: Mon, 18 Sep 2017 17:38:58 -0700
>
>> Allow peers to be specified by IPv6 addresses.
>>
>> Signed-off-by: Tom Herbert <tom@quantonium.net>
>
> Hmmm, can you just check the socket family or something like that?

I'm not sure what code you're referring to.

Thanks

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Paweł Staszewski @ 2017-09-20 17:58 UTC (permalink / raw)
  To: Wei Wang; +Cc: Eric Dumazet, Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <CAEA6p_D8Av0QAJ46Gzx=Bt3uP-NshTmz2QoBzYaO8OK9+CFQNA@mail.gmail.com>



W dniu 2017-09-20 o 19:46, Wei Wang pisze:
>>> This is why I suggested to replace the BUG() in another mail
>>>
>>> So :
>>>
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index
>>> f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
>>> 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
>>>     */
>>>    static inline void dev_put(struct net_device *dev)
>>>    {
>>> -       this_cpu_dec(*dev->pcpu_refcnt);
>>> +       int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
>>> +
>>> +       if (!pref) {
>>> +               pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle
>>> %d\n",
>>> +                      dev, dev->name, dev->reg_state, dev->dismantle);
>>> +               for (;;)
>>> +                       cpu_relax();
>>> +       }
>>> +       this_cpu_dec(*pref);
>>>    }
>>>      /**
>>>
> Thanks a lot Eric for the debug patch.
>
> Pawel,
>
> I want to confirm with you about the last good commit when you did bisection.
> You mentioned:
>
>> And the last one
>>
>> git bisect good
>> Bisecting: 1 revision left to test after this (roughly 1 step)
>> [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt for
>> insertion into fib6 tree
>>
>> With this have kernel panic same as always
>>
>> git bisect bad
>> Bisecting: 0 revisions left to test after this (roughly 0 steps)
>> [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and
>> remove the operation of dst_free()
>
> So it breaks right at:
> [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and
> remove the operation of dst_free()
> Right?
> If you sync the image to one commit before the above one:
> [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call dst_hold_safe() properly
> Does it crash?
Later today i will repeat last three steps - in about next 3 hours after 
rush hours of internet traffic - now i cant touch backup router  :)

>
> And could you confirm that your config does not have any IPv6
> addresses or routes configured?
There is ipv6 enabled
And yes there are some ipv6 ip's
One interface have ipv6 enabled with one static route

  but no ipv6 bgp sessions - so nt many ipv6 prefixes and ipv6 fib is 
almost empty

ip -6 r ls | wc -l
57


>
> Thanks.
> Wei
>
>
> 6:03 +0200, Paweł Staszewski wrote:
>>>> Nit much more after adding this patch
>>>>
>>>> https://bugzilla.kernel.org/attachment.cgi?id=258529
>>>>
>>> This is why I suggested to replace the BUG() in another mail
>>>
>>> So :
>>>
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index
>>> f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
>>> 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
>>>     */
>>>    static inline void dev_put(struct net_device *dev)
>>>    {
>>> -       this_cpu_dec(*dev->pcpu_refcnt);
>>> +       int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
>>> +
>>> +       if (!pref) {
>>> +               pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle
>>> %d\n",
>>> +                      dev, dev->name, dev->reg_state, dev->dismantle);
>>> +               for (;;)
>>> +                       cpu_relax();
>>> +       }
>>> +       this_cpu_dec(*pref);
>>>    }
>>>      /**
>>>
>>>
>>>
>> Full panic
>>
>> https://bugzilla.kernel.org/attachment.cgi?id=258531
>>
>>
>> I will change patch and apply but later today cause now cant use backup
>> router as testlab - Internet rush hours if something happens this will be
>> bed when second router will have bugged kernel :)
>>
>>

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Eric Dumazet @ 2017-09-20 17:59 UTC (permalink / raw)
  To: Cong Wang
  Cc: Paweł Staszewski, Wei Wang, Linux Kernel Network Developers,
	Eric Dumazet
In-Reply-To: <CAM_iQpV8qV8a_yFT0--4Z==FuWda=4EQzKJzHTL+op4S5b9E0A@mail.gmail.com>

On Wed, 2017-09-20 at 10:50 -0700, Cong Wang wrote:
> On Wed, Sep 20, 2017 at 6:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Sorry for top-posting, but this is to give context to Wei, since Pawel
> > used a top posting way to report his bisection.
> >
> > Wei, can you take a look at Pawel report ?
> >
> > Crash happens in dst_destroy() at following :
> >
> > if (dst->dev)
> >      dev_put(dst->dev); <<CRASH>>
> >
> >
> > dst->dev is not NULL, but netdev->pcpu_refcnt is NULL
> >
> > 65 ff 08                decl   %gs:(%rax)   // CRASH since rax = NULL
> >
> >
> >
> > Pawel, please share your netdevices and routing setup  ?
> 
> Looks like a double dev_put() on some dev...
> 
> Pawel, do you have any idea how this is triggered? Does your
> test try to remove some network device? If so which one?
> I noticed you have at least multiple vlan, bond and ixgbe
> devices.

Or a missing dev_hold() somewhere.

^ permalink raw reply

* Re: mac80211: avoid allocating TXQs that won't be used
From: Johannes Berg @ 2017-09-20 17:51 UTC (permalink / raw)
  To: Colin Ian King
  Cc: David S. Miller,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <3592b0b1-0455-ca9a-9ca7-702d7cf421ff-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

On Wed, 2017-09-20 at 17:08 +0100, Colin Ian King wrote:
> Johannes,
> 
> Static analysis with CoverityScan on linux-next today detected a null
> pointer dereference issue on commit:
> 
> From 0fc4b3403d215ecd3c05505ec1f0028a227ed319 Mon Sep 17 00:00:00
> 2001
> From: Johannes Berg <johannes.berg-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Date: Thu, 22 Jun 2017 12:20:29 +0200
> Subject: [PATCH] mac80211: avoid allocating TXQs that won't be used
> 
> Issue: sdata is null when the sdata is dereferenced by:
> 
>                    sdata->vif.type != NL80211_IFTYPE_AP_VLAN &&
>                    sdata->vif.type != NL80211_IFTYPE_MONITOR)
> 
> note that sdata is assigned a non-null much later with the statement
> sdata = netdev_priv(ndev).

Yeah, umm, that should be checking just 'type'. Thanks, will fix.

johannes

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Cong Wang @ 2017-09-20 17:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Paweł Staszewski, Wei Wang, Linux Kernel Network Developers,
	Eric Dumazet
In-Reply-To: <1505913075.29839.90.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Sep 20, 2017 at 6:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Sorry for top-posting, but this is to give context to Wei, since Pawel
> used a top posting way to report his bisection.
>
> Wei, can you take a look at Pawel report ?
>
> Crash happens in dst_destroy() at following :
>
> if (dst->dev)
>      dev_put(dst->dev); <<CRASH>>
>
>
> dst->dev is not NULL, but netdev->pcpu_refcnt is NULL
>
> 65 ff 08                decl   %gs:(%rax)   // CRASH since rax = NULL
>
>
>
> Pawel, please share your netdevices and routing setup  ?

Looks like a double dev_put() on some dev...

Pawel, do you have any idea how this is triggered? Does your
test try to remove some network device? If so which one?
I noticed you have at least multiple vlan, bond and ixgbe
devices.

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Wei Wang @ 2017-09-20 17:46 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: Eric Dumazet, Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <273f49c7-58c0-0b41-10ff-6f891ffb72e9@itcare.pl>

>> This is why I suggested to replace the BUG() in another mail
>>
>> So :
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index
>> f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
>> 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
>>    */
>>   static inline void dev_put(struct net_device *dev)
>>   {
>> -       this_cpu_dec(*dev->pcpu_refcnt);
>> +       int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
>> +
>> +       if (!pref) {
>> +               pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle
>> %d\n",
>> +                      dev, dev->name, dev->reg_state, dev->dismantle);
>> +               for (;;)
>> +                       cpu_relax();
>> +       }
>> +       this_cpu_dec(*pref);
>>   }
>>     /**
>>

Thanks a lot Eric for the debug patch.

Pawel,

I want to confirm with you about the last good commit when you did bisection.
You mentioned:

> And the last one
>
> git bisect good
> Bisecting: 1 revision left to test after this (roughly 1 step)
> [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt for
> insertion into fib6 tree
>
> With this have kernel panic same as always
>
> git bisect bad
> Bisecting: 0 revisions left to test after this (roughly 0 steps)
> [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and
> remove the operation of dst_free()


So it breaks right at:
[b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and
remove the operation of dst_free()
Right?
If you sync the image to one commit before the above one:
[9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call dst_hold_safe() properly
Does it crash?

And could you confirm that your config does not have any IPv6
addresses or routes configured?

Thanks.
Wei


6:03 +0200, Paweł Staszewski wrote:
>>>
>>> Nit much more after adding this patch
>>>
>>> https://bugzilla.kernel.org/attachment.cgi?id=258529
>>>
>> This is why I suggested to replace the BUG() in another mail
>>
>> So :
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index
>> f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
>> 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
>>    */
>>   static inline void dev_put(struct net_device *dev)
>>   {
>> -       this_cpu_dec(*dev->pcpu_refcnt);
>> +       int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
>> +
>> +       if (!pref) {
>> +               pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle
>> %d\n",
>> +                      dev, dev->name, dev->reg_state, dev->dismantle);
>> +               for (;;)
>> +                       cpu_relax();
>> +       }
>> +       this_cpu_dec(*pref);
>>   }
>>     /**
>>
>>
>>
>
> Full panic
>
> https://bugzilla.kernel.org/attachment.cgi?id=258531
>
>
> I will change patch and apply but later today cause now cant use backup
> router as testlab - Internet rush hours if something happens this will be
> bed when second router will have bugged kernel :)
>
>

^ permalink raw reply

* Re: [PATCH net-next 1/5] net: add support for noref skb->sk
From: Eric Dumazet @ 2017-09-20 17:41 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <39b55ec9584df7dfd6eb498a3d354cd2e08e3eaa.1505926196.git.pabeni@redhat.com>

On Wed, 2017-09-20 at 18:54 +0200, Paolo Abeni wrote:
> Noref sk do not carry a socket refcount, are valid
> only inside the current RCU section and must be
> explicitly cleared before exiting such section.
> 
> They will be used in a later patch to allow early demux
> without sock refcounting.




> +/* dummy destructor used by noref sockets */
> +void sock_dummyfree(struct sk_buff *skb)
> +{

BUG();

> +}
> +EXPORT_SYMBOL(sock_dummyfree);
> +


I do not see how you ensure we do not leave RCU section with an skb
destructor pointing to this sock_dummyfree()

This patch series looks quite dangerous to me.

Do we really have real applications using connected UDP sockets and
wanting very high pps throughput ?

I am pretty sure the bottleneck is the sender part.

^ permalink raw reply

* Re: [PATCH RFC V1 net-next 0/6] Time based packet transmission
From: levipearson @ 2017-09-20 17:35 UTC (permalink / raw)
  To: rcochran
  Cc: netdev, linux-kernel, intel-wired-lan, vinicius.gomes,
	andre.guedes, john.stultz, jesus.sanchez-palencia, henrik, tglx,
	anna-maria, davem
In-Reply-To: <cover.1505719061.git.rcochran@linutronix.de>

> This series is an early RFC that introduces a new socket option
> allowing time based transmission of packets.  This option will be
> useful in implementing various real time protocols over Ethernet,
> including but not limited to P802.1Qbv, which is currently finding
> its way into 802.1Q.
> 
> * Open questions about SO_TXTIME semantics
> 
>   - What should the kernel do if the dialed Tx time is in the past?
>     Should the packet be sent ASAP, or should we throw an error?

Based on the i210 and latest NXP/Freescale FEC launch time behavior,
the hardware timestamps work over 1-second windows corresponding to
the time elapsed since the last PTP second began. When considering the
head-of-queue frame, the launch time is compared to the elapsed time
counter and if the elapsed time is between exactly the launch time and
half a second after the launch time, it is launched. If you enqueue a
frame with a scheduled launch time that ends up more than half a second
late, it is considered by the hardware to be scheduled *in the future*
at the offset belonging to the next second after the 1-second window
wraps around.

So *slightly* late (<<.5sec late) frames could be scheduled as normal,
but approaching .5sec late frames would have to either be dropped or 
have their schedule changed to avoid blocking the queue for a large
fraction of a second.

I don't like the idea of changing the scheduled time, and anything that
is close to half a second late is most likely useless. But it is also
reasonable to let barely-late frames go out ASAP--in the case of a Qav-
shaped stream, the bunching would get smoothed out downstream. A timed
launch schedule need not be used as an exact time, but a "don't send
before time X" flag. Both are useful in different circumstances.

A configurable parameter for allowable lateness, with the upper bound
set by the driver based on the hardware capabilities, seems ideal.
Barring that, I would suggest dropping frames with already-missed
launch times.

> 
>   - Should the kernel inform the user if it detects a missed deadline,
>     via the error queue for example?

I think some sort of counter for mis-scheduled/late-delivered frames
would be in keeping with the general 802.1 error handling strategy.

> 
>   - What should the timescale be for the dialed Tx time?  Should the
>     kernel select UTC when using the SW Qdisc and the HW time
>     otherwise?  Or should the socket option include a clockid_t?

When I implemented something like this, I left it relative to the HW
time for the sake of simplicity, but I don't have a strong opinion.

> 
> * Things todo
> 
>   - Design a Qdisc for purpose of configuring SO_TXTIME.  There should
>     be one option to dial HW offloading or SW best effort.

You seem focused on Qbv, but there is another aspect of the endpoint
requirements for Qav that this would provide a perfect use case for. A
bridge can treat all traffic in a Qav-shaped class equally, but an
endpoint must essentially run one credit-based shaper per active stream
feeding into the class--this is because a stream must adhere to its
frames-per-interval promise in its t-spec, and when the observation
interval is not an even multiple of the sample rate, it will occasionally
have an observation interval with no frame. This leaves extra bandwidth
in the class reservation, but it cannot be used by any other stream if
it would cause more than one frame per interval to be sent!

Even if a stream is not explicitly scheduled in userspace, a per-stream
Qdisc could apply a rough launch time that the class Qdisc (or hardware
shaping) would use to ensure the frames-per-interval aspect of the
reservation for the stream is adhered to. For example, each observation
interval could be assigned a launch time, and all streams would get a
number of frames corresponding to their frames-per-interval reservation
assigned that same launch time before being put into the class queue.
The i210's shaper would then only consider the current interval's set 
of frames ready to launch, and spread them evenly with its hardware
credit-based shaping.

For industrial and automotive control applications, a Qbv Qdisc based on
SO_TXTIME would be very interesting, but pro and automotive media uses
will most likely continue to use SRP + Qav, and these are becoming
increasingly common uses as you can see by the growing support for Qav in
automotive chips.

>   - Implement the SW best effort variant.  Here is my back of the
>     napkin sketch.  Each interface has its own timerqueue keeping the
>     TXTIME packets in order and a FIFO for all other traffic.  A guard
>     window starts at the earliest deadline minus the maximum MTU minus
>     a configurable fudge factor.  The Qdisc uses a hrtimer to transmit
>     the next packet in the timerqueue.  During the guard window, all
>     other traffic is defered unless the next packet can be transmitted
>     before the guard window expires.

This sounds plausible to me.

> 
> * Current limitations
> 
>   - The driver does not handle out of order packets.  If user space
>     sends a packet with an earlier Tx time, then the code should stop
>     the queue, reshuffle the descriptors accordingly, and then
>     restart the queue.

You might store the last scheduled timestamp in the driver private struct
and drop any frame with a timestamp not greater or equal to the last one.

> 
>   - The driver does not correctly queue up packets in the distant
>     future.  The i210 has a limited time window of +/- 0.5 seconds.
>     Packets with a Tx time greater than that should be deferred in
>     order to enqueue them later on.

The limit is not half a second in the future, but half a second from the
previous scheduled frame if one is enqueued. Another use case for the last
scheduled frame field. There are definitely cases that might need to be
deferred though.

> 
> * Performance measurements
> 
>   1. Prepared a PC and the Device Under Test (DUT) each with an Intel
>      i210 card connected with a crossover cable.
>   2. The DUT was a Pentium(R) D CPU 2.80GHz running PREEMPT_RT
>      4.9.40-rt30 with about 50 usec maximum latency under cyclictest.
>   3. Synchronized the DUT's PHC to the PC's PHC using ptp4l.
>   4. Synchronized the DUT's system clock to its PHC using phc2sys.
>   5. Started netperf to produce some network load.
>   6. Measured the arrival time of the packets at the PC's PHC using
>      hardware time stamping.
> 
>   I ran ten minute tests both with and without using the so_txtime
>   option, with a period was 1 millisecond.  I then repeated the
>   so_txtime case but with a 250 microsecond period.  The measured
>   offset from the expected period (in nanoseconds) is shown in the
>   following table.
> 
>   |         | plain preempt_rt |     so_txtime | txtime @ 250 us |
>   |---------+------------------+---------------+-----------------|
>   | min:    |    +1.940800e+04 | +4.720000e+02 |   +4.720000e+02 |
>   | max:    |    +7.556000e+04 | +5.680000e+02 |   +5.760000e+02 |
>   | pk-pk:  |    +5.615200e+04 | +9.600000e+01 |   +1.040000e+02 |
>   | mean:   |    +3.292776e+04 | +5.072274e+02 |   +5.073602e+02 |
>   | stddev: |    +6.514709e+03 | +1.310849e+01 |   +1.507144e+01 |
>   | count:  |           600000 |        600000 |         2400000 |
> 
>   Using so_txtime, the peak to peak jitter is about 100 nanoseconds,
>   independent of the period.  In contrast, plain preempt_rt shows a
>   jitter of of 56 microseconds.  The average delay of 507 nanoseconds
>   when using so_txtime is explained by the documented input and output
>   delays on the i210 cards.
> 
>   The test program is appended, below.  If anyone is interested in
>   reproducing this test, I can provide helper scripts.
> 
> Thanks,
> Richard
> 

< most of test program snipped >

> 
> 	/*
> 	 * We specify the transmission time in the CMSG.
> 	 */
> 	if (use_so_txtime) {
> 		msg.msg_control = u.buf;
> 		msg.msg_controllen = sizeof(u.buf);
> 		cmsg = CMSG_FIRSTHDR(&msg);
> 		cmsg->cmsg_level = SOL_SOCKET;
> 		cmsg->cmsg_type = SO_TXTIME;
> 		cmsg->cmsg_len = CMSG_LEN(sizeof(__u64));
> 		*((__u64 *) CMSG_DATA(cmsg)) = txtime;
> 	}
> 	cnt = sendmsg(fd, &msg, 0);

An interesting use case I have explored is to increase efficiency by batching
transmissions with sendmmsg. This is attractive when getting large chunks of
audio data from ALSA and scheduling them for transmit all at once.

Anyway, I am wholly in favor of this proposal--in fact, it is very similar to
a patch set I shared with Eric Mann and others at Intel in early Dec 2016 with
the intention to get some early feedback before submitting here. I never heard
back and got busy with other things. I only mention this since you said
elsewhere that you got this idea from Eric Mann yourself, and I am curious
whether Eric and I came up with it independently (which I would not be
surprised at).

Levi

^ permalink raw reply

* Re: [PATCH net-next v5 1/4] bpf: add helper bpf_perf_event_read_value for perf event array map
From: Peter Zijlstra @ 2017-09-20 17:26 UTC (permalink / raw)
  To: Yonghong Song; +Cc: rostedt, ast, daniel, netdev, kernel-team, David Miller
In-Reply-To: <20170920060935.1102268-2-yhs@fb.com>

On Tue, Sep 19, 2017 at 11:09:32PM -0700, Yonghong Song wrote:
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 3e691b7..2d5bbe5 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3684,10 +3684,12 @@ static inline u64 perf_event_count(struct perf_event *event)
>   *     will not be local and we cannot read them atomically
>   *   - must not have a pmu::count method
>   */
> -int perf_event_read_local(struct perf_event *event, u64 *value)
> +int perf_event_read_local(struct perf_event *event, u64 *value,
> +			  u64 *enabled, u64 *running)
>  {
>  	unsigned long flags;
>  	int ret = 0;
> +	u64 now;
>  
>  	/*
>  	 * Disabling interrupts avoids all counter scheduling (context
> @@ -3718,14 +3720,21 @@ int perf_event_read_local(struct perf_event *event, u64 *value)
>  		goto out;
>  	}
>  
> +	now = event->shadow_ctx_time + perf_clock();
> +	if (enabled)
> +		*enabled = now - event->tstamp_enabled;
>  	/*
>  	 * If the event is currently on this CPU, its either a per-task event,
>  	 * or local to this CPU. Furthermore it means its ACTIVE (otherwise
>  	 * oncpu == -1).
>  	 */
> -	if (event->oncpu == smp_processor_id())
> +	if (event->oncpu == smp_processor_id()) {
>  		event->pmu->read(event);
> -
> +		if (running)
> +			*running = now - event->tstamp_running;
> +	} else if (running) {
> +		*running = event->total_time_running;
> +	}
>  	*value = local64_read(&event->count);
>  out:
>  	local_irq_restore(flags);

Yeah, this looks about right.

Dave, could we have this in a topic tree of sorts, because I have a
pending series to rework all the timekeeping and it might be nice to not
have sfr run into all sorts of conflicts.

^ permalink raw reply

* [PATCH net-next 5/5] udp: perform full socket lookup in early demux
From: Paolo Abeni @ 2017-09-20 16:54 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <cover.1505926196.git.pabeni@redhat.com>

Since UDP early demux lookup fetches noref socket references,
we can safely be optimistic about it and set the sk reference
even if the skb is not going to land on such socket, avoiding
the rx dst cache usage for unconnected unicast sockets.

This avoids a second lookup for unconnected sockets, and clean
up a bit the whole udp early demux code.

After this change, on hosts not acting as routers, the UDP
early demux never affect negatively the receive performances,
while before this change UDP early demux caused measurable
performance impact for unconnected sockets.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/udp.h |  2 ++
 net/ipv4/udp.c      | 62 +++++++++++++++++++----------------------------------
 net/ipv6/udp.c      | 57 ++++++++++++------------------------------------
 3 files changed, 38 insertions(+), 83 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index eaea63bc79bb..9c68b57543cc 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -92,6 +92,8 @@ static inline struct udp_sock *udp_sk(const struct sock *sk)
 	return (struct udp_sock *)sk;
 }
 
+void udp_set_skb_rx_dst(struct sock *sk, struct sk_buff *skb, u32 cookie);
+
 static inline void udp_set_no_check6_tx(struct sock *sk, bool val)
 {
 	udp_sk(sk)->no_check6_tx = val;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ba49d5aa9f09..5cbbd78024dc 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2043,6 +2043,11 @@ static inline int udp4_csum_init(struct sk_buff *skb, struct udphdr *uh,
 							 inet_compute_pseudo);
 }
 
+static bool udp_use_rx_dst_cache(struct sock *sk, struct sk_buff *skb)
+{
+	return sk->sk_state == TCP_ESTABLISHED || skb->pkt_type != PACKET_HOST;
+}
+
 /*
  *	All we need to do is get the socket, and then do a checksum.
  */
@@ -2088,8 +2093,8 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		struct dst_entry *dst = skb_dst(skb);
 		int ret;
 
-		if (unlikely(sk->sk_rx_dst != dst))
-			udp_sk_rx_dst_set(sk, dst);
+		if (udp_use_rx_dst_cache(sk, skb))
+			dst_update(&sk->sk_rx_dst, dst);
 
 		ret = udp_queue_rcv_skb(sk, skb);
 		if (!noref_sk)
@@ -2196,42 +2201,28 @@ static struct sock *__udp4_lib_mcast_demux_lookup(struct net *net,
 	return result;
 }
 
-/* For unicast we should only early demux connected sockets or we can
- * break forwarding setups.  The chains here can be long so only check
- * if the first socket is an exact match and if not move on.
- */
-static struct sock *__udp4_lib_demux_lookup(struct net *net,
-					    __be16 loc_port, __be32 loc_addr,
-					    __be16 rmt_port, __be32 rmt_addr,
-					    int dif, int sdif)
+void udp_set_skb_rx_dst(struct sock *sk, struct sk_buff *skb, u32 cookie)
 {
-	unsigned short hnum = ntohs(loc_port);
-	unsigned int hash2 = udp4_portaddr_hash(net, loc_addr, hnum);
-	unsigned int slot2 = hash2 & udp_table.mask;
-	struct udp_hslot *hslot2 = &udp_table.hash2[slot2];
-	INET_ADDR_COOKIE(acookie, rmt_addr, loc_addr);
-	const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum);
-	struct sock *sk;
+	struct dst_entry *dst = dst_access(&sk->sk_rx_dst, cookie);
 
-	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
-		if (INET_MATCH(sk, net, acookie, rmt_addr,
-			       loc_addr, ports, dif, sdif))
-			return sk;
-		/* Only check first socket in chain */
-		break;
+	if (dst) {
+		/* set noref for now.
+		 * any place which wants to hold dst has to call
+		 * dst_hold_safe()
+		 */
+		skb_dst_set_noref(skb, dst);
 	}
-	return NULL;
 }
+EXPORT_SYMBOL_GPL(udp_set_skb_rx_dst);
 
 void udp_v4_early_demux(struct sk_buff *skb)
 {
 	struct net *net = dev_net(skb->dev);
+	int dif = skb->dev->ifindex;
+	int sdif = inet_sdif(skb);
 	const struct iphdr *iph;
 	const struct udphdr *uh;
 	struct sock *sk = NULL;
-	struct dst_entry *dst;
-	int dif = skb->dev->ifindex;
-	int sdif = inet_sdif(skb);
 	int ours;
 
 	/* validate the packet */
@@ -2260,25 +2251,16 @@ void udp_v4_early_demux(struct sk_buff *skb)
 						   uh->source, iph->saddr,
 						   dif, sdif);
 	} else if (skb->pkt_type == PACKET_HOST) {
-		sk = __udp4_lib_demux_lookup(net, uh->dest, iph->daddr,
-					     uh->source, iph->saddr, dif, sdif);
+		sk = __udp4_lib_lookup(net, iph->saddr, uh->source, iph->daddr,
+				       uh->dest, dif, sdif, &udp_table, skb);
 	}
 
 	if (!sk)
 		return;
 
 	skb_set_noref_sk(skb, sk);
-	dst = READ_ONCE(sk->sk_rx_dst);
-
-	if (dst)
-		dst = dst_check(dst, 0);
-	if (dst) {
-		/* set noref for now.
-		 * any place which wants to hold dst has to call
-		 * dst_hold_safe()
-		 */
-		skb_dst_set_noref(skb, dst);
-	}
+	if (udp_use_rx_dst_cache(sk, skb))
+		udp_set_skb_rx_dst(sk, skb, 0);
 }
 
 int udp_rcv(struct sk_buff *skb)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 8f62392c4c35..67d340679c3a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -773,13 +773,18 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 
 static void udp6_sk_rx_dst_set(struct sock *sk, struct dst_entry *dst)
 {
-	if (udp_sk_rx_dst_set(sk, dst)) {
+	if (unlikely(dst_update(&sk->sk_rx_dst, dst))) {
 		const struct rt6_info *rt = (const struct rt6_info *)dst;
 
 		inet6_sk(sk)->rx_dst_cookie = rt6_get_cookie(rt);
 	}
 }
 
+static bool udp6_use_rx_dst_cache(struct sock *sk)
+{
+	return sk->sk_state == TCP_ESTABLISHED;
+}
+
 int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		   int proto)
 {
@@ -830,7 +835,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		struct dst_entry *dst = skb_dst(skb);
 		int ret;
 
-		if (unlikely(sk->sk_rx_dst != dst))
+		if (udp6_use_rx_dst_cache(sk))
 			udp6_sk_rx_dst_set(sk, dst);
 
 		ret = udpv6_queue_rcv_skb(sk, skb);
@@ -905,37 +910,13 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	return 0;
 }
 
-
-static struct sock *__udp6_lib_demux_lookup(struct net *net,
-			__be16 loc_port, const struct in6_addr *loc_addr,
-			__be16 rmt_port, const struct in6_addr *rmt_addr,
-			int dif, int sdif)
-{
-	unsigned short hnum = ntohs(loc_port);
-	unsigned int hash2 = udp6_portaddr_hash(net, loc_addr, hnum);
-	unsigned int slot2 = hash2 & udp_table.mask;
-	struct udp_hslot *hslot2 = &udp_table.hash2[slot2];
-	const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum);
-	struct sock *sk;
-
-	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
-		if (sk->sk_state == TCP_ESTABLISHED &&
-		    INET6_MATCH(sk, net, rmt_addr, loc_addr, ports, dif, sdif))
-			return sk;
-		/* Only check first socket in chain */
-		break;
-	}
-	return NULL;
-}
-
 static void udp_v6_early_demux(struct sk_buff *skb)
 {
 	struct net *net = dev_net(skb->dev);
-	const struct udphdr *uh;
-	struct sock *sk;
-	struct dst_entry *dst;
 	int dif = skb->dev->ifindex;
 	int sdif = inet6_sdif(skb);
+	const struct udphdr *uh;
+	struct sock *sk;
 
 	if (!pskb_may_pull(skb, skb_transport_offset(skb) +
 	    sizeof(struct udphdr)))
@@ -944,10 +925,9 @@ static void udp_v6_early_demux(struct sk_buff *skb)
 	uh = udp_hdr(skb);
 
 	if (skb->pkt_type == PACKET_HOST)
-		sk = __udp6_lib_demux_lookup(net, uh->dest,
-					     &ipv6_hdr(skb)->daddr,
-					     uh->source, &ipv6_hdr(skb)->saddr,
-					     dif, sdif);
+		sk = __udp6_lib_lookup(net, &ipv6_hdr(skb)->saddr, uh->source,
+				       &ipv6_hdr(skb)->daddr, uh->dest, dif,
+				       sdif, &udp_table, skb);
 	else
 		return;
 
@@ -955,17 +935,8 @@ static void udp_v6_early_demux(struct sk_buff *skb)
 		return;
 
 	skb_set_noref_sk(skb, sk);
-	dst = READ_ONCE(sk->sk_rx_dst);
-
-	if (dst)
-		dst = dst_check(dst, inet6_sk(sk)->rx_dst_cookie);
-	if (dst) {
-		/* set noref for now.
-		 * any place which wants to hold dst has to call
-		 * dst_hold_safe()
-		 */
-		skb_dst_set_noref(skb, dst);
-	}
+	if (udp6_use_rx_dst_cache(sk))
+		udp_set_skb_rx_dst(sk, skb, inet6_sk(sk)->rx_dst_cookie);
 }
 
 static __inline__ int udpv6_rcv(struct sk_buff *skb)
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 4/5] net: add simple socket-like dst cache helpers
From: Paolo Abeni @ 2017-09-20 16:54 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <cover.1505926196.git.pabeni@redhat.com>

It will be used by later patches to reduce code duplication.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/dst.h | 12 ++++++++++++
 net/core/dst.c    | 16 ++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/net/dst.h b/include/net/dst.h
index 93568bd0a352..a6a39357f19a 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -485,6 +485,18 @@ static inline struct dst_entry *dst_check(struct dst_entry *dst, u32 cookie)
 	return dst;
 }
 
+bool dst_update(struct dst_entry **cache, struct dst_entry *dst);
+static inline struct dst_entry *dst_access(struct dst_entry **cache,
+					      u32 cookie)
+{
+	struct dst_entry *dst = READ_ONCE(*cache);
+
+	if (!dst)
+		return NULL;
+
+	return dst_check(dst, cookie);
+}
+
 /* Flags for xfrm_lookup flags argument. */
 enum {
 	XFRM_LOOKUP_ICMP = 1 << 0,
diff --git a/net/core/dst.c b/net/core/dst.c
index a6c47da7d0f8..6aff0a3e7ba3 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -205,6 +205,22 @@ void dst_release_immediate(struct dst_entry *dst)
 }
 EXPORT_SYMBOL(dst_release_immediate);
 
+/* 'dst' is not refcounted */
+bool dst_update(struct dst_entry **cache, struct dst_entry *dst)
+{
+	if (likely(*cache == dst))
+		return false;
+
+	if (dst_hold_safe(dst)) {
+		struct dst_entry *old = xchg(cache, dst);
+
+		dst_release(old);
+		return old != dst;
+	}
+	return false;
+}
+EXPORT_SYMBOL_GPL(dst_update);
+
 u32 *dst_cow_metrics_generic(struct dst_entry *dst, unsigned long old)
 {
 	struct dst_metrics *p = kmalloc(sizeof(*p), GFP_ATOMIC);
-- 
2.13.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox