Netdev List
 help / color / mirror / Atom feed
* Re: Latest net-next from GIT panic
From: Paweł Staszewski @ 2017-09-20 21:27 UTC (permalink / raw)
  To: Cong Wang, Eric Dumazet
  Cc: Wei Wang, Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <6c36fa00-13e6-8cd7-ff5c-df7739220736@itcare.pl>



W dniu 2017-09-20 o 23:25, Paweł Staszewski pisze:
>
>
> W dniu 2017-09-20 o 23:24, Paweł Staszewski pisze:
>>
>>
>> W dniu 2017-09-20 o 23:10, Paweł Staszewski pisze:
>>>
>>>
>>> W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze:
>>>>
>>>>
>>>> W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:
>>>>>
>>>>>
>>>>> W dniu 2017-09-20 o 20:36, Cong Wang pisze:
>>>>>> On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet 
>>>>>> <eric.dumazet@gmail.com> wrote:
>>>>>>> On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:
>>>>>>>> but dmesg at this time shows nothing about interfaces or flaps.
>>>>>>>>
>>>>>>>> This is very odd.
>>>>>>>>
>>>>>>>> We only free netdevice in free_netdev() and it is only called when
>>>>>>>> we unregister a netdevice. Otherwise pcpu_refcnt is impossible
>>>>>>>> to be NULL.
>>>>>>> If there is a missing dev_hold() or one dev_put() in excess,
>>>>>>> this would allow the netdev to be freed too soon.
>>>>>>>
>>>>>>> -> Use after free.
>>>>>>> memory holding netdev could be reallocated-cleared by some other 
>>>>>>> kernel
>>>>>>> user.
>>>>>>>
>>>>>> Sure, but only unregister could trigger a free. If there is no 
>>>>>> unregister,
>>>>>> like what Pawel claims, then there is no free, the refcnt just 
>>>>>> goes to
>>>>>> 0 but the memory is still there.
>>>>>>
>>>>> About possible mistake from my side with bisect - i can judge too 
>>>>> early that some bisect was good
>>>>> the road was:
>>>>> git bisect start
>>>>> # bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
>>>>> 'pinctrl-v4.13-1' of 
>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
>>>>> git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
>>>>> # good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
>>>>> 'next' of 
>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
>>>>> git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
>>>>> # bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
>>>>> using stack larger than 1024.
>>>>> git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
>>>>> # good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
>>>>> 'udp-reduce-cache-pressure'
>>>>> git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
>>>>> # bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
>>>>> 's390-net-updates-part-2'
>>>>> git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
>>>>> # good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
>>>>> 'bpf-ctx-narrow'
>>>>> git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
>>>>> # good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: 
>>>>> remove cp_outgoing
>>>>> git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
>>>>> # bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
>>>>> TCP_MD5SIG_EXT socket option to set a key address prefix
>>>>> git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
>>>>> # good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce 
>>>>> a new function dst_dev_put()
>>>>>
>>>>> And currently have this running for about 4 hours without problems.
>>>>>
>>>>>
>>>>>
>>>>> git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
>>>>> # bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
>>>>> DST_NOCACHE flag
>>>>>
>>>>> Here for sure - panic
>>>>>
>>>>> git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
>>>>> # bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
>>>>> dst_hold_safe() properly
>>>>> git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
>>>>> # good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
>>>>> dst_hold_safe() properly
>>>>> git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
>>>>> # bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
>>>>> dst->__refcnt for insertion into fib6 tree
>>>>>
>>>>> im not 100% sure tor last two
>>>>> Will test them again starting from
>>>>> [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call 
>>>>> dst_dev_put() properly
>>>>>
>>>>>
>>>>> git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
>>>>> # bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark 
>>>>> DST_NOGC and remove the operation of dst_free()
>>>>>
>>>>>
>>>>>
>>>>> git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
>>>>> # first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] 
>>>>> ipv4: mark DST_NOGC and remove the operation of dst_free()
>>>>>
>>>>>
>>>>>
>>>> What i can say more
>>>> I can reproduce this on any server with similar configuration
>>>> the difference can be teamd instead of bonding
>>>> ixgbe or i40e and mlx5
>>>> Same problems
>>>>
>>>> vlans - more or less prefixes learned from bgp -> zebra -> netlink 
>>>> -> kernel
>>>> But normally in lab when using only plain routing no bgpd and about 
>>>> 128 vlans - with 128 routes - cant reproduce this - this apperas 
>>>> only with bgp - minimum where i can reproduce this was about 130k 
>>>> prefixes with about 286 nexthops
>>>>
>>>>
>>>>
>>>>
>>> bisected again and same result:
>>> b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit
>>> commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
>>> Author: Wei Wang <weiwan@google.com>
>>> Date:   Sat Jun 17 10:42:32 2017 -0700
>>>
>>>     ipv4: mark DST_NOGC and remove the operation of dst_free()
>>>
>>>     With the previous preparation patches, we are ready to get rid 
>>> of the
>>>     dst gc operation in ipv4 code and release dst based on refcnt only.
>>>     So this patch adds DST_NOGC flag for all IPv4 dst and remove the 
>>> calls
>>>     to dst_free().
>>>     At this point, all dst created in ipv4 code do not use the dst gc
>>>     anymore and will be destroyed at the point when refcnt drops to 0.
>>>
>>>     Signed-off-by: Wei Wang <weiwan@google.com>
>>>     Acked-by: Martin KaFai Lau <kafai@fb.com>
>>>     Signed-off-by: David S. Miller <davem@davemloft.net>
>>>
>>> :040000 040000 9b7e7fb641de6531fc7887473ca47ef7cb6a11da 
>>> 831a73b71d3df1755f3e24c0d3c86d7a93fd55e2 M      net
>>>
>>> Will add now version 2 of patch from Eric and we will see
>>>
>>>
>> after adding patch
>> perf top catch
>>    PerfTop:   77159 irqs/sec  kernel:99.7%  exact:  0.0% [4000Hz 
>> cycles],  (all, 40 CPUs)
>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
>>
>>
>>     60.95%  [kernel]            [k] dev_put.part.6
>>      4.00%  [kernel]            [k] ixgbe_poll
>>      3.63%  [kernel]            [k] irq_entries_start
>>      1.22%  [kernel]            [k] fib_table_lookup
>>      1.15%  [kernel]            [k] do_raw_spin_lock
>>      1.05%  [kernel]            [k] ixgbe_xmit_frame_ring
>>      1.04%  [kernel]            [k] lookup
>>      0.87%  [kernel]            [k] eth_type_trans
>>
>>
>> no panic on console - rebooting to check logs
>>
>>
> Nothing logged
>
>
after adding this patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
   */
  static inline void dev_put(struct net_device *dev)
  {
-	this_cpu_dec(*dev->pcpu_refcnt);
+	int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
+
+	if (!pref) {
+		pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle %d\n",
+		       dev, dev->name, dev->reg_state, dev->dismantle);
+		for (;;)
+			cpu_relax();
+	}
+	this_cpu_dec(*pref);
  }
  
  /**


Have just halted console - no output no reaction on kbd
nothing in any syslog/log and catched only something from perf top

^ permalink raw reply related

* Re: [PATCH net-next] udp: do rmem bulk free even if the rx sk queue is empty
From: David Miller @ 2017-09-20 21:29 UTC (permalink / raw)
  To: pabeni; +Cc: netdev, edumazet
In-Reply-To: <f82856528a033394c407539d1ff4b92e243bad36.1505815819.git.pabeni@redhat.com>

From: Paolo Abeni <pabeni@redhat.com>
Date: Tue, 19 Sep 2017 12:11:43 +0200

> The commit 6b229cf77d68 ("udp: add batching to udp_rmem_release()")
> reduced greatly the cacheline contention between the BH and the US
> reader batching the rmem updates in most scenarios.
> 
> Such optimization is explicitly avoided if the US reader is faster
> then BH processing.
> 
> My fault, I initially suggested this kind of behavior due to concerns
> of possible regressions with small sk_rcvbuf values. Tests showed
> such concerns are misplaced, so this commit relaxes the condition
> for rmem bulk updates, obtaining small but measurable performance
> gain in the scenario described above.
> 
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Applied, thanks Paolo.

^ permalink raw reply

* Re: [PATCH net-next 3/3] virtio-net: support XDP_REDIRECT
From: David Miller @ 2017-09-20 21:28 UTC (permalink / raw)
  To: jasowang; +Cc: mst, virtualization, netdev, linux-kernel, john.fastabend
In-Reply-To: <1505814163-7315-3-git-send-email-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Tue, 19 Sep 2017 17:42:43 +0800

> This patch tries to add XDP_REDIRECT for virtio-net. The changes are
> not complex as we could use exist XDP_TX helpers for most of the
> work. The rest is passing the XDP_TX to NAPI handler for implementing
> batching.
> 
> Cc: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 2/3] virtio-net: add packet len average only when needed during XDP
From: David Miller @ 2017-09-20 21:28 UTC (permalink / raw)
  To: jasowang; +Cc: netdev, virtualization, john.fastabend, linux-kernel, mst
In-Reply-To: <1505814163-7315-2-git-send-email-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Tue, 19 Sep 2017 17:42:42 +0800

> There's no need to add packet len average in the case of XDP_PASS
> since it will be done soon after skb is created.
> 
> Cc: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 1/3] virtio-net: remove unnecessary parameter of virtnet_xdp_xmit()
From: David Miller @ 2017-09-20 21:28 UTC (permalink / raw)
  To: jasowang; +Cc: netdev, virtualization, john.fastabend, linux-kernel, mst
In-Reply-To: <1505814163-7315-1-git-send-email-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Tue, 19 Sep 2017 17:42:41 +0800

> CC: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Applied.

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Paweł Staszewski @ 2017-09-20 21:25 UTC (permalink / raw)
  To: Cong Wang, Eric Dumazet
  Cc: Wei Wang, Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <26e8b899-d8b1-caa4-436b-e23ffe384b0f@itcare.pl>



W dniu 2017-09-20 o 23:24, Paweł Staszewski pisze:
>
>
> W dniu 2017-09-20 o 23:10, Paweł Staszewski pisze:
>>
>>
>> W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze:
>>>
>>>
>>> W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:
>>>>
>>>>
>>>> W dniu 2017-09-20 o 20:36, Cong Wang pisze:
>>>>> On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet 
>>>>> <eric.dumazet@gmail.com> wrote:
>>>>>> On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:
>>>>>>> but dmesg at this time shows nothing about interfaces or flaps.
>>>>>>>
>>>>>>> This is very odd.
>>>>>>>
>>>>>>> We only free netdevice in free_netdev() and it is only called when
>>>>>>> we unregister a netdevice. Otherwise pcpu_refcnt is impossible
>>>>>>> to be NULL.
>>>>>> If there is a missing dev_hold() or one dev_put() in excess,
>>>>>> this would allow the netdev to be freed too soon.
>>>>>>
>>>>>> -> Use after free.
>>>>>> memory holding netdev could be reallocated-cleared by some other 
>>>>>> kernel
>>>>>> user.
>>>>>>
>>>>> Sure, but only unregister could trigger a free. If there is no 
>>>>> unregister,
>>>>> like what Pawel claims, then there is no free, the refcnt just 
>>>>> goes to
>>>>> 0 but the memory is still there.
>>>>>
>>>> About possible mistake from my side with bisect - i can judge too 
>>>> early that some bisect was good
>>>> the road was:
>>>> git bisect start
>>>> # bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
>>>> 'pinctrl-v4.13-1' of 
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
>>>> git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
>>>> # good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
>>>> 'next' of 
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
>>>> git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
>>>> # bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
>>>> using stack larger than 1024.
>>>> git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
>>>> # good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
>>>> 'udp-reduce-cache-pressure'
>>>> git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
>>>> # bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
>>>> 's390-net-updates-part-2'
>>>> git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
>>>> # good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
>>>> 'bpf-ctx-narrow'
>>>> git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
>>>> # good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove 
>>>> cp_outgoing
>>>> git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
>>>> # bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
>>>> TCP_MD5SIG_EXT socket option to set a key address prefix
>>>> git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
>>>> # good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a 
>>>> new function dst_dev_put()
>>>>
>>>> And currently have this running for about 4 hours without problems.
>>>>
>>>>
>>>>
>>>> git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
>>>> # bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
>>>> DST_NOCACHE flag
>>>>
>>>> Here for sure - panic
>>>>
>>>> git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
>>>> # bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
>>>> dst_hold_safe() properly
>>>> git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
>>>> # good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
>>>> dst_hold_safe() properly
>>>> git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
>>>> # bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
>>>> dst->__refcnt for insertion into fib6 tree
>>>>
>>>> im not 100% sure tor last two
>>>> Will test them again starting from
>>>> [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() 
>>>> properly
>>>>
>>>>
>>>> git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
>>>> # bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark 
>>>> DST_NOGC and remove the operation of dst_free()
>>>>
>>>>
>>>>
>>>> git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
>>>> # first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] 
>>>> ipv4: mark DST_NOGC and remove the operation of dst_free()
>>>>
>>>>
>>>>
>>> What i can say more
>>> I can reproduce this on any server with similar configuration
>>> the difference can be teamd instead of bonding
>>> ixgbe or i40e and mlx5
>>> Same problems
>>>
>>> vlans - more or less prefixes learned from bgp -> zebra -> netlink 
>>> -> kernel
>>> But normally in lab when using only plain routing no bgpd and about 
>>> 128 vlans - with 128 routes - cant reproduce this - this apperas 
>>> only with bgp - minimum where i can reproduce this was about 130k 
>>> prefixes with about 286 nexthops
>>>
>>>
>>>
>>>
>> bisected again and same result:
>> b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit
>> commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
>> Author: Wei Wang <weiwan@google.com>
>> Date:   Sat Jun 17 10:42:32 2017 -0700
>>
>>     ipv4: mark DST_NOGC and remove the operation of dst_free()
>>
>>     With the previous preparation patches, we are ready to get rid of 
>> the
>>     dst gc operation in ipv4 code and release dst based on refcnt only.
>>     So this patch adds DST_NOGC flag for all IPv4 dst and remove the 
>> calls
>>     to dst_free().
>>     At this point, all dst created in ipv4 code do not use the dst gc
>>     anymore and will be destroyed at the point when refcnt drops to 0.
>>
>>     Signed-off-by: Wei Wang <weiwan@google.com>
>>     Acked-by: Martin KaFai Lau <kafai@fb.com>
>>     Signed-off-by: David S. Miller <davem@davemloft.net>
>>
>> :040000 040000 9b7e7fb641de6531fc7887473ca47ef7cb6a11da 
>> 831a73b71d3df1755f3e24c0d3c86d7a93fd55e2 M      net
>>
>> Will add now version 2 of patch from Eric and we will see
>>
>>
> after adding patch
> perf top catch
>    PerfTop:   77159 irqs/sec  kernel:99.7%  exact:  0.0% [4000Hz 
> cycles],  (all, 40 CPUs)
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
>
>
>     60.95%  [kernel]            [k] dev_put.part.6
>      4.00%  [kernel]            [k] ixgbe_poll
>      3.63%  [kernel]            [k] irq_entries_start
>      1.22%  [kernel]            [k] fib_table_lookup
>      1.15%  [kernel]            [k] do_raw_spin_lock
>      1.05%  [kernel]            [k] ixgbe_xmit_frame_ring
>      1.04%  [kernel]            [k] lookup
>      0.87%  [kernel]            [k] eth_type_trans
>
>
> no panic on console - rebooting to check logs
>
>
Nothing logged

^ permalink raw reply

* Re: [PATCH] bgmac: Remove all offloading features, including GRO.
From: rosenp @ 2017-09-20 21:27 UTC (permalink / raw)
  To: Florian Fainelli, Eric Dumazet; +Cc: netdev
In-Reply-To: <0CEBE536-2C18-4ED2-90CB-3836CEF9A70E@gmail.com>

Sorry for the noise. After more testing I've found out that the cause
was that I had BBR enabled on my laptop. Switching back to CUBIC fixed
the issue.

In other words, this patch is detrimental.

~67mbps - gro off
~87mbps - gro on

On Fri, 2017-09-15 at 23:04 -0700, Florian Fainelli wrote:
> On September 15, 2017 5:38:42 PM PDT, rosenp@gmail.com wrote:
> > I have not. Unfortunately I own no gigabit hardware to test this
> > on.
> > The MIPS CPU runs at 300MHz on my unit.
> > 
> 
> bgmac is used on Gigabit capable hardware, like Northstar and
> Northstar Plus, and others too, so unless you can get access to such
> HW or get confirmation from someone that your patches changes
> something, I would just drop this change and not bother. This is
> already not 100mbits/sec linerate...
> 
> > On Fri, 2017-09-15 at 17:34 -0700, Eric Dumazet wrote:
> > > On Fri, 2017-09-15 at 17:23 -0700, Rosen Penev wrote:
> > > > On a linksys E1200v1 (actually a crossflashed E1000v2), the
> > > > offloading features give no measurable benefit to speed or
> > > > latency.
> > > > Furthermore, disabling GRO actually improves iperf performance
> > > > by a
> > > > whoppimg 3mbps. Results:
> > > > 
> > > > Currently:
> > > > 
> > > > v2: Changed napi_gro_receive to netif_receive_skb. Seems to
> > > > have an
> > > > identical result.
> > > > 
> > > > Signed-off-by: Rosen Penev <rosenp@gmail.com>
> > > > ---
> > > >  drivers/net/ethernet/broadcom/bgmac.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/net/ethernet/broadcom/bgmac.c
> > > > b/drivers/net/ethernet/broadcom/bgmac.c
> > > > index 48d672b204a4..1fb0053aeee7 100644
> > > > --- a/drivers/net/ethernet/broadcom/bgmac.c
> > > > +++ b/drivers/net/ethernet/broadcom/bgmac.c
> > > > @@ -483,7 +483,7 @@ static int bgmac_dma_rx_read(struct bgmac
> > > > *bgmac, struct bgmac_dma_ring *ring,
> > > >  			skb->protocol = eth_type_trans(skb,
> > > > bgmac-
> > > > > net_dev);
> > > > 
> > > >  			bgmac->net_dev->stats.rx_bytes += len;
> > > >  			bgmac->net_dev->stats.rx_packets++;
> > > > -			napi_gro_receive(&bgmac->napi, skb);
> > > > +			netif_receive_skb(skb);
> > > >  			handled++;
> > > >  		} while (0);
> > > >  
> > > 
> > > And have you tested 1Gbit link speed ?
> > > ( Or 2.5 Gbit link speed )
> > > 
> > > If you want to disable GRO on your host, fine : you can use
> > > ethtool
> > > -K
> > > 
> > > 
> > > 
> 
> (please don't top-post)

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Paweł Staszewski @ 2017-09-20 21:24 UTC (permalink / raw)
  To: Cong Wang, Eric Dumazet
  Cc: Wei Wang, Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <7d8731ee-6c91-bbfd-6313-c95f52d7b49a@itcare.pl>



W dniu 2017-09-20 o 23:10, Paweł Staszewski pisze:
>
>
> W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze:
>>
>>
>> W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:
>>>
>>>
>>> W dniu 2017-09-20 o 20:36, Cong Wang pisze:
>>>> On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet 
>>>> <eric.dumazet@gmail.com> wrote:
>>>>> On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:
>>>>>> but dmesg at this time shows nothing about interfaces or flaps.
>>>>>>
>>>>>> This is very odd.
>>>>>>
>>>>>> We only free netdevice in free_netdev() and it is only called when
>>>>>> we unregister a netdevice. Otherwise pcpu_refcnt is impossible
>>>>>> to be NULL.
>>>>> If there is a missing dev_hold() or one dev_put() in excess,
>>>>> this would allow the netdev to be freed too soon.
>>>>>
>>>>> -> Use after free.
>>>>> memory holding netdev could be reallocated-cleared by some other 
>>>>> kernel
>>>>> user.
>>>>>
>>>> Sure, but only unregister could trigger a free. If there is no 
>>>> unregister,
>>>> like what Pawel claims, then there is no free, the refcnt just goes to
>>>> 0 but the memory is still there.
>>>>
>>> About possible mistake from my side with bisect - i can judge too 
>>> early that some bisect was good
>>> the road was:
>>> git bisect start
>>> # bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
>>> 'pinctrl-v4.13-1' of 
>>> git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
>>> git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
>>> # good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
>>> 'next' of 
>>> git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
>>> git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
>>> # bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
>>> using stack larger than 1024.
>>> git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
>>> # good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
>>> 'udp-reduce-cache-pressure'
>>> git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
>>> # bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
>>> 's390-net-updates-part-2'
>>> git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
>>> # good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
>>> 'bpf-ctx-narrow'
>>> git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
>>> # good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove 
>>> cp_outgoing
>>> git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
>>> # bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
>>> TCP_MD5SIG_EXT socket option to set a key address prefix
>>> git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
>>> # good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a 
>>> new function dst_dev_put()
>>>
>>> And currently have this running for about 4 hours without problems.
>>>
>>>
>>>
>>> git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
>>> # bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
>>> DST_NOCACHE flag
>>>
>>> Here for sure - panic
>>>
>>> git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
>>> # bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
>>> dst_hold_safe() properly
>>> git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
>>> # good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
>>> dst_hold_safe() properly
>>> git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
>>> # bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
>>> dst->__refcnt for insertion into fib6 tree
>>>
>>> im not 100% sure tor last two
>>> Will test them again starting from
>>> [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() 
>>> properly
>>>
>>>
>>> git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
>>> # bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark 
>>> DST_NOGC and remove the operation of dst_free()
>>>
>>>
>>>
>>> git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
>>> # first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: 
>>> mark DST_NOGC and remove the operation of dst_free()
>>>
>>>
>>>
>> What i can say more
>> I can reproduce this on any server with similar configuration
>> the difference can be teamd instead of bonding
>> ixgbe or i40e and mlx5
>> Same problems
>>
>> vlans - more or less prefixes learned from bgp -> zebra -> netlink -> 
>> kernel
>> But normally in lab when using only plain routing no bgpd and about 
>> 128 vlans - with 128 routes - cant reproduce this - this apperas only 
>> with bgp - minimum where i can reproduce this was about 130k prefixes 
>> with about 286 nexthops
>>
>>
>>
>>
> bisected again and same result:
> b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit
> commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
> Author: Wei Wang <weiwan@google.com>
> Date:   Sat Jun 17 10:42:32 2017 -0700
>
>     ipv4: mark DST_NOGC and remove the operation of dst_free()
>
>     With the previous preparation patches, we are ready to get rid of the
>     dst gc operation in ipv4 code and release dst based on refcnt only.
>     So this patch adds DST_NOGC flag for all IPv4 dst and remove the 
> calls
>     to dst_free().
>     At this point, all dst created in ipv4 code do not use the dst gc
>     anymore and will be destroyed at the point when refcnt drops to 0.
>
>     Signed-off-by: Wei Wang <weiwan@google.com>
>     Acked-by: Martin KaFai Lau <kafai@fb.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
>
> :040000 040000 9b7e7fb641de6531fc7887473ca47ef7cb6a11da 
> 831a73b71d3df1755f3e24c0d3c86d7a93fd55e2 M      net
>
> Will add now version 2 of patch from Eric and we will see
>
>
after adding patch
perf top catch
    PerfTop:   77159 irqs/sec  kernel:99.7%  exact:  0.0% [4000Hz 
cycles],  (all, 40 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

     60.95%  [kernel]            [k] dev_put.part.6
      4.00%  [kernel]            [k] ixgbe_poll
      3.63%  [kernel]            [k] irq_entries_start
      1.22%  [kernel]            [k] fib_table_lookup
      1.15%  [kernel]            [k] do_raw_spin_lock
      1.05%  [kernel]            [k] ixgbe_xmit_frame_ring
      1.04%  [kernel]            [k] lookup
      0.87%  [kernel]            [k] eth_type_trans


no panic on console - rebooting to check logs

^ permalink raw reply

* Re: [PATCH v4 4/4] samples/bpf: Add documentation on cross compilation
From: Daniel Borkmann @ 2017-09-20 21:25 UTC (permalink / raw)
  To: Joel Fernandes, linux-kernel
  Cc: netdev, alison, juri.lelli, fengc, davem, ast, kernel-team
In-Reply-To: <20170920161159.25747-4-joelaf@google.com>

On 09/20/2017 06:11 PM, Joel Fernandes wrote:
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Joel Fernandes <joelaf@google.com>

(Minor typo pointed out by Randy, but rest looks fine.)

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* Re: [Patch v3 2/3] ipv4: Namespaceify tcp_fastopen_key knob
From: David Miller @ 2017-09-20 21:25 UTC (permalink / raw)
  To: yanhaishuang; +Cc: kuznet, edumazet, weiwan, lucab, netdev, linux-kernel
In-Reply-To: <1505813896-12121-2-git-send-email-yanhaishuang@cmss.chinamobile.com>

From: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Date: Tue, 19 Sep 2017 17:38:15 +0800

> @@ -128,6 +130,8 @@ struct netns_ipv4 {
>  	struct inet_timewait_death_row tcp_death_row;
>  	int sysctl_max_syn_backlog;
>  	int sysctl_tcp_fastopen;
> +	struct tcp_fastopen_context __rcu *tcp_fastopen_ctx;
> +	spinlock_t tcp_fastopen_ctx_lock;
>  
>  #ifdef CONFIG_NET_L3_MASTER_DEV
>  	int sysctl_udp_l3mdev_accept;

Where are you releasing this context during netns teardown?
I think this is a leak.

^ permalink raw reply

* Re: [PATCH v4 3/4] samples/bpf: Fix pt_regs issues when cross-compiling
From: Daniel Borkmann @ 2017-09-20 21:25 UTC (permalink / raw)
  To: Joel Fernandes, linux-kernel
  Cc: netdev, alison, juri.lelli, fengc, davem, ast, kernel-team
In-Reply-To: <20170920161159.25747-3-joelaf@google.com>

On 09/20/2017 06:11 PM, Joel Fernandes wrote:
> BPF samples fail to build when cross-compiling for ARM64 because of incorrect
> pt_regs param selection. This is because clang defines __x86_64__ and
> bpf_headers thinks we're building for x86. Since clang is building for the BPF
> target, it shouldn't make assumptions about what target the BPF program is
> going to run on. To fix this, lets pass ARCH so the header knows which target
> the BPF program is being compiled for and can use the correct pt_regs code.
>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Joel Fernandes <joelaf@google.com>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* Re: [PATCH v4 2/4] samples/bpf: Enable cross compiler support
From: Daniel Borkmann @ 2017-09-20 21:24 UTC (permalink / raw)
  To: Joel Fernandes, linux-kernel
  Cc: netdev, alison, juri.lelli, fengc, davem, ast, kernel-team
In-Reply-To: <20170920161159.25747-2-joelaf@google.com>

On 09/20/2017 06:11 PM, Joel Fernandes wrote:
> When cross compiling, bpf samples use HOSTCC for compiling the non-BPF part of
> the sample, however what we really want is to use the cross compiler to build
> for the cross target since that is what will load and run the BPF sample.
> Detect this and compile samples correctly.
>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Joel Fernandes <joelaf@google.com>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* Re: [PATCH v4 1/4] samples/bpf: Use getppid instead of getpgrp for array map stress
From: Daniel Borkmann @ 2017-09-20 21:24 UTC (permalink / raw)
  To: Joel Fernandes, linux-kernel
  Cc: netdev, alison, juri.lelli, fengc, davem, ast, kernel-team
In-Reply-To: <20170920161159.25747-1-joelaf@google.com>

On 09/20/2017 06:11 PM, Joel Fernandes wrote:
> When cross-compiling the bpf sample map_perf_test for aarch64, I find that
> __NR_getpgrp is undefined. This causes build errors. This syscall is deprecated
> and requires defining __ARCH_WANT_SYSCALL_DEPRECATED. To avoid having to define
> that, just use a different syscall (getppid) for the array map stress test.
>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Joel Fernandes <joelaf@google.com>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* Re: [Patch v3 1/3] ipv4: Namespaceify tcp_fastopen knob
From: David Miller @ 2017-09-20 21:22 UTC (permalink / raw)
  To: yanhaishuang; +Cc: kuznet, edumazet, weiwan, lucab, netdev, linux-kernel
In-Reply-To: <1505813896-12121-1-git-send-email-yanhaishuang@cmss.chinamobile.com>

From: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Date: Tue, 19 Sep 2017 17:38:14 +0800

> -		if ((sysctl_tcp_fastopen & TFO_SERVER_WO_SOCKOPT1) &&
> -		    (sysctl_tcp_fastopen & TFO_SERVER_ENABLE) &&
> +		tcp_fastopen =  sock_net(sk)->ipv4.sysctl_tcp_fastopen;
                              ^^

Please change that to one space.

And also please provide an appropriate "[PATCH vX 0/3] " header
posting when you respin this series.

> @@ -282,18 +280,19 @@ struct sock *tcp_try_fastopen(struct sock *sk, struct sk_buff *skb,
>  	struct tcp_fastopen_cookie valid_foc = { .len = -1 };
>  	bool syn_data = TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq + 1;
>  	struct sock *child;
> +	int tcp_fastopen = sock_net(sk)->ipv4.sysctl_tcp_fastopen;

Please order local variables from longest to shortest line (aka. reverse
christmas tree format).

^ permalink raw reply

* Re: [PATCH v3 14/31] vxfs: Define usercopy region in vxfs_inode slab cache
From: Kees Cook @ 2017-09-20 21:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: LKML, David Windsor, linux-fsdevel@vger.kernel.org,
	Network Development, Linux-MM,
	kernel-hardening@lists.openwall.com
In-Reply-To: <20170920205642.GA20023@infradead.org>

On Wed, Sep 20, 2017 at 1:56 PM, Christoph Hellwig <hch@infradead.org> wrote:
> Hi Kees,
>
> I've only got this single email from you, which on it's own doesn't
> compile and seems to be part of a 31 patch series.
>
> So as-is NAK, doesn't work.
>
> Please make sure to always send every patch in a series to every
> developer you want to include.

This is why I included several other lists on the full CC (am I
unlucky enough to have you not subscribed to any of them?). Adding a
CC for everyone can result in a huge CC list, especially for the
forth-coming 300-patch timer_list series. ;)

Do you want me to resend the full series to you, or would you prefer
something else like a patchwork bundle? (I'll explicitly add you to CC
for any future versions, though.)

-Kees

-- 
Kees Cook
Pixel Security

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: ipv4 ID calculation
From: Stephen Hemminger @ 2017-09-20 21:15 UTC (permalink / raw)
  To: Harsha Chenji; +Cc: netdev
In-Reply-To: <CAMB9WxJpwdZ2gFygTdWZmpLoCG0A79ReCbts-phpT76JfTPHsQ@mail.gmail.com>

On Mon, 18 Sep 2017 20:43:05 -0400
Harsha Chenji <cjkernel@gmail.com> wrote:

> Hi all,
> 
> Where is the ID field of the IPv4 header created when the DF flag is
> set? I am looking at ip_build_and_send_pkt. The code seems to have
> changed in 4.4-rc1:
> 
> if (ip_dont_fragment(sk, &rt->dst)) {
>     iph->frag_off = htons(IP_DF);
>     iph->id = 0;
> } else {
>     iph->frag_off = 0;
>     __ip_select_ident(net, iph, 1);
> }
> 
> old code (executed irrespective of DF or not):
> 
>     ip_select_ident(sock_net(sk), skb, sk);
> 
> The code in Stevens is basically iph->id = htons(ip_ident++) and now
> it seems to be calculated based on a hash + lookup table.
> 
> So where is the id of 0 overwritten when DF is set? Didn't find any
> info in the docs.

IP id doesn't matter if Dont Fragment bit is set. The IP id is used
by receiver (and firewalls) to coalesce fragments by the ID. If DF
is set no system in the path is supposed to fragment. The idea is
to reduce the number of possible cases of fragment collisions.



> P.S. - is this the right mailing list for these kind of questions?

Sort of, the list is more about technical discussions and patches.

You could find more info by using git blame to find the commit that
introduced the change, then read the log for that.

^ permalink raw reply

* Re: [PATCH net-next] net: dsa: lan9303: Add adjust_link() method
From: David Miller @ 2017-09-20 21:14 UTC (permalink / raw)
  To: privat; +Cc: andrew, vivien.didelot, f.fainelli, netdev, linux-kernel
In-Reply-To: <20170919080924.11144-1-privat@egil-hjelmeland.no>

From: Egil Hjelmeland <privat@egil-hjelmeland.no>
Date: Tue, 19 Sep 2017 10:09:24 +0200

> Make the driver react to device tree "fixed-link" declaration on CPU port.
> 
> - turn off autonegotiation
> - force speed 10 or 100 mb/s
> - force duplex mode
> 
> Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no>

Applied, thank you.

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Paweł Staszewski @ 2017-09-20 21:10 UTC (permalink / raw)
  To: Cong Wang, Eric Dumazet
  Cc: Wei Wang, Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <548a893c-8a8f-3b07-00f7-d75de1d777ad@itcare.pl>



W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze:
>
>
> W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:
>>
>>
>> W dniu 2017-09-20 o 20:36, Cong Wang pisze:
>>> On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet 
>>> <eric.dumazet@gmail.com> wrote:
>>>> On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:
>>>>> but dmesg at this time shows nothing about interfaces or flaps.
>>>>>
>>>>> This is very odd.
>>>>>
>>>>> We only free netdevice in free_netdev() and it is only called when
>>>>> we unregister a netdevice. Otherwise pcpu_refcnt is impossible
>>>>> to be NULL.
>>>> If there is a missing dev_hold() or one dev_put() in excess,
>>>> this would allow the netdev to be freed too soon.
>>>>
>>>> -> Use after free.
>>>> memory holding netdev could be reallocated-cleared by some other 
>>>> kernel
>>>> user.
>>>>
>>> Sure, but only unregister could trigger a free. If there is no 
>>> unregister,
>>> like what Pawel claims, then there is no free, the refcnt just goes to
>>> 0 but the memory is still there.
>>>
>> About possible mistake from my side with bisect - i can judge too 
>> early that some bisect was good
>> the road was:
>> git bisect start
>> # bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
>> 'pinctrl-v4.13-1' of 
>> git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
>> git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
>> # good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
>> 'next' of 
>> git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
>> git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
>> # bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
>> using stack larger than 1024.
>> git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
>> # good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
>> 'udp-reduce-cache-pressure'
>> git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
>> # bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
>> 's390-net-updates-part-2'
>> git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
>> # good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
>> 'bpf-ctx-narrow'
>> git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
>> # good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove 
>> cp_outgoing
>> git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
>> # bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
>> TCP_MD5SIG_EXT socket option to set a key address prefix
>> git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
>> # good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a 
>> new function dst_dev_put()
>>
>> And currently have this running for about 4 hours without problems.
>>
>>
>>
>> git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
>> # bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
>> DST_NOCACHE flag
>>
>> Here for sure - panic
>>
>> git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
>> # bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
>> dst_hold_safe() properly
>> git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
>> # good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
>> dst_hold_safe() properly
>> git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
>> # bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
>> dst->__refcnt for insertion into fib6 tree
>>
>> im not 100% sure tor last two
>> Will test them again starting from
>> [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() 
>> properly
>>
>>
>> git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
>> # bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC 
>> and remove the operation of dst_free()
>>
>>
>>
>> git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
>> # first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: 
>> mark DST_NOGC and remove the operation of dst_free()
>>
>>
>>
> What i can say more
> I can reproduce this on any server with similar configuration
> the difference can be teamd instead of bonding
> ixgbe or i40e and mlx5
> Same problems
>
> vlans - more or less prefixes learned from bgp -> zebra -> netlink -> 
> kernel
> But normally in lab when using only plain routing no bgpd and about 
> 128 vlans - with 128 routes - cant reproduce this - this apperas only 
> with bgp - minimum where i can reproduce this was about 130k prefixes 
> with about 286 nexthops
>
>
>
>
bisected again and same result:
b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit
commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
Author: Wei Wang <weiwan@google.com>
Date:   Sat Jun 17 10:42:32 2017 -0700

     ipv4: mark DST_NOGC and remove the operation of dst_free()

     With the previous preparation patches, we are ready to get rid of the
     dst gc operation in ipv4 code and release dst based on refcnt only.
     So this patch adds DST_NOGC flag for all IPv4 dst and remove the calls
     to dst_free().
     At this point, all dst created in ipv4 code do not use the dst gc
     anymore and will be destroyed at the point when refcnt drops to 0.

     Signed-off-by: Wei Wang <weiwan@google.com>
     Acked-by: Martin KaFai Lau <kafai@fb.com>
     Signed-off-by: David S. Miller <davem@davemloft.net>

:040000 040000 9b7e7fb641de6531fc7887473ca47ef7cb6a11da 
831a73b71d3df1755f3e24c0d3c86d7a93fd55e2 M      net

Will add now version 2 of patch from Eric and we will see

^ permalink raw reply

* Re: [PATCH net] bpf: one perf event close won't free bpf program attached by another perf event
From: David Miller @ 2017-09-20 21:12 UTC (permalink / raw)
  To: yhs; +Cc: peterz, rostedt, ast, daniel, netdev, kernel-team
In-Reply-To: <20170918233836.1817062-1-yhs@fb.com>

From: Yonghong Song <yhs@fb.com>
Date: Mon, 18 Sep 2017 16:38:36 -0700

> This patch fixes a bug exhibited by the following scenario:
>   1. fd1 = perf_event_open with attr.config = ID1
>   2. attach bpf program prog1 to fd1
>   3. fd2 = perf_event_open with attr.config = ID1
>      <this will be successful>
>   4. user program closes fd2 and prog1 is detached from the tracepoint.
>   5. user program with fd1 does not work properly as tracepoint
>      no output any more.
> 
> The issue happens at step 4. Multiple perf_event_open can be called
> successfully, but only one bpf prog pointer in the tp_event. In the
> current logic, any fd release for the same tp_event will free
> the tp_event->prog.
> 
> The fix is to free tp_event->prog only when the closing fd
> corresponds to the one which registered the program.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>

I've applied this and queued it up for -stable as it looks good
to me and 2 days is enough time for waiting for any other reviews.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next v2] bridge: also trigger RTM_NEWLINK when interface is released from bridge
From: David Miller @ 2017-09-20 21:09 UTC (permalink / raw)
  To: vincent; +Cc: stephen, bridge, netdev, dsahern
In-Reply-To: <20170916141833.28344-1-vincent@bernat.im>

From: Vincent Bernat <vincent@bernat.im>
Date: Sat, 16 Sep 2017 16:18:33 +0200

David, I am CC:'ing you because you've done work in this area over the
past year.  I'm applying this patch, it's been sitting since the 16th
and likes entirely correct to me.  But if you have objections just let
me know.

> Currently, when an interface is released from a bridge via
> ioctl(), we get a RTM_DELLINK event through netlink:
> 
> Deleted 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN
>     link/ether 6e:23:c2:54:3a:b3
> 
> Userspace has to interpret that as a removal from the bridge, not as a
> complete removal of the interface. When an bridged interface is
> completely removed, we get two events:
> 
> Deleted 2: dummy0: <BROADCAST,NOARP> mtu 1500 master bridge0 state DOWN
>     link/ether 6e:23:c2:54:3a:b3
> Deleted 2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
>     link/ether 6e:23:c2:54:3a:b3 brd ff:ff:ff:ff:ff:ff
> 
> In constrast, when an interface is released from a bond, we get a
> RTM_NEWLINK with only the new characteristics (no master):
> 
> 3: dummy1: <BROADCAST,NOARP,SLAVE,UP,LOWER_UP> mtu 1500 qdisc noqueue master bond0 state UNKNOWN group default
>     link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff
> 3: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
>     link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff
> 4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
>     link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff
> 3: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noqueue state DOWN group default
>     link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff
> 3: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noqueue state DOWN group default
>     link/ether ca:c8:7b:66:f8:25 brd ff:ff:ff:ff:ff:ff
> 4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
>     link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff
> 
> Userland may be confused by the fact we say a link is deleted while
> its characteristics are only modified. A first solution would have
> been to turn the RTM_DELLINK event in del_nbp() into a RTM_NEWLINK
> event. However, maybe some piece of userland is relying on this
> RTM_DELLINK to detect when a bridged interface is released. Instead,
> we also emit a RTM_NEWLINK event once the interface is
> released (without master info).
> 
> Deleted 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN
>     link/ether 8a:bb:e7:94:b1:f8
> 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
>     link/ether 8a:bb:e7:94:b1:f8 brd ff:ff:ff:ff:ff:ff
> 
> This is done only when using ioctl(). When using Netlink, such an
> event is already automatically emitted in do_setlink().
> 
> Signed-off-by: Vincent Bernat <vincent@bernat.im>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net] packet: hold bind lock when rebinding to fanout hook
From: David Miller @ 2017-09-20 21:03 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: willemb, netdev, nixiaoming
In-Reply-To: <CAF=yD-J3PYMsZXxogr_9-fbDXHnTT-LBq0TXjbFbhSHXKeA65w@mail.gmail.com>

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Fri, 15 Sep 2017 10:07:46 -0400

> On Thu, Sep 14, 2017 at 5:14 PM, Willem de Bruijn <willemb@google.com> wrote:
>> Packet socket bind operations must hold the po->bind_lock. This keeps
>> po->running consistent with whether the socket is actually on a ptype
>> list to receive packets.
>>
>> fanout_add unbinds a socket and its packet_rcv/tpacket_rcv call, then
>> binds the fanout object to receive through packet_rcv_fanout.
>>
>> Make it hold the po->bind_lock when testing po->running and rebinding.
>> Else, it can race with other rebind operations, such as that in
>> packet_set_ring from packet_rcv to tpacket_rcv. Concurrent updates
>> can result in a socket being added to a fanout group twice, causing
>> use-after-free KASAN bug reports, among others.
>>
>> Reported independently by both trinity and syzkaller.
>> Verified that the syzkaller reproducer passes after this patch.
>>
> 
> I forgot to add the Fixes tag, sorry.
> 
> Fixes: dc99f600698d ("packet: Add fanout support.")

Applied and queued up for stable as it fixes this race and I can't
see any new problems it introduces.

But boy is this one messy area.

The scariest thing to me now is the save/restore sequence done by
packet_set_ring(), for example.

	spin_lock(&po->bind_lock);
	was_running = po->running;
	num = po->num;
	if (was_running) {
		po->num = 0;
		__unregister_prot_hook(sk, false);
	}
	spin_unlock(&po->bind_lock);
 ...
	spin_lock(&po->bind_lock);
	if (was_running) {
		po->num = num;
		register_prot_hook(sk);
	}
	spin_unlock(&po->bind_lock);

The socket is also locked during this sequence but that doesn't
prevent parallel changes to the running state.

Since po->bind_lock is dropped, it's possible for another thread
to grab bind_lock and bind it meanwhile.

The above code seems to assume that can't happen, and that
register_prot_hook() will always see po->running set to zero
and rebind the socket.

If the race happens we'll have weird state, because we did not
rebind yet we modified po->num.

We seem to have a hierachy of sleeping and non-sleeping locks
that do not work well together.

^ permalink raw reply

* Re: [PATCH v3 14/31] vxfs: Define usercopy region in vxfs_inode slab cache
From: Christoph Hellwig @ 2017-09-20 20:56 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, David Windsor, Christoph Hellwig, linux-fsdevel,
	netdev, linux-mm, kernel-hardening
In-Reply-To: <1505940337-79069-15-git-send-email-keescook@chromium.org>

Hi Kees,

I've only got this single email from you, which on it's own doesn't
compile and seems to be part of a 31 patch series.

So as-is NAK, doesn't work.

Please make sure to always send every patch in a series to every
developer you want to include.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [net-next] macvlan: code refine to check data before using
From: David Miller @ 2017-09-20 20:47 UTC (permalink / raw)
  To: zhangshengju; +Cc: fgao, vyasevic, netdev
In-Reply-To: <1505866343-14135-1-git-send-email-zhangshengju@cmss.chinamobile.com>

From: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Date: Wed, 20 Sep 2017 08:12:23 +0800

> This patch checks data first at one place, return if it's null.
> 
> Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>

Looks good, applied, thanks.

^ permalink raw reply

* Re: [PATCH] ipv6: Use ipv6_authlen for len in ipv6_skip_exthdr
From: David Miller @ 2017-09-20 20:45 UTC (permalink / raw)
  To: qasdfgtyuiop; +Cc: trivial, netdev, kuznet, yoshfuji
In-Reply-To: <20170920161817.16203-1-qasdfgtyuiop@gmail.com>

From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Wed, 20 Sep 2017 12:18:17 -0400

> In ipv6_skip_exthdr, the lengh of AH header is computed manually
> as (hp->hdrlen+2)<<2. However, in include/linux/ipv6.h, a macro
> named ipv6_authlen is already defined for exactly the same job. This
> commit replaces the manual computation code with the macro.
> 
> Signed-off-by: Xiang Gao <qasdfgtyuiop@gmail.com>

Applied, thank you.

^ permalink raw reply

* [PATCH v3 31/31] lkdtm: Update usercopy tests for whitelisting
From: Kees Cook @ 2017-09-20 20:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, linux-fsdevel, netdev, linux-mm, kernel-hardening,
	David Windsor
In-Reply-To: <1505940337-79069-1-git-send-email-keescook@chromium.org>

This updates the USERCOPY_HEAP_FLAG_* tests to USERCOPY_HEAP_WHITELIST_*,
since the final form of usercopy whitelisting ended up using an offset/size
window instead of the earlier proposed allocation flags.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 drivers/misc/lkdtm.h          |  4 +-
 drivers/misc/lkdtm_core.c     |  4 +-
 drivers/misc/lkdtm_usercopy.c | 88 ++++++++++++++++++++++++-------------------
 3 files changed, 53 insertions(+), 43 deletions(-)

diff --git a/drivers/misc/lkdtm.h b/drivers/misc/lkdtm.h
index bfb6c45b6130..327bcf46fab5 100644
--- a/drivers/misc/lkdtm.h
+++ b/drivers/misc/lkdtm.h
@@ -75,8 +75,8 @@ void __init lkdtm_usercopy_init(void);
 void __exit lkdtm_usercopy_exit(void);
 void lkdtm_USERCOPY_HEAP_SIZE_TO(void);
 void lkdtm_USERCOPY_HEAP_SIZE_FROM(void);
-void lkdtm_USERCOPY_HEAP_FLAG_TO(void);
-void lkdtm_USERCOPY_HEAP_FLAG_FROM(void);
+void lkdtm_USERCOPY_HEAP_WHITELIST_TO(void);
+void lkdtm_USERCOPY_HEAP_WHITELIST_FROM(void);
 void lkdtm_USERCOPY_STACK_FRAME_TO(void);
 void lkdtm_USERCOPY_STACK_FRAME_FROM(void);
 void lkdtm_USERCOPY_STACK_BEYOND(void);
diff --git a/drivers/misc/lkdtm_core.c b/drivers/misc/lkdtm_core.c
index 981b3ef71e47..6e2d767ecaaa 100644
--- a/drivers/misc/lkdtm_core.c
+++ b/drivers/misc/lkdtm_core.c
@@ -245,8 +245,8 @@ struct crashtype crashtypes[] = {
 	CRASHTYPE(ATOMIC_TIMING),
 	CRASHTYPE(USERCOPY_HEAP_SIZE_TO),
 	CRASHTYPE(USERCOPY_HEAP_SIZE_FROM),
-	CRASHTYPE(USERCOPY_HEAP_FLAG_TO),
-	CRASHTYPE(USERCOPY_HEAP_FLAG_FROM),
+	CRASHTYPE(USERCOPY_HEAP_WHITELIST_TO),
+	CRASHTYPE(USERCOPY_HEAP_WHITELIST_FROM),
 	CRASHTYPE(USERCOPY_STACK_FRAME_TO),
 	CRASHTYPE(USERCOPY_STACK_FRAME_FROM),
 	CRASHTYPE(USERCOPY_STACK_BEYOND),
diff --git a/drivers/misc/lkdtm_usercopy.c b/drivers/misc/lkdtm_usercopy.c
index df6ac985fbb5..f6055f4922bf 100644
--- a/drivers/misc/lkdtm_usercopy.c
+++ b/drivers/misc/lkdtm_usercopy.c
@@ -19,7 +19,7 @@
  */
 static volatile size_t unconst = 0;
 static volatile size_t cache_size = 1024;
-static struct kmem_cache *bad_cache;
+static struct kmem_cache *whitelist_cache;
 
 static const unsigned char test_text[] = "This is a test.\n";
 
@@ -114,6 +114,10 @@ static noinline void do_usercopy_stack(bool to_user, bool bad_frame)
 	vm_munmap(user_addr, PAGE_SIZE);
 }
 
+/*
+ * This checks for whole-object size validation with hardened usercopy,
+ * with or without usercopy whitelisting.
+ */
 static void do_usercopy_heap_size(bool to_user)
 {
 	unsigned long user_addr;
@@ -171,77 +175,79 @@ static void do_usercopy_heap_size(bool to_user)
 	kfree(two);
 }
 
-static void do_usercopy_heap_flag(bool to_user)
+/*
+ * This checks for the specific whitelist window within an object. If this
+ * test passes, then do_usercopy_heap_size() tests will pass too.
+ */
+static void do_usercopy_heap_whitelist(bool to_user)
 {
-	unsigned long user_addr;
-	unsigned char *good_buf = NULL;
-	unsigned char *bad_buf = NULL;
+	unsigned long user_alloc;
+	unsigned char *buf = NULL;
+	unsigned char __user *user_addr;
+	size_t offset, size;
 
 	/* Make sure cache was prepared. */
-	if (!bad_cache) {
+	if (!whitelist_cache) {
 		pr_warn("Failed to allocate kernel cache\n");
 		return;
 	}
 
 	/*
-	 * Allocate one buffer from each cache (kmalloc will have the
-	 * SLAB_USERCOPY flag already, but "bad_cache" won't).
+	 * Allocate a buffer with a whitelisted window in the buffer.
 	 */
-	good_buf = kmalloc(cache_size, GFP_KERNEL);
-	bad_buf = kmem_cache_alloc(bad_cache, GFP_KERNEL);
-	if (!good_buf || !bad_buf) {
-		pr_warn("Failed to allocate buffers from caches\n");
+	buf = kmem_cache_alloc(whitelist_cache, GFP_KERNEL);
+	if (!buf) {
+		pr_warn("Failed to allocate buffer from whitelist cache\n");
 		goto free_alloc;
 	}
 
 	/* Allocate user memory we'll poke at. */
-	user_addr = vm_mmap(NULL, 0, PAGE_SIZE,
+	user_alloc = vm_mmap(NULL, 0, PAGE_SIZE,
 			    PROT_READ | PROT_WRITE | PROT_EXEC,
 			    MAP_ANONYMOUS | MAP_PRIVATE, 0);
-	if (user_addr >= TASK_SIZE) {
+	if (user_alloc >= TASK_SIZE) {
 		pr_warn("Failed to allocate user memory\n");
 		goto free_alloc;
 	}
+	user_addr = (void __user *)user_alloc;
 
-	memset(good_buf, 'A', cache_size);
-	memset(bad_buf, 'B', cache_size);
+	memset(buf, 'B', cache_size);
+
+	/* Whitelisted window in buffer, from kmem_cache_create_usercopy. */
+	offset = (cache_size / 4) + unconst;
+	size = (cache_size / 16) + unconst;
 
 	if (to_user) {
-		pr_info("attempting good copy_to_user with SLAB_USERCOPY\n");
-		if (copy_to_user((void __user *)user_addr, good_buf,
-				 cache_size)) {
+		pr_info("attempting good copy_to_user inside whitelist\n");
+		if (copy_to_user(user_addr, buf + offset, size)) {
 			pr_warn("copy_to_user failed unexpectedly?!\n");
 			goto free_user;
 		}
 
-		pr_info("attempting bad copy_to_user w/o SLAB_USERCOPY\n");
-		if (copy_to_user((void __user *)user_addr, bad_buf,
-				 cache_size)) {
+		pr_info("attempting bad copy_to_user outside whitelist\n");
+		if (copy_to_user(user_addr, buf + offset - 1, size)) {
 			pr_warn("copy_to_user failed, but lacked Oops\n");
 			goto free_user;
 		}
 	} else {
-		pr_info("attempting good copy_from_user with SLAB_USERCOPY\n");
-		if (copy_from_user(good_buf, (void __user *)user_addr,
-				   cache_size)) {
+		pr_info("attempting good copy_from_user inside whitelist\n");
+		if (copy_from_user(buf + offset, user_addr, size)) {
 			pr_warn("copy_from_user failed unexpectedly?!\n");
 			goto free_user;
 		}
 
-		pr_info("attempting bad copy_from_user w/o SLAB_USERCOPY\n");
-		if (copy_from_user(bad_buf, (void __user *)user_addr,
-				   cache_size)) {
+		pr_info("attempting bad copy_from_user outside whitelist\n");
+		if (copy_from_user(buf + offset - 1, user_addr, size)) {
 			pr_warn("copy_from_user failed, but lacked Oops\n");
 			goto free_user;
 		}
 	}
 
 free_user:
-	vm_munmap(user_addr, PAGE_SIZE);
+	vm_munmap(user_alloc, PAGE_SIZE);
 free_alloc:
-	if (bad_buf)
-		kmem_cache_free(bad_cache, bad_buf);
-	kfree(good_buf);
+	if (buf)
+		kmem_cache_free(whitelist_cache, buf);
 }
 
 /* Callable tests. */
@@ -255,14 +261,14 @@ void lkdtm_USERCOPY_HEAP_SIZE_FROM(void)
 	do_usercopy_heap_size(false);
 }
 
-void lkdtm_USERCOPY_HEAP_FLAG_TO(void)
+void lkdtm_USERCOPY_HEAP_WHITELIST_TO(void)
 {
-	do_usercopy_heap_flag(true);
+	do_usercopy_heap_whitelist(true);
 }
 
-void lkdtm_USERCOPY_HEAP_FLAG_FROM(void)
+void lkdtm_USERCOPY_HEAP_WHITELIST_FROM(void)
 {
-	do_usercopy_heap_flag(false);
+	do_usercopy_heap_whitelist(false);
 }
 
 void lkdtm_USERCOPY_STACK_FRAME_TO(void)
@@ -313,11 +319,15 @@ void lkdtm_USERCOPY_KERNEL(void)
 void __init lkdtm_usercopy_init(void)
 {
 	/* Prepare cache that lacks SLAB_USERCOPY flag. */
-	bad_cache = kmem_cache_create("lkdtm-no-usercopy", cache_size, 0,
-				      0, NULL);
+	whitelist_cache =
+		kmem_cache_create_usercopy("lkdtm-usercopy", cache_size,
+					   0, 0,
+					   cache_size / 4,
+					   cache_size / 16,
+					   NULL);
 }
 
 void __exit lkdtm_usercopy_exit(void)
 {
-	kmem_cache_destroy(bad_cache);
+	kmem_cache_destroy(whitelist_cache);
 }
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox