Netdev List
 help / color / mirror / Atom feed
* Re: Re: [Intel-wired-lan] [PATCH] ixgbe: initialize u64_stats_sync structures early at ixgbe_probe
From: Alexander Duyck @ 2017-04-25 17:16 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: Singh, Krishneil K, Song, Liwei (Wind River), Kirsher, Jeffrey T,
	netdev@vger.kernel.org, intel-wired-lan@lists.osuosl.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <trinity-a348af9b-9ef5-47a9-a14e-94265c880cf1-1493134775614@3capp-gmx-bs78>

On Tue, Apr 25, 2017 at 8:39 AM, Lino Sanfilippo <LinoSanfilippo@gmx.de> wrote:
> Hi,
>
>> This patch doesn't look right to me. I would suggest rejecting it.
>>
>> The call to initialize the stats should be done when the ring is
>> allocated, not in ixgbe_probe(). This should probably be done in
>> ixgbe_alloc_q_vector() instead.
>>
>
> AFAICS ixgbe_alloc_q_vector() is also called in probe() (by ixgbe_init_interrupt_scheme()).
> Furthermore it is also called in resume() which would lead to multiple initialization of
> the u64_stats_sync in case of resume.

ixgbe_alloc_q_vector() is what allocates the ring structures that are
being initialized here. Calling it anywhere other than here doesn't
make sense since what we want to do is initialize the memory after we
have allocated it, but before we hand the pointer to it over to a
netdev or in this case an adapter structure.

> IMHO the u64_stats_sync variables have to be initialized before register_netdev() is called
> since this is the point from which userspace can call ixgbe_get_stats64(). I would say the
> best place to do so is the probe() function as it is done in this patch.

I would disagree here. We should be initializing the stats variables
after we allocate them. Especially since we can end up freeing and
reallocating them any time the number of queues is changed.

> Just my 2 cents.
>
> Regards,
> Lino

My advice would be to look through the code and verify what it is you
need to initialize and where it should happen. In this case we are
getting a lockdep splat since we are just letting things get
initialized with kzalloc and aren't following up in the right place. I
don't disagree that the original code has the u64_stats_init in the
wrong place since we can open/close the interface and trigger a
reinitialization of the stats. I would say we need to initialize the
stats just after we allocate them in memory so that if we decide to
free and reallocate the rings we initialize the new rings before they
are added to the netdev and don't reintroduce this issue in just a
different form.

- Alex

^ permalink raw reply

* Re: [PATCH v4 net] net: ipv6: regenerate host route if moved to gc list
From: Martin KaFai Lau @ 2017-04-25 17:18 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, dvyukov, andreyknvl, mmanning, eric.dumazet
In-Reply-To: <1493137049-16465-1-git-send-email-dsa@cumulusnetworks.com>

On Tue, Apr 25, 2017 at 09:17:29AM -0700, David Ahern wrote:
[...]
>
> All of those faults are fixed by regenerating the host route if the
> existing one has been moved to the gc list, something that can be
> determined by checking if the rt6i_ref counter is 0.
Acked-by: Martin KaFai Lau <kafai@fb.com>

>
> Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional")
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>

> ---
> v4
> - move 'prev = ifp->rt;' under spinlock as requested by Eric
>
> v3
> - removed 'if (prev)' and just call ip6_rt_put; added comment about spinlock
>
> v2
> - change ifp->rt under spinlock vs cmpxchg
> - add comment about rt6i_ref == 0
>
>  net/ipv6/addrconf.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 80ce478c4851..0ea96c4d334d 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -3271,14 +3271,24 @@ static void addrconf_gre_config(struct net_device *dev)
>  static int fixup_permanent_addr(struct inet6_dev *idev,
>  				struct inet6_ifaddr *ifp)
>  {
> -	if (!ifp->rt) {
> -		struct rt6_info *rt;
> +	/* rt6i_ref == 0 means the host route was removed from the
> +	 * FIB, for example, if 'lo' device is taken down. In that
> +	 * case regenerate the host route.
> +	 */
> +	if (!ifp->rt || !atomic_read(&ifp->rt->rt6i_ref)) {
> +		struct rt6_info *rt, *prev;
>
>  		rt = addrconf_dst_alloc(idev, &ifp->addr, false);
>  		if (unlikely(IS_ERR(rt)))
>  			return PTR_ERR(rt);
>
> +		/* ifp->rt can be accessed outside of rtnl */
> +		spin_lock(&ifp->lock);
> +		prev = ifp->rt;
>  		ifp->rt = rt;
> +		spin_unlock(&ifp->lock);
> +
> +		ip6_rt_put(prev);
>  	}
>
>  	if (!(ifp->flags & IFA_F_NOPREFIXROUTE)) {
> --
> 2.1.4
>

^ permalink raw reply

* Re: [RFC PATCH 3/7] net: add option to get information about timestamped packets
From: Willem de Bruijn @ 2017-04-25 17:23 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: Willem de Bruijn, Network Development, Richard Cochran,
	Soheil Hassas Yeganeh, Keller, Jacob E, Denny Page, Jiri Benc
In-Reply-To: <20170425135642.GB27148@localhost>

On Tue, Apr 25, 2017 at 9:56 AM, Miroslav Lichvar <mlichvar@redhat.com> wrote:
> On Mon, Apr 24, 2017 at 11:18:13AM -0400, Willem de Bruijn wrote:
>> On Mon, Apr 24, 2017 at 5:00 AM, Miroslav Lichvar <mlichvar@redhat.com> wrote:
>> > Would "skb->data - skb->head -
>> > skb->mac_header + skb->len" always work as the L2 length for received
>> > packets at the time when the cmsg is prepared?
>>
>> (skb->data - skb->head) - skb->mac_header computes the length
>> of data before the mac, such as reserve?
>
> data - head includes the reserve, but mac_header does too, so I think
> it should be just the length of MAC header and everything up to the
> data.
>
>> Do you mean skb->data -
>> skb->mac_header (or - skb_mac_offset(skb))?
>
> That would give me a pointer? If I used skb_mac_offset(), the total
> length would be just skb->len - skb_mac_offset()?

It appears so. The only existing caller first checks
skb_mac_header_was_set(skb).

^ permalink raw reply

* Re: Blogpost evaluation this [PATCH v4 net-next RFC] net: Generic XDP
From: Andy Gospodarek @ 2017-04-25 17:25 UTC (permalink / raw)
  To: David Miller; +Cc: brouer, xdp-newbies, netdev
In-Reply-To: <20170424.182643.485613135674690555.davem@davemloft.net>

On Mon, Apr 24, 2017 at 06:26:43PM -0400, David Miller wrote:
> From: Jesper Dangaard Brouer <brouer@redhat.com>
> Date: Mon, 24 Apr 2017 16:24:05 +0200
> 
> > I've done a very detailed evaluation of this patch, and I've created a
> > blogpost like report here:
> > 
> >  https://prototype-kernel.readthedocs.io/en/latest/blogposts/xdp25_eval_generic_xdp_tx.html
> 
> Thanks for doing this Jesper.

Yes, this is excellent.  I'm not all the way thru it, but I looked at
the data and corroborate the results you are seeing.

My results for both optimized and generic XDP for
xdp_bench01_mem_access_cost --action XDP_DROP --readmem are quite
similar to yours (11.7Mpps and 7.8Mpps, respectively for me 11.7Mpps and
8.4Mpps for you).

I also noted (as you did) that there is no discernible difference
running xdp_bench01_mem_access_cost with or without the --readmem
option since the packet data is already being accessed that late it the
stack.

> 
> > I didn't evaluate the adjust_head part, so I hope Andy is still
> > planning to validate that part?
> 
> I was hoping he would post some results today as well.
> 
> Andy, how goes it? :)

Sorry for the delayed response.  I was AFK yesterday, but based on
testing from Friday and what I wrapped up today all looks good to me.

On my system (i7-6700 CPU @ 3.40GHz) the reported and actual TX
throughput for xdp_tx_iptunnel is 4.6Mpps for the optimized XDP.

For generic XDP the reported throughput of xdp_tx_iptunnel is 4.6Mpps
but only ~880kpps actually on the wire.  It seems to me that can be
fixed with a follow-up for offending drivers or the stack if deemed that
there is a real error there.

> Once the basic patch is ready and integrated in we can try to do
> xmit_more in generic XDP and see what that does for XDP_TX
> performance.

Agreed.

^ permalink raw reply

* Re: [PATCH net-next 0/2] flower: add MPLS matching support
From: Benjamin LaHaise @ 2017-04-25 17:26 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Simon Horman, Jakub Kicinski, David Miller, netdev, bcrl,
	Jiri Pirko
In-Reply-To: <1a05824d-342a-6d06-5fa4-d747a8b8460f@mojatatu.com>

On Tue, Apr 25, 2017 at 08:47:00AM -0400, Jamal Hadi Salim wrote:
> On 17-04-25 07:55 AM, Simon Horman wrote:
> [..]
> > 
> > I agree something should be done wrt BOS. If the LABEL and TC are to
> > be left as-is then I think a similar treatment of BOS - that is masking it
> > - makes sense.
> > 
> > I also agree with statements made earlier in the thread that it is unlikely
> > that the unused bits of these attributes will be used - as opposed to a
> > bitmask of flag values which seems ripe for re-use for future flags.
> > 
> 
> For your use case, I think you are fine if you just do the mask in the
> kernel. A mask  to a user value implies "I am ignoring the rest
> of these bits - I dont  care if you set them "
> 
> > I would like to add to the discussion that I think in future it would
> > be good to expand the features provided by this patch to support supplying
> > a mask as part of the match - as flower supports for other fields such
> > as IP addresses. But I think the current scheme of masking out invalid bits
> > should also work in conjunction with user-supplied masks.
> > 
> 
> The challenge we have right now is "users do stoopid or malicious
> things". So are you going to accept the wrong bitmap + mask?

I think rejecting bits in a mask that clearly cannot be set (like the 
bits above the lower 20 bits in an MPLS label) makes perfect sense.  It 
doesn't impact usability for the tools since they shouldn't be set (and 
actually can't be in the iproute2 changes).

		-ben

> cheers,
> jamal
> 

^ permalink raw reply

* Re: Blogpost evaluation this [PATCH v4 net-next RFC] net: Generic XDP
From: David Miller @ 2017-04-25 17:31 UTC (permalink / raw)
  To: andy; +Cc: brouer, xdp-newbies, netdev
In-Reply-To: <20170425172549.GQ4730@C02RW35GFVH8.dhcp.broadcom.net>

From: Andy Gospodarek <andy@greyhouse.net>
Date: Tue, 25 Apr 2017 13:25:49 -0400

> On Mon, Apr 24, 2017 at 06:26:43PM -0400, David Miller wrote:
>> Andy, how goes it? :)
> 
> Sorry for the delayed response.  I was AFK yesterday, but based on
> testing from Friday and what I wrapped up today all looks good to me.
> 
> On my system (i7-6700 CPU @ 3.40GHz) the reported and actual TX
> throughput for xdp_tx_iptunnel is 4.6Mpps for the optimized XDP.
> 
> For generic XDP the reported throughput of xdp_tx_iptunnel is 4.6Mpps
> but only ~880kpps actually on the wire.  It seems to me that can be
> fixed with a follow-up for offending drivers or the stack if deemed that
> there is a real error there.

Ok, I'll commit the latest version with tested-by tags for you, Jesper,
and David Ahern added.

Thanks everyone.

^ permalink raw reply

* Re: [PATCH v4 net] net: ipv6: regenerate host route if moved to gc list
From: Eric Dumazet @ 2017-04-25 17:34 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, dvyukov, andreyknvl, mmanning, kafai
In-Reply-To: <1493137049-16465-1-git-send-email-dsa@cumulusnetworks.com>

On Tue, 2017-04-25 at 09:17 -0700, David Ahern wrote:
> Taking down the loopback device wreaks havoc on IPv6 routing. By
> extension, taking down a VRF device wreaks havoc on its table.
> 
> Dmitry and Andrey both reported heap out-of-bounds reports in the IPv6
> FIB code while running syzkaller fuzzer. The root cause is a dead dst
> that is on the garbage list gets reinserted into the IPv6 FIB. While on
> the gc (or perhaps when it gets added to the gc list) the dst->next is
> set to an IPv4 dst. A subsequent walk of the ipv6 tables causes the
> out-of-bounds access.
> 
> Andrey's reproducer was the key to getting to the bottom of this.
> 
> With IPv6, host routes for an address have the dst->dev set to the
> loopback device. When the 'lo' device is taken down, rt6_ifdown initiates
> a walk of the fib evicting routes with the 'lo' device which means all
> host routes are removed. That process moves the dst which is attached to
> an inet6_ifaddr to the gc list and marks it as dead.
> 
> The recent change to keep global IPv6 addresses added a new function,
> fixup_permanent_addr, that is called on admin up. That function restarts
> dad for an inet6_ifaddr and when it completes the host route attached
> to it is inserted into the fib. Since the route was marked dead and
> moved to the gc list, re-inserting the route causes the reported
> out-of-bounds accesses. If the device with the address is taken down
> or the address is removed, the WARN_ON in fib6_del is triggered.
> 
> All of those faults are fixed by regenerating the host route if the
> existing one has been moved to the gc list, something that can be
> determined by checking if the rt6i_ref counter is 0.
> 
> Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional")
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
> ---

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Bug and configuration MPLS error?
From: Алексей Болдырев @ 2017-04-25 17:28 UTC (permalink / raw)
  To: netdev
In-Reply-To: <61671493137651@web14o.yandex.ru>

Короче, вот конфиг MPLS на одном из дистрибутивов:
In short, here's the MPLS configuration on one of the distributions:
226 sysctl -w net.mpls.conf.lo.input=1
227 sysctl -w net.mpls.platform_labels=1048575
228 ip link add veth0 type veth peer name veth1
229 ip link add veth2 type veth peer name veth3
230 sysctl -w net.mpls.conf.veth0.input=1
231 sysctl -w net.mpls.conf.veth2.input=1
232 ifconfig veth0 10.3.3.1 netmask 255.255.255.0
233 ifconfig veth2 10.4.4.1 netmask 255.255.255.0
234 ip netns add host1
235 ip netns add host2
236 ip link set veth1 netns host1
237 ip link set veth3 netns host2
238 ip netns exec host1 ifconfig veth1 10.3.3.2 netmask 255.255.255.0 up
239 ip netns exec host2 ifconfig veth3 10.4.4.2 netmask 255.255.255.0 up
240 ip netns exec host1 ip route add 10.10.10.2/32 encap mpls 112 via inet 10.3.3.1
241 ip netns exec host2 ip route add 10.10.10.1/32 encap mpls 111 via inet 10.4.4.1
242 ip -f mpls route add 111 via inet 10.3.3.2
243 ip -f mpls route add 112 via inet 10.4.4.2

Результаты теста:
Test Results:
tcp по mpls:
~ # ip netns exec host2 iperf3 -c 10.10.10.1 -B 10.10.10.2
Connecting to host 10.10.10.1, port 5201
[ 4] local 10.10.10.2 port 34021 connected to 10.10.10.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 912 KBytes 7.46 Mbits/sec 0 636 KBytes
[ 4] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 0 636 KBytes
[ 4] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 0 636 KBytes
[ 4] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec 0 636 KBytes
[ 4] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec 0 636 KBytes
[ 4] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 636 KBytes
[ 4] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 0 636 KBytes
[ 4] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 636 KBytes
[ 4] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 636 KBytes
[ 4] 9.00-10.00 sec 0.00 Bytes 0.00 bits/sec 0 636 KBytes
----------------------------------------

[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 912 KBytes 747 Kbits/sec 0 sender
[ 4] 0.00-10.00 sec 21.3 KBytes 17.5 Kbits/sec receiver

iperf Done.
~ #
udp по mpls:
~ # ip netns exec host2 iperf3 -c 10.10.10.1 -B 10.10.10.2 -u -b 10g
Connecting to host 10.10.10.1, port 5201
[ 4] local 10.10.10.2 port 56901 connected to 10.10.10.1 port 5201
[ ID] Interval Transfer Bandwidth Total Datagrams
[ 4] 0.00-1.00 sec 438 MBytes 3.67 Gbits/sec 56049
[ 4] 1.00-2.00 sec 491 MBytes 4.12 Gbits/sec 62829
[ 4] 2.00-3.00 sec 492 MBytes 4.12 Gbits/sec 62919
[ 4] 3.00-4.00 sec 490 MBytes 4.11 Gbits/sec 62762
[ 4] 4.00-5.00 sec 491 MBytes 4.12 Gbits/sec 62891
[ 4] 5.00-6.00 sec 492 MBytes 4.13 Gbits/sec 62994
[ 4] 6.00-7.00 sec 503 MBytes 4.22 Gbits/sec 64322
[ 4] 7.00-8.00 sec 503 MBytes 4.22 Gbits/sec 64321
[ 4] 8.00-9.00 sec 502 MBytes 4.21 Gbits/sec 64279
[ 4] 9.00-10.00 sec 511 MBytes 4.28 Gbits/sec 65352
----------------------------------------

[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[ 4] 0.00-10.00 sec 4.80 GBytes 4.12 Gbits/sec 0.001 ms 0/628718 (0%)
[ 4] Sent 628718 datagrams

iperf Done.
UDP как видим, проходит нормально.
UDP as seen, is normal.
Вот параметры интерфейсов:
Here are the interface parameters:
P:
veth0 Link encap:Ethernet HWaddr 72:0D:9E:D7:BC:B3
inet addr:10.3.3.1 Bcast:10.3.3.255 Mask:255.255.255.0
inet6 addr: fe80::700d:9eff:fed7:bcb3/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65535 Metric:1
RX packets:126 errors:0 dropped:0 overruns:0 frame:0
TX packets:629026 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:9592 (9.3 KiB) TX bytes:5178498619 (4.8 GiB)

veth2 Link encap:Ethernet HWaddr CE:24:F8:1F:99:C1
inet addr:10.4.4.1 Bcast:10.4.4.255 Mask:255.255.255.0
inet6 addr: fe80::cc24:f8ff:fe1f:99c1/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65535 Metric:1
RX packets:629015 errors:0 dropped:0 overruns:0 frame:0
TX packets:135 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5181014123 (4.8 GiB) TX bytes:9564 (9.3 KiB)
PE1:
~ # ip netns exec host2 ifconfig
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

veth3 Link encap:Ethernet HWaddr 36:00:C2:29:0D:F9
inet addr:10.4.4.2 Bcast:10.4.4.255 Mask:255.255.255.0
inet6 addr: fe80::3400:c2ff:fe29:df9/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65200 Metric:1
RX packets:136 errors:0 dropped:0 overruns:0 frame:0
TX packets:629015 errors:0 dropped:1 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:9596 (9.3 KiB) TX bytes:5181014123 (4.8 GiB)
PE2:
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

veth1 Link encap:Ethernet HWaddr DA:B2:AD:31:68:77
inet addr:10.3.3.2 Bcast:10.3.3.255 Mask:255.255.255.0
inet6 addr: fe80::d8b2:adff:fe31:6877/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65200 Metric:1
RX packets:629027 errors:0 dropped:0 overruns:0 frame:0
TX packets:126 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5178498651 (4.8 GiB) TX bytes:9592 (9.3 KiB)
Тоже самое, только на более свежом ядре: http://forum.nag.ru/forum/index.php?showtopic=128927&st=0
The same thing, only on amore recent nucleus:
Ядро:
Core:
/ # uname -r
4.8.6
Конфиг ядра:
Kernel Config:
https://pastebin.com/raw/EE1k05cT

Это баг ядра, или ошибка конфигурирования?
Is it a kernel bug, or a configuration error?

^ permalink raw reply

* [PATCH net-next v5] net: Generic XDP
From: David Miller @ 2017-04-25 17:43 UTC (permalink / raw)
  To: netdev; +Cc: andy, brouer, ast, daniel, netdev, xdp-newbies


This provides a generic SKB based non-optimized XDP path which is used
if either the driver lacks a specific XDP implementation, or the user
requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE.

It is arguable that perhaps I should have required something like
this as part of the initial XDP feature merge.

I believe this is critical for two reasons:

1) Accessibility.  More people can play with XDP with less
   dependencies.  Yes I know we have XDP support in virtio_net, but
   that just creates another depedency for learning how to use this
   facility.

   I wrote this to make life easier for the XDP newbies.

2) As a model for what the expected semantics are.  If there is a pure
   generic core implementation, it serves as a semantic example for
   driver folks adding XDP support.

One thing I have not tried to address here is the issue of
XDP_PACKET_HEADROOM, thanks to Daniel for spotting that.  It seems
incredibly expensive to do a skb_cow(skb, XDP_PACKET_HEADROOM) or
whatever even if the XDP program doesn't try to push headers at all.
I think we really need the verifier to somehow propagate whether
certain XDP helpers are used or not.

v5:
 - Handle both negative and positive offset after running prog
 - Fix mac length in XDP_TX case (Alexei)
 - Use rcu_dereference_protected() in free_netdev (kbuild test robot)

v4:
 - Fix MAC header adjustmnet before calling prog (David Ahern)
 - Disable LRO when generic XDP is installed (Michael Chan)
 - Bypass qdisc et al. on XDP_TX and record the event (Alexei)
 - Do not perform generic XDP on reinjected packets (DaveM)

v3:
 - Make sure XDP program sees packet at MAC header, push back MAC
   header if we do XDP_TX.  (Alexei)
 - Elide GRO when generic XDP is in use.  (Alexei)
 - Add XDP_FLAG_SKB_MODE flag which the user can use to request generic
   XDP even if the driver has an XDP implementation.  (Alexei)
 - Report whether SKB mode is in use in rtnl_xdp_fill() via XDP_FLAGS
   attribute.  (Daniel)

v2:
 - Add some "fall through" comments in switch statements based
   upon feedback from Andrew Lunn
 - Use RCU for generic xdp_prog, thanks to Johannes Berg.

Tested-by: Andy Gospodarek <andy@greyhouse.net>
Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---

Committed to net-next... thanks again everyone.

 include/linux/netdevice.h    |   8 +++
 include/uapi/linux/if_link.h |   4 +-
 net/core/dev.c               | 155 +++++++++++++++++++++++++++++++++++++++++--
 net/core/gro_cells.c         |   2 +-
 net/core/rtnetlink.c         |  40 ++++++-----
 5 files changed, 187 insertions(+), 22 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5d5267f..46d220c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1905,9 +1905,17 @@ struct net_device {
 	struct lock_class_key	*qdisc_tx_busylock;
 	struct lock_class_key	*qdisc_running_key;
 	bool			proto_down;
+	struct bpf_prog __rcu	*xdp_prog;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
+static inline bool netif_elide_gro(const struct net_device *dev)
+{
+	if (!(dev->features & NETIF_F_GRO) || dev->xdp_prog)
+		return true;
+	return false;
+}
+
 #define	NETDEV_ALIGN		32
 
 static inline
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 8b405af..633aa02 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -887,7 +887,9 @@ enum {
 /* XDP section */
 
 #define XDP_FLAGS_UPDATE_IF_NOEXIST	(1U << 0)
-#define XDP_FLAGS_MASK			(XDP_FLAGS_UPDATE_IF_NOEXIST)
+#define XDP_FLAGS_SKB_MODE		(2U << 0)
+#define XDP_FLAGS_MASK			(XDP_FLAGS_UPDATE_IF_NOEXIST | \
+					 XDP_FLAGS_SKB_MODE)
 
 enum {
 	IFLA_XDP_UNSPEC,
diff --git a/net/core/dev.c b/net/core/dev.c
index db6e315..1b3317c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -95,6 +95,7 @@
 #include <linux/notifier.h>
 #include <linux/skbuff.h>
 #include <linux/bpf.h>
+#include <linux/bpf_trace.h>
 #include <net/net_namespace.h>
 #include <net/sock.h>
 #include <net/busy_poll.h>
@@ -4251,6 +4252,125 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	return ret;
 }
 
+static struct static_key generic_xdp_needed __read_mostly;
+
+static int generic_xdp_install(struct net_device *dev, struct netdev_xdp *xdp)
+{
+	struct bpf_prog *new = xdp->prog;
+	int ret = 0;
+
+	switch (xdp->command) {
+	case XDP_SETUP_PROG: {
+		struct bpf_prog *old = rtnl_dereference(dev->xdp_prog);
+
+		rcu_assign_pointer(dev->xdp_prog, new);
+		if (old)
+			bpf_prog_put(old);
+
+		if (old && !new) {
+			static_key_slow_dec(&generic_xdp_needed);
+		} else if (new && !old) {
+			static_key_slow_inc(&generic_xdp_needed);
+			dev_disable_lro(dev);
+		}
+		break;
+	}
+
+	case XDP_QUERY_PROG:
+		xdp->prog_attached = !!rcu_access_pointer(dev->xdp_prog);
+		break;
+
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+static u32 netif_receive_generic_xdp(struct sk_buff *skb,
+				     struct bpf_prog *xdp_prog)
+{
+	struct xdp_buff xdp;
+	u32 act = XDP_DROP;
+	void *orig_data;
+	int hlen, off;
+	u32 mac_len;
+
+	/* Reinjected packets coming from act_mirred or similar should
+	 * not get XDP generic processing.
+	 */
+	if (skb_cloned(skb))
+		return XDP_PASS;
+
+	if (skb_linearize(skb))
+		goto do_drop;
+
+	/* The XDP program wants to see the packet starting at the MAC
+	 * header.
+	 */
+	mac_len = skb->data - skb_mac_header(skb);
+	hlen = skb_headlen(skb) + mac_len;
+	xdp.data = skb->data - mac_len;
+	xdp.data_end = xdp.data + hlen;
+	xdp.data_hard_start = skb->data - skb_headroom(skb);
+	orig_data = xdp.data;
+
+	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+	off = xdp.data - orig_data;
+	if (off > 0)
+		__skb_pull(skb, off);
+	else if (off < 0)
+		__skb_push(skb, -off);
+
+	switch (act) {
+	case XDP_TX:
+		__skb_push(skb, mac_len);
+		/* fall through */
+	case XDP_PASS:
+		break;
+
+	default:
+		bpf_warn_invalid_xdp_action(act);
+		/* fall through */
+	case XDP_ABORTED:
+		trace_xdp_exception(skb->dev, xdp_prog, act);
+		/* fall through */
+	case XDP_DROP:
+	do_drop:
+		kfree_skb(skb);
+		break;
+	}
+
+	return act;
+}
+
+/* When doing generic XDP we have to bypass the qdisc layer and the
+ * network taps in order to match in-driver-XDP behavior.
+ */
+static void generic_xdp_tx(struct sk_buff *skb, struct bpf_prog *xdp_prog)
+{
+	struct net_device *dev = skb->dev;
+	struct netdev_queue *txq;
+	bool free_skb = true;
+	int cpu, rc;
+
+	txq = netdev_pick_tx(dev, skb, NULL);
+	cpu = smp_processor_id();
+	HARD_TX_LOCK(dev, txq, cpu);
+	if (!netif_xmit_stopped(txq)) {
+		rc = netdev_start_xmit(skb, dev, txq, 0);
+		if (dev_xmit_complete(rc))
+			free_skb = false;
+	}
+	HARD_TX_UNLOCK(dev, txq);
+	if (free_skb) {
+		trace_xdp_exception(dev, xdp_prog, XDP_TX);
+		kfree_skb(skb);
+	}
+}
+
 static int netif_receive_skb_internal(struct sk_buff *skb)
 {
 	int ret;
@@ -4262,6 +4382,21 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
 
 	rcu_read_lock();
 
+	if (static_key_false(&generic_xdp_needed)) {
+		struct bpf_prog *xdp_prog = rcu_dereference(skb->dev->xdp_prog);
+
+		if (xdp_prog) {
+			u32 act = netif_receive_generic_xdp(skb, xdp_prog);
+
+			if (act != XDP_PASS) {
+				rcu_read_unlock();
+				if (act == XDP_TX)
+					generic_xdp_tx(skb, xdp_prog);
+				return NET_RX_DROP;
+			}
+		}
+	}
+
 #ifdef CONFIG_RPS
 	if (static_key_false(&rps_needed)) {
 		struct rps_dev_flow voidflow, *rflow = &voidflow;
@@ -4494,7 +4629,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 	enum gro_result ret;
 	int grow;
 
-	if (!(skb->dev->features & NETIF_F_GRO))
+	if (netif_elide_gro(skb->dev))
 		goto normal;
 
 	if (skb->csum_bad)
@@ -6723,6 +6858,7 @@ EXPORT_SYMBOL(dev_change_proto_down);
  */
 int dev_change_xdp_fd(struct net_device *dev, int fd, u32 flags)
 {
+	int (*xdp_op)(struct net_device *dev, struct netdev_xdp *xdp);
 	const struct net_device_ops *ops = dev->netdev_ops;
 	struct bpf_prog *prog = NULL;
 	struct netdev_xdp xdp;
@@ -6730,14 +6866,16 @@ int dev_change_xdp_fd(struct net_device *dev, int fd, u32 flags)
 
 	ASSERT_RTNL();
 
-	if (!ops->ndo_xdp)
-		return -EOPNOTSUPP;
+	xdp_op = ops->ndo_xdp;
+	if (!xdp_op || (flags & XDP_FLAGS_SKB_MODE))
+		xdp_op = generic_xdp_install;
+
 	if (fd >= 0) {
 		if (flags & XDP_FLAGS_UPDATE_IF_NOEXIST) {
 			memset(&xdp, 0, sizeof(xdp));
 			xdp.command = XDP_QUERY_PROG;
 
-			err = ops->ndo_xdp(dev, &xdp);
+			err = xdp_op(dev, &xdp);
 			if (err < 0)
 				return err;
 			if (xdp.prog_attached)
@@ -6753,7 +6891,7 @@ int dev_change_xdp_fd(struct net_device *dev, int fd, u32 flags)
 	xdp.command = XDP_SETUP_PROG;
 	xdp.prog = prog;
 
-	err = ops->ndo_xdp(dev, &xdp);
+	err = xdp_op(dev, &xdp);
 	if (err < 0 && prog)
 		bpf_prog_put(prog);
 
@@ -7793,6 +7931,7 @@ EXPORT_SYMBOL(alloc_netdev_mqs);
 void free_netdev(struct net_device *dev)
 {
 	struct napi_struct *p, *n;
+	struct bpf_prog *prog;
 
 	might_sleep();
 	netif_free_tx_queues(dev);
@@ -7811,6 +7950,12 @@ void free_netdev(struct net_device *dev)
 	free_percpu(dev->pcpu_refcnt);
 	dev->pcpu_refcnt = NULL;
 
+	prog = rcu_dereference_protected(dev->xdp_prog, 1);
+	if (prog) {
+		bpf_prog_put(prog);
+		static_key_slow_dec(&generic_xdp_needed);
+	}
+
 	/*  Compatibility with error handling in drivers */
 	if (dev->reg_state == NETREG_UNINITIALIZED) {
 		netdev_freemem(dev);
diff --git a/net/core/gro_cells.c b/net/core/gro_cells.c
index c98bbfb..814e58a 100644
--- a/net/core/gro_cells.c
+++ b/net/core/gro_cells.c
@@ -13,7 +13,7 @@ int gro_cells_receive(struct gro_cells *gcells, struct sk_buff *skb)
 	struct net_device *dev = skb->dev;
 	struct gro_cell *cell;
 
-	if (!gcells->cells || skb_cloned(skb) || !(dev->features & NETIF_F_GRO))
+	if (!gcells->cells || skb_cloned(skb) || netif_elide_gro(dev))
 		return netif_rx(skb);
 
 	cell = this_cpu_ptr(gcells->cells);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 088f9c8..9031a6c 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -896,15 +896,13 @@ static size_t rtnl_port_size(const struct net_device *dev,
 		return port_self_size;
 }
 
-static size_t rtnl_xdp_size(const struct net_device *dev)
+static size_t rtnl_xdp_size(void)
 {
 	size_t xdp_size = nla_total_size(0) +	/* nest IFLA_XDP */
-			  nla_total_size(1);	/* XDP_ATTACHED */
+			  nla_total_size(1) +	/* XDP_ATTACHED */
+			  nla_total_size(4);	/* XDP_FLAGS */
 
-	if (!dev->netdev_ops->ndo_xdp)
-		return 0;
-	else
-		return xdp_size;
+	return xdp_size;
 }
 
 static noinline size_t if_nlmsg_size(const struct net_device *dev,
@@ -943,7 +941,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_PORT_ID */
 	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_SWITCH_ID */
 	       + nla_total_size(IFNAMSIZ) /* IFLA_PHYS_PORT_NAME */
-	       + rtnl_xdp_size(dev) /* IFLA_XDP */
+	       + rtnl_xdp_size() /* IFLA_XDP */
 	       + nla_total_size(1); /* IFLA_PROTO_DOWN */
 
 }
@@ -1251,23 +1249,35 @@ static int rtnl_fill_link_ifmap(struct sk_buff *skb, struct net_device *dev)
 
 static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
 {
-	struct netdev_xdp xdp_op = {};
 	struct nlattr *xdp;
+	u32 xdp_flags = 0;
+	u8 val = 0;
 	int err;
 
-	if (!dev->netdev_ops->ndo_xdp)
-		return 0;
 	xdp = nla_nest_start(skb, IFLA_XDP);
 	if (!xdp)
 		return -EMSGSIZE;
-	xdp_op.command = XDP_QUERY_PROG;
-	err = dev->netdev_ops->ndo_xdp(dev, &xdp_op);
-	if (err)
-		goto err_cancel;
-	err = nla_put_u8(skb, IFLA_XDP_ATTACHED, xdp_op.prog_attached);
+	if (rcu_access_pointer(dev->xdp_prog)) {
+		xdp_flags = XDP_FLAGS_SKB_MODE;
+		val = 1;
+	} else if (dev->netdev_ops->ndo_xdp) {
+		struct netdev_xdp xdp_op = {};
+
+		xdp_op.command = XDP_QUERY_PROG;
+		err = dev->netdev_ops->ndo_xdp(dev, &xdp_op);
+		if (err)
+			goto err_cancel;
+		val = xdp_op.prog_attached;
+	}
+	err = nla_put_u8(skb, IFLA_XDP_ATTACHED, val);
 	if (err)
 		goto err_cancel;
 
+	if (xdp_flags) {
+		err = nla_put_u32(skb, IFLA_XDP_FLAGS, xdp_flags);
+		if (err)
+			goto err_cancel;
+	}
 	nla_nest_end(skb, xdp);
 	return 0;
 
-- 
2.4.11

^ permalink raw reply related

* Re: [PATCH net-next] drivers: net: xgene-v2: Fix error return code in xge_mdio_config()
From: David Miller @ 2017-04-25 17:48 UTC (permalink / raw)
  To: weiyj.lk; +Cc: isubramanian, kchudgar, weiyongjun1, netdev
In-Reply-To: <20170425113650.22730-1-weiyj.lk@gmail.com>

From: Wei Yongjun <weiyj.lk@gmail.com>
Date: Tue, 25 Apr 2017 11:36:50 +0000

> From: Wei Yongjun <weiyongjun1@huawei.com>
> 
> Fix to return error code -ENODEV from the no PHY found error
> handling case instead of 0, as done elsewhere in this function.
> 
> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next 01/10] tcp: add tp->tcp_mstamp field
From: Soheil Hassas Yeganeh @ 2017-04-25 17:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S . Miller, netdev, Eric Dumazet
In-Reply-To: <20170425171541.3417-2-edumazet@google.com>

On Tue, Apr 25, 2017 at 1:15 PM, Eric Dumazet <edumazet@google.com> wrote:
> We want to use precise timestamps in TCP stack, but we do not
> want to call possibly expensive kernel time services too often.
>
> tp->tcp_mstamp is guaranteed to be updated once per incoming packet.
>
> We will use it in the following patches, removing specific
> skb_mstamp_get() calls, and removing ack_time from
> struct tcp_sacktag_state.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

> ---
>  include/linux/tcp.h  | 1 +
>  net/ipv4/tcp_input.c | 3 +++
>  2 files changed, 4 insertions(+)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index cbe5b602a2d349fdeb1e878305f37b4da1e6cc86..99a22f44c32e1587a6bf4835b65c7a4314807aa8 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -240,6 +240,7 @@ struct tcp_sock {
>         u32     tlp_high_seq;   /* snd_nxt at the time of TLP retransmit. */
>
>  /* RTT measurement */
> +       struct skb_mstamp tcp_mstamp; /* most recent packet received/sent */
>         u32     srtt_us;        /* smoothed round trip time << 3 in usecs */
>         u32     mdev_us;        /* medium deviation                     */
>         u32     mdev_max_us;    /* maximal mdev for the last rtt period */
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 5af2f04f885914491a7116c20056b3d2188d2d7d..bd18c65df4a9d9c2b66d8005f2cc4ff468140a73 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5362,6 +5362,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
>  {
>         struct tcp_sock *tp = tcp_sk(sk);
>
> +       skb_mstamp_get(&tp->tcp_mstamp);
>         if (unlikely(!sk->sk_rx_dst))
>                 inet_csk(sk)->icsk_af_ops->sk_rx_dst_set(sk, skb);
>         /*
> @@ -5922,6 +5923,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
>
>         case TCP_SYN_SENT:
>                 tp->rx_opt.saw_tstamp = 0;
> +               skb_mstamp_get(&tp->tcp_mstamp);
>                 queued = tcp_rcv_synsent_state_process(sk, skb, th);
>                 if (queued >= 0)
>                         return queued;
> @@ -5933,6 +5935,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
>                 return 0;
>         }
>
> +       skb_mstamp_get(&tp->tcp_mstamp);
>         tp->rx_opt.saw_tstamp = 0;
>         req = tp->fastopen_rsk;
>         if (req) {
> --
> 2.13.0.rc0.306.g87b477812d-goog
>

Nice patchset. Thanks, Eric!

^ permalink raw reply

* Re: [PATCH net-next 01/10] tcp: add tp->tcp_mstamp field
From: Neal Cardwell @ 2017-04-25 17:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20170425171541.3417-2-edumazet@google.com>

On Tue, Apr 25, 2017 at 1:15 PM, Eric Dumazet <edumazet@google.com> wrote:
> We want to use precise timestamps in TCP stack, but we do not
> want to call possibly expensive kernel time services too often.
>
> tp->tcp_mstamp is guaranteed to be updated once per incoming packet.
>
> We will use it in the following patches, removing specific
> skb_mstamp_get() calls, and removing ack_time from
> struct tcp_sacktag_state.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---

This series is great! Thanks, Eric!

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [PATCH net-next 02/10] tcp: do not pass timestamp to tcp_rack_detect_loss()
From: Neal Cardwell @ 2017-04-25 17:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20170425171541.3417-3-edumazet@google.com>

On Tue, Apr 25, 2017 at 1:15 PM, Eric Dumazet <edumazet@google.com> wrote:
> We can use tp->tcp_mstamp as it contains a recent timestamp.
>
> This removes a call to skb_mstamp_get() from tcp_rack_reo_timeout()
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [PATCH net-next 03/10] tcp: do not pass timestamp to tcp_rack_mark_lost()
From: Neal Cardwell @ 2017-04-25 17:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20170425171541.3417-4-edumazet@google.com>

On Tue, Apr 25, 2017 at 1:15 PM, Eric Dumazet <edumazet@google.com> wrote:
> This is no longer used, since tcp_rack_detect_loss() takes
> the timestamp from tp->tcp_mstamp
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [PATCH net-next 04/10] tcp: do not pass timestamp to tcp_rack_identify_loss()
From: Neal Cardwell @ 2017-04-25 17:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20170425171541.3417-5-edumazet@google.com>

On Tue, Apr 25, 2017 at 1:15 PM, Eric Dumazet <edumazet@google.com> wrote:
> Not used anymore now tp->tcp_mstamp holds the information.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [PATCH net-next 05/10] tcp: do not pass timestamp to tcp_fastretrans_alert()
From: Neal Cardwell @ 2017-04-25 17:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20170425171541.3417-6-edumazet@google.com>

On Tue, Apr 25, 2017 at 1:15 PM, Eric Dumazet <edumazet@google.com> wrote:
> Not used anymore now tp->tcp_mstamp holds the information.
>
> This is needed to remove sack_state.ack_time in a following patch.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [PATCH net-next 06/10] tcp: do not pass timestamp to tcp_rate_gen()
From: Neal Cardwell @ 2017-04-25 17:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20170425171541.3417-7-edumazet@google.com>

On Tue, Apr 25, 2017 at 1:15 PM, Eric Dumazet <edumazet@google.com> wrote:
> No longer needed, since tp->tcp_mstamp holds the information.
>
> This is needed to remove sack_state.ack_time in a following patch.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [PATCH net-next 07/10] tcp: do not pass timestamp to tcp_rack_advance()
From: Neal Cardwell @ 2017-04-25 17:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20170425171541.3417-8-edumazet@google.com>

On Tue, Apr 25, 2017 at 1:15 PM, Eric Dumazet <edumazet@google.com> wrote:
> No longer needed, since tp->tcp_mstamp holds the information.
>
> This is needed to remove sack_state.ack_time in a following patch.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [PATCH net-next 08/10] tcp: use tp->tcp_mstamp in tcp_clean_rtx_queue()
From: Neal Cardwell @ 2017-04-25 17:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20170425171541.3417-9-edumazet@google.com>

On Tue, Apr 25, 2017 at 1:15 PM, Eric Dumazet <edumazet@google.com> wrote:
> Following patch will remove ack_time from struct tcp_sacktag_state
>
> Same info is now found in tp->tcp_mstamp
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [PATCH net-next 09/10] tcp: remove ack_time from struct tcp_sacktag_state
From: Neal Cardwell @ 2017-04-25 17:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20170425171541.3417-10-edumazet@google.com>

On Tue, Apr 25, 2017 at 1:15 PM, Eric Dumazet <edumazet@google.com> wrote:
> It is no longer needed, everything uses tp->tcp_mstamp instead.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH 1/2] e1000e: Don't return uninitialized stats
From: Benjamin Poirier @ 2017-04-25 17:54 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: Brown, Aaron F, Neftin, Sasha, David S Miller, stephen,
	netdev@vger.kernel.org, intel-wired-lan@lists.osuosl.org,
	Stefan Priebe
In-Reply-To: <1493111272.68612.2.camel@intel.com>

[-- Attachment #1: Type: text/plain, Size: 2874 bytes --]

On 2017/04/25 02:07, Jeff Kirsher wrote:
[...]
> > > 
> > > I don't see any memset in e1000e_get_stats64(). What kernel version
> > > are
> > > you looking at?
> > 
> > The call to memset was removed from the upstream kernel with:
> > -------------------------------------------------------------------
> > -----------------
> > commit 5944701df90d9577658e2354cc27c4ceaeca30fe
> > Author: stephen hemminger <stephen@networkplumber.org>
> > Date:   Fri Jan 6 19:12:53 2017 -0800
> > 
> >     net: remove useless memset's in drivers get_stats64
> > 
> >     In dev_get_stats() the statistic structure storage has already
> > been
> >     zeroed. Therefore network drivers do not need to call memset()
> > again.
> > ...
> > < changes to other drivers snipped out >
> > ...
> > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
> > b/drivers/net/ethernet/int
> > index 723025b..79651eb 100644
> > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> > @@ -5925,7 +5925,6 @@ void e1000e_get_stats64(struct net_device
> > *netdev,
> >  {
> >         struct e1000_adapter *adapter = netdev_priv(netdev);
> > 
> > -       memset(stats, 0, sizeof(struct rtnl_link_stats64));
> >         spin_lock(&adapter->stats64_lock);
> >         e1000e_update_stats(adapter);
> >         /* Fill out the OS statistics structure */
> > -------------------------------------------------------------------
> > -----------------
> > 
> > This also is where the bad counters start to show up for e1000e for
> > my test systems.  From this driver on I see (very) large values for
> > tx_dropped, rx_over_errors and tx_fifo_errors on driver load (even
> > before bringing the interface up.  It seems the memset is not so
> > useless for this driver after all.  Would simply reverting the e1000e
> > portion of this patch resolve the issue?
> 
> Looks like Aaron beat me to the punch on pointing out that we had this
> very code in there before.  It appears that Stephen's
> assertion/assumption was incorrect about the stats structure being
> zero'd out, which is why we are seeing the issue.
> 
> I have no issue reverting Stephen's earlier patch, or do we want to
> pursue why the stats structure is not zero'd out and resolve that
> instead.  Either way, just want to make sure we are all on the same
> page as to the right solution so that we do not end up repeating this
> in the future.

If you revert the e1000e part of 5944701df90d ("net: remove useless
memset's in drivers get_stats64", v4.11-rc1) it will fix the issue with
ethtool but memset will be done twice for code paths that call
dev_get_stats() (sysfs, rtnl, ...). Not a big deal but this is not a
problem in the approach I initially suggested. Alternatively, we could
put a memset in e1000_get_ethtool_stats().

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 10/10] tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps
From: Neal Cardwell @ 2017-04-25 17:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20170425171541.3417-11-edumazet@google.com>

On Tue, Apr 25, 2017 at 1:15 PM, Eric Dumazet <edumazet@google.com> wrote:
> Some devices or distributions use HZ=100 or HZ=250
>
> TCP receive buffer autotuning has poor behavior caused by this choice.
> Since autotuning happens after 4 ms or 10 ms, short distance flows
> get their receive buffer tuned to a very high value, but after an initial
> period where it was frozen to (too small) initial value.
>
> With tp->tcp_mstamp introduction, we can switch to high resolution
> timestamps almost for free (at the expense of 8 additional bytes per
> TCP structure)
>
> Note that some TCP stacks use usec TCP timestamps where this
> patch makes even more sense : Many TCP flows have < 500 usec RTT.
> Hopefully this finer TS option can be standardized soon.
>
> Tested:
>  HZ=100 kernel
>  ./netperf -H lpaa24 -t TCP_RR -l 1000 -- -r 10000,10000 &
>
>  Peer without patch :
>  lpaa24:~# ss -tmi dst lpaa23
>  ...
>  skmem:(r0,rb8388608,...)
>  rcv_rtt:10 rcv_space:3210000 minrtt:0.017
>
>  Peer with the patch :
>  lpaa23:~# ss -tmi dst lpaa24
>  ...
>  skmem:(r0,rb428800,...)
>  rcv_rtt:0.069 rcv_space:30000 minrtt:0.017
>
> We can see saner RCVBUF, and more precise rcv_rtt information.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [PATCH] ipv6: ensure message length for raw socket is at least sizeof(ipv6hdr)
From: David Miller @ 2017-04-25 17:55 UTC (permalink / raw)
  To: glider; +Cc: dvyukov, kcc, edumazet, kuznet, linux-kernel, netdev
In-Reply-To: <20170425131827.66498-1-glider@google.com>

From: Alexander Potapenko <glider@google.com>
Date: Tue, 25 Apr 2017 15:18:27 +0200

> rawv6_send_hdrinc() expects that the buffer copied from the userspace
> contains the IPv6 header, so if too few bytes are copied parts of the
> header may remain uninitialized.
> 
> This bug has been detected with KMSAN.
> 
> Signed-off-by: Alexander Potapenko <glider@google.com>

Hmmm, ipv4 seems to lack this check as well.

I think we need to be careful here and fully understand why KMSAN doesn't
seem to be triggering in the ipv4 case but for ipv6 it is before I apply
this.

Thanks.

^ permalink raw reply

* Re: [PATCH net] ipv6: fix source routing
From: David Miller @ 2017-04-25 17:59 UTC (permalink / raw)
  To: sd; +Cc: netdev, hannes, david.lebrun
In-Reply-To: <1d451d4d6c04c259280dee65caeabf3e00e0eb94.1493128459.git.sd@queasysnail.net>

From: Sabrina Dubroca <sd@queasysnail.net>
Date: Tue, 25 Apr 2017 15:56:50 +0200

> Commit a149e7c7ce81 ("ipv6: sr: add support for SRH injection through
> setsockopt") introduced handling of IPV6_SRCRT_TYPE_4, but at the same
> time restricted it to only IPV6_SRCRT_TYPE_0 and
> IPV6_SRCRT_TYPE_4. Previously, ipv6_push_exthdr() and fl6_update_dst()
> would also handle other values (ie STRICT and TYPE_2).
> 
> Restore previous source routing behavior, by handling IPV6_SRCRT_STRICT
> and IPV6_SRCRT_TYPE_2 the same way as IPV6_SRCRT_TYPE_0 in
> ipv6_push_exthdr() and fl6_update_dst().
> 
> Fixes: a149e7c7ce81 ("ipv6: sr: add support for SRH injection through setsockopt")
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Applied, thanks.

^ permalink raw reply

* Re: [PATCHv2 net] bridge: move bridge multicast cleanup to ndo_uninit
From: David Miller @ 2017-04-25 18:02 UTC (permalink / raw)
  To: lucien.xin; +Cc: netdev, nikolay, stephen
In-Reply-To: <6e156e72cb1a5e279da8ac53bdb601eee5d654fe.1493132317.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Tue, 25 Apr 2017 22:58:37 +0800

> During removing a bridge device, if the bridge is still up, a new mdb entry
> still can be added in br_multicast_add_group() after all mdb entries are
> removed in br_multicast_dev_del(). Like the path:
> 
>   mld_ifc_timer_expire ->
>     mld_sendpack -> ...
>       br_multicast_rcv ->
>         br_multicast_add_group
> 
> The new mp's timer will be set up. If the timer expires after the bridge
> is freed, it may cause use-after-free panic in br_multicast_group_expired.
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
> IP: [<ffffffffa07ed2c8>] br_multicast_group_expired+0x28/0xb0 [bridge]
> Call Trace:
>  <IRQ>
>  [<ffffffff81094536>] call_timer_fn+0x36/0x110
>  [<ffffffffa07ed2a0>] ? br_mdb_free+0x30/0x30 [bridge]
>  [<ffffffff81096967>] run_timer_softirq+0x237/0x340
>  [<ffffffff8108dcbf>] __do_softirq+0xef/0x280
>  [<ffffffff8169889c>] call_softirq+0x1c/0x30
>  [<ffffffff8102c275>] do_softirq+0x65/0xa0
>  [<ffffffff8108e055>] irq_exit+0x115/0x120
>  [<ffffffff81699515>] smp_apic_timer_interrupt+0x45/0x60
>  [<ffffffff81697a5d>] apic_timer_interrupt+0x6d/0x80
> 
> Nikolay also found it would cause a memory leak - the mdb hash is
> reallocated and not freed due to the mdb rehash.
> 
> unreferenced object 0xffff8800540ba800 (size 2048):
>   backtrace:
>     [<ffffffff816e2287>] kmemleak_alloc+0x67/0xc0
>     [<ffffffff81260bea>] __kmalloc+0x1ba/0x3e0
>     [<ffffffffa05c60ee>] br_mdb_rehash+0x5e/0x340 [bridge]
>     [<ffffffffa05c74af>] br_multicast_new_group+0x43f/0x6e0 [bridge]
>     [<ffffffffa05c7aa3>] br_multicast_add_group+0x203/0x260 [bridge]
>     [<ffffffffa05ca4b5>] br_multicast_rcv+0x945/0x11d0 [bridge]
>     [<ffffffffa05b6b10>] br_dev_xmit+0x180/0x470 [bridge]
>     [<ffffffff815c781b>] dev_hard_start_xmit+0xbb/0x3d0
>     [<ffffffff815c8743>] __dev_queue_xmit+0xb13/0xc10
>     [<ffffffff815c8850>] dev_queue_xmit+0x10/0x20
>     [<ffffffffa02f8d7a>] ip6_finish_output2+0x5ca/0xac0 [ipv6]
>     [<ffffffffa02fbfc6>] ip6_finish_output+0x126/0x2c0 [ipv6]
>     [<ffffffffa02fc245>] ip6_output+0xe5/0x390 [ipv6]
>     [<ffffffffa032b92c>] NF_HOOK.constprop.44+0x6c/0x240 [ipv6]
>     [<ffffffffa032bd16>] mld_sendpack+0x216/0x3e0 [ipv6]
>     [<ffffffffa032d5eb>] mld_ifc_timer_expire+0x18b/0x2b0 [ipv6]
> 
> This could happen when ip link remove a bridge or destroy a netns with a
> bridge device inside.
> 
> With Nikolay's suggestion, this patch is to clean up bridge multicast in
> ndo_uninit after bridge dev is shutdown, instead of br_dev_delete, so
> that netif_running check in br_multicast_add_group can avoid this issue.
> 
> v1->v2:
>   - fix this issue by moving br_multicast_dev_del to ndo_uninit, instead
>     of calling dev_close in br_dev_delete.
> 
> Reported-by: Jianwen Ji <jiji@redhat.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox