Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
From: Toshiaki Makita @ 2018-05-02  3:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Toshiaki Makita, netdev, Tariq Toukan, Daniel Borkmann,
	Alexei Starovoitov, Eran Ben Elisha
In-Reply-To: <20180501101420.544ff225@redhat.com>

On 18/05/01 (火) 17:14, Jesper Dangaard Brouer wrote:
> On Tue, 1 May 2018 10:02:01 +0900 Toshiaki Makita 
> <makita.toshiaki@lab.ntt.co.jp> wrote:
> 
>> On 2018/05/01 2:27, Jesper Dangaard Brouer wrote:
>>> On Thu, 26 Apr 2018 19:52:40 +0900 Toshiaki Makita 
>>> <makita.toshiaki@lab.ntt.co.jp> wrote:
>>> 
>>>> On 2018/04/26 5:24, Jesper Dangaard Brouer wrote:
>>>>> On Tue, 24 Apr 2018 23:39:20 +0900 Toshiaki Makita 
>>>>> <toshiaki.makita1@gmail.com> wrote:
>>>>> 
>>>>>> +static int veth_xdp_xmit(struct net_device *dev, struct 
>>>>>> xdp_frame *frame) +{ +	struct veth_priv *rcv_priv, *priv = 
>>>>>> netdev_priv(dev); +	int headroom = frame->data - (void 
>>>>>> *)frame; +	struct net_device *rcv; +	int err = 0; + +	rcv
>>>>>> = rcu_dereference(priv->peer); +	if (unlikely(!rcv)) + 
>>>>>> return -ENXIO; + +	rcv_priv = netdev_priv(rcv); +	/* 
>>>>>> xdp_ring is initialized on receive side? */ +	if 
>>>>>> (rcu_access_pointer(rcv_priv->xdp_prog)) { +		err = 
>>>>>> xdp_ok_fwd_dev(rcv, frame->len); +		if (unlikely(err)) + 
>>>>>> return err; + +		err = veth_xdp_enqueue(rcv_priv, 
>>>>>> veth_xdp_to_ptr(frame)); +	} else { +		struct sk_buff *skb;
>>>>>> + +		skb = veth_build_skb(frame, headroom, frame->len, 0);
>>>>>> +		if (unlikely(!skb)) +			return -ENOMEM; + +		/* Get page
>>>>>> ref in case skb is dropped in netif_rx. + * The caller is
>>>>>> responsible for freeing the page on error. +		 */ +
>>>>>> get_page(virt_to_page(frame->data));
>>>>> 
>>>>> I'm not sure you can make this assumption, that xdp_frames 
>>>>> coming from another device driver uses a refcnt based memory 
>>>>> model. But maybe I'm confused, as this looks like an SKB 
>>>>> receive path, but in the ndo_xdp_xmit().
>>>> 
>>>> I find this path similar to cpumap, which creates skb from 
>>>> redirected xdp frame. Once it is converted to skb, skb head is 
>>>> freed by page_frag_free, so anyway I needed to get the
>>>> refcount here regardless of memory model.
>>> 
>>> Yes I know, I wrote cpumap ;-)
>>> 
>>> First of all, I don't want to see such xdp_frame to SKB 
>>> conversion code in every driver.  Because that increase the 
>>> chances of errors.  And when looking at the details, then it 
>>> seems that you have made the mistake of making it possible to 
>>> leak xdp_frame info to the SKB (which cpumap takes into 
>>> account).
>> 
>> Do you mean leaving xdp_frame in skb->head is leaking something? 
>> how?
> 
> Like commit 97e19cce05e5 ("bpf: reserve xdp_frame size in xdp 
> headroom") and commit 6dfb970d3dbd ("xdp: avoid leaking info stored 
> in frame data on page reuse").

Thanks for sharing the info.

> But this time, the concern is a bpf_prog attached at TC/bpf_cls 
> level, that can that can adjust head via bpf_skb_change_head (for
> XDP it is bpf_xdp_adjust_head) into the area used by xdp_frame.  As 
> described in https://git.kernel.org/davem/net-next/c/6dfb970d3dbd in 
> is not super critical at the moment, as this _currently_ runs as 
> CAP_SYS_ADMIN, but we would like to move towards CAP_NET_ADMIN.

What I don't get is why special casing xdp_frame info. My assumption is
that the area above skb->mac_header is uninit kernel memory and should
not be readable by unprivileged users anyway. So I didn't clear the area
at this point.

>>> Second, I think the refcnt scheme here is wrong. The xdp_frame 
>>> should be "owned" by XDP and have the proper refcnt to deliver
>>> it directly to the network stack.
>>> 
>>> Third, if we choose that we want a fallback, in-case XDP is not 
>>> enabled on egress dev (but it have an ndo_xdp_xmit), then it 
>>> should be placed in the generic/core code.  E.g. 
>>> __bpf_tx_xdp_map() could look at the return code from 
>>> dev->netdev_ops->ndo_xdp() and create an SKB.  (Hint, this would 
>>> make it easy to implement TX bulking towards the dev).
>> 
>> Right, this is a much cleaner way.
>> 
>> Although I feel like we should add this fallback for veth because 
>> it requires something which is different from other drivers 
>> (enabling XDP on the peer device of the egress device),
> 
> (This is why I Cc'ed Tariq...)
> 
> This is actually a general problem with the xdp "xmit" side (and not
>  specific to veth driver). The problem exists for other drivers as 
> well.
> 
> The problem is that a driver can implement ndo_xdp_xmit(), but the 
> driver might not have allocated the necessary XDP TX-queue resources
>  yet (or it might not be possible due to system resource limits).
> 
> The current "hack" is to load an XDP prog on the egress device, and 
> then assume that this cause the driver to also allocate the XDP 
> ndo_xdo_xmit side HW resources.  This is IMHO a wrong assumption.
 >
> We need a more generic way to test if a net_device is "ready/enabled"
> for handling xdp_frames via ndo_xdp_xmit.  And Tariq had some ideas
> on how to implement this...

My assumption on REDIRECT requirement came from this.
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=306da4e685b

I guess you are saying thing are changing, and having an XDP program 
attached on the egress device is no longer generally sufficient. Looking 
forward to Tariq's solution.

Toshiaki Makita

> 
> My opinion is that it is a waste of (HW/mem) resources to always 
> allocate resources for ndo_xdp_xmit when loading an XDP program. 
> Because what if my use-cases are XDP_DROP DDoS filter, or CPUMAP 
> redirect load-balancer, then I don't want/need ndo_xdp_xmit.  E.g. 
> today using ixgbe on machines with more than 96 CPUs, will fail due 
> to limited TX queue resources. Thus, blocking the mentioned 
> use-cases.
> 
> 
>> I'll drop the part for now. It should not be resolved in the driver
>> code.
> 
> Thank you.
> 

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: enable stackmap with build_id in nmi context
From: kbuild test robot @ 2018-05-02  3:34 UTC (permalink / raw)
  To: Song Liu
  Cc: kbuild-all, netdev, Song Liu, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, Peter Zijlstra
In-Reply-To: <20180502000220.2585320-1-songliubraving@fb.com>

Hi Song,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Song-Liu/bpf-enable-stackmap-with-build_id-in-nmi-context/20180502-081727
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> kernel/bpf/stackmap.c:51:1: sparse: symbol '__pcpu_scope_irq_work' was not declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* [RFC PATCH] bpf: __pcpu_scope_irq_work can be static
From: kbuild test robot @ 2018-05-02  3:34 UTC (permalink / raw)
  To: Song Liu
  Cc: kbuild-all, netdev, Song Liu, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, Peter Zijlstra
In-Reply-To: <20180502000220.2585320-1-songliubraving@fb.com>


Fixes: 1bd02fb7c12e ("bpf: enable stackmap with build_id in nmi context")
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 stackmap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index c33fec1..5d0a951 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -48,7 +48,7 @@ static void up_read_work(struct irq_work *entry)
 	work->sem = NULL;
 }
 
-DEFINE_PER_CPU(struct stack_map_irq_work, irq_work);
+static DEFINE_PER_CPU(struct stack_map_irq_work, irq_work);
 
 static inline bool stack_map_use_build_id(struct bpf_map *map)
 {

^ permalink raw reply related

* Re: [PATCH RFC 4/9] veth: Use NAPI for XDP
From: Toshiaki Makita @ 2018-05-02  3:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Toshiaki Makita, netdev
In-Reply-To: <20180501104306.33c65bd8@redhat.com>

On 18/05/01 (火) 17:43, Jesper Dangaard Brouer wrote:
> On Tue, 1 May 2018 17:02:34 +0900
> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
> 
>> On 2018/05/01 16:50, Jesper Dangaard Brouer wrote:
>>> On Tue, 24 Apr 2018 23:39:18 +0900
>>> Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
>>>    
>>>> +static int veth_xdp_enqueue(struct veth_priv *priv, void *ptr)
>>>> +{
>>>> +	if (unlikely(ptr_ring_produce(&priv->xdp_ring, ptr)))
>>>> +		return -ENOSPC;
>>>> +
>>>> +	return 0;
>>>> +}
>>>
>>> Here we have a lock per (enqueued) packet.  I'm working on changing the
>>> ndo_xdp_xmit API to allow bulking.  And the tun driver have exact same
>>> issue/need.
>>
>> Probably I should have noted in commitlog, but this per-packet lock is
>> removed in patch 9.
>> I'm curious about if any change is needed by your new API.
> 
> Again, I'm just moving this into the generic code, to avoid having to
> implement this for every driver.
> 
> Plus, with CONFIG_RETPOLINE we have the advantage of only calling
> ndo_xdp_xmit once (indirect function pointer call).

Nice. Once it becomes available I'll use it instead.

Toshiaki Makita

^ permalink raw reply

* Re: [RFC v3 0/5] virtio: support packed ring
From: Jason Wang @ 2018-05-02  3:49 UTC (permalink / raw)
  To: Tiwei Bie, mst, virtualization, linux-kernel, netdev; +Cc: wexu, jfreimann
In-Reply-To: <20180425051550.24342-1-tiwei.bie@intel.com>



On 2018年04月25日 13:15, Tiwei Bie wrote:
> Hello everyone,
>
> This RFC implements packed ring support in virtio driver.
>
> Some simple functional tests have been done with Jason's
> packed ring implementation in vhost:
>
> https://lkml.org/lkml/2018/4/23/12
>
> Both of ping and netperf worked as expected (with EVENT_IDX
> disabled). But there are below known issues:
>
> 1. Reloading the guest driver will break the Tx/Rx;

It looks like the reason is we don't sync wrap counter information 
between host and qemu through VHOST_SET/GET_VRING_BASE. And both vhost 
and qemu need to do this through encoding warp counters to higher bits 
of vhost_vring_state.num,

Thanks

^ permalink raw reply

* Re: [PATCH iproute2-next] Add tc-BPF example for a TCP pure ack recognizer
From: David Ahern @ 2018-05-02  4:12 UTC (permalink / raw)
  To: Dave Taht, netdev
In-Reply-To: <1525199561-9302-1-git-send-email-dave.taht@gmail.com>

On 5/1/18 12:32 PM, Dave Taht wrote:
> ack_recognize can shift pure tcp acks into another flowid.
> ---
>  examples/bpf/ack_recognize.c | 98 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 98 insertions(+)
>  create mode 100644 examples/bpf/ack_recognize.c
> 
> diff --git a/examples/bpf/ack_recognize.c b/examples/bpf/ack_recognize.c
> new file mode 100644
> index 0000000..5da620c
> --- /dev/null
> +++ b/examples/bpf/ack_recognize.c
> @@ -0,0 +1,98 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright 2017 Google Inc.

2017?? it's 2018. Did you write this last year and just now posting?

> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
> + * 02110-1301, USA.
> + */
> +
> +/*
> + * Author: dave.taht@gmail.com (Dave Taht)
> + *
> + * ack_recognizer: An eBPF program that correctly recognizes modern TCP ACKs,
> + * with tcp option fields like SACK and timestamps, and no additional data.
> + *
> + * ack_match call: Recognize "pure acks" with no data payload
> + *
> + */
> +
> +#include "bpf_api.h"
> +#include "linux/if_ether.h"
> +#include "linux/ip.h"
> +#include "linux/in.h"
> +#include "linux/ipv6.h"
> +#include "linux/tcp.h"
> +
> +/*
> + * A pure ack contains the ip header, the tcp header + options, flags with the
> + * ack field set, and no additional payload. That last bit is what every prior
> + * ack filter gets wrong, they typically assume an obsolete 64 bytes, and don't
> + * calculate the options (like sack or timestamps) to subtract from the payload.
> + */
> +
> +__section_cls_entry
> +int ack_match(struct __sk_buff *skb)
> +{
> +	void *data = (void *)(long)skb->data;
> +	void *data_end = (void *)(long)skb->data_end;
> +	struct ethhdr *eth = data;
> +	struct iphdr *iph = data + sizeof(*eth);
> +	struct tcphdr *tcp;
> +
> +	if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*tcp) > data_end)
> +		return 0;
> +
> +	if (eth->h_proto == htons(ETH_P_IP) &&
> +	    iph->version == 4) {

Why not make the version == 4 under the proto check?
	if (eth->h_proto == htons(ETH_P_IP) {
		if (iph->version != 4)
			return 0;
if the eth proto is ETH_P_IP, and version != 4 something is really wrong

> +		if(iph->protocol == IPPROTO_TCP &&
> +		   iph->ihl == 5 &&
> +		   data + sizeof(*eth) + 20 + sizeof(*tcp) <= data_end) {
> +			tcp = data + sizeof(*eth) + 20;
> +			if (tcp->ack &&
> +			    htons(iph->tot_len) == 20 + tcp->doff*4)
> +				return -1;

And then here why the magic '20' and limits on ihl = 5? both are trivial
to accommodate programmatically.

> +		}
> +	} else if (eth->h_proto == htons(ETH_P_IPV6) &&
> +		   iph->version == 6) {
> +		struct ipv6hdr *iph6 = (struct ipv6hdr *) iph;
> +		if(iph6->nexthdr == IPPROTO_TCP &&
> +		   data + sizeof(*eth) + 40 + sizeof(*tcp) <= data_end ) {
> +			tcp = data + sizeof(*eth) + 40;
> +			if (tcp->ack &&
> +			    tcp->doff*4 == htons(iph6->payload_len))
> +				return -1;

ditto here.

> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/* Example: Move acks into a priority queue:
> +
> +IFACE=eth0
> +tc qdisc del dev $IFACE root 2> /dev/null
> +tc qdisc add dev $IFACE root handle 1: prio bands 3 \
> +	priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> +tc qdisc add dev $IFACE parent 1:1 handle 10:1 sfq headdrop # acks only
> +tc qdisc add dev $IFACE parent 1:2 handle 20:1 fq_codel # all other traffic
> +tc qdisc add dev $IFACE parent 1:3 handle 30:1 fq_codel # unused
> +tc filter add dev $IFACE parent 1: prio 1 bpf \
> +	object-file ack_recognize.o flowid 1:1
> +
> +Please note that a strict priority queue is not a good idea (drr would be
> +better), nor is doing any level of prioritization on acks at all....
> +*/
> +
> +BPF_LICENSE("GPL");
> 

^ permalink raw reply

* Re: [PATCH bpf-next] bpf/verifier: enable ctx + const + 0.
From: William Tu @ 2018-05-02  4:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linux Kernel Network Developers, Yonghong Song, Yifeng Sun
In-Reply-To: <20180501231652.ec563qza4c2nayhx@ast-mbp>

On Tue, May 1, 2018 at 4:16 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Apr 30, 2018 at 10:15:05AM -0700, William Tu wrote:
>> Existing verifier does not allow 'ctx + const + const'.  However, due to
>> compiler optimization, there is a case where BPF compilerit generates
>> 'ctx + const + 0', as shown below:
>>
>>   599: (1d) if r2 == r4 goto pc+2
>>    R0=inv(id=0) R1=ctx(id=0,off=40,imm=0)
>>    R2=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
>>    R3=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)) R4=inv0
>>    R6=ctx(id=0,off=0,imm=0) R7=inv2
>>   600: (bf) r1 = r6                   // r1 is ctx
>>   601: (07) r1 += 36                  // r1 has offset 36
>>   602: (61) r4 = *(u32 *)(r1 +0)      // r1 + 0
>>   dereference of modified ctx ptr R1 off=36+0, ctx+const is allowed,
>>   ctx+const+const is not
>>
>> The reason for BPF backend generating this code is due optimization
>> likes this, explained from Yonghong:
>>     if (...)
>>         *(ctx + 60)
>>     else
>>         *(ctx + 56)
>>
>> The compiler translates it to
>>     if (...)
>>        ptr = ctx + 60
>>     else
>>        ptr = ctx + 56
>>     *(ptr + 0)
>>
>> So load ptr memory become an example of 'ctx + const + 0'.  This patch
>> enables support for this case.
>>
>> Fixes: f8ddadc4db6c7 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
>> Cc: Yonghong Song <yhs@fb.com>
>> Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com>
>> Signed-off-by: William Tu <u9012063@gmail.com>
>> ---
>>  kernel/bpf/verifier.c                       |  2 +-
>>  tools/testing/selftests/bpf/test_verifier.c | 13 +++++++++++++
>>  2 files changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 712d8655e916..c9a791b9cf2a 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -1638,7 +1638,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
>>               /* ctx accesses must be at a fixed offset, so that we can
>>                * determine what type of data were returned.
>>                */
>> -             if (reg->off) {
>> +             if (reg->off && off != reg->off) {
>>                       verbose(env,
>>                               "dereference of modified ctx ptr R%d off=%d+%d, ctx+const is allowed, ctx+const+const is not\n",
>>                               regno, reg->off, off - reg->off);
>> diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
>> index 1acafe26498b..95ad5d5723ae 100644
>> --- a/tools/testing/selftests/bpf/test_verifier.c
>> +++ b/tools/testing/selftests/bpf/test_verifier.c
>> @@ -8452,6 +8452,19 @@ static struct bpf_test tests[] = {
>>               .prog_type = BPF_PROG_TYPE_SCHED_CLS,
>>       },
>>       {
>> +             "arithmetic ops make PTR_TO_CTX + const + 0 valid",
>> +             .insns = {
>> +                     BPF_ALU64_IMM(BPF_ADD, BPF_REG_1,
>> +                                   offsetof(struct __sk_buff, data) -
>> +                                   offsetof(struct __sk_buff, mark)),

This is:
   r1 += N     // r1 has offset N: the offset between data and mark)

>> +                     BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, 0),

This is:
   r0 = *(u32 *)(r1 +0)      // r1 + 0

So the above two lines create similar case I hit
  601: (07) r1 += 36                       // r1 has offset 36
  602: (61) r4 = *(u32 *)(r1 +0)      // r1 + 0

>
> How rewritten code looks here?
>
> The patch is allowing check_ctx_access() to proceed with sort-of
> correct 'off' and remember ctx_field_size,
> but in convert_ctx_accesses() it's using insn->off to do conversion.
> Which is zero in this case, so it will convert
> struct __sk_buff {
>         __u32 len; // offset 0
>
> into access of 'struct sk_buff'->len
> and then will add __sk_buff's &data - &mark delta to in-kernel len field.
> Which will point to some random field further down in struct sk_buff.
> Doesn't look correct at all.

why?
So it points to ctx + "offsetof(struct __sk_buff, data) -
offsetof(struct __sk_buff, mark)",
which is ctx + const
then I tested that 'ctx + const + 0' should pass the verifier

> How did you test this patch?
>
Without the patch, the test case will fail.
With the patch, the test case passes.

William

^ permalink raw reply

* Re: linux-next: manual merge of the bpf-next tree with the bpf tree
From: Song Liu @ 2018-05-02  4:40 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Daniel Borkmann, Alexei Starovoitov, Networking,
	Linux-Next Mailing List, Linux Kernel Mailing List, Yonghong Song
In-Reply-To: <20180502120921.654cc338@canb.auug.org.au>



> On May 1, 2018, at 7:09 PM, Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> 
> Hi all,
> 
> Today's linux-next merge of the bpf-next tree got a conflict in:
> 
>  tools/testing/selftests/bpf/test_progs.c
> 
> between commit:
> 
>  a4e21ff8d9a3 ("bpf: minor fix to selftest test_stacktrace_build_id()")
> 
> from the bpf tree and commit:
> 
>  79b453501310 ("tools/bpf: add a test for bpf_get_stack with tracepoint prog")
> 
> from the bpf-next tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
> 
> -- 
> Cheers,
> Stephen Rothwell
> 
> diff --cc tools/testing/selftests/bpf/test_progs.c
> index fac581f1c57f,aa336f0abebc..000000000000
> --- a/tools/testing/selftests/bpf/test_progs.c
> +++ b/tools/testing/selftests/bpf/test_progs.c
> @@@ -1137,9 -1193,14 +1193,14 @@@ static void test_stacktrace_build_id(vo
>  		  err, errno))
>  		goto disable_pmu;
> 
> + 	stack_amap_fd = bpf_find_map(__func__, obj, "stack_amap");
> + 	if (CHECK(stack_amap_fd < 0, "bpf_find_map stack_amap",
> + 		  "err %d errno %d\n", err, errno))
> + 		goto disable_pmu;
> + 
>  	assert(system("dd if=/dev/urandom of=/dev/zero count=4 2> /dev/null")
>  	       == 0);
> -	assert(system("./urandom_read if=/dev/urandom of=/dev/zero count=4 2> /dev/null") == 0);
> +	assert(system("./urandom_read") == 0);
>  	/* disable stack trace collection */
>  	key = 0;
>  	val = 1;
> @@@ -1188,8 -1249,15 +1249,15 @@@
>  		previous_key = key;
>  	} while (bpf_map_get_next_key(stackmap_fd, &previous_key, &key) == 0);
> 
> - 	CHECK(build_id_matches < 1, "build id match",
> - 	      "Didn't find expected build ID from the map\n");
> + 	if (CHECK(build_id_matches < 1, "build id match",
> -		  "Didn't find expected build ID from the map"))
> ++		  "Didn't find expected build ID from the map\n"))

^^^  Is there a "+" at the beginning of this line? 

Thanks,
Song

> + 		goto disable_pmu;
> + 
> + 	stack_trace_len = PERF_MAX_STACK_DEPTH
> + 		* sizeof(struct bpf_stack_build_id);
> + 	err = compare_stack_ips(stackmap_fd, stack_amap_fd, stack_trace_len);
> + 	CHECK(err, "compare_stack_ips stackmap vs. stack_amap",
> + 	      "err %d errno %d\n", err, errno);
> 
>  disable_pmu:
>  	ioctl(pmu_fd, PERF_EVENT_IOC_DISABLE);

^ permalink raw reply

* Re: [PATCH iproute2-next] Add tc-BPF example for a TCP pure ack recognizer
From: Dave Taht @ 2018-05-02  4:51 UTC (permalink / raw)
  To: David Ahern; +Cc: Linux Kernel Network Developers
In-Reply-To: <96fb372d-4fc8-77db-e971-6b9540d8ad26@gmail.com>

On Tue, May 1, 2018 at 9:12 PM, David Ahern <dsahern@gmail.com> wrote:
> On 5/1/18 12:32 PM, Dave Taht wrote:
>> ack_recognize can shift pure tcp acks into another flowid.
>> ---
>>  examples/bpf/ack_recognize.c | 98 ++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 98 insertions(+)
>>  create mode 100644 examples/bpf/ack_recognize.c
>>
>> diff --git a/examples/bpf/ack_recognize.c b/examples/bpf/ack_recognize.c
>> new file mode 100644
>> index 0000000..5da620c
>> --- /dev/null
>> +++ b/examples/bpf/ack_recognize.c
>> @@ -0,0 +1,98 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * Copyright 2017 Google Inc.
>
> 2017?? it's 2018. Did you write this last year and just now posting?

November, 2017? Shortly prior to you taking iproute2-next off of steven's hands.

It was the first stage of a proof of concept showing (eventually) that
a simple ack decimator/filter (like "drop every other ack", or simple
"drop head" in the supplied example) had a ton of problems, and that
to filter out acks correctly to any extent, it needed to peer back
into the queue. (see the sch_cake ack-filter controversy on-going on
this list)

While it's better than what is in wondershaper, I wouldn't recommend
anyone use it. It was, however, useful in acquiring a gut feel for
what several other broken ack filters might be doing in the field. I
submitted it as a possibly useful example for showing off tc-bpf and
to add fuel to the ack-filter fire under cake.

I can clean it up, SPF it, etc, if you want it. Otherwise, sorry for the noise.

>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public License
>> + * as published by the Free Software Foundation; either version 2
>> + * of the License, or (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, write to the Free Software
>> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
>> + * 02110-1301, USA.
>> + */
>> +
>> +/*
>> + * Author: dave.taht@gmail.com (Dave Taht)
>> + *
>> + * ack_recognizer: An eBPF program that correctly recognizes modern TCP ACKs,
>> + * with tcp option fields like SACK and timestamps, and no additional data.
>> + *
>> + * ack_match call: Recognize "pure acks" with no data payload
>> + *
>> + */
>> +
>> +#include "bpf_api.h"
>> +#include "linux/if_ether.h"
>> +#include "linux/ip.h"
>> +#include "linux/in.h"
>> +#include "linux/ipv6.h"
>> +#include "linux/tcp.h"
>> +
>> +/*
>> + * A pure ack contains the ip header, the tcp header + options, flags with the
>> + * ack field set, and no additional payload. That last bit is what every prior
>> + * ack filter gets wrong, they typically assume an obsolete 64 bytes, and don't
>> + * calculate the options (like sack or timestamps) to subtract from the payload.
>> + */
>> +
>> +__section_cls_entry
>> +int ack_match(struct __sk_buff *skb)
>> +{
>> +     void *data = (void *)(long)skb->data;
>> +     void *data_end = (void *)(long)skb->data_end;
>> +     struct ethhdr *eth = data;
>> +     struct iphdr *iph = data + sizeof(*eth);
>> +     struct tcphdr *tcp;
>> +
>> +     if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*tcp) > data_end)
>> +             return 0;
>> +
>> +     if (eth->h_proto == htons(ETH_P_IP) &&
>> +         iph->version == 4) {
>
> Why not make the version == 4 under the proto check?
>         if (eth->h_proto == htons(ETH_P_IP) {
>                 if (iph->version != 4)
>                         return 0;
> if the eth proto is ETH_P_IP, and version != 4 something is really wrong
>
>> +             if(iph->protocol == IPPROTO_TCP &&
>> +                iph->ihl == 5 &&
>> +                data + sizeof(*eth) + 20 + sizeof(*tcp) <= data_end) {
>> +                     tcp = data + sizeof(*eth) + 20;
>> +                     if (tcp->ack &&
>> +                         htons(iph->tot_len) == 20 + tcp->doff*4)
>> +                             return -1;
>
> And then here why the magic '20' and limits on ihl = 5? both are trivial
> to accommodate programmatically.
>
>> +             }
>> +     } else if (eth->h_proto == htons(ETH_P_IPV6) &&
>> +                iph->version == 6) {
>> +             struct ipv6hdr *iph6 = (struct ipv6hdr *) iph;
>> +             if(iph6->nexthdr == IPPROTO_TCP &&
>> +                data + sizeof(*eth) + 40 + sizeof(*tcp) <= data_end ) {
>> +                     tcp = data + sizeof(*eth) + 40;
>> +                     if (tcp->ack &&
>> +                         tcp->doff*4 == htons(iph6->payload_len))
>> +                             return -1;
>
> ditto here.
>
>> +             }
>> +     }
>> +
>> +     return 0;
>> +}
>> +
>> +/* Example: Move acks into a priority queue:
>> +
>> +IFACE=eth0
>> +tc qdisc del dev $IFACE root 2> /dev/null
>> +tc qdisc add dev $IFACE root handle 1: prio bands 3 \
>> +     priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>> +tc qdisc add dev $IFACE parent 1:1 handle 10:1 sfq headdrop # acks only
>> +tc qdisc add dev $IFACE parent 1:2 handle 20:1 fq_codel # all other traffic
>> +tc qdisc add dev $IFACE parent 1:3 handle 30:1 fq_codel # unused
>> +tc filter add dev $IFACE parent 1: prio 1 bpf \
>> +     object-file ack_recognize.o flowid 1:1
>> +
>> +Please note that a strict priority queue is not a good idea (drr would be
>> +better), nor is doing any level of prioritization on acks at all....
>> +*/
>> +
>> +BPF_LICENSE("GPL");
>>
>



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

^ permalink raw reply

* Re: [PATCH bpf-next] bpf/verifier: enable ctx + const + 0.
From: Alexei Starovoitov @ 2018-05-02  4:52 UTC (permalink / raw)
  To: William Tu; +Cc: Linux Kernel Network Developers, Yonghong Song, Yifeng Sun
In-Reply-To: <CALDO+SbDZ5BDz2Pt_66O96Y0ZtvpB78oKaaxtL-Toy_npr0_Zw@mail.gmail.com>

On Tue, May 01, 2018 at 09:35:29PM -0700, William Tu wrote:
> 
> > How did you test this patch?
> >
> Without the patch, the test case will fail.
> With the patch, the test case passes.

Please test it with real program and you'll see crashes and garbage returned.

^ permalink raw reply

* Re: [v2 PATCH 1/1] tg3: fix meaningless hw_stats reading after tg3_halt memset 0 hw_stats
From: Michael Chan @ 2018-05-02  5:12 UTC (permalink / raw)
  To: Zumeng Chen
  Cc: Netdev, open list, Siva Reddy Kallam,
	prashant.sreedharan@broadcom.com, David Miller, Zumeng Chen
In-Reply-To: <20180502004234.230662-1-zumeng.chen@gmail.com>

On Tue, May 1, 2018 at 5:42 PM, Zumeng Chen <zumeng.chen@gmail.com> wrote:

> diff --git a/drivers/net/ethernet/broadcom/tg3.h b/drivers/net/ethernet/broadcom/tg3.h
> index 3b5e98e..c61d83c 100644
> --- a/drivers/net/ethernet/broadcom/tg3.h
> +++ b/drivers/net/ethernet/broadcom/tg3.h
> @@ -3102,6 +3102,7 @@ enum TG3_FLAGS {
>         TG3_FLAG_ROBOSWITCH,
>         TG3_FLAG_ONE_DMA_AT_ONCE,
>         TG3_FLAG_RGMII_MODE,
> +       TG3_FLAG_HALT,

I think you should be able to use the existing INIT_COMPLETE flag and
not have to add a new flag.

>
>         /* Add new flags before this comment and TG3_FLAG_NUMBER_OF_FLAGS */
>         TG3_FLAG_NUMBER_OF_FLAGS,       /* Last entry in enum TG3_FLAGS */
> --
> 2.9.3
>

^ permalink raw reply

* Re: [Xen-devel] [PATCH 2/6] xen-netfront: copy response out of shared buffer before accessing it
From: Oleksandr Andrushchenko @ 2018-05-02  5:20 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, xen-devel
  Cc: Juergen Gross, open list:NETWORKING DRIVERS, stable, open list,
	Boris Ostrovsky
In-Reply-To: <98a855dceb47dbebd9c87e024084f14a5cb127f7.1525122026.git-series.marmarek@invisiblethingslab.com>

On 05/01/2018 12:01 AM, Marek Marczykowski-Górecki wrote:
> Make local copy of the response, otherwise backend might modify it while
> frontend is already processing it - leading to time of check / time of
> use issue.
>
> This is complementary to XSA155.
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> ---
>   drivers/net/xen-netfront.c | 51 +++++++++++++++++++--------------------
>   1 file changed, 25 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index 4dd0668..dc99763 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -387,13 +387,13 @@ static void xennet_tx_buf_gc(struct netfront_queue *queue)
>   		rmb(); /* Ensure we see responses up to 'rp'. */
>   
>   		for (cons = queue->tx.rsp_cons; cons != prod; cons++) {
Side comment: the original concern was expressed on the above counters,
will those be addressed as a dedicated series?
> -			struct xen_netif_tx_response *txrsp;
> +			struct xen_netif_tx_response txrsp;
>   
> -			txrsp = RING_GET_RESPONSE(&queue->tx, cons);
> -			if (txrsp->status == XEN_NETIF_RSP_NULL)
> +			RING_COPY_RESPONSE(&queue->tx, cons, &txrsp);
> +			if (txrsp.status == XEN_NETIF_RSP_NULL)
>   				continue;
>   
IMO, there is still no guarantee you access consistent data after this 
change.
What if part of the response was ok when you started copying and
then, in the middle, backend poisons the end of the response?
This seems to be just like minimizing(?) chances to work with inconsistent
data rather than removing the possibility of such completely
> -			id  = txrsp->id;
> +			id  = txrsp.id;
>   			skb = queue->tx_skbs[id].skb;
>   			if (unlikely(gnttab_query_foreign_access(
>   				queue->grant_tx_ref[id]) != 0)) {
> @@ -741,7 +741,7 @@ static int xennet_get_extras(struct netfront_queue *queue,
>   			     RING_IDX rp)
>   
>   {
> -	struct xen_netif_extra_info *extra;
> +	struct xen_netif_extra_info extra;
>   	struct device *dev = &queue->info->netdev->dev;
>   	RING_IDX cons = queue->rx.rsp_cons;
>   	int err = 0;
> @@ -757,24 +757,23 @@ static int xennet_get_extras(struct netfront_queue *queue,
>   			break;
>   		}
>   
> -		extra = (struct xen_netif_extra_info *)
> -			RING_GET_RESPONSE(&queue->rx, ++cons);
> +		RING_COPY_RESPONSE(&queue->rx, ++cons, &extra);
>   
> -		if (unlikely(!extra->type ||
> -			     extra->type >= XEN_NETIF_EXTRA_TYPE_MAX)) {
> +		if (unlikely(!extra.type ||
> +			     extra.type >= XEN_NETIF_EXTRA_TYPE_MAX)) {
>   			if (net_ratelimit())
>   				dev_warn(dev, "Invalid extra type: %d\n",
> -					extra->type);
> +					extra.type);
>   			err = -EINVAL;
>   		} else {
> -			memcpy(&extras[extra->type - 1], extra,
> -			       sizeof(*extra));
> +			memcpy(&extras[extra.type - 1], &extra,
> +			       sizeof(extra));
>   		}
>   
>   		skb = xennet_get_rx_skb(queue, cons);
>   		ref = xennet_get_rx_ref(queue, cons);
>   		xennet_move_rx_slot(queue, skb, ref);
> -	} while (extra->flags & XEN_NETIF_EXTRA_FLAG_MORE);
> +	} while (extra.flags & XEN_NETIF_EXTRA_FLAG_MORE);
>   
>   	queue->rx.rsp_cons = cons;
>   	return err;
> @@ -784,28 +783,28 @@ static int xennet_get_responses(struct netfront_queue *queue,
>   				struct netfront_rx_info *rinfo, RING_IDX rp,
>   				struct sk_buff_head *list)
>   {
> -	struct xen_netif_rx_response *rx = &rinfo->rx;
> +	struct xen_netif_rx_response rx = rinfo->rx;
>   	struct xen_netif_extra_info *extras = rinfo->extras;
>   	struct device *dev = &queue->info->netdev->dev;
>   	RING_IDX cons = queue->rx.rsp_cons;
>   	struct sk_buff *skb = xennet_get_rx_skb(queue, cons);
>   	grant_ref_t ref = xennet_get_rx_ref(queue, cons);
> -	int max = MAX_SKB_FRAGS + (rx->status <= RX_COPY_THRESHOLD);
> +	int max = MAX_SKB_FRAGS + (rx.status <= RX_COPY_THRESHOLD);
>   	int slots = 1;
>   	int err = 0;
>   	unsigned long ret;
>   
> -	if (rx->flags & XEN_NETRXF_extra_info) {
> +	if (rx.flags & XEN_NETRXF_extra_info) {
>   		err = xennet_get_extras(queue, extras, rp);
>   		cons = queue->rx.rsp_cons;
>   	}
>   
>   	for (;;) {
> -		if (unlikely(rx->status < 0 ||
> -			     rx->offset + rx->status > XEN_PAGE_SIZE)) {
> +		if (unlikely(rx.status < 0 ||
> +			     rx.offset + rx.status > XEN_PAGE_SIZE)) {
>   			if (net_ratelimit())
>   				dev_warn(dev, "rx->offset: %u, size: %d\n",
> -					 rx->offset, rx->status);
> +					 rx.offset, rx.status);
>   			xennet_move_rx_slot(queue, skb, ref);
>   			err = -EINVAL;
>   			goto next;
> @@ -819,7 +818,7 @@ static int xennet_get_responses(struct netfront_queue *queue,
>   		if (ref == GRANT_INVALID_REF) {
>   			if (net_ratelimit())
>   				dev_warn(dev, "Bad rx response id %d.\n",
> -					 rx->id);
> +					 rx.id);
>   			err = -EINVAL;
>   			goto next;
>   		}
> @@ -832,7 +831,7 @@ static int xennet_get_responses(struct netfront_queue *queue,
>   		__skb_queue_tail(list, skb);
>   
>   next:
> -		if (!(rx->flags & XEN_NETRXF_more_data))
> +		if (!(rx.flags & XEN_NETRXF_more_data))
>   			break;
>   
>   		if (cons + slots == rp) {
> @@ -842,7 +841,7 @@ static int xennet_get_responses(struct netfront_queue *queue,
>   			break;
>   		}
>   
> -		rx = RING_GET_RESPONSE(&queue->rx, cons + slots);
> +		RING_COPY_RESPONSE(&queue->rx, cons + slots, &rx);
>   		skb = xennet_get_rx_skb(queue, cons + slots);
>   		ref = xennet_get_rx_ref(queue, cons + slots);
>   		slots++;
> @@ -898,9 +897,9 @@ static RING_IDX xennet_fill_frags(struct netfront_queue *queue,
>   	struct sk_buff *nskb;
>   
>   	while ((nskb = __skb_dequeue(list))) {
> -		struct xen_netif_rx_response *rx =
> -			RING_GET_RESPONSE(&queue->rx, ++cons);
> +		struct xen_netif_rx_response rx;
>   		skb_frag_t *nfrag = &skb_shinfo(nskb)->frags[0];
> +		RING_COPY_RESPONSE(&queue->rx, ++cons, &rx);
>   
>   		if (shinfo->nr_frags == MAX_SKB_FRAGS) {
>   			unsigned int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
> @@ -911,7 +910,7 @@ static RING_IDX xennet_fill_frags(struct netfront_queue *queue,
>   		BUG_ON(shinfo->nr_frags >= MAX_SKB_FRAGS);
>   
>   		skb_add_rx_frag(skb, shinfo->nr_frags, skb_frag_page(nfrag),
> -				rx->offset, rx->status, PAGE_SIZE);
> +				rx.offset, rx.status, PAGE_SIZE);
>   
>   		skb_shinfo(nskb)->nr_frags = 0;
>   		kfree_skb(nskb);
> @@ -1007,7 +1006,7 @@ static int xennet_poll(struct napi_struct *napi, int budget)
>   	i = queue->rx.rsp_cons;
>   	work_done = 0;
>   	while ((i != rp) && (work_done < budget)) {
> -		memcpy(rx, RING_GET_RESPONSE(&queue->rx, i), sizeof(*rx));
> +		RING_COPY_RESPONSE(&queue->rx, i, rx);
>   		memset(extras, 0, sizeof(rinfo.extras));
>   
>   		err = xennet_get_responses(queue, &rinfo, rp, &tmpq);

^ permalink raw reply

* [PATCH net] sctp: init active key for the new asoc in dupcook_a and dupcook_b
From: Xin Long @ 2018-05-02  5:37 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: davem, Marcelo Ricardo Leitner, Neil Horman

When processing a duplicate cookie-echo chunk, for case 'A' and 'B',
after sctp_process_init for the new asoc, if auth is enabled for the
cookie-ack chunk, the active key should also be initialized.

Otherwise, the cookie-ack chunk made later can not be set with auth
shkey properly, and a crash can even be caused by this, as after
Commit 1b1e0bc99474 ("sctp: add refcnt support for sh_key"), sctp
needs to hold the shkey when making control chunks.

Fixes: 1b1e0bc99474 ("sctp: add refcnt support for sh_key")
Reported-by: Jianwen Ji <jiji@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/sctp/sm_statefuns.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index dd0594a..98acfed 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -1794,6 +1794,9 @@ static enum sctp_disposition sctp_sf_do_dupcook_a(
 			       GFP_ATOMIC))
 		goto nomem;
 
+	if (sctp_auth_asoc_init_active_key(new_asoc, GFP_ATOMIC))
+		goto nomem;
+
 	/* Make sure no new addresses are being added during the
 	 * restart.  Though this is a pretty complicated attack
 	 * since you'd have to get inside the cookie.
@@ -1906,6 +1909,9 @@ static enum sctp_disposition sctp_sf_do_dupcook_b(
 			       GFP_ATOMIC))
 		goto nomem;
 
+	if (sctp_auth_asoc_init_active_key(new_asoc, GFP_ATOMIC))
+		goto nomem;
+
 	/* Update the content of current association.  */
 	sctp_add_cmd_sf(commands, SCTP_CMD_UPDATE_ASSOC, SCTP_ASOC(new_asoc));
 	sctp_add_cmd_sf(commands, SCTP_CMD_NEW_STATE,
-- 
2.1.0

^ permalink raw reply related

* [PATCH net] sctp: use the old asoc when making the cookie-ack chunk in dupcook_d
From: Xin Long @ 2018-05-02  5:39 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: davem, Marcelo Ricardo Leitner, Neil Horman

When processing a duplicate cookie-echo chunk, for case 'D', sctp will
not process the param from this chunk. It means old asoc has nothing
to be updated, and the new temp asoc doesn't have the complete info.

So there's no reason to use the new asoc when creating the cookie-ack
chunk. Otherwise, like when auth is enabled for cookie-ack, the chunk
can not be set with auth, and it will definitely be dropped by peer.

This issue is there since very beginning, and we fix it by using the
old asoc instead.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/sctp/sm_statefuns.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 98acfed..28c070e 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -2056,7 +2056,7 @@ static enum sctp_disposition sctp_sf_do_dupcook_d(
 		}
 	}
 
-	repl = sctp_make_cookie_ack(new_asoc, chunk);
+	repl = sctp_make_cookie_ack(asoc, chunk);
 	if (!repl)
 		goto nomem;
 
-- 
2.1.0

^ permalink raw reply related

* RE: RTL8723BE performance regression
From: Pkshih @ 2018-05-02  5:44 UTC (permalink / raw)
  To: João Paulo Rechi Vita, Larry Finger
  Cc: Steve deRosier, 莊彥宣, Birming Chiu, Shaofu,
	Steven Ting, Chaoming_Li, Kalle Valo, linux-wireless,
	Network Development, LKML, Daniel Drake,
	João Paulo Rechi Vita, linux@endlessm.com
In-Reply-To: <CA+A7VXUUsZHD2gr9TBcV6jZBOPGZv7_vK-wWZ0MvGgiCnAiUgQ@mail.gmail.com>



> -----Original Message-----
> From: João Paulo Rechi Vita [mailto:jprvita@gmail.com]
> Sent: Wednesday, May 02, 2018 6:41 AM
> To: Larry Finger
> Cc: Steve deRosier; 莊彥宣; Pkshih; Birming Chiu; Shaofu; Steven Ting; Chaoming_Li; Kalle Valo;
> linux-wireless; Network Development; LKML; Daniel Drake; João Paulo Rechi Vita; linux@endlessm.com
> Subject: Re: RTL8723BE performance regression
> 
> On Tue, Apr 3, 2018 at 7:51 PM, Larry Finger <Larry.Finger@lwfinger.net> wrote:
> > On 04/03/2018 09:37 PM, João Paulo Rechi Vita wrote:
> >>
> >> On Tue, Apr 3, 2018 at 7:28 PM, Larry Finger <Larry.Finger@lwfinger.net>
> >> wrote:
> >>
> >> (...)
> >>
> >>> As the antenna selection code changes affected your first bisection, do
> >>> you
> >>> have one of those HP laptops with only one antenna and the incorrect
> >>> coding
> >>> in the FUSE?
> >>
> >>
> >> Yes, that is why I've been passing ant_sel=1 during my tests -- this
> >> was needed to achieve a good performance in the past, before this
> >> regression. I've also opened the laptop chassis and confirmed the
> >> antenna cable is plugged to the connector labeled with "1" on the
> >> card.
> >>
> >>> If so, please make sure that you still have the same signal
> >>> strength for good and bad cases. I have tried to keep the driver and the
> >>> btcoex code in sync, but there may be some combinations of antenna
> >>> configuration and FUSE contents that cause the code to fail.
> >>>
> >>
> >> What is the recommended way to monitor the signal strength?
> >
> >
> > The btcoex code is developed for multiple platforms by a different group
> > than the Linux driver. I think they made a change that caused ant_sel to
> > switch from 1 to 2. At least numerous comments at
> > github.com/lwfinger/rtlwifi_new claimed they needed to make that change.
> >
> > Mhy recommended method is to verify the wifi device name with "iw dev". Then
> > using that device
> >
> > sudo iw dev <dev_name> scan | egrep "SSID|signal"
> >
> 
> I have confirmed that the performance regression is indeed tied to
> signal strength: on the good cases signal was between -16 and -8 dBm,
> whereas in bad cases signal was always between -50 to - 40 dBm. I've
> also switched to testing bandwidth in controlled LAN environment using
> iperf3, as suggested by Steve deRosier, with the DUT being the only
> machine connected to the 2.4 GHz radio and the machine running the
> iperf3 server connected via ethernet.
> 

We have new experimental results in commit af8a41cccf8f46 ("rtlwifi: cleanup 
8723be ant_sel definition"). You can use the above commit and do the same 
experiments (with ant_sel=0, 1 and 2) in your side, and then share your results.
Since performance is tied to signal strength, you can only share signal strength.

Regards
PK


^ permalink raw reply

* [PATCH net] sctp: fix the issue that the cookie-ack with auth can't get processed
From: Xin Long @ 2018-05-02  5:45 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: davem, Marcelo Ricardo Leitner, Neil Horman

When auth is enabled for cookie-ack chunk, in sctp_inq_pop, sctp
processes auth chunk first, then continues to the next chunk in
this packet if chunk_end + chunk_hdr size < skb_tail_pointer().
Otherwise, it will go to the next packet or discard this chunk.

However, it missed the fact that cookie-ack chunk's size is equal
to chunk_hdr size, which couldn't match that check, and thus this
chunk would not get processed.

This patch fixes it by changing the check to chunk_end + chunk_hdr
size <= skb_tail_pointer().

Fixes: 26b87c788100 ("net: sctp: fix remote memory pressure from excessive queueing")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/sctp/inqueue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sctp/inqueue.c b/net/sctp/inqueue.c
index 23ebc53..eb93ffe 100644
--- a/net/sctp/inqueue.c
+++ b/net/sctp/inqueue.c
@@ -217,7 +217,7 @@ struct sctp_chunk *sctp_inq_pop(struct sctp_inq *queue)
 	skb_pull(chunk->skb, sizeof(*ch));
 	chunk->subh.v = NULL; /* Subheader is no longer valid.  */
 
-	if (chunk->chunk_end + sizeof(*ch) < skb_tail_pointer(chunk->skb)) {
+	if (chunk->chunk_end + sizeof(*ch) <= skb_tail_pointer(chunk->skb)) {
 		/* This is not a singleton */
 		chunk->singleton = 0;
 	} else if (chunk->chunk_end > skb_tail_pointer(chunk->skb)) {
-- 
2.1.0

^ permalink raw reply related

* Re: linux-next: manual merge of the bpf-next tree with the bpf tree
From: Stephen Rothwell @ 2018-05-02  5:50 UTC (permalink / raw)
  To: Song Liu
  Cc: Daniel Borkmann, Alexei Starovoitov, Networking,
	Linux-Next Mailing List, Linux Kernel Mailing List, Yonghong Song
In-Reply-To: <0942C278-336F-4795-BE63-FAD7FBAA231B@fb.com>

[-- Attachment #1: Type: text/plain, Size: 588 bytes --]

Hi Song,

On Wed, 2 May 2018 04:40:20 +0000 Song Liu <songliubraving@fb.com> wrote:
>
> > - 	CHECK(build_id_matches < 1, "build id match",
> > - 	      "Didn't find expected build ID from the map\n");
> > + 	if (CHECK(build_id_matches < 1, "build id match",
> > -		  "Didn't find expected build ID from the map"))
> > ++		  "Didn't find expected build ID from the map\n"))  
> 
> ^^^  Is there a "+" at the beginning of this line? 

No, this is a merge resolution diff, so the ++ means that the line did
not appear in either parent commit.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
From: Jesper Dangaard Brouer @ 2018-05-02  5:56 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Toshiaki Makita, netdev, Tariq Toukan, Daniel Borkmann,
	Alexei Starovoitov, Eran Ben Elisha, brouer
In-Reply-To: <874a1c42-bbdf-0ab4-b96e-3b07e6fb4aef@gmail.com>

On Wed, 2 May 2018 12:33:47 +0900
Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:

> On 18/05/01 (火) 17:14, Jesper Dangaard Brouer wrote:
> > On Tue, 1 May 2018 10:02:01 +0900 Toshiaki Makita 
> > <makita.toshiaki@lab.ntt.co.jp> wrote:
> >   
> >> On 2018/05/01 2:27, Jesper Dangaard Brouer wrote:  
> >>> On Thu, 26 Apr 2018 19:52:40 +0900 Toshiaki Makita 
> >>> <makita.toshiaki@lab.ntt.co.jp> wrote:
> >>>   
> >>>> On 2018/04/26 5:24, Jesper Dangaard Brouer wrote:  
> >>>>> On Tue, 24 Apr 2018 23:39:20 +0900 Toshiaki Makita 
> >>>>> <toshiaki.makita1@gmail.com> wrote:
> >>>>>   
> >>>>>> +static int veth_xdp_xmit(struct net_device *dev, struct 
[...]
> >>>>> 
> >>>>> I'm not sure you can make this assumption, that xdp_frames 
> >>>>> coming from another device driver uses a refcnt based memory 
> >>>>> model. But maybe I'm confused, as this looks like an SKB 
> >>>>> receive path, but in the ndo_xdp_xmit().  
> >>>> 
> >>>> I find this path similar to cpumap, which creates skb from 
> >>>> redirected xdp frame. Once it is converted to skb, skb head is 
> >>>> freed by page_frag_free, so anyway I needed to get the
> >>>> refcount here regardless of memory model.  
> >>> 
> >>> Yes I know, I wrote cpumap ;-)
> >>> 
> >>> First of all, I don't want to see such xdp_frame to SKB 
> >>> conversion code in every driver.  Because that increase the 
> >>> chances of errors.  And when looking at the details, then it 
> >>> seems that you have made the mistake of making it possible to 
> >>> leak xdp_frame info to the SKB (which cpumap takes into 
> >>> account).  
> >> 
> >> Do you mean leaving xdp_frame in skb->head is leaking something? 
> >> how?  
> > 
> > Like commit 97e19cce05e5 ("bpf: reserve xdp_frame size in xdp 
> > headroom") and commit 6dfb970d3dbd ("xdp: avoid leaking info stored 
> > in frame data on page reuse").  
> 
> Thanks for sharing the info.
> 
> > But this time, the concern is a bpf_prog attached at TC/bpf_cls 
> > level, that can that can adjust head via bpf_skb_change_head (for
> > XDP it is bpf_xdp_adjust_head) into the area used by xdp_frame.  As 
> > described in https://git.kernel.org/davem/net-next/c/6dfb970d3dbd in 
> > is not super critical at the moment, as this _currently_ runs as 
> > CAP_SYS_ADMIN, but we would like to move towards CAP_NET_ADMIN.  
> 
> What I don't get is why special casing xdp_frame info. My assumption is
> that the area above skb->mac_header is uninit kernel memory and should
> not be readable by unprivileged users anyway. So I didn't clear the area
> at this point.

This is also my understanding. But Alexei explicitly asked me to handle
this xdp_frame info case.  I assume that more work is required for the
transition from CAP_SYS_ADMIN to CAP_NET_ADMIN, we just don't want to
add more/new code that makes this more difficult.


> >>> Second, I think the refcnt scheme here is wrong. The xdp_frame 
> >>> should be "owned" by XDP and have the proper refcnt to deliver
> >>> it directly to the network stack.
> >>> 
> >>> Third, if we choose that we want a fallback, in-case XDP is not 
> >>> enabled on egress dev (but it have an ndo_xdp_xmit), then it 
> >>> should be placed in the generic/core code.  E.g. 
> >>> __bpf_tx_xdp_map() could look at the return code from 
> >>> dev->netdev_ops->ndo_xdp() and create an SKB.  (Hint, this would 
> >>> make it easy to implement TX bulking towards the dev).  
> >> 
> >> Right, this is a much cleaner way.
> >> 
> >> Although I feel like we should add this fallback for veth because 
> >> it requires something which is different from other drivers 
> >> (enabling XDP on the peer device of the egress device),  
> > 
> > (This is why I Cc'ed Tariq...)
> > 
> > This is actually a general problem with the xdp "xmit" side (and not
> >  specific to veth driver). The problem exists for other drivers as 
> > well.
> > 
> > The problem is that a driver can implement ndo_xdp_xmit(), but the 
> > driver might not have allocated the necessary XDP TX-queue resources
> >  yet (or it might not be possible due to system resource limits).
> > 
> > The current "hack" is to load an XDP prog on the egress device, and 
> > then assume that this cause the driver to also allocate the XDP 
> > ndo_xdo_xmit side HW resources.  This is IMHO a wrong assumption.
>  >
> > We need a more generic way to test if a net_device is "ready/enabled"
> > for handling xdp_frames via ndo_xdp_xmit.  And Tariq had some ideas
> > on how to implement this...  
> 
> My assumption on REDIRECT requirement came from this.
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=306da4e685b

Yes, I wrote that.

> I guess you are saying thing are changing, and having an XDP program 
> attached on the egress device is no longer generally sufficient. Looking 
> forward to Tariq's solution.

Yes, (hopefully) things are changing. Loading a dummy XDP program to
enable ndo_xdp_xmit, turned out to be a bad model, for all the reasons
I listed.

I hope Tariq find some time to work on this ... ;-)


> Toshiaki Makita
> 
> > 
> > My opinion is that it is a waste of (HW/mem) resources to always 
> > allocate resources for ndo_xdp_xmit when loading an XDP program. 
> > Because what if my use-cases are XDP_DROP DDoS filter, or CPUMAP 
> > redirect load-balancer, then I don't want/need ndo_xdp_xmit.  E.g. 
> > today using ixgbe on machines with more than 96 CPUs, will fail due 
> > to limited TX queue resources. Thus, blocking the mentioned 
> > use-cases.
> > 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: RTL8723BE performance regression
From: Pkshih @ 2018-05-02  5:58 UTC (permalink / raw)
  To: jprvita@gmail.com, Larry.Finger@lwfinger.net
  Cc: linux-kernel@vger.kernel.org, jprvita@endlessm.com, Birming Chiu,
	drake@endlessm.com, Chaoming_Li, kvalo@codeaurora.org,
	莊彥宣, derosier@gmail.com, Steven Ting,
	netdev@vger.kernel.org, linux@endlessm.com, Shaofu,
	linux-wireless@vger.kernel.org
In-Reply-To: <5B2DA6FDDF928F4E855344EE0A5C39D13BEC14C0@RTITMBSV07.realtek.com.tw>

On Wed, 2018-05-02 at 05:44 +0000, Pkshih wrote:
> 
> > -----Original Message-----
> > From: João Paulo Rechi Vita [mailto:jprvita@gmail.com]
> > Sent: Wednesday, May 02, 2018 6:41 AM
> > To: Larry Finger
> > Cc: Steve deRosier; 莊彥宣; Pkshih; Birming Chiu; Shaofu; Steven Ting; Chaoming_Li; Kalle Valo;
> > linux-wireless; Network Development; LKML; Daniel Drake; João Paulo Rechi Vita; linux@endlessm.c
> om
> > Subject: Re: RTL8723BE performance regression
> > 
> > On Tue, Apr 3, 2018 at 7:51 PM, Larry Finger <Larry.Finger@lwfinger.net> wrote:
> > > On 04/03/2018 09:37 PM, João Paulo Rechi Vita wrote:
> > >>
> > >> On Tue, Apr 3, 2018 at 7:28 PM, Larry Finger <Larry.Finger@lwfinger.net>
> > >> wrote:
> > >>
> > >> (...)
> > >>
> > >>> As the antenna selection code changes affected your first bisection, do
> > >>> you
> > >>> have one of those HP laptops with only one antenna and the incorrect
> > >>> coding
> > >>> in the FUSE?
> > >>
> > >>
> > >> Yes, that is why I've been passing ant_sel=1 during my tests -- this
> > >> was needed to achieve a good performance in the past, before this
> > >> regression. I've also opened the laptop chassis and confirmed the
> > >> antenna cable is plugged to the connector labeled with "1" on the
> > >> card.
> > >>
> > >>> If so, please make sure that you still have the same signal
> > >>> strength for good and bad cases. I have tried to keep the driver and the
> > >>> btcoex code in sync, but there may be some combinations of antenna
> > >>> configuration and FUSE contents that cause the code to fail.
> > >>>
> > >>
> > >> What is the recommended way to monitor the signal strength?
> > >
> > >
> > > The btcoex code is developed for multiple platforms by a different group
> > > than the Linux driver. I think they made a change that caused ant_sel to
> > > switch from 1 to 2. At least numerous comments at
> > > github.com/lwfinger/rtlwifi_new claimed they needed to make that change.
> > >
> > > Mhy recommended method is to verify the wifi device name with "iw dev". Then
> > > using that device
> > >
> > > sudo iw dev <dev_name> scan | egrep "SSID|signal"
> > >
> > 
> > I have confirmed that the performance regression is indeed tied to
> > signal strength: on the good cases signal was between -16 and -8 dBm,
> > whereas in bad cases signal was always between -50 to - 40 dBm. I've
> > also switched to testing bandwidth in controlled LAN environment using
> > iperf3, as suggested by Steve deRosier, with the DUT being the only
> > machine connected to the 2.4 GHz radio and the machine running the
> > iperf3 server connected via ethernet.
> > 
> 
> We have new experimental results in commit af8a41cccf8f46 ("rtlwifi: cleanup 
> 8723be ant_sel definition"). You can use the above commit and do the same 
> experiments (with ant_sel=0, 1 and 2) in your side, and then share your results.
> Since performance is tied to signal strength, you can only share signal strength.
> 

Please pay attention to cold reboot once ant_sel is changed.


^ permalink raw reply

* Re: linux-next: manual merge of the bpf-next tree with the bpf tree
From: Song Liu @ 2018-05-02  6:05 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Daniel Borkmann, Alexei Starovoitov, Networking,
	Linux-Next Mailing List, Linux Kernel Mailing List, Yonghong Song
In-Reply-To: <20180502155014.4f5acdcb@canb.auug.org.au>



> On May 1, 2018, at 10:50 PM, Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> 
> Hi Song,
> 
> On Wed, 2 May 2018 04:40:20 +0000 Song Liu <songliubraving@fb.com> wrote:
>> 
>>> - 	CHECK(build_id_matches < 1, "build id match",
>>> - 	      "Didn't find expected build ID from the map\n");
>>> + 	if (CHECK(build_id_matches < 1, "build id match",
>>> -		  "Didn't find expected build ID from the map"))
>>> ++		  "Didn't find expected build ID from the map\n"))  
>> 
>> ^^^  Is there a "+" at the beginning of this line? 
> 
> No, this is a merge resolution diff, so the ++ means that the line did
> not appear in either parent commit.

Hi Stephen,

Thanks for the explanation! 

Song

^ permalink raw reply

* [PATCH net-next] cxgb4: add new T5 device id's
From: Ganesh Goudar @ 2018-05-02  6:17 UTC (permalink / raw)
  To: netdev, davem; +Cc: nirranjan, indranil, Ganesh Goudar

Add device id's 0x5019, 0x501a and 0x501b for T5
cards.

Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h b/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
index 51b1803..90b5274 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
@@ -145,6 +145,9 @@ CH_PCI_DEVICE_ID_TABLE_DEFINE_BEGIN
 	CH_PCI_ID_TABLE_FENTRY(0x5016),	/* T580-OCP-SO */
 	CH_PCI_ID_TABLE_FENTRY(0x5017),	/* T520-OCP-SO */
 	CH_PCI_ID_TABLE_FENTRY(0x5018),	/* T540-BT */
+	CH_PCI_ID_TABLE_FENTRY(0x5019),	/* T540-LP-BT */
+	CH_PCI_ID_TABLE_FENTRY(0x501a),	/* T540-SO-BT */
+	CH_PCI_ID_TABLE_FENTRY(0x501b),	/* T540-SO-CR */
 	CH_PCI_ID_TABLE_FENTRY(0x5080),	/* Custom T540-cr */
 	CH_PCI_ID_TABLE_FENTRY(0x5081),	/* Custom T540-LL-cr */
 	CH_PCI_ID_TABLE_FENTRY(0x5082),	/* Custom T504-cr */
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH] vhost: make msg padding explicit
From: Kevin Easton @ 2018-05-02  6:28 UTC (permalink / raw)
  To: David Miller; +Cc: mst, linux-kernel, jasowang, kvm, virtualization, netdev
In-Reply-To: <20180501.140551.1944561534446599966.davem@davemloft.net>

On Tue, May 01, 2018 at 02:05:51PM -0400, David Miller wrote:
> From: "Michael S. Tsirkin" <mst@redhat.com>
> Date: Tue, 1 May 2018 20:19:19 +0300
> 
> > On Tue, May 01, 2018 at 11:28:22AM -0400, David Miller wrote:
> >> From: "Michael S. Tsirkin" <mst@redhat.com>
> >> Date: Fri, 27 Apr 2018 19:02:05 +0300
> >> 
> >> > There's a 32 bit hole just after type. It's best to
> >> > give it a name, this way compiler is forced to initialize
> >> > it with rest of the structure.
> >> > 
> >> > Reported-by: Kevin Easton <kevin@guarana.org>
> >> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >> 
> >> Michael, will you be sending this directly to Linus or would you like
> >> me to apply it to net or net-next?
> >> 
> >> Thanks.
> > 
> > I'd prefer you to apply it for net and cc stable if possible.
> 
> Ok, applied, and added to my -stable submission queue.

Hold on, this patch changes the layout for i386 (where there is
no padding at all).  And it's part of UAPI.

    - Kevin

> 

^ permalink raw reply

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
From: Julian Anastasov @ 2018-05-02  6:38 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Tom Herbert, Eric Dumazet, Nikita Shirokov, kernel-team,
	lvs-devel
In-Reply-To: <20180419212324.1542504-1-kafai@fb.com>


	Hello,

On Thu, 19 Apr 2018, Martin KaFai Lau wrote:

> This patch is not a proper fix and mainly serves for discussion purpose.
> It is based on net-next which I have been using to debug the issue.
> 
> The change that works around the issue is in ensure_mtu_is_adequate().
> Other changes are the rippling effect in function arg.
> 
> This bug was uncovered by one of our legacy service that
> are still using ipvs for load balancing.  In that setup,
> the ipvs encap the ipv6-tcp packet in another ipv6 hdr
> before tx it out to eth0.
> 
> The problem is the kernel stack could pass a skb (which was
> originated from a sys_write(tcp_fd)) to the driver with skb->len
> bigger than the device MTU.  In one NIC setup (with gso and tso off)
> that we are using, it upset the NIC/driver and caused the tx queue
> stalled for tens of seconds which is how it got uncovered.
> (On the NIC side, the NIC firmware and driver have been fixed
> to avoid this tx queue stall after seeing this skb).

	In the last week I analyzed the situation and found
that just changes in route.c are able to solve the problems,
at 99% :) I'm posting a separate patch for this.

	Here is what happens, I'm testing with FTP which
starts with connection to port 21 and then with data connection,
from local ftp client to local Virtual IP (forwarded to remote
real server via tunnel).

- TCP creates local route which after commit 839da4d98960
appears to be non-cached, before this commit it is cached.
Route is saved and reused.

- initial traffic for port 21 does not use GSO. But after
every packet IPVS calls maybe_update_pmtu (rt->dst.ops->update_pmtu)
to report the reduced MTU. These updates are stored in fnhe_pmtu
but they do not go to any route, even if we try to get fresh
output route. Why? Because the local routes are not cached, so
they can not use the fnhe. This is what my patch for route.c
will fix. With this fix FTP-DATA gets route with reduced PMTU.

- later, there are large GSO packets in the FTP-DATA connection.
Currently, IPVS does not send ICMP error for exceeded PMTU
by GSO packets (this will be fixed, see patch below). Even
if they reach TCP and it refreshes its route, TCP can not get
any actual PMTU values from the routing, so continues to
use the large gso_size without the patch for route.c

	For the changes in IPVS that I show below as RFC:

- I synchronized the PMTU checks in __mtu_check_toobig* with
IPv4/IPv6 forwarding

- I modified your changes, see ipvs_gso_hlen() and how I use it
at start of ensure_mtu_is_adequate(), after skb is made writable,
before the PMTU checks.

	In my tests, all variants work, just with skb_decrease_gso_size
or even if I add SKB_GSO_DODGY+gso_segs=0. But I'm not sure
if this is safe for the different GSO configurations.

	I'm analyzing what can be changed in route.c, so that
dst.ops->update_pmtu changes to take effect for the provided
route. If that is possible, it will allow the PMTU change to
take immediate effect for the local routes.

diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 4527921..7a2a0a89 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -104,9 +104,32 @@ __ip_vs_dst_check(struct ip_vs_dest *dest)
 	return dest_dst;
 }
 
-static inline bool
-__mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
+/* Check if packet size violates MTU */
+static bool __mtu_check_toobig(const struct sk_buff *skb, u32 mtu)
 {
+	if (skb->len <= mtu)
+		return false;
+
+	if (unlikely((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0))
+		return false;
+
+	if (IPCB(skb)->frag_max_size) {
+		/* frag_max_size tell us that, this packet have been
+		 * defragmented by netfilter IPv4 conntrack module.
+		 */
+		if (IPCB(skb)->frag_max_size > mtu)
+			return true; /* largest fragment violate MTU */
+	}
+	if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
+		return false;
+	return true;
+}
+
+/* Check if packet size violates MTU */
+static bool __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
+{
+	if (skb->len <= mtu)
+		return false;
 	if (IP6CB(skb)->frag_max_size) {
 		/* frag_max_size tell us that, this packet have been
 		 * defragmented by netfilter IPv6 conntrack module.
@@ -114,10 +137,9 @@ __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
 		if (IP6CB(skb)->frag_max_size > mtu)
 			return true; /* largest fragment violate MTU */
 	}
-	else if (skb->len > mtu && !skb_is_gso(skb)) {
-		return true; /* Packet size violate MTU size */
-	}
-	return false;
+	if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
+		return false;
+	return true;
 }
 
 /* Get route to daddr, update *saddr, optionally bind route to saddr */
@@ -212,11 +234,42 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
 		ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
 }
 
+/* GSO: length of network and transport headers, 0=unsupported */
+static unsigned short ipvs_gso_hlen(struct sk_buff *skb,
+				    struct ip_vs_iphdr *ipvsh)
+{
+	unsigned short hlen = ipvsh->len - ipvsh->off;
+
+	if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)) {
+		struct tcphdr _tcph, *th;
+
+		th = skb_header_pointer(skb, ipvsh->len, sizeof(_tcph), &_tcph);
+		if (th)
+			return hlen + (th->doff << 2);
+	}
+	return 0;
+}
+ 
 static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
 					  int rt_mode,
 					  struct ip_vs_iphdr *ipvsh,
 					  struct sk_buff *skb, int mtu)
 {
+	/* Re-segment traffic from local clients */
+	if (!skb->dev && skb_is_gso(skb)) {
+		unsigned short hlen = ipvs_gso_hlen(skb, ipvsh);
+
+		if (hlen && mtu - hlen < skb_shinfo(skb)->gso_size &&
+		    mtu > hlen) {
+			u16 reduce = skb_shinfo(skb)->gso_size - (mtu - hlen);
+
+			IP_VS_DBG(10, "Reducing gso_size=%u with %u\n",
+				skb_shinfo(skb)->gso_size, reduce);
+			skb_decrease_gso_size(skb_shinfo(skb), reduce);
+			skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
+			skb_shinfo(skb)->gso_segs = 0;
+		}
+	}
 #ifdef CONFIG_IP_VS_IPV6
 	if (skb_af == AF_INET6) {
 		struct net *net = ipvs->net;
@@ -240,9 +293,7 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
 		if ((rt_mode & IP_VS_RT_MODE_TUNNEL) && !sysctl_pmtu_disc(ipvs))
 			return true;
 
-		if (unlikely(ip_hdr(skb)->frag_off & htons(IP_DF) &&
-			     skb->len > mtu && !skb_is_gso(skb) &&
-			     !ip_vs_iph_icmp(ipvsh))) {
+		if (unlikely(__mtu_check_toobig(skb, mtu))) {
 			icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
 				  htonl(mtu));
 			IP_VS_DBG(1, "frag needed for %pI4\n",

Regards

^ permalink raw reply related

* [PATCH net] ipv4: fix fnhe usage by non-cached routes
From: Julian Anastasov @ 2018-05-02  6:41 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Martin KaFai Lau, kernel-team, David Ahern, Xin Long

Allow some non-cached routes to use non-expired fnhe:

1. ip_del_fnhe: moved above and now called by find_exception.
The 4.5+ commit deed49df7390 expires fnhe only when caching
routes. Change that to:

1.1. use fnhe for non-cached local output routes, with the help
from (2)

1.2. allow __mkroute_input to detect expired fnhe (outdated
fnhe_gw, for example) when do_cache is false, eg. when itag!=0
for unicast destinations.

2. __mkroute_output: keep fi to allow local routes with orig_oif != 0
to use fnhe info even when the new route will not be cached into fnhe.
After commit 839da4d98960 ("net: ipv4: set orig_oif based on fib
result for local traffic") it means all local routes will be affected
because they are not cached. This change is used to solve a PMTU
problem with IPVS (and probably Netfilter DNAT) setups that redirect
local clients from target local IP (local route to Virtual IP)
to new remote IP target, eg. IPVS TUN real server. Loopback has
64K MTU and we need to create fnhe on the local route that will
keep the reduced PMTU for the Virtual IP. Without this change
fnhe_pmtu is updated from ICMP but never exposed to non-cached
local routes. This includes routes with flowi4_oif!=0 for 4.6+ and
with flowi4_oif=any for 4.14+).

3. update_or_create_fnhe: make sure fnhe_expires is not 0 for
new entries

Fixes: 839da4d98960 ("net: ipv4: set orig_oif based on fib result for local traffic")
Fixes: d6d5e999e5df ("route: do not cache fib route info on local routes with oif")
Fixes: deed49df7390 ("route: check and remove route cache when we get route")
Cc: David Ahern <dsahern@gmail.com>
Cc: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 net/ipv4/route.c | 118 +++++++++++++++++++++++++------------------------------
 1 file changed, 53 insertions(+), 65 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index ccb25d8..1412a7b 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -709,7 +709,7 @@ static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
 		fnhe->fnhe_gw = gw;
 		fnhe->fnhe_pmtu = pmtu;
 		fnhe->fnhe_mtu_locked = lock;
-		fnhe->fnhe_expires = expires;
+		fnhe->fnhe_expires = max(1UL, expires);
 
 		/* Exception created; mark the cached routes for the nexthop
 		 * stale, so anyone caching it rechecks if this exception
@@ -1297,6 +1297,36 @@ static unsigned int ipv4_mtu(const struct dst_entry *dst)
 	return mtu - lwtunnel_headroom(dst->lwtstate, mtu);
 }
 
+static void ip_del_fnhe(struct fib_nh *nh, __be32 daddr)
+{
+	struct fnhe_hash_bucket *hash;
+	struct fib_nh_exception *fnhe, __rcu **fnhe_p;
+	u32 hval = fnhe_hashfun(daddr);
+
+	spin_lock_bh(&fnhe_lock);
+
+	hash = rcu_dereference_protected(nh->nh_exceptions,
+					 lockdep_is_held(&fnhe_lock));
+	hash += hval;
+
+	fnhe_p = &hash->chain;
+	fnhe = rcu_dereference_protected(*fnhe_p, lockdep_is_held(&fnhe_lock));
+	while (fnhe) {
+		if (fnhe->fnhe_daddr == daddr) {
+			rcu_assign_pointer(*fnhe_p, rcu_dereference_protected(
+				fnhe->fnhe_next, lockdep_is_held(&fnhe_lock)));
+			fnhe_flush_routes(fnhe);
+			kfree_rcu(fnhe, rcu);
+			break;
+		}
+		fnhe_p = &fnhe->fnhe_next;
+		fnhe = rcu_dereference_protected(fnhe->fnhe_next,
+						 lockdep_is_held(&fnhe_lock));
+	}
+
+	spin_unlock_bh(&fnhe_lock);
+}
+
 static struct fib_nh_exception *find_exception(struct fib_nh *nh, __be32 daddr)
 {
 	struct fnhe_hash_bucket *hash = rcu_dereference(nh->nh_exceptions);
@@ -1310,8 +1340,14 @@ static struct fib_nh_exception *find_exception(struct fib_nh *nh, __be32 daddr)
 
 	for (fnhe = rcu_dereference(hash[hval].chain); fnhe;
 	     fnhe = rcu_dereference(fnhe->fnhe_next)) {
-		if (fnhe->fnhe_daddr == daddr)
+		if (fnhe->fnhe_daddr == daddr) {
+			if (fnhe->fnhe_expires &&
+			    time_after(jiffies, fnhe->fnhe_expires)) {
+				ip_del_fnhe(nh, daddr);
+				break;
+			}
 			return fnhe;
+		}
 	}
 	return NULL;
 }
@@ -1636,36 +1672,6 @@ static void ip_handle_martian_source(struct net_device *dev,
 #endif
 }
 
-static void ip_del_fnhe(struct fib_nh *nh, __be32 daddr)
-{
-	struct fnhe_hash_bucket *hash;
-	struct fib_nh_exception *fnhe, __rcu **fnhe_p;
-	u32 hval = fnhe_hashfun(daddr);
-
-	spin_lock_bh(&fnhe_lock);
-
-	hash = rcu_dereference_protected(nh->nh_exceptions,
-					 lockdep_is_held(&fnhe_lock));
-	hash += hval;
-
-	fnhe_p = &hash->chain;
-	fnhe = rcu_dereference_protected(*fnhe_p, lockdep_is_held(&fnhe_lock));
-	while (fnhe) {
-		if (fnhe->fnhe_daddr == daddr) {
-			rcu_assign_pointer(*fnhe_p, rcu_dereference_protected(
-				fnhe->fnhe_next, lockdep_is_held(&fnhe_lock)));
-			fnhe_flush_routes(fnhe);
-			kfree_rcu(fnhe, rcu);
-			break;
-		}
-		fnhe_p = &fnhe->fnhe_next;
-		fnhe = rcu_dereference_protected(fnhe->fnhe_next,
-						 lockdep_is_held(&fnhe_lock));
-	}
-
-	spin_unlock_bh(&fnhe_lock);
-}
-
 /* called in rcu_read_lock() section */
 static int __mkroute_input(struct sk_buff *skb,
 			   const struct fib_result *res,
@@ -1719,20 +1725,10 @@ static int __mkroute_input(struct sk_buff *skb,
 
 	fnhe = find_exception(&FIB_RES_NH(*res), daddr);
 	if (do_cache) {
-		if (fnhe) {
+		if (fnhe)
 			rth = rcu_dereference(fnhe->fnhe_rth_input);
-			if (rth && rth->dst.expires &&
-			    time_after(jiffies, rth->dst.expires)) {
-				ip_del_fnhe(&FIB_RES_NH(*res), daddr);
-				fnhe = NULL;
-			} else {
-				goto rt_cache;
-			}
-		}
-
-		rth = rcu_dereference(FIB_RES_NH(*res).nh_rth_input);
-
-rt_cache:
+		else
+			rth = rcu_dereference(FIB_RES_NH(*res).nh_rth_input);
 		if (rt_cache_valid(rth)) {
 			skb_dst_set_noref(skb, &rth->dst);
 			goto out;
@@ -2216,39 +2212,31 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 		 * the loopback interface and the IP_PKTINFO ipi_ifindex will
 		 * be set to the loopback interface as well.
 		 */
-		fi = NULL;
+		do_cache = false;
 	}
 
 	fnhe = NULL;
 	do_cache &= fi != NULL;
-	if (do_cache) {
+	if (fi) {
 		struct rtable __rcu **prth;
 		struct fib_nh *nh = &FIB_RES_NH(*res);
 
 		fnhe = find_exception(nh, fl4->daddr);
+		if (!do_cache)
+			goto add;
 		if (fnhe) {
 			prth = &fnhe->fnhe_rth_output;
-			rth = rcu_dereference(*prth);
-			if (rth && rth->dst.expires &&
-			    time_after(jiffies, rth->dst.expires)) {
-				ip_del_fnhe(nh, fl4->daddr);
-				fnhe = NULL;
-			} else {
-				goto rt_cache;
+		} else {
+			if (unlikely(fl4->flowi4_flags &
+				     FLOWI_FLAG_KNOWN_NH &&
+				     !(nh->nh_gw &&
+				       nh->nh_scope == RT_SCOPE_LINK))) {
+				do_cache = false;
+				goto add;
 			}
+			prth = raw_cpu_ptr(nh->nh_pcpu_rth_output);
 		}
-
-		if (unlikely(fl4->flowi4_flags &
-			     FLOWI_FLAG_KNOWN_NH &&
-			     !(nh->nh_gw &&
-			       nh->nh_scope == RT_SCOPE_LINK))) {
-			do_cache = false;
-			goto add;
-		}
-		prth = raw_cpu_ptr(nh->nh_pcpu_rth_output);
 		rth = rcu_dereference(*prth);
-
-rt_cache:
 		if (rt_cache_valid(rth) && dst_hold_safe(&rth->dst))
 			return rth;
 	}
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH net-next] udp: disable gso with no_check_tx
From: Michal Kubecek @ 2018-05-02  7:05 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn, Willem de Bruijn
In-Reply-To: <20180430195836.69378-1-willemdebruijn.kernel@gmail.com>

On Mon, Apr 30, 2018 at 03:58:36PM -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> Syzbot managed to send a udp gso packet without checksum offload into
> the gso stack by disabling tx checksum (UDP_NO_CHECK6_TX). This
> triggered the skb_warn_bad_offload.
> 
>   RIP: 0010:skb_warn_bad_offload+0x2bc/0x600 net/core/dev.c:2658
>    skb_gso_segment include/linux/netdevice.h:4038 [inline]
>    validate_xmit_skb+0x54d/0xd90 net/core/dev.c:3120
>    __dev_queue_xmit+0xbf8/0x34c0 net/core/dev.c:3577
>    dev_queue_xmit+0x17/0x20 net/core/dev.c:3618
> 
> UDP_NO_CHECK6_TX sets skb->ip_summed to CHECKSUM_NONE just after the
> udp gso integrity checks in udp_(v6_)send_skb. Extend those checks to
> catch and fail in this case.

Sounds rather familiar... perhaps we might want to check other
exceptions added to UFO over the years, some might apply here as well.

Michal Kubecek

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox