Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [bpf-next PATCH 1/4] bpf: devmap introduce dev_map_enqueue
From: kbuild test robot @ 2018-05-10  9:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: kbuild-all, netdev, Daniel Borkmann, Alexei Starovoitov,
	Jesper Dangaard Brouer, Christoph Hellwig, BjörnTöpel,
	Magnus Karlsson
In-Reply-To: <152587157974.20423.10791157575158535841.stgit@firesoul>

[-- Attachment #1: Type: text/plain, Size: 2637 bytes --]

Hi Jesper,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Jesper-Dangaard-Brouer/xdp-introduce-bulking-for-ndo_xdp_xmit-API/20180510-134105
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: x86_64-acpi-redef (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   In file included from net/core/filter.c:52:0:
   include/linux/bpf.h:577:28: warning: 'struct bpf_dtab_netdev' declared inside parameter list will not be visible outside of this definition or declaration
    int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp)
                               ^~~~~~~~~~~~~~~
   net/core/filter.c: In function '__bpf_tx_xdp_map':
>> net/core/filter.c:3025:25: error: passing argument 1 of 'dev_map_enqueue' from incompatible pointer type [-Werror=incompatible-pointer-types]
      err = dev_map_enqueue(dst, xdp);
                            ^~~
   In file included from net/core/filter.c:52:0:
   include/linux/bpf.h:577:5: note: expected 'struct bpf_dtab_netdev *' but argument is of type 'struct bpf_dtab_netdev *'
    int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp)
        ^~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/dev_map_enqueue +3025 net/core/filter.c

  3013	
  3014	static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
  3015				    struct bpf_map *map,
  3016				    struct xdp_buff *xdp,
  3017				    u32 index)
  3018	{
  3019		int err;
  3020	
  3021		switch (map->map_type) {
  3022		case BPF_MAP_TYPE_DEVMAP: {
  3023			struct bpf_dtab_netdev *dst = fwd;
  3024	
> 3025			err = dev_map_enqueue(dst, xdp);
  3026			if (err)
  3027				return err;
  3028			__dev_map_insert_ctx(map, index);
  3029			break;
  3030		}
  3031		case BPF_MAP_TYPE_CPUMAP: {
  3032			struct bpf_cpu_map_entry *rcpu = fwd;
  3033	
  3034			err = cpu_map_enqueue(rcpu, xdp, dev_rx);
  3035			if (err)
  3036				return err;
  3037			__cpu_map_insert_ctx(map, index);
  3038			break;
  3039		}
  3040		case BPF_MAP_TYPE_XSKMAP: {
  3041			struct xdp_sock *xs = fwd;
  3042	
  3043			err = __xsk_map_redirect(map, xdp, xs);
  3044			return err;
  3045		}
  3046		default:
  3047			break;
  3048		}
  3049		return 0;
  3050	}
  3051	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30404 bytes --]

^ permalink raw reply

* Re: [bpf-next v3 8/9] bpf: Provide helper to do forwarding lookups in kernel FIB table
From: Toke Høiland-Jørgensen @ 2018-05-10  9:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, David Ahern
  Cc: netdev, borkmann, ast, davem, shm, roopa, john.fastabend, brouer
In-Reply-To: <20180510093158.08a7ed4b@redhat.com>

Jesper Dangaard Brouer <brouer@redhat.com> writes:

> On Wed,  9 May 2018 20:34:26 -0700
> David Ahern <dsahern@gmail.com> wrote:
>
>> Provide a helper for doing a FIB and neighbor lookup in the kernel
>> tables from an XDP program. The helper provides a fastpath for forwarding
>> packets. If the packet is a local delivery or for any reason is not a
>> simple lookup and forward, the packet continues up the stack.
>> 
>> If it is to be forwarded, the forwarding can be done directly if the
>> neighbor is already known. If the neighbor does not exist, the first
>> few packets go up the stack for neighbor resolution. Once resolved, the
>> xdp program provides the fast path.
>> 
>> On successful lookup the nexthop dmac, current device smac and egress
>> device index are returned.
>> 
>> The API supports IPv4, IPv6 and MPLS protocols, but only IPv4 and IPv6
>> are implemented in this patch. The API includes layer 4 parameters if
>> the XDP program chooses to do deep packet inspection to allow compare
>> against ACLs implemented as FIB rules.
>> 
>> Header rewrite is left to the XDP program.
>> 
>> The lookup takes 2 flags:
>> - BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes
>>   straight to the table associated with the device (expert setting for
>>   those looking to maximize throughput)
>> 
>> - BPF_FIB_LOOKUP_OUTPUT to do a lookup from the egress perspective.
>>   Default is an ingress lookup.
>> 
>> Initial performance numbers collected by Jesper, forwarded packets/sec:
>> 
>>        Full stack    XDP FIB lookup    XDP Direct lookup
>> IPv4   1,947,969       7,074,156          7,415,333
>> IPv6   1,728,000       6,165,504          7,262,720
>> 
>
> The "Full stack" tests were with netfilter modules unloaded.  Default
> setting with netfilter conntrack loaded and default Fedora firewall
> rules, show around 700Kpps.
>
>> These number are single CPU core forwarding on a Broadwell
>> E5-1650 v4 @ 3.60GHz.
>> 
>> Signed-off-by: David Ahern <dsahern@gmail.com>
>
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
>
> This helper is awesome, as it really shows how XDP is meant to work in
> concert and cooperate with the existing network stack.

+1!

-Toke

^ permalink raw reply

* Re: [bpf-next PATCH 4/4] xdp: change ndo_xdp_xmit API to support bulking
From: kbuild test robot @ 2018-05-10  9:03 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: kbuild-all, netdev, Daniel Borkmann, Alexei Starovoitov,
	Jesper Dangaard Brouer, Christoph Hellwig, BjörnTöpel,
	Magnus Karlsson
In-Reply-To: <152587159495.20423.14022994969026458789.stgit@firesoul>

[-- Attachment #1: Type: text/plain, Size: 977 bytes --]

Hi Jesper,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Jesper-Dangaard-Brouer/xdp-introduce-bulking-for-ndo_xdp_xmit-API/20180510-134105
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: m68k-sun3_defconfig (attached as .config)
compiler: m68k-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=m68k 

All errors (new ones prefixed by >>):

   net/core/xdp.o: In function `__xdp_return':
>> xdp.c:(.text+0x356): undefined reference to `__page_pool_put_page'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 11635 bytes --]

^ permalink raw reply

* Re: [RFC v3 3/5] virtio_ring: add packed ring support
From: Tiwei Bie @ 2018-05-10  8:56 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, virtualization, linux-kernel, netdev, wexu
In-Reply-To: <5885acac-e9e3-3abf-b6a2-7347f4d55be2@redhat.com>

On Thu, May 10, 2018 at 03:34:50PM +0800, Jason Wang wrote:
> On 2018年05月10日 15:32, Jason Wang wrote:
> > On 2018年04月25日 13:15, Tiwei Bie wrote:
> > > +    /* We're using some buffers from the free list. */
> > > +    vq->vq.num_free -= descs_used;
> > > +
> > > +    /* Update free pointer */
> > > +    if (indirect) {
> > > +        n = head + 1;
> > > +        if (n >= vq->vring_packed.num) {
> > > +            n = 0;
> > > +            vq->wrap_counter ^= 1;
> > > +        }
> > > +        vq->next_avail_idx = n;
> > > +    } else
> > > +        vq->next_avail_idx = i;
> > 
> > During testing zerocopy (out of order completion), I found driver may
> > submit two identical buffer id to vhost. So the above code may not work
> > well.
> > 
> > Consider the case that driver adds 3 buffer and virtqueue size is 8.
> > 
> > a) id = 0,count = 2,next_avail = 2
> > 
> > b) id = 2,count = 4,next_avail = 2
> 
> next_avail should be 6 here.
> 
> > 
> > c) id = 4,count = 2,next_avail = 0
> > 
> 
> id should be 6 here.
> 
> Thanks
> 
> > if packet b is done before packet a, driver may think buffer id 0 is
> > available and try to use it if even if the real buffer 0 was not done.
> > 
> > Thanks

Nice catch! Thanks a lot!
I'll implement an ID allocator.

Best regards,
Tiwei Bie

^ permalink raw reply

* Re: STMMAC driver with TSO enabled issue
From: Jose Abreu @ 2018-05-10  8:55 UTC (permalink / raw)
  To: Bhadram Varka, Jose Abreu, netdev@vger.kernel.org, Joao Pinto
In-Reply-To: <89c0a735-9e34-89c6-7692-579e48dadaa6@nvidia.com>

++net-dev

Hi Bhadram,

On 09-05-2018 12:03, Bhadram Varka wrote:
> Hi,
>
> Thanks for responding.
>
> Tried below suggested way. Still observing the issue -

It seems stmmac has a bug in the RX side when using TSO which is
causing all the RX descriptors to be consumed. The stmmac_rx()
function will need to be refactored. I will send a fix ASAP.

Thanks and Best Regards,
Jose Miguel Abreu

>
> [root@alarm ~]# iperf3 -c 10.19.65.141
> Connecting to host 10.19.65.141, port 5201
> [  5] local 10.19.65.210 port 57630 connected to 10.19.65.141
> port 5201
> [   65.408268] stmmac_tso_xmit(): line = 2842
> [   65.412362] stmmac_tso_xmit: tcphdrlen 32, hdr_len 66,
> pay_len 0, mss 1448
> [   65.419224]  skb->len 8754, skb->data_len 8688
> [   65.423672] stmmac_tso_xmit: curr=20 dirty=17 f=18, e=20,
> f_p=00000000178e52e1, nfrags 1
> [   65.431747] TX descriptor ring:
> [   65.434881] 000 [0x82005000]: 0x0 0x0 0x0 0x0
> [   65.439230] 001 [0x82005010]: 0x0 0x0 0x0 0x0
> [   65.443578] 002 [0x82005020]: 0x0 0x0 0x0 0x0
> [   65.447927] 003 [0x82005030]: 0x0 0x0 0x0 0x0
> [   65.452275] 004 [0x82005040]: 0x0 0x0 0x0 0x0
> [   65.456622] 005 [0x82005050]: 0x0 0x0 0x0 0x0
> [   65.460970] 006 [0x82005060]: 0x0 0x0 0x0 0x0
> [   65.465316] 007 [0x82005070]: 0x0 0x0 0x0 0x0
> [   65.469664] 008 [0x82005080]: 0x0 0x0 0x0 0x0
> [   65.474010] 009 [0x82005090]: 0x0 0x0 0x0 0x0
> [   65.478357] 010 [0x820050a0]: 0x0 0x0 0x0 0x0
> [   65.482706] 011 [0x820050b0]: 0x0 0x0 0x0 0x0
> [   65.487053] 012 [0x820050c0]: 0x0 0x0 0x0 0x0
> [   65.491400] 013 [0x820050d0]: 0x0 0x0 0x0 0x0
> [   65.495746] 014 [0x820050e0]: 0x0 0x0 0x0 0x0
> [   65.500092] 015 [0x820050f0]: 0x0 0x0 0x0 0x0
> [   65.504438] 016 [0x82005100]: 0x0 0x0 0x0 0x0
> [   65.508784] 017 [0x82005110]: 0x0 0x0 0x5a8 0xc4000000
> [   65.513910] 018 [0x82005120]: 0xfb297000 0x0 0x42 0xa04421f0
> [   65.519557] 019 [0x82005130]: 0xfb298000 0x0 0x21f0 0x90000000
> [   65.525376] 020 [0x82005140]: 0x0 0x0 0x0 0x0
> [   65.529722] 021 [0x82005150]: 0x0 0x0 0x0 0x0
> [   65.534069] 022 [0x82005160]: 0x0 0x0 0x0 0x0
> [   65.538414] 023 [0x82005170]: 0x0 0x0 0x0 0x0
> [   65.542761] 024 [0x82005180]: 0x0 0x0 0x0 0x0
> [   65.547107] 025 [0x82005190]: 0x0 0x0 0x0 0x0
> [   65.551454] 026 [0x820051a0]: 0x0 0x0 0x0 0x0
> [   65.555802] 027 [0x820051b0]: 0x0 0x0 0x0 0x0
> [   65.560147] 028 [0x820051c0]: 0x0 0x0 0x0 0x0
> [   65.564493] 029 [0x820051d0]: 0x0 0x0 0x0 0x0
> [   65.568840] 030 [0x820051e0]: 0x0 0x0 0x0 0x0
> [   65.573187] 031 [0x820051f0]: 0x0 0x0 0x0 0x0
> [   65.577533] 032 [0x82005200]: 0x0 0x0 0x0 0x0
> [   65.581879] 033 [0x82005210]: 0x0 0x0 0x0 0x0
> [   65.586225] 034 [0x82005220]: 0x0 0x0 0x0 0x0
> [   65.590571] 035 [0x82005230]: 0x0 0x0 0x0 0x0
> [   65.594917] 036 [0x82005240]: 0x0 0x0 0x0 0x0
> [   65.599262] 037 [0x82005250]: 0x0 0x0 0x0 0x0
> [   65.603607] 038 [0x82005260]: 0x0 0x0 0x0 0x0
> [   65.607952] 039 [0x82005270]: 0x0 0x0 0x0 0x0
> [   65.612297] 040 [0x82005280]: 0x0 0x0 0x0 0x0
> [   65.616643] 041 [0x82005290]: 0x0 0x0 0x0 0x0
> [   65.620989] 042 [0x820052a0]: 0x0 0x0 0x0 0x0
> [   65.625336] 043 [0x820052b0]: 0x0 0x0 0x0 0x0
> [   65.629681] 044 [0x820052c0]: 0x0 0x0 0x0 0x0
> [   65.634027] 045 [0x820052d0]: 0x0 0x0 0x0 0x0
> [   65.638372] 046 [0x820052e0]: 0x0 0x0 0x0 0x0
> [   65.642718] 047 [0x820052f0]: 0x0 0x0 0x0 0x0
> [   65.647063] 048 [0x82005300]: 0x0 0x0 0x0 0x0
> [   65.651408] 049 [0x82005310]: 0x0 0x0 0x0 0x0
> [   65.655754] 050 [0x82005320]: 0x0 0x0 0x0 0x0
> [   65.660099] 051 [0x82005330]: 0x0 0x0 0x0 0x0
> [   65.664444] 052 [0x82005340]: 0x0 0x0 0x0 0x0
> [   65.668790] 053 [0x82005350]: 0x0 0x0 0x0 0x0
> [   65.673134] 054 [0x82005360]: 0x0 0x0 0x0 0x0
> [   65.677480] 055 [0x82005370]: 0x0 0x0 0x0 0x0
> [   65.681825] 056 [0x82005380]: 0x0 0x0 0x0 0x0
> [   65.686170] 057 [0x82005390]: 0x0 0x0 0x0 0x0
> [   65.690515] 058 [0x820053a0]: 0x0 0x0 0x0 0x0
> [   65.694861] 059 [0x820053b0]: 0x0 0x0 0x0 0x0
> [   65.699206] 060 [0x820053c0]: 0x0 0x0 0x0 0x0
> [   65.703552] 061 [0x820053d0]: 0x0 0x0 0x0 0x0
> [   65.707898] 062 [0x820053e0]: 0x0 0x0 0x0 0x0
> [   65.712243] 063 [0x820053f0]: 0x0 0x0 0x0 0x0
> [   65.716706] stmmac_tso_xmit(): line = 2842
> [   65.720802] stmmac_tso_xmit: tcphdrlen 32, hdr_len 66,
> pay_len 0, mss 1448
> [   65.727669]  skb->len 4410, skb->data_len 4344
> [   65.732114] stmmac_tso_xmit: curr=22 dirty=19 f=20, e=22,
> f_p=00000000b1247b41, nfrags 1
> [   65.740190] TX descriptor ring:
> [   65.743327] 000 [0x82005000]: 0x0 0x0 0x0 0x0
> [   65.747678] 001 [0x82005010]: 0x0 0x0 0x0 0x0
> [   65.752029] 002 [0x82005020]: 0x0 0x0 0x0 0x0
> [   65.756378] 003 [0x82005030]: 0x0 0x0 0x0 0x0
> [   65.760727] 004 [0x82005040]: 0x0 0x0 0x0 0x0
> [   65.765077] 005 [0x82005050]: 0x0 0x0 0x0 0x0
> [   65.769427] 006 [0x82005060]: 0x0 0x0 0x0 0x0
> [   65.773776] 007 [0x82005070]: 0x0 0x0 0x0 0x0
> [   65.778126] 008 [0x82005080]: 0x0 0x0 0x0 0x0
> [   65.782476] 009 [0x82005090]: 0x0 0x0 0x0 0x0
> [   65.786826] 010 [0x820050a0]: 0x0 0x0 0x0 0x0
> [   65.791176] 011 [0x820050b0]: 0x0 0x0 0x0 0x0
> [   65.795526] 012 [0x820050c0]: 0x0 0x0 0x0 0x0
> [   65.799875] 013 [0x820050d0]: 0x0 0x0 0x0 0x0
> [   65.804225] 014 [0x820050e0]: 0x0 0x0 0x0 0x0
> [   65.808575] 015 [0x820050f0]: 0x0 0x0 0x0 0x0
> [   65.812925] 016 [0x82005100]: 0x0 0x0 0x0 0x0
> [   65.817274] 017 [0x82005110]: 0x0 0x0 0x0 0x0
> [   65.821625] 018 [0x82005120]: 0x0 0x0 0x0 0x0
> [   65.825976] 019 [0x82005130]: 0xfb298000 0x0 0x21f0 0x90000000
> [   65.831800] 020 [0x82005140]: 0xfb2a1000 0x0 0x42 0xa04410f8
> [   65.837450] 021 [0x82005150]: 0xfb2a2000 0x0 0x10f8 0x90000000
> [   65.843273] 022 [0x82005160]: 0x0 0x0 0x0 0x0
> [   65.847622] 023 [0x82005170]: 0x0 0x0 0x0 0x0
> [   65.851971] 024 [0x82005180]: 0x0 0x0 0x0 0x0
> [   65.856319] 025 [0x82005190]: 0x0 0x0 0x0 0x0
> [   65.860670] 026 [0x820051a0]: 0x0 0x0 0x0 0x0
> [   65.865020] 027 [0x820051b0]: 0x0 0x0 0x0 0x0
> [   65.869369] 028 [0x820051c0]: 0x0 0x0 0x0 0x0
> [   65.873719] 029 [0x820051d0]: 0x0 0x0 0x0 0x0
> [   65.878068] 030 [0x820051e0]: 0x0 0x0 0x0 0x0
> [   65.882418] 031 [0x820051f0]: 0x0 0x0 0x0 0x0
> [   65.886767] 032 [0x82005200]: 0x0 0x0 0x0 0x0
> [   65.891118] 033 [0x82005210]: 0x0 0x0 0x0 0x0
> [   65.895467] 034 [0x82005220]: 0x0 0x0 0x0 0x0
> [   65.899816] 035 [0x82005230]: 0x0 0x0 0x0 0x0
> [   65.904165] 036 [0x82005240]: 0x0 0x0 0x0 0x0
> [   65.908515] 037 [0x82005250]: 0x0 0x0 0x0 0x0
> [   65.912865] 038 [0x82005260]: 0x0 0x0 0x0 0x0
> [   65.917215] 039 [0x82005270]: 0x0 0x0 0x0 0x0
> [   65.921564] 040 [0x82005280]: 0x0 0x0 0x0 0x0
> [   65.925915] 041 [0x82005290]: 0x0 0x0 0x0 0x0
> [   65.930264] 042 [0x820052a0]: 0x0 0x0 0x0 0x0
> [   65.934615] 043 [0x820052b0]: 0x0 0x0 0x0 0x0
> [   65.938964] 044 [0x820052c0]: 0x0 0x0 0x0 0x0
> [   65.943313] 045 [0x820052d0]: 0x0 0x0 0x0 0x0
> [   65.947664] 046 [0x820052e0]: 0x0 0x0 0x0 0x0
> [   65.952012] 047 [0x820052f0]: 0x0 0x0 0x0 0x0
> [   65.956363] 048 [0x82005300]: 0x0 0x0 0x0 0x0
> [   65.960712] 049 [0x82005310]: 0x0 0x0 0x0 0x0
> [   65.965061] 050 [0x82005320]: 0x0 0x0 0x0 0x0
> [   65.969410] 051 [0x82005330]: 0x0 0x0 0x0 0x0
> [   65.973760] 052 [0x82005340]: 0x0 0x0 0x0 0x0
> [   65.978110] 053 [0x82005350]: 0x0 0x0 0x0 0x0
> [   65.982460] 054 [0x82005360]: 0x0 0x0 0x0 0x0
> [   65.986812] 055 [0x82005370]: 0x0 0x0 0x0 0x0
> [   65.991161] 056 [0x82005380]: 0x0 0x0 0x0 0x0
> [   65.995510] 057 [0x82005390]: 0x0 0x0 0x0 0x0
> [   65.999860] 058 [0x820053a0]: 0x0 0x0 0x0 0x0
> [   66.004210] 059 [0x820053b0]: 0x0 0x0 0x0 0x0
> [   66.008559] 060 [0x820053c0]: 0x0 0x0 0x0 0x0
> [   66.012908] 061 [0x820053d0]: 0x0 0x0 0x0 0x0
> [   66.017257] 062 [0x820053e0]: 0x0 0x0 0x0 0x0
> [   66.021607] 063 [0x820053f0]: 0x0 0x0 0x0 0x0
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   184 KBytes  1.50 Mbits/sec    0  
> 1.41 KBytes
> [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41
> KBytes
> [  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41
> KBytes
> [  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41
> KBytes
> [  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41
> KBytes
> [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41
> KBytes
> [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41
> KBytes
> [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41
> KBytes
> [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41
> KBytes
>
>
> On 5/9/2018 3:35 PM, Jose Abreu wrote:
>> Hi Bhadram,
>>
>> On 09-05-2018 08:18, Bhadram Varka wrote:
>>>
>>> +               queue0 {
>>> +                       snps,weight = <0x10>;
>>
>>> +               queue1 {
>>> +                       snps,weight = <0x10>;
>>>
>>
>>> +               queue2 {
>>> +                       snps,weight = <0x10>;
>>>
>>
>>> +               queue3 {
>>> +                       snps,weight = <0x10>;
>>>
>>
>> This is wrong. You can't use the same weight for all queues.
>> Please try with different weights (for example: 0x10, 0x11, 0x12,
>> 0x13).
>>
>> Thanks and Best Regards,
>> Jose Miguel Abreu
>>
>>
>

^ permalink raw reply

* Significant capacity drop on loopback interface
From: Naruto Nguyen @ 2018-05-10  8:35 UTC (permalink / raw)
  To: netdev

Hello everyone,

Recently, I used netperf to test the TCP performance on loopback
interface on my 2 nodes, one is installed kernel 4.4.103 and the other
is 3.12.61

netperf -l 100 -t TCP_RR
netperf -l 100 -t TCP_RR -- -D

In both cases, I see that the throughput on 4.4.103 is about just 1/2
in comparing with 3.12.61 node

# netperf -l 100 -t TCP_RR
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0
AF_INET to localhost () port 0 AF_INET : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       100.00   37714.68
16384  87380

netperf -l 100 -t TCP_RR
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0
AF_INET to localhost () port 0 AF_INET : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       100.00   64038.41
16384  87380

When running tcpdump to capture all packets in loopback interface, I
see that during 200s capture, the number of packets on loopback of
4.4.103 is double the number of packets in 3.12.61? Could you please
let me know if it can cause the low throughput as above? Do we have
any tuning for TCP on loopback to improve the performace (actually the
low throughput also happens with UDP) or if we have any known
performance issue in 4.4 kernel on loopback?

Thanks a lot,
Brs,
Naruto

^ permalink raw reply

* [PATCH net] net: Correct wrong skb_flow_limit check when enable RPS
From: gfree.wind @ 2018-05-10  8:28 UTC (permalink / raw)
  To: davem, daniel, jakub.kicinski, dsahern, netdev; +Cc: Gao Feng

From: Gao Feng <gfree.wind@vip.163.com>

The skb flow limit is implemented for each CPU independently. In the
current codes, the function skb_flow_limit gets the softnet_data by
this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not
the current cpu when enable RPS. As the result, the skb_flow_limit checks
the stats of current CPU, while the skb is going to append the queue of
another CPU. It isn't the expected behavior.

Now pass the softnet_data as a param to softnet_data to make consistent.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
---
 net/core/dev.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index af0558b..0f98eff 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3883,18 +3883,15 @@ static int rps_ipi_queued(struct softnet_data *sd)
 int netdev_flow_limit_table_len __read_mostly = (1 << 12);
 #endif
 
-static bool skb_flow_limit(struct sk_buff *skb, unsigned int qlen)
+static bool skb_flow_limit(struct softnet_data *sd, struct sk_buff *skb, unsigned int qlen)
 {
 #ifdef CONFIG_NET_FLOW_LIMIT
 	struct sd_flow_limit *fl;
-	struct softnet_data *sd;
 	unsigned int old_flow, new_flow;
 
 	if (qlen < (netdev_max_backlog >> 1))
 		return false;
 
-	sd = this_cpu_ptr(&softnet_data);
-
 	rcu_read_lock();
 	fl = rcu_dereference(sd->flow_limit);
 	if (fl) {
@@ -3938,7 +3935,7 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
 	if (!netif_running(skb->dev))
 		goto drop;
 	qlen = skb_queue_len(&sd->input_pkt_queue);
-	if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) {
+	if (qlen <= netdev_max_backlog && !skb_flow_limit(sd, skb, qlen)) {
 		if (qlen) {
 enqueue:
 			__skb_queue_tail(&sd->input_pkt_queue, skb);
-- 
1.9.1

^ permalink raw reply related

* Re: [bpf-next PATCH 4/4] xdp: change ndo_xdp_xmit API to support bulking
From: kbuild test robot @ 2018-05-10  8:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: kbuild-all, netdev, Daniel Borkmann, Alexei Starovoitov,
	Jesper Dangaard Brouer, Christoph Hellwig, BjörnTöpel,
	Magnus Karlsson
In-Reply-To: <152587159495.20423.14022994969026458789.stgit@firesoul>

[-- Attachment #1: Type: text/plain, Size: 1820 bytes --]

Hi Jesper,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Jesper-Dangaard-Brouer/xdp-introduce-bulking-for-ndo_xdp_xmit-API/20180510-134105
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: i386-randconfig-a1-05100951 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   net/core/xdp.o: In function `__xdp_return':
>> net/core/xdp.c:323: undefined reference to `__page_pool_put_page'

vim +323 net/core/xdp.c

   310	
   311	static void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct)
   312	{
   313		struct xdp_mem_allocator *xa;
   314		struct page *page;
   315	
   316		switch (mem->type) {
   317		case MEM_TYPE_PAGE_POOL:
   318			rcu_read_lock();
   319			/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
   320			xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
   321			page = virt_to_head_page(data);
   322			if (xa)
 > 323				__page_pool_put_page(xa->page_pool, page, napi_direct);
   324			else
   325				put_page(page);
   326			rcu_read_unlock();
   327			break;
   328		case MEM_TYPE_PAGE_SHARED:
   329			page_frag_free(data);
   330			break;
   331		case MEM_TYPE_PAGE_ORDER0:
   332			page = virt_to_page(data); /* Assumes order0 page*/
   333			put_page(page);
   334			break;
   335		default:
   336			/* Not possible, checked in xdp_rxq_info_reg_mem_model() */
   337			break;
   338		}
   339	}
   340	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29808 bytes --]

^ permalink raw reply

* Re: [RFC v3 3/5] virtio_ring: add packed ring support
From: Jason Wang @ 2018-05-10  7:34 UTC (permalink / raw)
  To: Tiwei Bie, mst, virtualization, linux-kernel, netdev; +Cc: wexu
In-Reply-To: <927f4478-5a81-31d4-ac69-f9ec26248591@redhat.com>



On 2018年05月10日 15:32, Jason Wang wrote:
>
>
> On 2018年04月25日 13:15, Tiwei Bie wrote:
>> +    /* We're using some buffers from the free list. */
>> +    vq->vq.num_free -= descs_used;
>> +
>> +    /* Update free pointer */
>> +    if (indirect) {
>> +        n = head + 1;
>> +        if (n >= vq->vring_packed.num) {
>> +            n = 0;
>> +            vq->wrap_counter ^= 1;
>> +        }
>> +        vq->next_avail_idx = n;
>> +    } else
>> +        vq->next_avail_idx = i;
>
> During testing zerocopy (out of order completion), I found driver may 
> submit two identical buffer id to vhost. So the above code may not 
> work well.
>
> Consider the case that driver adds 3 buffer and virtqueue size is 8.
>
> a) id = 0,count = 2,next_avail = 2
>
> b) id = 2,count = 4,next_avail = 2

next_avail should be 6 here.

>
> c) id = 4,count = 2,next_avail = 0
>

id should be 6 here.

Thanks

> if packet b is done before packet a, driver may think buffer id 0 is 
> available and try to use it if even if the real buffer 0 was not done.
>
> Thanks
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC v3 3/5] virtio_ring: add packed ring support
From: Jason Wang @ 2018-05-10  7:32 UTC (permalink / raw)
  To: Tiwei Bie, mst, virtualization, linux-kernel, netdev; +Cc: wexu, jfreimann
In-Reply-To: <20180425051550.24342-4-tiwei.bie@intel.com>

On 2018年04月25日 13:15, Tiwei Bie wrote:
> +	/* We're using some buffers from the free list. */
> +	vq->vq.num_free -= descs_used;
> +
> +	/* Update free pointer */
> +	if (indirect) {
> +		n = head + 1;
> +		if (n >= vq->vring_packed.num) {
> +			n = 0;
> +			vq->wrap_counter ^= 1;
> +		}
> +		vq->next_avail_idx = n;
> +	} else
> +		vq->next_avail_idx = i;

During testing zerocopy (out of order completion), I found driver may 
submit two identical buffer id to vhost. So the above code may not work 
well.

Consider the case that driver adds 3 buffer and virtqueue size is 8.

a) id = 0,count = 2,next_avail = 2

b) id = 2,count = 4,next_avail = 2

c) id = 4,count = 2,next_avail = 0

if packet b is done before packet a, driver may think buffer id 0 is 
available and try to use it if even if the real buffer 0 was not done.

Thanks

^ permalink raw reply

* Re: [bpf-next v3 8/9] bpf: Provide helper to do forwarding lookups in kernel FIB table
From: Jesper Dangaard Brouer @ 2018-05-10  7:31 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, borkmann, ast, davem, shm, roopa, toke, john.fastabend,
	brouer
In-Reply-To: <20180510033427.20756-9-dsahern@gmail.com>

On Wed,  9 May 2018 20:34:26 -0700
David Ahern <dsahern@gmail.com> wrote:

> Provide a helper for doing a FIB and neighbor lookup in the kernel
> tables from an XDP program. The helper provides a fastpath for forwarding
> packets. If the packet is a local delivery or for any reason is not a
> simple lookup and forward, the packet continues up the stack.
> 
> If it is to be forwarded, the forwarding can be done directly if the
> neighbor is already known. If the neighbor does not exist, the first
> few packets go up the stack for neighbor resolution. Once resolved, the
> xdp program provides the fast path.
> 
> On successful lookup the nexthop dmac, current device smac and egress
> device index are returned.
> 
> The API supports IPv4, IPv6 and MPLS protocols, but only IPv4 and IPv6
> are implemented in this patch. The API includes layer 4 parameters if
> the XDP program chooses to do deep packet inspection to allow compare
> against ACLs implemented as FIB rules.
> 
> Header rewrite is left to the XDP program.
> 
> The lookup takes 2 flags:
> - BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes
>   straight to the table associated with the device (expert setting for
>   those looking to maximize throughput)
> 
> - BPF_FIB_LOOKUP_OUTPUT to do a lookup from the egress perspective.
>   Default is an ingress lookup.
> 
> Initial performance numbers collected by Jesper, forwarded packets/sec:
> 
>        Full stack    XDP FIB lookup    XDP Direct lookup
> IPv4   1,947,969       7,074,156          7,415,333
> IPv6   1,728,000       6,165,504          7,262,720
> 

The "Full stack" tests were with netfilter modules unloaded.  Default
setting with netfilter conntrack loaded and default Fedora firewall
rules, show around 700Kpps.

> These number are single CPU core forwarding on a Broadwell
> E5-1650 v4 @ 3.60GHz.
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

This helper is awesome, as it really shows how XDP is meant to work in
concert and cooperate with the existing network stack.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH] e1000e: Ignore TSYNCRXCTL when getting I219 clock attributes
From: Benjamin Poirier @ 2018-05-10  7:28 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: Keller, Jacob E, Achim Mildenberger, olouvignes, jayanth,
	ehabkost, postmodern.mod3, Bart.VanAssche, intel-wired-lan,
	netdev, linux-kernel
In-Reply-To: <02874ECE860811409154E81DA85FBB5882D918F3@ORSMSX115.amr.corp.intel.com>

There have been multiple reports of crashes that look like
kernel: RIP: 0010:[<ffffffff8110303f>] timecounter_read+0xf/0x50
[...]
kernel: Call Trace:
kernel:  [<ffffffffa0806b0f>] e1000e_phc_gettime+0x2f/0x60 [e1000e]
kernel:  [<ffffffffa0806c5d>] e1000e_systim_overflow_work+0x1d/0x80 [e1000e]
kernel:  [<ffffffff810992c5>] process_one_work+0x155/0x440
kernel:  [<ffffffff81099e16>] worker_thread+0x116/0x4b0
kernel:  [<ffffffff8109f422>] kthread+0xd2/0xf0
kernel:  [<ffffffff8163184f>] ret_from_fork+0x3f/0x70

These can be traced back to the fact that e1000e_systim_reset() skips the
timecounter_init() call if e1000e_get_base_timinca() returns -EINVAL, which
leads to a null deref in timecounter_read().

Commit 83129b37ef35 ("e1000e: fix systim issues", v4.2-rc1) reworked
e1000e_get_base_timinca() in such a way that it can return -EINVAL for
e1000_pch_spt if the SYSCFI bit is not set in TSYNCRXCTL.

Some experimentation has shown that on I219 (e1000_pch_spt, "MAC: 12")
adapters, the E1000_TSYNCRXCTL_SYSCFI flag is unstable; TSYNCRXCTL reads
sometimes don't have the SYSCFI bit set. Retrying the read shortly after
finds the bit to be set. This was observed at boot (probe) but also link up
and link down.

Moreover, the phc (PTP Hardware Clock) seems to operate normally even after
reads where SYSCFI=0. Therefore, remove this register read and
unconditionally set the clock parameters.

Reported-by: Achim Mildenberger <admin@fph.physik.uni-karlsruhe.de>
Message-Id: <20180425065243.g5mqewg5irkwgwgv@f2>
Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1075876
Fixes: 83129b37ef35 ("e1000e: fix systim issues")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index ec4a9759a6f2..3afb1f3b6f91 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -3546,15 +3546,12 @@ s32 e1000e_get_base_timinca(struct e1000_adapter *adapter, u32 *timinca)
 		}
 		break;
 	case e1000_pch_spt:
-		if (er32(TSYNCRXCTL) & E1000_TSYNCRXCTL_SYSCFI) {
-			/* Stable 24MHz frequency */
-			incperiod = INCPERIOD_24MHZ;
-			incvalue = INCVALUE_24MHZ;
-			shift = INCVALUE_SHIFT_24MHZ;
-			adapter->cc.shift = shift;
-			break;
-		}
-		return -EINVAL;
+		/* Stable 24MHz frequency */
+		incperiod = INCPERIOD_24MHZ;
+		incvalue = INCVALUE_24MHZ;
+		shift = INCVALUE_SHIFT_24MHZ;
+		adapter->cc.shift = shift;
+		break;
 	case e1000_pch_cnp:
 		if (er32(TSYNCRXCTL) & E1000_TSYNCRXCTL_SYSCFI) {
 			/* Stable 24MHz frequency */
-- 
2.16.3

^ permalink raw reply related

* Re: [bpf-next v3 9/9] samples/bpf: Add example of ipv4 and ipv6 forwarding in XDP
From: Jesper Dangaard Brouer @ 2018-05-10  7:22 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, borkmann, ast, davem, shm, roopa, toke, john.fastabend,
	brouer, Tariq Toukan
In-Reply-To: <20180510033427.20756-10-dsahern@gmail.com>

On Wed,  9 May 2018 20:34:27 -0700
David Ahern <dsahern@gmail.com> wrote:

> Simple example of fast-path forwarding. It has a serious flaw
> in not verifying the egress device index supports XDP forwarding.
> If the egress device does not packets are dropped.
> 
> Take this only as a simple example of fast-path forwarding.
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>
> Acked-by: David S. Miller <davem@davemloft.net>

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

I agree that sample program have this flaw, but it should not stop this
patchset. We need to find a more reliable way of detecting/verifying
that an egress device supports XDP forwarding, from within the BPF prog.

As this sample program hints, we could do a lookup in the devmap, to
get this info(?)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH v2] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'
From: Christophe JAILLET @ 2018-05-10  7:06 UTC (permalink / raw)
  To: davem, tariqt
  Cc: netdev, linux-rdma, linux-kernel, kernel-janitors,
	Christophe JAILLET

If an error occurs, 'mlx4_en_destroy_netdev()' is called.
It then calls 'mlx4_en_free_resources()' which does the needed resources
cleanup.

So, doing some explicit kfree in the error handling path would lead to
some double kfree.

Simplify code to avoid such a case.

Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
---
v1 -> v2 : rewrite the fix as explained by Tariq Toukan
           (this 2nd version may have been posted twice, once without the
           v2 tag. PLease ignore the first one)
---

 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index e0adac4a9a19..9670b33fc9b1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -3324,12 +3324,11 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 					   MAX_TX_RINGS, GFP_KERNEL);
 		if (!priv->tx_ring[t]) {
 			err = -ENOMEM;
-			goto err_free_tx;
+			goto out;
 		}
 		priv->tx_cq[t] = kzalloc(sizeof(struct mlx4_en_cq *) *
 					 MAX_TX_RINGS, GFP_KERNEL);
 		if (!priv->tx_cq[t]) {
-			kfree(priv->tx_ring[t]);
 			err = -ENOMEM;
 			goto out;
 		}
@@ -3582,11 +3581,6 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 
 	return 0;
 
-err_free_tx:
-	while (t--) {
-		kfree(priv->tx_ring[t]);
-		kfree(priv->tx_cq[t]);
-	}
 out:
 	mlx4_en_destroy_netdev(dev);
 	return err;
-- 
2.17.0

^ permalink raw reply related

* [PATCH] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'
From: Christophe JAILLET @ 2018-05-10  7:02 UTC (permalink / raw)
  To: davem, tariqt
  Cc: netdev, linux-rdma, linux-kernel, kernel-janitors,
	Christophe JAILLET

If an error occurs, 'mlx4_en_destroy_netdev()' is called.
It then calls 'mlx4_en_free_resources()' which does the needed resources
cleanup.

So, doing some explicit kfree in the error handling path would lead to
some double kfree.

Simplify code to avoid such a case.

Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index e0adac4a9a19..9670b33fc9b1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -3324,12 +3324,11 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 					   MAX_TX_RINGS, GFP_KERNEL);
 		if (!priv->tx_ring[t]) {
 			err = -ENOMEM;
-			goto err_free_tx;
+			goto out;
 		}
 		priv->tx_cq[t] = kzalloc(sizeof(struct mlx4_en_cq *) *
 					 MAX_TX_RINGS, GFP_KERNEL);
 		if (!priv->tx_cq[t]) {
-			kfree(priv->tx_ring[t]);
 			err = -ENOMEM;
 			goto out;
 		}
@@ -3582,11 +3581,6 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 
 	return 0;
 
-err_free_tx:
-	while (t--) {
-		kfree(priv->tx_ring[t]);
-		kfree(priv->tx_cq[t]);
-	}
 out:
 	mlx4_en_destroy_netdev(dev);
 	return err;
-- 
2.17.0

^ permalink raw reply related

* [PATCH net-next v2] tcp: Add mark for TIMEWAIT sockets
From: Jon Maxwell @ 2018-05-10  6:53 UTC (permalink / raw)
  To: davem; +Cc: kuznet, yoshfuji, netdev, linux-kernel, jmaxwell

This version has some suggestions by Eric Dumazet:

- Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP 
races. 
- Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark"
statement. 
- Factorize code as sk_fullsock() check is not necessary.

Aidan McGurn from Openwave Mobility systems reported the following bug:

"Marked routing is broken on customer deployment. Its effects are large 
increase in Uplink retransmissions caused by the client never receiving 
the final ACK to their FINACK - this ACK misses the mark and routes out 
of the incorrect route."

Currently marks are added to sk_buffs for replies when the "fwmark_reflect" 
sysctl is enabled. But not for TW sockets that had sk->sk_mark set via 
setsockopt(SO_MARK..).  

Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the 
original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location. 
Then progate this so that the skb gets sent with the correct mark. Do the same 
for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that
netfilter rules are still honored.

Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
---
 include/net/inet_timewait_sock.h |  1 +
 net/ipv4/ip_output.c             |  2 +-
 net/ipv4/tcp_ipv4.c              | 16 ++++++++++++++--
 net/ipv4/tcp_minisocks.c         |  1 +
 net/ipv6/tcp_ipv6.c              |  6 +++++-
 5 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index c7be1ca8e562..659d8ed5a3bc 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -62,6 +62,7 @@ struct inet_timewait_sock {
 #define tw_dr			__tw_common.skc_tw_dr
 
 	int			tw_timeout;
+	__u32			tw_mark;
 	volatile unsigned char	tw_substate;
 	unsigned char		tw_rcv_wscale;
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 95adb171f852..b5e21eb198d8 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1561,7 +1561,7 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
 		oif = skb->skb_iif;
 
 	flowi4_init_output(&fl4, oif,
-			   IP4_REPLY_MARK(net, skb->mark),
+			   IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark,
 			   RT_TOS(arg->tos),
 			   RT_SCOPE_UNIVERSE, ip_hdr(skb)->protocol,
 			   ip_reply_arg_flowi_flags(arg),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b50838..caf23de88f8a 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -621,6 +621,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
 	struct sock *sk1 = NULL;
 #endif
 	struct net *net;
+	struct sock *ctl_sk;
 
 	/* Never send a reset in response to a reset. */
 	if (th->rst)
@@ -723,11 +724,16 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
 	arg.tos = ip_hdr(skb)->tos;
 	arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
 	local_bh_disable();
-	ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
+	ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
+	if (sk)
+		ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
+				   inet_twsk(sk)->tw_mark : sk->sk_mark;
+	ip_send_unicast_reply(ctl_sk,
 			      skb, &TCP_SKB_CB(skb)->header.h4.opt,
 			      ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
 			      &arg, arg.iov[0].iov_len);
 
+	ctl_sk->sk_mark = 0;
 	__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
 	__TCP_INC_STATS(net, TCP_MIB_OUTRSTS);
 	local_bh_enable();
@@ -759,6 +765,7 @@ static void tcp_v4_send_ack(const struct sock *sk,
 	} rep;
 	struct net *net = sock_net(sk);
 	struct ip_reply_arg arg;
+	struct sock *ctl_sk;
 
 	memset(&rep.th, 0, sizeof(struct tcphdr));
 	memset(&arg, 0, sizeof(arg));
@@ -809,11 +816,16 @@ static void tcp_v4_send_ack(const struct sock *sk,
 	arg.tos = tos;
 	arg.uid = sock_net_uid(net, sk_fullsock(sk) ? sk : NULL);
 	local_bh_disable();
-	ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
+	ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
+	if (sk)
+		ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
+				   inet_twsk(sk)->tw_mark : sk->sk_mark;
+	ip_send_unicast_reply(ctl_sk,
 			      skb, &TCP_SKB_CB(skb)->header.h4.opt,
 			      ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
 			      &arg, arg.iov[0].iov_len);
 
+	ctl_sk->sk_mark = 0;
 	__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
 	local_bh_enable();
 }
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 57b5468b5139..f867658b4b30 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -263,6 +263,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
 		struct inet_sock *inet = inet_sk(sk);
 
 		tw->tw_transparent	= inet->transparent;
+		tw->tw_mark		= sk->sk_mark;
 		tw->tw_rcv_wscale	= tp->rx_opt.rcv_wscale;
 		tcptw->tw_rcv_nxt	= tp->rcv_nxt;
 		tcptw->tw_snd_nxt	= tp->snd_nxt;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 6d664d83cd16..7d47c2b550a9 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -803,6 +803,7 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32
 	unsigned int tot_len = sizeof(struct tcphdr);
 	struct dst_entry *dst;
 	__be32 *topt;
+	__u32 mark = 0;
 
 	if (tsecr)
 		tot_len += TCPOLEN_TSTAMP_ALIGNED;
@@ -871,7 +872,10 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32
 		fl6.flowi6_oif = oif;
 	}
 
-	fl6.flowi6_mark = IP6_REPLY_MARK(net, skb->mark);
+	if (sk)
+		mark = (sk->sk_state == TCP_TIME_WAIT) ?
+			inet_twsk(sk)->tw_mark : sk->sk_mark;
+	fl6.flowi6_mark = IP6_REPLY_MARK(net, skb->mark) ?: mark;
 	fl6.fl6_dport = t1->dest;
 	fl6.fl6_sport = t1->source;
 	fl6.flowi6_uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
-- 
2.13.6

^ permalink raw reply related

* Re: net: hang in unregister_netdevice: waiting for lo to become free
From: Dmitry Vyukov @ 2018-05-10  6:46 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Tommi Rantala, Neil Horman, Xin Long, David Ahern,
	Daniel Borkmann, Cong Wang, David Miller, Eric Dumazet,
	Willem de Bruijn, Jakub Kicinski, Rasmus Villemoes, netdev, LKML,
	Alexey Kuznetsov, Hideaki YOSHIFUJI, syzkaller, Dan Streetman,
	Eric W. Biederman, Alexey Kodanev
In-Reply-To: <CALZtONDn_yQb_EeSgbwk0Trzp7TA2uXLc6tuxmmWd0uyqOq5aA@mail.gmail.com>

On Mon, Apr 16, 2018 at 9:42 PM, Dan Streetman <ddstreet@ieee.org> wrote:
>>>>> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>>>>> <tommi.t.rantala@nokia.com> wrote:
>>>>>> On 20.02.2018 18:26, Neil Horman wrote:
>>>>>>>
>>>>>>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>>>>>>>>
>>>>>>>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>>>>>>>> <tommi.t.rantala@nokia.com> wrote:
>>>>>>>>>
>>>>>>>>> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>>>>>>>>>
>>>>>>>>>> Is this meant to be fixed already? I am still seeing this on the
>>>>>>>>>> latest upstream tree.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> These two commits are in v4.16-rc1:
>>>>>>>>>
>>>>>>>>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>>>>>>>>> Author: Tommi Rantala <tommi.t.rantala@nokia.com>
>>>>>>>>> Date:   Mon Feb 5 21:48:14 2018 +0200
>>>>>>>>>
>>>>>>>>>      sctp: fix dst refcnt leak in sctp_v4_get_dst
>>>>>>>>> ...
>>>>>>>>>      Fixes: 410f03831 ("sctp: add routing output fallback")
>>>>>>>>>      Fixes: 0ca50d12f ("sctp: fix src address selection if using
>>>>>>>>> secondary
>>>>>>>>> addresses")
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>>>>>>>>> Author: Alexey Kodanev <alexey.kodanev@oracle.com>
>>>>>>>>> Date:   Mon Feb 5 15:10:35 2018 +0300
>>>>>>>>>
>>>>>>>>>      sctp: fix dst refcnt leak in sctp_v6_get_dst()
>>>>>>>>> ...
>>>>>>>>>      Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>>>>>>>>> secondary
>>>>>>>>> addresses for ipv6")
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I guess we missed something if it's still reproducible.
>>>>>>>>>
>>>>>>>>> I can check it later this week, unless someone else beat me to it.
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Tommi,
>>>>>>>>
>>>>>>>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
>>>>>>>> another one then. But I am still seeing these:
>>>>>>>>
>>>>>>>> [   58.799130] unregister_netdevice: waiting for lo to become free.
>>>>>>>> Usage count = 4
>>>>>>>> [   60.847138] unregister_netdevice: waiting for lo to become free.
>>>>>>>> Usage count = 4
>>>>>>>> [   62.895093] unregister_netdevice: waiting for lo to become free.
>>>>>>>> Usage count = 4
>>>>>>>> [   64.943103] unregister_netdevice: waiting for lo to become free.
>>>>>>>> Usage count = 4
>>>>>>>>
>>>>>>>> on upstream tree pulled ~12 hours ago.
>>>>>>>>
>>>>>>> Can you write a systemtap script to probe dev_hold, and dev_put, printing
>>>>>>> out a
>>>>>>> backtrace if the device name matches "lo".  That should tell us
>>>>>>> definitively if
>>>>>>> the problem is in the same location or not
>>>>>>
>>>>>>
>>>>>> Hi Dmitry, I tested with the reproducer and the kernel .config file that you
>>>>>> sent in the first email in this thread:
>>>>>>
>>>>>> With 4.16-rc2 unable to reproduce.
>>>>>>
>>>>>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
>>>>>> lo to become free. Usage count = 3"
>>>>>>
>>>>>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
>>>>>> cherry-picked on top, unable to reproduce.
>>>>>>
>>>>>>
>>>>>> Is syzkaller doing something else now to trigger the bug...?
>>>>>> Can you still trigger the bug with the same reproducer?
>>>>>
>>>>> Hi Neil, Tommi,
>>>>>
>>>>> Reviving this old thread about "unregister_netdevice: waiting for lo
>>>>> to become free. Usage count = 3" hangs.
>>>>> I still did not have time to deep dive into what happens there (too
>>>>> many bugs coming from syzbot). But this still actively happens and I
>>>>> suspect accounts to a significant portion of various hang reports,
>>>>> which are quite unpleasant.
>>>>>
>>>>> One idea that could make it all simpler:
>>>>>
>>>>> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
>>>>> prolonged periods of time under any non-buggy conditions? E.g. more
>>>>> than 1-2 minutes?
>>>>> If it only supposed to wait briefly for things that already supposed
>>>>> to be shutting down, and we add a WARNING there after some timeout,
>>>>> then syzbot will report all info how/when it happens, hopefully
>>>>> extracting reproducers, and all the nice things.
>>>>> But this WARNING should not have any false positives under any
>>>>> realistic conditions (e.g. waiting for arrival of remote packets with
>>>>> large timeouts).
>>>>>
>>>>> Looking at some task hung reports, it seems that this code holds some
>>>>> mutexes, takes workqueue thread and prevents any progress with
>>>>> destruction of other devices (and net namespace creation/destruction),
>>>>> so I guess it should not wait for any indefinite periods of time?
>>>>
>>>> I'm working on this currently:
>>>> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407
>>>>
>>>> I added a summary of what I've found to be the cause (or at least, one
>>>> possible cause) of this:
>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72
>>>>
>>>> I'm working on a patch to work around the main side-effect of this,
>>>> which is hanging while holding the global net mutex.  Hangs will still
>>>> happen (e.g. if a dst leaks) but should not affect anything else,
>>>> other than a leak of the dst and its net namespace.
>>>>
>>>> Fixing the dst leaks is important too, of course, but a dst leak (or
>>>> other cause) shouldn't break the entire system.
>>>
>>> Leaking some memory is definitely better than hanging the system.
>>>
>>> So I've made syzkaller to recognize "unregister_netdevice: waiting for
>>> (.*) to become free" as a kernel bug:
>>> https://github.com/google/syzkaller/commit/7a67784ca8bdc3b26cce2f0ec9a40d2dd9ec9396
>>> Unfortunately it does not make it catch these bugs because creating a
>>> net namespace per test is too damn slow, so namespaces are reused for
>>> lots of tests and when/if it's eventually destroyed it's already too
>>> late to find root cause.
>>>
>>> But I've run a one-off experiment with prompt net namespace
>>> destruction and syzkaller was able to easily extract a C reproducer:
>>> https://gist.githubusercontent.com/dvyukov/d571e8fff24e127ca48a8c4790d42bfa/raw/52050e93ba9afbb5126b9d7bb39b7e71a82af016/gistfile1.txt
>>>
>>> On upstream 16e205cf42da1f497b10a4a24f563e6c0d574eec with this config:
>>> https://gist.githubusercontent.com/dvyukov/9663c57443adb21f2795b92ef0829d62/raw/bbea0652e23746096dd56855a28f6c681aebcdee/gistfile1.txt
>>>
>>> this gives me:
>>>
>>> [   83.183198] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 9
>>> [   85.231202] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 9
>>> ...
>>> [  523.511205] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 9
>>> ...
>>>
>>> This is generated from this syzkaller program:
>>>
>>> r0 = socket$inet6(0xa, 0x1, 0x84)
>>> setsockopt$inet6_IPV6_XFRM_POLICY(r0, 0x29, 0x23,
>>> &(0x7f0000000380)={{{@in6=@remote={0xfe, 0x80, [], 0xbb},
>>> @in=@dev={0xac, 0x14, 0x14}, 0x0, 0x0, 0x0, 0x0, 0xa}, {}, {}, 0x0,
>>> 0x0, 0x1}, {{@in=@local={0xac, 0x14, 0x14, 0xaa}, 0x0, 0x32}, 0x0,
>>> @in=@local={0xac, 0x14, 0x14, 0xaa}, 0x3504}}, 0xe8)
>>> bind$inet6(r0, &(0x7f0000000000)={0xa, 0x4e20}, 0x1c)
>>> connect$inet(r0, &(0x7f0000000040)={0x2, 0x4e20, @dev={0xac, 0x14,
>>> 0x14, 0xd}}, 0x10)
>>> syz_emit_ethernet(0x3e, &(0x7f00000001c0)={@local={[0xaa, 0xaa, 0xaa,
>>> 0xaa, 0xaa], 0xaa}, @dev={[0xaa, 0xaa, 0xaa, 0xaa, 0xaa]}, [],
>>> {@ipv6={0x86dd, {0x0, 0x6, "50a09c", 0x8, 0xffffff11, 0x0,
>>> @remote={0xfe, 0x80, [], 0xbb}, @local={0xfe, 0x80, [], 0xaa}, {[],
>>> @udp={0x0, 0x4e20, 0x8}}}}}}, &(0x7f0000000040))
>>>
>>> So this seems to be related to IPv6 and/or xfrm and is potentially
>>> caused by external packets (that syz_emit_ethernet call).
>>
>>
>>
>> Here is another repro which seems to be a different bug (note that it
>> requires fault injection):
>>
>> https://gist.githubusercontent.com/dvyukov/1c56623016cc4c24a69d433c5114ad5b/raw/530478f571b195193101b912aa646948528baa8e/gistfile1.txt
>>
>> Dan, do you mind taking a look at them? Fixing these should eliminate
>> root causes of these hangs/leaks.
>
> Yep I will look at them, thanks for the reproducers.

Hi Dan,

Any updates on this? syzbot is hitting this all the time.

^ permalink raw reply

* [PATCH] net: ipv4: remove define INET_CSK_DEBUG and unnecessary EXPORT_SYMBOL
From: Joe Perches @ 2018-05-10  6:24 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI
  Cc: Li RongQing, Arnaldo Carvalho de Melo, netdev, linux-kernel

INET_CSK_DEBUG is always set and only is used for 2 pr_debug calls.

EXPORT_SYMBOL(inet_csk_timer_bug_msg) is only used by these 2
pr_debug calls and is also unnecessary as the exported string can
be used directly by these calls.

Signed-off-by: Joe Perches <joe@perches.com>
---
 include/net/inet_connection_sock.h | 22 ++++------------------
 net/ipv4/inet_connection_sock.c    |  5 -----
 2 files changed, 4 insertions(+), 23 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 2ab6667275df..0a6c9e0f2b5a 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -23,8 +23,6 @@
 #include <net/inet_sock.h>
 #include <net/request_sock.h>
 
-#define INET_CSK_DEBUG 1
-
 /* Cancel timers, when they are not required. */
 #undef INET_CSK_CLEAR_TIMERS
 
@@ -196,10 +194,6 @@ static inline void inet_csk_delack_init(struct sock *sk)
 void inet_csk_delete_keepalive_timer(struct sock *sk);
 void inet_csk_reset_keepalive_timer(struct sock *sk, unsigned long timeout);
 
-#ifdef INET_CSK_DEBUG
-extern const char inet_csk_timer_bug_msg[];
-#endif
-
 static inline void inet_csk_clear_xmit_timer(struct sock *sk, const int what)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
@@ -214,12 +208,9 @@ static inline void inet_csk_clear_xmit_timer(struct sock *sk, const int what)
 #ifdef INET_CSK_CLEAR_TIMERS
 		sk_stop_timer(sk, &icsk->icsk_delack_timer);
 #endif
+	} else {
+		pr_debug("inet_csk BUG: unknown timer value\n");
 	}
-#ifdef INET_CSK_DEBUG
-	else {
-		pr_debug("%s", inet_csk_timer_bug_msg);
-	}
-#endif
 }
 
 /*
@@ -232,10 +223,8 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
 	if (when > max_when) {
-#ifdef INET_CSK_DEBUG
 		pr_debug("reset_xmit_timer: sk=%p %d when=0x%lx, caller=%p\n",
 			 sk, what, when, current_text_addr());
-#endif
 		when = max_when;
 	}
 
@@ -249,12 +238,9 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
 		icsk->icsk_ack.pending |= ICSK_ACK_TIMER;
 		icsk->icsk_ack.timeout = jiffies + when;
 		sk_reset_timer(sk, &icsk->icsk_delack_timer, icsk->icsk_ack.timeout);
+	} else {
+		pr_debug("inet_csk BUG: unknown timer value\n");
 	}
-#ifdef INET_CSK_DEBUG
-	else {
-		pr_debug("%s", inet_csk_timer_bug_msg);
-	}
-#endif
 }
 
 static inline unsigned long
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 881ac6d046f2..33a88e045efd 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -27,11 +27,6 @@
 #include <net/sock_reuseport.h>
 #include <net/addrconf.h>
 
-#ifdef INET_CSK_DEBUG
-const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
-EXPORT_SYMBOL(inet_csk_timer_bug_msg);
-#endif
-
 #if IS_ENABLED(CONFIG_IPV6)
 /* match_wildcard == true:  IPV6_ADDR_ANY equals to any IPv6 addresses if IPv6
  *                          only, and any IPv4 addresses if not IPv6 only
-- 
2.15.0

^ permalink raw reply related

* Re: [PATCH net-next v1] tcp: Add mark for TIMEWAIT sockets
From: Jonathan Maxwell @ 2018-05-10  6:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, kuznet, yoshfuji, netdev, linux-kernel, Jon Maxwell
In-Reply-To: <eece810c-b3c8-4307-c03a-1cfdaa7109de@gmail.com>

On Thu, May 10, 2018 at 3:45 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 05/09/2018 10:21 PM, Jon Maxwell wrote:
>
> ...
>
>>       if (th->rst)
>> @@ -723,11 +724,17 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
>>       arg.tos = ip_hdr(skb)->tos;
>>       arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
>>       local_bh_disable();
>> -     ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
>> +     ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
>> +     if (sk && sk->sk_state == TCP_TIME_WAIT)
>> +             ctl_sk->sk_mark = inet_twsk(sk)->tw_mark;
>> +     else if (sk && sk_fullsock(sk))
>
> I do not believe we could have a non fullsock here ?
>

Okay thanks I'll make these changes to v2.

> A request socket (SYN_RECV state) at this point is not expected.
>
>
> So you can factorize :
>
> if (sk)
>         ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
>                           inet_twsk(sk)->tw_mark : sk->sk_mark;
>
> (same remark for IPv6)
>
>
>> +             ctl_sk->sk_mark = sk->sk_mark;
>> +     ip_send_unicast_reply(ctl_sk,
>>                             skb, &TCP_SKB_CB(skb)->header.h4.opt,
>>                             ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
>>                             &arg, arg.iov[0].iov_len);
>
>

^ permalink raw reply

* Re: [PATCH 2/2] alx: add disable_wol paramenter
From: AceLan Kao @ 2018-05-10  5:58 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: David Miller, James Cliburn, Chris Snook, rakesh, netdev,
	Linux-Kernel@Vger. Kernel. Org, Emily Chien
In-Reply-To: <20180509134543.GF14276@lunn.ch>

Hi Andrew,

We have some machines using Qualcomm Atheros Killer E2400 Gigabit
Ethernet Controller,
but none of them has the unintentional wake up issue.
We're willing to fix it if we encountered the issue, but before we can
do it, we need this feature is supported by the driver.

Taking the feature has been removed for 5 years into account, I doubt
if we still can reproduce this issue,
but again, to verify this issue we need to add back this feature first.
Set WoL disabled by default won't introduce any regression but give
users and developers a chance to fix it.

Best regards,
AceLan Kao.

2018-05-09 21:45 GMT+08:00 Andrew Lunn <andrew@lunn.ch>:
> On Wed, May 09, 2018 at 10:40:13AM +0800, AceLan Kao wrote:
>> Hi,
>>
>> I didn't get any response around one month.
>
> I didn't notice you posting an results of searching for the true
> problem? Please can you point me at an email archive.
>
>          Andrew

^ permalink raw reply

* Re: [net-next][PATCH] inet: Use switch case instead of multiple condition checks
From: Joe Perches @ 2018-05-10  5:49 UTC (permalink / raw)
  To: Li RongQing, netdev
In-Reply-To: <1525930656-6901-1-git-send-email-lirongqing@baidu.com>

On Thu, 2018-05-10 at 13:37 +0800, Li RongQing wrote:
> inet_csk_reset_xmit_timer uses multiple equality condition checks,
> so it is better to use switch case instead of them
[]
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
[]
> @@ -239,22 +239,31 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
>  		when = max_when;
>  	}
>  
> -	if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0 ||
> -	    what == ICSK_TIME_EARLY_RETRANS || what == ICSK_TIME_LOSS_PROBE ||
> -	    what == ICSK_TIME_REO_TIMEOUT) {
> +	switch (what) {
> +	case ICSK_TIME_RETRANS:
> +		/* fall through */
> +	case ICSK_TIME_PROBE0:
> +		/* fall through */
> +	case ICSK_TIME_EARLY_RETRANS:
> +		/* fall through */
> +	case ICSK_TIME_LOSS_PROBE:
> +		/* fall through */
> +	case ICSK_TIME_REO_TIMEOUT:

Please remove the /* fall through */ lines
and use consecutive case labels like:

	case ICSK_TIME_RETRANS:
	case ICSK_TIME_PROBE0:
	case ICSK_TIME_EARLY_RETRANS:
	case ICSK_TIME_LOSS_PROBE:
	case ICSK_TIME_REO_TIMEOUT:

^ permalink raw reply

* Re: [net-next][PATCH] inet: Use switch case instead of multiple condition checks
From: Eric Dumazet @ 2018-05-10  5:48 UTC (permalink / raw)
  To: Li RongQing, netdev
In-Reply-To: <1525930656-6901-1-git-send-email-lirongqing@baidu.com>



On 05/09/2018 10:37 PM, Li RongQing wrote:
> inet_csk_reset_xmit_timer uses multiple equality condition checks,
> so it is better to use switch case instead of them
> 

Why is it better ?

I prefer the concise form, and compiler really should not care.

^ permalink raw reply

* Re: [PATCH net-next v1] tcp: Add mark for TIMEWAIT sockets
From: Eric Dumazet @ 2018-05-10  5:45 UTC (permalink / raw)
  To: Jon Maxwell, davem; +Cc: kuznet, yoshfuji, netdev, linux-kernel, jmaxwell
In-Reply-To: <20180510052111.17886-1-jmaxwell37@gmail.com>



On 05/09/2018 10:21 PM, Jon Maxwell wrote:

...

>  	if (th->rst)
> @@ -723,11 +724,17 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
>  	arg.tos = ip_hdr(skb)->tos;
>  	arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
>  	local_bh_disable();
> -	ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
> +	ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
> +	if (sk && sk->sk_state == TCP_TIME_WAIT)
> +		ctl_sk->sk_mark = inet_twsk(sk)->tw_mark;
> +	else if (sk && sk_fullsock(sk))

I do not believe we could have a non fullsock here ?

A request socket (SYN_RECV state) at this point is not expected. 


So you can factorize :

if (sk)
        ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
			  inet_twsk(sk)->tw_mark : sk->sk_mark;

(same remark for IPv6)


> +		ctl_sk->sk_mark = sk->sk_mark;
> +	ip_send_unicast_reply(ctl_sk,
>  			      skb, &TCP_SKB_CB(skb)->header.h4.opt,
>  			      ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
>  			      &arg, arg.iov[0].iov_len);

^ permalink raw reply

* Re: [PATCH ipsec-next] xfrm: Allow Output Mark to be Updated Using UPDSA
From: Eyal Birger @ 2018-05-10  5:44 UTC (permalink / raw)
  To: Nathan Harold; +Cc: netdev, steffen.klassert, tobias
In-Reply-To: <20180509204626.56561-1-nharold@google.com>

Hi Nathan,

On Wed,  9 May 2018 13:46:26 -0700
Nathan Harold <nharold@google.com> wrote:

> Allow UPDSA to change output_mark to permit
> policy separation of packet routing decisions from
> SA keying in systems that use mark-based routing.
> 
> In the output_mark, used as a routing and firewall
> mark for outbound packets, is made update-able which
> allows routing decisions to be handled independently
> of keying/SA creation. To maintain consistency with
> other optional attributes, the output mark is only
> updated if sent with a non-zero value. Once set, the
> output mark may not be reset to zero, which ensures
> that updating the SA does not require the mark to
> be re-sent to avoid the value being clobbered.

There is an attempt to extend the 'output_mark' to support the input
direction and masking.

In the proposed implementation, output_mark is converted to type 'struct
xfrm_mark' where the semantics are as follows:

- If mark is given by XFRMA_OUTPUT_MARK (renamed to XFRMA_SET_MARK)
  then a new XFRMA_SET_MARK_MASK attribute is consulted to set the mask
  value
- if no XFRMA_SET_MARK_MASK attribute is provided, the mask is set to
  0xffffffff

Therefore, if the mask value is 0, we can regard the mark as 'not
given'.

My question is, in the context of this patch, it seems that the
"Once set, the output mark may not be reset to zero" restriction may be
lifted in favor of updating the mark only if the new mask is non zero.

Does this make sense to you?
Eyal

^ permalink raw reply

* [net-next][PATCH] inet: Use switch case instead of multiple condition checks
From: Li RongQing @ 2018-05-10  5:37 UTC (permalink / raw)
  To: netdev

inet_csk_reset_xmit_timer uses multiple equality condition checks,
so it is better to use switch case instead of them

after this patch, the increased image size is acceptable

                                Before      After
size of net/ipv4/tcp_output.o:  721640      721648
size of vmlinux:                400236400   400236401

Signed-off-by: Li RongQing <lirongqing@baidu.com>
---
 include/net/inet_connection_sock.h | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 2ab6667275df..d2e9314cf43d 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -239,22 +239,31 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
 		when = max_when;
 	}
 
-	if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0 ||
-	    what == ICSK_TIME_EARLY_RETRANS || what == ICSK_TIME_LOSS_PROBE ||
-	    what == ICSK_TIME_REO_TIMEOUT) {
+	switch (what) {
+	case ICSK_TIME_RETRANS:
+		/* fall through */
+	case ICSK_TIME_PROBE0:
+		/* fall through */
+	case ICSK_TIME_EARLY_RETRANS:
+		/* fall through */
+	case ICSK_TIME_LOSS_PROBE:
+		/* fall through */
+	case ICSK_TIME_REO_TIMEOUT:
 		icsk->icsk_pending = what;
 		icsk->icsk_timeout = jiffies + when;
 		sk_reset_timer(sk, &icsk->icsk_retransmit_timer, icsk->icsk_timeout);
-	} else if (what == ICSK_TIME_DACK) {
+		break;
+	case ICSK_TIME_DACK:
 		icsk->icsk_ack.pending |= ICSK_ACK_TIMER;
 		icsk->icsk_ack.timeout = jiffies + when;
 		sk_reset_timer(sk, &icsk->icsk_delack_timer, icsk->icsk_ack.timeout);
-	}
+		break;
 #ifdef INET_CSK_DEBUG
-	else {
+	default:
 		pr_debug("%s", inet_csk_timer_bug_msg);
-	}
+		break;
 #endif
+	}
 }
 
 static inline unsigned long
-- 
2.16.2

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox