Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [REGRESSION next-20170426] Commit 09515ef5ddad ("of/acpi: Configure dma operations at probe time for platform/amba/pci bus devices") causes oops in mvneta
From: Sricharan R @ 2017-04-27 13:35 UTC (permalink / raw)
  To: Ralph Sennhauser
  Cc: Rafael J. Wysocki, Joerg Roedel, Bjorn Helgaas, linux-acpi,
	linux-kernel, linux-pci, Thomas Petazzoni, netdev
In-Reply-To: <20170426181508.687b52af@gmail.com>

Hi,

On 4/26/2017 9:45 PM, Ralph Sennhauser wrote:
> Hi Sricharan R,
> 
> Commit 09515ef5ddad ("of/acpi: Configure dma operations at probe time
> for platform/amba/pci bus devices") causes a kernel panic as in the log
> below on an armada-385. Reverting the commit fixes the issue.
> 
> Regards
> Ralph

Somehow not getting a obvious clue on whats going wrong with the logs
below. From the log and looking in to dts, the drivers seems to the one for
"marvell,armada-370-neta". Issue looks the data from the dma has gone bad
and subsequently referring the wrong data has resulted in the crash.
Looks like the dma_masks is the one going wrong.
Can i get some logs from mvneta_probe, about dev->dma_mask,
dev->coherent_dma_mask and dev->dma_ops with and without the patch
to see whats the difference ?

Regards,
 Sricharan

> 
> ---
> 
> [   18.288244] [c06d8480] *pgd=0061941e(bad)
> [   18.292271] Internal error: Oops: 80d [#1] SMP ARM
> [   18.297080] Modules linked in:
> [   18.471175] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O    4.11.0-rc8-next-20170426 #3
> [   18.479909] Hardware name: Marvell Armada 380/385 (Device Tree)
> [   18.485850] task: c0a07000 task.stack: c0a00000
> [   18.490401] PC is at __memzero+0x40/0x80
> [   18.494336] LR is at 0x0
> [   18.496878] pc : [<c0317920>]    lr : [<00000000>]    psr: 00000113
> [   18.496878] sp : c0a01d0c  ip : 00000000  fp : c0a01d34
> [   18.508402] r10: df43f800  r9 : df43f800  r8 : 00000001
> [   18.513645] r7 : c06d7e40  r6 : 000007c0  r5 : c06d8480  r4 : de14aa80
> [   18.520196] r3 : 00000000  r2 : 00000000  r1 : ffffffe4  r0 : c06d8480
> [   18.526750] Flags: nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> [   18.533912] Control: 10c5387d  Table: 16f0004a  DAC: 00000051
> [   18.539679] Process swapper/0 (pid: 0, stack limit = 0xc0a00210)
> [   18.545708] Stack: (0xc0a01d0c to 0xc0a02000)
> [   18.550082] 1d00:                            c04caef0 c06d7e40 00000700 0cc04000 00000001
> [   18.558292] 1d20: df691810 df43f800 c0a01d4c c0a01d38 c04caf20 c04cae7c e1b1d000 df5bd980
> [   18.566503] 1d40: c0a01dc4 c0a01d50 c043d3d8 c04caf14 df41e494 00000000 1e652500 c0601460
> [   18.574714] 1d60: dfbd8880 c0a04680 c06d7e40 00000000 dfbd8888 00000000 00000000 00000188
> [   18.582924] 1d80: 1e652500 dfbd8880 00000040 00000100 df43fc80 00000001 00000000 dfbd8888
> [   18.591134] 1da0: c043cfd4 00000040 0000012c ffff91f5 c0a02d00 c0a01de8 c0a01e24 c0a01dc8
> [   18.599345] 1dc0: c04db2d4 c043cfe0 c015d81c c06db4d4 c0a04990 c0a04990 c0a302b2 1f267000
> [   18.607555] 1de0: c096ab40 dfbd1b40 c0a01de8 c0a01de8 c0a01df0 c0a01df0 c0a01e1c 00000000
> [   18.615766] 1e00: c0a0208c c0a00000 00000100 00000003 c0a02080 40000003 c0a01e84 c0a01e28
> [   18.623977] 1e20: c0122e9c c04db0cc c0a01e54 00000101 c0a01e4c 00200102 c0a02d00 ffff91f5
> [   18.632187] 1e40: 0000000a c06025b4 c0a31740 c09632a8 c0a02080 c0a01e28 c0169788 c0968420
> [   18.640397] 1e60: 00000000 00000000 00000001 df408000 e0803100 c0a01f58 c0a01e94 c0a01e88
> [   18.648607] 1e80: c01232d0 c0122d84 c0a01ebc c0a01e98 c015d6e4 c012322c c0a169c0 c0a03fac
> [   18.656818] 1ea0: e080210c c0a01ee8 e0802100 e0803100 c0a01ee4 c0a01ec0 c01014a4 c015d688
> [   18.665029] 1ec0: c01085b0 60000013 ffffffff c0a01f1c 00000000 c0a00000 c0a01f44 c0a01ee8
> [   18.673240] 1ee0: c010c86c c0101460 00000001 00000000 00000000 c0118e40 c0a00000 c0a03cf8
> [   18.681451] 1f00: c0a03cac c09696f8 00000000 00000000 c0a01f58 c0a01f44 c0a01f48 c0a01f38
> [   18.689661] 1f20: c01085ac c01085b0 60000013 ffffffff 00000051 00000000 c0a01f54 c0a01f48
> [   18.697871] 1f40: c05e3634 c010857c c0a01f8c c0a01f58 c0153458 c05e3618 c0a03c80 c0a0f30a
> [   18.706082] 1f60: c0a30c00 000000bd c0a30c00 c0a03c80 ffffffff c0a30c00 c0833a28 dfffcb40
> [   18.714292] 1f80: c0a01f9c c0a01f90 c0153718 c01532c0 c0a01fac c0a01fa0 c05dd758 c0153704
> [   18.722503] 1fa0: c0a01ff4 c0a01fb0 c0800d68 c05dd6e8 ffffffff ffffffff 00000000 c08006f8
> [   18.730713] 1fc0: 00000000 c0833a28 00000000 c0a30e94 c0a03c9c c0833a24 c0a081e8 0000406a
> [   18.738923] 1fe0: 414fc091 00000000 00000000 c0a01ff8 0000807c c08009b4 00000000 00000000
> [   18.747132] Backtrace:
> [   18.749591] [<c04cae70>] (__build_skb) from [<c04caf20>] (build_skb+0x18/0x6c)
> [   18.756843]  r9:df43f800 r8:df691810 r7:00000001 r6:0cc04000 r5:00000700 r4:c06d7e40
> [   18.764618] [<c04caf08>] (build_skb) from [<c043d3d8>] (mvneta_poll+0x404/0xc18)
> [   18.772042]  r5:df5bd980 r4:e1b1d000
> [   18.775632] [<c043cfd4>] (mvneta_poll) from [<c04db2d4>] (net_rx_action+0x214/0x308)
> [   18.783406]  r10:c0a01de8 r9:c0a02d00 r8:ffff91f5 r7:0000012c r6:00000040 r5:c043cfd4
> [   18.791266]  r4:dfbd8888
> [   18.793810] [<c04db0c0>] (net_rx_action) from [<c0122e9c>] (__do_softirq+0x124/0x248)
> [   18.801672]  r10:40000003 r9:c0a02080 r8:00000003 r7:00000100 r6:c0a00000 r5:c0a0208c
> [   18.809531]  r4:00000000
> [   18.812074] [<c0122d78>] (__do_softirq) from [<c01232d0>] (irq_exit+0xb0/0xe4)
> [   18.819325]  r10:c0a01f58 r9:e0803100 r8:df408000 r7:00000001 r6:00000000 r5:00000000
> [   18.827184]  r4:c0968420
> [   18.829729] [<c0123220>] (irq_exit) from [<c015d6e4>] (__handle_domain_irq+0x68/0xbc)
> [   18.837591] [<c015d67c>] (__handle_domain_irq) from [<c01014a4>] (gic_handle_irq+0x50/0x94)
> [   18.845976]  r9:e0803100 r8:e0802100 r7:c0a01ee8 r6:e080210c r5:c0a03fac r4:c0a169c0
> [   18.853749] [<c0101454>] (gic_handle_irq) from [<c010c86c>] (__irq_svc+0x6c/0x90)
> [   18.861261] Exception stack(0xc0a01ee8 to 0xc0a01f30)
> [   18.866332] 1ee0:                   00000001 00000000 00000000 c0118e40 c0a00000 c0a03cf8
> [   18.874543] 1f00: c0a03cac c09696f8 00000000 00000000 c0a01f58 c0a01f44 c0a01f48 c0a01f38
> [   18.882753] 1f20: c01085ac c01085b0 60000013 ffffffff
> [   18.887823]  r9:c0a00000 r8:00000000 r7:c0a01f1c r6:ffffffff r5:60000013 r4:c01085b0
> [   18.895601] [<c0108570>] (arch_cpu_idle) from [<c05e3634>] (default_idle_call+0x28/0x34)
> [   18.903727] [<c05e360c>] (default_idle_call) from [<c0153458>] (do_idle+0x1a4/0x1d0)
> [   18.911503] [<c01532b4>] (do_idle) from [<c0153718>] (cpu_startup_entry+0x20/0x24)
> [   18.919103]  r10:dfffcb40 r9:c0833a28 r8:c0a30c00 r7:ffffffff r6:c0a03c80 r5:c0a30c00
> [   18.926962]  r4:000000bd
> [   18.929508] [<c01536f8>] (cpu_startup_entry) from [<c05dd758>] (rest_init+0x7c/0x80)
> [   18.937284] [<c05dd6dc>] (rest_init) from [<c0800d68>] (start_kernel+0x3c0/0x3cc)
> [   18.944796] [<c08009a8>] (start_kernel) from [<0000807c>] (0x807c)
> [   18.951001] Code: a8a0500c cafffff9 08bd8000 e3110020 (18a0500c)
> [   18.957119] ---[ end trace 4e5c1e66e49610b0 ]---
> [   18.961753] Kernel panic - not syncing: Fatal exception in interrupt
> [   18.968133] CPU1: stopping
> [   18.970852] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D    O    4.11.0-rc8-next-20170426 #3
> [   18.979585] Hardware name: Marvell Armada 380/385 (Device Tree)
> [   18.985527] Backtrace:
> [   18.987986] [<c010ba4c>] (dump_backtrace) from [<c010bd20>] (show_stack+0x18/0x1c)
> [   18.995586]  r7:df467f28 r6:60000193 r5:c0a165b0 r4:00000000
> [   19.001268] [<c010bd08>] (show_stack) from [<c031883c>] (dump_stack+0x94/0xa8)
> [   19.008520] [<c03187a8>] (dump_stack) from [<c010ecec>] (handle_IPI+0x178/0x198)
> [   19.015945]  r7:df467f28 r6:00000000 r5:00000001 r4:c0a30ef0
> [   19.021627] [<c010eb74>] (handle_IPI) from [<c01014e4>] (gic_handle_irq+0x90/0x94)
> [   19.029227]  r7:df467f28 r6:e080210c r5:c0a03fac r4:c0a169c0
> [   19.034908] [<c0101454>] (gic_handle_irq) from [<c010c86c>] (__irq_svc+0x6c/0x90)
> [   19.042419] Exception stack(0xdf467f28 to 0xdf467f70)
> [   19.047490] 7f20:                   00000001 00000000 00000000 c0118e40 df466000 c0a03cf8
> [   19.055701] 7f40: c0a03cac c09696f8 00000000 00000000 df467f98 df467f84 df467f88 df467f78
> [   19.063911] 7f60: c01085ac c01085b0 60000013 ffffffff
> [   19.068982]  r9:df466000 r8:00000000 r7:df467f5c r6:ffffffff r5:60000013 r4:c01085b0
> [   19.076758] [<c0108570>] (arch_cpu_idle) from [<c05e3634>] (default_idle_call+0x28/0x34)
> [   19.084882] [<c05e360c>] (default_idle_call) from [<c0153458>] (do_idle+0x1a4/0x1d0)
> [   19.092657] [<c01532b4>] (do_idle) from [<c0153718>] (cpu_startup_entry+0x20/0x24)
> [   19.100258]  r10:00000000 r9:414fc091 r8:0000406a r7:c0a30f00 r6:10c0387d r5:00000001
> [   19.108118]  r4:00000087
> [   19.110661] [<c01536f8>] (cpu_startup_entry) from [<c010e918>] (secondary_start_kernel+0x150/0x15c)
> [   19.119743] [<c010e7c8>] (secondary_start_kernel) from [<0010162c>] (0x10162c)
> [   19.126993]  r5:00000051 r4:1f45c06a
> [   19.130583] Rebooting in 3 seconds..
> 
> ---
> 
> git bisect start
> # bad: [e0a8aa40bd2c7d973b6520293f3fd86dcbca847b] Add linux-next specific files for 20170426
> git bisect bad e0a8aa40bd2c7d973b6520293f3fd86dcbca847b
> # good: [c1ae3cfa0e89fa1a7ecc4c99031f5e9ae99d9201] Linux 4.11-rc1
> git bisect good c1ae3cfa0e89fa1a7ecc4c99031f5e9ae99d9201
> # good: [221c3c382a529418d3a2acc58f53101103f6ff13] Merge remote-tracking branch 'l2-mtd/master'
> git bisect good 221c3c382a529418d3a2acc58f53101103f6ff13
> # bad: [ffcefbbe2f8c0af5f22af921afd2756baceddd74] Merge remote-tracking branch 'spi/for-next'
> git bisect bad ffcefbbe2f8c0af5f22af921afd2756baceddd74
> # good: [c2e7f82d336a451ebb904b8bf9a5a558cf16c39b] drm: mali-dp: Check the mclk rate and allow up/down scaling
> git bisect good c2e7f82d336a451ebb904b8bf9a5a558cf16c39b
> # good: [03b600f368cfa6f94e9622dda82e60f55b5e6224] Merge remote-tracking branch 'block/for-next'
> git bisect good 03b600f368cfa6f94e9622dda82e60f55b5e6224
> # good: [8b1e05cbcdfc59496eff5870cb6b6ab964ecc733] Merge remote-tracking branch 'battery/for-next'
> git bisect good 8b1e05cbcdfc59496eff5870cb6b6ab964ecc733
> # good: [e765d496d3adb3d69bd8c53df6fd3f3b77e5b1d2] Merge remote-tracking branch 'watchdog/master'
> git bisect good e765d496d3adb3d69bd8c53df6fd3f3b77e5b1d2
> # bad: [858eed97e369df7af0993463f355aa9755227136] Merge remote-tracking branch 'audit/next'
> git bisect bad 858eed97e369df7af0993463f355aa9755227136
> # bad: [efc2195bcc35eebf06805806eb525893f3b9ab5c] Merge branches 'arm/exynos', 'arm/omap', 'arm/rockchip', 'arm/mediatek', 'arm/smmu', 'arm/core', 'x86/vt-d', 'x86/amd' and 'core' into next
> git bisect bad efc2195bcc35eebf06805806eb525893f3b9ab5c
> # bad: [316ca8804ea84a782d5ba2163711ebb22116ff5a] ACPI/IORT: Remove linker section for IORT entries probing
> git bisect bad 316ca8804ea84a782d5ba2163711ebb22116ff5a
> # good: [d7b0558230e444f29488fcee0b0b561015d16f8a] iommu/of: Prepare for deferred IOMMU configuration
> git bisect good d7b0558230e444f29488fcee0b0b561015d16f8a
> # bad: [09515ef5ddad71c7820e5e428da418b709feeb26] of/acpi: Configure dma operations at probe time for platform/amba/pci bus devices
> git bisect bad 09515ef5ddad71c7820e5e428da418b709feeb26
> # good: [1d9029d440e40b276c0691caed1de10c42d96bef] ACPI/IORT: Add function to check SMMUs drivers presence
> git bisect good 1d9029d440e40b276c0691caed1de10c42d96bef
> # good: [efc8551a276faab19d85079da02c5fb602b0dcbe] of: device: Fix overflow of coherent_dma_mask
> git bisect good efc8551a276faab19d85079da02c5fb602b0dcbe
> # first bad commit: [09515ef5ddad71c7820e5e428da418b709feeb26] of/acpi:
> Configure dma operations at probe time for platform/amba/pci bus devices
> 

-- 
"QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply

* Re: [PATCH net-next 6/6] bpf: show bpf programs
From: Hannes Frederic Sowa @ 2017-04-27 13:28 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, ast, daniel, jbenc, aconole, Martin KaFai Lau
In-Reply-To: <20170426212553.fq3uyhktfxvn36rv@ast-mbp>

On 26.04.2017 23:25, Alexei Starovoitov wrote:
> On Wed, Apr 26, 2017 at 08:24:19PM +0200, Hannes Frederic Sowa wrote:
>>  
>> +static const char *bpf_type_string(enum bpf_prog_type type)
>> +{
>> +	static const char *bpf_type_names[] = {
>> +#define X(type) #type
>> +		BPF_PROG_TYPES
>> +#undef X
>> +	};
>> +
>> +	if (type >= ARRAY_SIZE(bpf_type_names))
>> +		return "<unknown>";
>> +
>> +	return bpf_type_names[type];
>> +}
>> +
>>  static int ebpf_proc_show(struct seq_file *s, void *v)
>>  {
>> +	struct bpf_prog *prog;
>> +	struct bpf_prog_aux *aux;
>> +	char prog_tag[sizeof(prog->tag) * 2 + 1] = { };
>> +
>>  	if (v == SEQ_START_TOKEN) {
>> -		seq_printf(s, "# tag\n");
>> +		seq_printf(s, "# tag\t\t\ttype\t\t\truntime\tcap\tmemlock\n");
>>  		return 0;
>>  	}
>>  
>> +	aux = v;
>> +	prog = aux->prog;
>> +
>> +	bin2hex(prog_tag, prog->tag, sizeof(prog->tag));
>> +	seq_printf(s, "%s\t%s\t%s\t%s\t%llu\n", prog_tag,
>> +		   bpf_type_string(prog->type),
>> +		   prog->jited ? "jit" : "int",
>> +		   prog->priv_cap_sys_admin ? "priv" : "unpriv",
>> +		   prog->pages * 1ULL << PAGE_SHIFT);
> 
> As I said several times already I'm strongly against procfs
> style of exposing information about the programs.

I would not expand procfs interface to dump ebpf bytecode or jitcode,
albeit I would prefer a file system abstraction for that.

> I don't want this to become debugfs for bpf.

Right now it just prints a list of ebpf programs. You reject of where
things are going or do you already reject this particular patch?

> Maintaining the list of all loaded programs is fine
> and we need a way to iterate through them, but procfs
> is obviously not the interface to do that.
> Programs/maps are binary whereas any fs interface is text.

It should give admins only a high level overview of what is going on in
there system. I don't want o print any kind of opcodes there, just
giving people the possibility to see what is going on.

> It also doesn't scale with large number of programs/maps.
> I prefer Daniel's suggestion on adding 'get_next' like API.
> Also would be good if you can wait for Martin to finish his
> prog->handle/id patches. Then user space will be able
> to iterate through all the progs/maps and fetch all info about
> them through syscall in extensible way.
> And you wouldn't need to abuse kallsyms list for different purpose.

I don't buy the scalability argument:

What is the scalability issue with this O(1) approach on loading and
O(n) (but also without any locks held) on dumping? You can't do any
better. You can even dump several programs with one syscall instead of
slowly iterating over them.

Certainly I can wait with those patches until Martin presented his patches.

And regarding abusing the list, I just renamed it. If you think I abuse
something I can also add another list like bpf_prog_id patches do?

Bye bye,
Hannes

^ permalink raw reply

* Re: [PATCH net-next 6/6] bpf: show bpf programs
From: Hannes Frederic Sowa @ 2017-04-27 13:22 UTC (permalink / raw)
  To: Daniel Borkmann, netdev; +Cc: ast, daniel, jbenc, aconole
In-Reply-To: <590112BE.7040902@iogearbox.net>

On 26.04.2017 23:35, Daniel Borkmann wrote:
> On 04/26/2017 08:24 PM, Hannes Frederic Sowa wrote:
>> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
>> ---
>>   include/uapi/linux/bpf.h | 32 +++++++++++++++++++-------------
>>   kernel/bpf/core.c        | 30 +++++++++++++++++++++++++++++-
>>   2 files changed, 48 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index e553529929f683..d6506e320953d5 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -101,20 +101,26 @@ enum bpf_map_type {
>>       BPF_MAP_TYPE_HASH_OF_MAPS,
>>   };
>>
>> +#define BPF_PROG_TYPES            \
>> +    X(BPF_PROG_TYPE_UNSPEC),    \
>> +    X(BPF_PROG_TYPE_SOCKET_FILTER),    \
>> +    X(BPF_PROG_TYPE_KPROBE),    \
>> +    X(BPF_PROG_TYPE_SCHED_CLS),    \
>> +    X(BPF_PROG_TYPE_SCHED_ACT),    \
>> +    X(BPF_PROG_TYPE_TRACEPOINT),    \
>> +    X(BPF_PROG_TYPE_XDP),        \
>> +    X(BPF_PROG_TYPE_PERF_EVENT),    \
>> +    X(BPF_PROG_TYPE_CGROUP_SKB),    \
>> +    X(BPF_PROG_TYPE_CGROUP_SOCK),    \
>> +    X(BPF_PROG_TYPE_LWT_IN),    \
>> +    X(BPF_PROG_TYPE_LWT_OUT),    \
>> +    X(BPF_PROG_TYPE_LWT_XMIT),
>> +
>> +
>>   enum bpf_prog_type {
>> -    BPF_PROG_TYPE_UNSPEC,
>> -    BPF_PROG_TYPE_SOCKET_FILTER,
>> -    BPF_PROG_TYPE_KPROBE,
>> -    BPF_PROG_TYPE_SCHED_CLS,
>> -    BPF_PROG_TYPE_SCHED_ACT,
>> -    BPF_PROG_TYPE_TRACEPOINT,
>> -    BPF_PROG_TYPE_XDP,
>> -    BPF_PROG_TYPE_PERF_EVENT,
>> -    BPF_PROG_TYPE_CGROUP_SKB,
>> -    BPF_PROG_TYPE_CGROUP_SOCK,
>> -    BPF_PROG_TYPE_LWT_IN,
>> -    BPF_PROG_TYPE_LWT_OUT,
>> -    BPF_PROG_TYPE_LWT_XMIT,
>> +#define X(type) type
> 
> Defining X in uapi could clash easily with other headers e.g.
> from application side.

I think that X-macros are so much used that code review should catch
that no X macro is leaked beyond its scope.

I certainly can use some other macro names, maybe something along the
line from bpf_types.h.

> 
>> +    BPF_PROG_TYPES
>> +#undef X
>>   };
>>
>>   enum bpf_attach_type {
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index 3ba175a24e971a..685c1d0f31e029 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -536,13 +536,41 @@ static void ebpf_proc_stop(struct seq_file *s,
>> void *v)
>>       rcu_read_unlock();
>>   }
>>
>> +static const char *bpf_type_string(enum bpf_prog_type type)
>> +{
>> +    static const char *bpf_type_names[] = {
>> +#define X(type) #type
>> +        BPF_PROG_TYPES
>> +#undef X
>> +    };
>> +
>> +    if (type >= ARRAY_SIZE(bpf_type_names))
>> +        return "<unknown>";
>> +
>> +    return bpf_type_names[type];
>> +}
>> +
>>   static int ebpf_proc_show(struct seq_file *s, void *v)
>>   {
>> +    struct bpf_prog *prog;
>> +    struct bpf_prog_aux *aux;
>> +    char prog_tag[sizeof(prog->tag) * 2 + 1] = { };
>> +
>>       if (v == SEQ_START_TOKEN) {
>> -        seq_printf(s, "# tag\n");
>> +        seq_printf(s, "# tag\t\t\ttype\t\t\truntime\tcap\tmemlock\n");
>>           return 0;
>>       }
>>
>> +    aux = v;
>> +    prog = aux->prog;
>> +
>> +    bin2hex(prog_tag, prog->tag, sizeof(prog->tag));
>> +    seq_printf(s, "%s\t%s\t%s\t%s\t%llu\n", prog_tag,
>> +           bpf_type_string(prog->type),
>> +           prog->jited ? "jit" : "int",
>> +           prog->priv_cap_sys_admin ? "priv" : "unpriv",
>> +           prog->pages * 1ULL << PAGE_SHIFT);
> 
> Yeah, so that would be quite similar to what we dump in
> bpf_prog_show_fdinfo() modulo the priv bit.

And, additionally, that you don't need to have a file descriptor handy
to do so, which is my main point.

> I generally agree that a facility for dumping all progs is needed
> and it was also on the TODO list after the bpf(2) cmd for dumping
> program insns back to user space.

I think this is orthogonal.

> I think the procfs interface has pro and cons: the upside is that
> you can use it with tools like cat to inspect it, but what you still
> cannot do is to say that you want to see the prog insns for, say,
> prog #4 from that list. If we could iterate over that list through fds
> via bpf(2) syscall, you could i) present the same info you have above
> via fdinfo already and ii) also dump the BPF insns from that specific
> program through a BPF_PROG_DUMP bpf(2) command. Once that dump also
> supports maps in progs, you could go further and fetch related map
> fds for inspection, etc.

Sure, that sounds super. But so far Linux and most (maybe I should write
all) subsystems always provided some easy way to get the insights of the
kernel without having to code or rely on special tools so far. And I
very much enjoyed that part on Linux. FreeBSD did very often abstract
some details behind shared libraries and commands that has no easy
pedant to use for with cat and grep.

> Such option of iterating through that would need a new BPF syscall
> cmd aka BPF_PROG_GET_NEXT which returns the first prog from the list
> and you would walk the next one by passing the current fd, which can
> later also be closed as not needed anymore. We could restrict that
> dump to capable(CAP_SYS_ADMIN), and the kernel tree would need to
> ship a tool e.g. under tools/bpf/ that can be used for inspection.
> 
Martin already posted his patches which add a bpf_prog_id to ebpf
programs. I will have alook at those and will comment over there.

Thansk for the review,
Hannes

^ permalink raw reply

* Re: [PATCH net-next 4/6] bpf: track if the bpf program was loaded with SYS_ADMIN capabilities
From: Hannes Frederic Sowa @ 2017-04-27 13:17 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, ast, daniel, jbenc, aconole
In-Reply-To: <20170426210826.mlwuuhq7jsu4n7f4@ast-mbp>

Hi,

On 26.04.2017 23:08, Alexei Starovoitov wrote:
> On Wed, Apr 26, 2017 at 08:24:17PM +0200, Hannes Frederic Sowa wrote:
>> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
>> ---
>>  include/linux/filter.h | 6 ++++--
>>  kernel/bpf/core.c      | 4 +++-
>>  kernel/bpf/syscall.c   | 7 ++++---
>>  kernel/bpf/verifier.c  | 4 ++--
>>  net/core/filter.c      | 6 +++---
>>  5 files changed, 16 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index 63624c619e371b..635311f57bf24f 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -413,7 +413,8 @@ struct bpf_prog {
>>  				locked:1,	/* Program image locked? */
>>  				gpl_compatible:1, /* Is filter GPL compatible? */
>>  				cb_access:1,	/* Is control block accessed? */
>> -				dst_needed:1;	/* Do we need dst entry? */
>> +				dst_needed:1,	/* Do we need dst entry? */
>> +				priv_cap_sys_admin:1; /* Where we loaded as sys_admin? */
> 
> This is no go.
> You didn't provide any explanation whatsoever why you want to see this boolean value.

Sorry, should be in the description and will be added if this patch
series is considered to be worthwhile to pursue.

cap_sys_admin influences the verifier a lot in terms which programs are
accepted and which are not. So during investigations it might be even
interesting if the bpf program required those special flags or if the
same program could be loaded just as underprivileged.

I also have to review if we can/should attach cap_sys_admin verified
programs as unprivileged user. It should stop to install a ptr leaking
bpf program as ordinary user, even if one got the file descriptor, no?

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH net-next] mlx5: work around unused function warning
From: Leon Romanovsky @ 2017-04-27 13:13 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Saeed Mahameed, Matan Barak, David S. Miller, Erez Shitrit,
	Dan Carpenter, Stephen Hemminger, Networking,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux Kernel Mailing List
In-Reply-To: <CAK8P3a0kX4uNLBvLAHrhtmUc7eBLmD6dXA8=xxY+xJ4OsGFJbg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 1719 bytes --]

On Thu, Apr 27, 2017 at 02:04:15PM +0200, Arnd Bergmann wrote:
> On Thu, Apr 27, 2017 at 1:55 PM, Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> > On Thu, Apr 27, 2017 at 01:04:02PM +0200, Arnd Bergmann wrote:
> >> The previous patch addressed a sparse warning but replaced it with a
> >> compiler warning when CONFIG_MODULES is disabled:
> >>
> >> drivers/net/ethernet/mellanox/mlx5/core/ipoib.c:485:13: error: 'mlx5_rdma_netdev_free' defined but not used [-Werror=unused-function]
> >> drivers/net/ethernet/mellanox/mlx5/core/ipoib.c:423:27: error: 'mlx5_rdma_netdev_alloc' defined but not used [-Werror=unused-function]
> >>
> >> We should never export 'static' functions, so this makes them global
> >> again but hides them in another #ifdef like the change before.
> >>
> >> Fixes: a7082ef066f0 ("mlx5: hide unused functions")
> >> Signed-off-by: Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org>
> >
> > Hi Arnd,
> >
> > Thanks for the patch, but Stephen and Saeed already sent patch similar to it.
> > http://marc.info/?l=linux-netdev&m=149288674816288&w=2
>
> That link is for the patch that introduced the warning that I'm fixing here,
> it showed up yesterday in linux-next.
>
> Did you misread my patch, or just give the wrong link?

Sorry, I misread you patch, because I prepared patch which add users for
these functions and didn't notice both "static" and EXPORT_SYMBOLS at the same time.
I will send it in a couple of minutes.

Thanks

>
>       Arnd
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [ISSUE: sky2 - rx error] Link stops working under heavy traffic load connected to a mv88e6176
From: Andrew Lunn @ 2017-04-27 13:04 UTC (permalink / raw)
  To: Rafa Corvillo; +Cc: Stephen Hemminger, netdev
In-Reply-To: <5901DE9F.1070005@aoifes.com>

On Thu, Apr 27, 2017 at 02:05:51PM +0200, Rafa Corvillo wrote:
> On 25/04/17 17:27, Stephen Hemminger wrote:
> >On Fri, 21 Apr 2017 14:39:00 +0200
> >Rafa Corvillo <rafael.corvillo@aoifes.com> wrote:
> >
> >>We are working in an ARMv7 embedded system running kernel 4.9 (LEDE build).
> >>It is an imx6 board with 2 ethernet interfaces. One of them is connected to
> >>a Marvell switch.
> >>
> >>The schema of the system is the following:
> >>

Hi Rafa

Your ASCII art got messed up somewhere. Is this the correct
reconstruction?

   +-------------------+ eth0
   |                   +--+
   |                   |  |
   | Embedded system   +--+
   |                   |
   |      ARMv7        |
   |                   | Marvell 88E8057(sky2)     +-------------+
   |                   +--+                     +--+             +--+ eth1
   |                   |  +---------------------+  |             |  +------+
   |                   +--+      CPU port       +--+ mv88e6176   +--+
   +------+--+---------+                           |             |
emulated  |  |                                     |             |
GPIO      +--+                                  +--+             +--+ eth2
MDIO       +-----------------------------------+   |             |  +------+
                                MDIO            +--+             +--+
                                                   +-------------+

I assume you are using DSA? Since this is LEDE, it could be swconfig,
but the bridge configuration you mentioned would not make sense for
swconfig.

> >>If I connect the eth1/eth2, the link is up and I can do ping through it.
> >>But, once
> >>I start sending a heavy traffic load the link fails and the kernel sends the
> >>following messages:
> >>
> >>[   48.557140] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
> >>length 1518
> >>[   48.564964] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
> >>length 1518
> >>[   48.572110] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
> >>length 1518
> >>[   48.579263] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
> >>length 1518
> >>[   48.586417] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
> >>length 1518
> >>[   48.593573] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
> >>length 1518
> >>[   48.600718] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
> >>length 1518
> >>[   54.877567] net_ratelimit: 6 callbacks suppressed
> >>[   54.882293] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
> >>length 1518
> >>[   61.413552] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
> >>length 1518
> >
> >The status error bits are in sky2.h
> >0x5f20010 is
> >      05f2 frame length => 1522
> >      0010 Too long err
> >
> >That means the packet was longer than the configured MTU.
> >You are probably getting packets with VLAN tag but have not configured
> >a VLAN.

Since you are using DSA, you will have DSA tags enabled on frames
to/from the switch. This adds an extra 8 byte header in the frame.  My
guess is, it is this header, not the VLAN tag which is causing you MTU
issues.

I think this is the first time i've seen sky2 used in a DSA
setup. mv643xx or mvneta is generally what is used, when using Marvell
chipsets. These drivers are more lenient about MTU, and are happy to
pass frames with additional headers.

> Thanks for the information. I have increased the MTU value to 1550
> (workaround) and it works if sends traffic (with iperf) from my
> computer to the unit. But, if I send traffic outside the unit, I get
> a new error message and link goes down:

Changing the MTU like this is not a good fix. It will allow you to
receive frames which are bigger, but it also means the local network
stack will generate bigger frames to be transmitted. You probably need
to modify the sky2 driver to allow it to receive frames bigger than
the interface MTU, by about 8 bytes.
 
> [ 4901.032989] sky2 0000:04:00.0 marvell: tx timeout
> [ 4904.722670] sky2 0000:04:00.0 marvell: Link is up at 1000 Mbps,
> full duplex, flow control both

Between the sky2 and the switch, do you have two back-to-back PHYs or
are you connecting the RGMII interfaces together?

    Andrew

^ permalink raw reply

* Re: [Patch net-next] ipv4: get rid of ip_ra_lock
From: Eric Dumazet @ 2017-04-27 12:46 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev
In-Reply-To: <1493240115-779-1-git-send-email-xiyou.wangcong@gmail.com>

On Wed, 2017-04-26 at 13:55 -0700, Cong Wang wrote:
> After commit 1215e51edad1 ("ipv4: fix a deadlock in ip_ra_control")
> we always take RTNL lock for ip_ra_control() which is the only place
> we update the list ip_ra_chain, so the ip_ra_lock is no longer needed,
> we just need to disable BH there.
> 
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---

Looks great, but reading again this code, I believe we do not need to
disable BH at all ?

Thanks.

^ permalink raw reply

* Re: [PATCH RFC net-next v4 0/7] net: improve support for SCTP checksums
From: Marcelo Ricardo Leitner @ 2017-04-27 12:41 UTC (permalink / raw)
  To: Davide Caratti
  Cc: Tom Herbert, Alexander Duyck, David Laight, David S . Miller,
	netdev, linux-sctp
In-Reply-To: <cover.1492692976.git.dcaratti@redhat.com>

On Thu, Apr 20, 2017 at 03:38:06PM +0200, Davide Caratti wrote:
> hello Tom,
> 
> On Fri, 2017-04-07 at 11:11 -0700, Tom Herbert wrote:
> > maybe just call it csum_not_ip then. Then just do "if
> > (unlikely(skb->csum_not_ip)) ..."
> 
> Ok, done. V4 uses this bit for SCTP only and leaves unmodified behavior
> when offloaded FCoE frames are processed. Further work is still possible
> to extend this fix for FCoE, if needed, either by using additional sk_buff
> bits, or using skb->csum_not_ip and use other data (e.g. skb->csum_offset)
> to distinguish SCTP from FCoE.
> 
> > the only case where this new bit is relevant is when
> > CHECKSUM_PARTIAL for a CRC is being done. When it's set for offloading
> > sctp crc it must be set. When CRC is resolved, in the helper for
> > instance, it must be cleared.
> 
> in V4 the bit is set when SCTP packets with offloaded checksum are
> generated; the bit is cleared when CRC32c is resolved for such packets
> (i.e. skb->ip_summed transitions from CHECKSUM_PARTIAL to CHECKSUM_NONE).
> 
> Any feedbacks are appreciated!
> thank you in advance,
> --
> davide
> 
> 
> Davide Caratti (7):
>   skbuff: add stub to help computing crc32c on SCTP packets
>   net: introduce skb_crc32c_csum_help
>   sk_buff: remove support for csum_bad in sk_buff
>   net: use skb->csum_not_inet to identify packets needing crc32c
>   net: more accurate checksumming in validate_xmit_skb()
>   openvswitch: more accurate checksumming in queue_userspace_packet()
>   sk_buff.h: improve description of CHECKSUM_{COMPLETE,UNNECESSARY}

Other than the comments I did on patch 2, this series LGTM.


> 
>  Documentation/networking/checksum-offloads.txt   | 11 +++--
>  drivers/net/ethernet/aquantia/atlantic/aq_ring.c |  2 +-
>  include/linux/netdevice.h                        |  8 +--
>  include/linux/skbuff.h                           | 58 +++++++++-------------
>  net/bridge/netfilter/nft_reject_bridge.c         |  5 +-
>  net/core/dev.c                                   | 63 +++++++++++++++++++++---
>  net/core/skbuff.c                                | 24 +++++++++
>  net/ipv4/netfilter/nf_reject_ipv4.c              |  2 +-
>  net/ipv6/netfilter/nf_reject_ipv6.c              |  3 --
>  net/openvswitch/datapath.c                       |  2 +-
>  net/sched/act_csum.c                             |  1 +
>  net/sctp/offload.c                               |  8 +++
>  net/sctp/output.c                                |  1 +
>  13 files changed, 128 insertions(+), 60 deletions(-)
> 
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH net-next] net: bridge: Fix improper taking over HW learned FDB
From: Nikolay Aleksandrov @ 2017-04-27 12:33 UTC (permalink / raw)
  To: Arkadi Sharshevsky, netdev; +Cc: Ido Schimmel, bridge, davem
In-Reply-To: <1493280485-37845-1-git-send-email-arkadis@mellanox.com>

On 27/04/17 11:08, Arkadi Sharshevsky wrote:
> Commit 7e26bf45e4cb ("net: bridge: allow SW learn to take over HW fdb
> entries") added the ability to "take over an entry which was previously
> learned via HW when it shows up from a SW port".
> 
> However, if an entry was learned via HW and then a control packet
> (e.g., ARP request) was trapped to the CPU, the bridge driver will
> update the entry and remove the externally learned flag, although the
> entry is still present in HW. Instead, only clear the externally learned
> flag in case of roaming.
> 
> Fixes: 7e26bf45e4cb ("net: bridge: allow SW learn to take over HW fdb entries")
> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> Signed-off-by: Arkadi Sharashevsky <arkadis@mellanox.com>
> Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> ---
>  net/bridge/br_fdb.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

Thanks!

Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

^ permalink raw reply

* Re: [PATCH RFC net-next v4 2/7] net: introduce skb_crc32c_csum_help
From: Marcelo Ricardo Leitner @ 2017-04-27 12:29 UTC (permalink / raw)
  To: Davide Caratti
  Cc: Tom Herbert, Alexander Duyck, David Laight, David S . Miller,
	netdev, linux-sctp
In-Reply-To: <2d4895be40590b624bed89df38ea64e560aa2f10.1492692976.git.dcaratti@redhat.com>

On Thu, Apr 20, 2017 at 03:38:08PM +0200, Davide Caratti wrote:
> skb_crc32c_csum_help is like skb_checksum_help, but it is designed for
> checksumming SCTP packets using crc32c (see RFC3309), provided that
> libcrc32c.ko has been loaded before. In case libcrc32c is not loaded,
> invoking skb_crc32c_csum_help on a skb results in one the following
> printouts:
> 
> warn_crc32c_csum_update: attempt to compute crc32c without libcrc32c.ko
> warn_crc32c_csum_combine: attempt to compute crc32c without libcrc32c.ko
> 
> Signed-off-by: Davide Caratti <dcaratti@redhat.com>
> ---
>  include/linux/netdevice.h |  1 +
>  include/linux/skbuff.h    |  3 ++-
>  net/core/dev.c            | 40 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 43 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index b0aa089..bf84a67 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3898,6 +3898,7 @@ void netdev_rss_key_fill(void *buffer, size_t len);
>  
>  int dev_get_nest_level(struct net_device *dev);
>  int skb_checksum_help(struct sk_buff *skb);
> +int skb_crc32c_csum_help(struct sk_buff *skb);
>  struct sk_buff *__skb_gso_segment(struct sk_buff *skb,
>  				  netdev_features_t features, bool tx_path);
>  struct sk_buff *skb_mac_gso_segment(struct sk_buff *skb,
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index ba3ae21..ec4551b 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -193,7 +193,8 @@
>   *     accordingly. Note the there is no indication in the skbuff that the
>   *     CHECKSUM_PARTIAL refers to an SCTP checksum, a driver that supports
>   *     both IP checksum offload and SCTP CRC offload must verify which offload
> - *     is configured for a packet presumably by inspecting packet headers.
> + *     is configured for a packet presumably by inspecting packet headers; in
> + *     case, skb_crc32c_csum_help is provided to compute CRC on SCTP packets.
>   *
>   *   NETIF_F_FCOE_CRC - This feature indicates that a device is capable of
>   *     offloading the FCOE CRC in a packet. To perform this offload the stack
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 5d33e2b..c7aec95 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -140,6 +140,7 @@
>  #include <linux/hrtimer.h>
>  #include <linux/netfilter_ingress.h>
>  #include <linux/crash_dump.h>
> +#include <linux/sctp.h>
>  
>  #include "net-sysfs.h"
>  
> @@ -2606,6 +2607,45 @@ int skb_checksum_help(struct sk_buff *skb)
>  }
>  EXPORT_SYMBOL(skb_checksum_help);
>  
> +int skb_crc32c_csum_help(struct sk_buff *skb)
> +{
> +	__le32 crc32c_csum;
> +	int ret = 0, offset;
> +
> +	if (skb->ip_summed != CHECKSUM_PARTIAL)
> +		goto out;
> +
> +	if (unlikely(skb_is_gso(skb)))
> +		goto out;
> +
> +	/* Before computing a checksum, we should make sure no frag could
> +	 * be modified by an external entity : checksum could be wrong.
> +	 */
> +	if (unlikely(skb_has_shared_frag(skb))) {
> +		ret = __skb_linearize(skb);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	offset = skb_checksum_start_offset(skb);
> +	crc32c_csum = cpu_to_le32(~__skb_checksum(skb, offset,
> +						  skb->len - offset, ~(__u32)0,
> +						  crc32c_csum_stub));
> +	offset += offsetof(struct sctphdr, checksum);
> +	BUG_ON(offset >= skb_headlen(skb));

I suggest using WARN_ON_ONCE() here and returning an error instead. Will
still allow debugging and won't disrupt the system.

> +
> +	if (skb_cloned(skb) &&
> +	    !skb_clone_writable(skb, offset + sizeof(__le32))) {
> +		ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> +		if (ret)
> +			goto out;
> +	}

We could do this check (including the BUG_ON/WARN check above) before
the actual crc32 calc. This can fail, and if it does, we will have
calculated it in vain. Note how offset doesn't really depend on the
checksum result.

I know skb_checksum_help also does it this way, maybe it was because of
some cache optimization on the offset += checksum offset  operation?

> +	*(__le32 *)(skb->data + offset) = crc32c_csum;
> +	skb->ip_summed = CHECKSUM_NONE;
> +out:
> +	return ret;
> +}
> +
>  __be16 skb_network_protocol(struct sk_buff *skb, int *depth)
>  {
>  	__be16 type = skb->protocol;
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH net-next] tcp: tcp_rack_reo_timeout() must update tp->tcp_mstamp
From: Soheil Hassas Yeganeh @ 2017-04-27 12:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Neal Cardwell, Yuchung Cheng
In-Reply-To: <1493266255.6453.103.camel@edumazet-glaptop3.roam.corp.google.com>

On Thu, Apr 27, 2017 at 12:10 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> I wrongly assumed tp->tcp_mstamp was up to date at the time
> tcp_rack_reo_timeout() was called.
>
> It is not true, since we only update tcp->tcp_mstamp when receiving
> a packet (as initially done in commit 69e996c58a35 ("tcp: add
> tp->tcp_mstamp field")
>
> tcp_rack_reo_timeout() being called by a timer and not an incoming
> packet, we need to refresh tp->tcp_mstamp
>
> Fixes: 7c1c7308592f ("tcp: do not pass timestamp to tcp_rack_detect_loss()")
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

> Cc: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> ---
>  net/ipv4/tcp_recovery.c |    1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
> index cd72b3d3879e88181c8a4639f0334a24e4cda852..362b8c75bfab44cf87c2a01398a146a271bc1119 100644
> --- a/net/ipv4/tcp_recovery.c
> +++ b/net/ipv4/tcp_recovery.c
> @@ -166,6 +166,7 @@ void tcp_rack_reo_timeout(struct sock *sk)
>         u32 timeout, prior_inflight;
>
>         prior_inflight = tcp_packets_in_flight(tp);
> +       skb_mstamp_get(&tp->tcp_mstamp);
>         tcp_rack_detect_loss(sk, &timeout);
>         if (prior_inflight != tcp_packets_in_flight(tp)) {
>                 if (inet_csk(sk)->icsk_ca_state != TCP_CA_Recovery) {
>
>

Thanks for the fix!

^ permalink raw reply

* Re: [ISSUE: sky2 - rx error] Link stops working under heavy traffic load connected to a mv88e6176
From: Rafa Corvillo @ 2017-04-27 12:05 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20170425082741.59428876@xeon-e3>

On 25/04/17 17:27, Stephen Hemminger wrote:
> On Fri, 21 Apr 2017 14:39:00 +0200
> Rafa Corvillo <rafael.corvillo@aoifes.com> wrote:
>
>> We are working in an ARMv7 embedded system running kernel 4.9 (LEDE build).
>> It is an imx6 board with 2 ethernet interfaces. One of them is connected to
>> a Marvell switch.
>>
>> The schema of the system is the following:
>>
>>    +-------------------+ eth0
>>    |                   +--+
>>    |                   |  |
>>    | Embedded system   +--+
>>    |                   |
>>    |      ARMv7        |
>>    |                   | Marvell 88E8057(sky2) +-------------+
>>    |                   +--+ +--+             +--+ eth1
>>    |                   |  +---------------------+ |             |  +------+
>>    |                   +--+      CPU port       +--+ mv88e6176  +--+
>>    +------+--+---------+ |             |
>> emulated|  | |             |
>> GPIO    +--+ +--+             +--+ eth2
>> MDIO      +-----------------------------------+ |             |  +------+
>>                                 MDIO +--+             +--+
>> +-------------+
>>
>> There is a bridge (br-lan) which includes eth0/eth1/eth2
>>
>> If I connect the eth1/eth2, the link is up and I can do ping through it.
>> But, once
>> I start sending a heavy traffic load the link fails and the kernel sends the
>> following messages:
>>
>> [   48.557140] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
>> length 1518
>> [   48.564964] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
>> length 1518
>> [   48.572110] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
>> length 1518
>> [   48.579263] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
>> length 1518
>> [   48.586417] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
>> length 1518
>> [   48.593573] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
>> length 1518
>> [   48.600718] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
>> length 1518
>> [   54.877567] net_ratelimit: 6 callbacks suppressed
>> [   54.882293] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
>> length 1518
>> [   61.413552] sky2 0000:04:00.0 marvell: rx error, status 0x5f20010
>> length 1518
>
> The status error bits are in sky2.h
> 0x5f20010 is
>       05f2 frame length => 1522
>       0010 Too long err
>
> That means the packet was longer than the configured MTU.
> You are probably getting packets with VLAN tag but have not configured
> a VLAN.
>
>
>

Thanks for the information. I have increased the MTU value to 1550 
(workaround) and it works if sends traffic (with iperf) from my computer 
to the unit. But, if I send traffic outside the unit, I get a new error 
message and link goes down:

[ 4901.032989] sky2 0000:04:00.0 marvell: tx timeout
[ 4904.722670] sky2 0000:04:00.0 marvell: Link is up at 1000 Mbps, full 
duplex, flow control both

Rafa

^ permalink raw reply

* Re: [PATCH v6 1/5] skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow
From: Jason A. Donenfeld @ 2017-04-27 12:04 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: Netdev, LKML, David Laight, kernel-hardening, David Miller
In-Reply-To: <20170427113016.GA12448@bistromath.localdomain>

On Thu, Apr 27, 2017 at 1:30 PM, Sabrina Dubroca <sd@queasysnail.net> wrote:
> Hmm, I think this can actually happen:

Alright, perhaps better to err on the side of caution, then.

Jason

^ permalink raw reply

* Re: [PATCH net-next] mlx5: work around unused function warning
From: Arnd Bergmann @ 2017-04-27 12:04 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Saeed Mahameed, Matan Barak, David S. Miller, Erez Shitrit,
	Dan Carpenter, Stephen Hemminger, Networking, linux-rdma,
	Linux Kernel Mailing List
In-Reply-To: <20170427115513.GH14088@mtr-leonro.local>

On Thu, Apr 27, 2017 at 1:55 PM, Leon Romanovsky <leonro@mellanox.com> wrote:
> On Thu, Apr 27, 2017 at 01:04:02PM +0200, Arnd Bergmann wrote:
>> The previous patch addressed a sparse warning but replaced it with a
>> compiler warning when CONFIG_MODULES is disabled:
>>
>> drivers/net/ethernet/mellanox/mlx5/core/ipoib.c:485:13: error: 'mlx5_rdma_netdev_free' defined but not used [-Werror=unused-function]
>> drivers/net/ethernet/mellanox/mlx5/core/ipoib.c:423:27: error: 'mlx5_rdma_netdev_alloc' defined but not used [-Werror=unused-function]
>>
>> We should never export 'static' functions, so this makes them global
>> again but hides them in another #ifdef like the change before.
>>
>> Fixes: a7082ef066f0 ("mlx5: hide unused functions")
>> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
>
> Hi Arnd,
>
> Thanks for the patch, but Stephen and Saeed already sent patch similar to it.
> http://marc.info/?l=linux-netdev&m=149288674816288&w=2

That link is for the patch that introduced the warning that I'm fixing here,
it showed up yesterday in linux-next.

Did you misread my patch, or just give the wrong link?

      Arnd

^ permalink raw reply

* Re: [PATCH net v4 1/3] net: hns: support deferred probe when can not obtain irq
From: Matthias Brugger @ 2017-04-27 11:58 UTC (permalink / raw)
  To: Yankejian, davem, salil.mehta, yisen.zhuang, lipeng321, zhouhuiru,
	huangdaode
  Cc: netdev, linuxarm
In-Reply-To: <1493261053-68197-2-git-send-email-yankejian@huawei.com>



On 27/04/17 04:44, Yankejian wrote:
> From: lipeng <lipeng321@huawei.com>
>
> In the hip06 and hip07 SoCs, the interrupt lines from the
> DSAF controllers are connected to mbigen hw module.
> The mbigen module is probed with module_init, and, as such,
> is not guaranteed to probe before the HNS driver. So we need
> to support deferred probe.
>
> Signed-off-by: lipeng <lipeng321@huawei.com>
> Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com>
> Reviewed-by: Matthias Brugger <mbrugger@suse.com>
> ---
> change log:
> V3 -> V4:
> 1. Delete redundant commit message;
> 2. add Reviewed-by: Matthias Brugger <mbrugger@suse.com>;
>
> V2 -> V3:
> 1. Check return value when  platform_get_irq in hns_rcb_get_cfg;
> ---
>  drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c | 4 +++-
>  drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c | 8 +++++++-
>  drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.h | 2 +-
>  3 files changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
> index 6ea8722..a41cf95 100644
> --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
> +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c
> @@ -510,7 +510,9 @@ int hns_ppe_init(struct dsaf_device *dsaf_dev)
>
>  		hns_ppe_get_cfg(dsaf_dev->ppe_common[i]);
>
> -		hns_rcb_get_cfg(dsaf_dev->rcb_common[i]);
> +		ret = hns_rcb_get_cfg(dsaf_dev->rcb_common[i]);
> +		if (ret)
> +			goto get_rcb_cfg_fail;

a2185587ade7 ("net: hns: Simplify the exception sequence in hns_ppe_init()")
deleted get_rcb_cfg_fail label. This does not compile. Please rebase 
against net-next.

Best regards,
Matthias

>  	}
>
>  	for (i = 0; i < HNS_PPE_COM_NUM; i++)
> diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c
> index f0ed80d6..673a5d3 100644
> --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c
> +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c
> @@ -452,7 +452,7 @@ static int hns_rcb_get_base_irq_idx(struct rcb_common_cb *rcb_common)
>   *hns_rcb_get_cfg - get rcb config
>   *@rcb_common: rcb common device
>   */
> -void hns_rcb_get_cfg(struct rcb_common_cb *rcb_common)
> +int hns_rcb_get_cfg(struct rcb_common_cb *rcb_common)
>  {
>  	struct ring_pair_cb *ring_pair_cb;
>  	u32 i;
> @@ -477,10 +477,16 @@ void hns_rcb_get_cfg(struct rcb_common_cb *rcb_common)
>  		ring_pair_cb->virq[HNS_RCB_IRQ_IDX_RX] =
>  		is_ver1 ? platform_get_irq(pdev, base_irq_idx + i * 2 + 1) :
>  			  platform_get_irq(pdev, base_irq_idx + i * 3);
> +		if ((ring_pair_cb->virq[HNS_RCB_IRQ_IDX_TX] == -EPROBE_DEFER) ||
> +		    (ring_pair_cb->virq[HNS_RCB_IRQ_IDX_RX] == -EPROBE_DEFER))
> +			return -EPROBE_DEFER;
> +
>  		ring_pair_cb->q.phy_base =
>  			RCB_COMM_BASE_TO_RING_BASE(rcb_common->phy_base, i);
>  		hns_rcb_ring_pair_get_cfg(ring_pair_cb);
>  	}
> +
> +	return 0;
>  }
>
>  /**
> diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.h b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.h
> index 99b4e1b..3d7b484 100644
> --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.h
> +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.h
> @@ -110,7 +110,7 @@ struct rcb_common_cb {
>  void hns_rcb_common_free_cfg(struct dsaf_device *dsaf_dev, u32 comm_index);
>  int hns_rcb_common_init_hw(struct rcb_common_cb *rcb_common);
>  void hns_rcb_start(struct hnae_queue *q, u32 val);
> -void hns_rcb_get_cfg(struct rcb_common_cb *rcb_common);
> +int hns_rcb_get_cfg(struct rcb_common_cb *rcb_common);
>  void hns_rcb_get_queue_mode(enum dsaf_mode dsaf_mode,
>  			    u16 *max_vfn, u16 *max_q_per_vf);
>
>

^ permalink raw reply

* Re: [PATCH net-next] mlx5: work around unused function warning
From: Leon Romanovsky @ 2017-04-27 11:55 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Saeed Mahameed, Matan Barak, David S. Miller, Erez Shitrit,
	Dan Carpenter, Stephen Hemminger, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20170427110421.2431598-1-arnd-r2nGTMty4D4@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 2700 bytes --]

On Thu, Apr 27, 2017 at 01:04:02PM +0200, Arnd Bergmann wrote:
> The previous patch addressed a sparse warning but replaced it with a
> compiler warning when CONFIG_MODULES is disabled:
>
> drivers/net/ethernet/mellanox/mlx5/core/ipoib.c:485:13: error: 'mlx5_rdma_netdev_free' defined but not used [-Werror=unused-function]
> drivers/net/ethernet/mellanox/mlx5/core/ipoib.c:423:27: error: 'mlx5_rdma_netdev_alloc' defined but not used [-Werror=unused-function]
>
> We should never export 'static' functions, so this makes them global
> again but hides them in another #ifdef like the change before.
>
> Fixes: a7082ef066f0 ("mlx5: hide unused functions")
> Signed-off-by: Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/ipoib.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
>

Hi Arnd,

Thanks for the patch, but Stephen and Saeed already sent patch similar to it.
http://marc.info/?l=linux-netdev&m=149288674816288&w=2

Thanks

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
> index 3c84e36af018..eb48f112dff4 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
> @@ -405,7 +405,6 @@ static int mlx5i_xmit(struct net_device *dev, struct sk_buff *skb,
>
>  	return mlx5i_sq_xmit(sq, skb, &mah->av, dqpn, dqkey);
>  }
> -#endif
>
>  static int mlx5i_check_required_hca_cap(struct mlx5_core_dev *mdev)
>  {
> @@ -420,10 +419,10 @@ static int mlx5i_check_required_hca_cap(struct mlx5_core_dev *mdev)
>  	return 0;
>  }
>
> -static struct net_device *mlx5_rdma_netdev_alloc(struct mlx5_core_dev *mdev,
> -						 struct ib_device *ibdev,
> -						 const char *name,
> -						 void (*setup)(struct net_device *))
> +struct net_device *mlx5_rdma_netdev_alloc(struct mlx5_core_dev *mdev,
> +					  struct ib_device *ibdev,
> +					  const char *name,
> +					  void (*setup)(struct net_device *))
>  {
>  	const struct mlx5e_profile *profile = &mlx5i_nic_profile;
>  	int nch = profile->max_nch(mdev);
> @@ -482,7 +481,7 @@ static struct net_device *mlx5_rdma_netdev_alloc(struct mlx5_core_dev *mdev,
>  }
>  EXPORT_SYMBOL(mlx5_rdma_netdev_alloc);
>
> -static void mlx5_rdma_netdev_free(struct net_device *netdev)
> +void mlx5_rdma_netdev_free(struct net_device *netdev)
>  {
>  	struct mlx5e_priv          *priv    = mlx5i_epriv(netdev);
>  	const struct mlx5e_profile *profile = priv->profile;
> @@ -495,4 +494,4 @@ static void mlx5_rdma_netdev_free(struct net_device *netdev)
>  	mlx5e_destroy_mdev_resources(priv->mdev);
>  }
>  EXPORT_SYMBOL(mlx5_rdma_netdev_free);
> -
> +#endif
> --
> 2.9.0
>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net] tcp: do not underestimate skb->truesize in tcp_trim_head()
From: Andrey Konovalov @ 2017-04-27 11:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1493252140.6453.96.camel@edumazet-glaptop3.roam.corp.google.com>

On Thu, Apr 27, 2017 at 2:15 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Andrey found a way to trigger the WARN_ON_ONCE(delta < len) in
> skb_try_coalesce() using syzkaller and a filter attached to a TCP
> socket over loopback interface.
>
> I believe one issue with looped skbs is that tcp_trim_head() can end up
> producing skb with under estimated truesize.
>
> It hardly matters for normal conditions, since packets sent over
> loopback are never truncated.
>
> Bytes trimmed from skb->head should not change skb truesize, since
> skb->head is not reallocated.

Hi Eric,

With all 3 of your patches applied to net-next I don't see the warning any more.

Thanks!

Tested-by: Andrey Konovalov <andreyknvl@google.com>

>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> ---
>  net/ipv4/tcp_output.c |   19 ++++++++++++-------
>  1 file changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index c3c082ed38796f65948e7a1042e37813b71d5abf..a85d863c44196e60fd22e25471cf773e72d2c133 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1267,7 +1267,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
>   * eventually). The difference is that pulled data not copied, but
>   * immediately discarded.
>   */
> -static void __pskb_trim_head(struct sk_buff *skb, int len)
> +static int __pskb_trim_head(struct sk_buff *skb, int len)
>  {
>         struct skb_shared_info *shinfo;
>         int i, k, eat;
> @@ -1277,7 +1277,7 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
>                 __skb_pull(skb, eat);
>                 len -= eat;
>                 if (!len)
> -                       return;
> +                       return 0;
>         }
>         eat = len;
>         k = 0;
> @@ -1303,23 +1303,28 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
>         skb_reset_tail_pointer(skb);
>         skb->data_len -= len;
>         skb->len = skb->data_len;
> +       return len;
>  }
>
>  /* Remove acked data from a packet in the transmit queue. */
>  int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
>  {
> +       u32 delta_truesize;
> +
>         if (skb_unclone(skb, GFP_ATOMIC))
>                 return -ENOMEM;
>
> -       __pskb_trim_head(skb, len);
> +       delta_truesize = __pskb_trim_head(skb, len);
>
>         TCP_SKB_CB(skb)->seq += len;
>         skb->ip_summed = CHECKSUM_PARTIAL;
>
> -       skb->truesize        -= len;
> -       sk->sk_wmem_queued   -= len;
> -       sk_mem_uncharge(sk, len);
> -       sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
> +       if (delta_truesize) {
> +               skb->truesize      -= delta_truesize;
> +               sk->sk_wmem_queued -= delta_truesize;
> +               sk_mem_uncharge(sk, delta_truesize);
> +               sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
> +       }
>
>         /* Any change of skb->len requires recalculation of tso factor. */
>         if (tcp_skb_pcount(skb) > 1)
>
>

^ permalink raw reply

* Re: [PATCH net-next 4/6] bpf: track if the bpf program was loaded with SYS_ADMIN capabilities
From: Hannes Frederic Sowa @ 2017-04-27 11:39 UTC (permalink / raw)
  To: Daniel Borkmann, netdev; +Cc: ast, daniel, jbenc, aconole
In-Reply-To: <59010B5A.6060509@iogearbox.net>

On 26.04.2017 23:04, Daniel Borkmann wrote:
> On 04/26/2017 08:24 PM, Hannes Frederic Sowa wrote:
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 9a37860a80fc78..dc020d40bb770a 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -1100,7 +1100,7 @@ int bpf_prog_create(struct bpf_prog **pfp,
>> struct sock_fprog_kern *fprog)
>>       if (!bpf_check_basics_ok(fprog->filter, fprog->len))
>>           return -EINVAL;
>>
>> -    fp = bpf_prog_alloc(bpf_prog_size(fprog->len), 0);
>> +    fp = bpf_prog_alloc(bpf_prog_size(fprog->len), 0, false);
>>       if (!fp)
>>           return -ENOMEM;
>>
> 
> Did you check that transferring allow_ptr_leaks doesn't have a side
> effect on the nfp JIT? I believe it can also do cbpf migrations to
> a certain extend.

Initially I grepped allow_ptr_leaks usages and didn't see it. I just
looked through the code path and didn't see how it could have an impact.
Also, cbpf programs shouldn't depend on allow_ptr_leak to the best of my
knowledge, no?

Thanks,
Hannes

^ permalink raw reply

* Re: [PATCH v1 net-next 3/6] net: add new control message for incoming HW-timestamped packets
From: Willem de Bruijn @ 2017-04-27 11:38 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: Network Development, Richard Cochran, Soheil Hassas Yeganeh,
	Keller, Jacob E, Denny Page, Jiri Benc
In-Reply-To: <20170427101554.GF5654@localhost>

On Thu, Apr 27, 2017 at 6:15 AM, Miroslav Lichvar <mlichvar@redhat.com> wrote:
> Thanks for the comments.
>
> On Wed, Apr 26, 2017 at 07:34:49PM -0400, Willem de Bruijn wrote:
>> > +struct net_device *dev_get_by_napi_id(unsigned int napi_id)
>> > +{
>> > +       struct net_device *dev = NULL;
>> > +       struct napi_struct *napi;
>> > +
>> > +       rcu_read_lock();
>> > +
>> > +       napi = napi_by_id(napi_id);
>> > +       if (napi)
>> > +               dev = napi->dev;
>> > +
>> > +       rcu_read_unlock();
>> > +
>> > +       return dev;
>> > +}
>> > +EXPORT_SYMBOL(dev_get_by_napi_id);
>>
>> Returning dev without holding a reference is not safe. You'll probably
>> have to call this with rcu_read_lock held instead.
>
> How about changing the function to simply return the index instead of
> the device (e.g. dev_get_ifindex_by_napi_id())? Would that be too
> specific?

I suspect so. If you can achieve the same by calling this function
within an rcu read_side critical section, I would do that.

>> >  /*
>> >   * called from sock_recv_timestamp() if sock_flag(sk, SOCK_RCVTSTAMP)
>> >   */
>> > @@ -699,8 +719,12 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
>> >                 empty = 0;
>> >         if (shhwtstamps &&
>> >             (sk->sk_tsflags & SOF_TIMESTAMPING_RAW_HARDWARE) &&
>>
>> This information is also informative with software timestamps.
>
> But is it useful and worth the cost? If I have two interfaces and only
> one has HW timestamping, or just one interface which can timestamp
> incoming packets at a limited rate, I would prefer to not waste CPU
> cycles preparing and processing useless data.
>
>> And getting the real iif is definitely useful outside timestamps.
>
> Do you have an example? We have asked that in the original thread,
> but no one suggested anything. For AF_PACKET there is PACKET_ORIGDEV.
> When I was searching the Internet on how to get the index with INET
> sockets, it looked like I was the only one who had this problem :).

Okay. Maybe I'm mistaken. If no one else responds with a use case,
then I agree that we can ignore these cases.

>> An
>> alternative approach is to add versioning to IP_PKTINFO with a new
>> setsockopt IP_PKTINFO_VERSION plus a new struct in_pktinfo_v2
>> that extends in_pktinfo. Just a thought.
>
> The struct would contain both the original and last interface index,
> and the length as well? And similarly with in6_pktinfo?
>
> If there is an agreement that the information would useful also for
> other things than timestamping, I can try that. If not, I think it
> would be better to keep it tied to HW timestamping.

Ack.

^ permalink raw reply

* Re: [PATCH v6 1/5] skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow
From: Sabrina Dubroca @ 2017-04-27 11:30 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Netdev, LKML, David Laight, kernel-hardening, David Miller
In-Reply-To: <CAHmME9qDmcvzF_xeaxegC2RpBOs8PziJOaKEqv6Z_X1pUFbR0w@mail.gmail.com>

2017-04-27, 11:21:51 +0200, Jason A. Donenfeld wrote:
> However, perhaps there's the chance that fraglist skbs having
> separate fraglists are actually forbidden? Is this the case?

Hmm, I think this can actually happen:

    /*  net/ipv4/ip_fragment.c  */
    static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
    			 struct net_device *dev)
    {

    ...

    	/* If the first fragment is fragmented itself, we split
    	 * it to two chunks: the first with data and paged part
    	 * and the second, holding only fragments. */
    	if (skb_has_frag_list(head)) {
    		struct sk_buff *clone;
    		int i, plen = 0;

    		clone = alloc_skb(0, GFP_ATOMIC);
    		if (!clone)
    			goto out_nomem;
    		clone->next = head->next;
    		head->next = clone;
    		skb_shinfo(clone)->frag_list = skb_shinfo(head)->frag_list;
    		skb_frag_list_init(head);
    		for (i = 0; i < skb_shinfo(head)->nr_frags; i++)
    			plen += skb_frag_size(&skb_shinfo(head)->frags[i]);
    		clone->len = clone->data_len = head->data_len - plen;
    		head->data_len -= clone->len;
    		head->len -= clone->len;
    		clone->csum = 0;
    		clone->ip_summed = head->ip_summed;
    		add_frag_mem_limit(qp->q.net, clone->truesize);
    	}

    ...
    }

You can test that with a vxlan tunnel on top of a vxlan tunnel ("real"
MTU is 1500, first tunnel MTU set to 10000, second tunnel MTU set to
40000 -- or anything, as long as they both get fragmented).

-- 
Sabrina

^ permalink raw reply

* [patch iproute2 2/2] tc_gact: introduce support for goto chain action
From: Jiri Pirko @ 2017-04-27 11:15 UTC (permalink / raw)
  To: netdev
  Cc: davem, jhs, xiyou.wangcong, dsa, edumazet, stephen, daniel,
	alexander.h.duyck, mlxsw, simon.horman
In-Reply-To: <1493291540-2119-1-git-send-email-jiri@resnulli.us>

From: Jiri Pirko <jiri@mellanox.com>

Allow user to set action "goto" with filter chain index as a parameter.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/linux/pkt_cls.h        |  1 +
 include/linux/tc_act/tc_gact.h |  1 +
 tc/m_gact.c                    | 32 ++++++++++++++++++++++++++++----
 tc/tc_util.c                   |  3 +++
 4 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/include/linux/pkt_cls.h b/include/linux/pkt_cls.h
index 7a69f2a..4f347b0 100644
--- a/include/linux/pkt_cls.h
+++ b/include/linux/pkt_cls.h
@@ -37,6 +37,7 @@ enum {
 #define TC_ACT_QUEUED		5
 #define TC_ACT_REPEAT		6
 #define TC_ACT_REDIRECT		7
+#define TC_ACT_GOTO_CHAIN	8
 #define TC_ACT_JUMP		0x10000000
 
 /* Action type identifiers*/
diff --git a/include/linux/tc_act/tc_gact.h b/include/linux/tc_act/tc_gact.h
index 70b536a..388733d 100644
--- a/include/linux/tc_act/tc_gact.h
+++ b/include/linux/tc_act/tc_gact.h
@@ -26,6 +26,7 @@ enum {
 	TCA_GACT_PARMS,
 	TCA_GACT_PROB,
 	TCA_GACT_PAD,
+	TCA_GACT_CHAIN,
 	__TCA_GACT_MAX
 };
 #define TCA_GACT_MAX (__TCA_GACT_MAX - 1)
diff --git a/tc/m_gact.c b/tc/m_gact.c
index 755a3be..f8d1c1e 100644
--- a/tc/m_gact.c
+++ b/tc/m_gact.c
@@ -43,19 +43,21 @@ static void
 explain(void)
 {
 #ifdef CONFIG_GACT_PROB
-	fprintf(stderr, "Usage: ... gact <ACTION> [RAND] [INDEX]\n");
+	fprintf(stderr, "Usage: ... gact <ACTION> [RAND] [INDEX] [OPTIONS]\n");
 	fprintf(stderr,
-		"Where: \tACTION := reclassify | drop | continue | pass | pipe\n"
+		"Where: \tACTION := reclassify | drop | continue | pass | pipe | goto\n"
 			"\tRAND := random <RANDTYPE> <ACTION> <VAL>\n"
 			"\tRANDTYPE := netrand | determ\n"
 			"\tVAL : = value not exceeding 10000\n"
 			"\tINDEX := index value used\n"
+			"\tOPTIONS: := chain <CHAIN_INDEX>\n"
 			"\n");
 #else
-	fprintf(stderr, "Usage: ... gact <ACTION> [INDEX]\n");
+	fprintf(stderr, "Usage: ... gact <ACTION> [INDEX] [OPTIONS]\n");
 	fprintf(stderr,
-		"Where: \tACTION := reclassify | drop | continue | pass | pipe\n"
+		"Where: \tACTION := reclassify | drop | continue | pass | pipe | goto\n"
 		"\tINDEX := index value used\n"
+		"\tOPTIONS: := chain <CHAIN_INDEX>\n"
 		"\n");
 #endif
 }
@@ -93,6 +95,7 @@ parse_gact(struct action_util *a, int *argc_p, char ***argv_p,
 	int rd = 0;
 	struct tc_gact_p pp;
 #endif
+	__u32 chain_index;
 	struct rtattr *tail;
 
 	if (argc < 0)
@@ -173,6 +176,23 @@ parse_gact(struct action_util *a, int *argc_p, char ***argv_p,
 		}
 	}
 
+	if (ok && action == TC_ACT_GOTO_CHAIN) {
+		if (matches(*argv, "chain") == 0) {
+			NEXT_ARG();
+			if (get_u32(&chain_index, *argv, 10)) {
+				fprintf(stderr, "Illegal \"chain index\"\n");
+				return -1;
+			}
+			argc--;
+			argv++;
+		} else if (matches(*argv, "help") == 0) {
+			usage();
+		} else {
+			fprintf(stderr, "\"chain\" needs to be specified for \"goto\" action\n");
+			return -1;
+		}
+	}
+
 	if (!ok)
 		return -1;
 
@@ -184,6 +204,8 @@ parse_gact(struct action_util *a, int *argc_p, char ***argv_p,
 		addattr_l(n, MAX_MSG, TCA_GACT_PROB, &pp, sizeof(pp));
 	}
 #endif
+	if (action == TC_ACT_GOTO_CHAIN)
+		addattr32(n, MAX_MSG, TCA_GACT_CHAIN, chain_index);
 	tail->rta_len = (void *) NLMSG_TAIL(n) - (void *) tail;
 
 	*argc_p = argc;
@@ -213,6 +235,8 @@ print_gact(struct action_util *au, FILE * f, struct rtattr *arg)
 	p = RTA_DATA(tb[TCA_GACT_PARMS]);
 
 	fprintf(f, "gact action %s", action_n2a(p->action));
+	if (tb[TCA_GACT_CHAIN])
+		fprintf(f, " chain %u", rta_getattr_u32(tb[TCA_GACT_CHAIN]));
 #ifdef CONFIG_GACT_PROB
 	if (tb[TCA_GACT_PROB] != NULL) {
 		pp = RTA_DATA(tb[TCA_GACT_PROB]);
diff --git a/tc/tc_util.c b/tc/tc_util.c
index 24ca1f1..a75367f 100644
--- a/tc/tc_util.c
+++ b/tc/tc_util.c
@@ -428,6 +428,8 @@ const char *action_n2a(int action)
 		return "pipe";
 	case TC_ACT_STOLEN:
 		return "stolen";
+	case TC_ACT_GOTO_CHAIN:
+		return "goto";
 	default:
 		snprintf(buf, 64, "%d", action);
 		buf[63] = '\0';
@@ -459,6 +461,7 @@ int action_a2n(char *arg, int *result, bool allow_num)
 		{"ok", TC_ACT_OK},
 		{"reclassify", TC_ACT_RECLASSIFY},
 		{"pipe", TC_ACT_PIPE},
+		{"goto", TC_ACT_GOTO_CHAIN},
 		{ NULL },
 	}, *iter;
 
-- 
2.7.4

^ permalink raw reply related

* [patch iproute2 1/2] tc_filter: add support for chain index
From: Jiri Pirko @ 2017-04-27 11:15 UTC (permalink / raw)
  To: netdev
  Cc: davem, jhs, xiyou.wangcong, dsa, edumazet, stephen, daniel,
	alexander.h.duyck, mlxsw, simon.horman
In-Reply-To: <1493291540-2119-1-git-send-email-jiri@resnulli.us>

From: Jiri Pirko <jiri@mellanox.com>

Allow user to put filter to a specific chain identified by index.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/linux/rtnetlink.h |  1 +
 tc/tc_filter.c            | 87 +++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 73 insertions(+), 15 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index a96db83..70c5750 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -549,6 +549,7 @@ enum {
 	TCA_STAB,
 	TCA_PAD,
 	TCA_DUMP_INVISIBLE,
+	TCA_CHAIN,
 	__TCA_MAX
 };
 
diff --git a/tc/tc_filter.c b/tc/tc_filter.c
index ff8713b..b13fb91 100644
--- a/tc/tc_filter.c
+++ b/tc/tc_filter.c
@@ -31,7 +31,7 @@ static void usage(void)
 	fprintf(stderr,
 		"Usage: tc filter [ add | del | change | replace | show ] dev STRING\n"
 		"Usage: tc filter get dev STRING parent CLASSID protocol PROTO handle FILTERID pref PRIO FILTER_TYPE\n"
-		"       [ pref PRIO ] protocol PROTO\n"
+		"       [ pref PRIO ] protocol PROTO [ chain CHAIN_INDEX ]\n"
 		"       [ estimator INTERVAL TIME_CONSTANT ]\n"
 		"       [ root | ingress | egress | parent CLASSID ]\n"
 		"       [ handle FILTERID ] [ [ FILTER_TYPE ] [ help | OPTIONS ] ]\n"
@@ -59,6 +59,8 @@ static int tc_filter_modify(int cmd, unsigned int flags, int argc, char **argv)
 	__u32 prio = 0;
 	__u32 protocol = 0;
 	int protocol_set = 0;
+	__u32 chain_index;
+	int chain_index_set = 0;
 	char *fhandle = NULL;
 	char  d[16] = {};
 	char  k[16] = {};
@@ -127,6 +129,13 @@ static int tc_filter_modify(int cmd, unsigned int flags, int argc, char **argv)
 				invarg("invalid protocol", *argv);
 			protocol = id;
 			protocol_set = 1;
+		} else if (matches(*argv, "chain") == 0) {
+			NEXT_ARG();
+			if (chain_index_set)
+				duparg("chain", *argv);
+			if (get_u32(&chain_index, *argv, 0))
+				invarg("invalid chain index value", *argv);
+			chain_index_set = 1;
 		} else if (matches(*argv, "estimator") == 0) {
 			if (parse_estimator(&argc, &argv, &est) < 0)
 				return -1;
@@ -146,6 +155,9 @@ static int tc_filter_modify(int cmd, unsigned int flags, int argc, char **argv)
 
 	req.t.tcm_info = TC_H_MAKE(prio<<16, protocol);
 
+	if (chain_index_set)
+		addattr32(&req.n, sizeof(req), TCA_CHAIN, chain_index);
+
 	if (k[0])
 		addattr_l(&req.n, sizeof(req), TCA_KIND, k, strlen(k)+1);
 
@@ -167,6 +179,7 @@ static int tc_filter_modify(int cmd, unsigned int flags, int argc, char **argv)
 			return -1;
 		}
 	}
+
 	if (est.ewma_log)
 		addattr_l(&req.n, sizeof(req), TCA_RATE, &est, sizeof(est));
 
@@ -193,6 +206,8 @@ static __u32 filter_parent;
 static int filter_ifindex;
 static __u32 filter_prio;
 static __u32 filter_protocol;
+static __u32 filter_chain_index;
+static int filter_chain_index_set;
 __u16 f_proto;
 
 int print_filter(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
@@ -270,6 +285,15 @@ int print_filter(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 		}
 	}
 	fprintf(fp, "%s ", rta_getattr_str(tb[TCA_KIND]));
+
+	if (tb[TCA_CHAIN]) {
+		__u32 chain_index = rta_getattr_u32(tb[TCA_CHAIN]);
+
+		if (!filter_chain_index_set ||
+		    filter_chain_index != chain_index)
+			fprintf(fp, "chain %u ", chain_index);
+	}
+
 	q = get_filter_kind(RTA_DATA(tb[TCA_KIND]));
 	if (tb[TCA_OPTIONS]) {
 		if (q)
@@ -311,6 +335,8 @@ static int tc_filter_get(int cmd, unsigned int flags, int argc, char **argv)
 	__u32 prio = 0;
 	__u32 protocol = 0;
 	int protocol_set = 0;
+	__u32 chain_index;
+	int chain_index_set = 0;
 	__u32 parent_handle = 0;
 	char *fhandle = NULL;
 	char  d[16] = {};
@@ -375,6 +401,13 @@ static int tc_filter_get(int cmd, unsigned int flags, int argc, char **argv)
 				invarg("invalid protocol", *argv);
 			protocol = id;
 			protocol_set = 1;
+		} else if (matches(*argv, "chain") == 0) {
+			NEXT_ARG();
+			if (chain_index_set)
+				duparg("chain", *argv);
+			if (get_u32(&chain_index, *argv, 0))
+				invarg("invalid chain index value", *argv);
+			chain_index_set = 1;
 		} else if (matches(*argv, "help") == 0) {
 			usage();
 			return 0;
@@ -401,6 +434,9 @@ static int tc_filter_get(int cmd, unsigned int flags, int argc, char **argv)
 
 	req.t.tcm_info = TC_H_MAKE(prio<<16, protocol);
 
+	if (chain_index_set)
+		addattr32(&req.n, sizeof(req), TCA_CHAIN, chain_index);
+
 	if (req.t.tcm_parent == TC_H_UNSPEC) {
 		fprintf(stderr, "Must specify filter parent\n");
 		return -1;
@@ -457,10 +493,20 @@ static int tc_filter_get(int cmd, unsigned int flags, int argc, char **argv)
 
 static int tc_filter_list(int argc, char **argv)
 {
-	struct tcmsg t = { .tcm_family = AF_UNSPEC };
+	struct {
+		struct nlmsghdr n;
+		struct tcmsg t;
+		char buf[MAX_MSG];
+	} req = {
+		.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)),
+		.n.nlmsg_type = RTM_GETTFILTER,
+		.t.tcm_parent = TC_H_UNSPEC,
+		.t.tcm_family = AF_UNSPEC,
+	};
 	char d[16] = {};
 	__u32 prio = 0;
 	__u32 protocol = 0;
+	__u32 chain_index;
 	char *fhandle = NULL;
 
 	while (argc > 0) {
@@ -470,39 +516,39 @@ static int tc_filter_list(int argc, char **argv)
 				duparg("dev", *argv);
 			strncpy(d, *argv, sizeof(d)-1);
 		} else if (strcmp(*argv, "root") == 0) {
-			if (t.tcm_parent) {
+			if (req.t.tcm_parent) {
 				fprintf(stderr,
 					"Error: \"root\" is duplicate parent ID\n");
 				return -1;
 			}
-			filter_parent = t.tcm_parent = TC_H_ROOT;
+			filter_parent = req.t.tcm_parent = TC_H_ROOT;
 		} else if (strcmp(*argv, "ingress") == 0) {
-			if (t.tcm_parent) {
+			if (req.t.tcm_parent) {
 				fprintf(stderr,
 					"Error: \"ingress\" is duplicate parent ID\n");
 				return -1;
 			}
 			filter_parent = TC_H_MAKE(TC_H_CLSACT,
 						  TC_H_MIN_INGRESS);
-			t.tcm_parent = filter_parent;
+			req.t.tcm_parent = filter_parent;
 		} else if (strcmp(*argv, "egress") == 0) {
-			if (t.tcm_parent) {
+			if (req.t.tcm_parent) {
 				fprintf(stderr,
 					"Error: \"egress\" is duplicate parent ID\n");
 				return -1;
 			}
 			filter_parent = TC_H_MAKE(TC_H_CLSACT,
 						  TC_H_MIN_EGRESS);
-			t.tcm_parent = filter_parent;
+			req.t.tcm_parent = filter_parent;
 		} else if (strcmp(*argv, "parent") == 0) {
 			__u32 handle;
 
 			NEXT_ARG();
-			if (t.tcm_parent)
+			if (req.t.tcm_parent)
 				duparg("parent", *argv);
 			if (get_tc_classid(&handle, *argv))
 				invarg("invalid parent ID", *argv);
-			filter_parent = t.tcm_parent = handle;
+			filter_parent = req.t.tcm_parent = handle;
 		} else if (strcmp(*argv, "handle") == 0) {
 			NEXT_ARG();
 			if (fhandle)
@@ -526,6 +572,14 @@ static int tc_filter_list(int argc, char **argv)
 				invarg("invalid protocol", *argv);
 			protocol = res;
 			filter_protocol = protocol;
+		} else if (matches(*argv, "chain") == 0) {
+			NEXT_ARG();
+			if (filter_chain_index_set)
+				duparg("chain", *argv);
+			if (get_u32(&chain_index, *argv, 0))
+				invarg("invalid chain index value", *argv);
+			filter_chain_index_set = 1;
+			filter_chain_index = chain_index;
 		} else if (matches(*argv, "help") == 0) {
 			usage();
 		} else {
@@ -538,20 +592,23 @@ static int tc_filter_list(int argc, char **argv)
 		argc--; argv++;
 	}
 
-	t.tcm_info = TC_H_MAKE(prio<<16, protocol);
+	req.t.tcm_info = TC_H_MAKE(prio<<16, protocol);
 
 	ll_init_map(&rth);
 
 	if (d[0]) {
-		t.tcm_ifindex = ll_name_to_index(d);
-		if (t.tcm_ifindex == 0) {
+		req.t.tcm_ifindex = ll_name_to_index(d);
+		if (req.t.tcm_ifindex == 0) {
 			fprintf(stderr, "Cannot find device \"%s\"\n", d);
 			return 1;
 		}
-		filter_ifindex = t.tcm_ifindex;
+		filter_ifindex = req.t.tcm_ifindex;
 	}
 
-	if (rtnl_dump_request(&rth, RTM_GETTFILTER, &t, sizeof(t)) < 0) {
+	if (filter_chain_index_set)
+		addattr32(&req.n, sizeof(req), TCA_CHAIN, chain_index);
+
+	if (rtnl_dump_request_n(&rth, &req.n) < 0) {
 		perror("Cannot send dump request");
 		return 1;
 	}
-- 
2.7.4

^ permalink raw reply related

* Strange samples/bpf loading error for maps on net-next?
From: Jesper Dangaard Brouer @ 2017-04-27 11:15 UTC (permalink / raw)
  To: Daniel Borkmann, Martin KaFai Lau; +Cc: brouer, netdev@vger.kernel.org

To provoke this bug, remember that you MUST call:

 make headers_install

In the kernels root directory, else you will be compiling samples/bpf/
against the older headers previously installed.

The error looks like:

 $ sudo ./sockex1
 bpf_load_program() err=22
 fd 0 is not pointing to valid bpf_map
 sockex1: [...]/samples/bpf/sockex1_user.c:26: main: Assertion `setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd, sizeof(prog_fd[0])) == 0' failed.
 Aborted

I've found that the bug were introduced in
 commit: fb30d4b71214 ("bpf: Add tests for map-in-map")

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [patch net-next 08/10] net: sched: introduce multichain support for filters
From: Jiri Pirko @ 2017-04-27 11:12 UTC (permalink / raw)
  To: netdev
  Cc: davem, jhs, xiyou.wangcong, dsa, edumazet, stephen, daniel,
	alexander.h.duyck, mlxsw, simon.horman
In-Reply-To: <1493291540-2119-1-git-send-email-jiri@resnulli.us>

From: Jiri Pirko <jiri@mellanox.com>

Instead of having only one filter per block, introduce a list of chains
for every block. Create chain 0 by default. UAPI is extended so the user
can specify which chain he wants to change. If the new attribute is not
specified, chain 0 is used. That allows to maintain backward
compatibility. If chain does not exist and user wants to manipulate with
it, new chain is created with specified index. Also, when last filter is
removed from the chain, the chain is destroyed.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/pkt_cls.h          |   2 +
 include/net/sch_generic.h      |   9 +++-
 include/uapi/linux/rtnetlink.h |   1 +
 net/sched/cls_api.c            | 100 ++++++++++++++++++++++++++++++++++-------
 4 files changed, 94 insertions(+), 18 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index e56e715..2c213a6 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -18,6 +18,8 @@ int register_tcf_proto_ops(struct tcf_proto_ops *ops);
 int unregister_tcf_proto_ops(struct tcf_proto_ops *ops);
 
 #ifdef CONFIG_NET_CLS
+struct tcf_chain *tcf_chain_get(struct tcf_block *block, u32 chain_index);
+void tcf_chain_put(struct tcf_chain *chain);
 int tcf_block_get(struct tcf_block **p_block,
 		  struct tcf_proto __rcu **p_filter_chain);
 void tcf_block_put(struct tcf_block *block);
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 52bceed..569b565 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -8,6 +8,7 @@
 #include <linux/pkt_cls.h>
 #include <linux/percpu.h>
 #include <linux/dynamic_queue_limits.h>
+#include <linux/list.h>
 #include <net/gen_stats.h>
 #include <net/rtnetlink.h>
 
@@ -236,7 +237,7 @@ struct tcf_proto {
 	struct Qdisc		*q;
 	void			*data;
 	const struct tcf_proto_ops	*ops;
-	struct tcf_block	*block;
+	struct tcf_chain	*chain;
 	struct rcu_head		rcu;
 };
 
@@ -251,10 +252,14 @@ struct qdisc_skb_cb {
 struct tcf_chain {
 	struct tcf_proto __rcu *filter_chain;
 	struct tcf_proto __rcu **p_filter_chain;
+	struct list_head list;
+	struct tcf_block *block;
+	u32 index; /* chain index */
+	unsigned int refcnt;
 };
 
 struct tcf_block {
-	struct tcf_chain *chain;
+	struct list_head chain_list;
 };
 
 static inline void qdisc_cb_private_validate(const struct sk_buff *skb, int sz)
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index cce0613..6487b21 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -549,6 +549,7 @@ enum {
 	TCA_STAB,
 	TCA_PAD,
 	TCA_DUMP_INVISIBLE,
+	TCA_CHAIN,
 	__TCA_MAX
 };
 
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 9f48061..a4fab8f 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -129,7 +129,7 @@ static inline u32 tcf_auto_prio(struct tcf_proto *tp)
 
 static struct tcf_proto *tcf_proto_create(const char *kind, u32 protocol,
 					  u32 prio, u32 parent, struct Qdisc *q,
-					  struct tcf_block *block)
+					  struct tcf_chain *chain)
 {
 	struct tcf_proto *tp;
 	int err;
@@ -165,7 +165,7 @@ static struct tcf_proto *tcf_proto_create(const char *kind, u32 protocol,
 	tp->prio = prio;
 	tp->classid = parent;
 	tp->q = q;
-	tp->block = block;
+	tp->chain = chain;
 
 	err = tp->ops->init(tp);
 	if (err) {
@@ -186,15 +186,26 @@ static void tcf_proto_destroy(struct tcf_proto *tp)
 	kfree_rcu(tp, rcu);
 }
 
-static struct tcf_chain *tcf_chain_create(void)
+static struct tcf_chain *tcf_chain_create(struct tcf_block *block,
+					  u32 chain_index)
 {
-	return kzalloc(sizeof(struct tcf_chain), GFP_KERNEL);
+	struct tcf_chain *chain;
+
+	chain = kzalloc(sizeof(*chain), GFP_KERNEL);
+	if (!chain)
+		return NULL;
+	list_add_tail(&chain->list, &block->chain_list);
+	chain->block = block;
+	chain->index = chain_index;
+	chain->refcnt = 1;
+	return chain;
 }
 
 static void tcf_chain_destroy(struct tcf_chain *chain)
 {
 	struct tcf_proto *tp;
 
+	list_del(&chain->list);
 	while ((tp = rtnl_dereference(chain->filter_chain)) != NULL) {
 		RCU_INIT_POINTER(chain->filter_chain, tp->next);
 		tcf_proto_destroy(tp);
@@ -202,6 +213,30 @@ static void tcf_chain_destroy(struct tcf_chain *chain)
 	kfree(chain);
 }
 
+struct tcf_chain *tcf_chain_get(struct tcf_block *block, u32 chain_index)
+{
+	struct tcf_chain *chain;
+
+	list_for_each_entry(chain, &block->chain_list, list) {
+		if (chain->index == chain_index) {
+			chain->refcnt++;
+			return chain;
+		}
+	}
+	return tcf_chain_create(block, chain_index);
+}
+EXPORT_SYMBOL(tcf_chain_get);
+
+void tcf_chain_put(struct tcf_chain *chain)
+{
+	/* Destroy unused chain, with exception of chain 0, which is the
+	 * default one and has to be always present.
+	 */
+	if (--chain->refcnt == 0 && !chain->filter_chain && chain->index != 0)
+		tcf_chain_destroy(chain);
+}
+EXPORT_SYMBOL(tcf_chain_put);
+
 static void
 tcf_chain_filter_chain_ptr_set(struct tcf_chain *chain,
 			       struct tcf_proto __rcu **p_filter_chain)
@@ -213,16 +248,19 @@ int tcf_block_get(struct tcf_block **p_block,
 		  struct tcf_proto __rcu **p_filter_chain)
 {
 	struct tcf_block *block = kzalloc(sizeof(*block), GFP_KERNEL);
+	struct tcf_chain *chain;
 	int err;
 
 	if (!block)
 		return -ENOMEM;
-	block->chain = tcf_chain_create();
-	if (!block->chain) {
+	INIT_LIST_HEAD(&block->chain_list);
+	/* Create chain 0 by default, it has to be always present. */
+	chain = tcf_chain_create(block, 0);
+	if (!chain) {
 		err = -ENOMEM;
 		goto err_chain_create;
 	}
-	tcf_chain_filter_chain_ptr_set(block->chain, p_filter_chain);
+	tcf_chain_filter_chain_ptr_set(chain, p_filter_chain);
 	*p_block = block;
 	return 0;
 
@@ -234,9 +272,13 @@ EXPORT_SYMBOL(tcf_block_get);
 
 void tcf_block_put(struct tcf_block *block)
 {
+	struct tcf_chain *chain, *tmp;
+
 	if (!block)
 		return;
-	tcf_chain_destroy(block->chain);
+
+	list_for_each_entry_safe(chain, tmp, &block->chain_list, list)
+		tcf_chain_destroy(chain);
 	kfree(block);
 }
 EXPORT_SYMBOL(tcf_block_put);
@@ -354,10 +396,11 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	u32 prio;
 	bool prio_allocate;
 	u32 parent;
+	u32 chain_index;
 	struct net_device *dev;
 	struct Qdisc  *q;
 	struct tcf_chain_info chain_info;
-	struct tcf_chain *chain;
+	struct tcf_chain *chain = NULL;
 	struct tcf_block *block;
 	struct tcf_proto *tp;
 	const struct Qdisc_class_ops *cops;
@@ -443,7 +486,13 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		err = -EINVAL;
 		goto errout;
 	}
-	chain = block->chain;
+
+	chain_index = tca[TCA_CHAIN] ? nla_get_u32(tca[TCA_CHAIN]) : 0;
+	chain = tcf_chain_get(block, chain_index);
+	if (!chain) {
+		err = -ENOMEM;
+		goto errout;
+	}
 
 	if (n->nlmsg_type == RTM_DELTFILTER && prio == 0) {
 		tfilter_notify_chain(net, skb, n, chain, RTM_DELTFILTER);
@@ -477,7 +526,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 			prio = tcf_auto_prio(tcf_chain_tp_prev(&chain_info));
 
 		tp = tcf_proto_create(nla_data(tca[TCA_KIND]),
-				      protocol, prio, parent, q, block);
+				      protocol, prio, parent, q, chain);
 		if (IS_ERR(tp)) {
 			err = PTR_ERR(tp);
 			goto errout;
@@ -550,6 +599,8 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	}
 
 errout:
+	if (chain)
+		tcf_chain_put(chain);
 	if (cl)
 		cops->put(q, cl);
 	if (err == -EAGAIN)
@@ -578,6 +629,8 @@ static int tcf_fill_node(struct net *net, struct sk_buff *skb,
 	tcm->tcm_info = TC_H_MAKE(tp->prio, tp->protocol);
 	if (nla_put_string(skb, TCA_KIND, tp->ops->kind))
 		goto nla_put_failure;
+	if (nla_put_u32(skb, TCA_CHAIN, tp->chain->index))
+		goto nla_put_failure;
 	tcm->tcm_handle = fh;
 	if (RTM_DELTFILTER != event) {
 		tcm->tcm_handle = 0;
@@ -634,7 +687,7 @@ static int tcf_node_dump(struct tcf_proto *tp, unsigned long n,
 			     RTM_NEWTFILTER);
 }
 
-static void tcf_chain_dump(struct tcf_chain *chain, struct sk_buff *skb,
+static bool tcf_chain_dump(struct tcf_chain *chain, struct sk_buff *skb,
 			   struct netlink_callback *cb,
 			   long index_start, long *p_index)
 {
@@ -661,7 +714,7 @@ static void tcf_chain_dump(struct tcf_chain *chain, struct sk_buff *skb,
 					  NETLINK_CB(cb->skb).portid,
 					  cb->nlh->nlmsg_seq, NLM_F_MULTI,
 					  RTM_NEWTFILTER) <= 0)
-				break;
+				return false;
 
 			cb->args[1] = 1;
 		}
@@ -676,14 +729,16 @@ static void tcf_chain_dump(struct tcf_chain *chain, struct sk_buff *skb,
 		tp->ops->walk(tp, &arg.w);
 		cb->args[1] = arg.w.count + 1;
 		if (arg.w.stop)
-			break;
+			return false;
 	}
+	return true;
 }
 
 /* called with RTNL */
 static int tc_dump_tfilter(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	struct net *net = sock_net(skb->sk);
+	struct nlattr *tca[TCA_MAX + 1];
 	struct net_device *dev;
 	struct Qdisc *q;
 	struct tcf_block *block;
@@ -693,9 +748,15 @@ static int tc_dump_tfilter(struct sk_buff *skb, struct netlink_callback *cb)
 	const struct Qdisc_class_ops *cops;
 	long index_start;
 	long index;
+	int err;
 
 	if (nlmsg_len(cb->nlh) < sizeof(*tcm))
 		return skb->len;
+
+	err = nlmsg_parse(cb->nlh, sizeof(*tcm), tca, TCA_MAX, NULL, NULL);
+	if (err)
+		return err;
+
 	dev = __dev_get_by_index(net, tcm->tcm_ifindex);
 	if (!dev)
 		return skb->len;
@@ -719,11 +780,18 @@ static int tc_dump_tfilter(struct sk_buff *skb, struct netlink_callback *cb)
 	block = cops->tcf_block(q, cl);
 	if (!block)
 		goto errout;
-	chain = block->chain;
 
 	index_start = cb->args[0];
 	index = 0;
-	tcf_chain_dump(chain, skb, cb, index_start, &index);
+
+	list_for_each_entry(chain, &block->chain_list, list) {
+		if (tca[TCA_CHAIN] &&
+		    nla_get_u32(tca[TCA_CHAIN]) != chain->index)
+			continue;
+		if (!tcf_chain_dump(chain, skb, cb, index_start, &index))
+			break;
+	}
+
 	cb->args[0] = index;
 
 errout:
-- 
2.7.4

^ permalink raw reply related

* [patch net-next 10/10] net: sched: extend gact to allow jumping to another filter chain
From: Jiri Pirko @ 2017-04-27 11:12 UTC (permalink / raw)
  To: netdev
  Cc: davem, jhs, xiyou.wangcong, dsa, edumazet, stephen, daniel,
	alexander.h.duyck, mlxsw, simon.horman
In-Reply-To: <1493291540-2119-1-git-send-email-jiri@resnulli.us>

From: Jiri Pirko <jiri@mellanox.com>

Introduce new type of gact action called "goto_chain". This allows
user to specify a chain to be processed. This action type is
then processed as a return value in tcf_classify loop in similar
way as "reclassify" is, only it does not reset to the first filter
in chain but rather reset to the first filter of the desired chain.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/sch_generic.h           |  9 +++++--
 include/net/tc_act/tc_gact.h        |  2 ++
 include/uapi/linux/pkt_cls.h        |  1 +
 include/uapi/linux/tc_act/tc_gact.h |  1 +
 net/sched/act_gact.c                | 48 ++++++++++++++++++++++++++++++++++++-
 net/sched/cls_api.c                 |  8 +++++--
 6 files changed, 64 insertions(+), 5 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 569b565..3688501 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -193,8 +193,13 @@ struct Qdisc_ops {
 
 
 struct tcf_result {
-	unsigned long	class;
-	u32		classid;
+	union {
+		struct {
+			unsigned long	class;
+			u32		classid;
+		};
+		const struct tcf_proto *goto_tp;
+	};
 };
 
 struct tcf_proto_ops {
diff --git a/include/net/tc_act/tc_gact.h b/include/net/tc_act/tc_gact.h
index b6f1739..58bee54 100644
--- a/include/net/tc_act/tc_gact.h
+++ b/include/net/tc_act/tc_gact.h
@@ -12,6 +12,8 @@ struct tcf_gact {
 	int			tcfg_paction;
 	atomic_t		packets;
 #endif
+	struct tcf_chain	*goto_chain;
+	struct rcu_head		rcu;
 };
 #define to_gact(a) ((struct tcf_gact *)a)
 
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index f1129e3..e03ba27 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -37,6 +37,7 @@ enum {
 #define TC_ACT_QUEUED		5
 #define TC_ACT_REPEAT		6
 #define TC_ACT_REDIRECT		7
+#define TC_ACT_GOTO_CHAIN	8
 #define TC_ACT_JUMP		0x10000000
 
 /* Action type identifiers*/
diff --git a/include/uapi/linux/tc_act/tc_gact.h b/include/uapi/linux/tc_act/tc_gact.h
index 70b536a..388733d 100644
--- a/include/uapi/linux/tc_act/tc_gact.h
+++ b/include/uapi/linux/tc_act/tc_gact.h
@@ -26,6 +26,7 @@ enum {
 	TCA_GACT_PARMS,
 	TCA_GACT_PROB,
 	TCA_GACT_PAD,
+	TCA_GACT_CHAIN,
 	__TCA_GACT_MAX
 };
 #define TCA_GACT_MAX (__TCA_GACT_MAX - 1)
diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c
index c527c11..d63aebd 100644
--- a/net/sched/act_gact.c
+++ b/net/sched/act_gact.c
@@ -20,6 +20,7 @@
 #include <linux/init.h>
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
 #include <linux/tc_act/tc_gact.h>
 #include <net/tc_act/tc_gact.h>
 
@@ -54,6 +55,7 @@ static g_rand gact_rand[MAX_RAND] = { NULL, gact_net_rand, gact_determ };
 static const struct nla_policy gact_policy[TCA_GACT_MAX + 1] = {
 	[TCA_GACT_PARMS]	= { .len = sizeof(struct tc_gact) },
 	[TCA_GACT_PROB]		= { .len = sizeof(struct tc_gact_p) },
+	[TCA_GACT_CHAIN]	= { .type = NLA_U32 },
 };
 
 static int tcf_gact_init(struct net *net, struct tcf_proto *tp,
@@ -92,6 +94,9 @@ static int tcf_gact_init(struct net *net, struct tcf_proto *tp,
 	}
 #endif
 
+	if (parm->action == TC_ACT_GOTO_CHAIN && !tb[TCA_GACT_CHAIN])
+		return -EINVAL;
+
 	if (!tcf_hash_check(tn, parm->index, a, bind)) {
 		ret = tcf_hash_create(tn, parm->index, est, a,
 				      &act_gact_ops, bind, true);
@@ -121,11 +126,43 @@ static int tcf_gact_init(struct net *net, struct tcf_proto *tp,
 		gact->tcfg_ptype   = p_parm->ptype;
 	}
 #endif
+
+	if (gact->tcf_action == TC_ACT_GOTO_CHAIN) {
+		u32 chain_index = nla_get_u32(tb[TCA_GACT_CHAIN]);
+
+		if (!tp) {
+			if (ret == ACT_P_CREATED)
+				tcf_hash_release(*a, bind);
+			return -EINVAL;
+		}
+		gact->goto_chain = tcf_chain_get(tp->chain->block, chain_index);
+		if (!gact->goto_chain) {
+			if (ret == ACT_P_CREATED)
+				tcf_hash_release(*a, bind);
+			return -ENOMEM;
+		}
+	}
+
 	if (ret == ACT_P_CREATED)
 		tcf_hash_insert(tn, *a);
 	return ret;
 }
 
+static void tcf_gact_cleanup_rcu(struct rcu_head *rcu)
+{
+	struct tcf_gact *gact = container_of(rcu, struct tcf_gact, rcu);
+
+	if (gact->tcf_action == TC_ACT_GOTO_CHAIN)
+		tcf_chain_put(gact->goto_chain);
+}
+
+static void tcf_gact_cleanup(struct tc_action *a, int bind)
+{
+	struct tcf_gact *gact = to_gact(a);
+
+	call_rcu(&gact->rcu, tcf_gact_cleanup_rcu);
+}
+
 static int tcf_gact(struct sk_buff *skb, const struct tc_action *a,
 		    struct tcf_result *res)
 {
@@ -141,8 +178,13 @@ static int tcf_gact(struct sk_buff *skb, const struct tc_action *a,
 	}
 #endif
 	bstats_cpu_update(this_cpu_ptr(gact->common.cpu_bstats), skb);
-	if (action == TC_ACT_SHOT)
+	if (action == TC_ACT_SHOT) {
 		qstats_drop_inc(this_cpu_ptr(gact->common.cpu_qstats));
+	} else if (action == TC_ACT_GOTO_CHAIN) {
+		struct tcf_chain *chain = gact->goto_chain;
+
+		res->goto_tp = rcu_dereference_bh(chain->filter_chain);
+	}
 
 	tcf_lastuse_update(&gact->tcf_tm);
 
@@ -194,6 +236,9 @@ static int tcf_gact_dump(struct sk_buff *skb, struct tc_action *a,
 	tcf_tm_dump(&t, &gact->tcf_tm);
 	if (nla_put_64bit(skb, TCA_GACT_TM, sizeof(t), &t, TCA_GACT_PAD))
 		goto nla_put_failure;
+	if (gact->tcf_action == TC_ACT_GOTO_CHAIN &&
+	    nla_put_u32(skb, TCA_GACT_CHAIN, gact->goto_chain->index))
+		goto nla_put_failure;
 	return skb->len;
 
 nla_put_failure:
@@ -225,6 +270,7 @@ static struct tc_action_ops act_gact_ops = {
 	.stats_update	=	tcf_gact_stats_update,
 	.dump		=	tcf_gact_dump,
 	.init		=	tcf_gact_init,
+	.cleanup	=	tcf_gact_cleanup,
 	.walk		=	tcf_gact_walker,
 	.lookup		=	tcf_gact_search,
 	.size		=	sizeof(struct tcf_gact),
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index dbc1348..a2d6bc7 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -304,10 +304,14 @@ int tcf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 			continue;
 
 		err = tp->classify(skb, tp, res);
-		if (unlikely(err == TC_ACT_RECLASSIFY && !compat_mode))
+		if (err == TC_ACT_RECLASSIFY && !compat_mode) {
 			goto reset;
-		if (err >= 0)
+		} else if (err == TC_ACT_GOTO_CHAIN) {
+			old_tp = res->goto_tp;
+			goto reset;
+		} else if (err >= 0) {
 			return err;
+		}
 	}
 
 	return TC_ACT_UNSPEC; /* signal: continue lookup */
-- 
2.7.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox