Netdev List

Netdev List
 help / color / mirror / Atom feed

* RE: bnx2x: kernel panic in the bnx2x driver
From: Kalluru, Sudarsana @ 2018-06-22 10:20 UTC (permalink / raw)
  To: Vishwanath Pai, Elior, Ariel, Dept-Eng Everest Linux L2
  Cc: davem@davemloft.net, netdev@vger.kernel.org, dbanerje@akamai.com,
	pai.vishwain@gmail.com
In-Reply-To: <20180622050706.GA24578@akamai.com>

Hi Vishwanath,
    Thanks for your mail, and the analysis.
The fix would be to invoke bnx2x_rss() only when the device is opened,
	if (bp->state == BNX2X_STATE_OPEN)
		return bnx2x_rss(bp, &bp->rss_conf_obj, false, true);
	else
		return 0;
Ariel,
   Could you please review the path (bnx2x_set_rss_flags()--> bnx2x_rss()) and confirm/correct on the above.

Thanks,
Sudarsana

-----Original Message-----
From: Vishwanath Pai [mailto:vpai@akamai.com] 
Sent: 22 June 2018 10:37
To: Elior, Ariel <Ariel.Elior@cavium.com>; Dept-Eng Everest Linux L2 <Dept-EngEverestLinuxL2@cavium.com>
Cc: davem@davemloft.net; netdev@vger.kernel.org; dbanerje@akamai.com; pai.vishwain@gmail.com
Subject: bnx2x: kernel panic in the bnx2x driver

External Email

Hi,

We recently noticed a kernel panic in the bnx2x driver when trying to set rx-flow-hash parameters via ethtool during if-pre-up.d. I am running kernel
v4.17.2 from ubuntu-mainline-ppa. I have added the stack trace below:

[   18.280209] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[   18.280212] PGD 8000000407a79067 P4D 8000000407a79067 PUD 40ce8a067 PMD 0
[   18.280214] Oops: 0010 [#1] SMP PTI
[   18.280215] Modules linked in: intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel joydev input_led kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc hid_eneric aesni_intel gpio_ich aes_x86_64 usbhid lpc_ich crpto_simd ie31200_edac cryptd glue_helper intel_cstate mac_hid intel_rapl_perf bnx2x mdio tcp_bbr netconsole ipmi_devintf ipmi_msghandler i2c_i801 coretemp autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear sha26_mb mcryptd sha256_ssse3 hid ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt mpt3sas fb_sys_fops drm raid_class scsi_transport_sas ahci libahci shpchp video
[   18.280241] CPU: 6 PID: 1081 Comm: ethtool Not tainted 4.17.2-041702-generic #201806160433
[   18.280242] Hardware name: Foxconn CangJie/CangJie, BIOS CC1F108D 02/26/2014
[   18.280243] RIP: 0010:          (null)
[   18.280243] RSP: 0018:ffffb84bc260b9c0 EFLAGS: 00010246
[   18.280244] RAX: 0000000000000000 RBX: ffff92f987f020f0 RCX: 0000000000000000
[   18.280245] RDX: 0000000000000000 RSI: ffffb84bc260b9f8 RDI: ffff92f987f020f0
[   18.280245] RBP: ffffb8bc260b9e8 R08: 0000000000000001 R09: 0000000000000000
[   18.280246] R10: ffffb84bc260bd20 R11: 0000000000000000 R12: ffffb84bc260b9f8
[   18.280246] R13: ffff92f987f008c0 R14: 00007ffdb75bec40 R15: 0000000000000000
[   18.280247] FS:  00007fc0e8798700(0000) GS:ffff92f99fd80000(0000) knlGS:0000000000000000
[   18.280248] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   18.280249] CR2: 0000000000000000 CR3: 0000000409b4c003 CR4: 00000000001606e0
[   18.280249] Call Trace:
[   18.280263]  ? bnx2x_config_rss+0x2f/0xd0 [bnx2x]
[   18.280270]  bnx2x_rss+0x1d9/0x210 [bnx2x]
[   18.280276]  bnx2x_set_rxnfc+0x17d/0x380 [bnx2x]
[   18.280279]  ethtool_set_rxnfc+0x9b/0x110
[   18.280281]  ? __do_page_cache_readahead+0x1da/0x2c0
[   18.280283]  ? security_capable+0x3c/0x60
[   18.280284]  dev_ethtool+0350/0x2610
[   18.280286]  ? page_cache_async_readahead+0x71/0x80
[   18.280288]  ? page_add_file_rmap+0x5d/0x220
[   18.280290]  ? inet_ioctl+0x182/0x1a0
[   18.280291]  dev_ioctl+0x203/0x3f0
[   18.280293]  ? dev_ioctl+0x203/0x3f0
[   18.280294]  sock_do_ioctl+0xae/0x150
[   18.280296]  sock_ioctl+0x1e2/0x330
[   18.280296]  ? sock_ioctl+0x1e2/0x330
[   18.280299]  do_vfs_ioctl+0xa8/0x620
[   18.280300]  ? dlci_ioctl_set+0x30/0x30
[   18.280301]  ? do_vfs_ioctl+0xa8/0x620
[   18.280302]  ? handle_mm_fault+0xe3/0x220
[   18.280304]  ksys_ioctl+0x75/0x80
[   18.280305]  __x64_sys_ioctl+0x1a/0x20
[   18.280307]  do_syscall_64+0x5a/0x120
[   18.280309]  entry_SYSCALL_64_aftr_hwframe+0x44/0xa9
[   18.280310] RIP: 0033:0x7fc0e7fba107
[   18.280311] RSP: 002b:00007ffdb75beb78 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[   18.280312] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc0e7fba107
[   18.280312] RDX: 00007ffdb75bed60 RSI: 0000000000008946 RDI: 0000000000000003
[   18.280313] RBP: 00007ffdb75bed50 R08: 00007ffdb75bed60 R09: 0000000000000001
[   18.280313] R10: 0000000000000541 R11: 0000000000000206 R12: 00007ffdb75beed0
[   18.280314] R13: 0000000000421020 R14: 000000000041fe28 R15: 0000000000000003
[   18.280315] Code:  Bad RIP value.
[   18.280317] RIP:           (null) RSP: ffffb84bc260b9c0
[  18.280318] CR2: 0000000000000000
[   18.280319] ---[ end trace 5f361db3fb9059f1 ]---

To reproduce this I created a bash script in "/etc/network/if-pre-up.d/" with these two lines:
ethtool -N $IFACE rx-flow-hash udp4 "sdfn"
ethtool -N $IFACE rx-flow-hash udp6 "sdfn"

The problem here is that rss_obj in bnx2x struct for the device hasn't been initialized yet, which causes an exception in bnx2x_config_rss() when calling "r->set_pending(r)" because r->set_pending is NULL. It looks like a lot many things haven't been initialized at this point, most of that happens in this
function: "bnx2x_init_bp_objs()" which isn't called until ifup. Any thoughts on how this can be fixed? Would it be possible to safely move bnx2x_init_bp_objs() to maybe bnx2x_init_one() which runs much before ifup?

Thanks,
Vishwanath

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Okash Khawaja @ 2018-06-22 10:24 UTC (permalink / raw)
  To: Quentin Monnet
  Cc: Daniel Borkmann, Martin KaFai Lau, Alexei Starovoitov,
	Yonghong Song, Jakub Kicinski, David S. Miller, netdev,
	kernel-team, linux-kernel
In-Reply-To: <3db6047a-101a-2ed1-9ca3-9e90b45ea00f@netronome.com>

On Thu, Jun 21, 2018 at 11:42:59AM +0100, Quentin Monnet wrote:
> Hi Okash,
hi and sorry about delay in responding. the email got routed to
incorrect folder.
> 
> 2018-06-20 13:30 UTC-0700 ~ Okash Khawaja <osk@fb.com>
> > This consumes functionality exported in the previous patch. It does the
> > main job of printing with BTF data. This is used in the following patch
> > to provide a more readable output of a map's dump. It relies on
> > json_writer to do json printing. Below is sample output where map keys
> > are ints and values are of type struct A:
> > 
> > typedef int int_type;
> > enum E {
> >         E0,
> >         E1,
> > };
> > 
> > struct B {
> >         int x;
> >         int y;
> > };
> > 
> > struct A {
> >         int m;
> >         unsigned long long n;
> >         char o;
> >         int p[8];
> >         int q[4][8];
> >         enum E r;
> >         void *s;
> >         struct B t;
> >         const int u;
> >         int_type v;
> >         unsigned int w1: 3;
> >         unsigned int w2: 3;
> > };
> > 
> > $ sudo bpftool map dump -p id 14
> > [{
> >         "key": 0
> >     },{
> >         "value": {
> >             "m": 1,
> >             "n": 2,
> >             "o": "c",
> >             "p": [15,16,17,18,15,16,17,18
> >             ],
> >             "q": [[25,26,27,28,25,26,27,28
> >                 ],[35,36,37,38,35,36,37,38
> >                 ],[45,46,47,48,45,46,47,48
> >                 ],[55,56,57,58,55,56,57,58
> >                 ]
> >             ],
> >             "r": 1,
> >             "s": 0x7ffff6f70568,
> 
> Hexadecimal values, without quotes, are not valid JSON. Please stick to
> decimal values.
ah sorry, i used a buggy json validator. this should be a quick fix.
which would be better:  pointers be output hex strings or integers?

> 
> >             "t": {
> >                 "x": 5,
> >                 "y": 10
> >             },
> >             "u": 100,
> >             "v": 20,
> >             "w1": 0x7,
> >             "w2": 0x3
> >         }
> >     }
> > ]
> > 
> > This patch uses json's {} and [] to imply struct/union and array. More
> > explicit information can be added later. For example, a command line
> > option can be introduced to print whether a key or value is struct
> > or union, name of a struct etc. This will however come at the expense
> > of duplicating info when, for example, printing an array of structs.
> > enums are printed as ints without their names.
> > 
> > Signed-off-by: Okash Khawaja <osk@fb.com>
> > Acked-by: Martin KaFai Lau <kafai@fb.com>
> > 
> > ---
> >  tools/bpf/bpftool/btf_dumper.c |  247 +++++++++++++++++++++++++++++++++++++++++
> >  tools/bpf/bpftool/btf_dumper.h |   18 ++
> >  2 files changed, 265 insertions(+)
> > 
> > --- /dev/null
> > +++ b/tools/bpf/bpftool/btf_dumper.c
> > @@ -0,0 +1,247 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/* Copyright (c) 2018 Facebook */
> > +
> > +#include <linux/btf.h>
> > +#include <linux/err.h>
> > +#include <stdio.h> /* for (FILE *) used by json_writer */
> > +#include <linux/bitops.h>
> > +#include <string.h>
> > +#include <ctype.h>
> > +
> > +#include "btf.h"
> > +#include "json_writer.h"
> > +
> > +#define BITS_PER_BYTE_MASK (BITS_PER_BYTE - 1)
> > +#define BITS_PER_BYTE_MASKED(bits) ((bits) & BITS_PER_BYTE_MASK)
> > +#define BITS_ROUNDDOWN_BYTES(bits) ((bits) >> 3)
> > +#define BITS_ROUNDUP_BYTES(bits) \
> > +	(BITS_ROUNDDOWN_BYTES(bits) + !!BITS_PER_BYTE_MASKED(bits))
> > +
> > +static int btf_dumper_do_type(const struct btf *btf, uint32_t type_id,
> > +		uint8_t bit_offset, const void *data, json_writer_t *jw);
> > +
> > +static void btf_dumper_ptr(const void *data, json_writer_t *jw)
> > +{
> > +	jsonw_printf(jw, "%p", *((uintptr_t *)data));
> > +}
> > +
> > +static int btf_dumper_modifier(const struct btf *btf, uint32_t type_id,
> > +		const void *data, json_writer_t *jw)
> > +{
> > +	int32_t actual_type_id = btf__resolve_type(btf, type_id);
> > +	int ret;
> > +
> > +	if (actual_type_id < 0)
> > +		return actual_type_id;
> > +
> > +	ret = btf_dumper_do_type(btf, actual_type_id, 0, data, jw);
> > +
> > +	return ret;
> > +}
> > +
> > +static void btf_dumper_enum(const void *data, json_writer_t *jw)
> > +{
> > +	jsonw_printf(jw, "%d", *((int32_t *)data));
> > +}
> > +
> > +static int btf_dumper_array(const struct btf *btf, uint32_t type_id,
> > +		const void *data, json_writer_t *jw)
> > +{
> > +	const struct btf_type *t = btf__type_by_id(btf, type_id);
> > +	struct btf_array *arr = (struct btf_array *)(t + 1);
> > +	int64_t elem_size;
> > +	uint32_t i;
> > +	int ret;
> > +
> > +	elem_size = btf__resolve_size(btf, arr->type);
> > +	if (elem_size < 0)
> > +		return elem_size;
> > +
> > +	jsonw_start_array(jw);
> > +	for (i = 0; i < arr->nelems; i++) {
> > +		ret = btf_dumper_do_type(btf, arr->type, 0,
> > +				data + (i * elem_size), jw);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	jsonw_end_array(jw);
> > +
> > +	return 0;
> > +}
> > +
> > +static void btf_dumper_int_bits(uint32_t int_type, uint8_t bit_offset,
> > +		const void *data, json_writer_t *jw)
> > +{
> > +	uint16_t total_bits_offset;
> > +	uint32_t bits = BTF_INT_BITS(int_type);
> 
> Nit: Please use reverse-Christmas-tree here.
> 
> As for patch 3 you also have a number of continuation lines not aligned
> on the opening parenthesis from the first line, throughout the patch
> (please consider running checkpatch in "--strict" mode for the list).
i didn't use "--strict". that would explain style mismatch. will fix
that in v2.

> 
> > +	uint16_t bytes_to_copy;
> > +	uint16_t bits_to_copy;
> > +	uint8_t upper_bits;
> > +	union {
> > +		uint64_t u64_num;
> > +		uint8_t u8_nums[8];
> > +	} print_num;
> > +
> > +	total_bits_offset = bit_offset + BTF_INT_OFFSET(int_type);
> > +	data += BITS_ROUNDDOWN_BYTES(total_bits_offset);
> > +	bit_offset = BITS_PER_BYTE_MASKED(total_bits_offset);
> > +	bits_to_copy = bits + bit_offset;
> > +	bytes_to_copy = BITS_ROUNDUP_BYTES(bits_to_copy);
> > +
> > +	print_num.u64_num = 0;
> > +	memcpy(&print_num.u64_num, data, bytes_to_copy);
> > +
> > +	upper_bits = BITS_PER_BYTE_MASKED(bits_to_copy);
> > +	if (upper_bits) {
> > +		uint8_t mask = (1 << upper_bits) - 1;
> > +
> > +		print_num.u8_nums[bytes_to_copy - 1] &= mask;
> > +	}
> > +
> > +	print_num.u64_num >>= bit_offset;
> > +
> > +	jsonw_printf(jw, "0x%llx", print_num.u64_num);
> > +}
> > +
> > +static int btf_dumper_int(const struct btf_type *t, uint8_t bit_offset,
> > +		const void *data, json_writer_t *jw)
> > +{
> > +	uint32_t *int_type = (uint32_t *)(t + 1);
> > +	uint32_t bits = BTF_INT_BITS(*int_type);
> > +	int ret = 0;
> > +
> > +	/* if this is bit field */
> > +	if (bit_offset || BTF_INT_OFFSET(*int_type) ||
> > +			BITS_PER_BYTE_MASKED(bits)) {
> > +		btf_dumper_int_bits(*int_type, bit_offset, data, jw);
> > +		return ret;
> > +	}
> > +
> > +	switch (BTF_INT_ENCODING(*int_type)) {
> > +	case 0:
> > +		if (BTF_INT_BITS(*int_type) == 64)
> > +			jsonw_printf(jw, "%lu", *((uint64_t *)data));
> > +		else if (BTF_INT_BITS(*int_type) == 32)
> > +			jsonw_printf(jw, "%u", *((uint32_t *)data));
> > +		else if (BTF_INT_BITS(*int_type) == 16)
> > +			jsonw_printf(jw, "%hu", *((uint16_t *)data));
> > +		else if (BTF_INT_BITS(*int_type) == 8)
> > +			jsonw_printf(jw, "%hhu", *((uint8_t *)data));
> > +		else
> > +			btf_dumper_int_bits(*int_type, bit_offset, data, jw);
> > +		break;
> > +	case BTF_INT_SIGNED:
> > +		if (BTF_INT_BITS(*int_type) == 64)
> > +			jsonw_printf(jw, "%ld", *((int64_t *)data));
> > +		else if (BTF_INT_BITS(*int_type) == 32)
> > +			jsonw_printf(jw, "%d", *((int32_t *)data));
> > +		else if (BTF_INT_BITS(*int_type) ==  16)
> > +			jsonw_printf(jw, "%hd", *((int16_t *)data));
> > +		else if (BTF_INT_BITS(*int_type) ==  8)
> > +			jsonw_printf(jw, "%hhd", *((int8_t *)data));
> > +		else
> > +			btf_dumper_int_bits(*int_type, bit_offset, data, jw);
> > +		break;
> > +	case BTF_INT_CHAR:
> > +		if (*((char *)data) == '\0')
> > +			jsonw_null(jw);
> > +		else if (isprint(*((char *)data)))
> > +			jsonw_printf(jw, "\"%c\"", *((char *)data));
> > +		else
> > +			jsonw_printf(jw, "%hhx", *((char *)data));
> > +		break;
> > +	case BTF_INT_BOOL:
> > +		jsonw_bool(jw, *((int *)data));
> > +		break;
> > +	default:
> > +		/* shouldn't happen */
> > +		ret = -EINVAL;
> > +		break;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static int btf_dumper_struct(const struct btf *btf, uint32_t type_id,
> > +		const void *data, json_writer_t *jw)
> > +{
> > +	const struct btf_type *t = btf__type_by_id(btf, type_id);
> > +	struct btf_member *m;
> > +	int ret = 0;
> > +	int i, vlen;
> > +
> > +	if (t == NULL)
> > +		return -EINVAL;
> > +
> > +	vlen = BTF_INFO_VLEN(t->info);
> > +	jsonw_start_object(jw);
> > +	m = (struct btf_member *)(t + 1);
> > +
> > +	for (i = 0; i < vlen; i++) {
> > +		jsonw_name(jw, btf__name_by_offset(btf, m[i].name_off));
> > +		ret = btf_dumper_do_type(btf, m[i].type,
> > +				BITS_PER_BYTE_MASKED(m[i].offset),
> > +				data + BITS_ROUNDDOWN_BYTES(m[i].offset), jw);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	jsonw_end_object(jw);
> > +
> > +	return 0;
> > +}
> > +
> > +static int btf_dumper_do_type(const struct btf *btf, uint32_t type_id,
> > +		uint8_t bit_offset, const void *data, json_writer_t *jw)
> > +{
> > +	const struct btf_type *t = btf__type_by_id(btf, type_id);
> > +	int ret = 0;
> > +
> > +	switch (BTF_INFO_KIND(t->info)) {
> > +	case BTF_KIND_INT:
> > +		ret = btf_dumper_int(t, bit_offset, data, jw);
> > +		break;
> > +	case BTF_KIND_STRUCT:
> > +	case BTF_KIND_UNION:
> > +		ret = btf_dumper_struct(btf, type_id, data, jw);
> > +		break;
> > +	case BTF_KIND_ARRAY:
> > +		ret = btf_dumper_array(btf, type_id, data, jw);
> > +		break;
> > +	case BTF_KIND_ENUM:
> > +		btf_dumper_enum(data, jw);
> > +		break;
> > +	case BTF_KIND_PTR:
> > +		btf_dumper_ptr(data, jw);
> > +		break;
> > +	case BTF_KIND_UNKN:
> > +		jsonw_printf(jw, "(unknown)");
> > +		break;
> > +	case BTF_KIND_FWD:
> > +		/* map key or value can't be forward */
> > +		ret = -EINVAL;
> > +		break;
> > +	case BTF_KIND_TYPEDEF:
> > +	case BTF_KIND_VOLATILE:
> > +	case BTF_KIND_CONST:
> > +	case BTF_KIND_RESTRICT:
> > +		ret = btf_dumper_modifier(btf, type_id, data, jw);
> > +		break;
> > +	default:
> > +		jsonw_printf(jw, "(unsupported-kind");
> > +		ret = -EINVAL;
> > +		break;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +int32_t btf_dumper_type(const struct btf *btf, json_writer_t *jw,
> > +		uint32_t type_id, const void *data)
> > +{
> > +	if (!jw)
> > +		return -EINVAL;
> > +
> > +	return btf_dumper_do_type(btf, type_id, 0, data, jw);
> > +}
> > --- /dev/null
> > +++ b/tools/bpf/bpftool/btf_dumper.h
> > @@ -0,0 +1,18 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> 
> I believe SPDX tag in header files should use C++ style comments (//)?
okay i will verify and fix that.

> 
> > +/* Copyright (c) 2018 Facebook */
> > +
> > +#ifndef BTF_DUMPER_H
> > +#define BTF_DUMPER_H
> > +
> > +/* btf_dumper_type - json print data along with type information
> > + * @btf: btf instance initialised via btf__new()
> > + * @jw: json writer used for printing
> > + * @type_id: index in btf->types array. this points to the type to be dumped
> > + * @data: pointer the actual data, i.e. the values to be printed
> > + *
> > + * Returns zero on success and negative error code otherwise
> > + */
> > +int32_t btf_dumper_type(const struct btf *btf, json_writer_t *jw,
> > +		uint32_t type_id, void *data);
> > +
> > +#endif
> > 
> 
> Thanks,
> Quentin

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Quentin Monnet @ 2018-06-22 10:39 UTC (permalink / raw)
  To: Okash Khawaja
  Cc: Daniel Borkmann, Martin KaFai Lau, Alexei Starovoitov,
	Yonghong Song, Jakub Kicinski, David S. Miller, netdev,
	kernel-team, linux-kernel
In-Reply-To: <20180622102428.GA3050@w1t1fb>

2018-06-22 11:24 UTC+0100 ~ Okash Khawaja <osk@fb.com>
> On Thu, Jun 21, 2018 at 11:42:59AM +0100, Quentin Monnet wrote:
>> Hi Okash,
> hi and sorry about delay in responding. the email got routed to
> incorrect folder.
>>
>> 2018-06-20 13:30 UTC-0700 ~ Okash Khawaja <osk@fb.com>
>>> This consumes functionality exported in the previous patch. It does the
>>> main job of printing with BTF data. This is used in the following patch
>>> to provide a more readable output of a map's dump. It relies on
>>> json_writer to do json printing. Below is sample output where map keys
>>> are ints and values are of type struct A:
>>>
>>> typedef int int_type;
>>> enum E {
>>>         E0,
>>>         E1,
>>> };
>>>
>>> struct B {
>>>         int x;
>>>         int y;
>>> };
>>>
>>> struct A {
>>>         int m;
>>>         unsigned long long n;
>>>         char o;
>>>         int p[8];
>>>         int q[4][8];
>>>         enum E r;
>>>         void *s;
>>>         struct B t;
>>>         const int u;
>>>         int_type v;
>>>         unsigned int w1: 3;
>>>         unsigned int w2: 3;
>>> };
>>>
>>> $ sudo bpftool map dump -p id 14
>>> [{
>>>         "key": 0
>>>     },{
>>>         "value": {
>>>             "m": 1,
>>>             "n": 2,
>>>             "o": "c",
>>>             "p": [15,16,17,18,15,16,17,18
>>>             ],
>>>             "q": [[25,26,27,28,25,26,27,28
>>>                 ],[35,36,37,38,35,36,37,38
>>>                 ],[45,46,47,48,45,46,47,48
>>>                 ],[55,56,57,58,55,56,57,58
>>>                 ]
>>>             ],
>>>             "r": 1,
>>>             "s": 0x7ffff6f70568,
>>
>> Hexadecimal values, without quotes, are not valid JSON. Please stick to
>> decimal values.
> ah sorry, i used a buggy json validator. this should be a quick fix.
> which would be better:  pointers be output hex strings or integers?

I would go for integers. Although this is harder to read for humans, it
is easier to process for machines, which remain the primary targets for
JSON output.

Quentin

^ permalink raw reply

* Re: WARNING: CPU: 3 PID: 0 at net/sched/sch_hfsc.c:1388 hfsc_dequeue+0x319/0x350 [sch_hfsc]
From: Marco Berizzi @ 2018-06-22 11:05 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers
In-Reply-To: <CAM_iQpVEZFL==0BuQh26Wh-TEJ-hmz7NwGry2g7EZiK-FCjKAQ@mail.gmail.com>

> Il 21 giugno 2018 alle 1.00 Cong Wang <xiyou.wangcong@gmail.com> ha scritto:
> Please also test HFSC_RSC ("rt") if possible.

Hi Cong,

sorry for the delayed response.
I have tested this hfsc rt setup:

tc class add dev eth2 parent 1:2 classid 1:16 hfsc rt m2 500kbit
and
tc class add dev eth2 parent 1:2 classid 1:16 hfsc rt m2 5000kbit
and
tc class add dev eth2 parent 1:2 classid 1:16 hfsc rt m2 20000kbit

and I was getting the expected throughput specified by the m2
parameter.

> If you can confirm nothing breaks, I will send it out formally.

after 52 hours uptime I don't see the previous error anymore.
 
> Thanks for testing!

and thanks for the patch.

^ permalink raw reply

* [PATCH v4] net: Remove depends on HAS_DMA in case of platform dependency
From: Geert Uytterhoeven @ 2018-06-22 11:08 UTC (permalink / raw)
  To: David S . Miller, Yisen Zhuang, Sergey Matyukevich, Salil Mehta,
	Kalle Valo, Igor Mitsyanko, Avinash Patil
  Cc: Wright Feng, Sergei Shtylyov, Quan Nguyen, Keyur Chudgar,
	Jiri Pirko, Iyappan Subramanian, Ido Schimmel, Hante Meuleman,
	Franky Lin, Chi-Hsien Lin, Arend van Spriel, netdev, linux-kernel,
	Geert Uytterhoeven

Remove dependencies on HAS_DMA where a Kconfig symbol depends on another
symbol that implies HAS_DMA, and, optionally, on "|| COMPILE_TEST".
In most cases this other symbol is an architecture or platform specific
symbol, or PCI.

Generic symbols and drivers without platform dependencies keep their
dependencies on HAS_DMA, to prevent compiling subsystems or drivers that
cannot work anyway.

This simplifies the dependencies, and allows to improve compile-testing.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Robin Murphy <robin.murphy@arm.com>
---
v4:
  - Rebase to v4.18-rc1 (applies to next-20180622, too),

v3:
  - Rebase to v4.17-rc1,
  - Drop obsolete note about FSL_FMAN,

v2:
  - Add Reviewed-by, Acked-by,
  - Drop RFC state,
  - Split per subsystem.
---
 drivers/net/ethernet/amd/Kconfig                | 2 +-
 drivers/net/ethernet/apm/xgene-v2/Kconfig       | 1 -
 drivers/net/ethernet/apm/xgene/Kconfig          | 1 -
 drivers/net/ethernet/arc/Kconfig                | 6 ++++--
 drivers/net/ethernet/broadcom/Kconfig           | 2 --
 drivers/net/ethernet/calxeda/Kconfig            | 2 +-
 drivers/net/ethernet/hisilicon/Kconfig          | 2 +-
 drivers/net/ethernet/marvell/Kconfig            | 8 +++-----
 drivers/net/ethernet/mellanox/mlxsw/Kconfig     | 2 +-
 drivers/net/ethernet/renesas/Kconfig            | 2 --
 drivers/net/wireless/broadcom/brcm80211/Kconfig | 1 -
 drivers/net/wireless/quantenna/qtnfmac/Kconfig  | 2 +-
 12 files changed, 12 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/amd/Kconfig b/drivers/net/ethernet/amd/Kconfig
index d5c15e8bb3de706b..f273af136fc7c995 100644
--- a/drivers/net/ethernet/amd/Kconfig
+++ b/drivers/net/ethernet/amd/Kconfig
@@ -173,7 +173,7 @@ config SUNLANCE
 
 config AMD_XGBE
 	tristate "AMD 10GbE Ethernet driver"
-	depends on ((OF_NET && OF_ADDRESS) || ACPI || PCI) && HAS_IOMEM && HAS_DMA
+	depends on ((OF_NET && OF_ADDRESS) || ACPI || PCI) && HAS_IOMEM
 	depends on X86 || ARM64 || COMPILE_TEST
 	select BITREVERSE
 	select CRC32
diff --git a/drivers/net/ethernet/apm/xgene-v2/Kconfig b/drivers/net/ethernet/apm/xgene-v2/Kconfig
index 1205861b631896a0..eedd3f3dd22e2201 100644
--- a/drivers/net/ethernet/apm/xgene-v2/Kconfig
+++ b/drivers/net/ethernet/apm/xgene-v2/Kconfig
@@ -1,6 +1,5 @@
 config NET_XGENE_V2
 	tristate "APM X-Gene SoC Ethernet-v2 Driver"
-	depends on HAS_DMA
 	depends on ARCH_XGENE || COMPILE_TEST
 	help
 	  This is the Ethernet driver for the on-chip ethernet interface
diff --git a/drivers/net/ethernet/apm/xgene/Kconfig b/drivers/net/ethernet/apm/xgene/Kconfig
index afccb033177b3923..e4e33c900b577161 100644
--- a/drivers/net/ethernet/apm/xgene/Kconfig
+++ b/drivers/net/ethernet/apm/xgene/Kconfig
@@ -1,6 +1,5 @@
 config NET_XGENE
 	tristate "APM X-Gene SoC Ethernet Driver"
-	depends on HAS_DMA
 	depends on ARCH_XGENE || COMPILE_TEST
 	select PHYLIB
 	select MDIO_XGENE
diff --git a/drivers/net/ethernet/arc/Kconfig b/drivers/net/ethernet/arc/Kconfig
index e743ddf46343302f..5d0ab8e74b680cc6 100644
--- a/drivers/net/ethernet/arc/Kconfig
+++ b/drivers/net/ethernet/arc/Kconfig
@@ -24,7 +24,8 @@ config ARC_EMAC_CORE
 config ARC_EMAC
 	tristate "ARC EMAC support"
 	select ARC_EMAC_CORE
-	depends on OF_IRQ && OF_NET && HAS_DMA && (ARC || COMPILE_TEST)
+	depends on OF_IRQ && OF_NET
+	depends on ARC || COMPILE_TEST
 	---help---
 	  On some legacy ARC (Synopsys) FPGA boards such as ARCAngel4/ML50x
 	  non-standard on-chip ethernet device ARC EMAC 10/100 is used.
@@ -33,7 +34,8 @@ config ARC_EMAC
 config EMAC_ROCKCHIP
 	tristate "Rockchip EMAC support"
 	select ARC_EMAC_CORE
-	depends on OF_IRQ && OF_NET && REGULATOR && HAS_DMA && (ARCH_ROCKCHIP || COMPILE_TEST)
+	depends on OF_IRQ && OF_NET && REGULATOR
+	depends on ARCH_ROCKCHIP || COMPILE_TEST
 	---help---
 	  Support for Rockchip RK3036/RK3066/RK3188 EMAC ethernet controllers.
 	  This selects Rockchip SoC glue layer support for the
diff --git a/drivers/net/ethernet/broadcom/Kconfig b/drivers/net/ethernet/broadcom/Kconfig
index af75156919edfead..4c3bfde6e8de00f2 100644
--- a/drivers/net/ethernet/broadcom/Kconfig
+++ b/drivers/net/ethernet/broadcom/Kconfig
@@ -157,7 +157,6 @@ config BGMAC
 config BGMAC_BCMA
 	tristate "Broadcom iProc GBit BCMA support"
 	depends on BCMA && BCMA_HOST_SOC
-	depends on HAS_DMA
 	depends on BCM47XX || ARCH_BCM_5301X || COMPILE_TEST
 	select BGMAC
 	select PHYLIB
@@ -170,7 +169,6 @@ config BGMAC_BCMA
 
 config BGMAC_PLATFORM
 	tristate "Broadcom iProc GBit platform support"
-	depends on HAS_DMA
 	depends on ARCH_BCM_IPROC || COMPILE_TEST
 	depends on OF
 	select BGMAC
diff --git a/drivers/net/ethernet/calxeda/Kconfig b/drivers/net/ethernet/calxeda/Kconfig
index 07d2201530d26c85..9fdd496b90ff47cb 100644
--- a/drivers/net/ethernet/calxeda/Kconfig
+++ b/drivers/net/ethernet/calxeda/Kconfig
@@ -1,6 +1,6 @@
 config NET_CALXEDA_XGMAC
 	tristate "Calxeda 1G/10G XGMAC Ethernet driver"
-	depends on HAS_IOMEM && HAS_DMA
+	depends on HAS_IOMEM
 	depends on ARCH_HIGHBANK || COMPILE_TEST
 	select CRC32
 	help
diff --git a/drivers/net/ethernet/hisilicon/Kconfig b/drivers/net/ethernet/hisilicon/Kconfig
index 8bcf470ff5f38a4e..fb1a7251f45d3369 100644
--- a/drivers/net/ethernet/hisilicon/Kconfig
+++ b/drivers/net/ethernet/hisilicon/Kconfig
@@ -5,7 +5,7 @@
 config NET_VENDOR_HISILICON
 	bool "Hisilicon devices"
 	default y
-	depends on (OF || ACPI) && HAS_DMA
+	depends on OF || ACPI
 	depends on ARM || ARM64 || COMPILE_TEST
 	---help---
 	  If you have a network (Ethernet) card belonging to this class, say Y.
diff --git a/drivers/net/ethernet/marvell/Kconfig b/drivers/net/ethernet/marvell/Kconfig
index cc2f7701e71e1b03..f33fd22b351c856a 100644
--- a/drivers/net/ethernet/marvell/Kconfig
+++ b/drivers/net/ethernet/marvell/Kconfig
@@ -18,8 +18,8 @@ if NET_VENDOR_MARVELL
 
 config MV643XX_ETH
 	tristate "Marvell Discovery (643XX) and Orion ethernet support"
-	depends on (MV64X60 || PPC32 || PLAT_ORION || COMPILE_TEST) && INET
-	depends on HAS_DMA
+	depends on MV64X60 || PPC32 || PLAT_ORION || COMPILE_TEST
+	depends on INET
 	select PHYLIB
 	select MVMDIO
 	---help---
@@ -58,7 +58,6 @@ config MVNETA_BM_ENABLE
 config MVNETA
 	tristate "Marvell Armada 370/38x/XP/37xx network interface support"
 	depends on ARCH_MVEBU || COMPILE_TEST
-	depends on HAS_DMA
 	select MVMDIO
 	select PHYLINK
 	---help---
@@ -84,7 +83,6 @@ config MVNETA_BM
 config MVPP2
 	tristate "Marvell Armada 375/7K/8K network interface support"
 	depends on ARCH_MVEBU || COMPILE_TEST
-	depends on HAS_DMA
 	select MVMDIO
 	select PHYLINK
 	---help---
@@ -93,7 +91,7 @@ config MVPP2
 
 config PXA168_ETH
 	tristate "Marvell pxa168 ethernet support"
-	depends on HAS_IOMEM && HAS_DMA
+	depends on HAS_IOMEM
 	depends on CPU_PXA168 || ARCH_BERLIN || COMPILE_TEST
 	select PHYLIB
 	---help---
diff --git a/drivers/net/ethernet/mellanox/mlxsw/Kconfig b/drivers/net/ethernet/mellanox/mlxsw/Kconfig
index f4d9c9975ac3d857..82827a8d3d67cac7 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlxsw/Kconfig
@@ -30,7 +30,7 @@ config MLXSW_CORE_THERMAL
 
 config MLXSW_PCI
 	tristate "PCI bus implementation for Mellanox Technologies Switch ASICs"
-	depends on PCI && HAS_DMA && HAS_IOMEM && MLXSW_CORE
+	depends on PCI && HAS_IOMEM && MLXSW_CORE
 	default m
 	---help---
 	  This is PCI bus implementation for Mellanox Technologies Switch ASICs.
diff --git a/drivers/net/ethernet/renesas/Kconfig b/drivers/net/ethernet/renesas/Kconfig
index 27be51f0a421b43e..f3f7477043ce1061 100644
--- a/drivers/net/ethernet/renesas/Kconfig
+++ b/drivers/net/ethernet/renesas/Kconfig
@@ -17,7 +17,6 @@ if NET_VENDOR_RENESAS
 
 config SH_ETH
 	tristate "Renesas SuperH Ethernet support"
-	depends on HAS_DMA
 	depends on ARCH_RENESAS || SUPERH || COMPILE_TEST
 	select CRC32
 	select MII
@@ -31,7 +30,6 @@ config SH_ETH
 
 config RAVB
 	tristate "Renesas Ethernet AVB support"
-	depends on HAS_DMA
 	depends on ARCH_RENESAS || COMPILE_TEST
 	select CRC32
 	select MII
diff --git a/drivers/net/wireless/broadcom/brcm80211/Kconfig b/drivers/net/wireless/broadcom/brcm80211/Kconfig
index 9d99eb42d9176f0f..6acba67bca07abd7 100644
--- a/drivers/net/wireless/broadcom/brcm80211/Kconfig
+++ b/drivers/net/wireless/broadcom/brcm80211/Kconfig
@@ -60,7 +60,6 @@ config BRCMFMAC_PCIE
 	bool "PCIE bus interface support for FullMAC driver"
 	depends on BRCMFMAC
 	depends on PCI
-	depends on HAS_DMA
 	select BRCMFMAC_PROTO_MSGBUF
 	select FW_LOADER
 	---help---
diff --git a/drivers/net/wireless/quantenna/qtnfmac/Kconfig b/drivers/net/wireless/quantenna/qtnfmac/Kconfig
index 025fa6018550895a..8d1492a90bd135c0 100644
--- a/drivers/net/wireless/quantenna/qtnfmac/Kconfig
+++ b/drivers/net/wireless/quantenna/qtnfmac/Kconfig
@@ -7,7 +7,7 @@ config QTNFMAC
 config QTNFMAC_PEARL_PCIE
 	tristate "Quantenna QSR10g PCIe support"
 	default n
-	depends on HAS_DMA && PCI && CFG80211
+	depends on PCI && CFG80211
 	select QTNFMAC
 	select FW_LOADER
 	select CRC32
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Okash Khawaja @ 2018-06-22 11:17 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Jakub Kicinski, Daniel Borkmann, Alexei Starovoitov,
	Yonghong Song, Quentin Monnet, David S. Miller, netdev,
	kernel-team, linux-kernel
In-Reply-To: <20180622012052.htkvholi674x6i4f@kafai-mbp.dhcp.thefacebook.com>

On Thu, Jun 21, 2018 at 06:20:52PM -0700, Martin KaFai Lau wrote:
> On Thu, Jun 21, 2018 at 05:25:23PM -0700, Jakub Kicinski wrote:
> > On Thu, 21 Jun 2018 16:58:15 -0700, Martin KaFai Lau wrote:
> > > On Thu, Jun 21, 2018 at 04:07:19PM -0700, Jakub Kicinski wrote:
> > > > On Thu, 21 Jun 2018 15:51:17 -0700, Martin KaFai Lau wrote:  
> > > > > On Thu, Jun 21, 2018 at 02:59:35PM -0700, Jakub Kicinski wrote:  
> > > > > > On Wed, 20 Jun 2018 13:30:53 -0700, Okash Khawaja wrote:    
> > > > > > > $ sudo bpftool map dump -p id 14
> > > > > > > [{
> > > > > > >         "key": 0
> > > > > > >     },{
> > > > > > >         "value": {
> > > > > > >             "m": 1,
> > > > > > >             "n": 2,
> > > > > > >             "o": "c",
> > > > > > >             "p": [15,16,17,18,15,16,17,18
> > > > > > >             ],
> > > > > > >             "q": [[25,26,27,28,25,26,27,28
> > > > > > >                 ],[35,36,37,38,35,36,37,38
> > > > > > >                 ],[45,46,47,48,45,46,47,48
> > > > > > >                 ],[55,56,57,58,55,56,57,58
> > > > > > >                 ]
> > > > > > >             ],
> > > > > > >             "r": 1,
> > > > > > >             "s": 0x7ffff6f70568,
> > > > > > >             "t": {
> > > > > > >                 "x": 5,
> > > > > > >                 "y": 10
> > > > > > >             },
> > > > > > >             "u": 100,
> > > > > > >             "v": 20,
> > > > > > >             "w1": 0x7,
> > > > > > >             "w2": 0x3
> > > > > > >         }
> > > > > > >     }
> > > > > > > ]    
> > > > > > 
> > > > > > I don't think this format is okay, JSON output is an API you shouldn't
> > > > > > break.  You can change the non-JSON output whatever way you like, but
> > > > > > JSON must remain backwards compatible.
> > > > > > 
> > > > > > The dump today has object per entry, e.g.:
> > > > > > 
> > > > > > {
> > > > > >         "key":["0x00","0x00","0x00","0x00",
> > > > > >         ],
> > > > > >         "value": ["0x02","0x00","0x00","0x00","0x00","0x00","0x00","0x00"
> > > > > >         ]
> > > > > > }
> > > > > > 
> > > > > > This format must remain, you may only augment it with new fields.  E.g.:
> > > > > > 
> > > > > > {
> > > > > >         "key":["0x00","0x00","0x00","0x00",
> > > > > >         ],
> > > > > > 	"key_struct":{
> > > > > > 		"index":0
> > > > > > 	},
> Got a few questions.
> 
> When we support hashtab later, the key could be int
> but reusing the name as "index" is weird.
> The key could also be a struct (e.g. a struct to describe ip:port).
> Can you suggest how the "key_struct" will look like?
> 
> > > > > >         "value": ["0x02","0x00","0x00","0x00","0x00","0x00","0x00","0x00"
> > > > > >         ],
> > > > > > 	"value_struct":{
> > > > > > 		"src_ip":2,
> If for the same map the user changes the "src_ip" to an array of int[4]
> later (e.g. to support ipv6), it will become "src_ip": [1, 2, 3, 4].
> Is it breaking backward compat?
> i.e.
> struct five_tuples {
> -	int src_ip;
> +	int src_ip[4];
> /* ... */
> };
> 
> > > > > > 		"dst_ip:0
> > > > > > 	}
> > > > > > }    
> > > > > I am not sure how useful to have both "key|value" and "(key|value)_struct"
> > > > > while most people would prefer "key_struct"/"value_struct" if it is
> > > > > available.  
> > > > 
> > > > Agreed, it's not that useful, especially with the string-hex debacle :(
> > > > It's just about the backwards compat.
> > > >   
> > > > > How about introducing a new option, like "-b", to print the
> > > > > map with BTF (if available) such that it won't break the existing
> > > > > one (-j or -p) while the "-b" output can keep using the "key"
> > > > > and "value".
> > > > > 
> > > > > The existing json can be kept as is.  
> > > > 
> > > > That was my knee jerk reaction too, but on reflection it doesn't sound
> > > > that great.  We expect people with new-enough bpftool to use btf, so it
> > > > should be available in the default output, without hiding it behind a
> > > > switch.  We could add a switch to hide the old output, but that doesn't
> > > > give us back the names...  What about Key and Value or k and v?  Or
> > > > key_fields and value_fields?  
> > > I thought the current default output is "plain" ;)
> > > Having said that, yes, the btf is currently printed in json.
> > > 
> > > Ideally, the default json output should do what most people want:
> > > print btf and btf only (if it is available).
> > > but I don't see a way out without new option if we need to
> > > be backward compat :(
> > > 
> > > Agree that showing the btf in the existing json output will be useful (e.g.
> > > to hint people that BTF is available).  If btf is showing in old json,
> > > also agree that the names should be the same with the new json.
> > > key_fields and value_fields may hint it has >1 fields though.
> > > May be "formatted_key" and "formatted_value"?
> > 
> > SGTM!  Or even maybe as a "formatted" object?:
> > 
> > {
> >          "key":["0x00","0x00","0x00","0x00",
> >          ],
> >          "value": ["0x02","0x00","0x00","0x00","0x00","0x00","0x00","0x00"
> >          ],
> > 	"formatted":{
> > 	 	"key":{
> >  			"index":0
> > 	 	},
> > 	 	"value":{
> >  			"src_ip":2,
> >  			"dst_ip:0
> > 	 	}
> > 	}
> hmm... that is an extra indentation (keep in mind that the "value" could
> already have a few nested structs which itself consumes a few indentations)
> but I guess adding another one may be ok-ish.
> 
> > }  
> > 
> > > > > > The name XYZ_struct may not be the best, perhaps you can come up with a
> > > > > > better one?  
> > > > > > 
> > > > > > Does that make sense?  Am I missing what you're doing here?
> > > > > > 
> > > > > > One process note - please make sure you run checkpatch.pl --strict on
> > > > > > bpftool patches before posting.
> > > > > > 
> > > > > > Thanks for working on this!    
> > 

Hi,

While I agree on the point of backward compatibility, I think printing
two overlapping pieces of information side-by-side will make the
interface less clear. Having separate outputs for the two will keep the
interface clear and readable.

Is there a major downside to adding a new flag for BTF output?

Thanks,
Okash

^ permalink raw reply

* Crash in netlink/sk_filter_trim_cap on ARMv7 on 4.18rc1
From: Peter Robinson @ 2018-06-22 11:19 UTC (permalink / raw)
  To: netdev, linux-arm-kernel; +Cc: labbott

Hi All,

I'm seeing this netlink/sk_filter_trim_cap crash on ARMv7 across quite
a few ARMv7 platforms on Fedora with 4.18rc1. I've tested RPi2/RPi3
(doesn't happen on aarch64), AllWinner H3, BeagleBone and a few
others, both LPAE/normal kernels.

I'm a bit out of my depth in this part of the kernel but I'm wondering
if it's known, I couldn't find anything that looked obvious on a few
mailing lists.

Peter

[    9.955543] Modules linked in:
[    9.955562] CPU: 1 PID: 213 Comm: systemd-udevd Tainted: G      D
        4.18.0-0.rc1.git0.1.fc29.armv7hl #1
[    9.955566] Hardware name: BCM2835
[    9.955584] PC is at sk_filter_trim_cap+0x15c/0x1b8
[    9.955590] LR is at   (null)
[    9.955597] pc : [<c09d4d58>]    lr : [<00000000>]    psr: 60000013
[    9.955602] sp : c2cf9d58  ip : 00000000  fp : 00000000
[    9.955608] r10: ef2c3c00  r9 : c13093c0  r8 : 00000000
[    9.955615] r7 : 00000000  r6 : 00000001  r5 : f0f6a000  r4 : 00000000
[    9.955621] r3 : 00000007  r2 : 00000000  r1 : 00000000  r0 : 00000000
[    9.955629] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[    9.955640] Control: 10c5387d  Table: 02e6406a  DAC: 00000051
[    9.963334] Unable to handle kernel NULL pointer dereference at
virtual address 0000000c
[    9.964631] Process systemd-udevd (pid: 213, stack limit = 0x(ptrval))
[    9.964640] Stack: (0xc2cf9d58 to 0xc2cfa000)
[    9.964649] 9d40:
    00000000 c2c90540
[    9.964663] 9d60: 006000c0 00000000 00000000 c09a233c c2c90b40
c2c90b40 c2c90540 00000000
[    9.964678] 9d80: 00000000 00000000 c13093c0 c09fa2bc 006000c0
00000001 ee7f1800 00000000
[    9.964691] 9da0: 00000002 00000000 00000001 ef2c3c64 c2cf9f70
00000002 c2c90540 00000000
[    9.964706] 9dc0: c2cf9f68 00000083 ee7f1800 00000008 00000000
c09fa3b8 006000c0 00000000
[    9.964724] 9de0: 00000000 00000002 00000002 c09fc704 006000c0
00000000 ee7c7c00 00000000
[    9.976159] pgd = (ptrval)
[    9.979536] 9e00: 000000d5 00000000 00000000 00000000 c126a314
c2cf9f68 eec77880 c2cf9e50
[    9.979550] 9e20: 00000040 00000000 eec77880 00000000 00000000
c099a624 c2cf9f68 00000000
[    9.979565] 9e40: c2cf9e50 c099ae48 00000100 00000000 00000080
c04ab918 ee78e8c0 7fff0000
[    9.979580] 9e60: c2cf9e90 c2cf9eec ffff0000 000000a0 bed817e4
00000028 01a040a8 0000005b
[    9.979594] 9e80: 00000000 00000000 00000000 01a0ef00 00000128
40000028 b6cd9548 00000000
[    9.979607] 9ea0: 0000000d 00000000 bed817b8 00000000 00000010
00000000 00000002 00000000
[    9.985866] [0000000c] *pgd=00000000
[    9.988810] 9ec0: 00000000 00000000 01a0ef00 00000000 c2cf9fb0
00000128 bed817b8 00000000
[    9.988825] 9ee0: 00000000 c0407f18 00000000 00000000 c120bbec
b6e2ba00 c2cf9fb0 10c5387d
[    9.988841] 9f00: 01a0efb8 bed81720 bed81728 c03165fc 00005010
00001000 3e600000 c04ced24
[    9.988855] 9f20: ee4b5010 00000ff0 ee4b5000 00000000 ee4b6000
eec77880 bed817b8 00000000
[    9.988875] 9f40: 00000128 c0301204 c2cf8000 00000128 00000000
c099bc5c 00000000 00000000
[   10.000948] 9f60: 00000000 fffffff7 c2cf9eb0 0000000c 00000001
00000000 00000000 c2cf9e80
[   10.000961] 9f80: 00000000 c030ac08 00000000 00000000 00000040
00000000 00000000 01a0ef00
[   10.000976] 9fa0: bed817b8 c03011d4 00000000 01a0ef00 0000000d
bed817b8 00000000 00000000
[   10.000995] 9fc0: 00000000 01a0ef00 bed817b8 00000128 0000005b
01a0af00 01a0f620 00000000
[   10.228876] 9fe0: b6f9fad4 bed81780 b6de4780 b6cd9548 60000010
0000000d 00000000 00000000
[   10.237081] [<c09d4d58>] (sk_filter_trim_cap) from [<c09fa2bc>]
(netlink_broadcast_filtered+0x304/0x3dc)
[   10.246575] [<c09fa2bc>] (netlink_broadcast_filtered) from
[<c09fa3b8>] (netlink_broadcast+0x24/0x2c)
[   10.255806] [<c09fa3b8>] (netlink_broadcast) from [<c09fc704>]
(netlink_sendmsg+0x30c/0x340)
[   10.264258] [<c09fc704>] (netlink_sendmsg) from [<c099a624>]
(sock_sendmsg+0x3c/0x4c)
[   10.272100] [<c099a624>] (sock_sendmsg) from [<c099ae48>]
(___sys_sendmsg+0x1d8/0x218)
[   10.280030] [<c099ae48>] (___sys_sendmsg) from [<c099bc5c>]
(__sys_sendmsg+0x48/0x6c)
[   10.287872] [<c099bc5c>] (__sys_sendmsg) from [<c03011d4>]
(__sys_trace_return+0x0/0x10)
[   10.295962] Exception stack(0xc2cf9fa8 to 0xc2cf9ff0)
[   10.301018] 9fa0:                   00000000 01a0ef00 0000000d
bed817b8 00000000 00000000
[   10.309202] 9fc0: 00000000 01a0ef00 bed817b8 00000128 0000005b
01a0af00 01a0f620 00000000
[   10.317381] 9fe0: b6f9fad4 bed81780 b6de4780 b6cd9548
[   10.322442] Code: 1afffff7 e59c0000 e5830000 e3520000 (e584800c)
[   10.328557] Internal error: Oops: 805 [#8] SMP ARM
[   10.328768] ---[ end trace 2cb865e83300a747 ]---
[   10.333357] Modules linked in:
[   10.333374] CPU: 2 PID: 212 Comm: systemd-udevd Tainted: G      D
        4.18.0-0.rc1.git0.1.fc29.armv7hl #1
[   10.333378] Hardware name: BCM2835
[   10.333396] PC is at sk_filter_trim_cap+0x15c/0x1b8
[   10.333409] LR is at   (null)
[   10.341840] Unable to handle kernel NULL pointer dereference at
virtual address 0000000c
[   10.351172] pc : [<c09d4d58>]    lr : [<00000000>]    psr: 60000013
[   10.351179] sp : c2e5dd58  ip : 00000000  fp : 00000000
[   10.351185] r10: ef2c3c00  r9 : c13093c0  r8 : 00000000
[   10.351192] r7 : 00000000  r6 : 00000001  r5 : f0f6a000  r4 : 00000000
[   10.351198] r3 : 00000007  r2 : 00000000  r1 : 00000000  r0 : 00000000
[   10.351207] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[   10.351215] Control: 10c5387d  Table: 02e6006a  DAC: 00000051
[   10.351231] Process systemd-udevd (pid: 212, stack limit = 0x(ptrval))
[   10.354654] pgd = (ptrval)
[   10.359496] Stack: (0xc2e5dd58 to 0xc2e5e000)
[   10.359505] dd40:
    00000000 ef3c0540
[   10.359520] dd60: 006000c0 00000000 00000000 c09a233c ef3c0b40
ef3c0b40 ef3c0540 00000000
[   10.359534] dd80: 00000000 00000000 c13093c0 c09fa2bc 006000c0
00000001 ee7f2000 00000000
[   10.359548] dda0: 00000002 00000000 00000001 ef2c3c64 c2e5df70
00000002 ef3c0540 00000000
[   10.359563] ddc0: c2e5df68 00000084 ee7f2000 00000008 00000000
c09fa3b8 006000c0 00000000
[   10.359585] dde0: 00000000 00000002 00000002 c09fc704 006000c0
00000000 c2c68d00 00000000
[   10.362574] [0000000c] *pgd=00000000
[   10.382706] de00: 000000d4 00000000 00000000 00000000 c126a314
c2e5df68 eec76c40 c2e5de50
[   10.382721] de20: 00000040 00000000 eec76c40 00000000 00000000
c099a624 c2e5df68 00000000
[   10.382735] de40: c2e5de50 c099ae48 00000100 00000000 00000080
c04ab918 ee78e8c0 7fff0000
[   10.382750] de60: c2e5de90 c2e5deec ffff0000 000000a0 bed817e4
00000028 01a040a8 0000005c
[   10.382764] de80: 00000000 00000000 00000000 01a0e0f0 00000128
40000028 b6cd9548 00000000
[   10.382780] dea0: 0000000d 00000000 bed817b8 00000000 00000010
00000000 00000002 00000000
[   10.397129] dec0: 00000000 00000000 01a0e0f0 00000000 c2e5dfb0
00000128 bed817b8 00000000
[   10.397144] dee0: 00000000 c0407f18 00000000 00000000 c120bbec
b6e2ba00 c2e5dfb0 10c5387d
[   10.397159] df00: 01a0e1a8 bed81720 bed81728 c03165fc 00006010
00001000 3e600000 c04ced24
[   10.397174] df20: ef216010 00000ff0 ef216000 00000000 ef217000
eec76c40 bed817b8 00000000
[   10.397189] df40: 00000128 c0301204 c2e5c000 00000128 00000000
c099bc5c 00000000 00000000
[   10.589571] df60: 00000000 fffffff7 c2e5deb0 0000000c 00000001
00000000 00000000 c2e5de80
[   10.589596] df80: 00000000 c030ac08 00000000 00000000 00000040
00000000 00000000 01a0e0f0
[   10.605946] dfa0: bed817b8 c03011d4 00000000 01a0e0f0 0000000d
bed817b8 00000000 00000000
[   10.614131] dfc0: 00000000 01a0e0f0 bed817b8 00000128 0000005c
01a0af00 01a0e920 00000000
[   10.622316] dfe0: b6f9fad4 bed81780 b6de4780 b6cd9548 60000010
0000000d 00000000 00000000
[   10.630594] [<c09d4d58>] (sk_filter_trim_cap) from [<c09fa2bc>]
(netlink_broadcast_filtered+0x304/0x3dc)
[   10.640088] [<c09fa2bc>] (netlink_broadcast_filtered) from
[<c09fa3b8>] (netlink_broadcast+0x24/0x2c)
[   10.650447] [<c09fa3b8>] (netlink_broadcast) from [<c09fc704>]
(netlink_sendmsg+0x30c/0x340)
[   10.658899] [<c09fc704>] (netlink_sendmsg) from [<c099a624>]
(sock_sendmsg+0x3c/0x4c)
[   10.666742] [<c099a624>] (sock_sendmsg) from [<c099ae48>]
(___sys_sendmsg+0x1d8/0x218)
[   10.674673] [<c099ae48>] (___sys_sendmsg) from [<c099bc5c>]
(__sys_sendmsg+0x48/0x6c)
[   10.682515] [<c099bc5c>] (__sys_sendmsg) from [<c03011d4>]
(__sys_trace_return+0x0/0x10)
[   10.690604] Exception stack(0xc2e5dfa8 to 0xc2e5dff0)
[   10.695660] dfa0:                   00000000 01a0e0f0 0000000d
bed817b8 00000000 00000000
[   10.703845] dfc0: 00000000 01a0e0f0 bed817b8 00000128 0000005c
01a0af00 01a0e920 00000000
[   10.712025] dfe0: b6f9fad4 bed81780 b6de4780 b6cd9548
[   10.717086] Code: 1afffff7 e59c0000 e5830000 e3520000 (e584800c)
[   10.723199] Internal error: Oops: 805 [#9] SMP ARM
[   10.723343] ---[ end trace 2cb865e83300a748 ]---

^ permalink raw reply

* Re: Crash in netlink/sk_filter_trim_cap on ARMv7 on 4.18rc1
From: Eric Dumazet @ 2018-06-22 12:55 UTC (permalink / raw)
  To: Peter Robinson, netdev, linux-arm-kernel; +Cc: labbott
In-Reply-To: <CALeDE9PP__kPHX_aW24kwzGf9BgA0gQOQJSY+Qw0yFMOLn4Pcw@mail.gmail.com>



On 06/22/2018 04:19 AM, Peter Robinson wrote:
> Hi All,
> 
> I'm seeing this netlink/sk_filter_trim_cap crash on ARMv7 across quite
> a few ARMv7 platforms on Fedora with 4.18rc1. I've tested RPi2/RPi3
> (doesn't happen on aarch64), AllWinner H3, BeagleBone and a few
> others, both LPAE/normal kernels.
> 
> I'm a bit out of my depth in this part of the kernel but I'm wondering
> if it's known, I couldn't find anything that looked obvious on a few
> mailing lists.
> 
> Peter

Hi Peter

Could you provide symbolic information ?

Thanks !

^ permalink raw reply

* Re: bnx2x: kernel panic in the bnx2x driver
From: Vishwanath Pai @ 2018-06-22 13:22 UTC (permalink / raw)
  To: Kalluru, Sudarsana, Elior, Ariel, Dept-Eng Everest Linux L2
  Cc: davem@davemloft.net, netdev@vger.kernel.org, dbanerje@akamai.com,
	pai.vishwain@gmail.com
In-Reply-To: <MW2PR07MB4139C0F5622B56A2294A80AC8A750@MW2PR07MB4139.namprd07.prod.outlook.com>

Hi Sudarsana,

Thanks for taking a look at my email. The fix you suggested would
definitely fix the kernel panic, but at the same time wouldn't it also
silently ignore the request by ethtool to set rx-flow-hash?

Thanks,
Vishwanath

On 06/22/2018 06:20 AM, Kalluru, Sudarsana wrote:
> Hi Vishwanath,
>     Thanks for your mail, and the analysis.
> The fix would be to invoke bnx2x_rss() only when the device is opened,
> 	if (bp->state == BNX2X_STATE_OPEN)
> 		return bnx2x_rss(bp, &bp->rss_conf_obj, false, true);
> 	else
> 		return 0;
> Ariel,
>    Could you please review the path (bnx2x_set_rss_flags()--> bnx2x_rss()) and confirm/correct on the above.
> 
> Thanks,
> Sudarsana
> 
> -----Original Message-----
> From: Vishwanath Pai [mailto:vpai@akamai.com] 
> Sent: 22 June 2018 10:37
> To: Elior, Ariel <Ariel.Elior@cavium.com>; Dept-Eng Everest Linux L2 <Dept-EngEverestLinuxL2@cavium.com>
> Cc: davem@davemloft.net; netdev@vger.kernel.org; dbanerje@akamai.com; pai.vishwain@gmail.com
> Subject: bnx2x: kernel panic in the bnx2x driver
> 
> External Email
> 
> Hi,
> 
> We recently noticed a kernel panic in the bnx2x driver when trying to set rx-flow-hash parameters via ethtool during if-pre-up.d. I am running kernel
> v4.17.2 from ubuntu-mainline-ppa. I have added the stack trace below:
> 
> [   18.280209] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> [   18.280212] PGD 8000000407a79067 P4D 8000000407a79067 PUD 40ce8a067 PMD 0
> [   18.280214] Oops: 0010 [#1] SMP PTI
> [   18.280215] Modules linked in: intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel joydev input_led kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc hid_eneric aesni_intel gpio_ich aes_x86_64 usbhid lpc_ich crpto_simd ie31200_edac cryptd glue_helper intel_cstate mac_hid intel_rapl_perf bnx2x mdio tcp_bbr netconsole ipmi_devintf ipmi_msghandler i2c_i801 coretemp autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear sha26_mb mcryptd sha256_ssse3 hid ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt mpt3sas fb_sys_fops drm raid_class scsi_transport_sas ahci libahci shpchp video
> [   18.280241] CPU: 6 PID: 1081 Comm: ethtool Not tainted 4.17.2-041702-generic #201806160433
> [   18.280242] Hardware name: Foxconn CangJie/CangJie, BIOS CC1F108D 02/26/2014
> [   18.280243] RIP: 0010:          (null)
> [   18.280243] RSP: 0018:ffffb84bc260b9c0 EFLAGS: 00010246
> [   18.280244] RAX: 0000000000000000 RBX: ffff92f987f020f0 RCX: 0000000000000000
> [   18.280245] RDX: 0000000000000000 RSI: ffffb84bc260b9f8 RDI: ffff92f987f020f0
> [   18.280245] RBP: ffffb8bc260b9e8 R08: 0000000000000001 R09: 0000000000000000
> [   18.280246] R10: ffffb84bc260bd20 R11: 0000000000000000 R12: ffffb84bc260b9f8
> [   18.280246] R13: ffff92f987f008c0 R14: 00007ffdb75bec40 R15: 0000000000000000
> [   18.280247] FS:  00007fc0e8798700(0000) GS:ffff92f99fd80000(0000) knlGS:0000000000000000
> [   18.280248] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   18.280249] CR2: 0000000000000000 CR3: 0000000409b4c003 CR4: 00000000001606e0
> [   18.280249] Call Trace:
> [   18.280263]  ? bnx2x_config_rss+0x2f/0xd0 [bnx2x]
> [   18.280270]  bnx2x_rss+0x1d9/0x210 [bnx2x]
> [   18.280276]  bnx2x_set_rxnfc+0x17d/0x380 [bnx2x]
> [   18.280279]  ethtool_set_rxnfc+0x9b/0x110
> [   18.280281]  ? __do_page_cache_readahead+0x1da/0x2c0
> [   18.280283]  ? security_capable+0x3c/0x60
> [   18.280284]  dev_ethtool+0350/0x2610
> [   18.280286]  ? page_cache_async_readahead+0x71/0x80
> [   18.280288]  ? page_add_file_rmap+0x5d/0x220
> [   18.280290]  ? inet_ioctl+0x182/0x1a0
> [   18.280291]  dev_ioctl+0x203/0x3f0
> [   18.280293]  ? dev_ioctl+0x203/0x3f0
> [   18.280294]  sock_do_ioctl+0xae/0x150
> [   18.280296]  sock_ioctl+0x1e2/0x330
> [   18.280296]  ? sock_ioctl+0x1e2/0x330
> [   18.280299]  do_vfs_ioctl+0xa8/0x620
> [   18.280300]  ? dlci_ioctl_set+0x30/0x30
> [   18.280301]  ? do_vfs_ioctl+0xa8/0x620
> [   18.280302]  ? handle_mm_fault+0xe3/0x220
> [   18.280304]  ksys_ioctl+0x75/0x80
> [   18.280305]  __x64_sys_ioctl+0x1a/0x20
> [   18.280307]  do_syscall_64+0x5a/0x120
> [   18.280309]  entry_SYSCALL_64_aftr_hwframe+0x44/0xa9
> [   18.280310] RIP: 0033:0x7fc0e7fba107
> [   18.280311] RSP: 002b:00007ffdb75beb78 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> [   18.280312] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc0e7fba107
> [   18.280312] RDX: 00007ffdb75bed60 RSI: 0000000000008946 RDI: 0000000000000003
> [   18.280313] RBP: 00007ffdb75bed50 R08: 00007ffdb75bed60 R09: 0000000000000001
> [   18.280313] R10: 0000000000000541 R11: 0000000000000206 R12: 00007ffdb75beed0
> [   18.280314] R13: 0000000000421020 R14: 000000000041fe28 R15: 0000000000000003
> [   18.280315] Code:  Bad RIP value.
> [   18.280317] RIP:           (null) RSP: ffffb84bc260b9c0
> [  18.280318] CR2: 0000000000000000
> [   18.280319] ---[ end trace 5f361db3fb9059f1 ]---
> 
> To reproduce this I created a bash script in "/etc/network/if-pre-up.d/" with these two lines:
> ethtool -N $IFACE rx-flow-hash udp4 "sdfn"
> ethtool -N $IFACE rx-flow-hash udp6 "sdfn"
> 
> The problem here is that rss_obj in bnx2x struct for the device hasn't been initialized yet, which causes an exception in bnx2x_config_rss() when calling "r->set_pending(r)" because r->set_pending is NULL. It looks like a lot many things haven't been initialized at this point, most of that happens in this
> function: "bnx2x_init_bp_objs()" which isn't called until ifup. Any thoughts on how this can be fixed? Would it be possible to safely move bnx2x_init_bp_objs() to maybe bnx2x_init_one() which runs much before ifup?
> 
> Thanks,
> Vishwanath
> 

^ permalink raw reply

* [PATCH net 0/2] net: dccp: fixes around rx_tstamp_last_feedback
From: Eric Dumazet @ 2018-06-22 13:44 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Gerrit Renker, dccp, Eric Dumazet, Eric Dumazet

This patch series fix some issues with rx_tstamp_last_feedback.

- Switch to monotonic clock.
- Avoid potential overflows on fast hosts/networks.

Eric Dumazet (2):
  net: dccp: avoid crash in ccid3_hc_rx_send_feedback()
  net: dccp: switch rx_tstamp_last_feedback to monotonic clock

 net/dccp/ccids/ccid3.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

-- 
2.18.0.rc2.346.g013aa6912e-goog

^ permalink raw reply

* [PATCH net 1/2] net: dccp: avoid crash in ccid3_hc_rx_send_feedback()
From: Eric Dumazet @ 2018-06-22 13:44 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Gerrit Renker, dccp, Eric Dumazet, Eric Dumazet
In-Reply-To: <20180622134415.104266-1-edumazet@google.com>

On fast hosts or malicious bots, we trigger a DCCP_BUG() which
seems excessive.

syzbot reported :

BUG: delta (-6195) <= 0 at net/dccp/ccids/ccid3.c:628/ccid3_hc_rx_send_feedback()
CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.18.0-rc1+ #112
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
 ccid3_hc_rx_send_feedback net/dccp/ccids/ccid3.c:628 [inline]
 ccid3_hc_rx_packet_recv.cold.16+0x38/0x71 net/dccp/ccids/ccid3.c:793
 ccid_hc_rx_packet_recv net/dccp/ccid.h:185 [inline]
 dccp_deliver_input_to_ccids+0xf0/0x280 net/dccp/input.c:180
 dccp_rcv_established+0x87/0xb0 net/dccp/input.c:378
 dccp_v4_do_rcv+0x153/0x180 net/dccp/ipv4.c:654
 sk_backlog_rcv include/net/sock.h:914 [inline]
 __sk_receive_skb+0x3ba/0xd80 net/core/sock.c:517
 dccp_v4_rcv+0x10f9/0x1f58 net/dccp/ipv4.c:875
 ip_local_deliver_finish+0x2eb/0xda0 net/ipv4/ip_input.c:215
 NF_HOOK include/linux/netfilter.h:287 [inline]
 ip_local_deliver+0x1e9/0x750 net/ipv4/ip_input.c:256
 dst_input include/net/dst.h:450 [inline]
 ip_rcv_finish+0x823/0x2220 net/ipv4/ip_input.c:396
 NF_HOOK include/linux/netfilter.h:287 [inline]
 ip_rcv+0xa18/0x1284 net/ipv4/ip_input.c:492
 __netif_receive_skb_core+0x2488/0x3680 net/core/dev.c:4628
 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4693
 process_backlog+0x219/0x760 net/core/dev.c:5373
 napi_poll net/core/dev.c:5771 [inline]
 net_rx_action+0x7da/0x1980 net/core/dev.c:5837
 __do_softirq+0x2e8/0xb17 kernel/softirq.c:284
 run_ksoftirqd+0x86/0x100 kernel/softirq.c:645
 smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164
 kthread+0x345/0x410 kernel/kthread.c:240
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Cc: dccp@vger.kernel.org
---
 net/dccp/ccids/ccid3.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/dccp/ccids/ccid3.c b/net/dccp/ccids/ccid3.c
index 8b5ba6dffac7ebc88fd21075793dc3db43a74a43..d57a2be1e2e09aee89347e286aca538303b7dee1 100644
--- a/net/dccp/ccids/ccid3.c
+++ b/net/dccp/ccids/ccid3.c
@@ -625,9 +625,8 @@ static void ccid3_hc_rx_send_feedback(struct sock *sk,
 	case CCID3_FBACK_PERIODIC:
 		delta = ktime_us_delta(now, hc->rx_tstamp_last_feedback);
 		if (delta <= 0)
-			DCCP_BUG("delta (%ld) <= 0", (long)delta);
-		else
-			hc->rx_x_recv = scaled_div32(hc->rx_bytes_recv, delta);
+			delta = 1;
+		hc->rx_x_recv = scaled_div32(hc->rx_bytes_recv, delta);
 		break;
 	default:
 		return;
-- 
2.18.0.rc2.346.g013aa6912e-goog

^ permalink raw reply related

* [PATCH net 2/2] net: dccp: switch rx_tstamp_last_feedback to monotonic clock
From: Eric Dumazet @ 2018-06-22 13:44 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Gerrit Renker, dccp, Eric Dumazet, Eric Dumazet
In-Reply-To: <20180622134415.104266-1-edumazet@google.com>

To compute delays, better not use time of the day which can
be changed by admins or malicious programs.

Also change ccid3_first_li() to use s64 type for delta variable
to avoid potential overflows.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Cc: dccp@vger.kernel.org
---
 net/dccp/ccids/ccid3.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/dccp/ccids/ccid3.c b/net/dccp/ccids/ccid3.c
index d57a2be1e2e09aee89347e286aca538303b7dee1..12877a1514e7b8e873cd26529e58f7ebaae99c1a 100644
--- a/net/dccp/ccids/ccid3.c
+++ b/net/dccp/ccids/ccid3.c
@@ -600,7 +600,7 @@ static void ccid3_hc_rx_send_feedback(struct sock *sk,
 {
 	struct ccid3_hc_rx_sock *hc = ccid3_hc_rx_sk(sk);
 	struct dccp_sock *dp = dccp_sk(sk);
-	ktime_t now = ktime_get_real();
+	ktime_t now = ktime_get();
 	s64 delta = 0;
 
 	switch (fbtype) {
@@ -632,7 +632,7 @@ static void ccid3_hc_rx_send_feedback(struct sock *sk,
 		return;
 	}
 
-	ccid3_pr_debug("Interval %ldusec, X_recv=%u, 1/p=%u\n", (long)delta,
+	ccid3_pr_debug("Interval %lldusec, X_recv=%u, 1/p=%u\n", delta,
 		       hc->rx_x_recv, hc->rx_pinv);
 
 	hc->rx_tstamp_last_feedback = now;
@@ -679,7 +679,8 @@ static int ccid3_hc_rx_insert_options(struct sock *sk, struct sk_buff *skb)
 static u32 ccid3_first_li(struct sock *sk)
 {
 	struct ccid3_hc_rx_sock *hc = ccid3_hc_rx_sk(sk);
-	u32 x_recv, p, delta;
+	u32 x_recv, p;
+	s64 delta;
 	u64 fval;
 
 	if (hc->rx_rtt == 0) {
@@ -687,7 +688,9 @@ static u32 ccid3_first_li(struct sock *sk)
 		hc->rx_rtt = DCCP_FALLBACK_RTT;
 	}
 
-	delta  = ktime_to_us(net_timedelta(hc->rx_tstamp_last_feedback));
+	delta = ktime_us_delta(ktime_get(), hc->rx_tstamp_last_feedback);
+	if (delta <= 0)
+		delta = 1;
 	x_recv = scaled_div32(hc->rx_bytes_recv, delta);
 	if (x_recv == 0) {		/* would also trigger divide-by-zero */
 		DCCP_WARN("X_recv==0\n");
-- 
2.18.0.rc2.346.g013aa6912e-goog

^ permalink raw reply related

* [PATCH net V3 1/1] net/smc: coordinate wait queues for nonblocking connect
From: Ursula Braun @ 2018-06-22 14:01 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun,
	xiyou.wangcong, hch

The recent poll change may lead to stalls for non-blocking connecting
SMC sockets, since sock_poll_wait is no longer performed on the
internal CLC socket, but on the outer SMC socket.  kernel_connect() on
the internal CLC socket returns with -EINPROGRESS, but the wake up
logic does not work in all cases. If the internal CLC socket is still
in state TCP_SYN_SENT when polled, sock_poll_wait() from sock_poll()
does not sleep. It is supposed to sleep till the state of the internal
CLC socket switches to TCP_ESTABLISHED.

This patch temporarily propagates the wait queue from the internal
CLC sock to the SMC sock, till the non-blocking connect() is
finished.

In addition locking is reduced due to the removed poll waits.

Fixes: c0129a061442 ("smc: convert to ->poll_mask")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
---
 net/smc/af_smc.c | 13 +++++++++----
 net/smc/smc.h    |  1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index da7f02edcd37..7966e7ddb563 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -23,6 +23,7 @@
 #include <linux/workqueue.h>
 #include <linux/in.h>
 #include <linux/sched/signal.h>
+#include <linux/rcupdate.h>
 
 #include <net/sock.h>
 #include <net/tcp.h>
@@ -605,6 +606,11 @@ static int smc_connect(struct socket *sock, struct sockaddr *addr,
 
 	smc_copy_sock_settings_to_clc(smc);
 	tcp_sk(smc->clcsock->sk)->syn_smc = 1;
+	if (flags & O_NONBLOCK) {
+		smc->smcwq = rcu_access_pointer(sk->sk_wq);
+		rcu_assign_pointer(sock->sk->sk_wq,
+				   rcu_access_pointer(smc->clcsock->sk->sk_wq));
+	}
 	rc = kernel_connect(smc->clcsock, addr, alen, flags);
 	if (rc)
 		goto out;
@@ -1285,12 +1291,9 @@ static __poll_t smc_poll_mask(struct socket *sock, __poll_t events)
 
 	smc = smc_sk(sock->sk);
 	sock_hold(sk);
-	lock_sock(sk);
 	if ((sk->sk_state == SMC_INIT) || smc->use_fallback) {
 		/* delegate to CLC child sock */
-		release_sock(sk);
 		mask = smc->clcsock->ops->poll_mask(smc->clcsock, events);
-		lock_sock(sk);
 		sk->sk_err = smc->clcsock->sk->sk_err;
 		if (sk->sk_err) {
 			mask |= EPOLLERR;
@@ -1299,7 +1302,10 @@ static __poll_t smc_poll_mask(struct socket *sock, __poll_t events)
 			if (sk->sk_state == SMC_INIT &&
 			    mask & EPOLLOUT &&
 			    smc->clcsock->sk->sk_state != TCP_CLOSE) {
+				lock_sock(sk);
+				rcu_assign_pointer(sock->sk->sk_wq, smc->smcwq);
 				rc = __smc_connect(smc);
+				release_sock(sk);
 				if (rc < 0)
 					mask |= EPOLLERR;
 				/* success cases including fallback */
@@ -1334,7 +1340,6 @@ static __poll_t smc_poll_mask(struct socket *sock, __poll_t events)
 			mask |= EPOLLPRI;
 
 	}
-	release_sock(sk);
 	sock_put(sk);
 
 	return mask;
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 51ae1f10d81a..89d6d7ef973f 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -190,6 +190,7 @@ struct smc_connection {
 struct smc_sock {				/* smc sock container */
 	struct sock		sk;
 	struct socket		*clcsock;	/* internal tcp socket */
+	struct socket_wq	*smcwq;		/* original smcsock wq */
 	struct smc_connection	conn;		/* smc connection */
 	struct smc_sock		*listen_smc;	/* listen parent */
 	struct work_struct	tcp_listen_work;/* handle tcp socket accepts */
-- 
2.16.4

^ permalink raw reply related

* RE: bnx2x: kernel panic in the bnx2x driver
From: Kalluru, Sudarsana @ 2018-06-22 14:21 UTC (permalink / raw)
  To: Vishwanath Pai, Elior, Ariel, Dept-Eng Everest Linux L2
  Cc: davem@davemloft.net, netdev@vger.kernel.org, dbanerje@akamai.com,
	pai.vishwain@gmail.com
In-Reply-To: <c6283d44-d2c4-5252-9128-f905c2068973@akamai.com>

Hi Vishwanath,
    The config will be cached in the device structure (bp->rss_conf_obj.udp_rss_v4) in this scenario, and will be applied in the load path (bnx2x_nic_load() --> bnx2x_init_rss()). Have unit tested the change on my setup.

Thanks,
Sudarsana

-----Original Message-----
From: Vishwanath Pai [mailto:vpai@akamai.com] 
Sent: 22 June 2018 18:52
To: Kalluru, Sudarsana <Sudarsana.Kalluru@cavium.com>; Elior, Ariel <Ariel.Elior@cavium.com>; Dept-Eng Everest Linux L2 <Dept-EngEverestLinuxL2@cavium.com>
Cc: davem@davemloft.net; netdev@vger.kernel.org; dbanerje@akamai.com; pai.vishwain@gmail.com
Subject: Re: bnx2x: kernel panic in the bnx2x driver

Hi Sudarsana,

Thanks for taking a look at my email. The fix you suggested would definitely fix the kernel panic, but at the same time wouldn't it also silently ignore the request by ethtool to set rx-flow-hash?

Thanks,
Vishwanath

On 06/22/2018 06:20 AM, Kalluru, Sudarsana wrote:
> Hi Vishwanath,
>     Thanks for your mail, and the analysis.
> The fix would be to invoke bnx2x_rss() only when the device is opened,
>       if (bp->state == BNX2X_STATE_OPEN)
>               return bnx2x_rss(bp, &bp->rss_conf_obj, false, true);
>       else
>               return 0;
> Ariel,
>    Could you please review the path (bnx2x_set_rss_flags()--> bnx2x_rss()) and confirm/correct on the above.
>
> Thanks,
> Sudarsana
>
> -----Original Message-----
> From: Vishwanath Pai [mailto:vpai@akamai.com]
> Sent: 22 June 2018 10:37
> To: Elior, Ariel <Ariel.Elior@cavium.com>; Dept-Eng Everest Linux L2 
> <Dept-EngEverestLinuxL2@cavium.com>
> Cc: davem@davemloft.net; netdev@vger.kernel.org; dbanerje@akamai.com; 
> pai.vishwain@gmail.com
> Subject: bnx2x: kernel panic in the bnx2x driver
>
> External Email
>
> Hi,
>
> We recently noticed a kernel panic in the bnx2x driver when trying to 
> set rx-flow-hash parameters via ethtool during if-pre-up.d. I am 
> running kernel
> v4.17.2 from ubuntu-mainline-ppa. I have added the stack trace below:
>
> [   18.280209] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> [   18.280212] PGD 8000000407a79067 P4D 8000000407a79067 PUD 40ce8a067 PMD 0
> [   18.280214] Oops: 0010 [#1] SMP PTI
> [   18.280215] Modules linked in: intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel joydev input_led kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc hid_eneric aesni_intel gpio_ich aes_x86_64 usbhid lpc_ich crpto_simd ie31200_edac cryptd glue_helper intel_cstate mac_hid intel_rapl_perf bnx2x mdio tcp_bbr netconsole ipmi_devintf ipmi_msghandler i2c_i801 coretemp autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear sha26_mb mcryptd sha256_ssse3 hid ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt mpt3sas fb_sys_fops drm raid_class scsi_transport_sas ahci libahci shpchp video
> [   18.280241] CPU: 6 PID: 1081 Comm: ethtool Not tainted 4.17.2-041702-generic #201806160433
> [   18.280242] Hardware name: Foxconn CangJie/CangJie, BIOS CC1F108D 02/26/2014
> [   18.280243] RIP: 0010:          (null)
> [   18.280243] RSP: 0018:ffffb84bc260b9c0 EFLAGS: 00010246
> [   18.280244] RAX: 0000000000000000 RBX: ffff92f987f020f0 RCX: 0000000000000000
> [   18.280245] RDX: 0000000000000000 RSI: ffffb84bc260b9f8 RDI: ffff92f987f020f0
> [   18.280245] RBP: ffffb8bc260b9e8 R08: 0000000000000001 R09: 0000000000000000
> [   18.280246] R10: ffffb84bc260bd20 R11: 0000000000000000 R12: ffffb84bc260b9f8
> [   18.280246] R13: ffff92f987f008c0 R14: 00007ffdb75bec40 R15: 0000000000000000
> [   18.280247] FS:  00007fc0e8798700(0000) GS:ffff92f99fd80000(0000) knlGS:0000000000000000
> [   18.280248] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   18.280249] CR2: 0000000000000000 CR3: 0000000409b4c003 CR4: 00000000001606e0
> [   18.280249] Call Trace:
> [   18.280263]  ? bnx2x_config_rss+0x2f/0xd0 [bnx2x]
> [   18.280270]  bnx2x_rss+0x1d9/0x210 [bnx2x]
> [   18.280276]  bnx2x_set_rxnfc+0x17d/0x380 [bnx2x]
> [   18.280279]  ethtool_set_rxnfc+0x9b/0x110
> [   18.280281]  ? __do_page_cache_readahead+0x1da/0x2c0
> [   18.280283]  ? security_capable+0x3c/0x60
> [   18.280284]  dev_ethtool+0350/0x2610
> [   18.280286]  ? page_cache_async_readahead+0x71/0x80
> [   18.280288]  ? page_add_file_rmap+0x5d/0x220
> [   18.280290]  ? inet_ioctl+0x182/0x1a0
> [   18.280291]  dev_ioctl+0x203/0x3f0
> [   18.280293]  ? dev_ioctl+0x203/0x3f0
> [   18.280294]  sock_do_ioctl+0xae/0x150
> [   18.280296]  sock_ioctl+0x1e2/0x330
> [   18.280296]  ? sock_ioctl+0x1e2/0x330
> [   18.280299]  do_vfs_ioctl+0xa8/0x620
> [   18.280300]  ? dlci_ioctl_set+0x30/0x30
> [   18.280301]  ? do_vfs_ioctl+0xa8/0x620
> [   18.280302]  ? handle_mm_fault+0xe3/0x220
> [   18.280304]  ksys_ioctl+0x75/0x80
> [   18.280305]  __x64_sys_ioctl+0x1a/0x20
> [   18.280307]  do_syscall_64+0x5a/0x120
> [   18.280309]  entry_SYSCALL_64_aftr_hwframe+0x44/0xa9
> [   18.280310] RIP: 0033:0x7fc0e7fba107
> [   18.280311] RSP: 002b:00007ffdb75beb78 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> [   18.280312] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc0e7fba107
> [   18.280312] RDX: 00007ffdb75bed60 RSI: 0000000000008946 RDI: 0000000000000003
> [   18.280313] RBP: 00007ffdb75bed50 R08: 00007ffdb75bed60 R09: 0000000000000001
> [   18.280313] R10: 0000000000000541 R11: 0000000000000206 R12: 00007ffdb75beed0
> [   18.280314] R13: 0000000000421020 R14: 000000000041fe28 R15: 0000000000000003
> [   18.280315] Code:  Bad RIP value.
> [   18.280317] RIP:           (null) RSP: ffffb84bc260b9c0
> [  18.280318] CR2: 0000000000000000
> [   18.280319] ---[ end trace 5f361db3fb9059f1 ]---
>
> To reproduce this I created a bash script in "/etc/network/if-pre-up.d/" with these two lines:
> ethtool -N $IFACE rx-flow-hash udp4 "sdfn"
> ethtool -N $IFACE rx-flow-hash udp6 "sdfn"
>
> The problem here is that rss_obj in bnx2x struct for the device hasn't 
> been initialized yet, which causes an exception in bnx2x_config_rss() 
> when calling "r->set_pending(r)" because r->set_pending is NULL. It 
> looks like a lot many things haven't been initialized at this point, 
> most of that happens in this
> function: "bnx2x_init_bp_objs()" which isn't called until ifup. Any thoughts on how this can be fixed? Would it be possible to safely move bnx2x_init_bp_objs() to maybe bnx2x_init_one() which runs much before ifup?
>
> Thanks,
> Vishwanath
>


^ permalink raw reply

* Re: [PATCH rdma-next 0/2] RoCE ICRC counter
From: Jason Gunthorpe @ 2018-06-22 15:03 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Leon Romanovsky, RDMA mailing list, Mark Bloch,
	Talat Batheesh, Saeed Mahameed, linux-netdev
In-Reply-To: <20180621123756.32645-1-leon@kernel.org>

On Thu, Jun 21, 2018 at 03:37:54PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@mellanox.com>
> 
> Hi,
> 
> This series exposes RoCE ICRC counter through existing RDMA hw_counters
> sysfs interface.
> 
> First patch has all HW definitions in mlx5_ifc.h file and second patch is
> actual counter implementation.
> 
> Thanks
> 
> Talat Batheesh (2):
>   net/mlx5: Add RoCE RX ICRC encapsulated counter
>   IB/mlx5: Support RoCE ICRC encapsulated error counter
> 
>  drivers/infiniband/hw/mlx5/cmd.c     | 12 +++++++
>  drivers/infiniband/hw/mlx5/cmd.h     |  1 +
>  drivers/infiniband/hw/mlx5/main.c    | 62 ++++++++++++++++++++++++++++++++++--
>  drivers/infiniband/hw/mlx5/mlx5_ib.h |  1 +
>  include/linux/mlx5/mlx5_ifc.h        | 11 +++++--
>  5 files changed, 81 insertions(+), 6 deletions(-)

Applied to rdma for-next with the mellanox/mlx5-next branch

Thanks,
Jason

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-22 15:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Siwei Liu, Samudrala, Sridhar, Alexander Duyck, virtio-dev,
	aaron.f.brown, Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel,
	virtualization, konrad.wilk, boris.ostrovsky, Joao Martins,
	Venu Busireddy, vijay.balakrishna
In-Reply-To: <20180621211712-mutt-send-email-mst@kernel.org>

On Thu, 21 Jun 2018 21:20:13 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:
> > OK, so what about the following:
> > 
> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> >   that we have a new uuid field in the virtio-net config space
> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> >   offer VIRTIO_NET_F_STANDBY_UUID if set
> > - when configuring, set the property to the group UUID of the vfio-pci
> >   device
> > - in the guest, use the uuid from the virtio-net device's config space
> >   if applicable; else, fall back to matching by MAC as done today
> > 
> > That should work for all virtio transports.  
> 
> True. I'm a bit unhappy that it's virtio net specific though
> since down the road I expect we'll have a very similar feature
> for scsi (and maybe others).
> 
> But we do not have a way to have fields that are portable
> both across devices and transports, and I think it would
> be a useful addition. How would this work though? Any idea?

Can we introduce some kind of device-independent config space area?
Pushing back the device-specific config space by a certain value if the
appropriate feature is negotiated and use that for things like the uuid?

But regardless of that, I'm not sure whether extending this approach to
other device types is the way to go. Tying together two different
devices is creating complicated situations at least in the hypervisor
(even if it's fairly straightforward in the guest). [I have not come
around again to look at the "how to handle visibility in QEMU"
questions due to lack of cycles, sorry about that.]

So, what's the goal of this approach? Only to allow migration with
vfio-pci, or also to plug in a faster device and use it instead of an
already attached paravirtualized device?

What about migration of vfio devices that are not easily replaced by a
paravirtualized device? I'm thinking of vfio-ccw, where our main (and
currently only) supported device is dasd (disks) -- which can do a lot
of specialized things that virtio-blk does not support (and should not
or even cannot support). Would it be more helpful to focus on generic
migration support for vfio instead of going about it device by device?

This network device approach already seems far along, so it makes sense
to continue with it. But I'm not sure whether we want to spend time and
energy on that for other device types rather than working on a general
solution for vfio migration.

^ permalink raw reply

* [bpf PATCH v3 0/4] BPF fixes for sockhash
From: John Fastabend @ 2018-06-22 15:21 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, kafai; +Cc: netdev

This addresses two syzbot issues that lead to identifying (by Eric and
Wei) a class of bugs where we don't correctly check for IPv4/v6
sockets and their associated state. The second issue was a locking
omission in sockhash.

The first patch addresses IPv6 socks and fixing an error where
sockhash would overwrite the prot pointer with IPv4 prot. To fix
this build similar solution to TLS ULP. Although we continue to
allow socks in all states not just ESTABLISH in this patch set
because as Martin points out there should be no issue with this
on the sockmap ULP because we don't use the ctx in this code.

The other issue syzbot found that the tcp_close() handler missed
locking the hash bucket lock which could result in corrupting the
sockhash bucket list if delete and close ran at the same time. 
And also the smap_list_remove() routine was not working correctly
at all. This was not caught in my testing because in general my
tests (to date at least lets add some more robust selftest in
bpf-next) do things in the "expected" order, create map, add socks,
delete socks, then tear down maps. The tests we have that do the
ops out of this order where only working on single maps not multi-
maps so we never saw the issue. Thanks syzbot. The fix is to
restructure the tcp_close() lock handling. And fix the obvious
bug in smap_list_remove().

Finally, during review I noticed the release handler was omitted
from the upstream code (patch 4) due to an incorrect merge conflict
fix when I ported the code to latest bpf-next before submitting.

v3: rework patches, dropping ESTABLISH check and adding rcu
    annotation along with the smap_list_remove fix

Also big thanks to Martin for thorough review he caught at least
one case where I missed a rcu_call().

---

John Fastabend (4):
      bpf: sockmap, fix crash when ipv6 sock is added
      bpf: sockmap, fix smap_list_map_remove when psock is in many maps
      bpf: sockhash fix omitted bucket lock in sock_close
      bpf: sockhash, add release routine

 kernel/bpf/sockmap.c |  210 ++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 153 insertions(+), 57 deletions(-)

^ permalink raw reply

* [bpf PATCH v3 1/4] bpf: sockmap, fix crash when ipv6 sock is added
From: John Fastabend @ 2018-06-22 15:21 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, kafai; +Cc: netdev
In-Reply-To: <20180622151123.24502.56029.stgit@john-Precision-Tower-5810>

This fixes a crash where we assign tcp_prot to IPv6 sockets instead
of tcpv6_prot.

Previously we overwrote the sk->prot field with tcp_prot even in the
AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
are used.

Tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
crashing case here.

Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Wei Wang <weiwan@google.com>
---
 kernel/bpf/sockmap.c |   58 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 48 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 52a91d8..d7fd17a 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -140,6 +140,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
 			    int offset, size_t size, int flags);
+static void bpf_tcp_close(struct sock *sk, long timeout);
 
 static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
 {
@@ -161,7 +162,42 @@ static bool bpf_tcp_stream_read(const struct sock *sk)
 	return !empty;
 }
 
-static struct proto tcp_bpf_proto;
+enum {
+	SOCKMAP_IPV4,
+	SOCKMAP_IPV6,
+	SOCKMAP_NUM_PROTS,
+};
+
+enum {
+	SOCKMAP_BASE,
+	SOCKMAP_TX,
+	SOCKMAP_NUM_CONFIGS,
+};
+
+static struct proto *saved_tcpv6_prot __read_mostly;
+static DEFINE_SPINLOCK(tcpv6_prot_lock);
+static struct proto bpf_tcp_prots[SOCKMAP_NUM_PROTS][SOCKMAP_NUM_CONFIGS];
+static void build_protos(struct proto prot[SOCKMAP_NUM_CONFIGS],
+			 struct proto *base)
+{
+	prot[SOCKMAP_BASE]			= *base;
+	prot[SOCKMAP_BASE].close		= bpf_tcp_close;
+	prot[SOCKMAP_BASE].recvmsg		= bpf_tcp_recvmsg;
+	prot[SOCKMAP_BASE].stream_memory_read	= bpf_tcp_stream_read;
+
+	prot[SOCKMAP_TX]			= prot[SOCKMAP_BASE];
+	prot[SOCKMAP_TX].sendmsg		= bpf_tcp_sendmsg;
+	prot[SOCKMAP_TX].sendpage		= bpf_tcp_sendpage;
+}
+
+static void update_sk_prot(struct sock *sk, struct smap_psock *psock)
+{
+	int family = sk->sk_family == AF_INET6 ? SOCKMAP_IPV6 : SOCKMAP_IPV4;
+	int conf = psock->bpf_tx_msg ? SOCKMAP_TX : SOCKMAP_BASE;
+
+	sk->sk_prot = &bpf_tcp_prots[family][conf];
+}
+
 static int bpf_tcp_init(struct sock *sk)
 {
 	struct smap_psock *psock;
@@ -181,14 +217,17 @@ static int bpf_tcp_init(struct sock *sk)
 	psock->save_close = sk->sk_prot->close;
 	psock->sk_proto = sk->sk_prot;
 
-	if (psock->bpf_tx_msg) {
-		tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
-		tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
-		tcp_bpf_proto.recvmsg = bpf_tcp_recvmsg;
-		tcp_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
+	/* Build IPv6 sockmap whenever the address of tcpv6_prot changes */
+	if (sk->sk_family == AF_INET6 &&
+	    unlikely(sk->sk_prot != smp_load_acquire(&saved_tcpv6_prot))) {
+		spin_lock_bh(&tcpv6_prot_lock);
+		if (likely(sk->sk_prot != saved_tcpv6_prot)) {
+			build_protos(bpf_tcp_prots[SOCKMAP_IPV6], sk->sk_prot);
+			smp_store_release(&saved_tcpv6_prot, sk->sk_prot);
+		}
+		spin_unlock_bh(&tcpv6_prot_lock);
 	}
-
-	sk->sk_prot = &tcp_bpf_proto;
+	update_sk_prot(sk, psock);
 	rcu_read_unlock();
 	return 0;
 }
@@ -1111,8 +1150,7 @@ static void bpf_tcp_msg_add(struct smap_psock *psock,
 
 static int bpf_tcp_ulp_register(void)
 {
-	tcp_bpf_proto = tcp_prot;
-	tcp_bpf_proto.close = bpf_tcp_close;
+	build_protos(bpf_tcp_prots[SOCKMAP_IPV4], &tcp_prot);
 	/* Once BPF TX ULP is registered it is never unregistered. It
 	 * will be in the ULP list for the lifetime of the system. Doing
 	 * duplicate registers is not a problem.

^ permalink raw reply related

* [bpf PATCH v3 2/4] bpf: sockmap, fix smap_list_map_remove when psock is in many maps
From: John Fastabend @ 2018-06-22 15:21 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, kafai; +Cc: netdev
In-Reply-To: <20180622151123.24502.56029.stgit@john-Precision-Tower-5810>

If a hashmap is free'd with open socks it removes the reference to
the hash entry from the psock. If that is the last reference to the
psock then it will also be free'd by the reference counting logic.
However the current logic that removes the hash reference from the
list of references is broken. In map_list_map_remove() we first check
if the sockmap entry matches and then check if the hashmap entry
matches. But, the sockmap entry sill always match because its NULL in
this case which causes the first entry to be removed from the list.
If this is always the "right" entry (because the user adds/removes
entries in order) then everything is OK but otherwise a subsequent
bpf_tcp_close() may reference a free'd object.

To fix this create two list handlers one for sockmap and one for
sockhash.

Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |   33 +++++++++++++++++++++------------
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index d7fd17a..69b26af 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1602,17 +1602,26 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 	return ERR_PTR(err);
 }
 
-static void smap_list_remove(struct smap_psock *psock,
-			     struct sock **entry,
-			     struct htab_elem *hash_link)
+static void smap_list_map_remove(struct smap_psock *psock,
+				 struct sock **entry)
 {
 	struct smap_psock_map_entry *e, *tmp;
 
 	list_for_each_entry_safe(e, tmp, &psock->maps, list) {
-		if (e->entry == entry || e->hash_link == hash_link) {
+		if (e->entry == entry)
+			list_del(&e->list);
+	}
+}
+static void smap_list_hash_remove(struct smap_psock *psock,
+				  struct htab_elem *hash_link)
+{
+	struct smap_psock_map_entry *e, *tmp;
+
+	list_for_each_entry_safe(e, tmp, &psock->maps, list) {
+		struct htab_elem *c = e->hash_link;
+
+		if (c == hash_link)
 			list_del(&e->list);
-			break;
-		}
 	}
 }
 
@@ -1647,7 +1656,7 @@ static void sock_map_free(struct bpf_map *map)
 		 * to be null and queued for garbage collection.
 		 */
 		if (likely(psock)) {
-			smap_list_remove(psock, &stab->sock_map[i], NULL);
+			smap_list_map_remove(psock, &stab->sock_map[i]);
 			smap_release_sock(psock, sock);
 		}
 		write_unlock_bh(&sock->sk_callback_lock);
@@ -1706,7 +1715,7 @@ static int sock_map_delete_elem(struct bpf_map *map, void *key)
 
 	if (psock->bpf_parse)
 		smap_stop_sock(psock, sock);
-	smap_list_remove(psock, &stab->sock_map[k], NULL);
+	smap_list_map_remove(psock, &stab->sock_map[k]);
 	smap_release_sock(psock, sock);
 out:
 	write_unlock_bh(&sock->sk_callback_lock);
@@ -1908,7 +1917,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 		struct smap_psock *opsock = smap_psock_sk(osock);
 
 		write_lock_bh(&osock->sk_callback_lock);
-		smap_list_remove(opsock, &stab->sock_map[i], NULL);
+		smap_list_map_remove(opsock, &stab->sock_map[i]);
 		smap_release_sock(opsock, osock);
 		write_unlock_bh(&osock->sk_callback_lock);
 	}
@@ -2124,7 +2133,7 @@ static void sock_hash_free(struct bpf_map *map)
 			 * (psock) to be null and queued for garbage collection.
 			 */
 			if (likely(psock)) {
-				smap_list_remove(psock, NULL, l);
+				smap_list_hash_remove(psock, l);
 				smap_release_sock(psock, sock);
 			}
 			write_unlock_bh(&sock->sk_callback_lock);
@@ -2304,7 +2313,7 @@ static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 		psock = smap_psock_sk(l_old->sk);
 
 		hlist_del_rcu(&l_old->hash_node);
-		smap_list_remove(psock, NULL, l_old);
+		smap_list_hash_remove(psock, l_old);
 		smap_release_sock(psock, l_old->sk);
 		free_htab_elem(htab, l_old);
 	}
@@ -2372,7 +2381,7 @@ static int sock_hash_delete_elem(struct bpf_map *map, void *key)
 		 * to be null and queued for garbage collection.
 		 */
 		if (likely(psock)) {
-			smap_list_remove(psock, NULL, l);
+			smap_list_hash_remove(psock, l);
 			smap_release_sock(psock, sock);
 		}
 		write_unlock_bh(&sock->sk_callback_lock);

^ permalink raw reply related

* [bpf PATCH v3 3/4] bpf: sockhash fix omitted bucket lock in sock_close
From: John Fastabend @ 2018-06-22 15:21 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, kafai; +Cc: netdev
In-Reply-To: <20180622151123.24502.56029.stgit@john-Precision-Tower-5810>

First in tcp_close, reduce scope of sk_callback_lock() the lock is
only needed for protecting maps list the ingress and cork
lists are protected by sock lock. Having the lock in wider scope is
harmless but may confuse the reader who may infer it is in fact
needed.

Next, in sock_hash_delete_elem() the pattern is as follows,

  sock_hash_delete_elem()
     [...]
     spin_lock(bucket_lock)
     l = lookup_elem_raw()
     if (l)
        hlist_del_rcu()
        write_lock(sk_callback_lock)
         .... destroy psock ...
        write_unlock(sk_callback_lock)
     spin_unlock(bucket_lock)

The ordering is necessary because we only know the {p}sock after
dereferencing the hash table which we can't do unless we have the
bucket lock held. Once we have the bucket lock and the psock element
it is deleted from the hashmap to ensure any other path doing a lookup
will fail. Finally, the refcnt is decremented and if zero the psock
is destroyed.

In parallel with the above (or free'ing the map) a tcp close event
may trigger tcp_close(). Which at the moment omits the bucket lock
altogether (oops!) where the flow looks like this,

  bpf_tcp_close()
     [...]
     write_lock(sk_callback_lock)
     for each psock->maps // list of maps this sock is part of
         hlist_del_rcu(ref_hash_node);
         .... destroy psock ...
     write_unlock(sk_callback_lock)

Obviously, and demonstrated by syzbot, this is broken because
we can have multiple threads deleting entries via hlist_del_rcu().

To fix this we might be tempted to wrap the hlist operation in a
bucket lock but that would create a lock inversion problem. In
summary to follow locking rules the psocks maps list needs the
sk_callback_lock but we need the bucket lock to do the hlist_del_rcu.
To resolve the lock inversion problem pop the head of the maps list
repeatedly and remove the reference until no more are left. If a
delete happens in parallel from the BPF API that is OK as well because
it will do a similar action, lookup the lock in the map/hash, delete
it from the map/hash, and dec the refcnt. We check for this case
before doing a destroy on the psock to ensure we don't have two
threads tearing down a psock. The new logic is as follows,

  bpf_tcp_close()
  e = psock_map_pop(psock->maps) // done with sk_callback_lock
  bucket_lock() // lock hash list bucket
  l = lookup_elem_raw(head, hash, key, key_size);
  if (l) {
     //only get here if elmnt was not already removed
     hlist_del_rcu()
     ... destroy psock...
  }
  bucket_unlock()

And finally for all the above to work add missing sk_callback_lock
around smap_list_remove in sock_hash_ctx_update_elem(). Otherwise
delete and update may corrupt maps list. Then add RCU annotations and
use rcu_dereference/rcu_assign_pointer to manage values relying on
RCU so that the object is not free'd from sock_hash_free() while it
is being referenced in bpf_tcp_close().

(As an aside the sk_callback_lock serves two purposes. The
 first, is to update the sock callbacks sk_data_ready, sk_write_space,
 etc. The second is to protect the psock 'maps' list. The 'maps' list
 is used to (as shown above) to delete all map/hash references to a
 sock when the sock is closed)

Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |  120 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 84 insertions(+), 36 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 69b26af..333427b 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -72,6 +72,7 @@ struct bpf_htab {
 	u32 n_buckets;
 	u32 elem_size;
 	struct bpf_sock_progs progs;
+	struct rcu_head rcu;
 };
 
 struct htab_elem {
@@ -89,8 +90,8 @@ enum smap_psock_state {
 struct smap_psock_map_entry {
 	struct list_head list;
 	struct sock **entry;
-	struct htab_elem *hash_link;
-	struct bpf_htab *htab;
+	struct htab_elem __rcu *hash_link;
+	struct bpf_htab __rcu *htab;
 };
 
 struct smap_psock {
@@ -258,16 +259,54 @@ static void bpf_tcp_release(struct sock *sk)
 	rcu_read_unlock();
 }
 
+static struct htab_elem *lookup_elem_raw(struct hlist_head *head,
+					 u32 hash, void *key, u32 key_size)
+{
+	struct htab_elem *l;
+
+	hlist_for_each_entry_rcu(l, head, hash_node) {
+		if (l->hash == hash && !memcmp(&l->key, key, key_size))
+			return l;
+	}
+
+	return NULL;
+}
+
+static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash)
+{
+	return &htab->buckets[hash & (htab->n_buckets - 1)];
+}
+
+static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
+{
+	return &__select_bucket(htab, hash)->head;
+}
+
 static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 {
 	atomic_dec(&htab->count);
 	kfree_rcu(l, rcu);
 }
 
+static struct smap_psock_map_entry *psock_map_pop(struct sock *sk,
+						  struct smap_psock *psock)
+{
+	struct smap_psock_map_entry *e;
+
+	write_lock_bh(&sk->sk_callback_lock);
+	e = list_first_entry_or_null(&psock->maps,
+				     struct smap_psock_map_entry,
+				     list);
+	if (e)
+		list_del(&e->list);
+	write_unlock_bh(&sk->sk_callback_lock);
+	return e;
+}
+
 static void bpf_tcp_close(struct sock *sk, long timeout)
 {
 	void (*close_fun)(struct sock *sk, long timeout);
-	struct smap_psock_map_entry *e, *tmp;
+	struct smap_psock_map_entry *e;
 	struct sk_msg_buff *md, *mtmp;
 	struct smap_psock *psock;
 	struct sock *osk;
@@ -286,7 +325,6 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
 	 */
 	close_fun = psock->save_close;
 
-	write_lock_bh(&sk->sk_callback_lock);
 	if (psock->cork) {
 		free_start_sg(psock->sock, psock->cork);
 		kfree(psock->cork);
@@ -299,20 +337,38 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
 		kfree(md);
 	}
 
-	list_for_each_entry_safe(e, tmp, &psock->maps, list) {
+	e = psock_map_pop(sk, psock);
+	while (e) {
 		if (e->entry) {
 			osk = cmpxchg(e->entry, sk, NULL);
 			if (osk == sk) {
-				list_del(&e->list);
 				smap_release_sock(psock, sk);
 			}
 		} else {
-			hlist_del_rcu(&e->hash_link->hash_node);
-			smap_release_sock(psock, e->hash_link->sk);
-			free_htab_elem(e->htab, e->hash_link);
+			struct htab_elem *link = rcu_dereference(e->hash_link);
+			struct bpf_htab *htab = rcu_dereference(e->htab);
+			struct hlist_head *head;
+			struct htab_elem *l;
+			struct bucket *b;
+
+			b = __select_bucket(htab, link->hash);
+			head = &b->head;
+			raw_spin_lock_bh(&b->lock);
+			l = lookup_elem_raw(head,
+					    link->hash, link->key,
+					    htab->map.key_size);
+			/* If another thread deleted this object skip deletion.
+			 * The refcnt on psock may or may not be zero.
+			 */
+			if (l) {
+				hlist_del_rcu(&link->hash_node);
+				smap_release_sock(psock, link->sk);
+				free_htab_elem(htab, link);
+			}
+			raw_spin_unlock_bh(&b->lock);
 		}
+		e = psock_map_pop(sk, psock);
 	}
-	write_unlock_bh(&sk->sk_callback_lock);
 	rcu_read_unlock();
 	close_fun(sk, timeout);
 }
@@ -1618,7 +1674,7 @@ static void smap_list_hash_remove(struct smap_psock *psock,
 	struct smap_psock_map_entry *e, *tmp;
 
 	list_for_each_entry_safe(e, tmp, &psock->maps, list) {
-		struct htab_elem *c = e->hash_link;
+		struct htab_elem *c = rcu_dereference(e->hash_link);
 
 		if (c == hash_link)
 			list_del(&e->list);
@@ -2090,14 +2146,13 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
 	return ERR_PTR(err);
 }
 
-static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash)
+static void __bpf_htab_free(struct rcu_head *rcu)
 {
-	return &htab->buckets[hash & (htab->n_buckets - 1)];
-}
+	struct bpf_htab *htab;
 
-static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
-{
-	return &__select_bucket(htab, hash)->head;
+	htab = container_of(rcu, struct bpf_htab, rcu);
+	bpf_map_area_free(htab->buckets);
+	kfree(htab);
 }
 
 static void sock_hash_free(struct bpf_map *map)
@@ -2116,10 +2171,13 @@ static void sock_hash_free(struct bpf_map *map)
 	 */
 	rcu_read_lock();
 	for (i = 0; i < htab->n_buckets; i++) {
-		struct hlist_head *head = select_bucket(htab, i);
+		struct bucket *b = __select_bucket(htab, i);
+		struct hlist_head *head;
 		struct hlist_node *n;
 		struct htab_elem *l;
 
+		raw_spin_lock_bh(&b->lock);
+		head = &b->head;
 		hlist_for_each_entry_safe(l, n, head, hash_node) {
 			struct sock *sock = l->sk;
 			struct smap_psock *psock;
@@ -2137,12 +2195,12 @@ static void sock_hash_free(struct bpf_map *map)
 				smap_release_sock(psock, sock);
 			}
 			write_unlock_bh(&sock->sk_callback_lock);
-			kfree(l);
+			free_htab_elem(htab, l);
 		}
+		raw_spin_unlock_bh(&b->lock);
 	}
 	rcu_read_unlock();
-	bpf_map_area_free(htab->buckets);
-	kfree(htab);
+	call_rcu(&htab->rcu, __bpf_htab_free);
 }
 
 static struct htab_elem *alloc_sock_hash_elem(struct bpf_htab *htab,
@@ -2169,19 +2227,6 @@ static struct htab_elem *alloc_sock_hash_elem(struct bpf_htab *htab,
 	return l_new;
 }
 
-static struct htab_elem *lookup_elem_raw(struct hlist_head *head,
-					 u32 hash, void *key, u32 key_size)
-{
-	struct htab_elem *l;
-
-	hlist_for_each_entry_rcu(l, head, hash_node) {
-		if (l->hash == hash && !memcmp(&l->key, key, key_size))
-			return l;
-	}
-
-	return NULL;
-}
-
 static inline u32 htab_map_hash(const void *key, u32 key_len)
 {
 	return jhash(key, key_len, 0);
@@ -2301,8 +2346,9 @@ static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 		goto bucket_err;
 	}
 
-	e->hash_link = l_new;
-	e->htab = container_of(map, struct bpf_htab, map);
+	rcu_assign_pointer(e->hash_link, l_new);
+	rcu_assign_pointer(e->htab,
+			   container_of(map, struct bpf_htab, map));
 	list_add_tail(&e->list, &psock->maps);
 
 	/* add new element to the head of the list, so that
@@ -2313,8 +2359,10 @@ static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 		psock = smap_psock_sk(l_old->sk);
 
 		hlist_del_rcu(&l_old->hash_node);
+		write_lock_bh(&l_old->sk->sk_callback_lock);
 		smap_list_hash_remove(psock, l_old);
 		smap_release_sock(psock, l_old->sk);
+		write_unlock_bh(&l_old->sk->sk_callback_lock);
 		free_htab_elem(htab, l_old);
 	}
 	raw_spin_unlock_bh(&b->lock);

^ permalink raw reply related

* [bpf PATCH v3 4/4] bpf: sockhash, add release routine
From: John Fastabend @ 2018-06-22 15:21 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, kafai; +Cc: netdev
In-Reply-To: <20180622151123.24502.56029.stgit@john-Precision-Tower-5810>

Add map_release_uref pointer to hashmap ops. This was dropped when
original sockhash code was ported into bpf-next before initial
commit.

Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 333427b..2456986 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -2478,6 +2478,7 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key)
 	.map_get_next_key = sock_hash_get_next_key,
 	.map_update_elem = sock_hash_update_elem,
 	.map_delete_elem = sock_hash_delete_elem,
+	.map_release_uref = sock_map_release,
 };
 
 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,

^ permalink raw reply related

* Re: bnx2x: kernel panic in the bnx2x driver
From: Vishwanath Pai @ 2018-06-22 14:57 UTC (permalink / raw)
  To: Kalluru, Sudarsana, Elior, Ariel, Dept-Eng Everest Linux L2
  Cc: davem@davemloft.net, netdev@vger.kernel.org, dbanerje@akamai.com,
	pai.vishwain@gmail.com
In-Reply-To: <MW2PR07MB4139CC84215EDFEDC980CF098A750@MW2PR07MB4139.namprd07.prod.outlook.com>

Ah, that is great! I will test it out on my machine and let you know.

Thanks,
Vishwanath

On 06/22/2018 10:21 AM, Kalluru, Sudarsana wrote:
> Hi Vishwanath,
>     The config will be cached in the device structure (bp->rss_conf_obj.udp_rss_v4) in this scenario, and will be applied in the load path (bnx2x_nic_load() --> bnx2x_init_rss()). Have unit tested the change on my setup.
>
> Thanks,
> Sudarsana
>
> -----Original Message-----
> From: Vishwanath Pai [mailto:vpai@akamai.com] 
> Sent: 22 June 2018 18:52
> To: Kalluru, Sudarsana <Sudarsana.Kalluru@cavium.com>; Elior, Ariel <Ariel.Elior@cavium.com>; Dept-Eng Everest Linux L2 <Dept-EngEverestLinuxL2@cavium.com>
> Cc: davem@davemloft.net; netdev@vger.kernel.org; dbanerje@akamai.com; pai.vishwain@gmail.com
> Subject: Re: bnx2x: kernel panic in the bnx2x driver
>
> Hi Sudarsana,
>
> Thanks for taking a look at my email. The fix you suggested would definitely fix the kernel panic, but at the same time wouldn't it also silently ignore the request by ethtool to set rx-flow-hash?
>
> Thanks,
> Vishwanath
>
> On 06/22/2018 06:20 AM, Kalluru, Sudarsana wrote:
>> Hi Vishwanath,
>>     Thanks for your mail, and the analysis.
>> The fix would be to invoke bnx2x_rss() only when the device is opened,
>>       if (bp->state == BNX2X_STATE_OPEN)
>>               return bnx2x_rss(bp, &bp->rss_conf_obj, false, true);
>>       else
>>               return 0;
>> Ariel,
>>    Could you please review the path (bnx2x_set_rss_flags()--> bnx2x_rss()) and confirm/correct on the above.
>>
>> Thanks,
>> Sudarsana
>>
>> -----Original Message-----
>> From: Vishwanath Pai [mailto:vpai@akamai.com]
>> Sent: 22 June 2018 10:37
>> To: Elior, Ariel <Ariel.Elior@cavium.com>; Dept-Eng Everest Linux L2 
>> <Dept-EngEverestLinuxL2@cavium.com>
>> Cc: davem@davemloft.net; netdev@vger.kernel.org; dbanerje@akamai.com; 
>> pai.vishwain@gmail.com
>> Subject: bnx2x: kernel panic in the bnx2x driver
>>
>> External Email
>>
>> Hi,
>>
>> We recently noticed a kernel panic in the bnx2x driver when trying to 
>> set rx-flow-hash parameters via ethtool during if-pre-up.d. I am 
>> running kernel
>> v4.17.2 from ubuntu-mainline-ppa. I have added the stack trace below:
>>
>> [   18.280209] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
>> [   18.280212] PGD 8000000407a79067 P4D 8000000407a79067 PUD 40ce8a067 PMD 0
>> [   18.280214] Oops: 0010 [#1] SMP PTI
>> [   18.280215] Modules linked in: intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel joydev input_led kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc hid_eneric aesni_intel gpio_ich aes_x86_64 usbhid lpc_ich crpto_simd ie31200_edac cryptd glue_helper intel_cstate mac_hid intel_rapl_perf bnx2x mdio tcp_bbr netconsole ipmi_devintf ipmi_msghandler i2c_i801 coretemp autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear sha26_mb mcryptd sha256_ssse3 hid ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt mpt3sas fb_sys_fops drm raid_class scsi_transport_sas ahci libahci shpchp video
>> [   18.280241] CPU: 6 PID: 1081 Comm: ethtool Not tainted 4.17.2-041702-generic #201806160433
>> [   18.280242] Hardware name: Foxconn CangJie/CangJie, BIOS CC1F108D 02/26/2014
>> [   18.280243] RIP: 0010:          (null)
>> [   18.280243] RSP: 0018:ffffb84bc260b9c0 EFLAGS: 00010246
>> [   18.280244] RAX: 0000000000000000 RBX: ffff92f987f020f0 RCX: 0000000000000000
>> [   18.280245] RDX: 0000000000000000 RSI: ffffb84bc260b9f8 RDI: ffff92f987f020f0
>> [   18.280245] RBP: ffffb8bc260b9e8 R08: 0000000000000001 R09: 0000000000000000
>> [   18.280246] R10: ffffb84bc260bd20 R11: 0000000000000000 R12: ffffb84bc260b9f8
>> [   18.280246] R13: ffff92f987f008c0 R14: 00007ffdb75bec40 R15: 0000000000000000
>> [   18.280247] FS:  00007fc0e8798700(0000) GS:ffff92f99fd80000(0000) knlGS:0000000000000000
>> [   18.280248] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   18.280249] CR2: 0000000000000000 CR3: 0000000409b4c003 CR4: 00000000001606e0
>> [   18.280249] Call Trace:
>> [   18.280263]  ? bnx2x_config_rss+0x2f/0xd0 [bnx2x]
>> [   18.280270]  bnx2x_rss+0x1d9/0x210 [bnx2x]
>> [   18.280276]  bnx2x_set_rxnfc+0x17d/0x380 [bnx2x]
>> [   18.280279]  ethtool_set_rxnfc+0x9b/0x110
>> [   18.280281]  ? __do_page_cache_readahead+0x1da/0x2c0
>> [   18.280283]  ? security_capable+0x3c/0x60
>> [   18.280284]  dev_ethtool+0350/0x2610
>> [   18.280286]  ? page_cache_async_readahead+0x71/0x80
>> [   18.280288]  ? page_add_file_rmap+0x5d/0x220
>> [   18.280290]  ? inet_ioctl+0x182/0x1a0
>> [   18.280291]  dev_ioctl+0x203/0x3f0
>> [   18.280293]  ? dev_ioctl+0x203/0x3f0
>> [   18.280294]  sock_do_ioctl+0xae/0x150
>> [   18.280296]  sock_ioctl+0x1e2/0x330
>> [   18.280296]  ? sock_ioctl+0x1e2/0x330
>> [   18.280299]  do_vfs_ioctl+0xa8/0x620
>> [   18.280300]  ? dlci_ioctl_set+0x30/0x30
>> [   18.280301]  ? do_vfs_ioctl+0xa8/0x620
>> [   18.280302]  ? handle_mm_fault+0xe3/0x220
>> [   18.280304]  ksys_ioctl+0x75/0x80
>> [   18.280305]  __x64_sys_ioctl+0x1a/0x20
>> [   18.280307]  do_syscall_64+0x5a/0x120
>> [   18.280309]  entry_SYSCALL_64_aftr_hwframe+0x44/0xa9
>> [   18.280310] RIP: 0033:0x7fc0e7fba107
>> [   18.280311] RSP: 002b:00007ffdb75beb78 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
>> [   18.280312] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc0e7fba107
>> [   18.280312] RDX: 00007ffdb75bed60 RSI: 0000000000008946 RDI: 0000000000000003
>> [   18.280313] RBP: 00007ffdb75bed50 R08: 00007ffdb75bed60 R09: 0000000000000001
>> [   18.280313] R10: 0000000000000541 R11: 0000000000000206 R12: 00007ffdb75beed0
>> [   18.280314] R13: 0000000000421020 R14: 000000000041fe28 R15: 0000000000000003
>> [   18.280315] Code:  Bad RIP value.
>> [   18.280317] RIP:           (null) RSP: ffffb84bc260b9c0
>> [  18.280318] CR2: 0000000000000000
>> [   18.280319] ---[ end trace 5f361db3fb9059f1 ]---
>>
>> To reproduce this I created a bash script in "/etc/network/if-pre-up.d/" with these two lines:
>> ethtool -N $IFACE rx-flow-hash udp4 "sdfn"
>> ethtool -N $IFACE rx-flow-hash udp6 "sdfn"
>>
>> The problem here is that rss_obj in bnx2x struct for the device hasn't 
>> been initialized yet, which causes an exception in bnx2x_config_rss() 
>> when calling "r->set_pending(r)" because r->set_pending is NULL. It 
>> looks like a lot many things haven't been initialized at this point, 
>> most of that happens in this
>> function: "bnx2x_init_bp_objs()" which isn't called until ifup. Any thoughts on how this can be fixed? Would it be possible to safely move bnx2x_init_bp_objs() to maybe bnx2x_init_one() which runs much before ifup?
>>
>> Thanks,
>> Vishwanath
>>

^ permalink raw reply

* Re: [PATCH v2 bpf-net] bpf: Change bpf_fib_lookup to return lookup status
From: Jesper Dangaard Brouer @ 2018-06-22 15:49 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: brouer, dsahern, netdev, borkmann, ast, davem, David Ahern
In-Reply-To: <20180621170936.2tobn5lu24l6xuo7@kafai-mbp.dhcp.thefacebook.com>

On Thu, 21 Jun 2018 10:09:36 -0700
Martin KaFai Lau <kafai@fb.com> wrote:

> On Wed, Jun 20, 2018 at 08:00:11PM -0700, dsahern@kernel.org wrote:
> > From: David Ahern <dsahern@gmail.com>
> > 
> > For ACLs implemented using either FIB rules or FIB entries, the BPF
> > program needs the FIB lookup status to be able to drop the packet.
> > Since the bpf_fib_lookup API has not reached a released kernel yet,
> > change the return code to contain an encoding of the FIB lookup
> > result and return the nexthop device index in the params struct.
> > 
> > In addition, inform the BPF program of any post FIB lookup reason as
> > to why the packet needs to go up the stack.
> > 
> > The fib result for unicast routes must have an egress device, so remove
> > the check that it is non-NULL.  
> Acked-by: Martin KaFai Lau <kafai@fb.com>

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net V3 1/1] net/smc: coordinate wait queues for nonblocking connect
From: Eric Dumazet @ 2018-06-22 15:49 UTC (permalink / raw)
  To: Ursula Braun, davem
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl,
	xiyou.wangcong, hch
In-Reply-To: <20180622140139.22981-1-ubraun@linux.ibm.com>



On 06/22/2018 07:01 AM, Ursula Braun wrote:
> The recent poll change may lead to stalls for non-blocking connecting
> SMC sockets, since sock_poll_wait is no longer performed on the
> internal CLC socket, but on the outer SMC socket.  kernel_connect() on
> the internal CLC socket returns with -EINPROGRESS, but the wake up
> logic does not work in all cases. If the internal CLC socket is still
> in state TCP_SYN_SENT when polled, sock_poll_wait() from sock_poll()
> does not sleep. It is supposed to sleep till the state of the internal
> CLC socket switches to TCP_ESTABLISHED.
> 
> This patch temporarily propagates the wait queue from the internal
> CLC sock to the SMC sock, till the non-blocking connect() is
> finished.
> 
> In addition locking is reduced due to the removed poll waits.
> 
> Fixes: c0129a061442 ("smc: convert to ->poll_mask")
> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
> ---
>  net/smc/af_smc.c | 13 +++++++++----
>  net/smc/smc.h    |  1 +
>  2 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index da7f02edcd37..7966e7ddb563 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -23,6 +23,7 @@
>  #include <linux/workqueue.h>
>  #include <linux/in.h>
>  #include <linux/sched/signal.h>
> +#include <linux/rcupdate.h>
>  
>  #include <net/sock.h>
>  #include <net/tcp.h>
> @@ -605,6 +606,11 @@ static int smc_connect(struct socket *sock, struct sockaddr *addr,
>  
>  	smc_copy_sock_settings_to_clc(smc);
>  	tcp_sk(smc->clcsock->sk)->syn_smc = 1;
> +	if (flags & O_NONBLOCK) {
> +		smc->smcwq = rcu_access_pointer(sk->sk_wq);
> +		rcu_assign_pointer(sock->sk->sk_wq,
> +				   rcu_access_pointer(smc->clcsock->sk->sk_wq));

That is obfuscation.

The following is much easier to read.

	sock->sk->sk_wq = smc->clcsock->sk->sk_wq;

But, this looks very suspect to me.

Nowhere in the stack we divert sk->sk_wq to something else.

What about rcu users of sock->sk->sk_wq ?

		

> +	}
>  	rc = kernel_connect(smc->clcsock, addr, alen, flags);
>  	if (rc)
>  		goto out;
> @@ -1285,12 +1291,9 @@ static __poll_t smc_poll_mask(struct socket *sock, __poll_t events)
>  
>  	smc = smc_sk(sock->sk);
>  	sock_hold(sk);
> -	lock_sock(sk);
>  	if ((sk->sk_state == SMC_INIT) || smc->use_fallback) {
>  		/* delegate to CLC child sock */
> -		release_sock(sk);
>  		mask = smc->clcsock->ops->poll_mask(smc->clcsock, events);
> -		lock_sock(sk);
>  		sk->sk_err = smc->clcsock->sk->sk_err;
>  		if (sk->sk_err) {
>  			mask |= EPOLLERR;
> @@ -1299,7 +1302,10 @@ static __poll_t smc_poll_mask(struct socket *sock, __poll_t events)
>  			if (sk->sk_state == SMC_INIT &&
>  			    mask & EPOLLOUT &&
>  			    smc->clcsock->sk->sk_state != TCP_CLOSE) {
> +				lock_sock(sk);
> +				rcu_assign_pointer(sock->sk->sk_wq, smc->smcwq);
>  				rc = __smc_connect(smc);
> +				release_sock(sk);
>  				if (rc < 0)
>  					mask |= EPOLLERR;
>  				/* success cases including fallback */
> @@ -1334,7 +1340,6 @@ static __poll_t smc_poll_mask(struct socket *sock, __poll_t events)
>  			mask |= EPOLLPRI;
>  
>  	}
> -	release_sock(sk);
>  	sock_put(sk);
>  
>  	return mask;
> diff --git a/net/smc/smc.h b/net/smc/smc.h
> index 51ae1f10d81a..89d6d7ef973f 100644
> --- a/net/smc/smc.h
> +++ b/net/smc/smc.h
> @@ -190,6 +190,7 @@ struct smc_connection {
>  struct smc_sock {				/* smc sock container */
>  	struct sock		sk;
>  	struct socket		*clcsock;	/* internal tcp socket */
> +	struct socket_wq	*smcwq;		/* original smcsock wq */
>  	struct smc_connection	conn;		/* smc connection */
>  	struct smc_sock		*listen_smc;	/* listen parent */
>  	struct work_struct	tcp_listen_work;/* handle tcp socket accepts */
> 


No refcounting when ->smcwq is set ?

This looks quite risky to me.

^ permalink raw reply

* Re: [PATCH 0/4] docs: e100[0] fix build errors
From: Jeff Kirsher @ 2018-06-22 16:44 UTC (permalink / raw)
  To: Tobin C. Harding, Jonathan Corbet
  Cc: David S. Miller, linux-doc, netdev, linux-kernel
In-Reply-To: <20180622003708.31848-1-me@tobin.cc>

[-- Attachment #1: Type: text/plain, Size: 629 bytes --]

On Fri, 2018-06-22 at 10:37 +1000, Tobin C. Harding wrote:
> I may be a little confused here, I'm getting docs build failure on
> Linus' mainline, linux-next, and your docs-next but different errors.
> There seems to be patches to the first two trees that are not in your
> docs-next tree?
> 
> Do networking docs typically go through your tree?  Looks like
> networking has done some conversion to rst that hasn't gone through
> your
> tree.  Or else my git skills are failing.

Networking documentation changes went through David Miller's networking
tree because he maintains changes under Documentation/networking/

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox