Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH] wimax: i2400: fix memory leak
From: Navid Emamdoost @ 2019-09-10 23:01 UTC (permalink / raw)
  Cc: emamd001, smccaman, kjlu, Navid Emamdoost, Inaky Perez-Gonzalez,
	linux-wimax, David S. Miller, netdev, linux-kernel

In i2400m_op_rfkill_sw_toggle cmd buffer should be released along with
skb response.

Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com>
---
 drivers/net/wimax/i2400m/op-rfkill.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/wimax/i2400m/op-rfkill.c b/drivers/net/wimax/i2400m/op-rfkill.c
index 6642bcb27761..8efb493ceec2 100644
--- a/drivers/net/wimax/i2400m/op-rfkill.c
+++ b/drivers/net/wimax/i2400m/op-rfkill.c
@@ -127,6 +127,7 @@ int i2400m_op_rfkill_sw_toggle(struct wimax_dev *wimax_dev,
 			"%d\n", result);
 	result = 0;
 error_cmd:
+	kfree(cmd);
 	kfree_skb(ack_skb);
 error_msg_to_dev:
 error_alloc:
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH bpf-next v10 4/4] tools/testing/selftests/bpf: Add self-tests for helper bpf_get_pidns_info.
From: Yonghong Song @ 2019-09-10 22:55 UTC (permalink / raw)
  To: Carlos Neira, netdev@vger.kernel.org
  Cc: ebiederm@xmission.com, brouer@redhat.com, bpf@vger.kernel.org
In-Reply-To: <20190906150952.23066-5-cneirabustos@gmail.com>



On 9/6/19 4:09 PM, Carlos Neira wrote:
> Added 2 selftest:
> 
> bpf_get_current_pidns_info helper is called in an interrupt
> context and also in a non interrupt context.
> 
> Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
> ---
>   tools/testing/selftests/bpf/Makefile               |   2 +-
>   tools/testing/selftests/bpf/bpf_helpers.h          |   3 +
>   .../testing/selftests/bpf/progs/test_pidns_kern.c  |  52 ++++++++
>   .../selftests/bpf/progs/test_pidns_nmi_kern.c      |  52 ++++++++
>   tools/testing/selftests/bpf/test_pidns.c           | 146 +++++++++++++++++++++
>   tools/testing/selftests/bpf/test_pidns_nmi.c       | 139 ++++++++++++++++++++
>   6 files changed, 393 insertions(+), 1 deletion(-)
>   create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_kern.c
>   create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_nmi_kern.c
>   create mode 100644 tools/testing/selftests/bpf/test_pidns.c
>   create mode 100644 tools/testing/selftests/bpf/test_pidns_nmi.c
> 
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 1faad0c3c3c9..8507c89141f5 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -29,7 +29,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
>   	test_cgroup_storage test_select_reuseport test_section_names \
>   	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \
>   	test_btf_dump test_cgroup_attach xdping test_sockopt test_sockopt_sk \
> -	test_sockopt_multi test_sockopt_inherit test_tcp_rtt
> +	test_sockopt_multi test_sockopt_inherit test_tcp_rtt test_pidns test_pidns_nmi
>   
>   BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
>   TEST_GEN_FILES = $(BPF_OBJ_FILES)
> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> index 6c4930bc6e2e..30a70ebe583a 100644
> --- a/tools/testing/selftests/bpf/bpf_helpers.h
> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> @@ -313,6 +313,9 @@ static unsigned int (*bpf_set_hash)(void *ctx, __u32 hash) =
>   static int (*bpf_skb_adjust_room)(void *ctx, __s32 len_diff, __u32 mode,
>   				  unsigned long long flags) =
>   	(void *) BPF_FUNC_skb_adjust_room;
> +static int (*bpf_get_current_pidns_info)(struct bpf_pidns_info *buf,
> +					 unsigned int buf_size) =
> +	(void *) BPF_FUNC_get_current_pidns_info;
>   
>   /* Scan the ARCH passed in from ARCH env variable (see Makefile) */
>   #if defined(__TARGET_ARCH_x86)
> diff --git a/tools/testing/selftests/bpf/progs/test_pidns_kern.c b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
> new file mode 100644
> index 000000000000..a4c0bde1608b
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
> @@ -0,0 +1,52 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +
> +#include <linux/bpf.h>
> +#include <errno.h>
> +#include "bpf_helpers.h"
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> +	__uint(max_entries, 1);
> +	__type(key, __u32);
> +	__type(value, struct bpf_pidns_info);
> +} nsidmap SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> +	__uint(max_entries, 1);
> +	__type(key, __u32);
> +	__type(value, __u32);
> +} pidmap SEC(".maps");
> +
> +SEC("tracepoint/syscalls/sys_enter_nanosleep")
> +int trace(void *ctx)
> +{
> +	struct bpf_pidns_info nsinfo;
> +	__u32 key = 0, *expected_pid;
> +	struct bpf_pidns_info  *val;
> +
> +	if (bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo)))
> +		return -EINVAL;
> +
> +	expected_pid = bpf_map_lookup_elem(&pidmap, &key);
> +	__u64 tgid = bpf_get_current_pid_tgid();
> +
> +	if (!expected_pid || *expected_pid != nsinfo.pid ||
> +			nsinfo.tgid != (__u32)tgid)
> +		return 0;

The logic in the above is not right.
bpf_get_current_pidns_info() retrieved
   dev, nsid, tgid and pid.

You should create a map, populate dev/nsid you
got from the user space. And then update maybe
another map with tgid/pid and passed back to
user space for verification.

> +
> +	val = bpf_map_lookup_elem(&nsidmap, &key);
> +	if (val)
> +		*val = nsinfo;
> +
> +	return 0;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> +__u32 _version SEC("version") = 1;
> diff --git a/tools/testing/selftests/bpf/progs/test_pidns_nmi_kern.c b/tools/testing/selftests/bpf/progs/test_pidns_nmi_kern.c
> new file mode 100644
> index 000000000000..7f02e4e29021
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_pidns_nmi_kern.c
> @@ -0,0 +1,52 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +
> +#include <linux/bpf.h>
> +#include <errno.h>
> +#include "bpf_helpers.h"
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> +	__uint(max_entries, 1);
> +	__type(key, __u32);
> +	__type(value, struct bpf_pidns_info);
> +} nsidmap SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> +	__uint(max_entries, 1);
> +	__type(key, __u32);
> +	__type(value, __u32);
> +} pidmap SEC(".maps");
> +
> +SEC("tracepoint/net/netif_receive_skb")
> +int trace(void *ctx)
> +{
> +	struct bpf_pidns_info nsinfo;
> +	__u32 key = 0, *expected_pid;
> +	struct bpf_pidns_info  *val;
> +
> +	if (bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo)))
> +		return -EINVAL;
> +
> +	expected_pid = bpf_map_lookup_elem(&pidmap, &key);
> +	__u64 tgid = bpf_get_current_pid_tgid();
> +
> +	if (!expected_pid || *expected_pid != nsinfo.pid ||
> +			nsinfo.tgid != (__u32)tgid)
> +		return 0;
> +
> +	val = bpf_map_lookup_elem(&nsidmap, &key);
> +	if (val)
> +		*val = nsinfo;
> +
> +	return 0;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> +__u32 _version SEC("version") = 1;

With the new nsfs interface function(), the file test_pidns_nmi_kern.c
is not needed any more.

Please also remove the in_interrupt() checking in the kernel helper 
implementation.

> diff --git a/tools/testing/selftests/bpf/test_pidns.c b/tools/testing/selftests/bpf/test_pidns.c

Please restructure and put this file under 
tools/testing/selftests/bpf/prog_tests directory.

> new file mode 100644
> index 000000000000..0c579305da53
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/test_pidns.c
> @@ -0,0 +1,146 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <syscall.h>
> +#include <unistd.h>
> +#include <linux/perf_event.h>
> +#include <sys/ioctl.h>
> +#include <sys/time.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +
> +#include <linux/bpf.h>
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +
> +#include "cgroup_helpers.h"
> +#include "bpf_rlimit.h"
> +
> +#define CHECK(condition, tag, format...) ({		\
> +	int __ret = !!(condition);			\
> +	if (__ret) {					\
> +		printf("%s:FAIL:%s ", __func__, tag);	\
> +		printf(format);				\
> +	} else {					\
> +		printf("%s:PASS:%s\n", __func__, tag);	\
> +	}						\
> +	__ret;						\
> +})
> +
> +static int bpf_find_map(const char *test, struct bpf_object *obj,
> +			const char *name)
> +{
> +	struct bpf_map *map;
> +
> +	map = bpf_object__find_map_by_name(obj, name);
> +	if (!map)
> +		return -1;
> +	return bpf_map__fd(map);
> +}
> +
> +
> +int main(int argc, char **argv)
> +{
> +	const char *probe_name = "syscalls/sys_enter_nanosleep";
> +	const char *file = "test_pidns_kern.o";
> +	int err, bytes, efd, prog_fd, pmu_fd;
> +	struct bpf_pidns_info knsid = {};
> +	int pidmap_fd, nsidmap_fd;
> +	struct perf_event_attr attr = {};
> +	struct bpf_object *obj;
> +	__u32 key = 0, pid;
> +	int exit_code = 1;
> +	struct stat st;
> +	char buf[256];
> +
> +	err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, &prog_fd);
> +	if (CHECK(err, "bpf_prog_load", "err %d errno %d\n", err, errno))
> +		goto cleanup_cgroup_env;
> +
> +	nsidmap_fd = bpf_find_map(__func__, obj, "nsidmap");
> +	if (CHECK(nsidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
> +		  nsidmap_fd, errno))
> +		goto close_prog;
> +
> +	pidmap_fd = bpf_find_map(__func__, obj, "pidmap");
> +	if (CHECK(pidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
> +		  pidmap_fd, errno))
> +		goto close_prog;
> +
> +	pid = getpid();
> +	bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
> +
> +	snprintf(buf, sizeof(buf),
> +		 "/sys/kernel/debug/tracing/events/%s/id", probe_name);
> +	efd = open(buf, O_RDONLY, 0);
> +	if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
> +		goto close_prog;
> +	bytes = read(efd, buf, sizeof(buf));
> +	close(efd);
> +	if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "read",
> +		  "bytes %d errno %d\n", bytes, errno))
> +		goto close_prog;
> +
> +	attr.config = strtol(buf, NULL, 0);
> +	attr.type = PERF_TYPE_TRACEPOINT;
> +	attr.sample_type = PERF_SAMPLE_RAW;
> +	attr.sample_period = 1;
> +	attr.wakeup_events = 1;
> +
> +	pmu_fd = syscall(__NR_perf_event_open, &attr, getpid(), -1, -1, 0);
> +	if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n", pmu_fd,
> +		  errno))
> +		goto close_prog;
> +
> +	err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
> +	if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n", err,
> +		  errno))
> +		goto close_pmu;
> +
> +	err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
> +	if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n", err,
> +		  errno))
> +		goto close_pmu;
> +
> +	/* trigger some syscalls */
> +	sleep(1);
> +
> +	err = bpf_map_lookup_elem(nsidmap_fd, &key, &knsid);
> +	if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", err, errno))
> +		goto close_pmu;
> +
> +	if (stat("/proc/self/ns/pid", &st))
> +		goto close_pmu;
> +
> +	if (CHECK(knsid.nsid != (__u32) st.st_ino, "namespace id",
> +		  "bpf %u user %u\n", knsid.nsid, (__u32) st.st_ino))
> +		goto close_pmu;
> +
> +	if (CHECK(major(knsid.dev) != major(st.st_rdev), "dev major",
> +		  "bpf %u user %u\n", major(knsid.dev), major(st.st_rdev)))
> +		goto close_pmu;
> +
> +	if (CHECK(minor(knsid.dev) != minor(st.st_rdev), "dev minor",
> +		  "bpf %u user %u\n", minor(knsid.dev), minor(st.st_rdev)))
> +		goto close_pmu;
> +
> +	exit_code = 0;
> +	printf("%s:PASS\n", argv[0]);
> +
> +close_pmu:
> +	close(pmu_fd);
> +close_prog:
> +	bpf_object__close(obj);
> +cleanup_cgroup_env:
> +	return exit_code;
> +}
> diff --git a/tools/testing/selftests/bpf/test_pidns_nmi.c b/tools/testing/selftests/bpf/test_pidns_nmi.c
> new file mode 100644
> index 000000000000..bb8075bbe7d8
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/test_pidns_nmi.c
> @@ -0,0 +1,139 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <syscall.h>
> +#include <unistd.h>
> +#include <linux/perf_event.h>
> +#include <sys/ioctl.h>
> +#include <sys/time.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +
> +#include <linux/bpf.h>
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +
> +#include "cgroup_helpers.h"
> +#include "bpf_rlimit.h"
> +
> +#define CHECK(condition, tag, format...) ({		\
> +	int __ret = !!(condition);			\
> +	if (__ret) {					\
> +		printf("%s:FAIL:%s ", __func__, tag);	\
> +		printf(format);				\
> +	} else {					\
> +		printf("%s:PASS:%s\n", __func__, tag);	\
> +	}						\
> +	__ret;						\
> +})
> +
> +static int bpf_find_map(const char *test, struct bpf_object *obj,
> +			const char *name)
> +{
> +	struct bpf_map *map;
> +
> +	map = bpf_object__find_map_by_name(obj, name);
> +	if (!map)
> +		return -1;
> +	return bpf_map__fd(map);
> +}
> +
> +
> +int main(int argc, char **argv)
> +{
> +	const char *probe_name = "net/netif_receive_skb";
> +	const char *file = "test_pidns_kern.o";
> +	int err, bytes, efd, prog_fd, pmu_fd;
> +	struct bpf_pidns_info knsid = {};
> +	int pidmap_fd, nsidmap_fd;
> +	struct perf_event_attr attr = {};
> +	struct bpf_object *obj;
> +	__u32 key = 0, pid;
> +	int exit_code = 1;
> +	struct stat st;
> +	char buf[256];
> +
> +	err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, &prog_fd);
> +	if (CHECK(err, "bpf_prog_load", "err %d errno %d\n", err, errno))
> +		goto cleanup_cgroup_env;
> +
> +	nsidmap_fd = bpf_find_map(__func__, obj, "nsidmap");
> +	if (CHECK(nsidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
> +		  nsidmap_fd, errno))
> +		goto close_prog;
> +
> +	pidmap_fd = bpf_find_map(__func__, obj, "pidmap");
> +	if (CHECK(pidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
> +		  pidmap_fd, errno))
> +		goto close_prog;
> +
> +	pid = getpid();
> +	bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
> +
> +	snprintf(buf, sizeof(buf),
> +		 "/sys/kernel/debug/tracing/events/%s/id", probe_name);
> +	efd = open(buf, O_RDONLY, 0);
> +	if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
> +		goto close_prog;
> +	bytes = read(efd, buf, sizeof(buf));
> +	close(efd);
> +	if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "read",
> +		  "bytes %d errno %d\n", bytes, errno))
> +		goto close_prog;
> +
> +	attr.config = strtol(buf, NULL, 0);
> +	attr.type = PERF_TYPE_TRACEPOINT;
> +	attr.sample_type = PERF_SAMPLE_RAW;
> +	attr.sample_period = 1;
> +	attr.wakeup_events = 1;
> +
> +	pmu_fd = syscall(__NR_perf_event_open, &attr, getpid(), -1, -1, 0);
> +	if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n", pmu_fd,
> +		  errno))
> +		goto close_prog;
> +
> +	err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
> +	if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n", err,
> +		  errno))
> +		goto close_pmu;
> +
> +	err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
> +	if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n", err,
> +		  errno))
> +		goto close_pmu;
> +
> +	/* trigger some syscalls */
> +	sleep(1);
> +
> +	err = bpf_map_lookup_elem(nsidmap_fd, &key, &knsid);
> +	if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", err, errno))
> +		goto close_pmu;
> +
> +	if (stat("/proc/self/ns/pid", &st))
> +		goto close_pmu;
> +
> +	if (CHECK(knsid.nsid != (__u32) st.st_ino, "namespace_id",
> +				"Called in interrupt context bpf %u user %u\n",
> +				knsid.nsid, (__u32) st.st_ino))
> +		goto close_pmu;
> +
> +	exit_code = 0;
> +	printf("%s:PASS\n", argv[0]);
> +
> +close_pmu:
> +	close(pmu_fd);
> +close_prog:
> +	bpf_object__close(obj);
> +cleanup_cgroup_env:
> +	return exit_code;
> +}

the file test_pidns_nmi.c is not needed any more.

^ permalink raw reply

* Re: ixgbe: driver drops packets routed from an IPSec interface with a "bad sa_idx" error
From: Michael Marley @ 2019-09-10 22:53 UTC (permalink / raw)
  To: Shannon Nelson; +Cc: netdev, Jeff Kirsher, steffen.klassert
In-Reply-To: <90dd9f8c-57fa-14c7-5d09-207b84ec3292@pensando.io>

On 9/10/19 5:43 PM, Shannon Nelson wrote:

> On 9/9/19 11:45 AM, Michael Marley wrote:
>> On 2019-09-09 14:21, Shannon Nelson wrote:
>>> On 9/6/19 11:13 AM, Michael Marley wrote:
>>>> (This is also reported at
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=204551, but it was
>>>> recommended that I send it to this list as well.)
>>>>
>>>> I have a put together a router that routes traffic from several
>>>> local subnets from a switch attached to an i82599ES card through an
>>>> IPSec VPN interface set up with StrongSwan. (The VPN is running on
>>>> an unrelated second interface with a different driver.)  Traffic
>>>> from the local interfaces to the VPN works as it should and
>>>> eventually makes it through the VPN server and out to the
>>>> Internet.  The return traffic makes it back to the router and
>>>> tcpdump shows it leaving by the i82599, but the traffic never
>>>> actually makes it onto the wire and I instead get one of
>>>>
>>>> enp1s0: ixgbe_ipsec_tx: bad sa_idx=64512 handle=0
>>>>
>>>> for each packet that should be transmitted.  (The sa_idx and handle
>>>> values are always the same.)
>>>>
>>>> I realized this was probably related to ixgbe's IPSec offloading
>>>> feature, so I tried with the motherboard's integrated e1000e device
>>>> and didn't have the problem.  I tried using ethtool to disable all
>>>> the IPSec-related offloads (tx-esp-segmentation, esp-hw-offload,
>>>> esp-tx-csum-hw-offload), but the problem persisted.  I then tried
>>>> recompiling the kernel with CONFIG_IXGBE_IPSEC=n and that worked
>>>> around the problem.
>>>>
>>>> I was also able to find another instance of the same problem
>>>> reported in Debian at
>>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=930443. That
>>>> person seems to be having exactly the same issue as me, down to the
>>>> sa_idx and handle values being the same.
>>>>
>>>> If there are any more details I can provide to make this easier to
>>>> track down, please let me know.
>>>>
>>>> Thanks,
>>>>
>>>> Michael Marley
>>>
>>> Hi Michael,
>>>
>>> Thanks for pointing this out.  The issue this error message is
>>> complaining about is that the handle given to the driver is a bad
>>> value.  The handle is what helps the driver find the right encryption
>>> information, and in this case is an index into an array, one array for
>>> Rx and one for Tx, each of which have up to 1024 entries.  In order to
>>> encode them into a single value, 1024 is added to the Tx values to
>>> make the handle, and 1024 is subtracted to use the handle later.  Note
>>> that the bad sa_idx is 64512, which happens to also be -1024; if the
>>> Tx handle given to ixgbe for xmit is 0, we subtract 1024 from that and
>>> get this bad sa_idx value.
>>>
>>> That handle is supposed to be an opaque value only used by the
>>> driver.  It looks to me like either (a) the driver is not setting up
>>> the handle correctly when the SA is first set up, or (b) something in
>>> the upper levels of the ipsec code is clearing the handle value. We
>>> would need to know more about all the bits in your SA set up to have a
>>> better idea what parts of the ipsec code are being exercised when this
>>> problem happens.
>>>
>>> I currently don't have access to a good ixgbe setup on which to
>>> test/debug this, and I haven't been paying much attention lately to
>>> what's happening in the upper ipsec layers, so my help will be
>>> somewhat limited.  I'm hoping the the Intel folks can add a little
>>> help, so I've copied Jeff Kirsher on this (they'll probably point back
>>> to me since I wrote this chunk :-) ).  I've also copied Stephen
>>> Klassert for his ipsec thoughts.
>>>
>>> In the meantime, can you give more details on the exact ipsec rules
>>> that are used here, and are there any error messages coming from ixgbe
>>> regarding the ipsec rule setup that might help us identify what's
>>> happening?
>>>
>>> Thanks,
>>> sln
>>
>> Hi Shannon,
>>
>> Thanks for your response!  I apologize, I am a bit of a newbie to
>> IPSec myself, so I'm not 100% sure what is the best way to provide
>> the information you need, but here is the (slightly-redacted) output
>> of swanctl --list-sas first from the server and then from the client:
>>
>> <servername>: #24, ESTABLISHED, IKEv2, 3cb75c180ee5dc68_i
>> cc7dae551b603bb7_r*
>>   local  '<serverip>' @ <serverip>[4500]
>>   remote '<clientip>' @ <clientip>[4500]
>>   AES_GCM_16-256/PRF_HMAC_SHA2_512/ECP_384
>>   established 174180s ago
>>   <servername>: #110, reqid 12, INSTALLED, TUNNEL-in-UDP,
>> ESP:AES_GCM_16-256/ECP_384
>>     installed 469s ago
>>     in  c51a0f11 (-|0x00000064), 1548864 bytes, 19575 packets, 6s ago
>>     out c3bd9741 (-|0x00000064), 23618807 bytes, 22865 packets,    
>> 7s ago
>>     local  0.0.0.0/0 ::/0
>>     remote 0.0.0.0/0 ::/0
>>
>> <clientname>: #1, ESTABLISHED, IKEv2, 3cb75c180ee5dc68_i*
>> cc7dae551b603bb7_r
>>   local  '<clientip>' @ <clientip>[4500]
>>   remote '<serverip>' @ <serverip>[4500]
>>   AES_GCM_16-256/PRF_HMAC_SHA2_512/ECP_384
>>   established 174013s ago
>>   <clientname>: #54, reqid 1, INSTALLED, TUNNEL-in-UDP,
>> ESP:AES_GCM_16-256/ECP_384
>>     installed 303s ago, rekeying in 2979s, expires in 3657s
>>     in  c3bd9741 (-|0x00000064), 23178523 bytes, 20725 packets,    
>> 0s ago
>>     out c51a0f11 (-|0x00000064), 1429124 bytes, 17719 packets, 0s ago
>>     local  0.0.0.0/0 ::/0
>>     remote 0.0.0.0/0 ::/0
>>
>> It might also be worth mentioning that I am using an xfrm interface
>> to do "regular" routing rather than the policy-based routing that
>> StrongSwan/IPSec normally uses. If there is anything else that would
>> help more, I would be happy to provide it.
>>
>> Just to be clear though, I'm not trying to run IPSec on the ixgbe
>> interface at all.  The ixgbe adapter is being used to connect the
>> router to the switch on the LAN side of the network.  IPSec is
>> running on the WAN interface without any hardware acceleration
>> (besides AES-NI).  The problem occurs when a computer on the LAN
>> tries to access the WAN.  The outgoing packets work as expected and
>> the incoming packets are routed back out through the ixgbe device
>> toward the LAN client, but the driver drops the packets with the
>> sa_idx error.
>>
>> I hope this helps.
>>
>> Thanks,
>>
>> Michael
>
> I'm not familiar with StrongSwan and its configurations, but I'm
> guessing that if you didn't expressly enable it, perhaps StrongSwan
> enabled the ipsec offload capability.  I would suggest turning it off
> to at least get you passed the immediate issue.  If there isn't an
> obvious configuration knob in StrongSwan, perhaps you can at least use
> ethtool to disable the offload, which should be off be default anyway.
>
> You can check it with "ethtool -k ethX | grep esp-hw-offload" and see
> if it is set.  You can disable it with "ethtool -K ethX esp-hw-offload
> off"
>
> Meanwhile, can you please send the output of the following commands:
> uname -a
> ip xfrm s
> ip xfrm p
> dmesg | grep ixgbe
>
> And any other /var/log/syslog or /var/log/messages that look
> suspicious and might give any more insight to what's happening.
>
> Thanks,
> sln
>
StrongSwan has hardware offload disabled by default, and I didn't enable
it explicitly.  I also already tried turning off all those switches with
ethtool and it has no effect.  This doesn't surprise me though, because
as I said, I don't actually have the IPSec connection running over the
ixgbe device.  The IPSec connection runs over another network adapter
that doesn't support IPSec offload at all.  The problem comes when
traffic received over the IPSec interface is then routed back out
(unencrypted) through the ixgbe device into the local network.

Here is the rest of the data for which you asked:

michael@soapstone:~$ uname -a
Linux soapstone 5.3.0-10-generic #11-Ubuntu SMP Mon Sep 9 15:12:17 UTC
2019 x86_64 x86_64 x86_64 GNU/Linux
michael@soapstone:~$ sudo ip xfrm s
src <srcip> dst <dstip>
        proto esp spi 0xcf6f90d3 reqid 1 mode tunnel
        replay-window 0 flag af-unspec
        aead rfc4106(gcm(aes))
0x254c6298b27ad65f61387c39e060698db777a335081d145ca6706d65b6be95770d2622b4
128
        encap type espinudp sport 4500 dport 4500 addr 0.0.0.0
        anti-replay context: seq 0x0, oseq 0xaaac, bitmap 0x00000000
        if_id 0x64
src <dstip> dst <srcip>
        proto esp spi 0xc6e02140 reqid 1 mode tunnel
        replay-window 32 flag af-unspec
        aead rfc4106(gcm(aes))
0x05473bd76e1b7268b54825b019d19c13a360193bc9aa20137204ea566409356da47fc7d7
128
        encap type espinudp sport 4500 dport 4500 addr 0.0.0.0
        anti-replay context: seq 0xab11, oseq 0x0, bitmap 0xffffffff
        if_id 0x64
michael@soapstone:~$ sudo ip xfrm p
src ::/0 dst ::/0
        dir out priority 399999
        tmpl src <srcip> dst <dstip>
                proto esp spi 0xcf6f90d3 reqid 1 mode tunnel
        if_id 0x64
src 0.0.0.0/0 dst 0.0.0.0/0
        dir out priority 399999
        tmpl src <srcip> dst <dstip>
                proto esp spi 0xcf6f90d3 reqid 1 mode tunnel
        if_id 0x64
src ::/0 dst ::/0
        dir fwd priority 399999
        tmpl src <dstip> dst <srcip>
                proto esp reqid 1 mode tunnel
        if_id 0x64
src ::/0 dst ::/0
        dir in priority 399999
        tmpl src <dstip> dst <srcip>
                proto esp reqid 1 mode tunnel
        if_id 0x64
src 0.0.0.0/0 dst 0.0.0.0/0
        dir fwd priority 399999
        tmpl src <dstip> dst <srcip>
                proto esp reqid 1 mode tunnel
        if_id 0x64
src 0.0.0.0/0 dst 0.0.0.0/0
        dir in priority 399999
        tmpl src <dstip> dst <srcip>
                proto esp reqid 1 mode tunnel
        if_id 0x64
src 0.0.0.0/0 dst 0.0.0.0/0
        socket in priority 0
src 0.0.0.0/0 dst 0.0.0.0/0
        socket out priority 0
src 0.0.0.0/0 dst 0.0.0.0/0
        socket in priority 0
src 0.0.0.0/0 dst 0.0.0.0/0
        socket out priority 0
src ::/0 dst ::/0
        socket in priority 0
src ::/0 dst ::/0
        socket out priority 0
src ::/0 dst ::/0
        socket in priority 0
src ::/0 dst ::/0
        socket out priority 0
michael@soapstone:~$ dmesg | grep -i ixgbe
[    0.780400] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver -
version 5.1.0-k
[    0.781606] ixgbe: Copyright (c) 1999-2016 Intel Corporation.
[    0.954093] ixgbe 0000:01:00.0: Multiqueue Enabled: Rx Queue count =
8, Tx Queue count = 8 XDP Queue count = 0
[    0.955081] ixgbe 0000:01:00.0: 32.000 Gb/s available PCIe bandwidth
(5 GT/s x8 link)
[    0.955860] ixgbe 0000:01:00.0: MAC: 2, PHY: 14, SFP+: 3, PBA No: Unknown
[    0.956519] ixgbe 0000:01:00.0: 00:1b:21:c0:00:1e
[    0.958079] ixgbe 0000:01:00.0: Intel(R) 10 Gigabit Network Connection
[    0.958884] libphy: ixgbe-mdio: probed
[    0.960220] ixgbe 0000:01:00.0 enp1s0: renamed from eth0
[    2.788208] ixgbe 0000:01:00.0: registered PHC device on enp1s0
[    2.966290] ixgbe 0000:01:00.0 enp1s0: detected SFP+: 3
[    3.110132] ixgbe 0000:01:00.0 enp1s0: NIC Link is Up 10 Gbps, Flow
Control: RX/TX

Thanks,

Michael



^ permalink raw reply

* Re: [PATCH] ftgmac100: Disable HW checksum generation on AST2500
From: Vijay Khemka @ 2019-09-10 22:48 UTC (permalink / raw)
  To: Florian Fainelli, David S. Miller, YueHaibing, Andrew Lunn,
	Kate Stewart, Mauro Carvalho Chehab, Luis Chamberlain,
	Thomas Gleixner, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
  Cc: openbmc @ lists . ozlabs . org, joel@jms.id.au,
	linux-aspeed@lists.ozlabs.org, Sai Dasari
In-Reply-To: <bd5eab2e-6ba6-9e27-54d4-d9534da9d5f7@gmail.com>



On 9/10/19, 3:05 PM, "Florian Fainelli" <f.fainelli@gmail.com> wrote:

    On 9/10/19 2:37 PM, Vijay Khemka wrote:
    > HW checksum generation is not working for AST2500, specially with IPV6
    > over NCSI. All TCP packets with IPv6 get dropped. By disabling this
    > it works perfectly fine with IPV6.
    > 
    > Verified with IPV6 enabled and can do ssh.
    
    How about IPv4, do these packets have problem? If not, can you continue
    advertising NETIF_F_IP_CSUM but take out NETIF_F_IPV6_CSUM?

I changed code from (netdev->hw_features &= ~NETIF_F_HW_CSUM) to 
(netdev->hw_features &= ~NETIF_F_ IPV6_CSUM). And it is not working. 
Don't know why. IPV4 works without any change but IPv6 needs HW_CSUM
Disabled.
    
    > 
    > Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>
    > ---
    >  drivers/net/ethernet/faraday/ftgmac100.c | 5 +++--
    >  1 file changed, 3 insertions(+), 2 deletions(-)
    > 
    > diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c
    > index 030fed65393e..591c9725002b 100644
    > --- a/drivers/net/ethernet/faraday/ftgmac100.c
    > +++ b/drivers/net/ethernet/faraday/ftgmac100.c
    > @@ -1839,8 +1839,9 @@ static int ftgmac100_probe(struct platform_device *pdev)
    >  	if (priv->use_ncsi)
    >  		netdev->hw_features |= NETIF_F_HW_VLAN_CTAG_FILTER;
    >  
    > -	/* AST2400  doesn't have working HW checksum generation */
    > -	if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac")))
    > +	/* AST2400  and AST2500 doesn't have working HW checksum generation */
    > +	if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac") ||
    > +		   of_device_is_compatible(np, "aspeed,ast2500-mac")))
    >  		netdev->hw_features &= ~NETIF_F_HW_CSUM;
    >  	if (np && of_get_property(np, "no-hw-checksum", NULL))
    >  		netdev->hw_features &= ~(NETIF_F_HW_CSUM | NETIF_F_RXCSUM);
    > 
    
    
    -- 
    Florian
    


^ permalink raw reply

* Re: [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from current task New bpf helper bpf_get_current_pidns_info.
From: Yonghong Song @ 2019-09-10 22:46 UTC (permalink / raw)
  To: Carlos Neira, netdev@vger.kernel.org
  Cc: ebiederm@xmission.com, brouer@redhat.com, bpf@vger.kernel.org
In-Reply-To: <20190906150952.23066-3-cneirabustos@gmail.com>



On 9/6/19 4:09 PM, Carlos Neira wrote:
> This helper(bpf_get_current_pidns_info) obtains the active namespace from
> current and returns pid, tgid, device and namespace id as seen from that
> namespace, allowing to instrument a process inside a container.
> 
> Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
> ---
>   include/linux/bpf.h      |  1 +
>   include/uapi/linux/bpf.h | 35 +++++++++++++++++++-
>   kernel/bpf/core.c        |  1 +
>   kernel/bpf/helpers.c     | 86 ++++++++++++++++++++++++++++++++++++++++++++++++
>   kernel/trace/bpf_trace.c |  2 ++
>   5 files changed, 124 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 5b9d22338606..819cb1c84be0 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1055,6 +1055,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
>   extern const struct bpf_func_proto bpf_strtol_proto;
>   extern const struct bpf_func_proto bpf_strtoul_proto;
>   extern const struct bpf_func_proto bpf_tcp_sock_proto;
> +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
>   
>   /* Shared helpers among cBPF and eBPF. */
>   void bpf_user_rnd_init_once(void);
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index b5889257cc33..3ec9aa1438b7 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2747,6 +2747,32 @@ union bpf_attr {
>    *		**-EOPNOTSUPP** kernel configuration does not enable SYN cookies
>    *
>    *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
> + *
> + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
> + *	Description
> + *		Get tgid, pid and namespace id as seen by the current namespace,
> + *		and device major/minor numbers from /proc/self/ns/pid. Such
> + *		information is stored in *pidns* of size *size*.
> + *
> + *		This helper is used when pid filtering is needed inside a
> + *		container as bpf_get_current_tgid() helper always returns the
> + *		pid id as seen by the root namespace.
> + *	Return
> + *		0 on success
> + *
> + *		On failure, the returned value is one of the following:
> + *
> + *		**-EINVAL** if *size_of_pidns* is not valid or unable to get ns, pid
> + *		or tgid of the current task.
> + *
> + *		**-ENOENT** if /proc/self/ns/pid does not exists.
> + *
> + *		**-ENOENT** if /proc/self/ns does not exists.
> + *
> + *		**-ENOMEM** if helper internal allocation fails.

-ENOMEM can be removed.

> + *
> + *		**-EPERM** if not able to call helper.
> + *
>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -2859,7 +2885,8 @@ union bpf_attr {
>   	FN(sk_storage_get),		\
>   	FN(sk_storage_delete),		\
>   	FN(send_signal),		\
> -	FN(tcp_gen_syncookie),
> +	FN(tcp_gen_syncookie),		\
> +	FN(get_current_pidns_info),
>   
>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>    * function eBPF program intends to call
> @@ -3610,4 +3637,10 @@ struct bpf_sockopt {
>   	__s32	retval;
>   };
>   
> +struct bpf_pidns_info {
> +	__u32 dev;	/* dev_t from /proc/self/ns/pid inode */

    /* dev_t of pid namespace pseudo file (typically /proc/seelf/ns/pid) 
after following symbolic link */

> +	__u32 nsid;
> +	__u32 tgid;
> +	__u32 pid;
> +};
>   #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 8191a7db2777..3159f2a0188c 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2038,6 +2038,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
>   const struct bpf_func_proto bpf_get_current_comm_proto __weak;
>   const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
>   const struct bpf_func_proto bpf_get_local_storage_proto __weak;
> +const struct bpf_func_proto bpf_get_current_pidns_info __weak;
>   
>   const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
>   {
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 5e28718928ca..8dbe6347893c 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -11,6 +11,11 @@
>   #include <linux/uidgid.h>
>   #include <linux/filter.h>
>   #include <linux/ctype.h>
> +#include <linux/pid_namespace.h>
> +#include <linux/kdev_t.h>
> +#include <linux/stat.h>
> +#include <linux/namei.h>
> +#include <linux/version.h>
>   
>   #include "../../lib/kstrtox.h"
>   
> @@ -312,6 +317,87 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>   	preempt_enable();
>   }
>   
> +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
> +	 size)
> +{
> +	const char *pidns_path = "/proc/self/ns/pid";
> +	struct pid_namespace *pidns = NULL;
> +	struct filename *fname = NULL;
> +	struct inode *inode;
> +	struct path kp;
> +	pid_t tgid = 0;
> +	pid_t pid = 0;
> +	int ret = -EINVAL;
> +	int len;
> +
> +	if (unlikely(in_interrupt() ||
> +			current->flags & (PF_KTHREAD | PF_EXITING)))
> +		return -EPERM;
> +
> +	if (unlikely(size != sizeof(struct bpf_pidns_info)))
> +		return -EINVAL;
> +
> +	pidns = task_active_pid_ns(current);
> +	if (unlikely(!pidns))
> +		return -ENOENT;
> +
> +	pidns_info->nsid =  pidns->ns.inum;
> +	pid = task_pid_nr_ns(current, pidns);
> +	if (unlikely(!pid))
> +		goto clear;
> +
> +	tgid = task_tgid_nr_ns(current, pidns);
> +	if (unlikely(!tgid))
> +		goto clear;
> +
> +	pidns_info->tgid = (u32) tgid;
> +	pidns_info->pid = (u32) pid;
> +
[...]
> +	fname = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
> +	if (unlikely(!fname)) {
> +		ret = -ENOMEM;
> +		goto clear;
> +	}
> +	const size_t fnamesize = offsetof(struct filename, iname[1]);
> +	struct filename *tmp;
> +
> +	tmp = kmalloc(fnamesize, GFP_ATOMIC);
> +	if (unlikely(!tmp)) {
> +		__putname(fname);
> +		ret = -ENOMEM;
> +		goto clear;
> +	}
> +
> +	tmp->name = (char *)fname;
> +	fname = tmp;
> +	len = strlen(pidns_path) + 1;
> +	memcpy((char *)fname->name, pidns_path, len);
> +	fname->uptr = NULL;
> +	fname->aname = NULL;
> +	fname->refcnt = 1;
> +
> +	ret = filename_lookup(AT_FDCWD, fname, 0, &kp, NULL);
> +	if (ret)
> +		goto clear;
> +
> +	inode = d_backing_inode(kp.dentry);
> +	pidns_info->dev = (u32)inode->i_rdev;
The above can bee replaced with new nsfs interface function
ns_get_inum_dev().
> +
> +	return 0;
> +
> +clear:
> +	memset((void *)pidns_info, 0, (size_t) size);
> +	return ret;
> +}
> +
> +const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
> +	.func		= bpf_get_current_pidns_info,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
> +	.arg2_type	= ARG_CONST_SIZE,
> +};
> +
>   #ifdef CONFIG_CGROUPS
>   BPF_CALL_0(bpf_get_current_cgroup_id)
>   {
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index ca1255d14576..5e1dc22765a5 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>   #endif
>   	case BPF_FUNC_send_signal:
>   		return &bpf_send_signal_proto;
> +	case BPF_FUNC_get_current_pidns_info:
> +		return &bpf_get_current_pidns_info_proto;
>   	default:
>   		return NULL;
>   	}
> 

^ permalink raw reply

* Re: [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from current task New bpf helper bpf_get_current_pidns_info.
From: Yonghong Song @ 2019-09-10 22:35 UTC (permalink / raw)
  To: Carlos Antonio Neira Bustos
  Cc: Al Viro, netdev@vger.kernel.org, ebiederm@xmission.com,
	brouer@redhat.com, bpf@vger.kernel.org
In-Reply-To: <20190909174522.GA17882@frodo.byteswizards.com>


Carlos,

Discussed with Eric today for what is the best way to get
the device number for a namespace. The following patch seems
a reasonable start although Eric would like to see
how the helper is used in order to decide whether the
interface looks right.

commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2)
Author: Yonghong Song <yhs@fb.com>
Date:   Mon Sep 9 21:50:51 2019 -0700

     nsfs: add an interface function ns_get_inum_dev()

     This patch added an interface function
     ns_get_inum_dev(). Given a ns_common structure,
     the function returns the inode and device
     numbers. The function will be used later
     by a newly added bpf helper.

     Signed-off-by: Yonghong Song <yhs@fb.com>

diff --git a/fs/nsfs.c b/fs/nsfs.c
index a0431642c6b5..a603c6fc3f54 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd)
         return ERR_PTR(-EINVAL);
  }

+/* Get the device number for the current task pidns.
+ */
+void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev)
+{
+       *inum = ns->inum;
+       *dev = nsfs_mnt->mnt_sb->s_dev;
+}
+
  static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry)
  {
         struct inode *inode = d_inode(dentry);
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index d31cb6215905..b8fc680cdf1a 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct 
task_struct *task,
  typedef struct ns_common *ns_get_path_helper_t(void *);
  extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t 
ns_get_cb,
                             void *private_data);
+extern void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev);

  extern int ns_get_name(char *buf, size_t size, struct task_struct *task,
                         const struct proc_ns_operations *ns_ops);

Could you put the above change and patch #1 and then have
all your other patches? In your kernel change, please use
interface function ns_get_inum_dev() to get pidns inode number
and dev number.

On 9/9/19 6:45 PM, Carlos Antonio Neira Bustos wrote:
> Thanks a lot, Al Viro and Yonghong for taking the time to review this patch and
> provide technical insights needed on this one.
> But how do we move this forward?
> Al Viro's review is clear that this will not work and we should strip the name
> resolution code (thanks for your detailed analysis).
> As there is currently only one instance of the nsfs device on the system,
> I think we could leave out the retrieval of the pidns device number and address it
> when the situation changes.
> What do you think?
> 
> 
> On Sat, Sep 07, 2019 at 06:34:39AM +0000, Yonghong Song wrote:
>>
>>
>> On 9/6/19 5:10 PM, Al Viro wrote:
>>> On Fri, Sep 06, 2019 at 11:21:14PM +0000, Yonghong Song wrote:
>>>
>>>> -bash-4.4$ readlink /proc/self/ns/pid
>>>> pid:[4026531836]
>>>> -bash-4.4$ stat /proc/self/ns/pid
>>>>      File: ‘/proc/self/ns/pid’ -> ‘pid:[4026531836]’
>>>>      Size: 0               Blocks: 0          IO Block: 1024   symbolic link
>>>> Device: 4h/4d   Inode: 344795989   Links: 1
>>>> Access: (0777/lrwxrwxrwx)  Uid: (128203/     yhs)   Gid: (  100/   users)
>>>> Context: user_u:base_r:base_t
>>>> Access: 2019-09-06 16:06:09.431616380 -0700
>>>> Modify: 2019-09-06 16:06:09.431616380 -0700
>>>> Change: 2019-09-06 16:06:09.431616380 -0700
>>>>     Birth: -
>>>> -bash-4.4$
>>>>
>>>> Based on a discussion with Eric Biederman back in 2019 Linux
>>>> Plumbers, Eric suggested that to uniquely identify a
>>>> namespace, device id (major/minor) number should also
>>>> be included. Although today's kernel implementation
>>>> has the same device for all namespace pseudo files,
>>>> but from uapi perspective, device id should be included.
>>>>
>>>> That is the reason why we try to get device id which holds
>>>> pid namespace pseudo file.
>>>>
>>>> Do you have a better suggestion on how to get
>>>> the device id for 'current' pid namespace? Or from design, we
>>>> really should not care about device id at all?
>>>
>>> What the hell is "device id for pid namespace"?  This is the
>>> first time I've heard about that mystery object, so it's
>>> hard to tell where it could be found.
>>>
>>> I can tell you what device numbers are involved in the areas
>>> you seem to be looking in.
>>>
>>> 1) there's whatever device number that gets assigned to
>>> (this) procfs instance.  That, ironically, _is_ per-pidns, but
>>> that of the procfs instance, not that of your process (and
>>> those can be different).  That's what you get in ->st_dev
>>> when doing lstat() of anything in /proc (assuming that
>>> procfs is mounted there, in the first place).  NOTE:
>>> that's lstat(2), not stat(2).  stat(1) uses lstat(2),
>>> unless given -L (in which case it's stat(2) time).  The
>>> difference:
>>>
>>> root@kvm1:~# stat /proc/self/ns/pid
>>>     File: /proc/self/ns/pid -> pid:[4026531836]
>>>     Size: 0               Blocks: 0          IO Block: 1024   symbolic link
>>> Device: 4h/4d   Inode: 17396       Links: 1
>>> Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
>>> Access: 2019-09-06 19:43:11.871312319 -0400
>>> Modify: 2019-09-06 19:43:11.871312319 -0400
>>> Change: 2019-09-06 19:43:11.871312319 -0400
>>>    Birth: -
>>> root@kvm1:~# stat -L /proc/self/ns/pid
>>>     File: /proc/self/ns/pid
>>>     Size: 0               Blocks: 0          IO Block: 4096   regular empty file
>>> Device: 3h/3d   Inode: 4026531836  Links: 1
>>> Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
>>> Access: 2019-09-06 19:43:15.955313293 -0400
>>> Modify: 2019-09-06 19:43:15.955313293 -0400
>>> Change: 2019-09-06 19:43:15.955313293 -0400
>>>    Birth: -
>>>
>>> The former is lstat, the latter - stat.
>>>
>>> 2) device number of the filesystem where the symlink target lives.
>>> In this case, it's nsfs and there's only one instance on the entire
>>> system.  _That_ would be obtained by looking at st_dev in stat(2) on
>>> /proc/self/ns/pid (0:3 above).
>>>
>>> 3) device number *OF* the symlink.  That would be st_rdev in lstat(2).
>>> There's none - it's a symlink, not a character or block device.  It's
>>> always zero and always will be zero.
>>>
>>> 4) the same for the target; st_rdev in stat(2) results and again,
>>> there's no such beast - it's neither character nor block device.
>>>
>>> Your code is looking at (3).  Please, reread any textbook on Unix
>>> in the section that would cover stat(2) and discussion of the
>>> difference between st_dev and st_rdev.
>>>
>>> I have no idea what Eric had been talking about - it's hard to
>>> reconstruct by what you said so far.  Making nsfs per-userns,
>>> perhaps?  But that makes no sense whatsoever, not that userns
>>> ever had...  Cheap shots aside, I really can't guess what that's
>>> about.  Sorry.
>>
>> Thanks for the detailed information. The device number we want
>> is nsfs. Indeed, currently, there is only one instance
>> on the entire system. But not exactly sure what is the possibility
>> to have more than one nsfs device in the future. Maybe per-userns
>> or any other criteria?
>>
>>>
>>> In any case, pathname resolution is *NOT* for the situations where
>>> you can't block.  Even if it's procfs (and from the same pidns as
>>> the process) mounted there, there is no promise that the target
>>> of /proc/self has already been looked up and not evicted from
>>> memory since then.  And in case of cache miss pathwalk will
>>> have to call ->lookup(), which requires locking the directory
>>> (rw_sem, shared).  You can't do that in such context.
>>>
>>> And that doesn't even go into the possibility that process has
>>> something very different mounted on /proc.
>>>
>>> Again, I don't know what it is that you want to get to, but
>>> I would strongly recommend finding a way to get to that data
>>> that would not involve going anywhere near pathname resolution.
>>>
>>> How would you expect the userland to work with that value,
>>> whatever it might be?  If it's just a 32bit field that will
>>> never be read, you might as well store there the same value
>>> you store now (0, that is) in much cheaper and safer way ;-)
>>
>> Suppose inside pid namespace, user can pass the device number,
>> say n1, (`stat -L /proc/self/ns/pid`) to bpf program (through map
>> or JIT). At runtime, bpf program will try to get device number,
>> say n2, for the 'current' process. If n1 is not the same as
>> n2, that means they are not in the same namespace. 'current'
>> is in the same pid namespace as the user iff
>> n1 == n2 and also pidns id is the same for 'current' and
>> the one with `lsns -t pid`.
>>
>> Are you aware of any way to get the pidns device number
>> for 'current' without going through the pathname
>> lookup?
>>

^ permalink raw reply related

* Re: [PATCH v2] tcp: Add TCP_INFO counter for packets received out-of-order
From: Neal Cardwell @ 2019-09-10 22:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Thomas Higdon, netdev, Jonathan Lemon, Dave Jones
In-Reply-To: <CANn89iKCSae880bS3MTwrm=MeTyPsntyXfkhJS7CfgtpiEpOsQ@mail.gmail.com>

On Tue, Sep 10, 2019 at 4:39 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Sep 10, 2019 at 10:11 PM Thomas Higdon <tph@fb.com> wrote:
> >
> >
> ...
> > Because an additional 32-bit member in struct tcp_info would cause
> > a hole on 64-bit systems, we reserve a struct member '_reserved'.
> ...
> > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> > index b3564f85a762..990a5bae3ac1 100644
> > --- a/include/uapi/linux/tcp.h
> > +++ b/include/uapi/linux/tcp.h
> > @@ -270,6 +270,9 @@ struct tcp_info {
> >         __u64   tcpi_bytes_retrans;  /* RFC4898 tcpEStatsPerfOctetsRetrans */
> >         __u32   tcpi_dsack_dups;     /* RFC4898 tcpEStatsStackDSACKDups */
> >         __u32   tcpi_reord_seen;     /* reordering events seen */
> > +
> > +       __u32   _reserved;           /* Reserved for future 32-bit member. */
> > +       __u32   tcpi_rcv_ooopack;    /* Out-of-order packets received */
> >  };
> >
>
> Unfortunately we won't be able to use this hole, because the way the
> TCP_INFO works,
>
> The kernel will report the same size after the reserved field is
> renamed to something else.
>
> User space code is able to detect which fields are there or not based
> on what the kernel
> returns for the size of the structure.

If we are looking for something else useful to use for a __u32 to pair
up with this new field, I would suggest  we export tp->snd_wnd
(send-side receive window) and tp->rcv_wnd (receive-side receive
window) in tcp_info. We could export one of them now, and the other
the next time we need to add a field and need some useful "padding".
These fields could help folks diagnose whether a flow is
receive-window-limited at a given instant, using "ss", etc. I think
there was some interest in this internally in our team a while ago.

neal

^ permalink raw reply

* Re: [net-next 11/14] ixgbe: Prevent u8 wrapping of ITR value to something less than 10us
From: Alexander Duyck @ 2019-09-10 22:19 UTC (permalink / raw)
  To: Jeff Kirsher, davem
  Cc: netdev, nhorman, sassmann, Gregg Leventhal, Andrew Bowers
In-Reply-To: <20190910163434.2449-12-jeffrey.t.kirsher@intel.com>

On Tue, 2019-09-10 at 09:34 -0700, Jeff Kirsher wrote:
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
> There were a couple cases where the ITR value generated via the adaptive
> ITR scheme could exceed 126. This resulted in the value becoming either 0
> or something less than 10. Switching back and forth between a value less
> than 10 and a value greater than 10 can cause issues as certain hardware
> features such as RSC to not function well when the ITR value has dropped
> that low.

One quick thing we can add on here is:
Fixes: b4ded8327fea ("ixgbe: Update adaptive ITR algorithm")

This is likely something that we may want to backport to stable.

> Reported-by: Gregg Leventhal <gleventhal@janestreet.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index dc034f4e8cf6..a5398b691aa8 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -2623,7 +2623,7 @@ static void ixgbe_update_itr(struct ixgbe_q_vector *q_vector,
>  		/* 16K ints/sec to 9.2K ints/sec */
>  		avg_wire_size *= 15;
>  		avg_wire_size += 11452;
> -	} else if (avg_wire_size <= 1980) {
> +	} else if (avg_wire_size < 1968) {
>  		/* 9.2K ints/sec to 8K ints/sec */
>  		avg_wire_size *= 5;
>  		avg_wire_size += 22420;
> @@ -2656,6 +2656,8 @@ static void ixgbe_update_itr(struct ixgbe_q_vector *q_vector,
>  	case IXGBE_LINK_SPEED_2_5GB_FULL:
>  	case IXGBE_LINK_SPEED_1GB_FULL:
>  	case IXGBE_LINK_SPEED_10_FULL:
> +		if (avg_wire_size > 8064)
> +			avg_wire_size = 8064;
>  		itr += DIV_ROUND_UP(avg_wire_size,
>  				    IXGBE_ITR_ADAPTIVE_MIN_INC * 64) *
>  		       IXGBE_ITR_ADAPTIVE_MIN_INC;



^ permalink raw reply

* Strange routing with VRF and 5.2.7+
From: Ben Greear @ 2019-09-10 22:17 UTC (permalink / raw)
  To: netdev

Today we were testing creating 200 virtual station vdevs on ath9k, and using
VRF for the routing.

This really slows down the machine in question.

During the minutes that it takes to bring these up and configure them,
we loose network connectivity on the management port.

If I do 'ip route show', it just shows the default route out of eth0, and
the subnet route.  But, if I try to ping the gateway, I get an ICMP error
coming back from the gateway of one of the virtual stations (which should be
safely using VRFs and so not in use when I do a plain 'ping' from the shell).

I tried running tshark on eth0 in the background and running ping, and it captures
no packets leaving eth0.

After some time (and during this time, my various scripts will be (re)configuring
vrfs and stations and related vrf routing tables and such,
but should *not* be messing with the main routing table, then suddenly
things start working again.

I am curious if anyone has seen anything similar or has suggestions for more
ways to debug this.  It seems reproducible, but it is a pain to
debug.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: [PATCH] ftgmac100: Disable HW checksum generation on AST2500
From: Vijay Khemka @ 2019-09-10 22:13 UTC (permalink / raw)
  To: Florian Fainelli, David S. Miller, YueHaibing, Andrew Lunn,
	Kate Stewart, Mauro Carvalho Chehab, Luis Chamberlain,
	Thomas Gleixner, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
  Cc: openbmc @ lists . ozlabs . org, joel@jms.id.au,
	linux-aspeed@lists.ozlabs.org, Sai Dasari
In-Reply-To: <bd5eab2e-6ba6-9e27-54d4-d9534da9d5f7@gmail.com>



On 9/10/19, 3:05 PM, "Florian Fainelli" <f.fainelli@gmail.com> wrote:

    On 9/10/19 2:37 PM, Vijay Khemka wrote:
    > HW checksum generation is not working for AST2500, specially with IPV6
    > over NCSI. All TCP packets with IPv6 get dropped. By disabling this
    > it works perfectly fine with IPV6.
    > 
    > Verified with IPV6 enabled and can do ssh.
    
    How about IPv4, do these packets have problem? If not, can you continue
    advertising NETIF_F_IP_CSUM but take out NETIF_F_IPV6_CSUM?

Yes IPv4 works. Let me test with your suggestion and will update patch.
    
    > 
    > Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>
    > ---
    >  drivers/net/ethernet/faraday/ftgmac100.c | 5 +++--
    >  1 file changed, 3 insertions(+), 2 deletions(-)
    > 
    > diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c
    > index 030fed65393e..591c9725002b 100644
    > --- a/drivers/net/ethernet/faraday/ftgmac100.c
    > +++ b/drivers/net/ethernet/faraday/ftgmac100.c
    > @@ -1839,8 +1839,9 @@ static int ftgmac100_probe(struct platform_device *pdev)
    >  	if (priv->use_ncsi)
    >  		netdev->hw_features |= NETIF_F_HW_VLAN_CTAG_FILTER;
    >  
    > -	/* AST2400  doesn't have working HW checksum generation */
    > -	if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac")))
    > +	/* AST2400  and AST2500 doesn't have working HW checksum generation */
    > +	if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac") ||
    > +		   of_device_is_compatible(np, "aspeed,ast2500-mac")))
    >  		netdev->hw_features &= ~NETIF_F_HW_CSUM;
    >  	if (np && of_get_property(np, "no-hw-checksum", NULL))
    >  		netdev->hw_features &= ~(NETIF_F_HW_CSUM | NETIF_F_RXCSUM);
    > 
    
    
    -- 
    Florian
    


^ permalink raw reply

* Re: [PATCH] ftgmac100: Disable HW checksum generation on AST2500
From: Florian Fainelli @ 2019-09-10 22:05 UTC (permalink / raw)
  To: Vijay Khemka, David S. Miller, YueHaibing, Andrew Lunn,
	Kate Stewart, Mauro Carvalho Chehab, Luis Chamberlain,
	Thomas Gleixner, netdev, linux-kernel
  Cc: openbmc @ lists . ozlabs . org, joel, linux-aspeed, sdasari
In-Reply-To: <20190910213734.3112330-1-vijaykhemka@fb.com>

On 9/10/19 2:37 PM, Vijay Khemka wrote:
> HW checksum generation is not working for AST2500, specially with IPV6
> over NCSI. All TCP packets with IPv6 get dropped. By disabling this
> it works perfectly fine with IPV6.
> 
> Verified with IPV6 enabled and can do ssh.

How about IPv4, do these packets have problem? If not, can you continue
advertising NETIF_F_IP_CSUM but take out NETIF_F_IPV6_CSUM?

> 
> Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>
> ---
>  drivers/net/ethernet/faraday/ftgmac100.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c
> index 030fed65393e..591c9725002b 100644
> --- a/drivers/net/ethernet/faraday/ftgmac100.c
> +++ b/drivers/net/ethernet/faraday/ftgmac100.c
> @@ -1839,8 +1839,9 @@ static int ftgmac100_probe(struct platform_device *pdev)
>  	if (priv->use_ncsi)
>  		netdev->hw_features |= NETIF_F_HW_VLAN_CTAG_FILTER;
>  
> -	/* AST2400  doesn't have working HW checksum generation */
> -	if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac")))
> +	/* AST2400  and AST2500 doesn't have working HW checksum generation */
> +	if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac") ||
> +		   of_device_is_compatible(np, "aspeed,ast2500-mac")))
>  		netdev->hw_features &= ~NETIF_F_HW_CSUM;
>  	if (np && of_get_property(np, "no-hw-checksum", NULL))
>  		netdev->hw_features &= ~(NETIF_F_HW_CSUM | NETIF_F_RXCSUM);
> 


-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next] tcp: force a PSH flag on TSO packets
From: Soheil Hassas Yeganeh @ 2019-09-10 21:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Neal Cardwell,
	Yuchung Cheng, Daniel Borkmann, Tariq Toukan
In-Reply-To: <20190910214928.220727-1-edumazet@google.com>

On Tue, Sep 10, 2019 at 5:49 PM Eric Dumazet <edumazet@google.com> wrote:
>
> When tcp sends a TSO packet, adding a PSH flag on it
> reduces the sojourn time of GRO packet in GRO receivers.
>
> This is particularly the case under pressure, since RX queues
> receive packets for many concurrent flows.
>
> A sender can give a hint to GRO engines when it is
> appropriate to flush a super-packet, especially when pacing
> is in the picture, since next packet is probably delayed by
> one ms.
>
> Having less packets in GRO engine reduces chance
> of LRU eviction or inflated RTT, and reduces GRO cost.
>
> We found recently that we must not set the PSH flag on
> individual full-size MSS segments [1] :
>
>  Under pressure (CWR state), we better let the packet sit
>  for a small delay (depending on NAPI logic) so that the
>  ACK packet is delayed, and thus next packet we send is
>  also delayed a bit. Eventually the bottleneck queue can
>  be drained. DCTCP flows with CWND=1 have demonstrated
>  the issue.
>
> This patch allows to slowdown the aggregate traffic without
> involving high resolution timers on senders and/or
> receivers.
>
> It has been used at Google for about four years,
> and has been discussed at various networking conferences.
>
> [1] segments smaller than MSS already have PSH flag set
>     by tcp_sendmsg() / tcp_mark_push(), unless MSG_MORE
>     has been requested by the user.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Tariq Toukan <tariqt@mellanox.com>


Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Eric, thank you for the patch and the very nice commit message!

> ---
>  net/ipv4/tcp_output.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 42abc9bd687a5fea627cbc7cfa750d022f393d84..fec6d67bfd146dc78f0f25173fd71b8b8cc752fe 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1050,11 +1050,22 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
>         tcb = TCP_SKB_CB(skb);
>         memset(&opts, 0, sizeof(opts));
>
> -       if (unlikely(tcb->tcp_flags & TCPHDR_SYN))
> +       if (unlikely(tcb->tcp_flags & TCPHDR_SYN)) {
>                 tcp_options_size = tcp_syn_options(sk, skb, &opts, &md5);
> -       else
> +       } else {
>                 tcp_options_size = tcp_established_options(sk, skb, &opts,
>                                                            &md5);
> +               /* Force a PSH flag on all (GSO) packets to expedite GRO flush
> +                * at receiver : This slightly improve GRO performance.
> +                * Note that we do not force the PSH flag for non GSO packets,
> +                * because they might be sent under high congestion events,
> +                * and in this case it is better to delay the delivery of 1-MSS
> +                * packets and thus the corresponding ACK packet that would
> +                * release the following packet.
> +                */
> +               if (tcp_skb_pcount(skb) > 1)
> +                       tcb->tcp_flags |= TCPHDR_PSH;
> +       }
>         tcp_header_size = tcp_options_size + sizeof(struct tcphdr);
>
>         /* if no packet is in qdisc/device queue, then allow XPS to select
> --
> 2.23.0.162.g0b9fbb3734-goog
>

^ permalink raw reply

* [PATCH net-next] tcp: force a PSH flag on TSO packets
From: Eric Dumazet @ 2019-09-10 21:49 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, Soheil Hassas Yeganeh,
	Neal Cardwell, Yuchung Cheng, Daniel Borkmann, Tariq Toukan

When tcp sends a TSO packet, adding a PSH flag on it
reduces the sojourn time of GRO packet in GRO receivers.

This is particularly the case under pressure, since RX queues
receive packets for many concurrent flows.

A sender can give a hint to GRO engines when it is
appropriate to flush a super-packet, especially when pacing
is in the picture, since next packet is probably delayed by
one ms.

Having less packets in GRO engine reduces chance
of LRU eviction or inflated RTT, and reduces GRO cost.

We found recently that we must not set the PSH flag on
individual full-size MSS segments [1] :

 Under pressure (CWR state), we better let the packet sit
 for a small delay (depending on NAPI logic) so that the
 ACK packet is delayed, and thus next packet we send is
 also delayed a bit. Eventually the bottleneck queue can
 be drained. DCTCP flows with CWND=1 have demonstrated
 the issue.

This patch allows to slowdown the aggregate traffic without
involving high resolution timers on senders and/or
receivers.

It has been used at Google for about four years,
and has been discussed at various networking conferences.

[1] segments smaller than MSS already have PSH flag set
    by tcp_sendmsg() / tcp_mark_push(), unless MSG_MORE
    has been requested by the user.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tariq Toukan <tariqt@mellanox.com>
---
 net/ipv4/tcp_output.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 42abc9bd687a5fea627cbc7cfa750d022f393d84..fec6d67bfd146dc78f0f25173fd71b8b8cc752fe 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1050,11 +1050,22 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 	tcb = TCP_SKB_CB(skb);
 	memset(&opts, 0, sizeof(opts));

-	if (unlikely(tcb->tcp_flags & TCPHDR_SYN))
+	if (unlikely(tcb->tcp_flags & TCPHDR_SYN)) {
 		tcp_options_size = tcp_syn_options(sk, skb, &opts, &md5);
-	else
+	} else {
 		tcp_options_size = tcp_established_options(sk, skb, &opts,
 							   &md5);
+		/* Force a PSH flag on all (GSO) packets to expedite GRO flush
+		 * at receiver : This slightly improve GRO performance.
+		 * Note that we do not force the PSH flag for non GSO packets,
+		 * because they might be sent under high congestion events,
+		 * and in this case it is better to delay the delivery of 1-MSS
+		 * packets and thus the corresponding ACK packet that would
+		 * release the following packet.
+		 */
+		if (tcp_skb_pcount(skb) > 1)
+			tcb->tcp_flags |= TCPHDR_PSH;
+	}
 	tcp_header_size = tcp_options_size + sizeof(struct tcphdr);

 	/* if no packet is in qdisc/device queue, then allow XPS to select
-- 
2.23.0.162.g0b9fbb3734-goog

^ permalink raw reply related

* [net-next 3/3] net/mlx5: FWTrace, Reduce stack usage
From: Saeed Mahameed @ 2019-09-10 21:46 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev@vger.kernel.org, Saeed Mahameed, Arnd Bergmann
In-Reply-To: <20190910214542.8433-1-saeedm@mellanox.com>

Mark mlx5_tracer_print_trace as noinline as the function only uses 512
bytes on the stack to avoid the following build warning:

drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c:660:13: error: stack frame size of 1032 bytes in function 'mlx5_fw_tracer_handle_traces' [-Werror,-Wframe-larger-than=]

Fixes: 70dd6fdb8987 ("net/mlx5: FW tracer, parse traces and kernel tracing support")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
index 2011eaf15cc5..94d7b69a95c7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
@@ -553,9 +553,10 @@ static void mlx5_fw_tracer_save_trace(struct mlx5_fw_tracer *tracer,
 	mutex_unlock(&tracer->st_arr.lock);
 }
 
-static void mlx5_tracer_print_trace(struct tracer_string_format *str_frmt,
-				    struct mlx5_core_dev *dev,
-				    u64 trace_timestamp)
+static noinline
+void mlx5_tracer_print_trace(struct tracer_string_format *str_frmt,
+			     struct mlx5_core_dev *dev,
+			     u64 trace_timestamp)
 {
 	char	tmp[512];
 
-- 
2.21.0


^ permalink raw reply related

* [net-next 2/3] net/mlx5: Fix addr's type in mlx5dr_icm_dm
From: Saeed Mahameed @ 2019-09-10 21:46 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Nathan Chancellor, Arnd Bergmann,
	Saeed Mahameed
In-Reply-To: <20190910214542.8433-1-saeedm@mellanox.com>

From: Nathan Chancellor <natechancellor@gmail.com>

clang errors when CONFIG_PHYS_ADDR_T_64BIT is not set:

drivers/net/ethernet/mellanox/mlx5/core/steering/dr_icm_pool.c:121:8:
error: incompatible pointer types passing 'u64 *' (aka 'unsigned long
long *') to parameter of type 'phys_addr_t *' (aka 'unsigned int *')
[-Werror,-Wincompatible-pointer-types]
                                   &icm_mr->dm.addr, &icm_mr->dm.obj_id);
                                   ^~~~~~~~~~~~~~~~
include/linux/mlx5/driver.h:1092:39: note: passing argument to parameter
'addr' here
                         u64 length, u16 uid, phys_addr_t *addr, u32 *obj_id);
                                                           ^
1 error generated.

Use phys_addr_t for addr's type in mlx5dr_icm_dm, which won't change
anything with 64-bit builds because phys_addr_t is u64 when
CONFIG_PHYS_ADDR_T_64BIT is set, which is always when CONFIG_64BIT is
set.

Fixes: 29cf8febd185 ("net/mlx5: DR, ICM pool memory allocator")
Link: https://github.com/ClangBuiltLinux/linux/issues/653
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_icm_pool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_icm_pool.c b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_icm_pool.c
index e76f61e7555e..913f1e5aaaf2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_icm_pool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_icm_pool.c
@@ -53,7 +53,7 @@ struct mlx5dr_icm_pool {
 struct mlx5dr_icm_dm {
 	u32 obj_id;
 	enum mlx5_sw_icm_type type;
-	u64 addr;
+	phys_addr_t addr;
 	size_t length;
 };
 
-- 
2.21.0


^ permalink raw reply related

* [net-next 1/3] net/mlx5: Fix rt's type in dr_action_create_reformat_action
From: Saeed Mahameed @ 2019-09-10 21:45 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Nathan Chancellor, Austin Kim,
	Arnd Bergmann, Saeed Mahameed
In-Reply-To: <20190910214542.8433-1-saeedm@mellanox.com>

From: Nathan Chancellor <natechancellor@gmail.com>

clang warns:

drivers/net/ethernet/mellanox/mlx5/core/steering/dr_action.c:1080:9:
warning: implicit conversion from enumeration type 'enum
mlx5_reformat_ctx_type' to different enumeration type 'enum
mlx5dr_action_type' [-Wenum-conversion]
                        rt = MLX5_REFORMAT_TYPE_L2_TO_L2_TUNNEL;
                           ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/mellanox/mlx5/core/steering/dr_action.c:1082:9:
warning: implicit conversion from enumeration type 'enum
mlx5_reformat_ctx_type' to different enumeration type 'enum
mlx5dr_action_type' [-Wenum-conversion]
                        rt = MLX5_REFORMAT_TYPE_L2_TO_L3_TUNNEL;
                           ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/mellanox/mlx5/core/steering/dr_action.c:1084:51:
warning: implicit conversion from enumeration type 'enum
mlx5dr_action_type' to different enumeration type 'enum
mlx5_reformat_ctx_type' [-Wenum-conversion]
                ret = mlx5dr_cmd_create_reformat_ctx(dmn->mdev, rt, data_sz, data,
                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~            ^~
3 warnings generated.

Use the right type for rt, which is mlx5_reformat_ctx_type so there are
no warnings about mismatched types.

Fixes: 9db810ed2d37 ("net/mlx5: DR, Expose steering action functionality")
Link: https://github.com/ClangBuiltLinux/linux/issues/652
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reported-by: Austin Kim <austindh.kim@gmail.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_action.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_action.c b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_action.c
index a02f87f85c17..7d81a7735de5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_action.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_action.c
@@ -1074,7 +1074,7 @@ dr_action_create_reformat_action(struct mlx5dr_domain *dmn,
 	case DR_ACTION_TYP_L2_TO_TNL_L2:
 	case DR_ACTION_TYP_L2_TO_TNL_L3:
 	{
-		enum mlx5dr_action_type rt;
+		enum mlx5_reformat_ctx_type rt;
 
 		if (action->action_type == DR_ACTION_TYP_L2_TO_TNL_L2)
 			rt = MLX5_REFORMAT_TYPE_L2_TO_L2_TUNNEL;
-- 
2.21.0


^ permalink raw reply related

* Re: ixgbe: driver drops packets routed from an IPSec interface with a "bad sa_idx" error
From: Shannon Nelson @ 2019-09-10 21:43 UTC (permalink / raw)
  To: Michael Marley; +Cc: netdev, Jeff Kirsher, steffen.klassert
In-Reply-To: <fb63dec226170199e9b0fd1b356d2314@michaelmarley.com>

On 9/9/19 11:45 AM, Michael Marley wrote:
> On 2019-09-09 14:21, Shannon Nelson wrote:
>> On 9/6/19 11:13 AM, Michael Marley wrote:
>>> (This is also reported at 
>>> https://bugzilla.kernel.org/show_bug.cgi?id=204551, but it was 
>>> recommended that I send it to this list as well.)
>>>
>>> I have a put together a router that routes traffic from several 
>>> local subnets from a switch attached to an i82599ES card through an 
>>> IPSec VPN interface set up with StrongSwan. (The VPN is running on 
>>> an unrelated second interface with a different driver.)  Traffic 
>>> from the local interfaces to the VPN works as it should and 
>>> eventually makes it through the VPN server and out to the Internet.  
>>> The return traffic makes it back to the router and tcpdump shows it 
>>> leaving by the i82599, but the traffic never actually makes it onto 
>>> the wire and I instead get one of
>>>
>>> enp1s0: ixgbe_ipsec_tx: bad sa_idx=64512 handle=0
>>>
>>> for each packet that should be transmitted.  (The sa_idx and handle 
>>> values are always the same.)
>>>
>>> I realized this was probably related to ixgbe's IPSec offloading 
>>> feature, so I tried with the motherboard's integrated e1000e device 
>>> and didn't have the problem.  I tried using ethtool to disable all 
>>> the IPSec-related offloads (tx-esp-segmentation, esp-hw-offload, 
>>> esp-tx-csum-hw-offload), but the problem persisted.  I then tried 
>>> recompiling the kernel with CONFIG_IXGBE_IPSEC=n and that worked 
>>> around the problem.
>>>
>>> I was also able to find another instance of the same problem 
>>> reported in Debian at 
>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=930443. That 
>>> person seems to be having exactly the same issue as me, down to the 
>>> sa_idx and handle values being the same.
>>>
>>> If there are any more details I can provide to make this easier to 
>>> track down, please let me know.
>>>
>>> Thanks,
>>>
>>> Michael Marley
>>
>> Hi Michael,
>>
>> Thanks for pointing this out.  The issue this error message is
>> complaining about is that the handle given to the driver is a bad
>> value.  The handle is what helps the driver find the right encryption
>> information, and in this case is an index into an array, one array for
>> Rx and one for Tx, each of which have up to 1024 entries.  In order to
>> encode them into a single value, 1024 is added to the Tx values to
>> make the handle, and 1024 is subtracted to use the handle later.  Note
>> that the bad sa_idx is 64512, which happens to also be -1024; if the
>> Tx handle given to ixgbe for xmit is 0, we subtract 1024 from that and
>> get this bad sa_idx value.
>>
>> That handle is supposed to be an opaque value only used by the
>> driver.  It looks to me like either (a) the driver is not setting up
>> the handle correctly when the SA is first set up, or (b) something in
>> the upper levels of the ipsec code is clearing the handle value. We
>> would need to know more about all the bits in your SA set up to have a
>> better idea what parts of the ipsec code are being exercised when this
>> problem happens.
>>
>> I currently don't have access to a good ixgbe setup on which to
>> test/debug this, and I haven't been paying much attention lately to
>> what's happening in the upper ipsec layers, so my help will be
>> somewhat limited.  I'm hoping the the Intel folks can add a little
>> help, so I've copied Jeff Kirsher on this (they'll probably point back
>> to me since I wrote this chunk :-) ).  I've also copied Stephen
>> Klassert for his ipsec thoughts.
>>
>> In the meantime, can you give more details on the exact ipsec rules
>> that are used here, and are there any error messages coming from ixgbe
>> regarding the ipsec rule setup that might help us identify what's
>> happening?
>>
>> Thanks,
>> sln
>
> Hi Shannon,
>
> Thanks for your response!  I apologize, I am a bit of a newbie to 
> IPSec myself, so I'm not 100% sure what is the best way to provide the 
> information you need, but here is the (slightly-redacted) output of 
> swanctl --list-sas first from the server and then from the client:
>
> <servername>: #24, ESTABLISHED, IKEv2, 3cb75c180ee5dc68_i 
> cc7dae551b603bb7_r*
>   local  '<serverip>' @ <serverip>[4500]
>   remote '<clientip>' @ <clientip>[4500]
>   AES_GCM_16-256/PRF_HMAC_SHA2_512/ECP_384
>   established 174180s ago
>   <servername>: #110, reqid 12, INSTALLED, TUNNEL-in-UDP, 
> ESP:AES_GCM_16-256/ECP_384
>     installed 469s ago
>     in  c51a0f11 (-|0x00000064), 1548864 bytes, 19575 packets, 6s ago
>     out c3bd9741 (-|0x00000064), 23618807 bytes, 22865 packets,     7s 
> ago
>     local  0.0.0.0/0 ::/0
>     remote 0.0.0.0/0 ::/0
>
> <clientname>: #1, ESTABLISHED, IKEv2, 3cb75c180ee5dc68_i* 
> cc7dae551b603bb7_r
>   local  '<clientip>' @ <clientip>[4500]
>   remote '<serverip>' @ <serverip>[4500]
>   AES_GCM_16-256/PRF_HMAC_SHA2_512/ECP_384
>   established 174013s ago
>   <clientname>: #54, reqid 1, INSTALLED, TUNNEL-in-UDP, 
> ESP:AES_GCM_16-256/ECP_384
>     installed 303s ago, rekeying in 2979s, expires in 3657s
>     in  c3bd9741 (-|0x00000064), 23178523 bytes, 20725 packets,     0s 
> ago
>     out c51a0f11 (-|0x00000064), 1429124 bytes, 17719 packets, 0s ago
>     local  0.0.0.0/0 ::/0
>     remote 0.0.0.0/0 ::/0
>
> It might also be worth mentioning that I am using an xfrm interface to 
> do "regular" routing rather than the policy-based routing that 
> StrongSwan/IPSec normally uses. If there is anything else that would 
> help more, I would be happy to provide it.
>
> Just to be clear though, I'm not trying to run IPSec on the ixgbe 
> interface at all.  The ixgbe adapter is being used to connect the 
> router to the switch on the LAN side of the network.  IPSec is running 
> on the WAN interface without any hardware acceleration (besides 
> AES-NI).  The problem occurs when a computer on the LAN tries to 
> access the WAN.  The outgoing packets work as expected and the 
> incoming packets are routed back out through the ixgbe device toward 
> the LAN client, but the driver drops the packets with the sa_idx error.
>
> I hope this helps.
>
> Thanks,
>
> Michael

I'm not familiar with StrongSwan and its configurations, but I'm 
guessing that if you didn't expressly enable it, perhaps StrongSwan 
enabled the ipsec offload capability.  I would suggest turning it off to 
at least get you passed the immediate issue.  If there isn't an obvious 
configuration knob in StrongSwan, perhaps you can at least use ethtool 
to disable the offload, which should be off be default anyway.

You can check it with "ethtool -k ethX | grep esp-hw-offload" and see if 
it is set.  You can disable it with "ethtool -K ethX esp-hw-offload off"

Meanwhile, can you please send the output of the following commands:
uname -a
ip xfrm s
ip xfrm p
dmesg | grep ixgbe

And any other /var/log/syslog or /var/log/messages that look suspicious 
and might give any more insight to what's happening.

Thanks,
sln


^ permalink raw reply

* [pull request][net-next 0/3] Mellanox, mlx5 build cleanup 2019-09-10
From: Saeed Mahameed @ 2019-09-10 21:45 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev@vger.kernel.org, Saeed Mahameed

Hi Dave,

This series provides three build warnings cleanup patches for mlx5,
Originally i wanted to wait a bit more and attach more patches to this
series, but apparently this can't wait since already 3 different patches
for the same fix were submitted this week :).

For more information please see tag log below.

Please pull and let me know if there is any problem.

Thanks,
Saeed.

---
The following changes since commit 074be7fd99a29ff36dcb2c036b3b31a6b670b3cf:

  Merge branch 'nfp-implement-firmware-loading-policy' (2019-09-10 17:29:27 +0100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2019-09-10

for you to fetch changes up to fa355bb1b0373e7fe087cfc830b1b0b9b6130388:

  net/mlx5: FWTrace, Reduce stack usage (2019-09-10 13:43:27 -0700)

----------------------------------------------------------------
mlx5-updates-2019-09-10

Misc build warnings cleanup for mlx5:

1) Reduce stack usage in FW trace
2) Fix addr's type in mlx5dr_icm_dm
3) Fix rt's type in dr_action_create_reformat_action

----------------------------------------------------------------
Nathan Chancellor (2):
      net/mlx5: Fix rt's type in dr_action_create_reformat_action
      net/mlx5: Fix addr's type in mlx5dr_icm_dm

Saeed Mahameed (1):
      net/mlx5: FWTrace, Reduce stack usage

 drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c       | 7 ++++---
 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_action.c   | 2 +-
 drivers/net/ethernet/mellanox/mlx5/core/steering/dr_icm_pool.c | 2 +-
 3 files changed, 6 insertions(+), 5 deletions(-)

^ permalink raw reply

* [PATCH] cxgb4: Fix spelling typos
From: Arkadiusz Drabczyk @ 2019-09-10 20:49 UTC (permalink / raw)
  To: vishal; +Cc: davem, netdev, linux-kernel

Fix several spelling typos in comments in t4_hw.c.

Signed-off-by: Arkadiusz Drabczyk <arkadiusz@drabczyk.org>
---
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index f7fc553..f2a7824 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -329,7 +329,7 @@ int t4_wr_mbox_meat_timeout(struct adapter *adap, int mbox, const void *cmd,
 	for (i = 0; ; i += ms) {
 		/* If we've waited too long, return a busy indication.  This
 		 * really ought to be based on our initial position in the
-		 * mailbox access list but this is a start.  We very rearely
+		 * mailbox access list but this is a start.  We very rarely
 		 * contend on access to the mailbox ...
 		 */
 		pcie_fw = t4_read_reg(adap, PCIE_FW_A);
@@ -606,7 +606,7 @@ void t4_memory_rw_residual(struct adapter *adap, u32 off, u32 addr, u8 *buf,
  *
  *	Reads/writes an [almost] arbitrary memory region in the firmware: the
  *	firmware memory address and host buffer must be aligned on 32-bit
- *	boudaries; the length may be arbitrary.  The memory is transferred as
+ *	boundaries; the length may be arbitrary.  The memory is transferred as
  *	a raw byte sequence from/to the firmware's memory.  If this memory
  *	contains data structures which contain multi-byte integers, it's the
  *	caller's responsibility to perform appropriate byte order conversions.
@@ -3774,7 +3774,7 @@ int t4_phy_fw_ver(struct adapter *adap, int *phy_fw_ver)
  *	A negative error number will be returned if an error occurs.  If
  *	version number support is available and there's no need to upgrade
  *	the firmware, 0 will be returned.  If firmware is successfully
- *	transferred to the adapter, 1 will be retured.
+ *	transferred to the adapter, 1 will be returned.
  *
  *	NOTE: some adapters only have local RAM to store the PHY firmware.  As
  *	a result, a RESET of the adapter would cause that RAM to lose its
@@ -3808,7 +3808,7 @@ int t4_load_phy_fw(struct adapter *adap,
 	}
 
 	/* Ask the firmware where it wants us to copy the PHY firmware image.
-	 * The size of the file requires a special version of the READ coommand
+	 * The size of the file requires a special version of the READ command
 	 * which will pass the file size via the values field in PARAMS_CMD and
 	 * retrieve the return value from firmware and place it in the same
 	 * buffer values
@@ -4082,7 +4082,7 @@ static inline fw_port_cap32_t cc_to_fwcap_pause(enum cc_pause cc_pause)
 		fw_pause |= FW_PORT_CAP32_FORCE_PAUSE;
 
 	/* Translate orthogonal Pause controls into IEEE 802.3 Pause,
-	 * Asymetrical Pause for use in reporting to upper layer OS code, etc.
+	 * Asymmetrical Pause for use in reporting to upper layer OS code, etc.
 	 * Note that these bits are ignored in L1 Configure commands.
 	 */
 	if (cc_pause & PAUSE_RX) {
@@ -4151,7 +4151,7 @@ fw_port_cap32_t t4_link_acaps(struct adapter *adapter, unsigned int port,
 	/* Convert Common Code Forward Error Control settings into the
 	 * Firmware's API.  If the current Requested FEC has "Automatic"
 	 * (IEEE 802.3) specified, then we use whatever the Firmware
-	 * sent us as part of it's IEEE 802.3-based interpratation of
+	 * sent us as part of its IEEE 802.3-based interpretation of
 	 * the Transceiver Module EPROM FEC parameters.  Otherwise we
 	 * use whatever is in the current Requested FEC settings.
 	 */
@@ -4248,7 +4248,7 @@ int t4_link_l1cfg_core(struct adapter *adapter, unsigned int mbox,
 
 	/* Unfortunately, even if the Requested Port Capabilities "fit" within
 	 * the Physical Port Capabilities, some combinations of features may
-	 * still not be leagal.  For example, 40Gb/s and Reed-Solomon Forward
+	 * still not be legal.  For example, 40Gb/s and Reed-Solomon Forward
 	 * Error Correction.  So if the Firmware rejects the L1 Configure
 	 * request, flag that here.
 	 */
@@ -6797,7 +6797,7 @@ int t4_sge_ctxt_flush(struct adapter *adap, unsigned int mbox, int ctxt_type)
 }
 
 /**
- *	t4_read_sge_dbqtimers - reag SGE Doorbell Queue Timer values
+ *	t4_read_sge_dbqtimers - read SGE Doorbell Queue Timer values
  *	@adap - the adapter
  *	@ndbqtimers: size of the provided SGE Doorbell Queue Timer table
  *	@dbqtimers: SGE Doorbell Queue Timer table
@@ -6925,8 +6925,8 @@ int t4_fw_hello(struct adapter *adap, unsigned int mbox, unsigned int evt_mbox,
 			waiting -= 50;
 
 			/*
-			 * If neither Error nor Initialialized are indicated
-			 * by the firmware keep waiting till we exaust our
+			 * If neither Error nor Initialized are indicated
+			 * by the firmware keep waiting till we exhaust our
 			 * timeout ... and then retry if we haven't exhausted
 			 * our retries ...
 			 */
@@ -7238,7 +7238,7 @@ int t4_fl_pkt_align(struct adapter *adap)
 	 * separately.  The actual Ingress Packet Data alignment boundary
 	 * within Packed Buffer Mode is the maximum of these two
 	 * specifications.  (Note that it makes no real practical sense to
-	 * have the Pading Boudary be larger than the Packing Boundary but you
+	 * have the Padding Boundary be larger than the Packing Boundary but you
 	 * could set the chip up that way and, in fact, legacy T4 code would
 	 * end doing this because it would initialize the Padding Boundary and
 	 * leave the Packing Boundary initialized to 0 (16 bytes).)
@@ -8973,10 +8973,10 @@ static int t4_get_flash_params(struct adapter *adap)
 			goto found;
 		}
 
-	/* Decode Flash part size.  The code below looks repetative with
+	/* Decode Flash part size.  The code below looks repetitive with
 	 * common encodings, but that's not guaranteed in the JEDEC
-	 * specification for the Read JADEC ID command.  The only thing that
-	 * we're guaranteed by the JADEC specification is where the
+	 * specification for the Read JEDEC ID command.  The only thing that
+	 * we're guaranteed by the JEDEC specification is where the
 	 * Manufacturer ID is in the returned result.  After that each
 	 * Manufacturer ~could~ encode things completely differently.
 	 * Note, all Flash parts must have 64KB sectors.
@@ -9317,7 +9317,7 @@ int t4_init_devlog_params(struct adapter *adap)
 	struct fw_devlog_cmd devlog_cmd;
 	int ret;
 
-	/* If we're dealing with newer firmware, the Device Log Paramerters
+	/* If we're dealing with newer firmware, the Device Log Parameters
 	 * are stored in a designated register which allows us to access the
 	 * Device Log even if we can't talk to the firmware.
 	 */
-- 
2.9.0


^ permalink raw reply related

* Re: [PATCH 3/7] dt-bindings: mv88e6xxx: add ability to set default queue priorities per port
From: Vladimir Oltean @ 2019-09-10 20:46 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: Florian Fainelli, Robert Beckett, netdev, Andrew Lunn,
	David S. Miller, Rob Herring, Mark Rutland, devicetree,
	Jiri Pirko, Ido Schimmel
In-Reply-To: <20190910124952.GG32337@t480s.localdomain>

Hi guys,

On 10/09/2019, Vivien Didelot <vivien.didelot@gmail.com> wrote:
> Hi Robert,
>
> On Tue, 10 Sep 2019 09:42:24 -0700, Florian Fainelli <f.fainelli@gmail.com>
> wrote:
>> This is a vendor specific driver/property,
>> marvell,default-queue-priority (which be cheapskate on words) would be
>> more readable. But still, I have some more fundamental issues with the
>> general approach, see my response in the cover letter.
>
> As Florian said, the DT is unlikely to welcome vendor specific nodes for
> configuration which may be generic through standard network userspace
> tools.
>
>
> Thanks,
>
> 	Vivien
>

While I do agree that the DT bindings are a big no-no for QoS
settings, the topic is interesting.
What is the user space knob for configuring port-default priority (say
RX queue)?
Something like this maybe? (a very forced "matchall" with rxnfc)
ethtool --config-nfc eth0 flow-type ether src 00:00:00:00:00:00 m
00:00:00:00:00:00 action 5

Regards,
-Vladimir

^ permalink raw reply

* Re: [PATCH v2] tcp: Add TCP_INFO counter for packets received out-of-order
From: Eric Dumazet @ 2019-09-10 20:38 UTC (permalink / raw)
  To: Thomas Higdon; +Cc: netdev, Jonathan Lemon, Dave Jones
In-Reply-To: <20190910201128.3967163-1-tph@fb.com>

On Tue, Sep 10, 2019 at 10:11 PM Thomas Higdon <tph@fb.com> wrote:
>
>
...
> Because an additional 32-bit member in struct tcp_info would cause
> a hole on 64-bit systems, we reserve a struct member '_reserved'.
...
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index b3564f85a762..990a5bae3ac1 100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -270,6 +270,9 @@ struct tcp_info {
>         __u64   tcpi_bytes_retrans;  /* RFC4898 tcpEStatsPerfOctetsRetrans */
>         __u32   tcpi_dsack_dups;     /* RFC4898 tcpEStatsStackDSACKDups */
>         __u32   tcpi_reord_seen;     /* reordering events seen */
> +
> +       __u32   _reserved;           /* Reserved for future 32-bit member. */
> +       __u32   tcpi_rcv_ooopack;    /* Out-of-order packets received */
>  };
>

Unfortunately we won't be able to use this hole, because the way the
TCP_INFO works,

The kernel will report the same size after the reserved field is
renamed to something else.

User space code is able to detect which fields are there or not based
on what the kernel
returns for the size of the structure.

^ permalink raw reply

* Re: [PATCH net-next 0/6] net: stmmac: Improvements for -next
From: Jesse Brandeburg @ 2019-09-10 20:36 UTC (permalink / raw)
  To: Jose Abreu
  Cc: netdev, Joao Pinto, Giuseppe Cavallaro, Alexandre Torgue,
	David S. Miller, Maxime Coquelin, linux-stm32, linux-arm-kernel,
	linux-kernel, jesse.brandeburg
In-Reply-To: <cover.1568126224.git.joabreu@synopsys.com>

On Tue, 10 Sep 2019 16:41:21 +0200 Jose wrote:
> Misc patches for -next. It includes:
>  - Two fixes for features in -next only
>  - New features support for GMAC cores (which includes GMAC4 and GMAC5)
> 
> ---
> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
> Cc: Alexandre Torgue <alexandre.torgue@st.com>
> Cc: Jose Abreu <joabreu@synopsys.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
> Cc: netdev@vger.kernel.org
> Cc: linux-stm32@st-md-mailman.stormreply.com
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> ---
> 
> Jose Abreu (6):
>   net: stmmac: Prevent divide-by-zero
>   net: stmmac: Add VLAN HASH filtering support in GMAC4+
>   net: stmmac: xgmac: Reinitialize correctly a variable
>   net: stmmac: Add support for SA Insertion/Replacement in GMAC4+
>   net: stmmac: Add support for VLAN Insertion Offload in GMAC4+
>   net: stmmac: ARP Offload for GMAC4+ Cores

For the series, looks good to me.

Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

^ permalink raw reply

* [PATCH net] net: Fix null de-reference of device refcount
From: Subash Abhinov Kasiviswanathan @ 2019-09-10 20:02 UTC (permalink / raw)
  To: dlezcano, eric.dumazet, davem, netdev
  Cc: Subash Abhinov Kasiviswanathan, Sean Tranchetti

In event of failure during register_netdevice, free_netdev is
invoked immediately. free_netdev assumes that all the netdevice
refcounts have been dropped prior to it being called and as a
result frees and clears out the refcount pointer.

However, this is not necessarily true as some of the operations
in the NETDEV_UNREGISTER notifier handlers queue RCU callbacks for
invocation after a grace period. The IPv4 callback in_dev_rcu_put
tries to access the refcount after free_netdev is called which
leads to a null de-reference-

44837.761523:   <6> Unable to handle kernel paging request at
                    virtual address 0000004a88287000
44837.761651:   <2> pc : in_dev_finish_destroy+0x4c/0xc8
44837.761654:   <2> lr : in_dev_finish_destroy+0x2c/0xc8
44837.762393:   <2> Call trace:
44837.762398:   <2>  in_dev_finish_destroy+0x4c/0xc8
44837.762404:   <2>  in_dev_rcu_put+0x24/0x30
44837.762412:   <2>  rcu_nocb_kthread+0x43c/0x468
44837.762418:   <2>  kthread+0x118/0x128
44837.762424:   <2>  ret_from_fork+0x10/0x1c

Fix this by waiting for the completion of the call_rcu() in
case of register_netdevice errors.

Fixes: 93ee31f14f6f ("[NET]: Fix free_netdev on register_netdev failure.")
Cc: Sean Tranchetti <stranche@codeaurora.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
---
 net/core/dev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 49589ed..c7463c9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8796,6 +8796,8 @@ int register_netdevice(struct net_device *dev)
 	ret = notifier_to_errno(ret);
 	if (ret) {
 		rollback_registered(dev);
+		rcu_barrier();
+
 		dev->reg_state = NETREG_UNREGISTERED;
 	}
 	/*
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH] tcp: Add TCP_INFO counter for packets received out-of-order
From: Thomas Higdon @ 2019-09-10 19:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Jonathan Lemon, Dave Jones
In-Reply-To: <CANn89iJ5wANqhpR28y5AYf6GTBgzTau+u0N0ogG690C71LbxaA@mail.gmail.com>

On Mon, Sep 09, 2019 at 05:01:46PM +0200, Eric Dumazet wrote:
> On Mon, Sep 9, 2019 at 4:30 PM Thomas Higdon <tph@fb.com> wrote:
> > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> > index b3564f85a762..20237987ccc8 100644
> > --- a/include/uapi/linux/tcp.h
> > +++ b/include/uapi/linux/tcp.h
> > @@ -270,6 +270,8 @@ struct tcp_info {
> >         __u64   tcpi_bytes_retrans;  /* RFC4898 tcpEStatsPerfOctetsRetrans */
> >         __u32   tcpi_dsack_dups;     /* RFC4898 tcpEStatsStackDSACKDups */
> >         __u32   tcpi_reord_seen;     /* reordering events seen */
> > +
> > +       __u32   tcpi_rcv_ooopack;    /* Out-of-order packets received */
> 
> This is problematic : you create a 32bit hole in this structure that
> we will never be able to fill.
> 
> We need to add another metric here so that the whole 64bit space is used.

I don't have another metric to add currently. Perhaps I could first place
a '__u32 _reserved' member so that someone else may replace it with
a 32-bit member in the future. Unless there is a canonical way to do
this? I couldn't find any prior examples.

> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 706cbb3b2986..2774680c5d05 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -4555,6 +4555,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
> >         tp->pred_flags = 0;
> >         inet_csk_schedule_ack(sk);
> >
> > +       tp->rcv_ooopack++;
> 
> We count skbs or we count segments ?
> 
> (GRO might have aggregated multiple segments)

Let's count segments -- I will copy the technique of tcp_segs_in(), which
checks the maximum of 1 and gso_segs from the lower layer. Interestingly,
on my development machine, which uses the virtio-net driver, when LRO is
enabled, gso_segs is always zero, even when an aggregated segment is
passed up the stack. I guess this may be a problem with virtio-net? It
still seems that using gso_segs here is the correct solution.

^ permalink raw reply

* Re: [PATCH net-next 2/2] mlx5: fix type mismatch
From: Arnd Bergmann @ 2019-09-10 19:51 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: dledford@redhat.com, davem@davemloft.net, jgg@ziepe.ca,
	leon@kernel.org, linux-rdma@vger.kernel.org, Yishai Hadas,
	Mark Bloch, linux-kernel@vger.kernel.org, Erez Shitrit,
	netdev@vger.kernel.org, Alex Vesker, Ariel Levkovich,
	Nathan Chancellor
In-Reply-To: <8311cb643690d3e80dddd5d4f2f6a7d923b9fbbc.camel@mellanox.com>

On Tue, Sep 10, 2019 at 7:56 PM Saeed Mahameed <saeedm@mellanox.com> wrote:
>
> On Mon, 2019-09-09 at 21:50 +0200, Arnd Bergmann wrote:
> > In mlx5, pointers to 'phys_addr_t' and 'u64' are mixed since the
> > addition
> > of the pool memory allocator, leading to incorrect behavior on 32-bit
> > architectures and this compiler warning:
> >
> > drivers/net/ethernet/mellanox/mlx5/core/steering/dr_icm_pool.c:121:8:
> > error: incompatible pointer types passing 'u64 *' (aka 'unsigned long
> > long *') to parameter of type 'phys_addr_t *' (aka 'unsigned int *')
> > [-Werror,-Wincompatible-pointer-types]
> >                                    &icm_mr->dm.addr, &icm_mr-
> > >dm.obj_id);
> >                                    ^~~~~~~~~~~~~~~~
> > include/linux/mlx5/driver.h:1092:39: note: passing argument to
> > parameter 'addr' here
> >                          u64 length, u16 uid, phys_addr_t *addr, u32
> > *obj_id);
> >
> > Change the code to use 'u64' consistently in place of 'phys_addr_t'
> > to
> > fix this. The alternative of using phys_addr_t more would require a
> > larger
> > rework.
> >
> > Fixes: 29cf8febd185 ("net/mlx5: DR, ICM pool memory allocator")
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
>
> Hi Arnd,
>
> Nathan Chancellor Already submitted a patch to fix this and it is more
> minimal:
> https://patchwork.ozlabs.org/patch/1158177/
>
> I would like to use that patch if it is ok with you..

Yes, please do. I think I had tried something like that
initially and concluded it wasn't quite right before I went
into a different direction with my patch.

Looking at the two versions now, I also prefer Nathan's,
and I just confirmed that it fixes all the randconfig failures
I ran into.

       Arnd

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox