Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v7 1/3] vmcore: add API to collect hardware dump in second kernel
From: Stephen Hemminger @ 2018-05-01 23:28 UTC (permalink / raw)
  To: Rahul Lakkireddy
  Cc: netdev, kexec, linux-fsdevel, linux-kernel, davem, viro, ebiederm,
	akpm, torvalds, ganeshgr, nirranjan, indranil
In-Reply-To: <b6065b53c5446d98ee55e09119f6821f979418d4.1525197408.git.rahul.lakkireddy@chelsio.com>

On Tue,  1 May 2018 23:57:45 +0530
Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> wrote:

> +
> +int vmcore_add_device_dump(struct vmcoredd_data *data)
> +{
> +	return __vmcore_add_device_dump(data);
> +}
> +EXPORT_SYMBOL(vmcore_add_device_dump);
> +#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP

Why the stub wrapper function? 
it does nothing.

^ permalink raw reply

* [PATCH bpf-next 1/2] bpf: enable stackmap with build_id in nmi context
From: Song Liu @ 2018-05-02  0:02 UTC (permalink / raw)
  To: netdev
  Cc: Song Liu, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	Peter Zijlstra

Currently, we cannot parse build_id in nmi context because of
up_read(&current->mm->mmap_sem), this makes stackmap with build_id
less useful. This patch enables parsing build_id in nmi by putting
the up_read() call in irq_work. To avoid memory allocation in nmi
context, we use per cpu variable for the irq_work. As a result, only
one irq_work per cpu is allowed. If the irq_work is in-use, we
fallback to only report ips.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 kernel/bpf/stackmap.c | 60 +++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 54 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 3ba102b..c33fec1 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -11,6 +11,7 @@
 #include <linux/perf_event.h>
 #include <linux/elf.h>
 #include <linux/pagemap.h>
+#include <linux/irq_work.h>
 #include "percpu_freelist.h"
 
 #define STACK_CREATE_FLAG_MASK					\
@@ -32,6 +33,23 @@ struct bpf_stack_map {
 	struct stack_map_bucket *buckets[];
 };
 
+/* irq_work to run up_read() for build_id lookup in nmi context */
+struct stack_map_irq_work {
+	struct irq_work work;
+	struct rw_semaphore *sem;
+};
+
+static void up_read_work(struct irq_work *entry)
+{
+	struct stack_map_irq_work *work = container_of(entry,
+			struct stack_map_irq_work, work);
+
+	up_read(work->sem);
+	work->sem = NULL;
+}
+
+DEFINE_PER_CPU(struct stack_map_irq_work, irq_work);
+
 static inline bool stack_map_use_build_id(struct bpf_map *map)
 {
 	return (map->map_flags & BPF_F_STACK_BUILD_ID);
@@ -267,17 +285,27 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
 {
 	int i;
 	struct vm_area_struct *vma;
+	bool in_nmi_ctx = in_nmi();
+	bool irq_work_busy = false;
+	struct stack_map_irq_work *work;
+
+	if (in_nmi_ctx) {
+		work = this_cpu_ptr(&irq_work);
+		if (work->sem)
+			/* cannot queue more up_read, fallback */
+			irq_work_busy = true;
+	}
 
 	/*
-	 * We cannot do up_read() in nmi context, so build_id lookup is
-	 * only supported for non-nmi events. If at some point, it is
-	 * possible to run find_vma() without taking the semaphore, we
-	 * would like to allow build_id lookup in nmi context.
+	 * We cannot do up_read() in nmi context. To do build_id lookup
+	 * in nmi context, we need to run up_read() in irq_work. We use
+	 * a percpu variable to do the irq_work. If the irq_work is
+	 * already used by another lookup, we fall back to report ips.
 	 *
 	 * Same fallback is used for kernel stack (!user) on a stackmap
 	 * with build_id.
 	 */
-	if (!user || !current || !current->mm || in_nmi() ||
+	if (!user || !current || !current->mm || irq_work_busy ||
 	    down_read_trylock(&current->mm->mmap_sem) == 0) {
 		/* cannot access current->mm, fall back to ips */
 		for (i = 0; i < trace_nr; i++) {
@@ -299,7 +327,13 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
 			- vma->vm_start;
 		id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
 	}
-	up_read(&current->mm->mmap_sem);
+
+	if (!in_nmi_ctx)
+		up_read(&current->mm->mmap_sem);
+	else {
+		work->sem = &current->mm->mmap_sem;
+		irq_work_queue(&work->work);
+	}
 }
 
 BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
@@ -575,3 +609,17 @@ const struct bpf_map_ops stack_map_ops = {
 	.map_update_elem = stack_map_update_elem,
 	.map_delete_elem = stack_map_delete_elem,
 };
+
+static int __init stack_map_init(void)
+{
+	int cpu;
+	struct stack_map_irq_work *work;
+
+	for_each_possible_cpu(cpu) {
+		work = per_cpu_ptr(&irq_work, cpu);
+		init_irq_work(&work->work, up_read_work);
+	}
+	pr_info("%s: done\n", __func__);
+	return 0;
+}
+subsys_initcall(stack_map_init);
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 2/2] bpf: add selftest for stackmap with build_id in NMI context
From: Song Liu @ 2018-05-02  0:02 UTC (permalink / raw)
  To: netdev; +Cc: Song Liu, kernel-team
In-Reply-To: <20180502000220.2585320-1-songliubraving@fb.com>

This new test captures stackmap with build_id with hardware event
PERF_COUNT_HW_CPU_CYCLES.

Because we only support one ips-to-build_id lookup per cpu in NMI
context, stack_amap will not be able to do the lookup in this test.
Therefore, we didn't do compare_stack_ips(), as it will alwasy fail.

urandom_read.c is extended to run configurable cycles so that it can be
caught by the perf event.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 tools/testing/selftests/bpf/test_progs.c   | 137 +++++++++++++++++++++++++++++
 tools/testing/selftests/bpf/urandom_read.c |  10 ++-
 2 files changed, 145 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index aa336f0..00bb08c 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -1272,6 +1272,142 @@ static void test_stacktrace_build_id(void)
 	return;
 }
 
+static void test_stacktrace_build_id_nmi(void)
+{
+	int control_map_fd, stackid_hmap_fd, stackmap_fd, stack_amap_fd;
+	const char *file = "./test_stacktrace_build_id.o";
+	int err, pmu_fd, prog_fd;
+	struct perf_event_attr attr = {
+		.sample_freq = 5000,
+		.freq = 1,
+		.type = PERF_TYPE_HARDWARE,
+		.config = PERF_COUNT_HW_CPU_CYCLES,
+	};
+	__u32 key, previous_key, val, duration = 0;
+	struct bpf_object *obj;
+	char buf[256];
+	int i, j;
+	struct bpf_stack_build_id id_offs[PERF_MAX_STACK_DEPTH];
+	int build_id_matches = 0;
+
+	err = bpf_prog_load(file, BPF_PROG_TYPE_PERF_EVENT, &obj, &prog_fd);
+	if (CHECK(err, "prog_load", "err %d errno %d\n", err, errno))
+		goto out;
+
+	pmu_fd = syscall(__NR_perf_event_open, &attr, -1 /* pid */,
+			 0 /* cpu 0 */, -1 /* group id */,
+			 0 /* flags */);
+	if (CHECK(pmu_fd < 0, "perf_event_open",
+		  "err %d errno %d. Does the test host support PERF_COUNT_HW_CPU_CYCLES?\n",
+		  pmu_fd, errno))
+		goto close_prog;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
+	if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n",
+		  err, errno))
+		goto close_pmu;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
+	if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n",
+		  err, errno))
+		goto disable_pmu;
+
+	/* find map fds */
+	control_map_fd = bpf_find_map(__func__, obj, "control_map");
+	if (CHECK(control_map_fd < 0, "bpf_find_map control_map",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu;
+
+	stackid_hmap_fd = bpf_find_map(__func__, obj, "stackid_hmap");
+	if (CHECK(stackid_hmap_fd < 0, "bpf_find_map stackid_hmap",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu;
+
+	stackmap_fd = bpf_find_map(__func__, obj, "stackmap");
+	if (CHECK(stackmap_fd < 0, "bpf_find_map stackmap", "err %d errno %d\n",
+		  err, errno))
+		goto disable_pmu;
+
+	stack_amap_fd = bpf_find_map(__func__, obj, "stack_amap");
+	if (CHECK(stack_amap_fd < 0, "bpf_find_map stack_amap",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu;
+
+	assert(system("dd if=/dev/urandom of=/dev/zero count=4 2> /dev/null")
+	       == 0);
+	assert(system("taskset 0x1 ./urandom_read 100000") == 0);
+	/* disable stack trace collection */
+	key = 0;
+	val = 1;
+	bpf_map_update_elem(control_map_fd, &key, &val, 0);
+
+	/* for every element in stackid_hmap, we can find a corresponding one
+	 * in stackmap, and vise versa.
+	 */
+	err = compare_map_keys(stackid_hmap_fd, stackmap_fd);
+	if (CHECK(err, "compare_map_keys stackid_hmap vs. stackmap",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu;
+
+	err = compare_map_keys(stackmap_fd, stackid_hmap_fd);
+	if (CHECK(err, "compare_map_keys stackmap vs. stackid_hmap",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu;
+
+	err = extract_build_id(buf, 256);
+
+	if (CHECK(err, "get build_id with readelf",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu;
+
+	err = bpf_map_get_next_key(stackmap_fd, NULL, &key);
+	if (CHECK(err, "get_next_key from stackmap",
+		  "err %d, errno %d\n", err, errno))
+		goto disable_pmu;
+
+	do {
+		char build_id[64];
+
+		err = bpf_map_lookup_elem(stackmap_fd, &key, id_offs);
+		if (CHECK(err, "lookup_elem from stackmap",
+			  "err %d, errno %d\n", err, errno))
+			goto disable_pmu;
+		for (i = 0; i < PERF_MAX_STACK_DEPTH; ++i)
+			if (id_offs[i].status == BPF_STACK_BUILD_ID_VALID &&
+			    id_offs[i].offset != 0) {
+				for (j = 0; j < 20; ++j)
+					sprintf(build_id + 2 * j, "%02x",
+						id_offs[i].build_id[j] & 0xff);
+				if (strstr(buf, build_id) != NULL)
+					build_id_matches = 1;
+			}
+		previous_key = key;
+	} while (bpf_map_get_next_key(stackmap_fd, &previous_key, &key) == 0);
+
+	if (CHECK(build_id_matches < 1, "build id match",
+		  "Didn't find expected build ID from the map\n"))
+		goto disable_pmu;
+
+	/*
+	 * We intentionally skip compare_stack_ips(). This is because we
+	 * only support one in_nmi() ips-to-build_id translation per cpu
+	 * at any time, thus stack_amap here will always fallback to
+	 * BPF_STACK_BUILD_ID_IP;
+	 */
+
+disable_pmu:
+	ioctl(pmu_fd, PERF_EVENT_IOC_DISABLE);
+
+close_pmu:
+	close(pmu_fd);
+
+close_prog:
+	bpf_object__close(obj);
+
+out:
+	return;
+}
+
 #define MAX_CNT_RAWTP	10ull
 #define MAX_STACK_RAWTP	100
 struct get_stack_trace_t {
@@ -1425,6 +1561,7 @@ int main(void)
 	test_tp_attach_query();
 	test_stacktrace_map();
 	test_stacktrace_build_id();
+	test_stacktrace_build_id_nmi();
 	test_stacktrace_map_raw_tp();
 	test_get_stack_raw_tp();
 
diff --git a/tools/testing/selftests/bpf/urandom_read.c b/tools/testing/selftests/bpf/urandom_read.c
index 4acfdeb..9de8b7c 100644
--- a/tools/testing/selftests/bpf/urandom_read.c
+++ b/tools/testing/selftests/bpf/urandom_read.c
@@ -6,15 +6,21 @@
 #include <stdlib.h>
 
 #define BUF_SIZE 256
-int main(void)
+
+int main(int argc, char *argv[])
 {
 	int fd = open("/dev/urandom", O_RDONLY);
 	int i;
 	char buf[BUF_SIZE];
+	int count = 4;
 
 	if (fd < 0)
 		return 1;
-	for (i = 0; i < 4; ++i)
+
+	if (argc == 2)
+		count = atoi(argv[1]);
+
+	for (i = 0; i < count; ++i)
 		read(fd, buf, BUF_SIZE);
 
 	close(fd);
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Samudrala, Sridhar @ 2018-05-02  0:20 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh, aaron.f.brown
In-Reply-To: <20180430072034.GH23854@nanopsycho.orion>

On 4/30/2018 12:20 AM, Jiri Pirko wrote:
>
>>> Now I try to change mac of the failover master:
>>> [root@test1 ~]# ip link set ens3 addr 52:54:00:b2:a7:f3
>>> RTNETLINK answers: Operation not supported
>>>
>>> That I did expect to work. I would expect this would change the mac of
>>> the master and both standby and primary slaves.
>> If a VF is untrusted, a VM will not able to change its MAC and moreover
> Note that at this point, I have no VF. So I'm not sure why you mention
> that.
>
>
>> in this mode we are assuming that the hypervisor has assigned the MAC and
>> guest is not expected to change the MAC.
> Wait, for ordinary old-fashioned virtio_net, as a VM user, I can change
> mac and all works fine. How is this different? Change mac on "failover
> instance" should work and should propagate the mac down to its slaves.
>
>
>> For the initial implementation, i would propose not allowing the guest to
>> change the MAC of failover or standby dev.
> I see no reason for such restriction.
>

It is true that a VM user can change mac address of a normal virtio-net interface,
however when it is in STANDBY mode i think we should not allow this change specifically
because we are creating a failover instance based on a MAC that is assigned by the
hypervisor.

Moreover,  in a cloud environment i would think that PF/hypervisor assigns a MAC to
the VF and it cannot be changed by the guest.

So for the initial implementation, do you see any issues with having this restriction
in STANDBY mode.

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Andrew Morton @ 2018-05-02  0:36 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	Michal Hocko, linux-mm, edumazet, virtualization, David Miller,
	Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804241229410.23702@file01.intranet.prod.int.rdu2.redhat.com>

On Tue, 24 Apr 2018 12:33:01 -0400 (EDT) Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Tue, 24 Apr 2018, Michal Hocko wrote:
> 
> > On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > 
> > > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > > > 
> > > > > Fixing __vmalloc code 
> > > > > is easy and it doesn't require cooperation with maintainers.
> > > > 
> > > > But it is a hack against the intention of the scope api.
> > > 
> > > It is not!
> > 
> > This discussion simply doesn't make much sense it seems. The scope API
> > is to document the scope of the reclaim recursion critical section. That
> > certainly is not a utility function like vmalloc.
> 
> That 15-line __vmalloc bugfix doesn't prevent you (or any other kernel 
> developer) from converting the code to the scope API. You make nonsensical 
> excuses.
> 

Fun thread!

Winding back to the original problem, I'd state it as

- Caller uses kvmalloc() but passes the address into vmalloc-naive
  DMA API and

- Caller uses kvmalloc() but passes the address into kfree()

Yes?

If so, then...

Is there a way in which, in the kvmalloc-called-kmalloc path, we can
tag the slab-allocated memory with a "this memory was allocated with
kvmalloc()" flag?  I *think* there's extra per-object storage available
with suitable slab/slub debugging options?  Perhaps we could steal one
bit from the redzone, dunno.

If so then we can

a) set that flag in kvmalloc() if the kmalloc() call succeeded

b) check for that flag in the DMA code, WARN if it is set.

c) in kvfree(), clear that flag before calling kfree()

d) in kfree(), check for that flag and go WARN() if set.

So both potential bugs are detected all the time, dependent upon
CONFIG_SLUB_DEBUG (and perhaps other slub config options).

^ permalink raw reply

* [v2 PATCH 1/1] tg3: fix meaningless hw_stats reading after tg3_halt memset 0 hw_stats
From: Zumeng Chen @ 2018-05-02  0:42 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: michael.chan, siva.kallam, prashant, davem, Zumeng Chen,
	Zumeng Chen

From: Zumeng Chen <zumeng.chen@windriver.com>

Reading hw_stats will get the actual data after a sucessfull tg3_reset_hw,
which actually after tg3_timer_start, so TG3_FLAG_HALT is introduced to
tell tg3_get_stats64 when hw_stats is ready to read. It will be set after
having done memset(tp->hw_stats, 0) in tg3_halt and be clear when hw_init
done. Both tg3_get_stats64 and tg3_halt are protected by tp->lock in all
scope.

Meanwhile, this patch is also to fix a kernel BUG_ON(in_interrupt) crash
when tg3_free_consistent is stuck in tp->lock, which might get a lot of
in_softirq counts(512 or so), and BUG_ON when vunmap to unmap hw->stats.

------------[ cut here ]------------
kernel BUG at /kernel-source//mm/vmalloc.c:1621!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
task: ffffffc874310000 task.stack: ffffffc8742bc000
PC is at vunmap+0x48/0x50
LR is at __dma_free+0x98/0xa0
pc : [<ffffff80081eb420>] lr : [<ffffff8008097fb8>] pstate: 00000145
sp : ffffffc8742bfad0
x29: ffffffc8742bfad0 x28: ffffffc874310000
x27: ffffffc878931200 x26: ffffffc874530000
x25: 0000000000000003 x24: ffffff800b3aa000
x23: 00000000700bb000 x22: 0000000000000000
x21: 0000000000000000 x20: ffffffc87aafd0a0
x19: ffffff800b3aa000 x18: 0000000000000020
x17: 0000007f9e191e10 x16: ffffff8008eb0d28
x15: 000000000000000a x14: 0000000000070cc8
x13: ffffff8008c65000 x12: 00000000ffffffff
x11: 000000000000000a x10: ffffffbf21d0e220
x9 : 0000000000000004 x8 : ffffff8008c65000
x7 : 0000000000003ff0 x6 : 0000000000000000
x5 : ffffff8008097f20 x4 : 0000000000000000
x3 : ffffff8008fd4fff x2 : ffffffc87b361788
x1 : ffffff800b3aafff x0 : 0000000000000201
Process connmand (pid: 785, stack limit = 0xffffffc8742bc000)
Stack: (0xffffffc8742bfad0 to 0xffffffc8742c0000)
fac0:                                   ffffffc8742bfaf0 ffffff8008097fb8
fae0: 0000000000001000 ffffff80ffffffff ffffffc8742bfb30 ffffff8000b717d4
fb00: ffffffc87aafd0a0 ffffff8008a38000 ffffff800b3aa000 ffffffc874530904
fb20: ffffffc874530900 00000000700bb000 ffffffc8742bfb80 ffffff8000b80324
fb40: 0000000000000001 ffffffc874530900 0000000000000100 0000000000000200
fb60: 0000000000009003 ffffffc874530000 0000000000000003 ffffffc874530000
fb80: ffffffc8742bfbd0 ffffff8000b8aa5c ffffffc874530900 ffffffc874530000
fba0: 00000000ffff0001 0000000000000000 0000000000009003 ffffffc878931210
fbc0: 0000000000009002 ffffffc874530000 ffffffc8742bfc00 ffffff80088bf44c
fbe0: ffffffc874530000 ffffffc8742bfc50 00000000ffff0001 ffffffc874310000
fc00: ffffffc8742bfc30 ffffff80088bf5e4 ffffffc874530000 00000000ffff9002
fc20: ffffffc8742bfc40 ffffffc874530000 ffffffc8742bfc60 ffffff80088c9d58
fc40: ffffffc874530000 ffffff80088c9d34 ffffffc874530080 ffffffc874530080
fc60: ffffffc8742bfca0 ffffff80088c9e4c ffffffc874530000 0000000000009003
fc80: 0000000000008914 0000000000000000 0000007ffd94ba10 ffffffc8742bfd38
fca0: ffffffc8742bfcd0 ffffff80089509f8 0000000000000000 00000000ffffff9d
fcc0: 0000000000008914 0000000000000000 ffffffc8742bfd60 ffffff8008953088
fce0: 0000000000008914 ffffffc874b49b80 0000007ffd94ba10 ffffff8008e9b400
fd00: 0000000000000004 0000007ffd94ba10 0000000000000124 000000000000001d
fd20: ffffff8008a32000 ffffff8008e9b400 0000000000000004 0000000034687465
fd40: 0000000000000000 0000000000009002 0000000000000000 0000000000000000
fd60: ffffffc8742bfd90 ffffff80088a1720 ffffffc874b49b80 0000000000008914
fd80: 0000007ffd94ba10 0000000000000000 ffffffc8742bfdc0 ffffff80088a2648
fda0: 0000000000008914 0000007ffd94ba10 ffffff8008e9b400 ffffffc878a73c00
fdc0: ffffffc8742bfe00 ffffff800822e9e0 0000000000008914 0000007ffd94ba10
fde0: ffffffc874b49bb0 ffffffc8747e5800 ffffffc8742bfe50 ffffff800823cd58
fe00: ffffffc8742bfe80 ffffff800822f0ec 0000000000000000 ffffffc878a73c00
fe20: ffffffc878a73c00 0000000000000004 0000000000008914 0000000000080000
fe40: ffffffc8742bfe80 ffffff800822f0b0 0000000000000000 ffffffc878a73c00
fe60: ffffffc878a73c00 0000000000000004 0000000000008914 ffffff8008083730
fe80: 0000000000000000 ffffff8008083730 0000000000000000 00000048771fb000
fea0: ffffffffffffffff 0000007f9e191e1c 0000000000000000 0000000000000015
fec0: 0000000000000004 0000000000008914 0000007ffd94ba10 0000000000000000
fee0: 000000000000002f 0000000000000004 0000000000000010 0000000000000000
ff00: 000000000000001d 0000000fffffffff 0101010101010101 0000000000000000
ff20: 6532336338646634 00656c6261635f38 0000007f9e46a220 0000007f9e45f318
ff40: 00000000004c1a58 0000007f9e191e10 00000000000006df 0000000000000000
ff60: 0000000000000004 00000000004c6470 00000000004c3c40 0000000000512d20
ff80: 0000000000000001 0000000000000000 0000000000000000 0000000000000000
ffa0: 0000000000000000 0000007ffd94b9f0 0000000000463dec 0000007ffd94b9f0
ffc0: 0000007f9e191e1c 0000000000000000 0000000000000004 000000000000001d
ffe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call trace:
Exception stack(0xffffffc8742bf900 to 0xffffffc8742bfa30)
f900: ffffff800b3aa000 0000008000000000 ffffffc8742bfad0 ffffff80081eb420
f920: ffffff8000000000 ffffff80081a58ec ffffffc8742bf940 ffffff80081c3ea8
f940: ffffffc8742bf990 ffffff80081a591c ffffffc8742bf970 ffffff8008a1d89c
f960: ffffff8008eb1780 ffffff8008eb1780 ffffffc8742bf990 ffffff80081a59dc
f980: ffffffbf21c4ae00 ffffffbf00000000 ffffffc8742bfa20 ffffff80081a5db4
f9a0: 0000000000000201 ffffff800b3aafff ffffffc87b361788 ffffff8008fd4fff
f9c0: 0000000000000000 ffffff8008097f20 0000000000000000 0000000000003ff0
f9e0: ffffff8008c65000 0000000000000004 ffffffbf21d0e220 000000000000000a
fa00: 00000000ffffffff ffffff8008c65000 0000000000070cc8 000000000000000a
fa20: ffffff8008eb0d28 0000007f9e191e10
[<ffffff80081eb420>] vunmap+0x48/0x50
[<ffffff8008097fb8>] __dma_free+0x98/0xa0
[<ffffff8000b717d4>] tg3_free_consistent+0x14c/0x190 [tg3]
[<ffffff8000b80324>] tg3_stop+0x204/0x238 [tg3]
[<ffffff8000b8aa5c>] tg3_close+0x34/0x98 [tg3]
[<ffffff80088bf44c>] __dev_close_many+0x94/0xe8
[<ffffff80088bf5e4>] __dev_close+0x34/0x50
[<ffffff80088c9d58>] __dev_change_flags+0xa0/0x160
[<ffffff80088c9e4c>] dev_change_flags+0x34/0x70
[<ffffff80089509f8>] devinet_ioctl+0x740/0x808
[<ffffff8008953088>] inet_ioctl+0x140/0x158
[<ffffff80088a1720>] sock_do_ioctl+0x40/0x88
[<ffffff80088a2648>] sock_ioctl+0x238/0x368
[<ffffff800822e9e0>] do_vfs_ioctl+0xb0/0x730
[<ffffff800822f0ec>] SyS_ioctl+0x8c/0xa8
[<ffffff8008083730>] el0_svc_naked+0x24/0x28
Code: f9400bf3 a8c27bfd d65f03c0 d503201f (d4210000)
---[ end trace e214990b7cc445ce ]---
Kernel panic - not syncing: Fatal exception in interrupt

Signed-off-by: Zumeng Chen <zumeng.chen@gmail.com>
---

V2 changes:

As Michael mentioned, use TG3_FLAG_HALT to flag the interface down.
Sanity tests include building and runtime, passed.

Cheers,
Zumeng 

 drivers/net/ethernet/broadcom/tg3.c | 13 ++++++++-----
 drivers/net/ethernet/broadcom/tg3.h |  1 +
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 537d571..690cdc6 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -8723,14 +8723,11 @@ static void tg3_free_consistent(struct tg3 *tp)
 	tg3_mem_rx_release(tp);
 	tg3_mem_tx_release(tp);
 
-	/* Protect tg3_get_stats64() from reading freed tp->hw_stats. */
-	tg3_full_lock(tp, 0);
 	if (tp->hw_stats) {
 		dma_free_coherent(&tp->pdev->dev, sizeof(struct tg3_hw_stats),
 				  tp->hw_stats, tp->stats_mapping);
 		tp->hw_stats = NULL;
 	}
-	tg3_full_unlock(tp);
 }
 
 /*
@@ -9334,6 +9331,7 @@ static int tg3_halt(struct tg3 *tp, int kind, bool silent)
 
 		/* And make sure the next sample is new data */
 		memset(tp->hw_stats, 0, sizeof(struct tg3_hw_stats));
+		tg3_flag_set(tp, HALT);
 	}
 
 	return err;
@@ -10732,6 +10730,7 @@ static int tg3_reset_hw(struct tg3 *tp, bool reset_phy)
  */
 static int tg3_init_hw(struct tg3 *tp, bool reset_phy)
 {
+	int retval;
 	/* Chip may have been just powered on. If so, the boot code may still
 	 * be running initialization. Wait for it to finish to avoid races in
 	 * accessing the hardware.
@@ -10743,7 +10742,11 @@ static int tg3_init_hw(struct tg3 *tp, bool reset_phy)
 
 	tw32(TG3PCI_MEM_WIN_BASE_ADDR, 0);
 
-	return tg3_reset_hw(tp, reset_phy);
+	retval = tg3_reset_hw(tp, reset_phy);
+	if (retval == 0)
+		tg3_flag_clear(tp, HALT);
+
+	return retval;
 }
 
 #ifdef CONFIG_TIGON3_HWMON
@@ -14155,7 +14158,7 @@ static void tg3_get_stats64(struct net_device *dev,
 	struct tg3 *tp = netdev_priv(dev);
 
 	spin_lock_bh(&tp->lock);
-	if (!tp->hw_stats) {
+	if (!tp->hw_stats || tg3_flag(tp, HALT)) {
 		*stats = tp->net_stats_prev;
 		spin_unlock_bh(&tp->lock);
 		return;
diff --git a/drivers/net/ethernet/broadcom/tg3.h b/drivers/net/ethernet/broadcom/tg3.h
index 3b5e98e..c61d83c 100644
--- a/drivers/net/ethernet/broadcom/tg3.h
+++ b/drivers/net/ethernet/broadcom/tg3.h
@@ -3102,6 +3102,7 @@ enum TG3_FLAGS {
 	TG3_FLAG_ROBOSWITCH,
 	TG3_FLAG_ONE_DMA_AT_ONCE,
 	TG3_FLAG_RGMII_MODE,
+	TG3_FLAG_HALT,
 
 	/* Add new flags before this comment and TG3_FLAG_NUMBER_OF_FLAGS */
 	TG3_FLAG_NUMBER_OF_FLAGS,	/* Last entry in enum TG3_FLAGS */
-- 
2.9.3

^ permalink raw reply related

* [PATCH net] tcp_bbr: fix to zero idle_restart only upon S/ACKed data
From: Neal Cardwell @ 2018-05-02  1:45 UTC (permalink / raw)
  To: davem, netdev, ncardwell, ycheng, soheil, priyarjha, ysseung
  Cc: netdev, Neal Cardwell, Yuchung Cheng, Soheil Hassas Yeganeh,
	Priyaranjan Jha, Yousuk Seung

Previously the bbr->idle_restart tracking was zeroing out the
bbr->idle_restart bit upon ACKs that did not SACK or ACK anything,
e.g. receiving incoming data or receiver window updates. In such
situations BBR would forget that this was a restart-from-idle
situation, and if the min_rtt had expired it would unnecessarily enter
PROBE_RTT (even though we were actually restarting from idle but had
merely forgotten that fact).

The fix is simple: we need to remember we are restarting from idle
until we receive a S/ACK for some data (a S/ACK for the first flight
of data we send as we are restarting).

This commit is a stable candidate for kernels back as far as 4.9.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Yousuk Seung <ysseung@google.com>
---
 net/ipv4/tcp_bbr.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 158d105e76da1..58e2f479ffb4d 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -806,7 +806,9 @@ static void bbr_update_min_rtt(struct sock *sk, const struct rate_sample *rs)
 			}
 		}
 	}
-	bbr->idle_restart = 0;
+	/* Restart after idle ends only once we process a new S/ACK for data */
+	if (rs->delivered > 0)
+		bbr->idle_restart = 0;
 }
 
 static void bbr_update_model(struct sock *sk, const struct rate_sample *rs)
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* linux-next: manual merge of the net-next tree with the net tree
From: Stephen Rothwell @ 2018-05-02  1:52 UTC (permalink / raw)
  To: David Miller, Networking
  Cc: Linux-Next Mailing List, Linux Kernel Mailing List, Thomas Winter,
	David Ahern

[-- Attachment #1: Type: text/plain, Size: 1445 bytes --]

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  include/net/ip6_route.h

between commit:

  edd7ceb78296 ("ipv6: Allow non-gateway ECMP for IPv6")

from the net tree and commit:

  93c2fb253d17 ("net/ipv6: Rename fib6_info struct elements")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc include/net/ip6_route.h
index abceb5864d99,8df4ff798b04..000000000000
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@@ -66,9 -66,10 +66,9 @@@ static inline bool rt6_need_strict(cons
  		(IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL | IPV6_ADDR_LOOPBACK);
  }
  
- static inline bool rt6_qualify_for_ecmp(const struct rt6_info *rt)
+ static inline bool rt6_qualify_for_ecmp(const struct fib6_info *f6i)
  {
- 	return (rt->rt6i_flags & (RTF_ADDRCONF | RTF_DYNAMIC)) == 0;
 -	return (f6i->fib6_flags & (RTF_GATEWAY|RTF_ADDRCONF|RTF_DYNAMIC)) ==
 -	       RTF_GATEWAY;
++	return (f6i->fib6_flags & (RTF_ADDRCONF | RTF_DYNAMIC)) == 0;
  }
  
  void ip6_route_input(struct sk_buff *skb);

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH] net/xfrm: Fix lookups for states with spi == 0
From: Dmitry Safonov @ 2018-05-02  2:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: 0x7f454c46, Dmitry Safonov, Steffen Klassert, Herbert Xu,
	David S. Miller, netdev

It seems to be a valid use case to add xfrm state without
Security Parameter Indexes (SPI) value associated:
ip xfrm state add src $src dst $dst proto $proto mode $mode sel src $src dst $dst $algo

The bad thing is that it's currently impossible to get/delete the state
without SPI: __xfrm_state_insert() obviously doesn't add hash for zero
SPI in xfrm.state_byspi, and xfrm_user_state_lookup() will fail as
xfrm_state_lookup() does lookups by hash.

It also isn't possible to workaround from userspace as
xfrm_id_proto_match() will be always true for ah/esp/comp protos.

So, don't try looking up by hash if SPI == 0.

Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
 net/xfrm/xfrm_user.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 080035f056d9..6b38503255c8 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -681,7 +681,7 @@ static struct xfrm_state *xfrm_user_state_lookup(struct net *net,
 	int err;
 	u32 mark = xfrm_mark_get(attrs, &m);
 
-	if (xfrm_id_proto_match(p->proto, IPSEC_PROTO_ANY)) {
+	if (xfrm_id_proto_match(p->proto, IPSEC_PROTO_ANY) && p->spi) {
 		err = -ESRCH;
 		x = xfrm_state_lookup(net, mark, &p->daddr, p->spi, p->proto, p->family);
 	} else {
-- 
2.13.6

^ permalink raw reply related

* Re: Suggestions on iterating eBPF maps
From: Lorenzo Colitti @ 2018-05-02  2:05 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Chenbo Feng, netdev, Daniel Borkmann, Joel Fernandes
In-Reply-To: <20180428010439.qryq3ejdyyhtb25u@ast-mbp>

On Sat, Apr 28, 2018 at 10:04 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> Another approach could be to use map-in-map and have almost atomic
> replace of the whole map with new potentially empty map. The prog
> can continue using the new map, while user space walks no longer
> accessed old map.

That sounds like a promising approach. I assume this would be
functionally equivalent to an approach where there is a map containing
a boolean that says whether to write to map A or map B? We'd then do
the following:

0. Kernel program is writing to map A.
1. Userspace pushes config that says to write to map B.
2. Kernel program starts to write to map B.
3. Userspace scans map A, collecting stats and deleting everything it finds.

One problem with this is: if the effects of #1 are not immediately
visible to the programs running on all cores, the program could still
be writing to map A and the deletes in #3 would result in loss of
data. Are there any guarantees around this? I know that hash map
writes are atomic, but I'm not aware of any other guarantees here. Are
there memory barriers around map writes and reads?

In the absence of guarantees, userspace could put a sleep between #1
and #3 and things would be correct Most Of The Time(TM), but if the
kernel is busy doing other things that might not be sufficient.
Thoughts?

^ permalink raw reply

* [PATCH V2 net-next 3/6] sctp: Build sctp offload support into the base kernel
From: Vladislav Yasevich @ 2018-05-02  2:07 UTC (permalink / raw)
  To: netdev
  Cc: virtio-dev, marcelo.leitner, nhorman, mst, virtualization,
	linux-sctp
In-Reply-To: <20180502020739.19239-1-vyasevic@redhat.com>

With the SCTP checksum offload support added to virtio, it is
now possible to get into a situation where SCTP not present
in the kernel, but the feature is negotiated.  Handle this just
like we do IPv6 and other modular offloads.
Move the sctp offload out of the sctp module and into the base
kernel.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
---
 include/net/sctp/sctp.h | 5 -----
 net/Kconfig             | 1 +
 net/sctp/Kconfig        | 1 -
 net/sctp/Makefile       | 3 ++-
 net/sctp/offload.c      | 4 +++-
 net/sctp/protocol.c     | 3 ---
 6 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
index 72c5b8f..625b45f 100644
--- a/include/net/sctp/sctp.h
+++ b/include/net/sctp/sctp.h
@@ -183,11 +183,6 @@ struct sctp_transport *sctp_epaddr_lookup_transport(
 int __net_init sctp_proc_init(struct net *net);
 
 /*
- * sctp/offload.c
- */
-int sctp_offload_init(void);
-
-/*
  * sctp/stream_sched.c
  */
 void sctp_sched_ops_init(void);
diff --git a/net/Kconfig b/net/Kconfig
index 0428f12..2773f98 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -64,6 +64,7 @@ config INET
 	bool "TCP/IP networking"
 	select CRYPTO
 	select CRYPTO_AES
+	select LIBCRC32C
 	---help---
 	  These are the protocols used on the Internet and on most local
 	  Ethernets. It is highly recommended to say Y here (this will enlarge
diff --git a/net/sctp/Kconfig b/net/sctp/Kconfig
index c740b18..d07477a 100644
--- a/net/sctp/Kconfig
+++ b/net/sctp/Kconfig
@@ -9,7 +9,6 @@ menuconfig IP_SCTP
 	select CRYPTO
 	select CRYPTO_HMAC
 	select CRYPTO_SHA1
-	select LIBCRC32C
 	---help---
 	  Stream Control Transmission Protocol
 
diff --git a/net/sctp/Makefile b/net/sctp/Makefile
index e845e45..ee206ca 100644
--- a/net/sctp/Makefile
+++ b/net/sctp/Makefile
@@ -5,6 +5,7 @@
 
 obj-$(CONFIG_IP_SCTP) += sctp.o
 obj-$(CONFIG_INET_SCTP_DIAG) += sctp_diag.o
+obj-$(CONFIG_INET) += offload.o
 
 sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
 	  protocol.o endpointola.o associola.o \
@@ -12,7 +13,7 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
 	  inqueue.o outqueue.o ulpqueue.o \
 	  tsnmap.o bind_addr.o socket.o primitive.o \
 	  output.o input.o debug.o stream.o auth.o \
-	  offload.o stream_sched.o stream_sched_prio.o \
+	  stream_sched.o stream_sched_prio.o \
 	  stream_sched_rr.o stream_interleave.o
 
 sctp_diag-y := diag.o
diff --git a/net/sctp/offload.c b/net/sctp/offload.c
index 123e9f2..c61cbde 100644
--- a/net/sctp/offload.c
+++ b/net/sctp/offload.c
@@ -107,7 +107,7 @@ static const struct skb_checksum_ops crc32c_csum_ops = {
 	.combine = sctp_csum_combine,
 };
 
-int __init sctp_offload_init(void)
+static int __init sctp_offload_init(void)
 {
 	int ret;
 
@@ -127,3 +127,5 @@ int __init sctp_offload_init(void)
 out:
 	return ret;
 }
+
+fs_initcall(sctp_offload_init);
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index a24cde2..46d2b63 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1479,9 +1479,6 @@ static __init int sctp_init(void)
 	if (status)
 		goto err_v6_add_protocol;
 
-	if (sctp_offload_init() < 0)
-		pr_crit("%s: Cannot add SCTP protocol offload\n", __func__);
-
 out:
 	return status;
 err_v6_add_protocol:
-- 
2.9.5

^ permalink raw reply related

* [PATCH V2 net-next 0/6] virtio-net:  Add SCTP checksum offload support
From: Vladislav Yasevich @ 2018-05-02  2:07 UTC (permalink / raw)
  To: netdev
  Cc: linux-sctp, virtualization, virtio-dev, mst, jasowang, nhorman,
	marcelo.leitner, Vladislav Yasevich

Now that we have SCTP offload capabilities in the kernel, we can add
them to virtio as well.  First step is SCTP checksum.

We need a new freature in virtio to negotiate this support since
SCTP is excluded with the stardard checksum and requires a little
bit extra.  This series proposes VIRTIO_NET_F_SCTP_CSUM feature bit.

As the "little bit extra",  the kernel uses a new bit in the skb
(skb->csum_not_inet) to determine whether to use standard inet checksum
or the SCTP CRC32c checksum.  This bit has to be communicated between
the host and the guest.  This bit is carried in the vnet header.

Tap and macvtap support is added through an extra feature for the
TUNSETOFFLOAD ioctl.  Additionally macvtap will no correctly
do sctp checksumming if the receive doesn't support SCTP offload.
This also turns on sctp offloading for macvlan devices.

As for the perf numbers, I am seeing about a 5% increase in vm-to-vm
and vm-to-hos throughput which is the same as manually disabling
sctp checksumming,since this is exactly what we are emulatting.
Sending outside the host,  the increase about 2.5-3%.

Since v1:
  - Fixed some old patch bugs that snuck in.
  - virtio feature bits moved to high bits
  - distinguish between GUEST and HOST features
  - Fixed offload control command to control sctp checksum offload
  - added ipvlan support

Vladislav Yasevich (6):
  virtio: Add support for SCTP checksum offloading
  sctp: Handle sctp packets with CHECKSUM_PARTIAL
  sctp: Build sctp offload support into the base kernel
  tun: Add support for SCTP checksum offload
  macvlan/macvtap: Add support for SCTP checksum offload.
  ipvlan: Add support for SCTP checksum offload

 drivers/net/ipvlan/ipvlan_main.c |  3 ++-
 drivers/net/macvlan.c            |  5 +++--
 drivers/net/tap.c                |  8 +++++---
 drivers/net/tun.c                |  7 ++++++-
 drivers/net/virtio_net.c         | 14 +++++++++++++-
 include/linux/virtio_net.h       |  6 ++++++
 include/net/sctp/sctp.h          |  5 -----
 include/uapi/linux/if_tun.h      |  1 +
 include/uapi/linux/virtio_net.h  |  5 +++++
 net/Kconfig                      |  1 +
 net/sctp/Kconfig                 |  1 -
 net/sctp/Makefile                |  3 ++-
 net/sctp/input.c                 | 11 ++++++++++-
 net/sctp/offload.c               |  4 +++-
 net/sctp/protocol.c              |  3 ---
 15 files changed, 57 insertions(+), 20 deletions(-)

-- 
2.9.5

^ permalink raw reply

* [PATCH V2 net-next 1/6] virtio: Add support for SCTP checksum offloading
From: Vladislav Yasevich @ 2018-05-02  2:07 UTC (permalink / raw)
  To: netdev
  Cc: linux-sctp, virtualization, virtio-dev, mst, jasowang, nhorman,
	marcelo.leitner, Vladislav Yasevich
In-Reply-To: <20180502020739.19239-1-vyasevic@redhat.com>

To support SCTP checksum offloading, we need to add a new feature
to virtio_net, so we can negotiate support between the hypervisor
and the guest.
The HOST feature bit signifies offloading support for transmit and
enables device offload features.
The GUEST feature bit signifies offloading support of recieve and
is currently only used by the driver in case of xdp.  

That patch also adds an addition virtio_net header flag which
mirrors the skb->csum_not_inet flag.  This flags is used to indicate
that is this an SCTP packet that needs its checksum computed by the
lower layer.  In this case, the lower layer is the host hypervisor or
possibly HW nic that supporst CRC32c offload.

In the case that GUEST feature bit is flag, it will be possible to
receive a virtio_net header with this bit set, which will set the
corresponding skb bit.  SCTP protocol will be updated to correctly
deal with it.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
---
 drivers/net/virtio_net.c        | 14 +++++++++++++-
 include/linux/virtio_net.h      |  6 ++++++
 include/uapi/linux/virtio_net.h |  5 +++++
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b187ec..34af280 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2148,6 +2148,8 @@ static int virtnet_clear_guest_offloads(struct virtnet_info *vi)
 
 	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_CSUM))
 		offloads = 1ULL << VIRTIO_NET_F_GUEST_CSUM;
+	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_SCTP_CSUM))
+		offloads |= 1ULL << VIRTIO_NET_F_GUEST_SCTP_CSUM;
 
 	return virtnet_set_guest_offloads(vi, offloads);
 }
@@ -2160,6 +2162,8 @@ static int virtnet_restore_guest_offloads(struct virtnet_info *vi)
 		return 0;
 	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_CSUM))
 		offloads |= 1ULL << VIRTIO_NET_F_GUEST_CSUM;
+	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_SCTP_CSUM))
+		offloads |= 1ULL << VIRTIO_NET_F_GUEST_SCTP_CSUM;
 
 	return virtnet_set_guest_offloads(vi, offloads);
 }
@@ -2724,6 +2728,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 	/* Do we support "hardware" checksums? */
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_CSUM)) {
 		/* This opens up the world of extra features. */
+
 		dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG;
 		if (csum)
 			dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG;
@@ -2746,9 +2751,15 @@ static int virtnet_probe(struct virtio_device *vdev)
 			dev->features |= dev->hw_features & NETIF_F_ALL_TSO;
 		/* (!csum && gso) case will be fixed by register_netdev() */
 	}
+
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
 		dev->features |= NETIF_F_RXCSUM;
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_SCTP_CSUM)) {
+		dev->hw_features |= NETIF_F_SCTP_CRC;
+		dev->features |= NETIF_F_SCTP_CRC;
+	}
+
 	dev->vlan_features = dev->features;
 
 	/* MTU range: 68 - 65535 */
@@ -2962,7 +2973,8 @@ static struct virtio_device_id id_table[] = {
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
 	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-	VIRTIO_NET_F_SPEED_DUPLEX
+	VIRTIO_NET_F_SPEED_DUPLEX, \
+	VIRTIO_NET_F_HOST_SCTP_CSUM, VIRTIO_NET_F_GUEST_SCTP_CSUM
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index f144216..28fffdc 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -39,6 +39,9 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
 
 		if (!skb_partial_csum_set(skb, start, off))
 			return -EINVAL;
+
+		if (hdr->flags & VIRTIO_NET_HDR_F_CSUM_NOT_INET)
+			skb->csum_not_inet = 1;
 	}
 
 	if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
@@ -91,6 +94,9 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
 				skb_checksum_start_offset(skb));
 		hdr->csum_offset = __cpu_to_virtio16(little_endian,
 				skb->csum_offset);
+
+		if (skb->csum_not_inet)
+			hdr->flags |= VIRTIO_NET_HDR_F_CSUM_NOT_INET;
 	} else if (has_data_valid &&
 		   skb->ip_summed == CHECKSUM_UNNECESSARY) {
 		hdr->flags = VIRTIO_NET_HDR_F_DATA_VALID;
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed3..9dfca1a 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,10 @@
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+#define VIRTIO_NET_F_GUEST_SCTP_CSUM  61 /* Guest handles SCTP pks w/ partial
+					  * csum */
+#define VIRTIO_NET_F_HOST_SCTP_CSUM   62 /* HOST handles SCTP pkts w/ partial
+					  * csum */
 #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY
@@ -101,6 +105,7 @@ struct virtio_net_config {
 struct virtio_net_hdr_v1 {
 #define VIRTIO_NET_HDR_F_NEEDS_CSUM	1	/* Use csum_start, csum_offset */
 #define VIRTIO_NET_HDR_F_DATA_VALID	2	/* Csum is valid */
+#define VIRTIO_NET_HDR_F_CSUM_NOT_INET  4       /* Checksum is not inet */
 	__u8 flags;
 #define VIRTIO_NET_HDR_GSO_NONE		0	/* Not a GSO frame */
 #define VIRTIO_NET_HDR_GSO_TCPV4	1	/* GSO frame, IPv4 TCP (TSO) */
-- 
2.9.5

^ permalink raw reply related

* [PATCH V2 net-next 2/6] sctp: Handle sctp packets with CHECKSUM_PARTIAL
From: Vladislav Yasevich @ 2018-05-02  2:07 UTC (permalink / raw)
  To: netdev
  Cc: linux-sctp, virtualization, virtio-dev, mst, jasowang, nhorman,
	marcelo.leitner, Vladislav Yasevich
In-Reply-To: <20180502020739.19239-1-vyasevic@redhat.com>

With SCTP checksum offload available in virtio, it is now
possible for virtio to receive a sctp packet with CHECKSUM_PARTIAL
set (guest-to-guest traffic).  SCTP doesn't really have a
partial checksum like TCP does, because CRC32c can't do partial
additive checksumming.  It's all or nothing.  So an SCTP packet
with CHECKSUM_PARTIAL will have checksum set to 0.  Let SCTP
treat this as a valid checksum if CHECKSUM_PARTIAL is set.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
---
 net/sctp/input.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/sctp/input.c b/net/sctp/input.c
index ba8a6e6..055b8ffa 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -80,8 +80,17 @@ static inline int sctp_rcv_checksum(struct net *net, struct sk_buff *skb)
 {
 	struct sctphdr *sh = sctp_hdr(skb);
 	__le32 cmp = sh->checksum;
-	__le32 val = sctp_compute_cksum(skb, 0);
+	__le32 val = 0;
 
+	/* In sctp PARTIAL checksum is always 0.  This is a case of
+	 * a packet received from guest that supports checksum offload.
+	 * Assume it's correct as there is really no way to verify,
+	 * and we want to avaoid computing it unnecesarily.
+	 */
+	if (skb->ip_summed == CHECKSUM_PARTIAL)
+		return 0;
+
+	val = sctp_compute_cksum(skb, 0);
 	if (val != cmp) {
 		/* CRC failure, dump it. */
 		__SCTP_INC_STATS(net, SCTP_MIB_CHECKSUMERRORS);
-- 
2.9.5

^ permalink raw reply related

* [PATCH V2 net-next 4/6] tun: Add support for SCTP checksum offload
From: Vladislav Yasevich @ 2018-05-02  2:07 UTC (permalink / raw)
  To: netdev
  Cc: linux-sctp, virtualization, virtio-dev, mst, jasowang, nhorman,
	marcelo.leitner, Vladislav Yasevich
In-Reply-To: <20180502020739.19239-1-vyasevic@redhat.com>

Adds a new tun offload flag to allow for SCTP checksum offload.
The flag has to be set by the user and defaults to "no offload".
Add SCTP checksum support to the supported tun features.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
---
 drivers/net/tun.c           | 7 ++++++-
 include/uapi/linux/if_tun.h | 1 +
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a1ba262..4f314a6 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -216,7 +216,7 @@ struct tun_struct {
 	struct net_device	*dev;
 	netdev_features_t	set_features;
 #define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \
-			  NETIF_F_TSO6)
+			  NETIF_F_TSO6|NETIF_F_SCTP_CRC)
 
 	int			align;
 	int			vnet_hdr_sz;
@@ -2719,6 +2719,11 @@ static int set_offload(struct tun_struct *tun, unsigned long arg)
 		arg &= ~TUN_F_UFO;
 	}
 
+	if (arg & TUN_F_SCTP_CSUM) {
+		features |= NETIF_F_SCTP_CRC;
+		arg &= ~TUN_F_SCTP_CSUM;
+	}
+
 	/* This gives the user a way to test for new features in future by
 	 * trying to set them. */
 	if (arg)
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index ee432cd..061aea8 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -86,6 +86,7 @@
 #define TUN_F_TSO6	0x04	/* I can handle TSO for IPv6 packets */
 #define TUN_F_TSO_ECN	0x08	/* I can handle TSO with ECN bits. */
 #define TUN_F_UFO	0x10	/* I can handle UFO packets */
+#define TUN_F_SCTP_CSUM 0x20	/* I can handle SCTP checksums offloads */
 
 /* Protocol info prepended to the packets (when IFF_NO_PI is not set) */
 #define TUN_PKT_STRIP	0x0001
-- 
2.9.5

^ permalink raw reply related

* [PATCH V2 net-next 5/6] macvlan/macvtap: Add support for SCTP checksum offload.
From: Vladislav Yasevich @ 2018-05-02  2:07 UTC (permalink / raw)
  To: netdev
  Cc: linux-sctp, virtualization, virtio-dev, mst, jasowang, nhorman,
	marcelo.leitner, Vladislav Yasevich
In-Reply-To: <20180502020739.19239-1-vyasevic@redhat.com>

Since we now have support for software CRC32c offload, turn it on
for macvlan and macvtap devices so that guests can take advantage
of offload SCTP checksums to the host or host hardware.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
---
 drivers/net/macvlan.c | 5 +++--
 drivers/net/tap.c     | 8 +++++---
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 725f4b4..646b730 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -834,7 +834,7 @@ static struct lock_class_key macvlan_netdev_addr_lock_key;
 
 #define ALWAYS_ON_OFFLOADS \
 	(NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE | \
-	 NETIF_F_GSO_ROBUST | NETIF_F_GSO_ENCAP_ALL)
+	 NETIF_F_GSO_ROBUST | NETIF_F_GSO_ENCAP_ALL | NETIF_F_SCTP_CRC)
 
 #define ALWAYS_ON_FEATURES (ALWAYS_ON_OFFLOADS | NETIF_F_LLTX)
 
@@ -842,7 +842,8 @@ static struct lock_class_key macvlan_netdev_addr_lock_key;
 	(NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | NETIF_F_FRAGLIST | \
 	 NETIF_F_GSO | NETIF_F_TSO | NETIF_F_LRO | \
 	 NETIF_F_TSO_ECN | NETIF_F_TSO6 | NETIF_F_GRO | NETIF_F_RXCSUM | \
-	 NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER)
+	 NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER | \
+	 NETIF_F_SCTP_CRC)
 
 #define MACVLAN_STATE_MASK \
 	((1<<__LINK_STATE_NOCARRIER) | (1<<__LINK_STATE_DORMANT))
diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 9b6cb78..2c8512b 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -369,8 +369,7 @@ rx_handler_result_t tap_handle_frame(struct sk_buff **pskb)
 		 *	  check, we either support them all or none.
 		 */
 		if (skb->ip_summed == CHECKSUM_PARTIAL &&
-		    !(features & NETIF_F_CSUM_MASK) &&
-		    skb_checksum_help(skb))
+		    skb_csum_hwoffload_help(skb, features))
 			goto drop;
 		if (ptr_ring_produce(&q->ring, skb))
 			goto drop;
@@ -945,6 +944,9 @@ static int set_offload(struct tap_queue *q, unsigned long arg)
 		}
 	}
 
+	if (arg & TUN_F_SCTP_CSUM)
+		feature_mask |= NETIF_F_SCTP_CRC;
+
 	/* tun/tap driver inverts the usage for TSO offloads, where
 	 * setting the TSO bit means that the userspace wants to
 	 * accept TSO frames and turning it off means that user space
@@ -1077,7 +1079,7 @@ static long tap_ioctl(struct file *file, unsigned int cmd,
 	case TUNSETOFFLOAD:
 		/* let the user check for future flags */
 		if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
-			    TUN_F_TSO_ECN | TUN_F_UFO))
+			    TUN_F_TSO_ECN | TUN_F_UFO | TUN_F_SCTP_CSUM))
 			return -EINVAL;
 
 		rtnl_lock();
-- 
2.9.5

^ permalink raw reply related

* [PATCH V2 net-next 6/6] ipvlan: Add support for SCTP checksum offload
From: Vladislav Yasevich @ 2018-05-02  2:07 UTC (permalink / raw)
  To: netdev
  Cc: linux-sctp, virtualization, virtio-dev, mst, jasowang, nhorman,
	marcelo.leitner, Vladislav Yasevich
In-Reply-To: <20180502020739.19239-1-vyasevic@redhat.com>

Since ipvlan is a software device, we can turn on SCTP checksum
offload and let the checksum be computed at the lower layer.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
---
 drivers/net/ipvlan/ipvlan_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 450eec2..ddb1590 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -161,7 +161,8 @@ static void ipvlan_port_destroy(struct net_device *dev)
 	(NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | NETIF_F_FRAGLIST | \
 	 NETIF_F_GSO | NETIF_F_TSO | NETIF_F_GSO_ROBUST | \
 	 NETIF_F_TSO_ECN | NETIF_F_TSO6 | NETIF_F_GRO | NETIF_F_RXCSUM | \
-	 NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER)
+	 NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER | \
+	 NETIF_F_SCTP_CRC)
 
 #define IPVLAN_STATE_MASK \
 	((1<<__LINK_STATE_NOCARRIER) | (1<<__LINK_STATE_DORMANT))
-- 
2.9.5

^ permalink raw reply related

* linux-next: manual merge of the bpf-next tree with the bpf tree
From: Stephen Rothwell @ 2018-05-02  2:09 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov, Networking
  Cc: Linux-Next Mailing List, Linux Kernel Mailing List, Song Liu,
	Yonghong Song

[-- Attachment #1: Type: text/plain, Size: 2342 bytes --]

Hi all,

Today's linux-next merge of the bpf-next tree got a conflict in:

  tools/testing/selftests/bpf/test_progs.c

between commit:

  a4e21ff8d9a3 ("bpf: minor fix to selftest test_stacktrace_build_id()")

from the bpf tree and commit:

  79b453501310 ("tools/bpf: add a test for bpf_get_stack with tracepoint prog")

from the bpf-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc tools/testing/selftests/bpf/test_progs.c
index fac581f1c57f,aa336f0abebc..000000000000
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@@ -1137,9 -1193,14 +1193,14 @@@ static void test_stacktrace_build_id(vo
  		  err, errno))
  		goto disable_pmu;
  
+ 	stack_amap_fd = bpf_find_map(__func__, obj, "stack_amap");
+ 	if (CHECK(stack_amap_fd < 0, "bpf_find_map stack_amap",
+ 		  "err %d errno %d\n", err, errno))
+ 		goto disable_pmu;
+ 
  	assert(system("dd if=/dev/urandom of=/dev/zero count=4 2> /dev/null")
  	       == 0);
 -	assert(system("./urandom_read if=/dev/urandom of=/dev/zero count=4 2> /dev/null") == 0);
 +	assert(system("./urandom_read") == 0);
  	/* disable stack trace collection */
  	key = 0;
  	val = 1;
@@@ -1188,8 -1249,15 +1249,15 @@@
  		previous_key = key;
  	} while (bpf_map_get_next_key(stackmap_fd, &previous_key, &key) == 0);
  
- 	CHECK(build_id_matches < 1, "build id match",
- 	      "Didn't find expected build ID from the map\n");
+ 	if (CHECK(build_id_matches < 1, "build id match",
 -		  "Didn't find expected build ID from the map"))
++		  "Didn't find expected build ID from the map\n"))
+ 		goto disable_pmu;
+ 
+ 	stack_trace_len = PERF_MAX_STACK_DEPTH
+ 		* sizeof(struct bpf_stack_build_id);
+ 	err = compare_stack_ips(stackmap_fd, stack_amap_fd, stack_trace_len);
+ 	CHECK(err, "compare_stack_ips stackmap vs. stack_amap",
+ 	      "err %d errno %d\n", err, errno);
  
  disable_pmu:
  	ioctl(pmu_fd, PERF_EVENT_IOC_DISABLE);

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: Suggestions on iterating eBPF maps
From: Alexei Starovoitov @ 2018-05-02  2:33 UTC (permalink / raw)
  To: Lorenzo Colitti; +Cc: Chenbo Feng, netdev, Daniel Borkmann, Joel Fernandes
In-Reply-To: <CAKD1Yr0QQJwhZDH4uDevcZ8QwzhrVjRQZGDvPg+Q2ONJGAqqiQ@mail.gmail.com>

On Wed, May 02, 2018 at 11:05:19AM +0900, Lorenzo Colitti wrote:
> On Sat, Apr 28, 2018 at 10:04 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > Another approach could be to use map-in-map and have almost atomic
> > replace of the whole map with new potentially empty map. The prog
> > can continue using the new map, while user space walks no longer
> > accessed old map.
> 
> That sounds like a promising approach. I assume this would be
> functionally equivalent to an approach where there is a map containing
> a boolean that says whether to write to map A or map B? We'd then do
> the following:
> 
> 0. Kernel program is writing to map A.
> 1. Userspace pushes config that says to write to map B.
> 2. Kernel program starts to write to map B.
> 3. Userspace scans map A, collecting stats and deleting everything it finds.
> 
> One problem with this is: if the effects of #1 are not immediately
> visible to the programs running on all cores, the program could still
> be writing to map A and the deletes in #3 would result in loss of
> data. Are there any guarantees around this? I know that hash map
> writes are atomic, but I'm not aware of any other guarantees here. Are
> there memory barriers around map writes and reads?
> 
> In the absence of guarantees, userspace could put a sleep between #1
> and #3 and things would be correct Most Of The Time(TM), but if the
> kernel is busy doing other things that might not be sufficient.
> Thoughts?

if you use map-in-map you don't need extra boolean map.
0. bpf prog can do
   inner_map = lookup(map_in_map, key=0);
   lookup(inner_map, your_real_key);
1. user space writes into map_in_map[0] <- FD of new map
2. some cpus are using old inner map and some a new
3. user space does sys_membarrier(CMD_GLOBAL) which will do synchronize_sched()
   which in CONFIG_PREEMPT_NONE=y servers is the same as synchronize_rcu()
   which will guarantee that progs finished.
4. scan old inner map

^ permalink raw reply

* Re: [PATCH iproute2-master] iproute: Parse last nexthop in a multipath route
From: Stephen Hemminger @ 2018-05-02  2:37 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, dsahern, mlxsw
In-Reply-To: <20180501131635.14981-1-idosch@mellanox.com>

On Tue,  1 May 2018 16:16:35 +0300
Ido Schimmel <idosch@mellanox.com> wrote:

> Continue parsing a multipath payload as long as another nexthop can fit
> in the payload.
> 
> # ip route add 192.0.2.0/24 nexthop dev dummy0 nexthop dev dummy1
> 
> Before:
> # ip route show 192.0.2.0/24
> 192.0.2.0/24
>         nexthop dev dummy0 weight 1
> 
> After:
> # ip route show 192.0.2.0/24
> 192.0.2.0/24
>         nexthop dev dummy0 weight 1
>         nexthop dev dummy1 weight 1
> 
> Fixes: f48e14880a0e ("iproute: refactor multipath print")
> Signed-off-by: Ido Schimmel <idosch@mellanox.com>


Applied, thanks.

There  really ought to be test cases for this.

^ permalink raw reply

* Re: [PATCH iproute2 1/2] README: update libdb build dependency information
From: Stephen Hemminger @ 2018-05-02  2:37 UTC (permalink / raw)
  To: Baruch Siach; +Cc: netdev
In-Reply-To: <1918c6139ac3ee9affe8e62817f30b110ab235c7.1525178588.git.baruch@tkos.co.il>

On Tue,  1 May 2018 15:43:07 +0300
Baruch Siach <baruch@tkos.co.il> wrote:

> Debian does not distribute libdb4.x-dev for quite some time now. Current
> stable carries libdb5.3-dev. Update the wording accordingly.
> 
> Signed-off-by: Baruch Siach <baruch@tkos.co.il>

Applied both patches, not sure how many people still use arpd anyway...

^ permalink raw reply

* Re: [RFC v3 4/5] virtio_ring: add event idx support in packed ring
From: Jason Wang @ 2018-05-02  2:51 UTC (permalink / raw)
  To: Tiwei Bie, mst, virtualization, linux-kernel, netdev; +Cc: wexu
In-Reply-To: <20180425051550.24342-5-tiwei.bie@intel.com>



On 2018年04月25日 13:15, Tiwei Bie wrote:
> This commit introduces the event idx support in packed
> ring. This feature is temporarily disabled, because the
> implementation in this patch may not work as expected,
> and some further discussions on the implementation are
> needed, e.g. do we have to check the wrap counter when
> checking whether a kick is needed?
>
> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> ---
>   drivers/virtio/virtio_ring.c | 53 ++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 49 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 0181e93897be..b1039c2985b9 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -986,7 +986,7 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
>   static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
>   {
>   	struct vring_virtqueue *vq = to_vvq(_vq);
> -	u16 flags;
> +	u16 new, old, off_wrap, flags;
>   	bool needs_kick;
>   	u32 snapshot;
>   
> @@ -995,7 +995,12 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
>   	 * suppressions. */
>   	virtio_mb(vq->weak_barriers);
>   
> +	old = vq->next_avail_idx - vq->num_added;
> +	new = vq->next_avail_idx;
> +	vq->num_added = 0;
> +
>   	snapshot = *(u32 *)vq->vring_packed.device;
> +	off_wrap = virtio16_to_cpu(_vq->vdev, snapshot & 0xffff);
>   	flags = cpu_to_virtio16(_vq->vdev, snapshot >> 16) & 0x3;
>   
>   #ifdef DEBUG
> @@ -1006,7 +1011,10 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
>   	vq->last_add_time_valid = false;
>   #endif
>   
> -	needs_kick = (flags != VRING_EVENT_F_DISABLE);
> +	if (flags == VRING_EVENT_F_DESC)
> +		needs_kick = vring_need_event(off_wrap & ~(1<<15), new, old);

I wonder whether or not the math is correct. Both new and event are in 
the unit of descriptor ring size, but old looks not.

Thanks

> +	else
> +		needs_kick = (flags != VRING_EVENT_F_DISABLE);
>   	END_USE(vq);
>   	return needs_kick;
>   }
> @@ -1116,6 +1124,15 @@ static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
>   	if (vq->last_used_idx >= vq->vring_packed.num)
>   		vq->last_used_idx -= vq->vring_packed.num;
>   
> +	/* If we expect an interrupt for the next entry, tell host
> +	 * by writing event index and flush out the write before
> +	 * the read in the next get_buf call. */
> +	if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
> +		virtio_store_mb(vq->weak_barriers,
> +				&vq->vring_packed.driver->off_wrap,
> +				cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
> +						(vq->wrap_counter << 15)));
> +
>   #ifdef DEBUG
>   	vq->last_add_time_valid = false;
>   #endif
> @@ -1143,10 +1160,17 @@ static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
>   
>   	/* We optimistically turn back on interrupts, then check if there was
>   	 * more to do. */
> +	/* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> +	 * either clear the flags bit or point the event index at the next
> +	 * entry. Always update the event index to keep code simple. */
> +
> +	vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> +			vq->last_used_idx | (vq->wrap_counter << 15));
>   
>   	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
>   		virtio_wmb(vq->weak_barriers);
> -		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> +		vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> +						     VRING_EVENT_F_ENABLE;
>   		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
>   							vq->event_flags_shadow);
>   	}
> @@ -1172,15 +1196,34 @@ static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
>   static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
>   {
>   	struct vring_virtqueue *vq = to_vvq(_vq);
> +	u16 bufs, used_idx, wrap_counter;
>   
>   	START_USE(vq);
>   
>   	/* We optimistically turn back on interrupts, then check if there was
>   	 * more to do. */
> +	/* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> +	 * either clear the flags bit or point the event index at the next
> +	 * entry. Always update the event index to keep code simple. */
> +
> +	/* TODO: tune this threshold */
> +	bufs = (u16)(vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
> +
> +	used_idx = vq->last_used_idx + bufs;
> +	wrap_counter = vq->wrap_counter;
> +
> +	if (used_idx >= vq->vring_packed.num) {
> +		used_idx -= vq->vring_packed.num;
> +		wrap_counter ^= 1;
> +	}
> +
> +	vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> +			used_idx | (wrap_counter << 15));
>   
>   	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
>   		virtio_wmb(vq->weak_barriers);
> -		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> +		vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> +						     VRING_EVENT_F_ENABLE;
>   		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
>   							vq->event_flags_shadow);
>   	}
> @@ -1822,8 +1865,10 @@ void vring_transport_features(struct virtio_device *vdev)
>   		switch (i) {
>   		case VIRTIO_RING_F_INDIRECT_DESC:
>   			break;
> +#if 0
>   		case VIRTIO_RING_F_EVENT_IDX:
>   			break;
> +#endif
>   		case VIRTIO_F_VERSION_1:
>   			break;
>   		case VIRTIO_F_IOMMU_PLATFORM:

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH V2 net-next 1/6] virtio: Add support for SCTP checksum offloading
From: Michael S. Tsirkin @ 2018-05-02  3:16 UTC (permalink / raw)
  To: Vladislav Yasevich
  Cc: virtio-dev, marcelo.leitner, nhorman, netdev, virtualization,
	linux-sctp
In-Reply-To: <20180502020739.19239-2-vyasevic@redhat.com>

On Tue, May 01, 2018 at 10:07:34PM -0400, Vladislav Yasevich wrote:
> To support SCTP checksum offloading, we need to add a new feature
> to virtio_net, so we can negotiate support between the hypervisor
> and the guest.
> The HOST feature bit signifies offloading support for transmit and
> enables device offload features.
> The GUEST feature bit signifies offloading support of recieve and
> is currently only used by the driver in case of xdp.  
> 
> That patch also adds an addition virtio_net header flag which
> mirrors the skb->csum_not_inet flag.  This flags is used to indicate
> that is this an SCTP packet that needs its checksum computed by the
> lower layer.  In this case, the lower layer is the host hypervisor or
> possibly HW nic that supporst CRC32c offload.
> 
> In the case that GUEST feature bit is flag, it will be possible to
> receive a virtio_net header with this bit set, which will set the
> corresponding skb bit.  SCTP protocol will be updated to correctly
> deal with it.
> 
> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
> ---
>  drivers/net/virtio_net.c        | 14 +++++++++++++-
>  include/linux/virtio_net.h      |  6 ++++++
>  include/uapi/linux/virtio_net.h |  5 +++++
>  3 files changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7b187ec..34af280 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2148,6 +2148,8 @@ static int virtnet_clear_guest_offloads(struct virtnet_info *vi)
>  
>  	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_CSUM))
>  		offloads = 1ULL << VIRTIO_NET_F_GUEST_CSUM;
> +	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_SCTP_CSUM))
> +		offloads |= 1ULL << VIRTIO_NET_F_GUEST_SCTP_CSUM;
>  
>  	return virtnet_set_guest_offloads(vi, offloads);
>  }
> @@ -2160,6 +2162,8 @@ static int virtnet_restore_guest_offloads(struct virtnet_info *vi)
>  		return 0;
>  	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_CSUM))
>  		offloads |= 1ULL << VIRTIO_NET_F_GUEST_CSUM;
> +	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_SCTP_CSUM))
> +		offloads |= 1ULL << VIRTIO_NET_F_GUEST_SCTP_CSUM;
>  
>  	return virtnet_set_guest_offloads(vi, offloads);
>  }
> @@ -2724,6 +2728,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	/* Do we support "hardware" checksums? */
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_CSUM)) {
>  		/* This opens up the world of extra features. */
> +
>  		dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG;
>  		if (csum)
>  			dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> @@ -2746,9 +2751,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>  			dev->features |= dev->hw_features & NETIF_F_ALL_TSO;
>  		/* (!csum && gso) case will be fixed by register_netdev() */
>  	}
> +
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
>  		dev->features |= NETIF_F_RXCSUM;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_SCTP_CSUM)) {
> +		dev->hw_features |= NETIF_F_SCTP_CRC;
> +		dev->features |= NETIF_F_SCTP_CRC;
> +	}
> +
>  	dev->vlan_features = dev->features;
>  
>  	/* MTU range: 68 - 65535 */
> @@ -2962,7 +2973,8 @@ static struct virtio_device_id id_table[] = {
>  	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
>  	VIRTIO_NET_F_CTRL_MAC_ADDR, \
>  	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
> -	VIRTIO_NET_F_SPEED_DUPLEX
> +	VIRTIO_NET_F_SPEED_DUPLEX, \
> +	VIRTIO_NET_F_HOST_SCTP_CSUM, VIRTIO_NET_F_GUEST_SCTP_CSUM
>  
>  static unsigned int features[] = {
>  	VIRTNET_FEATURES,
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index f144216..28fffdc 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -39,6 +39,9 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
>  
>  		if (!skb_partial_csum_set(skb, start, off))
>  			return -EINVAL;
> +
> +		if (hdr->flags & VIRTIO_NET_HDR_F_CSUM_NOT_INET)
> +			skb->csum_not_inet = 1;
>  	}
>  
>  	if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> @@ -91,6 +94,9 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
>  				skb_checksum_start_offset(skb));
>  		hdr->csum_offset = __cpu_to_virtio16(little_endian,
>  				skb->csum_offset);
> +
> +		if (skb->csum_not_inet)
> +			hdr->flags |= VIRTIO_NET_HDR_F_CSUM_NOT_INET;
>  	} else if (has_data_valid &&
>  		   skb->ip_summed == CHECKSUM_UNNECESSARY) {
>  		hdr->flags = VIRTIO_NET_HDR_F_DATA_VALID;
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index 5de6ed3..9dfca1a 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -57,6 +57,10 @@
>  					 * Steering */
>  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>  
> +#define VIRTIO_NET_F_GUEST_SCTP_CSUM  61 /* Guest handles SCTP pks w/ partial
> +					  * csum */
> +#define VIRTIO_NET_F_HOST_SCTP_CSUM   62 /* HOST handles SCTP pkts w/ partial
> +					  * csum */
>  #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
>  
>  #ifndef VIRTIO_NET_NO_LEGACY
> @@ -101,6 +105,7 @@ struct virtio_net_config {
>  struct virtio_net_hdr_v1 {
>  #define VIRTIO_NET_HDR_F_NEEDS_CSUM	1	/* Use csum_start, csum_offset */
>  #define VIRTIO_NET_HDR_F_DATA_VALID	2	/* Csum is valid */
> +#define VIRTIO_NET_HDR_F_CSUM_NOT_INET  4       /* Checksum is not inet */

Both comment and name are not very informative.
How about just saying CRC32c ?

>  	__u8 flags;
>  #define VIRTIO_NET_HDR_GSO_NONE		0	/* Not a GSO frame */
>  #define VIRTIO_NET_HDR_GSO_TCPV4	1	/* GSO frame, IPv4 TCP (TSO) */
> -- 
> 2.9.5

^ permalink raw reply

* Re: [PATCH V2 net-next 5/6] macvlan/macvtap: Add support for SCTP checksum offload.
From: Michael S. Tsirkin @ 2018-05-02  3:24 UTC (permalink / raw)
  To: Vladislav Yasevich
  Cc: netdev, linux-sctp, virtualization, virtio-dev, jasowang, nhorman,
	marcelo.leitner, Vladislav Yasevich
In-Reply-To: <20180502020739.19239-6-vyasevic@redhat.com>

On Tue, May 01, 2018 at 10:07:38PM -0400, Vladislav Yasevich wrote:
> Since we now have support for software CRC32c offload, turn it on
> for macvlan and macvtap devices so that guests can take advantage
> of offload SCTP checksums to the host or host hardware.
> 
> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
> ---
>  drivers/net/macvlan.c | 5 +++--
>  drivers/net/tap.c     | 8 +++++---
>  2 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
> index 725f4b4..646b730 100644
> --- a/drivers/net/macvlan.c
> +++ b/drivers/net/macvlan.c
> @@ -834,7 +834,7 @@ static struct lock_class_key macvlan_netdev_addr_lock_key;
>  
>  #define ALWAYS_ON_OFFLOADS \
>  	(NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE | \
> -	 NETIF_F_GSO_ROBUST | NETIF_F_GSO_ENCAP_ALL)
> +	 NETIF_F_GSO_ROBUST | NETIF_F_GSO_ENCAP_ALL | NETIF_F_SCTP_CRC)
>  
>  #define ALWAYS_ON_FEATURES (ALWAYS_ON_OFFLOADS | NETIF_F_LLTX)
>  
> @@ -842,7 +842,8 @@ static struct lock_class_key macvlan_netdev_addr_lock_key;
>  	(NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | NETIF_F_FRAGLIST | \
>  	 NETIF_F_GSO | NETIF_F_TSO | NETIF_F_LRO | \
>  	 NETIF_F_TSO_ECN | NETIF_F_TSO6 | NETIF_F_GRO | NETIF_F_RXCSUM | \
> -	 NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER)
> +	 NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER | \
> +	 NETIF_F_SCTP_CRC)
>  
>  #define MACVLAN_STATE_MASK \
>  	((1<<__LINK_STATE_NOCARRIER) | (1<<__LINK_STATE_DORMANT))
> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> index 9b6cb78..2c8512b 100644
> --- a/drivers/net/tap.c
> +++ b/drivers/net/tap.c
> @@ -369,8 +369,7 @@ rx_handler_result_t tap_handle_frame(struct sk_buff **pskb)
>  		 *	  check, we either support them all or none.
>  		 */
>  		if (skb->ip_summed == CHECKSUM_PARTIAL &&
> -		    !(features & NETIF_F_CSUM_MASK) &&
> -		    skb_checksum_help(skb))
> +		    skb_csum_hwoffload_help(skb, features))
>  			goto drop;
>  		if (ptr_ring_produce(&q->ring, skb))
>  			goto drop;
> @@ -945,6 +944,9 @@ static int set_offload(struct tap_queue *q, unsigned long arg)
>  		}
>  	}
>  
> +	if (arg & TUN_F_SCTP_CSUM)
> +		feature_mask |= NETIF_F_SCTP_CRC;
> +

so this still affects TX, shouldn't this affect RX instead?


>  	/* tun/tap driver inverts the usage for TSO offloads, where
>  	 * setting the TSO bit means that the userspace wants to
>  	 * accept TSO frames and turning it off means that user space
> @@ -1077,7 +1079,7 @@ static long tap_ioctl(struct file *file, unsigned int cmd,
>  	case TUNSETOFFLOAD:
>  		/* let the user check for future flags */
>  		if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
> -			    TUN_F_TSO_ECN | TUN_F_UFO))
> +			    TUN_F_TSO_ECN | TUN_F_UFO | TUN_F_SCTP_CSUM))
>  			return -EINVAL;
>  
>  		rtnl_lock();
> -- 
> 2.9.5

^ permalink raw reply

* Re: [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
From: Toshiaki Makita @ 2018-05-02  3:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Toshiaki Makita, netdev, Tariq Toukan, Daniel Borkmann,
	Alexei Starovoitov, Eran Ben Elisha
In-Reply-To: <20180501101420.544ff225@redhat.com>

On 18/05/01 (火) 17:14, Jesper Dangaard Brouer wrote:
> On Tue, 1 May 2018 10:02:01 +0900 Toshiaki Makita 
> <makita.toshiaki@lab.ntt.co.jp> wrote:
> 
>> On 2018/05/01 2:27, Jesper Dangaard Brouer wrote:
>>> On Thu, 26 Apr 2018 19:52:40 +0900 Toshiaki Makita 
>>> <makita.toshiaki@lab.ntt.co.jp> wrote:
>>> 
>>>> On 2018/04/26 5:24, Jesper Dangaard Brouer wrote:
>>>>> On Tue, 24 Apr 2018 23:39:20 +0900 Toshiaki Makita 
>>>>> <toshiaki.makita1@gmail.com> wrote:
>>>>> 
>>>>>> +static int veth_xdp_xmit(struct net_device *dev, struct 
>>>>>> xdp_frame *frame) +{ +	struct veth_priv *rcv_priv, *priv = 
>>>>>> netdev_priv(dev); +	int headroom = frame->data - (void 
>>>>>> *)frame; +	struct net_device *rcv; +	int err = 0; + +	rcv
>>>>>> = rcu_dereference(priv->peer); +	if (unlikely(!rcv)) + 
>>>>>> return -ENXIO; + +	rcv_priv = netdev_priv(rcv); +	/* 
>>>>>> xdp_ring is initialized on receive side? */ +	if 
>>>>>> (rcu_access_pointer(rcv_priv->xdp_prog)) { +		err = 
>>>>>> xdp_ok_fwd_dev(rcv, frame->len); +		if (unlikely(err)) + 
>>>>>> return err; + +		err = veth_xdp_enqueue(rcv_priv, 
>>>>>> veth_xdp_to_ptr(frame)); +	} else { +		struct sk_buff *skb;
>>>>>> + +		skb = veth_build_skb(frame, headroom, frame->len, 0);
>>>>>> +		if (unlikely(!skb)) +			return -ENOMEM; + +		/* Get page
>>>>>> ref in case skb is dropped in netif_rx. + * The caller is
>>>>>> responsible for freeing the page on error. +		 */ +
>>>>>> get_page(virt_to_page(frame->data));
>>>>> 
>>>>> I'm not sure you can make this assumption, that xdp_frames 
>>>>> coming from another device driver uses a refcnt based memory 
>>>>> model. But maybe I'm confused, as this looks like an SKB 
>>>>> receive path, but in the ndo_xdp_xmit().
>>>> 
>>>> I find this path similar to cpumap, which creates skb from 
>>>> redirected xdp frame. Once it is converted to skb, skb head is 
>>>> freed by page_frag_free, so anyway I needed to get the
>>>> refcount here regardless of memory model.
>>> 
>>> Yes I know, I wrote cpumap ;-)
>>> 
>>> First of all, I don't want to see such xdp_frame to SKB 
>>> conversion code in every driver.  Because that increase the 
>>> chances of errors.  And when looking at the details, then it 
>>> seems that you have made the mistake of making it possible to 
>>> leak xdp_frame info to the SKB (which cpumap takes into 
>>> account).
>> 
>> Do you mean leaving xdp_frame in skb->head is leaking something? 
>> how?
> 
> Like commit 97e19cce05e5 ("bpf: reserve xdp_frame size in xdp 
> headroom") and commit 6dfb970d3dbd ("xdp: avoid leaking info stored 
> in frame data on page reuse").

Thanks for sharing the info.

> But this time, the concern is a bpf_prog attached at TC/bpf_cls 
> level, that can that can adjust head via bpf_skb_change_head (for
> XDP it is bpf_xdp_adjust_head) into the area used by xdp_frame.  As 
> described in https://git.kernel.org/davem/net-next/c/6dfb970d3dbd in 
> is not super critical at the moment, as this _currently_ runs as 
> CAP_SYS_ADMIN, but we would like to move towards CAP_NET_ADMIN.

What I don't get is why special casing xdp_frame info. My assumption is
that the area above skb->mac_header is uninit kernel memory and should
not be readable by unprivileged users anyway. So I didn't clear the area
at this point.

>>> Second, I think the refcnt scheme here is wrong. The xdp_frame 
>>> should be "owned" by XDP and have the proper refcnt to deliver
>>> it directly to the network stack.
>>> 
>>> Third, if we choose that we want a fallback, in-case XDP is not 
>>> enabled on egress dev (but it have an ndo_xdp_xmit), then it 
>>> should be placed in the generic/core code.  E.g. 
>>> __bpf_tx_xdp_map() could look at the return code from 
>>> dev->netdev_ops->ndo_xdp() and create an SKB.  (Hint, this would 
>>> make it easy to implement TX bulking towards the dev).
>> 
>> Right, this is a much cleaner way.
>> 
>> Although I feel like we should add this fallback for veth because 
>> it requires something which is different from other drivers 
>> (enabling XDP on the peer device of the egress device),
> 
> (This is why I Cc'ed Tariq...)
> 
> This is actually a general problem with the xdp "xmit" side (and not
>  specific to veth driver). The problem exists for other drivers as 
> well.
> 
> The problem is that a driver can implement ndo_xdp_xmit(), but the 
> driver might not have allocated the necessary XDP TX-queue resources
>  yet (or it might not be possible due to system resource limits).
> 
> The current "hack" is to load an XDP prog on the egress device, and 
> then assume that this cause the driver to also allocate the XDP 
> ndo_xdo_xmit side HW resources.  This is IMHO a wrong assumption.
 >
> We need a more generic way to test if a net_device is "ready/enabled"
> for handling xdp_frames via ndo_xdp_xmit.  And Tariq had some ideas
> on how to implement this...

My assumption on REDIRECT requirement came from this.
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=306da4e685b

I guess you are saying thing are changing, and having an XDP program 
attached on the egress device is no longer generally sufficient. Looking 
forward to Tariq's solution.

Toshiaki Makita

> 
> My opinion is that it is a waste of (HW/mem) resources to always 
> allocate resources for ndo_xdp_xmit when loading an XDP program. 
> Because what if my use-cases are XDP_DROP DDoS filter, or CPUMAP 
> redirect load-balancer, then I don't want/need ndo_xdp_xmit.  E.g. 
> today using ixgbe on machines with more than 96 CPUs, will fail due 
> to limited TX queue resources. Thus, blocking the mentioned 
> use-cases.
> 
> 
>> I'll drop the part for now. It should not be resolved in the driver
>> code.
> 
> Thank you.
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox