* Re: [PATCH v4 bpf-next 0/4] selftests/bpf: fix compiling loop{1,2,3}.c on s390
From: Stanislav Fomichev @ 2019-07-11 20:35 UTC (permalink / raw)
To: Ilya Leoshkevich; +Cc: bpf, netdev, ys114321, daniel, davem, ast
In-Reply-To: <20190711142930.68809-1-iii@linux.ibm.com>
On 07/11, Ilya Leoshkevich wrote:
> Use PT_REGS_RC(ctx) instead of ctx->rax, which is not present on s390.
>
> This patch series consists of three preparatory commits, which make it
> possible to use PT_REGS_RC in BPF selftests, followed by the actual fix.
>
> > > Will this also work for 32-bit x86?
> > Thanks, this is a good catch: this builds, but makes 64-bit accesses, as
> > if it used the 64-bit variant of pt_regs. I will fix this.
> I found four problems in this area:
>
> 1. Selftest tracing progs are built with -target bpf, leading to struct
> pt_regs and friends being interpreted incorrectly.
> 2. When the Makefile is adjusted to build them without -target bpf, it
> still lacks -m32/-m64, leading to a similar issue.
> 3. There is no __i386__ define, leading to incorrect userspace struct
> pt_regs variant being chosen for x86.
> 4. Finally, there is an issue in my patch: when 1-3 are fixed, it fails
> to build, since i386 defines yet another set of field names.
>
> I will send fixes for problems 1-3 separately, I believe for this patch
> series to be correct, it's enough to fix #4 (which I did by adding
> another #ifdef).
>
> I've also changed ARCH to SRCARCH in patch #1, since while ARCH can be
> e.g. "i386", SRCARCH always corresponds to directory names under arch/.
>
> v1->v2: Split into multiple patches.
> v2->v3: Added arm64 support.
> v3->v4: Added i386 support, use SRCARCH instead of ARCH.
Still looks good to me, thanks!
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Again, should probably go via bpf to fix the existing tests, not bpf-next
(but I see bpf tree is not synced with net tree yet).
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
>
>
^ permalink raw reply
* Re: [bpf-next v3 10/12] bpf: Implement bpf_prog_test_run for perf event programs
From: Stanislav Fomichev @ 2019-07-11 20:30 UTC (permalink / raw)
To: Krzesimir Nowak
Cc: linux-kernel, Alban Crequy, Iago López Galeiras,
Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
Yonghong Song, David S. Miller, Jakub Kicinski,
Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
netdev, bpf, xdp-newbies
In-Reply-To: <20190708163121.18477-11-krzesimir@kinvolk.io>
On 07/08, Krzesimir Nowak wrote:
> As an input, test run for perf event program takes struct
> bpf_perf_event_data as ctx_in and struct bpf_perf_event_value as
> data_in. For an output, it basically ignores ctx_out and data_out.
>
> The implementation sets an instance of struct bpf_perf_event_data_kern
> in such a way that the BPF program reading data from context will
> receive what we passed to the bpf prog test run in ctx_in. Also BPF
> program can call bpf_perf_prog_read_value to receive what was passed
> in data_in.
>
> Changes since v2:
> - drop the changes in perf event verifier test - they are not needed
> anymore after reworked ctx size handling
>
> Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
> ---
> kernel/trace/bpf_trace.c | 60 ++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 60 insertions(+)
>
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index ca1255d14576..b870fc2314d0 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -19,6 +19,8 @@
> #include "trace_probe.h"
> #include "trace.h"
>
> +#include <trace/events/bpf_test_run.h>
> +
> #define bpf_event_rcu_dereference(p) \
> rcu_dereference_protected(p, lockdep_is_held(&bpf_event_mutex))
>
> @@ -1160,7 +1162,65 @@ const struct bpf_verifier_ops perf_event_verifier_ops = {
> .convert_ctx_access = pe_prog_convert_ctx_access,
> };
>
> +static int pe_prog_test_run(struct bpf_prog *prog,
> + const union bpf_attr *kattr,
> + union bpf_attr __user *uattr)
> +{
> + struct bpf_perf_event_data_kern real_ctx = {0, };
> + struct perf_sample_data sample_data = {0, };
> + struct bpf_perf_event_data *fake_ctx;
> + struct bpf_perf_event_value *value;
> + struct perf_event event = {0, };
> + u32 retval = 0, duration = 0;
> + int err;
> +
> + if (kattr->test.data_size_out || kattr->test.data_out)
> + return -EINVAL;
> + if (kattr->test.ctx_size_out || kattr->test.ctx_out)
> + return -EINVAL;
> +
> + fake_ctx = bpf_receive_ctx(kattr, sizeof(struct bpf_perf_event_data));
> + if (IS_ERR(fake_ctx))
> + return PTR_ERR(fake_ctx);
> +
> + value = bpf_receive_data(kattr, sizeof(struct bpf_perf_event_value));
> + if (IS_ERR(value)) {
> + kfree(fake_ctx);
> + return PTR_ERR(value);
> + }
nit: maybe use bpf_test_ prefix for receive_ctx/data:
* bpf_test_receive_ctx
* bpf_test_receive_data
? To signify that they are used for tests only.
> +
> + real_ctx.regs = &fake_ctx->regs;
> + real_ctx.data = &sample_data;
> + real_ctx.event = &event;
> + perf_sample_data_init(&sample_data, fake_ctx->addr,
> + fake_ctx->sample_period);
> + event.cpu = smp_processor_id();
> + event.oncpu = -1;
> + event.state = PERF_EVENT_STATE_OFF;
> + local64_set(&event.count, value->counter);
> + event.total_time_enabled = value->enabled;
> + event.total_time_running = value->running;
> + /* make self as a leader - it is used only for checking the
> + * state field
> + */
> + event.group_leader = &event;
> + err = bpf_test_run(prog, &real_ctx, kattr->test.repeat,
> + BPF_TEST_RUN_PLAIN, &retval, &duration);
> + if (err) {
> + kfree(value);
> + kfree(fake_ctx);
> + return err;
> + }
> +
> + err = bpf_test_finish(uattr, retval, duration);
> + trace_bpf_test_finish(&err);
Can probably do:
err = bpf_test_run(...)
if (!err) {
err = bpf_test_finish(uattr, retval, duration);
trace_bpf_test_finish(&err);
}
kfree(..);
kfree(..);
return err;
So you don't have to copy-paste the error handling.
> + kfree(value);
> + kfree(fake_ctx);
> + return err;
> +}
> +
> const struct bpf_prog_ops perf_event_prog_ops = {
> + .test_run = pe_prog_test_run,
> };
>
> static DEFINE_MUTEX(bpf_event_mutex);
> --
> 2.20.1
>
^ permalink raw reply
* Re: [bpf-next v3 09/12] bpf: Split out some helper functions
From: Stanislav Fomichev @ 2019-07-11 20:25 UTC (permalink / raw)
To: Krzesimir Nowak
Cc: linux-kernel, Alban Crequy, Iago López Galeiras,
Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
Yonghong Song, David S. Miller, Jakub Kicinski,
Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
netdev, bpf, xdp-newbies
In-Reply-To: <20190708163121.18477-10-krzesimir@kinvolk.io>
On 07/08, Krzesimir Nowak wrote:
> The moved functions are generally useful for implementing
> bpf_prog_test_run for other types of BPF programs - they don't have
> any network-specific stuff in them, so I can use them in a test run
> implementation for perf event BPF program too.
It's a bit hard to follow. Maybe split into multiple patches?
First one moves the relevant parts as is.
Second one renames (though, I'm not sure we need to rename, but up to
you/maintainers).
Third one removes duplication from bpf_prog_test_run_flow_dissector.
Also see possible suggestion on BPF_TEST_RUN_SETUP_CGROUP_STORAGE below.
> Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
> ---
> include/linux/bpf.h | 28 +++++
> kernel/bpf/Makefile | 1 +
> kernel/bpf/test_run.c | 212 ++++++++++++++++++++++++++++++++++
> net/bpf/test_run.c | 263 +++++++++++-------------------------------
> 4 files changed, 308 insertions(+), 196 deletions(-)
> create mode 100644 kernel/bpf/test_run.c
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 18f4cc2c6acd..28db8ba57bc3 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1143,4 +1143,32 @@ static inline u32 bpf_xdp_sock_convert_ctx_access(enum bpf_access_type type,
> }
> #endif /* CONFIG_INET */
>
> +/* Helper functions for bpf_prog_test_run implementations */
> +typedef u32 bpf_prog_run_helper_t(struct bpf_prog *prog, void *ctx,
> + void *private_data);
> +
> +enum bpf_test_run_flags {
> + BPF_TEST_RUN_PLAIN = 0,
> + BPF_TEST_RUN_SETUP_CGROUP_STORAGE = 1 << 0,
> +};
> +
> +int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, u32 flags,
> + u32 *retval, u32 *duration);
> +
> +int bpf_test_run_cb(struct bpf_prog *prog, void *ctx, u32 repeat, u32 flags,
> + bpf_prog_run_helper_t run_prog, void *private_data,
> + u32 *retval, u32 *duration);
> +
> +int bpf_test_finish(union bpf_attr __user *uattr, u32 retval, u32 duration);
> +
> +void *bpf_receive_ctx(const union bpf_attr *kattr, u32 max_size);
> +
> +int bpf_send_ctx(const union bpf_attr *kattr, union bpf_attr __user *uattr,
> + const void *data, u32 size);
> +
> +void *bpf_receive_data(const union bpf_attr *kattr, u32 max_size);
> +
> +int bpf_send_data(const union bpf_attr *kattr, union bpf_attr __user *uattr,
> + const void *data, u32 size);
> +
> #endif /* _LINUX_BPF_H */
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 29d781061cd5..570fd40288f4 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -22,3 +22,4 @@ obj-$(CONFIG_CGROUP_BPF) += cgroup.o
> ifeq ($(CONFIG_INET),y)
> obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o
> endif
> +obj-$(CONFIG_BPF_SYSCALL) += test_run.o
> diff --git a/kernel/bpf/test_run.c b/kernel/bpf/test_run.c
> new file mode 100644
> index 000000000000..0481373da8be
> --- /dev/null
> +++ b/kernel/bpf/test_run.c
> @@ -0,0 +1,212 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2017 Facebook
> + * Copyright (c) 2019 Tigera, Inc
> + */
> +
> +#include <asm/div64.h>
> +
> +#include <linux/bpf-cgroup.h>
> +#include <linux/bpf.h>
> +#include <linux/err.h>
> +#include <linux/errno.h>
> +#include <linux/filter.h>
> +#include <linux/gfp.h>
> +#include <linux/kernel.h>
> +#include <linux/limits.h>
> +#include <linux/preempt.h>
> +#include <linux/rcupdate.h>
> +#include <linux/sched.h>
> +#include <linux/sched/signal.h>
> +#include <linux/slab.h>
> +#include <linux/timekeeping.h>
> +#include <linux/uaccess.h>
> +
> +static void teardown_cgroup_storage(struct bpf_cgroup_storage **storage)
> +{
> + enum bpf_cgroup_storage_type stype;
> +
> + if (!storage)
> + return;
> + for_each_cgroup_storage_type(stype)
> + bpf_cgroup_storage_free(storage[stype]);
> + kfree(storage);
> +}
> +
> +static struct bpf_cgroup_storage **setup_cgroup_storage(struct bpf_prog *prog)
> +{
> + enum bpf_cgroup_storage_type stype;
> + struct bpf_cgroup_storage **storage;
> + size_t size = MAX_BPF_CGROUP_STORAGE_TYPE;
> +
> + size *= sizeof(struct bpf_cgroup_storage *);
> + storage = kzalloc(size, GFP_KERNEL);
> + for_each_cgroup_storage_type(stype) {
> + storage[stype] = bpf_cgroup_storage_alloc(prog, stype);
> + if (IS_ERR(storage[stype])) {
> + storage[stype] = NULL;
> + teardown_cgroup_storage(storage);
> + return ERR_PTR(-ENOMEM);
> + }
> + }
> + return storage;
> +}
> +
> +static u32 run_bpf_prog(struct bpf_prog *prog, void *ctx, void *private_data)
> +{
> + return BPF_PROG_RUN(prog, ctx);
> +}
> +
> +int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, u32 flags,
> + u32 *retval, u32 *duration)
> +{
> + return bpf_test_run_cb(prog, ctx, repeat, flags, run_bpf_prog, NULL,
> + retval, duration);
> +}
> +
> +int bpf_test_run_cb(struct bpf_prog *prog, void *ctx, u32 repeat, u32 flags,
> + bpf_prog_run_helper_t run_prog, void *private_data,
> + u32 *retval, u32 *duration)
> +{
> + struct bpf_cgroup_storage **storage = NULL;
> + u64 time_start, time_spent = 0;
> + int ret = 0;
> + u32 i;
> +
> + if (flags & BPF_TEST_RUN_SETUP_CGROUP_STORAGE) {
You can maybe get away without a flag. prog->aux->cgroup_storage[x] has
a non-zero pointer if verifier found out that the prog is using cgroup
storage. bpf_cgroup_storage_alloc returns NULL is the program doesn't
use cgroup storage. So you should be able to dynamically figure out
whether you need to call bpf_cgroup_storage_set or not.
> + storage = setup_cgroup_storage(prog);
> + if (IS_ERR(storage))
> + return PTR_ERR(storage);
> + }
> +
> + if (!repeat)
> + repeat = 1;
> +
> + rcu_read_lock();
> + preempt_disable();
> + time_start = ktime_get_ns();
> + for (i = 0; i < repeat; i++) {
> + if (storage)
> + bpf_cgroup_storage_set(storage);
> + *retval = run_prog(prog, ctx, private_data);
> +
> + if (signal_pending(current)) {
> + preempt_enable();
> + rcu_read_unlock();
> + teardown_cgroup_storage(storage);
> + return -EINTR;
> + }
> +
> + if (need_resched()) {
> + time_spent += ktime_get_ns() - time_start;
> + preempt_enable();
> + rcu_read_unlock();
> +
> + cond_resched();
> +
> + rcu_read_lock();
> + preempt_disable();
> + time_start = ktime_get_ns();
> + }
> + }
> + time_spent += ktime_get_ns() - time_start;
> + preempt_enable();
> + rcu_read_unlock();
> +
> + do_div(time_spent, repeat);
> + *duration = time_spent > U32_MAX ? U32_MAX : (u32)time_spent;
> +
> + teardown_cgroup_storage(storage);
> +
> + return ret;
> +}
> +
> +int bpf_test_finish(union bpf_attr __user *uattr, u32 retval, u32 duration)
> +{
> + if (copy_to_user(&uattr->test.retval, &retval, sizeof(retval)))
> + return -EFAULT;
> + if (copy_to_user(&uattr->test.duration, &duration, sizeof(duration)))
> + return -EFAULT;
> + return 0;
> +}
> +
> +static void *bpf_receive_mem(u64 in, u32 in_size, u32 max_size)
> +{
> + void __user *data_in = u64_to_user_ptr(in);
> + void *data;
> + int err;
> +
> + if (!data_in && in_size)
> + return ERR_PTR(-EINVAL);
> + data = kzalloc(max_size, GFP_USER);
> + if (!data)
> + return ERR_PTR(-ENOMEM);
> +
> + if (data_in) {
> + err = bpf_check_uarg_tail_zero(data_in, max_size, in_size);
> + if (err) {
> + kfree(data);
> + return ERR_PTR(err);
> + }
> +
> + in_size = min_t(u32, max_size, in_size);
> + if (copy_from_user(data, data_in, in_size)) {
> + kfree(data);
> + return ERR_PTR(-EFAULT);
> + }
> + }
> + return data;
> +}
> +
> +static int bpf_send_mem(u64 out, u32 out_size, u32 *out_size_write,
> + const void *data, u32 data_size)
> +{
> + void __user *data_out = u64_to_user_ptr(out);
> + int err = -EFAULT;
> + u32 copy_size = data_size;
> +
> + if (!data_out && out_size)
> + return -EINVAL;
> +
> + if (!data || !data_out)
> + return 0;
> +
> + if (copy_size > out_size) {
> + copy_size = out_size;
> + err = -ENOSPC;
> + }
> +
> + if (copy_to_user(data_out, data, copy_size))
> + goto out;
> + if (copy_to_user(out_size_write, &data_size, sizeof(data_size)))
> + goto out;
> + if (err != -ENOSPC)
> + err = 0;
> +out:
> + return err;
> +}
> +
> +void *bpf_receive_data(const union bpf_attr *kattr, u32 max_size)
> +{
> + return bpf_receive_mem(kattr->test.data_in, kattr->test.data_size_in,
> + max_size);
> +}
> +
> +int bpf_send_data(const union bpf_attr *kattr, union bpf_attr __user *uattr,
> + const void *data, u32 size)
> +{
> + return bpf_send_mem(kattr->test.data_out, kattr->test.data_size_out,
> + &uattr->test.data_size_out, data, size);
> +}
> +
> +void *bpf_receive_ctx(const union bpf_attr *kattr, u32 max_size)
> +{
> + return bpf_receive_mem(kattr->test.ctx_in, kattr->test.ctx_size_in,
> + max_size);
> +}
> +
> +int bpf_send_ctx(const union bpf_attr *kattr, union bpf_attr __user *uattr,
> + const void *data, u32 size)
> +{
> + return bpf_send_mem(kattr->test.ctx_out, kattr->test.ctx_size_out,
> + &uattr->test.ctx_size_out, data, size);
> +}
> diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
> index 80e6f3a6864d..fe6b7b1af0cc 100644
> --- a/net/bpf/test_run.c
> +++ b/net/bpf/test_run.c
> @@ -14,97 +14,6 @@
> #define CREATE_TRACE_POINTS
> #include <trace/events/bpf_test_run.h>
>
> -static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
> - u32 *retval, u32 *time)
> -{
> - struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE] = { NULL };
> - enum bpf_cgroup_storage_type stype;
> - u64 time_start, time_spent = 0;
> - int ret = 0;
> - u32 i;
> -
> - for_each_cgroup_storage_type(stype) {
> - storage[stype] = bpf_cgroup_storage_alloc(prog, stype);
> - if (IS_ERR(storage[stype])) {
> - storage[stype] = NULL;
> - for_each_cgroup_storage_type(stype)
> - bpf_cgroup_storage_free(storage[stype]);
> - return -ENOMEM;
> - }
> - }
> -
> - if (!repeat)
> - repeat = 1;
> -
> - rcu_read_lock();
> - preempt_disable();
> - time_start = ktime_get_ns();
> - for (i = 0; i < repeat; i++) {
> - bpf_cgroup_storage_set(storage);
> - *retval = BPF_PROG_RUN(prog, ctx);
> -
> - if (signal_pending(current)) {
> - ret = -EINTR;
> - break;
> - }
> -
> - if (need_resched()) {
> - time_spent += ktime_get_ns() - time_start;
> - preempt_enable();
> - rcu_read_unlock();
> -
> - cond_resched();
> -
> - rcu_read_lock();
> - preempt_disable();
> - time_start = ktime_get_ns();
> - }
> - }
> - time_spent += ktime_get_ns() - time_start;
> - preempt_enable();
> - rcu_read_unlock();
> -
> - do_div(time_spent, repeat);
> - *time = time_spent > U32_MAX ? U32_MAX : (u32)time_spent;
> -
> - for_each_cgroup_storage_type(stype)
> - bpf_cgroup_storage_free(storage[stype]);
> -
> - return ret;
> -}
> -
> -static int bpf_test_finish(const union bpf_attr *kattr,
> - union bpf_attr __user *uattr, const void *data,
> - u32 size, u32 retval, u32 duration)
> -{
> - void __user *data_out = u64_to_user_ptr(kattr->test.data_out);
> - int err = -EFAULT;
> - u32 copy_size = size;
> -
> - /* Clamp copy if the user has provided a size hint, but copy the full
> - * buffer if not to retain old behaviour.
> - */
> - if (kattr->test.data_size_out &&
> - copy_size > kattr->test.data_size_out) {
> - copy_size = kattr->test.data_size_out;
> - err = -ENOSPC;
> - }
> -
> - if (data_out && copy_to_user(data_out, data, copy_size))
> - goto out;
> - if (copy_to_user(&uattr->test.data_size_out, &size, sizeof(size)))
> - goto out;
> - if (copy_to_user(&uattr->test.retval, &retval, sizeof(retval)))
> - goto out;
> - if (copy_to_user(&uattr->test.duration, &duration, sizeof(duration)))
> - goto out;
> - if (err != -ENOSPC)
> - err = 0;
> -out:
> - trace_bpf_test_finish(&err);
> - return err;
> -}
> -
> static void *bpf_test_init(const union bpf_attr *kattr, u32 size,
> u32 headroom, u32 tailroom)
> {
> @@ -125,63 +34,6 @@ static void *bpf_test_init(const union bpf_attr *kattr, u32 size,
> return data;
> }
>
> -static void *bpf_ctx_init(const union bpf_attr *kattr, u32 max_size)
> -{
> - void __user *data_in = u64_to_user_ptr(kattr->test.ctx_in);
> - void __user *data_out = u64_to_user_ptr(kattr->test.ctx_out);
> - u32 size = kattr->test.ctx_size_in;
> - void *data;
> - int err;
> -
> - if (!data_in && !data_out)
> - return NULL;
> -
> - data = kzalloc(max_size, GFP_USER);
> - if (!data)
> - return ERR_PTR(-ENOMEM);
> -
> - if (data_in) {
> - err = bpf_check_uarg_tail_zero(data_in, max_size, size);
> - if (err) {
> - kfree(data);
> - return ERR_PTR(err);
> - }
> -
> - size = min_t(u32, max_size, size);
> - if (copy_from_user(data, data_in, size)) {
> - kfree(data);
> - return ERR_PTR(-EFAULT);
> - }
> - }
> - return data;
> -}
> -
> -static int bpf_ctx_finish(const union bpf_attr *kattr,
> - union bpf_attr __user *uattr, const void *data,
> - u32 size)
> -{
> - void __user *data_out = u64_to_user_ptr(kattr->test.ctx_out);
> - int err = -EFAULT;
> - u32 copy_size = size;
> -
> - if (!data || !data_out)
> - return 0;
> -
> - if (copy_size > kattr->test.ctx_size_out) {
> - copy_size = kattr->test.ctx_size_out;
> - err = -ENOSPC;
> - }
> -
> - if (copy_to_user(data_out, data, copy_size))
> - goto out;
> - if (copy_to_user(&uattr->test.ctx_size_out, &size, sizeof(size)))
> - goto out;
> - if (err != -ENOSPC)
> - err = 0;
> -out:
> - return err;
> -}
> -
> /**
> * range_is_zero - test whether buffer is initialized
> * @buf: buffer to check
> @@ -238,6 +90,36 @@ static void convert_skb_to___skb(struct sk_buff *skb, struct __sk_buff *__skb)
> memcpy(__skb->cb, &cb->data, QDISC_CB_PRIV_LEN);
> }
>
> +static int bpf_net_prog_test_run_finish(const union bpf_attr *kattr,
> + union bpf_attr __user *uattr,
> + const void *data, u32 data_size,
> + const void *ctx, u32 ctx_size,
> + u32 retval, u32 duration)
> +{
> + int ret;
> + union bpf_attr fixed_kattr;
> + const union bpf_attr *kattr_ptr = kattr;
> +
> + /* Clamp copy (in bpf_send_mem) if the user has provided a
> + * size hint, but copy the full buffer if not to retain old
> + * behaviour.
> + */
> + if (!kattr->test.data_size_out && kattr->test.data_out) {
> + fixed_kattr = *kattr;
> + fixed_kattr.test.data_size_out = U32_MAX;
> + kattr_ptr = &fixed_kattr;
> + }
> +
> + ret = bpf_send_data(kattr_ptr, uattr, data, data_size);
> + if (!ret) {
> + ret = bpf_test_finish(uattr, retval, duration);
> + if (!ret && ctx)
> + ret = bpf_send_ctx(kattr_ptr, uattr, ctx, ctx_size);
> + }
> + trace_bpf_test_finish(&ret);
> + return ret;
> +}
> +
> int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
> union bpf_attr __user *uattr)
> {
> @@ -257,7 +139,7 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
> if (IS_ERR(data))
> return PTR_ERR(data);
>
> - ctx = bpf_ctx_init(kattr, sizeof(struct __sk_buff));
> + ctx = bpf_receive_ctx(kattr, sizeof(struct __sk_buff));
> if (IS_ERR(ctx)) {
> kfree(data);
> return PTR_ERR(ctx);
> @@ -307,7 +189,8 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
> ret = convert___skb_to_skb(skb, ctx);
> if (ret)
> goto out;
> - ret = bpf_test_run(prog, skb, repeat, &retval, &duration);
> + ret = bpf_test_run(prog, skb, repeat, BPF_TEST_RUN_SETUP_CGROUP_STORAGE,
> + &retval, &duration);
> if (ret)
> goto out;
> if (!is_l2) {
> @@ -327,10 +210,9 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
> /* bpf program can never convert linear skb to non-linear */
> if (WARN_ON_ONCE(skb_is_nonlinear(skb)))
> size = skb_headlen(skb);
> - ret = bpf_test_finish(kattr, uattr, skb->data, size, retval, duration);
> - if (!ret)
> - ret = bpf_ctx_finish(kattr, uattr, ctx,
> - sizeof(struct __sk_buff));
> + ret = bpf_net_prog_test_run_finish(kattr, uattr, skb->data, size,
> + ctx, sizeof(struct __sk_buff),
> + retval, duration);
> out:
> kfree_skb(skb);
> bpf_sk_storage_free(sk);
> @@ -365,32 +247,48 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
> rxqueue = __netif_get_rx_queue(current->nsproxy->net_ns->loopback_dev, 0);
> xdp.rxq = &rxqueue->xdp_rxq;
>
> - ret = bpf_test_run(prog, &xdp, repeat, &retval, &duration);
> + ret = bpf_test_run(prog, &xdp, repeat,
> + BPF_TEST_RUN_SETUP_CGROUP_STORAGE,
> + &retval, &duration);
> if (ret)
> goto out;
> if (xdp.data != data + XDP_PACKET_HEADROOM + NET_IP_ALIGN ||
> xdp.data_end != xdp.data + size)
> size = xdp.data_end - xdp.data;
> - ret = bpf_test_finish(kattr, uattr, xdp.data, size, retval, duration);
> + ret = bpf_net_prog_test_run_finish(kattr, uattr, xdp.data, size,
> + NULL, 0, retval, duration);
> out:
> kfree(data);
> return ret;
> }
>
> +struct bpf_flow_dissect_run_data {
> + __be16 proto;
> + int nhoff;
> + int hlen;
> +};
> +
> +static u32 bpf_flow_dissect_run(struct bpf_prog *prog, void *ctx,
> + void *private_data)
> +{
> + struct bpf_flow_dissect_run_data *data = private_data;
> +
> + return bpf_flow_dissect(prog, ctx, data->proto, data->nhoff, data->hlen);
> +}
> +
> int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
> const union bpf_attr *kattr,
> union bpf_attr __user *uattr)
> {
> + struct bpf_flow_dissect_run_data run_data = {};
> u32 size = kattr->test.data_size_in;
> struct bpf_flow_dissector ctx = {};
> u32 repeat = kattr->test.repeat;
> struct bpf_flow_keys flow_keys;
> - u64 time_start, time_spent = 0;
> const struct ethhdr *eth;
> u32 retval, duration;
> void *data;
> int ret;
> - u32 i;
>
> if (prog->type != BPF_PROG_TYPE_FLOW_DISSECTOR)
> return -EINVAL;
> @@ -407,49 +305,22 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
>
> eth = (struct ethhdr *)data;
>
> - if (!repeat)
> - repeat = 1;
> -
> ctx.flow_keys = &flow_keys;
> ctx.data = data;
> ctx.data_end = (__u8 *)data + size;
>
> - rcu_read_lock();
> - preempt_disable();
> - time_start = ktime_get_ns();
> - for (i = 0; i < repeat; i++) {
> - retval = bpf_flow_dissect(prog, &ctx, eth->h_proto, ETH_HLEN,
> - size);
> -
> - if (signal_pending(current)) {
> - preempt_enable();
> - rcu_read_unlock();
> -
> - ret = -EINTR;
> - goto out;
> - }
> -
> - if (need_resched()) {
> - time_spent += ktime_get_ns() - time_start;
> - preempt_enable();
> - rcu_read_unlock();
> -
> - cond_resched();
> -
> - rcu_read_lock();
> - preempt_disable();
> - time_start = ktime_get_ns();
> - }
> - }
> - time_spent += ktime_get_ns() - time_start;
> - preempt_enable();
> - rcu_read_unlock();
> -
> - do_div(time_spent, repeat);
> - duration = time_spent > U32_MAX ? U32_MAX : (u32)time_spent;
> + run_data.proto = eth->h_proto;
> + run_data.nhoff = ETH_HLEN;
> + run_data.hlen = size;
> + ret = bpf_test_run_cb(prog, &ctx, repeat, BPF_TEST_RUN_PLAIN,
> + bpf_flow_dissect_run, &run_data,
> + &retval, &duration);
> + if (!ret)
> + goto out;
>
> - ret = bpf_test_finish(kattr, uattr, &flow_keys, sizeof(flow_keys),
> - retval, duration);
> + ret = bpf_net_prog_test_run_finish(kattr, uattr, &flow_keys,
> + sizeof(flow_keys), NULL, 0,
> + retval, duration);
>
> out:
> kfree(data);
> --
> 2.20.1
>
^ permalink raw reply
* Re: [RFC] virtio-net: share receive_*() and add_recvbuf_*() with virtio-vsock
From: Michael S. Tsirkin @ 2019-07-11 19:52 UTC (permalink / raw)
To: Stefano Garzarella; +Cc: Jason Wang, Stefan Hajnoczi, virtualization, netdev
In-Reply-To: <20190711114134.xhmpciyglb2angl6@steredhat>
On Thu, Jul 11, 2019 at 01:41:34PM +0200, Stefano Garzarella wrote:
> On Thu, Jul 11, 2019 at 03:37:00PM +0800, Jason Wang wrote:
> >
> > On 2019/7/10 下午11:37, Stefano Garzarella wrote:
> > > Hi,
> > > as Jason suggested some months ago, I looked better at the virtio-net driver to
> > > understand if we can reuse some parts also in the virtio-vsock driver, since we
> > > have similar challenges (mergeable buffers, page allocation, small
> > > packets, etc.).
> > >
> > > Initially, I would add the skbuff in the virtio-vsock in order to re-use
> > > receive_*() functions.
> >
> >
> > Yes, that will be a good step.
> >
>
> Okay, I'll go on this way.
>
> >
> > > Then I would move receive_[small, big, mergeable]() and
> > > add_recvbuf_[small, big, mergeable]() outside of virtio-net driver, in order to
> > > call them also from virtio-vsock. I need to do some refactoring (e.g. leave the
> > > XDP part on the virtio-net driver), but I think it is feasible.
> > >
> > > The idea is to create a virtio-skb.[h,c] where put these functions and a new
> > > object where stores some attributes needed (e.g. hdr_len ) and status (e.g.
> > > some fields of struct receive_queue).
> >
> >
> > My understanding is we could be more ambitious here. Do you see any blocker
> > for reusing virtio-net directly? It's better to reuse not only the functions
> > but also the logic like NAPI to avoid re-inventing something buggy and
> > duplicated.
> >
>
> These are my concerns:
> - virtio-vsock is not a "net_device", so a lot of code related to
> ethtool, net devices (MAC address, MTU, speed, VLAN, XDP, offloading) will be
> not used by virtio-vsock.
>
> - virtio-vsock has a different header. We can consider it as part of
> virtio_net payload, but it precludes the compatibility with old hosts. This
> was one of the major doubts that made me think about using only the
> send/recv skbuff functions, that it shouldn't break the compatibility.
>
> >
> > > This is an idea of virtio-skb.h that
> > > I have in mind:
> > > struct virtskb;
> >
> >
> > What fields do you want to store in virtskb? It looks to be exist sk_buff is
> > flexible enough to us?
>
> My idea is to store queues information, like struct receive_queue or
> struct send_queue, and some device attributes (e.g. hdr_len ).
>
> >
> >
> > >
> > > struct sk_buff *virtskb_receive_small(struct virtskb *vs, ...);
> > > struct sk_buff *virtskb_receive_big(struct virtskb *vs, ...);
> > > struct sk_buff *virtskb_receive_mergeable(struct virtskb *vs, ...);
> > >
> > > int virtskb_add_recvbuf_small(struct virtskb*vs, ...);
> > > int virtskb_add_recvbuf_big(struct virtskb *vs, ...);
> > > int virtskb_add_recvbuf_mergeable(struct virtskb *vs, ...);
> > >
> > > For the Guest->Host path it should be easier, so maybe I can add a
> > > "virtskb_send(struct virtskb *vs, struct sk_buff *skb)" with a part of the code
> > > of xmit_skb().
> >
> >
> > I may miss something, but I don't see any thing that prevents us from using
> > xmit_skb() directly.
> >
>
> Yes, but my initial idea was to make it more parametric and not related to the
> virtio_net_hdr, so the 'hdr_len' could be a parameter and the
> 'num_buffers' should be handled by the caller.
>
> >
> > >
> > > Let me know if you have in mind better names or if I should put these function
> > > in another place.
> > >
> > > I would like to leave the control part completely separate, so, for example,
> > > the two drivers will negotiate the features independently and they will call
> > > the right virtskb_receive_*() function based on the negotiation.
> >
> >
> > If it's one the issue of negotiation, we can simply change the
> > virtnet_probe() to deal with different devices.
> >
> >
> > >
> > > I already started to work on it, but before to do more steps and send an RFC
> > > patch, I would like to hear your opinion.
> > > Do you think that makes sense?
> > > Do you see any issue or a better solution?
> >
> >
> > I still think we need to seek a way of adding some codes on virtio-net.c
> > directly if there's no huge different in the processing of TX/RX. That would
> > save us a lot time.
>
> After the reading of the buffers from the virtqueue I think the process
> is slightly different, because virtio-net will interface with the network
> stack, while virtio-vsock will interface with the vsock-core (socket).
> So the virtio-vsock implements the following:
> - control flow mechanism to avoid to loose packets, informing the peer
> about the amount of memory available in the receive queue using some
> fields in the virtio_vsock_hdr
> - de-multiplexing parsing the virtio_vsock_hdr and choosing the right
> socket depending on the port
> - socket state handling
>
> We can use the virtio-net as transport, but we should add a lot of
> code to skip "net device" stuff when it is used by the virtio-vsock.
> This could break something in virtio-net, for this reason, I thought to reuse
> only the send/recv functions starting from the idea to split the virtio-net
> driver in two parts:
> a. one with all stuff related to the network stack
> b. one with the stuff needed to communicate with the host
>
> And use skbuff to communicate between parts. In this way, virtio-vsock
> can use only the b part.
>
> Maybe we can do this split in a better way, but I'm not sure it is
> simple.
>
> Thanks,
> Stefano
Frankly, skb is a huge structure which adds a lot of
overhead. I am not sure that using it is such a great idea
if building a device that does not have to interface
with the networking stack.
So I agree with Jason in theory. To clarify, he is basically saying
current implementation is all wrong, it should be a protocol and we
should teach networking stack that there are reliable net devices that
handle just this protocol. We could add a flag in virtio net that
will say it's such a device.
Whether it's doable, I don't know, and it's definitely not simple - in
particular you will have to also re-implement existing devices in these
terms, and not just virtio - vmware vsock too.
If you want to do a POC you can add a new address family,
that's easier.
Just reusing random functions won't help, net stack
is very heavy, if it manages to outperform vsock it's
because vsock was not written with performance in mind.
But the smarts are in the core not virtio driver.
What makes vsock slow is design decisions like
using a workqueue to process packets,
not batching memory management etc etc.
All things that net core does for virtio net.
--
MST
^ permalink raw reply
* [PATCH net-next 3/3] net/mlx5: E-Switch, Reduce ingress acl modify metadata stack usage
From: Saeed Mahameed @ 2019-07-11 19:39 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev@vger.kernel.org, Saeed Mahameed, Jianbo Liu
In-Reply-To: <20190711193937.29802-1-saeedm@mellanox.com>
Fix the following compiler warning:
In function ‘esw_vport_add_ingress_acl_modify_metadata’:
the frame size of 1084 bytes is larger than 1024 bytes [-Wframe-larger-than=]
Since the structure is never written to, we can statically allocate
it to avoid the stack usage.
Fixes: 7445cfb1169c ("net/mlx5: E-Switch, Tag packet with vport number in VF vports and uplink ingress ACLs")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Jianbo Liu <jianbol@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 8ed4497929b9..5f78e76019c5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -1785,8 +1785,8 @@ static int esw_vport_add_ingress_acl_modify_metadata(struct mlx5_eswitch *esw,
struct mlx5_vport *vport)
{
u8 action[MLX5_UN_SZ_BYTES(set_action_in_add_action_in_auto)] = {};
+ static const struct mlx5_flow_spec spec = {};
struct mlx5_flow_act flow_act = {};
- struct mlx5_flow_spec spec = {};
int err = 0;
MLX5_SET(set_action_in, action, action_type, MLX5_ACTION_TYPE_SET);
--
2.21.0
^ permalink raw reply related
* [PATCH net-next 2/3] net/mlx5e: Fix unused variable warning when CONFIG_MLX5_ESWITCH is off
From: Saeed Mahameed @ 2019-07-11 19:39 UTC (permalink / raw)
To: David S. Miller
Cc: netdev@vger.kernel.org, Saeed Mahameed, Mark Bloch, Tariq Toukan,
Nathan Chancellor
In-Reply-To: <20190711193937.29802-1-saeedm@mellanox.com>
In mlx5e_setup_tc "priv" variable is not being used if
CONFIG_MLX5_ESWITCH is off, one way to fix this is to actually use it.
mlx5e_setup_tc_mqprio also needs the "priv" variable and it extracts it
on its own. We can simply pass priv to mlx5e_setup_tc_mqprio instead of
netdev and avoid extracting the priv var, which will also resolve the
compiler warning.
Fixes: 4e95bc268b91 ("net: flow_offload: add flow_block_cb_setup_simple()")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
CC: Nathan Chancellor <natechancellor@gmail.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 6d0ae87c8ded..9163d6904741 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3390,10 +3390,9 @@ static int mlx5e_modify_channels_vsd(struct mlx5e_channels *chs, bool vsd)
return 0;
}
-static int mlx5e_setup_tc_mqprio(struct net_device *netdev,
+static int mlx5e_setup_tc_mqprio(struct mlx5e_priv *priv,
struct tc_mqprio_qopt *mqprio)
{
- struct mlx5e_priv *priv = netdev_priv(netdev);
struct mlx5e_channels new_channels = {};
u8 tc = mqprio->num_tc;
int err = 0;
@@ -3475,7 +3474,7 @@ static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type,
priv, priv, true);
#endif
case TC_SETUP_QDISC_MQPRIO:
- return mlx5e_setup_tc_mqprio(dev, type_data);
+ return mlx5e_setup_tc_mqprio(priv, type_data);
default:
return -EOPNOTSUPP;
}
--
2.21.0
^ permalink raw reply related
* [PATCH net-next 1/3] net/mlx5e: Fix compilation error in TLS code
From: Saeed Mahameed @ 2019-07-11 19:39 UTC (permalink / raw)
To: David S. Miller
Cc: netdev@vger.kernel.org, Tariq Toukan, Saeed Mahameed, Mao Wenan
In-Reply-To: <20190711193937.29802-1-saeedm@mellanox.com>
From: Tariq Toukan <tariqt@mellanox.com>
In the cited patch below, the Kconfig flags combination of:
CONFIG_MLX5_FPGA is not set
CONFIG_MLX5_TLS=y
CONFIG_MLX5_EN_TLS=y
leads to the compilation error:
./include/linux/mlx5/device.h:61:39: error: invalid application of
sizeof to incomplete type struct mlx5_ifc_tls_flow_bits.
Fix it.
Fixes: 90687e1a9a50 ("net/mlx5: Kconfig, Better organize compilation flags")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
CC: Mao Wenan <maowenan@huawei.com>
---
drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
index 879321b21616..d787bc0a4155 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
@@ -81,7 +81,6 @@ mlx5e_ktls_type_check(struct mlx5_core_dev *mdev,
struct tls_crypto_info *crypto_info) { return false; }
#endif
-#ifdef CONFIG_MLX5_FPGA_TLS
enum {
MLX5_ACCEL_TLS_TX = BIT(0),
MLX5_ACCEL_TLS_RX = BIT(1),
@@ -103,6 +102,7 @@ struct mlx5_ifc_tls_flow_bits {
u8 reserved_at_2[0x1e];
};
+#ifdef CONFIG_MLX5_FPGA_TLS
int mlx5_accel_tls_add_flow(struct mlx5_core_dev *mdev, void *flow,
struct tls_crypto_info *crypto_info,
u32 start_offload_tcp_sn, u32 *p_swid,
--
2.21.0
^ permalink raw reply related
* [PATCH net-next 0/3] Mellanox, mlx5 build fixes
From: Saeed Mahameed @ 2019-07-11 19:39 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev@vger.kernel.org, Saeed Mahameed
Hi Dave,
I know net-next is closed but these patches are fixing some compiler
build and warnings issues people have been complaining about.
I hope it is not too late, but in case it is a lot of trouble for you, I
guess they can wait.
Thanks,
Saeed.
---
Saeed Mahameed (2):
net/mlx5e: Fix unused variable warning when CONFIG_MLX5_ESWITCH is off
net/mlx5: E-Switch, Reduce ingress acl modify metadata stack usage
Tariq Toukan (1):
net/mlx5e: Fix compilation error in TLS code
drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 5 ++---
drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 2 +-
3 files changed, 4 insertions(+), 5 deletions(-)
--
2.21.0
^ permalink raw reply
* Re: [PATCH v3 net-next 13/19] ionic: Add initial ethtool support
From: Shannon Nelson @ 2019-07-11 19:10 UTC (permalink / raw)
To: Andrew Lunn; +Cc: netdev
In-Reply-To: <20190708220406.GB17857@lunn.ch>
On 7/8/19 3:04 PM, Andrew Lunn wrote:
>> + case XCVR_PID_SFP_10GBASE_ER:
>> + ethtool_link_ksettings_add_link_mode(ks, supported,
>> + 10000baseER_Full);
>> + break;
> I don't know these link modes too well. But only setting a single bit
> seems odd. What i do know is that an SFP which supports 2500BaseX
> should also be able to support 1000BaseX. So should a 100G SFP also
> support 40G, 25G, 10G etc? The SERDES just runs a slower bitstream
> over the basic bitpipe?
Yes, but in this initial release we're not supporting changes to the
modes yet. That flexibility will come later.
>
>> + case XCVR_PID_QSFP_100G_ACC:
>> + case XCVR_PID_QSFP_40GBASE_ER4:
>> + case XCVR_PID_SFP_25GBASE_LR:
>> + case XCVR_PID_SFP_25GBASE_ER:
>> + dev_info(lif->ionic->dev, "no decode bits for xcvr type pid=%d / 0x%x\n",
>> + idev->port_info->status.xcvr.pid,
>> + idev->port_info->status.xcvr.pid);
>> + break;
> Why not add them?
Yes, this has been mentioned before. I might in the future, but I have
my hands full at the moment.
>
>
>> + memcpy(ks->link_modes.advertising, ks->link_modes.supported,
>> + sizeof(ks->link_modes.advertising));
> bitmap_copy() would be a better way to do this. You could consider
> adding a helper to ethtool.h.
Sure.
Thanks for your comments, and sorry I haven't responded as quickly as
I'd like... I'll be going through these and your other comments over the
next few days.
sln
^ permalink raw reply
* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
From: Jonathan Lemon @ 2019-07-11 19:04 UTC (permalink / raw)
To: Eric Dumazet
Cc: Prout, Andrew - LLSC - MITLL, Christoph Paasch, David S . Miller,
netdev, Greg Kroah-Hartman, Jonathan Looney, Neal Cardwell,
Tyler Hicks, Yuchung Cheng, Bruce Curtis, Dustin Marquess
In-Reply-To: <d4b1ab65-c308-382a-2a0e-9042750335e0@gmail.com>
On 11 Jul 2019, at 11:28, Eric Dumazet wrote:
> On 7/11/19 7:14 PM, Prout, Andrew - LLSC - MITLL wrote:
>>
>> In my opinion, if a small SO_SNDBUF below a certain value is no
>> longer supported, then SOCK_MIN_SNDBUF should be adjusted to reflect
>> this. The RCVBUF/SNDBUF sizes are supposed to be hints, no error is
>> returned if they are not honored. The kernel should continue to
>> function regardless of what userspace requests for their values.
>>
>
> It is supported to set whatever SO_SNDBUF value and get terrible
> performance.
>
> It always has been.
>
> The only difference is that we no longer allow an attacker to fool TCP
> stack
> and consume up to 2 GB per socket while SO_SNDBUF was set to 128 KB.
>
> The side effect is that in some cases, the workload can appear to have
> the signature of the attack.
>
> The solution is to increase your SO_SNDBUF, or even better let TCP
> stack autotune it.
> nobody forced you to set very small values for it.
I discovered we have some production services that set SO_SNDBUF to
very small values (~4k), as they are essentially doing interactive
communications, not bulk transfers. But there's a difference between
"terrible performance" and "TCP stops working".
--
Jonathan
^ permalink raw reply
* Re: [PATCH net-next 00/11] Add drop monitor for offloaded data paths
From: David Miller @ 2019-07-11 19:02 UTC (permalink / raw)
To: idosch
Cc: nhorman, netdev, jiri, mlxsw, dsahern, roopa, nikolay, andy,
pablo, jakub.kicinski, pieter.jansenvanvuuren, andrew, f.fainelli,
vivien.didelot, idosch
In-Reply-To: <20190711123909.GA10978@splinter>
From: Ido Schimmel <idosch@idosch.org>
Date: Thu, 11 Jul 2019 15:39:09 +0300
> Before I start working on v2, I would like to get your feedback on the
> high level plan. Also adding Neil who is the maintainer of drop_monitor
> (and counterpart DropWatch tool [1]).
I'll try to get back to this, but right now the merge window is completely
consuming me at the moment so you will have to exercise extreme patience.
Thank you.
^ permalink raw reply
* [net 6/6] net/mlx5e: IPoIB, Add error path in mlx5_rdma_setup_rn
From: Saeed Mahameed @ 2019-07-11 18:54 UTC (permalink / raw)
To: David S. Miller
Cc: netdev@vger.kernel.org, Aya Levin, Feras Daoud, Saeed Mahameed
In-Reply-To: <20190711185353.5715-1-saeedm@mellanox.com>
From: Aya Levin <ayal@mellanox.com>
Check return value from mlx5e_attach_netdev, add error path on failure.
Fixes: 48935bbb7ae8 ("net/mlx5e: IPoIB, Add netdevice profile skeleton")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
index 9ca492b430d8..603d294757b4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
@@ -698,7 +698,9 @@ static int mlx5_rdma_setup_rn(struct ib_device *ibdev, u8 port_num,
prof->init(mdev, netdev, prof, ipriv);
- mlx5e_attach_netdev(epriv);
+ err = mlx5e_attach_netdev(epriv);
+ if (err)
+ goto detach;
netif_carrier_off(netdev);
/* set rdma_netdev func pointers */
@@ -714,6 +716,11 @@ static int mlx5_rdma_setup_rn(struct ib_device *ibdev, u8 port_num,
return 0;
+detach:
+ prof->cleanup(epriv);
+ if (ipriv->sub_interface)
+ return err;
+ mlx5e_destroy_mdev_resources(mdev);
destroy_ht:
mlx5i_pkey_qpn_ht_cleanup(netdev);
return err;
--
2.21.0
^ permalink raw reply related
* [net 5/6] net/mlx5e: Fix error flow in tx reporter diagnose
From: Saeed Mahameed @ 2019-07-11 18:54 UTC (permalink / raw)
To: David S. Miller
Cc: netdev@vger.kernel.org, Aya Levin, Tariq Toukan, Jiri Pirko,
Saeed Mahameed
In-Reply-To: <20190711185353.5715-1-saeedm@mellanox.com>
From: Aya Levin <ayal@mellanox.com>
Fix tx reporter's diagnose callback. Propagate error when failing to
gather diagnostics information or failing to print diagnostic data per
queue.
Fixes: de8650a82071 ("net/mlx5e: Add tx reporter support")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
index a778c15e5324..f3d98748b211 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
@@ -262,13 +262,13 @@ static int mlx5e_tx_reporter_diagnose(struct devlink_health_reporter *reporter,
err = mlx5_core_query_sq_state(priv->mdev, sq->sqn, &state);
if (err)
- break;
+ goto unlock;
err = mlx5e_tx_reporter_build_diagnose_output(fmsg, sq->sqn,
state,
netif_xmit_stopped(sq->txq));
if (err)
- break;
+ goto unlock;
}
err = devlink_fmsg_arr_pair_nest_end(fmsg);
if (err)
--
2.21.0
^ permalink raw reply related
* [net 4/6] net/mlx5e: Fix return value from timeout recover function
From: Saeed Mahameed @ 2019-07-11 18:54 UTC (permalink / raw)
To: David S. Miller
Cc: netdev@vger.kernel.org, Aya Levin, Jiri Pirko, Tariq Toukan,
Saeed Mahameed
In-Reply-To: <20190711185353.5715-1-saeedm@mellanox.com>
From: Aya Levin <ayal@mellanox.com>
Fix timeout recover function to return a meaningful return value.
When an interrupt was not sent by the FW, return IO error instead of
'true'.
Fixes: c7981bea48fb ("net/mlx5e: Fix return status of TX reporter timeout recover")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
index 476dd97f7f2f..a778c15e5324 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
@@ -142,22 +142,20 @@ static int mlx5e_tx_reporter_timeout_recover(struct mlx5e_txqsq *sq)
{
struct mlx5_eq_comp *eq = sq->cq.mcq.eq;
u32 eqe_count;
- int ret;
netdev_err(sq->channel->netdev, "EQ 0x%x: Cons = 0x%x, irqn = 0x%x\n",
eq->core.eqn, eq->core.cons_index, eq->core.irqn);
eqe_count = mlx5_eq_poll_irq_disabled(eq);
- ret = eqe_count ? false : true;
if (!eqe_count) {
clear_bit(MLX5E_SQ_STATE_ENABLED, &sq->state);
- return ret;
+ return -EIO;
}
netdev_err(sq->channel->netdev, "Recover %d eqes on EQ 0x%x\n",
eqe_count, eq->core.eqn);
sq->channel->stats->eq_rearm++;
- return ret;
+ return 0;
}
int mlx5e_tx_reporter_timeout(struct mlx5e_txqsq *sq)
--
2.21.0
^ permalink raw reply related
* [net 3/6] net/mlx5e: Rx, Fix checksum calculation for new hardware
From: Saeed Mahameed @ 2019-07-11 18:54 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev@vger.kernel.org, Saeed Mahameed
In-Reply-To: <20190711185353.5715-1-saeedm@mellanox.com>
CQE checksum full mode in new HW, provides a full checksum of rx frame.
Covering bytes starting from eth protocol up to last byte in the received
frame (frame_size - ETH_HLEN), as expected by the stack.
Fixing up skb->csum by the driver is not required in such case. This fix
is to avoid wrong checksum calculation in drivers which already support
the new hardware with the new checksum mode.
Fixes: 85327a9c4150 ("net/mlx5: Update the list of the PCI supported devices")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 1 +
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 3 +++
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 7 ++++++-
include/linux/mlx5/mlx5_ifc.h | 3 ++-
4 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index cc6797e24571..cc227a7aa79f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -294,6 +294,7 @@ enum {
MLX5E_RQ_STATE_ENABLED,
MLX5E_RQ_STATE_AM,
MLX5E_RQ_STATE_NO_CSUM_COMPLETE,
+ MLX5E_RQ_STATE_CSUM_FULL, /* cqe_csum_full hw bit is set */
};
struct mlx5e_cq {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index a8e8350b38aa..98d75271fc73 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -855,6 +855,9 @@ static int mlx5e_open_rq(struct mlx5e_channel *c,
if (err)
goto err_destroy_rq;
+ if (MLX5_CAP_ETH(c->mdev, cqe_checksum_full))
+ __set_bit(MLX5E_RQ_STATE_CSUM_FULL, &c->rq.state);
+
if (params->rx_dim_enabled)
__set_bit(MLX5E_RQ_STATE_AM, &c->rq.state);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 13133e7f088e..8a5f9411cac6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -873,8 +873,14 @@ static inline void mlx5e_handle_csum(struct net_device *netdev,
if (unlikely(get_ip_proto(skb, network_depth, proto) == IPPROTO_SCTP))
goto csum_unnecessary;
+ stats->csum_complete++;
skb->ip_summed = CHECKSUM_COMPLETE;
skb->csum = csum_unfold((__force __sum16)cqe->check_sum);
+
+ if (test_bit(MLX5E_RQ_STATE_CSUM_FULL, &rq->state))
+ return; /* CQE csum covers all received bytes */
+
+ /* csum might need some fixups ...*/
if (network_depth > ETH_HLEN)
/* CQE csum is calculated from the IP header and does
* not cover VLAN headers (if present). This will add
@@ -885,7 +891,6 @@ static inline void mlx5e_handle_csum(struct net_device *netdev,
skb->csum);
mlx5e_skb_padding_csum(skb, network_depth, proto, stats);
- stats->csum_complete++;
return;
}
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 5e74305e2e57..7e42efa143a0 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -749,7 +749,8 @@ struct mlx5_ifc_per_protocol_networking_offload_caps_bits {
u8 swp[0x1];
u8 swp_csum[0x1];
u8 swp_lso[0x1];
- u8 reserved_at_23[0xd];
+ u8 cqe_checksum_full[0x1];
+ u8 reserved_at_24[0xc];
u8 max_vxlan_udp_ports[0x8];
u8 reserved_at_38[0x6];
u8 max_geneve_opt_len[0x1];
--
2.21.0
^ permalink raw reply related
* [net 2/6] net/mlx5e: Fix port tunnel GRE entropy control
From: Saeed Mahameed @ 2019-07-11 18:54 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev@vger.kernel.org, Eli Britstein, Saeed Mahameed
In-Reply-To: <20190711185353.5715-1-saeedm@mellanox.com>
From: Eli Britstein <elibr@mellanox.com>
GRE entropy calculation is a single bit per card, and not per port.
Force disable GRE entropy calculation upon the first GRE encap rule,
and release the force at the last GRE encap rule removal. This is done
per port.
Fixes: 97417f6182f8 ("net/mlx5e: Fix GRE key by controlling port tunnel entropy calculation")
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
.../mellanox/mlx5/core/lib/port_tun.c | 23 ++++---------------
1 file changed, 4 insertions(+), 19 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/port_tun.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/port_tun.c
index be69c1d7941a..48b5c847b642 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/port_tun.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/port_tun.c
@@ -98,27 +98,12 @@ static int mlx5_set_entropy(struct mlx5_tun_entropy *tun_entropy,
*/
if (entropy_flags.gre_calc_supported &&
reformat_type == MLX5_REFORMAT_TYPE_L2_TO_NVGRE) {
- /* Other applications may change the global FW entropy
- * calculations settings. Check that the current entropy value
- * is the negative of the updated value.
- */
- if (entropy_flags.force_enabled &&
- enable == entropy_flags.gre_calc_enabled) {
- mlx5_core_warn(tun_entropy->mdev,
- "Unexpected GRE entropy calc setting - expected %d",
- !entropy_flags.gre_calc_enabled);
- return -EOPNOTSUPP;
- }
- err = mlx5_set_port_gre_tun_entropy_calc(tun_entropy->mdev, enable,
- entropy_flags.force_supported);
+ if (!entropy_flags.force_supported)
+ return 0;
+ err = mlx5_set_port_gre_tun_entropy_calc(tun_entropy->mdev,
+ enable, !enable);
if (err)
return err;
- /* if we turn on the entropy we don't need to force it anymore */
- if (entropy_flags.force_supported && enable) {
- err = mlx5_set_port_gre_tun_entropy_calc(tun_entropy->mdev, 1, 0);
- if (err)
- return err;
- }
} else if (entropy_flags.calc_supported) {
/* Other applications may change the global FW entropy
* calculations settings. Check that the current entropy value
--
2.21.0
^ permalink raw reply related
* [net 1/6] net/mlx5: E-Switch, Fix default encap mode
From: Saeed Mahameed @ 2019-07-11 18:54 UTC (permalink / raw)
To: David S. Miller
Cc: netdev@vger.kernel.org, Maor Gottlieb, Roi Dayan, Saeed Mahameed
In-Reply-To: <20190711185353.5715-1-saeedm@mellanox.com>
From: Maor Gottlieb <maorg@mellanox.com>
Encap mode is related to switchdev mode only. Move the init of
the encap mode to eswitch_offloads. Before this change, we reported
that eswitch supports encap, even tough the device was in non
SRIOV mode.
Fixes: 7768d1971de67 ('net/mlx5: E-Switch, Add control for encapsulation')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 5 -----
drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 7 +++++++
2 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 6a921e24cd5e..e9339e7d6a18 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1882,11 +1882,6 @@ int mlx5_eswitch_init(struct mlx5_core_dev *dev)
esw->enabled_vports = 0;
esw->mode = SRIOV_NONE;
esw->offloads.inline_mode = MLX5_INLINE_MODE_NONE;
- if (MLX5_CAP_ESW_FLOWTABLE_FDB(dev, reformat) &&
- MLX5_CAP_ESW_FLOWTABLE_FDB(dev, decap))
- esw->offloads.encap = DEVLINK_ESWITCH_ENCAP_MODE_BASIC;
- else
- esw->offloads.encap = DEVLINK_ESWITCH_ENCAP_MODE_NONE;
dev->priv.eswitch = esw;
return 0;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 47b446d30f71..c2beadc41c40 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -1840,6 +1840,12 @@ int esw_offloads_init(struct mlx5_eswitch *esw, int vf_nvports,
{
int err;
+ if (MLX5_CAP_ESW_FLOWTABLE_FDB(esw->dev, reformat) &&
+ MLX5_CAP_ESW_FLOWTABLE_FDB(esw->dev, decap))
+ esw->offloads.encap = DEVLINK_ESWITCH_ENCAP_MODE_BASIC;
+ else
+ esw->offloads.encap = DEVLINK_ESWITCH_ENCAP_MODE_NONE;
+
err = esw_offloads_steering_init(esw, vf_nvports, total_nvports);
if (err)
return err;
@@ -1901,6 +1907,7 @@ void esw_offloads_cleanup(struct mlx5_eswitch *esw)
esw_offloads_devcom_cleanup(esw);
esw_offloads_unload_all_reps(esw, num_vfs);
esw_offloads_steering_cleanup(esw);
+ esw->offloads.encap = DEVLINK_ESWITCH_ENCAP_MODE_NONE;
}
static int esw_mode_from_devlink(u16 mode, u16 *mlx5_mode)
--
2.21.0
^ permalink raw reply related
* [pull request][net 0/6] Mellanox, mlx5 fixes 2019-07-11
From: Saeed Mahameed @ 2019-07-11 18:54 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev@vger.kernel.org, Saeed Mahameed
Hi Dave,
This series introduces some fixes to mlx5 driver.
Please pull and let me know if there is any problem.
For -stable v4.15
('net/mlx5e: IPoIB, Add error path in mlx5_rdma_setup_rn')
For -stable v5.1
('net/mlx5e: Fix port tunnel GRE entropy control')
('net/mlx5e: Rx, Fix checksum calculation for new hardware')
('net/mlx5e: Fix return value from timeout recover function')
('net/mlx5e: Fix error flow in tx reporter diagnose')
For -stable v5.2
('net/mlx5: E-Switch, Fix default encap mode')
Conflict note: This pull request will produce a small conflict when
merged with net-next.
In drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
Take the hunk from net and replace:
esw_offloads_steering_init(esw, vf_nvports, total_nvports);
with:
esw_offloads_steering_init(esw);
Thanks,
Saeed.
---
The following changes since commit e858faf556d4e14c750ba1e8852783c6f9520a0e:
tcp: Reset bytes_acked and bytes_received when disconnecting (2019-07-08 19:29:19 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-fixes-2019-07-11
for you to fetch changes up to ef1ce7d7b67b46661091c7ccc0396186b7a247ef:
net/mlx5e: IPoIB, Add error path in mlx5_rdma_setup_rn (2019-07-11 11:45:04 -0700)
----------------------------------------------------------------
mlx5-fixes-2019-07-11
----------------------------------------------------------------
Aya Levin (3):
net/mlx5e: Fix return value from timeout recover function
net/mlx5e: Fix error flow in tx reporter diagnose
net/mlx5e: IPoIB, Add error path in mlx5_rdma_setup_rn
Eli Britstein (1):
net/mlx5e: Fix port tunnel GRE entropy control
Maor Gottlieb (1):
net/mlx5: E-Switch, Fix default encap mode
Saeed Mahameed (1):
net/mlx5e: Rx, Fix checksum calculation for new hardware
drivers/net/ethernet/mellanox/mlx5/core/en.h | 1 +
.../ethernet/mellanox/mlx5/core/en/reporter_tx.c | 10 ++++------
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 3 +++
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 7 ++++++-
drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 5 -----
.../ethernet/mellanox/mlx5/core/eswitch_offloads.c | 7 +++++++
.../net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c | 9 ++++++++-
.../net/ethernet/mellanox/mlx5/core/lib/port_tun.c | 23 ++++------------------
include/linux/mlx5/mlx5_ifc.h | 3 ++-
9 files changed, 35 insertions(+), 33 deletions(-)
^ permalink raw reply
* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
From: Eric Dumazet @ 2019-07-11 18:50 UTC (permalink / raw)
To: Michal Kubecek, netdev
Cc: Eric Dumazet, Christoph Paasch, Prout, Andrew - LLSC - MITLL,
David Miller, Greg Kroah-Hartman, Jonathan Looney, Neal Cardwell,
Tyler Hicks, Yuchung Cheng, Bruce Curtis, Jonathan Lemon,
Dustin Marquess
In-Reply-To: <20190711182654.GG5700@unicorn.suse.cz>
On 7/11/19 8:26 PM, Michal Kubecek wrote:
>
> I'm aware it's not a realistic test. It was written as quick and simple
> check of the pre-4.19 patch, but it shows that even TLP may not get
> through.
Most of TLP probes send new data, not rtx.
But yes, I get your point.
SO_SNDBUF=15000 in your case is seriously wrong.
Lets code a safety feature over SO_SNDBUF to not allow pathological small values,
because I do not want to support a constrained TCP stack in 2019.
^ permalink raw reply
* Re: [GIT] Networking
From: pr-tracker-bot @ 2019-07-11 18:35 UTC (permalink / raw)
To: David Miller; +Cc: torvalds, akpm, netdev, linux-kernel
In-Reply-To: <20190709.223834.2182721912834033108.davem@davemloft.net>
The pull request you sent on Tue, 09 Jul 2019 22:38:34 -0700 (PDT):
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git refs/heads/master
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/237f83dfbe668443b5e31c3c7576125871cca674
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker
^ permalink raw reply
* Re: [PATCH] MAINTAINERS: update BPF JIT S390 maintainers
From: David Miller @ 2019-07-11 18:33 UTC (permalink / raw)
To: gor; +Cc: ast, daniel, heiko.carstens, borntraeger, iii, netdev, bpf,
linux-s390
In-Reply-To: <your-ad-here.call-01562758494-ext-2794@work.hours>
From: Vasily Gorbik <gor@linux.ibm.com>
Date: Wed, 10 Jul 2019 13:34:54 +0200
> Dave, Alexei, Daniel,
> would you take it via one of your trees? Or should I take it via s390?
I think it can go via the bpf tree.
^ permalink raw reply
* Re: [bpf PATCH v2 2/6] bpf: tls fix transition through disconnect with close
From: Jakub Kicinski @ 2019-07-11 18:32 UTC (permalink / raw)
To: John Fastabend; +Cc: ast, daniel, netdev, edumazet, bpf
In-Reply-To: <5d276814a76ad_698f2aaeaaf925bc8a@john-XPS-13-9370.notmuch>
On Thu, 11 Jul 2019 09:47:16 -0700, John Fastabend wrote:
> Jakub Kicinski wrote:
> > On Wed, 10 Jul 2019 12:34:17 -0700, Jakub Kicinski wrote:
> > > > > > + if (sk->sk_prot->unhash)
> > > > > > + sk->sk_prot->unhash(sk);
> > > > > > + }
> > > > > > +
> > > > > > + ctx = tls_get_ctx(sk);
> > > > > > + if (ctx->tx_conf == TLS_SW || ctx->rx_conf == TLS_SW)
> > > > > > + tls_sk_proto_cleanup(sk, ctx, timeo);
> >
> > Do we still need to hook into unhash? With patch 6 in place perhaps we
> > can just do disconnect 🥺
>
> ?? "can just do a disconnect", not sure I folow. We still need unhash
> in cases where we have a TLS socket transition from ESTABLISHED
> to LISTEN state without calling close(). This is independent of if
> sockmap is running or not.
>
> Originally, I thought this would be extremely rare but I did see it
> in real applications on the sockmap side so presumably it is possible
> here as well.
Ugh, sorry, I meant shutdown. Instead of replacing the unhash callback
replace the shutdown callback. We probably shouldn't release the socket
lock either there, but we can sleep, so I'll be able to run the device
connection remove callback (which sleep).
> > cleanup is going to kick off TX but also:
> >
> > if (unlikely(sk->sk_write_pending) &&
> > !wait_on_pending_writer(sk, &timeo))
> > tls_handle_open_record(sk, 0);
> >
> > Are we guaranteed that sk_write_pending is 0? Otherwise
> > wait_on_pending_writer is hiding yet another release_sock() :(
>
> Not seeing the path to release_sock() at the moment?
>
> tls_handle_open_record
> push_pending_record
> tls_sw_push_pending_record
> bpf_exec_tx_verdict
wait_on_pending_writer
sk_wait_event
release_sock
> If bpf_exec_tx_verdict does a redirect we could hit a relase but that
> is another fix I have to get queued up shortly. I think we can fix
> that in another series.
Ugh.
^ permalink raw reply
* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
From: Eric Dumazet @ 2019-07-11 18:28 UTC (permalink / raw)
To: Prout, Andrew - LLSC - MITLL, Eric Dumazet, Christoph Paasch
Cc: David S . Miller, netdev, Greg Kroah-Hartman, Jonathan Looney,
Neal Cardwell, Tyler Hicks, Yuchung Cheng, Bruce Curtis,
Jonathan Lemon, Dustin Marquess
In-Reply-To: <adec774ed16540c6b627c2f607f3e216@ll.mit.edu>
On 7/11/19 7:14 PM, Prout, Andrew - LLSC - MITLL wrote:
>
> In my opinion, if a small SO_SNDBUF below a certain value is no longer supported, then SOCK_MIN_SNDBUF should be adjusted to reflect this. The RCVBUF/SNDBUF sizes are supposed to be hints, no error is returned if they are not honored. The kernel should continue to function regardless of what userspace requests for their values.
>
It is supported to set whatever SO_SNDBUF value and get terrible performance.
It always has been.
The only difference is that we no longer allow an attacker to fool TCP stack
and consume up to 2 GB per socket while SO_SNDBUF was set to 128 KB.
The side effect is that in some cases, the workload can appear to have the signature of the attack.
The solution is to increase your SO_SNDBUF, or even better let TCP stack autotune it.
nobody forced you to set very small values for it.
^ permalink raw reply
* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
From: Michal Kubecek @ 2019-07-11 18:26 UTC (permalink / raw)
To: netdev
Cc: Eric Dumazet, Christoph Paasch, Prout, Andrew - LLSC - MITLL,
David Miller, Greg Kroah-Hartman, Jonathan Looney, Neal Cardwell,
Tyler Hicks, Yuchung Cheng, Bruce Curtis, Jonathan Lemon,
Dustin Marquess
In-Reply-To: <eb6121ea-b02d-672e-25c9-2ad054d49fc7@gmail.com>
On Thu, Jul 11, 2019 at 11:19:45AM +0200, Eric Dumazet wrote:
>
>
> On 7/11/19 9:28 AM, Christoph Paasch wrote:
> >
> >
> >> On Jul 10, 2019, at 9:26 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>
> >>
> >>
> >> On 7/10/19 8:53 PM, Prout, Andrew - LLSC - MITLL wrote:
> >>>
> >>> Our initial rollout was v4.14.130, but I reproduced it with v4.14.132 as well, reliably for the samba test and once (not reliably) with synthetic test I was trying. A patched v4.14.132 with this patch partially reverted (just the four lines from tcp_fragment deleted) passed the samba test.
> >>>
> >>> The synthetic test was a pair of simple send/recv test programs under the following conditions:
> >>> -The send socket was non-blocking
> >>> -SO_SNDBUF set to 128KiB
> >>> -The receiver NIC was being flooded with traffic from multiple hosts (to induce packet loss/retransmits)
> >>> -Load was on both systems: a while(1) program spinning on each CPU core
> >>> -The receiver was on an older unaffected kernel
> >>>
> >>
> >> SO_SNDBUF to 128KB does not permit to recover from heavy losses,
> >> since skbs needs to be allocated for retransmits.
> >
> > Would it make sense to always allow the alloc in tcp_fragment when coming from __tcp_retransmit_skb() through the retransmit-timer ?
>
> 4.15+ kernels have :
>
> if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf &&
> tcp_queue != TCP_FRAG_IN_WRITE_QUEUE)) {
>
>
> Meaning that things like TLP will succeed.
I get
<idle>-0 [010] ..s. 301696.143296: p_tcp_fragment_0: (tcp_fragment+0x0/0x310) sndbuf=30000 wmemq=65600
<idle>-0 [010] d.s. 301696.143301: r_tcp_fragment_0: (tcp_send_loss_probe+0x13d/0x1f0 <- tcp_fragment) ret=-12
<idle>-0 [010] ..s. 301696.267644: p_tcp_fragment_0: (tcp_fragment+0x0/0x310) sndbuf=30000 wmemq=65600
<idle>-0 [010] d.s. 301696.267650: r_tcp_fragment_0: (__tcp_retransmit_skb+0xf9/0x800 <- tcp_fragment) ret=-12
<idle>-0 [010] ..s. 301696.875289: p_tcp_fragment_0: (tcp_fragment+0x0/0x310) sndbuf=30000 wmemq=65600
<idle>-0 [010] d.s. 301696.875293: r_tcp_fragment_0: (__tcp_retransmit_skb+0xf9/0x800 <- tcp_fragment) ret=-12
<idle>-0 [010] ..s. 301698.059267: p_tcp_fragment_0: (tcp_fragment+0x0/0x310) sndbuf=30000 wmemq=65600
<idle>-0 [010] d.s. 301698.059271: r_tcp_fragment_0: (__tcp_retransmit_skb+0xf9/0x800 <- tcp_fragment) ret=-12
<idle>-0 [010] ..s. 301700.427225: p_tcp_fragment_0: (tcp_fragment+0x0/0x310) sndbuf=30000 wmemq=65600
<idle>-0 [010] d.s. 301700.427230: r_tcp_fragment_0: (__tcp_retransmit_skb+0xf9/0x800 <- tcp_fragment) ret=-12
<idle>-0 [010] ..s. 301705.291144: p_tcp_fragment_0: (tcp_fragment+0x0/0x310) sndbuf=30000 wmemq=65600
<idle>-0 [010] d.s. 301705.291151: r_tcp_fragment_0: (__tcp_retransmit_skb+0xf9/0x800 <- tcp_fragment) ret=-12
<idle>-0 [010] ..s. 301714.762961: p_tcp_fragment_0: (tcp_fragment+0x0/0x310) sndbuf=30000 wmemq=65600
<idle>-0 [010] d.s. 301714.762966: r_tcp_fragment_0: (__tcp_retransmit_skb+0xf9/0x800 <- tcp_fragment) ret=-12
on 5.2 kernel with this packetdrill script:
------------------------------------------------------------------------
--tolerance_usecs=10000
// flush cached TCP metrics
0.000 `ip tcp_metrics flush all`
// establish a connection
+0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.000 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [15000], 4) = 0
+0.000 bind(3, ..., ...) = 0
+0.000 listen(3, 1) = 0
+0.100 < S 0:0(0) win 60000 <mss 1000,nop,nop,sackOK,nop,wscale 7>
+0.000 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
+0.100 < . 1:1(0) ack 1 win 2000
+0.000 accept(3, ..., ...) = 4
+0.100 write(4, ..., 30000) = 30000
+0.000 > . 1:2001(2000) ack 1
+0.000 > . 2001:4001(2000) ack 1
+0.000 > . 4001:6001(2000) ack 1
+0.000 > . 6001:8001(2000) ack 1
+0.000 > . 8001:10001(2000) ack 1
+0.010 < . 1:1(0) ack 10001 win 2000
+0.000 > . 10001:12001(2000) ack 1
+0.000 > . 12001:14001(2000) ack 1
+0.000 > . 14001:16001(2000) ack 1
+0.000 > . 16001:18001(2000) ack 1
+0.000 > . 18001:20001(2000) ack 1
+0.000 > . 20001:22001(2000) ack 1
+0.000 > . 22001:24001(2000) ack 1
+0.000 > . 24001:26001(2000) ack 1
+0.000 > . 26001:28001(2000) ack 1
+0.000 > P. 28001:30001(2000) ack 1
+0.010 < . 1:1(0) ack 30001 win 2000
+0.000 write(4, ..., 40000) = 40000
+0.000 > . 30001:32001(2000) ack 1
+0.000 > . 32001:34001(2000) ack 1
+0.000 > . 34001:36001(2000) ack 1
+0.000 > . 36001:38001(2000) ack 1
+0.000 > . 38001:40001(2000) ack 1
+0.000 > . 40001:42001(2000) ack 1
+0.000 > . 42001:44001(2000) ack 1
+0.000 > . 44001:46001(2000) ack 1
+0.000 > . 46001:48001(2000) ack 1
+0.000 > . 48001:50001(2000) ack 1
+0.000 > . 50001:52001(2000) ack 1
+0.000 > . 52001:54001(2000) ack 1
+0.000 > . 54001:56001(2000) ack 1
+0.000 > . 56001:58001(2000) ack 1
+0.000 > . 58001:60001(2000) ack 1
+0.000 > . 60001:62001(2000) ack 1
+0.000 > . 62001:64001(2000) ack 1
+0.000 > . 64001:66001(2000) ack 1
+0.000 > . 66001:68001(2000) ack 1
+0.000 > P. 68001:70001(2000) ack 1
+0.000 `ss -nteim state established sport == :8080`
+0.120~+0.200 > P. 69001:70001(1000) ack 1
------------------------------------------------------------------------
I'm aware it's not a realistic test. It was written as quick and simple
check of the pre-4.19 patch, but it shows that even TLP may not get
through.
Michal
^ permalink raw reply
* Re: [PATCH net-next iproute2 2/3] tc: Introduce tc ct action
From: Marcelo Ricardo Leitner @ 2019-07-11 17:40 UTC (permalink / raw)
To: Paul Blakey
Cc: Roi Dayan, John Hurley, Yossi, Oz Shlomo, netdev@vger.kernel.org,
Aaron Conole, Rony Efraim, Justin Pettit, Jiri Pirko,
nst-kernel@redhat.com, Simon Horman, Zhike Wang, David Miller,
Kuperman
In-Reply-To: <5ded2e5b-958e-eca3-76ad-909ebf79234e@mellanox.com>
On Thu, Jul 11, 2019 at 07:21:51AM +0000, Paul Blakey wrote:
>
> On 7/9/2019 6:36 PM, Marcelo Ricardo Leitner wrote:
> > On Tue, Jul 09, 2019 at 06:58:36AM +0000, Paul Blakey wrote:
> >> On 7/8/2019 8:54 PM, Marcelo Ricardo Leitner wrote:
> >>> On Sun, Jul 07, 2019 at 11:53:47AM +0300, Paul Blakey wrote:
> >>>> New tc action to send packets to conntrack module, commit
> >>>> them, and set a zone, labels, mark, and nat on the connection.
> >>>>
> >>>> It can also clear the packet's conntrack state by using clear.
> >>>>
> >>>> Usage:
> >>>> ct clear
> >>>> ct commit [force] [zone] [mark] [label] [nat]
> >>> Isn't the 'commit' also optional? More like
> >>> ct [commit [force]] [zone] [mark] [label] [nat]
> >>>
> >>>> ct [nat] [zone]
> >>>>
> >>>> Signed-off-by: Paul Blakey <paulb@mellanox.com>
> >>>> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> >>>> Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
> >>>> Acked-by: Jiri Pirko <jiri@mellanox.com>
> >>>> Acked-by: Roi Dayan <roid@mellanox.com>
> >>>> ---
> >>> ...
> >>>> +static void
> >>>> +usage(void)
> >>>> +{
> >>>> + fprintf(stderr,
> >>>> + "Usage: ct clear\n"
> >>>> + " ct commit [force] [zone ZONE] [mark MASKED_MARK] [label MASKED_LABEL] [nat NAT_SPEC]\n"
> >>> Ditto here then.
> >>
> >> In commit msg and here, it means there is multiple modes of operation. I
> >> think it's easier to split those.
> > Yep, that is good.
> > More below.
> >
> >> "ct clear" to clear it , not other options can be added here.
> >>
> >> "ct commit [force].... " sends to conntrack and commit a connection,
> >> and only for commit can you specify force mark label, and nat with
> >> nat_spec....
> >>
> >> and the last one, "ct [nat] [zone ZONE]" is to just send the packet to
> >> conntrack on some zone [optional], restore nat [optional].
> >>
> >>
> >>>> + " ct [nat] [zone ZONE]\n"
> >>>> + "Where: ZONE is the conntrack zone table number\n"
> >>>> + " NAT_SPEC is {src|dst} addr addr1[-addr2] [port port1[-port2]]\n"
> >>>> + "\n");
> >>>> + exit(-1);
> >>>> +}
> >>> ...
> >>>
> >>> The validation below doesn't enforce that commit must be there for
> >>> such case.
> >> which case? commit is optional. the above are the three valid patterns.
> > That's the point. But the 2nd example is saying 'commit' word is
> > mandatory in that mode. It is written as it is a command that was
> > selected.
> >
> > One may use just:
> > ct [zone]
> > And not
> > ct commit [zone]
> > Right?
>
> It is optional in the overall syntax.
>
>
> But I split it into modes:
>
> clear, commit, and "restore" (I unofficial call it like that, because it
> usually used to get the +est state on the packet and can restore nat, it
> doesn't actually restore anything for the first packet on the -trk rule)
>
> It is mandatory in the second mode (commit), if you don't specify commit
> or clear, you can only use the third form - "restore", which is to send
> to ct on some optional zone, and optionally and restore nat (so we get
> ct [zone] [nat]).
I see. Thanks Paul.
Marcelo
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox