* Re: [PATCH net-next 4/5] tcp: implement mmap() for zero copy receive
From: Eric Dumazet @ 2018-04-19 23:15 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller
Cc: netdev, Neal Cardwell, Yuchung Cheng, Soheil Hassas Yeganeh
In-Reply-To: <20180416173339.6310-5-edumazet@google.com>
On 04/16/2018 10:33 AM, Eric Dumazet wrote:
> Some networks can make sure TCP payload can exactly fit 4KB pages,
> with well chosen MSS/MTU and architectures.
>
> Implement mmap() system call so that applications can avoid
> copying data without complex splice() games.
>
> Note that a successful mmap( X bytes) on TCP socket is consuming
> bytes, as if recvmsg() has been done. (tp->copied += X)
>
Oh well, I should have run this code with LOCKDEP enabled :/
[ 974.320412] ======================================================
[ 974.326631] WARNING: possible circular locking dependency detected
[ 974.332816] 4.16.0-dbx-DEV #40 Not tainted
[ 974.336927] ------------------------------------------------------
[ 974.343107] b78299096/15790 is trying to acquire lock:
[ 974.348246] 000000006074c9cf (sk_lock-AF_INET6){+.+.}, at: tcp_mmap+0x7c/0x550
[ 974.355505]
but task is already holding lock:
[ 974.361366] 000000008dbe063b (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0x99/0x100
[ 974.368801]
which lock already depends on the new lock.
[ 974.377010]
the existing dependency chain (in reverse order) is:
[ 974.384501]
-> #1 (&mm->mmap_sem){++++}:
[ 974.389911] __might_fault+0x68/0x90
[ 974.394025] _copy_from_user+0x23/0xa0
[ 974.398311] sock_setsockopt+0x4a2/0xac0
[ 974.402761] __sys_setsockopt+0xd9/0xf0
[ 974.407118] SyS_setsockopt+0xe/0x20
[ 974.411242] do_syscall_64+0x6e/0x1a0
[ 974.415431] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[ 974.421011]
-> #0 (sk_lock-AF_INET6){+.+.}:
[ 974.426690] lock_acquire+0x95/0x1e0
[ 974.430813] lock_sock_nested+0x71/0xa0
[ 974.435196] tcp_mmap+0x7c/0x550
[ 974.438940] sock_mmap+0x23/0x30
[ 974.442695] mmap_region+0x3a4/0x5d0
[ 974.446808] do_mmap+0x313/0x530
[ 974.450571] vm_mmap_pgoff+0xc7/0x100
[ 974.454769] ksys_mmap_pgoff+0x1d5/0x260
[ 974.459247] SyS_mmap+0x1b/0x30
[ 974.462936] do_syscall_64+0x6e/0x1a0
[ 974.467114] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[ 974.472678]
other info that might help us debug this:
[ 974.480677] Possible unsafe locking scenario:
[ 974.486600] CPU0 CPU1
[ 974.491152] ---- ----
[ 974.495684] lock(&mm->mmap_sem);
[ 974.499089] lock(sk_lock-AF_INET6);
[ 974.505285] lock(&mm->mmap_sem);
[ 974.511211] lock(sk_lock-AF_INET6);
[ 974.514885]
*** DEADLOCK ***
[ 974.520825] 1 lock held by b78299096/15790:
[ 974.525018] #0: 000000008dbe063b (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0x99/0x100
[ 974.532852]
stack backtrace:
[ 974.537224] CPU: 25 PID: 15790 Comm: b78299096 Not tainted 4.16.0-dbx-DEV #40
[ 974.544371] Hardware name: Intel RML,PCH/Iota_QC_19, BIOS 2.40.0 06/22/2016
[ 974.551333] Call Trace:
[ 974.553792] dump_stack+0x70/0xa5
[ 974.557111] print_circular_bug.isra.39+0x1d8/0x1e6
[ 974.561982] __lock_acquire+0x1284/0x1340
[ 974.565992] ? tcp_mmap+0x7c/0x550
[ 974.569419] lock_acquire+0x95/0x1e0
[ 974.573011] ? lock_acquire+0x95/0x1e0
[ 974.576767] ? tcp_mmap+0x7c/0x550
[ 974.580167] lock_sock_nested+0x71/0xa0
[ 974.584023] ? tcp_mmap+0x7c/0x550
[ 974.587437] tcp_mmap+0x7c/0x550
[ 974.590677] sock_mmap+0x23/0x30
[ 974.593909] mmap_region+0x3a4/0x5d0
[ 974.597506] do_mmap+0x313/0x530
[ 974.600749] vm_mmap_pgoff+0xc7/0x100
[ 974.604414] ksys_mmap_pgoff+0x1d5/0x260
[ 974.608341] ? fd_install+0x25/0x30
[ 974.611849] ? trace_hardirqs_on_caller+0xef/0x180
[ 974.616641] SyS_mmap+0x1b/0x30
[ 974.619804] do_syscall_64+0x6e/0x1a0
[ 974.623462] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[ 974.628549] RIP: 0033:0x433749
[ 974.631600] RSP: 002b:00007ffd29fdb438 EFLAGS: 00000216 ORIG_RAX: 0000000000000009
[ 974.639197] RAX: ffffffffffffffda RBX: 00000000004002e0 RCX: 0000000000433749
[ 974.646323] RDX: 0000000000000008 RSI: 0000000000004000 RDI: 0000000020ab7000
[ 974.653463] RBP: 00007ffd29fdb460 R08: 0000000000000003 R09: 0000000000000000
[ 974.660603] R10: 0000000000000012 R11: 0000000000000216 R12: 0000000000401670
[ 974.667737] R13: 0000000000401700 R14: 0000000000000000 R15: 0000000000000000
I am not sure we can keep mmap() API, since we probably need to first lock the socket,
then grab vm semaphore.
^ permalink raw reply
* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Andrew Morton @ 2018-04-19 23:22 UTC (permalink / raw)
To: Mikulas Patocka
Cc: David Miller, linux-mm, eric.dumazet, edumazet, bhutchings,
netdev, linux-kernel, mst, jasowang, virtualization, dm-devel,
Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804191716100.10099@file01.intranet.prod.int.rdu2.redhat.com>
On Thu, 19 Apr 2018 17:19:20 -0400 (EDT) Mikulas Patocka <mpatocka@redhat.com> wrote:
> > > In order to detect these bugs reliably I submit this patch that changes
> > > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> > >
> > > ...
> > >
> > > --- linux-2.6.orig/mm/util.c 2018-04-18 15:46:23.000000000 +0200
> > > +++ linux-2.6/mm/util.c 2018-04-18 16:00:43.000000000 +0200
> > > @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
> > > */
> > > void *kvmalloc_node(size_t size, gfp_t flags, int node)
> > > {
> > > +#ifndef CONFIG_DEBUG_VM
> > > gfp_t kmalloc_flags = flags;
> > > void *ret;
> > >
> > > @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > */
> > > if (ret || size <= PAGE_SIZE)
> > > return ret;
> > > +#endif
> > >
> > > return __vmalloc_node_flags_caller(size, node, flags,
> > > __builtin_return_address(0));
> >
> > Well, it doesn't have to be done at compile-time, does it? We could
> > add a knob (in debugfs, presumably) which enables this at runtime.
> > That's far more user-friendly.
>
> But who will turn it on in debugfs?
But who will turn it on in Kconfig? Just a handful of developers. We
could add SONFIG_DEBUG_SG to the list in
Documentation/process/submit-checklist.rst, but nobody reads that.
Also, a whole bunch of defconfigs set CONFIG_DEBUG_SG=y and some
googling indicates that they aren't the only ones...
> It should be default for debugging
> kernels, so that users using them would report the error.
Well. This isn't the first time we've wanted to enable expensive (or
noisy) debugging things in debug kernels, by any means.
So how could we define a debug kernel in which it's OK to enable such
things?
- Could be "it's an -rc kernel". But then we'd be enabling a bunch of
untested code when Linus cuts a release.
- Could be "it's an -rc kernel with SUBLEVEL <= 5". But then we risk
unexpected things happening when Linux cuts -rc6, which still isn't
good.
- How about "it's an -rc kernel with odd-numbered SUBLEVEL and
SUBLEVEL <= 5". That way everybody who runs -rc1, -rc3 and -rc5 will
have kvmalloc debugging enabled. That's potentially nasty because
vmalloc is much slower than kmalloc. But kvmalloc() is only used for
large and probably infrequent allocations, so it's probably OK.
I wonder how we get at SUBLEVEL from within .c.
^ permalink raw reply
* Re: [PATCH bpf-next v2 3/9] bpf/verifier: refine retval R0 state for bpf_get_stack helper
From: Yonghong Song @ 2018-04-19 23:37 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: ast, daniel, netdev, kernel-team
In-Reply-To: <20180419043322.zmwxapw3vcimlgg6@ast-mbp>
On 4/18/18 9:33 PM, Alexei Starovoitov wrote:
> On Wed, Apr 18, 2018 at 09:54:38AM -0700, Yonghong Song wrote:
>> The special property of return values for helpers bpf_get_stack
>> and bpf_probe_read_str are captured in verifier.
>> Both helpers return a negative error code or
>> a length, which is equal to or smaller than the buffer
>> size argument. This additional information in the
>> verifier can avoid the condition such as "retval > bufsize"
>> in the bpf program. For example, for the code blow,
>> usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
>> if (usize < 0 || usize > max_len)
>> return 0;
>> The verifier may have the following errors:
>> 52: (85) call bpf_get_stack#65
>> R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
>> R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
>> R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>> R9_w=inv800 R10=fp0,call_-1
>> 53: (bf) r8 = r0
>> 54: (bf) r1 = r8
>> 55: (67) r1 <<= 32
>> 56: (bf) r2 = r1
>> 57: (77) r2 >>= 32
>> 58: (25) if r2 > 0x31f goto pc+33
>> R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
>> umax_value=18446744069414584320,
>> var_off=(0x0; 0xffffffff00000000))
>> R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
>> R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>> R8=inv(id=0) R9=inv800 R10=fp0,call_-1
>> 59: (1f) r9 -= r8
>> 60: (c7) r1 s>>= 32
>> 61: (bf) r2 = r7
>> 62: (0f) r2 += r1
>> math between map_value pointer and register with unbounded
>> min value is not allowed
>> The failure is due to llvm compiler optimization where register "r2",
>> which is a copy of "r1", is tested for condition while later on "r1"
>> is used for map_ptr operation. The verifier is not able to track such
>> inst sequence effectively.
>>
>> Without the "usize > max_len" condition, there is no llvm optimization
>> and the below generated code passed verifier:
>> 52: (85) call bpf_get_stack#65
>> R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
>> R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
>> R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>> R9_w=inv800 R10=fp0,call_-1
>> 53: (b7) r1 = 0
>> 54: (bf) r8 = r0
>> 55: (67) r8 <<= 32
>> 56: (c7) r8 s>>= 32
>> 57: (6d) if r1 s> r8 goto pc+24
>> R0=inv(id=0,umax_value=800) R1=inv0 R6=ctx(id=0,off=0,imm=0)
>> R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>> R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
>> R10=fp0,call_-1
>> 58: (bf) r2 = r7
>> 59: (0f) r2 += r8
>> 60: (1f) r9 -= r8
>> 61: (bf) r1 = r6
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>> kernel/bpf/verifier.c | 31 ++++++++++++++++++++++++++++++-
>> 1 file changed, 30 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index aba9425..a8302c3 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -2333,10 +2333,32 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
>> return 0;
>> }
>>
>> +static void do_refine_retval_range(struct bpf_reg_state *regs, int ret_type,
>> + int func_id,
>> + struct bpf_reg_state *retval_state,
>> + bool is_check)
>> +{
>> + struct bpf_reg_state *src_reg, *dst_reg;
>> +
>> + if (ret_type != RET_INTEGER ||
>> + (func_id != BPF_FUNC_get_stack &&
>> + func_id != BPF_FUNC_probe_read_str))
>> + return;
>> +
>> + dst_reg = is_check ? retval_state : ®s[BPF_REG_0];
>> + if (func_id == BPF_FUNC_get_stack)
>> + src_reg = is_check ? ®s[BPF_REG_3] : retval_state;
>> + else
>> + src_reg = is_check ? ®s[BPF_REG_2] : retval_state;
>> +
>> + dst_reg->smax_value = src_reg->smax_value;
>> + dst_reg->umax_value = src_reg->umax_value;
>> +}
>
> I think this part can be made more generic, by using 'meta' logic.
> check_func_arg(.. &meta);
> can remember smax/umax into meta for arg_type_is_mem_size()
> and later refine_retval_range() can be applied to r0.
> This will help avoid mistakes with specifying reg by position (r2 or r3)
> like above snippet is doing.
Good suggestion. Let me try this.
>
>> +
>> static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn_idx)
>> {
>> const struct bpf_func_proto *fn = NULL;
>> - struct bpf_reg_state *regs;
>> + struct bpf_reg_state *regs, retval_state;
>> struct bpf_call_arg_meta meta;
>> bool changes_data;
>> int i, err;
>> @@ -2415,6 +2437,10 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
>> }
>>
>> regs = cur_regs(env);
>> +
>> + /* before reset caller saved regs, check special ret value */
>> + do_refine_retval_range(regs, fn->ret_type, func_id, &retval_state, 1);
>> +
>> /* reset caller saved regs */
>> for (i = 0; i < CALLER_SAVED_REGS; i++) {
>> mark_reg_not_init(env, regs, caller_saved[i]);
>> @@ -2456,6 +2482,9 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
>> return -EINVAL;
>> }
>>
>> + /* apply additional constraints to ret value */
>> + do_refine_retval_range(regs, fn->ret_type, func_id, &retval_state, 0);
>> +
>> err = check_map_func_compatibility(env, meta.map_ptr, func_id);
>> if (err)
>> return err;
>> --
>> 2.9.5
>>
^ permalink raw reply
* Re: [PATCH bpf-next v2 4/9] bpf/verifier: improve register value range tracking with ARSH
From: Yonghong Song @ 2018-04-19 23:39 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: ast, daniel, netdev, kernel-team
In-Reply-To: <20180419043511.n65ryn5twzcfyp2f@ast-mbp>
On 4/18/18 9:35 PM, Alexei Starovoitov wrote:
> On Wed, Apr 18, 2018 at 09:54:39AM -0700, Yonghong Song wrote:
>> When helpers like bpf_get_stack returns an int value
>> and later on used for arithmetic computation, the LSH and ARSH
>> operations are often required to get proper sign extension into
>> 64-bit. For example, without this patch:
>> 54: R0=inv(id=0,umax_value=800)
>> 54: (bf) r8 = r0
>> 55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
>> 55: (67) r8 <<= 32
>> 56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
>> 56: (c7) r8 s>>= 32
>> 57: R8=inv(id=0)
>> With this patch:
>> 54: R0=inv(id=0,umax_value=800)
>> 54: (bf) r8 = r0
>> 55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
>> 55: (67) r8 <<= 32
>> 56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
>> 56: (c7) r8 s>>= 32
>> 57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
>> With better range of "R8", later on when "R8" is added to other register,
>> e.g., a map pointer or scalar-value register, the better register
>> range can be derived and verifier failure may be avoided.
>>
>> In our later example,
>> ......
>> usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
>> if (usize < 0)
>> return 0;
>> ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
>> ......
>> Without improving ARSH value range tracking, the register representing
>> "max_len - usize" will have smin_value equal to S64_MIN and will be
>> rejected by verifier.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>> kernel/bpf/verifier.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index a8302c3..6148d31 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -2944,6 +2944,7 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env,
>> __update_reg_bounds(dst_reg);
>> break;
>> case BPF_RSH:
>> + case BPF_ARSH:
>
> I don't think that's correct.
> The code further down is very RSH specific.
Okay, I may need to introduce tnum_arshift then.
>
>> if (umax_val >= insn_bitness) {
>> /* Shifts greater than 31 or 63 are undefined.
>> * This includes shifts by a negative number.
>> --
>> 2.9.5
>>
^ permalink raw reply
* Re: [PATCH bpf-next v2 7/9] samples/bpf: add a test for bpf_get_stack helper
From: Yonghong Song @ 2018-04-19 23:42 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: ast, daniel, netdev, kernel-team
In-Reply-To: <20180419043745.a23qak7peaurmiqg@ast-mbp>
On 4/18/18 9:37 PM, Alexei Starovoitov wrote:
> On Wed, Apr 18, 2018 at 09:54:42AM -0700, Yonghong Song wrote:
>> The test attached a kprobe program to kernel function sys_write.
>> It tested to get stack for user space, kernel space and user
>> space with build_id request. It also tested to get user
>> and kernel stack into the same buffer with back-to-back
>> bpf_get_stack helper calls.
>>
>> Whenever the kernel stack is available, the user space
>> application will check to ensure that sys_write/SyS_write
>> is part of the stack.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>> samples/bpf/Makefile | 4 +
>> samples/bpf/trace_get_stack_kern.c | 86 +++++++++++++++++++++
>> samples/bpf/trace_get_stack_user.c | 150 +++++++++++++++++++++++++++++++++++++
>> 3 files changed, 240 insertions(+)
>
> since perf_read is being refactored out of trace_output_user.c in the previous patch
> please move it to selftests (instead of bpf_load.c) and move
> this whole test to selftests as well.
I put it here since I am attaching to a kprobe so that I can compare
address. I guess I can still do it by attaching to a kernel tracepoint.
Will move the tests to selftests as suggested.
^ permalink raw reply
* Re: [PATCH bpf-next v2 9/9] tools/bpf: add a test_progs test case for bpf_get_stack helper
From: Yonghong Song @ 2018-04-19 23:42 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: ast, daniel, netdev, kernel-team
In-Reply-To: <20180419043953.fcv33e2glomg33gp@ast-mbp>
On 4/18/18 9:39 PM, Alexei Starovoitov wrote:
> On Wed, Apr 18, 2018 at 09:54:44AM -0700, Yonghong Song wrote:
>> The test_stacktrace_map is enhanced to call bpf_get_stack
>> in the helper to get the stack trace as well.
>> The stack traces from bpf_get_stack and bpf_get_stackid
>> are compared to ensure that for the same stack as
>> represented as the same hash, their ip addresses
>> must be the same.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>
> could you please add a test for bpf_get_stack() with buildid as well?
> I think patch 2 implementes it correctly, but would be good to have a test for it.
Right. Will improve the test to cover buildid as well.
^ permalink raw reply
* Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format
From: Daniel Borkmann @ 2018-04-19 23:57 UTC (permalink / raw)
To: Martin KaFai Lau, netdev
Cc: Alexei Starovoitov, kernel-team, Arnaldo Carvalho de Melo
In-Reply-To: <20180418225606.2771620-1-kafai@fb.com>
On 04/19/2018 12:55 AM, Martin KaFai Lau wrote:
> This patch introduces BPF Type Format (BTF).
>
> BTF (BPF Type Format) is the meta data format which describes
> the data types of BPF program/map. Hence, it basically focus
> on the C programming language which the modern BPF is primary
> using. The first use case is to provide a generic pretty print
> capability for a BPF map.
>
> A modified pahole that can convert dwarf to BTF is here:
> https://github.com/iamkafai/pahole/tree/btf
> (Arnaldo, there is some BTF_KIND numbering changes on
> Apr 18th, d61426c1571)
>
> Please see individual patch for details.
>
> v5:
> - Remove BTF_KIND_FLOAT and BTF_KIND_FUNC which are not
> currently used. They can be added in the future.
> Some bpf_df_xxx() are removed together.
> - Add comment in patch 7 to clarify that the new bpffs_map_fops
> should not be extended further.
>
> v4:
> - Fix warning (remove unneeded semicolon)
> - Remove a redundant variable (nr_bytes) from btf_int_check_meta() in
> patch 1. Caught by W=1.
>
> v3:
> - Rebase to bpf-next
> - Fix sparse warning (by adding static)
> - Add BTF header logging: btf_verifier_log_hdr()
> - Fix the alignment test on btf->type_off
> - Add tests for the BTF header
> - Lower the max BTF size to 16MB. It should be enough
> for some time. We could raise it later if it would
> be needed.
>
> v2:
> - Use kvfree where needed in patch 1 and 2
> - Also consider BTF_INT_OFFSET() in the btf_int_check_meta()
> in patch 1
> - Fix an incorrect goto target in map_create() during
> the btf-error-path in patch 7
> - re-org some local vars to keep the rev xmas tree in btf.c
Series applied to bpf-next, thanks Martin. As discussed please follow up
with the bpftool patches.
Thanks,
Daniel
^ permalink raw reply
* Re: [bpf-next PATCH 3/3] bpf: add sample program to trace map events
From: Alexei Starovoitov @ 2018-04-20 0:27 UTC (permalink / raw)
To: Sebastiano Miano
Cc: netdev, ast, daniel, mingo, rostedt, brouer, fulvio.risso,
David S. Miller
In-Reply-To: <152406545918.3465.14253635905960610284.stgit@localhost.localdomain>
On Wed, Apr 18, 2018 at 05:30:59PM +0200, Sebastiano Miano wrote:
> This patch adds a sample program, called trace_map_events,
> that shows how to capture map events and filter them based on
> the map id.
...
> +struct bpf_map_keyval_ctx {
> + u64 pad; // First 8 bytes are not accessible by bpf code
> + u32 type; // offset:8; size:4; signed:0;
> + u32 key_len; // offset:12; size:4; signed:0;
> + u32 key; // offset:16; size:4; signed:0;
> + bool key_trunc; // offset:20; size:1; signed:0;
> + u32 val_len; // offset:24; size:4; signed:0;
> + u32 val; // offset:28; size:4; signed:0;
> + bool val_trunc; // offset:32; size:1; signed:0;
> + int ufd; // offset:36; size:4; signed:1;
> + u32 id; // offset:40; size:4; signed:0;
> +};
> +
> +SEC("tracepoint/bpf/bpf_map_lookup_elem")
> +int trace_bpf_map_lookup_elem(struct bpf_map_keyval_ctx *ctx)
> +{
> + struct map_event_data data;
> + int cpu = bpf_get_smp_processor_id();
> + bool *filter;
> + u32 key = 0, map_id = ctx->id;
> +
> + filter = bpf_map_lookup_elem(&filter_events, &key);
> + if (!filter)
> + return 1;
> +
> + if (!*filter)
> + goto send_event;
> +
> + /*
> + * If the map_id is not in the list of filtered
> + * ids we immediately return
> + */
> + if (!bpf_map_lookup_elem(&filtered_ids, &map_id))
> + return 0;
> +
> +send_event:
> + data.map_id = map_id;
> + data.evnt_type = MAP_LOOKUP;
> + data.map_type = ctx->type;
> +
> + bpf_perf_event_output(ctx, &map_event_trace, cpu, &data, sizeof(data));
> + return 0;
> +}
looks like the purpose of the series is to create map notify mechanism
so some external user space daemon can snoop all bpf map operations
that all other processes and bpf programs are doing.
I think it would be way better to create a proper mechanism for that
with permissions.
tracepoints in the bpf core were introduced as introspection mechanism
for debugging. Right now we have better ways to do introspection
with ids, queries, etc that bpftool is using, so original purpose of
those tracepoints is gone and they actually rot.
Let's not repurpose them into this map notify logic.
I don't want tracepoints in the bpf core to become a stable interface
it will stiffen the development.
^ permalink raw reply
* Re: [pci PATCH v7 2/5] virtio_pci: Add support for unmanaged SR-IOV on virtio_pci devices
From: Michael S. Tsirkin @ 2018-04-20 0:40 UTC (permalink / raw)
To: Alexander Duyck
Cc: Daly, Dan, Bjorn Helgaas, Duyck, Alexander H, linux-pci,
virtio-dev, kvm, Netdev, LKML, linux-nvme, Keith Busch, netanel,
Don Dutile, Maximilian Heyne, Wang, Liang-min, Rustad, Mark D,
David Woodhouse, Christoph Hellwig, dwmw
In-Reply-To: <CAKgT0Ude79FYrK4qA0OKRJ1NackyqPi-hZ8Zh3WSdLDFxoOosQ@mail.gmail.com>
On Tue, Apr 03, 2018 at 12:06:03PM -0700, Alexander Duyck wrote:
> On Tue, Apr 3, 2018 at 11:27 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Apr 03, 2018 at 10:32:00AM -0700, Alexander Duyck wrote:
> >> On Tue, Apr 3, 2018 at 6:12 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Fri, Mar 16, 2018 at 09:40:34AM -0700, Alexander Duyck wrote:
> >> >> On Fri, Mar 16, 2018 at 9:34 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> >> > On Thu, Mar 15, 2018 at 11:42:41AM -0700, Alexander Duyck wrote:
> >> >> >> From: Alexander Duyck <alexander.h.duyck@intel.com>
> >> >> >>
> >> >> >> Hardware-realized virtio_pci devices can implement SR-IOV, so this
> >> >> >> patch enables its use. The device in question is an upcoming Intel
> >> >> >> NIC that implements both a virtio_net PF and virtio_net VFs. These
> >> >> >> are hardware realizations of what has been up to now been a software
> >> >> >> interface.
> >> >> >>
> >> >> >> The device in question has the following 4-part PCI IDs:
> >> >> >>
> >> >> >> PF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 15fe
> >> >> >> VF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 05fe
> >> >> >>
> >> >> >> The patch currently needs no check for device ID, because the callback
> >> >> >> will never be made for devices that do not assert the capability or
> >> >> >> when run on a platform incapable of SR-IOV.
> >> >> >>
> >> >> >> One reason for this patch is because the hardware requires the
> >> >> >> vendor ID of a VF to be the same as the vendor ID of the PF that
> >> >> >> created it. So it seemed logical to simply have a fully-functioning
> >> >> >> virtio_net PF create the VFs. This patch makes that possible.
> >> >> >>
> >> >> >> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >> >> >> Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
> >> >> >> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> >> >> >
> >> >> > So if and when virtio PFs can manage the VFs, then we can
> >> >> > add a feature bit for that?
> >> >> > Seems reasonable.
> >> >>
> >> >> Yes. If nothing else you may not even need a feature bit depending on
> >> >> how things go.
> >> >
> >> > OTOH if the interface is changed in an incompatible way,
> >> > and old Linux will attempt to drive the new device
> >> > since there is no check.
> >> >
> >> > I think we should add a feature bit right away.
> >>
> >> I'm not sure why you would need a feature bit. The capability is
> >> controlled via PCI configuration space. If it is present the device
> >> has the capability. If it is not then it does not.
> >>
> >> Basically if the PCI configuration space is not present then the sysfs
> >> entries will not be spawned and nothing will attempt to use this
> >> function.
> >>
> >> - ALex
> >
> > It's about compability with older guests which ignore the
> > capability.
> >
> > The feature is thus helpful so host knows whether guest supports VFs.
>
> The thing is if the capability is ignored then the feature isn't used.
> So for SR-IOV it isn't an uncommon thing for there to be drivers for
> the PF floating around that do not support SR-IOV. In such cases
> SR-IOV just isn't used while the hardware could support it.
Right but how come there are VF drivers but PF driver does not
know about these?
And are there PF drivers that intentially do not enable SRIOV
because it's known to be broken in some way?
Case in point I do think virtio want to limit this
depending on a feature bit on general principles
(the principle being that all extensions have feature bits).
There are security implications here - we previously relied on
whitelisting after all.
Wouldn't it be safer to be a bit more careful and update the
actual PF drivers? It's just one line per driver, but it
can be done with an ack by driver maintainer.
If/once we find out all drivers do have it, we can then
change the default.
> I would think in the case of virtio it would be the same kind of
> thing. Basically if SR-IOV is supported by the host then the
> capability would be present. If SR-IOV is supported by the guest then
> it would make use of the capability to spawn VFs. If either the
> capability isn't present, or the driver doesn't use it then you won't
> be able to spawn VFs in the guest.
> Maybe I am missing something. Do you support dynamically changing the
> PCI configuration space for Virtio devices based on the presence of
> feature bits provided by the guest?
No. The point is that IMHO at least virtio - in absence of feature bit -
to ignore VFs rather than assume they are safe to drive
in an unmanaged way.
> Also are you saying this patch set should wait on the feature bit to
> be added, or are you talking about doing this as some sort of
> follow-up?
>
> - Alex
I think for virtio it should include the feature bit, yes.
Adding feature bit is very easy - post a patch to the virtio TC mailing
list, wait about a week to give people time to respond (two weeks if it
is around holidays and such).
--
MST
^ permalink raw reply
* Re: [RFC PATCH ghak32 V2 10/13] audit: add containerid support for seccomp and anom_abend records
From: Richard Guy Briggs @ 2018-04-20 0:42 UTC (permalink / raw)
To: Paul Moore
Cc: simo, jlayton, carlos, linux-api, containers, LKML, Eric Paris,
dhowells, Linux-Audit Mailing List, ebiederm, luto, netdev,
linux-fsdevel, cgroups, serge, viro
In-Reply-To: <CAHC9VhS6MKoLkzpfcmYBSNnvrtbL2FOF5PX9uOfivSVEWykkQg@mail.gmail.com>
On 2018-04-18 21:31, Paul Moore wrote:
> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> > Add container ID auxiliary records to secure computing and abnormal end
> > standalone records.
> >
> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > ---
> > kernel/auditsc.c | 10 ++++++++--
> > 1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > index 7103d23..2f02ed9 100644
> > --- a/kernel/auditsc.c
> > +++ b/kernel/auditsc.c
> > @@ -2571,6 +2571,7 @@ static void audit_log_task(struct audit_buffer *ab)
> > void audit_core_dumps(long signr)
> > {
> > struct audit_buffer *ab;
> > + struct audit_context *context = audit_alloc_local();
>
> Looking quickly at do_coredump() I *believe* we can use current here.
>
> > if (!audit_enabled)
> > return;
> > @@ -2578,19 +2579,22 @@ void audit_core_dumps(long signr)
> > if (signr == SIGQUIT) /* don't care for those */
> > return;
> >
> > - ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_ANOM_ABEND);
> > + ab = audit_log_start(context, GFP_KERNEL, AUDIT_ANOM_ABEND);
> > if (unlikely(!ab))
> > return;
> > audit_log_task(ab);
> > audit_log_format(ab, " sig=%ld res=1", signr);
> > audit_log_end(ab);
> > + audit_log_container_info(context, "abend", audit_get_containerid(current));
> > + audit_free_context(context);
> > }
> >
> > void __audit_seccomp(unsigned long syscall, long signr, int code)
> > {
> > struct audit_buffer *ab;
> > + struct audit_context *context = audit_alloc_local();
>
> We can definitely use current here.
Ok, so both syscall aux records. That elimintes this patch from the
set, can go in independently.
> > - ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_SECCOMP);
> > + ab = audit_log_start(context, GFP_KERNEL, AUDIT_SECCOMP);
> > if (unlikely(!ab))
> > return;
> > audit_log_task(ab);
> > @@ -2598,6 +2602,8 @@ void __audit_seccomp(unsigned long syscall, long signr, int code)
> > signr, syscall_get_arch(), syscall,
> > in_compat_syscall(), KSTK_EIP(current), code);
> > audit_log_end(ab);
> > + audit_log_container_info(context, "seccomp", audit_get_containerid(current));
> > + audit_free_context(context);
> > }
> >
> > struct list_head *audit_killed_trees(void)
>
> --
> paul moore
> www.paul-moore.com
>
> --
> Linux-audit mailing list
> Linux-audit@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-audit
- RGB
--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
^ permalink raw reply
* Re: [pci PATCH v7 0/5] Add support for unmanaged SR-IOV
From: Michael S. Tsirkin @ 2018-04-20 0:46 UTC (permalink / raw)
To: Alexander Duyck
Cc: Bjorn Helgaas, Duyck, Alexander H, linux-pci, virtio-dev, kvm,
Netdev, Daly, Dan, LKML, linux-nvme, Keith Busch, netanel,
Don Dutile, Maximilian Heyne, Wang, Liang-min, Rustad, Mark D,
David Woodhouse, Christoph Hellwig, dwmw
In-Reply-To: <CAKgT0UdTANRo3xnr89aWdfSPmMg81-W2H-ZcJUdC=5nUkf9RAw@mail.gmail.com>
On Thu, Apr 19, 2018 at 03:54:49PM -0700, Alexander Duyck wrote:
> On Thu, Mar 15, 2018 at 11:40 AM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> > This series is meant to add support for SR-IOV on devices when the VFs are
> > not managed by the kernel. Examples of recent patches attempting to do this
> > include:
> > virto - https://patchwork.kernel.org/patch/10241225/
> > pci-stub - https://patchwork.kernel.org/patch/10109935/
> > vfio - https://patchwork.kernel.org/patch/10103353/
> > uio - https://patchwork.kernel.org/patch/9974031/
> >
> > Since this is quickly blowing up into a multi-driver problem it is probably
> > best to implement this solution as generically as possible.
> >
> > This series is an attempt to do that. What we do with this patch set is
> > provide a generic framework to enable SR-IOV in the case that the PF driver
> > doesn't support managing the VFs itself.
> >
> > I based my patch set originally on the patch by Mark Rustad but there isn't
> > much left after going through and cleaning out the bits that were no longer
> > needed, and after incorporating the feedback from David Miller. At this point
> > the only items to be fully reused was his patch description which is now
> > present in patch 3 of the set.
> >
> > This solution is limited in scope to just adding support for devices that
> > provide no functionality for SR-IOV other than allocating the VFs by
> > calling pci_enable_sriov. Previous sets had included patches for VFIO, but
> > for now I am dropping that as the scope of that work is larger then I
> > think I can take on at this time.
> >
> > v2: Reduced scope back to just virtio_pci and vfio-pci
> > Broke into 3 patch set from single patch
> > Changed autoprobe behavior to always set when num_vfs is set non-zero
> > v3: Updated Documentation to clarify when sriov_unmanaged_autoprobe is used
> > Wrapped vfio_pci_sriov_configure to fix build errors w/o SR-IOV in kernel
> > v4: Dropped vfio-pci patch
> > Added ena and nvme to drivers now using pci_sriov_configure_unmanaged
> > Dropped pci_disable_sriov call in virtio_pci to be consistent with ena
> > v5: Dropped sriov_unmanaged_autoprobe and pci_sriov_conifgure_unmanaged
> > Added new patch that enables pci_sriov_configure_simple
> > Updated drivers to use pci_sriov_configure_simple
> > v6: Defined pci_sriov_configure_simple as NULL when SR-IOV is not enabled
> > Updated drivers to drop "#ifdef" checks for IOV
> > Added pci-pf-stub as place for PF-only drivers to add support
> > v7: Dropped pci_id table explanation from pci-pf-stub driver
> > Updated pci_sriov_configure_simple to drop need for err value
> > Fixed comment explaining why pci_sriov_configure_simple is NULL
> >
>
> Just following up since this has been sitting in patchwork for just
> over a month now
> (https://patchwork.ozlabs.org/project/linux-pci/list/?series=34034).
> I'm just wondering what the expectation is on getting these pulled
> into the pci tree? I'm assuming that is the best place for these
> patches. Are there any concerns I still need to address or are these
> going to be pulled in at some point, and if so is there any ETA on
> when that will be?
>
> Thanks.
>
> - Alex
Sorry I didn't notice you had more questions. I have responded
hopefully explaining my concerns. Summary:
- For virtio we should add this with a feature bit.
- I am worried about security of this for the stub, but I am
not the maintainer there.
--
MST
^ permalink raw reply
* Re: [PATCH net-next 4/5] tcp: implement mmap() for zero copy receive
From: Eric Dumazet @ 2018-04-20 1:01 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller
Cc: netdev, Neal Cardwell, Yuchung Cheng, Soheil Hassas Yeganeh
In-Reply-To: <7a961ead-e77d-7334-3c29-399e071670fb@gmail.com>
On 04/19/2018 04:15 PM, Eric Dumazet wrote:
> I am not sure we can keep mmap() API, since we probably need to first lock the socket,
> then grab vm semaphore.
>
We can keep mmap() nice interface, granted we can add one hook like in following patch.
David, do you think such patch would be acceptable by lkml and mm/fs maintainers ?
Alternative would be implementing an ioctl() or getsockopt() operation,
but it seems less natural...
Thanks !
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 92efaf1f89775f7b017477617dd983c10e0dc4d2..016c711ac33e226b4285ee5bd688e14661dc0879 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1714,6 +1714,7 @@ struct file_operations {
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
+ void (*mmap_hook) (struct file *, bool);
unsigned long mmap_supported_flags;
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
diff --git a/mm/util.c b/mm/util.c
index 1fc4fa7576f762bbbf341f056ca6d0be803a423f..b546c59a6169c4dfa9011c61e86da4d03496aa4d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -350,11 +350,20 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
ret = security_mmap_file(file, prot, flag);
if (!ret) {
- if (down_write_killable(&mm->mmap_sem))
+ void (*mmap_hook)(struct file *, bool) = file ? file->f_op->mmap_hook : NULL;
+
+ if (mmap_hook)
+ mmap_hook(file, true);
+ if (down_write_killable(&mm->mmap_sem)) {
+ if (mmap_hook)
+ mmap_hook(file, false);
return -EINTR;
+ }
ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
&populate, &uf);
up_write(&mm->mmap_sem);
+ if (mmap_hook)
+ mmap_hook(file, false);
userfaultfd_unmap_complete(mm, &uf);
if (populate)
mm_populate(ret, populate);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4022073b0aeea9d07af0fa825b640a00512908a3..79b05d6d41643e8c309dfb8bd9597dc8b00fb0e1 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1756,8 +1756,6 @@ int tcp_mmap(struct file *file, struct socket *sock,
/* TODO: Maybe the following is not needed if pages are COW */
vma->vm_flags &= ~VM_MAYWRITE;
- lock_sock(sk);
-
ret = -ENOTCONN;
if (sk->sk_state == TCP_LISTEN)
goto out;
@@ -1833,7 +1831,6 @@ int tcp_mmap(struct file *file, struct socket *sock,
ret = 0;
out:
- release_sock(sk);
kvfree(pages_array);
return ret;
}
diff --git a/net/socket.c b/net/socket.c
index f10f1d947c78c193b49379b0ec641d81367fb4cf..bcabae3c37d765e5c0548a14fc93c19258972b48 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -131,6 +131,16 @@ static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags);
+static void sock_mmap_hook(struct file *file, bool enter)
+{
+ struct socket *sock = file->private_data;
+ struct sock *sk = sock->sk;
+
+ if (enter)
+ lock_sock(sk);
+ else
+ release_sock(sk);
+}
/*
* Socket files have a set of 'special' operations as well as the generic file ones. These don't appear
* in the operation structures but are done directly via the socketcall() multiplexor.
@@ -147,6 +157,7 @@ static const struct file_operations socket_file_ops = {
.compat_ioctl = compat_sock_ioctl,
#endif
.mmap = sock_mmap,
+ .mmap_hook = sock_mmap_hook,
.release = sock_close,
.fasync = sock_fasync,
.sendpage = sock_sendpage,
^ permalink raw reply related
* Re: [RFC PATCH ghak32 V2 05/13] audit: add containerid support for ptrace and signals
From: Richard Guy Briggs @ 2018-04-20 1:03 UTC (permalink / raw)
To: Paul Moore
Cc: cgroups, containers, linux-api, Linux-Audit Mailing List,
linux-fsdevel, LKML, netdev, ebiederm, luto, jlayton, carlos,
dhowells, viro, simo, Eric Paris, serge
In-Reply-To: <CAHC9VhTy4fX1hYfD5tppbP-fRaVRMXOfeJ=Et96J_rc7Jw12Bw@mail.gmail.com>
On 2018-04-18 20:32, Paul Moore wrote:
> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> > Add container ID support to ptrace and signals. In particular, the "op"
> > field provides a way to label the auxiliary record to which it is
> > associated.
> >
> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > ---
> > include/linux/audit.h | 16 +++++++++++-----
> > kernel/audit.c | 12 ++++++++----
> > kernel/audit.h | 2 ++
> > kernel/auditsc.c | 19 +++++++++++++++----
> > 4 files changed, 36 insertions(+), 13 deletions(-)
>
> ...
>
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index a12f21f..b238be5 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -142,6 +142,7 @@ struct audit_net {
> > kuid_t audit_sig_uid = INVALID_UID;
> > pid_t audit_sig_pid = -1;
> > u32 audit_sig_sid = 0;
> > +u64 audit_sig_cid = INVALID_CID;
> >
> > /* Records can be lost in several ways:
> > 0) [suppressed in audit_alloc]
> > @@ -1438,6 +1439,7 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
> > memcpy(sig_data->ctx, ctx, len);
> > security_release_secctx(ctx, len);
> > }
> > + sig_data->cid = audit_sig_cid;
> > audit_send_reply(skb, seq, AUDIT_SIGNAL_INFO, 0, 0,
> > sig_data, sizeof(*sig_data) + len);
> > kfree(sig_data);
> > @@ -2051,20 +2053,22 @@ void audit_log_session_info(struct audit_buffer *ab)
> >
> > /*
> > * audit_log_container_info - report container info
> > - * @tsk: task to be recorded
> > * @context: task or local context for record
> > + * @op: containerid string description
> > + * @containerid: container ID to report
> > */
> > -int audit_log_container_info(struct task_struct *tsk, struct audit_context *context)
> > +int audit_log_container_info(struct audit_context *context,
> > + char *op, u64 containerid)
> > {
> > struct audit_buffer *ab;
> >
> > - if (!audit_containerid_set(tsk))
> > + if (!cid_valid(containerid))
> > return 0;
> > /* Generate AUDIT_CONTAINER_INFO with container ID */
> > ab = audit_log_start(context, GFP_KERNEL, AUDIT_CONTAINER_INFO);
> > if (!ab)
> > return -ENOMEM;
> > - audit_log_format(ab, "contid=%llu", audit_get_containerid(tsk));
> > + audit_log_format(ab, "op=%s contid=%llu", op, containerid);
> > audit_log_end(ab);
> > return 0;
> > }
>
> Let's get these changes into the first patch where
> audit_log_container_info() is defined. Why? This inserts a new field
> into the record which is a no-no. Yes, it is one single patchset, but
> they are still separate patches and who knows which patches a given
> distribution and/or tree may decide to backport.
Fair enough. That first thought went through my mind... Would it be
sufficient to move that field addition to the first patch and leave the
rest here to support trace and signals?
> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > index 2bba324..2932ef1 100644
> > --- a/kernel/auditsc.c
> > +++ b/kernel/auditsc.c
> > @@ -113,6 +113,7 @@ struct audit_aux_data_pids {
> > kuid_t target_uid[AUDIT_AUX_PIDS];
> > unsigned int target_sessionid[AUDIT_AUX_PIDS];
> > u32 target_sid[AUDIT_AUX_PIDS];
> > + u64 target_cid[AUDIT_AUX_PIDS];
> > char target_comm[AUDIT_AUX_PIDS][TASK_COMM_LEN];
> > int pid_count;
> > };
> > @@ -1422,21 +1423,27 @@ static void audit_log_exit(struct audit_context *context, struct task_struct *ts
> > for (aux = context->aux_pids; aux; aux = aux->next) {
> > struct audit_aux_data_pids *axs = (void *)aux;
> >
> > - for (i = 0; i < axs->pid_count; i++)
> > + for (i = 0; i < axs->pid_count; i++) {
> > + char axsn[sizeof("aux0xN ")];
> > +
> > + sprintf(axsn, "aux0x%x", i);
> > if (audit_log_pid_context(context, axs->target_pid[i],
> > axs->target_auid[i],
> > axs->target_uid[i],
> > axs->target_sessionid[i],
> > axs->target_sid[i],
> > - axs->target_comm[i]))
> > + axs->target_comm[i])
> > + && audit_log_container_info(context, axsn, axs->target_cid[i]))
>
> Shouldn't this be an OR instead of an AND?
Yes. Bash-brain...
> > call_panic = 1;
> > + }
> > }
> >
> > if (context->target_pid &&
> > audit_log_pid_context(context, context->target_pid,
> > context->target_auid, context->target_uid,
> > context->target_sessionid,
> > - context->target_sid, context->target_comm))
> > + context->target_sid, context->target_comm)
> > + && audit_log_container_info(context, "target", context->target_cid))
>
> Same question.
Yes.
> > call_panic = 1;
> >
> > if (context->pwd.dentry && context->pwd.mnt) {
>
> --
> paul moore
> www.paul-moore.com
- RGB
--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
^ permalink raw reply
* [net-next PATCH 0/3] Symmetric queue selection using XPS for Rx queues
From: Amritha Nambiar @ 2018-04-20 1:04 UTC (permalink / raw)
To: netdev, davem
Cc: alexander.h.duyck, amritha.nambiar, sridhar.samudrala, edumazet,
hannes, tom
This patch series implements support for Tx queue selection based on
Rx queue map. This is done by configuring Rx queue map per Tx-queue
using sysfs attribute. If the user configuration for Rx queues does
not apply, then the Tx queue selection falls back to XPS using CPUs and
finally to hashing.
XPS is refactored to support Tx queue selection based on either the
CPU map or the Rx-queue map. The config option CONFIG_XPS needs to be
enabled. By default no receive queues are configured for the Tx queue.
- /sys/class/net/eth0/queues/tx-*/xps_rxqs
This is to enable sending packets on the same Tx-Rx queue pair as this
is useful for busy polling multi-threaded workloads where it is not
possible to pin the threads to a CPU. This is a rework of Sridhar's
patch for symmetric queueing via socket option:
https://www.spinics.net/lists/netdev/msg453106.html
---
Amritha Nambiar (3):
net: Refactor XPS for CPUs and Rx queues
net: Enable Tx queue selection based on Rx queues
net-sysfs: Add interface for Rx queue map per Tx queue
include/linux/netdevice.h | 82 +++++++++++++++
include/net/sock.h | 18 +++
net/core/dev.c | 240 +++++++++++++++++++++++++++++++--------------
net/core/net-sysfs.c | 85 ++++++++++++++++
net/core/sock.c | 5 +
net/ipv4/tcp_input.c | 7 +
net/ipv4/tcp_ipv4.c | 1
net/ipv4/tcp_minisocks.c | 1
8 files changed, 357 insertions(+), 82 deletions(-)
^ permalink raw reply
* [net-next PATCH 1/3] net: Refactor XPS for CPUs and Rx queues
From: Amritha Nambiar @ 2018-04-20 1:04 UTC (permalink / raw)
To: netdev, davem
Cc: alexander.h.duyck, amritha.nambiar, sridhar.samudrala, edumazet,
hannes, tom
In-Reply-To: <152418597668.5832.5150463027149101930.stgit@anamdev.jf.intel.com>
Refactor XPS code to support Tx queue selection based on
CPU map or Rx queue map.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
include/linux/netdevice.h | 82 +++++++++++++++++-
net/core/dev.c | 206 +++++++++++++++++++++++++++++----------------
net/core/net-sysfs.c | 4 -
3 files changed, 216 insertions(+), 76 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 14e0777..40a9171 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -730,10 +730,21 @@ struct xps_map {
*/
struct xps_dev_maps {
struct rcu_head rcu;
- struct xps_map __rcu *cpu_map[0];
+ struct xps_map __rcu *attr_map[0];
};
-#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
+
+#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
(nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
+
+#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
+ (_rxqs * (_tcs) * sizeof(struct xps_map *)))
+
+enum xps_map_type {
+ XPS_MAP_RXQS,
+ XPS_MAP_CPUS,
+ __XPS_MAP_MAX
+};
+
#endif /* CONFIG_XPS */
#define TC_MAX_QUEUE 16
@@ -1867,7 +1878,7 @@ struct net_device {
int watchdog_timeo;
#ifdef CONFIG_XPS
- struct xps_dev_maps __rcu *xps_maps;
+ struct xps_dev_maps __rcu *xps_maps[__XPS_MAP_MAX];
#endif
#ifdef CONFIG_NET_CLS_ACT
struct mini_Qdisc __rcu *miniq_egress;
@@ -3204,6 +3215,71 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
#ifdef CONFIG_XPS
int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
u16 index);
+int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
+ u16 index, enum xps_map_type type);
+
+static inline bool attr_test_mask(unsigned long j, const unsigned long *mask,
+ unsigned int nr_bits)
+{
+#ifdef CONFIG_DEBUG_PER_CPU_MAPS
+ WARN_ON_ONCE(j >= nr_bits);
+#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
+ return test_bit(j, mask);
+}
+
+static inline bool attr_test_online(unsigned long j,
+ const unsigned long *online_mask,
+ unsigned int nr_bits)
+{
+#ifdef CONFIG_DEBUG_PER_CPU_MAPS
+ WARN_ON_ONCE(j >= nr_bits);
+#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
+
+ if (online_mask)
+ return test_bit(j, online_mask);
+
+ if (j >= 0 && j < nr_bits)
+ return true;
+
+ return false;
+}
+
+static inline unsigned int attrmask_next(int n, const unsigned long *srcp,
+ unsigned int nr_bits)
+{
+ /* -1 is a legal arg here. */
+ if (n != -1) {
+#ifdef CONFIG_DEBUG_PER_CPU_MAPS
+ WARN_ON_ONCE(n >= nr_bits);
+#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
+ }
+
+ if (srcp)
+ return find_next_bit(srcp, nr_bits, n + 1);
+
+ return n + 1;
+}
+
+static inline int attrmask_next_and(int n, const unsigned long *src1p,
+ const unsigned long *src2p,
+ unsigned int nr_bits)
+{
+ /* -1 is a legal arg here. */
+ if (n != -1) {
+#ifdef CONFIG_DEBUG_PER_CPU_MAPS
+ WARN_ON_ONCE(n >= nr_bits);
+#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
+ }
+
+ if (src1p && src2p)
+ return find_next_and_bit(src1p, src2p, nr_bits, n + 1);
+ else if (src1p)
+ return find_next_bit(src1p, nr_bits, n + 1);
+ else if (src2p)
+ return find_next_bit(src2p, nr_bits, n + 1);
+
+ return n + 1;
+}
#else
static inline int netif_set_xps_queue(struct net_device *dev,
const struct cpumask *mask,
diff --git a/net/core/dev.c b/net/core/dev.c
index a490ef6..17c4883 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2092,7 +2092,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
int pos;
if (dev_maps)
- map = xmap_dereference(dev_maps->cpu_map[tci]);
+ map = xmap_dereference(dev_maps->attr_map[tci]);
if (!map)
return false;
@@ -2105,7 +2105,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
break;
}
- RCU_INIT_POINTER(dev_maps->cpu_map[tci], NULL);
+ RCU_INIT_POINTER(dev_maps->attr_map[tci], NULL);
kfree_rcu(map, rcu);
return false;
}
@@ -2138,30 +2138,47 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
u16 count)
{
+ const unsigned long *possible_mask = NULL;
+ enum xps_map_type type = XPS_MAP_RXQS;
struct xps_dev_maps *dev_maps;
- int cpu, i;
bool active = false;
+ unsigned int nr_ids;
+ int i, j;
mutex_lock(&xps_map_mutex);
- dev_maps = xmap_dereference(dev->xps_maps);
- if (!dev_maps)
- goto out_no_maps;
+ while (type < __XPS_MAP_MAX) {
+ dev_maps = xmap_dereference(dev->xps_maps[type]);
+ if (!dev_maps)
+ goto out_no_maps;
+
+ if (type == XPS_MAP_CPUS) {
+ if (num_possible_cpus() > 1)
+ possible_mask = cpumask_bits(cpu_possible_mask);
+ nr_ids = nr_cpu_ids;
+ } else if (type == XPS_MAP_RXQS) {
+ nr_ids = dev->num_rx_queues;
+ }
- for_each_possible_cpu(cpu)
- active |= remove_xps_queue_cpu(dev, dev_maps, cpu,
- offset, count);
+ for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+ j < nr_ids;)
+ active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
+ count);
+ if (!active) {
+ RCU_INIT_POINTER(dev->xps_maps[type], NULL);
+ kfree_rcu(dev_maps, rcu);
+ }
- if (!active) {
- RCU_INIT_POINTER(dev->xps_maps, NULL);
- kfree_rcu(dev_maps, rcu);
+ if (type == XPS_MAP_CPUS) {
+ for (i = offset + (count - 1); count--; i--)
+ netdev_queue_numa_node_write(
+ netdev_get_tx_queue(dev, i),
+ NUMA_NO_NODE);
+ }
+out_no_maps:
+ type++;
}
- for (i = offset + (count - 1); count--; i--)
- netdev_queue_numa_node_write(netdev_get_tx_queue(dev, i),
- NUMA_NO_NODE);
-
-out_no_maps:
mutex_unlock(&xps_map_mutex);
}
@@ -2170,11 +2187,11 @@ static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)
netif_reset_xps_queues(dev, index, dev->num_tx_queues - index);
}
-static struct xps_map *expand_xps_map(struct xps_map *map,
- int cpu, u16 index)
+static struct xps_map *expand_xps_map(struct xps_map *map, int attr_index,
+ u16 index, enum xps_map_type type)
{
- struct xps_map *new_map;
int alloc_len = XPS_MIN_MAP_ALLOC;
+ struct xps_map *new_map = NULL;
int i, pos;
for (pos = 0; map && pos < map->len; pos++) {
@@ -2183,7 +2200,7 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
return map;
}
- /* Need to add queue to this CPU's existing map */
+ /* Need to add tx-queue to this CPU's/rx-queue's existing map */
if (map) {
if (pos < map->alloc_len)
return map;
@@ -2191,9 +2208,14 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
alloc_len = map->alloc_len * 2;
}
- /* Need to allocate new map to store queue on this CPU's map */
- new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
- cpu_to_node(cpu));
+ /* Need to allocate new map to store tx-queue on this CPU's/rx-queue's
+ * map
+ */
+ if (type == XPS_MAP_RXQS)
+ new_map = kzalloc(XPS_MAP_SIZE(alloc_len), GFP_KERNEL);
+ else if (type == XPS_MAP_CPUS)
+ new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
+ cpu_to_node(attr_index));
if (!new_map)
return NULL;
@@ -2205,14 +2227,16 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
return new_map;
}
-int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
- u16 index)
+int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
+ u16 index, enum xps_map_type type)
{
+ const unsigned long *online_mask = NULL, *possible_mask = NULL;
struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
- int i, cpu, tci, numa_node_id = -2;
+ int i, j, tci, numa_node_id = -2;
int maps_sz, num_tc = 1, tc = 0;
struct xps_map *map, *new_map;
bool active = false;
+ unsigned int nr_ids;
if (dev->num_tc) {
num_tc = dev->num_tc;
@@ -2221,16 +2245,33 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
return -EINVAL;
}
- maps_sz = XPS_DEV_MAPS_SIZE(num_tc);
+ switch (type) {
+ case XPS_MAP_RXQS:
+ maps_sz = XPS_RXQ_DEV_MAPS_SIZE(num_tc, dev->num_rx_queues);
+ dev_maps = xmap_dereference(dev->xps_maps[XPS_MAP_RXQS]);
+ nr_ids = dev->num_rx_queues;
+ break;
+ case XPS_MAP_CPUS:
+ maps_sz = XPS_CPU_DEV_MAPS_SIZE(num_tc);
+ if (num_possible_cpus() > 1) {
+ online_mask = cpumask_bits(cpu_online_mask);
+ possible_mask = cpumask_bits(cpu_possible_mask);
+ }
+ dev_maps = xmap_dereference(dev->xps_maps[XPS_MAP_CPUS]);
+ nr_ids = nr_cpu_ids;
+ break;
+ default:
+ return -EINVAL;
+ }
+
if (maps_sz < L1_CACHE_BYTES)
maps_sz = L1_CACHE_BYTES;
mutex_lock(&xps_map_mutex);
- dev_maps = xmap_dereference(dev->xps_maps);
-
/* allocate memory for queue storage */
- for_each_cpu_and(cpu, cpu_online_mask, mask) {
+ for (j = -1; j = attrmask_next_and(j, online_mask, mask, nr_ids),
+ j < nr_ids;) {
if (!new_dev_maps)
new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
if (!new_dev_maps) {
@@ -2238,73 +2279,81 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
return -ENOMEM;
}
- tci = cpu * num_tc + tc;
- map = dev_maps ? xmap_dereference(dev_maps->cpu_map[tci]) :
+ tci = j * num_tc + tc;
+ map = dev_maps ? xmap_dereference(dev_maps->attr_map[tci]) :
NULL;
- map = expand_xps_map(map, cpu, index);
+ map = expand_xps_map(map, j, index, type);
if (!map)
goto error;
- RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+ RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
}
if (!new_dev_maps)
goto out_no_new_maps;
- for_each_possible_cpu(cpu) {
+ for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+ j < nr_ids;) {
/* copy maps belonging to foreign traffic classes */
- for (i = tc, tci = cpu * num_tc; dev_maps && i--; tci++) {
+ for (i = tc, tci = j * num_tc; dev_maps && i--; tci++) {
/* fill in the new device map from the old device map */
- map = xmap_dereference(dev_maps->cpu_map[tci]);
- RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+ map = xmap_dereference(dev_maps->attr_map[tci]);
+ RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
}
/* We need to explicitly update tci as prevous loop
* could break out early if dev_maps is NULL.
*/
- tci = cpu * num_tc + tc;
+ tci = j * num_tc + tc;
- if (cpumask_test_cpu(cpu, mask) && cpu_online(cpu)) {
- /* add queue to CPU maps */
+ if (attr_test_mask(j, mask, nr_ids) &&
+ attr_test_online(j, online_mask, nr_ids)) {
+ /* add tx-queue to CPU/rx-queue maps */
int pos = 0;
- map = xmap_dereference(new_dev_maps->cpu_map[tci]);
+ map = xmap_dereference(new_dev_maps->attr_map[tci]);
while ((pos < map->len) && (map->queues[pos] != index))
pos++;
if (pos == map->len)
map->queues[map->len++] = index;
#ifdef CONFIG_NUMA
- if (numa_node_id == -2)
- numa_node_id = cpu_to_node(cpu);
- else if (numa_node_id != cpu_to_node(cpu))
- numa_node_id = -1;
+ if (type == XPS_MAP_CPUS) {
+ if (numa_node_id == -2)
+ numa_node_id = cpu_to_node(j);
+ else if (numa_node_id != cpu_to_node(j))
+ numa_node_id = -1;
+ }
#endif
} else if (dev_maps) {
/* fill in the new device map from the old device map */
- map = xmap_dereference(dev_maps->cpu_map[tci]);
- RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+ map = xmap_dereference(dev_maps->attr_map[tci]);
+ RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
}
/* copy maps belonging to foreign traffic classes */
for (i = num_tc - tc, tci++; dev_maps && --i; tci++) {
/* fill in the new device map from the old device map */
- map = xmap_dereference(dev_maps->cpu_map[tci]);
- RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+ map = xmap_dereference(dev_maps->attr_map[tci]);
+ RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
}
}
- rcu_assign_pointer(dev->xps_maps, new_dev_maps);
+ if (type == XPS_MAP_RXQS)
+ rcu_assign_pointer(dev->xps_maps[XPS_MAP_RXQS], new_dev_maps);
+ else if (type == XPS_MAP_CPUS)
+ rcu_assign_pointer(dev->xps_maps[XPS_MAP_CPUS], new_dev_maps);
/* Cleanup old maps */
if (!dev_maps)
goto out_no_old_maps;
- for_each_possible_cpu(cpu) {
- for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
- new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
- map = xmap_dereference(dev_maps->cpu_map[tci]);
+ for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+ j < nr_ids;) {
+ for (i = num_tc, tci = j * num_tc; i--; tci++) {
+ new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
+ map = xmap_dereference(dev_maps->attr_map[tci]);
if (map && map != new_map)
kfree_rcu(map, rcu);
}
@@ -2317,19 +2366,23 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
active = true;
out_no_new_maps:
- /* update Tx queue numa node */
- netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
- (numa_node_id >= 0) ? numa_node_id :
- NUMA_NO_NODE);
+ if (type == XPS_MAP_CPUS) {
+ /* update Tx queue numa node */
+ netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
+ (numa_node_id >= 0) ?
+ numa_node_id : NUMA_NO_NODE);
+ }
if (!dev_maps)
goto out_no_maps;
- /* removes queue from unused CPUs */
- for_each_possible_cpu(cpu) {
- for (i = tc, tci = cpu * num_tc; i--; tci++)
+ /* removes tx-queue from unused CPUs/rx-queues */
+ for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+ j < nr_ids;) {
+ for (i = tc, tci = j * num_tc; i--; tci++)
active |= remove_xps_queue(dev_maps, tci, index);
- if (!cpumask_test_cpu(cpu, mask) || !cpu_online(cpu))
+ if (!attr_test_mask(j, mask, nr_ids) ||
+ !attr_test_online(j, online_mask, nr_ids))
active |= remove_xps_queue(dev_maps, tci, index);
for (i = num_tc - tc, tci++; --i; tci++)
active |= remove_xps_queue(dev_maps, tci, index);
@@ -2337,7 +2390,10 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
/* free map if not active */
if (!active) {
- RCU_INIT_POINTER(dev->xps_maps, NULL);
+ if (type == XPS_MAP_RXQS)
+ RCU_INIT_POINTER(dev->xps_maps[XPS_MAP_RXQS], NULL);
+ else if (type == XPS_MAP_CPUS)
+ RCU_INIT_POINTER(dev->xps_maps[XPS_MAP_CPUS], NULL);
kfree_rcu(dev_maps, rcu);
}
@@ -2347,11 +2403,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
return 0;
error:
/* remove any maps that we added */
- for_each_possible_cpu(cpu) {
- for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
- new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
+ for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+ j < nr_ids;) {
+ for (i = num_tc, tci = j * num_tc; i--; tci++) {
+ new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
map = dev_maps ?
- xmap_dereference(dev_maps->cpu_map[tci]) :
+ xmap_dereference(dev_maps->attr_map[tci]) :
NULL;
if (new_map && new_map != map)
kfree(new_map);
@@ -2363,6 +2420,13 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
kfree(new_dev_maps);
return -ENOMEM;
}
+
+int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
+ u16 index)
+{
+ return __netif_set_xps_queue(dev, cpumask_bits(mask), index,
+ XPS_MAP_CPUS);
+}
EXPORT_SYMBOL(netif_set_xps_queue);
#endif
@@ -3400,7 +3464,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
int queue_index = -1;
rcu_read_lock();
- dev_maps = rcu_dereference(dev->xps_maps);
+ dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_CPUS]);
if (dev_maps) {
unsigned int tci = skb->sender_cpu - 1;
@@ -3409,7 +3473,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
tci += netdev_get_prio_tc_map(dev, skb->priority);
}
- map = rcu_dereference(dev_maps->cpu_map[tci]);
+ map = rcu_dereference(dev_maps->attr_map[tci]);
if (map) {
if (map->len == 1)
queue_index = map->queues[0];
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index c476f07..d7abd33 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1227,13 +1227,13 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
}
rcu_read_lock();
- dev_maps = rcu_dereference(dev->xps_maps);
+ dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_CPUS]);
if (dev_maps) {
for_each_possible_cpu(cpu) {
int i, tci = cpu * num_tc + tc;
struct xps_map *map;
- map = rcu_dereference(dev_maps->cpu_map[tci]);
+ map = rcu_dereference(dev_maps->attr_map[tci]);
if (!map)
continue;
^ permalink raw reply related
* [net-next PATCH 2/3] net: Enable Tx queue selection based on Rx queues
From: Amritha Nambiar @ 2018-04-20 1:04 UTC (permalink / raw)
To: netdev, davem
Cc: alexander.h.duyck, amritha.nambiar, sridhar.samudrala, edumazet,
hannes, tom
In-Reply-To: <152418597668.5832.5150463027149101930.stgit@anamdev.jf.intel.com>
This patch adds support to pick Tx queue based on the Rx queue map
configuration set by the admin through the sysfs attribute
for each Tx queue. If the user configuration for receive
queue map does not apply, then the Tx queue selection falls back
to CPU map based selection and finally to hashing.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
include/net/sock.h | 18 ++++++++++++++++++
net/core/dev.c | 36 +++++++++++++++++++++++++++++-------
net/core/sock.c | 5 +++++
net/ipv4/tcp_input.c | 7 +++++++
net/ipv4/tcp_ipv4.c | 1 +
net/ipv4/tcp_minisocks.c | 1 +
6 files changed, 61 insertions(+), 7 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 74d725f..f10b2a2 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -139,6 +139,8 @@ typedef __u64 __bitwise __addrpair;
* @skc_node: main hash linkage for various protocol lookup tables
* @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
* @skc_tx_queue_mapping: tx queue number for this connection
+ * @skc_rx_queue_mapping: rx queue number for this connection
+ * @skc_rx_ifindex: rx ifindex for this connection
* @skc_flags: place holder for sk_flags
* %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
* %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
@@ -215,6 +217,10 @@ struct sock_common {
struct hlist_nulls_node skc_nulls_node;
};
int skc_tx_queue_mapping;
+#ifdef CONFIG_XPS
+ int skc_rx_queue_mapping;
+ int skc_rx_ifindex;
+#endif
union {
int skc_incoming_cpu;
u32 skc_rcv_wnd;
@@ -326,6 +332,10 @@ struct sock {
#define sk_nulls_node __sk_common.skc_nulls_node
#define sk_refcnt __sk_common.skc_refcnt
#define sk_tx_queue_mapping __sk_common.skc_tx_queue_mapping
+#ifdef CONFIG_XPS
+#define sk_rx_queue_mapping __sk_common.skc_rx_queue_mapping
+#define sk_rx_ifindex __sk_common.skc_rx_ifindex
+#endif
#define sk_dontcopy_begin __sk_common.skc_dontcopy_begin
#define sk_dontcopy_end __sk_common.skc_dontcopy_end
@@ -1691,6 +1701,14 @@ static inline int sk_tx_queue_get(const struct sock *sk)
return sk ? sk->sk_tx_queue_mapping : -1;
}
+static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb)
+{
+#ifdef CONFIG_XPS
+ sk->sk_rx_ifindex = skb->skb_iif;
+ sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);
+#endif
+}
+
static inline void sk_set_socket(struct sock *sk, struct socket *sock)
{
sk_tx_queue_clear(sk);
diff --git a/net/core/dev.c b/net/core/dev.c
index 17c4883..cf24d47 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3456,18 +3456,14 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
}
#endif /* CONFIG_NET_EGRESS */
-static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
-{
#ifdef CONFIG_XPS
- struct xps_dev_maps *dev_maps;
+static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
+ struct xps_dev_maps *dev_maps, unsigned int tci)
+{
struct xps_map *map;
int queue_index = -1;
- rcu_read_lock();
- dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_CPUS]);
if (dev_maps) {
- unsigned int tci = skb->sender_cpu - 1;
-
if (dev->num_tc) {
tci *= dev->num_tc;
tci += netdev_get_prio_tc_map(dev, skb->priority);
@@ -3484,6 +3480,32 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
queue_index = -1;
}
}
+ return queue_index;
+}
+#endif
+
+static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+{
+#ifdef CONFIG_XPS
+ enum xps_map_type i = XPS_MAP_RXQS;
+ struct xps_dev_maps *dev_maps;
+ struct sock *sk = skb->sk;
+ int queue_index = -1;
+ unsigned int tci = 0;
+
+ if (sk && sk->sk_rx_queue_mapping <= dev->real_num_rx_queues &&
+ dev->ifindex == sk->sk_rx_ifindex)
+ tci = sk->sk_rx_queue_mapping;
+
+ rcu_read_lock();
+ while (queue_index < 0 && i < __XPS_MAP_MAX) {
+ if (i == XPS_MAP_CPUS)
+ tci = skb->sender_cpu - 1;
+ dev_maps = rcu_dereference(dev->xps_maps[i]);
+ queue_index = __get_xps_queue_idx(dev, skb, dev_maps, tci);
+ i++;
+ }
+
rcu_read_unlock();
return queue_index;
diff --git a/net/core/sock.c b/net/core/sock.c
index b2c3db1..f7a4b46 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2820,6 +2820,11 @@ void sock_init_data(struct socket *sock, struct sock *sk)
sk->sk_pacing_rate = ~0U;
sk->sk_pacing_shift = 10;
sk->sk_incoming_cpu = -1;
+
+#ifdef CONFIG_XPS
+ sk->sk_rx_ifindex = -1;
+ sk->sk_rx_queue_mapping = -1;
+#endif
/*
* Before updating sk_refcnt, we must commit prior changes to memory
* (Documentation/RCU/rculist_nulls.txt for details)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0396fb9..157f401 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -78,6 +78,7 @@
#include <linux/errqueue.h>
#include <trace/events/tcp.h>
#include <linux/static_key.h>
+#include <net/busy_poll.h>
int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
@@ -5535,6 +5536,11 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
__tcp_fast_path_on(tp, tp->snd_wnd);
else
tp->pred_flags = 0;
+
+ if (skb) {
+ sk_mark_napi_id(sk, skb);
+ sk_mark_rx_queue(sk, skb);
+ }
}
static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
@@ -6347,6 +6353,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
tcp_rsk(req)->snt_isn = isn;
tcp_rsk(req)->txhash = net_tx_rndhash();
tcp_openreq_init_rwin(req, sk, dst);
+ sk_mark_rx_queue(req_to_sk(req), skb);
if (!want_cookie) {
tcp_reqsk_record_syn(sk, req, skb);
fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc, dst);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b..132d9af 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1467,6 +1467,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
sock_rps_save_rxhash(sk, skb);
sk_mark_napi_id(sk, skb);
+ sk_mark_rx_queue(sk, skb);
if (dst) {
if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif ||
!dst->ops->check(dst, 0)) {
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 57b5468..c18d6f2 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -835,6 +835,7 @@ int tcp_child_process(struct sock *parent, struct sock *child,
/* record NAPI ID of child */
sk_mark_napi_id(child, skb);
+ sk_mark_rx_queue(child, skb);
tcp_segs_in(tcp_sk(child), skb);
if (!sock_owned_by_user(child)) {
^ permalink raw reply related
* [net-next PATCH 3/3] net-sysfs: Add interface for Rx queue map per Tx queue
From: Amritha Nambiar @ 2018-04-20 1:04 UTC (permalink / raw)
To: netdev, davem
Cc: alexander.h.duyck, amritha.nambiar, sridhar.samudrala, edumazet,
hannes, tom
In-Reply-To: <152418597668.5832.5150463027149101930.stgit@anamdev.jf.intel.com>
Extend transmit queue sysfs attribute to configure Rx queue map
per Tx queue. By default no receive queues are configured for the
Tx queue.
- /sys/class/net/eth0/queues/tx-*/xps_rxqs
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
net/core/net-sysfs.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 81 insertions(+)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index d7abd33..0654243 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1283,6 +1283,86 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
static struct netdev_queue_attribute xps_cpus_attribute __ro_after_init
= __ATTR_RW(xps_cpus);
+
+static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
+{
+ struct net_device *dev = queue->dev;
+ struct xps_dev_maps *dev_maps;
+ unsigned long *mask, index;
+ int j, len, num_tc = 1, tc = 0;
+
+ mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
+ GFP_KERNEL);
+ if (!mask)
+ return -ENOMEM;
+
+ index = get_netdev_queue_index(queue);
+
+ if (dev->num_tc) {
+ num_tc = dev->num_tc;
+ tc = netdev_txq_to_tc(dev, index);
+ if (tc < 0)
+ return -EINVAL;
+ }
+
+ rcu_read_lock();
+ dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_RXQS]);
+ if (dev_maps) {
+ for (j = -1; j = attrmask_next(j, NULL, dev->num_rx_queues),
+ j < dev->num_rx_queues;) {
+ int i, tci = j * num_tc + tc;
+ struct xps_map *map;
+
+ map = rcu_dereference(dev_maps->attr_map[tci]);
+ if (!map)
+ continue;
+
+ for (i = map->len; i--;) {
+ if (map->queues[i] == index) {
+ set_bit(j, mask);
+ break;
+ }
+ }
+ }
+ }
+
+ len = bitmap_print_to_pagebuf(false, buf, mask, dev->num_rx_queues);
+ rcu_read_unlock();
+ kfree(mask);
+
+ return len < PAGE_SIZE ? len : -EINVAL;
+}
+
+static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
+ size_t len)
+{
+ struct net_device *dev = queue->dev;
+ unsigned long *mask, index;
+ int err;
+
+ if (!capable(CAP_NET_ADMIN))
+ return -EPERM;
+
+ mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
+ GFP_KERNEL);
+ if (!mask)
+ return -ENOMEM;
+
+ index = get_netdev_queue_index(queue);
+
+ err = bitmap_parse(buf, len, mask, dev->num_rx_queues);
+ if (err) {
+ kfree(mask);
+ return err;
+ }
+
+ err = __netif_set_xps_queue(dev, mask, index, XPS_MAP_RXQS);
+ kfree(mask);
+ return err ? : len;
+}
+
+static struct netdev_queue_attribute xps_rxqs_attribute __ro_after_init
+ = __ATTR_RW(xps_rxqs);
#endif /* CONFIG_XPS */
static struct attribute *netdev_queue_default_attrs[] __ro_after_init = {
@@ -1290,6 +1370,7 @@ static struct attribute *netdev_queue_default_attrs[] __ro_after_init = {
&queue_traffic_class.attr,
#ifdef CONFIG_XPS
&xps_cpus_attribute.attr,
+ &xps_rxqs_attribute.attr,
&queue_tx_maxrate.attr,
#endif
NULL
^ permalink raw reply related
* [GIT] Networking
From: David Miller @ 2018-04-20 1:17 UTC (permalink / raw)
To: torvalds; +Cc: akpm, netdev, linux-kernel
1) Unbalanced refcounting in TIPC, from Jon Maloy.
2) Only allow TCP_MD5SIG to be set on sockets in close or
listen state. Once the connection is established it makes
no sense to change this. From Eric Dumazet.
3) Missing attribute validation in neigh_dump_table(), also from Eric
Dumazet.
4) Fix address comparisons in SCTP, from Xin Long.
5) Neigh proxy table clearing can deadlock, from Wolfgang
Bumiller.
6) Fix tunnel refcounting in l2tp, from Guillaume Nault.
7) Fix double list insert in team driver, from Paolo Abeni.
8) af_vsock.ko module was accidently made unremovable, from
Stefan Hajnoczi.
9) Fix reference to freed llc_sap object in llc stack, from
Cong Wang.
10) Don't assume netdevice struct is DMA'able memory in virtio_net
driver, from Michael S. Tsirkin.
Please pull, thanks a lot!
The following changes since commit 5d1365940a68dd57b031b6e3c07d7d451cd69daf:
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2018-04-12 11:09:05 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git
for you to fetch changes up to 1255fcb2a655f05e02f3a74675a6d6525f187afd:
net/smc: fix shutdown in state SMC_LISTEN (2018-04-19 16:38:39 -0400)
----------------------------------------------------------------
Anders Roxell (1):
selftests: net: add in_netns.sh to TEST_PROGS
Bert Kenward (1):
sfc: check RSS is active for filter insert
Bjørn Mork (1):
tun: fix vlan packet truncation
Colin Ian King (2):
net: caif: fix spelling mistake "UKNOWN" -> "UNKNOWN"
atm: iphase: fix spelling mistake: "Tansmit" -> "Transmit"
Cong Wang (1):
llc: hold llc_sap before release_sock()
Dan Carpenter (1):
Revert "macsec: missing dev_put() on error in macsec_newlink()"
David S. Miller (6):
Merge branch 'ibmvnic-Fix-parameter-change-request-handling'
Merge branch 'nfp-improve-signal-handing-on-FW-waits-and-flower-control-message-Jakub Kicinski says:
Merge branch 'l2tp-remove-unsafe-calls-to-l2tp_tunnel_find_nth'
Merge branch 'sfc-ARFS-fixes'
Merge branch 'tipc-Better-check-user-provided-attributes'
Merge branch 'virtio-ctrl-buffer-fixes'
Doron Roberts-Kedes (1):
strparser: Fix incorrect strp->need_bytes value.
Edward Cree (3):
sfc: insert ARFS filters with replace_equal=true
sfc: pass the correctly bogus filter_id to rps_may_expire_flow()
sfc: limit ARFS workitems in flight per channel
Eric Biggers (1):
KEYS: DNS: limit the length of option strings
Eric Dumazet (5):
tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets
net: validate attribute sizes in neigh_dump_table()
net: af_packet: fix race in PACKET_{R|T}X_RING
tipc: add policy for TIPC_NLA_NET_ADDR
tipc: fix possible crash in __tipc_nl_net_set()
Gao Feng (1):
net: Fix one possible memleak in ip_setup_cork
Guillaume Nault (3):
l2tp: hold reference on tunnels in netlink dumps
l2tp: hold reference on tunnels printed in pppol2tp proc file
l2tp: hold reference on tunnels printed in l2tp/tunnels debugfs file
Jakub Kicinski (2):
nfp: ignore signals when communicating with management FW
nfp: print a message when mutex wait is interrupted
Jason Wang (1):
virtio-net: add missing virtqueue kick when flushing packets
Jon Maloy (3):
tipc: fix unbalanced reference counter
tipc: fix missing initializer in tipc_sendmsg()
tipc: fix use-after-free in tipc_nametbl_stop
Jonathan Corbet (1):
MAINTAINERS: Direct networking documentation changes to netdev
Jose Abreu (1):
net: stmmac: Disable ACS Feature for GMAC >= 4
Kees Cook (2):
ibmvnic: Define vnic_login_client_data name field as unsized array
net/tls: Remove VLA usage
Laura Abbott (1):
mISDN: Remove VLAs
Maxime Chevallier (2):
net: mvpp2: Fix TCAM filter reserved range
net: mvpp2: Fix DMA address mask size
Michael S. Tsirkin (3):
virtio_net: split out ctrl buffer
virtio_net: fix adding vids on big-endian
virtio_net: sparse annotation fix
Nathan Fontenot (2):
ibmvnic: Handle all login error conditions
ibmvnic: Do not notify peers on parameter change resets
Nicolas Dechesne (1):
net: qrtr: add MODULE_ALIAS_NETPROTO macro
Olivier Gayot (1):
docs: ip-sysctl.txt: fix name of some ipv6 variables
Paolo Abeni (1):
team: avoid adding twice the same option to the event list
Pawel Dembicki (1):
net: qmi_wwan: add Wistron Neweb D19Q1
Pieter Jansen van Vuuren (2):
nfp: flower: move route ack control messages out of the workqueue
nfp: flower: split and limit cmsg skb lists
Raghuram Chary J (1):
lan78xx: PHY DSP registers initialization to address EEE link drop issues with long cables
Randy Dunlap (1):
textsearch: fix kernel-doc warnings and add kernel-api section
Richard Cochran (1):
net: dsa: mv88e6xxx: Fix receive time stamp race condition.
Ronak Doshi (1):
vmxnet3: fix incorrect dereference when rxvlan is disabled
Soheil Hassas Yeganeh (1):
tcp: clear tp->packets_out when purging write queue
Stefan Hajnoczi (1):
VSOCK: make af_vsock.ko removable again
Subash Abhinov Kasiviswanathan (1):
net: qualcomm: rmnet: Fix warning seen with fill_info
Thomas Falcon (1):
ibmvnic: Clear pending interrupt after device reset
Toshiaki Makita (1):
vlan: Fix reading memory beyond skb->tail in skb_vlan_tagged_multi
Tung Nguyen (1):
tipc: fix infinite loop when dumping link monitor summary
Ursula Braun (1):
net/smc: fix shutdown in state SMC_LISTEN
Vasundhara Volam (1):
bnxt_en: Fix memory fault in bnxt_ethtool_init()
Wang Sheng-Hui (1):
filter.txt: update 'tools/net/' to 'tools/bpf/'
Wolfgang Bumiller (1):
net: fix deadlock while clearing neighbor proxy table
Xin Long (1):
sctp: do not check port in sctp_inet6_cmp_addr
dann frazier (1):
net: hns: Avoid action name truncation
sunlianwen (1):
net: change the comment of dev_mc_init
Documentation/core-api/kernel-api.rst | 13 ++++++
Documentation/networking/filter.txt | 6 +--
Documentation/networking/ip-sysctl.txt | 8 ++--
MAINTAINERS | 1 +
drivers/atm/iphase.c | 4 +-
drivers/isdn/mISDN/dsp_hwec.c | 8 ++--
drivers/isdn/mISDN/l1oip_core.c | 14 ++++--
drivers/net/dsa/mv88e6xxx/hwtstamp.c | 12 ++++-
drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 49 +++++++++++---------
drivers/net/ethernet/broadcom/bnxt/bnxt_nvm_defs.h | 2 -
drivers/net/ethernet/hisilicon/hns/hnae.h | 2 +-
drivers/net/ethernet/ibm/ibmvnic.c | 85 +++++++++++++++++++++-------------
drivers/net/ethernet/ibm/ibmvnic.h | 1 -
drivers/net/ethernet/marvell/mvpp2.c | 14 +++---
drivers/net/ethernet/netronome/nfp/flower/cmsg.c | 44 +++++++++++++++---
drivers/net/ethernet/netronome/nfp/flower/cmsg.h | 2 +
drivers/net/ethernet/netronome/nfp/flower/main.c | 6 ++-
drivers/net/ethernet/netronome/nfp/flower/main.h | 8 +++-
drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c | 5 +-
drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c | 3 +-
drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 11 +++--
drivers/net/ethernet/sfc/ef10.c | 7 ++-
drivers/net/ethernet/sfc/farch.c | 2 +-
drivers/net/ethernet/sfc/net_driver.h | 25 ++++++++++
drivers/net/ethernet/sfc/rx.c | 60 ++++++++++++------------
drivers/net/ethernet/stmicro/stmmac/dwmac4.h | 2 +-
drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c | 7 ---
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 7 ++-
drivers/net/macsec.c | 5 +-
drivers/net/phy/microchip.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
drivers/net/team/team.c | 19 ++++++++
drivers/net/tun.c | 7 +--
drivers/net/usb/qmi_wwan.c | 1 +
drivers/net/virtio_net.c | 79 +++++++++++++++++++-------------
drivers/net/vmxnet3/vmxnet3_drv.c | 17 +++++--
drivers/net/vmxnet3/vmxnet3_int.h | 4 +-
include/linux/if_vlan.h | 7 ++-
include/linux/microchipphy.h | 8 ++++
include/linux/textsearch.h | 4 +-
lib/textsearch.c | 40 +++++++++-------
net/caif/chnl_net.c | 2 +-
net/core/dev.c | 2 +-
net/core/dev_addr_lists.c | 2 +-
net/core/neighbour.c | 40 ++++++++++------
net/dns_resolver/dns_key.c | 12 ++---
net/ipv4/ip_output.c | 8 ++--
net/ipv4/tcp.c | 8 ++--
net/l2tp/l2tp_core.c | 40 ++++++++--------
net/l2tp/l2tp_core.h | 3 +-
net/l2tp/l2tp_debugfs.c | 15 +++++-
net/l2tp/l2tp_netlink.c | 11 +++--
net/l2tp/l2tp_ppp.c | 24 +++++++---
net/llc/af_llc.c | 7 +++
net/packet/af_packet.c | 23 ++++++----
net/qrtr/qrtr.c | 1 +
net/sctp/ipv6.c | 60 ++++++++++++------------
net/smc/af_smc.c | 10 ++--
net/strparser/strparser.c | 7 ++-
net/tipc/monitor.c | 2 +-
net/tipc/name_table.c | 34 ++++++++------
net/tipc/name_table.h | 2 +-
net/tipc/net.c | 2 +
net/tipc/netlink.c | 5 +-
net/tipc/node.c | 11 ++---
net/tipc/socket.c | 4 +-
net/tipc/subscr.c | 5 +-
net/tls/tls_sw.c | 10 +++-
net/vmw_vsock/af_vsock.c | 6 +++
tools/testing/selftests/net/Makefile | 2 +-
69 files changed, 786 insertions(+), 349 deletions(-)
^ permalink raw reply
* Re: [PATCH net-next 4/5] tcp: implement mmap() for zero copy receive
From: David Miller @ 2018-04-20 1:17 UTC (permalink / raw)
To: eric.dumazet; +Cc: edumazet, netdev, ncardwell, ycheng, soheil
In-Reply-To: <1e7a78a6-27cd-9679-54d7-44d05484eda7@gmail.com>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 19 Apr 2018 18:01:32 -0700
> David, do you think such patch would be acceptable by lkml and mm/fs
> maintainers ?
You will have to ask them directly I think :)
^ permalink raw reply
* Re: [RFC PATCH ghak32 V2 06/13] audit: add support for non-syscall auxiliary records
From: Richard Guy Briggs @ 2018-04-20 1:23 UTC (permalink / raw)
To: Paul Moore
Cc: simo-H+wXaHxf7aLQT0dZR+AlfA, jlayton-H+wXaHxf7aLQT0dZR+AlfA,
carlos-H+wXaHxf7aLQT0dZR+AlfA, linux-api-u79uwXL29TY76Z2rM5mHXA,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, LKML,
Eric Paris, dhowells-H+wXaHxf7aLQT0dZR+AlfA,
Linux-Audit Mailing List, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
luto-DgEjT+Ai2ygdnm+yROfE0A, netdev-u79uwXL29TY76Z2rM5mHXA,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
cgroups-u79uwXL29TY76Z2rM5mHXA,
viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn
In-Reply-To: <CAHC9VhQbPbnrbxCD1fyTSxWgrXXXYnZw_=nbOhfMCO5Q5eSsWQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On 2018-04-18 20:39, Paul Moore wrote:
> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <rgb-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > Standalone audit records have the timestamp and serial number generated
> > on the fly and as such are unique, making them standalone. This new
> > function audit_alloc_local() generates a local audit context that will
> > be used only for a standalone record and its auxiliary record(s). The
> > context is discarded immediately after the local associated records are
> > produced.
> >
> > Signed-off-by: Richard Guy Briggs <rgb-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> > include/linux/audit.h | 8 ++++++++
> > kernel/auditsc.c | 20 +++++++++++++++++++-
> > 2 files changed, 27 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/audit.h b/include/linux/audit.h
> > index ed16bb6..c0b83cb 100644
> > --- a/include/linux/audit.h
> > +++ b/include/linux/audit.h
> > @@ -227,7 +227,9 @@ static inline int audit_log_container_info(struct audit_context *context,
> > /* These are defined in auditsc.c */
> > /* Public API */
> > extern int audit_alloc(struct task_struct *task);
> > +extern struct audit_context *audit_alloc_local(void);
> > extern void __audit_free(struct task_struct *task);
> > +extern void audit_free_context(struct audit_context *context);
> > extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long a1,
> > unsigned long a2, unsigned long a3);
> > extern void __audit_syscall_exit(int ret_success, long ret_value);
> > @@ -472,6 +474,12 @@ static inline int audit_alloc(struct task_struct *task)
> > {
> > return 0;
> > }
> > +static inline struct audit_context *audit_alloc_local(void)
> > +{
> > + return NULL;
> > +}
> > +static inline void audit_free_context(struct audit_context *context)
> > +{ }
> > static inline void audit_free(struct task_struct *task)
> > { }
> > static inline void audit_syscall_entry(int major, unsigned long a0,
> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > index 2932ef1..7103d23 100644
> > --- a/kernel/auditsc.c
> > +++ b/kernel/auditsc.c
> > @@ -959,8 +959,26 @@ int audit_alloc(struct task_struct *tsk)
> > return 0;
> > }
> >
> > -static inline void audit_free_context(struct audit_context *context)
> > +struct audit_context *audit_alloc_local(void)
> > {
> > + struct audit_context *context;
> > +
> > + if (!audit_ever_enabled)
> > + return NULL; /* Return if not auditing. */
> > +
> > + context = audit_alloc_context(AUDIT_RECORD_CONTEXT);
> > + if (!context)
> > + return NULL;
> > + context->serial = audit_serial();
> > + context->ctime = current_kernel_time64();
> > + context->in_syscall = 1;
> > + return context;
> > +}
> > +
> > +inline void audit_free_context(struct audit_context *context)
> > +{
> > + if (!context)
> > + return;
> > audit_free_names(context);
> > unroll_tree_refs(context, NULL, 0);
> > free_tree_refs(context);
>
> I'm reserving the option to comment on this idea further as I make my
> way through the patchset, but audit_free_context() definitely
> shouldn't be declared as an inline function.
Ok, I think I follow. When it wasn't exported, inline was fine, but now
that it has been exported, it should no longer be inlined, or should use
an intermediate function name to export so that local uses of it can
remain inline.
> paul moore
- RGB
--
Richard Guy Briggs <rgb-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
^ permalink raw reply
* [PATCH net-next v7 0/4] Enable virtio_net to act as a standby for a passthru device
From: Sridhar Samudrala @ 2018-04-20 1:42 UTC (permalink / raw)
To: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
jasowang, loseweigh, jiri
The main motivation for this patch is to enable cloud service providers
to provide an accelerated datapath to virtio-net enabled VMs in a
transparent manner with no/minimal guest userspace changes. This also
enables hypervisor controlled live migration to be supported with VMs that
have direct attached SR-IOV VF devices.
Patch 1 introduces a new feature bit VIRTIO_NET_F_STANDBY that can be
used by hypervisor to indicate that virtio_net interface should act as
a standby for another device with the same MAC address.
Patch 2 introduces a failover module that provides a generic interface for
paravirtual drivers to listen for netdev register/unregister/link change
events from pci ethernet devices with the same MAC and takeover their
datapath. The notifier and event handling code is based on the existing
netvsc implementation. It provides 2 sets of interfaces to paravirtual
drivers to support 2-netdev(netvsc) and 3-netdev(virtio_net) models.
Patch 3 extends virtio_net to use alternate datapath when available and
registered. When STANDBY feature is enabled, virtio_net driver creates
an additional 'failover' netdev that acts as a master device and controls
2 slave devices. The original virtio_net netdev is registered as
'standby' netdev and a passthru/vf device with the same MAC gets
registered as 'primary' netdev. Both 'standby' and 'primary' netdevs are
associated with the same 'pci' device. The user accesses the network
interface via 'failover' netdev. The 'failover' netdev chooses 'primary'
netdev as default for transmits when it is available with link up and
running.
Patch 4 refactors netvsc to use the registration/notification framework
supported by failover module.
As this patch series is initially focusing on usecases where hypervisor
fully controls the VM networking and the guest is not expected to directly
configure any hardware settings, it doesn't expose all the ndo/ethtool ops
that are supported by virtio_net at this time. To support additional usecases,
it should be possible to enable additional ops later by caching the state
in virtio netdev and replaying when the 'primary' netdev gets registered.
The hypervisor needs to enable only one datapath at any time so that packets
don't get looped back to the VM over the other datapath. When a VF is
plugged, the virtio datapath link state can be marked as down.
At the time of live migration, the hypervisor needs to unplug the VF device
from the guest on the source host and reset the MAC filter of the VF to
initiate failover of datapath to virtio before starting the migration. After
the migration is completed, the destination hypervisor sets the MAC filter
on the VF and plugs it back to the guest to switch over to VF datapath.
This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
v7
- Rename 'bypass/active/backup' terminology with 'failover/primary/standy'
(jiri, mst)
- re-arranged dev_open() and dev_set_mtu() calls in the register routines
so that they don't get called for 2-netdev model. (stephen)
- fixed select_queue() routine to do queue selection based on VF if it is
registered as primary. (stephen)
- minor bugfixes
v6 RFC:
Simplified virtio_net changes by moving all the ndo_ops of the
bypass_netdev and create/destroy of bypass_netdev to 'bypass' module.
avoided 2 phase registration(driver + instances).
introduced IFF_BYPASS/IFF_BYPASS_SLAVE dev->priv_flags
replaced mutex with a spinlock
v5 RFC:
Based on Jiri's comments, moved the common functionality to a 'bypass'
module so that the same notifier and event handlers to handle child
register/unregister/link change events can be shared between virtio_net
and netvsc.
Improved error handling based on Siwei's comments.
v4:
- Based on the review comments on the v3 version of the RFC patch and
Jakub's suggestion for the naming issue with 3 netdev solution,
proposed 3 netdev in-driver bonding solution for virtio-net.
v3 RFC:
- Introduced 3 netdev model and pointed out a couple of issues with
that model and proposed 2 netdev model to avoid these issues.
- Removed broadcast/multicast optimization and only use virtio as
backup path when VF is unplugged.
v2 RFC:
- Changed VIRTIO_NET_F_MASTER to VIRTIO_NET_F_BACKUP (mst)
- made a small change to the virtio-net xmit path to only use VF datapath
for unicasts. Broadcasts/multicasts use virtio datapath. This avoids
east-west broadcasts to go over the PCI link.
- added suppport for the feature bit in qemu
Sridhar Samudrala (4):
virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit
net: Introduce generic failover module
virtio_net: Extend virtio to use VF datapath when available
netvsc: refactor notifier/event handling code to use the failover
framework
drivers/net/Kconfig | 1 +
drivers/net/hyperv/Kconfig | 1 +
drivers/net/hyperv/hyperv_net.h | 2 +
drivers/net/hyperv/netvsc_drv.c | 208 +++-------
drivers/net/virtio_net.c | 38 +-
include/linux/netdevice.h | 16 +
include/net/failover.h | 96 +++++
include/uapi/linux/virtio_net.h | 3 +
net/Kconfig | 18 +
net/core/Makefile | 1 +
net/core/failover.c | 844 ++++++++++++++++++++++++++++++++++++++++
11 files changed, 1070 insertions(+), 158 deletions(-)
create mode 100644 include/net/failover.h
create mode 100644 net/core/failover.c
--
2.14.3
^ permalink raw reply
* [PATCH v7 net-next 1/4] virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit
From: Sridhar Samudrala @ 2018-04-20 1:42 UTC (permalink / raw)
To: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
jasowang, loseweigh, jiri
In-Reply-To: <1524188524-28411-1-git-send-email-sridhar.samudrala@intel.com>
This feature bit can be used by hypervisor to indicate virtio_net device to
act as a standby for another device with the same MAC address.
VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
drivers/net/virtio_net.c | 2 +-
include/uapi/linux/virtio_net.h | 3 +++
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b187ec7411e..6f95719ede40 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2962,7 +2962,7 @@ static struct virtio_device_id id_table[] = {
VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
VIRTIO_NET_F_CTRL_MAC_ADDR, \
VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
- VIRTIO_NET_F_SPEED_DUPLEX
+ VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY
static unsigned int features[] = {
VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..a3715a3224c1 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
* Steering */
#define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
+#define VIRTIO_NET_F_STANDBY 62 /* Act as standby for another device
+ * with the same MAC.
+ */
#define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */
#ifndef VIRTIO_NET_NO_LEGACY
--
2.14.3
^ permalink raw reply related
* [PATCH v7 net-next 2/4] net: Introduce generic failover module
From: Sridhar Samudrala @ 2018-04-20 1:42 UTC (permalink / raw)
To: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
jasowang, loseweigh, jiri
In-Reply-To: <1524188524-28411-1-git-send-email-sridhar.samudrala@intel.com>
This provides a generic interface for paravirtual drivers to listen
for netdev register/unregister/link change events from pci ethernet
devices with the same MAC and takeover their datapath. The notifier and
event handling code is based on the existing netvsc implementation.
It exposes 2 sets of interfaces to the paravirtual drivers.
1. existing netvsc driver that uses 2 netdev model. In this model, no
master netdev is created. The paravirtual driver registers each instance
of netvsc as a 'failover' instance along with a set of ops to manage the
slave events.
failover_register()
failover_unregister()
2. new virtio_net based solution that uses 3 netdev model. In this model,
the failover module provides interfaces to create/destroy additional master
netdev and all the slave events are managed internally.
failover_create()
failover_destroy()
These functions call failover_register()/failover_unregister() with the
master netdev created by the failover module.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
include/linux/netdevice.h | 16 +
include/net/failover.h | 96 ++++++
net/Kconfig | 18 +
net/core/Makefile | 1 +
net/core/failover.c | 844 ++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 975 insertions(+)
create mode 100644 include/net/failover.h
create mode 100644 net/core/failover.c
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cf44503ea81a..ed535b6724e1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1401,6 +1401,8 @@ struct net_device_ops {
* entity (i.e. the master device for bridged veth)
* @IFF_MACSEC: device is a MACsec device
* @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
+ * @IFF_FAILOVER: device is a failover master device
+ * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
*/
enum netdev_priv_flags {
IFF_802_1Q_VLAN = 1<<0,
@@ -1430,6 +1432,8 @@ enum netdev_priv_flags {
IFF_PHONY_HEADROOM = 1<<24,
IFF_MACSEC = 1<<25,
IFF_NO_RX_HANDLER = 1<<26,
+ IFF_FAILOVER = 1<<27,
+ IFF_FAILOVER_SLAVE = 1<<28,
};
#define IFF_802_1Q_VLAN IFF_802_1Q_VLAN
@@ -1458,6 +1462,8 @@ enum netdev_priv_flags {
#define IFF_RXFH_CONFIGURED IFF_RXFH_CONFIGURED
#define IFF_MACSEC IFF_MACSEC
#define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER
+#define IFF_FAILOVER IFF_FAILOVER
+#define IFF_FAILOVER_SLAVE IFF_FAILOVER_SLAVE
/**
* struct net_device - The DEVICE structure.
@@ -4308,6 +4314,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
return dev->priv_flags & IFF_RXFH_CONFIGURED;
}
+static inline bool netif_is_failover(const struct net_device *dev)
+{
+ return dev->priv_flags & IFF_FAILOVER;
+}
+
+static inline bool netif_is_failover_slave(const struct net_device *dev)
+{
+ return dev->priv_flags & IFF_FAILOVER_SLAVE;
+}
+
/* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
static inline void netif_keep_dst(struct net_device *dev)
{
diff --git a/include/net/failover.h b/include/net/failover.h
new file mode 100644
index 000000000000..0b8601043d90
--- /dev/null
+++ b/include/net/failover.h
@@ -0,0 +1,96 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2018, Intel Corporation. */
+
+#ifndef _NET_FAILOVER_H
+#define _NET_FAILOVER_H
+
+#include <linux/netdevice.h>
+
+struct failover_ops {
+ int (*slave_pre_register)(struct net_device *slave_dev,
+ struct net_device *failover_dev);
+ int (*slave_join)(struct net_device *slave_dev,
+ struct net_device *failover_dev);
+ int (*slave_pre_unregister)(struct net_device *slave_dev,
+ struct net_device *failover_dev);
+ int (*slave_release)(struct net_device *slave_dev,
+ struct net_device *failover_dev);
+ int (*slave_link_change)(struct net_device *slave_dev,
+ struct net_device *failover_dev);
+ rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
+};
+
+struct failover {
+ struct list_head list;
+ struct net_device __rcu *failover_dev;
+ struct failover_ops __rcu *ops;
+};
+
+/* failover state */
+struct failover_info {
+ /* primary netdev with same MAC */
+ struct net_device __rcu *primary_dev;
+
+ /* standby netdev */
+ struct net_device __rcu *standby_dev;
+
+ /* primary netdev stats */
+ struct rtnl_link_stats64 primary_stats;
+
+ /* standby netdev stats */
+ struct rtnl_link_stats64 standby_stats;
+
+ /* aggregated stats */
+ struct rtnl_link_stats64 failover_stats;
+
+ /* spinlock while updating stats */
+ spinlock_t stats_lock;
+};
+
+#if IS_ENABLED(CONFIG_NET_FAILOVER)
+
+int failover_create(struct net_device *standby_dev,
+ struct failover **pfailover);
+void failover_destroy(struct failover *failover);
+
+int failover_register(struct net_device *standby_dev, struct failover_ops *ops,
+ struct failover **pfailover);
+void failover_unregister(struct failover *failover);
+
+int failover_slave_unregister(struct net_device *slave_dev);
+
+#else
+
+static inline
+int failover_create(struct net_device *standby_dev,
+ struct failover **pfailover);
+{
+ return 0;
+}
+
+static inline
+void failover_destroy(struct failover *failover)
+{
+}
+
+static inline
+int failover_register(struct net_device *standby_dev, struct failover_ops *ops,
+ struct pfailover **pfailover);
+{
+ return 0;
+}
+
+static inline
+void failover_unregister(struct failover *failover)
+{
+}
+
+static inline
+int failover_slave_unregister(struct net_device *slave_dev)
+{
+ return 0;
+}
+
+#endif
+
+#endif /* _NET_FAILOVER_H */
diff --git a/net/Kconfig b/net/Kconfig
index 0428f12c25c2..388b99dfee10 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
on MAY_USE_DEVLINK to ensure they do not cause link errors when
devlink is a loadable module and the driver using it is built-in.
+config NET_FAILOVER
+ tristate "Failover interface"
+ help
+ This provides a generic interface for paravirtual drivers to listen
+ for netdev register/unregister/link change events from pci ethernet
+ devices with the same MAC and takeover their datapath. This also
+ enables live migration of a VM with direct attached VF by failing
+ over to the paravirtual datapath when the VF is unplugged.
+
+config MAY_USE_FAILOVER
+ tristate
+ default m if NET_FAILOVER=m
+ default y if NET_FAILOVER=y || NET_FAILOVER=n
+ help
+ Drivers using the failover infrastructure should have a dependency
+ on MAY_USE_FAILOVER to ensure they do not cause link errors when
+ failover is a loadable module and the driver using it is built-in.
+
endif # if NET
# Used by archs to tell that they support BPF JIT compiler plus which flavour.
diff --git a/net/core/Makefile b/net/core/Makefile
index 6dbbba8c57ae..cef17518bb7d 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
obj-$(CONFIG_HWBM) += hwbm.o
obj-$(CONFIG_NET_DEVLINK) += devlink.o
obj-$(CONFIG_GRO_CELLS) += gro_cells.o
+obj-$(CONFIG_NET_FAILOVER) += failover.o
diff --git a/net/core/failover.c b/net/core/failover.c
new file mode 100644
index 000000000000..7bee762cb737
--- /dev/null
+++ b/net/core/failover.c
@@ -0,0 +1,844 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Intel Corporation. */
+
+/* A common module to handle registrations and notifications for paravirtual
+ * drivers to enable accelerated datapath and support VF live migration.
+ *
+ * The notifier and event handling code is based on netvsc driver.
+ */
+
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ethtool.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/netdevice.h>
+#include <linux/netpoll.h>
+#include <linux/rtnetlink.h>
+#include <linux/if_vlan.h>
+#include <linux/pci.h>
+#include <net/sch_generic.h>
+#include <uapi/linux/if_arp.h>
+#include <net/failover.h>
+
+static LIST_HEAD(failover_list);
+static DEFINE_SPINLOCK(failover_lock);
+
+static int failover_slave_pre_register(struct net_device *slave_dev,
+ struct net_device *failover_dev,
+ struct failover_ops *failover_ops)
+{
+ struct failover_info *finfo;
+ bool standby;
+
+ if (failover_ops) {
+ if (!failover_ops->slave_pre_register)
+ return -EINVAL;
+
+ return failover_ops->slave_pre_register(slave_dev,
+ failover_dev);
+ }
+
+ finfo = netdev_priv(failover_dev);
+ standby = (slave_dev->dev.parent == failover_dev->dev.parent);
+ if (standby ? rtnl_dereference(finfo->standby_dev) :
+ rtnl_dereference(finfo->primary_dev)) {
+ netdev_err(failover_dev, "%s attempting to register as slave dev when %s already present\n",
+ slave_dev->name, standby ? "standby" : "primary");
+ return -EEXIST;
+ }
+
+ /* Avoid non pci devices as primary netdev */
+ if (!standby && (!slave_dev->dev.parent ||
+ !dev_is_pci(slave_dev->dev.parent)))
+ return -EINVAL;
+
+ return 0;
+}
+
+static int failover_slave_join(struct net_device *slave_dev,
+ struct net_device *failover_dev,
+ struct failover_ops *failover_ops)
+{
+ struct failover_info *finfo;
+ int err, orig_mtu;
+ bool standby;
+
+ if (failover_ops) {
+ if (!failover_ops->slave_join)
+ return -EINVAL;
+
+ return failover_ops->slave_join(slave_dev, failover_dev);
+ }
+
+ if (netif_running(failover_dev)) {
+ err = dev_open(slave_dev);
+ if (err && (err != -EBUSY)) {
+ netdev_err(failover_dev, "Opening slave %s failed err:%d\n",
+ slave_dev->name, err);
+ goto err_dev_open;
+ }
+ }
+
+ /* Align MTU of slave with failover dev */
+ orig_mtu = slave_dev->mtu;
+ err = dev_set_mtu(slave_dev, failover_dev->mtu);
+ if (err) {
+ netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
+ slave_dev->name, failover_dev->mtu);
+ goto err_set_mtu;
+ }
+
+ finfo = netdev_priv(failover_dev);
+ standby = (slave_dev->dev.parent == failover_dev->dev.parent);
+
+ dev_hold(slave_dev);
+
+ if (standby) {
+ rcu_assign_pointer(finfo->standby_dev, slave_dev);
+ dev_get_stats(finfo->standby_dev, &finfo->standby_stats);
+ } else {
+ rcu_assign_pointer(finfo->primary_dev, slave_dev);
+ dev_get_stats(finfo->primary_dev, &finfo->primary_stats);
+ failover_dev->min_mtu = slave_dev->min_mtu;
+ failover_dev->max_mtu = slave_dev->max_mtu;
+ }
+
+ netdev_info(failover_dev, "failover slave:%s joined\n",
+ slave_dev->name);
+
+ return 0;
+
+err_set_mtu:
+ dev_close(slave_dev);
+err_dev_open:
+ return err;
+}
+
+/* Called when slave dev is injecting data into network stack.
+ * Change the associated network device from lower dev to virtio.
+ * note: already called with rcu_read_lock
+ */
+static rx_handler_result_t failover_handle_frame(struct sk_buff **pskb)
+{
+ struct sk_buff *skb = *pskb;
+ struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
+
+ skb->dev = ndev;
+
+ return RX_HANDLER_ANOTHER;
+}
+
+static struct net_device *failover_get_bymac(u8 *mac, struct failover_ops **ops)
+{
+ struct net_device *failover_dev;
+ struct failover *failover;
+
+ spin_lock(&failover_lock);
+ list_for_each_entry(failover, &failover_list, list) {
+ failover_dev = rtnl_dereference(failover->failover_dev);
+ if (ether_addr_equal(failover_dev->perm_addr, mac)) {
+ *ops = rtnl_dereference(failover->ops);
+ spin_unlock(&failover_lock);
+ return failover_dev;
+ }
+ }
+ spin_unlock(&failover_lock);
+ return NULL;
+}
+
+static int failover_slave_register(struct net_device *slave_dev)
+{
+ struct failover_ops *failover_ops;
+ struct net_device *failover_dev;
+ int ret;
+
+ ASSERT_RTNL();
+
+ failover_dev = failover_get_bymac(slave_dev->perm_addr, &failover_ops);
+ if (!failover_dev)
+ goto done;
+
+ ret = failover_slave_pre_register(slave_dev, failover_dev,
+ failover_ops);
+ if (ret)
+ goto done;
+
+ ret = netdev_rx_handler_register(slave_dev, failover_ops ?
+ failover_ops->handle_frame :
+ failover_handle_frame, failover_dev);
+ if (ret) {
+ netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
+ ret);
+ goto done;
+ }
+
+ ret = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
+ if (ret) {
+ netdev_err(slave_dev, "can not set failover device %s (err = %d)\n",
+ failover_dev->name, ret);
+ goto upper_link_failed;
+ }
+
+ slave_dev->priv_flags |= IFF_FAILOVER_SLAVE;
+
+ ret = failover_slave_join(slave_dev, failover_dev, failover_ops);
+ if (ret)
+ goto err_join;
+
+ call_netdevice_notifiers(NETDEV_JOIN, slave_dev);
+
+ netdev_info(failover_dev, "failover slave:%s registered\n",
+ slave_dev->name);
+
+ goto done;
+
+err_join:
+ netdev_upper_dev_unlink(slave_dev, failover_dev);
+ slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
+upper_link_failed:
+ netdev_rx_handler_unregister(slave_dev);
+done:
+ return NOTIFY_DONE;
+}
+
+static int failover_slave_pre_unregister(struct net_device *slave_dev,
+ struct net_device *failover_dev,
+ struct failover_ops *failover_ops)
+{
+ struct net_device *standby_dev, *primary_dev;
+ struct failover_info *finfo;
+
+ if (failover_ops) {
+ if (!failover_ops->slave_pre_unregister)
+ return -EINVAL;
+
+ return failover_ops->slave_pre_unregister(slave_dev,
+ failover_dev);
+ }
+
+ finfo = netdev_priv(failover_dev);
+ primary_dev = rtnl_dereference(finfo->primary_dev);
+ standby_dev = rtnl_dereference(finfo->standby_dev);
+
+ if (slave_dev != primary_dev && slave_dev != standby_dev)
+ return -EINVAL;
+
+ return 0;
+}
+
+static int failover_slave_release(struct net_device *slave_dev,
+ struct net_device *failover_dev,
+ struct failover_ops *failover_ops)
+{
+ struct net_device *standby_dev, *primary_dev;
+ struct failover_info *finfo;
+
+ if (failover_ops) {
+ if (!failover_ops->slave_release)
+ return -EINVAL;
+
+ return failover_ops->slave_release(slave_dev, failover_dev);
+ }
+
+ finfo = netdev_priv(failover_dev);
+ primary_dev = rtnl_dereference(finfo->primary_dev);
+ standby_dev = rtnl_dereference(finfo->standby_dev);
+
+ if (slave_dev == standby_dev) {
+ RCU_INIT_POINTER(finfo->standby_dev, NULL);
+ } else {
+ RCU_INIT_POINTER(finfo->primary_dev, NULL);
+ if (standby_dev) {
+ failover_dev->min_mtu = standby_dev->min_mtu;
+ failover_dev->max_mtu = standby_dev->max_mtu;
+ }
+ }
+
+ dev_put(slave_dev);
+
+ netdev_info(failover_dev, "failover slave:%s released\n",
+ slave_dev->name);
+
+ return 0;
+}
+
+int failover_slave_unregister(struct net_device *slave_dev)
+{
+ struct failover_ops *failover_ops;
+ struct net_device *failover_dev;
+ int ret;
+
+ if (!netif_is_failover_slave(slave_dev))
+ goto done;
+
+ ASSERT_RTNL();
+
+ failover_dev = failover_get_bymac(slave_dev->perm_addr, &failover_ops);
+ if (!failover_dev)
+ goto done;
+
+ ret = failover_slave_pre_unregister(slave_dev, failover_dev,
+ failover_ops);
+ if (ret)
+ goto done;
+
+ netdev_rx_handler_unregister(slave_dev);
+ netdev_upper_dev_unlink(slave_dev, failover_dev);
+ slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
+
+ failover_slave_release(slave_dev, failover_dev, failover_ops);
+
+ netdev_info(failover_dev, "failover slave:%s unregistered\n",
+ slave_dev->name);
+
+done:
+ return NOTIFY_DONE;
+}
+EXPORT_SYMBOL_GPL(failover_slave_unregister);
+
+static bool failover_xmit_ready(struct net_device *dev)
+{
+ return netif_running(dev) && netif_carrier_ok(dev);
+}
+
+static int failover_slave_link_change(struct net_device *slave_dev)
+{
+ struct net_device *failover_dev, *primary_dev, *standby_dev;
+ struct failover_ops *failover_ops;
+ struct failover_info *finfo;
+
+ if (!netif_is_failover_slave(slave_dev))
+ goto done;
+
+ ASSERT_RTNL();
+
+ failover_dev = failover_get_bymac(slave_dev->perm_addr, &failover_ops);
+ if (!failover_dev)
+ goto done;
+
+ if (failover_ops) {
+ if (!failover_ops->slave_link_change)
+ goto done;
+
+ return failover_ops->slave_link_change(slave_dev, failover_dev);
+ }
+
+ if (!netif_running(failover_dev))
+ return 0;
+
+ finfo = netdev_priv(failover_dev);
+
+ primary_dev = rtnl_dereference(finfo->primary_dev);
+ standby_dev = rtnl_dereference(finfo->standby_dev);
+
+ if (slave_dev != primary_dev && slave_dev != standby_dev)
+ goto done;
+
+ if ((primary_dev && failover_xmit_ready(primary_dev)) ||
+ (standby_dev && failover_xmit_ready(standby_dev))) {
+ netif_carrier_on(failover_dev);
+ netif_tx_wake_all_queues(failover_dev);
+ } else {
+ netif_carrier_off(failover_dev);
+ netif_tx_stop_all_queues(failover_dev);
+ }
+
+done:
+ return NOTIFY_DONE;
+}
+
+static bool failover_validate_event_dev(struct net_device *dev)
+{
+ /* Skip parent events */
+ if (netif_is_failover(dev))
+ return false;
+
+ /* Avoid non-Ethernet type devices */
+ if (dev->type != ARPHRD_ETHER)
+ return false;
+
+ return true;
+}
+
+static int
+failover_event(struct notifier_block *this, unsigned long event, void *ptr)
+{
+ struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
+
+ if (!failover_validate_event_dev(event_dev))
+ return NOTIFY_DONE;
+
+ switch (event) {
+ case NETDEV_REGISTER:
+ return failover_slave_register(event_dev);
+ case NETDEV_UNREGISTER:
+ return failover_slave_unregister(event_dev);
+ case NETDEV_UP:
+ case NETDEV_DOWN:
+ case NETDEV_CHANGE:
+ return failover_slave_link_change(event_dev);
+ default:
+ return NOTIFY_DONE;
+ }
+}
+
+static struct notifier_block failover_notifier = {
+ .notifier_call = failover_event,
+};
+
+static int failover_open(struct net_device *dev)
+{
+ struct failover_info *finfo = netdev_priv(dev);
+ struct net_device *primary_dev, *standby_dev;
+ int err;
+
+ netif_carrier_off(dev);
+ netif_tx_wake_all_queues(dev);
+
+ primary_dev = rtnl_dereference(finfo->primary_dev);
+ if (primary_dev) {
+ err = dev_open(primary_dev);
+ if (err)
+ goto err_primary_open;
+ }
+
+ standby_dev = rtnl_dereference(finfo->standby_dev);
+ if (standby_dev) {
+ err = dev_open(standby_dev);
+ if (err)
+ goto err_standby_open;
+ }
+
+ return 0;
+
+err_standby_open:
+ dev_close(primary_dev);
+err_primary_open:
+ netif_tx_disable(dev);
+ return err;
+}
+
+static int failover_close(struct net_device *dev)
+{
+ struct failover_info *finfo = netdev_priv(dev);
+ struct net_device *slave_dev;
+
+ netif_tx_disable(dev);
+
+ slave_dev = rtnl_dereference(finfo->primary_dev);
+ if (slave_dev)
+ dev_close(slave_dev);
+
+ slave_dev = rtnl_dereference(finfo->standby_dev);
+ if (slave_dev)
+ dev_close(slave_dev);
+
+ return 0;
+}
+
+static netdev_tx_t failover_drop_xmit(struct sk_buff *skb,
+ struct net_device *dev)
+{
+ atomic_long_inc(&dev->tx_dropped);
+ dev_kfree_skb_any(skb);
+ return NETDEV_TX_OK;
+}
+
+static netdev_tx_t failover_start_xmit(struct sk_buff *skb,
+ struct net_device *dev)
+{
+ struct failover_info *finfo = netdev_priv(dev);
+ struct net_device *xmit_dev;
+
+ /* Try xmit via primary netdev followed by standby netdev */
+ xmit_dev = rcu_dereference_bh(finfo->primary_dev);
+ if (!xmit_dev || !failover_xmit_ready(xmit_dev)) {
+ xmit_dev = rcu_dereference_bh(finfo->standby_dev);
+ if (!xmit_dev || !failover_xmit_ready(xmit_dev))
+ return failover_drop_xmit(skb, dev);
+ }
+
+ skb->dev = xmit_dev;
+ skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
+
+ return dev_queue_xmit(skb);
+}
+
+static u16 failover_select_queue(struct net_device *dev, struct sk_buff *skb,
+ void *accel_priv,
+ select_queue_fallback_t fallback)
+{
+ struct failover_info *finfo = netdev_priv(dev);
+ struct net_device *primary_dev;
+ u16 txq;
+
+ rcu_read_lock();
+ primary_dev = rcu_dereference(finfo->primary_dev);
+ if (primary_dev) {
+ const struct net_device_ops *ops = primary_dev->netdev_ops;
+
+ if (ops->ndo_select_queue)
+ txq = ops->ndo_select_queue(primary_dev, skb,
+ accel_priv, fallback);
+ else
+ txq = fallback(primary_dev, skb);
+
+ qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+ return txq;
+ }
+
+ txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+
+ /* Save the original txq to restore before passing to the driver */
+ qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+ if (unlikely(txq >= dev->real_num_tx_queues)) {
+ do {
+ txq -= dev->real_num_tx_queues;
+ } while (txq >= dev->real_num_tx_queues);
+ }
+
+ return txq;
+}
+
+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
+ * that some drivers can provide 32finfot values only.
+ */
+static void failover_fold_stats(struct rtnl_link_stats64 *_res,
+ const struct rtnl_link_stats64 *_new,
+ const struct rtnl_link_stats64 *_old)
+{
+ const u64 *new = (const u64 *)_new;
+ const u64 *old = (const u64 *)_old;
+ u64 *res = (u64 *)_res;
+ int i;
+
+ for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
+ u64 nv = new[i];
+ u64 ov = old[i];
+ s64 delta = nv - ov;
+
+ /* detects if this particular field is 32bit only */
+ if (((nv | ov) >> 32) == 0)
+ delta = (s64)(s32)((u32)nv - (u32)ov);
+
+ /* filter anomalies, some drivers reset their stats
+ * at down/up events.
+ */
+ if (delta > 0)
+ res[i] += delta;
+ }
+}
+
+static void failover_get_stats(struct net_device *dev,
+ struct rtnl_link_stats64 *stats)
+{
+ struct failover_info *finfo = netdev_priv(dev);
+ const struct rtnl_link_stats64 *new;
+ struct rtnl_link_stats64 temp;
+ struct net_device *slave_dev;
+
+ spin_lock(&finfo->stats_lock);
+ memcpy(stats, &finfo->failover_stats, sizeof(*stats));
+
+ rcu_read_lock();
+
+ slave_dev = rcu_dereference(finfo->primary_dev);
+ if (slave_dev) {
+ new = dev_get_stats(slave_dev, &temp);
+ failover_fold_stats(stats, new, &finfo->primary_stats);
+ memcpy(&finfo->primary_stats, new, sizeof(*new));
+ }
+
+ slave_dev = rcu_dereference(finfo->standby_dev);
+ if (slave_dev) {
+ new = dev_get_stats(slave_dev, &temp);
+ failover_fold_stats(stats, new, &finfo->standby_stats);
+ memcpy(&finfo->standby_stats, new, sizeof(*new));
+ }
+
+ rcu_read_unlock();
+
+ memcpy(&finfo->failover_stats, stats, sizeof(*stats));
+ spin_unlock(&finfo->stats_lock);
+}
+
+static int failover_change_mtu(struct net_device *dev, int new_mtu)
+{
+ struct failover_info *finfo = netdev_priv(dev);
+ struct net_device *primary_dev, *standby_dev;
+ int ret = 0;
+
+ primary_dev = rcu_dereference(finfo->primary_dev);
+ if (primary_dev) {
+ ret = dev_set_mtu(primary_dev, new_mtu);
+ if (ret)
+ return ret;
+ }
+
+ standby_dev = rcu_dereference(finfo->standby_dev);
+ if (standby_dev) {
+ ret = dev_set_mtu(standby_dev, new_mtu);
+ if (ret) {
+ dev_set_mtu(primary_dev, dev->mtu);
+ return ret;
+ }
+ }
+
+ dev->mtu = new_mtu;
+ return 0;
+}
+
+static void failover_set_rx_mode(struct net_device *dev)
+{
+ struct failover_info *finfo = netdev_priv(dev);
+ struct net_device *slave_dev;
+
+ rcu_read_lock();
+
+ slave_dev = rcu_dereference(finfo->primary_dev);
+ if (slave_dev) {
+ dev_uc_sync_multiple(slave_dev, dev);
+ dev_mc_sync_multiple(slave_dev, dev);
+ }
+
+ slave_dev = rcu_dereference(finfo->standby_dev);
+ if (slave_dev) {
+ dev_uc_sync_multiple(slave_dev, dev);
+ dev_mc_sync_multiple(slave_dev, dev);
+ }
+
+ rcu_read_unlock();
+}
+
+static const struct net_device_ops failover_dev_ops = {
+ .ndo_open = failover_open,
+ .ndo_stop = failover_close,
+ .ndo_start_xmit = failover_start_xmit,
+ .ndo_select_queue = failover_select_queue,
+ .ndo_get_stats64 = failover_get_stats,
+ .ndo_change_mtu = failover_change_mtu,
+ .ndo_set_rx_mode = failover_set_rx_mode,
+ .ndo_validate_addr = eth_validate_addr,
+ .ndo_features_check = passthru_features_check,
+};
+
+#define FAILOVER_NAME "failover"
+#define FAILOVER_VERSION "0.1"
+
+static void failover_ethtool_get_drvinfo(struct net_device *dev,
+ struct ethtool_drvinfo *drvinfo)
+{
+ strlcpy(drvinfo->driver, FAILOVER_NAME, sizeof(drvinfo->driver));
+ strlcpy(drvinfo->version, FAILOVER_VERSION, sizeof(drvinfo->version));
+}
+
+int failover_ethtool_get_link_ksettings(struct net_device *dev,
+ struct ethtool_link_ksettings *cmd)
+{
+ struct failover_info *finfo = netdev_priv(dev);
+ struct net_device *slave_dev;
+
+ slave_dev = rtnl_dereference(finfo->primary_dev);
+ if (!slave_dev || !failover_xmit_ready(slave_dev)) {
+ slave_dev = rtnl_dereference(finfo->standby_dev);
+ if (!slave_dev || !failover_xmit_ready(slave_dev)) {
+ cmd->base.duplex = DUPLEX_UNKNOWN;
+ cmd->base.port = PORT_OTHER;
+ cmd->base.speed = SPEED_UNKNOWN;
+
+ return 0;
+ }
+ }
+
+ return __ethtool_get_link_ksettings(slave_dev, cmd);
+}
+EXPORT_SYMBOL_GPL(failover_ethtool_get_link_ksettings);
+
+static const struct ethtool_ops failover_ethtool_ops = {
+ .get_drvinfo = failover_ethtool_get_drvinfo,
+ .get_link = ethtool_op_get_link,
+ .get_link_ksettings = failover_ethtool_get_link_ksettings,
+};
+
+static void failover_register_existing_slave(struct net_device *failover_dev)
+{
+ struct net *net = dev_net(failover_dev);
+ struct net_device *dev;
+
+ rtnl_lock();
+ for_each_netdev(net, dev) {
+ if (dev == failover_dev)
+ continue;
+ if (!failover_validate_event_dev(dev))
+ continue;
+ if (ether_addr_equal(failover_dev->perm_addr, dev->perm_addr))
+ failover_slave_register(dev);
+ }
+ rtnl_unlock();
+}
+
+int failover_register(struct net_device *dev, struct failover_ops *ops,
+ struct failover **pfailover)
+{
+ struct failover *failover;
+
+ failover = kzalloc(sizeof(*failover), GFP_KERNEL);
+ if (!failover)
+ return -ENOMEM;
+
+ rcu_assign_pointer(failover->ops, ops);
+ dev_hold(dev);
+ dev->priv_flags |= IFF_FAILOVER;
+ rcu_assign_pointer(failover->failover_dev, dev);
+
+ spin_lock(&failover_lock);
+ list_add_tail(&failover->list, &failover_list);
+ spin_unlock(&failover_lock);
+
+ failover_register_existing_slave(dev);
+
+ *pfailover = failover;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(failover_register);
+
+void failover_unregister(struct failover *failover)
+{
+ struct net_device *failover_dev;
+
+ failover_dev = rcu_dereference(failover->failover_dev);
+
+ failover_dev->priv_flags &= ~IFF_FAILOVER;
+ dev_put(failover_dev);
+
+ spin_lock(&failover_lock);
+ list_del(&failover->list);
+ spin_unlock(&failover_lock);
+
+ kfree(failover);
+}
+EXPORT_SYMBOL_GPL(failover_unregister);
+
+int failover_create(struct net_device *standby_dev, struct failover **pfailover)
+{
+ struct device *dev = standby_dev->dev.parent;
+ struct net_device *failover_dev;
+ int err;
+
+ /* Alloc at least 2 queues, for now we are going with 16 assuming
+ * that most devices being bonded won't have too many queues.
+ */
+ failover_dev = alloc_etherdev_mq(sizeof(struct failover_info), 16);
+ if (!failover_dev) {
+ dev_err(dev, "Unable to allocate failover_netdev!\n");
+ return -ENOMEM;
+ }
+
+ dev_net_set(failover_dev, dev_net(standby_dev));
+ SET_NETDEV_DEV(failover_dev, dev);
+
+ failover_dev->netdev_ops = &failover_dev_ops;
+ failover_dev->ethtool_ops = &failover_ethtool_ops;
+
+ /* Initialize the device options */
+ failover_dev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
+ failover_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
+ IFF_TX_SKB_SHARING);
+
+ /* don't acquire failover netdev's netif_tx_lock when transmitting */
+ failover_dev->features |= NETIF_F_LLTX;
+
+ /* Don't allow failover devices to change network namespaces. */
+ failover_dev->features |= NETIF_F_NETNS_LOCAL;
+
+ failover_dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
+ NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
+ NETIF_F_HIGHDMA | NETIF_F_LRO;
+
+ failover_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
+ failover_dev->features |= failover_dev->hw_features;
+
+ memcpy(failover_dev->dev_addr, standby_dev->dev_addr,
+ failover_dev->addr_len);
+
+ failover_dev->min_mtu = standby_dev->min_mtu;
+ failover_dev->max_mtu = standby_dev->max_mtu;
+
+ err = register_netdev(failover_dev);
+ if (err < 0) {
+ dev_err(dev, "Unable to register failover_dev!\n");
+ goto err_register_netdev;
+ }
+
+ netif_carrier_off(failover_dev);
+
+ err = failover_register(failover_dev, NULL, pfailover);
+ if (err < 0)
+ goto err_failover;
+
+ return 0;
+
+err_failover:
+ unregister_netdev(failover_dev);
+err_register_netdev:
+ free_netdev(failover_dev);
+
+ return err;
+}
+EXPORT_SYMBOL_GPL(failover_create);
+
+void failover_destroy(struct failover *failover)
+{
+ struct net_device *failover_dev;
+ struct net_device *slave_dev;
+ struct failover_info *finfo;
+
+ if (!failover)
+ return;
+
+ failover_dev = rcu_dereference(failover->failover_dev);
+ finfo = netdev_priv(failover_dev);
+
+ netif_device_detach(failover_dev);
+
+ rtnl_lock();
+
+ slave_dev = rtnl_dereference(finfo->primary_dev);
+ if (slave_dev)
+ failover_slave_unregister(slave_dev);
+
+ slave_dev = rtnl_dereference(finfo->standby_dev);
+ if (slave_dev)
+ failover_slave_unregister(slave_dev);
+
+ failover_unregister(failover);
+
+ unregister_netdevice(failover_dev);
+
+ rtnl_unlock();
+
+ free_netdev(failover_dev);
+}
+EXPORT_SYMBOL_GPL(failover_destroy);
+
+static __init int
+failover_init(void)
+{
+ register_netdevice_notifier(&failover_notifier);
+
+ return 0;
+}
+module_init(failover_init);
+
+static __exit
+void failover_exit(void)
+{
+ unregister_netdevice_notifier(&failover_notifier);
+}
+module_exit(failover_exit);
+
+MODULE_DESCRIPTION("Failover infrastructure/interface for Paravirtual drivers");
+MODULE_LICENSE("GPL v2");
--
2.14.3
^ permalink raw reply related
* [PATCH v7 net-next 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Sridhar Samudrala @ 2018-04-20 1:42 UTC (permalink / raw)
To: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
jasowang, loseweigh, jiri
In-Reply-To: <1524188524-28411-1-git-send-email-sridhar.samudrala@intel.com>
This patch enables virtio_net to switch over to a VF datapath when a VF
netdev is present with the same MAC address. It allows live migration
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.
The hypervisor needs to enable only one datapath at any time so that
packets don't get looped back to the VM over the other datapath. When a VF
is plugged, the virtio datapath link state can be marked as down. The
hypervisor needs to unplug the VF device from the guest on the source host
and reset the MAC filter of the VF to initiate failover of datapath to
virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.
It uses the generic failover framework that provides 2 functions to create
and destroy a master failover netdev. When STANDBY feature is enabled, an
additional netdev(failover netdev) is created that acts as a master device
and tracks the state of the 2 lower netdevs. The original virtio_net netdev
is marked as 'standby' netdev and a passthru device with the same MAC is
registered as 'primary' netdev.
This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
drivers/net/Kconfig | 1 +
drivers/net/virtio_net.c | 36 +++++++++++++++++++++++++++++++++++-
2 files changed, 36 insertions(+), 1 deletion(-)
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 891846655000..5abe328973da 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -331,6 +331,7 @@ config VETH
config VIRTIO_NET
tristate "Virtio network driver"
depends on VIRTIO
+ depends on MAY_USE_FAILOVER
---help---
This is the virtual network driver for virtio. It can be used with
QEMU based VMMs (like KVM or Xen). Say Y or M.
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6f95719ede40..42b9f9bff48b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -30,8 +30,11 @@
#include <linux/cpu.h>
#include <linux/average.h>
#include <linux/filter.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
#include <net/route.h>
#include <net/xdp.h>
+#include <net/failover.h>
static int napi_weight = NAPI_POLL_WEIGHT;
module_param(napi_weight, int, 0444);
@@ -206,6 +209,9 @@ struct virtnet_info {
u32 speed;
unsigned long guest_offloads;
+
+ /* failover when STANDBY feature enabled */
+ struct failover *failover;
};
struct padded_vnet_hdr {
@@ -2275,6 +2281,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
}
}
+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
+ size_t len)
+{
+ struct virtnet_info *vi = netdev_priv(dev);
+ int ret;
+
+ if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_STANDBY))
+ return -EOPNOTSUPP;
+
+ ret = snprintf(buf, len, "_sby");
+ if (ret >= len)
+ return -EOPNOTSUPP;
+
+ return 0;
+}
+
static const struct net_device_ops virtnet_netdev = {
.ndo_open = virtnet_open,
.ndo_stop = virtnet_close,
@@ -2292,6 +2314,7 @@ static const struct net_device_ops virtnet_netdev = {
.ndo_xdp_xmit = virtnet_xdp_xmit,
.ndo_xdp_flush = virtnet_xdp_flush,
.ndo_features_check = passthru_features_check,
+ .ndo_get_phys_port_name = virtnet_get_phys_port_name,
};
static void virtnet_config_changed_work(struct work_struct *work)
@@ -2839,10 +2862,16 @@ static int virtnet_probe(struct virtio_device *vdev)
virtnet_init_settings(dev);
+ if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
+ err = failover_create(vi->dev, &vi->failover);
+ if (err)
+ goto free_vqs;
+ }
+
err = register_netdev(dev);
if (err) {
pr_debug("virtio_net: registering device failed\n");
- goto free_vqs;
+ goto free_failover;
}
virtio_device_ready(vdev);
@@ -2879,6 +2908,8 @@ static int virtnet_probe(struct virtio_device *vdev)
vi->vdev->config->reset(vdev);
unregister_netdev(dev);
+free_failover:
+ failover_destroy(vi->failover);
free_vqs:
cancel_delayed_work_sync(&vi->refill);
free_receive_page_frags(vi);
@@ -2913,6 +2944,8 @@ static void virtnet_remove(struct virtio_device *vdev)
unregister_netdev(vi->dev);
+ failover_destroy(vi->failover);
+
remove_vq_common(vi);
free_netdev(vi->dev);
@@ -3010,6 +3043,7 @@ static __init int virtio_net_driver_init(void)
ret = register_virtio_driver(&virtio_net_driver);
if (ret)
goto err_virtio;
+
return 0;
err_virtio:
cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
--
2.14.3
^ permalink raw reply related
* [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Sridhar Samudrala @ 2018-04-20 1:42 UTC (permalink / raw)
To: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
jasowang, loseweigh, jiri
In-Reply-To: <1524188524-28411-1-git-send-email-sridhar.samudrala@intel.com>
Use the registration/notification framework supported by the generic
failover infrastructure.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
drivers/net/hyperv/Kconfig | 1 +
drivers/net/hyperv/hyperv_net.h | 2 +
drivers/net/hyperv/netvsc_drv.c | 208 ++++++++++------------------------------
3 files changed, 55 insertions(+), 156 deletions(-)
diff --git a/drivers/net/hyperv/Kconfig b/drivers/net/hyperv/Kconfig
index 936968d23559..56099d10beed 100644
--- a/drivers/net/hyperv/Kconfig
+++ b/drivers/net/hyperv/Kconfig
@@ -1,5 +1,6 @@
config HYPERV_NET
tristate "Microsoft Hyper-V virtual network driver"
depends on HYPERV
+ depends on MAY_USE_FAILOVER
help
Select this option to enable the Hyper-V virtual network driver.
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 960f06141472..d8c2ff698693 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -768,6 +768,8 @@ struct net_device_context {
u32 vf_alloc;
/* Serial number of the VF to team with */
u32 vf_serial;
+
+ struct failover *failover;
};
/* Per channel data */
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index ecc84954c511..8404c22de32b 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -43,6 +43,7 @@
#include <net/pkt_sched.h>
#include <net/checksum.h>
#include <net/ip6_checksum.h>
+#include <net/failover.h>
#include "hyperv_net.h"
@@ -1763,46 +1764,6 @@ static void netvsc_link_change(struct work_struct *w)
rtnl_unlock();
}
-static struct net_device *get_netvsc_bymac(const u8 *mac)
-{
- struct net_device *dev;
-
- ASSERT_RTNL();
-
- for_each_netdev(&init_net, dev) {
- if (dev->netdev_ops != &device_ops)
- continue; /* not a netvsc device */
-
- if (ether_addr_equal(mac, dev->perm_addr))
- return dev;
- }
-
- return NULL;
-}
-
-static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
-{
- struct net_device *dev;
-
- ASSERT_RTNL();
-
- for_each_netdev(&init_net, dev) {
- struct net_device_context *net_device_ctx;
-
- if (dev->netdev_ops != &device_ops)
- continue; /* not a netvsc device */
-
- net_device_ctx = netdev_priv(dev);
- if (!rtnl_dereference(net_device_ctx->nvdev))
- continue; /* device is removed */
-
- if (rtnl_dereference(net_device_ctx->vf_netdev) == vf_netdev)
- return dev; /* a match */
- }
-
- return NULL;
-}
-
/* Called when VF is injecting data into network stack.
* Change the associated network device from VF to netvsc.
* note: already called with rcu_read_lock
@@ -1829,39 +1790,15 @@ static int netvsc_vf_join(struct net_device *vf_netdev,
struct net_device *ndev)
{
struct net_device_context *ndev_ctx = netdev_priv(ndev);
- int ret;
-
- ret = netdev_rx_handler_register(vf_netdev,
- netvsc_vf_handle_frame, ndev);
- if (ret != 0) {
- netdev_err(vf_netdev,
- "can not register netvsc VF receive handler (err = %d)\n",
- ret);
- goto rx_handler_failed;
- }
-
- ret = netdev_upper_dev_link(vf_netdev, ndev, NULL);
- if (ret != 0) {
- netdev_err(vf_netdev,
- "can not set master device %s (err = %d)\n",
- ndev->name, ret);
- goto upper_link_failed;
- }
-
- /* set slave flag before open to prevent IPv6 addrconf */
- vf_netdev->flags |= IFF_SLAVE;
schedule_delayed_work(&ndev_ctx->vf_takeover, VF_TAKEOVER_INT);
- call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
-
netdev_info(vf_netdev, "joined to %s\n", ndev->name);
- return 0;
-upper_link_failed:
- netdev_rx_handler_unregister(vf_netdev);
-rx_handler_failed:
- return ret;
+ dev_hold(vf_netdev);
+ rcu_assign_pointer(ndev_ctx->vf_netdev, vf_netdev);
+
+ return 0;
}
static void __netvsc_vf_setup(struct net_device *ndev,
@@ -1914,85 +1851,82 @@ static void netvsc_vf_setup(struct work_struct *w)
rtnl_unlock();
}
-static int netvsc_register_vf(struct net_device *vf_netdev)
+static int netvsc_vf_pre_register(struct net_device *vf_netdev,
+ struct net_device *ndev)
{
- struct net_device *ndev;
struct net_device_context *net_device_ctx;
struct netvsc_device *netvsc_dev;
- if (vf_netdev->addr_len != ETH_ALEN)
- return NOTIFY_DONE;
-
- /*
- * We will use the MAC address to locate the synthetic interface to
- * associate with the VF interface. If we don't find a matching
- * synthetic interface, move on.
- */
- ndev = get_netvsc_bymac(vf_netdev->perm_addr);
- if (!ndev)
- return NOTIFY_DONE;
-
net_device_ctx = netdev_priv(ndev);
netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
if (!netvsc_dev || rtnl_dereference(net_device_ctx->vf_netdev))
- return NOTIFY_DONE;
-
- if (netvsc_vf_join(vf_netdev, ndev) != 0)
- return NOTIFY_DONE;
+ return -EEXIST;
netdev_info(ndev, "VF registering: %s\n", vf_netdev->name);
- dev_hold(vf_netdev);
- rcu_assign_pointer(net_device_ctx->vf_netdev, vf_netdev);
- return NOTIFY_OK;
+ return 0;
}
/* VF up/down change detected, schedule to change data path */
-static int netvsc_vf_changed(struct net_device *vf_netdev)
+static int netvsc_vf_changed(struct net_device *vf_netdev,
+ struct net_device *ndev)
{
struct net_device_context *net_device_ctx;
struct netvsc_device *netvsc_dev;
- struct net_device *ndev;
bool vf_is_up = netif_running(vf_netdev);
- ndev = get_netvsc_byref(vf_netdev);
- if (!ndev)
- return NOTIFY_DONE;
-
net_device_ctx = netdev_priv(ndev);
netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
if (!netvsc_dev)
- return NOTIFY_DONE;
+ return -EINVAL;
netvsc_switch_datapath(ndev, vf_is_up);
netdev_info(ndev, "Data path switched %s VF: %s\n",
vf_is_up ? "to" : "from", vf_netdev->name);
- return NOTIFY_OK;
+ return 0;
}
-static int netvsc_unregister_vf(struct net_device *vf_netdev)
+static int netvsc_vf_release(struct net_device *vf_netdev,
+ struct net_device *ndev)
{
- struct net_device *ndev;
struct net_device_context *net_device_ctx;
- ndev = get_netvsc_byref(vf_netdev);
- if (!ndev)
- return NOTIFY_DONE;
-
net_device_ctx = netdev_priv(ndev);
- cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
+ if (vf_netdev != rtnl_dereference(net_device_ctx->vf_netdev))
+ return -EINVAL;
- netdev_info(ndev, "VF unregistering: %s\n", vf_netdev->name);
+ cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
- netdev_rx_handler_unregister(vf_netdev);
- netdev_upper_dev_unlink(vf_netdev, ndev);
RCU_INIT_POINTER(net_device_ctx->vf_netdev, NULL);
dev_put(vf_netdev);
- return NOTIFY_OK;
+ return 0;
}
+static int netvsc_vf_pre_unregister(struct net_device *vf_netdev,
+ struct net_device *ndev)
+{
+ struct net_device_context *net_device_ctx;
+
+ net_device_ctx = netdev_priv(ndev);
+ if (vf_netdev != rtnl_dereference(net_device_ctx->vf_netdev))
+ return -EINVAL;
+
+ netdev_info(ndev, "VF unregistering: %s\n", vf_netdev->name);
+
+ return 0;
+}
+
+static struct failover_ops netvsc_failover_ops = {
+ .slave_pre_register = netvsc_vf_pre_register,
+ .slave_join = netvsc_vf_join,
+ .slave_pre_unregister = netvsc_vf_pre_unregister,
+ .slave_release = netvsc_vf_release,
+ .slave_link_change = netvsc_vf_changed,
+ .handle_frame = netvsc_vf_handle_frame,
+};
+
static int netvsc_probe(struct hv_device *dev,
const struct hv_vmbus_device_id *dev_id)
{
@@ -2082,8 +2016,15 @@ static int netvsc_probe(struct hv_device *dev,
goto register_failed;
}
+ ret = failover_register(net, &netvsc_failover_ops,
+ &net_device_ctx->failover);
+ if (ret != 0)
+ goto err_failover;
+
return ret;
+err_failover:
+ unregister_netdev(net);
register_failed:
rndis_filter_device_remove(dev, nvdev);
rndis_failed:
@@ -2124,13 +2065,15 @@ static int netvsc_remove(struct hv_device *dev)
rtnl_lock();
vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
if (vf_netdev)
- netvsc_unregister_vf(vf_netdev);
+ failover_slave_unregister(vf_netdev);
if (nvdev)
rndis_filter_device_remove(dev, nvdev);
unregister_netdevice(net);
+ failover_unregister(ndev_ctx->failover);
+
rtnl_unlock();
rcu_read_unlock();
@@ -2157,54 +2100,8 @@ static struct hv_driver netvsc_drv = {
.remove = netvsc_remove,
};
-/*
- * On Hyper-V, every VF interface is matched with a corresponding
- * synthetic interface. The synthetic interface is presented first
- * to the guest. When the corresponding VF instance is registered,
- * we will take care of switching the data path.
- */
-static int netvsc_netdev_event(struct notifier_block *this,
- unsigned long event, void *ptr)
-{
- struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
-
- /* Skip our own events */
- if (event_dev->netdev_ops == &device_ops)
- return NOTIFY_DONE;
-
- /* Avoid non-Ethernet type devices */
- if (event_dev->type != ARPHRD_ETHER)
- return NOTIFY_DONE;
-
- /* Avoid Vlan dev with same MAC registering as VF */
- if (is_vlan_dev(event_dev))
- return NOTIFY_DONE;
-
- /* Avoid Bonding master dev with same MAC registering as VF */
- if ((event_dev->priv_flags & IFF_BONDING) &&
- (event_dev->flags & IFF_MASTER))
- return NOTIFY_DONE;
-
- switch (event) {
- case NETDEV_REGISTER:
- return netvsc_register_vf(event_dev);
- case NETDEV_UNREGISTER:
- return netvsc_unregister_vf(event_dev);
- case NETDEV_UP:
- case NETDEV_DOWN:
- return netvsc_vf_changed(event_dev);
- default:
- return NOTIFY_DONE;
- }
-}
-
-static struct notifier_block netvsc_netdev_notifier = {
- .notifier_call = netvsc_netdev_event,
-};
-
static void __exit netvsc_drv_exit(void)
{
- unregister_netdevice_notifier(&netvsc_netdev_notifier);
vmbus_driver_unregister(&netvsc_drv);
}
@@ -2224,7 +2121,6 @@ static int __init netvsc_drv_init(void)
if (ret)
return ret;
- register_netdevice_notifier(&netvsc_netdev_notifier);
return 0;
}
--
2.14.3
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox