Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH bpf 0/5] bpf: allow wide (u64) aligned loads for some fields of bpf_sock_addr
From: Daniel Borkmann @ 2019-07-15 22:21 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev, bpf; +Cc: davem, ast, Yonghong Song
In-Reply-To: <20190715163956.204061-1-sdf@google.com>

On 7/15/19 6:39 PM, Stanislav Fomichev wrote:
> When fixing selftests by adding support for wide stores, Yonghong
> reported that he had seen some examples where clang generates
> single u64 loads for two adjacent u32s as well:
> http://lore.kernel.org/netdev/a66c937f-94c0-eaf8-5b37-8587d66c0c62@fb.com
> 
> Let's support aligned u64 reads for some bpf_sock_addr fields
> as well.
> 
> (This can probably wait for bpf-next, I'll defer to Younhong and the
> maintainers.)
> 
> Cc: Yonghong Song <yhs@fb.com>
> 
> Stanislav Fomichev (5):
>   bpf: rename bpf_ctx_wide_store_ok to bpf_ctx_wide_access_ok
>   bpf: allow wide aligned loads for bpf_sock_addr user_ip6 and
>     msg_src_ip6
>   selftests/bpf: rename verifier/wide_store.c to verifier/wide_access.c
>   selftests/bpf: add selftests for wide loads
>   bpf: sync bpf.h to tools/
> 
>  include/linux/filter.h                        |  2 +-
>  include/uapi/linux/bpf.h                      |  4 +-
>  net/core/filter.c                             | 24 ++++--
>  tools/include/uapi/linux/bpf.h                |  4 +-
>  .../selftests/bpf/verifier/wide_access.c      | 73 +++++++++++++++++++
>  .../selftests/bpf/verifier/wide_store.c       | 36 ---------
>  6 files changed, 95 insertions(+), 48 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/verifier/wide_access.c
>  delete mode 100644 tools/testing/selftests/bpf/verifier/wide_store.c
> 

Applied, thanks!

^ permalink raw reply

* Re: [PATCH bpf] samples/bpf: build with -D__TARGET_ARCH_$(SRCARCH)
From: Daniel Borkmann @ 2019-07-15 22:20 UTC (permalink / raw)
  To: Ilya Leoshkevich, bpf, netdev; +Cc: gor, heiko.carstens
In-Reply-To: <20190715091103.4030-1-iii@linux.ibm.com>

On 7/15/19 11:11 AM, Ilya Leoshkevich wrote:
> While $ARCH can be relatively flexible (see Makefile and
> tools/scripts/Makefile.arch), $SRCARCH always corresponds to a directory
> name under arch/.
> 
> Therefore, build samples with -D__TARGET_ARCH_$(SRCARCH), since that
> matches the expectations of bpf_helpers.h.
> 
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
> Acked-by: Vasily Gorbik <gor@linux.ibm.com>

Applied, thanks!

^ permalink raw reply

* Re: [oss-drivers] Re: [RFC bpf-next 2/8] bpf: extend list based insn patching infra to verification layer
From: Andrii Nakryiko @ 2019-07-15 22:29 UTC (permalink / raw)
  To: Jiong Wang
  Cc: Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann, Edward Cree,
	Naveen N. Rao, Jakub Kicinski, bpf, Networking, oss-drivers
In-Reply-To: <87h87n39aj.fsf@netronome.com>

On Mon, Jul 15, 2019 at 3:02 AM Jiong Wang <jiong.wang@netronome.com> wrote:
>
>
> Andrii Nakryiko writes:
>
> > On Thu, Jul 11, 2019 at 5:20 AM Jiong Wang <jiong.wang@netronome.com> wrote:
> >>
> >>
> >> Jiong Wang writes:
> >>
> >> > Andrii Nakryiko writes:
> >> >
> >> >> On Thu, Jul 4, 2019 at 2:32 PM Jiong Wang <jiong.wang@netronome.com> wrote:
> >> >>>
> >> >>> Verification layer also needs to handle auxiliar info as well as adjusting
> >> >>> subprog start.
> >> >>>
> >> >>> At this layer, insns inside patch buffer could be jump, but they should
> >> >>> have been resolved, meaning they shouldn't jump to insn outside of the
> >> >>> patch buffer. Lineration function for this layer won't touch insns inside
> >> >>> patch buffer.
> >> >>>
> >> >>> Adjusting subprog is finished along with adjusting jump target when the
> >> >>> input will cover bpf to bpf call insn, re-register subprog start is cheap.
> >> >>> But adjustment when there is insn deleteion is not considered yet.
> >> >>>
> >> >>> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> >> >>> ---
> >> >>>  kernel/bpf/verifier.c | 150 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >>>  1 file changed, 150 insertions(+)
> >> >>>
> >> >>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> >>> index a2e7637..2026d64 100644
> >> >>> --- a/kernel/bpf/verifier.c
> >> >>> +++ b/kernel/bpf/verifier.c
> >> >>> @@ -8350,6 +8350,156 @@ static void opt_hard_wire_dead_code_branches(struct bpf_verifier_env *env)
> >> >>>         }
> >> >>>  }
> >> >>>
> >> >>> +/* Linearize bpf list insn to array (verifier layer). */
> >> >>> +static struct bpf_verifier_env *
> >> >>> +verifier_linearize_list_insn(struct bpf_verifier_env *env,
> >> >>> +                            struct bpf_list_insn *list)
> >> >>
> >> >> It's unclear why this returns env back? It's not allocating a new env,
> >> >> so it's weird and unnecessary. Just return error code.
> >> >
> >> > The reason is I was thinking we have two layers in BPF, the core and the
> >> > verifier.
> >> >
> >> > For core layer (the relevant file is core.c), when doing patching, the
> >> > input is insn list and bpf_prog, the linearization should linearize the
> >> > insn list into insn array, and also whatever others affect inside bpf_prog
> >> > due to changing on insns, for example line info inside prog->aux. So the
> >> > return value is bpf_prog for core layer linearization hook.
> >> >
> >> > For verifier layer, it is similar, but the context if bpf_verifier_env, the
> >> > linearization hook should linearize the insn list, and also those affected
> >> > inside env, for example bpf_insn_aux_data, so the return value is
> >> > bpf_verifier_env, meaning returning an updated verifier context
> >> > (bpf_verifier_env) after insn list linearization.
> >>
> >> Realized your point is no new env is allocated, so just return error
> >> code. Yes, the env pointer is not changed, just internal data is
> >> updated. Return bpf_verifier_env mostly is trying to make the hook more
> >> clear that it returns an updated "context" where the linearization happens,
> >> for verifier layer, it is bpf_verifier_env, and for core layer, it is
> >> bpf_prog, so return value was designed to return these two types.
> >
> > Oh, I missed that core layer returns bpf_prog*. I think this is
> > confusing as hell and is very contrary to what one would expect. If
> > the function doesn't allocate those objects, it shouldn't return them,
> > except for rare cases of some accessor functions. Me reading this,
> > I'll always be suprised and will have to go skim code just to check
> > whether those functions really return new bpf_prog or
> > bpf_verifier_env, respectively.
>
> bpf_prog_realloc do return new bpf_prog, so we will need to return bpf_prog
> * for core layer.

Ah, I see, then it would make sense for core layer, but still is very
confusing for verifier_linearize_list_insn.
I still hope for unified solution, so it shouldn't matter. But it
pointed me to a bug in your code, see below.

>
> >
> > Please change them both to just return error code.
> >
> >>
> >> >
> >> > Make sense?
> >> >
> >> > Regards,
> >> > Jiong
> >> >
> >> >>
> >> >>> +{
> >> >>> +       u32 *idx_map, idx, orig_cnt, fini_cnt = 0;
> >> >>> +       struct bpf_subprog_info *new_subinfo;
> >> >>> +       struct bpf_insn_aux_data *new_data;
> >> >>> +       struct bpf_prog *prog = env->prog;
> >> >>> +       struct bpf_verifier_env *ret_env;
> >> >>> +       struct bpf_insn *insns, *insn;
> >> >>> +       struct bpf_list_insn *elem;
> >> >>> +       int ret;
> >> >>> +
> >> >>> +       /* Calculate final size. */
> >> >>> +       for (elem = list; elem; elem = elem->next)
> >> >>> +               if (!(elem->flag & LIST_INSN_FLAG_REMOVED))
> >> >>> +                       fini_cnt++;
> >> >>> +
> >> >>> +       orig_cnt = prog->len;
> >> >>> +       insns = prog->insnsi;
> >> >>> +       /* If prog length remains same, nothing else to do. */
> >> >>> +       if (fini_cnt == orig_cnt) {
> >> >>> +               for (insn = insns, elem = list; elem; elem = elem->next, insn++)
> >> >>> +                       *insn = elem->insn;
> >> >>> +               return env;
> >> >>> +       }
> >> >>> +       /* Realloc insn buffer when necessary. */
> >> >>> +       if (fini_cnt > orig_cnt)
> >> >>> +               prog = bpf_prog_realloc(prog, bpf_prog_size(fini_cnt),
> >> >>> +                                       GFP_USER);
> >> >>> +       if (!prog)
> >> >>> +               return ERR_PTR(-ENOMEM);

On realloc failure, prog will be non-NULL, so you need to handle error
properly (and propagate it, instead of returning -ENOMEM):

if (IS_ERR(prog))
    return ERR_PTR(prog);


> >> >>> +       insns = prog->insnsi;
> >> >>> +       prog->len = fini_cnt;
> >> >>> +       ret_env = env;
> >> >>> +
> >> >>> +       /* idx_map[OLD_IDX] = NEW_IDX */
> >> >>> +       idx_map = kvmalloc(orig_cnt * sizeof(u32), GFP_KERNEL);
> >> >>> +       if (!idx_map)
> >> >>> +               return ERR_PTR(-ENOMEM);
> >> >>> +       memset(idx_map, 0xff, orig_cnt * sizeof(u32));
> >> >>> +
> >> >>> +       /* Use the same alloc method used when allocating env->insn_aux_data. */
> >> >>> +       new_data = vzalloc(array_size(sizeof(*new_data), fini_cnt));
> >> >>> +       if (!new_data) {
> >> >>> +               kvfree(idx_map);
> >> >>> +               return ERR_PTR(-ENOMEM);
> >> >>> +       }
> >> >>> +
> >> >>> +       /* Copy over insn + calculate idx_map. */
> >> >>> +       for (idx = 0, elem = list; elem; elem = elem->next) {
> >> >>> +               int orig_idx = elem->orig_idx - 1;
> >> >>> +
> >> >>> +               if (orig_idx >= 0) {
> >> >>> +                       idx_map[orig_idx] = idx;
> >> >>> +
> >> >>> +                       if (elem->flag & LIST_INSN_FLAG_REMOVED)
> >> >>> +                               continue;
> >> >>> +
> >> >>> +                       new_data[idx] = env->insn_aux_data[orig_idx];
> >> >>> +
> >> >>> +                       if (elem->flag & LIST_INSN_FLAG_PATCHED)
> >> >>> +                               new_data[idx].zext_dst =
> >> >>> +                                       insn_has_def32(env, &elem->insn);
> >> >>> +               } else {
> >> >>> +                       new_data[idx].seen = true;
> >> >>> +                       new_data[idx].zext_dst = insn_has_def32(env,
> >> >>> +                                                               &elem->insn);
> >> >>> +               }
> >> >>> +               insns[idx++] = elem->insn;
> >> >>> +       }
> >> >>> +
> >> >>> +       new_subinfo = kvzalloc(sizeof(env->subprog_info), GFP_KERNEL);
> >> >>> +       if (!new_subinfo) {
> >> >>> +               kvfree(idx_map);
> >> >>> +               vfree(new_data);
> >> >>> +               return ERR_PTR(-ENOMEM);
> >> >>> +       }
> >> >>> +       memcpy(new_subinfo, env->subprog_info, sizeof(env->subprog_info));
> >> >>> +       memset(env->subprog_info, 0, sizeof(env->subprog_info));
> >> >>> +       env->subprog_cnt = 0;
> >> >>> +       env->prog = prog;
> >> >>> +       ret = add_subprog(env, 0);
> >> >>> +       if (ret < 0) {
> >> >>> +               ret_env = ERR_PTR(ret);
> >> >>> +               goto free_all_ret;
> >> >>> +       }
> >> >>> +       /* Relocate jumps using idx_map.
> >> >>> +        *   old_dst = jmp_insn.old_target + old_pc + 1;
> >> >>> +        *   new_dst = idx_map[old_dst] = jmp_insn.new_target + new_pc + 1;
> >> >>> +        *   jmp_insn.new_target = new_dst - new_pc - 1;
> >> >>> +        */
> >> >>> +       for (idx = 0, elem = list; elem; elem = elem->next) {
> >> >>> +               int orig_idx = elem->orig_idx;
> >> >>> +
> >> >>> +               if (elem->flag & LIST_INSN_FLAG_REMOVED)
> >> >>> +                       continue;
> >> >>> +               if ((elem->flag & LIST_INSN_FLAG_PATCHED) || !orig_idx) {
> >> >>> +                       idx++;
> >> >>> +                       continue;
> >> >>> +               }
> >> >>> +
> >> >>> +               ret = bpf_jit_adj_imm_off(&insns[idx], orig_idx - 1, idx,
> >> >>> +                                         idx_map);
> >> >>> +               if (ret < 0) {
> >> >>> +                       ret_env = ERR_PTR(ret);
> >> >>> +                       goto free_all_ret;
> >> >>> +               }
> >> >>> +               /* Recalculate subprog start as we are at bpf2bpf call insn. */
> >> >>> +               if (ret > 0) {
> >> >>> +                       ret = add_subprog(env, idx + insns[idx].imm + 1);
> >> >>> +                       if (ret < 0) {
> >> >>> +                               ret_env = ERR_PTR(ret);
> >> >>> +                               goto free_all_ret;
> >> >>> +                       }
> >> >>> +               }
> >> >>> +               idx++;
> >> >>> +       }
> >> >>> +       if (ret < 0) {
> >> >>> +               ret_env = ERR_PTR(ret);
> >> >>> +               goto free_all_ret;
> >> >>> +       }
> >> >>> +
> >> >>> +       env->subprog_info[env->subprog_cnt].start = fini_cnt;
> >> >>> +       for (idx = 0; idx <= env->subprog_cnt; idx++)
> >> >>> +               new_subinfo[idx].start = env->subprog_info[idx].start;
> >> >>> +       memcpy(env->subprog_info, new_subinfo, sizeof(env->subprog_info));
> >> >>> +
> >> >>> +       /* Adjust linfo.
> >> >>> +        * FIXME: no support for insn removal at the moment.
> >> >>> +        */
> >> >>> +       if (prog->aux->nr_linfo) {
> >> >>> +               struct bpf_line_info *linfo = prog->aux->linfo;
> >> >>> +               u32 nr_linfo = prog->aux->nr_linfo;
> >> >>> +
> >> >>> +               for (idx = 0; idx < nr_linfo; idx++)
> >> >>> +                       linfo[idx].insn_off = idx_map[linfo[idx].insn_off];
> >> >>> +       }
> >> >>> +       vfree(env->insn_aux_data);
> >> >>> +       env->insn_aux_data = new_data;
> >> >>> +       goto free_mem_list_ret;
> >> >>> +free_all_ret:
> >> >>> +       vfree(new_data);
> >> >>> +free_mem_list_ret:
> >> >>> +       kvfree(new_subinfo);
> >> >>> +       kvfree(idx_map);
> >> >>> +       return ret_env;
> >> >>> +}
> >> >>> +
> >> >>>  static int opt_remove_dead_code(struct bpf_verifier_env *env)
> >> >>>  {
> >> >>>         struct bpf_insn_aux_data *aux_data = env->insn_aux_data;
> >> >>> --
> >> >>> 2.7.4
> >> >>>
> >>
>

^ permalink raw reply

* Re: [PATCH bpf] selftests/bpf: make directory prerequisites order-only
From: Daniel Borkmann @ 2019-07-15 22:19 UTC (permalink / raw)
  To: Ilya Leoshkevich, bpf, netdev; +Cc: gor, heiko.carstens
In-Reply-To: <20190712135631.91398-1-iii@linux.ibm.com>

On 7/12/19 3:56 PM, Ilya Leoshkevich wrote:
> When directories are used as prerequisites in Makefiles, they can cause
> a lot of unnecessary rebuilds, because a directory is considered changed
> whenever a file in this directory is added, removed or modified.
> 
> If the only thing a target is interested in is the existence of the
> directory it depends on, which is the case for selftests/bpf, this
> directory should be specified as an order-only prerequisite: it would
> still be created in case it does not exist, but it would not trigger a
> rebuild of a target in case it's considered changed.
> 
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>

Applied, thanks!

^ permalink raw reply

* Re: [PATCH bpf] selftests/bpf: fix attach_probe on s390
From: Daniel Borkmann @ 2019-07-15 22:18 UTC (permalink / raw)
  To: Ilya Leoshkevich, bpf, netdev; +Cc: gor, heiko.carstens
In-Reply-To: <20190712134142.90668-1-iii@linux.ibm.com>

On 7/12/19 3:41 PM, Ilya Leoshkevich wrote:
> attach_probe test fails, because it cannot install a kprobe on a
> non-existent sys_nanosleep symbol.
> 
> Use the correct symbol name for the nanosleep syscall on 64-bit s390.
> Don't bother adding one for 31-bit mode, since tests are compiled only
> in 64-bit mode.
> 
> Fixes: 1e8611bbdfc9 ("selftests/bpf: add kprobe/uprobe selftests")
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
> Acked-by: Vasily Gorbik <gor@linux.ibm.com>

Applied, thanks!

^ permalink raw reply

* Re: [PATCH v3 bpf 0/3] fix BTF verification size resolution
From: Daniel Borkmann @ 2019-07-15 22:18 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast, yhs; +Cc: andrii.nakryiko, kernel-team
In-Reply-To: <20190712172557.4039121-1-andriin@fb.com>

On 7/12/19 7:25 PM, Andrii Nakryiko wrote:
> BTF size resolution logic isn't always resolving type size correctly, leading
> to erroneous map creation failures due to value size mismatch.
> 
> This patch set:
> 1. fixes the issue (patch #1);
> 2. adds tests for trickier cases (patch #2);
> 3. and converts few test cases utilizing BTF-defined maps, that previously
>    couldn't use typedef'ed arrays due to kernel bug (patch #3).
> 
> Andrii Nakryiko (3):
>   bpf: fix BTF verifier size resolution logic
>   selftests/bpf: add trickier size resolution tests
>   selftests/bpf: use typedef'ed arrays as map values
> 
>  kernel/bpf/btf.c                              | 19 ++--
>  .../bpf/progs/test_get_stack_rawtp.c          |  3 +-
>  .../bpf/progs/test_stacktrace_build_id.c      |  3 +-
>  .../selftests/bpf/progs/test_stacktrace_map.c |  2 +-
>  tools/testing/selftests/bpf/test_btf.c        | 88 +++++++++++++++++++
>  5 files changed, 104 insertions(+), 11 deletions(-)
> 

Applied, thanks!

^ permalink raw reply

* Re: [PATCH v2 bpf-next] selftests/bpf: fix "alu with different scalars 1" on s390
From: Daniel Borkmann @ 2019-07-15 22:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Ilya Leoshkevich; +Cc: bpf, Network Development, Y Song
In-Reply-To: <CAADnVQ+H9bOW+EY6=AKt7mqgdEgaPhc1Wk_o=Ez43CracLCaiA@mail.gmail.com>

On 7/16/19 12:13 AM, Alexei Starovoitov wrote:
> On Thu, Jul 4, 2019 at 1:53 AM Ilya Leoshkevich <iii@linux.ibm.com> wrote:
>>
>> BPF_LDX_MEM is used to load the least significant byte of the retrieved
>> test_val.index, however, on big-endian machines it ends up retrieving
>> the most significant byte.
>>
>> Use the correct least significant byte offset on big-endian machines.
>>
>> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
>> ---
>>
>> v1->v2:
>> - use __BYTE_ORDER instead of __BYTE_ORDER__.
>>
>>  tools/testing/selftests/bpf/verifier/value_ptr_arith.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/tools/testing/selftests/bpf/verifier/value_ptr_arith.c b/tools/testing/selftests/bpf/verifier/value_ptr_arith.c
>> index c3de1a2c9dc5..e5940c4e8b8f 100644
>> --- a/tools/testing/selftests/bpf/verifier/value_ptr_arith.c
>> +++ b/tools/testing/selftests/bpf/verifier/value_ptr_arith.c
>> @@ -183,7 +183,11 @@
>>         BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
>>         BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
>>         BPF_EXIT_INSN(),
>> +#if __BYTE_ORDER == __LITTLE_ENDIAN
>>         BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
>> +#else
>> +       BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, sizeof(int) - 1),
>> +#endif
> 
> I think tests should be arch and endian independent where possible.
> In this case test_val.index is 4 byte and 4 byte load should work just as well.

Yes, agree, this should be fixed with BPF_W as load.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH v2 bpf-next] selftests/bpf: fix "alu with different scalars 1" on s390
From: Alexei Starovoitov @ 2019-07-15 22:13 UTC (permalink / raw)
  To: Ilya Leoshkevich; +Cc: bpf, Network Development, Y Song
In-Reply-To: <20190704085224.65223-1-iii@linux.ibm.com>

On Thu, Jul 4, 2019 at 1:53 AM Ilya Leoshkevich <iii@linux.ibm.com> wrote:
>
> BPF_LDX_MEM is used to load the least significant byte of the retrieved
> test_val.index, however, on big-endian machines it ends up retrieving
> the most significant byte.
>
> Use the correct least significant byte offset on big-endian machines.
>
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
> ---
>
> v1->v2:
> - use __BYTE_ORDER instead of __BYTE_ORDER__.
>
>  tools/testing/selftests/bpf/verifier/value_ptr_arith.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/verifier/value_ptr_arith.c b/tools/testing/selftests/bpf/verifier/value_ptr_arith.c
> index c3de1a2c9dc5..e5940c4e8b8f 100644
> --- a/tools/testing/selftests/bpf/verifier/value_ptr_arith.c
> +++ b/tools/testing/selftests/bpf/verifier/value_ptr_arith.c
> @@ -183,7 +183,11 @@
>         BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
>         BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
>         BPF_EXIT_INSN(),
> +#if __BYTE_ORDER == __LITTLE_ENDIAN
>         BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
> +#else
> +       BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, sizeof(int) - 1),
> +#endif

I think tests should be arch and endian independent where possible.
In this case test_val.index is 4 byte and 4 byte load should work just as well.

^ permalink raw reply

* Re: [PATCH 8/9] acpi: Use built-in RCU list checking for acpi_ioremaps list (v1)
From: Rafael J. Wysocki @ 2019-07-15 21:44 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Linux Kernel Mailing List, Alexey Kuznetsov, Bjorn Helgaas,
	Borislav Petkov, c0d1n61at3, David S. Miller, edumazet,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jonathan Corbet, Josh Triplett, Kees Cook,
	Kernel Hardening, Cc: Android Kernel, Lai Jiangshan, Len Brown,
	ACPI Devel Maling List, open list:DOCUMENTATION, Linux PCI,
	Linux PM, Mathieu Desnoyers, NeilBrown, netdev, Oleg Nesterov,
	Paul E. McKenney, Pavel Machek, Peter Zijlstra, Rafael J. Wysocki,
	Rasmus Villemoes, rcu, Steven Rostedt, Tejun Heo, Thomas Gleixner,
	Will Deacon, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)
In-Reply-To: <20190715143705.117908-9-joel@joelfernandes.org>

On Mon, Jul 15, 2019 at 4:43 PM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> list_for_each_entry_rcu has built-in RCU and lock checking. Make use of
> it for acpi_ioremaps list traversal.
>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  drivers/acpi/osl.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> index 9c0edf2fc0dd..2f9d0d20b836 100644
> --- a/drivers/acpi/osl.c
> +++ b/drivers/acpi/osl.c
> @@ -14,6 +14,7 @@
>  #include <linux/slab.h>
>  #include <linux/mm.h>
>  #include <linux/highmem.h>
> +#include <linux/lockdep.h>
>  #include <linux/pci.h>
>  #include <linux/interrupt.h>
>  #include <linux/kmod.h>
> @@ -80,6 +81,7 @@ struct acpi_ioremap {
>
>  static LIST_HEAD(acpi_ioremaps);
>  static DEFINE_MUTEX(acpi_ioremap_lock);
> +#define acpi_ioremap_lock_held() lock_is_held(&acpi_ioremap_lock.dep_map)
>
>  static void __init acpi_request_region (struct acpi_generic_address *gas,
>         unsigned int length, char *desc)
> @@ -206,7 +208,7 @@ acpi_map_lookup(acpi_physical_address phys, acpi_size size)
>  {
>         struct acpi_ioremap *map;
>
> -       list_for_each_entry_rcu(map, &acpi_ioremaps, list)
> +       list_for_each_entry_rcu(map, &acpi_ioremaps, list, acpi_ioremap_lock_held())
>                 if (map->phys <= phys &&
>                     phys + size <= map->phys + map->size)
>                         return map;
> @@ -249,7 +251,7 @@ acpi_map_lookup_virt(void __iomem *virt, acpi_size size)
>  {
>         struct acpi_ioremap *map;
>
> -       list_for_each_entry_rcu(map, &acpi_ioremaps, list)
> +       list_for_each_entry_rcu(map, &acpi_ioremaps, list, acpi_ioremap_lock_held())
>                 if (map->virt <= virt &&
>                     virt + size <= map->virt + map->size)
>                         return map;
> --
> 2.22.0.510.g264f2c817a-goog
>

^ permalink raw reply

* [PATCH] bnx2x: Prevent load reordering in tx completion processing
From: Brian King @ 2019-07-15 21:41 UTC (permalink / raw)
  To: GR-everest-linux-l2; +Cc: skalluru, aelior, netdev, Brian King

This patch fixes an issue seen on Power systems with bnx2x which results
in the skb is NULL WARN_ON in bnx2x_free_tx_pkt firing due to the skb
pointer getting loaded in bnx2x_free_tx_pkt prior to the hw_cons
load in bnx2x_tx_int. Adding a read memory barrier resolves the issue.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 656ed80..e2be5a6 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -285,6 +285,9 @@ int bnx2x_tx_int(struct bnx2x *bp, struct bnx2x_fp_txdata *txdata)
 	hw_cons = le16_to_cpu(*txdata->tx_cons_sb);
 	sw_cons = txdata->tx_pkt_cons;
 
+	/* Ensure subsequent loads occur after hw_cons */
+	smp_rmb();
+
 	while (sw_cons != hw_cons) {
 		u16 pkt_cons;
 
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-07-15 21:09 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Tycho Andersen, Serge E. Hallyn, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	ebiederm, nhorman
In-Reply-To: <20190708181237.5poheliito7zpvmc@madcap2.tricolour.ca>

On Mon, Jul 8, 2019 at 2:12 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-05-30 19:26, Paul Moore wrote:

...

> > I like the creativity, but I worry that at some point these
> > limitations are going to be raised (limits have a funny way of doing
> > that over time) and we will be in trouble.  I say "trouble" because I
> > want to be able to quickly do an audit container ID comparison and
> > we're going to pay a penalty for these larger values (we'll need this
> > when we add multiple auditd support and the requisite record routing).
> >
> > Thinking about this makes me also realize we probably need to think a
> > bit longer about audit container ID conflicts between orchestrators.
> > Right now we just take the value that is given to us by the
> > orchestrator, but if we want to allow multiple container orchestrators
> > to work without some form of cooperation in userspace (I think we have
> > to assume the orchestrators will not talk to each other) we likely
> > need to have some way to block reuse of an audit container ID.  We
> > would either need to prevent the orchestrator from explicitly setting
> > an audit container ID to a currently in use value, or instead generate
> > the audit container ID in the kernel upon an event triggered by the
> > orchestrator (e.g. a write to a /proc file).  I suspect we should
> > start looking at the idr code, I think we will need to make use of it.
>
> To address this, I'd suggest that it is enforced to only allow the
> setting of descendants and to maintain a master list of audit container
> identifiers (with a hash table if necessary later) that includes the
> container owner.

We're discussing the audit container ID management policy elsewhere in
this thread so I won't comment on that here, but I did want to say
that we will likely need something better than a simple list of audit
container IDs from the start.  It's common for systems to have
thousands of containers now (or multiple thousands), which tells me
that a list is a poor choice.  You mentioned a hash table, so I would
suggest starting with that over the list for the initial patchset.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* [PATCH v1] net: fec: optionally reset PHY via a reset-controller
From: Sven Van Asbroeck @ 2019-07-15 21:05 UTC (permalink / raw)
  To: Fugang Duan; +Cc: David S . Miller, netdev, linux-kernel

The current fec driver allows the PHY to be reset via a gpio,
specified in the devicetree. However, some PHYs need to be reset
in a more complex way.

To accommodate such PHYs, allow an optional reset controller
in the fec devicetree. If no reset controller is found, the
fec will fall back to the legacy reset behaviour.

Example:
&fec {
	phy-mode = "rgmii";
	resets = <&phy_reset>;
	reset-names = "phy";
	status = "okay";
};

Signed-off-by: Sven Van Asbroeck <TheSven73@gmail.com>
---

Will send a Documentation patch if this receives a positive review.

 drivers/net/ethernet/freescale/fec_main.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 38f10f7dcbc3..5a5f3ed6f16d 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -61,6 +61,7 @@
 #include <linux/regulator/consumer.h>
 #include <linux/if_vlan.h>
 #include <linux/pinctrl/consumer.h>
+#include <linux/reset.h>
 #include <linux/prefetch.h>
 #include <soc/imx/cpuidle.h>
 
@@ -3335,6 +3336,7 @@ static int fec_enet_get_irq_cnt(struct platform_device *pdev)
 static int
 fec_probe(struct platform_device *pdev)
 {
+	struct reset_control *phy_reset;
 	struct fec_enet_private *fep;
 	struct fec_platform_data *pdata;
 	struct net_device *ndev;
@@ -3490,7 +3492,9 @@ fec_probe(struct platform_device *pdev)
 	pm_runtime_set_active(&pdev->dev);
 	pm_runtime_enable(&pdev->dev);
 
-	ret = fec_reset_phy(pdev);
+	phy_reset = devm_reset_control_get_exclusive(&pdev->dev, "phy");
+	ret = IS_ERR(phy_reset) ? fec_reset_phy(pdev) :
+			reset_control_reset(phy_reset);
 	if (ret)
 		goto failed_reset;
 
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-07-15 21:04 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Serge E. Hallyn, Tycho Andersen, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	ebiederm, nhorman
In-Reply-To: <20190708180558.5bar6ripag3sdadl@madcap2.tricolour.ca>

On Mon, Jul 8, 2019 at 2:06 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-05-30 15:29, Paul Moore wrote:

...

> > [REMINDER: It is an "*audit* container ID" and not a general
> > "container ID" ;)  Smiley aside, I'm not kidding about that part.]
> >
> > I'm not interested in supporting/merging something that isn't useful;
> > if this doesn't work for your use case then we need to figure out what
> > would work.  It sounds like nested containers are much more common in
> > the lxc world, can you elaborate a bit more on this?
> >
> > As far as the possible solutions you mention above, I'm not sure I
> > like the per-userns audit container IDs, I'd much rather just emit the
> > necessary tracking information via the audit record stream and let the
> > log analysis tools figure it out.  However, the bigger question is how
> > to limit (re)setting the audit container ID when you are in a non-init
> > userns.  For reasons already mentioned, using capable() is a non
> > starter for everything but the initial userns, and using ns_capable()
> > is equally poor as it essentially allows any userns the ability to
> > munge it's audit container ID (obviously not good).  It appears we
> > need a different method for controlling access to the audit container
> > ID.
>
> We're not quite ready yet for multiple audit daemons and possibly not
> yet for audit namespaces, but this is starting to look a lot like the
> latter.

A few quick comments on audit namespaces: the audit container ID is
not envisioned as a new namespace (even in nested form) and neither do
I consider running multiple audit daemons to be a new namespace.

From my perspective we create namespaces to allow us to redefine a
global resource for some subset of the system, e.g. providing a unique
/tmp for some number of processes on the system.  While it may be
tempting to think of the audit container ID as something we could
"namespace", especially when multiple audit daemons are concerned, in
some ways this would be counter productive; the audit container ID is
intended to be a global ID that can be used to associate audit event
records with a "container" where the "container" is defined by an
orchestrator outside the audit subsystem.  The global nature of the
audit container ID allows us to maintain a sane(ish) view of the
system in the audit log, if we were to "namespace" the audit container
ID such that the value was no longer guaranteed to be unique
throughout the system, we would need to additionally track the audit
namespace along with the audit container ID which starts to border on
insanity IMHO.

> If we can't trust ns_capable() then why are we passing on
> CAP_AUDIT_CONTROL?  It is being passed down and not stripped purposely
> by the orchestrator/engine.  If ns_capable() isn't inherited how is it
> gained otherwise?  Can it be inserted by cotainer image?  I think the
> answer is "no".  Either we trust ns_capable() or we have audit
> namespaces (recommend based on user namespace) (or both).

My thinking is that since ns_capable() checks the credentials with
respect to the current user namespace we can't rely on it to control
access since it would be possible for a privileged process running
inside an unprivileged container to manipulate the audit container ID
(containerized process has CAP_AUDIT_CONTROL, e.g. running as root in
the container, while the container itself does not).

> At this point I would say we are at an impasse unless we trust
> ns_capable() or we implement audit namespaces.

I'm not sure how we can trust ns_capable(), but if you can think of a
way I would love to hear it.  I'm also not sure how namespacing audit
is helpful (see my above comments), but if you think it is please
explain.

> I don't think another mechanism to trust nested orchestrators/engines
> will buy us anything.
>
> Am I missing something?

Based on your questions/comments above it looks like your
understanding of ns_capable() does not match mine; if I'm thinking
about ns_capable() incorrectly, please educate me.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [bpf PATCH v2 2/6] bpf: tls fix transition through disconnect with close
From: John Fastabend @ 2019-07-15 20:58 UTC (permalink / raw)
  To: Jakub Kicinski, John Fastabend; +Cc: ast, daniel, netdev, edumazet, bpf
In-Reply-To: <20190711201633.552292e6@cakuba.netronome.com>

Jakub Kicinski wrote:
> On Thu, 11 Jul 2019 14:25:54 -0700, John Fastabend wrote:
> > Jakub Kicinski wrote:
> > > On Thu, 11 Jul 2019 09:47:16 -0700, John Fastabend wrote:  
> > > > Jakub Kicinski wrote:  
> > > > > On Wed, 10 Jul 2019 12:34:17 -0700, Jakub Kicinski wrote:    
> > > > > > > > > +		if (sk->sk_prot->unhash)
> > > > > > > > > +			sk->sk_prot->unhash(sk);
> > > > > > > > > +	}
> > > > > > > > > +
> > > > > > > > > +	ctx = tls_get_ctx(sk);
> > > > > > > > > +	if (ctx->tx_conf == TLS_SW || ctx->rx_conf == TLS_SW)
> > > > > > > > > +		tls_sk_proto_cleanup(sk, ctx, timeo);    
> > > > > 
> > > > > Do we still need to hook into unhash? With patch 6 in place perhaps we
> > > > > can just do disconnect 🥺    
> > > > 
> > > > ?? "can just do a disconnect", not sure I folow. We still need unhash
> > > > in cases where we have a TLS socket transition from ESTABLISHED
> > > > to LISTEN state without calling close(). This is independent of if
> > > > sockmap is running or not.
> > > > 
> > > > Originally, I thought this would be extremely rare but I did see it
> > > > in real applications on the sockmap side so presumably it is possible
> > > > here as well.  
> > > 
> > > Ugh, sorry, I meant shutdown. Instead of replacing the unhash callback
> > > replace the shutdown callback. We probably shouldn't release the socket
> > > lock either there, but we can sleep, so I'll be able to run the device
> > > connection remove callback (which sleep).
> > 
> > ah OK seems doable to me. Do you want to write that on top of this
> > series? Or would you like to push it onto your branch and I can pull
> > it in push the rest of the patches on top and send it out? I think
> > if you can get to it in the next few days then it makes sense to wait.
> 
> Mm.. perhaps its easiest if we forget about HW for now and get SW 
> to work? Once you get the SW to 100% I can probably figure out what 
> to do for HW, but I feel like we got too many moving parts ATM.

Hi Jack,

I went ahead and pushed a v3 with your patches at the front. This resolves
a set of issues for me so I think it makes sense to push now and look
to resolve any further issues later. I'll look into the close with pending
data potential issue to see if it is/is-not a real issue.

Thanks,
John

^ permalink raw reply

* Re: [PATCH iproute2-rc 1/8] rdma: Update uapi headers to add statistic counter support
From: Stephen Hemminger @ 2019-07-15 20:52 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Leon Romanovsky, netdev, David Ahern, Mark Zhang,
	RDMA mailing list
In-Reply-To: <20190710072455.9125-2-leon@kernel.org>

On Wed, 10 Jul 2019 10:24:48 +0300
Leon Romanovsky <leon@kernel.org> wrote:

> From: Mark Zhang <markz@mellanox.com>
> 
> Update rdma_netlink.h to kernel commit 6e7be47a5345 ("RDMA/nldev:
> Allow get default counter statistics through RDMA netlink").
> 
> Signed-off-by: Mark Zhang <markz@mellanox.com>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>

I am waiting on this until it gets to Linus's tree.

^ permalink raw reply

* Re: [PATCH iproute2 master 0/3] devlink dumpit fixes
From: Stephen Hemminger @ 2019-07-15 20:51 UTC (permalink / raw)
  To: Tariq Toukan; +Cc: netdev, moshe, ayal
In-Reply-To: <1562756601-19171-1-git-send-email-tariqt@mellanox.com>

On Wed, 10 Jul 2019 14:03:18 +0300
Tariq Toukan <tariqt@mellanox.com> wrote:

> Hi,
> 
> This series from Aya contains several fixes for devlink health
> dump show command with binary data.
> 
> In patch 1 we replace the usage of doit with a dumpit, which
> is non-blocking and allows transferring larger amount of data.
> 
> Patches 2 and 3 fix the output for binary data prints, for both
> json and non-json.
> 
> Series generated against master commit:
> 2eb23f3e7aaf devlink: Show devlink port number
> 
> Regards,
> Tariq
> 
> Aya Levin (3):
>   devlink: Change devlink health dump show command to dumpit
>   devlink: Fix binary values print
>   devlink: Remove enclosing array brackets binary print with json format
> 
>  devlink/devlink.c | 41 +++++++++++++++++++++--------------------
>  1 file changed, 21 insertions(+), 20 deletions(-)
> 

Applied

^ permalink raw reply

* [bpf PATCH v3 8/8] bpf: sockmap/tls, close can race with map free
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

When a map free is called and in parallel a socket is closed we
have two paths that can potentially reset the socket prot ops, the
bpf close() path and the map free path. This creates a problem
with which prot ops should be used from the socket closed side.

If the map_free side completes first then we want to call the
original lowest level ops. However, if the tls path runs first
we want to call the sockmap ops. Additionally there was no locking
around prot updates in TLS code paths so the prot ops could
be changed multiple times once from TLS path and again from sockmap
side potentially leaving ops pointed at either TLS or sockmap
when psock and/or tls context have already been destroyed.

To fix this race first only update ops inside callback lock
so that TLS, sockmap and lowest level all agree on prot state.
Second and a ULP callback update() so that lower layers can
inform the upper layer when they are being removed allowing the
upper layer to reset prot ops.

This gets us close to allowing sockmap and tls to be stacked
in arbitrary order but will save that patch for *next trees.

Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com
Fixes: 02c558b2d5d6 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/skmsg.h |    8 +++++++-
 include/net/tcp.h     |    3 +++
 net/core/skmsg.c      |    4 ++--
 net/ipv4/tcp_ulp.c    |   13 +++++++++++++
 net/tls/tls_main.c    |   48 ++++++++++++++++++++++++++++++++++++------------
 5 files changed, 61 insertions(+), 15 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 50ced8aba9db..e4b3fb4bb77c 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -354,7 +354,13 @@ static inline void sk_psock_restore_proto(struct sock *sk,
 	sk->sk_write_space = psock->saved_write_space;
 
 	if (psock->sk_proto) {
-		sk->sk_prot = psock->sk_proto;
+		struct inet_connection_sock *icsk = inet_csk(sk);
+		bool has_ulp = !!icsk->icsk_ulp_data;
+
+		if (has_ulp)
+			tcp_update_ulp(sk, psock->sk_proto);
+		else
+			sk->sk_prot = psock->sk_proto;
 		psock->sk_proto = NULL;
 	}
 }
diff --git a/include/net/tcp.h b/include/net/tcp.h
index cca3c59b98bf..f4702c8b9b8c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2102,6 +2102,8 @@ struct tcp_ulp_ops {
 
 	/* initialize ulp */
 	int (*init)(struct sock *sk);
+	/* update ulp */
+	void (*update)(struct sock *sk, struct proto *p);
 	/* cleanup ulp */
 	void (*release)(struct sock *sk);
 
@@ -2113,6 +2115,7 @@ void tcp_unregister_ulp(struct tcp_ulp_ops *type);
 int tcp_set_ulp(struct sock *sk, const char *name);
 void tcp_get_available_ulp(char *buf, size_t len);
 void tcp_cleanup_ulp(struct sock *sk);
+void tcp_update_ulp(struct sock *sk, struct proto *p);
 
 #define MODULE_ALIAS_TCP_ULP(name)				\
 	__MODULE_INFO(alias, alias_userspace, name);		\
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 93bffaad2135..6832eeb4b785 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -585,12 +585,12 @@ EXPORT_SYMBOL_GPL(sk_psock_destroy);
 
 void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
 {
-	rcu_assign_sk_user_data(sk, NULL);
 	sk_psock_cork_free(psock);
 	sk_psock_zap_ingress(psock);
-	sk_psock_restore_proto(sk, psock);
 
 	write_lock_bh(&sk->sk_callback_lock);
+	sk_psock_restore_proto(sk, psock);
+	rcu_assign_sk_user_data(sk, NULL);
 	if (psock->progs.skb_parser)
 		sk_psock_stop_strp(sk, psock);
 	write_unlock_bh(&sk->sk_callback_lock);
diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
index 3d8a1d835471..4849edb62d52 100644
--- a/net/ipv4/tcp_ulp.c
+++ b/net/ipv4/tcp_ulp.c
@@ -96,6 +96,19 @@ void tcp_get_available_ulp(char *buf, size_t maxlen)
 	rcu_read_unlock();
 }
 
+void tcp_update_ulp(struct sock *sk, struct proto *proto)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	if (!icsk->icsk_ulp_ops) {
+		sk->sk_prot = proto;
+		return;
+	}
+
+	if (icsk->icsk_ulp_ops->update)
+		icsk->icsk_ulp_ops->update(sk, proto);
+}
+
 void tcp_cleanup_ulp(struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index f4cb0522fa95..e67e687f79a2 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -323,15 +323,16 @@ static void tls_sk_proto_unhash(struct sock *sk)
 	long timeo = sock_sndtimeo(sk, 0);
 	struct tls_context *ctx;
 
-	if (unlikely(!icsk->icsk_ulp_data)) {
-		if (sk->sk_prot->unhash)
-			sk->sk_prot->unhash(sk);
-	}
-
 	ctx = tls_get_ctx(sk);
 	if (ctx->tx_conf == TLS_SW || ctx->rx_conf == TLS_SW)
 		tls_sk_proto_cleanup(sk, ctx, timeo);
+
+	write_lock_bh(&sk->sk_callback_lock);
 	icsk->icsk_ulp_data = NULL;
+	if (sk->sk_prot->unhash == tls_sk_proto_unhash)
+		sk->sk_prot = ctx->sk_proto;
+	write_unlock_bh(&sk->sk_callback_lock);
+
 	tls_ctx_free_wq(ctx);
 
 	if (ctx->unhash)
@@ -340,15 +341,17 @@ static void tls_sk_proto_unhash(struct sock *sk)
 
 static void tls_sk_proto_close(struct sock *sk, long timeout)
 {
-	void (*sk_proto_close)(struct sock *sk, long timeout);
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tls_context *ctx = tls_get_ctx(sk);
 	long timeo = sock_sndtimeo(sk, 0);
 
+	if (unlikely(!ctx))
+		return;
+
 	if (ctx->tx_conf == TLS_SW)
 		tls_sw_cancel_work_tx(ctx);
 
 	lock_sock(sk);
-	sk_proto_close = ctx->sk_proto_close;
 
 	if (ctx->tx_conf == TLS_HW_RECORD && ctx->rx_conf == TLS_HW_RECORD)
 		goto skip_tx_cleanup;
@@ -356,17 +359,20 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE)
 		goto skip_tx_cleanup;
 
-	sk->sk_prot = ctx->sk_proto;
 	tls_sk_proto_cleanup(sk, ctx, timeo);
 
 skip_tx_cleanup:
+	write_lock_bh(&sk->sk_callback_lock);
+	icsk->icsk_ulp_data = NULL;
+	if (sk->sk_prot->close == tls_sk_proto_close)
+		sk->sk_prot = ctx->sk_proto;
+	write_unlock_bh(&sk->sk_callback_lock);
 	release_sock(sk);
 	if (ctx->rx_conf == TLS_SW || ctx->rx_conf == TLS_HW)
 		tls_sw_strparser_done(ctx);
 	if (ctx->rx_conf == TLS_SW)
 		tls_sw_free_ctx_rx(ctx);
-	sk_proto_close(sk, timeout);
-
+	ctx->sk_proto_close(sk, timeout);
 	if (ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW &&
 	    ctx->tx_conf != TLS_HW_RECORD && ctx->rx_conf != TLS_HW_RECORD)
 		tls_ctx_free(ctx);
@@ -833,7 +839,7 @@ static int tls_init(struct sock *sk)
 	int rc = 0;
 
 	if (tls_hw_prot(sk))
-		goto out;
+		return 0;
 
 	/* The TLS ulp is currently supported only for TCP sockets
 	 * in ESTABLISHED state.
@@ -844,22 +850,39 @@ static int tls_init(struct sock *sk)
 	if (sk->sk_state != TCP_ESTABLISHED)
 		return -ENOTSUPP;
 
+	tls_build_proto(sk);
+
 	/* allocate tls context */
+	write_lock_bh(&sk->sk_callback_lock);
 	ctx = create_ctx(sk);
 	if (!ctx) {
 		rc = -ENOMEM;
 		goto out;
 	}
 
-	tls_build_proto(sk);
 	ctx->tx_conf = TLS_BASE;
 	ctx->rx_conf = TLS_BASE;
 	ctx->sk_proto = sk->sk_prot;
 	update_sk_prot(sk, ctx);
 out:
+	write_unlock_bh(&sk->sk_callback_lock);
 	return rc;
 }
 
+static void tls_update(struct sock *sk, struct proto *p)
+{
+	struct tls_context *ctx;
+
+	ctx = tls_get_ctx(sk);
+	if (likely(ctx)) {
+		ctx->sk_proto_close = p->close;
+		ctx->unhash = p->unhash;
+		ctx->sk_proto = p;
+	} else {
+		sk->sk_prot = p;
+	}
+}
+
 void tls_register_device(struct tls_device *device)
 {
 	spin_lock_bh(&device_spinlock);
@@ -880,6 +903,7 @@ static struct tcp_ulp_ops tcp_tls_ulp_ops __read_mostly = {
 	.name			= "tls",
 	.owner			= THIS_MODULE,
 	.init			= tls_init,
+	.update			= tls_update,
 };
 
 static int __init tls_register(void)


^ permalink raw reply related

* Re: [PATCH iproute2 v2] utils: don't match empty strings as prefixes
From: Stephen Hemminger @ 2019-07-15 20:49 UTC (permalink / raw)
  To: Matteo Croce; +Cc: netdev, David Ahern
In-Reply-To: <20190715180430.19902-1-mcroce@redhat.com>

On Mon, 15 Jul 2019 20:04:30 +0200
Matteo Croce <mcroce@redhat.com> wrote:

> iproute has an utility function which checks if a string is a prefix for
> another one, to allow use of abbreviated commands, e.g. 'addr' or 'a'
> instead of 'address'.
> 
> This routine unfortunately considers an empty string as prefix
> of any pattern, leading to undefined behaviour when an empty
> argument is passed to ip:
> 
>     # ip ''
>     1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
>         link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>         inet 127.0.0.1/8 scope host lo
>            valid_lft forever preferred_lft forever
>         inet6 ::1/128 scope host
>            valid_lft forever preferred_lft forever
> 
>     # tc ''
>     qdisc noqueue 0: dev lo root refcnt 2
> 
>     # ip address add 192.0.2.0/24 '' 198.51.100.1 dev dummy0
>     # ip addr show dev dummy0
>     6: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
>         link/ether 02:9d:5e:e9:3f:c0 brd ff:ff:ff:ff:ff:ff
>         inet 192.0.2.0/24 brd 198.51.100.1 scope global dummy0
>            valid_lft forever preferred_lft forever
> 
> Rewrite matches() so it takes care of an empty input, and doesn't
> scan the input strings three times: the actual implementation
> does 2 strlen and a memcpy to accomplish the same task.
> 
> Signed-off-by: Matteo Croce <mcroce@redhat.com>

Thanks for following up. Applied

^ permalink raw reply

* [bpf PATCH v3 7/8] bpf: sockmap, only create entry if ulp is not already enabled
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

Sockmap does not currently support adding sockets after TLS has been
enabled. There never was a real use case for this so it was never
added. But, we lost the test for ULP at some point so add it here
and fail the socket insert if TLS is enabled. Future work could
make sockmap support this use case but fixup the bug here.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/sock_map.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 56bcabe7c2f2..1330a7442e5b 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -334,6 +334,7 @@ static int sock_map_update_common(struct bpf_map *map, u32 idx,
 				  struct sock *sk, u64 flags)
 {
 	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct sk_psock_link *link;
 	struct sk_psock *psock;
 	struct sock *osk;
@@ -344,6 +345,8 @@ static int sock_map_update_common(struct bpf_map *map, u32 idx,
 		return -EINVAL;
 	if (unlikely(idx >= map->max_entries))
 		return -E2BIG;
+	if (unlikely(icsk->icsk_ulp_data))
+		return -EINVAL;
 
 	link = sk_psock_init_link();
 	if (!link)


^ permalink raw reply related

* [bpf PATCH v3 6/8] bpf: sockmap, synchronize_rcu before free'ing map
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

We need to have a synchronize_rcu before free'ing the sockmap because
any outstanding psock references will have a pointer to the map and
when they use this could trigger a use after free.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/sock_map.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 28702f2e9a4a..56bcabe7c2f2 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -247,6 +247,8 @@ static void sock_map_free(struct bpf_map *map)
 	raw_spin_unlock_bh(&stab->lock);
 	rcu_read_unlock();
 
+	synchronize_rcu();
+
 	bpf_map_area_free(stab->sks);
 	kfree(stab);
 }


^ permalink raw reply related

* [bpf PATCH v3 5/8] bpf: sockmap, sock_map_delete needs to use xchg
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

__sock_map_delete() may be called from a tcp event such as unhash or
close from the following trace,

  tcp_bpf_close()
    tcp_bpf_remove()
      sk_psock_unlink()
        sock_map_delete_from_link()
          __sock_map_delete()

In this case the sock lock is held but this only protects against
duplicate removals on the TCP side. If the map is free'd then we have
this trace,

  sock_map_free
    xchg()                  <- replaces map entry
    sock_map_unref()
      sk_psock_put()
        sock_map_del_link()

The __sock_map_delete() call however uses a read, test, null over the
map entry which can result in both paths trying to free the map
entry.

To fix use xchg in TCP paths as well so we avoid having two references
to the same map entry.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/sock_map.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 52d4faeee18b..28702f2e9a4a 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -276,16 +276,20 @@ static int __sock_map_delete(struct bpf_stab *stab, struct sock *sk_test,
 			     struct sock **psk)
 {
 	struct sock *sk;
+	int err = 0;
 
 	raw_spin_lock_bh(&stab->lock);
 	sk = *psk;
 	if (!sk_test || sk_test == sk)
-		*psk = NULL;
+		sk = xchg(psk, NULL);
+
+	if (likely(sk))
+		sock_map_unref(sk, psk);
+	else
+		err = -EINVAL;
+
 	raw_spin_unlock_bh(&stab->lock);
-	if (unlikely(!sk))
-		return -EINVAL;
-	sock_map_unref(sk, psk);
-	return 0;
+	return err;
 }
 
 static void sock_map_delete_from_link(struct bpf_map *map, struct sock *sk,


^ permalink raw reply related

* [bpf PATCH v3 4/8] bpf: tls fix transition through disconnect with close
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

It is possible (via shutdown()) for TCP socks to go through TCP_CLOSE
state via tcp_dosconnect() without actually calling tcp_close which
would then call the tls close callback. Because of this a user could
disconnect a socket then put it in a LISTEN state which would break
our assumptions about sockets always being ESTABLISHED state.

More directly because close() can call unhash() and unhash is
implemented by sockmap if a sockmap socket has TLS enabled we can
incorrectly destroy the psock from unhash() and then call its close
handler again. But because the psock (sockmap socket representation)
is already destroyed we call close handler in sk->prot. However,
in some cases (TLS BASE/BASE case) this will still point at the
sockmap close handler resulting in a circular call and crash reported
by syzbot.

To fix both above issues implement the unhash() routine for TLS.

Fixes: 3c4d7559159bf ("tls: kernel TLS support")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/net/tls.h  |    5 ++++-
 net/tls/tls_main.c |   57 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 72ddd16de056..79ef7049375d 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -251,6 +251,8 @@ struct tls_context {
 	u8 tx_conf:3;
 	u8 rx_conf:3;
 
+	struct proto *sk_proto;
+
 	int (*push_pending_record)(struct sock *sk, int flags);
 	void (*sk_write_space)(struct sock *sk);
 
@@ -288,6 +290,8 @@ struct tls_context {
 
 	struct list_head list;
 	refcount_t refcount;
+
+	struct work_struct gc;
 };
 
 enum tls_offload_ctx_dir {
@@ -359,7 +363,6 @@ void tls_sw_strparser_done(struct tls_context *tls_ctx);
 int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 int tls_sw_sendpage(struct sock *sk, struct page *page,
 		    int offset, size_t size, int flags);
-void tls_sw_close(struct sock *sk, long timeout);
 void tls_sw_cancel_work_tx(struct tls_context *tls_ctx);
 void tls_sw_free_resources_tx(struct sock *sk);
 void tls_sw_free_resources_rx(struct sock *sk);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 9f4a9da182ae..f4cb0522fa95 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -251,6 +251,35 @@ static void tls_write_space(struct sock *sk)
 	ctx->sk_write_space(sk);
 }
 
+static void tls_ctx_free_deferred(struct work_struct *gc)
+{
+	struct tls_context *ctx = container_of(gc, struct tls_context, gc);
+
+	if (ctx->rx_conf == TLS_SW || ctx->rx_conf == TLS_HW)
+		tls_sw_strparser_done(ctx);
+
+	if (ctx->rx_conf == TLS_SW)
+		tls_sw_free_ctx_rx(ctx);
+
+	/* Ensure any remaining work items are completed. The sk will
+	 * already have lost its tls_ctx reference by the time we get
+	 * here so no xmit operation will actually be performed.
+	 */
+	tls_sw_cancel_work_tx(ctx);
+	kfree(ctx);
+}
+
+static void tls_ctx_free_wq(struct tls_context *ctx)
+{
+	if (!ctx)
+		return;
+
+	memzero_explicit(&ctx->crypto_send, sizeof(ctx->crypto_send));
+	memzero_explicit(&ctx->crypto_recv, sizeof(ctx->crypto_recv));
+	INIT_WORK(&ctx->gc, tls_ctx_free_deferred);
+	schedule_work(&ctx->gc);
+}
+
 void tls_ctx_free(struct tls_context *ctx)
 {
 	if (!ctx)
@@ -288,6 +317,27 @@ static void tls_sk_proto_cleanup(struct sock *sk,
 #endif
 }
 
+static void tls_sk_proto_unhash(struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	long timeo = sock_sndtimeo(sk, 0);
+	struct tls_context *ctx;
+
+	if (unlikely(!icsk->icsk_ulp_data)) {
+		if (sk->sk_prot->unhash)
+			sk->sk_prot->unhash(sk);
+	}
+
+	ctx = tls_get_ctx(sk);
+	if (ctx->tx_conf == TLS_SW || ctx->rx_conf == TLS_SW)
+		tls_sk_proto_cleanup(sk, ctx, timeo);
+	icsk->icsk_ulp_data = NULL;
+	tls_ctx_free_wq(ctx);
+
+	if (ctx->unhash)
+		ctx->unhash(sk);
+}
+
 static void tls_sk_proto_close(struct sock *sk, long timeout)
 {
 	void (*sk_proto_close)(struct sock *sk, long timeout);
@@ -306,6 +356,7 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE)
 		goto skip_tx_cleanup;
 
+	sk->sk_prot = ctx->sk_proto;
 	tls_sk_proto_cleanup(sk, ctx, timeo);
 
 skip_tx_cleanup:
@@ -611,6 +662,7 @@ static struct tls_context *create_ctx(struct sock *sk)
 	ctx->setsockopt = sk->sk_prot->setsockopt;
 	ctx->getsockopt = sk->sk_prot->getsockopt;
 	ctx->sk_proto_close = sk->sk_prot->close;
+	ctx->unhash = sk->sk_prot->unhash;
 	return ctx;
 }
 
@@ -734,20 +786,24 @@ static void build_protos(struct proto prot[TLS_NUM_CONFIG][TLS_NUM_CONFIG],
 	prot[TLS_BASE][TLS_BASE].setsockopt	= tls_setsockopt;
 	prot[TLS_BASE][TLS_BASE].getsockopt	= tls_getsockopt;
 	prot[TLS_BASE][TLS_BASE].close		= tls_sk_proto_close;
+	prot[TLS_BASE][TLS_BASE].unhash		= tls_sk_proto_unhash;
 
 	prot[TLS_SW][TLS_BASE] = prot[TLS_BASE][TLS_BASE];
 	prot[TLS_SW][TLS_BASE].sendmsg		= tls_sw_sendmsg;
 	prot[TLS_SW][TLS_BASE].sendpage		= tls_sw_sendpage;
+	prot[TLS_SW][TLS_BASE].unhash		= tls_sk_proto_unhash;
 
 	prot[TLS_BASE][TLS_SW] = prot[TLS_BASE][TLS_BASE];
 	prot[TLS_BASE][TLS_SW].recvmsg		  = tls_sw_recvmsg;
 	prot[TLS_BASE][TLS_SW].stream_memory_read = tls_sw_stream_read;
 	prot[TLS_BASE][TLS_SW].close		  = tls_sk_proto_close;
+	prot[TLS_BASE][TLS_SW].unhash		  = tls_sk_proto_unhash;
 
 	prot[TLS_SW][TLS_SW] = prot[TLS_SW][TLS_BASE];
 	prot[TLS_SW][TLS_SW].recvmsg		= tls_sw_recvmsg;
 	prot[TLS_SW][TLS_SW].stream_memory_read	= tls_sw_stream_read;
 	prot[TLS_SW][TLS_SW].close		= tls_sk_proto_close;
+	prot[TLS_SW][TLS_SW].unhash		= tls_sk_proto_unhash;
 
 #ifdef CONFIG_TLS_DEVICE
 	prot[TLS_HW][TLS_BASE] = prot[TLS_BASE][TLS_BASE];
@@ -798,6 +854,7 @@ static int tls_init(struct sock *sk)
 	tls_build_proto(sk);
 	ctx->tx_conf = TLS_BASE;
 	ctx->rx_conf = TLS_BASE;
+	ctx->sk_proto = sk->sk_prot;
 	update_sk_prot(sk, ctx);
 out:
 	return rc;


^ permalink raw reply related

* [bpf PATCH v3 3/8] tls: remove sock unlock/lock around strp_done()
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

The tls close() callback currently drops the sock lock to call
strp_done(). Split up the RX cleanup into stopping the strparser
and releasing most resources, syncing strparser and finally
freeing the context.

To avoid the need for a strp_done() call on the cleanup path
of device offload make sure we don't arm the strparser until
we are sure init will be successful.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
 include/net/tls.h    |    4 ++-
 net/tls/tls_device.c |    1 -
 net/tls/tls_main.c   |   65 +++++++++++++++++++++++++-------------------------
 net/tls/tls_sw.c     |   33 ++++++++++++++++++-------
 4 files changed, 58 insertions(+), 45 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index d4276cb6de53..72ddd16de056 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -107,9 +107,7 @@ struct tls_device {
 enum {
 	TLS_BASE,
 	TLS_SW,
-#ifdef CONFIG_TLS_DEVICE
 	TLS_HW,
-#endif
 	TLS_HW_RECORD,
 	TLS_NUM_CONFIG,
 };
@@ -357,6 +355,7 @@ int tls_sk_attach(struct sock *sk, int optname, char __user *optval,
 
 int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx);
 void tls_sw_strparser_arm(struct sock *sk, struct tls_context *ctx);
+void tls_sw_strparser_done(struct tls_context *tls_ctx);
 int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 int tls_sw_sendpage(struct sock *sk, struct page *page,
 		    int offset, size_t size, int flags);
@@ -365,6 +364,7 @@ void tls_sw_cancel_work_tx(struct tls_context *tls_ctx);
 void tls_sw_free_resources_tx(struct sock *sk);
 void tls_sw_free_resources_rx(struct sock *sk);
 void tls_sw_release_resources_rx(struct sock *sk);
+void tls_sw_free_ctx_rx(struct tls_context *tls_ctx);
 int tls_sw_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		   int nonblock, int flags, int *addr_len);
 bool tls_sw_stream_read(const struct sock *sk);
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 4d67d72f007c..7c0b2b778703 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -1045,7 +1045,6 @@ int tls_set_device_offload_rx(struct sock *sk, struct tls_context *ctx)
 	rc = tls_set_sw_offload(sk, ctx, 0);
 	if (rc)
 		goto release_ctx;
-	tls_sw_strparser_arm(sk, ctx);
 
 	rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_RX,
 					     &ctx->crypto_recv.info,
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index ddda422498aa..9f4a9da182ae 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -261,27 +261,9 @@ void tls_ctx_free(struct tls_context *ctx)
 	kfree(ctx);
 }
 
-static void tls_sk_proto_close(struct sock *sk, long timeout)
+static void tls_sk_proto_cleanup(struct sock *sk,
+				 struct tls_context *ctx, long timeo)
 {
-	struct tls_context *ctx = tls_get_ctx(sk);
-	long timeo = sock_sndtimeo(sk, 0);
-	void (*sk_proto_close)(struct sock *sk, long timeout);
-	bool free_ctx = false;
-
-	if (ctx->tx_conf == TLS_SW)
-		tls_sw_cancel_work_tx(ctx);
-
-	lock_sock(sk);
-	sk_proto_close = ctx->sk_proto_close;
-
-	if (ctx->tx_conf == TLS_HW_RECORD && ctx->rx_conf == TLS_HW_RECORD)
-		goto skip_tx_cleanup;
-
-	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE) {
-		free_ctx = true;
-		goto skip_tx_cleanup;
-	}
-
 	if (unlikely(sk->sk_write_pending) &&
 	    !wait_on_pending_writer(sk, &timeo))
 		tls_handle_open_record(sk, 0);
@@ -298,27 +280,44 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	}
 
 	if (ctx->rx_conf == TLS_SW)
-		tls_sw_free_resources_rx(sk);
+		tls_sw_release_resources_rx(sk);
 
 #ifdef CONFIG_TLS_DEVICE
 	if (ctx->rx_conf == TLS_HW)
 		tls_device_offload_cleanup_rx(sk);
-
-	if (ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW) {
-#else
-	{
 #endif
-		tls_ctx_free(ctx);
-		ctx = NULL;
-	}
+}
+
+static void tls_sk_proto_close(struct sock *sk, long timeout)
+{
+	void (*sk_proto_close)(struct sock *sk, long timeout);
+	struct tls_context *ctx = tls_get_ctx(sk);
+	long timeo = sock_sndtimeo(sk, 0);
+
+	if (ctx->tx_conf == TLS_SW)
+		tls_sw_cancel_work_tx(ctx);
+
+	lock_sock(sk);
+	sk_proto_close = ctx->sk_proto_close;
+
+	if (ctx->tx_conf == TLS_HW_RECORD && ctx->rx_conf == TLS_HW_RECORD)
+		goto skip_tx_cleanup;
+
+	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE)
+		goto skip_tx_cleanup;
+
+	tls_sk_proto_cleanup(sk, ctx, timeo);
 
 skip_tx_cleanup:
 	release_sock(sk);
+	if (ctx->rx_conf == TLS_SW || ctx->rx_conf == TLS_HW)
+		tls_sw_strparser_done(ctx);
+	if (ctx->rx_conf == TLS_SW)
+		tls_sw_free_ctx_rx(ctx);
 	sk_proto_close(sk, timeout);
-	/* free ctx for TLS_HW_RECORD, used by tcp_set_state
-	 * for sk->sk_prot->unhash [tls_hw_unhash]
-	 */
-	if (free_ctx)
+
+	if (ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW &&
+	    ctx->tx_conf != TLS_HW_RECORD && ctx->rx_conf != TLS_HW_RECORD)
 		tls_ctx_free(ctx);
 }
 
@@ -544,9 +543,9 @@ static int do_tls_setsockopt_conf(struct sock *sk, char __user *optval,
 			rc = tls_set_sw_offload(sk, ctx, 0);
 			if (rc)
 				goto err_crypto_info;
-			tls_sw_strparser_arm(sk, ctx);
 			conf = TLS_SW;
 		}
+		tls_sw_strparser_arm(sk, ctx);
 	}
 
 	if (tx)
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 38c0e53c727d..ee8fef312475 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2114,25 +2114,40 @@ void tls_sw_release_resources_rx(struct sock *sk)
 		skb_queue_purge(&ctx->rx_list);
 		crypto_free_aead(ctx->aead_recv);
 		strp_stop(&ctx->strp);
-		write_lock_bh(&sk->sk_callback_lock);
-		sk->sk_data_ready = ctx->saved_data_ready;
-		write_unlock_bh(&sk->sk_callback_lock);
-		release_sock(sk);
-		strp_done(&ctx->strp);
-		lock_sock(sk);
+		/* If tls_sw_strparser_arm() was not called (cleanup paths)
+		 * we still want to strp_stop(), but sk->sk_data_ready was
+		 * never swapped.
+		 */
+		if (ctx->saved_data_ready) {
+			write_lock_bh(&sk->sk_callback_lock);
+			sk->sk_data_ready = ctx->saved_data_ready;
+			write_unlock_bh(&sk->sk_callback_lock);
+		}
 	}
 }
 
-void tls_sw_free_resources_rx(struct sock *sk)
+void tls_sw_strparser_done(struct tls_context *tls_ctx)
 {
-	struct tls_context *tls_ctx = tls_get_ctx(sk);
 	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 
-	tls_sw_release_resources_rx(sk);
+	strp_done(&ctx->strp);
+}
+
+void tls_sw_free_ctx_rx(struct tls_context *tls_ctx)
+{
+	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 
 	kfree(ctx);
 }
 
+void tls_sw_free_resources_rx(struct sock *sk)
+{
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+
+	tls_sw_release_resources_rx(sk);
+	tls_sw_free_ctx_rx(tls_ctx);
+}
+
 /* The work handler to transmitt the encrypted records in tx_list */
 static void tx_work_handler(struct work_struct *work)
 {


^ permalink raw reply related

* [bpf PATCH v3 2/8] tls: remove close callback sock unlock/lock around TX work flush
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

The tls close() callback currently drops the sock lock, makes a
cancel_delayed_work_sync() call, and then relocks the sock.

By restructuring the code we can avoid droping lock and then
reclaiming it. To simplify this we do the following,

 tls_sk_proto_close
 set_bit(CLOSING)
 set_bit(SCHEDULE)
 cancel_delay_work_sync() <- cancel workqueue
 lock_sock(sk)
 ...
 release_sock(sk)
 strp_done()

Setting the CLOSING bit prevents the SCHEDULE bit from being
cleared by any workqueue items e.g. if one happens to be
scheduled and run between when we set SCHEDULE bit and cancel
work. Then because SCHEDULE bit is set now no new work will
be scheduled.

Tested with net selftests and bpf selftests.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
 include/net/tls.h  |    2 ++
 net/tls/tls_main.c |    3 +++
 net/tls/tls_sw.c   |   24 +++++++++++++++++-------
 3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 43f551cd508b..d4276cb6de53 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -162,6 +162,7 @@ struct tls_sw_context_tx {
 	int async_capable;
 
 #define BIT_TX_SCHEDULED	0
+#define BIT_TX_CLOSING		1
 	unsigned long tx_bitmask;
 };
 
@@ -360,6 +361,7 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 int tls_sw_sendpage(struct sock *sk, struct page *page,
 		    int offset, size_t size, int flags);
 void tls_sw_close(struct sock *sk, long timeout);
+void tls_sw_cancel_work_tx(struct tls_context *tls_ctx);
 void tls_sw_free_resources_tx(struct sock *sk);
 void tls_sw_free_resources_rx(struct sock *sk);
 void tls_sw_release_resources_rx(struct sock *sk);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 85a9d7d57b32..ddda422498aa 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -268,6 +268,9 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	void (*sk_proto_close)(struct sock *sk, long timeout);
 	bool free_ctx = false;
 
+	if (ctx->tx_conf == TLS_SW)
+		tls_sw_cancel_work_tx(ctx);
+
 	lock_sock(sk);
 	sk_proto_close = ctx->sk_proto_close;
 
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index f58a8ffc2a9c..38c0e53c727d 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2054,6 +2054,15 @@ static void tls_data_ready(struct sock *sk)
 	}
 }
 
+void tls_sw_cancel_work_tx(struct tls_context *tls_ctx)
+{
+	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
+
+	set_bit(BIT_TX_CLOSING, &ctx->tx_bitmask);
+	set_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask);
+	cancel_delayed_work_sync(&ctx->tx_work.work);
+}
+
 void tls_sw_free_resources_tx(struct sock *sk)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
@@ -2065,11 +2074,6 @@ void tls_sw_free_resources_tx(struct sock *sk)
 	if (atomic_read(&ctx->encrypt_pending))
 		crypto_wait_req(-EINPROGRESS, &ctx->async_wait);
 
-	release_sock(sk);
-	cancel_delayed_work_sync(&ctx->tx_work.work);
-	lock_sock(sk);
-
-	/* Tx whatever records we can transmit and abandon the rest */
 	tls_tx_records(sk, -1);
 
 	/* Free up un-sent records in tx_list. First, free
@@ -2137,11 +2141,17 @@ static void tx_work_handler(struct work_struct *work)
 					       struct tx_work, work);
 	struct sock *sk = tx_work->sk;
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
+	struct tls_sw_context_tx *ctx;
 
-	if (!test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask))
+	if (unlikely(!tls_ctx))
 		return;
 
+	ctx = tls_sw_ctx_tx(tls_ctx);
+	if (test_bit(BIT_TX_CLOSING, &ctx->tx_bitmask))
+		return;
+
+	if (!test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask))
+		return;
 	lock_sock(sk);
 	tls_tx_records(sk, -1);
 	release_sock(sk);


^ permalink raw reply related

* [bpf PATCH v3 1/8] net/tls: don't arm strparser immediately in tls_set_sw_offload()
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

From: Jakub Kicinski <jakub.kicinski@netronome.com>

In tls_set_device_offload_rx() we prepare the software context
for RX fallback and proceed to add the connection to the device.
Unfortunately, software context prep includes arming strparser
so in case of a later error we have to release the socket lock
to call strp_done().

In preparation for not releasing the socket lock half way through
callbacks move arming strparser into a separate function.
Following patches will make use of that.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
 include/net/tls.h    |    1 +
 net/tls/tls_device.c |    1 +
 net/tls/tls_main.c   |    8 +++++---
 net/tls/tls_sw.c     |   19 ++++++++++++-------
 4 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 584609174fe0..43f551cd508b 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -355,6 +355,7 @@ int tls_sk_attach(struct sock *sk, int optname, char __user *optval,
 		  unsigned int optlen);
 
 int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx);
+void tls_sw_strparser_arm(struct sock *sk, struct tls_context *ctx);
 int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 int tls_sw_sendpage(struct sock *sk, struct page *page,
 		    int offset, size_t size, int flags);
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 7c0b2b778703..4d67d72f007c 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -1045,6 +1045,7 @@ int tls_set_device_offload_rx(struct sock *sk, struct tls_context *ctx)
 	rc = tls_set_sw_offload(sk, ctx, 0);
 	if (rc)
 		goto release_ctx;
+	tls_sw_strparser_arm(sk, ctx);
 
 	rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_RX,
 					     &ctx->crypto_recv.info,
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 4674e57e66b0..85a9d7d57b32 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -526,6 +526,8 @@ static int do_tls_setsockopt_conf(struct sock *sk, char __user *optval,
 		{
 #endif
 			rc = tls_set_sw_offload(sk, ctx, 1);
+			if (rc)
+				goto err_crypto_info;
 			conf = TLS_SW;
 		}
 	} else {
@@ -537,13 +539,13 @@ static int do_tls_setsockopt_conf(struct sock *sk, char __user *optval,
 		{
 #endif
 			rc = tls_set_sw_offload(sk, ctx, 0);
+			if (rc)
+				goto err_crypto_info;
+			tls_sw_strparser_arm(sk, ctx);
 			conf = TLS_SW;
 		}
 	}
 
-	if (rc)
-		goto err_crypto_info;
-
 	if (tx)
 		ctx->tx_conf = conf;
 	else
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 53b4ad94e74a..f58a8ffc2a9c 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2160,6 +2160,18 @@ void tls_sw_write_space(struct sock *sk, struct tls_context *ctx)
 	}
 }
 
+void tls_sw_strparser_arm(struct sock *sk, struct tls_context *tls_ctx)
+{
+	struct tls_sw_context_rx *rx_ctx = tls_sw_ctx_rx(tls_ctx);
+
+	write_lock_bh(&sk->sk_callback_lock);
+	rx_ctx->saved_data_ready = sk->sk_data_ready;
+	sk->sk_data_ready = tls_data_ready;
+	write_unlock_bh(&sk->sk_callback_lock);
+
+	strp_check_rcv(&rx_ctx->strp);
+}
+
 int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
@@ -2357,13 +2369,6 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 		cb.parse_msg = tls_read_size;
 
 		strp_init(&sw_ctx_rx->strp, sk, &cb);
-
-		write_lock_bh(&sk->sk_callback_lock);
-		sw_ctx_rx->saved_data_ready = sk->sk_data_ready;
-		sk->sk_data_ready = tls_data_ready;
-		write_unlock_bh(&sk->sk_callback_lock);
-
-		strp_check_rcv(&sw_ctx_rx->strp);
 	}
 
 	goto out;


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox