Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH bpf-next v3 00/15] Introducing AF_XDP support
From: Alexei Starovoitov @ 2018-05-05  0:34 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Daniel Borkmann, Björn Töpel, Karlsson, Magnus,
	Alexander Duyck, Alexander Duyck, John Fastabend,
	Alexei Starovoitov, Jesper Dangaard Brouer, Willem de Bruijn,
	Michael S. Tsirkin, Network Development, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z
In-Reply-To: <CAJ8uoz3V8x4uv8Xeb+qaVB0_Rkd73TuU=3ubvkDh9b7nAkXSyw@mail.gmail.com>

On Fri, May 04, 2018 at 01:22:17PM +0200, Magnus Karlsson wrote:
> On Fri, May 4, 2018 at 1:38 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote:
> >> On 05/02/2018 01:01 PM, Björn Töpel wrote:
> >> > From: Björn Töpel <bjorn.topel@intel.com>
> >> >
> >> > This patch set introduces a new address family called AF_XDP that is
> >> > optimized for high performance packet processing and, in upcoming
> >> > patch sets, zero-copy semantics. In this patch set, we have removed
> >> > all zero-copy related code in order to make it smaller, simpler and
> >> > hopefully more review friendly. This patch set only supports copy-mode
> >> > for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
> >> > for RX using the XDP_DRV path. Zero-copy support requires XDP and
> >> > driver changes that Jesper Dangaard Brouer is working on. Some of his
> >> > work has already been accepted. We will publish our zero-copy support
> >> > for RX and TX on top of his patch sets at a later point in time.
> >>
> >> +1, would be great to see it land this cycle. Saw few minor nits here
> >> and there but nothing to hold it up, for the series:
> >>
> >> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> >>
> >> Thanks everyone!
> >
> > Great stuff!
> >
> > Applied to bpf-next, with one condition.
> > Upcoming zero-copy patches for both RX and TX need to be posted
> > and reviewed within this release window.
> > If netdev community as a whole won't be able to agree on the zero-copy
> > bits we'd need to revert this feature before the next merge window.
> 
> Thanks everyone for reviewing this. Highly appreciated.
> 
> Just so we understand the purpose correctly:
> 
> 1: Do you want to see the ZC patches in order to verify that the user
> space API holds? If so, we can produce an additional RFC  patch set
> using a big chunk of code that we had in RFC V1. We are not proud of
> this code since it is clunky, but it hopefully proves the point with
> the uapi being the same.
> 
> 2: And/Or are you worried about us all (the netdev community) not
> agreeing on a way to implement ZC internally in the drivers and the
> XDP infrastructure? This is not going to be possible to finish during
> this cycle since we do not like the implementation we had in RFC V1.
> Too intrusive and now we also have nicer abstractions from Jesper that
> we can use and extend to provide a (hopefully) much cleaner and less
> intrusive solution.

short answer: both.

Cleanliness and performance of the ZC code is not as important as
getting API right. The main concern that during ZC review process
we will find out that existing API has issues, so we have to
do this exercise before the merge window.
And RFC won't fly. Send the patches for real. They have to go
through the proper code review. The hackers of netdev community
can accept a partial, or a bit unclean, or slightly inefficient
implementation, since it can be and will be improved later,
but API we cannot change once it goes into official release.

Here is the example of API concern:
this patch set added shared umem concept. It sounds good in theory,
but will it perform well with ZC ? Earlier RFCs didn't have that
feature. If it won't perform well than it shouldn't be in the tree.
The key reason to let AF_XDP into the tree is its performance promise.
If it doesn't perform we should rip it out and redesign.

^ permalink raw reply

* pull-request: bpf-next 2018-05-05
From: Daniel Borkmann @ 2018-05-05  0:25 UTC (permalink / raw)
  To: davem; +Cc: daniel, ast, netdev

Hi David,

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add initial infrastructure for AF_XDP sockets, which is optimized
   for high performance packet processing. This early work only adds
   copy-mode, and zero-copy semantics with driver changes will land in
   subsequent patches. An AF_XDP socket has RX and/or TX queue associated
   to it for receiving and sending packets. In contrast to AF_PACKET v2/3
   descriptor queues are separated from packet buffers such that a RX or
   TX descriptor points to a data buffer in a memory area called UMEM.
   Latter can be shared so that packets don't need to be copied between
   RX and TX. A XDP BPF program will steer the packets to one of the
   AF_XDP sockets via a new BPF map called XSKMAP, from Björn and Magnus.

2) Add nfp BPF offload support for bpf_event_output() helper. Having
   the driver reimplement and manage the perf array itself seems fragile
   and unnecessary, therefore approach taken is that FW messages that
   carry the events are pushed out to the RB. Additionally bpftool gets
   support to connect to a perf event map and dump ring buffer contents,
   useful for debugging purposes, from Jakub.

3) Add a new eBPF JIT for x86_32. Like in arm32 case, 64 bit div/mod
   and xadd is still missing as well as BPF to BPF calls but other than
   that it's functional and numbers show 30% to 50% improvement compared
   to interpreter, from Wang.

4) Implement a new BPF helper bpf_get_stack() to overcome limitations
   of stackmap and bpf_get_stackid() helper. bpf_get_stack() allows
   to send stack traces directly to the BPF program which can perform
   in-kernel processing and push them out via bpf_perf_event_output(),
   from Yonghong.

5) Remove LD_ABS and LD_IND as native eBPF instructions and implement
   them as rewrites. This significantly reduces complexity from JITs
   while keeping similar performance characteristics, and allows to
   better evolve JITs long term by having them all in C only, from Daniel.

6) Improve the code logic related to managing subprog information by
   unifying main prog and subprogs, unifying entry points and stack
   depth tracking into struct bpf_subprog_info, and adding end marker
   into subprog_info array to simplify iteration logic, from Jiong.

7) Remove tracepoints from BPF core as they started to rot away,
   causing panics triggered from syzkaller. Earlier ones from BPF
   fs got already removed, so follow-up with rest since we also have
   better introspection infrastructure these days, from Alexei.

8) Relax the bpf_current_task_under_cgroup() helper to allow usage in
   interrupt which is particularly useful for BPF programs attached
   to perf events, from Teng.

9) Formatting fixes in the new BPF uapi helper documentation for
   bpf_perf_event_read() and bpf_get_stack() and relaxing whitespace
   constraints in bpf_helpers_doc.py to ease documentation, from Quentin.

10) Dump the bpftool 'loaded at:' information in ISO 8601 format in
    the plain variant and seconds since the Epoch in JSON to ease parsing,
    also from Quentin.

11) Various cleanups mostly around coding and comment style, and several
    capitalization, typo and grammar fixups in comments for the x64 BPF
    JIT, from Ingo.

12) Fix up BPF context struct types in uapi BPF helper documentation
    where some of them were mistakenly using kernel types, from Andrey.

13) Document that under CONFIG_BPF_JIT_ALWAYS_ON mode the bpf_jit_enable
    mode 2 is not available, from Leo.

14) Import erspan uapi header file into tools infra so that BPF tunnel
    helpers can use it and won't cause issues due to missing headers on
    some systems, from William.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

This has a minor merge conflict in tools/testing/selftests/bpf/test_progs.c.
Resolution is to take the hunk from bpf-next tree and change the first CHECK()
condition such that the missing '\n' is added to the end of the string, like:

        if (CHECK(build_id_matches < 1, "build id match",
                  "Didn't find expected build ID from the map\n"))
                goto disable_pmu;

Let me know if you run into any other unforeseen issue. Thanks a lot!

----------------------------------------------------------------

The following changes since commit 79741a38b4a2538a68342c45b813ecb9dd648ee8:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next (2018-04-26 21:19:50 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to e94fa1d93117e7f1eb783dc9cae6c70650944449:

  bpf, xskmap: fix crash in xsk_map_alloc error path handling (2018-05-04 14:55:54 -0700)

----------------------------------------------------------------
Alexei Starovoitov (5):
      Merge branch 'bpf_get_stack'
      Merge branch 'fix-bpf-helpers-doc'
      bpf: remove tracepoints from bpf core
      Merge branch 'AF_XDP-initial-support'
      Merge branch 'move-ld_abs-to-native-BPF'

Andrey Ignatov (2):
      bpf: Fix helpers ctx struct types in uapi doc
      bpf: Sync bpf.h to tools/

Björn Töpel (7):
      net: initial AF_XDP skeleton
      xsk: add user memory registration support sockopt
      xsk: add Rx queue setup and mmap support
      xsk: add Rx receive functions and poll support
      bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
      xsk: wire up XDP_DRV side of AF_XDP
      xsk: wire up XDP_SKB side of AF_XDP

Daniel Borkmann (17):
      Merge branch 'bpf-formatting-fixes-helpers'
      bpf: prefix cbpf internal helpers with bpf_
      bpf: migrate ebpf ld_abs/ld_ind tests to test_verifier
      bpf: implement ld_abs/ld_ind in native bpf
      bpf: add skb_load_bytes_relative helper
      bpf, x64: remove ld_abs/ld_ind
      bpf, arm64: remove ld_abs/ld_ind
      bpf, sparc64: remove ld_abs/ld_ind
      bpf, arm32: remove ld_abs/ld_ind
      bpf, mips64: remove ld_abs/ld_ind
      bpf, ppc64: remove ld_abs/ld_ind
      bpf, s390x: remove ld_abs/ld_ind
      bpf, x32: remove ld_abs/ld_ind
      bpf: sync tools bpf.h uapi header
      Merge branch 'bpf-subprog-mgmt-cleanup'
      Merge branch 'bpf-event-output-offload'
      bpf, xskmap: fix crash in xsk_map_alloc error path handling

Ingo Molnar (1):
      x86/bpf: Clean up non-standard comments, to make the code more readable

Jakub Kicinski (10):
      bpf: offload: allow offloaded programs to use perf event arrays
      nfp: bpf: record offload neutral maps in the driver
      bpf: export bpf_event_output()
      bpf: replace map pointer loads before calling into offloads
      nfp: bpf: perf event output helpers support
      nfp: bpf: rewrite map pointers with NFP TIDs
      tools: bpftool: fold hex keyword in command help
      tools: bpftool: move get_possible_cpus() to common code
      tools: bpftool: add simple perf event output reader
      bpf: fix references to free_bpf_prog_info() in comments

Jiong Wang (3):
      bpf: unify main prog and subprog
      bpf: centre subprog information fields
      bpf: add faked "ending" subprog

Leo Yan (1):
      bpf, doc: Update bpf_jit_enable limitation for CONFIG_BPF_JIT_ALWAYS_ON

Magnus Karlsson (8):
      xsk: add umem fill queue support and mmap
      xsk: add support for bind for Rx
      xsk: add umem completion queue support and mmap
      xsk: add Tx queue setup and mmap support
      dev: packet: make packet_direct_xmit a common function
      xsk: support for Tx
      xsk: statistics support
      samples/bpf: sample application and documentation for AF_XDP sockets

Quentin Monnet (5):
      bpf: fix formatting for bpf_perf_event_read() helper doc
      bpf: fix formatting for bpf_get_stack() helper doc
      bpf: update bpf.h uapi header for tools
      tools: bpftool: change time format for program 'loaded at:' information
      bpf: relax constraints on formatting for eBPF helper documentation

Teng Qin (1):
      bpf: Allow bpf_current_task_under_cgroup in interrupt

Wang YanQing (1):
      bpf, x86_32: add eBPF JIT compiler for ia32

William Tu (1):
      tools, include: Grab a copy of linux/erspan.h

Yonghong Song (11):
      bpf: change prototype for stack_map_get_build_id_offset
      bpf: add bpf_get_stack helper
      bpf/verifier: refine retval R0 state for bpf_get_stack helper
      bpf: remove never-hit branches in verifier adjust_scalar_min_max_vals
      bpf/verifier: improve register value range tracking with ARSH
      tools/bpf: add bpf_get_stack helper to tools headers
      samples/bpf: move common-purpose trace functions to selftests
      tools/bpf: add a verifier test case for bpf_get_stack helper and ARSH
      tools/bpf: add a test for bpf_get_stack with raw tracepoint prog
      tools/bpf: add a test for bpf_get_stack with tracepoint prog
      samples/bpf: fix kprobe attachment issue on x64

 Documentation/networking/af_xdp.rst                |  297 +++
 Documentation/networking/filter.txt                |    6 +
 Documentation/networking/index.rst                 |    1 +
 Documentation/sysctl/net.txt                       |    1 +
 MAINTAINERS                                        |    9 +-
 arch/arm/net/bpf_jit_32.c                          |   77 -
 arch/arm64/net/bpf_jit_comp.c                      |   65 -
 arch/mips/net/ebpf_jit.c                           |  104 -
 arch/powerpc/net/Makefile                          |    2 +-
 arch/powerpc/net/bpf_jit64.h                       |   37 +-
 arch/powerpc/net/bpf_jit_asm64.S                   |  180 --
 arch/powerpc/net/bpf_jit_comp64.c                  |  109 +-
 arch/s390/net/Makefile                             |    2 +-
 arch/s390/net/bpf_jit.S                            |  116 -
 arch/s390/net/bpf_jit.h                            |   20 +-
 arch/s390/net/bpf_jit_comp.c                       |  127 +-
 arch/sparc/net/Makefile                            |    5 +-
 arch/sparc/net/bpf_jit_64.h                        |   29 -
 arch/sparc/net/bpf_jit_asm_64.S                    |  162 --
 arch/sparc/net/bpf_jit_comp_64.c                   |   79 +-
 arch/x86/Kconfig                                   |    2 +-
 arch/x86/include/asm/nospec-branch.h               |   30 +-
 arch/x86/net/Makefile                              |    7 +-
 arch/x86/net/bpf_jit.S                             |  154 --
 arch/x86/net/bpf_jit_comp.c                        |  343 +--
 arch/x86/net/bpf_jit_comp32.c                      | 2419 ++++++++++++++++++++
 drivers/net/ethernet/netronome/nfp/bpf/cmsg.c      |   16 +-
 drivers/net/ethernet/netronome/nfp/bpf/fw.h        |   20 +-
 drivers/net/ethernet/netronome/nfp/bpf/jit.c       |   76 +-
 drivers/net/ethernet/netronome/nfp/bpf/main.c      |   28 +-
 drivers/net/ethernet/netronome/nfp/bpf/main.h      |   24 +-
 drivers/net/ethernet/netronome/nfp/bpf/offload.c   |  172 +-
 drivers/net/ethernet/netronome/nfp/bpf/verifier.c  |   78 +-
 drivers/net/ethernet/netronome/nfp/nfp_app.c       |    2 +-
 include/linux/bpf.h                                |   35 +-
 include/linux/bpf_trace.h                          |    1 -
 include/linux/bpf_types.h                          |    3 +
 include/linux/bpf_verifier.h                       |    9 +-
 include/linux/filter.h                             |    9 +-
 include/linux/netdevice.h                          |    1 +
 include/linux/socket.h                             |    5 +-
 include/linux/tnum.h                               |    4 +-
 include/net/xdp.h                                  |    1 +
 include/net/xdp_sock.h                             |   66 +
 include/trace/events/bpf.h                         |  355 ---
 include/uapi/linux/bpf.h                           |   94 +-
 include/uapi/linux/if_xdp.h                        |   87 +
 kernel/bpf/Makefile                                |    3 +
 kernel/bpf/core.c                                  |  108 +-
 kernel/bpf/inode.c                                 |   16 +-
 kernel/bpf/offload.c                               |    6 +-
 kernel/bpf/stackmap.c                              |   80 +-
 kernel/bpf/syscall.c                               |   17 +-
 kernel/bpf/tnum.c                                  |   10 +
 kernel/bpf/verifier.c                              |  247 +-
 kernel/bpf/xskmap.c                                |  241 ++
 kernel/trace/bpf_trace.c                           |   52 +-
 lib/test_bpf.c                                     |  570 +++--
 net/Kconfig                                        |    1 +
 net/Makefile                                       |    1 +
 net/core/dev.c                                     |   73 +-
 net/core/filter.c                                  |  345 ++-
 net/core/sock.c                                    |   12 +-
 net/core/xdp.c                                     |   15 +-
 net/packet/af_packet.c                             |   42 +-
 net/xdp/Kconfig                                    |    7 +
 net/xdp/Makefile                                   |    2 +
 net/xdp/xdp_umem.c                                 |  260 +++
 net/xdp/xdp_umem.h                                 |   67 +
 net/xdp/xdp_umem_props.h                           |   23 +
 net/xdp/xsk.c                                      |  656 ++++++
 net/xdp/xsk_queue.c                                |   73 +
 net/xdp/xsk_queue.h                                |  247 ++
 samples/bpf/Makefile                               |   15 +-
 samples/bpf/bpf_load.c                             |   97 +-
 samples/bpf/bpf_load.h                             |    7 -
 samples/bpf/offwaketime_user.c                     |    1 +
 samples/bpf/sampleip_user.c                        |    1 +
 samples/bpf/spintest_user.c                        |    1 +
 samples/bpf/trace_event_user.c                     |    1 +
 samples/bpf/trace_output_user.c                    |  110 +-
 samples/bpf/xdpsock.h                              |   11 +
 samples/bpf/xdpsock_kern.c                         |   56 +
 samples/bpf/xdpsock_user.c                         |  948 ++++++++
 scripts/bpf_helpers_doc.py                         |   14 +-
 security/selinux/hooks.c                           |    4 +-
 security/selinux/include/classmap.h                |    4 +-
 tools/bpf/bpftool/Documentation/bpftool-map.rst    |   40 +-
 tools/bpf/bpftool/Documentation/bpftool.rst        |    2 +-
 tools/bpf/bpftool/Makefile                         |    7 +-
 tools/bpf/bpftool/bash-completion/bpftool          |   36 +-
 tools/bpf/bpftool/common.c                         |   77 +-
 tools/bpf/bpftool/main.h                           |    7 +-
 tools/bpf/bpftool/map.c                            |   80 +-
 tools/bpf/bpftool/map_perf_ring.c                  |  347 +++
 tools/bpf/bpftool/prog.c                           |    8 +-
 tools/include/uapi/linux/bpf.h                     |   93 +-
 tools/include/uapi/linux/erspan.h                  |   52 +
 tools/testing/selftests/bpf/Makefile               |    4 +-
 tools/testing/selftests/bpf/bpf_helpers.h          |    2 +
 tools/testing/selftests/bpf/test_get_stack_rawtp.c |  102 +
 tools/testing/selftests/bpf/test_progs.c           |  242 +-
 .../selftests/bpf/test_stacktrace_build_id.c       |   20 +-
 tools/testing/selftests/bpf/test_stacktrace_map.c  |   19 +-
 tools/testing/selftests/bpf/test_verifier.c        |  311 ++-
 tools/testing/selftests/bpf/trace_helpers.c        |  180 ++
 tools/testing/selftests/bpf/trace_helpers.h        |   23 +
 107 files changed, 8852 insertions(+), 2713 deletions(-)
 create mode 100644 Documentation/networking/af_xdp.rst
 delete mode 100644 arch/powerpc/net/bpf_jit_asm64.S
 delete mode 100644 arch/s390/net/bpf_jit.S
 delete mode 100644 arch/sparc/net/bpf_jit_asm_64.S
 delete mode 100644 arch/x86/net/bpf_jit.S
 create mode 100644 arch/x86/net/bpf_jit_comp32.c
 create mode 100644 include/net/xdp_sock.h
 delete mode 100644 include/trace/events/bpf.h
 create mode 100644 include/uapi/linux/if_xdp.h
 create mode 100644 kernel/bpf/xskmap.c
 create mode 100644 net/xdp/Kconfig
 create mode 100644 net/xdp/Makefile
 create mode 100644 net/xdp/xdp_umem.c
 create mode 100644 net/xdp/xdp_umem.h
 create mode 100644 net/xdp/xdp_umem_props.h
 create mode 100644 net/xdp/xsk.c
 create mode 100644 net/xdp/xsk_queue.c
 create mode 100644 net/xdp/xsk_queue.h
 create mode 100644 samples/bpf/xdpsock.h
 create mode 100644 samples/bpf/xdpsock_kern.c
 create mode 100644 samples/bpf/xdpsock_user.c
 create mode 100644 tools/bpf/bpftool/map_perf_ring.c
 create mode 100644 tools/include/uapi/linux/erspan.h
 create mode 100644 tools/testing/selftests/bpf/test_get_stack_rawtp.c
 create mode 100644 tools/testing/selftests/bpf/trace_helpers.c
 create mode 100644 tools/testing/selftests/bpf/trace_helpers.h

^ permalink raw reply

* Re: [PATCH net-next] net/ipv6: rename rt6_next to fib6_next
From: David Miller @ 2018-05-04 23:55 UTC (permalink / raw)
  To: dsahern; +Cc: netdev
In-Reply-To: <20180504205424.10948-1-dsahern@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Fri,  4 May 2018 13:54:24 -0700

> This slipped through the cracks in the followup set to the fib6_info flip.
> Rename rt6_next to fib6_next.
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>

Applied, thanks David.

^ permalink raw reply

* Re: pull-request: bpf 2018-05-05
From: David Miller @ 2018-05-04 23:50 UTC (permalink / raw)
  To: daniel; +Cc: ast, netdev
In-Reply-To: <20180504222147.18850-1-daniel@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Sat,  5 May 2018 00:21:47 +0200

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
> 
> 1) Sanitize attr->{prog,map}_type from bpf(2) since used as an array index
>    to retrieve prog/map specific ops such that we prevent potential out of
>    bounds value under speculation, from Mark and Daniel.
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thanks Daniel.

^ permalink raw reply

* Re: [PATCH] net: disable UDP punt on sockets in RCV_SHUTDWON
From: Eric Dumazet @ 2018-05-04 23:44 UTC (permalink / raw)
  To: Chintan Shah, davem, kuznet, jmorris, yoshfuji, kaber, netdev,
	linux-kernel
  Cc: kamensky, takondra, xe-linux-external, enkechen
In-Reply-To: <1525468117-61242-1-git-send-email-chintsha@cisco.com>



On 05/04/2018 02:08 PM, Chintan Shah wrote:
> A UDP application which opens multiple sockets with same local
> address/port combination (using SO_REUSEPORT/SO_REUSEADDR socket options);
> and issues connect to a remote socket (using one of these local socket).
> Now if the same socket, which issued connect, issues shutdown (SHUT_RD);
> packets would still be queued to this socket (if sent from same remote
> client, which the local socket connected to), and not delivered to the
> other socket in the normal state.
> 

Confusing changelog.

sk_shutdown is on a different cache line, so this additional fetch would cause
loss of performance if many sockets are scanned in the hash bucket.

If you are trying to add full 4-tuple hash table to UDP, and accept() ability,
this would require a bit more than this hack...

^ permalink raw reply

* Re: [PATCH net] sctp: delay the authentication for the duplicated cookie-echo chunk
From: Marcelo Ricardo Leitner @ 2018-05-04 22:33 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, linux-sctp, davem, Neil Horman
In-Reply-To: <091d842812b99059231ff87e9bb7dff175336525.1525424710.git.lucien.xin@gmail.com>

On Fri, May 04, 2018 at 05:05:10PM +0800, Xin Long wrote:
> Now sctp only delays the authentication for the normal cookie-echo
> chunk by setting chunk->auth_chunk in sctp_endpoint_bh_rcv(). But
> for the duplicated one with auth, in sctp_assoc_bh_rcv(), it does
> authentication first based on the old asoc, which will definitely
> fail due to the different auth info in the old asoc.
>
> The duplicated cookie-echo chunk will create a new asoc with the
> auth info from this chunk, and the authentication should also be
> done with the new asoc's auth info for all of the collision 'A',
> 'B' and 'D'. Otherwise, the duplicated cookie-echo chunk with auth
> will never pass the authentication and create the new connection.
>
> This issue exists since very beginning, and this fix is to make
> sctp_assoc_bh_rcv() follow the way sctp_assoc_bh_rcv() does for
   I guess you meant sctp_endpoint_bh_rcv here --^ right?

Other than this LGTM

> the normal cookie-echo chunk to delay the authentication.
>
> While at it, remove the unused params from sctp_sf_authenticate()
> and define sctp_auth_chunk_verify() used for all the places that
> do the delayed authentication.
>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/sctp/associola.c    | 30 ++++++++++++++++-
>  net/sctp/sm_statefuns.c | 86 ++++++++++++++++++++++++++-----------------------
>  2 files changed, 75 insertions(+), 41 deletions(-)
>
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 837806d..a47179d 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -1024,8 +1024,9 @@ static void sctp_assoc_bh_rcv(struct work_struct *work)
>  	struct sctp_endpoint *ep;
>  	struct sctp_chunk *chunk;
>  	struct sctp_inq *inqueue;
> -	int state;
> +	int first_time = 1;	/* is this the first time through the loop */
>  	int error = 0;
> +	int state;
>
>  	/* The association should be held so we should be safe. */
>  	ep = asoc->ep;
> @@ -1036,6 +1037,30 @@ static void sctp_assoc_bh_rcv(struct work_struct *work)
>  		state = asoc->state;
>  		subtype = SCTP_ST_CHUNK(chunk->chunk_hdr->type);
>
> +		/* If the first chunk in the packet is AUTH, do special
> +		 * processing specified in Section 6.3 of SCTP-AUTH spec
> +		 */
> +		if (first_time && subtype.chunk == SCTP_CID_AUTH) {
> +			struct sctp_chunkhdr *next_hdr;
> +
> +			next_hdr = sctp_inq_peek(inqueue);
> +			if (!next_hdr)
> +				goto normal;
> +
> +			/* If the next chunk is COOKIE-ECHO, skip the AUTH
> +			 * chunk while saving a pointer to it so we can do
> +			 * Authentication later (during cookie-echo
> +			 * processing).
> +			 */
> +			if (next_hdr->type == SCTP_CID_COOKIE_ECHO) {
> +				chunk->auth_chunk = skb_clone(chunk->skb,
> +							      GFP_ATOMIC);
> +				chunk->auth = 1;
> +				continue;
> +			}
> +		}
> +
> +normal:
>  		/* SCTP-AUTH, Section 6.3:
>  		 *    The receiver has a list of chunk types which it expects
>  		 *    to be received only after an AUTH-chunk.  This list has
> @@ -1074,6 +1099,9 @@ static void sctp_assoc_bh_rcv(struct work_struct *work)
>  		/* If there is an error on chunk, discard this packet. */
>  		if (error && chunk)
>  			chunk->pdiscard = 1;
> +
> +		if (first_time)
> +			first_time = 0;
>  	}
>  	sctp_association_put(asoc);
>  }
> diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
> index 28c070e..c9ae340 100644
> --- a/net/sctp/sm_statefuns.c
> +++ b/net/sctp/sm_statefuns.c
> @@ -153,10 +153,7 @@ static enum sctp_disposition sctp_sf_violation_chunk(
>  					struct sctp_cmd_seq *commands);
>
>  static enum sctp_ierror sctp_sf_authenticate(
> -					struct net *net,
> -					const struct sctp_endpoint *ep,
>  					const struct sctp_association *asoc,
> -					const union sctp_subtype type,
>  					struct sctp_chunk *chunk);
>
>  static enum sctp_disposition __sctp_sf_do_9_1_abort(
> @@ -626,6 +623,38 @@ enum sctp_disposition sctp_sf_do_5_1C_ack(struct net *net,
>  	return SCTP_DISPOSITION_CONSUME;
>  }
>
> +static bool sctp_auth_chunk_verify(struct net *net, struct sctp_chunk *chunk,
> +				   const struct sctp_association *asoc)
> +{
> +	struct sctp_chunk auth;
> +
> +	if (!chunk->auth_chunk)
> +		return true;
> +
> +	/* SCTP-AUTH:  auth_chunk pointer is only set when the cookie-echo
> +	 * is supposed to be authenticated and we have to do delayed
> +	 * authentication.  We've just recreated the association using
> +	 * the information in the cookie and now it's much easier to
> +	 * do the authentication.
> +	 */
> +
> +	/* Make sure that we and the peer are AUTH capable */
> +	if (!net->sctp.auth_enable || !asoc->peer.auth_capable)
> +		return false;
> +
> +	/* set-up our fake chunk so that we can process it */
> +	auth.skb = chunk->auth_chunk;
> +	auth.asoc = chunk->asoc;
> +	auth.sctp_hdr = chunk->sctp_hdr;
> +	auth.chunk_hdr = (struct sctp_chunkhdr *)
> +				skb_push(chunk->auth_chunk,
> +					 sizeof(struct sctp_chunkhdr));
> +	skb_pull(chunk->auth_chunk, sizeof(struct sctp_chunkhdr));
> +	auth.transport = chunk->transport;
> +
> +	return sctp_sf_authenticate(asoc, &auth) == SCTP_IERROR_NO_ERROR;
> +}
> +
>  /*
>   * Respond to a normal COOKIE ECHO chunk.
>   * We are the side that is being asked for an association.
> @@ -763,37 +792,9 @@ enum sctp_disposition sctp_sf_do_5_1D_ce(struct net *net,
>  	if (error)
>  		goto nomem_init;
>
> -	/* SCTP-AUTH:  auth_chunk pointer is only set when the cookie-echo
> -	 * is supposed to be authenticated and we have to do delayed
> -	 * authentication.  We've just recreated the association using
> -	 * the information in the cookie and now it's much easier to
> -	 * do the authentication.
> -	 */
> -	if (chunk->auth_chunk) {
> -		struct sctp_chunk auth;
> -		enum sctp_ierror ret;
> -
> -		/* Make sure that we and the peer are AUTH capable */
> -		if (!net->sctp.auth_enable || !new_asoc->peer.auth_capable) {
> -			sctp_association_free(new_asoc);
> -			return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
> -		}
> -
> -		/* set-up our fake chunk so that we can process it */
> -		auth.skb = chunk->auth_chunk;
> -		auth.asoc = chunk->asoc;
> -		auth.sctp_hdr = chunk->sctp_hdr;
> -		auth.chunk_hdr = (struct sctp_chunkhdr *)
> -					skb_push(chunk->auth_chunk,
> -						 sizeof(struct sctp_chunkhdr));
> -		skb_pull(chunk->auth_chunk, sizeof(struct sctp_chunkhdr));
> -		auth.transport = chunk->transport;
> -
> -		ret = sctp_sf_authenticate(net, ep, new_asoc, type, &auth);
> -		if (ret != SCTP_IERROR_NO_ERROR) {
> -			sctp_association_free(new_asoc);
> -			return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
> -		}
> +	if (!sctp_auth_chunk_verify(net, chunk, new_asoc)) {
> +		sctp_association_free(new_asoc);
> +		return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
>  	}
>
>  	repl = sctp_make_cookie_ack(new_asoc, chunk);
> @@ -1797,13 +1798,15 @@ static enum sctp_disposition sctp_sf_do_dupcook_a(
>  	if (sctp_auth_asoc_init_active_key(new_asoc, GFP_ATOMIC))
>  		goto nomem;
>
> +	if (!sctp_auth_chunk_verify(net, chunk, new_asoc))
> +		return SCTP_DISPOSITION_DISCARD;
> +
>  	/* Make sure no new addresses are being added during the
>  	 * restart.  Though this is a pretty complicated attack
>  	 * since you'd have to get inside the cookie.
>  	 */
> -	if (!sctp_sf_check_restart_addrs(new_asoc, asoc, chunk, commands)) {
> +	if (!sctp_sf_check_restart_addrs(new_asoc, asoc, chunk, commands))
>  		return SCTP_DISPOSITION_CONSUME;
> -	}
>
>  	/* If the endpoint is in the SHUTDOWN-ACK-SENT state and recognizes
>  	 * the peer has restarted (Action A), it MUST NOT setup a new
> @@ -1912,6 +1915,9 @@ static enum sctp_disposition sctp_sf_do_dupcook_b(
>  	if (sctp_auth_asoc_init_active_key(new_asoc, GFP_ATOMIC))
>  		goto nomem;
>
> +	if (!sctp_auth_chunk_verify(net, chunk, new_asoc))
> +		return SCTP_DISPOSITION_DISCARD;
> +
>  	/* Update the content of current association.  */
>  	sctp_add_cmd_sf(commands, SCTP_CMD_UPDATE_ASSOC, SCTP_ASOC(new_asoc));
>  	sctp_add_cmd_sf(commands, SCTP_CMD_NEW_STATE,
> @@ -2009,6 +2015,9 @@ static enum sctp_disposition sctp_sf_do_dupcook_d(
>  	 * a COOKIE ACK.
>  	 */
>
> +	if (!sctp_auth_chunk_verify(net, chunk, asoc))
> +		return SCTP_DISPOSITION_DISCARD;
> +
>  	/* Don't accidentally move back into established state. */
>  	if (asoc->state < SCTP_STATE_ESTABLISHED) {
>  		sctp_add_cmd_sf(commands, SCTP_CMD_TIMER_STOP,
> @@ -4171,10 +4180,7 @@ enum sctp_disposition sctp_sf_eat_fwd_tsn_fast(
>   * The return value is the disposition of the chunk.
>   */
>  static enum sctp_ierror sctp_sf_authenticate(
> -					struct net *net,
> -					const struct sctp_endpoint *ep,
>  					const struct sctp_association *asoc,
> -					const union sctp_subtype type,
>  					struct sctp_chunk *chunk)
>  {
>  	struct sctp_shared_key *sh_key = NULL;
> @@ -4275,7 +4281,7 @@ enum sctp_disposition sctp_sf_eat_auth(struct net *net,
>  						  commands);
>
>  	auth_hdr = (struct sctp_authhdr *)chunk->skb->data;
> -	error = sctp_sf_authenticate(net, ep, asoc, type, chunk);
> +	error = sctp_sf_authenticate(asoc, chunk);
>  	switch (error) {
>  	case SCTP_IERROR_AUTH_BAD_HMAC:
>  		/* Generate the ERROR chunk and discard the rest
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH bpf-next 09/10] tools: bpftool: add simple perf event output reader
From: Jakub Kicinski @ 2018-05-04 22:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: daniel, oss-drivers, netdev, linux-kernel, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim
In-Reply-To: <20180504212501.hn2rnv7t3ik563mg@ast-mbp>

CC perf folks

On Fri, 4 May 2018 14:25:03 -0700, Alexei Starovoitov wrote:
> > +static void
> > +perf_event_read(struct event_ring_info *ring, void **buf, size_t *buf_len)
> > +{
> > +	volatile struct perf_event_mmap_page *header = ring->mem;
> > +	__u64 buffer_size = MMAP_PAGE_CNT * get_page_size();
> > +	__u64 data_tail = header->data_tail;
> > +	__u64 data_head = header->data_head;
> > +	void *base, *begin, *end;
> > +
> > +	asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */
> > +	if (data_head == data_tail)
> > +		return;  
> 
> this function was copied several times into different places.
> I think it's time to put into common lib. Like libbpf.

Agreed, I think libbpf would work, although there is nothing BPF
specific in this loop AFAICT now.

> Would be great if you can do it in the follow up.

Looking into it now, I found these:

$ git grep 'data_head == data_tail'
tools/bpf/bpftool/map_perf_ring.c:      if (data_head == data_tail)
tools/testing/selftests/bpf/trace_helpers.c:    if (data_head == data_tail)

Are there any other copies I should try to cater to?  I have change a few
things compared to the selftest, I guess others may have modified their
copy too.  Just trying to make sure what we put in libbpf would cater
to most possible use cases.

Should I also move bpf_perf_event_open()/test_bpf_perf_event() to libbpf?

> for the set:
> Acked-by: Alexei Starovoitov <ast@kernel.org>

Thanks!

^ permalink raw reply

* Re: [net-next PATCH v2 4/8] udp: Do not pass checksum as a parameter to GSO segmentation
From: Alexander Duyck @ 2018-05-04 22:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Netdev, Willem de Bruijn, David Miller
In-Reply-To: <52c9b572-ddcd-94ea-b9b6-787ca924698a@gmail.com>

On Fri, May 4, 2018 at 1:19 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 05/04/2018 11:30 AM, Alexander Duyck wrote:
>> From: Alexander Duyck <alexander.h.duyck@intel.com>
>>
>> This patch is meant to allow us to avoid having to recompute the checksum
>> from scratch and have it passed as a parameter.
>>
>> Instead of taking that approach we can take advantage of the fact that the
>> length that was used to compute the existing checksum is included in the
>> UDP header. If we cancel that out by adding the value XOR with 0xFFFF we
>> can then just add the new length in and fold that into the new result.
>>
>
>>
>> +     uh = udp_hdr(segs);
>> +
>> +     /* compute checksum adjustment based on old length versus new */
>> +     newlen = htons(sizeof(*uh) + mss);
>> +     check = ~csum_fold((__force __wsum)((__force u32)uh->check +
>> +                                         ((__force u32)uh->len ^ 0xFFFF) +
>> +                                         (__force u32)newlen));
>> +
>
>
> Can't this use csum_sub() instead of this ^ 0xFFFF trick ?

I could but that actually adds more instructions to all this since
csum_sub will perform the inversion across a 32b checksum when we only
need to bitflip a 16 bit value. I had considered doing (u16)(~uh->len)
but thought type casing it more than once would be a pain as well.

What I wanted to avoid is having to do the extra math to account for
the rollover. Adding 3 16 bit values will result in at most a 18 bit
value which can then be folded. Doing it this way we avoid that extra
add w/ carry logic that is needed for csum_add/sub.

^ permalink raw reply

* pull-request: bpf 2018-05-05
From: Daniel Borkmann @ 2018-05-04 22:21 UTC (permalink / raw)
  To: davem; +Cc: daniel, ast, netdev

Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Sanitize attr->{prog,map}_type from bpf(2) since used as an array index
   to retrieve prog/map specific ops such that we prevent potential out of
   bounds value under speculation, from Mark and Daniel.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!

----------------------------------------------------------------

The following changes since commit a8d7aa17bbc970971ccdf71988ea19230ab368b1:

  dccp: fix tasklet usage (2018-05-03 15:14:57 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to d0f1a451e33d9ca834422622da30aa68daade56b:

  bpf: use array_index_nospec in find_prog_type (2018-05-03 19:29:35 -0700)

----------------------------------------------------------------
Daniel Borkmann (1):
      bpf: use array_index_nospec in find_prog_type

Mark Rutland (1):
      bpf: fix possible spectre-v1 in find_and_alloc_map()

 kernel/bpf/syscall.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

^ permalink raw reply

* Re: [PATCH iproute2] rdma: fix header files
From: David Ahern @ 2018-05-04 22:13 UTC (permalink / raw)
  To: Stephen Hemminger, swise; +Cc: netdev
In-Reply-To: <20180504215608.11305-1-stephen@networkplumber.org>

On 5/4/18 3:56 PM, Stephen Hemminger wrote:
> All user api headers in iproute2 should be in include/uapi
> so that script can be used to put correct sanitized kernel headers
> there. And the header files for rdma must be a complete set; if one
> header file includes another, all must be present.
> 
> This fixes build on older distributions, and Windows Services
> for Linux.
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> ---
>  include/uapi/rdma/ib_user_sa.h                |   77 ++
>  include/uapi/rdma/ib_user_verbs.h             | 1210 +++++++++++++++++
>  .../uapi/rdma/rdma_netlink.h                  |   13 +
>  .../uapi/rdma/rdma_user_cm.h                  |    6 +-
>  4 files changed, 1303 insertions(+), 3 deletions(-)
>  create mode 100644 include/uapi/rdma/ib_user_sa.h
>  create mode 100644 include/uapi/rdma/ib_user_verbs.h
>  rename {rdma/include => include}/uapi/rdma/rdma_netlink.h (95%)
>  rename {rdma/include => include}/uapi/rdma/rdma_user_cm.h (98%)
> 

Stephen:

Per a recent discussion the RDMA folks need to take ownership of the
uapi files. RDMA features do not hit Dave's net-next tree so the rdma
code can never hit iproute2-next during a dev cycle.

^ permalink raw reply

* Re: [PATCH v2 4/4] smack: provide socketpair callback
From: Casey Schaufler @ 2018-05-04 22:01 UTC (permalink / raw)
  To: David Herrmann, linux-kernel
  Cc: James Morris, Paul Moore, teg, Stephen Smalley, selinux,
	linux-security-module, Eric Paris, serge, davem, netdev
In-Reply-To: <20180504142822.15233-5-dh.herrmann@gmail.com>

On 5/4/2018 7:28 AM, David Herrmann wrote:
> From: Tom Gundersen <teg@jklm.no>
>
> Make sure to implement the new socketpair callback so the SO_PEERSEC
> call on socketpair(2)s will return correct information.
>
> Signed-off-by: Tom Gundersen <teg@jklm.no>
> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>

This doesn't look like it will cause any problems.
I've only been able to test it in a general way. I
haven't created specific tests, but it passes the
usual Smack use cases.

Acked-by: Casey Schaufler <casey@schaufler-ca.com>

> ---
>  security/smack/smack_lsm.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
>
> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> index 0b414836bebd..dcb976f98df2 100644
> --- a/security/smack/smack_lsm.c
> +++ b/security/smack/smack_lsm.c
> @@ -2842,6 +2842,27 @@ static int smack_socket_post_create(struct socket *sock, int family,
>  	return smack_netlabel(sock->sk, SMACK_CIPSO_SOCKET);
>  }
>  
> +/**
> + * smack_socket_socketpair - create socket pair
> + * @socka: one socket
> + * @sockb: another socket
> + *
> + * Cross reference the peer labels for SO_PEERSEC
> + *
> + * Returns 0 on success, and error code otherwise
> + */
> +static int smack_socket_socketpair(struct socket *socka,
> +		                   struct socket *sockb)
> +{
> +	struct socket_smack *asp = socka->sk->sk_security;
> +	struct socket_smack *bsp = sockb->sk->sk_security;
> +
> +	asp->smk_packet = bsp->smk_out;
> +	bsp->smk_packet = asp->smk_out;
> +
> +	return 0;
> +}
> +
>  #ifdef SMACK_IPV6_PORT_LABELING
>  /**
>   * smack_socket_bind - record port binding information.
> @@ -4724,6 +4745,7 @@ static struct security_hook_list smack_hooks[] __lsm_ro_after_init = {
>  	LSM_HOOK_INIT(unix_may_send, smack_unix_may_send),
>  
>  	LSM_HOOK_INIT(socket_post_create, smack_socket_post_create),
> +	LSM_HOOK_INIT(socket_socketpair, smack_socket_socketpair),
>  #ifdef SMACK_IPV6_PORT_LABELING
>  	LSM_HOOK_INIT(socket_bind, smack_socket_bind),
>  #endif

^ permalink raw reply

* [PATCH iproute2] rdma: fix header files
From: Stephen Hemminger @ 2018-05-04 21:56 UTC (permalink / raw)
  To: swise; +Cc: netdev, Stephen Hemminger

All user api headers in iproute2 should be in include/uapi
so that script can be used to put correct sanitized kernel headers
there. And the header files for rdma must be a complete set; if one
header file includes another, all must be present.

This fixes build on older distributions, and Windows Services
for Linux.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 include/uapi/rdma/ib_user_sa.h                |   77 ++
 include/uapi/rdma/ib_user_verbs.h             | 1210 +++++++++++++++++
 .../uapi/rdma/rdma_netlink.h                  |   13 +
 .../uapi/rdma/rdma_user_cm.h                  |    6 +-
 4 files changed, 1303 insertions(+), 3 deletions(-)
 create mode 100644 include/uapi/rdma/ib_user_sa.h
 create mode 100644 include/uapi/rdma/ib_user_verbs.h
 rename {rdma/include => include}/uapi/rdma/rdma_netlink.h (95%)
 rename {rdma/include => include}/uapi/rdma/rdma_user_cm.h (98%)

diff --git a/include/uapi/rdma/ib_user_sa.h b/include/uapi/rdma/ib_user_sa.h
new file mode 100644
index 000000000000..0d2607f0cd20
--- /dev/null
+++ b/include/uapi/rdma/ib_user_sa.h
@@ -0,0 +1,77 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
+/*
+ * Copyright (c) 2005 Intel Corporation.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef IB_USER_SA_H
+#define IB_USER_SA_H
+
+#include <linux/types.h>
+
+enum {
+	IB_PATH_GMP		= 1,
+	IB_PATH_PRIMARY		= (1<<1),
+	IB_PATH_ALTERNATE	= (1<<2),
+	IB_PATH_OUTBOUND	= (1<<3),
+	IB_PATH_INBOUND		= (1<<4),
+	IB_PATH_INBOUND_REVERSE = (1<<5),
+	IB_PATH_BIDIRECTIONAL	= IB_PATH_OUTBOUND | IB_PATH_INBOUND_REVERSE
+};
+
+struct ib_path_rec_data {
+	__u32	flags;
+	__u32	reserved;
+	__u32	path_rec[16];
+};
+
+struct ib_user_path_rec {
+	__u8	dgid[16];
+	__u8	sgid[16];
+	__be16	dlid;
+	__be16	slid;
+	__u32	raw_traffic;
+	__be32	flow_label;
+	__u32	reversible;
+	__u32	mtu;
+	__be16	pkey;
+	__u8	hop_limit;
+	__u8	traffic_class;
+	__u8	numb_path;
+	__u8	sl;
+	__u8	mtu_selector;
+	__u8	rate_selector;
+	__u8	rate;
+	__u8	packet_life_time_selector;
+	__u8	packet_life_time;
+	__u8	preference;
+};
+
+#endif /* IB_USER_SA_H */
diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h
new file mode 100644
index 000000000000..9be07394fdbe
--- /dev/null
+++ b/include/uapi/rdma/ib_user_verbs.h
@@ -0,0 +1,1210 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
+/*
+ * Copyright (c) 2005 Topspin Communications.  All rights reserved.
+ * Copyright (c) 2005, 2006 Cisco Systems.  All rights reserved.
+ * Copyright (c) 2005 PathScale, Inc.  All rights reserved.
+ * Copyright (c) 2006 Mellanox Technologies.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef IB_USER_VERBS_H
+#define IB_USER_VERBS_H
+
+#include <linux/types.h>
+
+/*
+ * Increment this value if any changes that break userspace ABI
+ * compatibility are made.
+ */
+#define IB_USER_VERBS_ABI_VERSION	6
+#define IB_USER_VERBS_CMD_THRESHOLD    50
+
+enum {
+	IB_USER_VERBS_CMD_GET_CONTEXT,
+	IB_USER_VERBS_CMD_QUERY_DEVICE,
+	IB_USER_VERBS_CMD_QUERY_PORT,
+	IB_USER_VERBS_CMD_ALLOC_PD,
+	IB_USER_VERBS_CMD_DEALLOC_PD,
+	IB_USER_VERBS_CMD_CREATE_AH,
+	IB_USER_VERBS_CMD_MODIFY_AH,
+	IB_USER_VERBS_CMD_QUERY_AH,
+	IB_USER_VERBS_CMD_DESTROY_AH,
+	IB_USER_VERBS_CMD_REG_MR,
+	IB_USER_VERBS_CMD_REG_SMR,
+	IB_USER_VERBS_CMD_REREG_MR,
+	IB_USER_VERBS_CMD_QUERY_MR,
+	IB_USER_VERBS_CMD_DEREG_MR,
+	IB_USER_VERBS_CMD_ALLOC_MW,
+	IB_USER_VERBS_CMD_BIND_MW,
+	IB_USER_VERBS_CMD_DEALLOC_MW,
+	IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL,
+	IB_USER_VERBS_CMD_CREATE_CQ,
+	IB_USER_VERBS_CMD_RESIZE_CQ,
+	IB_USER_VERBS_CMD_DESTROY_CQ,
+	IB_USER_VERBS_CMD_POLL_CQ,
+	IB_USER_VERBS_CMD_PEEK_CQ,
+	IB_USER_VERBS_CMD_REQ_NOTIFY_CQ,
+	IB_USER_VERBS_CMD_CREATE_QP,
+	IB_USER_VERBS_CMD_QUERY_QP,
+	IB_USER_VERBS_CMD_MODIFY_QP,
+	IB_USER_VERBS_CMD_DESTROY_QP,
+	IB_USER_VERBS_CMD_POST_SEND,
+	IB_USER_VERBS_CMD_POST_RECV,
+	IB_USER_VERBS_CMD_ATTACH_MCAST,
+	IB_USER_VERBS_CMD_DETACH_MCAST,
+	IB_USER_VERBS_CMD_CREATE_SRQ,
+	IB_USER_VERBS_CMD_MODIFY_SRQ,
+	IB_USER_VERBS_CMD_QUERY_SRQ,
+	IB_USER_VERBS_CMD_DESTROY_SRQ,
+	IB_USER_VERBS_CMD_POST_SRQ_RECV,
+	IB_USER_VERBS_CMD_OPEN_XRCD,
+	IB_USER_VERBS_CMD_CLOSE_XRCD,
+	IB_USER_VERBS_CMD_CREATE_XSRQ,
+	IB_USER_VERBS_CMD_OPEN_QP,
+};
+
+enum {
+	IB_USER_VERBS_EX_CMD_QUERY_DEVICE = IB_USER_VERBS_CMD_QUERY_DEVICE,
+	IB_USER_VERBS_EX_CMD_CREATE_CQ = IB_USER_VERBS_CMD_CREATE_CQ,
+	IB_USER_VERBS_EX_CMD_CREATE_QP = IB_USER_VERBS_CMD_CREATE_QP,
+	IB_USER_VERBS_EX_CMD_MODIFY_QP = IB_USER_VERBS_CMD_MODIFY_QP,
+	IB_USER_VERBS_EX_CMD_CREATE_FLOW = IB_USER_VERBS_CMD_THRESHOLD,
+	IB_USER_VERBS_EX_CMD_DESTROY_FLOW,
+	IB_USER_VERBS_EX_CMD_CREATE_WQ,
+	IB_USER_VERBS_EX_CMD_MODIFY_WQ,
+	IB_USER_VERBS_EX_CMD_DESTROY_WQ,
+	IB_USER_VERBS_EX_CMD_CREATE_RWQ_IND_TBL,
+	IB_USER_VERBS_EX_CMD_DESTROY_RWQ_IND_TBL,
+	IB_USER_VERBS_EX_CMD_MODIFY_CQ
+};
+
+/*
+ * Make sure that all structs defined in this file remain laid out so
+ * that they pack the same way on 32-bit and 64-bit architectures (to
+ * avoid incompatibility between 32-bit userspace and 64-bit kernels).
+ * Specifically:
+ *  - Do not use pointer types -- pass pointers in __u64 instead.
+ *  - Make sure that any structure larger than 4 bytes is padded to a
+ *    multiple of 8 bytes.  Otherwise the structure size will be
+ *    different between 32-bit and 64-bit architectures.
+ */
+
+struct ib_uverbs_async_event_desc {
+	__aligned_u64 element;
+	__u32 event_type;	/* enum ib_event_type */
+	__u32 reserved;
+};
+
+struct ib_uverbs_comp_event_desc {
+	__aligned_u64 cq_handle;
+};
+
+struct ib_uverbs_cq_moderation_caps {
+	__u16     max_cq_moderation_count;
+	__u16     max_cq_moderation_period;
+	__u32     reserved;
+};
+
+/*
+ * All commands from userspace should start with a __u32 command field
+ * followed by __u16 in_words and out_words fields (which give the
+ * length of the command block and response buffer if any in 32-bit
+ * words).  The kernel driver will read these fields first and read
+ * the rest of the command struct based on these value.
+ */
+
+#define IB_USER_VERBS_CMD_COMMAND_MASK 0xff
+#define IB_USER_VERBS_CMD_FLAG_EXTENDED 0x80000000u
+
+struct ib_uverbs_cmd_hdr {
+	__u32 command;
+	__u16 in_words;
+	__u16 out_words;
+};
+
+struct ib_uverbs_ex_cmd_hdr {
+	__aligned_u64 response;
+	__u16 provider_in_words;
+	__u16 provider_out_words;
+	__u32 cmd_hdr_reserved;
+};
+
+struct ib_uverbs_get_context {
+	__aligned_u64 response;
+	__aligned_u64 driver_data[0];
+};
+
+struct ib_uverbs_get_context_resp {
+	__u32 async_fd;
+	__u32 num_comp_vectors;
+};
+
+struct ib_uverbs_query_device {
+	__aligned_u64 response;
+	__aligned_u64 driver_data[0];
+};
+
+struct ib_uverbs_query_device_resp {
+	__aligned_u64 fw_ver;
+	__be64 node_guid;
+	__be64 sys_image_guid;
+	__aligned_u64 max_mr_size;
+	__aligned_u64 page_size_cap;
+	__u32 vendor_id;
+	__u32 vendor_part_id;
+	__u32 hw_ver;
+	__u32 max_qp;
+	__u32 max_qp_wr;
+	__u32 device_cap_flags;
+	__u32 max_sge;
+	__u32 max_sge_rd;
+	__u32 max_cq;
+	__u32 max_cqe;
+	__u32 max_mr;
+	__u32 max_pd;
+	__u32 max_qp_rd_atom;
+	__u32 max_ee_rd_atom;
+	__u32 max_res_rd_atom;
+	__u32 max_qp_init_rd_atom;
+	__u32 max_ee_init_rd_atom;
+	__u32 atomic_cap;
+	__u32 max_ee;
+	__u32 max_rdd;
+	__u32 max_mw;
+	__u32 max_raw_ipv6_qp;
+	__u32 max_raw_ethy_qp;
+	__u32 max_mcast_grp;
+	__u32 max_mcast_qp_attach;
+	__u32 max_total_mcast_qp_attach;
+	__u32 max_ah;
+	__u32 max_fmr;
+	__u32 max_map_per_fmr;
+	__u32 max_srq;
+	__u32 max_srq_wr;
+	__u32 max_srq_sge;
+	__u16 max_pkeys;
+	__u8  local_ca_ack_delay;
+	__u8  phys_port_cnt;
+	__u8  reserved[4];
+};
+
+struct ib_uverbs_ex_query_device {
+	__u32 comp_mask;
+	__u32 reserved;
+};
+
+struct ib_uverbs_odp_caps {
+	__aligned_u64 general_caps;
+	struct {
+		__u32 rc_odp_caps;
+		__u32 uc_odp_caps;
+		__u32 ud_odp_caps;
+	} per_transport_caps;
+	__u32 reserved;
+};
+
+struct ib_uverbs_rss_caps {
+	/* Corresponding bit will be set if qp type from
+	 * 'enum ib_qp_type' is supported, e.g.
+	 * supported_qpts |= 1 << IB_QPT_UD
+	 */
+	__u32 supported_qpts;
+	__u32 max_rwq_indirection_tables;
+	__u32 max_rwq_indirection_table_size;
+	__u32 reserved;
+};
+
+struct ib_uverbs_tm_caps {
+	/* Max size of rendezvous request message */
+	__u32 max_rndv_hdr_size;
+	/* Max number of entries in tag matching list */
+	__u32 max_num_tags;
+	/* TM flags */
+	__u32 flags;
+	/* Max number of outstanding list operations */
+	__u32 max_ops;
+	/* Max number of SGE in tag matching entry */
+	__u32 max_sge;
+	__u32 reserved;
+};
+
+struct ib_uverbs_ex_query_device_resp {
+	struct ib_uverbs_query_device_resp base;
+	__u32 comp_mask;
+	__u32 response_length;
+	struct ib_uverbs_odp_caps odp_caps;
+	__aligned_u64 timestamp_mask;
+	__aligned_u64 hca_core_clock; /* in KHZ */
+	__aligned_u64 device_cap_flags_ex;
+	struct ib_uverbs_rss_caps rss_caps;
+	__u32  max_wq_type_rq;
+	__u32 raw_packet_caps;
+	struct ib_uverbs_tm_caps tm_caps;
+	struct ib_uverbs_cq_moderation_caps cq_moderation_caps;
+	__aligned_u64 max_dm_size;
+};
+
+struct ib_uverbs_query_port {
+	__aligned_u64 response;
+	__u8  port_num;
+	__u8  reserved[7];
+	__aligned_u64 driver_data[0];
+};
+
+struct ib_uverbs_query_port_resp {
+	__u32 port_cap_flags;
+	__u32 max_msg_sz;
+	__u32 bad_pkey_cntr;
+	__u32 qkey_viol_cntr;
+	__u32 gid_tbl_len;
+	__u16 pkey_tbl_len;
+	__u16 lid;
+	__u16 sm_lid;
+	__u8  state;
+	__u8  max_mtu;
+	__u8  active_mtu;
+	__u8  lmc;
+	__u8  max_vl_num;
+	__u8  sm_sl;
+	__u8  subnet_timeout;
+	__u8  init_type_reply;
+	__u8  active_width;
+	__u8  active_speed;
+	__u8  phys_state;
+	__u8  link_layer;
+	__u8  reserved[2];
+};
+
+struct ib_uverbs_alloc_pd {
+	__aligned_u64 response;
+	__aligned_u64 driver_data[0];
+};
+
+struct ib_uverbs_alloc_pd_resp {
+	__u32 pd_handle;
+};
+
+struct ib_uverbs_dealloc_pd {
+	__u32 pd_handle;
+};
+
+struct ib_uverbs_open_xrcd {
+	__aligned_u64 response;
+	__u32 fd;
+	__u32 oflags;
+	__aligned_u64 driver_data[0];
+};
+
+struct ib_uverbs_open_xrcd_resp {
+	__u32 xrcd_handle;
+};
+
+struct ib_uverbs_close_xrcd {
+	__u32 xrcd_handle;
+};
+
+struct ib_uverbs_reg_mr {
+	__aligned_u64 response;
+	__aligned_u64 start;
+	__aligned_u64 length;
+	__aligned_u64 hca_va;
+	__u32 pd_handle;
+	__u32 access_flags;
+	__aligned_u64 driver_data[0];
+};
+
+struct ib_uverbs_reg_mr_resp {
+	__u32 mr_handle;
+	__u32 lkey;
+	__u32 rkey;
+};
+
+struct ib_uverbs_rereg_mr {
+	__aligned_u64 response;
+	__u32 mr_handle;
+	__u32 flags;
+	__aligned_u64 start;
+	__aligned_u64 length;
+	__aligned_u64 hca_va;
+	__u32 pd_handle;
+	__u32 access_flags;
+};
+
+struct ib_uverbs_rereg_mr_resp {
+	__u32 lkey;
+	__u32 rkey;
+};
+
+struct ib_uverbs_dereg_mr {
+	__u32 mr_handle;
+};
+
+struct ib_uverbs_alloc_mw {
+	__aligned_u64 response;
+	__u32 pd_handle;
+	__u8  mw_type;
+	__u8  reserved[3];
+};
+
+struct ib_uverbs_alloc_mw_resp {
+	__u32 mw_handle;
+	__u32 rkey;
+};
+
+struct ib_uverbs_dealloc_mw {
+	__u32 mw_handle;
+};
+
+struct ib_uverbs_create_comp_channel {
+	__aligned_u64 response;
+};
+
+struct ib_uverbs_create_comp_channel_resp {
+	__u32 fd;
+};
+
+struct ib_uverbs_create_cq {
+	__aligned_u64 response;
+	__aligned_u64 user_handle;
+	__u32 cqe;
+	__u32 comp_vector;
+	__s32 comp_channel;
+	__u32 reserved;
+	__aligned_u64 driver_data[0];
+};
+
+enum ib_uverbs_ex_create_cq_flags {
+	IB_UVERBS_CQ_FLAGS_TIMESTAMP_COMPLETION = 1 << 0,
+	IB_UVERBS_CQ_FLAGS_IGNORE_OVERRUN = 1 << 1,
+};
+
+struct ib_uverbs_ex_create_cq {
+	__aligned_u64 user_handle;
+	__u32 cqe;
+	__u32 comp_vector;
+	__s32 comp_channel;
+	__u32 comp_mask;
+	__u32 flags;  /* bitmask of ib_uverbs_ex_create_cq_flags */
+	__u32 reserved;
+};
+
+struct ib_uverbs_create_cq_resp {
+	__u32 cq_handle;
+	__u32 cqe;
+};
+
+struct ib_uverbs_ex_create_cq_resp {
+	struct ib_uverbs_create_cq_resp base;
+	__u32 comp_mask;
+	__u32 response_length;
+};
+
+struct ib_uverbs_resize_cq {
+	__aligned_u64 response;
+	__u32 cq_handle;
+	__u32 cqe;
+	__aligned_u64 driver_data[0];
+};
+
+struct ib_uverbs_resize_cq_resp {
+	__u32 cqe;
+	__u32 reserved;
+	__aligned_u64 driver_data[0];
+};
+
+struct ib_uverbs_poll_cq {
+	__aligned_u64 response;
+	__u32 cq_handle;
+	__u32 ne;
+};
+
+struct ib_uverbs_wc {
+	__aligned_u64 wr_id;
+	__u32 status;
+	__u32 opcode;
+	__u32 vendor_err;
+	__u32 byte_len;
+	union {
+		__be32 imm_data;
+		__u32 invalidate_rkey;
+	} ex;
+	__u32 qp_num;
+	__u32 src_qp;
+	__u32 wc_flags;
+	__u16 pkey_index;
+	__u16 slid;
+	__u8 sl;
+	__u8 dlid_path_bits;
+	__u8 port_num;
+	__u8 reserved;
+};
+
+struct ib_uverbs_poll_cq_resp {
+	__u32 count;
+	__u32 reserved;
+	struct ib_uverbs_wc wc[0];
+};
+
+struct ib_uverbs_req_notify_cq {
+	__u32 cq_handle;
+	__u32 solicited_only;
+};
+
+struct ib_uverbs_destroy_cq {
+	__aligned_u64 response;
+	__u32 cq_handle;
+	__u32 reserved;
+};
+
+struct ib_uverbs_destroy_cq_resp {
+	__u32 comp_events_reported;
+	__u32 async_events_reported;
+};
+
+struct ib_uverbs_global_route {
+	__u8  dgid[16];
+	__u32 flow_label;
+	__u8  sgid_index;
+	__u8  hop_limit;
+	__u8  traffic_class;
+	__u8  reserved;
+};
+
+struct ib_uverbs_ah_attr {
+	struct ib_uverbs_global_route grh;
+	__u16 dlid;
+	__u8  sl;
+	__u8  src_path_bits;
+	__u8  static_rate;
+	__u8  is_global;
+	__u8  port_num;
+	__u8  reserved;
+};
+
+struct ib_uverbs_qp_attr {
+	__u32	qp_attr_mask;
+	__u32	qp_state;
+	__u32	cur_qp_state;
+	__u32	path_mtu;
+	__u32	path_mig_state;
+	__u32	qkey;
+	__u32	rq_psn;
+	__u32	sq_psn;
+	__u32	dest_qp_num;
+	__u32	qp_access_flags;
+
+	struct ib_uverbs_ah_attr ah_attr;
+	struct ib_uverbs_ah_attr alt_ah_attr;
+
+	/* ib_qp_cap */
+	__u32	max_send_wr;
+	__u32	max_recv_wr;
+	__u32	max_send_sge;
+	__u32	max_recv_sge;
+	__u32	max_inline_data;
+
+	__u16	pkey_index;
+	__u16	alt_pkey_index;
+	__u8	en_sqd_async_notify;
+	__u8	sq_draining;
+	__u8	max_rd_atomic;
+	__u8	max_dest_rd_atomic;
+	__u8	min_rnr_timer;
+	__u8	port_num;
+	__u8	timeout;
+	__u8	retry_cnt;
+	__u8	rnr_retry;
+	__u8	alt_port_num;
+	__u8	alt_timeout;
+	__u8	reserved[5];
+};
+
+struct ib_uverbs_create_qp {
+	__aligned_u64 response;
+	__aligned_u64 user_handle;
+	__u32 pd_handle;
+	__u32 send_cq_handle;
+	__u32 recv_cq_handle;
+	__u32 srq_handle;
+	__u32 max_send_wr;
+	__u32 max_recv_wr;
+	__u32 max_send_sge;
+	__u32 max_recv_sge;
+	__u32 max_inline_data;
+	__u8  sq_sig_all;
+	__u8  qp_type;
+	__u8  is_srq;
+	__u8  reserved;
+	__aligned_u64 driver_data[0];
+};
+
+enum ib_uverbs_create_qp_mask {
+	IB_UVERBS_CREATE_QP_MASK_IND_TABLE = 1UL << 0,
+};
+
+enum {
+	IB_UVERBS_CREATE_QP_SUP_COMP_MASK = IB_UVERBS_CREATE_QP_MASK_IND_TABLE,
+};
+
+enum {
+	/*
+	 * This value is equal to IB_QP_DEST_QPN.
+	 */
+	IB_USER_LEGACY_LAST_QP_ATTR_MASK = 1ULL << 20,
+};
+
+enum {
+	/*
+	 * This value is equal to IB_QP_RATE_LIMIT.
+	 */
+	IB_USER_LAST_QP_ATTR_MASK = 1ULL << 25,
+};
+
+struct ib_uverbs_ex_create_qp {
+	__aligned_u64 user_handle;
+	__u32 pd_handle;
+	__u32 send_cq_handle;
+	__u32 recv_cq_handle;
+	__u32 srq_handle;
+	__u32 max_send_wr;
+	__u32 max_recv_wr;
+	__u32 max_send_sge;
+	__u32 max_recv_sge;
+	__u32 max_inline_data;
+	__u8  sq_sig_all;
+	__u8  qp_type;
+	__u8  is_srq;
+	__u8 reserved;
+	__u32 comp_mask;
+	__u32 create_flags;
+	__u32 rwq_ind_tbl_handle;
+	__u32  source_qpn;
+};
+
+struct ib_uverbs_open_qp {
+	__aligned_u64 response;
+	__aligned_u64 user_handle;
+	__u32 pd_handle;
+	__u32 qpn;
+	__u8  qp_type;
+	__u8  reserved[7];
+	__aligned_u64 driver_data[0];
+};
+
+/* also used for open response */
+struct ib_uverbs_create_qp_resp {
+	__u32 qp_handle;
+	__u32 qpn;
+	__u32 max_send_wr;
+	__u32 max_recv_wr;
+	__u32 max_send_sge;
+	__u32 max_recv_sge;
+	__u32 max_inline_data;
+	__u32 reserved;
+};
+
+struct ib_uverbs_ex_create_qp_resp {
+	struct ib_uverbs_create_qp_resp base;
+	__u32 comp_mask;
+	__u32 response_length;
+};
+
+/*
+ * This struct needs to remain a multiple of 8 bytes to keep the
+ * alignment of the modify QP parameters.
+ */
+struct ib_uverbs_qp_dest {
+	__u8  dgid[16];
+	__u32 flow_label;
+	__u16 dlid;
+	__u16 reserved;
+	__u8  sgid_index;
+	__u8  hop_limit;
+	__u8  traffic_class;
+	__u8  sl;
+	__u8  src_path_bits;
+	__u8  static_rate;
+	__u8  is_global;
+	__u8  port_num;
+};
+
+struct ib_uverbs_query_qp {
+	__aligned_u64 response;
+	__u32 qp_handle;
+	__u32 attr_mask;
+	__aligned_u64 driver_data[0];
+};
+
+struct ib_uverbs_query_qp_resp {
+	struct ib_uverbs_qp_dest dest;
+	struct ib_uverbs_qp_dest alt_dest;
+	__u32 max_send_wr;
+	__u32 max_recv_wr;
+	__u32 max_send_sge;
+	__u32 max_recv_sge;
+	__u32 max_inline_data;
+	__u32 qkey;
+	__u32 rq_psn;
+	__u32 sq_psn;
+	__u32 dest_qp_num;
+	__u32 qp_access_flags;
+	__u16 pkey_index;
+	__u16 alt_pkey_index;
+	__u8  qp_state;
+	__u8  cur_qp_state;
+	__u8  path_mtu;
+	__u8  path_mig_state;
+	__u8  sq_draining;
+	__u8  max_rd_atomic;
+	__u8  max_dest_rd_atomic;
+	__u8  min_rnr_timer;
+	__u8  port_num;
+	__u8  timeout;
+	__u8  retry_cnt;
+	__u8  rnr_retry;
+	__u8  alt_port_num;
+	__u8  alt_timeout;
+	__u8  sq_sig_all;
+	__u8  reserved[5];
+	__aligned_u64 driver_data[0];
+};
+
+struct ib_uverbs_modify_qp {
+	struct ib_uverbs_qp_dest dest;
+	struct ib_uverbs_qp_dest alt_dest;
+	__u32 qp_handle;
+	__u32 attr_mask;
+	__u32 qkey;
+	__u32 rq_psn;
+	__u32 sq_psn;
+	__u32 dest_qp_num;
+	__u32 qp_access_flags;
+	__u16 pkey_index;
+	__u16 alt_pkey_index;
+	__u8  qp_state;
+	__u8  cur_qp_state;
+	__u8  path_mtu;
+	__u8  path_mig_state;
+	__u8  en_sqd_async_notify;
+	__u8  max_rd_atomic;
+	__u8  max_dest_rd_atomic;
+	__u8  min_rnr_timer;
+	__u8  port_num;
+	__u8  timeout;
+	__u8  retry_cnt;
+	__u8  rnr_retry;
+	__u8  alt_port_num;
+	__u8  alt_timeout;
+	__u8  reserved[2];
+	__aligned_u64 driver_data[0];
+};
+
+struct ib_uverbs_ex_modify_qp {
+	struct ib_uverbs_modify_qp base;
+	__u32	rate_limit;
+	__u32	reserved;
+};
+
+struct ib_uverbs_modify_qp_resp {
+};
+
+struct ib_uverbs_ex_modify_qp_resp {
+	__u32  comp_mask;
+	__u32  response_length;
+};
+
+struct ib_uverbs_destroy_qp {
+	__aligned_u64 response;
+	__u32 qp_handle;
+	__u32 reserved;
+};
+
+struct ib_uverbs_destroy_qp_resp {
+	__u32 events_reported;
+};
+
+/*
+ * The ib_uverbs_sge structure isn't used anywhere, since we assume
+ * the ib_sge structure is packed the same way on 32-bit and 64-bit
+ * architectures in both kernel and user space.  It's just here to
+ * document the ABI.
+ */
+struct ib_uverbs_sge {
+	__aligned_u64 addr;
+	__u32 length;
+	__u32 lkey;
+};
+
+struct ib_uverbs_send_wr {
+	__aligned_u64 wr_id;
+	__u32 num_sge;
+	__u32 opcode;
+	__u32 send_flags;
+	union {
+		__be32 imm_data;
+		__u32 invalidate_rkey;
+	} ex;
+	union {
+		struct {
+			__aligned_u64 remote_addr;
+			__u32 rkey;
+			__u32 reserved;
+		} rdma;
+		struct {
+			__aligned_u64 remote_addr;
+			__aligned_u64 compare_add;
+			__aligned_u64 swap;
+			__u32 rkey;
+			__u32 reserved;
+		} atomic;
+		struct {
+			__u32 ah;
+			__u32 remote_qpn;
+			__u32 remote_qkey;
+			__u32 reserved;
+		} ud;
+	} wr;
+};
+
+struct ib_uverbs_post_send {
+	__aligned_u64 response;
+	__u32 qp_handle;
+	__u32 wr_count;
+	__u32 sge_count;
+	__u32 wqe_size;
+	struct ib_uverbs_send_wr send_wr[0];
+};
+
+struct ib_uverbs_post_send_resp {
+	__u32 bad_wr;
+};
+
+struct ib_uverbs_recv_wr {
+	__aligned_u64 wr_id;
+	__u32 num_sge;
+	__u32 reserved;
+};
+
+struct ib_uverbs_post_recv {
+	__aligned_u64 response;
+	__u32 qp_handle;
+	__u32 wr_count;
+	__u32 sge_count;
+	__u32 wqe_size;
+	struct ib_uverbs_recv_wr recv_wr[0];
+};
+
+struct ib_uverbs_post_recv_resp {
+	__u32 bad_wr;
+};
+
+struct ib_uverbs_post_srq_recv {
+	__aligned_u64 response;
+	__u32 srq_handle;
+	__u32 wr_count;
+	__u32 sge_count;
+	__u32 wqe_size;
+	struct ib_uverbs_recv_wr recv[0];
+};
+
+struct ib_uverbs_post_srq_recv_resp {
+	__u32 bad_wr;
+};
+
+struct ib_uverbs_create_ah {
+	__aligned_u64 response;
+	__aligned_u64 user_handle;
+	__u32 pd_handle;
+	__u32 reserved;
+	struct ib_uverbs_ah_attr attr;
+};
+
+struct ib_uverbs_create_ah_resp {
+	__u32 ah_handle;
+};
+
+struct ib_uverbs_destroy_ah {
+	__u32 ah_handle;
+};
+
+struct ib_uverbs_attach_mcast {
+	__u8  gid[16];
+	__u32 qp_handle;
+	__u16 mlid;
+	__u16 reserved;
+	__aligned_u64 driver_data[0];
+};
+
+struct ib_uverbs_detach_mcast {
+	__u8  gid[16];
+	__u32 qp_handle;
+	__u16 mlid;
+	__u16 reserved;
+	__aligned_u64 driver_data[0];
+};
+
+struct ib_uverbs_flow_spec_hdr {
+	__u32 type;
+	__u16 size;
+	__u16 reserved;
+	/* followed by flow_spec */
+	__aligned_u64 flow_spec_data[0];
+};
+
+struct ib_uverbs_flow_eth_filter {
+	__u8  dst_mac[6];
+	__u8  src_mac[6];
+	__be16 ether_type;
+	__be16 vlan_tag;
+};
+
+struct ib_uverbs_flow_spec_eth {
+	union {
+		struct ib_uverbs_flow_spec_hdr hdr;
+		struct {
+			__u32 type;
+			__u16 size;
+			__u16 reserved;
+		};
+	};
+	struct ib_uverbs_flow_eth_filter val;
+	struct ib_uverbs_flow_eth_filter mask;
+};
+
+struct ib_uverbs_flow_ipv4_filter {
+	__be32 src_ip;
+	__be32 dst_ip;
+	__u8	proto;
+	__u8	tos;
+	__u8	ttl;
+	__u8	flags;
+};
+
+struct ib_uverbs_flow_spec_ipv4 {
+	union {
+		struct ib_uverbs_flow_spec_hdr hdr;
+		struct {
+			__u32 type;
+			__u16 size;
+			__u16 reserved;
+		};
+	};
+	struct ib_uverbs_flow_ipv4_filter val;
+	struct ib_uverbs_flow_ipv4_filter mask;
+};
+
+struct ib_uverbs_flow_tcp_udp_filter {
+	__be16 dst_port;
+	__be16 src_port;
+};
+
+struct ib_uverbs_flow_spec_tcp_udp {
+	union {
+		struct ib_uverbs_flow_spec_hdr hdr;
+		struct {
+			__u32 type;
+			__u16 size;
+			__u16 reserved;
+		};
+	};
+	struct ib_uverbs_flow_tcp_udp_filter val;
+	struct ib_uverbs_flow_tcp_udp_filter mask;
+};
+
+struct ib_uverbs_flow_ipv6_filter {
+	__u8    src_ip[16];
+	__u8    dst_ip[16];
+	__be32	flow_label;
+	__u8	next_hdr;
+	__u8	traffic_class;
+	__u8	hop_limit;
+	__u8	reserved;
+};
+
+struct ib_uverbs_flow_spec_ipv6 {
+	union {
+		struct ib_uverbs_flow_spec_hdr hdr;
+		struct {
+			__u32 type;
+			__u16 size;
+			__u16 reserved;
+		};
+	};
+	struct ib_uverbs_flow_ipv6_filter val;
+	struct ib_uverbs_flow_ipv6_filter mask;
+};
+
+struct ib_uverbs_flow_spec_action_tag {
+	union {
+		struct ib_uverbs_flow_spec_hdr hdr;
+		struct {
+			__u32 type;
+			__u16 size;
+			__u16 reserved;
+		};
+	};
+	__u32			      tag_id;
+	__u32			      reserved1;
+};
+
+struct ib_uverbs_flow_spec_action_drop {
+	union {
+		struct ib_uverbs_flow_spec_hdr hdr;
+		struct {
+			__u32 type;
+			__u16 size;
+			__u16 reserved;
+		};
+	};
+};
+
+struct ib_uverbs_flow_spec_action_handle {
+	union {
+		struct ib_uverbs_flow_spec_hdr hdr;
+		struct {
+			__u32 type;
+			__u16 size;
+			__u16 reserved;
+		};
+	};
+	__u32			      handle;
+	__u32			      reserved1;
+};
+
+struct ib_uverbs_flow_tunnel_filter {
+	__be32 tunnel_id;
+};
+
+struct ib_uverbs_flow_spec_tunnel {
+	union {
+		struct ib_uverbs_flow_spec_hdr hdr;
+		struct {
+			__u32 type;
+			__u16 size;
+			__u16 reserved;
+		};
+	};
+	struct ib_uverbs_flow_tunnel_filter val;
+	struct ib_uverbs_flow_tunnel_filter mask;
+};
+
+struct ib_uverbs_flow_spec_esp_filter {
+	__u32 spi;
+	__u32 seq;
+};
+
+struct ib_uverbs_flow_spec_esp {
+	union {
+		struct ib_uverbs_flow_spec_hdr hdr;
+		struct {
+			__u32 type;
+			__u16 size;
+			__u16 reserved;
+		};
+	};
+	struct ib_uverbs_flow_spec_esp_filter val;
+	struct ib_uverbs_flow_spec_esp_filter mask;
+};
+
+struct ib_uverbs_flow_attr {
+	__u32 type;
+	__u16 size;
+	__u16 priority;
+	__u8  num_of_specs;
+	__u8  reserved[2];
+	__u8  port;
+	__u32 flags;
+	/* Following are the optional layers according to user request
+	 * struct ib_flow_spec_xxx
+	 * struct ib_flow_spec_yyy
+	 */
+	struct ib_uverbs_flow_spec_hdr flow_specs[0];
+};
+
+struct ib_uverbs_create_flow  {
+	__u32 comp_mask;
+	__u32 qp_handle;
+	struct ib_uverbs_flow_attr flow_attr;
+};
+
+struct ib_uverbs_create_flow_resp {
+	__u32 comp_mask;
+	__u32 flow_handle;
+};
+
+struct ib_uverbs_destroy_flow  {
+	__u32 comp_mask;
+	__u32 flow_handle;
+};
+
+struct ib_uverbs_create_srq {
+	__aligned_u64 response;
+	__aligned_u64 user_handle;
+	__u32 pd_handle;
+	__u32 max_wr;
+	__u32 max_sge;
+	__u32 srq_limit;
+	__aligned_u64 driver_data[0];
+};
+
+struct ib_uverbs_create_xsrq {
+	__aligned_u64 response;
+	__aligned_u64 user_handle;
+	__u32 srq_type;
+	__u32 pd_handle;
+	__u32 max_wr;
+	__u32 max_sge;
+	__u32 srq_limit;
+	__u32 max_num_tags;
+	__u32 xrcd_handle;
+	__u32 cq_handle;
+	__aligned_u64 driver_data[0];
+};
+
+struct ib_uverbs_create_srq_resp {
+	__u32 srq_handle;
+	__u32 max_wr;
+	__u32 max_sge;
+	__u32 srqn;
+};
+
+struct ib_uverbs_modify_srq {
+	__u32 srq_handle;
+	__u32 attr_mask;
+	__u32 max_wr;
+	__u32 srq_limit;
+	__aligned_u64 driver_data[0];
+};
+
+struct ib_uverbs_query_srq {
+	__aligned_u64 response;
+	__u32 srq_handle;
+	__u32 reserved;
+	__aligned_u64 driver_data[0];
+};
+
+struct ib_uverbs_query_srq_resp {
+	__u32 max_wr;
+	__u32 max_sge;
+	__u32 srq_limit;
+	__u32 reserved;
+};
+
+struct ib_uverbs_destroy_srq {
+	__aligned_u64 response;
+	__u32 srq_handle;
+	__u32 reserved;
+};
+
+struct ib_uverbs_destroy_srq_resp {
+	__u32 events_reported;
+};
+
+struct ib_uverbs_ex_create_wq  {
+	__u32 comp_mask;
+	__u32 wq_type;
+	__aligned_u64 user_handle;
+	__u32 pd_handle;
+	__u32 cq_handle;
+	__u32 max_wr;
+	__u32 max_sge;
+	__u32 create_flags; /* Use enum ib_wq_flags */
+	__u32 reserved;
+};
+
+struct ib_uverbs_ex_create_wq_resp {
+	__u32 comp_mask;
+	__u32 response_length;
+	__u32 wq_handle;
+	__u32 max_wr;
+	__u32 max_sge;
+	__u32 wqn;
+};
+
+struct ib_uverbs_ex_destroy_wq  {
+	__u32 comp_mask;
+	__u32 wq_handle;
+};
+
+struct ib_uverbs_ex_destroy_wq_resp {
+	__u32 comp_mask;
+	__u32 response_length;
+	__u32 events_reported;
+	__u32 reserved;
+};
+
+struct ib_uverbs_ex_modify_wq  {
+	__u32 attr_mask;
+	__u32 wq_handle;
+	__u32 wq_state;
+	__u32 curr_wq_state;
+	__u32 flags; /* Use enum ib_wq_flags */
+	__u32 flags_mask; /* Use enum ib_wq_flags */
+};
+
+/* Prevent memory allocation rather than max expected size */
+#define IB_USER_VERBS_MAX_LOG_IND_TBL_SIZE 0x0d
+struct ib_uverbs_ex_create_rwq_ind_table  {
+	__u32 comp_mask;
+	__u32 log_ind_tbl_size;
+	/* Following are the wq handles according to log_ind_tbl_size
+	 * wq_handle1
+	 * wq_handle2
+	 */
+	__u32 wq_handles[0];
+};
+
+struct ib_uverbs_ex_create_rwq_ind_table_resp {
+	__u32 comp_mask;
+	__u32 response_length;
+	__u32 ind_tbl_handle;
+	__u32 ind_tbl_num;
+};
+
+struct ib_uverbs_ex_destroy_rwq_ind_table  {
+	__u32 comp_mask;
+	__u32 ind_tbl_handle;
+};
+
+struct ib_uverbs_cq_moderation {
+	__u16 cq_count;
+	__u16 cq_period;
+};
+
+struct ib_uverbs_ex_modify_cq {
+	__u32 cq_handle;
+	__u32 attr_mask;
+	struct ib_uverbs_cq_moderation attr;
+	__u32 reserved;
+};
+
+#define IB_DEVICE_NAME_MAX 64
+
+#endif /* IB_USER_VERBS_H */
diff --git a/rdma/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
similarity index 95%
rename from rdma/include/uapi/rdma/rdma_netlink.h
rename to include/uapi/rdma/rdma_netlink.h
index 9446a72136e8..60416ed71c0f 100644
--- a/rdma/include/uapi/rdma/rdma_netlink.h
+++ b/include/uapi/rdma/rdma_netlink.h
@@ -388,6 +388,19 @@ enum rdma_nldev_attr {
 	RDMA_NLDEV_ATTR_RES_LOCAL_DMA_LKEY,	/* u32 */
 	RDMA_NLDEV_ATTR_RES_UNSAFE_GLOBAL_RKEY,	/* u32 */
 
+	/*
+	 * Provides logical name and index of netdevice which is
+	 * connected to physical port. This information is relevant
+	 * for RoCE and iWARP.
+	 *
+	 * The netdevices which are associated with containers are
+	 * supposed to be exported together with GID table once it
+	 * will be exposed through the netlink. Because the
+	 * associated netdevices are properties of GIDs.
+	 */
+	RDMA_NLDEV_ATTR_NDEV_INDEX,		/* u32 */
+	RDMA_NLDEV_ATTR_NDEV_NAME,		/* string */
+
 	RDMA_NLDEV_ATTR_MAX
 };
 #endif /* _RDMA_NETLINK_H */
diff --git a/rdma/include/uapi/rdma/rdma_user_cm.h b/include/uapi/rdma/rdma_user_cm.h
similarity index 98%
rename from rdma/include/uapi/rdma/rdma_user_cm.h
rename to include/uapi/rdma/rdma_user_cm.h
index da099af0ace7..e1269024af47 100644
--- a/rdma/include/uapi/rdma/rdma_user_cm.h
+++ b/include/uapi/rdma/rdma_user_cm.h
@@ -31,8 +31,8 @@
  * SOFTWARE.
  */
 
-#ifndef _RDMA_USER_CM_H
-#define _RDMA_USER_CM_H
+#ifndef RDMA_USER_CM_H
+#define RDMA_USER_CM_H
 
 #include <linux/types.h>
 #include <linux/socket.h>
@@ -321,4 +321,4 @@ struct rdma_ucm_migrate_resp {
 	__u32 events_reported;
 };
 
-#endif /* _RDMA_USER_CM_H */
+#endif /* RDMA_USER_CM_H */
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH bpf-next 09/10] tools: bpftool: add simple perf event output reader
From: Daniel Borkmann @ 2018-05-04 21:53 UTC (permalink / raw)
  To: Jakub Kicinski, alexei.starovoitov; +Cc: oss-drivers, netdev
In-Reply-To: <20180504013717.29317-10-jakub.kicinski@netronome.com>

On 05/04/2018 03:37 AM, Jakub Kicinski wrote:
> Users of BPF sooner or later discover perf_event_output() helpers
> and BPF_MAP_TYPE_PERF_EVENT_ARRAY.  Dumping this array type is
> not possible, however, we can add simple reading of perf events.
> Create a new event_pipe subcommand for maps, this sub command
> will only work with BPF_MAP_TYPE_PERF_EVENT_ARRAY maps.
> 
> Parts of the code from samples/bpf/trace_output_user.c.
> 
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
[...]

One remark below:

[...]
> +static void
> +print_bpf_output(struct event_ring_info *ring, struct perf_event_sample *e)
> +{
> +	struct {
> +		struct perf_event_header header;
> +		__u64 id;
> +		__u64 lost;
> +	} *lost = (void *)e;
> +	struct timespec ts;
> +
> +	if (clock_gettime(CLOCK_MONOTONIC, &ts)) {
> +		perror("Can't read clock for timestamp");
> +		return;
> +	}
Instead of the timestamp above, probably better to pick it up via
PERF_SAMPLE_TIME which needs to be added to sample_type so it also
ends up in the RB. Given below you poll with 200 and you don't set
a wakeup event for perf RB (it's probably fine not to here, but it
can be done based on watermark or events), the clock_gettime() will
be off compared to when it was actually put into the RB.

> +	if (json_output) {
> +		jsonw_start_object(json_wtr);
> +		jsonw_name(json_wtr, "timestamp");
> +		jsonw_uint(json_wtr, ts.tv_sec * 1000000000ull + ts.tv_nsec);
> +		jsonw_name(json_wtr, "type");
> +		jsonw_uint(json_wtr, e->header.type);
> +		jsonw_name(json_wtr, "cpu");
> +		jsonw_uint(json_wtr, ring->cpu);
> +		jsonw_name(json_wtr, "index");
> +		jsonw_uint(json_wtr, ring->key);
> +		if (e->header.type == PERF_RECORD_SAMPLE) {
> +			jsonw_name(json_wtr, "data");
> +			print_data_json(e->data, e->size);
> +		} else if (e->header.type == PERF_RECORD_LOST) {
> +			jsonw_name(json_wtr, "lost");
> +			jsonw_start_object(json_wtr);
> +			jsonw_name(json_wtr, "id");
> +			jsonw_uint(json_wtr, lost->id);
> +			jsonw_name(json_wtr, "count");
> +			jsonw_uint(json_wtr, lost->lost);
> +			jsonw_end_object(json_wtr);
> +		}
> +		jsonw_end_object(json_wtr);
> +	} else {
> +		if (e->header.type == PERF_RECORD_SAMPLE) {
> +			printf("== @%ld.%ld CPU: %d index: %d =====\n",
> +			       (long)ts.tv_sec, ts.tv_nsec,
> +			       ring->cpu, ring->key);
> +			fprint_hex(stdout, e->data, e->size, " ");
> +			printf("\n");
> +		} else if (e->header.type == PERF_RECORD_LOST) {
> +			printf("lost %lld events\n", lost->lost);
> +		} else {
> +			printf("unknown event type=%d size=%d\n",
> +			       e->header.type, e->header.size);
> +		}

^ permalink raw reply

* [PATCH bpf-next 5/6] bpf: btf: Update tools/include/uapi/linux/btf.h with BTF ID
From: Martin KaFai Lau @ 2018-05-04 21:49 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20180504214955.1058805-1-kafai@fb.com>

This patch sync the tools/include/uapi/linux/btf.h with
the newly introduced BTF ID support.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
---
 tools/include/uapi/linux/bpf.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 83a95ae388dd..fff51c187d1e 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -96,6 +96,7 @@ enum bpf_cmd {
 	BPF_PROG_QUERY,
 	BPF_RAW_TRACEPOINT_OPEN,
 	BPF_BTF_LOAD,
+	BPF_BTF_GET_FD_BY_ID,
 };
 
 enum bpf_map_type {
@@ -343,6 +344,7 @@ union bpf_attr {
 			__u32		start_id;
 			__u32		prog_id;
 			__u32		map_id;
+			__u32		btf_id;
 		};
 		__u32		next_id;
 		__u32		open_flags;
@@ -2129,6 +2131,15 @@ struct bpf_map_info {
 	__u32 ifindex;
 	__u64 netns_dev;
 	__u64 netns_ino;
+	__u32 btf_id;
+	__u32 btf_key_id;
+	__u32 btf_value_id;
+} __attribute__((aligned(8)));
+
+struct bpf_btf_info {
+	__aligned_u64 btf;
+	__u32 btf_size;
+	__u32 id;
 } __attribute__((aligned(8)));
 
 /* User bpf_sock_addr struct to access socket fields and sockaddr struct passed
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 1/6] bpf: btf: Avoid WARN_ON when CONFIG_REFCOUNT_FULL=y
From: Martin KaFai Lau @ 2018-05-04 21:49 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20180504214955.1058805-1-kafai@fb.com>

If CONFIG_REFCOUNT_FULL=y, refcount_inc() WARN when refcount is 0.
When creating a new btf, the initial btf->refcnt is 0 and
triggered the following:

[   34.855452] refcount_t: increment on 0; use-after-free.
[   34.856252] WARNING: CPU: 6 PID: 1857 at lib/refcount.c:153 refcount_inc+0x26/0x30
....
[   34.868809] Call Trace:
[   34.869168]  btf_new_fd+0x1af6/0x24d0
[   34.869645]  ? btf_type_seq_show+0x200/0x200
[   34.870212]  ? lock_acquire+0x3b0/0x3b0
[   34.870726]  ? security_capable+0x54/0x90
[   34.871247]  __x64_sys_bpf+0x1b2/0x310
[   34.871761]  ? __ia32_sys_bpf+0x310/0x310
[   34.872285]  ? bad_area_access_error+0x310/0x310
[   34.872894]  do_syscall_64+0x95/0x3f0

This patch uses refcount_set() instead.

Reported-by: Yonghong Song <yhs@fb.com>
Tested-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 kernel/bpf/btf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 22e1046a1a86..fa0dce0452e7 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -1977,7 +1977,7 @@ static struct btf *btf_parse(void __user *btf_data, u32 btf_data_size,
 
 	if (!err) {
 		btf_verifier_env_free(env);
-		btf_get(btf);
+		refcount_set(&btf->refcnt, 1);
 		return btf;
 	}
 
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 3/6] bpf: btf: Add struct bpf_btf_info
From: Martin KaFai Lau @ 2018-05-04 21:49 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20180504214955.1058805-1-kafai@fb.com>

During BPF_OBJ_GET_INFO_BY_FD on a btf_fd, the current bpf_attr's
info.info is directly filled with the BTF binary data.  It is
not extensible.  In this case, we want to add BTF ID.

This patch adds "struct bpf_btf_info" which has the BTF ID as
one of its member.  The BTF binary data itself is exposed through
the "btf" and "btf_size" members.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
---
 include/uapi/linux/bpf.h |  6 ++++++
 kernel/bpf/btf.c         | 26 +++++++++++++++++++++-----
 kernel/bpf/syscall.c     | 17 ++++++++++++++++-
 3 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6106f23a9a8a..d615c777b573 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2137,6 +2137,12 @@ struct bpf_map_info {
 	__u32 btf_value_id;
 } __attribute__((aligned(8)));
 
+struct bpf_btf_info {
+	__aligned_u64 btf;
+	__u32 btf_size;
+	__u32 id;
+} __attribute__((aligned(8)));
+
 /* User bpf_sock_addr struct to access socket fields and sockaddr struct passed
  * by user and intended to be used by socket (e.g. to bind to, depends on
  * attach attach type).
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 40950b6bf395..ded10ab47b8a 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -2114,12 +2114,28 @@ int btf_get_info_by_fd(const struct btf *btf,
 		       const union bpf_attr *attr,
 		       union bpf_attr __user *uattr)
 {
-	void __user *udata = u64_to_user_ptr(attr->info.info);
-	u32 copy_len = min_t(u32, btf->data_size,
-			     attr->info.info_len);
+	struct bpf_btf_info __user *uinfo;
+	struct bpf_btf_info info = {};
+	u32 info_copy, btf_copy;
+	void __user *ubtf;
+	u32 uinfo_len;
 
-	if (copy_to_user(udata, btf->data, copy_len) ||
-	    put_user(btf->data_size, &uattr->info.info_len))
+	uinfo = u64_to_user_ptr(attr->info.info);
+	uinfo_len = attr->info.info_len;
+
+	info_copy = min_t(u32, uinfo_len, sizeof(info));
+	if (copy_from_user(&info, uinfo, info_copy))
+		return -EFAULT;
+
+	info.id = btf->id;
+	ubtf = u64_to_user_ptr(info.btf);
+	btf_copy = min_t(u32, btf->data_size, info.btf_size);
+	if (copy_to_user(ubtf, btf->data, btf_copy))
+		return -EFAULT;
+	info.btf_size = btf->data_size;
+
+	if (copy_to_user(uinfo, &info, info_copy) ||
+	    put_user(info_copy, &uattr->info.info_len))
 		return -EFAULT;
 
 	return 0;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8b0a45d65454..d2895e3e5cbf 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2019,6 +2019,21 @@ static int bpf_map_get_info_by_fd(struct bpf_map *map,
 	return 0;
 }
 
+static int bpf_btf_get_info_by_fd(struct btf *btf,
+				  const union bpf_attr *attr,
+				  union bpf_attr __user *uattr)
+{
+	struct bpf_btf_info __user *uinfo = u64_to_user_ptr(attr->info.info);
+	u32 info_len = attr->info.info_len;
+	int err;
+
+	err = check_uarg_tail_zero(uinfo, sizeof(*uinfo), info_len);
+	if (err)
+		return err;
+
+	return btf_get_info_by_fd(btf, attr, uattr);
+}
+
 #define BPF_OBJ_GET_INFO_BY_FD_LAST_FIELD info.info
 
 static int bpf_obj_get_info_by_fd(const union bpf_attr *attr,
@@ -2042,7 +2057,7 @@ static int bpf_obj_get_info_by_fd(const union bpf_attr *attr,
 		err = bpf_map_get_info_by_fd(f.file->private_data, attr,
 					     uattr);
 	else if (f.file->f_op == &btf_fops)
-		err = btf_get_info_by_fd(f.file->private_data, attr, uattr);
+		err = bpf_btf_get_info_by_fd(f.file->private_data, attr, uattr);
 	else
 		err = -EINVAL;
 
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 4/6] bpf: btf: Some test_btf clean up
From: Martin KaFai Lau @ 2018-05-04 21:49 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20180504214955.1058805-1-kafai@fb.com>

This patch adds a CHECK() macro for condition checking
and error report purpose.  Something similar to test_progs.c

It also counts the number of tests passed/skipped/failed and
print them at the end of the test run.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
---
 tools/testing/selftests/bpf/test_btf.c | 201 ++++++++++++++++-----------------
 1 file changed, 99 insertions(+), 102 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_btf.c b/tools/testing/selftests/bpf/test_btf.c
index 7b39b1f712a1..b7880a20fad1 100644
--- a/tools/testing/selftests/bpf/test_btf.c
+++ b/tools/testing/selftests/bpf/test_btf.c
@@ -20,6 +20,30 @@
 
 #include "bpf_rlimit.h"
 
+static uint32_t pass_cnt;
+static uint32_t error_cnt;
+static uint32_t skip_cnt;
+
+#define CHECK(condition, format...) ({					\
+	int __ret = !!(condition);					\
+	if (__ret) {							\
+		fprintf(stderr, "%s:%d:FAIL ", __func__, __LINE__);	\
+		fprintf(stderr, format);				\
+	}								\
+	__ret;								\
+})
+
+static int count_result(int err)
+{
+	if (err)
+		error_cnt++;
+	else
+		pass_cnt++;
+
+	fprintf(stderr, "\n");
+	return err;
+}
+
 #define min(a, b) ((a) < (b) ? (a) : (b))
 #define __printf(a, b)	__attribute__((format(printf, a, b)))
 
@@ -894,17 +918,13 @@ static void *btf_raw_create(const struct btf_header *hdr,
 	void *raw_btf;
 
 	type_sec_size = get_type_sec_size(raw_types);
-	if (type_sec_size < 0) {
-		fprintf(stderr, "Cannot get nr_raw_types\n");
+	if (CHECK(type_sec_size < 0, "Cannot get nr_raw_types"))
 		return NULL;
-	}
 
 	size_needed = sizeof(*hdr) + type_sec_size + str_sec_size;
 	raw_btf = malloc(size_needed);
-	if (!raw_btf) {
-		fprintf(stderr, "Cannot allocate memory for raw_btf\n");
+	if (CHECK(!raw_btf, "Cannot allocate memory for raw_btf"))
 		return NULL;
-	}
 
 	/* Copy header */
 	memcpy(raw_btf, hdr, sizeof(*hdr));
@@ -915,8 +935,7 @@ static void *btf_raw_create(const struct btf_header *hdr,
 	for (i = 0; i < type_sec_size / sizeof(raw_types[0]); i++) {
 		if (raw_types[i] == NAME_TBD) {
 			next_str = get_next_str(next_str, end_str);
-			if (!next_str) {
-				fprintf(stderr, "Error in getting next_str\n");
+			if (CHECK(!next_str, "Error in getting next_str")) {
 				free(raw_btf);
 				return NULL;
 			}
@@ -973,9 +992,8 @@ static int do_test_raw(unsigned int test_num)
 	free(raw_btf);
 
 	err = ((btf_fd == -1) != test->btf_load_err);
-	if (err)
-		fprintf(stderr, "btf_load_err:%d btf_fd:%d\n",
-			test->btf_load_err, btf_fd);
+	CHECK(err, "btf_fd:%d test->btf_load_err:%u",
+	      btf_fd, test->btf_load_err);
 
 	if (err || btf_fd == -1)
 		goto done;
@@ -992,16 +1010,15 @@ static int do_test_raw(unsigned int test_num)
 	map_fd = bpf_create_map_xattr(&create_attr);
 
 	err = ((map_fd == -1) != test->map_create_err);
-	if (err)
-		fprintf(stderr, "map_create_err:%d map_fd:%d\n",
-			test->map_create_err, map_fd);
+	CHECK(err, "map_fd:%d test->map_create_err:%u",
+	      map_fd, test->map_create_err);
 
 done:
 	if (!err)
-		fprintf(stderr, "OK\n");
+		fprintf(stderr, "OK");
 
 	if (*btf_log_buf && (err || args.always_log))
-		fprintf(stderr, "%s\n", btf_log_buf);
+		fprintf(stderr, "\n%s", btf_log_buf);
 
 	if (btf_fd != -1)
 		close(btf_fd);
@@ -1017,10 +1034,10 @@ static int test_raw(void)
 	int err = 0;
 
 	if (args.raw_test_num)
-		return do_test_raw(args.raw_test_num);
+		return count_result(do_test_raw(args.raw_test_num));
 
 	for (i = 1; i <= ARRAY_SIZE(raw_tests); i++)
-		err |= do_test_raw(i);
+		err |= count_result(do_test_raw(i));
 
 	return err;
 }
@@ -1080,8 +1097,7 @@ static int do_test_get_info(unsigned int test_num)
 	*btf_log_buf = '\0';
 
 	user_btf = malloc(raw_btf_size);
-	if (!user_btf) {
-		fprintf(stderr, "Cannot allocate memory for user_btf\n");
+	if (CHECK(!user_btf, "!user_btf")) {
 		err = -1;
 		goto done;
 	}
@@ -1089,9 +1105,7 @@ static int do_test_get_info(unsigned int test_num)
 	btf_fd = bpf_load_btf(raw_btf, raw_btf_size,
 			      btf_log_buf, BTF_LOG_BUF_SIZE,
 			      args.always_log);
-	if (btf_fd == -1) {
-		fprintf(stderr, "bpf_load_btf:%s(%d)\n",
-			strerror(errno), errno);
+	if (CHECK(btf_fd == -1, "errno:%d", errno)) {
 		err = -1;
 		goto done;
 	}
@@ -1103,31 +1117,31 @@ static int do_test_get_info(unsigned int test_num)
 		       raw_btf_size - expected_nbytes);
 
 	err = bpf_obj_get_info_by_fd(btf_fd, user_btf, &user_btf_size);
-	if (err || user_btf_size != raw_btf_size ||
-	    memcmp(raw_btf, user_btf, expected_nbytes)) {
-		fprintf(stderr,
-			"err:%d(errno:%d) raw_btf_size:%u user_btf_size:%u expected_nbytes:%u memcmp:%d\n",
-			err, errno,
-			raw_btf_size, user_btf_size, expected_nbytes,
-			memcmp(raw_btf, user_btf, expected_nbytes));
+	if (CHECK(err || user_btf_size != raw_btf_size ||
+		  memcmp(raw_btf, user_btf, expected_nbytes),
+		  "err:%d(errno:%d) raw_btf_size:%u user_btf_size:%u expected_nbytes:%u memcmp:%d",
+		  err, errno,
+		  raw_btf_size, user_btf_size, expected_nbytes,
+		  memcmp(raw_btf, user_btf, expected_nbytes))) {
 		err = -1;
 		goto done;
 	}
 
 	while (expected_nbytes < raw_btf_size) {
 		fprintf(stderr, "%u...", expected_nbytes);
-		if (user_btf[expected_nbytes++] != 0xff) {
-			fprintf(stderr, "!= 0xff\n");
+		if (CHECK(user_btf[expected_nbytes++] != 0xff,
+			  "user_btf[%u]:%x != 0xff", expected_nbytes - 1,
+			  user_btf[expected_nbytes - 1])) {
 			err = -1;
 			goto done;
 		}
 	}
 
-	fprintf(stderr, "OK\n");
+	fprintf(stderr, "OK");
 
 done:
 	if (*btf_log_buf && (err || args.always_log))
-		fprintf(stderr, "%s\n", btf_log_buf);
+		fprintf(stderr, "\n%s", btf_log_buf);
 
 	free(raw_btf);
 	free(user_btf);
@@ -1144,10 +1158,10 @@ static int test_get_info(void)
 	int err = 0;
 
 	if (args.get_info_test_num)
-		return do_test_get_info(args.get_info_test_num);
+		return count_result(do_test_get_info(args.get_info_test_num));
 
 	for (i = 1; i <= ARRAY_SIZE(get_info_tests); i++)
-		err |= do_test_get_info(i);
+		err |= count_result(do_test_get_info(i));
 
 	return err;
 }
@@ -1175,28 +1189,21 @@ static int file_has_btf_elf(const char *fn)
 	Elf *elf;
 	int ret;
 
-	if (elf_version(EV_CURRENT) == EV_NONE) {
-		fprintf(stderr, "Failed to init libelf\n");
+	if (CHECK(elf_version(EV_CURRENT) == EV_NONE,
+		  "elf_version(EV_CURRENT) == EV_NONE"))
 		return -1;
-	}
 
 	elf_fd = open(fn, O_RDONLY);
-	if (elf_fd == -1) {
-		fprintf(stderr, "Cannot open file %s: %s(%d)\n",
-			fn, strerror(errno), errno);
+	if (CHECK(elf_fd == -1, "open(%s): errno:%d", fn, errno))
 		return -1;
-	}
 
 	elf = elf_begin(elf_fd, ELF_C_READ, NULL);
-	if (!elf) {
-		fprintf(stderr, "Failed to read ELF from %s. %s\n", fn,
-			elf_errmsg(elf_errno()));
+	if (CHECK(!elf, "elf_begin(%s): %s", fn, elf_errmsg(elf_errno()))) {
 		ret = -1;
 		goto done;
 	}
 
-	if (!gelf_getehdr(elf, &ehdr)) {
-		fprintf(stderr, "Failed to get EHDR from %s\n", fn);
+	if (CHECK(!gelf_getehdr(elf, &ehdr), "!gelf_getehdr(%s)", fn)) {
 		ret = -1;
 		goto done;
 	}
@@ -1205,9 +1212,8 @@ static int file_has_btf_elf(const char *fn)
 		const char *sh_name;
 		GElf_Shdr sh;
 
-		if (gelf_getshdr(scn, &sh) != &sh) {
-			fprintf(stderr,
-				"Failed to get section header from %s\n", fn);
+		if (CHECK(gelf_getshdr(scn, &sh) != &sh,
+			  "file:%s gelf_getshdr != &sh", fn)) {
 			ret = -1;
 			goto done;
 		}
@@ -1243,53 +1249,44 @@ static int do_test_file(unsigned int test_num)
 		return err;
 
 	if (err == 0) {
-		fprintf(stderr, "SKIP. No ELF %s found\n", BTF_ELF_SEC);
+		fprintf(stderr, "SKIP. No ELF %s found", BTF_ELF_SEC);
+		skip_cnt++;
 		return 0;
 	}
 
 	obj = bpf_object__open(test->file);
-	if (IS_ERR(obj))
+	if (CHECK(IS_ERR(obj), "obj: %ld", PTR_ERR(obj)))
 		return PTR_ERR(obj);
 
 	err = bpf_object__btf_fd(obj);
-	if (err == -1) {
-		fprintf(stderr, "bpf_object__btf_fd: -1\n");
+	if (CHECK(err == -1, "bpf_object__btf_fd: -1"))
 		goto done;
-	}
 
 	prog = bpf_program__next(NULL, obj);
-	if (!prog) {
-		fprintf(stderr, "Cannot find bpf_prog\n");
+	if (CHECK(!prog, "Cannot find bpf_prog")) {
 		err = -1;
 		goto done;
 	}
 
 	bpf_program__set_type(prog, BPF_PROG_TYPE_TRACEPOINT);
 	err = bpf_object__load(obj);
-	if (err < 0) {
-		fprintf(stderr, "bpf_object__load: %d\n", err);
+	if (CHECK(err < 0, "bpf_object__load: %d", err))
 		goto done;
-	}
 
 	map = bpf_object__find_map_by_name(obj, "btf_map");
-	if (!map) {
-		fprintf(stderr, "btf_map not found\n");
+	if (CHECK(!map, "btf_map not found")) {
 		err = -1;
 		goto done;
 	}
 
 	err = (bpf_map__btf_key_id(map) == 0 || bpf_map__btf_value_id(map) == 0)
 		!= test->btf_kv_notfound;
-	if (err) {
-		fprintf(stderr,
-			"btf_kv_notfound:%u btf_key_id:%u btf_value_id:%u\n",
-			test->btf_kv_notfound,
-			bpf_map__btf_key_id(map),
-			bpf_map__btf_value_id(map));
+	if (CHECK(err, "btf_key_id:%u btf_value_id:%u test->btf_kv_notfound:%u",
+		  bpf_map__btf_key_id(map), bpf_map__btf_value_id(map),
+		  test->btf_kv_notfound))
 		goto done;
-	}
 
-	fprintf(stderr, "OK\n");
+	fprintf(stderr, "OK");
 
 done:
 	bpf_object__close(obj);
@@ -1302,10 +1299,10 @@ static int test_file(void)
 	int err = 0;
 
 	if (args.file_test_num)
-		return do_test_file(args.file_test_num);
+		return count_result(do_test_file(args.file_test_num));
 
 	for (i = 1; i <= ARRAY_SIZE(file_tests); i++)
-		err |= do_test_file(i);
+		err |= count_result(do_test_file(i));
 
 	return err;
 }
@@ -1425,7 +1422,7 @@ static int test_pprint(void)
 	unsigned int key;
 	uint8_t *raw_btf;
 	ssize_t nread;
-	int err;
+	int err, ret;
 
 	fprintf(stderr, "%s......", test->descr);
 	raw_btf = btf_raw_create(&hdr_tmpl, test->raw_types,
@@ -1441,10 +1438,8 @@ static int test_pprint(void)
 			      args.always_log);
 	free(raw_btf);
 
-	if (btf_fd == -1) {
+	if (CHECK(btf_fd == -1, "errno:%d", errno)) {
 		err = -1;
-		fprintf(stderr, "bpf_load_btf: %s(%d)\n",
-			strerror(errno), errno);
 		goto done;
 	}
 
@@ -1458,26 +1453,23 @@ static int test_pprint(void)
 	create_attr.btf_value_id = test->value_id;
 
 	map_fd = bpf_create_map_xattr(&create_attr);
-	if (map_fd == -1) {
+	if (CHECK(map_fd == -1, "errno:%d", errno)) {
 		err = -1;
-		fprintf(stderr, "bpf_creat_map_btf: %s(%d)\n",
-			strerror(errno), errno);
 		goto done;
 	}
 
-	if (snprintf(pin_path, sizeof(pin_path), "%s/%s",
-		     "/sys/fs/bpf", test->map_name) == sizeof(pin_path)) {
+	ret = snprintf(pin_path, sizeof(pin_path), "%s/%s",
+		       "/sys/fs/bpf", test->map_name);
+
+	if (CHECK(ret == sizeof(pin_path), "pin_path %s/%s is too long",
+		  "/sys/fs/bpf", test->map_name)) {
 		err = -1;
-		fprintf(stderr, "pin_path is too long\n");
 		goto done;
 	}
 
 	err = bpf_obj_pin(map_fd, pin_path);
-	if (err) {
-		fprintf(stderr, "Cannot pin to %s. %s(%d).\n", pin_path,
-			strerror(errno), errno);
+	if (CHECK(err, "bpf_obj_pin(%s): errno:%d.", pin_path, errno))
 		goto done;
-	}
 
 	for (key = 0; key < test->max_entries; key++) {
 		set_pprint_mapv(&mapv, key);
@@ -1485,10 +1477,8 @@ static int test_pprint(void)
 	}
 
 	pin_file = fopen(pin_path, "r");
-	if (!pin_file) {
+	if (CHECK(!pin_file, "fopen(%s): errno:%d", pin_path, errno)) {
 		err = -1;
-		fprintf(stderr, "fopen(%s): %s(%d)\n", pin_path,
-			strerror(errno), errno);
 		goto done;
 	}
 
@@ -1497,9 +1487,8 @@ static int test_pprint(void)
 	       *line == '#')
 		;
 
-	if (nread <= 0) {
+	if (CHECK(nread <= 0, "Unexpected EOF")) {
 		err = -1;
-		fprintf(stderr, "Unexpected EOF\n");
 		goto done;
 	}
 
@@ -1518,9 +1507,9 @@ static int test_pprint(void)
 					  mapv.ui8a[4], mapv.ui8a[5], mapv.ui8a[6], mapv.ui8a[7],
 					  pprint_enum_str[mapv.aenum]);
 
-		if (nexpected_line == sizeof(expected_line)) {
+		if (CHECK(nexpected_line == sizeof(expected_line),
+			  "expected_line is too long")) {
 			err = -1;
-			fprintf(stderr, "expected_line is too long\n");
 			goto done;
 		}
 
@@ -1535,15 +1524,15 @@ static int test_pprint(void)
 		nread = getline(&line, &line_len, pin_file);
 	} while (++key < test->max_entries && nread > 0);
 
-	if (key < test->max_entries) {
+	if (CHECK(key < test->max_entries,
+		  "Unexpected EOF. key:%u test->max_entries:%u",
+		  key, test->max_entries)) {
 		err = -1;
-		fprintf(stderr, "Unexpected EOF\n");
 		goto done;
 	}
 
-	if (nread > 0) {
+	if (CHECK(nread > 0, "Unexpected extra pprint output: %s", line)) {
 		err = -1;
-		fprintf(stderr, "Unexpected extra pprint output: %s\n", line);
 		goto done;
 	}
 
@@ -1551,9 +1540,9 @@ static int test_pprint(void)
 
 done:
 	if (!err)
-		fprintf(stderr, "OK\n");
+		fprintf(stderr, "OK");
 	if (*btf_log_buf && (err || args.always_log))
-		fprintf(stderr, "%s\n", btf_log_buf);
+		fprintf(stderr, "\n%s", btf_log_buf);
 	if (btf_fd != -1)
 		close(btf_fd);
 	if (map_fd != -1)
@@ -1634,6 +1623,12 @@ static int parse_args(int argc, char **argv)
 	return 0;
 }
 
+static void print_summary(void)
+{
+	fprintf(stderr, "PASS:%u SKIP:%u FAIL:%u\n",
+		pass_cnt - skip_cnt, skip_cnt, error_cnt);
+}
+
 int main(int argc, char **argv)
 {
 	int err = 0;
@@ -1655,15 +1650,17 @@ int main(int argc, char **argv)
 		err |= test_file();
 
 	if (args.pprint_test)
-		err |= test_pprint();
+		err |= count_result(test_pprint());
 
 	if (args.raw_test || args.get_info_test || args.file_test ||
 	    args.pprint_test)
-		return err;
+		goto done;
 
 	err |= test_raw();
 	err |= test_get_info();
 	err |= test_file();
 
+done:
+	print_summary();
 	return err;
 }
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 6/6] bpf: btf: Tests for BPF_OBJ_GET_INFO_BY_FD and BPF_BTF_GET_FD_BY_ID
From: Martin KaFai Lau @ 2018-05-04 21:49 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20180504214955.1058805-1-kafai@fb.com>

This patch adds test for BPF_BTF_GET_FD_BY_ID and the new
btf_id/btf_key_id/btf_value_id in the "struct bpf_map_info".

It also modifies the existing BPF_OBJ_GET_INFO_BY_FD test
to reflect the new "struct bpf_btf_info".

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
---
 tools/lib/bpf/bpf.c                    |  10 ++
 tools/lib/bpf/bpf.h                    |   1 +
 tools/testing/selftests/bpf/test_btf.c | 289 +++++++++++++++++++++++++++++++--
 3 files changed, 287 insertions(+), 13 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 76b36cc16e7f..a3a8fb2ac697 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -458,6 +458,16 @@ int bpf_map_get_fd_by_id(__u32 id)
 	return sys_bpf(BPF_MAP_GET_FD_BY_ID, &attr, sizeof(attr));
 }
 
+int bpf_btf_get_fd_by_id(__u32 id)
+{
+	union bpf_attr attr;
+
+	bzero(&attr, sizeof(attr));
+	attr.btf_id = id;
+
+	return sys_bpf(BPF_BTF_GET_FD_BY_ID, &attr, sizeof(attr));
+}
+
 int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 *info_len)
 {
 	union bpf_attr attr;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 553b11ad52b3..fb3a146d92ff 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -98,6 +98,7 @@ int bpf_prog_get_next_id(__u32 start_id, __u32 *next_id);
 int bpf_map_get_next_id(__u32 start_id, __u32 *next_id);
 int bpf_prog_get_fd_by_id(__u32 id);
 int bpf_map_get_fd_by_id(__u32 id);
+int bpf_btf_get_fd_by_id(__u32 id);
 int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 *info_len);
 int bpf_prog_query(int target_fd, enum bpf_attach_type type, __u32 query_flags,
 		   __u32 *attach_flags, __u32 *prog_ids, __u32 *prog_cnt);
diff --git a/tools/testing/selftests/bpf/test_btf.c b/tools/testing/selftests/bpf/test_btf.c
index b7880a20fad1..c8bceae7ec02 100644
--- a/tools/testing/selftests/bpf/test_btf.c
+++ b/tools/testing/selftests/bpf/test_btf.c
@@ -1047,9 +1047,13 @@ struct btf_get_info_test {
 	const char *str_sec;
 	__u32 raw_types[MAX_NR_RAW_TYPES];
 	__u32 str_sec_size;
-	int info_size_delta;
+	int btf_size_delta;
+	int (*special_test)(unsigned int test_num);
 };
 
+static int test_big_btf_info(unsigned int test_num);
+static int test_btf_id(unsigned int test_num);
+
 const struct btf_get_info_test get_info_tests[] = {
 {
 	.descr = "== raw_btf_size+1",
@@ -1060,7 +1064,7 @@ const struct btf_get_info_test get_info_tests[] = {
 	},
 	.str_sec = "",
 	.str_sec_size = sizeof(""),
-	.info_size_delta = 1,
+	.btf_size_delta = 1,
 },
 {
 	.descr = "== raw_btf_size-3",
@@ -1071,20 +1075,274 @@ const struct btf_get_info_test get_info_tests[] = {
 	},
 	.str_sec = "",
 	.str_sec_size = sizeof(""),
-	.info_size_delta = -3,
+	.btf_size_delta = -3,
+},
+{
+	.descr = "Large bpf_btf_info",
+	.raw_types = {
+		/* int */				/* [1] */
+		BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),
+		BTF_END_RAW,
+	},
+	.str_sec = "",
+	.str_sec_size = sizeof(""),
+	.special_test = test_big_btf_info,
+},
+{
+	.descr = "BTF ID",
+	.raw_types = {
+		/* int */				/* [1] */
+		BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),
+		/* unsigned int */			/* [2] */
+		BTF_TYPE_INT_ENC(0, 0, 0, 32, 4),
+		BTF_END_RAW,
+	},
+	.str_sec = "",
+	.str_sec_size = sizeof(""),
+	.special_test = test_btf_id,
 },
 };
 
+static inline __u64 ptr_to_u64(const void *ptr)
+{
+	return (__u64)(unsigned long)ptr;
+}
+
+static int test_big_btf_info(unsigned int test_num)
+{
+	const struct btf_get_info_test *test = &get_info_tests[test_num - 1];
+	uint8_t *raw_btf = NULL, *user_btf = NULL;
+	unsigned int raw_btf_size;
+	struct {
+		struct bpf_btf_info info;
+		uint64_t garbage;
+	} info_garbage;
+	struct bpf_btf_info *info;
+	int btf_fd = -1, err;
+	uint32_t info_len;
+
+	raw_btf = btf_raw_create(&hdr_tmpl,
+				 test->raw_types,
+				 test->str_sec,
+				 test->str_sec_size,
+				 &raw_btf_size);
+
+	if (!raw_btf)
+		return -1;
+
+	*btf_log_buf = '\0';
+
+	user_btf = malloc(raw_btf_size);
+	if (CHECK(!user_btf, "!user_btf")) {
+		err = -1;
+		goto done;
+	}
+
+	btf_fd = bpf_load_btf(raw_btf, raw_btf_size,
+			      btf_log_buf, BTF_LOG_BUF_SIZE,
+			      args.always_log);
+	if (CHECK(btf_fd == -1, "errno:%d", errno)) {
+		err = -1;
+		goto done;
+	}
+
+	/*
+	 * GET_INFO should error out if the userspace info
+	 * has non zero tailing bytes.
+	 */
+	info = &info_garbage.info;
+	memset(info, 0, sizeof(*info));
+	info_garbage.garbage = 0xdeadbeef;
+	info_len = sizeof(info_garbage);
+	info->btf = ptr_to_u64(user_btf);
+	info->btf_size = raw_btf_size;
+
+	err = bpf_obj_get_info_by_fd(btf_fd, info, &info_len);
+	if (CHECK(!err, "!err")) {
+		err = -1;
+		goto done;
+	}
+
+	/*
+	 * GET_INFO should succeed even info_len is larger than
+	 * the kernel supported as long as tailing bytes are zero.
+	 * The kernel supported info len should also be returned
+	 * to userspace.
+	 */
+	info_garbage.garbage = 0;
+	err = bpf_obj_get_info_by_fd(btf_fd, info, &info_len);
+	if (CHECK(err || info_len != sizeof(*info),
+		  "err:%d errno:%d info_len:%u sizeof(*info):%lu",
+		  err, errno, info_len, sizeof(*info))) {
+		err = -1;
+		goto done;
+	}
+
+	fprintf(stderr, "OK");
+
+done:
+	if (*btf_log_buf && (err || args.always_log))
+		fprintf(stderr, "\n%s", btf_log_buf);
+
+	free(raw_btf);
+	free(user_btf);
+
+	if (btf_fd != -1)
+		close(btf_fd);
+
+	return err;
+}
+
+static int test_btf_id(unsigned int test_num)
+{
+	const struct btf_get_info_test *test = &get_info_tests[test_num - 1];
+	struct bpf_create_map_attr create_attr = {};
+	uint8_t *raw_btf = NULL, *user_btf[2] = {};
+	int btf_fd[2] = {-1, -1}, map_fd = -1;
+	struct bpf_map_info map_info = {};
+	struct bpf_btf_info info[2] = {};
+	unsigned int raw_btf_size;
+	uint32_t info_len;
+	int err, i, ret;
+
+	raw_btf = btf_raw_create(&hdr_tmpl,
+				 test->raw_types,
+				 test->str_sec,
+				 test->str_sec_size,
+				 &raw_btf_size);
+
+	if (!raw_btf)
+		return -1;
+
+	*btf_log_buf = '\0';
+
+	for (i = 0; i < 2; i++) {
+		user_btf[i] = malloc(raw_btf_size);
+		if (CHECK(!user_btf[i], "!user_btf[%d]", i)) {
+			err = -1;
+			goto done;
+		}
+		info[i].btf = ptr_to_u64(user_btf[i]);
+		info[i].btf_size = raw_btf_size;
+	}
+
+	btf_fd[0] = bpf_load_btf(raw_btf, raw_btf_size,
+				 btf_log_buf, BTF_LOG_BUF_SIZE,
+				 args.always_log);
+	if (CHECK(btf_fd[0] == -1, "errno:%d", errno)) {
+		err = -1;
+		goto done;
+	}
+
+	/* Test BPF_OBJ_GET_INFO_BY_ID on btf_id */
+	info_len = sizeof(info[0]);
+	err = bpf_obj_get_info_by_fd(btf_fd[0], &info[0], &info_len);
+	if (CHECK(err, "errno:%d", errno)) {
+		err = -1;
+		goto done;
+	}
+
+	btf_fd[1] = bpf_btf_get_fd_by_id(info[0].id);
+	if (CHECK(btf_fd[1] == -1, "errno:%d", errno)) {
+		err = -1;
+		goto done;
+	}
+
+	ret = 0;
+	err = bpf_obj_get_info_by_fd(btf_fd[1], &info[1], &info_len);
+	if (CHECK(err || info[0].id != info[1].id ||
+		  info[0].btf_size != info[1].btf_size ||
+		  (ret = memcmp(user_btf[0], user_btf[1], info[0].btf_size)),
+		  "err:%d errno:%d id0:%u id1:%u btf_size0:%u btf_size1:%u memcmp:%d",
+		  err, errno, info[0].id, info[1].id,
+		  info[0].btf_size, info[1].btf_size, ret)) {
+		err = -1;
+		goto done;
+	}
+
+	/* Test btf members in struct bpf_map_info */
+	create_attr.name = "test_btf_id";
+	create_attr.map_type = BPF_MAP_TYPE_ARRAY;
+	create_attr.key_size = sizeof(int);
+	create_attr.value_size = sizeof(unsigned int);
+	create_attr.max_entries = 4;
+	create_attr.btf_fd = btf_fd[0];
+	create_attr.btf_key_id = 1;
+	create_attr.btf_value_id = 2;
+
+	map_fd = bpf_create_map_xattr(&create_attr);
+	if (CHECK(map_fd == -1, "errno:%d", errno)) {
+		err = -1;
+		goto done;
+	}
+
+	info_len = sizeof(map_info);
+	err = bpf_obj_get_info_by_fd(map_fd, &map_info, &info_len);
+	if (CHECK(err || map_info.btf_id != info[0].id ||
+		  map_info.btf_key_id != 1 || map_info.btf_value_id != 2,
+		  "err:%d errno:%d info.id:%u btf_id:%u btf_key_id:%u btf_value_id:%u",
+		  err, errno, info[0].id, map_info.btf_id, map_info.btf_key_id,
+		  map_info.btf_value_id)) {
+		err = -1;
+		goto done;
+	}
+
+	for (i = 0; i < 2; i++) {
+		close(btf_fd[i]);
+		btf_fd[i] = -1;
+	}
+
+	/* Test BTF ID is removed from the kernel */
+	btf_fd[0] = bpf_btf_get_fd_by_id(map_info.btf_id);
+	if (CHECK(btf_fd[0] == -1, "errno:%d", errno)) {
+		err = -1;
+		goto done;
+	}
+	close(btf_fd[0]);
+	btf_fd[0] = -1;
+
+	/* The map holds the last ref to BTF and its btf_id */
+	close(map_fd);
+	map_fd = -1;
+	btf_fd[0] = bpf_btf_get_fd_by_id(map_info.btf_id);
+	if (CHECK(btf_fd[0] != -1, "BTF lingers")) {
+		err = -1;
+		goto done;
+	}
+
+	fprintf(stderr, "OK");
+
+done:
+	if (*btf_log_buf && (err || args.always_log))
+		fprintf(stderr, "\n%s", btf_log_buf);
+
+	free(raw_btf);
+	if (map_fd != -1)
+		close(map_fd);
+	for (i = 0; i < 2; i++) {
+		free(user_btf[i]);
+		if (btf_fd[i] != -1)
+			close(btf_fd[i]);
+	}
+
+	return err;
+}
+
 static int do_test_get_info(unsigned int test_num)
 {
 	const struct btf_get_info_test *test = &get_info_tests[test_num - 1];
 	unsigned int raw_btf_size, user_btf_size, expected_nbytes;
 	uint8_t *raw_btf = NULL, *user_btf = NULL;
-	int btf_fd = -1, err;
+	struct bpf_btf_info info = {};
+	int btf_fd = -1, err, ret;
+	uint32_t info_len;
 
-	fprintf(stderr, "BTF GET_INFO_BY_ID test[%u] (%s): ",
+	fprintf(stderr, "BTF GET_INFO test[%u] (%s): ",
 		test_num, test->descr);
 
+	if (test->special_test)
+		return test->special_test(test_num);
+
 	raw_btf = btf_raw_create(&hdr_tmpl,
 				 test->raw_types,
 				 test->str_sec,
@@ -1110,19 +1368,24 @@ static int do_test_get_info(unsigned int test_num)
 		goto done;
 	}
 
-	user_btf_size = (int)raw_btf_size + test->info_size_delta;
+	user_btf_size = (int)raw_btf_size + test->btf_size_delta;
 	expected_nbytes = min(raw_btf_size, user_btf_size);
 	if (raw_btf_size > expected_nbytes)
 		memset(user_btf + expected_nbytes, 0xff,
 		       raw_btf_size - expected_nbytes);
 
-	err = bpf_obj_get_info_by_fd(btf_fd, user_btf, &user_btf_size);
-	if (CHECK(err || user_btf_size != raw_btf_size ||
-		  memcmp(raw_btf, user_btf, expected_nbytes),
-		  "err:%d(errno:%d) raw_btf_size:%u user_btf_size:%u expected_nbytes:%u memcmp:%d",
-		  err, errno,
-		  raw_btf_size, user_btf_size, expected_nbytes,
-		  memcmp(raw_btf, user_btf, expected_nbytes))) {
+	info_len = sizeof(info);
+	info.btf = ptr_to_u64(user_btf);
+	info.btf_size = user_btf_size;
+
+	ret = 0;
+	err = bpf_obj_get_info_by_fd(btf_fd, &info, &info_len);
+	if (CHECK(err || !info.id || info_len != sizeof(info) ||
+		  info.btf_size != raw_btf_size ||
+		  (ret = memcmp(raw_btf, user_btf, expected_nbytes)),
+		  "err:%d errno:%d info.id:%u info_len:%u sizeof(info):%lu raw_btf_size:%u info.btf_size:%u expected_nbytes:%u memcmp:%d",
+		  err, errno, info.id, info_len, sizeof(info),
+		  raw_btf_size, info.btf_size, expected_nbytes, ret)) {
 		err = -1;
 		goto done;
 	}
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 0/6] Introduce BTF ID
From: Martin KaFai Lau @ 2018-05-04 21:49 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

This series introduces BTF ID which is exposed through
the new BPF_BTF_GET_FD_BY_ID cmd, new "struct bpf_btf_info"
and new members in the "struct bpf_map_info".

Please see individual patch for details.

Martin KaFai Lau (6):
  bpf: btf: Avoid WARN_ON when CONFIG_REFCOUNT_FULL=y
  bpf: btf: Introduce BTF ID
  bpf: btf: Add struct bpf_btf_info
  bpf: btf: Some test_btf clean up
  bpf: btf: Update tools/include/uapi/linux/btf.h with BTF ID
  bpf: btf: Tests for BPF_OBJ_GET_INFO_BY_FD and BPF_BTF_GET_FD_BY_ID

 include/linux/btf.h                    |   2 +
 include/uapi/linux/bpf.h               |  11 +
 kernel/bpf/btf.c                       | 136 ++++++++--
 kernel/bpf/syscall.c                   |  41 ++-
 tools/include/uapi/linux/bpf.h         |  11 +
 tools/lib/bpf/bpf.c                    |  10 +
 tools/lib/bpf/bpf.h                    |   1 +
 tools/testing/selftests/bpf/test_btf.c | 478 +++++++++++++++++++++++++--------
 8 files changed, 563 insertions(+), 127 deletions(-)

-- 
2.9.5

^ permalink raw reply

* [PATCH bpf-next 2/6] bpf: btf: Introduce BTF ID
From: Martin KaFai Lau @ 2018-05-04 21:49 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20180504214955.1058805-1-kafai@fb.com>

This patch gives an ID to each loaded BTF.  The ID is allocated by
the idr like the existing prog-id and map-id.

The bpf_put(map->btf) is moved to __bpf_map_put() so that the
userspace can stop seeing the BTF ID ASAP when the last BTF
refcnt is gone.

It also makes BTF accessible from userspace through the
1. new BPF_BTF_GET_FD_BY_ID command.  It is limited to CAP_SYS_ADMIN
   which is inline with the BPF_BTF_LOAD cmd and the existing
   BPF_[MAP|PROG]_GET_FD_BY_ID cmd.
2. new btf_id (and btf_key_id + btf_value_id) in "struct bpf_map_info"

Once the BTF ID handler is accessible from userspace, freeing a BTF
object has to go through a rcu period.  The BPF_BTF_GET_FD_BY_ID cmd
can then be done under a rcu_read_lock() instead of taking
spin_lock.
[Note: A similar rcu usage can be done to the existing
       bpf_prog_get_fd_by_id() in a follow up patch]

When processing the BPF_BTF_GET_FD_BY_ID cmd,
refcount_inc_not_zero() is needed because the BTF object
could be already in the rcu dead row .  btf_get() is
removed since its usage is currently limited to btf.c
alone.  refcount_inc() is used directly instead.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
---
 include/linux/btf.h      |   2 +
 include/uapi/linux/bpf.h |   5 +++
 kernel/bpf/btf.c         | 108 ++++++++++++++++++++++++++++++++++++++++++-----
 kernel/bpf/syscall.c     |  24 ++++++++++-
 4 files changed, 128 insertions(+), 11 deletions(-)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index a966dc6d61ee..e076c4697049 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -44,5 +44,7 @@ const struct btf_type *btf_type_id_size(const struct btf *btf,
 					u32 *ret_size);
 void btf_type_seq_show(const struct btf *btf, u32 type_id, void *obj,
 		       struct seq_file *m);
+int btf_get_fd_by_id(u32 id);
+u32 btf_id(const struct btf *btf);
 
 #endif
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 93d5a4eeec2a..6106f23a9a8a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -96,6 +96,7 @@ enum bpf_cmd {
 	BPF_PROG_QUERY,
 	BPF_RAW_TRACEPOINT_OPEN,
 	BPF_BTF_LOAD,
+	BPF_BTF_GET_FD_BY_ID,
 };
 
 enum bpf_map_type {
@@ -344,6 +345,7 @@ union bpf_attr {
 			__u32		start_id;
 			__u32		prog_id;
 			__u32		map_id;
+			__u32		btf_id;
 		};
 		__u32		next_id;
 		__u32		open_flags;
@@ -2130,6 +2132,9 @@ struct bpf_map_info {
 	__u32 ifindex;
 	__u64 netns_dev;
 	__u64 netns_ino;
+	__u32 btf_id;
+	__u32 btf_key_id;
+	__u32 btf_value_id;
 } __attribute__((aligned(8)));
 
 /* User bpf_sock_addr struct to access socket fields and sockaddr struct passed
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index fa0dce0452e7..40950b6bf395 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -11,6 +11,7 @@
 #include <linux/file.h>
 #include <linux/uaccess.h>
 #include <linux/kernel.h>
+#include <linux/idr.h>
 #include <linux/bpf_verifier.h>
 #include <linux/btf.h>
 
@@ -179,6 +180,9 @@
 	     i < btf_type_vlen(struct_type);				\
 	     i++, member++)
 
+static DEFINE_IDR(btf_idr);
+static DEFINE_SPINLOCK(btf_idr_lock);
+
 struct btf {
 	union {
 		struct btf_header *hdr;
@@ -193,6 +197,8 @@ struct btf {
 	u32 types_size;
 	u32 data_size;
 	refcount_t refcnt;
+	u32 id;
+	struct rcu_head rcu;
 };
 
 enum verifier_phase {
@@ -598,6 +604,42 @@ static int btf_add_type(struct btf_verifier_env *env, struct btf_type *t)
 	return 0;
 }
 
+static int btf_alloc_id(struct btf *btf)
+{
+	int id;
+
+	idr_preload(GFP_KERNEL);
+	spin_lock_bh(&btf_idr_lock);
+	id = idr_alloc_cyclic(&btf_idr, btf, 1, INT_MAX, GFP_ATOMIC);
+	if (id > 0)
+		btf->id = id;
+	spin_unlock_bh(&btf_idr_lock);
+	idr_preload_end();
+
+	if (WARN_ON_ONCE(!id))
+		return -ENOSPC;
+
+	return id > 0 ? 0 : id;
+}
+
+static void btf_free_id(struct btf *btf)
+{
+	unsigned long flags;
+
+	/*
+	 * In map-in-map, calling map_delete_elem() on outer
+	 * map will call bpf_map_put on the inner map.
+	 * It will then eventually call btf_free_id()
+	 * on the inner map.  Some of the map_delete_elem()
+	 * implementation may have irq disabled, so
+	 * we need to use the _irqsave() version instead
+	 * of the _bh() version.
+	 */
+	spin_lock_irqsave(&btf_idr_lock, flags);
+	idr_remove(&btf_idr, btf->id);
+	spin_unlock_irqrestore(&btf_idr_lock, flags);
+}
+
 static void btf_free(struct btf *btf)
 {
 	kvfree(btf->types);
@@ -607,15 +649,19 @@ static void btf_free(struct btf *btf)
 	kfree(btf);
 }
 
-static void btf_get(struct btf *btf)
+static void btf_free_rcu(struct rcu_head *rcu)
 {
-	refcount_inc(&btf->refcnt);
+	struct btf *btf = container_of(rcu, struct btf, rcu);
+
+	btf_free(btf);
 }
 
 void btf_put(struct btf *btf)
 {
-	if (btf && refcount_dec_and_test(&btf->refcnt))
-		btf_free(btf);
+	if (btf && refcount_dec_and_test(&btf->refcnt)) {
+		btf_free_id(btf);
+		call_rcu(&btf->rcu, btf_free_rcu);
+	}
 }
 
 static int env_resolve_init(struct btf_verifier_env *env)
@@ -2006,10 +2052,15 @@ const struct file_operations btf_fops = {
 	.release	= btf_release,
 };
 
+static int __btf_new_fd(struct btf *btf)
+{
+	return anon_inode_getfd("btf", &btf_fops, btf, O_RDONLY | O_CLOEXEC);
+}
+
 int btf_new_fd(const union bpf_attr *attr)
 {
 	struct btf *btf;
-	int fd;
+	int ret;
 
 	btf = btf_parse(u64_to_user_ptr(attr->btf),
 			attr->btf_size, attr->btf_log_level,
@@ -2018,12 +2069,23 @@ int btf_new_fd(const union bpf_attr *attr)
 	if (IS_ERR(btf))
 		return PTR_ERR(btf);
 
-	fd = anon_inode_getfd("btf", &btf_fops, btf,
-			      O_RDONLY | O_CLOEXEC);
-	if (fd < 0)
+	ret = btf_alloc_id(btf);
+	if (ret) {
+		btf_free(btf);
+		return ret;
+	}
+
+	/*
+	 * The BTF ID is published to the userspace.
+	 * All BTF free must go through call_rcu() from
+	 * now on (i.e. free by calling btf_put()).
+	 */
+
+	ret = __btf_new_fd(btf);
+	if (ret < 0)
 		btf_put(btf);
 
-	return fd;
+	return ret;
 }
 
 struct btf *btf_get_by_fd(int fd)
@@ -2042,7 +2104,7 @@ struct btf *btf_get_by_fd(int fd)
 	}
 
 	btf = f.file->private_data;
-	btf_get(btf);
+	refcount_inc(&btf->refcnt);
 	fdput(f);
 
 	return btf;
@@ -2062,3 +2124,29 @@ int btf_get_info_by_fd(const struct btf *btf,
 
 	return 0;
 }
+
+int btf_get_fd_by_id(u32 id)
+{
+	struct btf *btf;
+	int fd;
+
+	rcu_read_lock();
+	btf = idr_find(&btf_idr, id);
+	if (!btf || !refcount_inc_not_zero(&btf->refcnt))
+		btf = ERR_PTR(-ENOENT);
+	rcu_read_unlock();
+
+	if (IS_ERR(btf))
+		return PTR_ERR(btf);
+
+	fd = __btf_new_fd(btf);
+	if (fd < 0)
+		btf_put(btf);
+
+	return fd;
+}
+
+u32 btf_id(const struct btf *btf)
+{
+	return btf->id;
+}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 263e13ede029..8b0a45d65454 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -252,7 +252,6 @@ static void bpf_map_free_deferred(struct work_struct *work)
 
 	bpf_map_uncharge_memlock(map);
 	security_bpf_map_free(map);
-	btf_put(map->btf);
 	/* implementation dependent freeing */
 	map->ops->map_free(map);
 }
@@ -273,6 +272,7 @@ static void __bpf_map_put(struct bpf_map *map, bool do_idr_lock)
 	if (atomic_dec_and_test(&map->refcnt)) {
 		/* bpf_map_free_id() must be called first */
 		bpf_map_free_id(map, do_idr_lock);
+		btf_put(map->btf);
 		INIT_WORK(&map->work, bpf_map_free_deferred);
 		schedule_work(&map->work);
 	}
@@ -2000,6 +2000,12 @@ static int bpf_map_get_info_by_fd(struct bpf_map *map,
 	info.map_flags = map->map_flags;
 	memcpy(info.name, map->name, sizeof(map->name));
 
+	if (map->btf) {
+		info.btf_id = btf_id(map->btf);
+		info.btf_key_id = map->btf_key_id;
+		info.btf_value_id = map->btf_value_id;
+	}
+
 	if (bpf_map_is_dev_bound(map)) {
 		err = bpf_map_offload_info_fill(&info, map);
 		if (err)
@@ -2057,6 +2063,19 @@ static int bpf_btf_load(const union bpf_attr *attr)
 	return btf_new_fd(attr);
 }
 
+#define BPF_BTF_GET_FD_BY_ID_LAST_FIELD btf_id
+
+static int bpf_btf_get_fd_by_id(const union bpf_attr *attr)
+{
+	if (CHECK_ATTR(BPF_BTF_GET_FD_BY_ID))
+		return -EINVAL;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	return btf_get_fd_by_id(attr->btf_id);
+}
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
 {
 	union bpf_attr attr = {};
@@ -2140,6 +2159,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_BTF_LOAD:
 		err = bpf_btf_load(&attr);
 		break;
+	case BPF_BTF_GET_FD_BY_ID:
+		err = bpf_btf_get_fd_by_id(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH bpf-next 00/10] bpf: support offload of bpf_event_output()
From: Daniel Borkmann @ 2018-05-04 21:46 UTC (permalink / raw)
  To: Jakub Kicinski, alexei.starovoitov; +Cc: oss-drivers, netdev
In-Reply-To: <20180504013717.29317-1-jakub.kicinski@netronome.com>

On 05/04/2018 03:37 AM, Jakub Kicinski wrote:
> Hi!
> 
> This series centres on NFP offload of bpf_event_output().  The
> first patch allows perf event arrays to be used by offloaded
> programs.  Next patch makes the nfp driver keep track of such
> arrays to be able to filter FW events referring to maps.
> Perf event arrays are not device bound.  Having driver
> reimplement and manage the perf array seems brittle and unnecessary.
> 
> Patch 4 moves slightly the verifier step which replaces map fds
> with map pointers.  This is useful for nfp JIT since we can then
> easily replace host pointers with NFP table ids (patch 6).  This
> allows us to lift the limitation on map helpers having to be used
> with the same map pointer on all paths.  Second use of replacing
> fds with real host map pointers is that we can use the host map
> pointer as a key for FW events in perf event array offload.
> 
> Patch 5 adds perf event output offload support for the NFP.
> 
> There are some differences between bpf_event_output() offloaded
> and non-offloaded version.  The FW messages which carry events
> may get dropped and reordered relatively easily.  The return codes
> from the helper are also not guaranteed to match the host.  Users
> are warned about some of those discrepancies with a one time
> warning message to kernel logs.
> 
> bpftool gains an ability to dump perf ring events in a very simple
> format.  This was very useful for testing and simple debug, maybe
> it will be useful to others?
> 
> Last patch is a trivial comment fix.

Nice approach, applied to bpf-next, thanks Jakub!

^ permalink raw reply

* Re: [PATCH] cxgb4vf: fix t4vf_eth_xmit()'s return type
From: Casey Leedom @ 2018-05-04 21:41 UTC (permalink / raw)
  To: Luc Van Oostenryck, linux-kernel@vger.kernel.org; +Cc: netdev@vger.kernel.org
In-Reply-To: <20180424131902.5767-1-luc.vanoostenryck@gmail.com>

| From: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
| Sent: Tuesday, April 24, 2018 6:19:02 AM
| 
| The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
| which is a typedef for an enum type, but the implementation in this
| driver returns an 'int'.
| 
| Fix this by returning 'netdev_tx_t' in this driver too.

Looks good to me.

Casey

^ permalink raw reply

* Re: [PATCH net-next] net: core: rework skb_probe_transport_header()
From: kbuild test robot @ 2018-05-04 21:35 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: kbuild-all, netdev, David S. Miller, Eric Dumazet, Jason Wang
In-Reply-To: <7cbdf466f4a1bf44ddbb948428dc7bb0dad091a7.1525340013.git.pabeni@redhat.com>

Hi Paolo,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Paolo-Abeni/net-core-rework-skb_probe_transport_header/20180504-041345
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> include/linux/skbuff.h:2360:32: sparse: Using plain integer as NULL pointer
   drivers/net/tun.c:2088:40: sparse: expression using sizeof(void)
   drivers/net/tun.c:2221:15: sparse: expression using sizeof(void)
   drivers/net/tun.c:2221:15: sparse: expression using sizeof(void)
   drivers/net/tun.c:2846:36: sparse: incorrect type in argument 2 (different address spaces) @@    expected struct tun_prog [noderef] <asn:4>**prog_p @@    got noderef] <asn:4>**prog_p @@
   drivers/net/tun.c:2846:36:    expected struct tun_prog [noderef] <asn:4>**prog_p
   drivers/net/tun.c:2846:36:    got struct tun_prog **prog_p
   drivers/net/tun.c:3142:42: sparse: incorrect type in argument 2 (different address spaces) @@    expected struct tun_prog **prog_p @@    got struct tun_prog [struct tun_prog **prog_p @@
   drivers/net/tun.c:3142:42:    expected struct tun_prog **prog_p
   drivers/net/tun.c:3142:42:    got struct tun_prog [noderef] <asn:4>**<noident>
   drivers/net/tun.c:3146:42: sparse: incorrect type in argument 2 (different address spaces) @@    expected struct tun_prog **prog_p @@    got struct tun_prog [struct tun_prog **prog_p @@
   drivers/net/tun.c:3146:42:    expected struct tun_prog **prog_p
   drivers/net/tun.c:3146:42:    got struct tun_prog [noderef] <asn:4>**<noident>
--
>> include/linux/skbuff.h:2360:32: sparse: Using plain integer as NULL pointer
   drivers/net/tap.c:879:15: sparse: expression using sizeof(void)
   drivers/net/tap.c:879:15: sparse: expression using sizeof(void)
--
   drivers/net/xen-netback/netback.c:175:21: sparse: expression using sizeof(void)
   drivers/net/xen-netback/netback.c:182:35: sparse: expression using sizeof(void)
   drivers/net/xen-netback/netback.c:182:35: sparse: expression using sizeof(void)
>> include/linux/skbuff.h:2360:32: sparse: Using plain integer as NULL pointer
   drivers/net/xen-netback/netback.c:1632:37: sparse: expression using sizeof(void)

vim +2360 include/linux/skbuff.h

  2349	
  2350	static inline void skb_probe_transport_header(struct sk_buff *skb,
  2351						      const int offset_hint)
  2352	{
  2353		struct flow_keys_basic keys;
  2354	
  2355		if (skb_transport_header_was_set(skb))
  2356			return;
  2357	
  2358		memset(&keys, 0, sizeof(keys));
  2359		if (__skb_flow_dissect(skb, &flow_keys_buf_dissector, &keys,
> 2360				       0, 0, 0, 0, 0))
  2361			skb_set_transport_header(skb, keys.control.thoff);
  2362		else
  2363			skb_set_transport_header(skb, offset_hint);
  2364	}
  2365	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* Re: [PATCH bpf-next 09/10] tools: bpftool: add simple perf event output reader
From: Alexei Starovoitov @ 2018-05-04 21:25 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: daniel, oss-drivers, netdev
In-Reply-To: <20180504013717.29317-10-jakub.kicinski@netronome.com>

On Thu, May 03, 2018 at 06:37:16PM -0700, Jakub Kicinski wrote:
> Users of BPF sooner or later discover perf_event_output() helpers
> and BPF_MAP_TYPE_PERF_EVENT_ARRAY.  Dumping this array type is
> not possible, however, we can add simple reading of perf events.
> Create a new event_pipe subcommand for maps, this sub command
> will only work with BPF_MAP_TYPE_PERF_EVENT_ARRAY maps.
> 
> Parts of the code from samples/bpf/trace_output_user.c.
> 
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---
>  .../bpf/bpftool/Documentation/bpftool-map.rst |  29 +-
>  tools/bpf/bpftool/Documentation/bpftool.rst   |   2 +-
>  tools/bpf/bpftool/Makefile                    |   7 +-
>  tools/bpf/bpftool/bash-completion/bpftool     |  36 +-
>  tools/bpf/bpftool/common.c                    |  19 +
>  tools/bpf/bpftool/main.h                      |   4 +
>  tools/bpf/bpftool/map.c                       |  19 +-
>  tools/bpf/bpftool/map_perf_ring.c             | 347 ++++++++++++++++++
>  8 files changed, 444 insertions(+), 19 deletions(-)
>  create mode 100644 tools/bpf/bpftool/map_perf_ring.c
> 
> diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst
> index c3eef8c972cd..a6258bc8ec4f 100644
> --- a/tools/bpf/bpftool/Documentation/bpftool-map.rst
> +++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst
> @@ -22,12 +22,13 @@ MAP COMMANDS
>  =============
>  
>  |	**bpftool** **map { show | list }**   [*MAP*]
> -|	**bpftool** **map dump**    *MAP*
> -|	**bpftool** **map update**  *MAP*  **key** *DATA*   **value** *VALUE* [*UPDATE_FLAGS*]
> -|	**bpftool** **map lookup**  *MAP*  **key** *DATA*
> -|	**bpftool** **map getnext** *MAP* [**key** *DATA*]
> -|	**bpftool** **map delete**  *MAP*  **key** *DATA*
> -|	**bpftool** **map pin**     *MAP*  *FILE*
> +|	**bpftool** **map dump**       *MAP*
> +|	**bpftool** **map update**     *MAP*  **key** *DATA*   **value** *VALUE* [*UPDATE_FLAGS*]
> +|	**bpftool** **map lookup**     *MAP*  **key** *DATA*
> +|	**bpftool** **map getnext**    *MAP* [**key** *DATA*]
> +|	**bpftool** **map delete**     *MAP*  **key** *DATA*
> +|	**bpftool** **map pin**        *MAP*  *FILE*
> +|	**bpftool** **map event_pipe** *MAP* [**cpu** *N* **index** *M*]
>  |	**bpftool** **map help**
>  |
>  |	*MAP* := { **id** *MAP_ID* | **pinned** *FILE* }
> @@ -76,6 +77,22 @@ DESCRIPTION
>  
>  		  Note: *FILE* must be located in *bpffs* mount.
>  
> +	**bpftool** **map event_pipe** *MAP* [**cpu** *N* **index** *M*]
> +		  Read events from a BPF_MAP_TYPE_PERF_EVENT_ARRAY map.
> +
> +		  Install perf rings into a perf event array map and dump
> +		  output of any bpf_perf_event_output() call in the kernel.
> +		  By default read the number of CPUs on the system and
> +		  install perf ring for each CPU in the corresponding index
> +		  in the array.
> +
> +		  If **cpu** and **index** are specified, install perf ring
> +		  for given **cpu** at **index** in the array (single ring).
> +
> +		  Note that installing a perf ring into an array will silently
> +		  replace any existing ring.  Any other application will stop
> +		  receiving events if it installed its rings earlier.
> +
>  	**bpftool map help**
>  		  Print short help message.
>  
> diff --git a/tools/bpf/bpftool/Documentation/bpftool.rst b/tools/bpf/bpftool/Documentation/bpftool.rst
> index 20689a321ffe..564cb0d9692b 100644
> --- a/tools/bpf/bpftool/Documentation/bpftool.rst
> +++ b/tools/bpf/bpftool/Documentation/bpftool.rst
> @@ -23,7 +23,7 @@ SYNOPSIS
>  
>  	*MAP-COMMANDS* :=
>  	{ **show** | **list** | **dump** | **update** | **lookup** | **getnext** | **delete**
> -	| **pin** | **help** }
> +	| **pin** | **event_pipe** | **help** }
>  
>  	*PROG-COMMANDS* := { **show** | **list** | **dump jited** | **dump xlated** | **pin**
>  	| **load** | **help** }
> diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
> index 4e69782c4a79..892dbf095bff 100644
> --- a/tools/bpf/bpftool/Makefile
> +++ b/tools/bpf/bpftool/Makefile
> @@ -39,7 +39,12 @@ CC = gcc
>  
>  CFLAGS += -O2
>  CFLAGS += -W -Wall -Wextra -Wno-unused-parameter -Wshadow -Wno-missing-field-initializers
> -CFLAGS += -DPACKAGE='"bpftool"' -D__EXPORTED_HEADERS__ -I$(srctree)/tools/include/uapi -I$(srctree)/tools/include -I$(srctree)/tools/lib/bpf -I$(srctree)/kernel/bpf/
> +CFLAGS += -DPACKAGE='"bpftool"' -D__EXPORTED_HEADERS__ \
> +	-I$(srctree)/kernel/bpf/ \
> +	-I$(srctree)/tools/include \
> +	-I$(srctree)/tools/include/uapi \
> +	-I$(srctree)/tools/lib/bpf \
> +	-I$(srctree)/tools/perf
>  CFLAGS += -DBPFTOOL_VERSION='"$(BPFTOOL_VERSION)"'
>  LIBS = -lelf -lbfd -lopcodes $(LIBBPF)
>  
> diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
> index 852d84a98acd..b301c9b315f1 100644
> --- a/tools/bpf/bpftool/bash-completion/bpftool
> +++ b/tools/bpf/bpftool/bash-completion/bpftool
> @@ -1,6 +1,6 @@
>  # bpftool(8) bash completion                               -*- shell-script -*-
>  #
> -# Copyright (C) 2017 Netronome Systems, Inc.
> +# Copyright (C) 2017-2018 Netronome Systems, Inc.
>  #
>  # This software is dual licensed under the GNU General License
>  # Version 2, June 1991 as shown in the file COPYING in the top-level
> @@ -79,6 +79,14 @@ _bpftool_get_map_ids()
>          command sed -n 's/.*"id": \(.*\),$/\1/p' )" -- "$cur" ) )
>  }
>  
> +_bpftool_get_perf_map_ids()
> +{
> +    COMPREPLY+=( $( compgen -W "$( bpftool -jp map  2>&1 | \
> +        command grep -C2 perf_event_array | \
> +        command sed -n 's/.*"id": \(.*\),$/\1/p' )" -- "$cur" ) )
> +}
> +
> +
>  _bpftool_get_prog_ids()
>  {
>      COMPREPLY+=( $( compgen -W "$( bpftool -jp prog 2>&1 | \
> @@ -359,10 +367,34 @@ _bpftool()
>                      fi
>                      return 0
>                      ;;
> +                event_pipe)
> +                    case $prev in
> +                        $command)
> +                            COMPREPLY=( $( compgen -W "$MAP_TYPE" -- "$cur" ) )
> +                            return 0
> +                            ;;
> +                        id)
> +                            _bpftool_get_perf_map_ids
> +                            return 0
> +                            ;;
> +                        cpu)
> +                            return 0
> +                            ;;
> +                        index)
> +                            return 0
> +                            ;;
> +                        *)
> +                            _bpftool_once_attr 'cpu'
> +                            _bpftool_once_attr 'index'
> +                            return 0
> +                            ;;
> +                    esac
> +                    ;;
>                  *)
>                      [[ $prev == $object ]] && \
>                          COMPREPLY=( $( compgen -W 'delete dump getnext help \
> -                            lookup pin show list update' -- "$cur" ) )
> +                            lookup pin event_pipe show list update' -- \
> +                            "$cur" ) )
>                      ;;
>              esac
>              ;;
> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> index 9c620770c6ed..32f9e397a6c0 100644
> --- a/tools/bpf/bpftool/common.c
> +++ b/tools/bpf/bpftool/common.c
> @@ -331,6 +331,16 @@ char *get_fdinfo(int fd, const char *key)
>  	return NULL;
>  }
>  
> +void print_data_json(uint8_t *data, size_t len)
> +{
> +	unsigned int i;
> +
> +	jsonw_start_array(json_wtr);
> +	for (i = 0; i < len; i++)
> +		jsonw_printf(json_wtr, "%d", data[i]);
> +	jsonw_end_array(json_wtr);
> +}
> +
>  void print_hex_data_json(uint8_t *data, size_t len)
>  {
>  	unsigned int i;
> @@ -421,6 +431,15 @@ void delete_pinned_obj_table(struct pinned_obj_table *tab)
>  	}
>  }
>  
> +unsigned int get_page_size(void)
> +{
> +	static int result;
> +
> +	if (!result)
> +		result = getpagesize();
> +	return result;
> +}
> +
>  unsigned int get_possible_cpus(void)
>  {
>  	static unsigned int result;
> diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
> index cbf8985da362..6173cd997e7a 100644
> --- a/tools/bpf/bpftool/main.h
> +++ b/tools/bpf/bpftool/main.h
> @@ -117,14 +117,18 @@ int do_pin_fd(int fd, const char *name);
>  
>  int do_prog(int argc, char **arg);
>  int do_map(int argc, char **arg);
> +int do_event_pipe(int argc, char **argv);
>  int do_cgroup(int argc, char **arg);
>  
>  int prog_parse_fd(int *argc, char ***argv);
> +int map_parse_fd_and_info(int *argc, char ***argv, void *info, __u32 *info_len);
>  
>  void disasm_print_insn(unsigned char *image, ssize_t len, int opcodes,
>  		       const char *arch);
> +void print_data_json(uint8_t *data, size_t len);
>  void print_hex_data_json(uint8_t *data, size_t len);
>  
> +unsigned int get_page_size(void);
>  unsigned int get_possible_cpus(void);
>  const char *ifindex_to_bfd_name_ns(__u32 ifindex, __u64 ns_dev, __u64 ns_ino);
>  
> diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
> index 5efefde5f578..af6766e956ba 100644
> --- a/tools/bpf/bpftool/map.c
> +++ b/tools/bpf/bpftool/map.c
> @@ -130,8 +130,7 @@ static int map_parse_fd(int *argc, char ***argv)
>  	return -1;
>  }
>  
> -static int
> -map_parse_fd_and_info(int *argc, char ***argv, void *info, __u32 *info_len)
> +int map_parse_fd_and_info(int *argc, char ***argv, void *info, __u32 *info_len)
>  {
>  	int err;
>  	int fd;
> @@ -817,12 +816,13 @@ static int do_help(int argc, char **argv)
>  
>  	fprintf(stderr,
>  		"Usage: %s %s { show | list }   [MAP]\n"
> -		"       %s %s dump    MAP\n"
> -		"       %s %s update  MAP  key DATA value VALUE [UPDATE_FLAGS]\n"
> -		"       %s %s lookup  MAP  key DATA\n"
> -		"       %s %s getnext MAP [key DATA]\n"
> -		"       %s %s delete  MAP  key DATA\n"
> -		"       %s %s pin     MAP  FILE\n"
> +		"       %s %s dump       MAP\n"
> +		"       %s %s update     MAP  key DATA value VALUE [UPDATE_FLAGS]\n"
> +		"       %s %s lookup     MAP  key DATA\n"
> +		"       %s %s getnext    MAP [key DATA]\n"
> +		"       %s %s delete     MAP  key DATA\n"
> +		"       %s %s pin        MAP  FILE\n"
> +		"       %s %s event_pipe MAP [cpu N index M]\n"
>  		"       %s %s help\n"
>  		"\n"
>  		"       MAP := { id MAP_ID | pinned FILE }\n"
> @@ -834,7 +834,7 @@ static int do_help(int argc, char **argv)
>  		"",
>  		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
>  		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
> -		bin_name, argv[-2], bin_name, argv[-2]);
> +		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2]);
>  
>  	return 0;
>  }
> @@ -849,6 +849,7 @@ static const struct cmd cmds[] = {
>  	{ "getnext",	do_getnext },
>  	{ "delete",	do_delete },
>  	{ "pin",	do_pin },
> +	{ "event_pipe",	do_event_pipe },
>  	{ 0 }
>  };
>  
> diff --git a/tools/bpf/bpftool/map_perf_ring.c b/tools/bpf/bpftool/map_perf_ring.c
> new file mode 100644
> index 000000000000..c5a2ced8552d
> --- /dev/null
> +++ b/tools/bpf/bpftool/map_perf_ring.c
> @@ -0,0 +1,347 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (C) 2018 Netronome Systems, Inc. */
> +/* This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <libbpf.h>
> +#include <poll.h>
> +#include <signal.h>
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <time.h>
> +#include <unistd.h>
> +#include <linux/bpf.h>
> +#include <linux/perf_event.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/syscall.h>
> +
> +#include <bpf.h>
> +#include <perf-sys.h>
> +
> +#include "main.h"
> +
> +#define MMAP_PAGE_CNT	16
> +
> +static bool stop;
> +
> +struct event_ring_info {
> +	int fd;
> +	int key;
> +	unsigned int cpu;
> +	void *mem;
> +};
> +
> +struct perf_event_sample {
> +	struct perf_event_header header;
> +	__u32 size;
> +	unsigned char data[];
> +};
> +
> +static void int_exit(int signo)
> +{
> +	fprintf(stderr, "Stopping...\n");
> +	stop = true;
> +}
> +
> +static void
> +print_bpf_output(struct event_ring_info *ring, struct perf_event_sample *e)
> +{
> +	struct {
> +		struct perf_event_header header;
> +		__u64 id;
> +		__u64 lost;
> +	} *lost = (void *)e;
> +	struct timespec ts;
> +
> +	if (clock_gettime(CLOCK_MONOTONIC, &ts)) {
> +		perror("Can't read clock for timestamp");
> +		return;
> +	}
> +
> +	if (json_output) {
> +		jsonw_start_object(json_wtr);
> +		jsonw_name(json_wtr, "timestamp");
> +		jsonw_uint(json_wtr, ts.tv_sec * 1000000000ull + ts.tv_nsec);
> +		jsonw_name(json_wtr, "type");
> +		jsonw_uint(json_wtr, e->header.type);
> +		jsonw_name(json_wtr, "cpu");
> +		jsonw_uint(json_wtr, ring->cpu);
> +		jsonw_name(json_wtr, "index");
> +		jsonw_uint(json_wtr, ring->key);
> +		if (e->header.type == PERF_RECORD_SAMPLE) {
> +			jsonw_name(json_wtr, "data");
> +			print_data_json(e->data, e->size);
> +		} else if (e->header.type == PERF_RECORD_LOST) {
> +			jsonw_name(json_wtr, "lost");
> +			jsonw_start_object(json_wtr);
> +			jsonw_name(json_wtr, "id");
> +			jsonw_uint(json_wtr, lost->id);
> +			jsonw_name(json_wtr, "count");
> +			jsonw_uint(json_wtr, lost->lost);
> +			jsonw_end_object(json_wtr);
> +		}
> +		jsonw_end_object(json_wtr);
> +	} else {
> +		if (e->header.type == PERF_RECORD_SAMPLE) {
> +			printf("== @%ld.%ld CPU: %d index: %d =====\n",
> +			       (long)ts.tv_sec, ts.tv_nsec,
> +			       ring->cpu, ring->key);
> +			fprint_hex(stdout, e->data, e->size, " ");
> +			printf("\n");
> +		} else if (e->header.type == PERF_RECORD_LOST) {
> +			printf("lost %lld events\n", lost->lost);
> +		} else {
> +			printf("unknown event type=%d size=%d\n",
> +			       e->header.type, e->header.size);
> +		}
> +	}
> +}
> +
> +static void
> +perf_event_read(struct event_ring_info *ring, void **buf, size_t *buf_len)
> +{
> +	volatile struct perf_event_mmap_page *header = ring->mem;
> +	__u64 buffer_size = MMAP_PAGE_CNT * get_page_size();
> +	__u64 data_tail = header->data_tail;
> +	__u64 data_head = header->data_head;
> +	void *base, *begin, *end;
> +
> +	asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */
> +	if (data_head == data_tail)
> +		return;

this function was copied several times into different places.
I think it's time to put into common lib. Like libbpf.
Would be great if you can do it in the follow up.

for the set:
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* [PATCH] net: disable UDP punt on sockets in RCV_SHUTDWON
From: Chintan Shah @ 2018-05-04 21:08 UTC (permalink / raw)
  To: davem, kuznet, jmorris, yoshfuji, kaber, netdev, linux-kernel
  Cc: chintsha, kamensky, takondra, xe-linux-external, enkechen

A UDP application which opens multiple sockets with same local
address/port combination (using SO_REUSEPORT/SO_REUSEADDR socket options);
and issues connect to a remote socket (using one of these local socket).
Now if the same socket, which issued connect, issues shutdown (SHUT_RD);
packets would still be queued to this socket (if sent from same remote
client, which the local socket connected to), and not delivered to the
other socket in the normal state.

In UDP socket lookup, socket's state (if it has issued SHUTDOWN on
read or not), is not taken into account. When application calls, SHUTDOWN
(SHUT_RD), UDP socket's state is changed (sk_shutdown is set to
RCV_SHUTDOWN).

UDP socket lookup is performed with help of compute_score
function. The function checks socket's attributes against incoming packets
headers; and based on match/mismatch it returns score. We can check for
the socket's state (sk->sk_shutdown) here, in same compute_score function,
and return values accordingly.

Signed-off-by: Chintan Shah <chintsha@cisco.com>
CC: xe-linux-external@cisco.com
---
 net/ipv4/udp.c | 6 ++++++
 net/ipv6/udp.c | 6 ++++++
 2 files changed, 12 insertions(+)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 0dfcd73..a5fe6d7 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -402,6 +402,9 @@ static inline int compute_score(struct sock *sk, struct net *net,
 #endif
 #endif

+	if (sk->sk_shutdown & RCV_SHUTDOWN)
+		return -1;
+
 	if (!net_eq(sock_net(sk), net) ||
 	    udp_sk(sk)->udp_port_hash != hnum ||
 	    ipv6_only_sock(sk))
@@ -483,6 +486,9 @@ static inline int compute_score2(struct sock *sk, struct net *net,
 #endif
 #endif

+	if (sk->sk_shutdown & RCV_SHUTDOWN)
+		return -1;
+
 	if (!net_eq(sock_net(sk), net) ||
 	    ipv6_only_sock(sk))
 		return -1;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index d956cbb..2254b07 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -170,6 +170,9 @@ static inline int compute_score(struct sock *sk, struct net *net,
 #endif
 #endif

+	if (sk->sk_shutdown & RCV_SHUTDOWN)
+		return -1;
+
 	if (!net_eq(sock_net(sk), net) ||
 	    udp_sk(sk)->udp_port_hash != hnum ||
 	    sk->sk_family != PF_INET6)
@@ -251,6 +254,9 @@ static inline int compute_score2(struct sock *sk, struct net *net,
 #endif
 #endif

+	if (sk->sk_shutdown & RCV_SHUTDOWN)
+		return -1;
+
 	if (!net_eq(sock_net(sk), net) ||
 	    udp_sk(sk)->udp_port_hash != hnum ||
 	    sk->sk_family != PF_INET6)
-- 
2.5.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox