* pull-request: bpf-next 2019-07-09
From: Daniel Borkmann @ 2019-07-09 0:13 UTC (permalink / raw)
To: davem; +Cc: daniel, ast, netdev, bpf
Hi David,
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Lots of libbpf improvements: i) addition of new APIs to attach BPF
programs to tracing entities such as {k,u}probes or tracepoints,
ii) improve specification of BTF-defined maps by eliminating the
need for data initialization for some of the members, iii) addition
of a high-level API for setting up and polling perf buffers for
BPF event output helpers, all from Andrii.
2) Add "prog run" subcommand to bpftool in order to test-run programs
through the kernel testing infrastructure of BPF, from Quentin.
3) Improve verifier for BPF sockaddr programs to support 8-byte stores
for user_ip6 and msg_src_ip6 members given clang tends to generate
such stores, from Stanislav.
4) Enable the new BPF JIT zero-extension optimization for further
riscv64 ALU ops, from Luke.
5) Fix a bpftool json JIT dump crash on powerpc, from Jiri.
6) Fix an AF_XDP race in generic XDP's receive path, from Ilya.
7) Various smaller fixes from Ilya, Yue and Arnd.
Please consider pulling these changes from:
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
Thanks a lot!
----------------------------------------------------------------
The following changes since commit c4cde5804d512a2f8934017dbf7df642dfbdf2ad:
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next (2019-07-04 12:48:21 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
for you to fetch changes up to bf0bdd1343efbbf65b4d53aef1fce14acbd79d50:
xdp: fix race on generic receive path (2019-07-09 01:43:26 +0200)
----------------------------------------------------------------
Andrii Nakryiko (19):
libbpf: make libbpf_strerror_r agnostic to sign of error
libbpf: introduce concept of bpf_link
libbpf: add ability to attach/detach BPF program to perf event
libbpf: add kprobe/uprobe attach API
libbpf: add tracepoint attach API
libbpf: add raw tracepoint attach API
selftests/bpf: switch test to new attach_perf_event API
selftests/bpf: add kprobe/uprobe selftests
selftests/bpf: convert existing tracepoint tests to new APIs
libbpf: capture value in BTF type info for BTF-defined map defs
selftests/bpf: add __uint and __type macro for BTF-defined maps
selftests/bpf: convert selftests using BTF-defined maps to new syntax
selftests/bpf: convert legacy BPF maps to BTF-defined ones
libbpf: add perf buffer API
libbpf: auto-set PERF_EVENT_ARRAY size to number of CPUs
selftests/bpf: test perf buffer API
tools/bpftool: switch map event_pipe to libbpf's perf_buffer
libbpf: add perf_buffer_ prefix to README
selftests/bpf: fix test_attach_probe map definition
Arnd Bergmann (1):
bpf: avoid unused variable warning in tcp_bpf_rtt()
Daniel Borkmann (4):
Merge branch 'bpf-libbpf-link-trace'
Merge branch 'bpf-libbpf-int-btf-map'
Merge branch 'bpf-libbpf-perf-rb-api'
Merge branch 'bpf-sockaddr-wide-store'
Ilya Leoshkevich (1):
selftests/bpf: fix test_reuseport_array on s390
Ilya Maximets (1):
xdp: fix race on generic receive path
Jiri Olsa (1):
tools: bpftool: Fix json dump crash on powerpc
Luke Nelson (1):
bpf, riscv: Enable zext optimization for more RV64G ALU ops
Quentin Monnet (2):
tools: bpftool: add "prog run" subcommand to test-run programs
tools: bpftool: add completion for bpftool prog "loadall"
Stanislav Fomichev (5):
selftests/bpf: fix test_align liveliness expectations
selftests/bpf: add test_tcp_rtt to .gitignore
bpf: allow wide (u64) aligned stores for some fields of bpf_sock_addr
bpf: sync bpf.h to tools/
selftests/bpf: add verifier tests for wide stores
YueHaibing (1):
bpf: cgroup: Fix build error without CONFIG_NET
arch/riscv/net/bpf_jit_comp.c | 16 +-
include/linux/filter.h | 6 +
include/net/tcp.h | 4 +-
include/net/xdp_sock.h | 2 +
include/uapi/linux/bpf.h | 6 +-
kernel/bpf/cgroup.c | 4 +
net/core/filter.c | 22 +-
net/xdp/xsk.c | 31 +-
tools/bpf/bpftool/Documentation/bpftool-prog.rst | 34 +
tools/bpf/bpftool/bash-completion/bpftool | 35 +-
tools/bpf/bpftool/jit_disasm.c | 11 +-
tools/bpf/bpftool/main.c | 29 +
tools/bpf/bpftool/main.h | 1 +
tools/bpf/bpftool/map_perf_ring.c | 201 ++---
tools/bpf/bpftool/prog.c | 348 ++++++++-
tools/include/linux/sizes.h | 48 ++
tools/include/uapi/linux/bpf.h | 6 +-
tools/lib/bpf/README.rst | 3 +-
tools/lib/bpf/libbpf.c | 822 ++++++++++++++++++++-
tools/lib/bpf/libbpf.h | 70 ++
tools/lib/bpf/libbpf.map | 12 +-
tools/lib/bpf/str_error.c | 2 +-
tools/testing/selftests/bpf/.gitignore | 1 +
tools/testing/selftests/bpf/bpf_helpers.h | 3 +
.../selftests/bpf/prog_tests/attach_probe.c | 166 +++++
.../testing/selftests/bpf/prog_tests/perf_buffer.c | 100 +++
.../selftests/bpf/prog_tests/stacktrace_build_id.c | 55 +-
.../bpf/prog_tests/stacktrace_build_id_nmi.c | 31 +-
.../selftests/bpf/prog_tests/stacktrace_map.c | 43 +-
.../bpf/prog_tests/stacktrace_map_raw_tp.c | 15 +-
tools/testing/selftests/bpf/progs/bpf_flow.c | 28 +-
.../selftests/bpf/progs/get_cgroup_id_kern.c | 26 +-
tools/testing/selftests/bpf/progs/netcnt_prog.c | 20 +-
tools/testing/selftests/bpf/progs/pyperf.h | 90 +--
.../selftests/bpf/progs/socket_cookie_prog.c | 13 +-
.../selftests/bpf/progs/sockmap_verdict_prog.c | 48 +-
tools/testing/selftests/bpf/progs/strobemeta.h | 68 +-
.../selftests/bpf/progs/test_attach_probe.c | 52 ++
tools/testing/selftests/bpf/progs/test_btf_newkv.c | 13 +-
.../selftests/bpf/progs/test_get_stack_rawtp.c | 39 +-
.../testing/selftests/bpf/progs/test_global_data.c | 37 +-
tools/testing/selftests/bpf/progs/test_l4lb.c | 65 +-
.../selftests/bpf/progs/test_l4lb_noinline.c | 65 +-
.../testing/selftests/bpf/progs/test_map_in_map.c | 30 +-
tools/testing/selftests/bpf/progs/test_map_lock.c | 26 +-
tools/testing/selftests/bpf/progs/test_obj_id.c | 12 +-
.../testing/selftests/bpf/progs/test_perf_buffer.c | 25 +
.../bpf/progs/test_select_reuseport_kern.c | 67 +-
.../selftests/bpf/progs/test_send_signal_kern.c | 26 +-
.../selftests/bpf/progs/test_sock_fields_kern.c | 78 +-
tools/testing/selftests/bpf/progs/test_spin_lock.c | 36 +-
.../selftests/bpf/progs/test_stacktrace_build_id.c | 55 +-
.../selftests/bpf/progs/test_stacktrace_map.c | 52 +-
.../testing/selftests/bpf/progs/test_tcp_estats.c | 13 +-
.../testing/selftests/bpf/progs/test_tcpbpf_kern.c | 26 +-
.../selftests/bpf/progs/test_tcpnotify_kern.c | 28 +-
tools/testing/selftests/bpf/progs/test_xdp.c | 26 +-
tools/testing/selftests/bpf/progs/test_xdp_loop.c | 26 +-
.../selftests/bpf/progs/test_xdp_noinline.c | 81 +-
.../testing/selftests/bpf/progs/xdp_redirect_map.c | 12 +-
tools/testing/selftests/bpf/progs/xdping_kern.c | 12 +-
tools/testing/selftests/bpf/test_align.c | 16 +-
tools/testing/selftests/bpf/test_maps.c | 21 +-
tools/testing/selftests/bpf/test_queue_stack_map.h | 30 +-
tools/testing/selftests/bpf/test_sockmap_kern.h | 110 +--
tools/testing/selftests/bpf/test_verifier.c | 17 +-
tools/testing/selftests/bpf/verifier/wide_store.c | 36 +
67 files changed, 2490 insertions(+), 1062 deletions(-)
create mode 100644 tools/include/linux/sizes.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/attach_probe.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/perf_buffer.c
create mode 100644 tools/testing/selftests/bpf/progs/test_attach_probe.c
create mode 100644 tools/testing/selftests/bpf/progs/test_perf_buffer.c
create mode 100644 tools/testing/selftests/bpf/verifier/wide_store.c
^ permalink raw reply
* Re: [PATCH bpf v2] xdp: fix race on generic receive path
From: Daniel Borkmann @ 2019-07-09 0:13 UTC (permalink / raw)
To: Ilya Maximets, netdev
Cc: linux-kernel, bpf, xdp-newbies, David S. Miller,
Björn Töpel, Magnus Karlsson, Jonathan Lemon,
Jakub Kicinski, Alexei Starovoitov
In-Reply-To: <20190703120916.19973-1-i.maximets@samsung.com>
On 07/03/2019 02:09 PM, Ilya Maximets wrote:
> Unlike driver mode, generic xdp receive could be triggered
> by different threads on different CPU cores at the same time
> leading to the fill and rx queue breakage. For example, this
> could happen while sending packets from two processes to the
> first interface of veth pair while the second part of it is
> open with AF_XDP socket.
>
> Need to take a lock for each generic receive to avoid race.
>
> Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support")
> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Applied, thanks!
^ permalink raw reply
* Re: [PATCH bpf-next] selftests/bpf: fix test_reuseport_array on s390
From: Daniel Borkmann @ 2019-07-09 0:12 UTC (permalink / raw)
To: Ilya Leoshkevich, bpf, netdev
In-Reply-To: <20190703115034.53984-1-iii@linux.ibm.com>
On 07/03/2019 01:50 PM, Ilya Leoshkevich wrote:
> Fix endianness issue: passing a pointer to 64-bit fd as a 32-bit key
> does not work on big-endian architectures. So cast fd to 32-bits when
> necessary.
>
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Applied, thanks!
^ permalink raw reply
* Re: [PATCH bpf-next v3 1/6] xsk: replace ndo_xsk_async_xmit with ndo_xsk_wakeup
From: Daniel Borkmann @ 2019-07-08 23:40 UTC (permalink / raw)
To: Magnus Karlsson, bjorn.topel, ast, netdev, brouer
Cc: bpf, bruce.richardson, ciara.loftus, jakub.kicinski, xiaolong.ye,
qi.z.zhang, maximmi, sridhar.samudrala, kevin.laatz,
ilias.apalodimas, kiran.patil, axboe, maciej.fijalkowski,
maciejromanfijalkowski, intel-wired-lan
In-Reply-To: <1562244134-19069-2-git-send-email-magnus.karlsson@intel.com>
On 07/04/2019 02:42 PM, Magnus Karlsson wrote:
> This commit replaces ndo_xsk_async_xmit with ndo_xsk_wakeup. This new
> ndo provides the same functionality as before but with the addition of
> a new flags field that is used to specifiy if Rx, Tx or both should be
> woken up. The previous ndo only woke up Tx, as implied by the
> name. The i40e and ixgbe drivers (which are all the supported ones)
> are updated with this new interface.
>
> This new ndo will be used by the new need_wakeup functionality of XDP
> sockets that need to be able to wake up both Rx and Tx driver
> processing.
>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
> drivers/net/ethernet/intel/i40e/i40e_main.c | 5 +++--
> drivers/net/ethernet/intel/i40e/i40e_xsk.c | 7 ++++---
> drivers/net/ethernet/intel/i40e/i40e_xsk.h | 2 +-
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 5 +++--
> drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h | 2 +-
> drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 4 ++--
> drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c | 2 +-
> drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.h | 2 +-
> drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 +-
> include/linux/netdevice.h | 14 ++++++++++++--
> net/xdp/xdp_umem.c | 3 +--
> net/xdp/xsk.c | 3 ++-
> 12 files changed, 32 insertions(+), 19 deletions(-)
Looks good, but given driver changes to support the AF_XDP need_wakeup
feature are quite trivial, is there a reason that you updated mlx5 here
but not for the actual support such that all three in-tree drivers are
supported?
Thanks,
Daniel
^ permalink raw reply
* Re: [PATCH net-next v2 0/3] net: Multipath hashing on inner L3
From: David Miller @ 2019-07-08 23:37 UTC (permalink / raw)
To: ssuryaextr; +Cc: netdev, idosch, nikolay, dsahern
In-Reply-To: <20190706145519.13488-1-ssuryaextr@gmail.com>
From: Stephen Suryaputra <ssuryaextr@gmail.com>
Date: Sat, 6 Jul 2019 10:55:16 -0400
> This series extends commit 363887a2cdfe ("ipv4: Support multipath
> hashing on inner IP pkts for GRE tunnel") to include support when the
> outer L3 is IPv6 and to consider the case where the inner L3 is
> different version from the outer L3, such as IPv6 tunneled by IPv4 GRE
> or vice versa. It also includes kselftest scripts to test the use cases.
>
> v2: Clarify the commit messages in the commits in this series to use the
> term tunneled by IPv4 GRE or by IPv6 GRE so that it's clear which
> one is the inner and which one is the outer (per David Miller).
Series applied, thanks.
^ permalink raw reply
* Re: [PATCH] net: pasemi: fix an use-after-free in pasemi_mac_phy_init()
From: David Miller @ 2019-07-08 23:33 UTC (permalink / raw)
To: wen.yang99
Cc: linux-kernel, xue.zhihong, wang.yi59, cheng.shengyu, tglx, mcgrof,
mpe, netdev
In-Reply-To: <1562387021-951-1-git-send-email-wen.yang99@zte.com.cn>
From: Wen Yang <wen.yang99@zte.com.cn>
Date: Sat, 6 Jul 2019 12:23:41 +0800
> The phy_dn variable is still being used in of_phy_connect() after the
> of_node_put() call, which may result in use-after-free.
>
> Fixes: 1dd2d06c0459 ("net: Rework pasemi_mac driver to use of_mdio infrastructure")
> Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Applied.
^ permalink raw reply
* Re: [PATCH] net: axienet: fix a potential double free in axienet_probe()
From: David Miller @ 2019-07-08 23:32 UTC (permalink / raw)
To: wen.yang99
Cc: linux-kernel, xue.zhihong, wang.yi59, cheng.shengyu, anirudh,
John.Linn, michal.simek, hancock, netdev, linux-arm-kernel
In-Reply-To: <1562384321-46727-1-git-send-email-wen.yang99@zte.com.cn>
From: Wen Yang <wen.yang99@zte.com.cn>
Date: Sat, 6 Jul 2019 11:38:41 +0800
> There is a possible use-after-free issue in the axienet_probe():
>
> 1701: np = of_parse_phandle(pdev->dev.of_node, "axistream-connected", 0);
> 1702: if (np) {
> ...
> 1787: of_node_put(np); ---> released here
> 1788: lp->eth_irq = platform_get_irq(pdev, 0);
> 1789: } else {
> ...
> 1801: }
> 1802: if (IS_ERR(lp->dma_regs)) {
> ...
> 1805: of_node_put(np); ---> double released here
> 1806: goto free_netdev;
> 1807: }
>
> We solve this problem by removing the unnecessary of_node_put().
>
> Fixes: 28ef9ebdb64c ("net: axienet: make use of axistream-connected attribute optional")
> Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Applied to net-next
^ permalink raw reply
* Re: [RFC PATCH net-next 0/6] tc-taprio offload for SJA1105 DSA
From: Vinicius Costa Gomes @ 2019-07-08 23:28 UTC (permalink / raw)
To: Vladimir Oltean, f.fainelli, vivien.didelot, andrew, davem,
vedang.patel, richardcochran
Cc: weifeng.voon, jiri, m-karicheri2, Jose.Abreu, ilias.apalodimas,
netdev, Vladimir Oltean
In-Reply-To: <20190707172921.17731-1-olteanv@gmail.com>
Hi Vladimir,
Vladimir Oltean <olteanv@gmail.com> writes:
> Using Vinicius Costa Gomes' configuration interface for 802.1Qbv (later
> resent by Voon Weifeng for the stmmac driver), I am submitting for
> review a draft implementation of this offload for a DSA switch.
>
> I don't want to insist too much on the hardware specifics of SJA1105
> which isn't otherwise very compliant to the IEEE spec.
>
> In order to be able to test with Vedang Patel's iproute2 patch for
> taprio offload (https://www.spinics.net/lists/netdev/msg573072.html)
> I had to actually revert the txtime-assist branch as it had changed the
> iproute2 interface.
Now, that Vedang's work was merged, I will send a rebased version of
the taprio offload series (also taking your feedback into account, see
below).
>
> In terms of impact for DSA drivers, I would like to point out that:
>
> - Maybe somebody should pre-populate qopt->cycle_time in case the user
> does not provide one. Otherwise each driver needs to iterate over the
> GCL once, just to set the cycle time (right now stmmac does as
> well).
Very fair, this should be very easy to do from taprio side.
>
> - Configuring the switch over SPI cannot apparently be done from this
> ndo_setup_tc callback because it runs in atomic context. I also have
> some downstream patches to offload tc clsact matchall with mirred
> action, but in that case it looks like the atomic context restriction
> does not apply.
>
> - I had to copy the struct tc_taprio_qopt_offload to driver private
> memory because a static config needs to be constructed every time a
> change takes place, and there are up to 4 switch ports that may take a
> TAS configuration. I have created a private
> tc_taprio_qopt_offload_copy() helper for this - I don't know whether
> it's of any help in the general case.
If everyone needs to do this, perhaps we can think of something else,
one first idea is that taprio builds this configuration and gives
ownership of it to the driver, but we would need to add _ref()/_unref()
(or similar) helpers to the qdisc/driver API.
>
> There is more to be done however. The TAS needs to be integrated with
> the PTP driver. This is because with a PTP clock source, the base time
> is written dynamically to the PTPSCHTM (PTP schedule time) register and
> must be a time in the future. Then the "real" base time of each port's
> TAS config can be offset by at most ~50 ms (the DELTA field from the
> Schedule Entry Points Table) relative to PTPSCHTM.
> Because base times in the past are completely ignored by this hardware,
> we need to decide if it's ok behaviorally for a driver to "roll" a past
> base time into the immediate future by incrementally adding the cycle
> time (so the phase doesn't change).
That's another good piece of information. My understanding from reading
section 8.6.9.1.1 from the IEEE 802.1Q-2018 spec, is that it's ok:
"""
CycleStartTime = (OperBaseTime + N*OperCycleTime)
where N is the smallest integer for which the relation:
CycleStartTime >= CurrentTime
would be TRUE.
"""
> If it is, then decide by how long in
> the future it is ok to do so. Or alternatively, is it preferable if the
> driver errors out if the user-supplied base time is in the past and the
> hardware doesn't like it? But even then, there might be fringe cases
> when the base time becomes a past PTP time right as the driver tries to
> apply the config.
This fringe case is interesting. I don't know how to handle it well. The
first idea that comes to mind is for the driver to add a integer number
of cycles so there's enough time to apply the config. But this I think
will be different for every driver, no?
> Also applying a tc-taprio offload to a second SJA1105 switch port will
> inevitably need to roll the first port's (now past) base time into an
> equivalent future time.
> All of this is going to be complicated even further by the fact that
> resetting the switch (to apply the tc-taprio offload) makes it reset its
> PTP time.
This is going to be complicated indeed.
Thanks a lot,
--
Vinicius
^ permalink raw reply
* Re: [PATCH] sunrpc/cache: remove the exporting of cache_seq_next
From: J. Bruce Fields @ 2019-07-08 23:23 UTC (permalink / raw)
To: Denis Efremov
Cc: Trond Myklebust, Chuck Lever, Anna Schumaker, David S. Miller,
linux-nfs, netdev, linux-kernel
In-Reply-To: <20190708161423.31006-1-efremov@linux.com>
Makes sense, thanks; apply for 5.3.--b.
On Mon, Jul 08, 2019 at 07:14:23PM +0300, Denis Efremov wrote:
> The function cache_seq_next is declared static and marked
> EXPORT_SYMBOL_GPL, which is at best an odd combination. Because the
> function is not used outside of the net/sunrpc/cache.c file it is
> defined in, this commit removes the EXPORT_SYMBOL_GPL() marking.
>
> Fixes: d48cf356a130 ("SUNRPC: Remove non-RCU protected lookup")
> Signed-off-by: Denis Efremov <efremov@linux.com>
> ---
> net/sunrpc/cache.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
> index 66fbb9d2fba7..6f1528f271ee 100644
> --- a/net/sunrpc/cache.c
> +++ b/net/sunrpc/cache.c
> @@ -1375,7 +1375,6 @@ static void *cache_seq_next(struct seq_file *m, void *p, loff_t *pos)
> hlist_first_rcu(&cd->hash_table[hash])),
> struct cache_head, cache_list);
> }
> -EXPORT_SYMBOL_GPL(cache_seq_next);
>
> void *cache_seq_start_rcu(struct seq_file *m, loff_t *pos)
> __acquires(RCU)
> --
> 2.21.0
^ permalink raw reply
* Re: [ovs-dev] [PATCH net-next] net: openvswitch: do not update max_headroom if new headroom is equal to old headroom
From: Gregory Rose @ 2019-07-08 23:22 UTC (permalink / raw)
To: David Miller, ap420073; +Cc: dev, netdev, Pravin Shelar
In-Reply-To: <87bfb355-9ddf-c27b-c160-b3028a945a22@gmail.com>
On 7/8/2019 4:18 PM, Gregory Rose wrote:
> On 7/8/2019 4:08 PM, David Miller wrote:
>> From: Taehee Yoo <ap420073@gmail.com>
>> Date: Sat, 6 Jul 2019 01:08:09 +0900
>>
>>> When a vport is deleted, the maximum headroom size would be changed.
>>> If the vport which has the largest headroom is deleted,
>>> the new max_headroom would be set.
>>> But, if the new headroom size is equal to the old headroom size,
>>> updating routine is unnecessary.
>>>
>>> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
>> I'm not so sure about the logic here and I'd therefore like an OVS
>> expert
>> to review this.
>
> I'll review and test it and get back. Pravin may have input as well.
>
Err, adding Pravin.
- Greg
> Thanks,
>
> - Greg
>
>> Thanks.
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
^ permalink raw reply
* Re: linux-next: manual merge of the net-next tree with the sh tree
From: Stephen Rothwell @ 2019-07-08 23:22 UTC (permalink / raw)
To: David Miller, Networking, Yoshinori Sato
Cc: Linux Next Mailing List, Linux Kernel Mailing List,
Krzysztof Kozlowski, Jiri Pirko
In-Reply-To: <20190617114011.4159295e@canb.auug.org.au>
[-- Attachment #1: Type: text/plain, Size: 2466 bytes --]
Hi all,
On Mon, 17 Jun 2019 11:40:11 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> Today's linux-next merge of the net-next tree got conflicts in:
>
> arch/sh/configs/se7712_defconfig
> arch/sh/configs/se7721_defconfig
> arch/sh/configs/titan_defconfig
>
> between commit:
>
> 7c04efc8d2ef ("sh: configs: Remove useless UEVENT_HELPER_PATH")
>
> from the sh tree and commit:
>
> a51486266c3b ("net: sched: remove NET_CLS_IND config option")
>
> from the net-next tree.
>
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging. You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
>
> --
> Cheers,
> Stephen Rothwell
>
> diff --cc arch/sh/configs/se7712_defconfig
> index 6ac7d362e106,1e116529735f..000000000000
> --- a/arch/sh/configs/se7712_defconfig
> +++ b/arch/sh/configs/se7712_defconfig
> @@@ -63,7 -63,7 +63,6 @@@ CONFIG_NET_SCH_NETEM=
> CONFIG_NET_CLS_TCINDEX=y
> CONFIG_NET_CLS_ROUTE4=y
> CONFIG_NET_CLS_FW=y
> - CONFIG_NET_CLS_IND=y
> -CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
> CONFIG_MTD=y
> CONFIG_MTD_BLOCK=y
> CONFIG_MTD_CFI=y
> diff --cc arch/sh/configs/se7721_defconfig
> index ffd15acc2a04,c66e512719ab..000000000000
> --- a/arch/sh/configs/se7721_defconfig
> +++ b/arch/sh/configs/se7721_defconfig
> @@@ -62,7 -62,7 +62,6 @@@ CONFIG_NET_SCH_NETEM=
> CONFIG_NET_CLS_TCINDEX=y
> CONFIG_NET_CLS_ROUTE4=y
> CONFIG_NET_CLS_FW=y
> - CONFIG_NET_CLS_IND=y
> -CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
> CONFIG_MTD=y
> CONFIG_MTD_BLOCK=y
> CONFIG_MTD_CFI=y
> diff --cc arch/sh/configs/titan_defconfig
> index 1c1c78e74fbb,171ab05ce4fc..000000000000
> --- a/arch/sh/configs/titan_defconfig
> +++ b/arch/sh/configs/titan_defconfig
> @@@ -142,7 -142,7 +142,6 @@@ CONFIG_GACT_PROB=
> CONFIG_NET_ACT_MIRRED=m
> CONFIG_NET_ACT_IPT=m
> CONFIG_NET_ACT_PEDIT=m
> - CONFIG_NET_CLS_IND=y
> -CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
> CONFIG_FW_LOADER=m
> CONFIG_CONNECTOR=m
> CONFIG_MTD=m
I am still getting this conflict (the commit ids may have changed).
Just a reminder in case you think Linus may need to know.
--
Cheers,
Stephen Rothwell
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [ovs-dev] [PATCH net-next] net: openvswitch: do not update max_headroom if new headroom is equal to old headroom
From: Gregory Rose @ 2019-07-08 23:18 UTC (permalink / raw)
To: David Miller, ap420073; +Cc: dev, netdev
In-Reply-To: <20190708.160804.2026506853635876959.davem@davemloft.net>
On 7/8/2019 4:08 PM, David Miller wrote:
> From: Taehee Yoo <ap420073@gmail.com>
> Date: Sat, 6 Jul 2019 01:08:09 +0900
>
>> When a vport is deleted, the maximum headroom size would be changed.
>> If the vport which has the largest headroom is deleted,
>> the new max_headroom would be set.
>> But, if the new headroom size is equal to the old headroom size,
>> updating routine is unnecessary.
>>
>> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> I'm not so sure about the logic here and I'd therefore like an OVS expert
> to review this.
I'll review and test it and get back. Pravin may have input as well.
Thanks,
- Greg
> Thanks.
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
^ permalink raw reply
* [PATCH] net/mlx5e: Return in default case statement in tx_post_resync_params
From: Nathan Chancellor @ 2019-07-08 23:11 UTC (permalink / raw)
To: Saeed Mahameed, Leon Romanovsky
Cc: David S. Miller, Boris Pismenny, netdev, linux-rdma, linux-kernel,
clang-built-linux, Nathan Chancellor
clang warns:
drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c:251:2:
warning: variable 'rec_seq_sz' is used uninitialized whenever switch
default is taken [-Wsometimes-uninitialized]
default:
^~~~~~~
drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c:255:46: note:
uninitialized use occurs here
skip_static_post = !memcmp(rec_seq, &rn_be, rec_seq_sz);
^~~~~~~~~~
drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c:239:16: note:
initialize the variable 'rec_seq_sz' to silence this warning
u16 rec_seq_sz;
^
= 0
1 warning generated.
This case statement was clearly designed to be one that should not be
hit during runtime because of the WARN_ON statement so just return early
to prevent copying uninitialized memory up into rn_be.
Fixes: d2ead1f360e8 ("net/mlx5e: Add kTLS TX HW offload support")
Link: https://github.com/ClangBuiltLinux/linux/issues/590
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
index 3f5f4317a22b..5c08891806f0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
@@ -250,6 +250,7 @@ tx_post_resync_params(struct mlx5e_txqsq *sq,
}
default:
WARN_ON(1);
+ return;
}
skip_static_post = !memcmp(rec_seq, &rn_be, rec_seq_sz);
--
2.22.0
^ permalink raw reply related
* Re: [net-next] net: fib_rules: do not flow dissect local packets
From: David Miller @ 2019-07-08 23:12 UTC (permalink / raw)
To: ppenkov; +Cc: netdev, roopa, edumazet
In-Reply-To: <20190705184643.249884-1-ppenkov@google.com>
From: Petar Penkov <ppenkov@google.com>
Date: Fri, 5 Jul 2019 11:46:43 -0700
> Rules matching on loopback iif do not need early flow dissection as the
> packet originates from the host. Stop counting such rules in
> fib_rule_requires_fldissect
>
> Signed-off-by: Petar Penkov <ppenkov@google.com>
Roopa, please review.
^ permalink raw reply
* Re: [PATCH v2 net-next] net: stmmac: enable clause 45 mdio support
From: David Miller @ 2019-07-08 23:09 UTC (permalink / raw)
To: weifeng.voon
Cc: mcoquelin.stm32, netdev, linux-kernel, joabreu, peppe.cavallaro,
andrew, f.fainelli, alexandre.torgue, biao.huang, boon.leong.ong,
hock.leong.kweh
In-Reply-To: <1562348007-12263-1-git-send-email-weifeng.voon@intel.com>
From: Voon Weifeng <weifeng.voon@intel.com>
Date: Sat, 6 Jul 2019 01:33:27 +0800
> From: Kweh Hock Leong <hock.leong.kweh@intel.com>
>
> DWMAC4 is capable to support clause 45 mdio communication.
> This patch enable the feature on stmmac_mdio_write() and
> stmmac_mdio_read() by following phy_write_mmd() and
> phy_read_mmd() mdiobus read write implementation format.
>
> Reviewed-by: Li, Yifan <yifan2.li@intel.com>
> Signed-off-by: Kweh Hock Leong <hock.leong.kweh@intel.com>
> Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
> Signed-off-by: Voon Weifeng <weifeng.voon@intel.com>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH net-next] net: openvswitch: do not update max_headroom if new headroom is equal to old headroom
From: David Miller @ 2019-07-08 23:08 UTC (permalink / raw)
To: ap420073; +Cc: pshelar, netdev, dev
In-Reply-To: <20190705160809.5202-1-ap420073@gmail.com>
From: Taehee Yoo <ap420073@gmail.com>
Date: Sat, 6 Jul 2019 01:08:09 +0900
> When a vport is deleted, the maximum headroom size would be changed.
> If the vport which has the largest headroom is deleted,
> the new max_headroom would be set.
> But, if the new headroom size is equal to the old headroom size,
> updating routine is unnecessary.
>
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
I'm not so sure about the logic here and I'd therefore like an OVS expert
to review this.
Thanks.
^ permalink raw reply
* Re: [PATCH v3 net-next 19/19] ionic: Add basic devlink interface
From: Shannon Nelson @ 2019-07-08 22:58 UTC (permalink / raw)
To: Jiri Pirko; +Cc: netdev
In-Reply-To: <20190708200350.GG2282@nanopsycho.orion>
On 7/8/19 1:03 PM, Jiri Pirko wrote:
> Mon, Jul 08, 2019 at 09:58:09PM CEST, snelson@pensando.io wrote:
>> On 7/8/19 12:34 PM, Jiri Pirko wrote:
>>> Mon, Jul 08, 2019 at 09:25:32PM CEST, snelson@pensando.io wrote:
>>>>
>>>> +
>>>> +static const struct devlink_ops ionic_dl_ops = {
>>>> + .info_get = ionic_dl_info_get,
>>>> +};
>>>> +
>>>> +int ionic_devlink_register(struct ionic *ionic)
>>>> +{
>>>> + struct devlink *dl;
>>>> + struct ionic **ip;
>>>> + int err;
>>>> +
>>>> + dl = devlink_alloc(&ionic_dl_ops, sizeof(struct ionic *));
>>> Oups. Something is wrong with your flow. The devlink alloc is allocating
>>> the structure that holds private data (per-device data) for you. This is
>>> misuse :/
>>>
>>> You are missing one parent device struct apparently.
>>>
>>> Oh, I think I see something like it. The unused "struct ionic_devlink".
>> If I'm not mistaken, the alloc is only allocating enough for a pointer, not
>> the whole per device struct, and a few lines down from here the pointer to
>> the new devlink struct is assigned to ionic->dl. This was based on what I
>> found in the qed driver's qed_devlink_register(), and it all seems to work.
> I'm not saying your code won't work. What I say is that you should have
> a struct for device that would be allocated by devlink_alloc()
Is there a particular reason why? I appreciate that devlink_alloc() can
give you this device specific space, just as alloc_etherdev_mq() can,
but is there a specific reason why this should be used instead of
setting up simply a pointer to a space that has already been allocated?
There are several drivers that are using it the way I've setup here,
which happened to be the first examples I followed - are they doing
something different that makes this valid for them?
>
> The ionic struct should be associated with devlink_port. That you are
> missing too.
We don't support any of devlink_port features at this point, just the
simple device information.
sln
>
>
>> That unused struct ionic_devlink does need to go away, it was superfluous
>> after working out a better typecast off of devlink_priv().
>>
>> I'll remove the unused struct ionic_devlink, but I think the rest is okay.
>>
>> sln
>>
>>>
>>>> + if (!dl) {
>>>> + dev_warn(ionic->dev, "devlink_alloc failed");
>>>> + return -ENOMEM;
>>>> + }
>>>> +
>>>> + ip = (struct ionic **)devlink_priv(dl);
>>>> + *ip = ionic;
>>>> + ionic->dl = dl;
>>>> +
>>>> + err = devlink_register(dl, ionic->dev);
>>>> + if (err) {
>>>> + dev_warn(ionic->dev, "devlink_register failed: %d\n", err);
>>>> + goto err_dl_free;
>>>> + }
>>>> +
>>>> + return 0;
>>>> +
>>>> +err_dl_free:
>>>> + ionic->dl = NULL;
>>>> + devlink_free(dl);
>>>> + return err;
>>>> +}
>>>> +
>>>> +void ionic_devlink_unregister(struct ionic *ionic)
>>>> +{
>>>> + if (!ionic->dl)
>>>> + return;
>>>> +
>>>> + devlink_unregister(ionic->dl);
>>>> + devlink_free(ionic->dl);
>>>> +}
>>>> diff --git a/drivers/net/ethernet/pensando/ionic/ionic_devlink.h b/drivers/net/ethernet/pensando/ionic/ionic_devlink.h
>>>> new file mode 100644
>>>> index 000000000000..35528884e29f
>>>> --- /dev/null
>>>> +++ b/drivers/net/ethernet/pensando/ionic/ionic_devlink.h
>>>> @@ -0,0 +1,12 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +/* Copyright(c) 2017 - 2019 Pensando Systems, Inc */
>>>> +
>>>> +#ifndef _IONIC_DEVLINK_H_
>>>> +#define _IONIC_DEVLINK_H_
>>>> +
>>>> +#include <net/devlink.h>
>>>> +
>>>> +int ionic_devlink_register(struct ionic *ionic);
>>>> +void ionic_devlink_unregister(struct ionic *ionic);
>>>> +
>>>> +#endif /* _IONIC_DEVLINK_H_ */
>>>> --
>>>> 2.17.1
>>>>
^ permalink raw reply
* Re: [PATCH net-next] net: openvswitch: use netif_ovs_is_port() instead of opencode
From: David Miller @ 2019-07-08 22:53 UTC (permalink / raw)
To: ap420073; +Cc: pshelar, netdev, dev
In-Reply-To: <20190705160546.4847-1-ap420073@gmail.com>
From: Taehee Yoo <ap420073@gmail.com>
Date: Sat, 6 Jul 2019 01:05:46 +0900
> Use netif_ovs_is_port() function instead of open code.
> This patch doesn't change logic.
>
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next 00/11] Add drop monitor for offloaded data paths
From: Jakub Kicinski @ 2019-07-08 22:51 UTC (permalink / raw)
To: Ido Schimmel
Cc: David Miller, netdev, jiri, mlxsw, dsahern, roopa, nikolay, andy,
pablo, pieter.jansenvanvuuren, andrew, f.fainelli, vivien.didelot,
idosch, Alexei Starovoitov
In-Reply-To: <20190708131908.GA13672@splinter>
On Mon, 8 Jul 2019 16:19:08 +0300, Ido Schimmel wrote:
> On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:
> > From: Ido Schimmel <idosch@idosch.org>
> > Date: Sun, 7 Jul 2019 10:58:17 +0300
> >
> > > Users have several ways to debug the kernel and understand why a packet
> > > was dropped. For example, using "drop monitor" and "perf". Both
> > > utilities trace kfree_skb(), which is the function called when a packet
> > > is freed as part of a failure. The information provided by these tools
> > > is invaluable when trying to understand the cause of a packet loss.
> > >
> > > In recent years, large portions of the kernel data path were offloaded
> > > to capable devices. Today, it is possible to perform L2 and L3
> > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
> > > Different TC classifiers and actions are also offloaded to capable
> > > devices, at both ingress and egress.
> > >
> > > However, when the data path is offloaded it is not possible to achieve
> > > the same level of introspection as tools such "perf" and "drop monitor"
> > > become irrelevant.
> > >
> > > This patchset aims to solve this by allowing users to monitor packets
> > > that the underlying device decided to drop along with relevant metadata
> > > such as the drop reason and ingress port.
> >
> > We are now going to have 5 or so ways to capture packets passing through
> > the system, this is nonsense.
> >
> > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
> > devlink thing.
> >
> > This is insanity, too many ways to do the same thing and therefore the
> > worst possible user experience.
> >
> > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
> > XDP perf events, and these taps there too.
> >
> > I mean really, think about it from the average user's perspective. To
> > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
> > listen on devlink but configure a special tap thing beforehand and then
> > if someone is using XDP I gotta setup another perf event buffer capture
> > thing too.
>
> Let me try to explain again because I probably wasn't clear enough. The
> devlink-trap mechanism is not doing the same thing as other solutions.
>
> The packets we are capturing in this patchset are packets that the
> kernel (the CPU) never saw up until now - they were silently dropped by
> the underlying device performing the packet forwarding instead of the
> CPU.
When you say silently dropped do you mean that mlxsw as of today
doesn't have any counters exposed for those events?
If we wanted to consolidate this into something existing we can either
(a) add similar traps in the kernel data path;
(b) make these traps extension of statistics.
My knee jerk reaction to seeing the patches was that it adds a new
place where device statistics are reported. Users who want to know why
things are dropped will not get detailed breakdown from ethtool -S which
for better or worse is the one stop shop for device stats today.
Having thought about it some more, however, I think that having a
forwarding "exception" object and hanging statistics off of it is a
better design, even if we need to deal with some duplication to get
there.
IOW having an way to "trap all packets which would increment a
statistic" (option (b) above) is probably a bad design.
As for (a) I wonder how many of those events have a corresponding event
in the kernel stack? If we could add corresponding trace points and
just feed those from the device driver, that'd obviously be a holy
grail. Not to mention that requiring trace points to be added to the
core would make Alexei happy:
http://vger.kernel.org/netconf2019_files/netconf2019_slides_ast.pdf#page=3
;)
That's my $.02, not very insightful.
> For each such packet we get valuable metadata from the underlying device
> such as the drop reason and the ingress port. With time, even more
> reasons and metadata could be provided (e.g., egress port, traffic
> class). Netlink provides a structured and extensible way to report the
> packet along with the metadata to interested users. The tc-sample action
> uses a similar concept.
>
> I would like to emphasize that these dropped packets are not injected to
> the kernel's receive path and therefore not subject to kfree_skb() and
> related infrastructure. There is no need to waste CPU cycles on packets
> we already know were dropped (and why). Further, hardware tail/early
> drops will not be dropped by the kernel, given its qdiscs are probably
> empty.
>
> Regarding the use of devlink, current ASICs can forward packets at
> 6.4Tb/s. We do not want to overwhelm the CPU with dropped packets and
> therefore we give users the ability to control - via devlink - the
> trapping of certain packets to the CPU and their reporting to user
> space. In the future, devlink-trap can be extended to support the
> configuration of the hardware policers of each trap.
^ permalink raw reply
* Re: [PATCH net-next V2] MAINTAINERS: Add page_pool maintainer entry
From: David Miller @ 2019-07-08 22:51 UTC (permalink / raw)
To: brouer
Cc: ilias.apalodimas, netdev, daniel, jakub.kicinski, john.fastabend,
ast
In-Reply-To: <156233140902.25371.7033961410347587264.stgit@carbon>
From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Fri, 05 Jul 2019 14:57:55 +0200
> In this release cycle the number of NIC drivers using page_pool
> will likely reach 4 drivers. It is about time to add a maintainer
> entry. Add myself and Ilias.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
> V2: Ilias also volunteered to co-maintain over IRC
Applied.
^ permalink raw reply
* Re: [PATCH net-next 0/2] net: mvpp2: Add classification based on the ETHER flow
From: David Miller @ 2019-07-08 22:50 UTC (permalink / raw)
To: maxime.chevallier
Cc: netdev, linux-kernel, linux-arm-kernel, antoine.tenart,
thomas.petazzoni, gregory.clement, miquel.raynal, nadavh, stefanc,
mw
In-Reply-To: <20190705120913.25013-1-maxime.chevallier@bootlin.com>
From: Maxime Chevallier <maxime.chevallier@bootlin.com>
Date: Fri, 5 Jul 2019 14:09:11 +0200
> Hello everyone,
>
> This series adds support for classification of the ETHER flow in the
> mvpp2 driver.
>
> The first patch allows detecting when a user specifies a flow_type that
> isn't supported by the driver, while the second adds support for this
> flow_type by adding the mapping between the ETHER_FLOW enum value and
> the relevant classifier flow entries.
Series applied, thanks.
^ permalink raw reply
* Re: [PATCH] net: sysctl: cleanup net_sysctl_init error exit paths
From: George G. Davis @ 2019-07-08 22:47 UTC (permalink / raw)
To: David Miller; +Cc: netdev, linux-kernel
In-Reply-To: <20190517144345.GA16926@mam-gdavis-lt>
Hello David,
On Fri, May 17, 2019 at 10:43:45AM -0400, George G. Davis wrote:
> Hello David,
>
> On Thu, May 16, 2019 at 02:27:44PM -0700, David Miller wrote:
> > From: "George G. Davis" <george_davis@mentor.com>
> > Date: Thu, 16 May 2019 11:23:08 -0400
> >
> > > Unwind net_sysctl_init error exit goto spaghetti code
> > >
> > > Suggested-by: Joshua Frkuska <joshua_frkuska@mentor.com>
> > > Signed-off-by: George G. Davis <george_davis@mentor.com>
> >
> > Cleanups are not appropriate until the net-next tree opens back up.
> >
> > So please resubmit at that time.
>
> I fear that I may be distracted by other shiny objects by then but
> I'll make a reminder and try to resubmit during the next merge window.
Since the "Linux 5.2" kernel has been released [1], I'm guessing that the
net-next merge window is open now? If yes, the patch remains unchanged
since my initial post. Please consider applying or let me know when to
resubmit when the net-next merge window is again open.
TIA!
>
> Thanks!
>
> >
> > Thank you.
>
> --
> Regards,
> George
--
Regards,
George
[1] https://lwn.net/Articles/792995/
^ permalink raw reply
* Re: [PATCH] selftests: txring_overwrite: fix incorrect test of mmap() return value
From: David Miller @ 2019-07-08 22:40 UTC (permalink / raw)
To: debrabander; +Cc: netdev
In-Reply-To: <1562326994-4569-1-git-send-email-debrabander@gmail.com>
From: Frank de Brabander <debrabander@gmail.com>
Date: Fri, 5 Jul 2019 13:43:14 +0200
> If mmap() fails it returns MAP_FAILED, which is defined as ((void *) -1).
> The current if-statement incorrectly tests if *ring is NULL.
>
> Signed-off-by: Frank de Brabander <debrabander@gmail.com>
Applied with fixes tag added and queued up for -stable, thanks.
^ permalink raw reply
* Re: [PATCH 1/1] tools/dtrace: initial implementation of DTrace
From: Kris Van Hees @ 2019-07-08 22:38 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo
Cc: Kris Van Hees, netdev, bpf, dtrace-devel, linux-kernel, rostedt,
mhiramat, ast, daniel, Peter Zijlstra, Chris Mason
In-Reply-To: <20190708171537.GA11960@kernel.org>
On Mon, Jul 08, 2019 at 02:15:37PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Wed, Jul 03, 2019 at 08:14:30PM -0700, Kris Van Hees escreveu:
> > This initial implementation of a tiny subset of DTrace functionality
> > provides the following options:
> >
> > dtrace [-lvV] [-b bufsz] -s script
> > -b set trace buffer size
> > -l list probes (only works with '-s script' for now)
> > -s enable or list probes for the specified BPF program
> > -V report DTrace API version
> >
> > The patch comprises quite a bit of code due to DTrace requiring a few
> > crucial components, even in its most basic form.
> >
> > The code is structured around the command line interface implemented in
> > dtrace.c. It provides option parsing and drives the three modes of
> > operation that are currently implemented:
> >
> > 1. Report DTrace API version information.
> > Report the version information and terminate.
> >
> > 2. List probes in BPF programs.
> > Initialize the list of probes that DTrace recognizes, load BPF
> > programs, parse all BPF ELF section names, resolve them into
> > known probes, and emit the probe names. Then terminate.
> >
> > 3. Load BPF programs and collect tracing data.
> > Initialize the list of probes that DTrace recognizes, load BPF
> > programs and attach them to their corresponding probes, set up
> > perf event output buffers, and start processing tracing data.
> >
> > This implementation makes extensive use of BPF (handled by dt_bpf.c) and
> > the perf event output ring buffer (handled by dt_buffer.c). DTrace-style
> > probe handling (dt_probe.c) offers an interface to probes that hides the
> > implementation details of the individual probe types by provider (dt_fbt.c
> > and dt_syscall.c). Probe lookup by name uses a hashtable implementation
> > (dt_hash.c). The dt_utils.c code populates a list of online CPU ids, so
> > we know what CPUs we can obtain tracing data from.
> >
> > Building the tool is trivial because its only dependency (libbpf) is in
> > the kernel tree under tools/lib/bpf. A simple 'make' in the tools/dtrace
> > directory suffices.
> >
> > The 'dtrace' executable needs to run as root because BPF programs cannot
> > be loaded by non-root users.
> >
> > Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
> > Reviewed-by: David Mc Lean <david.mclean@oracle.com>
> > Reviewed-by: Eugene Loh <eugene.loh@oracle.com>
> > ---
> > MAINTAINERS | 6 +
> > tools/dtrace/Makefile | 88 ++++++++++
> > tools/dtrace/bpf_sample.c | 145 ++++++++++++++++
> > tools/dtrace/dt_bpf.c | 188 +++++++++++++++++++++
> > tools/dtrace/dt_buffer.c | 331 +++++++++++++++++++++++++++++++++++++
> > tools/dtrace/dt_fbt.c | 201 ++++++++++++++++++++++
> > tools/dtrace/dt_hash.c | 211 +++++++++++++++++++++++
> > tools/dtrace/dt_probe.c | 230 ++++++++++++++++++++++++++
> > tools/dtrace/dt_syscall.c | 179 ++++++++++++++++++++
> > tools/dtrace/dt_utils.c | 132 +++++++++++++++
> > tools/dtrace/dtrace.c | 249 ++++++++++++++++++++++++++++
> > tools/dtrace/dtrace.h | 13 ++
> > tools/dtrace/dtrace_impl.h | 101 +++++++++++
> > 13 files changed, 2074 insertions(+)
> > create mode 100644 tools/dtrace/Makefile
> > create mode 100644 tools/dtrace/bpf_sample.c
> > create mode 100644 tools/dtrace/dt_bpf.c
> > create mode 100644 tools/dtrace/dt_buffer.c
> > create mode 100644 tools/dtrace/dt_fbt.c
> > create mode 100644 tools/dtrace/dt_hash.c
> > create mode 100644 tools/dtrace/dt_probe.c
> > create mode 100644 tools/dtrace/dt_syscall.c
> > create mode 100644 tools/dtrace/dt_utils.c
> > create mode 100644 tools/dtrace/dtrace.c
> > create mode 100644 tools/dtrace/dtrace.h
> > create mode 100644 tools/dtrace/dtrace_impl.h
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 606d1f80bc49..668468834865 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -5474,6 +5474,12 @@ W: https://linuxtv.org
> > S: Odd Fixes
> > F: drivers/media/pci/dt3155/
> >
> > +DTRACE
> > +M: Kris Van Hees <kris.van.hees@oracle.com>
> > +L: dtrace-devel@oss.oracle.com
> > +S: Maintained
> > +F: tools/dtrace/
> > +
> > DVB_USB_AF9015 MEDIA DRIVER
> > M: Antti Palosaari <crope@iki.fi>
> > L: linux-media@vger.kernel.org
> > diff --git a/tools/dtrace/Makefile b/tools/dtrace/Makefile
> > new file mode 100644
> > index 000000000000..99fd0f9dd1d6
> > --- /dev/null
> > +++ b/tools/dtrace/Makefile
> > @@ -0,0 +1,88 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +#
> > +# This Makefile is based on samples/bpf.
> > +#
> > +# Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
> > +
> > +DT_VERSION := 2.0.0
> > +DT_GIT_VERSION := $(shell git rev-parse HEAD 2>/dev/null || \
> > + echo Unknown)
> > +
> > +DTRACE_PATH ?= $(abspath $(srctree)/$(src))
> > +TOOLS_PATH := $(DTRACE_PATH)/..
> > +SAMPLES_PATH := $(DTRACE_PATH)/../../samples
> > +
> > +hostprogs-y := dtrace
> > +
> > +LIBBPF := $(TOOLS_PATH)/lib/bpf/libbpf.a
> > +OBJS := dt_bpf.o dt_buffer.o dt_utils.o dt_probe.o \
> > + dt_hash.o \
> > + dt_fbt.o dt_syscall.o
> > +
> > +dtrace-objs := $(OBJS) dtrace.o
> > +
> > +always := $(hostprogs-y)
> > +always += bpf_sample.o
> > +
> > +KBUILD_HOSTCFLAGS += -DDT_VERSION=\"$(DT_VERSION)\"
> > +KBUILD_HOSTCFLAGS += -DDT_GIT_VERSION=\"$(DT_GIT_VERSION)\"
> > +KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib
> > +KBUILD_HOSTCFLAGS += -I$(srctree)/tools/perf
>
> Interesting, what are you using from tools/perf/? So that we can move to
> tools/{include,lib,arch}.
This is my mistake... an earlier version of the code (as I was developing it)
was using stuff from tools/perf, but that is no longer the case. Removing it.
> > +KBUILD_HOSTCFLAGS += -I$(srctree)/tools/include/uapi
> > +KBUILD_HOSTCFLAGS += -I$(srctree)/tools/include/
> > +KBUILD_HOSTCFLAGS += -I$(srctree)/usr/include
> > +
> > +KBUILD_HOSTLDLIBS := $(LIBBPF) -lelf
> > +
> > +LLC ?= llc
> > +CLANG ?= clang
> > +LLVM_OBJCOPY ?= llvm-objcopy
> > +
> > +ifdef CROSS_COMPILE
> > +HOSTCC = $(CROSS_COMPILE)gcc
> > +CLANG_ARCH_ARGS = -target $(ARCH)
> > +endif
> > +
> > +all:
> > + $(MAKE) -C ../../ $(CURDIR)/ DTRACE_PATH=$(CURDIR)
> > +
> > +clean:
> > + $(MAKE) -C ../../ M=$(CURDIR) clean
> > + @rm -f *~
> > +
> > +$(LIBBPF): FORCE
> > + $(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= srctree=$(DTRACE_PATH)/../../ O=
> > +
> > +FORCE:
> > +
> > +.PHONY: verify_cmds verify_target_bpf $(CLANG) $(LLC)
> > +
> > +verify_cmds: $(CLANG) $(LLC)
> > + @for TOOL in $^ ; do \
> > + if ! (which -- "$${TOOL}" > /dev/null 2>&1); then \
> > + echo "*** ERROR: Cannot find LLVM tool $${TOOL}" ;\
> > + exit 1; \
> > + else true; fi; \
> > + done
> > +
> > +verify_target_bpf: verify_cmds
> > + @if ! (${LLC} -march=bpf -mattr=help > /dev/null 2>&1); then \
> > + echo "*** ERROR: LLVM (${LLC}) does not support 'bpf' target" ;\
> > + echo " NOTICE: LLVM version >= 3.7.1 required" ;\
> > + exit 2; \
> > + else true; fi
> > +
> > +$(DTRACE_PATH)/*.c: verify_target_bpf $(LIBBPF)
> > +$(src)/*.c: verify_target_bpf $(LIBBPF)
> > +
> > +$(obj)/%.o: $(src)/%.c
> > + @echo " CLANG-bpf " $@
> > + $(Q)$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
> > + -I$(srctree)/tools/testing/selftests/bpf/ \
> > + -D__KERNEL__ -D__BPF_TRACING__ -Wno-unused-value -Wno-pointer-sign \
> > + -D__TARGET_ARCH_$(ARCH) -Wno-compare-distinct-pointer-types \
> > + -Wno-gnu-variable-sized-type-not-at-end \
> > + -Wno-address-of-packed-member -Wno-tautological-compare \
> > + -Wno-unknown-warning-option $(CLANG_ARCH_ARGS) \
> > + -I$(srctree)/samples/bpf/ -include asm_goto_workaround.h \
> > + -O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf $(LLC_FLAGS) -filetype=obj -o $@
>
>
> We have the above in tools/perf/util/llvm-utils.c, perhaps we need to
> move it to some place in lib/ to share?
Yes, if there is a way to put things like this in a central location so we can
maintain a single copy that would be a good idea indeed.
> > diff --git a/tools/dtrace/bpf_sample.c b/tools/dtrace/bpf_sample.c
> > new file mode 100644
> > index 000000000000..49f350390b5f
> > --- /dev/null
> > +++ b/tools/dtrace/bpf_sample.c
> > @@ -0,0 +1,145 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * This sample DTrace BPF tracing program demonstrates how actions can be
> > + * associated with different probe types.
> > + *
> > + * The kprobe/ksys_write probe is a Function Boundary Tracing (FBT) entry probe
> > + * on the ksys_write(fd, buf, count) function in the kernel. Arguments to the
> > + * function can be retrieved from the CPU registers (struct pt_regs).
> > + *
> > + * The tracepoint/syscalls/sys_enter_write probe is a System Call entry probe
> > + * for the write(d, buf, count) system call. Arguments to the system call can
> > + * be retrieved from the tracepoint data passed to the BPF program as context
> > + * struct syscall_data) when the probe fires.
> > + *
> > + * The BPF program associated with each probe prepares a DTrace BPF context
> > + * (struct dt_bpf_context) that stores the probe ID and up to 10 arguments.
> > + * Only 3 arguments are used in this sample. Then the prorgams call a shared
> > + * BPF function (bpf_action) that implements the actual action to be taken when
> > + * a probe fires. It prepares a data record to be stored in the tracing buffer
> > + * and submits it to the buffer. The data in the data record is obtained from
> > + * the DTrace BPF context.
> > + *
> > + * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
> > + */
> > +#include <uapi/linux/bpf.h>
> > +#include <linux/ptrace.h>
> > +#include <linux/version.h>
> > +#include <uapi/linux/unistd.h>
> > +#include "bpf_helpers.h"
> > +
> > +#include "dtrace.h"
> > +
> > +struct syscall_data {
> > + struct pt_regs *regs;
> > + long syscall_nr;
> > + long arg[6];
> > +};
> > +
> > +struct bpf_map_def SEC("maps") buffers = {
> > + .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
> > + .key_size = sizeof(u32),
> > + .value_size = sizeof(u32),
> > + .max_entries = NR_CPUS,
> > +};
> > +
> > +#if defined(__amd64)
> > +# define GET_REGS_ARG0(regs) ((regs)->di)
> > +# define GET_REGS_ARG1(regs) ((regs)->si)
> > +# define GET_REGS_ARG2(regs) ((regs)->dx)
> > +# define GET_REGS_ARG3(regs) ((regs)->cx)
> > +# define GET_REGS_ARG4(regs) ((regs)->r8)
> > +# define GET_REGS_ARG5(regs) ((regs)->r9)
> > +#else
> > +# warning Argument retrieval from pt_regs is not supported yet on this arch.
> > +# define GET_REGS_ARG0(regs) 0
> > +# define GET_REGS_ARG1(regs) 0
> > +# define GET_REGS_ARG2(regs) 0
> > +# define GET_REGS_ARG3(regs) 0
> > +# define GET_REGS_ARG4(regs) 0
> > +# define GET_REGS_ARG5(regs) 0
> > +#endif
>
> We have this in tools/testing/selftests/bpf/bpf_helpers.h, probably need
> to move to some other place in tools/include/ where this can be shared.
I should be using the ones in bpf_helpers (since I already include that
anyway), and yes, if we can move that to a general use location under
tools/include that would be a good idea.
Also, I jsut updated my code to use this and I added a PT_REGS_PARM6(x) for
all the listed archs because I need to be able to get to up to 6 parameters
rather than the supported 5. As far as I can see, all listed archs support
argument passing of at least 6 arguments so this should be no problem.
Any objections?
^ permalink raw reply
* Re: [PATCH bpf-next v3] virtio_net: add XDP meta data support
From: Daniel Borkmann @ 2019-07-08 22:38 UTC (permalink / raw)
To: Yuya Kusakabe, Jason Wang
Cc: ast, davem, hawk, jakub.kicinski, john.fastabend, kafai, mst,
netdev, songliubraving, yhs
In-Reply-To: <52e3fc0d-bdd7-83ee-58e6-488e2b91cc83@gmail.com>
On 07/02/2019 04:11 PM, Yuya Kusakabe wrote:
> On 7/2/19 5:33 PM, Jason Wang wrote:
>> On 2019/7/2 下午4:16, Yuya Kusakabe wrote:
>>> This adds XDP meta data support to both receive_small() and
>>> receive_mergeable().
>>>
>>> Fixes: de8f3a83b0a0 ("bpf: add meta pointer for direct access")
>>> Signed-off-by: Yuya Kusakabe <yuya.kusakabe@gmail.com>
>>> ---
>>> v3:
>>> - fix preserve the vnet header in receive_small().
>>> v2:
>>> - keep copy untouched in page_to_skb().
>>> - preserve the vnet header in receive_small().
>>> - fix indentation.
>>> ---
>>> drivers/net/virtio_net.c | 45 +++++++++++++++++++++++++++-------------
>>> 1 file changed, 31 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 4f3de0ac8b0b..03a1ae6fe267 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -371,7 +371,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>> struct receive_queue *rq,
>>> struct page *page, unsigned int offset,
>>> unsigned int len, unsigned int truesize,
>>> - bool hdr_valid)
>>> + bool hdr_valid, unsigned int metasize)
>>> {
>>> struct sk_buff *skb;
>>> struct virtio_net_hdr_mrg_rxbuf *hdr;
>>> @@ -393,7 +393,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>> else
>>> hdr_padded_len = sizeof(struct padded_vnet_hdr);
>>> - if (hdr_valid)
>>> + if (hdr_valid && !metasize)
>>> memcpy(hdr, p, hdr_len);
>>> len -= hdr_len;
>>> @@ -405,6 +405,11 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>> copy = skb_tailroom(skb);
>>> skb_put_data(skb, p, copy);
>>> + if (metasize) {
>>> + __skb_pull(skb, metasize);
>>> + skb_metadata_set(skb, metasize);
>>> + }
>>> +
>>> len -= copy;
>>> offset += copy;
>>> @@ -644,6 +649,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
>>> unsigned int delta = 0;
>>> struct page *xdp_page;
>>> int err;
>>> + unsigned int metasize = 0;
>>> len -= vi->hdr_len;
>>> stats->bytes += len;
>>> @@ -683,10 +689,13 @@ static struct sk_buff *receive_small(struct net_device *dev,
>>> xdp.data_hard_start = buf + VIRTNET_RX_PAD + vi->hdr_len;
>>> xdp.data = xdp.data_hard_start + xdp_headroom;
>>> - xdp_set_data_meta_invalid(&xdp);
>>> xdp.data_end = xdp.data + len;
>>> + xdp.data_meta = xdp.data;
>>> xdp.rxq = &rq->xdp_rxq;
>>> orig_data = xdp.data;
>>> + /* Copy the vnet header to the front of data_hard_start to avoid
>>> + * overwriting by XDP meta data */
>>> + memcpy(xdp.data_hard_start - vi->hdr_len, xdp.data - vi->hdr_len, vi->hdr_len);
I'm not fully sure if I'm following this one correctly, probably just missing
something. Isn't the vnet header based on how we set up xdp.data_hard_start
earlier already in front of it? Wouldn't we copy invalid data from xdp.data -
vi->hdr_len into the vnet header at that point (given there can be up to 256
bytes of headroom between the two)? If it's relative to xdp.data and headroom
is >0, then BPF prog could otherwise mangle this; something doesn't add up to
me here. Could you clarify? Thx
>> What happens if we have a large metadata that occupies all headroom here?
>>
>> Thanks
>
> Do you mean a large "XDP" metadata? If a large metadata is a large "XDP" metadata, I think we can not use a metadata that occupies all headroom. The size of metadata limited by bpf_xdp_adjust_meta() as below.
> bpf_xdp_adjust_meta() in net/core/filter.c:
> if (unlikely((metalen & (sizeof(__u32) - 1)) ||
> (metalen > 32)))
> return -EACCES;
>
> Thanks.
>
>>
>>
>>> act = bpf_prog_run_xdp(xdp_prog, &xdp);
>>> stats->xdp_packets++;
>>> @@ -695,9 +704,11 @@ static struct sk_buff *receive_small(struct net_device *dev,
>>> /* Recalculate length in case bpf program changed it */
>>> delta = orig_data - xdp.data;
>>> len = xdp.data_end - xdp.data;
>>> + metasize = xdp.data - xdp.data_meta;
>>> break;
>>> case XDP_TX:
>>> stats->xdp_tx++;
>>> + xdp.data_meta = xdp.data;
>>> xdpf = convert_to_xdp_frame(&xdp);
>>> if (unlikely(!xdpf))
>>> goto err_xdp;
>>> @@ -736,10 +747,12 @@ static struct sk_buff *receive_small(struct net_device *dev,
>>> skb_reserve(skb, headroom - delta);
>>> skb_put(skb, len);
>>> if (!delta) {
>>> - buf += header_offset;
>>> - memcpy(skb_vnet_hdr(skb), buf, vi->hdr_len);
>>> + memcpy(skb_vnet_hdr(skb), buf + VIRTNET_RX_PAD, vi->hdr_len);
>>> } /* keep zeroed vnet hdr since packet was changed by bpf */
>>> + if (metasize)
>>> + skb_metadata_set(skb, metasize);
>>> +
>>> err:
>>> return skb;
>>> @@ -760,8 +773,8 @@ static struct sk_buff *receive_big(struct net_device *dev,
>>> struct virtnet_rq_stats *stats)
>>> {
>>> struct page *page = buf;
>>> - struct sk_buff *skb = page_to_skb(vi, rq, page, 0, len,
>>> - PAGE_SIZE, true);
>>> + struct sk_buff *skb =
>>> + page_to_skb(vi, rq, page, 0, len, PAGE_SIZE, true, 0);
>>> stats->bytes += len - vi->hdr_len;
>>> if (unlikely(!skb))
>>> @@ -793,6 +806,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>> unsigned int truesize;
>>> unsigned int headroom = mergeable_ctx_to_headroom(ctx);
>>> int err;
>>> + unsigned int metasize = 0;
>>> head_skb = NULL;
>>> stats->bytes += len - vi->hdr_len;
>>> @@ -839,8 +853,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>> data = page_address(xdp_page) + offset;
>>> xdp.data_hard_start = data - VIRTIO_XDP_HEADROOM + vi->hdr_len;
>>> xdp.data = data + vi->hdr_len;
>>> - xdp_set_data_meta_invalid(&xdp);
>>> xdp.data_end = xdp.data + (len - vi->hdr_len);
>>> + xdp.data_meta = xdp.data;
>>> xdp.rxq = &rq->xdp_rxq;
>>> act = bpf_prog_run_xdp(xdp_prog, &xdp);
>>> @@ -852,8 +866,9 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>> * adjustments. Note other cases do not build an
>>> * skb and avoid using offset
>>> */
>>> - offset = xdp.data -
>>> - page_address(xdp_page) - vi->hdr_len;
>>> + metasize = xdp.data - xdp.data_meta;
>>> + offset = xdp.data - page_address(xdp_page) -
>>> + vi->hdr_len - metasize;
>>> /* recalculate len if xdp.data or xdp.data_end were
>>> * adjusted
>>> @@ -863,14 +878,15 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>> if (unlikely(xdp_page != page)) {
>>> rcu_read_unlock();
>>> put_page(page);
>>> - head_skb = page_to_skb(vi, rq, xdp_page,
>>> - offset, len,
>>> - PAGE_SIZE, false);
>>> + head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>> + len, PAGE_SIZE, false,
>>> + metasize);
>>> return head_skb;
>>> }
>>> break;
>>> case XDP_TX:
>>> stats->xdp_tx++;
>>> + xdp.data_meta = xdp.data;
>>> xdpf = convert_to_xdp_frame(&xdp);
>>> if (unlikely(!xdpf))
>>> goto err_xdp;
>>> @@ -921,7 +937,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>> goto err_skb;
>>> }
>>> - head_skb = page_to_skb(vi, rq, page, offset, len, truesize, !xdp_prog);
>>> + head_skb = page_to_skb(vi, rq, page, offset, len, truesize, !xdp_prog,
>>> + metasize);
>>> curr_skb = head_skb;
>>> if (unlikely(!curr_skb))
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox