Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2 bpf] selftests/bpf: fix bpf_target_sparc check
From: Andrii Nakryiko @ 2019-07-10 15:58 UTC (permalink / raw)
  To: Ilya Leoshkevich; +Cc: bpf, Networking
In-Reply-To: <20190710115654.44841-1-iii@linux.ibm.com>

On Wed, Jul 10, 2019 at 4:57 AM Ilya Leoshkevich <iii@linux.ibm.com> wrote:
>
> bpf_helpers.h fails to compile on sparc: the code should be checking
> for defined(bpf_target_sparc), but checks simply for bpf_target_sparc.
>
> Also change #ifdef bpf_target_powerpc to #if defined() for consistency.
>
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
> ---

Thanks!

Acked-by: Andrii Nakryiko <andriin@fb.com>

>
> v1->v2: bpf_target_powerpc change
>
>  tools/testing/selftests/bpf/bpf_helpers.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> index 5f6f9e7aba2a..0214797518ce 100644
> --- a/tools/testing/selftests/bpf/bpf_helpers.h
> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> @@ -440,10 +440,10 @@ static int (*bpf_skb_adjust_room)(void *ctx, __s32 len_diff, __u32 mode,
>
>  #endif
>
> -#ifdef bpf_target_powerpc
> +#if defined(bpf_target_powerpc)

Oh, yeah, that mix of #ifdef and #if definitely threw me off. I prefer
consistency, so thanks for this update!

>  #define BPF_KPROBE_READ_RET_IP(ip, ctx)                ({ (ip) = (ctx)->link; })
>  #define BPF_KRETPROBE_READ_RET_IP              BPF_KPROBE_READ_RET_IP
> -#elif bpf_target_sparc
> +#elif defined(bpf_target_sparc)
>  #define BPF_KPROBE_READ_RET_IP(ip, ctx)                ({ (ip) = PT_REGS_RET(ctx); })
>  #define BPF_KRETPROBE_READ_RET_IP              BPF_KPROBE_READ_RET_IP
>  #else
> --
> 2.21.0
>

^ permalink raw reply

* Re: [PATCH v2 6/7] dt-bindings: net: realtek: Add property to configure LED mode
From: Rob Herring @ 2019-07-10 15:55 UTC (permalink / raw)
  To: Matthias Kaehlcke
  Cc: Florian Fainelli, David S . Miller, Mark Rutland, Andrew Lunn,
	Heiner Kallweit, netdev, devicetree, linux-kernel@vger.kernel.org,
	Douglas Anderson
In-Reply-To: <20190703232331.GL250418@google.com>

On Wed, Jul 3, 2019 at 5:23 PM Matthias Kaehlcke <mka@chromium.org> wrote:
>
> Hi Florian,
>
> On Wed, Jul 03, 2019 at 02:37:47PM -0700, Florian Fainelli wrote:
> > On 7/3/19 12:37 PM, Matthias Kaehlcke wrote:
> > > The LED behavior of some Realtek PHYs is configurable. Add the
> > > property 'realtek,led-modes' to specify the configuration of the
> > > LEDs.
> > >
> > > Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
> > > ---
> > > Changes in v2:
> > > - patch added to the series
> > > ---
> > >  .../devicetree/bindings/net/realtek.txt         |  9 +++++++++
> > >  include/dt-bindings/net/realtek.h               | 17 +++++++++++++++++
> > >  2 files changed, 26 insertions(+)
> > >  create mode 100644 include/dt-bindings/net/realtek.h
> > >
> > > diff --git a/Documentation/devicetree/bindings/net/realtek.txt b/Documentation/devicetree/bindings/net/realtek.txt
> > > index 71d386c78269..40b0d6f9ee21 100644
> > > --- a/Documentation/devicetree/bindings/net/realtek.txt
> > > +++ b/Documentation/devicetree/bindings/net/realtek.txt
> > > @@ -9,6 +9,12 @@ Optional properties:
> > >
> > >     SSC is only available on some Realtek PHYs (e.g. RTL8211E).
> > >
> > > +- realtek,led-modes: LED mode configuration.
> > > +
> > > +   A 0..3 element vector, with each element configuring the operating
> > > +   mode of an LED. Omitted LEDs are turned off. Allowed values are
> > > +   defined in "include/dt-bindings/net/realtek.h".
> >
> > This should probably be made more general and we should define LED modes
> > that makes sense regardless of the PHY device, introduce a set of
> > generic functions for validating and then add new function pointer for
> > setting the LED configuration to the PHY driver. This would allow to be
> > more future proof where each PHY driver could expose standard LEDs class
> > devices to user-space, and it would also allow facilities like: ethtool
> > -p to plug into that.
> >
> > Right now, each driver invents its own way of configuring LEDs, that
> > does not scale, and there is not really a good reason for that other
> > than reviewing drivers in isolation and therefore making it harder to
> > extract the commonality. Yes, I realize that since you are the latest
> > person submitting something in that area, you are being selected :)

I agree.

> I see the merit of your proposal to come up with a generic mechanism
> to configure Ethernet LEDs, however I can't justify spending much of
> my work time on this. If it is deemed useful I'm happy to send another
> version of the current patchset that addresses the reviewer's comments,
> but if the implementation of a generic LED configuration interface is
> a requirement I will have to abandon at least the LED configuration
> part of this series.

Can you at least define a common binding for this. Maybe that's just
removing 'realtek'. While the kernel side can evolve to a common
infrastructure, the DT bindings can't.

Rob

^ permalink raw reply

* Re: [PATCH 00/12] treewide: Fix GENMASK misuses
From: Joe Perches @ 2019-07-10 15:45 UTC (permalink / raw)
  To: Russell King - ARM Linux admin, Johannes Berg
  Cc: Andrew Morton, Patrick Venture, Nancy Yuen, Benjamin Fair,
	Andrew Jeffery, openbmc, linux-kernel, linux-aspeed,
	linux-arm-kernel, linux-amlogic, netdev, linux-mediatek,
	linux-stm32, linux-wireless, linux-media, linux-iio, devel,
	alsa-devel, linux-mmc, dri-devel
In-Reply-To: <20190710094337.wf2lftxzfjq2etro@shell.armlinux.org.uk>

On Wed, 2019-07-10 at 10:43 +0100, Russell King - ARM Linux admin wrote:
> On Wed, Jul 10, 2019 at 11:17:31AM +0200, Johannes Berg wrote:
> > On Tue, 2019-07-09 at 22:04 -0700, Joe Perches wrote:
> > > These GENMASK uses are inverted argument order and the
> > > actual masks produced are incorrect.  Fix them.
> > > 
> > > Add checkpatch tests to help avoid more misuses too.
> > > 
> > > Joe Perches (12):
> > >   checkpatch: Add GENMASK tests
> > 
> > IMHO this doesn't make a lot of sense as a checkpatch test - just throw
> > in a BUILD_BUG_ON()?

I tried that.

It'd can't be done as it's used in declarations
and included in asm files and it uses the UL()
macro.

I also tried just making it do the right thing
whatever the argument order.

Oh well.

> My personal take on this is that GENMASK() is really not useful, it's
> just pure obfuscation and leads to exactly these kinds of mistakes.
> 
> Yes, I fully understand the argument that you can just specify the
> start and end bits, and it _in theory_ makes the code more readable.
> 
> However, the problem is when writing code.  GENMASK(a, b).  Is a the
> starting bit or ending bit?  Is b the number of bits?  It's confusing
> and causes mistakes resulting in incorrect code.  A BUILD_BUG_ON()
> can catch some of the cases, but not all of them.

It's a horrid little macro and I agree with Russell.

I also think if it existed at all it should have been
GENMASK(low, high) not GENMASK(high, low).

I


^ permalink raw reply

* [PATCH V2 0/1] tools/dtrace: initial implementation of DTrace
From: Kris Van Hees @ 2019-07-10 15:42 UTC (permalink / raw)
  To: netdev, bpf, dtrace-devel, linux-kernel
  Cc: rostedt, mhiramat, acme, ast, daniel, Peter Zijlstra, Chris Mason
In-Reply-To: <201907101537.x6AFboMR015946@aserv0122.oracle.com>

This initial implementation of a tiny subset of DTrace functionality
provides the following options:

	dtrace [-lvV] [-b bufsz] -s script
	    -b  set trace buffer size
	    -l  list probes (only works with '-s script' for now)
	    -s  enable or list probes for the specified BPF program
	    -V  report DTrace API version

The patch comprises quite a bit of code due to DTrace requiring a few
crucial components, even in its most basic form.

The code is structured around the command line interface implemented in
dtrace.c.  It provides option parsing and drives the three modes of
operation that are currently implemented:

1. Report DTrace API version information.
	Report the version information and terminate.

2. List probes in BPF programs.
	Initialize the list of probes that DTrace recognizes, load BPF
	programs, parse all BPF ELF section names, resolve them into
	known probes, and emit the probe names.  Then terminate.

3. Load BPF programs and collect tracing data.
	Initialize the list of probes that DTrace recognizes, load BPF
	programs and attach them to their corresponding probes, set up
	perf event output buffers, and start processing tracing data.

This implementation makes extensive use of BPF (handled by dt_bpf.c) and
the perf event output ring buffer (handled by dt_buffer.c).  DTrace-style
probe handling (dt_probe.c) offers an interface to probes that hides the
implementation details of the individual probe types by provider (dt_fbt.c
and dt_syscall.c).  Probe lookup by name uses a hashtable implementation
(dt_hash.c).  The dt_utils.c code populates a list of online CPU ids, so
we know what CPUs we can obtain tracing data from.

Building the tool is trivial because its only dependency (libbpf) is in
the kernel tree under tools/lib/bpf.  A simple 'make' in the tools/dtrace
directory suffices.

The 'dtrace' executable needs to run as root because BPF programs cannot
be loaded by non-root users.

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: David Mc Lean <david.mclean@oracle.com>
Reviewed-by: Eugene Loh <eugene.loh@oracle.com>
---
Changes in v2:
        - Use ring_buffer_read_head() and ring_buffer_write_tail() to
          avoid use of volatile.
        - Handle perf events that wrap around the ring buffer boundary.
        - Remove unnecessary PERF_EVENT_IOC_ENABLE.
        - Remove -I$(srctree)/tools/perf from KBUILD_HOSTCFLAGS since it
          is not actually used.
        - Use PT_REGS_PARM1(x), etc instead of my own macros.  Adding 
          PT_REGS_PARM6(x) in bpf_sample.c because we need to be able to
          support up to 6 arguments passed by registers.
---
 MAINTAINERS                |   6 +
 tools/dtrace/Makefile      |  87 ++++++++++
 tools/dtrace/bpf_sample.c  | 146 ++++++++++++++++
 tools/dtrace/dt_bpf.c      | 185 ++++++++++++++++++++
 tools/dtrace/dt_buffer.c   | 338 +++++++++++++++++++++++++++++++++++++
 tools/dtrace/dt_fbt.c      | 201 ++++++++++++++++++++++
 tools/dtrace/dt_hash.c     | 211 +++++++++++++++++++++++
 tools/dtrace/dt_probe.c    | 230 +++++++++++++++++++++++++
 tools/dtrace/dt_syscall.c  | 179 ++++++++++++++++++++
 tools/dtrace/dt_utils.c    | 132 +++++++++++++++
 tools/dtrace/dtrace.c      | 249 +++++++++++++++++++++++++++
 tools/dtrace/dtrace.h      |  13 ++
 tools/dtrace/dtrace_impl.h | 101 +++++++++++
 13 files changed, 2078 insertions(+)
 create mode 100644 tools/dtrace/Makefile
 create mode 100644 tools/dtrace/bpf_sample.c
 create mode 100644 tools/dtrace/dt_bpf.c
 create mode 100644 tools/dtrace/dt_buffer.c
 create mode 100644 tools/dtrace/dt_fbt.c
 create mode 100644 tools/dtrace/dt_hash.c
 create mode 100644 tools/dtrace/dt_probe.c
 create mode 100644 tools/dtrace/dt_syscall.c
 create mode 100644 tools/dtrace/dt_utils.c
 create mode 100644 tools/dtrace/dtrace.c
 create mode 100644 tools/dtrace/dtrace.h
 create mode 100644 tools/dtrace/dtrace_impl.h

diff --git a/MAINTAINERS b/MAINTAINERS
index cfa9ed89c031..410240732d55 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5485,6 +5485,12 @@ W:	https://linuxtv.org
 S:	Odd Fixes
 F:	drivers/media/pci/dt3155/
 
+DTRACE
+M:	Kris Van Hees <kris.van.hees@oracle.com>
+L:	dtrace-devel@oss.oracle.com
+S:	Maintained
+F:	tools/dtrace/
+
 DVB_USB_AF9015 MEDIA DRIVER
 M:	Antti Palosaari <crope@iki.fi>
 L:	linux-media@vger.kernel.org
diff --git a/tools/dtrace/Makefile b/tools/dtrace/Makefile
new file mode 100644
index 000000000000..03ae498d1429
--- /dev/null
+++ b/tools/dtrace/Makefile
@@ -0,0 +1,87 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# This Makefile is based on samples/bpf.
+#
+# Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+
+DT_VERSION		:= 2.0.0
+DT_GIT_VERSION		:= $(shell git rev-parse HEAD 2>/dev/null || \
+				   echo Unknown)
+
+DTRACE_PATH		?= $(abspath $(srctree)/$(src))
+TOOLS_PATH		:= $(DTRACE_PATH)/..
+SAMPLES_PATH		:= $(DTRACE_PATH)/../../samples
+
+hostprogs-y		:= dtrace
+
+LIBBPF			:= $(TOOLS_PATH)/lib/bpf/libbpf.a
+OBJS			:= dt_bpf.o dt_buffer.o dt_utils.o dt_probe.o \
+			   dt_hash.o \
+			   dt_fbt.o dt_syscall.o
+
+dtrace-objs		:= $(OBJS) dtrace.o
+
+always			:= $(hostprogs-y)
+always			+= bpf_sample.o
+
+KBUILD_HOSTCFLAGS	+= -DDT_VERSION=\"$(DT_VERSION)\"
+KBUILD_HOSTCFLAGS	+= -DDT_GIT_VERSION=\"$(DT_GIT_VERSION)\"
+KBUILD_HOSTCFLAGS	+= -I$(srctree)/tools/lib
+KBUILD_HOSTCFLAGS	+= -I$(srctree)/tools/include/uapi
+KBUILD_HOSTCFLAGS	+= -I$(srctree)/tools/include/
+KBUILD_HOSTCFLAGS	+= -I$(srctree)/usr/include
+
+KBUILD_HOSTLDLIBS	:= $(LIBBPF) -lelf
+
+LLC			?= llc
+CLANG			?= clang
+LLVM_OBJCOPY		?= llvm-objcopy
+
+ifdef CROSS_COMPILE
+HOSTCC			= $(CROSS_COMPILE)gcc
+CLANG_ARCH_ARGS		= -target $(ARCH)
+endif
+
+all:
+	$(MAKE) -C ../../ $(CURDIR)/ DTRACE_PATH=$(CURDIR)
+
+clean:
+	$(MAKE) -C ../../ M=$(CURDIR) clean
+	@rm -f *~
+
+$(LIBBPF): FORCE
+	$(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= srctree=$(DTRACE_PATH)/../../ O=
+
+FORCE:
+
+.PHONY: verify_cmds verify_target_bpf $(CLANG) $(LLC)
+
+verify_cmds: $(CLANG) $(LLC)
+	@for TOOL in $^ ; do \
+		if ! (which -- "$${TOOL}" > /dev/null 2>&1); then \
+			echo "*** ERROR: Cannot find LLVM tool $${TOOL}" ;\
+			exit 1; \
+		else true; fi; \
+	done
+
+verify_target_bpf: verify_cmds
+	@if ! (${LLC} -march=bpf -mattr=help > /dev/null 2>&1); then \
+		echo "*** ERROR: LLVM (${LLC}) does not support 'bpf' target" ;\
+		echo "   NOTICE: LLVM version >= 3.7.1 required" ;\
+		exit 2; \
+	else true; fi
+
+$(DTRACE_PATH)/*.c: verify_target_bpf $(LIBBPF)
+$(src)/*.c: verify_target_bpf $(LIBBPF)
+
+$(obj)/%.o: $(src)/%.c
+	@echo "  CLANG-bpf " $@
+	$(Q)$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
+		-I$(srctree)/tools/testing/selftests/bpf/ \
+		-D__KERNEL__ -D__BPF_TRACING__ -Wno-unused-value -Wno-pointer-sign \
+		-D__TARGET_ARCH_$(ARCH) -Wno-compare-distinct-pointer-types \
+		-Wno-gnu-variable-sized-type-not-at-end \
+		-Wno-address-of-packed-member -Wno-tautological-compare \
+		-Wno-unknown-warning-option $(CLANG_ARCH_ARGS) \
+		-I$(srctree)/samples/bpf/ -include asm_goto_workaround.h \
+		-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf $(LLC_FLAGS) -filetype=obj -o $@
diff --git a/tools/dtrace/bpf_sample.c b/tools/dtrace/bpf_sample.c
new file mode 100644
index 000000000000..9862f75f92d3
--- /dev/null
+++ b/tools/dtrace/bpf_sample.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This sample DTrace BPF tracing program demonstrates how actions can be
+ * associated with different probe types.
+ *
+ * The kprobe/ksys_write probe is a Function Boundary Tracing (FBT) entry probe
+ * on the ksys_write(fd, buf, count) function in the kernel.  Arguments to the
+ * function can be retrieved from the CPU registers (struct pt_regs).
+ *
+ * The tracepoint/syscalls/sys_enter_write probe is a System Call entry probe
+ * for the write(d, buf, count) system call.  Arguments to the system call can
+ * be retrieved from the tracepoint data passed to the BPF program as context
+ * struct syscall_data) when the probe fires.
+ *
+ * The BPF program associated with each probe prepares a DTrace BPF context
+ * (struct dt_bpf_context) that stores the probe ID and up to 10 arguments.
+ * Only 3 arguments are used in this sample.  Then the prorgams call a shared
+ * BPF function (bpf_action) that implements the actual action to be taken when
+ * a probe fires.  It prepares a data record to be stored in the tracing buffer
+ * and submits it to the buffer.  The data in the data record is obtained from
+ * the DTrace BPF context.
+ *
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ */
+#include <uapi/linux/bpf.h>
+#include <linux/ptrace.h>
+#include <linux/version.h>
+#include <uapi/linux/unistd.h>
+#include "bpf_helpers.h"
+
+#include "dtrace.h"
+
+struct syscall_data {
+	struct pt_regs *regs;
+	long syscall_nr;
+	long arg[6];
+};
+
+struct bpf_map_def SEC("maps") buffers = {
+	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u32),
+	.max_entries = NR_CPUS,
+};
+
+#if defined(bpf_target_x86)
+# define PT_REGS_PARM6(x)	((x)->r9)
+#elif defined(bpf_target_s390x)
+# define PT_REGS_PARM6(x)	((x)->gprs[7])
+#elif defined(bpf_target_arm)
+# define PT_REGS_PARM6(x)	((x)->uregs[5])
+#elif defined(bpf_target_arm64)
+# define PT_REGS_PARM6(x)	((x)->regs[5])
+#elif defined(bpf_target_mips)
+# define PT_REGS_PARM6(x)	((x)->regs[9])
+#elif defined(bpf_target_powerpc)
+# define PT_REGS_PARM6(x)	((x)->gpr[8])
+#elif defined(bpf_target_sparc)
+# define PT_REGS_PARM6(x)	((x)->u_regs[UREG_I5])
+#else
+# error Argument retrieval from pt_regs is not supported yet on this arch.
+#endif
+
+/*
+ * We must pass a valid BPF context pointer because the bpf_perf_event_output()
+ * helper requires a BPF context pointer as first argument (and the verifier is
+ * validating that we pass a value that is known to be a context pointer).
+ *
+ * This BPF function implements the following D action:
+ * {
+ *	trace(curthread);
+ *	trace(arg0);
+ *	trace(arg1);
+ *	trace(arg2);
+ * }
+ *
+ * Expected output will look like:
+ *   CPU     ID
+ *    15  70423 0xffff8c0968bf8ec0 0x00000000000001 0x0055e019eb3f60 0x0000000000002c
+ *    15  18876 0xffff8c0968bf8ec0 0x00000000000001 0x0055e019eb3f60 0x0000000000002c
+ *    |   |     +-- curthread      +--> arg0 (fd)   +--> arg1 (buf)  +-- arg2 (count)
+ *    |   |
+ *    |   +--> probe ID
+ *    |
+ *    +--> CPU the probe fired on
+ */
+static noinline int bpf_action(void *bpf_ctx, struct dt_bpf_context *ctx)
+{
+	int			cpu = bpf_get_smp_processor_id();
+	struct data {
+		u32	probe_id;	/* mandatory */
+
+		u64	task;		/* first data item (current task) */
+		u64	arg0;		/* 2nd data item (arg0, fd) */
+		u64	arg1;		/* 3rd data item (arg1, buf) */
+		u64	arg2;		/* 4th data item (arg2, count) */
+	}			rec;
+
+	memset(&rec, 0, sizeof(rec));
+
+	rec.probe_id = ctx->probe_id;
+	rec.task = bpf_get_current_task();
+	rec.arg0 = ctx->argv[0];
+	rec.arg1 = ctx->argv[1];
+	rec.arg2 = ctx->argv[2];
+
+	bpf_perf_event_output(bpf_ctx, &buffers, cpu, &rec, sizeof(rec));
+
+	return 0;
+}
+
+SEC("kprobe/ksys_write")
+int bpf_kprobe(struct pt_regs *regs)
+{
+	struct dt_bpf_context	ctx;
+
+	memset(&ctx, 0, sizeof(ctx));
+
+	ctx.probe_id = 18876;
+	ctx.argv[0] = PT_REGS_PARM1(regs);
+	ctx.argv[1] = PT_REGS_PARM2(regs);
+	ctx.argv[2] = PT_REGS_PARM3(regs);
+	ctx.argv[3] = PT_REGS_PARM4(regs);
+	ctx.argv[4] = PT_REGS_PARM5(regs);
+	ctx.argv[5] = PT_REGS_PARM6(regs);
+
+	return bpf_action(regs, &ctx);
+}
+
+SEC("tracepoint/syscalls/sys_enter_write")
+int bpf_tp(struct syscall_data *scd)
+{
+	struct dt_bpf_context	ctx;
+
+	memset(&ctx, 0, sizeof(ctx));
+
+	ctx.probe_id = 70423;
+	ctx.argv[0] = scd->arg[0];
+	ctx.argv[1] = scd->arg[1];
+	ctx.argv[2] = scd->arg[2];
+
+	return bpf_action(scd, &ctx);
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/tools/dtrace/dt_bpf.c b/tools/dtrace/dt_bpf.c
new file mode 100644
index 000000000000..78c90de016c6
--- /dev/null
+++ b/tools/dtrace/dt_bpf.c
@@ -0,0 +1,185 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This file provides the interface for handling BPF.  It uses the bpf library
+ * to interact with BPF ELF object files.
+ *
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ */
+#include <errno.h>
+#include <stdarg.h>
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+#include <bpf/libbpf.h>
+#include <linux/kernel.h>
+#include <linux/perf_event.h>
+#include <sys/ioctl.h>
+
+#include "dtrace_impl.h"
+
+/*
+ * Validate the output buffer map that is specified in the BPF ELF object.  It
+ * must match the following definition to be valid:
+ *
+ * struct bpf_map_def SEC("maps") buffers = {
+ *	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+ *	.key_size = sizeof(u32),
+ *	.value_size = sizeof(u32),
+ *	.max_entries = num,
+ * };
+ * where num is greater than dt_maxcpuid.
+ */
+static int is_valid_buffers(const struct bpf_map_def *mdef)
+{
+	return mdef->type == BPF_MAP_TYPE_PERF_EVENT_ARRAY &&
+	       mdef->key_size == sizeof(u32) &&
+	       mdef->value_size == sizeof(u32) &&
+	       mdef->max_entries > dt_maxcpuid;
+}
+
+/*
+ * List the probes specified in the given BPF ELF object file.
+ */
+int dt_bpf_list_probes(const char *fn)
+{
+	struct bpf_object	*obj;
+	struct bpf_program	*prog;
+	int			rc, fd;
+
+	libbpf_set_print(NULL);
+
+	/*
+	 * Listing probes is done before the DTrace command line utility loads
+	 * the supplied programs.  We load them here without attaching them to
+	 * probes so that we can retrieve the ELF section names for each BPF
+	 * program.  The section name indicates the probe that the program is
+	 * associated with.
+	 */
+	rc = bpf_prog_load(fn, BPF_PROG_TYPE_UNSPEC, &obj, &fd);
+	if (rc)
+		return rc;
+
+	/*
+	 * Loop through the programs in the BPF ELF object, and try to resolve
+	 * the section names into probes.  Use the supplied callback function
+	 * to emit the probe description.
+	 */
+	for (prog = bpf_program__next(NULL, obj); prog != NULL;
+	     prog = bpf_program__next(prog, obj)) {
+		struct dt_probe	*probe;
+
+		probe = dt_probe_resolve_event(bpf_program__title(prog, false));
+
+		printf("%5d %10s %17s %33s %s\n", probe->id,
+		       probe->prv_name ? probe->prv_name : "",
+		       probe->mod_name ? probe->mod_name : "",
+		       probe->fun_name ? probe->fun_name : "",
+		       probe->prb_name ? probe->prb_name : "");
+	}
+
+
+	/* Done with the BPF ELF object.  */
+	bpf_object__close(obj);
+
+	return 0;
+}
+
+/*
+ * Load the given BPF ELF object file.
+ */
+int dt_bpf_load_file(const char *fn)
+{
+	struct bpf_object	*obj;
+	struct bpf_map		*map;
+	struct bpf_program	*prog;
+	int			rc, fd;
+
+	libbpf_set_print(NULL);
+
+	/* Load the BPF ELF object file. */
+	rc = bpf_prog_load(fn, BPF_PROG_TYPE_UNSPEC, &obj, &fd);
+	if (rc)
+		return rc;
+
+	/* Validate buffers map. */
+	map = bpf_object__find_map_by_name(obj, "buffers");
+	if (map && is_valid_buffers(bpf_map__def(map)))
+		dt_bufmap_fd = bpf_map__fd(map);
+	else
+		goto fail;
+
+	/*
+	 * Loop through the programs and resolve each into the matching probe.
+	 * Attach the program to the probe.
+	 */
+	for (prog = bpf_program__next(NULL, obj); prog != NULL;
+	     prog = bpf_program__next(prog, obj)) {
+		struct dt_probe	*probe;
+
+		probe = dt_probe_resolve_event(bpf_program__title(prog, false));
+		if (!probe)
+			return -ENOENT;
+		if (probe->prov && probe->prov->attach)
+			probe->prov->attach(bpf_program__title(prog, false),
+					    bpf_program__fd(prog));
+	}
+
+	return 0;
+
+fail:
+	bpf_object__close(obj);
+	return -EINVAL;
+}
+
+/*
+ * Store the (key, value) pair in the map referenced by the given fd.
+ */
+int dt_bpf_map_update(int fd, const void *key, const void *val)
+{
+	union bpf_attr	attr;
+
+	memset(&attr, 0, sizeof(attr));
+
+	attr.map_fd = fd;
+	attr.key = (u64)(unsigned long)key;
+	attr.value = (u64)(unsigned long)val;
+	attr.flags = 0;
+
+	return bpf(BPF_MAP_UPDATE_ELEM, &attr);
+}
+
+/*
+ * Attach a trace event and associate a BPF program with it.
+ */
+int dt_bpf_attach(int event_id, int bpf_fd)
+{
+	int			event_fd;
+	int			rc;
+	struct perf_event_attr	attr = {};
+
+	attr.type = PERF_TYPE_TRACEPOINT;
+	attr.sample_type = PERF_SAMPLE_RAW;
+	attr.sample_period = 1;
+	attr.wakeup_events = 1;
+	attr.config = event_id;
+
+	/*
+	 * Register the event (based on its id), and obtain a fd.  It gets
+	 * created as an enabled probe, so we don't have to explicitly enable
+	 * it.
+	 */
+	event_fd = perf_event_open(&attr, -1, 0, -1, 0);
+	if (event_fd < 0) {
+		perror("sys_perf_event_open");
+		return -1;
+	}
+
+	/* Associate the BPF program with the event. */
+	rc = ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, bpf_fd);
+	if (rc < 0) {
+		perror("PERF_EVENT_IOC_SET_BPF");
+		return -1;
+	}
+
+	return 0;
+}
diff --git a/tools/dtrace/dt_buffer.c b/tools/dtrace/dt_buffer.c
new file mode 100644
index 000000000000..19bb7e4cfc92
--- /dev/null
+++ b/tools/dtrace/dt_buffer.c
@@ -0,0 +1,338 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This file provides the tracing buffer handling for DTrace.  It makes use of
+ * the perf event output ring buffers that can be written to from BPF programs.
+ *
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ */
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+#include <sys/epoll.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <linux/bpf.h>
+#include <linux/perf_event.h>
+#include <linux/ring_buffer.h>
+
+#include "dtrace_impl.h"
+
+/*
+ * Probe data is recorded in per-CPU perf ring buffers.
+ */
+struct dtrace_buffer {
+	int	cpu;			/* ID of CPU that uses this buffer */
+	int	fd;			/* fd of perf output buffer */
+	size_t	page_size;		/* size of each page in buffer */
+	size_t	data_size;		/* total buffer size */
+	u8	*base;			/* address of buffer */
+	u8	*endp;			/* address of end of buffer */
+	u8	*tmp;			/* temporary event buffer */
+	u32	tmp_len;		/* length of temporary event buffer */
+};
+
+static struct dtrace_buffer	*dt_buffers;
+
+/*
+ * File descriptor for the BPF map that holds the buffers for the online CPUs.
+ * The map is a bpf_array indexed by CPU id, and it stores a file descriptor as
+ * value (the fd for the perf_event that represents the CPU buffer).
+ */
+int				dt_bufmap_fd = -1;
+
+/*
+ * Create a perf_event buffer for the given DTrace buffer.  This will create
+ * a perf_event ring_buffer, mmap it, and enable the perf_event that owns the
+ * buffer.
+ */
+static int perf_buffer_open(struct dtrace_buffer *buf)
+{
+	int			pefd;
+	struct perf_event_attr	attr = {};
+
+	/*
+	 * Event configuration for BPF-generated output in perf_event ring
+	 * buffers.  The event is created in enabled state.
+	 */
+	attr.config = PERF_COUNT_SW_BPF_OUTPUT;
+	attr.type = PERF_TYPE_SOFTWARE;
+	attr.sample_type = PERF_SAMPLE_RAW;
+	attr.sample_period = 1;
+	attr.wakeup_events = 1;
+	pefd = perf_event_open(&attr, -1, buf->cpu, -1, PERF_FLAG_FD_CLOEXEC);
+	if (pefd < 0) {
+		fprintf(stderr, "perf_event_open(cpu %d): %s\n", buf->cpu,
+			strerror(errno));
+		goto fail;
+	}
+
+	/*
+	 * We add buf->page_size to the buf->data_size, because perf maintains
+	 * a meta-data page at the beginning of the memory region.  That page
+	 * is used for reader/writer symchronization.
+	 */
+	buf->fd = pefd;
+	buf->base = mmap(NULL, buf->page_size + buf->data_size,
+			 PROT_READ | PROT_WRITE, MAP_SHARED, buf->fd, 0);
+	buf->endp = buf->base + buf->page_size + buf->data_size - 1;
+	if (!buf->base)
+		goto fail;
+
+	return 0;
+
+fail:
+	if (buf->base) {
+		munmap(buf->base, buf->page_size + buf->data_size);
+		buf->base = NULL;
+		buf->endp = NULL;
+	}
+	if (buf->fd) {
+		close(buf->fd);
+		buf->fd = -1;
+	}
+
+	return -1;
+}
+
+/*
+ * Close the given DTrace buffer.  This function disables the perf_event that
+ * owns the buffer, munmaps the memory space, and closes the perf buffer fd.
+ */
+static void perf_buffer_close(struct dtrace_buffer *buf)
+{
+	/*
+	 * If the perf buffer failed to open, there is no need to close it.
+	 */
+	if (buf->fd < 0)
+		return;
+
+	if (ioctl(buf->fd, PERF_EVENT_IOC_DISABLE, 0) < 0)
+		fprintf(stderr, "PERF_EVENT_IOC_DISABLE(cpu %d): %s\n",
+			buf->cpu, strerror(errno));
+
+	munmap(buf->base, buf->page_size + buf->data_size);
+
+	if (close(buf->fd))
+		fprintf(stderr, "perf buffer close(cpu %d): %s\n",
+			buf->cpu, strerror(errno));
+
+	buf->base = NULL;
+	buf->fd = -1;
+}
+
+/*
+ * Initialize the probe data buffers (one per online CPU).  Each buffer will
+ * contain the given number of pages (i.e. total size of each buffer will be
+ * num_pages * getpagesize()).  This function also sets up an event polling
+ * descriptor that monitors all CPU buffers at once.
+ */
+int dt_buffer_init(int num_pages)
+{
+	int	i;
+	int	epoll_fd;
+
+	if (dt_bufmap_fd < 0)
+		return -EINVAL;
+
+	/* Allocate the per-CPU buffer structs. */
+	dt_buffers = calloc(dt_numcpus, sizeof(struct dtrace_buffer));
+	if (dt_buffers == NULL)
+		return -ENOMEM;
+
+	/* Set up the event polling file descriptor. */
+	epoll_fd = epoll_create1(EPOLL_CLOEXEC);
+	if (epoll_fd < 0) {
+		free(dt_buffers);
+		return -errno;
+	}
+
+	for (i = 0; i < dt_numcpus; i++) {
+		int			cpu = dt_cpuids[i];
+		struct epoll_event	ev;
+		struct dtrace_buffer	*buf = &dt_buffers[i];
+
+		buf->cpu = cpu;
+		buf->page_size = getpagesize();
+		buf->data_size = num_pages * buf->page_size;
+		buf->tmp = NULL;
+		buf->tmp_len = 0;
+
+		/* Try to create the perf buffer for this DTrace buffer. */
+		if (perf_buffer_open(buf) == -1)
+			continue;
+
+		/* Store the perf buffer fd in the buffer map. */
+		dt_bpf_map_update(dt_bufmap_fd, &cpu, &buf->fd);
+
+		/* Add the buffer to the event polling descriptor. */
+		ev.events = EPOLLIN;
+		ev.data.ptr = buf;
+		if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, buf->fd, &ev) == -1) {
+			fprintf(stderr, "EPOLL_CTL_ADD(cpu %d): %s\n",
+				buf->cpu, strerror(errno));
+			continue;
+		}
+	}
+
+	return epoll_fd;
+}
+
+/*
+ * Clean up the buffers.
+ */
+void dt_buffer_exit(int epoll_fd)
+{
+	int	i;
+
+	for (i = 0; i < dt_numcpus; i++)
+		perf_buffer_close(&dt_buffers[i]);
+
+	free(dt_buffers);
+	close(epoll_fd);
+}
+
+/*
+ * Process and output the probe data at the supplied address.
+ */
+static void output_event(int cpu, u64 *buf)
+{
+	u8				*data = (u8 *)buf;
+	struct perf_event_header	*hdr;
+
+	hdr = (struct perf_event_header *)data;
+	data += sizeof(struct perf_event_header);
+
+	if (hdr->type == PERF_RECORD_SAMPLE) {
+		u8		*ptr = data;
+		u32		i, size, probe_id;
+
+		/*
+		 * struct {
+		 *	struct perf_event_header	header;
+		 *	u32				size;
+		 *	u32				probe_id;
+		 *	u32				gap;
+		 *	u64				data[n];
+		 * }
+		 * and data points to the 'size' member at this point.
+		 */
+		if (ptr > (u8 *)buf + hdr->size) {
+			fprintf(stderr, "BAD: corrupted sample header\n");
+			return;
+		}
+
+		size = *(u32 *)data;
+		data += sizeof(size);
+		ptr += sizeof(size) + size;
+		if (ptr != (u8 *)buf + hdr->size) {
+			fprintf(stderr, "BAD: invalid sample size\n");
+			return;
+		}
+
+		probe_id = *(u32 *)data;
+		data += sizeof(probe_id);
+		size -= sizeof(probe_id);
+		data += sizeof(u32);		/* skip 32-bit gap */
+		size -= sizeof(u32);
+		buf = (u64 *)data;
+
+		printf("%3d %6d ", cpu, probe_id);
+		for (i = 0, size /= sizeof(u64); i < size; i++)
+			printf("%#016lx ", buf[i]);
+		printf("\n");
+	} else if (hdr->type == PERF_RECORD_LOST) {
+		u64	lost;
+
+		/*
+		 * struct {
+		 *	struct perf_event_header	header;
+		 *	u64				id;
+		 *	u64				lost;
+		 * }
+		 * and data points to the 'id' member at this point.
+		 */
+		lost = *(u64 *)(data + sizeof(u64));
+
+		printf("[%ld probes dropped]\n", lost);
+	} else
+		fprintf(stderr, "UNKNOWN: record type %d\n", hdr->type);
+}
+
+/*
+ * Process the available probe data in the given buffer.
+ */
+static void process_data(struct dtrace_buffer *buf)
+{
+	struct perf_event_mmap_page	*rb_page = (void *)buf->base;
+	struct perf_event_header	*hdr;
+	u8				*base;
+	u64				head, tail;
+
+	/* Set base to be the start of the buffer data. */
+	base = buf->base + buf->page_size;
+
+	for (;;) {
+		head = ring_buffer_read_head(rb_page);
+		tail = rb_page->data_tail;
+
+		if (tail == head)
+			break;
+
+		do {
+			u8	*event = base + tail % buf->data_size;
+			u32	len;
+
+			hdr = (struct perf_event_header *)event;
+			len = hdr->size;
+
+			/*
+			 * If the perf event data wraps around the boundary of
+			 * the buffer, we make a copy in contiguous memory.
+			 */
+			if (event + len > buf->endp) {
+				u8	*dst;
+				u32	num;
+
+				/* Increase buffer as needed. */
+				if (buf->tmp_len < len) {
+					buf->tmp = realloc(buf->tmp, len);
+					buf->tmp_len = len;
+				}
+
+				dst = buf->tmp;
+				num = buf->endp - event + 1;
+				memcpy(dst, event, num);
+				memcpy(dst + num, base, len - num);
+
+				event = dst;
+			}
+
+			output_event(buf->cpu, (u64 *)event);
+
+			tail += hdr->size;
+		} while (tail != head);
+
+		ring_buffer_write_tail(rb_page, tail);
+	}
+}
+
+/*
+ * Wait for data to become available in any of the buffers.
+ */
+int dt_buffer_poll(int epoll_fd, int timeout)
+{
+	struct epoll_event	events[dt_numcpus];
+	int			i, cnt;
+
+	cnt = epoll_wait(epoll_fd, events, dt_numcpus, timeout);
+	if (cnt < 0)
+		return -errno;
+
+	for (i = 0; i < cnt; i++)
+		process_data((struct dtrace_buffer *)events[i].data.ptr);
+
+	return cnt;
+}
diff --git a/tools/dtrace/dt_fbt.c b/tools/dtrace/dt_fbt.c
new file mode 100644
index 000000000000..fcf95243bf97
--- /dev/null
+++ b/tools/dtrace/dt_fbt.c
@@ -0,0 +1,201 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * The Function Boundary Tracing (FBT) provider for DTrace.
+ *
+ * FBT probes are exposed by the kernel as kprobes.  They are listed in the
+ * TRACEFS/available_filter_functions file.  Some kprobes are associated with
+ * a specific kernel module, while most are in the core kernel.
+ *
+ * Mapping from event name to DTrace probe name:
+ *
+ *      <name>					fbt:vmlinux:<name>:entry
+ *						fbt:vmlinux:<name>:return
+ *   or
+ *      <name> [<modname>]			fbt:<modname>:<name>:entry
+ *						fbt:<modname>:<name>:return
+ *
+ * Mapping from BPF section name to DTrace probe name:
+ *
+ *      kprobe/<name>				fbt:vmlinux:<name>:entry
+ *      kretprobe/<name>			fbt:vmlinux:<name>:return
+ *
+ * (Note that the BPF section does not carry information about the module that
+ *  the function is found in.  This means that BPF section name cannot be used
+ *  to distinguish between functions with the same name occurring in different
+ *  modules.)
+ *
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ */
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <linux/bpf.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+
+#include "dtrace_impl.h"
+
+#define KPROBE_EVENTS	TRACEFS "kprobe_events"
+#define PROBE_LIST	TRACEFS "available_filter_functions"
+
+static const char	provname[] = "fbt";
+static const char	modname[] = "vmlinux";
+
+/*
+ * Scan the PROBE_LIST file and add entry and return probes for every function
+ * that is listed.
+ */
+static int fbt_populate(void)
+{
+	FILE			*f;
+	char			buf[256];
+	char			*p;
+
+	f = fopen(PROBE_LIST, "r");
+	if (f == NULL)
+		return -1;
+
+	while (fgets(buf, sizeof(buf), f)) {
+		/*
+		 * Here buf is either "funcname\n" or "funcname [modname]\n".
+		 */
+		p = strchr(buf, '\n');
+		if (p) {
+			*p = '\0';
+			if (p > buf && *(--p) == ']')
+				*p = '\0';
+		} else {
+			/* If we didn't see a newline, the line was too long.
+			 * Report it, and continue until the end of the line.
+			 */
+			fprintf(stderr, "%s: Line too long: %s\n",
+				PROBE_LIST, buf);
+			do
+				fgets(buf, sizeof(buf), f);
+			while (strchr(buf, '\n') == NULL);
+			continue;
+		}
+
+		/*
+		 * Now buf is either "funcname" or "funcname [modname".  If
+		 * there is no module name provided, we will use the default.
+		 */
+		p = strchr(buf, ' ');
+		if (p) {
+			*p++ = '\0';
+			if (*p == '[')
+				p++;
+		}
+
+		dt_probe_new(&dt_fbt, provname, p ? p : modname, buf, "entry");
+		dt_probe_new(&dt_fbt, provname, p ? p : modname, buf, "return");
+	}
+
+	fclose(f);
+
+	return 0;
+}
+
+#define ENTRY_PREFIX	"kprobe/"
+#define EXIT_PREFIX	"kretprobe/"
+
+/*
+ * Perform a probe lookup based on an event name (BPF ELF section name).
+ */
+static struct dt_probe *fbt_resolve_event(const char *name)
+{
+	const char	*prbname;
+	struct dt_probe	tmpl;
+	struct dt_probe	*probe;
+
+	if (!name)
+		return NULL;
+
+	if (strncmp(name, ENTRY_PREFIX, sizeof(ENTRY_PREFIX) - 1) == 0) {
+		name += sizeof(ENTRY_PREFIX) - 1;
+		prbname = "entry";
+	} else if (strncmp(name, EXIT_PREFIX, sizeof(EXIT_PREFIX) - 1) == 0) {
+		name += sizeof(EXIT_PREFIX) - 1;
+		prbname = "return";
+	} else
+		return NULL;
+
+	memset(&tmpl, 0, sizeof(tmpl));
+	tmpl.prv_name = provname;
+	tmpl.mod_name = modname;
+	tmpl.fun_name = name;
+	tmpl.prb_name = prbname;
+
+	probe = dt_probe_by_name(&tmpl);
+
+	return probe;
+}
+
+/*
+ * Attach the given BPF program (identified by its file descriptor) to the
+ * kprobe identified by the given section name.
+ */
+static int fbt_attach(const char *name, int bpf_fd)
+{
+	char    efn[256];
+	char    buf[256];
+	int	event_id, fd, rc;
+
+	name += 7;				/* skip "kprobe/" */
+	snprintf(buf, sizeof(buf), "p:%s %s\n", name, name);
+
+	/*
+	 * Register the kprobe with the tracing subsystem.  This will create
+	 * a tracepoint event.
+	 */
+	fd = open(KPROBE_EVENTS, O_WRONLY | O_APPEND);
+	if (fd < 0) {
+		perror(KPROBE_EVENTS);
+		return -1;
+	}
+	rc = write(fd, buf, strlen(buf));
+	if (rc < 0) {
+		perror(KPROBE_EVENTS);
+		close(fd);
+		return -1;
+	}
+	close(fd);
+
+	/*
+	 * Read the tracepoint event id for the kprobe we just registered.
+	 */
+	strcpy(efn, EVENTSFS);
+	strcat(efn, "kprobes/");
+	strcat(efn, name);
+	strcat(efn, "/id");
+
+	fd = open(efn, O_RDONLY);
+	if (fd < 0) {
+		perror(efn);
+		return -1;
+	}
+	rc = read(fd, buf, sizeof(buf));
+	if (rc < 0 || rc >= sizeof(buf)) {
+		perror(efn);
+		close(fd);
+		return -1;
+	}
+	close(fd);
+	buf[rc] = '\0';
+	event_id = atoi(buf);
+
+	/*
+	 * Attaching a BPF program (by file descriptor) to an event (by ID) is
+	 * a generic operation provided by the BPF interface code.
+	 */
+	return dt_bpf_attach(event_id, bpf_fd);
+}
+
+struct dt_provider	dt_fbt = {
+	.name		= "fbt",
+	.populate	= &fbt_populate,
+	.resolve_event	= &fbt_resolve_event,
+	.attach		= &fbt_attach,
+};
diff --git a/tools/dtrace/dt_hash.c b/tools/dtrace/dt_hash.c
new file mode 100644
index 000000000000..b1f563bc0773
--- /dev/null
+++ b/tools/dtrace/dt_hash.c
@@ -0,0 +1,211 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This file provides a generic hashtable implementation for probes.
+ *
+ * The hashtable is created with 4 user-provided functions:
+ *	hval(probe)		- calculate a hash value for the given probe
+ *	cmp(probe1, probe2)	- compare two probes
+ *	add(head, probe)	- add a probe to a list of probes
+ *	del(head, probe)	- delete a probe from a list of probes
+ *
+ * Probes are hashed into a hashtable slot based on the return value of
+ * hval(probe).  Each hashtable slot holds a list of buckets, with each
+ * bucket storing probes that are equal under the cmp(probe1, probe2)
+ * function. Probes are added to the list of probes in a bucket using the
+ * add(head, probe) function, and they are deleted using a call to the
+ * del(head, probe) function.
+ *
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ */
+#include <errno.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+#include "dtrace_impl.h"
+
+/*
+ * Hashtable implementation for probes.
+ */
+struct dt_hbucket {
+	u32			hval;
+	struct dt_hbucket	*next;
+	struct dt_probe		*head;
+	int			nprobes;
+};
+
+struct dt_htab {
+	struct dt_hbucket	**tab;
+	int			size;
+	int			mask;
+	int			nbuckets;
+	dt_hval_fn		hval;		/* calculate hash value */
+	dt_cmp_fn		cmp;		/* compare 2 probes */
+	dt_add_fn		add;		/* add probe to list */
+	dt_del_fn		del;		/* delete probe from list */
+};
+
+/*
+ * Create a new (empty) hashtable.
+ */
+struct dt_htab *dt_htab_new(dt_hval_fn hval, dt_cmp_fn cmp, dt_add_fn add,
+			    dt_del_fn del)
+{
+	struct dt_htab	*htab = malloc(sizeof(struct dt_htab));
+
+	if (!htab)
+		return NULL;
+
+	htab->size = 1;
+	htab->mask = htab->size - 1;
+	htab->nbuckets = 0;
+	htab->hval = hval;
+	htab->cmp = cmp;
+	htab->add = add;
+	htab->del = del;
+
+	htab->tab = calloc(htab->size, sizeof(struct dt_hbucket *));
+	if (!htab->tab) {
+		free(htab);
+		return NULL;
+	}
+
+	return htab;
+}
+
+/*
+ * Resize the hashtable by doubling the number of slots.
+ */
+static int resize(struct dt_htab *htab)
+{
+	int			i;
+	int			osize = htab->size;
+	int			nsize = osize << 1;
+	int			nmask = nsize - 1;
+	struct dt_hbucket	**ntab;
+
+	ntab = calloc(nsize, sizeof(struct dt_hbucket *));
+	if (!ntab)
+		return -ENOMEM;
+
+	for (i = 0; i < osize; i++) {
+		struct dt_hbucket	*bucket, *next;
+
+		for (bucket = htab->tab[i]; bucket; bucket = next) {
+			int	idx	= bucket->hval & nmask;
+
+			next = bucket->next;
+			bucket->next = ntab[idx];
+			ntab[idx] = bucket;
+		}
+	}
+
+	free(htab->tab);
+	htab->tab = ntab;
+	htab->size = nsize;
+	htab->mask = nmask;
+
+	return 0;
+}
+
+/*
+ * Add a probe to the hashtable.  Resize if necessary, and allocate a new
+ * bucket if necessary.
+ */
+int dt_htab_add(struct dt_htab *htab, struct dt_probe *probe)
+{
+	u32			hval = htab->hval(probe);
+	int			idx;
+	struct dt_hbucket	*bucket;
+
+retry:
+	idx = hval & htab->mask;
+	for (bucket = htab->tab[idx]; bucket; bucket = bucket->next) {
+		if (htab->cmp(bucket->head, probe) == 0)
+			goto add;
+	}
+
+	if ((htab->nbuckets >> 1) > htab->size) {
+		int	err;
+
+		err = resize(htab);
+		if (err)
+			return err;
+
+		goto retry;
+	}
+
+	bucket = malloc(sizeof(struct dt_hbucket));
+	if (!bucket)
+		return -ENOMEM;
+
+	bucket->hval = hval;
+	bucket->next = htab->tab[idx];
+	bucket->head = NULL;
+	bucket->nprobes = 0;
+	htab->tab[idx] = bucket;
+	htab->nbuckets++;
+
+add:
+	bucket->head = htab->add(bucket->head, probe);
+	bucket->nprobes++;
+
+	return 0;
+}
+
+/*
+ * Find a probe in the hashtable.
+ */
+struct dt_probe *dt_htab_lookup(const struct dt_htab *htab,
+				const struct dt_probe *probe)
+{
+	u32			hval = htab->hval(probe);
+	int			idx = hval & htab->mask;
+	struct dt_hbucket	*bucket;
+
+	for (bucket = htab->tab[idx]; bucket; bucket = bucket->next) {
+		if (htab->cmp(bucket->head, probe) == 0)
+			return bucket->head;
+	}
+
+	return NULL;
+}
+
+/*
+ * Remove a probe from the hashtable.  If we are deleting the last probe in a
+ * bucket, get rid of the bucket.
+ */
+int dt_htab_del(struct dt_htab *htab, struct dt_probe *probe)
+{
+	u32			hval = htab->hval(probe);
+	int			idx = hval & htab->mask;
+	struct dt_hbucket	*bucket;
+	struct dt_probe		*head;
+
+	for (bucket = htab->tab[idx]; bucket; bucket = bucket->next) {
+		if (htab->cmp(bucket->head, probe) == 0)
+			break;
+	}
+
+	if (bucket == NULL)
+		return -ENOENT;
+
+	head = htab->del(bucket->head, probe);
+	if (!head) {
+		struct dt_hbucket	*b = htab->tab[idx];
+
+		if (bucket == b)
+			htab->tab[idx] = bucket->next;
+		else {
+			while (b->next != bucket)
+				b = b->next;
+
+			b->next = bucket->next;
+		}
+
+		htab->nbuckets--;
+		free(bucket);
+	} else
+		bucket->head = head;
+
+	return 0;
+}
diff --git a/tools/dtrace/dt_probe.c b/tools/dtrace/dt_probe.c
new file mode 100644
index 000000000000..0b6228eaff29
--- /dev/null
+++ b/tools/dtrace/dt_probe.c
@@ -0,0 +1,230 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This file implements the interface to probes grouped by provider.
+ *
+ * Probes are named by a set of 4 identifiers:
+ *	- provider name
+ *	- module name
+ *	- function name
+ *	- probe name
+ *
+ * The Fully Qualified Name (FQN) is "provider:module:function:name".
+ *
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ */
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include <linux/kernel.h>
+
+#include "dtrace_impl.h"
+
+static struct dt_provider      *dt_providers[] = {
+							&dt_fbt,
+							&dt_syscall,
+						 };
+
+static struct dt_htab	*ht_byfqn;
+
+static u32		next_probe_id;
+
+/*
+ * Calculate a hash value based on a given string and an initial value.  The
+ * initial value is used to calculate compound hash values, e.g.
+ *
+ *	u32	hval;
+ *
+ *	hval = str2hval(str1, 0);
+ *	hval = str2hval(str2, hval);
+ */
+static u32 str2hval(const char *p, u32 hval)
+{
+	u32	g;
+
+	if (!p)
+		return hval;
+
+	while (*p) {
+		hval = (hval << 4) + *p++;
+		g = hval & 0xf0000000;
+		if (g != 0)
+			hval ^= g >> 24;
+
+		hval &= ~g;
+	}
+
+	return hval;
+}
+
+/*
+ * String compare function that can handle either or both strings being NULL.
+ */
+static int safe_strcmp(const char *p, const char *q)
+{
+	return (!p) ? (!q) ? 0
+			   : -1
+		    : (!q) ? 1
+			   : strcmp(p, q);
+}
+
+/*
+ * Calculate the hash value of a probe as the cummulative hash value of the
+ * FQN.
+ */
+static u32 fqn_hval(const struct dt_probe *probe)
+{
+	u32	hval = 0;
+
+	hval = str2hval(probe->prv_name, hval);
+	hval = str2hval(":", hval);
+	hval = str2hval(probe->mod_name, hval);
+	hval = str2hval(":", hval);
+	hval = str2hval(probe->fun_name, hval);
+	hval = str2hval(":", hval);
+	hval = str2hval(probe->prb_name, hval);
+
+	return hval;
+}
+
+/*
+ * Compare two probes based on the FQN.
+ */
+static int fqn_cmp(const struct dt_probe *p, const struct dt_probe *q)
+{
+	int	rc;
+
+	rc = safe_strcmp(p->prv_name, q->prv_name);
+	if (rc)
+		return rc;
+	rc = safe_strcmp(p->mod_name, q->mod_name);
+	if (rc)
+		return rc;
+	rc = safe_strcmp(p->fun_name, q->fun_name);
+	if (rc)
+		return rc;
+	rc = safe_strcmp(p->prb_name, q->prb_name);
+	if (rc)
+		return rc;
+
+	return 0;
+}
+
+/*
+ * Add the given probe 'new' to the double-linked probe list 'head'.  Probe
+ * 'new' becomes the new list head.
+ */
+static struct dt_probe *fqn_add(struct dt_probe *head, struct dt_probe *new)
+{
+	if (!head)
+		return new;
+
+	new->he_fqn.next = head;
+	head->he_fqn.prev = new;
+
+	return new;
+}
+
+/*
+ * Remove the given probe 'probe' from the double-linked probe list 'head'.
+ * If we are deleting the current head, the next probe in the list is returned
+ * as the new head.  If that value is NULL, the list is now empty.
+ */
+static struct dt_probe *fqn_del(struct dt_probe *head, struct dt_probe *probe)
+{
+	if (head == probe) {
+		if (!probe->he_fqn.next)
+			return NULL;
+
+		head = probe->he_fqn.next;
+		head->he_fqn.prev = NULL;
+		probe->he_fqn.next = NULL;
+
+		return head;
+	}
+
+	if (!probe->he_fqn.next) {
+		probe->he_fqn.prev->he_fqn.next = NULL;
+		probe->he_fqn.prev = NULL;
+
+		return head;
+	}
+
+	probe->he_fqn.prev->he_fqn.next = probe->he_fqn.next;
+	probe->he_fqn.next->he_fqn.prev = probe->he_fqn.prev;
+	probe->he_fqn.prev = probe->he_fqn.next = NULL;
+
+	return head;
+}
+
+/*
+ * Initialize the probe handling by populating the FQN hashtable with probes
+ * from all providers.
+ */
+int dt_probe_init(void)
+{
+	int	i;
+
+	ht_byfqn = dt_htab_new(fqn_hval, fqn_cmp, fqn_add, fqn_del);
+
+	for (i = 0; i < ARRAY_SIZE(dt_providers); i++) {
+		if (dt_providers[i]->populate() < 0)
+			return -1;
+	}
+
+	return 0;
+}
+
+/*
+ * Allocate a new probe and add it to the FQN hashtable.
+ */
+int dt_probe_new(const struct dt_provider *prov, const char *pname,
+		 const char *mname, const char *fname, const char *name)
+{
+	struct dt_probe	*probe;
+
+	probe = malloc(sizeof(struct dt_probe));
+	if (!probe)
+		return -ENOMEM;
+
+	memset(probe, 0, sizeof(struct dt_probe));
+	probe->id = next_probe_id++;
+	probe->prov = prov;
+	probe->prv_name = pname ? strdup(pname) : NULL;
+	probe->mod_name = mname ? strdup(mname) : NULL;
+	probe->fun_name = fname ? strdup(fname) : NULL;
+	probe->prb_name = name ? strdup(name) : NULL;
+
+	dt_htab_add(ht_byfqn, probe);
+
+	return 0;
+}
+
+/*
+ * Perform a probe lookup based on FQN.
+ */
+struct dt_probe *dt_probe_by_name(const struct dt_probe *tmpl)
+{
+	return dt_htab_lookup(ht_byfqn, tmpl);
+}
+
+/*
+ * Resolve an event name (BPF ELF section name) into a probe.  We query each
+ * provider, and as soon as we get a hit, we return the result.
+ */
+struct dt_probe *dt_probe_resolve_event(const char *name)
+{
+	int		i;
+	struct dt_probe	*probe;
+
+	for (i = 0; i < ARRAY_SIZE(dt_providers); i++) {
+		if (!dt_providers[i]->resolve_event)
+			continue;
+		probe = dt_providers[i]->resolve_event(name);
+		if (probe)
+			return probe;
+	}
+
+	return NULL;
+}
diff --git a/tools/dtrace/dt_syscall.c b/tools/dtrace/dt_syscall.c
new file mode 100644
index 000000000000..6695a4a1c701
--- /dev/null
+++ b/tools/dtrace/dt_syscall.c
@@ -0,0 +1,179 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * The syscall provider for DTrace.
+ *
+ * System call probes are exposed by the kernel as tracepoint events in the
+ * "syscalls" group.  Entry probe names start with "sys_enter_" and exit probes
+ * start with "sys_exit_".
+ *
+ * Mapping from event name to DTrace probe name:
+ *
+ *	syscalls:sys_enter_<name>		syscall:vmlinux:<name>:entry
+ *	syscalls:sys_exit_<name>		syscall:vmlinux:<name>:return
+ *
+ * Mapping from BPF section name to DTrace probe name:
+ *
+ *	tracepoint/syscalls/sys_enter_<name>	syscall:vmlinux:<name>:entry
+ *	tracepoint/syscalls/sys_exit_<name>	syscall:vmlinux:<name>:return
+ *
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ */
+#include <ctype.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <linux/bpf.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+
+#include "dtrace_impl.h"
+
+static const char	provname[] = "syscall";
+static const char	modname[] = "vmlinux";
+
+#define PROBE_LIST	TRACEFS "available_events"
+
+#define PROV_PREFIX	"syscalls:"
+#define ENTRY_PREFIX	"sys_enter_"
+#define EXIT_PREFIX	"sys_exit_"
+
+/*
+ * Scan the PROBE_LIST file and add probes for any syscalls events.
+ */
+static int syscall_populate(void)
+{
+	FILE			*f;
+	char			buf[256];
+
+	f = fopen(PROBE_LIST, "r");
+	if (f == NULL)
+		return -1;
+
+	while (fgets(buf, sizeof(buf), f)) {
+		char	*p;
+
+		/* * Here buf is "group:event".  */
+		p = strchr(buf, '\n');
+		if (p)
+			*p = '\0';
+		else {
+			/*
+			 * If we didn't see a newline, the line was too long.
+			 * Report it, and continue until the end of the line.
+			 */
+			fprintf(stderr, "%s: Line too long: %s\n",
+				PROBE_LIST, buf);
+			do
+				fgets(buf, sizeof(buf), f);
+			while (strchr(buf, '\n') == NULL);
+			continue;
+		}
+
+		/* We need "group:" to match "syscalls:". */
+		p = buf;
+		if (memcmp(p, PROV_PREFIX, sizeof(PROV_PREFIX) - 1) != 0)
+			continue;
+
+		p += sizeof(PROV_PREFIX) - 1;
+		/*
+		 * Now p will be just "event", and we are only interested in
+		 * events that match "sys_enter_*" or "sys_exit_*".
+		 */
+		if (!memcmp(p, ENTRY_PREFIX, sizeof(ENTRY_PREFIX) - 1)) {
+			p += sizeof(ENTRY_PREFIX) - 1;
+			dt_probe_new(&dt_syscall, provname, modname, p,
+				     "entry");
+		} else if (!memcmp(p, EXIT_PREFIX, sizeof(EXIT_PREFIX) - 1)) {
+			p += sizeof(EXIT_PREFIX) - 1;
+			dt_probe_new(&dt_syscall, provname, modname, p,
+				     "return");
+		}
+	}
+
+	fclose(f);
+
+	return 0;
+}
+
+#define EVENT_PREFIX	"tracepoint/syscalls/"
+
+/*
+ * Perform a probe lookup based on an event name (BPF ELF section name).
+ */
+static struct dt_probe *systrace_resolve_event(const char *name)
+{
+	const char	*prbname;
+	struct dt_probe	tmpl;
+	struct dt_probe	*probe;
+
+	if (!name)
+		return NULL;
+
+	/* Exclude anything that is not a syscalls tracepoint */
+	if (strncmp(name, EVENT_PREFIX, sizeof(EVENT_PREFIX) - 1) != 0)
+		return NULL;
+	name += sizeof(EVENT_PREFIX) - 1;
+
+	if (strncmp(name, ENTRY_PREFIX, sizeof(ENTRY_PREFIX) - 1) == 0) {
+		name += sizeof(ENTRY_PREFIX) - 1;
+		prbname = "entry";
+	} else if (strncmp(name, EXIT_PREFIX, sizeof(EXIT_PREFIX) - 1) == 0) {
+		name += sizeof(EXIT_PREFIX) - 1;
+		prbname = "return";
+	} else
+		return NULL;
+
+	memset(&tmpl, 0, sizeof(tmpl));
+	tmpl.prv_name = provname;
+	tmpl.mod_name = modname;
+	tmpl.fun_name = name;
+	tmpl.prb_name = prbname;
+
+	probe = dt_probe_by_name(&tmpl);
+
+	return probe;
+}
+
+#define SYSCALLSFS	EVENTSFS "syscalls/"
+
+/*
+ * Attach the given BPF program (identified by its file descriptor) to the
+ * event identified by the given section name.
+ */
+static int syscall_attach(const char *name, int bpf_fd)
+{
+	char    efn[256];
+	char    buf[256];
+	int	event_id, fd, rc;
+
+	name += sizeof(EVENT_PREFIX) - 1;
+	strcpy(efn, SYSCALLSFS);
+	strcat(efn, name);
+	strcat(efn, "/id");
+
+	fd = open(efn, O_RDONLY);
+	if (fd < 0) {
+		perror(efn);
+		return -1;
+	}
+	rc = read(fd, buf, sizeof(buf));
+	if (rc < 0 || rc >= sizeof(buf)) {
+		perror(efn);
+		close(fd);
+		return -1;
+	}
+	close(fd);
+	buf[rc] = '\0';
+	event_id = atoi(buf);
+
+	return dt_bpf_attach(event_id, bpf_fd);
+}
+
+struct dt_provider	dt_syscall = {
+	.name		= "syscall",
+	.populate	= &syscall_populate,
+	.resolve_event	= &systrace_resolve_event,
+	.attach		= &syscall_attach,
+};
diff --git a/tools/dtrace/dt_utils.c b/tools/dtrace/dt_utils.c
new file mode 100644
index 000000000000..55d51bae1d97
--- /dev/null
+++ b/tools/dtrace/dt_utils.c
@@ -0,0 +1,132 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include "dtrace_impl.h"
+
+#define BUF_SIZE	1024		/* max size for online cpu data */
+
+int	dt_numcpus;			/* number of online CPUs */
+int	dt_maxcpuid;			/* highest CPU id */
+int	*dt_cpuids;			/* list of CPU ids */
+
+/*
+ * Populate the online CPU id information from sysfs data.  We only do this
+ * once because we do not care about CPUs coming online after we started
+ * tracing.  If a CPU goes offline during tracing, we do not care either
+ * because that simply means that it won't be writing any new probe data into
+ * its buffer.
+ */
+void cpu_list_populate(void)
+{
+	char buf[BUF_SIZE];
+	int fd, cnt, start, end, i;
+	int *cpu;
+	char *p, *q;
+
+	fd = open("/sys/devices/system/cpu/online", O_RDONLY);
+	if (fd < 0)
+		goto fail;
+	cnt = read(fd, buf, sizeof(buf));
+	close(fd);
+	if (cnt <= 0)
+		goto fail;
+
+	/*
+	 * The string should always end with a newline, but let's make sure.
+	 */
+	if (buf[cnt - 1] == '\n')
+		buf[--cnt] = 0;
+
+	/*
+	 * Count how many CPUs we have.
+	 */
+	dt_numcpus = 0;
+	p = buf;
+	do {
+		start = (int)strtol(p, &q, 10);
+		switch (*q) {
+		case '-':		/* range */
+			p = q + 1;
+			end = (int)strtol(p, &q, 10);
+			dt_numcpus += end - start + 1;
+			if (*q == 0) {	/* end of string */
+				p = q;
+				break;
+			}
+			if (*q != ',')
+				goto fail;
+			p = q + 1;
+			break;
+		case 0:			/* end of string */
+			dt_numcpus++;
+			p = q;
+			break;
+		case ',':	/* gap  */
+			dt_numcpus++;
+			p = q + 1;
+			break;
+		}
+	} while (*p != 0);
+
+	dt_cpuids = calloc(dt_numcpus,  sizeof(int));
+	cpu = dt_cpuids;
+
+	/*
+	 * Fill in the CPU ids.
+	 */
+	p = buf;
+	do {
+		start = (int)strtol(p, &q, 10);
+		switch (*q) {
+		case '-':		/* range */
+			p = q + 1;
+			end = (int)strtol(p, &q, 10);
+			for (i = start; i <= end; i++)
+				*cpu++ = i;
+			if (*q == 0) {	/* end of string */
+				p = q;
+				break;
+			}
+			if (*q != ',')
+				goto fail;
+			p = q + 1;
+			break;
+		case 0:			/* end of string */
+			*cpu = start;
+			p = q;
+			break;
+		case ',':	/* gap  */
+			*cpu++ = start;
+			p = q + 1;
+			break;
+		}
+	} while (*p != 0);
+
+	/* Record the highest CPU id of the set of online CPUs. */
+	dt_maxcpuid = *(cpu - 1);
+
+	return;
+fail:
+	if (dt_cpuids)
+		free(dt_cpuids);
+
+	dt_numcpus = 0;
+	dt_maxcpuid = 0;
+	dt_cpuids = NULL;
+}
+
+void cpu_list_free(void)
+{
+	free(dt_cpuids);
+	dt_numcpus = 0;
+	dt_maxcpuid = 0;
+	dt_cpuids = NULL;
+}
diff --git a/tools/dtrace/dtrace.c b/tools/dtrace/dtrace.c
new file mode 100644
index 000000000000..36ad526c1cd4
--- /dev/null
+++ b/tools/dtrace/dtrace.c
@@ -0,0 +1,249 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ */
+#include <errno.h>
+#include <libgen.h>
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <linux/log2.h>
+
+#include "dtrace_impl.h"
+
+#define DTRACE_BUFSIZE	32		/* default buffer size (in pages) */
+
+#define DMODE_VERS	0		/* display version information (-V) */
+#define DMODE_LIST	1		/* list probes (-l) */
+#define DMODE_EXEC	2		/* compile program and start tracing */
+
+#define E_SUCCESS	0
+#define E_ERROR		1
+#define E_USAGE		2
+
+#define NUM_PAGES(sz)	(((sz) + getpagesize() - 1) / getpagesize())
+
+static const char		*dtrace_options = "+b:ls:V";
+
+static char			*g_pname;
+static int			g_mode = DMODE_EXEC;
+
+static int usage(void)
+{
+	fprintf(stderr, "Usage: %s [-lV] [-b bufsz] -s script\n", g_pname);
+	fprintf(stderr,
+	"\t-b  set trace buffer size\n"
+	"\t-l  list probes matching specified criteria\n"
+	"\t-s  enable or list probes for the specified BPF program\n"
+	"\t-V  report DTrace API version\n");
+
+	return E_USAGE;
+}
+
+static u64 parse_size(const char *arg)
+{
+	long long	mul = 1;
+	long long	neg, val;
+	size_t		len;
+	char		*end;
+
+	if (!arg)
+		return -1;
+
+	len = strlen(arg);
+	if (!len)
+		return -1;
+
+	switch (arg[len - 1]) {
+	case 't':
+	case 'T':
+		mul *= 1024;
+		/* fall-through */
+	case 'g':
+	case 'G':
+		mul *= 1024;
+		/* fall-through */
+	case 'm':
+	case 'M':
+		mul *= 1024;
+		/* fall-through */
+	case 'k':
+	case 'K':
+		mul *= 1024;
+		/* fall-through */
+	default:
+		break;
+	}
+
+	neg = strtoll(arg, NULL, 0);
+	errno = 0;
+	val = strtoull(arg, &end, 0) * mul;
+
+	if ((mul > 1 && end != &arg[len - 1]) || (mul == 1 && *end != '\0') ||
+	    val < 0 || neg < 0 || errno != 0)
+		return -1;
+
+	return val;
+}
+
+int main(int argc, char *argv[])
+{
+	int	i;
+	int	modec = 0;
+	int	bufsize = DTRACE_BUFSIZE;
+	int	epoll_fd;
+	int	cnt;
+	char	**prgv;
+	int	prgc;
+
+	g_pname = basename(argv[0]);
+
+	if (argc == 1)
+		return usage();
+
+	prgc = 0;
+	prgv = calloc(argc, sizeof(char *));
+	if (!prgv) {
+		fprintf(stderr, "failed to allocate memory for arguments: %s\n",
+			strerror(errno));
+		return E_ERROR;
+	}
+
+	argv[0] = g_pname;			/* argv[0] for getopt errors */
+
+	for (optind = 1; optind < argc; optind++) {
+		int	opt;
+
+		while ((opt = getopt(argc, argv, dtrace_options)) != EOF) {
+			u64			val;
+
+			switch (opt) {
+			case 'b':
+				val = parse_size(optarg);
+				if (val < 0) {
+					fprintf(stderr, "invalid: -b %s\n",
+						optarg);
+					return E_ERROR;
+				}
+
+				/*
+				 * Bufsize needs to be a number of pages, and
+				 * must be a power of 2.  This is required by
+				 * the perf event buffer code.
+				 */
+				bufsize = roundup_pow_of_two(NUM_PAGES(val));
+				if ((u64)bufsize * getpagesize() > val)
+					fprintf(stderr,
+						"bufsize increased to %ld\n",
+						(u64)bufsize * getpagesize());
+
+				break;
+			case 'l':
+				g_mode = DMODE_LIST;
+				modec++;
+				break;
+			case 's':
+				prgv[prgc++] = optarg;
+				break;
+			case 'V':
+				g_mode = DMODE_VERS;
+				modec++;
+				break;
+			default:
+				if (strchr(dtrace_options, opt) == NULL)
+					return usage();
+			}
+		}
+
+		if (optind < argc) {
+			fprintf(stderr, "unknown option '%s'\n", argv[optind]);
+			return E_ERROR;
+		}
+	}
+
+	if (modec > 1) {
+		fprintf(stderr,
+			"only one of [-lV] can be specified at a time\n");
+		return E_USAGE;
+	}
+
+	/*
+	 * We handle requests for version information first because we do not
+	 * need probe information for it.
+	 */
+	if (g_mode == DMODE_VERS) {
+		printf("%s\n"
+		       "This is DTrace %s\n"
+		       "dtrace(1) version-control ID: %s\n",
+		       DT_VERS_STRING, DT_VERSION, DT_GIT_VERSION);
+
+		return E_SUCCESS;
+	}
+
+	/* Initialize probes. */
+	if (dt_probe_init() < 0) {
+		fprintf(stderr, "failed to initialize probes: %s\n",
+			strerror(errno));
+		return E_ERROR;
+	}
+
+	/*
+	 * We handle requests to list probes next.
+	 */
+	if (g_mode == DMODE_LIST) {
+		int	rc = 0;
+
+		printf("%5s %10s %17s %33s %s\n",
+		       "ID", "PROVIDER", "MODULE", "FUNCTION", "NAME");
+		for (i = 0; i < prgc; i++) {
+			rc = dt_bpf_list_probes(prgv[i]);
+			if (rc < 0)
+				fprintf(stderr, "failed to load %s: %s\n",
+					prgv[i], strerror(errno));
+		}
+
+		return rc ? E_ERROR : E_SUCCESS;
+	}
+
+	if (!prgc) {
+		fprintf(stderr, "missing BPF program(s)\n");
+		return E_ERROR;
+	}
+
+	/* Process the BPF program. */
+	for (i = 0; i < prgc; i++) {
+		int	err;
+
+		err = dt_bpf_load_file(prgv[i]);
+		if (err) {
+			errno = -err;
+			fprintf(stderr, "failed to load %s: %s\n",
+				prgv[i], strerror(errno));
+			return E_ERROR;
+		}
+	}
+
+	/* Get the list of online CPUs. */
+	cpu_list_populate();
+
+	/* Initialize buffers. */
+	epoll_fd = dt_buffer_init(bufsize);
+	if (epoll_fd < 0) {
+		errno = -epoll_fd;
+		fprintf(stderr, "failed to allocate buffers: %s\n",
+			strerror(errno));
+		return E_ERROR;
+	}
+
+	/* Process probe data. */
+	printf("%3s %6s\n", "CPU", "ID");
+	do {
+		cnt = dt_buffer_poll(epoll_fd, 100);
+	} while (cnt >= 0);
+
+	dt_buffer_exit(epoll_fd);
+
+	return E_SUCCESS;
+}
diff --git a/tools/dtrace/dtrace.h b/tools/dtrace/dtrace.h
new file mode 100644
index 000000000000..c79398432d17
--- /dev/null
+++ b/tools/dtrace/dtrace.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ */
+#ifndef _UAPI_LINUX_DTRACE_H
+#define _UAPI_LINUX_DTRACE_H
+
+struct dt_bpf_context {
+	u32		probe_id;
+	u64		argv[10];
+};
+
+#endif /* _UAPI_LINUX_DTRACE_H */
diff --git a/tools/dtrace/dtrace_impl.h b/tools/dtrace/dtrace_impl.h
new file mode 100644
index 000000000000..9aa51b4c4aee
--- /dev/null
+++ b/tools/dtrace/dtrace_impl.h
@@ -0,0 +1,101 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ */
+#ifndef _DTRACE_H
+#define _DTRACE_H
+
+#include <unistd.h>
+#include <bpf/libbpf.h>
+#include <linux/types.h>
+#include <linux/ptrace.h>
+#include <linux/perf_event.h>
+#include <sys/syscall.h>
+
+#include "dtrace.h"
+
+#define DT_DEBUG
+
+#define DT_VERS_STRING	"Oracle D 2.0.0"
+
+#define TRACEFS		"/sys/kernel/debug/tracing/"
+#define EVENTSFS	TRACEFS "events/"
+
+extern int	dt_numcpus;
+extern int	dt_maxcpuid;
+extern int	*dt_cpuids;
+
+extern void cpu_list_populate(void);
+extern void cpu_list_free(void);
+
+struct dt_provider {
+	char		*name;
+	int		(*populate)(void);
+	struct dt_probe *(*resolve_event)(const char *name);
+	int		(*attach)(const char *name, int bpf_fd);
+};
+
+extern struct dt_provider	dt_fbt;
+extern struct dt_provider	dt_syscall;
+
+struct dt_hentry {
+	struct dt_probe		*next;
+	struct dt_probe		*prev;
+};
+
+struct dt_htab;
+
+typedef u32 (*dt_hval_fn)(const struct dt_probe *);
+typedef int (*dt_cmp_fn)(const struct dt_probe *, const struct dt_probe *);
+typedef struct dt_probe *(*dt_add_fn)(struct dt_probe *, struct dt_probe *);
+typedef struct dt_probe *(*dt_del_fn)(struct dt_probe *, struct dt_probe *);
+
+extern struct dt_htab *dt_htab_new(dt_hval_fn hval, dt_cmp_fn cmp,
+				   dt_add_fn add, dt_del_fn del);
+extern int dt_htab_add(struct dt_htab *htab, struct dt_probe *probe);
+extern struct dt_probe *dt_htab_lookup(const struct dt_htab *htab,
+				       const struct dt_probe *probe);
+extern int dt_htab_del(struct dt_htab *htab, struct dt_probe *probe);
+
+struct dt_probe {
+	u32				id;
+	int				event_fd;
+	const struct dt_provider	*prov;
+	const char			*prv_name;	/* provider name */
+	const char			*mod_name;	/* module name */
+	const char			*fun_name;	/* function name */
+	const char			*prb_name;	/* probe name */
+	struct dt_hentry		he_fqn;
+};
+
+typedef void (*dt_probe_fn)(const struct dt_probe *probe);
+
+extern int dt_probe_init(void);
+extern int dt_probe_new(const struct dt_provider *prov, const char *pname,
+			const char *mname, const char *fname, const char *name);
+extern struct dt_probe *dt_probe_by_name(const struct dt_probe *tmpl);
+extern struct dt_probe *dt_probe_resolve_event(const char *name);
+
+extern int dt_bpf_list_probes(const char *fn);
+extern int dt_bpf_load_file(const char *fn);
+extern int dt_bpf_map_update(int fd, const void *key, const void *val);
+extern int dt_bpf_attach(int event_id, int bpf_fd);
+
+extern int dt_bufmap_fd;
+
+extern int dt_buffer_init(int num_pages);
+extern int dt_buffer_poll(int epoll_fd, int timeout);
+extern void dt_buffer_exit(int epoll_fd);
+
+static inline int perf_event_open(struct perf_event_attr *attr, pid_t pid,
+				  int cpu, int group_fd, unsigned long flags)
+{
+	return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags);
+}
+
+extern inline int bpf(enum bpf_cmd cmd, union bpf_attr *attr)
+{
+	return syscall(__NR_bpf, cmd, attr, sizeof(union bpf_attr));
+}
+
+#endif /* _DTRACE_H */
-- 
2.20.1


^ permalink raw reply related

* Re: [RFC PATCH net-next 0/3] net: batched receive in GRO path
From: Eric Dumazet @ 2019-07-10 15:41 UTC (permalink / raw)
  To: Edward Cree, Paolo Abeni, David Miller; +Cc: netdev, Eric Dumazet
In-Reply-To: <677040f4-05d1-e664-d24a-5ee2d2edcdbd@solarflare.com>



On 7/10/19 4:52 PM, Edward Cree wrote:

> Hmm, I was caught out by the call to napi_poll() actually being a local
>  function pointer, not the static function of the same name.  How did a
>  shadow like that ever get allowed?
> But in that case I _really_ don't understand napi_busy_loop(); nothing
>  in it seems to ever flush GRO, so it's relying on either
>  (1) stuff getting flushed because the bucket runs out of space, or
>  (2) the next napi poll after busy_poll_stop() doing the flush.
> What am I missing, and where exactly in napi_busy_loop() should the
>  gro_normal_list() call go?

Please look at busy_poll_stop()


^ permalink raw reply

* [PATCH V2 0/1] tools/dtrace: initial implementation of DTrace
From: Kris Van Hees @ 2019-07-10 15:37 UTC (permalink / raw)
  To: netdev, bpf, dtrace-devel, linux-kernel
  Cc: rostedt, mhiramat, acme, ast, daniel, Peter Zijlstra, Chris Mason

This is version 2 of the patch, incorporating feedback from Peter Zijlstra and
Arnaldo Carvalho de Melo.

Changes in Makefile:
	- Remove -I$(srctree)/tools/perf from KBUILD_HOSTCFLAGS since it
	  is not actually used.

Changes in dt_bpf.c:
	- Remove unnecessary PERF_EVENT_IOC_ENABLE.

Changes in dt_buffer.c:
	- Use ring_buffer_read_head() and ring_buffer_write_tail() to
	  avoid use of volatile.
	- Handle perf events that wrap around the ring buffer boundary.
	- Remove unnecessary PERF_EVENT_IOC_ENABLE.

Changes in bpf_sample.c:
	- Use PT_REGS_PARM1(x), etc instead of my own macros.  Adding
	  PT_REGS_PARM6(x) in bpf_sample.c because we need to be able to
	  support up to 6 arguments passed by registers.

This patch is also available, applied to bpf-next, at the following URL:

	https://github.com/oracle/dtrace-linux-kernel/tree/dtrace-bpf

As suggested in feedback to my earlier patch submissions, this code takes an
approach to avoid kernel code changes as much as possible.  The current patch
does not involve any kernel code changes.  Further development of this code
will continue with this approach, incrementally adding features to this first
minimal implementation.  The goal is a fully featured and functional DTrace
implementation involving kernel changes only when strictly necessary.

The code presented here supports two very basic functions:

1. Listing probes that are used in BPF programs

   # dtrace -l -s bpf_sample.o
      ID   PROVIDER            MODULE                          FUNCTION NAME
   18876        fbt           vmlinux                        ksys_write entry
   70423    syscall           vmlinux                             write entry

2. Loading BPF tracing programs and collecting data that they generate

   # dtrace -s bpf_sample.o
   CPU     ID
    15  70423 0xffff8c0968bf8ec0 0x00000000000001 0x0055e019eb3f60 0x0000000000002c
    15  18876 0xffff8c0968bf8ec0 0x00000000000001 0x0055e019eb3f60 0x0000000000002c
   ...

Only kprobes and syscall tracepoints are supported since this is an initial
patch.  It does show the use of a generic BPF function to implement the actual
probe action, called from two distinct probe types.  Follow-up patches will
add more probe types, add more tracing features from the D language, add
support for D script compilation to BPF, etc.

The implementation makes use of libbpf for handling BPF ELF objects, and uses
the perf event output ring buffer (supported through BPF) to retrieve the
tracing data.  The next step in development will be adding support to libbpf
for programs using shared functions from a collection of functions included in
the BPF ELF object (as suggested by Alexei).  

The code is structured as follows:
 tools/dtrace/dtrace.c      = command line utility
 tools/dtrace/dt_bpf.c      = interface to libbpf
 tools/dtrace/dt_buffer.c   = perf event output buffer handling
 tools/dtrace/dt_fbt.c      = kprobes probe provider
 tools/dtrace/dt_syscall.c  = syscall tracepoint probe provider
 tools/dtrace/dt_probe.c    = generic probe and probe provider handling code
                              This implements a generic interface to the actual
                              probe providers (dt_fbt and dt_syscall).
 tools/dtrace/dt_hash.c     = general probe hashing implementation
 tools/dtrace/dt_utils.c    = support code (manage list of online CPUs)
 tools/dtrace/dtrace.h      = API header file (used by BPF program source code)
 tools/dtrace/dtrace_impl.h = implementation header file
 tools/dtrace/bpf_sample.c  = sample BPF program using two probe types

I included an entry for the MAINTAINERS file.  I offer to actively maintain
this code, and to keep advancing its development.

	Cheers,
	Kris Van Hees

^ permalink raw reply

* [RFC] virtio-net: share receive_*() and add_recvbuf_*() with virtio-vsock
From: Stefano Garzarella @ 2019-07-10 15:37 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi; +Cc: virtualization, netdev

Hi,
as Jason suggested some months ago, I looked better at the virtio-net driver to
understand if we can reuse some parts also in the virtio-vsock driver, since we
have similar challenges (mergeable buffers, page allocation, small
packets, etc.).

Initially, I would add the skbuff in the virtio-vsock in order to re-use
receive_*() functions.
Then I would move receive_[small, big, mergeable]() and
add_recvbuf_[small, big, mergeable]() outside of virtio-net driver, in order to
call them also from virtio-vsock. I need to do some refactoring (e.g. leave the
XDP part on the virtio-net driver), but I think it is feasible.

The idea is to create a virtio-skb.[h,c] where put these functions and a new
object where stores some attributes needed (e.g. hdr_len ) and status (e.g.
some fields of struct receive_queue). This is an idea of virtio-skb.h that
I have in mind:
    struct virtskb;

    struct sk_buff *virtskb_receive_small(struct virtskb *vs, ...);
    struct sk_buff *virtskb_receive_big(struct virtskb *vs, ...);
    struct sk_buff *virtskb_receive_mergeable(struct virtskb *vs, ...);

    int virtskb_add_recvbuf_small(struct virtskb*vs, ...);
    int virtskb_add_recvbuf_big(struct virtskb *vs, ...);
    int virtskb_add_recvbuf_mergeable(struct virtskb *vs, ...);

For the Guest->Host path it should be easier, so maybe I can add a
"virtskb_send(struct virtskb *vs, struct sk_buff *skb)" with a part of the code
of xmit_skb().

Let me know if you have in mind better names or if I should put these function
in another place.

I would like to leave the control part completely separate, so, for example,
the two drivers will negotiate the features independently and they will call
the right virtskb_receive_*() function based on the negotiation.

I already started to work on it, but before to do more steps and send an RFC
patch, I would like to hear your opinion.
Do you think that makes sense?
Do you see any issue or a better solution?

Thanks in advance,
Stefano

^ permalink raw reply

* Re: [PATCH net] net: fix use-after-free in __netif_receive_skb_core
From: Edward Cree @ 2019-07-10 15:07 UTC (permalink / raw)
  To: Sabrina Dubroca, netdev; +Cc: Andreas Steinmetz
In-Reply-To: <e909b8fe24b9eac71de52c4f80f7f3f6e5770199.1562766613.git.sd@queasysnail.net>

On 10/07/2019 14:52, Sabrina Dubroca wrote:
> When __netif_receive_skb_core handles a shared skb, it can be
> reallocated in a few different places:
>  - the device's rx_handler
>  - vlan_do_receive
>  - skb_vlan_untag
>
> To deal with that, rx_handlers and vlan_do_receive get passed a
> reference to the skb, and skb_vlan_untag just returns the new
> skb. This was not a problem until commit 88eb1944e18c ("net: core:
> propagate SKB lists through packet_type lookup"), which moved the
> final handling of the skb via pt_prev out of
> __netif_receive_skb_core. After this commit, when the skb is
> reallocated by __netif_receive_skb_core, KASAN reports a
> use-after-free on the old skb:
>
> BUG: KASAN: use-after-free in __netif_receive_skb_one_core+0x15c/0x180
> Call Trace:
>  <IRQ>
>  __netif_receive_skb_one_core+0x15c/0x180
>  process_backlog+0x1b5/0x630
>  ? net_rx_action+0x247/0xd00
>  net_rx_action+0x3fa/0xd00
>  ? napi_complete_done+0x360/0x360
>  __do_softirq+0x257/0xa0b
>  do_softirq_own_stack+0x2a/0x40
>  </IRQ>
>  ? __dev_queue_xmit+0x12ba/0x3120
>  do_softirq+0x5d/0x60
>  [...]
>
> Allocated by task 505:
>  __kasan_kmalloc.constprop.0+0xd6/0x140
>  kmem_cache_alloc+0xd4/0x2e0
>  skb_clone+0x106/0x300
>  deliver_clone+0x3f/0xa0
>  maybe_deliver+0x1c0/0x2b0
>  br_flood+0xd4/0x320
>  br_dev_xmit+0xbc0/0x1080
>  dev_hard_start_xmit+0x139/0x750
>  __dev_queue_xmit+0x24eb/0x3120
>  packet_sendmsg+0x1bfa/0x50e0
>  [...]
>
> Freed by task 505:
>  __kasan_slab_free+0x138/0x1e0
>  kmem_cache_free+0xa2/0x2e0
>  macsec_handle_frame+0xa24/0x2e60
>  __netif_receive_skb_core+0xe2a/0x2c90
>  __netif_receive_skb_one_core+0x96/0x180
>  process_backlog+0x1b5/0x630
>  net_rx_action+0x3fa/0xd00
>  __do_softirq+0x257/0xa0b
>
> The solution is to pass a reference to the skb to
> __netif_receive_skb_core, as we already do with the rx_handlers, so
> that its callers use the new skb.
>
> Fixes: 88eb1944e18c ("net: core: propagate SKB lists through packet_type lookup")
> Reported-by: Andreas Steinmetz <ast@domdv.de>
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
> ---
>  net/core/dev.c | 26 ++++++++++++++++++++------
>  1 file changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index d6edd218babd..0bbf6d2a9c32 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4809,11 +4809,12 @@ static inline int nf_ingress(struct sk_buff *skb, struct packet_type **pt_prev,
>  	return 0;
>  }
>  
> -static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc,
> +static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
>  				    struct packet_type **ppt_prev)
>  {
>  	struct packet_type *ptype, *pt_prev;
>  	rx_handler_func_t *rx_handler;
> +	struct sk_buff *skb = *pskb;
Would it not be simpler just to change all users of skb to *pskb?
Then you avoid having to keep doing "*pskb = skb;" whenever skb changes
 (with concomitant risk of bugs if one gets missed).

-Ed

^ permalink raw reply

* Re: [RFC PATCH net-next 0/3] net: batched receive in GRO path
From: Edward Cree @ 2019-07-10 14:52 UTC (permalink / raw)
  To: Paolo Abeni, David Miller; +Cc: netdev, Eric Dumazet
In-Reply-To: <c80a9e7846bf903728327a1ca2c3bdcc078057a2.camel@redhat.com>

On 10/07/2019 08:27, Paolo Abeni wrote:
> I'm toying with a patch similar to your 3/3 (most relevant difference
> being the lack of a limit to the batch size), on top of ixgbe (which
> sends all the pkts to the GRO engine), and I'm observing more
> controversial results (UDP only):
>
> * when a single rx queue is running, I see a just-above-noise
> peformance delta
> * when multiple rx queues are running, I observe measurable regressions
> (note: I use small pkts, still well under line rate even with multiple
> rx queues)
>
> I'll try to test your patch in the following days.
I look forward to it.

> Side note: I think that in patch 3/3, it's necessary to add a call to
> gro_normal_list() also inside napi_busy_loop().
Hmm, I was caught out by the call to napi_poll() actually being a local
 function pointer, not the static function of the same name.  How did a
 shadow like that ever get allowed?
But in that case I _really_ don't understand napi_busy_loop(); nothing
 in it seems to ever flush GRO, so it's relying on either
 (1) stuff getting flushed because the bucket runs out of space, or
 (2) the next napi poll after busy_poll_stop() doing the flush.
What am I missing, and where exactly in napi_busy_loop() should the
 gro_normal_list() call go?

-Ed

^ permalink raw reply

* Re: [PATCH AUTOSEL 4.19 14/60] mwifiex: Abort at too short BSS descriptor element
From: Sasha Levin @ 2019-07-10 14:51 UTC (permalink / raw)
  To: Brian Norris
  Cc: Linux Kernel, stable, Takashi Iwai, Kalle Valo, linux-wireless,
	<netdev@vger.kernel.org>
In-Reply-To: <CA+ASDXPyGECiq9gZmFj8TU6Gmt2epQtuBqnGqRWad79DJT589w@mail.gmail.com>

On Fri, Jun 28, 2019 at 03:58:49PM -0700, Brian Norris wrote:
>On Wed, Jun 26, 2019 at 5:49 PM Sasha Levin <sashal@kernel.org> wrote:
>>
>> From: Takashi Iwai <tiwai@suse.de>
>>
>> [ Upstream commit 685c9b7750bfacd6fc1db50d86579980593b7869 ]
>>
>> Currently mwifiex_update_bss_desc_with_ie() implicitly assumes that
>> the source descriptor entries contain the enough size for each type
>> and performs copying without checking the source size.  This may lead
>> to read over boundary.
>>
>> Fix this by putting the source size check in appropriate places.
>>
>> Signed-off-by: Takashi Iwai <tiwai@suse.de>
>> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
>> Signed-off-by: Sasha Levin <sashal@kernel.org>
>
>For the record, this fixup is still aiming for 5.2, correcting some
>potential mistakes in this patch:
>
>63d7ef36103d mwifiex: Don't abort on small, spec-compliant vendor IEs
>
>So you might want to hold off a bit, and grab them both.

I see that 63d7ef36103d didn't make it into 5.2, so I'll just drop this
for now.

--
Thanks,
Sasha

^ permalink raw reply

* KMSAN: uninit-value in smsc95xx_read_eeprom (2)
From: syzbot @ 2019-07-10 14:48 UTC (permalink / raw)
  To: UNGLinuxDriver, davem, glider, linux-kernel, linux-usb, netdev,
	steve.glendinning, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    fe36eb20 kmsan: rework SLUB hooks
git tree:       kmsan
console output: https://syzkaller.appspot.com/x/log.txt?x=1312be5ba00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=40511ad0c5945201
dashboard link: https://syzkaller.appspot.com/bug?extid=0dfe788c0e7be7c95931
compiler:       clang version 9.0.0 (/home/glider/llvm/clang  
80fee25776c2fb61e74c1ecb1a523375c2500b69)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=143976f7a00000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1218cfd8600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+0dfe788c0e7be7c95931@syzkaller.appspotmail.com

usb 1-1: New USB device found, idVendor=0424, idProduct=9908,  
bcdDevice=6a.5e
usb 1-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
usb 1-1: config 0 descriptor??
smsc95xx v1.0.6
==================================================================
BUG: KMSAN: uninit-value in smsc95xx_eeprom_confirm_not_busy  
drivers/net/usb/smsc95xx.c:326 [inline]
BUG: KMSAN: uninit-value in smsc95xx_read_eeprom+0x203/0x920  
drivers/net/usb/smsc95xx.c:345
CPU: 1 PID: 695 Comm: kworker/1:2 Not tainted 5.2.0-rc4+ #11
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: usb_hub_wq hub_event
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x191/0x1f0 lib/dump_stack.c:113
  kmsan_report+0x162/0x2d0 mm/kmsan/kmsan_report.c:109
  __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:294
  smsc95xx_eeprom_confirm_not_busy drivers/net/usb/smsc95xx.c:326 [inline]
  smsc95xx_read_eeprom+0x203/0x920 drivers/net/usb/smsc95xx.c:345
  smsc95xx_init_mac_address drivers/net/usb/smsc95xx.c:914 [inline]
  smsc95xx_bind+0x467/0x1690 drivers/net/usb/smsc95xx.c:1286
  usbnet_probe+0x10d3/0x3950 drivers/net/usb/usbnet.c:1722
  usb_probe_interface+0xd19/0x1310 drivers/usb/core/driver.c:361
  really_probe+0x1344/0x1d90 drivers/base/dd.c:513
  driver_probe_device+0x1ba/0x510 drivers/base/dd.c:670
  __device_attach_driver+0x5b8/0x790 drivers/base/dd.c:777
  bus_for_each_drv+0x28e/0x3b0 drivers/base/bus.c:454
  __device_attach+0x489/0x750 drivers/base/dd.c:843
  device_initial_probe+0x4a/0x60 drivers/base/dd.c:890
  bus_probe_device+0x131/0x390 drivers/base/bus.c:514
  device_add+0x25b5/0x2df0 drivers/base/core.c:2111
  usb_set_configuration+0x309f/0x3710 drivers/usb/core/message.c:2027
  generic_probe+0xe7/0x280 drivers/usb/core/generic.c:210
  usb_probe_device+0x146/0x200 drivers/usb/core/driver.c:266
  really_probe+0x1344/0x1d90 drivers/base/dd.c:513
  driver_probe_device+0x1ba/0x510 drivers/base/dd.c:670
  __device_attach_driver+0x5b8/0x790 drivers/base/dd.c:777
  bus_for_each_drv+0x28e/0x3b0 drivers/base/bus.c:454
  __device_attach+0x489/0x750 drivers/base/dd.c:843
  device_initial_probe+0x4a/0x60 drivers/base/dd.c:890
  bus_probe_device+0x131/0x390 drivers/base/bus.c:514
  device_add+0x25b5/0x2df0 drivers/base/core.c:2111
  usb_new_device+0x23e5/0x2fb0 drivers/usb/core/hub.c:2534
  hub_port_connect drivers/usb/core/hub.c:5089 [inline]
  hub_port_connect_change drivers/usb/core/hub.c:5204 [inline]
  port_event drivers/usb/core/hub.c:5350 [inline]
  hub_event+0x5853/0x7320 drivers/usb/core/hub.c:5432
  process_one_work+0x1572/0x1f00 kernel/workqueue.c:2269
  worker_thread+0x111b/0x2460 kernel/workqueue.c:2415
  kthread+0x4b5/0x4f0 kernel/kthread.c:256
  ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:355

Local variable description: ----buf.i.i86@smsc95xx_read_eeprom
Variable was created at:
  __smsc95xx_read_reg drivers/net/usb/smsc95xx.c:330 [inline]
  smsc95xx_read_reg drivers/net/usb/smsc95xx.c:144 [inline]
  smsc95xx_eeprom_confirm_not_busy drivers/net/usb/smsc95xx.c:320 [inline]
  smsc95xx_read_eeprom+0x109/0x920 drivers/net/usb/smsc95xx.c:345
  smsc95xx_init_mac_address drivers/net/usb/smsc95xx.c:914 [inline]
  smsc95xx_bind+0x467/0x1690 drivers/net/usb/smsc95xx.c:1286
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* [PATCH v6 4/4] net: macb: add support for high speed interface
From: Parshuram Thombare @ 2019-07-10 14:39 UTC (permalink / raw)
  To: andrew, nicolas.ferre, davem, f.fainelli
  Cc: linux, netdev, hkallweit1, linux-kernel, rafalc, piotrs, aniljoy,
	arthurm, stevenh, pthombar, mparab
In-Reply-To: <1562769391-31803-1-git-send-email-pthombar@cadence.com>

This patch add support for high speed USXGMII PCS and 10G
speed in Cadence ethernet controller driver.

Signed-off-by: Parshuram Thombare <pthombar@cadence.com>
---
 drivers/net/ethernet/cadence/macb.h      |  43 ++++++++
 drivers/net/ethernet/cadence/macb_main.c | 132 +++++++++++++++++++++--
 2 files changed, 165 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
index 3ed5bffb735b..e3ec224ffc2a 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -82,6 +82,7 @@
 #define GEM_USRIO		0x000c /* User IO */
 #define GEM_DMACFG		0x0010 /* DMA Configuration */
 #define GEM_JML			0x0048 /* Jumbo Max Length */
+#define GEM_HS_MAC_CONFIG	0x0050 /* GEM high speed config */
 #define GEM_HRB			0x0080 /* Hash Bottom */
 #define GEM_HRT			0x0084 /* Hash Top */
 #define GEM_SA1B		0x0088 /* Specific1 Bottom */
@@ -166,7 +167,13 @@
 #define GEM_DCFG6		0x0294 /* Design Config 6 */
 #define GEM_DCFG7		0x0298 /* Design Config 7 */
 #define GEM_DCFG8		0x029C /* Design Config 8 */
+#define GEM_DCFG9		0x02A0 /* Design Config 9 */
 #define GEM_DCFG10		0x02A4 /* Design Config 10 */
+#define GEM_DCFG11		0x02A8 /* Design Config 11 */
+#define GEM_DCFG12		0x02AC /* Design Config 12 */
+#define GEM_DCFG13		0x02B0 /* Design Config 13 */
+#define GEM_USX_CONTROL		0x0A80 /* USXGMII control register */
+#define GEM_USX_STATUS		0x0A88 /* USXGMII status register */
 
 #define GEM_TXBDCTRL	0x04cc /* TX Buffer Descriptor control register */
 #define GEM_RXBDCTRL	0x04d0 /* RX Buffer Descriptor control register */
@@ -274,6 +281,8 @@
 #define MACB_IRXFCS_SIZE	1
 
 /* GEM specific NCR bitfields. */
+#define GEM_ENABLE_HS_MAC_OFFSET	31
+#define GEM_ENABLE_HS_MAC_SIZE		1
 #define GEM_TWO_PT_FIVE_GIG_OFFSET	29
 #define GEM_TWO_PT_FIVE_GIG_SIZE	1
 
@@ -465,6 +474,10 @@
 #define MACB_REV_OFFSET				0
 #define MACB_REV_SIZE				16
 
+/* Bitfield in HS_MAC_CONFIG */
+#define GEM_HS_MAC_SPEED_OFFSET			0
+#define GEM_HS_MAC_SPEED_SIZE			3
+
 /* Bitfields in PCS_CONTROL. */
 #define GEM_PCS_CTRL_RST_OFFSET			15
 #define GEM_PCS_CTRL_RST_SIZE			1
@@ -510,6 +523,34 @@
 #define GEM_RXBD_RDBUFF_OFFSET			8
 #define GEM_RXBD_RDBUFF_SIZE			4
 
+/* Bitfields in DCFG12. */
+#define GEM_HIGH_SPEED_OFFSET			26
+#define GEM_HIGH_SPEED_SIZE			1
+
+/* Bitfields in USX_CONTROL. */
+#define GEM_USX_CTRL_SPEED_OFFSET		14
+#define GEM_USX_CTRL_SPEED_SIZE			3
+#define GEM_SERDES_RATE_OFFSET			12
+#define GEM_SERDES_RATE_SIZE			2
+#define GEM_RX_SCR_BYPASS_OFFSET		9
+#define GEM_RX_SCR_BYPASS_SIZE			1
+#define GEM_TX_SCR_BYPASS_OFFSET		8
+#define GEM_TX_SCR_BYPASS_SIZE			1
+#define GEM_RX_SYNC_RESET_OFFSET		2
+#define GEM_RX_SYNC_RESET_SIZE			1
+#define GEM_TX_EN_OFFSET			1
+#define GEM_TX_EN_SIZE				1
+#define GEM_SIGNAL_OK_OFFSET			0
+#define GEM_SIGNAL_OK_SIZE			1
+
+/* Bitfields in USX_STATUS. */
+#define GEM_USX_TX_FAULT_OFFSET			28
+#define GEM_USX_TX_FAULT_SIZE			1
+#define GEM_USX_RX_FAULT_OFFSET			27
+#define GEM_USX_RX_FAULT_SIZE			1
+#define GEM_USX_BLOCK_LOCK_OFFSET		0
+#define GEM_USX_BLOCK_LOCK_SIZE			1
+
 /* Bitfields in TISUBN */
 #define GEM_SUBNSINCR_OFFSET			0
 #define GEM_SUBNSINCRL_OFFSET			24
@@ -674,6 +715,7 @@
 #define MACB_CAPS_MACB_IS_GEM			BIT(31)
 #define MACB_CAPS_PCS				BIT(24)
 #define MACB_CAPS_MACB_IS_GEM_GXL		BIT(25)
+#define MACB_CAPS_HIGH_SPEED			BIT(26)
 
 #define MACB_GEM7010_IDNUM			0x009
 #define MACB_GEM7014_IDNU			0x107
@@ -753,6 +795,7 @@
 	})
 
 #define MACB_READ_NSR(bp)	macb_readl(bp, NSR)
+#define GEM_READ_USX_STATUS(bp)	gem_readl(bp, USX_STATUS)
 
 /* struct macb_dma_desc - Hardware DMA descriptor
  * @addr: DMA address of data buffer
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index 792073d1b5c3..6551c03e7628 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -82,6 +82,20 @@ struct sifive_fu540_macb_mgmt {
 #define MACB_WOL_HAS_MAGIC_PACKET	(0x1 << 0)
 #define MACB_WOL_ENABLED		(0x1 << 1)
 
+enum {
+	HS_MAC_SPEED_100M,
+	HS_MAC_SPEED_1000M,
+	HS_MAC_SPEED_2500M,
+	HS_MAC_SPEED_5000M,
+	HS_MAC_SPEED_10000M,
+	HS_MAC_SPEED_25000M,
+};
+
+enum {
+	MACB_SERDES_RATE_5G,
+	MACB_SERDES_RATE_10G,
+};
+
 /* Graceful stop timeouts in us. We should allow up to
  * 1 frame time (10 Mbits/s, full-duplex, ignoring collisions)
  */
@@ -91,6 +105,8 @@ struct sifive_fu540_macb_mgmt {
 
 #define MACB_MDIO_TIMEOUT	1000000 /* in usecs */
 
+#define MACB_USX_BLOCK_LOCK_TIMEOUT	1000000 /* in usecs */
+
 /* DMA buffer descriptor might be different size
  * depends on hardware configuration:
  *
@@ -491,12 +507,32 @@ static void gem_phylink_validate(struct phylink_config *pl_config,
 		if (!macb_is_gem(bp))
 			goto empty_set;
 		break;
+	case PHY_INTERFACE_MODE_USXGMII:
+		if (!(bp->caps & MACB_CAPS_HIGH_SPEED &&
+		      bp->caps & MACB_CAPS_PCS))
+			goto empty_set;
+		break;
 	default:
 		break;
 	}
 
 	switch (state->interface) {
 	case PHY_INTERFACE_MODE_NA:
+	case PHY_INTERFACE_MODE_USXGMII:
+	case PHY_INTERFACE_MODE_10GKR:
+		if (bp->caps & MACB_CAPS_GIGABIT_MODE_AVAILABLE) {
+			phylink_set(mask, 10000baseCR_Full);
+			phylink_set(mask, 10000baseER_Full);
+			phylink_set(mask, 10000baseKR_Full);
+			phylink_set(mask, 10000baseLR_Full);
+			phylink_set(mask, 10000baseLRM_Full);
+			phylink_set(mask, 10000baseSR_Full);
+			phylink_set(mask, 10000baseT_Full);
+			phylink_set(mask, 5000baseT_Full);
+			phylink_set(mask, 2500baseX_Full);
+			phylink_set(mask, 1000baseX_Full);
+		}
+	/* fallthrough */
 	case PHY_INTERFACE_MODE_SGMII:
 	case PHY_INTERFACE_MODE_GMII:
 	case PHY_INTERFACE_MODE_RGMII:
@@ -532,6 +568,80 @@ static int gem_phylink_mac_link_state(struct phylink_config *pl_config,
 	return -EOPNOTSUPP;
 }
 
+static int macb_wait_for_usx_block_lock(struct macb *bp)
+{
+	u32 val;
+
+	return readx_poll_timeout(GEM_READ_USX_STATUS, bp, val,
+				  val & GEM_BIT(USX_BLOCK_LOCK),
+				  1, MACB_USX_BLOCK_LOCK_TIMEOUT);
+}
+
+static inline int gem_mac_usx_configure(struct macb *bp, int spd)
+{
+	u32 speed, config;
+
+	gem_writel(bp, NCFGR, GEM_BIT(PCSSEL) |
+		   (~GEM_BIT(SGMIIEN) & gem_readl(bp, NCFGR)));
+	gem_writel(bp, NCR, gem_readl(bp, NCR) |
+		   GEM_BIT(ENABLE_HS_MAC));
+	gem_writel(bp, NCFGR, gem_readl(bp, NCFGR) |
+		   MACB_BIT(FD));
+	config = gem_readl(bp, USX_CONTROL);
+	config = GEM_BFINS(SERDES_RATE, MACB_SERDES_RATE_10G, config);
+	config &= ~GEM_BIT(TX_SCR_BYPASS);
+	config &= ~GEM_BIT(RX_SCR_BYPASS);
+	gem_writel(bp, USX_CONTROL, config |
+		   GEM_BIT(TX_EN));
+	config = gem_readl(bp, USX_CONTROL);
+	gem_writel(bp, USX_CONTROL, config | GEM_BIT(SIGNAL_OK));
+	if (macb_wait_for_usx_block_lock(bp) < 0) {
+		netdev_warn(bp->dev, "USXGMII block lock failed");
+		return -ETIMEDOUT;
+	}
+
+	switch (spd) {
+	case SPEED_10000:
+		speed = HS_MAC_SPEED_10000M;
+		break;
+	case SPEED_5000:
+		speed = HS_MAC_SPEED_5000M;
+		break;
+	case SPEED_2500:
+		speed = HS_MAC_SPEED_2500M;
+		break;
+	case SPEED_1000:
+		speed = HS_MAC_SPEED_1000M;
+		break;
+	default:
+	case SPEED_100:
+		speed = HS_MAC_SPEED_100M;
+		break;
+	}
+
+	gem_writel(bp, HS_MAC_CONFIG, GEM_BFINS(HS_MAC_SPEED, speed,
+						gem_readl(bp, HS_MAC_CONFIG)));
+	gem_writel(bp, USX_CONTROL, GEM_BFINS(USX_CTRL_SPEED, speed,
+					      gem_readl(bp, USX_CONTROL)));
+	return 0;
+}
+
+static inline void gem_mac_configure(struct macb *bp, int speed)
+{
+	switch (speed) {
+	case SPEED_1000:
+		gem_writel(bp, NCFGR, GEM_BIT(GBE) |
+			   gem_readl(bp, NCFGR));
+		break;
+	case SPEED_100:
+		macb_writel(bp, NCFGR, MACB_BIT(SPD) |
+			    macb_readl(bp, NCFGR));
+		break;
+	default:
+		break;
+	}
+}
+
 static void gem_mac_config(struct phylink_config *pl_config, unsigned int mode,
 			   const struct phylink_link_state *state)
 {
@@ -574,18 +684,17 @@ static void gem_mac_config(struct phylink_config *pl_config, unsigned int mode,
 			reg &= ~GEM_BIT(GBE);
 		if (state->duplex)
 			reg |= MACB_BIT(FD);
+		macb_or_gem_writel(bp, NCFGR, reg);
 
-		switch (state->speed) {
-		case SPEED_1000:
-			reg |= GEM_BIT(GBE);
-			break;
-		case SPEED_100:
-			reg |= MACB_BIT(SPD);
-			break;
-		default:
-			break;
+		if (bp->phy_interface == PHY_INTERFACE_MODE_USXGMII) {
+			if (gem_mac_usx_configure(bp, state->speed) < 0) {
+				spin_unlock_irqrestore(&bp->lock, flags);
+				phylink_mac_change(bp->pl, false);
+				return;
+			}
+		} else {
+			gem_mac_configure(bp, state->speed);
 		}
-		macb_or_gem_writel(bp, NCFGR, reg);
 
 		bp->speed = state->speed;
 		bp->duplex = state->duplex;
@@ -3435,6 +3544,9 @@ static void macb_configure_caps(struct macb *bp,
 		default:
 			break;
 		}
+		dcfg = gem_readl(bp, DCFG12);
+		if (GEM_BFEXT(HIGH_SPEED, dcfg) == 1)
+			bp->caps |= MACB_CAPS_HIGH_SPEED;
 		dcfg = gem_readl(bp, DCFG2);
 		if ((dcfg & (GEM_BIT(RX_PKT_BUFF) | GEM_BIT(TX_PKT_BUFF))) == 0)
 			bp->caps |= MACB_CAPS_FIFO_MODE;
-- 
2.17.1


^ permalink raw reply related

* [PATCH v6 3/4] net: macb: add support for c45 PHY
From: Parshuram Thombare @ 2019-07-10 14:38 UTC (permalink / raw)
  To: andrew, nicolas.ferre, davem, f.fainelli
  Cc: linux, netdev, hkallweit1, linux-kernel, rafalc, piotrs, aniljoy,
	arthurm, stevenh, pthombar, mparab
In-Reply-To: <1562769391-31803-1-git-send-email-pthombar@cadence.com>

This patch modify MDIO read/write functions to support
communication with C45 PHY.

Signed-off-by: Parshuram Thombare <pthombar@cadence.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
---
 drivers/net/ethernet/cadence/macb.h      | 15 ++++--
 drivers/net/ethernet/cadence/macb_main.c | 61 +++++++++++++++++++-----
 2 files changed, 61 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
index 301fbcb0df4b..3ed5bffb735b 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -646,10 +646,17 @@
 #define GEM_CLK_DIV96				5
 
 /* Constants for MAN register */
-#define MACB_MAN_SOF				1
-#define MACB_MAN_WRITE				1
-#define MACB_MAN_READ				2
-#define MACB_MAN_CODE				2
+#define MACB_MAN_C22_SOF                        1
+#define MACB_MAN_C22_WRITE                      1
+#define MACB_MAN_C22_READ                       2
+#define MACB_MAN_C22_CODE                       2
+
+#define MACB_MAN_C45_SOF                        0
+#define MACB_MAN_C45_ADDR                       0
+#define MACB_MAN_C45_WRITE                      1
+#define MACB_MAN_C45_POST_READ_INCR             2
+#define MACB_MAN_C45_READ                       3
+#define MACB_MAN_C45_CODE                       2
 
 /* Capability mask bits */
 #define MACB_CAPS_ISR_CLEAR_ON_WRITE		BIT(0)
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index 6485fcc0560b..792073d1b5c3 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -339,11 +339,30 @@ static int macb_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
 	if (status < 0)
 		goto mdio_read_exit;
 
-	macb_writel(bp, MAN, (MACB_BF(SOF, MACB_MAN_SOF)
-			      | MACB_BF(RW, MACB_MAN_READ)
-			      | MACB_BF(PHYA, mii_id)
-			      | MACB_BF(REGA, regnum)
-			      | MACB_BF(CODE, MACB_MAN_CODE)));
+	if (regnum & MII_ADDR_C45) {
+		macb_writel(bp, MAN, (MACB_BF(SOF, MACB_MAN_C45_SOF)
+			    | MACB_BF(RW, MACB_MAN_C45_ADDR)
+			    | MACB_BF(PHYA, mii_id)
+			    | MACB_BF(REGA, (regnum >> 16) & 0x1F)
+			    | MACB_BF(DATA, regnum & 0xFFFF)
+			    | MACB_BF(CODE, MACB_MAN_C45_CODE)));
+
+		status = macb_mdio_wait_for_idle(bp);
+		if (status < 0)
+			goto mdio_read_exit;
+
+		macb_writel(bp, MAN, (MACB_BF(SOF, MACB_MAN_C45_SOF)
+			    | MACB_BF(RW, MACB_MAN_C45_READ)
+			    | MACB_BF(PHYA, mii_id)
+			    | MACB_BF(REGA, (regnum >> 16) & 0x1F)
+			    | MACB_BF(CODE, MACB_MAN_C45_CODE)));
+	} else {
+		macb_writel(bp, MAN, (MACB_BF(SOF, MACB_MAN_C22_SOF)
+				| MACB_BF(RW, MACB_MAN_C22_READ)
+				| MACB_BF(PHYA, mii_id)
+				| MACB_BF(REGA, regnum)
+				| MACB_BF(CODE, MACB_MAN_C22_CODE)));
+	}
 
 	status = macb_mdio_wait_for_idle(bp);
 	if (status < 0)
@@ -372,12 +391,32 @@ static int macb_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
 	if (status < 0)
 		goto mdio_write_exit;
 
-	macb_writel(bp, MAN, (MACB_BF(SOF, MACB_MAN_SOF)
-			      | MACB_BF(RW, MACB_MAN_WRITE)
-			      | MACB_BF(PHYA, mii_id)
-			      | MACB_BF(REGA, regnum)
-			      | MACB_BF(CODE, MACB_MAN_CODE)
-			      | MACB_BF(DATA, value)));
+	if (regnum & MII_ADDR_C45) {
+		macb_writel(bp, MAN, (MACB_BF(SOF, MACB_MAN_C45_SOF)
+			    | MACB_BF(RW, MACB_MAN_C45_ADDR)
+			    | MACB_BF(PHYA, mii_id)
+			    | MACB_BF(REGA, (regnum >> 16) & 0x1F)
+			    | MACB_BF(DATA, regnum & 0xFFFF)
+			    | MACB_BF(CODE, MACB_MAN_C45_CODE)));
+
+		status = macb_mdio_wait_for_idle(bp);
+		if (status < 0)
+			goto mdio_write_exit;
+
+		macb_writel(bp, MAN, (MACB_BF(SOF, MACB_MAN_C45_SOF)
+			    | MACB_BF(RW, MACB_MAN_C45_WRITE)
+			    | MACB_BF(PHYA, mii_id)
+			    | MACB_BF(REGA, (regnum >> 16) & 0x1F)
+			    | MACB_BF(CODE, MACB_MAN_C45_CODE)
+			    | MACB_BF(DATA, value)));
+	} else {
+		macb_writel(bp, MAN, (MACB_BF(SOF, MACB_MAN_C22_SOF)
+				| MACB_BF(RW, MACB_MAN_C22_WRITE)
+				| MACB_BF(PHYA, mii_id)
+				| MACB_BF(REGA, regnum)
+				| MACB_BF(CODE, MACB_MAN_C22_CODE)
+				| MACB_BF(DATA, value)));
+	}
 
 	status = macb_mdio_wait_for_idle(bp);
 	if (status < 0)
-- 
2.17.1


^ permalink raw reply related

* [PATCH v6 2/4] net: macb: add support for sgmii MAC-PHY interface
From: Parshuram Thombare @ 2019-07-10 14:38 UTC (permalink / raw)
  To: andrew, nicolas.ferre, davem, f.fainelli
  Cc: linux, netdev, hkallweit1, linux-kernel, rafalc, piotrs, aniljoy,
	arthurm, stevenh, pthombar, mparab
In-Reply-To: <1562769391-31803-1-git-send-email-pthombar@cadence.com>

This patch add support for SGMII interface and
2.5Gbps MAC in Cadence ethernet controller driver.

Signed-off-by: Parshuram Thombare <pthombar@cadence.com>
---
 drivers/net/ethernet/cadence/macb.h      | 54 ++++++++++++++++++------
 drivers/net/ethernet/cadence/macb_main.c | 42 +++++++++++++++++-
 2 files changed, 82 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
index a4007057b35e..301fbcb0df4b 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -77,6 +77,7 @@
 #define MACB_RBQPH		0x04D4
 
 /* GEM register offsets. */
+#define GEM_NCR			0x0000 /* Network Control */
 #define GEM_NCFGR		0x0004 /* Network Config */
 #define GEM_USRIO		0x000c /* User IO */
 #define GEM_DMACFG		0x0010 /* DMA Configuration */
@@ -156,6 +157,7 @@
 #define GEM_PEFTN		0x01f4 /* PTP Peer Event Frame Tx Ns */
 #define GEM_PEFRSL		0x01f8 /* PTP Peer Event Frame Rx Sec Low */
 #define GEM_PEFRN		0x01fc /* PTP Peer Event Frame Rx Ns */
+#define GEM_PCS_CTRL		0x0200 /* PCS Control */
 #define GEM_DCFG1		0x0280 /* Design Config 1 */
 #define GEM_DCFG2		0x0284 /* Design Config 2 */
 #define GEM_DCFG3		0x0288 /* Design Config 3 */
@@ -271,6 +273,10 @@
 #define MACB_IRXFCS_OFFSET	19
 #define MACB_IRXFCS_SIZE	1
 
+/* GEM specific NCR bitfields. */
+#define GEM_TWO_PT_FIVE_GIG_OFFSET	29
+#define GEM_TWO_PT_FIVE_GIG_SIZE	1
+
 /* GEM specific NCFGR bitfields. */
 #define GEM_GBE_OFFSET		10 /* Gigabit mode enable */
 #define GEM_GBE_SIZE		1
@@ -323,6 +329,9 @@
 #define MACB_MDIO_SIZE		1
 #define MACB_IDLE_OFFSET	2 /* The PHY management logic is idle */
 #define MACB_IDLE_SIZE		1
+#define MACB_DUPLEX_OFFSET	3
+#define MACB_DUPLEX_SIZE	1
+
 
 /* Bitfields in TSR */
 #define MACB_UBR_OFFSET		0 /* Used bit read */
@@ -456,11 +465,17 @@
 #define MACB_REV_OFFSET				0
 #define MACB_REV_SIZE				16
 
+/* Bitfields in PCS_CONTROL. */
+#define GEM_PCS_CTRL_RST_OFFSET			15
+#define GEM_PCS_CTRL_RST_SIZE			1
+
 /* Bitfields in DCFG1. */
 #define GEM_IRQCOR_OFFSET			23
 #define GEM_IRQCOR_SIZE				1
 #define GEM_DBWDEF_OFFSET			25
 #define GEM_DBWDEF_SIZE				3
+#define GEM_NO_PCS_OFFSET			0
+#define GEM_NO_PCS_SIZE				1
 
 /* Bitfields in DCFG2. */
 #define GEM_RX_PKT_BUFF_OFFSET			20
@@ -637,19 +652,32 @@
 #define MACB_MAN_CODE				2
 
 /* Capability mask bits */
-#define MACB_CAPS_ISR_CLEAR_ON_WRITE		0x00000001
-#define MACB_CAPS_USRIO_HAS_CLKEN		0x00000002
-#define MACB_CAPS_USRIO_DEFAULT_IS_MII_GMII	0x00000004
-#define MACB_CAPS_NO_GIGABIT_HALF		0x00000008
-#define MACB_CAPS_USRIO_DISABLED		0x00000010
-#define MACB_CAPS_JUMBO				0x00000020
-#define MACB_CAPS_GEM_HAS_PTP			0x00000040
-#define MACB_CAPS_BD_RD_PREFETCH		0x00000080
-#define MACB_CAPS_NEEDS_RSTONUBR		0x00000100
-#define MACB_CAPS_FIFO_MODE			0x10000000
-#define MACB_CAPS_GIGABIT_MODE_AVAILABLE	0x20000000
-#define MACB_CAPS_SG_DISABLED			0x40000000
-#define MACB_CAPS_MACB_IS_GEM			0x80000000
+#define MACB_CAPS_ISR_CLEAR_ON_WRITE		BIT(0)
+#define MACB_CAPS_USRIO_HAS_CLKEN		BIT(1)
+#define MACB_CAPS_USRIO_DEFAULT_IS_MII_GMII	BIT(2)
+#define MACB_CAPS_NO_GIGABIT_HALF		BIT(3)
+#define MACB_CAPS_USRIO_DISABLED		BIT(4)
+#define MACB_CAPS_JUMBO				BIT(5)
+#define MACB_CAPS_GEM_HAS_PTP			BIT(6)
+#define MACB_CAPS_BD_RD_PREFETCH		BIT(7)
+#define MACB_CAPS_NEEDS_RSTONUBR		BIT(8)
+#define MACB_CAPS_FIFO_MODE			BIT(28)
+#define MACB_CAPS_GIGABIT_MODE_AVAILABLE	BIT(29)
+#define MACB_CAPS_SG_DISABLED			BIT(30)
+#define MACB_CAPS_MACB_IS_GEM			BIT(31)
+#define MACB_CAPS_PCS				BIT(24)
+#define MACB_CAPS_MACB_IS_GEM_GXL		BIT(25)
+
+#define MACB_GEM7010_IDNUM			0x009
+#define MACB_GEM7014_IDNU			0x107
+#define MACB_GEM7014A_IDNUM			0x207
+#define MACB_GEM7016_IDNUM			0x10a
+#define MACB_GEM7017_IDNUM			0x00a
+#define MACB_GEM7017A_IDNUM			0x20a
+#define MACB_GEM7020_IDNUM			0x003
+#define MACB_GEM7021_IDNUM			0x00c
+#define MACB_GEM7021A_IDNUM			0x20c
+#define MACB_GEM7022_IDNUM			0x00b
 
 /* LSO settings */
 #define MACB_LSO_UFO_ENABLE			0x01
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index ce064eb9252a..6485fcc0560b 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -443,6 +443,10 @@ static void gem_phylink_validate(struct phylink_config *pl_config,
 	__ETHTOOL_DECLARE_LINK_MODE_MASK(mask) = { 0, };
 
 	switch (state->interface) {
+	case PHY_INTERFACE_MODE_SGMII:
+		if (!(bp->caps & MACB_CAPS_PCS))
+			goto empty_set;
+		break;
 	case PHY_INTERFACE_MODE_GMII:
 	case PHY_INTERFACE_MODE_RGMII:
 		if (!macb_is_gem(bp))
@@ -453,6 +457,8 @@ static void gem_phylink_validate(struct phylink_config *pl_config,
 	}
 
 	switch (state->interface) {
+	case PHY_INTERFACE_MODE_NA:
+	case PHY_INTERFACE_MODE_SGMII:
 	case PHY_INTERFACE_MODE_GMII:
 	case PHY_INTERFACE_MODE_RGMII:
 		if (bp->caps & MACB_CAPS_GIGABIT_MODE_AVAILABLE) {
@@ -497,8 +503,26 @@ static void gem_mac_config(struct phylink_config *pl_config, unsigned int mode,
 
 	spin_lock_irqsave(&bp->lock, flags);
 
-	if (change_interface)
+	if (change_interface) {
 		bp->phy_interface = state->interface;
+		/* 2.5G mode not supported */
+		gem_writel(bp, NCR, ~GEM_BIT(TWO_PT_FIVE_GIG) &
+			   gem_readl(bp, NCR));
+
+		if (state->interface == PHY_INTERFACE_MODE_SGMII) {
+			gem_writel(bp, NCFGR, GEM_BIT(SGMIIEN) |
+				   GEM_BIT(PCSSEL) |
+				   gem_readl(bp, NCFGR));
+		} else {
+			/* Disable SGMII mode and PCS */
+			gem_writel(bp, NCFGR, ~(GEM_BIT(SGMIIEN) |
+				   GEM_BIT(PCSSEL)) &
+				   gem_readl(bp, NCFGR));
+			/* Reset PCS */
+			gem_writel(bp, PCS_CTRL, gem_readl(bp, PCS_CTRL) |
+				   GEM_BIT(PCS_CTRL_RST));
+		}
+	}
 
 	if (!phylink_autoneg_inband(mode) &&
 	    (bp->speed != state->speed ||
@@ -3356,6 +3380,22 @@ static void macb_configure_caps(struct macb *bp,
 		dcfg = gem_readl(bp, DCFG1);
 		if (GEM_BFEXT(IRQCOR, dcfg) == 0)
 			bp->caps |= MACB_CAPS_ISR_CLEAR_ON_WRITE;
+		if (GEM_BFEXT(NO_PCS, dcfg) == 0)
+			bp->caps |= MACB_CAPS_PCS;
+		switch (MACB_BFEXT(IDNUM, macb_readl(bp, MID))) {
+		case MACB_GEM7016_IDNUM:
+		case MACB_GEM7017_IDNUM:
+		case MACB_GEM7017A_IDNUM:
+		case MACB_GEM7020_IDNUM:
+		case MACB_GEM7021_IDNUM:
+		case MACB_GEM7021A_IDNUM:
+		case MACB_GEM7022_IDNUM:
+			bp->caps |= MACB_CAPS_USRIO_DISABLED;
+			bp->caps |= MACB_CAPS_MACB_IS_GEM_GXL;
+			break;
+		default:
+			break;
+		}
 		dcfg = gem_readl(bp, DCFG2);
 		if ((dcfg & (GEM_BIT(RX_PKT_BUFF) | GEM_BIT(TX_PKT_BUFF))) == 0)
 			bp->caps |= MACB_CAPS_FIFO_MODE;
-- 
2.17.1


^ permalink raw reply related

* [PATCH v6 1/4] net: macb: add phylink support
From: Parshuram Thombare @ 2019-07-10 14:37 UTC (permalink / raw)
  To: andrew, nicolas.ferre, davem, f.fainelli
  Cc: linux, netdev, hkallweit1, linux-kernel, rafalc, piotrs, aniljoy,
	arthurm, stevenh, pthombar, mparab
In-Reply-To: <1562769391-31803-1-git-send-email-pthombar@cadence.com>

This patch replace phylib API's by phylink API's.

Signed-off-by: Parshuram Thombare <pthombar@cadence.com>
---
 drivers/net/ethernet/cadence/Kconfig     |   2 +-
 drivers/net/ethernet/cadence/macb.h      |   3 +
 drivers/net/ethernet/cadence/macb_main.c | 332 +++++++++++++----------
 3 files changed, 187 insertions(+), 150 deletions(-)

diff --git a/drivers/net/ethernet/cadence/Kconfig b/drivers/net/ethernet/cadence/Kconfig
index f4b3bd85dfe3..53b50c24d9c9 100644
--- a/drivers/net/ethernet/cadence/Kconfig
+++ b/drivers/net/ethernet/cadence/Kconfig
@@ -22,7 +22,7 @@ if NET_VENDOR_CADENCE
 config MACB
 	tristate "Cadence MACB/GEM support"
 	depends on HAS_DMA && COMMON_CLK
-	select PHYLIB
+	select PHYLINK
 	---help---
 	  The Cadence MACB ethernet interface is found on many Atmel AT32 and
 	  AT91 parts.  This driver also supports the Cadence GEM (Gigabit
diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
index 03983bd46eef..a4007057b35e 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -11,6 +11,7 @@
 #include <linux/ptp_clock_kernel.h>
 #include <linux/net_tstamp.h>
 #include <linux/interrupt.h>
+#include <linux/phylink.h>
 
 #if defined(CONFIG_ARCH_DMA_ADDR_T_64BIT) || defined(CONFIG_MACB_USE_HWSTAMP)
 #define MACB_EXT_DESC
@@ -1232,6 +1233,8 @@ struct macb {
 	u32	rx_intr_mask;
 
 	struct macb_pm_data pm_data;
+	struct phylink *pl;
+	struct phylink_config pl_config;
 };
 
 #ifdef CONFIG_MACB_USE_HWSTAMP
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index 5ca17e62dc3e..ce064eb9252a 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -36,6 +36,7 @@
 #include <linux/tcp.h>
 #include <linux/iopoll.h>
 #include <linux/pm_runtime.h>
+#include <linux/phylink.h>
 #include "macb.h"
 
 /* This structure is only used for MACB on SiFive FU540 devices */
@@ -433,115 +434,160 @@ static void macb_set_tx_clk(struct clk *clk, int speed, struct net_device *dev)
 		netdev_err(dev, "adjusting tx_clk failed.\n");
 }
 
-static void macb_handle_link_change(struct net_device *dev)
+static void gem_phylink_validate(struct phylink_config *pl_config,
+				 unsigned long *supported,
+				 struct phylink_link_state *state)
 {
-	struct macb *bp = netdev_priv(dev);
-	struct phy_device *phydev = dev->phydev;
+	struct net_device *netdev = to_net_dev(pl_config->dev);
+	struct macb *bp = netdev_priv(netdev);
+	__ETHTOOL_DECLARE_LINK_MODE_MASK(mask) = { 0, };
+
+	switch (state->interface) {
+	case PHY_INTERFACE_MODE_GMII:
+	case PHY_INTERFACE_MODE_RGMII:
+		if (!macb_is_gem(bp))
+			goto empty_set;
+		break;
+	default:
+		break;
+	}
+
+	switch (state->interface) {
+	case PHY_INTERFACE_MODE_GMII:
+	case PHY_INTERFACE_MODE_RGMII:
+		if (bp->caps & MACB_CAPS_GIGABIT_MODE_AVAILABLE) {
+			phylink_set(mask, 1000baseT_Full);
+			phylink_set(mask, 1000baseX_Full);
+			if (!(bp->caps & MACB_CAPS_NO_GIGABIT_HALF))
+				phylink_set(mask, 1000baseT_Half);
+		}
+	/* fallthrough */
+	case PHY_INTERFACE_MODE_MII:
+	case PHY_INTERFACE_MODE_RMII:
+		phylink_set(mask, 10baseT_Half);
+		phylink_set(mask, 10baseT_Full);
+		phylink_set(mask, 100baseT_Half);
+		phylink_set(mask, 100baseT_Full);
+		break;
+	default:
+		goto empty_set;
+	}
+
+	linkmode_and(supported, supported, mask);
+	linkmode_and(state->advertising, state->advertising, mask);
+	return;
+
+empty_set:
+	linkmode_zero(supported);
+}
+
+static int gem_phylink_mac_link_state(struct phylink_config *pl_config,
+				      struct phylink_link_state *state)
+{
+	return -EOPNOTSUPP;
+}
+
+static void gem_mac_config(struct phylink_config *pl_config, unsigned int mode,
+			   const struct phylink_link_state *state)
+{
+	struct net_device *netdev = to_net_dev(pl_config->dev);
+	struct macb *bp = netdev_priv(netdev);
+	bool change_interface = bp->phy_interface != state->interface;
 	unsigned long flags;
-	int status_change = 0;
 
 	spin_lock_irqsave(&bp->lock, flags);
 
-	if (phydev->link) {
-		if ((bp->speed != phydev->speed) ||
-		    (bp->duplex != phydev->duplex)) {
-			u32 reg;
+	if (change_interface)
+		bp->phy_interface = state->interface;
 
-			reg = macb_readl(bp, NCFGR);
-			reg &= ~(MACB_BIT(SPD) | MACB_BIT(FD));
-			if (macb_is_gem(bp))
-				reg &= ~GEM_BIT(GBE);
+	if (!phylink_autoneg_inband(mode) &&
+	    (bp->speed != state->speed ||
+	     bp->duplex != state->duplex)) {
+		u32 reg;
 
-			if (phydev->duplex)
-				reg |= MACB_BIT(FD);
-			if (phydev->speed == SPEED_100)
-				reg |= MACB_BIT(SPD);
-			if (phydev->speed == SPEED_1000 &&
-			    bp->caps & MACB_CAPS_GIGABIT_MODE_AVAILABLE)
-				reg |= GEM_BIT(GBE);
-
-			macb_or_gem_writel(bp, NCFGR, reg);
+		reg = macb_readl(bp, NCFGR);
+		reg &= ~(MACB_BIT(SPD) | MACB_BIT(FD));
+		if (macb_is_gem(bp))
+			reg &= ~GEM_BIT(GBE);
+		if (state->duplex)
+			reg |= MACB_BIT(FD);
 
-			bp->speed = phydev->speed;
-			bp->duplex = phydev->duplex;
-			status_change = 1;
+		switch (state->speed) {
+		case SPEED_1000:
+			reg |= GEM_BIT(GBE);
+			break;
+		case SPEED_100:
+			reg |= MACB_BIT(SPD);
+			break;
+		default:
+			break;
 		}
-	}
+		macb_or_gem_writel(bp, NCFGR, reg);
 
-	if (phydev->link != bp->link) {
-		if (!phydev->link) {
-			bp->speed = 0;
-			bp->duplex = -1;
-		}
-		bp->link = phydev->link;
+		bp->speed = state->speed;
+		bp->duplex = state->duplex;
 
-		status_change = 1;
+		if (state->link)
+			macb_set_tx_clk(bp->tx_clk, state->speed, netdev);
 	}
 
 	spin_unlock_irqrestore(&bp->lock, flags);
+}
 
-	if (status_change) {
-		if (phydev->link) {
-			/* Update the TX clock rate if and only if the link is
-			 * up and there has been a link change.
-			 */
-			macb_set_tx_clk(bp->tx_clk, phydev->speed, dev);
+static void gem_mac_link_up(struct phylink_config *pl_config, unsigned int mode,
+			    phy_interface_t interface, struct phy_device *phy)
+{
+	struct net_device *netdev = to_net_dev(pl_config->dev);
+	struct macb *bp = netdev_priv(netdev);
 
-			netif_carrier_on(dev);
-			netdev_info(dev, "link up (%d/%s)\n",
-				    phydev->speed,
-				    phydev->duplex == DUPLEX_FULL ?
-				    "Full" : "Half");
-		} else {
-			netif_carrier_off(dev);
-			netdev_info(dev, "link down\n");
-		}
-	}
+	bp->link = 1;
+	/* Enable TX and RX */
+	macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(RE) | MACB_BIT(TE));
+}
+
+static void gem_mac_link_down(struct phylink_config *pl_config,
+			      unsigned int mode, phy_interface_t interface)
+{
+	struct net_device *netdev = to_net_dev(pl_config->dev);
+	struct macb *bp = netdev_priv(netdev);
+
+	bp->link = 0;
+	/* Disable TX and RX */
+	macb_writel(bp, NCR,
+		    macb_readl(bp, NCR) & ~(MACB_BIT(RE) | MACB_BIT(TE)));
 }
 
+static const struct phylink_mac_ops gem_phylink_ops = {
+	.validate = gem_phylink_validate,
+	.mac_link_state = gem_phylink_mac_link_state,
+	.mac_config = gem_mac_config,
+	.mac_link_up = gem_mac_link_up,
+	.mac_link_down = gem_mac_link_down,
+};
+
 /* based on au1000_eth. c*/
-static int macb_mii_probe(struct net_device *dev)
+static int macb_mii_probe(struct net_device *dev, phy_interface_t phy_mode)
 {
 	struct macb *bp = netdev_priv(dev);
 	struct phy_device *phydev;
 	struct device_node *np;
-	int ret, i;
+	int ret;
 
 	np = bp->pdev->dev.of_node;
 	ret = 0;
 
-	if (np) {
-		if (of_phy_is_fixed_link(np)) {
-			bp->phy_node = of_node_get(np);
-		} else {
-			bp->phy_node = of_parse_phandle(np, "phy-handle", 0);
-			/* fallback to standard phy registration if no
-			 * phy-handle was found nor any phy found during
-			 * dt phy registration
-			 */
-			if (!bp->phy_node && !phy_find_first(bp->mii_bus)) {
-				for (i = 0; i < PHY_MAX_ADDR; i++) {
-					phydev = mdiobus_scan(bp->mii_bus, i);
-					if (IS_ERR(phydev) &&
-					    PTR_ERR(phydev) != -ENODEV) {
-						ret = PTR_ERR(phydev);
-						break;
-					}
-				}
-
-				if (ret)
-					return -ENODEV;
-			}
-		}
+	bp->pl_config.dev = &dev->dev;
+	bp->pl_config.type = PHYLINK_NETDEV;
+	bp->pl = phylink_create(&bp->pl_config, of_fwnode_handle(np),
+				phy_mode, &gem_phylink_ops);
+	if (IS_ERR(bp->pl)) {
+		netdev_err(dev,
+			   "error creating PHYLINK: %ld\n", PTR_ERR(bp->pl));
+		return PTR_ERR(bp->pl);
 	}
 
-	if (bp->phy_node) {
-		phydev = of_phy_connect(dev, bp->phy_node,
-					&macb_handle_link_change, 0,
-					bp->phy_interface);
-		if (!phydev)
-			return -ENODEV;
-	} else {
+	ret = phylink_of_phy_connect(bp->pl, np, 0);
+	if (ret == -ENODEV && bp->mii_bus) {
 		phydev = phy_find_first(bp->mii_bus);
 		if (!phydev) {
 			netdev_err(dev, "no PHY found\n");
@@ -549,32 +595,22 @@ static int macb_mii_probe(struct net_device *dev)
 		}
 
 		/* attach the mac to the phy */
-		ret = phy_connect_direct(dev, phydev, &macb_handle_link_change,
-					 bp->phy_interface);
+		ret = phylink_connect_phy(bp->pl, phydev);
 		if (ret) {
 			netdev_err(dev, "Could not attach to PHY\n");
 			return ret;
 		}
 	}
 
-	/* mask with MAC supported features */
-	if (macb_is_gem(bp) && bp->caps & MACB_CAPS_GIGABIT_MODE_AVAILABLE)
-		phy_set_max_speed(phydev, SPEED_1000);
-	else
-		phy_set_max_speed(phydev, SPEED_100);
-
-	if (bp->caps & MACB_CAPS_NO_GIGABIT_HALF)
-		phy_remove_link_mode(phydev,
-				     ETHTOOL_LINK_MODE_1000baseT_Half_BIT);
-
 	bp->link = 0;
-	bp->speed = 0;
-	bp->duplex = -1;
+	bp->speed = SPEED_UNKNOWN;
+	bp->duplex = DUPLEX_UNKNOWN;
+	bp->phy_interface = PHY_INTERFACE_MODE_MAX;
 
-	return 0;
+	return ret;
 }
 
-static int macb_mii_init(struct macb *bp)
+static int macb_mii_init(struct macb *bp, phy_interface_t phy_mode)
 {
 	struct device_node *np;
 	int err = -ENXIO;
@@ -599,22 +635,12 @@ static int macb_mii_init(struct macb *bp)
 	dev_set_drvdata(&bp->dev->dev, bp->mii_bus);
 
 	np = bp->pdev->dev.of_node;
-	if (np && of_phy_is_fixed_link(np)) {
-		if (of_phy_register_fixed_link(np) < 0) {
-			dev_err(&bp->pdev->dev,
-				"broken fixed-link specification %pOF\n", np);
-			goto err_out_free_mdiobus;
-		}
-
-		err = mdiobus_register(bp->mii_bus);
-	} else {
-		err = of_mdiobus_register(bp->mii_bus, np);
-	}
+	err = of_mdiobus_register(bp->mii_bus, np);
 
 	if (err)
 		goto err_out_free_fixed_link;
 
-	err = macb_mii_probe(bp->dev);
+	err = macb_mii_probe(bp->dev, phy_mode);
 	if (err)
 		goto err_out_unregister_bus;
 
@@ -625,7 +651,6 @@ static int macb_mii_init(struct macb *bp)
 err_out_free_fixed_link:
 	if (np && of_phy_is_fixed_link(np))
 		of_phy_deregister_fixed_link(np);
-err_out_free_mdiobus:
 	of_node_put(bp->phy_node);
 	mdiobus_free(bp->mii_bus);
 err_out:
@@ -2418,12 +2443,6 @@ static int macb_open(struct net_device *dev)
 	/* carrier starts down */
 	netif_carrier_off(dev);
 
-	/* if the phy is not yet register, retry later*/
-	if (!dev->phydev) {
-		err = -EAGAIN;
-		goto pm_exit;
-	}
-
 	/* RX buffers initialization */
 	macb_init_rx_buffer_size(bp, bufsz);
 
@@ -2441,7 +2460,7 @@ static int macb_open(struct net_device *dev)
 	macb_init_hw(bp);
 
 	/* schedule a link state check */
-	phy_start(dev->phydev);
+	phylink_start(bp->pl);
 
 	netif_tx_start_all_queues(dev);
 
@@ -2468,8 +2487,7 @@ static int macb_close(struct net_device *dev)
 	for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue)
 		napi_disable(&queue->napi);
 
-	if (dev->phydev)
-		phy_stop(dev->phydev);
+	phylink_stop(bp->pl);
 
 	spin_lock_irqsave(&bp->lock, flags);
 	macb_reset_hw(bp);
@@ -3158,6 +3176,23 @@ static int gem_set_rxnfc(struct net_device *netdev, struct ethtool_rxnfc *cmd)
 	return ret;
 }
 
+static int gem_ethtool_get_link_ksettings(struct net_device *netdev,
+					  struct ethtool_link_ksettings *cmd)
+{
+	struct macb *bp = netdev_priv(netdev);
+
+	return phylink_ethtool_ksettings_get(bp->pl, cmd);
+}
+
+static int
+gem_ethtool_set_link_ksettings(struct net_device *netdev,
+			       const struct ethtool_link_ksettings *cmd)
+{
+	struct macb *bp = netdev_priv(netdev);
+
+	return phylink_ethtool_ksettings_set(bp->pl, cmd);
+}
+
 static const struct ethtool_ops macb_ethtool_ops = {
 	.get_regs_len		= macb_get_regs_len,
 	.get_regs		= macb_get_regs,
@@ -3165,8 +3200,8 @@ static const struct ethtool_ops macb_ethtool_ops = {
 	.get_ts_info		= ethtool_op_get_ts_info,
 	.get_wol		= macb_get_wol,
 	.set_wol		= macb_set_wol,
-	.get_link_ksettings     = phy_ethtool_get_link_ksettings,
-	.set_link_ksettings     = phy_ethtool_set_link_ksettings,
+	.get_link_ksettings     = gem_ethtool_get_link_ksettings,
+	.set_link_ksettings     = gem_ethtool_set_link_ksettings,
 	.get_ringparam		= macb_get_ringparam,
 	.set_ringparam		= macb_set_ringparam,
 };
@@ -3179,8 +3214,8 @@ static const struct ethtool_ops gem_ethtool_ops = {
 	.get_ethtool_stats	= gem_get_ethtool_stats,
 	.get_strings		= gem_get_ethtool_strings,
 	.get_sset_count		= gem_get_sset_count,
-	.get_link_ksettings     = phy_ethtool_get_link_ksettings,
-	.set_link_ksettings     = phy_ethtool_set_link_ksettings,
+	.get_link_ksettings     = gem_ethtool_get_link_ksettings,
+	.set_link_ksettings     = gem_ethtool_set_link_ksettings,
 	.get_ringparam		= macb_get_ringparam,
 	.set_ringparam		= macb_set_ringparam,
 	.get_rxnfc			= gem_get_rxnfc,
@@ -3189,17 +3224,13 @@ static const struct ethtool_ops gem_ethtool_ops = {
 
 static int macb_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 {
-	struct phy_device *phydev = dev->phydev;
 	struct macb *bp = netdev_priv(dev);
 
 	if (!netif_running(dev))
 		return -EINVAL;
 
-	if (!phydev)
-		return -ENODEV;
-
 	if (!bp->ptp_info)
-		return phy_mii_ioctl(phydev, rq, cmd);
+		return phylink_mii_ioctl(bp->pl, rq, cmd);
 
 	switch (cmd) {
 	case SIOCSHWTSTAMP:
@@ -3207,7 +3238,7 @@ static int macb_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 	case SIOCGHWTSTAMP:
 		return bp->ptp_info->get_hwtst(dev, rq);
 	default:
-		return phy_mii_ioctl(phydev, rq, cmd);
+		return phylink_mii_ioctl(bp->pl, rq, cmd);
 	}
 }
 
@@ -3709,7 +3740,7 @@ static int at91ether_open(struct net_device *dev)
 			     MACB_BIT(HRESP));
 
 	/* schedule a link state check */
-	phy_start(dev->phydev);
+	phylink_start(lp->pl);
 
 	netif_start_queue(dev);
 
@@ -4182,13 +4213,12 @@ static int macb_probe(struct platform_device *pdev)
 	struct clk *tsu_clk = NULL;
 	unsigned int queue_mask, num_queues;
 	bool native_io;
-	struct phy_device *phydev;
 	struct net_device *dev;
 	struct resource *regs;
 	void __iomem *mem;
 	const char *mac;
 	struct macb *bp;
-	int err, val;
+	int err, val, phy_mode;
 
 	regs = platform_get_resource(pdev, IORESOURCE_MEM, 0);
 	mem = devm_ioremap_resource(&pdev->dev, regs);
@@ -4309,24 +4339,20 @@ static int macb_probe(struct platform_device *pdev)
 		macb_get_hwaddr(bp);
 	}
 
-	err = of_get_phy_mode(np);
-	if (err < 0)
+	phy_mode = of_get_phy_mode(np);
+	if (phy_mode < 0)
 		/* not found in DT, MII by default */
-		bp->phy_interface = PHY_INTERFACE_MODE_MII;
-	else
-		bp->phy_interface = err;
+		phy_mode = PHY_INTERFACE_MODE_MII;
 
 	/* IP specific init */
 	err = init(pdev);
 	if (err)
 		goto err_out_free_netdev;
 
-	err = macb_mii_init(bp);
+	err = macb_mii_init(bp, phy_mode);
 	if (err)
 		goto err_out_free_netdev;
 
-	phydev = dev->phydev;
-
 	netif_carrier_off(dev);
 
 	err = register_netdev(dev);
@@ -4338,8 +4364,6 @@ static int macb_probe(struct platform_device *pdev)
 	tasklet_init(&bp->hresp_err_tasklet, macb_hresp_error_task,
 		     (unsigned long)bp);
 
-	phy_attached_info(phydev);
-
 	netdev_info(dev, "Cadence %s rev 0x%08x at 0x%08lx irq %d (%pM)\n",
 		    macb_is_gem(bp) ? "GEM" : "MACB", macb_readl(bp, MID),
 		    dev->base_addr, dev->irq, dev->dev_addr);
@@ -4350,7 +4374,9 @@ static int macb_probe(struct platform_device *pdev)
 	return 0;
 
 err_out_unregister_mdio:
-	phy_disconnect(dev->phydev);
+	rtnl_lock();
+	phylink_disconnect_phy(bp->pl);
+	rtnl_unlock();
 	mdiobus_unregister(bp->mii_bus);
 	of_node_put(bp->phy_node);
 	if (np && of_phy_is_fixed_link(np))
@@ -4384,13 +4410,18 @@ static int macb_remove(struct platform_device *pdev)
 
 	if (dev) {
 		bp = netdev_priv(dev);
-		if (dev->phydev)
-			phy_disconnect(dev->phydev);
+		if (bp->pl) {
+			rtnl_lock();
+			phylink_disconnect_phy(bp->pl);
+			rtnl_unlock();
+		}
 		mdiobus_unregister(bp->mii_bus);
 		if (np && of_phy_is_fixed_link(np))
 			of_phy_deregister_fixed_link(np);
 		dev->phydev = NULL;
 		mdiobus_free(bp->mii_bus);
+		if (bp->pl)
+			phylink_destroy(bp->pl);
 
 		unregister_netdev(dev);
 		pm_runtime_disable(&pdev->dev);
@@ -4433,8 +4464,9 @@ static int __maybe_unused macb_suspend(struct device *dev)
 		for (q = 0, queue = bp->queues; q < bp->num_queues;
 		     ++q, ++queue)
 			napi_disable(&queue->napi);
-		phy_stop(netdev->phydev);
-		phy_suspend(netdev->phydev);
+		phylink_stop(bp->pl);
+		if (netdev->phydev)
+			phy_suspend(netdev->phydev);
 		spin_lock_irqsave(&bp->lock, flags);
 		macb_reset_hw(bp);
 		spin_unlock_irqrestore(&bp->lock, flags);
@@ -4482,9 +4514,11 @@ static int __maybe_unused macb_resume(struct device *dev)
 		for (q = 0, queue = bp->queues; q < bp->num_queues;
 		     ++q, ++queue)
 			napi_enable(&queue->napi);
-		phy_resume(netdev->phydev);
-		phy_init_hw(netdev->phydev);
-		phy_start(netdev->phydev);
+		if (netdev->phydev) {
+			phy_resume(netdev->phydev);
+			phy_init_hw(netdev->phydev);
+		}
+		phylink_start(bp->pl);
 	}
 
 	bp->macbgem_ops.mog_init_rings(bp);
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH net-next v6 06/15] ethtool: netlink bitset handling
From: Michal Kubecek @ 2019-07-10 14:37 UTC (permalink / raw)
  To: netdev
  Cc: Jiri Pirko, David Miller, Jakub Kicinski, Andrew Lunn,
	Florian Fainelli, John Linville, Stephen Hemminger, Johannes Berg,
	linux-kernel
In-Reply-To: <20190710125943.GC2291@nanopsycho>

On Wed, Jul 10, 2019 at 02:59:43PM +0200, Jiri Pirko wrote:
> Wed, Jul 10, 2019 at 02:38:03PM CEST, mkubecek@suse.cz wrote:
> >On Tue, Jul 09, 2019 at 04:18:17PM +0200, Jiri Pirko wrote:
> >> 
> >> I understand. So how about avoid the bitfield all together and just
> >> have array of either bits of strings or combinations?
> >> 
> >> ETHTOOL_CMD_SETTINGS_SET (U->K)
> >>     ETHTOOL_A_HEADER
> >>         ETHTOOL_A_DEV_NAME = "eth3"
> >>     ETHTOOL_A_SETTINGS_PRIV_FLAGS
> >>        ETHTOOL_A_SETTINGS_PRIV_FLAG
> >>            ETHTOOL_A_FLAG_NAME = "legacy-rx"
> >> 	   ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
> >> 
> >> or the same with index instead of string
> >> 
> >> ETHTOOL_CMD_SETTINGS_SET (U->K)
> >>     ETHTOOL_A_HEADER
> >>         ETHTOOL_A_DEV_NAME = "eth3"
> >>     ETHTOOL_A_SETTINGS_PRIV_FLAGS
> >>         ETHTOOL_A_SETTINGS_PRIV_FLAG
> >>             ETHTOOL_A_FLAG_INDEX = 0
> >>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
> >> 
> >> 
> >> For set you can combine both when you want to set multiple bits:
> >> 
> >> ETHTOOL_CMD_SETTINGS_SET (U->K)
> >>     ETHTOOL_A_HEADER
> >>         ETHTOOL_A_DEV_NAME = "eth3"
> >>     ETHTOOL_A_SETTINGS_PRIV_FLAGS
> >>         ETHTOOL_A_SETTINGS_PRIV_FLAG
> >>             ETHTOOL_A_FLAG_INDEX = 2
> >>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
> >>         ETHTOOL_A_SETTINGS_PRIV_FLAG
> >>             ETHTOOL_A_FLAG_INDEX = 8
> >>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
> >>         ETHTOOL_A_SETTINGS_PRIV_FLAG
> >>             ETHTOOL_A_FLAG_NAME = "legacy-rx"
> >>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
> >> 
> >> 
> >> For get this might be a bit bigger message:
> >> 
> >> ETHTOOL_CMD_SETTINGS_GET_REPLY (K->U)
> >>     ETHTOOL_A_HEADER
> >>         ETHTOOL_A_DEV_NAME = "eth3"
> >>     ETHTOOL_A_SETTINGS_PRIV_FLAGS
> >>         ETHTOOL_A_SETTINGS_PRIV_FLAG
> >>             ETHTOOL_A_FLAG_INDEX = 0
> >>             ETHTOOL_A_FLAG_NAME = "legacy-rx"
> >>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
> >>         ETHTOOL_A_SETTINGS_PRIV_FLAG
> >>             ETHTOOL_A_FLAG_INDEX = 1
> >>             ETHTOOL_A_FLAG_NAME = "vf-ipsec"
> >>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
> >>         ETHTOOL_A_SETTINGS_PRIV_FLAG
> >>             ETHTOOL_A_FLAG_INDEX = 8
> >>             ETHTOOL_A_FLAG_NAME = "something-else"
> >>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
> >
> >This is perfect for "one shot" applications but not so much for long
> >running ones, either "ethtool --monitor" or management or monitoring
> >daemons. Repeating the names in every notification message would be
> >a waste, it's much more convenient to load the strings only once and
> 
> Yeah, for those aplications, the ETHTOOL_A_FLAG_NAME could be omitted
> 
> 
> >cache them. Even if we omit the names in notifications (and possibly the
> >GET replies if client opts for it), this format still takes 12-16 bytes
> >per bit.
> >
> >So the problem I'm trying to address is that there are two types of
> >clients with very different mode of work and different preferences.
> >
> >Looking at the bitset.c, I would rather say that most of the complexity
> >and ugliness comes from dealing with both unsigned long based bitmaps
> >and u32 based ones. Originally, there were functions working with
> >unsigned long based bitmaps and the variants with "32" suffix were
> >wrappers around them which converted u32 bitmaps to unsigned long ones
> >and back. This became a problem when kernel started issuing warnings
> >about variable length arrays as getting rid of them meant two kmalloc()
> >and two kfree() for each u32 bitmap operation, even if most of the
> >bitmaps are in rather short in practice.
> >
> >Maybe the wrapper could do something like
> >
> >int ethnl_put_bitset32(const u32 *value, const u32 *mask,
> >		       unsigned int size,  ...)
> >{
> >	unsigned long fixed_value[2], fixed_mask[2];
> >	unsigned long *tmp_value = fixed_value;
> >	unsigned long *tmp_mask = fixed_mask;
> >
> >	if (size > sizeof(fixed_value) * BITS_PER_BYTE) {
> >		tmp_value = bitmap_alloc(size);
> >		if (!tmp_value)
> >			return -ENOMEM;
> >		tmp_mask = bitmap_alloc(size);
> >		if (!tmp_mask) {
> >			kfree(tmp_value);
> >			return -ENOMEM;
> >		}
> >	}
> >
> >	bitmap_from_arr32(tmp_value, value, size);
> >	bitmap_from_arr32(tmp_mask, mask, size);
> >	ret = ethnl_put_bitset(tmp_value, tmp_mask, size, ...);
> >}
> >
> >This way we would make bitset.c code cleaner while avoiding allocating
> >short bitmaps (which is the most common case). 
> 
> I'm primarily concerned about the uapi. Plus if the uapi approach is united
> for both index and string, we can omit this whole bitset abomination...

I'm afraid I don't understand this comment. Whatever the representation
of bitmaps (both simple bitmaps and value/mask pairs) is going to be, we
will need a function for parsing them (currently ethnl_update_bitset())
and a function for filling them into the message (currently
ethnl_put_bitset()). Unless you are suggesting to write a copy of
essentially the same parser and composer for each of the bitsets (there
is 15 of them at the already and 4 NLA_BITFIELD32 attributes which I'm
seriously considering to replace with arbitrary length bitsets as well
to make the UAPI as future proof as possible).

After all, what you suggested above is exactly the same structure as my
bitset in verbose form, except you omit size (which is a problem, as
discussed in other part of the thread) and put the contents of BITS
container directly under the main container.

Michal

^ permalink raw reply

* [PATCH v6 0/5] net: macb: cover letter
From: Parshuram Thombare @ 2019-07-10 14:36 UTC (permalink / raw)
  To: andrew, nicolas.ferre, davem, f.fainelli
  Cc: linux, netdev, hkallweit1, linux-kernel, rafalc, piotrs, aniljoy,
	arthurm, stevenh, pthombar, mparab

Hello !

This is 6th version of patch set containing following patches
for Cadence ethernet controller driver.

1. 0001-net-macb-add-phylink-support.patch
   Replace phylib API's with phylink API's.
2. 0002-net-macb-add-support-for-sgmii-MAC-PHY-interface.patch
   This patch add support for SGMII mode.
3. 0004-net-macb-add-support-for-c45-PHY.patch
   This patch is to support C45 PHY.
4. 0005-net-macb-add-support-for-high-speed-interface
   This patch add support for 10G USXGMII PCS in fixed mode.

Changes in v2:
1. Dropped patch configuring TI PHY DP83867 from
   Cadence PCI wrapper driver.
2. Removed code registering emulated PHY for fixed mode. 
3. Code reformatting as per Andrew's and Florian's suggestions.

Changes in v3:
Based on Russell's suggestions
1. Configure MAC in mac_config only for non in-band modes
2. Handle dynamic phy_mode changes in mac_config
3. Move MAC configurations to mac_config
4. Removed seemingly redundant check for phylink handle
5. Removed code from mac_an_restart and mac_link_state
   now just return -EOPNOTSUPP

Changes in v4:
1. Removed PHY_INTERFACE_MODE_2500BASEX, PHY_INTERFACE_MODE_1000BASEX and
   2.5G PHY_INTERFACE_MODE_SGMII phy modes from supported modes

Changes in v5:
1. Code refactoring

Changes in v6:
1. Allow phylink to validate particular phy_mode support by hardware.
2. Remove device tree parameter and 5G serdes rate for USXGMII

Regards,
Parshuram Thombare

Parshuram Thombare (4):
  net: macb: add phylink support
  net: macb: add support for sgmii MAC-PHY interface
  net: macb: add support for c45 PHY
  net: macb: add support for high speed interface

 drivers/net/ethernet/cadence/Kconfig     |   2 +-
 drivers/net/ethernet/cadence/macb.h      | 115 ++++-
 drivers/net/ethernet/cadence/macb_main.c | 543 ++++++++++++++++-------
 3 files changed, 483 insertions(+), 177 deletions(-)

-- 
2.17.1


^ permalink raw reply

* Re: [PATCH] [net-next] davinci_cpdma: don't cast dma_addr_t to pointer
From: Ivan Khoronzhuk @ 2019-07-10 14:26 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: David S. Miller, Grygorii Strashko, Andrew Lunn, Ilias Apalodimas,
	linux-omap, netdev, linux-kernel
In-Reply-To: <20190710080106.24237-1-arnd@arndb.de>

On Wed, Jul 10, 2019 at 10:00:33AM +0200, Arnd Bergmann wrote:
>dma_addr_t may be 64-bit wide on 32-bit architectures, so it is not
>valid to cast between it and a pointer:
>
>drivers/net/ethernet/ti/davinci_cpdma.c: In function 'cpdma_chan_submit_si':
>drivers/net/ethernet/ti/davinci_cpdma.c:1047:12: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
>drivers/net/ethernet/ti/davinci_cpdma.c: In function 'cpdma_chan_idle_submit_mapped':
>drivers/net/ethernet/ti/davinci_cpdma.c:1114:12: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
>drivers/net/ethernet/ti/davinci_cpdma.c: In function 'cpdma_chan_submit_mapped':
>drivers/net/ethernet/ti/davinci_cpdma.c:1164:12: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
>
>Solve this by using two separate members in 'struct submit_info'.
>Since this avoids the use of the 'flag' member, the structure does
>not even grow in typical configurations.
>
>Fixes: 6670acacd59e ("net: ethernet: ti: davinci_cpdma: add dma mapped submit")
>Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Despite "flags" could be used for smth else (who knows), looks ok.
Reviewed-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>

-- 
Regards,
Ivan Khoronzhuk

^ permalink raw reply

* Re: [net][PATCH 5/5] rds: avoid version downgrade to legitimate newer peer connections
From: Yanjun Zhu @ 2019-07-10 14:26 UTC (permalink / raw)
  To: Santosh Shilimkar, netdev, davem
In-Reply-To: <1562736764-31752-6-git-send-email-santosh.shilimkar@oracle.com>


On 2019/7/10 13:32, Santosh Shilimkar wrote:
> Connections with legitimate tos values can get into usual connection
> race. It can result in consumer reject. We don't want tos value or
> protocol version to be demoted for such connections otherwise
> piers would end up different tos values which can results in
> no connection. Example a peer initiated connection with say
> tos 8 while usual connection racing can get downgraded to tos 0
> which is not desirable.
>
> Patch fixes above issue introduced by commit
> commit d021fabf525f ("rds: rdma: add consumer reject")
>
> Reported-by: Yanjun Zhu <yanjun.zhu@oracle.com>
> Tested-by: Yanjun Zhu <yanjun.zhu@oracle.com>

Thanks. I am OK with this.

Zhu Yanjun

> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
> ---
>   net/rds/rdma_transport.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
> index 9db455d..ff74c4b 100644
> --- a/net/rds/rdma_transport.c
> +++ b/net/rds/rdma_transport.c
> @@ -117,8 +117,10 @@ static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id *cm_id,
>   		     ((*err) <= RDS_RDMA_REJ_INCOMPAT))) {
>   			pr_warn("RDS/RDMA: conn <%pI6c, %pI6c> rejected, dropping connection\n",
>   				&conn->c_laddr, &conn->c_faddr);
> -			conn->c_proposed_version = RDS_PROTOCOL_COMPAT_VERSION;
> -			conn->c_tos = 0;
> +
> +			if (!conn->c_tos)
> +				conn->c_proposed_version = RDS_PROTOCOL_COMPAT_VERSION;
> +
>   			rds_conn_drop(conn);
>   		}
>   		rdsdebug("Connection rejected: %s\n",

^ permalink raw reply

* Re: [net][PATCH 4/5] rds: Return proper "tos" value to user-space
From: Yanjun Zhu @ 2019-07-10 14:25 UTC (permalink / raw)
  To: Santosh Shilimkar, netdev, davem
In-Reply-To: <1562736764-31752-5-git-send-email-santosh.shilimkar@oracle.com>


On 2019/7/10 13:32, Santosh Shilimkar wrote:
> From: Gerd Rausch <gerd.rausch@oracle.com>
>
> The proper "tos" value needs to be returned
> to user-space (sockopt RDS_INFO_CONNECTIONS).
>
> Fixes: 3eb450367d08 ("rds: add type of service(tos) infrastructure")
> Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>

Thanks. I am OK with this.

Zhu Yanjun

> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
> ---
>   net/rds/connection.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/net/rds/connection.c b/net/rds/connection.c
> index 7ea134f..ed7f213 100644
> --- a/net/rds/connection.c
> +++ b/net/rds/connection.c
> @@ -736,6 +736,7 @@ static int rds_conn_info_visitor(struct rds_conn_path *cp, void *buffer)
>   	cinfo->next_rx_seq = cp->cp_next_rx_seq;
>   	cinfo->laddr = conn->c_laddr.s6_addr32[3];
>   	cinfo->faddr = conn->c_faddr.s6_addr32[3];
> +	cinfo->tos = conn->c_tos;
>   	strncpy(cinfo->transport, conn->c_trans->t_name,
>   		sizeof(cinfo->transport));
>   	cinfo->flags = 0;

^ permalink raw reply

* Re: [net][PATCH 3/5] rds: Accept peer connection reject messages due to incompatible version
From: Yanjun Zhu @ 2019-07-10 14:24 UTC (permalink / raw)
  To: Santosh Shilimkar, netdev, davem
In-Reply-To: <1562736764-31752-4-git-send-email-santosh.shilimkar@oracle.com>


On 2019/7/10 13:32, Santosh Shilimkar wrote:
> From: Gerd Rausch <gerd.rausch@oracle.com>
>
> Prior to
> commit d021fabf525ff ("rds: rdma: add consumer reject")
>
> function "rds_rdma_cm_event_handler_cmn" would always honor a rejected
> connection attempt by issuing a "rds_conn_drop".
>
> The commit mentioned above added a "break", eliminating
> the "fallthrough" case and made the "rds_conn_drop" rather conditional:
>
> Now it only happens if a "consumer defined" reject (i.e. "rdma_reject")
> carries an integer-value of "1" inside "private_data":
>
>    if (!conn)
>      break;
>      err = (int *)rdma_consumer_reject_data(cm_id, event, &len);
>      if (!err || (err && ((*err) == RDS_RDMA_REJ_INCOMPAT))) {
>        pr_warn("RDS/RDMA: conn <%pI6c, %pI6c> rejected, dropping connection\n",
>                &conn->c_laddr, &conn->c_faddr);
>                conn->c_proposed_version = RDS_PROTOCOL_COMPAT_VERSION;
>                rds_conn_drop(conn);
>      }
>      rdsdebug("Connection rejected: %s\n",
>               rdma_reject_msg(cm_id, event->status));
>      break;
>      /* FALLTHROUGH */
> A number of issues are worth mentioning here:
>     #1) Previous versions of the RDS code simply rejected a connection
>         by calling "rdma_reject(cm_id, NULL, 0);"
>         So the value of the payload in "private_data" will not be "1",
>         but "0".
>
>     #2) Now the code has become dependent on host byte order and sizing.
>         If one peer is big-endian, the other is little-endian,
>         or there's a difference in sizeof(int) (e.g. ILP64 vs LP64),
>         the *err check does not work as intended.
>
>     #3) There is no check for "len" to see if the data behind *err is even valid.
>         Luckily, it appears that the "rdma_reject(cm_id, NULL, 0)" will always
>         carry 148 bytes of zeroized payload.
>         But that should probably not be relied upon here.
>
>     #4) With the added "break;",
>         we might as well drop the misleading "/* FALLTHROUGH */" comment.
>
> This commit does _not_ address issue #2, as the sender would have to
> agree on a byte order as well.
>
> Here is the sequence of messages in this observed error-scenario:
>     Host-A is pre-QoS changes (excluding the commit mentioned above)
>     Host-B is post-QoS changes (including the commit mentioned above)
>
>     #1 Host-B
>        issues a connection request via function "rds_conn_path_transition"
>        connection state transitions to "RDS_CONN_CONNECTING"
>
>     #2 Host-A
>        rejects the incompatible connection request (from #1)
>        It does so by calling "rdma_reject(cm_id, NULL, 0);"
>
>     #3 Host-B
>        receives an "RDMA_CM_EVENT_REJECTED" event (from #2)
>        But since the code is changed in the way described above,
>        it won't drop the connection here, simply because "*err == 0".
>
>     #4 Host-A
>        issues a connection request
>
>     #5 Host-B
>        receives an "RDMA_CM_EVENT_CONNECT_REQUEST" event
>        and ends up calling "rds_ib_cm_handle_connect".
>        But since the state is already in "RDS_CONN_CONNECTING"
>        (as of #1) it will end up issuing a "rdma_reject" without
>        dropping the connection:
>           if (rds_conn_state(conn) == RDS_CONN_CONNECTING) {
>               /* Wait and see - our connect may still be succeeding */
>               rds_ib_stats_inc(s_ib_connect_raced);
>           }
>           goto out;
>
>     #6 Host-A
>        receives an "RDMA_CM_EVENT_REJECTED" event (from #5),
>        drops the connection and tries again (goto #4) until it gives up.
>
> Tested-by: Zhu Yanjun <yanjun.zhu@oracle.com>

Thanks

Zhu Yanjun

> Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
> ---
>   net/rds/rdma_transport.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
> index 46bce83..9db455d 100644
> --- a/net/rds/rdma_transport.c
> +++ b/net/rds/rdma_transport.c
> @@ -112,7 +112,9 @@ static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id *cm_id,
>   		if (!conn)
>   			break;
>   		err = (int *)rdma_consumer_reject_data(cm_id, event, &len);
> -		if (!err || (err && ((*err) == RDS_RDMA_REJ_INCOMPAT))) {
> +		if (!err ||
> +		    (err && len >= sizeof(*err) &&
> +		     ((*err) <= RDS_RDMA_REJ_INCOMPAT))) {
>   			pr_warn("RDS/RDMA: conn <%pI6c, %pI6c> rejected, dropping connection\n",
>   				&conn->c_laddr, &conn->c_faddr);
>   			conn->c_proposed_version = RDS_PROTOCOL_COMPAT_VERSION;
> @@ -122,7 +124,6 @@ static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id *cm_id,
>   		rdsdebug("Connection rejected: %s\n",
>   			 rdma_reject_msg(cm_id, event->status));
>   		break;
> -		/* FALLTHROUGH */
>   	case RDMA_CM_EVENT_ADDR_ERROR:
>   	case RDMA_CM_EVENT_ROUTE_ERROR:
>   	case RDMA_CM_EVENT_CONNECT_ERROR:

^ permalink raw reply

* Re: [PATCH] [net-next] net/mlx5e: avoid uninitialized variable use
From: Tariq Toukan @ 2019-07-10 14:22 UTC (permalink / raw)
  To: Arnd Bergmann, Saeed Mahameed, Leon Romanovsky, David S. Miller
  Cc: Tariq Toukan, Eran Ben Elisha, Boris Pismenny,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, clang-built-linux@googlegroups.com
In-Reply-To: <20190710130638.1846846-1-arnd@arndb.de>



On 7/10/2019 4:06 PM, Arnd Bergmann wrote:
> clang points to a variable being used in an unexpected
> code path:
> 
> drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c:251:2: warning: variable 'rec_seq_sz' is used uninitialized whenever switch default is taken [-Wsometimes-uninitialized]
>          default:
>          ^~~~~~~
> drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c:255:46: note: uninitialized use occurs here
>          skip_static_post = !memcmp(rec_seq, &rn_be, rec_seq_sz);
>                                                      ^~~~~~~~~~
> 
>  From looking at the function logic, it seems that there is no
> sensible way to continue here, so just return early and hope
> for the best.
> 
> Fixes: d2ead1f360e8 ("net/mlx5e: Add kTLS TX HW offload support")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
> index 3f5f4317a22b..5c08891806f0 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
> @@ -250,6 +250,7 @@ tx_post_resync_params(struct mlx5e_txqsq *sq,
>   	}
>   	default:
>   		WARN_ON(1);
> +		return;
>   	}
>   
>   	skip_static_post = !memcmp(rec_seq, &rn_be, rec_seq_sz);
> 

Reviewed-by: Tariq Toukan <tariqt@mellanox.com>

Thanks!

^ permalink raw reply

* Re: [net][PATCH 1/5] rds: fix reordering with composite message notification
From: Yanjun Zhu @ 2019-07-10 14:23 UTC (permalink / raw)
  To: Santosh Shilimkar, netdev, davem
In-Reply-To: <1562736764-31752-2-git-send-email-santosh.shilimkar@oracle.com>


On 2019/7/10 13:32, Santosh Shilimkar wrote:
> RDS composite message(rdma + control) user notification needs to be
> triggered once the full message is delivered and such a fix was
> added as part of commit 941f8d55f6d61 ("RDS: RDMA: Fix the composite
> message user notification"). But rds_send_remove_from_sock is missing
> data part notify check and hence at times the user don't get
> notification which isn't desirable.
>
> One way is to fix the rds_send_remove_from_sock to check of that case
> but considering the ordering complexity with completion handler and
> rdma + control messages are always dispatched back to back in same send
> context, just delaying the signaled completion on rmda work request also
> gets the desired behaviour. i.e Notifying application only after
> RDMA + control message send completes. So patch updates the earlier
> fix with this approach. The delay signaling completions of rdma op
> till the control message send completes fix was done by Venkat
> Venkatsubra in downstream kernel.
>
> Reviewed-and-tested-by: Zhu Yanjun <yanjun.zhu@oracle.com>

Thanks. I am fine with this.

Zhu Yanjun

> Reviewed-by: Gerd Rausch <gerd.rausch@oracle.com>
> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
> ---
>   net/rds/ib_send.c | 29 +++++++++++++----------------
>   net/rds/rdma.c    | 10 ----------
>   net/rds/rds.h     |  1 -
>   net/rds/send.c    |  4 +---
>   4 files changed, 14 insertions(+), 30 deletions(-)
>
> diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
> index 18f2341..dfe6237 100644
> --- a/net/rds/ib_send.c
> +++ b/net/rds/ib_send.c
> @@ -69,6 +69,16 @@ static void rds_ib_send_complete(struct rds_message *rm,
>   	complete(rm, notify_status);
>   }
>   
> +static void rds_ib_send_unmap_data(struct rds_ib_connection *ic,
> +				   struct rm_data_op *op,
> +				   int wc_status)
> +{
> +	if (op->op_nents)
> +		ib_dma_unmap_sg(ic->i_cm_id->device,
> +				op->op_sg, op->op_nents,
> +				DMA_TO_DEVICE);
> +}
> +
>   static void rds_ib_send_unmap_rdma(struct rds_ib_connection *ic,
>   				   struct rm_rdma_op *op,
>   				   int wc_status)
> @@ -129,21 +139,6 @@ static void rds_ib_send_unmap_atomic(struct rds_ib_connection *ic,
>   		rds_ib_stats_inc(s_ib_atomic_fadd);
>   }
>   
> -static void rds_ib_send_unmap_data(struct rds_ib_connection *ic,
> -				   struct rm_data_op *op,
> -				   int wc_status)
> -{
> -	struct rds_message *rm = container_of(op, struct rds_message, data);
> -
> -	if (op->op_nents)
> -		ib_dma_unmap_sg(ic->i_cm_id->device,
> -				op->op_sg, op->op_nents,
> -				DMA_TO_DEVICE);
> -
> -	if (rm->rdma.op_active && rm->data.op_notify)
> -		rds_ib_send_unmap_rdma(ic, &rm->rdma, wc_status);
> -}
> -
>   /*
>    * Unmap the resources associated with a struct send_work.
>    *
> @@ -902,7 +897,9 @@ int rds_ib_xmit_rdma(struct rds_connection *conn, struct rm_rdma_op *op)
>   		send->s_queued = jiffies;
>   		send->s_op = NULL;
>   
> -		nr_sig += rds_ib_set_wr_signal_state(ic, send, op->op_notify);
> +		if (!op->op_notify)
> +			nr_sig += rds_ib_set_wr_signal_state(ic, send,
> +							     op->op_notify);
>   
>   		send->s_wr.opcode = op->op_write ? IB_WR_RDMA_WRITE : IB_WR_RDMA_READ;
>   		send->s_rdma_wr.remote_addr = remote_addr;
> diff --git a/net/rds/rdma.c b/net/rds/rdma.c
> index b340ed4..916f5ec 100644
> --- a/net/rds/rdma.c
> +++ b/net/rds/rdma.c
> @@ -641,16 +641,6 @@ int rds_cmsg_rdma_args(struct rds_sock *rs, struct rds_message *rm,
>   		}
>   		op->op_notifier->n_user_token = args->user_token;
>   		op->op_notifier->n_status = RDS_RDMA_SUCCESS;
> -
> -		/* Enable rmda notification on data operation for composite
> -		 * rds messages and make sure notification is enabled only
> -		 * for the data operation which follows it so that application
> -		 * gets notified only after full message gets delivered.
> -		 */
> -		if (rm->data.op_sg) {
> -			rm->rdma.op_notify = 0;
> -			rm->data.op_notify = !!(args->flags & RDS_RDMA_NOTIFY_ME);
> -		}
>   	}
>   
>   	/* The cookie contains the R_Key of the remote memory region, and
> diff --git a/net/rds/rds.h b/net/rds/rds.h
> index 0d8f67c..f0066d1 100644
> --- a/net/rds/rds.h
> +++ b/net/rds/rds.h
> @@ -476,7 +476,6 @@ struct rds_message {
>   		} rdma;
>   		struct rm_data_op {
>   			unsigned int		op_active:1;
> -			unsigned int		op_notify:1;
>   			unsigned int		op_nents;
>   			unsigned int		op_count;
>   			unsigned int		op_dmasg;
> diff --git a/net/rds/send.c b/net/rds/send.c
> index 166dd57..031b1e9 100644
> --- a/net/rds/send.c
> +++ b/net/rds/send.c
> @@ -491,14 +491,12 @@ void rds_rdma_send_complete(struct rds_message *rm, int status)
>   	struct rm_rdma_op *ro;
>   	struct rds_notifier *notifier;
>   	unsigned long flags;
> -	unsigned int notify = 0;
>   
>   	spin_lock_irqsave(&rm->m_rs_lock, flags);
>   
> -	notify =  rm->rdma.op_notify | rm->data.op_notify;
>   	ro = &rm->rdma;
>   	if (test_bit(RDS_MSG_ON_SOCK, &rm->m_flags) &&
> -	    ro->op_active && notify && ro->op_notifier) {
> +	    ro->op_active && ro->op_notify && ro->op_notifier) {
>   		notifier = ro->op_notifier;
>   		rs = rm->m_rs;
>   		sock_hold(rds_rs_to_sk(rs));

^ permalink raw reply

* [PATCH nf-next] net/mlx5e: Fix kernel NULL pointer dereference
From: wenxu @ 2019-07-10 14:18 UTC (permalink / raw)
  To: pablo, davem; +Cc: netdev

From: wenxu <wenxu@ucloud.cn>

[ 3444.666552] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 3444.666631] #PF: supervisor read access in kernel mode
[ 3444.666701] #PF: error_code(0x0000) - not-present page
[ 3444.666769] PGD 8000000812dd7067 P4D 8000000812dd7067 PUD 8207cc067 PMD 0
[ 3444.666843] Oops: 0000 [#1] SMP PTI
[ 3444.666910] CPU: 17 PID: 27387 Comm: nft Kdump: loaded Tainted: G           O      5.2.0-rc6+ #1
[ 3444.666987] Hardware name: Huawei Technologies Co., Ltd. RH1288 V3/BC11HGSC0, BIOS 3.57 02/26/2017
[ 3444.667071] RIP: 0010:flow_block_cb_setup_simple+0x127/0x240
[ 3444.667141] Code: 02 48 89 43 08 31 c0 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 83 c4 10 b8 a1 ff ff ff 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <49> 8b 04 24 49 39 c4 75 0a eb 2f 48 8b 00 49 39 c4 74 27 4c 3b 68
[ 3444.668201] RSP: 0018:ffffc90007b7b888 EFLAGS: 00010246
[ 3444.668595] RAX: 0000000000000000 RBX: ffff8890439a9b40 RCX: ffff88904d5008c0
[ 3444.668992] RDX: ffffffffa0879850 RSI: 0000000000000000 RDI: ffffc90007b7b908
[ 3444.669389] RBP: ffffc90007b7b8c0 R08: ffff88904d5008c0 R09: 0000000000000001
[ 3444.669787] R10: ffff88885a797d00 R11: ffff8890439a9b00 R12: 0000000000000000
[ 3444.670186] R13: ffffffffa0879850 R14: ffffc90007b7b908 R15: ffffffff823a8480
[ 3444.670588] FS:  00007f357c2fa740(0000) GS:ffff88885fe40000(0000) knlGS:0000000000000000
[ 3444.671313] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3444.671705] CR2: 0000000000000000 CR3: 00000001a1600002 CR4: 00000000001626e0
[ 3444.672103] Call Trace:
[ 3444.672505]  ? jump_label_update+0x5f/0xc0
[ 3444.672933]  mlx5e_rep_setup_tc+0x32/0x40 [mlx5_core]
[ 3444.673335]  nft_flow_offload_chain+0xd0/0x1d0 [nf_tables]
[ 3444.673729]  nft_flow_rule_offload_commit+0x91/0x11b [nf_tables]
[ 3444.674129]  nf_tables_commit+0x90/0xe30 [nf_tables]
[ 3444.674529]  nfnetlink_rcv_batch+0x3b9/0x750 [nfnetlink]

Init the driver_block_list parameter

Fixes: 955bcb6ea0df ("drivers: net: use flow block API")
Signed-off-by: wenxu <wenxu@ucloud.cn>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 10ef90a..90c6de9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -1182,7 +1182,7 @@ static int mlx5e_rep_setup_tc(struct net_device *dev, enum tc_setup_type type,
 
 	switch (type) {
 	case TC_SETUP_BLOCK:
-		return flow_block_cb_setup_simple(type_data, NULL,
+		return flow_block_cb_setup_simple(type_data, &mlx5e_block_cb_list,
 						  mlx5e_rep_setup_tc_cb,
 						  priv, priv, true);
 	default:
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH RFC 4/4] selftests/bpf: Add test for ftrace-based BPF attach/detach
From: Joel Fernandes (Google) @ 2019-07-10 14:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google), Adrian Ratiu, Alexei Starovoitov, bpf,
	Brendan Gregg, connoro, Daniel Borkmann, duyuchao, Ingo Molnar,
	jeffv, Karim Yaghmour, kernel-team, linux-kselftest,
	Manali Shukla, Manjo Raja Rao, Martin KaFai Lau, Masami Hiramatsu,
	Matt Mullins, Michal Gregorczyk, Michal Gregorczyk,
	Mohammad Husain, namhyung, namhyung, netdev, paul.chaignon,
	primiano, Qais Yousef, Shuah Khan, Song Liu, Srinivas Ramana,
	Steven Rostedt, Tamir Carmeli, Yonghong Song
In-Reply-To: <20190710141548.132193-1-joel@joelfernandes.org>

Here we add support for testing the attach and detach of a BPF program
to a tracepoint through tracefs.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 .../raw_tp_writable_test_ftrace_run.c         | 89 +++++++++++++++++++
 1 file changed, 89 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_ftrace_run.c

diff --git a/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_ftrace_run.c b/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_ftrace_run.c
new file mode 100644
index 000000000000..7b42e3a69b71
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_ftrace_run.c
@@ -0,0 +1,89 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <test_progs.h>
+#include <linux/nbd.h>
+
+void test_raw_tp_writable_test_ftrace_run(void)
+{
+	__u32 duration = 0;
+	char error[4096];
+	int ret;
+
+	const struct bpf_insn trace_program[] = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 0),
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_6, 0),
+		BPF_MOV64_IMM(BPF_REG_0, 42),
+		BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	};
+
+	struct bpf_load_program_attr load_attr = {
+		.prog_type = BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
+		.license = "GPL v2",
+		.insns = trace_program,
+		.insns_cnt = sizeof(trace_program) / sizeof(struct bpf_insn),
+		.log_level = 2,
+	};
+
+	int bpf_fd = bpf_load_program_xattr(&load_attr, error, sizeof(error));
+
+	if (CHECK(bpf_fd < 0, "bpf_raw_tracepoint_writable loaded",
+		  "failed: %d errno %d\n", bpf_fd, errno))
+		return;
+
+	const struct bpf_insn skb_program[] = {
+		BPF_MOV64_IMM(BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	};
+
+	struct bpf_load_program_attr skb_load_attr = {
+		.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
+		.license = "GPL v2",
+		.insns = skb_program,
+		.insns_cnt = sizeof(skb_program) / sizeof(struct bpf_insn),
+	};
+
+	int filter_fd =
+		bpf_load_program_xattr(&skb_load_attr, error, sizeof(error));
+	if (CHECK(filter_fd < 0, "test_program_loaded", "failed: %d errno %d\n",
+		  filter_fd, errno))
+		goto out_bpffd;
+
+	ret = bpf_raw_tracepoint_ftrace_attach("bpf_test_run",
+					       "bpf_test_finish",
+					       bpf_fd);
+	if (CHECK(ret < 0, "bpf_raw_tracepoint_ftrace_attach",
+		  "failed: %d errno %d\n", ret, errno))
+		goto out_filterfd;
+
+	char test_skb[128] = {
+		0,
+	};
+
+	__u32 prog_ret;
+	int err = bpf_prog_test_run(filter_fd, 1, test_skb, sizeof(test_skb), 0,
+				    0, &prog_ret, 0);
+	CHECK(err != 42, "test_run",
+	      "tracepoint did not modify return value\n");
+	CHECK(prog_ret != 0, "test_run_ret",
+	      "socket_filter did not return 0\n");
+
+	ret = bpf_raw_tracepoint_ftrace_detach("bpf_test_run",
+					       "bpf_test_finish",
+					       bpf_fd);
+	if (CHECK(ret < 0, "bpf_raw_tracepoint_ftrace_detach",
+		  "failed: %d errno %d\n", ret, errno))
+		goto out_filterfd;
+
+	err = bpf_prog_test_run(filter_fd, 1, test_skb, sizeof(test_skb), 0, 0,
+				&prog_ret, 0);
+	CHECK(err != 0, "test_run_notrace",
+	      "test_run failed with %d errno %d\n", err, errno);
+	CHECK(prog_ret != 0, "test_run_ret_notrace",
+	      "socket_filter did not return 0\n");
+
+out_filterfd:
+	close(filter_fd);
+out_bpffd:
+	close(bpf_fd);
+}
-- 
2.22.0.410.gd8fdbe21b5-goog


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox