Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH REPOST v4 2/7] ixgbe: eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-21 18:56 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: netdev, timur, sulrich, linux-arm-msm, linux-arm-kernel,
	Sinan Kaya, intel-wired-lan, linux-kernel
In-Reply-To: <1521658572-26354-1-git-send-email-okaya@codeaurora.org>

Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 0da5aa2..58ed70f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1692,7 +1692,7 @@ void ixgbe_alloc_rx_buffers(struct ixgbe_ring *rx_ring, u16 cleaned_count)
 		 * such as IA-64).
 		 */
 		wmb();
-		writel(i, rx_ring->tail);
+		writel_relaxed(i, rx_ring->tail);
 	}
 }
 
@@ -2453,7 +2453,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 		 * know there are new descriptors to fetch.
 		 */
 		wmb();
-		writel(ring->next_to_use, ring->tail);
+		writel_relaxed(ring->next_to_use, ring->tail);
 
 		xdp_do_flush_map();
 	}
@@ -8078,7 +8078,7 @@ static int ixgbe_tx_map(struct ixgbe_ring *tx_ring,
 	ixgbe_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
 	if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-		writel(i, tx_ring->tail);
+		writel_relaxed(i, tx_ring->tail);
 
 		/* we need this if more than one processor can write to our tail
 		 * at a time, it synchronizes IO on IA64/Altix systems
@@ -10014,7 +10014,7 @@ static void ixgbe_xdp_flush(struct net_device *dev)
 	 * are new descriptors to fetch.
 	 */
 	wmb();
-	writel(ring->next_to_use, ring->tail);
+	writel_relaxed(ring->next_to_use, ring->tail);
 
 	return;
 }
-- 
2.7.4

^ permalink raw reply related

* [PATCH REPOST v4 1/7] i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-21 18:56 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: sulrich, netdev, timur, linux-kernel, Sinan Kaya, intel-wired-lan,
	linux-arm-msm, linux-arm-kernel
In-Reply-To: <1521658572-26354-1-git-send-email-okaya@codeaurora.org>

Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 8 ++++----
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e554aa6cf..9455869 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -185,7 +185,7 @@ static int i40e_program_fdir_filter(struct i40e_fdir_filter *fdir_data,
 	/* Mark the data descriptor to be watched */
 	first->next_to_watch = tx_desc;
 
-	writel(tx_ring->next_to_use, tx_ring->tail);
+	writel_relaxed(tx_ring->next_to_use, tx_ring->tail);
 	return 0;
 
 dma_fail:
@@ -1375,7 +1375,7 @@ static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
 	 * such as IA-64).
 	 */
 	wmb();
-	writel(val, rx_ring->tail);
+	writel_relaxed(val, rx_ring->tail);
 }
 
 /**
@@ -2258,7 +2258,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 		 */
 		wmb();
 
-		writel(xdp_ring->next_to_use, xdp_ring->tail);
+		writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
 	}
 
 	rx_ring->skb = skb;
@@ -3286,7 +3286,7 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
 
 	/* notify HW of packet */
 	if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-		writel(i, tx_ring->tail);
+		writel_relaxed(i, tx_ring->tail);
 
 		/* we need this if more than one processor can write to our tail
 		 * at a time, it synchronizes IO on IA64/Altix systems
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 357d605..56eea20 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -667,7 +667,7 @@ static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
 	 * such as IA-64).
 	 */
 	wmb();
-	writel(val, rx_ring->tail);
+	writel_relaxed(val, rx_ring->tail);
 }
 
 /**
@@ -2243,7 +2243,7 @@ static inline void i40evf_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
 
 	/* notify HW of packet */
 	if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-		writel(i, tx_ring->tail);
+		writel_relaxed(i, tx_ring->tail);
 
 		/* we need this if more than one processor can write to our tail
 		 * at a time, it synchronizes IO on IA64/Altix systems
-- 
2.7.4

^ permalink raw reply related

* [PATCH REPOST v4 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-21 18:56 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: sulrich, netdev, timur, Sinan Kaya, linux-arm-msm,
	linux-arm-kernel

Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

I did a regex search for wmb() followed by writel() in each drivers
directory.
I scrubbed the ones I care about in this series.

I considered "ease of change", "popular usage" and "performance critical
path" as the determining criteria for my filtering.

We used relaxed API heavily on ARM for a long time but
it did not exist on other architectures. For this reason, relaxed
architectures have been paying double penalty in order to use the common
drivers.

Now that relaxed API is present on all architectures, we can go and scrub
all drivers to see what needs to change and what can remain.

We start with mostly used ones and hope to increase the coverage over time.
It will take a while to cover all drivers.

Feel free to apply patches individually.

repost:
- split into intel specific patches

Changes since v3:
- https://www.spinics.net/lists/arm-kernel/msg641851.html
- group patches together into subsystems net:...
- collect reviewed and tested bys
- scrub barrier()

Sinan Kaya (7):
  i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
  ixgbe: eliminate duplicate barriers on weakly-ordered archs
  igbvf: eliminate duplicate barriers on weakly-ordered archs
  igb: eliminate duplicate barriers on weakly-ordered archs
  ixgbevf: keep writel() closer to wmb()
  ixgbevf: eliminate duplicate barriers on weakly-ordered archs
  fm10k: Eliminate duplicate barriers on weakly-ordered archs

 drivers/net/ethernet/intel/fm10k/fm10k_main.c     | 4 ++--
 drivers/net/ethernet/intel/i40e/i40e_txrx.c       | 8 ++++----
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c     | 4 ++--
 drivers/net/ethernet/intel/igb/igb_main.c         | 4 ++--
 drivers/net/ethernet/intel/igbvf/netdev.c         | 4 ++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     | 8 ++++----
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      | 5 -----
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 4 ++--
 8 files changed, 18 insertions(+), 23 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH v2 bpf-next 6/8] libbpf: add bpf_raw_tracepoint_open helper
From: Alexei Starovoitov @ 2018-03-21 18:54 UTC (permalink / raw)
  To: davem; +Cc: daniel, torvalds, peterz, rostedt, netdev, kernel-team, linux-api
In-Reply-To: <20180321185448.2806324-1-ast@fb.com>

From: Alexei Starovoitov <ast@kernel.org>

add bpf_raw_tracepoint_open(const char *name, int prog_fd) api to libbpf

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/include/uapi/linux/bpf.h | 11 +++++++++++
 tools/lib/bpf/bpf.c            | 11 +++++++++++
 tools/lib/bpf/bpf.h            |  1 +
 3 files changed, 23 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d245c41213ac..58060bec999d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -94,6 +94,7 @@ enum bpf_cmd {
 	BPF_MAP_GET_FD_BY_ID,
 	BPF_OBJ_GET_INFO_BY_FD,
 	BPF_PROG_QUERY,
+	BPF_RAW_TRACEPOINT_OPEN,
 };
 
 enum bpf_map_type {
@@ -134,6 +135,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_SKB,
 	BPF_PROG_TYPE_CGROUP_DEVICE,
 	BPF_PROG_TYPE_SK_MSG,
+	BPF_PROG_TYPE_RAW_TRACEPOINT,
 };
 
 enum bpf_attach_type {
@@ -344,6 +346,11 @@ union bpf_attr {
 		__aligned_u64	prog_ids;
 		__u32		prog_cnt;
 	} query;
+
+	struct {
+		__u64 name;
+		__u32 prog_fd;
+	} raw_tracepoint;
 } __attribute__((aligned(8)));
 
 /* BPF helper function descriptions:
@@ -1151,4 +1158,8 @@ struct bpf_cgroup_dev_ctx {
 	__u32 minor;
 };
 
+struct bpf_raw_tracepoint_args {
+	__u64 args[0];
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 592a58a2b681..e0500055f1a6 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -428,6 +428,17 @@ int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 *info_len)
 	return err;
 }
 
+int bpf_raw_tracepoint_open(const char *name, int prog_fd)
+{
+	union bpf_attr attr;
+
+	bzero(&attr, sizeof(attr));
+	attr.raw_tracepoint.name = ptr_to_u64(name);
+	attr.raw_tracepoint.prog_fd = prog_fd;
+
+	return sys_bpf(BPF_RAW_TRACEPOINT_OPEN, &attr, sizeof(attr));
+}
+
 int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
 {
 	struct sockaddr_nl sa;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 8d18fb73d7fb..ee59342c6f42 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -79,4 +79,5 @@ int bpf_map_get_fd_by_id(__u32 id);
 int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 *info_len);
 int bpf_prog_query(int target_fd, enum bpf_attach_type type, __u32 query_flags,
 		   __u32 *attach_flags, __u32 *prog_ids, __u32 *prog_cnt);
+int bpf_raw_tracepoint_open(const char *name, int prog_fd);
 #endif
-- 
2.9.5

^ permalink raw reply related

* [PATCH v2 bpf-next 5/8] bpf: introduce BPF_RAW_TRACEPOINT
From: Alexei Starovoitov @ 2018-03-21 18:54 UTC (permalink / raw)
  To: davem; +Cc: daniel, torvalds, peterz, rostedt, netdev, kernel-team, linux-api
In-Reply-To: <20180321185448.2806324-1-ast@fb.com>

From: Alexei Starovoitov <ast@kernel.org>

Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
kernel internal arguments of the tracepoints in their raw form.

>From bpf program point of view the access to the arguments look like:
struct bpf_raw_tracepoint_args {
       __u64 args[0];
};

int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
{
  // program can read args[N] where N depends on tracepoint
  // and statically verified at program load+attach time
}

kprobe+bpf infrastructure allows programs access function arguments.
This feature allows programs access raw tracepoint arguments.

Similar to proposed 'dynamic ftrace events' there are no abi guarantees
to what the tracepoints arguments are and what their meaning is.
The program needs to type cast args properly and use bpf_probe_read()
helper to access struct fields when argument is a pointer.

For every tracepoint __bpf_trace_##call function is prepared.
In assembler it looks like:
(gdb) disassemble __bpf_trace_xdp_exception
Dump of assembler code for function __bpf_trace_xdp_exception:
   0xffffffff81132080 <+0>:     mov    %ecx,%ecx
   0xffffffff81132082 <+2>:     jmpq   0xffffffff811231f0 <bpf_trace_run3>

where

TRACE_EVENT(xdp_exception,
        TP_PROTO(const struct net_device *dev,
                 const struct bpf_prog *xdp, u32 act),

The above assembler snippet is casting 32-bit 'act' field into 'u64'
to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
and in total this approach adds 7k bytes to .text and 8k bytes
to .rodata since the probe funcs need to appear in kallsyms.
The alternative of having __bpf_trace_##call being global in kallsyms
could have been to keep them static and add another pointer to these
static functions to 'struct trace_event_class' and 'struct trace_event_call',
but keeping them global simplifies implementation and keeps it indepedent
from the tracing side.

Also such approach gives the lowest possible overhead
while calling trace_xdp_exception() from kernel C code and
transitioning into bpf land.
Since tracepoint+bpf are used at speeds of 1M+ events per second
this is very valuable optimization.

Since ftrace and perf side are not involved the new
BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
that returns anon_inode FD of 'bpf-raw-tracepoint' object.

The user space looks like:
// load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
prog_fd = bpf_prog_load(...);
// receive anon_inode fd for given bpf_raw_tracepoint with prog attached
raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);

Ctrl-C of tracing daemon or cmdline tool that uses this feature
will automatically detach bpf program, unload it and
unregister tracepoint probe.

On the kernel side for_each_kernel_tracepoint() is used
to find a tracepoint with "xdp_exception" name
(that would be __tracepoint_xdp_exception record)

Then kallsyms_lookup_name() is used to find the addr
of __bpf_trace_xdp_exception() probe function.

And finally tracepoint_probe_register() is used to connect probe
with tracepoint.

Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
tracepoint mechanisms. perf_event_open() can be used in parallel
on the same tracepoint.
Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted.
Each with its own bpf program. The kernel will execute
all tracepoint probes and all attached bpf programs.

In the future bpf_raw_tracepoints can be extended with
query/introspection logic.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf_types.h    |   1 +
 include/linux/trace_events.h |  57 +++++++++
 include/trace/bpf_probe.h    |  87 +++++++++++++
 include/trace/define_trace.h |   1 +
 include/uapi/linux/bpf.h     |  11 ++
 kernel/bpf/syscall.c         |  87 +++++++++++++
 kernel/trace/bpf_trace.c     | 283 +++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 527 insertions(+)
 create mode 100644 include/trace/bpf_probe.h

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 5e2e8a49fb21..6d7243bfb0ff 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -19,6 +19,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
 BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
 BPF_PROG_TYPE(BPF_PROG_TYPE_TRACEPOINT, tracepoint)
 BPF_PROG_TYPE(BPF_PROG_TYPE_PERF_EVENT, perf_event)
+BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 #endif
 #ifdef CONFIG_CGROUP_BPF
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 8a1442c4e513..46d76bbd5668 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -468,6 +468,8 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx);
 int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
 void perf_event_detach_bpf_prog(struct perf_event *event);
 int perf_event_query_prog_array(struct perf_event *event, void __user *info);
+int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog);
+int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog);
 #else
 static inline unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
 {
@@ -487,6 +489,14 @@ perf_event_query_prog_array(struct perf_event *event, void __user *info)
 {
 	return -EOPNOTSUPP;
 }
+static inline int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *p)
+{
+	return -EOPNOTSUPP;
+}
+static inline int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *p)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
 enum {
@@ -546,6 +556,53 @@ extern void ftrace_profile_free_filter(struct perf_event *event);
 void perf_trace_buf_update(void *record, u16 type);
 void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
 
+void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
+void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
+void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3);
+void bpf_trace_run4(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3, u64 arg4);
+void bpf_trace_run5(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3, u64 arg4, u64 arg5);
+void bpf_trace_run6(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3, u64 arg4, u64 arg5, u64 arg6);
+void bpf_trace_run7(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7);
+void bpf_trace_run8(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		    u64 arg8);
+void bpf_trace_run9(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		    u64 arg8, u64 arg9);
+void bpf_trace_run10(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		     u64 arg8, u64 arg9, u64 arg10);
+void bpf_trace_run11(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		     u64 arg8, u64 arg9, u64 arg10, u64 arg11);
+void bpf_trace_run12(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		     u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12);
+void bpf_trace_run13(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		     u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
+		     u64 arg13);
+void bpf_trace_run14(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		     u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
+		     u64 arg13, u64 arg14);
+void bpf_trace_run15(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		     u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
+		     u64 arg13, u64 arg14, u64 arg15);
+void bpf_trace_run16(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		     u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
+		     u64 arg13, u64 arg14, u64 arg15, u64 arg16);
+void bpf_trace_run17(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		     u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
+		     u64 arg13, u64 arg14, u64 arg15, u64 arg16, u64 arg17);
 void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx,
 			       struct trace_event_call *call, u64 count,
 			       struct pt_regs *regs, struct hlist_head *head,
diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
new file mode 100644
index 000000000000..f67876794de8
--- /dev/null
+++ b/include/trace/bpf_probe.h
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#undef TRACE_SYSTEM_VAR
+
+#ifdef CONFIG_BPF_EVENTS
+
+#undef __entry
+#define __entry entry
+
+#undef __get_dynamic_array
+#define __get_dynamic_array(field)	\
+		((void *)__entry + (__entry->__data_loc_##field & 0xffff))
+
+#undef __get_dynamic_array_len
+#define __get_dynamic_array_len(field)	\
+		((__entry->__data_loc_##field >> 16) & 0xffff)
+
+#undef __get_str
+#define __get_str(field) ((char *)__get_dynamic_array(field))
+
+#undef __get_bitmask
+#define __get_bitmask(field) (char *)__get_dynamic_array(field)
+
+#undef __perf_count
+#define __perf_count(c)	(c)
+
+#undef __perf_task
+#define __perf_task(t)	(t)
+
+/*
+ * cast any integer or pointer type to u64 without warnings
+ * on 32 and 64 bit archs
+ */
+#define __CAST_TO_U64(expr) \
+	(u64) __builtin_choose_expr(sizeof(long) < sizeof(expr), \
+				    (expr), \
+				    (long) expr)
+#define __CAST1(a,...) __CAST_TO_U64(a)
+#define __CAST2(a,...) __CAST_TO_U64(a), __CAST1(__VA_ARGS__)
+#define __CAST3(a,...) __CAST_TO_U64(a), __CAST2(__VA_ARGS__)
+#define __CAST4(a,...) __CAST_TO_U64(a), __CAST3(__VA_ARGS__)
+#define __CAST5(a,...) __CAST_TO_U64(a), __CAST4(__VA_ARGS__)
+#define __CAST6(a,...) __CAST_TO_U64(a), __CAST5(__VA_ARGS__)
+#define __CAST7(a,...) __CAST_TO_U64(a), __CAST6(__VA_ARGS__)
+#define __CAST8(a,...) __CAST_TO_U64(a), __CAST7(__VA_ARGS__)
+#define __CAST9(a,...) __CAST_TO_U64(a), __CAST8(__VA_ARGS__)
+#define __CAST10(a,...) __CAST_TO_U64(a), __CAST9(__VA_ARGS__)
+#define __CAST11(a,...) __CAST_TO_U64(a), __CAST10(__VA_ARGS__)
+#define __CAST12(a,...) __CAST_TO_U64(a), __CAST11(__VA_ARGS__)
+#define __CAST13(a,...) __CAST_TO_U64(a), __CAST12(__VA_ARGS__)
+#define __CAST14(a,...) __CAST_TO_U64(a), __CAST13(__VA_ARGS__)
+#define __CAST15(a,...) __CAST_TO_U64(a), __CAST14(__VA_ARGS__)
+#define __CAST16(a,...) __CAST_TO_U64(a), __CAST15(__VA_ARGS__)
+#define __CAST17(a,...) __CAST_TO_U64(a), __CAST16(__VA_ARGS__)
+
+#define CAST_TO_U64(...) __FN_COUNT(__CAST,##__VA_ARGS__)(__VA_ARGS__)
+
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
+/* no 'static' here. The bpf probe functions are global */		\
+notrace void								\
+__bpf_trace_##call(void *__data, proto)					\
+{									\
+	struct bpf_prog *prog = __data;					\
+	\
+	__FN_COUNT(bpf_trace_run, args)(prog, CAST_TO_U64(args));	\
+}
+
+/*
+ * This part is compiled out, it is only here as a build time check
+ * to make sure that if the tracepoint handling changes, the
+ * bpf probe will fail to compile unless it too is updated.
+ */
+#undef DEFINE_EVENT
+#define DEFINE_EVENT(template, call, proto, args)			\
+static inline void bpf_test_probe_##call(void)				\
+{									\
+	check_trace_callback_type_##call(__bpf_trace_##template);	\
+}
+
+
+#undef DEFINE_EVENT_PRINT
+#define DEFINE_EVENT_PRINT(template, name, proto, args, print)	\
+	DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
+
+#include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
+#endif /* CONFIG_BPF_EVENTS */
diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index c040eda95d41..3bbd3b88177f 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -95,6 +95,7 @@
 #ifdef TRACEPOINTS_ENABLED
 #include <trace/trace_events.h>
 #include <trace/perf.h>
+#include <trace/bpf_probe.h>
 #endif
 
 #undef TRACE_EVENT
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 18b7c510c511..1878201c2d77 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -94,6 +94,7 @@ enum bpf_cmd {
 	BPF_MAP_GET_FD_BY_ID,
 	BPF_OBJ_GET_INFO_BY_FD,
 	BPF_PROG_QUERY,
+	BPF_RAW_TRACEPOINT_OPEN,
 };
 
 enum bpf_map_type {
@@ -134,6 +135,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_SKB,
 	BPF_PROG_TYPE_CGROUP_DEVICE,
 	BPF_PROG_TYPE_SK_MSG,
+	BPF_PROG_TYPE_RAW_TRACEPOINT,
 };
 
 enum bpf_attach_type {
@@ -344,6 +346,11 @@ union bpf_attr {
 		__aligned_u64	prog_ids;
 		__u32		prog_cnt;
 	} query;
+
+	struct {
+		__u64 name;
+		__u32 prog_fd;
+	} raw_tracepoint;
 } __attribute__((aligned(8)));
 
 /* BPF helper function descriptions:
@@ -1152,4 +1159,8 @@ struct bpf_cgroup_dev_ctx {
 	__u32 minor;
 };
 
+struct bpf_raw_tracepoint_args {
+	__u64 args[0];
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 3aeb4ea2a93a..96bc45a6e7d6 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1311,6 +1311,90 @@ static int bpf_obj_get(const union bpf_attr *attr)
 				attr->file_flags);
 }
 
+struct bpf_raw_tracepoint {
+	struct tracepoint *tp;
+	struct bpf_prog *prog;
+};
+
+static int bpf_raw_tracepoint_release(struct inode *inode, struct file *filp)
+{
+	struct bpf_raw_tracepoint *raw_tp = filp->private_data;
+
+	if (raw_tp->prog) {
+		bpf_probe_unregister(raw_tp->tp, raw_tp->prog);
+		bpf_prog_put(raw_tp->prog);
+	}
+	kfree(raw_tp);
+	return 0;
+}
+
+static const struct file_operations bpf_raw_tp_fops = {
+	.release	= bpf_raw_tracepoint_release,
+	.read		= bpf_dummy_read,
+	.write		= bpf_dummy_write,
+};
+
+static void *__find_tp(struct tracepoint *tp, void *priv)
+{
+	char *name = priv;
+
+	if (!strcmp(tp->name, name))
+		return tp;
+	return NULL;
+}
+
+#define BPF_RAW_TRACEPOINT_OPEN_LAST_FIELD raw_tracepoint.prog_fd
+
+static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
+{
+	struct bpf_raw_tracepoint *raw_tp;
+	struct tracepoint *tp;
+	struct bpf_prog *prog;
+	char tp_name[128];
+	int tp_fd, err;
+
+	if (strncpy_from_user(tp_name, u64_to_user_ptr(attr->raw_tracepoint.name),
+			      sizeof(tp_name) - 1) < 0)
+		return -EFAULT;
+	tp_name[sizeof(tp_name) - 1] = 0;
+
+	tp = for_each_kernel_tracepoint(__find_tp, tp_name);
+	if (!tp)
+		return -ENOENT;
+
+	raw_tp = kmalloc(sizeof(*raw_tp), GFP_USER | __GFP_ZERO);
+	if (!raw_tp)
+		return -ENOMEM;
+	raw_tp->tp = tp;
+
+	prog = bpf_prog_get_type(attr->raw_tracepoint.prog_fd,
+				 BPF_PROG_TYPE_RAW_TRACEPOINT);
+	if (IS_ERR(prog)) {
+		err = PTR_ERR(prog);
+		goto out_free_tp;
+	}
+
+	err = bpf_probe_register(raw_tp->tp, prog);
+	if (err)
+		goto out_put_prog;
+
+	raw_tp->prog = prog;
+	tp_fd = anon_inode_getfd("bpf-raw-tracepoint", &bpf_raw_tp_fops, raw_tp,
+				 O_CLOEXEC);
+	if (tp_fd < 0) {
+		bpf_probe_unregister(raw_tp->tp, prog);
+		err = tp_fd;
+		goto out_put_prog;
+	}
+	return tp_fd;
+
+out_put_prog:
+	bpf_prog_put(prog);
+out_free_tp:
+	kfree(raw_tp);
+	return err;
+}
+
 #ifdef CONFIG_CGROUP_BPF
 
 #define BPF_PROG_ATTACH_LAST_FIELD attach_flags
@@ -1921,6 +2005,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_OBJ_GET_INFO_BY_FD:
 		err = bpf_obj_get_info_by_fd(&attr, uattr);
 		break;
+	case BPF_RAW_TRACEPOINT_OPEN:
+		err = bpf_raw_tracepoint_open(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index c634e093951f..19576d216880 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -723,6 +723,86 @@ const struct bpf_verifier_ops tracepoint_verifier_ops = {
 const struct bpf_prog_ops tracepoint_prog_ops = {
 };
 
+/*
+ * bpf_raw_tp_regs are separate from bpf_pt_regs used from skb/xdp
+ * to avoid potential recursive reuse issue when/if tracepoints are added
+ * inside bpf_*_event_output and/or bpf_get_stack_id
+ */
+static DEFINE_PER_CPU(struct pt_regs, bpf_raw_tp_regs);
+BPF_CALL_5(bpf_perf_event_output_raw_tp, struct bpf_raw_tracepoint_args *, args,
+	   struct bpf_map *, map, u64, flags, void *, data, u64, size)
+{
+	struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
+
+	perf_fetch_caller_regs(regs);
+	return ____bpf_perf_event_output(regs, map, flags, data, size);
+}
+
+static const struct bpf_func_proto bpf_perf_event_output_proto_raw_tp = {
+	.func		= bpf_perf_event_output_raw_tp,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_CONST_MAP_PTR,
+	.arg3_type	= ARG_ANYTHING,
+	.arg4_type	= ARG_PTR_TO_MEM,
+	.arg5_type	= ARG_CONST_SIZE_OR_ZERO,
+};
+
+BPF_CALL_3(bpf_get_stackid_raw_tp, struct bpf_raw_tracepoint_args *, args,
+	   struct bpf_map *, map, u64, flags)
+{
+	struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
+
+	perf_fetch_caller_regs(regs);
+	/* similar to bpf_perf_event_output_tp, but pt_regs fetched differently */
+	return bpf_get_stackid((unsigned long) regs, (unsigned long) map,
+			       flags, 0, 0);
+}
+
+static const struct bpf_func_proto bpf_get_stackid_proto_raw_tp = {
+	.func		= bpf_get_stackid_raw_tp,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_CONST_MAP_PTR,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto *raw_tp_prog_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_perf_event_output:
+		return &bpf_perf_event_output_proto_raw_tp;
+	case BPF_FUNC_get_stackid:
+		return &bpf_get_stackid_proto_raw_tp;
+	default:
+		return tracing_func_proto(func_id);
+	}
+}
+
+static bool raw_tp_prog_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					struct bpf_insn_access_aux *info)
+{
+	/* largest tracepoint in the kernel has 17 args */
+	if (off < 0 || off >= sizeof(__u64) * 17)
+		return false;
+	if (type != BPF_READ)
+		return false;
+	if (off % size != 0)
+		return false;
+	return true;
+}
+
+const struct bpf_verifier_ops raw_tracepoint_verifier_ops = {
+	.get_func_proto  = raw_tp_prog_func_proto,
+	.is_valid_access = raw_tp_prog_is_valid_access,
+};
+
+const struct bpf_prog_ops raw_tracepoint_prog_ops = {
+};
+
 static bool pe_prog_is_valid_access(int off, int size, enum bpf_access_type type,
 				    struct bpf_insn_access_aux *info)
 {
@@ -896,3 +976,206 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
 
 	return ret;
 }
+
+static __always_inline
+void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
+{
+	rcu_read_lock();
+	preempt_disable();
+	(void) BPF_PROG_RUN(prog, args);
+	preempt_enable();
+	rcu_read_unlock();
+}
+
+#define EVAL1(FN, X) FN(X)
+#define EVAL2(FN, X, Y...) FN(X) EVAL1(FN, Y)
+#define EVAL3(FN, X, Y...) FN(X) EVAL2(FN, Y)
+#define EVAL4(FN, X, Y...) FN(X) EVAL3(FN, Y)
+#define EVAL5(FN, X, Y...) FN(X) EVAL4(FN, Y)
+#define EVAL6(FN, X, Y...) FN(X) EVAL5(FN, Y)
+
+#define COPY(X) args[X - 1] = arg##X;
+
+void bpf_trace_run1(struct bpf_prog *prog, u64 arg1)
+{
+	u64 args[1];
+
+	EVAL1(COPY, 1);
+	__bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run1);
+void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2)
+{
+	u64 args[2];
+
+	EVAL2(COPY, 1, 2);
+	__bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run2);
+void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3)
+{
+	u64 args[3];
+
+	EVAL3(COPY, 1, 2, 3);
+	__bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run3);
+void bpf_trace_run4(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3, u64 arg4)
+{
+	u64 args[4];
+
+	EVAL4(COPY, 1, 2, 3, 4);
+	__bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run4);
+void bpf_trace_run5(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3, u64 arg4, u64 arg5)
+{
+	u64 args[5];
+
+	EVAL5(COPY, 1, 2, 3, 4, 5);
+	__bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run5);
+void bpf_trace_run6(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3, u64 arg4, u64 arg5, u64 arg6)
+{
+	u64 args[6];
+
+	EVAL6(COPY, 1, 2, 3, 4, 5, 6);
+	__bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run6);
+void bpf_trace_run7(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7)
+{
+	u64 args[7];
+
+	EVAL6(COPY, 1, 2, 3, 4, 5, 6);
+	EVAL1(COPY, 7);
+	__bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run7);
+void bpf_trace_run8(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		    u64 arg8)
+{
+	u64 args[8];
+
+	EVAL6(COPY, 1, 2, 3, 4, 5, 6);
+	EVAL2(COPY, 7, 8);
+	__bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run8);
+void bpf_trace_run9(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		    u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		    u64 arg8, u64 arg9)
+{
+	u64 args[9];
+
+	EVAL6(COPY, 1, 2, 3, 4, 5, 6);
+	EVAL3(COPY, 7, 8, 9);
+	__bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run9);
+void bpf_trace_run10(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		     u64 arg8, u64 arg9, u64 arg10)
+{
+	u64 args[10];
+
+	EVAL6(COPY, 1, 2, 3, 4, 5, 6);
+	EVAL4(COPY, 7, 8, 9, 10);
+	__bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run10);
+void bpf_trace_run11(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		     u64 arg8, u64 arg9, u64 arg10, u64 arg11)
+{
+	u64 args[11];
+
+	EVAL6(COPY, 1, 2, 3, 4, 5, 6);
+	EVAL5(COPY, 7, 8, 9, 10, 11);
+	__bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run11);
+void bpf_trace_run12(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		     u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12)
+{
+	u64 args[12];
+
+	EVAL6(COPY, 1, 2, 3, 4, 5, 6);
+	EVAL6(COPY, 7, 8, 9, 10, 11, 12);
+	__bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run12);
+void bpf_trace_run17(struct bpf_prog *prog, u64 arg1, u64 arg2,
+		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+		     u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
+		     u64 arg13, u64 arg14, u64 arg15, u64 arg16, u64 arg17)
+{
+	u64 args[17];
+
+	EVAL6(COPY, 1, 2, 3, 4, 5, 6);
+	EVAL6(COPY, 7, 8, 9, 10, 11, 12);
+	EVAL5(COPY, 13, 14, 15, 16, 17);
+	__bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run17);
+
+static int __bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog)
+{
+	unsigned long addr;
+	char buf[128];
+
+	/*
+	 * check that program doesn't access arguments beyond what's
+	 * available in this tracepoint
+	 */
+	if (prog->aux->max_ctx_offset > tp->num_args * sizeof(u64))
+		return -EINVAL;
+
+	snprintf(buf, sizeof(buf), "__bpf_trace_%s", tp->name);
+	addr = kallsyms_lookup_name(buf);
+	if (!addr)
+		return -ENOENT;
+
+	return tracepoint_probe_register(tp, (void *)addr, prog);
+}
+
+int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog)
+{
+	int err;
+
+	mutex_lock(&bpf_event_mutex);
+	err = __bpf_probe_register(tp, prog);
+	mutex_unlock(&bpf_event_mutex);
+	return err;
+}
+
+static int __bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog)
+{
+	unsigned long addr;
+	char buf[128];
+
+	snprintf(buf, sizeof(buf), "__bpf_trace_%s", tp->name);
+	addr = kallsyms_lookup_name(buf);
+	if (!addr)
+		return -ENOENT;
+
+	return tracepoint_probe_unregister(tp, (void *)addr, prog);
+}
+
+int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog)
+{
+	int err;
+
+	mutex_lock(&bpf_event_mutex);
+	err = __bpf_probe_unregister(tp, prog);
+	mutex_unlock(&bpf_event_mutex);
+	return err;
+}
-- 
2.9.5

^ permalink raw reply related

* [PATCH v2 bpf-next 0/8] bpf, tracing: introduce bpf raw tracepoints
From: Alexei Starovoitov @ 2018-03-21 18:54 UTC (permalink / raw)
  To: davem; +Cc: daniel, torvalds, peterz, rostedt, netdev, kernel-team, linux-api

From: Alexei Starovoitov <ast@kernel.org>

v1->v2:
- simplified api by combing bpf_raw_tp_open(name) + bpf_attach(prog_fd) into
  bpf_raw_tp_open(name, prog_fd) as suggested by Daniel.
  That simplifies bpf_detach as well which is now simple close() of fd.
- fixed memory leak in error path which was spotted by Daniel.
- fixed bpf_get_stackid(), bpf_perf_event_output() called from raw tracepoints
- added more tests
- fixed allyesconfig build caught by buildbot

v1:
This patch set is a different way to address the pressing need to access
task_struct pointers in sched tracepoints from bpf programs.

The first approach simply added these pointers to sched tracepoints:
https://lkml.org/lkml/2017/12/14/753
which Peter nacked.
Few options were discussed and eventually the discussion converged on
doing bpf specific tracepoint_probe_register() probe functions.
Details here:
https://lkml.org/lkml/2017/12/20/929

Patch 1 is kernel wide cleanup of pass-struct-by-value into
pass-struct-by-reference into tracepoints.

Patches 2 and 3 are minor cleanups to address allyesconfig build

Patch 4 minor prep work to expose number of arguments passed
into tracepoints.

Patch 5 introduces BPF_RAW_TRACEPOINT api.
the auto-cleanup and multiple concurrent users are must have
features of tracing api. For bpf raw tracepoints it looks like:
  // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
  prog_fd = bpf_prog_load(...);

  // receive anon_inode fd for given bpf_raw_tracepoint
  // and attach bpf program to it
  raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);

Ctrl-C of tracing daemon or cmdline tool will automatically
detach bpf program, unload it and unregister tracepoint probe.
More details in patch 5.

Patch 6 - trivial support in libbpf
Patches 7, 8 - user space tests

samples/bpf/test_overhead performance on 1 cpu:

tracepoint    base  kprobe+bpf tracepoint+bpf raw_tracepoint+bpf
task_rename   1.1M   769K        947K            1.0M
urandom_read  789K   697K        750K            755K

Alexei Starovoitov (8):
  treewide: remove struct-pass-by-value from tracepoints arguments
  net/mediatek: disambiguate mt76 vs mt7601u trace events
  net/mac802154: disambiguate mac80215 vs mac802154 trace events
  tracepoint: compute num_args at build time
  bpf: introduce BPF_RAW_TRACEPOINT
  libbpf: add bpf_raw_tracepoint_open helper
  samples/bpf: raw tracepoint test
  selftests/bpf: test for bpf_get_stackid() from raw tracepoints

 arch/x86/xen/mmu_pv.c                         |  16 +-
 drivers/gpu/drm/i915/i915_trace.h             |  13 +-
 drivers/infiniband/hw/hfi1/file_ops.c         |   2 +-
 drivers/infiniband/hw/hfi1/trace_ctxts.h      |  12 +-
 drivers/net/wireless/mediatek/mt7601u/trace.h |   6 +-
 drivers/s390/cio/ioasm.c                      |  18 +-
 drivers/s390/cio/trace.h                      |  50 ++---
 fs/dax.c                                      |   2 +-
 include/linux/bpf_types.h                     |   1 +
 include/linux/trace_events.h                  |  57 ++++++
 include/linux/tracepoint-defs.h               |   1 +
 include/linux/tracepoint.h                    |  32 ++-
 include/trace/bpf_probe.h                     |  87 ++++++++
 include/trace/define_trace.h                  |  15 +-
 include/trace/events/f2fs.h                   |   2 +-
 include/trace/events/fs_dax.h                 |   6 +-
 include/trace/events/rcu.h                    |   4 +-
 include/trace/events/xen.h                    |  32 +--
 include/uapi/linux/bpf.h                      |  11 +
 kernel/bpf/syscall.c                          |  87 ++++++++
 kernel/rcu/tree.c                             |  10 +-
 kernel/trace/bpf_trace.c                      | 283 ++++++++++++++++++++++++++
 kernel/tracepoint.c                           |  27 ++-
 net/mac802154/trace.h                         |   8 +-
 net/wireless/trace.h                          |   2 +-
 samples/bpf/Makefile                          |   1 +
 samples/bpf/bpf_load.c                        |  14 ++
 samples/bpf/test_overhead_raw_tp_kern.c       |  17 ++
 samples/bpf/test_overhead_user.c              |  12 ++
 sound/firewire/amdtp-stream-trace.h           |   2 +-
 tools/include/uapi/linux/bpf.h                |  11 +
 tools/lib/bpf/bpf.c                           |  11 +
 tools/lib/bpf/bpf.h                           |   1 +
 tools/testing/selftests/bpf/test_progs.c      |  91 +++++++--
 34 files changed, 807 insertions(+), 137 deletions(-)
 create mode 100644 include/trace/bpf_probe.h
 create mode 100644 samples/bpf/test_overhead_raw_tp_kern.c

-- 
2.9.5

^ permalink raw reply

* [PATCH v2 bpf-next 4/8] tracepoint: compute num_args at build time
From: Alexei Starovoitov @ 2018-03-21 18:54 UTC (permalink / raw)
  To: davem; +Cc: daniel, torvalds, peterz, rostedt, netdev, kernel-team, linux-api
In-Reply-To: <20180321185448.2806324-1-ast@fb.com>

From: Alexei Starovoitov <ast@kernel.org>

add fancy macro to compute number of arguments passed into tracepoint
at compile time and store it as part of 'struct tracepoint'.
The number is necessary to check safety of bpf program access that
is coming in subsequent patch.

for_each_tracepoint_range() api has no users inside the kernel.
Make it more useful with ability to stop for_each() loop depending
via callback return value.
In such form it's used in subsequent patch.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/tracepoint-defs.h |  1 +
 include/linux/tracepoint.h      | 32 +++++++++++++++++++++++---------
 include/trace/define_trace.h    | 14 +++++++-------
 kernel/tracepoint.c             | 27 ++++++++++++++++-----------
 4 files changed, 47 insertions(+), 27 deletions(-)

diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index 64ed7064f1fa..39a283c61c51 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -33,6 +33,7 @@ struct tracepoint {
 	int (*regfunc)(void);
 	void (*unregfunc)(void);
 	struct tracepoint_func __rcu *funcs;
+	u32 num_args;
 };
 
 #endif
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index c94f466d57ef..b1676e53bb23 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -40,9 +40,19 @@ tracepoint_probe_register_prio(struct tracepoint *tp, void *probe, void *data,
 			       int prio);
 extern int
 tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data);
-extern void
-for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),
-		void *priv);
+
+#ifdef CONFIG_TRACEPOINTS
+void *
+for_each_kernel_tracepoint(void *(*fct)(struct tracepoint *tp, void *priv),
+			   void *priv);
+#else
+static inline void *
+for_each_kernel_tracepoint(void *(*fct)(struct tracepoint *tp, void *priv),
+			   void *priv)
+{
+	return NULL;
+}
+#endif
 
 #ifdef CONFIG_MODULES
 struct tp_module {
@@ -225,23 +235,27 @@ extern void syscall_unregfunc(void);
 		return static_key_false(&__tracepoint_##name.key);	\
 	}
 
+#define ___FN_COUNT(fn,n0,n1,n2,n3,n4,n5,n6,n7,n8,n9,n10,n11,n12,n13,n14,n15,n16,n,...) fn##n
+#define __FN_COUNT(fn,...) ___FN_COUNT(fn,##__VA_ARGS__,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0)
+#define __COUNT(...) __FN_COUNT(/**/,##__VA_ARGS__)
+
 /*
  * We have no guarantee that gcc and the linker won't up-align the tracepoint
  * structures, so we create an array of pointers that will be used for iteration
  * on the tracepoints.
  */
-#define DEFINE_TRACE_FN(name, reg, unreg)				 \
+#define DEFINE_TRACE_FN(name, reg, unreg, num_args)			 \
 	static const char __tpstrtab_##name[]				 \
 	__attribute__((section("__tracepoints_strings"))) = #name;	 \
 	struct tracepoint __tracepoint_##name				 \
 	__attribute__((section("__tracepoints"))) =			 \
-		{ __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL };\
+		{ __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL, num_args };\
 	static struct tracepoint * const __tracepoint_ptr_##name __used	 \
 	__attribute__((section("__tracepoints_ptrs"))) =		 \
 		&__tracepoint_##name;
 
-#define DEFINE_TRACE(name)						\
-	DEFINE_TRACE_FN(name, NULL, NULL);
+#define DEFINE_TRACE(name, num_args)					\
+	DEFINE_TRACE_FN(name, NULL, NULL, num_args);
 
 #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)				\
 	EXPORT_SYMBOL_GPL(__tracepoint_##name)
@@ -275,8 +289,8 @@ extern void syscall_unregfunc(void);
 		return false;						\
 	}
 
-#define DEFINE_TRACE_FN(name, reg, unreg)
-#define DEFINE_TRACE(name)
+#define DEFINE_TRACE_FN(name, reg, unreg, num_args)
+#define DEFINE_TRACE(name, num_args)
 #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
 #define EXPORT_TRACEPOINT_SYMBOL(name)
 
diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index d9e3d4aa3f6e..c040eda95d41 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -25,7 +25,7 @@
 
 #undef TRACE_EVENT
 #define TRACE_EVENT(name, proto, args, tstruct, assign, print)	\
-	DEFINE_TRACE(name)
+	DEFINE_TRACE(name, __COUNT(args))
 
 #undef TRACE_EVENT_CONDITION
 #define TRACE_EVENT_CONDITION(name, proto, args, cond, tstruct, assign, print) \
@@ -39,24 +39,24 @@
 #undef TRACE_EVENT_FN
 #define TRACE_EVENT_FN(name, proto, args, tstruct,		\
 		assign, print, reg, unreg)			\
-	DEFINE_TRACE_FN(name, reg, unreg)
+	DEFINE_TRACE_FN(name, reg, unreg, __COUNT(args))
 
 #undef TRACE_EVENT_FN_COND
 #define TRACE_EVENT_FN_COND(name, proto, args, cond, tstruct,		\
 		assign, print, reg, unreg)			\
-	DEFINE_TRACE_FN(name, reg, unreg)
+	DEFINE_TRACE_FN(name, reg, unreg, __COUNT(args))
 
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, name, proto, args) \
-	DEFINE_TRACE(name)
+	DEFINE_TRACE(name, __COUNT(args))
 
 #undef DEFINE_EVENT_FN
 #define DEFINE_EVENT_FN(template, name, proto, args, reg, unreg) \
-	DEFINE_TRACE_FN(name, reg, unreg)
+	DEFINE_TRACE_FN(name, reg, unreg, __COUNT(args))
 
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print)	\
-	DEFINE_TRACE(name)
+	DEFINE_TRACE(name, __COUNT(args))
 
 #undef DEFINE_EVENT_CONDITION
 #define DEFINE_EVENT_CONDITION(template, name, proto, args, cond) \
@@ -64,7 +64,7 @@
 
 #undef DECLARE_TRACE
 #define DECLARE_TRACE(name, proto, args)	\
-	DEFINE_TRACE(name)
+	DEFINE_TRACE(name, __COUNT(args))
 
 #undef TRACE_INCLUDE
 #undef __TRACE_INCLUDE
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 671b13457387..3f2dc5738c2b 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -502,17 +502,22 @@ static __init int init_tracepoints(void)
 __initcall(init_tracepoints);
 #endif /* CONFIG_MODULES */
 
-static void for_each_tracepoint_range(struct tracepoint * const *begin,
-		struct tracepoint * const *end,
-		void (*fct)(struct tracepoint *tp, void *priv),
-		void *priv)
+static void *for_each_tracepoint_range(struct tracepoint * const *begin,
+				       struct tracepoint * const *end,
+				       void *(*fct)(struct tracepoint *tp, void *priv),
+				       void *priv)
 {
 	struct tracepoint * const *iter;
+	void *ret;
 
 	if (!begin)
-		return;
-	for (iter = begin; iter < end; iter++)
-		fct(*iter, priv);
+		return NULL;
+	for (iter = begin; iter < end; iter++) {
+		ret = fct(*iter, priv);
+		if (ret)
+			return ret;
+	}
+	return NULL;
 }
 
 /**
@@ -520,11 +525,11 @@ static void for_each_tracepoint_range(struct tracepoint * const *begin,
  * @fct: callback
  * @priv: private data
  */
-void for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),
-		void *priv)
+void *for_each_kernel_tracepoint(void *(*fct)(struct tracepoint *tp, void *priv),
+				 void *priv)
 {
-	for_each_tracepoint_range(__start___tracepoints_ptrs,
-		__stop___tracepoints_ptrs, fct, priv);
+	return for_each_tracepoint_range(__start___tracepoints_ptrs,
+					 __stop___tracepoints_ptrs, fct, priv);
 }
 EXPORT_SYMBOL_GPL(for_each_kernel_tracepoint);
 
-- 
2.9.5

^ permalink raw reply related

* [PATCH v2 bpf-next 7/8] samples/bpf: raw tracepoint test
From: Alexei Starovoitov @ 2018-03-21 18:54 UTC (permalink / raw)
  To: davem; +Cc: daniel, torvalds, peterz, rostedt, netdev, kernel-team, linux-api
In-Reply-To: <20180321185448.2806324-1-ast@fb.com>

From: Alexei Starovoitov <ast@kernel.org>

add empty raw_tracepoint bpf program to test overhead similar
to kprobe and traditional tracepoint tests

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 samples/bpf/Makefile                    |  1 +
 samples/bpf/bpf_load.c                  | 14 ++++++++++++++
 samples/bpf/test_overhead_raw_tp_kern.c | 17 +++++++++++++++++
 samples/bpf/test_overhead_user.c        | 12 ++++++++++++
 4 files changed, 44 insertions(+)
 create mode 100644 samples/bpf/test_overhead_raw_tp_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 2c2a587e0942..4d6a6edd4bf6 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -119,6 +119,7 @@ always += offwaketime_kern.o
 always += spintest_kern.o
 always += map_perf_test_kern.o
 always += test_overhead_tp_kern.o
+always += test_overhead_raw_tp_kern.o
 always += test_overhead_kprobe_kern.o
 always += parse_varlen.o parse_simple.o parse_ldabs.o
 always += test_cgrp2_tc_kern.o
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index b1a310c3ae89..bebe4188b4b3 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -61,6 +61,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
 	bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
 	bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
+	bool is_raw_tracepoint = strncmp(event, "raw_tracepoint/", 15) == 0;
 	bool is_xdp = strncmp(event, "xdp", 3) == 0;
 	bool is_perf_event = strncmp(event, "perf_event", 10) == 0;
 	bool is_cgroup_skb = strncmp(event, "cgroup/skb", 10) == 0;
@@ -85,6 +86,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 		prog_type = BPF_PROG_TYPE_KPROBE;
 	} else if (is_tracepoint) {
 		prog_type = BPF_PROG_TYPE_TRACEPOINT;
+	} else if (is_raw_tracepoint) {
+		prog_type = BPF_PROG_TYPE_RAW_TRACEPOINT;
 	} else if (is_xdp) {
 		prog_type = BPF_PROG_TYPE_XDP;
 	} else if (is_perf_event) {
@@ -131,6 +134,16 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 		return populate_prog_array(event, fd);
 	}
 
+	if (is_raw_tracepoint) {
+		efd = bpf_raw_tracepoint_open(event + 15, fd);
+		if (efd < 0) {
+			printf("tracepoint %s %s\n", event + 15, strerror(errno));
+			return -1;
+		}
+		event_fd[prog_cnt - 1] = efd;
+		return 0;
+	}
+
 	if (is_kprobe || is_kretprobe) {
 		if (is_kprobe)
 			event += 7;
@@ -587,6 +600,7 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map)
 		if (memcmp(shname, "kprobe/", 7) == 0 ||
 		    memcmp(shname, "kretprobe/", 10) == 0 ||
 		    memcmp(shname, "tracepoint/", 11) == 0 ||
+		    memcmp(shname, "raw_tracepoint/", 15) == 0 ||
 		    memcmp(shname, "xdp", 3) == 0 ||
 		    memcmp(shname, "perf_event", 10) == 0 ||
 		    memcmp(shname, "socket", 6) == 0 ||
diff --git a/samples/bpf/test_overhead_raw_tp_kern.c b/samples/bpf/test_overhead_raw_tp_kern.c
new file mode 100644
index 000000000000..d2af8bc1c805
--- /dev/null
+++ b/samples/bpf/test_overhead_raw_tp_kern.c
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Facebook */
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+SEC("raw_tracepoint/task_rename")
+int prog(struct bpf_raw_tracepoint_args *ctx)
+{
+	return 0;
+}
+
+SEC("raw_tracepoint/urandom_read")
+int prog2(struct bpf_raw_tracepoint_args *ctx)
+{
+	return 0;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/test_overhead_user.c b/samples/bpf/test_overhead_user.c
index d291167fd3c7..e1d35e07a10e 100644
--- a/samples/bpf/test_overhead_user.c
+++ b/samples/bpf/test_overhead_user.c
@@ -158,5 +158,17 @@ int main(int argc, char **argv)
 		unload_progs();
 	}
 
+	if (test_flags & 0xC0) {
+		snprintf(filename, sizeof(filename),
+			 "%s_raw_tp_kern.o", argv[0]);
+		if (load_bpf_file(filename)) {
+			printf("%s", bpf_log_buf);
+			return 1;
+		}
+		printf("w/RAW_TRACEPOINT\n");
+		run_perf_test(num_cpu, test_flags >> 6);
+		unload_progs();
+	}
+
 	return 0;
 }
-- 
2.9.5

^ permalink raw reply related

* [PATCH v2 bpf-next 3/8] net/mac802154: disambiguate mac80215 vs mac802154 trace events
From: Alexei Starovoitov @ 2018-03-21 18:54 UTC (permalink / raw)
  To: davem; +Cc: daniel, torvalds, peterz, rostedt, netdev, kernel-team, linux-api
In-Reply-To: <20180321185448.2806324-1-ast@fb.com>

From: Alexei Starovoitov <ast@kernel.org>

two trace events defined with the same name and both unused.
They conflict in allyesconfig build. Rename one of them.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 net/mac802154/trace.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/mac802154/trace.h b/net/mac802154/trace.h
index 2c8a43d3607f..df855c33daf2 100644
--- a/net/mac802154/trace.h
+++ b/net/mac802154/trace.h
@@ -33,7 +33,7 @@
 
 /* Tracing for driver callbacks */
 
-DECLARE_EVENT_CLASS(local_only_evt,
+DECLARE_EVENT_CLASS(local_only_evt4,
 	TP_PROTO(struct ieee802154_local *local),
 	TP_ARGS(local),
 	TP_STRUCT__entry(
@@ -45,7 +45,7 @@ DECLARE_EVENT_CLASS(local_only_evt,
 	TP_printk(LOCAL_PR_FMT, LOCAL_PR_ARG)
 );
 
-DEFINE_EVENT(local_only_evt, 802154_drv_return_void,
+DEFINE_EVENT(local_only_evt4, 802154_drv_return_void,
 	TP_PROTO(struct ieee802154_local *local),
 	TP_ARGS(local)
 );
@@ -65,12 +65,12 @@ TRACE_EVENT(802154_drv_return_int,
 		  __entry->ret)
 );
 
-DEFINE_EVENT(local_only_evt, 802154_drv_start,
+DEFINE_EVENT(local_only_evt4, 802154_drv_start,
 	TP_PROTO(struct ieee802154_local *local),
 	TP_ARGS(local)
 );
 
-DEFINE_EVENT(local_only_evt, 802154_drv_stop,
+DEFINE_EVENT(local_only_evt4, 802154_drv_stop,
 	TP_PROTO(struct ieee802154_local *local),
 	TP_ARGS(local)
 );
-- 
2.9.5

^ permalink raw reply related

* [PATCH v2 bpf-next 8/8] selftests/bpf: test for bpf_get_stackid() from raw tracepoints
From: Alexei Starovoitov @ 2018-03-21 18:54 UTC (permalink / raw)
  To: davem; +Cc: daniel, torvalds, peterz, rostedt, netdev, kernel-team, linux-api
In-Reply-To: <20180321185448.2806324-1-ast@fb.com>

From: Alexei Starovoitov <ast@kernel.org>

similar to traditional traceopint test add bpf_get_stackid() test
from raw tracepoints
and reduce verbosity of existing stackmap test

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/test_progs.c | 91 ++++++++++++++++++++++++--------
 1 file changed, 70 insertions(+), 21 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index e9df48b306df..faadbe233966 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -877,7 +877,7 @@ static void test_stacktrace_map()
 
 	err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, &prog_fd);
 	if (CHECK(err, "prog_load", "err %d errno %d\n", err, errno))
-		goto out;
+		return;
 
 	/* Get the ID for the sched/sched_switch tracepoint */
 	snprintf(buf, sizeof(buf),
@@ -888,8 +888,7 @@ static void test_stacktrace_map()
 
 	bytes = read(efd, buf, sizeof(buf));
 	close(efd);
-	if (CHECK(bytes <= 0 || bytes >= sizeof(buf),
-		  "read", "bytes %d errno %d\n", bytes, errno))
+	if (bytes <= 0 || bytes >= sizeof(buf))
 		goto close_prog;
 
 	/* Open the perf event and attach bpf progrram */
@@ -906,29 +905,24 @@ static void test_stacktrace_map()
 		goto close_prog;
 
 	err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
-	if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n",
-		  err, errno))
-		goto close_pmu;
+	if (err)
+		goto disable_pmu;
 
 	err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
-	if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n",
-		  err, errno))
+	if (err)
 		goto disable_pmu;
 
 	/* find map fds */
 	control_map_fd = bpf_find_map(__func__, obj, "control_map");
-	if (CHECK(control_map_fd < 0, "bpf_find_map control_map",
-		  "err %d errno %d\n", err, errno))
+	if (control_map_fd < 0)
 		goto disable_pmu;
 
 	stackid_hmap_fd = bpf_find_map(__func__, obj, "stackid_hmap");
-	if (CHECK(stackid_hmap_fd < 0, "bpf_find_map stackid_hmap",
-		  "err %d errno %d\n", err, errno))
+	if (stackid_hmap_fd < 0)
 		goto disable_pmu;
 
 	stackmap_fd = bpf_find_map(__func__, obj, "stackmap");
-	if (CHECK(stackmap_fd < 0, "bpf_find_map stackmap", "err %d errno %d\n",
-		  err, errno))
+	if (stackmap_fd < 0)
 		goto disable_pmu;
 
 	/* give some time for bpf program run */
@@ -945,24 +939,78 @@ static void test_stacktrace_map()
 	err = compare_map_keys(stackid_hmap_fd, stackmap_fd);
 	if (CHECK(err, "compare_map_keys stackid_hmap vs. stackmap",
 		  "err %d errno %d\n", err, errno))
-		goto disable_pmu;
+		goto disable_pmu_noerr;
 
 	err = compare_map_keys(stackmap_fd, stackid_hmap_fd);
 	if (CHECK(err, "compare_map_keys stackmap vs. stackid_hmap",
 		  "err %d errno %d\n", err, errno))
-		; /* fall through */
+		goto disable_pmu_noerr;
 
+	goto disable_pmu_noerr;
 disable_pmu:
+	error_cnt++;
+disable_pmu_noerr:
 	ioctl(pmu_fd, PERF_EVENT_IOC_DISABLE);
-
-close_pmu:
 	close(pmu_fd);
-
 close_prog:
 	bpf_object__close(obj);
+}
 
-out:
-	return;
+static void test_stacktrace_map_raw_tp()
+{
+	int control_map_fd, stackid_hmap_fd, stackmap_fd;
+	const char *file = "./test_stacktrace_map.o";
+	int efd, err, prog_fd;
+	__u32 key, val, duration = 0;
+	struct bpf_object *obj;
+
+	err = bpf_prog_load(file, BPF_PROG_TYPE_RAW_TRACEPOINT, &obj, &prog_fd);
+	if (CHECK(err, "prog_load raw tp", "err %d errno %d\n", err, errno))
+		return;
+
+	efd = bpf_raw_tracepoint_open("sched_switch", prog_fd);
+	if (CHECK(efd < 0, "raw_tp_open", "err %d errno %d\n", efd, errno))
+		goto close_prog;
+
+	/* find map fds */
+	control_map_fd = bpf_find_map(__func__, obj, "control_map");
+	if (control_map_fd < 0)
+		goto close_prog;
+
+	stackid_hmap_fd = bpf_find_map(__func__, obj, "stackid_hmap");
+	if (stackid_hmap_fd < 0)
+		goto close_prog;
+
+	stackmap_fd = bpf_find_map(__func__, obj, "stackmap");
+	if (stackmap_fd < 0)
+		goto close_prog;
+
+	/* give some time for bpf program run */
+	sleep(1);
+
+	/* disable stack trace collection */
+	key = 0;
+	val = 1;
+	bpf_map_update_elem(control_map_fd, &key, &val, 0);
+
+	/* for every element in stackid_hmap, we can find a corresponding one
+	 * in stackmap, and vise versa.
+	 */
+	err = compare_map_keys(stackid_hmap_fd, stackmap_fd);
+	if (CHECK(err, "compare_map_keys stackid_hmap vs. stackmap",
+		  "err %d errno %d\n", err, errno))
+		goto close_prog;
+
+	err = compare_map_keys(stackmap_fd, stackid_hmap_fd);
+	if (CHECK(err, "compare_map_keys stackmap vs. stackid_hmap",
+		  "err %d errno %d\n", err, errno))
+		goto close_prog;
+
+	goto close_prog_noerr;
+close_prog:
+	error_cnt++;
+close_prog_noerr:
+	bpf_object__close(obj);
 }
 
 static int extract_build_id(char *build_id, size_t size)
@@ -1138,6 +1186,7 @@ int main(void)
 	test_tp_attach_query();
 	test_stacktrace_map();
 	test_stacktrace_build_id();
+	test_stacktrace_map_raw_tp();
 
 	printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
 	return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
-- 
2.9.5

^ permalink raw reply related

* [PATCH v2 bpf-next 1/8] treewide: remove struct-pass-by-value from tracepoints arguments
From: Alexei Starovoitov @ 2018-03-21 18:54 UTC (permalink / raw)
  To: davem; +Cc: daniel, torvalds, peterz, rostedt, netdev, kernel-team, linux-api
In-Reply-To: <20180321185448.2806324-1-ast@fb.com>

From: Alexei Starovoitov <ast@kernel.org>

Fix all tracepoint arguments to pass structures (large and small) by reference
instead of by value.
Avoiding passing large structs by value is a good coding style.
Passing small structs sometimes is beneficial, but in all cases
it makes no difference vs readability of the code.
The subsequent patch enforces that all tracepoints args are either integers
or pointers and fit into 64-bit.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 arch/x86/xen/mmu_pv.c                    | 16 +++++-----
 drivers/gpu/drm/i915/i915_trace.h        | 13 +++++++--
 drivers/infiniband/hw/hfi1/file_ops.c    |  2 +-
 drivers/infiniband/hw/hfi1/trace_ctxts.h | 12 ++++----
 drivers/s390/cio/ioasm.c                 | 18 ++++++------
 drivers/s390/cio/trace.h                 | 50 ++++++++++++++++----------------
 fs/dax.c                                 |  2 +-
 include/trace/events/f2fs.h              |  2 +-
 include/trace/events/fs_dax.h            |  6 ++--
 include/trace/events/rcu.h               |  4 +--
 include/trace/events/xen.h               | 32 ++++++++++----------
 kernel/rcu/tree.c                        | 10 +++----
 net/wireless/trace.h                     |  2 +-
 sound/firewire/amdtp-stream-trace.h      |  2 +-
 14 files changed, 89 insertions(+), 82 deletions(-)

diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index aae88fec9941..b1a8061c3b28 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -218,7 +218,7 @@ static void xen_set_pmd_hyper(pmd_t *ptr, pmd_t val)
 
 static void xen_set_pmd(pmd_t *ptr, pmd_t val)
 {
-	trace_xen_mmu_set_pmd(ptr, val);
+	trace_xen_mmu_set_pmd(ptr, &val);
 
 	/* If page is not pinned, we can just update the entry
 	   directly */
@@ -277,14 +277,14 @@ static inline void __xen_set_pte(pte_t *ptep, pte_t pteval)
 
 static void xen_set_pte(pte_t *ptep, pte_t pteval)
 {
-	trace_xen_mmu_set_pte(ptep, pteval);
+	trace_xen_mmu_set_pte(ptep, &pteval);
 	__xen_set_pte(ptep, pteval);
 }
 
 static void xen_set_pte_at(struct mm_struct *mm, unsigned long addr,
 		    pte_t *ptep, pte_t pteval)
 {
-	trace_xen_mmu_set_pte_at(mm, addr, ptep, pteval);
+	trace_xen_mmu_set_pte_at(mm, addr, ptep, &pteval);
 	__xen_set_pte(ptep, pteval);
 }
 
@@ -292,7 +292,7 @@ pte_t xen_ptep_modify_prot_start(struct mm_struct *mm,
 				 unsigned long addr, pte_t *ptep)
 {
 	/* Just return the pte as-is.  We preserve the bits on commit */
-	trace_xen_mmu_ptep_modify_prot_start(mm, addr, ptep, *ptep);
+	trace_xen_mmu_ptep_modify_prot_start(mm, addr, ptep, ptep);
 	return *ptep;
 }
 
@@ -301,7 +301,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct *mm, unsigned long addr,
 {
 	struct mmu_update u;
 
-	trace_xen_mmu_ptep_modify_prot_commit(mm, addr, ptep, pte);
+	trace_xen_mmu_ptep_modify_prot_commit(mm, addr, ptep, &pte);
 	xen_mc_batch();
 
 	u.ptr = virt_to_machine(ptep).maddr | MMU_PT_UPDATE_PRESERVE_AD;
@@ -409,7 +409,7 @@ static void xen_set_pud_hyper(pud_t *ptr, pud_t val)
 
 static void xen_set_pud(pud_t *ptr, pud_t val)
 {
-	trace_xen_mmu_set_pud(ptr, val);
+	trace_xen_mmu_set_pud(ptr, &val);
 
 	/* If page is not pinned, we can just update the entry
 	   directly */
@@ -424,7 +424,7 @@ static void xen_set_pud(pud_t *ptr, pud_t val)
 #ifdef CONFIG_X86_PAE
 static void xen_set_pte_atomic(pte_t *ptep, pte_t pte)
 {
-	trace_xen_mmu_set_pte_atomic(ptep, pte);
+	trace_xen_mmu_set_pte_atomic(ptep, &pte);
 	set_64bit((u64 *)ptep, native_pte_val(pte));
 }
 
@@ -514,7 +514,7 @@ static void xen_set_p4d(p4d_t *ptr, p4d_t val)
 	pgd_t *user_ptr = xen_get_user_pgd((pgd_t *)ptr);
 	pgd_t pgd_val;
 
-	trace_xen_mmu_set_p4d(ptr, (p4d_t *)user_ptr, val);
+	trace_xen_mmu_set_p4d(ptr, (p4d_t *)user_ptr, &val);
 
 	/* If page is not pinned, we can just update the entry
 	   directly */
diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
index e1169c02eb2b..681da1f51911 100644
--- a/drivers/gpu/drm/i915/i915_trace.h
+++ b/drivers/gpu/drm/i915/i915_trace.h
@@ -849,8 +849,8 @@ TRACE_EVENT(i915_flip_complete,
 	    TP_printk("plane=%d, obj=%p", __entry->plane, __entry->obj)
 );
 
-TRACE_EVENT_CONDITION(i915_reg_rw,
-	TP_PROTO(bool write, i915_reg_t reg, u64 val, int len, bool trace),
+TRACE_EVENT_CONDITION(i915_reg_rw__,
+	TP_PROTO(bool write, u32 reg, u64 val, int len, bool trace),
 
 	TP_ARGS(write, reg, val, len, trace),
 
@@ -865,7 +865,7 @@ TRACE_EVENT_CONDITION(i915_reg_rw,
 
 	TP_fast_assign(
 		__entry->val = (u64)val;
-		__entry->reg = i915_mmio_reg_offset(reg);
+		__entry->reg = reg;
 		__entry->write = write;
 		__entry->len = len;
 		),
@@ -876,6 +876,13 @@ TRACE_EVENT_CONDITION(i915_reg_rw,
 		(u32)(__entry->val & 0xffffffff),
 		(u32)(__entry->val >> 32))
 );
+#if !defined(CREATE_TRACE_POINTS) && !defined(TRACE_HEADER_MULTI_READ)
+static inline void trace_i915_reg_rw(bool write, i915_reg_t reg, u64 val,
+				     int len, bool trace)
+{
+	trace_i915_reg_rw__(write, i915_mmio_reg_offset(reg), val, len, trace);
+}
+#endif
 
 TRACE_EVENT(intel_gpu_freq_change,
 	    TP_PROTO(u32 freq),
diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
index 41fafebe3b0d..da4aa1a95b11 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -1153,7 +1153,7 @@ static int get_ctxt_info(struct hfi1_filedata *fd, unsigned long arg, u32 len)
 	cinfo.sdma_ring_size = fd->cq->nentries;
 	cinfo.rcvegr_size = uctxt->egrbufs.rcvtid_size;
 
-	trace_hfi1_ctxt_info(uctxt->dd, uctxt->ctxt, fd->subctxt, cinfo);
+	trace_hfi1_ctxt_info(uctxt->dd, uctxt->ctxt, fd->subctxt, &cinfo);
 	if (copy_to_user((void __user *)arg, &cinfo, len))
 		return -EFAULT;
 
diff --git a/drivers/infiniband/hw/hfi1/trace_ctxts.h b/drivers/infiniband/hw/hfi1/trace_ctxts.h
index 4eb4cc798035..e00c8a7d559c 100644
--- a/drivers/infiniband/hw/hfi1/trace_ctxts.h
+++ b/drivers/infiniband/hw/hfi1/trace_ctxts.h
@@ -106,7 +106,7 @@ TRACE_EVENT(hfi1_uctxtdata,
 TRACE_EVENT(hfi1_ctxt_info,
 	    TP_PROTO(struct hfi1_devdata *dd, unsigned int ctxt,
 		     unsigned int subctxt,
-		     struct hfi1_ctxt_info cinfo),
+		     struct hfi1_ctxt_info *cinfo),
 	    TP_ARGS(dd, ctxt, subctxt, cinfo),
 	    TP_STRUCT__entry(DD_DEV_ENTRY(dd)
 			     __field(unsigned int, ctxt)
@@ -120,11 +120,11 @@ TRACE_EVENT(hfi1_ctxt_info,
 	    TP_fast_assign(DD_DEV_ASSIGN(dd);
 			    __entry->ctxt = ctxt;
 			    __entry->subctxt = subctxt;
-			    __entry->egrtids = cinfo.egrtids;
-			    __entry->rcvhdrq_cnt = cinfo.rcvhdrq_cnt;
-			    __entry->rcvhdrq_size = cinfo.rcvhdrq_entsize;
-			    __entry->sdma_ring_size = cinfo.sdma_ring_size;
-			    __entry->rcvegr_size = cinfo.rcvegr_size;
+			    __entry->egrtids = cinfo->egrtids;
+			    __entry->rcvhdrq_cnt = cinfo->rcvhdrq_cnt;
+			    __entry->rcvhdrq_size = cinfo->rcvhdrq_entsize;
+			    __entry->sdma_ring_size = cinfo->sdma_ring_size;
+			    __entry->rcvegr_size = cinfo->rcvegr_size;
 			    ),
 	    TP_printk("[%s] ctxt %u:%u " CINFO_FMT,
 		      __get_str(dev),
diff --git a/drivers/s390/cio/ioasm.c b/drivers/s390/cio/ioasm.c
index 4fa9ee1d09fa..0aecb6314e6f 100644
--- a/drivers/s390/cio/ioasm.c
+++ b/drivers/s390/cio/ioasm.c
@@ -35,7 +35,7 @@ int stsch(struct subchannel_id schid, struct schib *addr)
 	int ccode;
 
 	ccode = __stsch(schid, addr);
-	trace_s390_cio_stsch(schid, addr, ccode);
+	trace_s390_cio_stsch(&schid, addr, ccode);
 
 	return ccode;
 }
@@ -63,7 +63,7 @@ int msch(struct subchannel_id schid, struct schib *addr)
 	int ccode;
 
 	ccode = __msch(schid, addr);
-	trace_s390_cio_msch(schid, addr, ccode);
+	trace_s390_cio_msch(&schid, addr, ccode);
 
 	return ccode;
 }
@@ -88,7 +88,7 @@ int tsch(struct subchannel_id schid, struct irb *addr)
 	int ccode;
 
 	ccode = __tsch(schid, addr);
-	trace_s390_cio_tsch(schid, addr, ccode);
+	trace_s390_cio_tsch(&schid, addr, ccode);
 
 	return ccode;
 }
@@ -115,7 +115,7 @@ int ssch(struct subchannel_id schid, union orb *addr)
 	int ccode;
 
 	ccode = __ssch(schid, addr);
-	trace_s390_cio_ssch(schid, addr, ccode);
+	trace_s390_cio_ssch(&schid, addr, ccode);
 
 	return ccode;
 }
@@ -141,7 +141,7 @@ int csch(struct subchannel_id schid)
 	int ccode;
 
 	ccode = __csch(schid);
-	trace_s390_cio_csch(schid, ccode);
+	trace_s390_cio_csch(&schid, ccode);
 
 	return ccode;
 }
@@ -202,7 +202,7 @@ int rchp(struct chp_id chpid)
 	int ccode;
 
 	ccode = __rchp(chpid);
-	trace_s390_cio_rchp(chpid, ccode);
+	trace_s390_cio_rchp(&chpid, ccode);
 
 	return ccode;
 }
@@ -228,7 +228,7 @@ int rsch(struct subchannel_id schid)
 	int ccode;
 
 	ccode = __rsch(schid);
-	trace_s390_cio_rsch(schid, ccode);
+	trace_s390_cio_rsch(&schid, ccode);
 
 	return ccode;
 }
@@ -253,7 +253,7 @@ int hsch(struct subchannel_id schid)
 	int ccode;
 
 	ccode = __hsch(schid);
-	trace_s390_cio_hsch(schid, ccode);
+	trace_s390_cio_hsch(&schid, ccode);
 
 	return ccode;
 }
@@ -278,7 +278,7 @@ int xsch(struct subchannel_id schid)
 	int ccode;
 
 	ccode = __xsch(schid);
-	trace_s390_cio_xsch(schid, ccode);
+	trace_s390_cio_xsch(&schid, ccode);
 
 	return ccode;
 }
diff --git a/drivers/s390/cio/trace.h b/drivers/s390/cio/trace.h
index 1f8d1c1e566d..4aa6d1426106 100644
--- a/drivers/s390/cio/trace.h
+++ b/drivers/s390/cio/trace.h
@@ -22,7 +22,7 @@
 #include <linux/tracepoint.h>
 
 DECLARE_EVENT_CLASS(s390_class_schib,
-	TP_PROTO(struct subchannel_id schid, struct schib *schib, int cc),
+	TP_PROTO(struct subchannel_id *schid, struct schib *schib, int cc),
 	TP_ARGS(schid, schib, cc),
 	TP_STRUCT__entry(
 		__field(u8, cssid)
@@ -33,9 +33,9 @@ DECLARE_EVENT_CLASS(s390_class_schib,
 		__field(int, cc)
 	),
 	TP_fast_assign(
-		__entry->cssid = schid.cssid;
-		__entry->ssid = schid.ssid;
-		__entry->schno = schid.sch_no;
+		__entry->cssid = schid->cssid;
+		__entry->ssid = schid->ssid;
+		__entry->schno = schid->sch_no;
 		__entry->devno = schib->pmcw.dev;
 		__entry->schib = *schib;
 		__entry->cc = cc;
@@ -60,7 +60,7 @@ DECLARE_EVENT_CLASS(s390_class_schib,
  * @cc: Condition code
  */
 DEFINE_EVENT(s390_class_schib, s390_cio_stsch,
-	TP_PROTO(struct subchannel_id schid, struct schib *schib, int cc),
+	TP_PROTO(struct subchannel_id *schid, struct schib *schib, int cc),
 	TP_ARGS(schid, schib, cc)
 );
 
@@ -71,7 +71,7 @@ DEFINE_EVENT(s390_class_schib, s390_cio_stsch,
  * @cc: Condition code
  */
 DEFINE_EVENT(s390_class_schib, s390_cio_msch,
-	TP_PROTO(struct subchannel_id schid, struct schib *schib, int cc),
+	TP_PROTO(struct subchannel_id *schid, struct schib *schib, int cc),
 	TP_ARGS(schid, schib, cc)
 );
 
@@ -82,7 +82,7 @@ DEFINE_EVENT(s390_class_schib, s390_cio_msch,
  * @cc: Condition code
  */
 TRACE_EVENT(s390_cio_tsch,
-	TP_PROTO(struct subchannel_id schid, struct irb *irb, int cc),
+	TP_PROTO(struct subchannel_id *schid, struct irb *irb, int cc),
 	TP_ARGS(schid, irb, cc),
 	TP_STRUCT__entry(
 		__field(u8, cssid)
@@ -92,9 +92,9 @@ TRACE_EVENT(s390_cio_tsch,
 		__field(int, cc)
 	),
 	TP_fast_assign(
-		__entry->cssid = schid.cssid;
-		__entry->ssid = schid.ssid;
-		__entry->schno = schid.sch_no;
+		__entry->cssid = schid->cssid;
+		__entry->ssid = schid->ssid;
+		__entry->schno = schid->sch_no;
 		__entry->irb = *irb;
 		__entry->cc = cc;
 	),
@@ -151,7 +151,7 @@ TRACE_EVENT(s390_cio_tpi,
  * @cc: Condition code
  */
 TRACE_EVENT(s390_cio_ssch,
-	TP_PROTO(struct subchannel_id schid, union orb *orb, int cc),
+	TP_PROTO(struct subchannel_id *schid, union orb *orb, int cc),
 	TP_ARGS(schid, orb, cc),
 	TP_STRUCT__entry(
 		__field(u8, cssid)
@@ -161,9 +161,9 @@ TRACE_EVENT(s390_cio_ssch,
 		__field(int, cc)
 	),
 	TP_fast_assign(
-		__entry->cssid = schid.cssid;
-		__entry->ssid = schid.ssid;
-		__entry->schno = schid.sch_no;
+		__entry->cssid = schid->cssid;
+		__entry->ssid = schid->ssid;
+		__entry->schno = schid->sch_no;
 		__entry->orb = *orb;
 		__entry->cc = cc;
 	),
@@ -173,7 +173,7 @@ TRACE_EVENT(s390_cio_ssch,
 );
 
 DECLARE_EVENT_CLASS(s390_class_schid,
-	TP_PROTO(struct subchannel_id schid, int cc),
+	TP_PROTO(struct subchannel_id *schid, int cc),
 	TP_ARGS(schid, cc),
 	TP_STRUCT__entry(
 		__field(u8, cssid)
@@ -182,9 +182,9 @@ DECLARE_EVENT_CLASS(s390_class_schid,
 		__field(int, cc)
 	),
 	TP_fast_assign(
-		__entry->cssid = schid.cssid;
-		__entry->ssid = schid.ssid;
-		__entry->schno = schid.sch_no;
+		__entry->cssid = schid->cssid;
+		__entry->ssid = schid->ssid;
+		__entry->schno = schid->sch_no;
 		__entry->cc = cc;
 	),
 	TP_printk("schid=%x.%x.%04x cc=%d", __entry->cssid, __entry->ssid,
@@ -198,7 +198,7 @@ DECLARE_EVENT_CLASS(s390_class_schid,
  * @cc: Condition code
  */
 DEFINE_EVENT(s390_class_schid, s390_cio_csch,
-	TP_PROTO(struct subchannel_id schid, int cc),
+	TP_PROTO(struct subchannel_id *schid, int cc),
 	TP_ARGS(schid, cc)
 );
 
@@ -208,7 +208,7 @@ DEFINE_EVENT(s390_class_schid, s390_cio_csch,
  * @cc: Condition code
  */
 DEFINE_EVENT(s390_class_schid, s390_cio_hsch,
-	TP_PROTO(struct subchannel_id schid, int cc),
+	TP_PROTO(struct subchannel_id *schid, int cc),
 	TP_ARGS(schid, cc)
 );
 
@@ -218,7 +218,7 @@ DEFINE_EVENT(s390_class_schid, s390_cio_hsch,
  * @cc: Condition code
  */
 DEFINE_EVENT(s390_class_schid, s390_cio_xsch,
-	TP_PROTO(struct subchannel_id schid, int cc),
+	TP_PROTO(struct subchannel_id *schid, int cc),
 	TP_ARGS(schid, cc)
 );
 
@@ -228,7 +228,7 @@ DEFINE_EVENT(s390_class_schid, s390_cio_xsch,
  * @cc: Condition code
  */
 DEFINE_EVENT(s390_class_schid, s390_cio_rsch,
-	TP_PROTO(struct subchannel_id schid, int cc),
+	TP_PROTO(struct subchannel_id *schid, int cc),
 	TP_ARGS(schid, cc)
 );
 
@@ -238,7 +238,7 @@ DEFINE_EVENT(s390_class_schid, s390_cio_rsch,
  * @cc: Condition code
  */
 TRACE_EVENT(s390_cio_rchp,
-	TP_PROTO(struct chp_id chpid, int cc),
+	TP_PROTO(struct chp_id *chpid, int cc),
 	TP_ARGS(chpid, cc),
 	TP_STRUCT__entry(
 		__field(u8, cssid)
@@ -246,8 +246,8 @@ TRACE_EVENT(s390_cio_rchp,
 		__field(int, cc)
 	),
 	TP_fast_assign(
-		__entry->cssid = chpid.cssid;
-		__entry->id = chpid.id;
+		__entry->cssid = chpid->cssid;
+		__entry->id = chpid->id;
 		__entry->cc = cc;
 	),
 	TP_printk("chpid=%x.%02x cc=%d", __entry->cssid, __entry->id,
diff --git a/fs/dax.c b/fs/dax.c
index 0276df90e86c..6d03ead8e788 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1429,7 +1429,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 			goto finish_iomap;
 		}
 
-		trace_dax_pmd_insert_mapping(inode, vmf, PMD_SIZE, pfn, entry);
+		trace_dax_pmd_insert_mapping(inode, vmf, PMD_SIZE, &pfn, entry);
 		result = vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
 					    write);
 		break;
diff --git a/include/trace/events/f2fs.h b/include/trace/events/f2fs.h
index 06c87f9f720c..795698925d20 100644
--- a/include/trace/events/f2fs.h
+++ b/include/trace/events/f2fs.h
@@ -491,7 +491,7 @@ DEFINE_EVENT(f2fs__truncate_node, f2fs_truncate_node,
 
 TRACE_EVENT(f2fs_truncate_partial_nodes,
 
-	TP_PROTO(struct inode *inode, nid_t nid[], int depth, int err),
+	TP_PROTO(struct inode *inode, nid_t *nid, int depth, int err),
 
 	TP_ARGS(inode, nid, depth, err),
 
diff --git a/include/trace/events/fs_dax.h b/include/trace/events/fs_dax.h
index 97b09fcf7e52..5a6a8285750f 100644
--- a/include/trace/events/fs_dax.h
+++ b/include/trace/events/fs_dax.h
@@ -104,7 +104,7 @@ DEFINE_PMD_LOAD_HOLE_EVENT(dax_pmd_load_hole_fallback);
 
 DECLARE_EVENT_CLASS(dax_pmd_insert_mapping_class,
 	TP_PROTO(struct inode *inode, struct vm_fault *vmf,
-		long length, pfn_t pfn, void *radix_entry),
+		long length, pfn_t *pfn, void *radix_entry),
 	TP_ARGS(inode, vmf, length, pfn, radix_entry),
 	TP_STRUCT__entry(
 		__field(unsigned long, ino)
@@ -123,7 +123,7 @@ DECLARE_EVENT_CLASS(dax_pmd_insert_mapping_class,
 		__entry->address = vmf->address;
 		__entry->write = vmf->flags & FAULT_FLAG_WRITE;
 		__entry->length = length;
-		__entry->pfn_val = pfn.val;
+		__entry->pfn_val = pfn->val;
 		__entry->radix_entry = radix_entry;
 	),
 	TP_printk("dev %d:%d ino %#lx %s %s address %#lx length %#lx "
@@ -145,7 +145,7 @@ DECLARE_EVENT_CLASS(dax_pmd_insert_mapping_class,
 #define DEFINE_PMD_INSERT_MAPPING_EVENT(name) \
 DEFINE_EVENT(dax_pmd_insert_mapping_class, name, \
 	TP_PROTO(struct inode *inode, struct vm_fault *vmf, \
-		long length, pfn_t pfn, void *radix_entry), \
+		long length, pfn_t *pfn, void *radix_entry), \
 	TP_ARGS(inode, vmf, length, pfn, radix_entry))
 
 DEFINE_PMD_INSERT_MAPPING_EVENT(dax_pmd_insert_mapping);
diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 0b50fda80db0..4b463294306f 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -436,7 +436,7 @@ TRACE_EVENT(rcu_fqs,
  */
 TRACE_EVENT(rcu_dyntick,
 
-	TP_PROTO(const char *polarity, long oldnesting, long newnesting, atomic_t dynticks),
+	TP_PROTO(const char *polarity, long oldnesting, long newnesting, atomic_t *dynticks),
 
 	TP_ARGS(polarity, oldnesting, newnesting, dynticks),
 
@@ -451,7 +451,7 @@ TRACE_EVENT(rcu_dyntick,
 		__entry->polarity = polarity;
 		__entry->oldnesting = oldnesting;
 		__entry->newnesting = newnesting;
-		__entry->dynticks = atomic_read(&dynticks);
+		__entry->dynticks = atomic_read(dynticks);
 	),
 
 	TP_printk("%s %lx %lx %#3x", __entry->polarity,
diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h
index 7dd8f34c37df..ea9e9014f0c5 100644
--- a/include/trace/events/xen.h
+++ b/include/trace/events/xen.h
@@ -128,14 +128,14 @@ TRACE_EVENT(xen_mc_extend_args,
 TRACE_DEFINE_SIZEOF(pteval_t);
 /* mmu */
 DECLARE_EVENT_CLASS(xen_mmu__set_pte,
-	    TP_PROTO(pte_t *ptep, pte_t pteval),
+	    TP_PROTO(pte_t *ptep, pte_t *pteval),
 	    TP_ARGS(ptep, pteval),
 	    TP_STRUCT__entry(
 		    __field(pte_t *, ptep)
 		    __field(pteval_t, pteval)
 		    ),
 	    TP_fast_assign(__entry->ptep = ptep;
-			   __entry->pteval = pteval.pte),
+			   __entry->pteval = pteval->pte),
 	    TP_printk("ptep %p pteval %0*llx (raw %0*llx)",
 		      __entry->ptep,
 		      (int)sizeof(pteval_t) * 2, (unsigned long long)pte_val(native_make_pte(__entry->pteval)),
@@ -144,14 +144,14 @@ DECLARE_EVENT_CLASS(xen_mmu__set_pte,
 
 #define DEFINE_XEN_MMU_SET_PTE(name)				\
 	DEFINE_EVENT(xen_mmu__set_pte, name,			\
-		     TP_PROTO(pte_t *ptep, pte_t pteval),	\
+		     TP_PROTO(pte_t *ptep, pte_t *pteval),	\
 		     TP_ARGS(ptep, pteval))
 
 DEFINE_XEN_MMU_SET_PTE(xen_mmu_set_pte);
 
 TRACE_EVENT(xen_mmu_set_pte_at,
 	    TP_PROTO(struct mm_struct *mm, unsigned long addr,
-		     pte_t *ptep, pte_t pteval),
+		     pte_t *ptep, pte_t *pteval),
 	    TP_ARGS(mm, addr, ptep, pteval),
 	    TP_STRUCT__entry(
 		    __field(struct mm_struct *, mm)
@@ -162,7 +162,7 @@ TRACE_EVENT(xen_mmu_set_pte_at,
 	    TP_fast_assign(__entry->mm = mm;
 			   __entry->addr = addr;
 			   __entry->ptep = ptep;
-			   __entry->pteval = pteval.pte),
+			   __entry->pteval = pteval->pte),
 	    TP_printk("mm %p addr %lx ptep %p pteval %0*llx (raw %0*llx)",
 		      __entry->mm, __entry->addr, __entry->ptep,
 		      (int)sizeof(pteval_t) * 2, (unsigned long long)pte_val(native_make_pte(__entry->pteval)),
@@ -172,14 +172,14 @@ TRACE_EVENT(xen_mmu_set_pte_at,
 TRACE_DEFINE_SIZEOF(pmdval_t);
 
 TRACE_EVENT(xen_mmu_set_pmd,
-	    TP_PROTO(pmd_t *pmdp, pmd_t pmdval),
+	    TP_PROTO(pmd_t *pmdp, pmd_t *pmdval),
 	    TP_ARGS(pmdp, pmdval),
 	    TP_STRUCT__entry(
 		    __field(pmd_t *, pmdp)
 		    __field(pmdval_t, pmdval)
 		    ),
 	    TP_fast_assign(__entry->pmdp = pmdp;
-			   __entry->pmdval = pmdval.pmd),
+			   __entry->pmdval = pmdval->pmd),
 	    TP_printk("pmdp %p pmdval %0*llx (raw %0*llx)",
 		      __entry->pmdp,
 		      (int)sizeof(pmdval_t) * 2, (unsigned long long)pmd_val(native_make_pmd(__entry->pmdval)),
@@ -220,14 +220,14 @@ TRACE_EVENT(xen_mmu_pmd_clear,
 TRACE_DEFINE_SIZEOF(pudval_t);
 
 TRACE_EVENT(xen_mmu_set_pud,
-	    TP_PROTO(pud_t *pudp, pud_t pudval),
+	    TP_PROTO(pud_t *pudp, pud_t *pudval),
 	    TP_ARGS(pudp, pudval),
 	    TP_STRUCT__entry(
 		    __field(pud_t *, pudp)
 		    __field(pudval_t, pudval)
 		    ),
 	    TP_fast_assign(__entry->pudp = pudp;
-			   __entry->pudval = native_pud_val(pudval)),
+			   __entry->pudval = native_pud_val(*pudval)),
 	    TP_printk("pudp %p pudval %0*llx (raw %0*llx)",
 		      __entry->pudp,
 		      (int)sizeof(pudval_t) * 2, (unsigned long long)pud_val(native_make_pud(__entry->pudval)),
@@ -237,7 +237,7 @@ TRACE_EVENT(xen_mmu_set_pud,
 TRACE_DEFINE_SIZEOF(p4dval_t);
 
 TRACE_EVENT(xen_mmu_set_p4d,
-	    TP_PROTO(p4d_t *p4dp, p4d_t *user_p4dp, p4d_t p4dval),
+	    TP_PROTO(p4d_t *p4dp, p4d_t *user_p4dp, p4d_t *p4dval),
 	    TP_ARGS(p4dp, user_p4dp, p4dval),
 	    TP_STRUCT__entry(
 		    __field(p4d_t *, p4dp)
@@ -246,7 +246,7 @@ TRACE_EVENT(xen_mmu_set_p4d,
 		    ),
 	    TP_fast_assign(__entry->p4dp = p4dp;
 			   __entry->user_p4dp = user_p4dp;
-			   __entry->p4dval = p4d_val(p4dval)),
+			   __entry->p4dval = p4d_val(*p4dval)),
 	    TP_printk("p4dp %p user_p4dp %p p4dval %0*llx (raw %0*llx)",
 		      __entry->p4dp, __entry->user_p4dp,
 		      (int)sizeof(p4dval_t) * 2, (unsigned long long)pgd_val(native_make_pgd(__entry->p4dval)),
@@ -255,14 +255,14 @@ TRACE_EVENT(xen_mmu_set_p4d,
 #else
 
 TRACE_EVENT(xen_mmu_set_pud,
-	    TP_PROTO(pud_t *pudp, pud_t pudval),
+	    TP_PROTO(pud_t *pudp, pud_t *pudval),
 	    TP_ARGS(pudp, pudval),
 	    TP_STRUCT__entry(
 		    __field(pud_t *, pudp)
 		    __field(pudval_t, pudval)
 		    ),
 	    TP_fast_assign(__entry->pudp = pudp;
-			   __entry->pudval = native_pud_val(pudval)),
+			   __entry->pudval = native_pud_val(*pudval)),
 	    TP_printk("pudp %p pudval %0*llx (raw %0*llx)",
 		      __entry->pudp,
 		      (int)sizeof(pudval_t) * 2, (unsigned long long)pgd_val(native_make_pgd(__entry->pudval)),
@@ -273,7 +273,7 @@ TRACE_EVENT(xen_mmu_set_pud,
 
 DECLARE_EVENT_CLASS(xen_mmu_ptep_modify_prot,
 	    TP_PROTO(struct mm_struct *mm, unsigned long addr,
-		     pte_t *ptep, pte_t pteval),
+		     pte_t *ptep, pte_t *pteval),
 	    TP_ARGS(mm, addr, ptep, pteval),
 	    TP_STRUCT__entry(
 		    __field(struct mm_struct *, mm)
@@ -284,7 +284,7 @@ DECLARE_EVENT_CLASS(xen_mmu_ptep_modify_prot,
 	    TP_fast_assign(__entry->mm = mm;
 			   __entry->addr = addr;
 			   __entry->ptep = ptep;
-			   __entry->pteval = pteval.pte),
+			   __entry->pteval = pteval->pte),
 	    TP_printk("mm %p addr %lx ptep %p pteval %0*llx (raw %0*llx)",
 		      __entry->mm, __entry->addr, __entry->ptep,
 		      (int)sizeof(pteval_t) * 2, (unsigned long long)pte_val(native_make_pte(__entry->pteval)),
@@ -293,7 +293,7 @@ DECLARE_EVENT_CLASS(xen_mmu_ptep_modify_prot,
 #define DEFINE_XEN_MMU_PTEP_MODIFY_PROT(name)				\
 	DEFINE_EVENT(xen_mmu_ptep_modify_prot, name,			\
 		     TP_PROTO(struct mm_struct *mm, unsigned long addr,	\
-			      pte_t *ptep, pte_t pteval),		\
+			      pte_t *ptep, pte_t *pteval),		\
 		     TP_ARGS(mm, addr, ptep, pteval))
 
 DEFINE_XEN_MMU_PTEP_MODIFY_PROT(xen_mmu_ptep_modify_prot_start);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 491bdf39f276..43c0f899f78c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -772,7 +772,7 @@ static void rcu_eqs_enter(bool user)
 	}
 
 	lockdep_assert_irqs_disabled();
-	trace_rcu_dyntick(TPS("Start"), rdtp->dynticks_nesting, 0, rdtp->dynticks);
+	trace_rcu_dyntick(TPS("Start"), rdtp->dynticks_nesting, 0, &rdtp->dynticks);
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
 	for_each_rcu_flavor(rsp) {
 		rdp = this_cpu_ptr(rsp->rda);
@@ -848,14 +848,14 @@ void rcu_nmi_exit(void)
 	 * leave it in non-RCU-idle state.
 	 */
 	if (rdtp->dynticks_nmi_nesting != 1) {
-		trace_rcu_dyntick(TPS("--="), rdtp->dynticks_nmi_nesting, rdtp->dynticks_nmi_nesting - 2, rdtp->dynticks);
+		trace_rcu_dyntick(TPS("--="), rdtp->dynticks_nmi_nesting, rdtp->dynticks_nmi_nesting - 2, &rdtp->dynticks);
 		WRITE_ONCE(rdtp->dynticks_nmi_nesting, /* No store tearing. */
 			   rdtp->dynticks_nmi_nesting - 2);
 		return;
 	}
 
 	/* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
-	trace_rcu_dyntick(TPS("Startirq"), rdtp->dynticks_nmi_nesting, 0, rdtp->dynticks);
+	trace_rcu_dyntick(TPS("Startirq"), rdtp->dynticks_nmi_nesting, 0, &rdtp->dynticks);
 	WRITE_ONCE(rdtp->dynticks_nmi_nesting, 0); /* Avoid store tearing. */
 	rcu_dynticks_eqs_enter();
 }
@@ -930,7 +930,7 @@ static void rcu_eqs_exit(bool user)
 	rcu_dynticks_task_exit();
 	rcu_dynticks_eqs_exit();
 	rcu_cleanup_after_idle();
-	trace_rcu_dyntick(TPS("End"), rdtp->dynticks_nesting, 1, rdtp->dynticks);
+	trace_rcu_dyntick(TPS("End"), rdtp->dynticks_nesting, 1, &rdtp->dynticks);
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
 	WRITE_ONCE(rdtp->dynticks_nesting, 1);
 	WRITE_ONCE(rdtp->dynticks_nmi_nesting, DYNTICK_IRQ_NONIDLE);
@@ -1004,7 +1004,7 @@ void rcu_nmi_enter(void)
 	}
 	trace_rcu_dyntick(incby == 1 ? TPS("Endirq") : TPS("++="),
 			  rdtp->dynticks_nmi_nesting,
-			  rdtp->dynticks_nmi_nesting + incby, rdtp->dynticks);
+			  rdtp->dynticks_nmi_nesting + incby, &rdtp->dynticks);
 	WRITE_ONCE(rdtp->dynticks_nmi_nesting, /* Prevent store tearing. */
 		   rdtp->dynticks_nmi_nesting + incby);
 	barrier();
diff --git a/net/wireless/trace.h b/net/wireless/trace.h
index 5152938b358d..018c81fa72fb 100644
--- a/net/wireless/trace.h
+++ b/net/wireless/trace.h
@@ -3137,7 +3137,7 @@ TRACE_EVENT(rdev_start_radar_detection,
 
 TRACE_EVENT(rdev_set_mcast_rate,
 	TP_PROTO(struct wiphy *wiphy, struct net_device *netdev,
-		 int mcast_rate[NUM_NL80211_BANDS]),
+		 int *mcast_rate),
 	TP_ARGS(wiphy, netdev, mcast_rate),
 	TP_STRUCT__entry(
 		WIPHY_ENTRY
diff --git a/sound/firewire/amdtp-stream-trace.h b/sound/firewire/amdtp-stream-trace.h
index ea0d486652c8..54cdd4ffa9ce 100644
--- a/sound/firewire/amdtp-stream-trace.h
+++ b/sound/firewire/amdtp-stream-trace.h
@@ -14,7 +14,7 @@
 #include <linux/tracepoint.h>
 
 TRACE_EVENT(in_packet,
-	TP_PROTO(const struct amdtp_stream *s, u32 cycles, u32 cip_header[2], unsigned int payload_length, unsigned int index),
+	TP_PROTO(const struct amdtp_stream *s, u32 cycles, u32 *cip_header, unsigned int payload_length, unsigned int index),
 	TP_ARGS(s, cycles, cip_header, payload_length, index),
 	TP_STRUCT__entry(
 		__field(unsigned int, second)
-- 
2.9.5

^ permalink raw reply related

* [PATCH v2 bpf-next 2/8] net/mediatek: disambiguate mt76 vs mt7601u trace events
From: Alexei Starovoitov @ 2018-03-21 18:54 UTC (permalink / raw)
  To: davem; +Cc: daniel, torvalds, peterz, rostedt, netdev, kernel-team, linux-api
In-Reply-To: <20180321185448.2806324-1-ast@fb.com>

From: Alexei Starovoitov <ast@kernel.org>

two trace events defined with the same name and both unused.
They conflict in allyesconfig build. Rename one of them.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 drivers/net/wireless/mediatek/mt7601u/trace.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/mediatek/mt7601u/trace.h b/drivers/net/wireless/mediatek/mt7601u/trace.h
index 289897300ef0..82c8898b9076 100644
--- a/drivers/net/wireless/mediatek/mt7601u/trace.h
+++ b/drivers/net/wireless/mediatek/mt7601u/trace.h
@@ -34,7 +34,7 @@
 #define REG_PR_FMT	"%04x=%08x"
 #define REG_PR_ARG	__entry->reg, __entry->val
 
-DECLARE_EVENT_CLASS(dev_reg_evt,
+DECLARE_EVENT_CLASS(dev_reg_evtu,
 	TP_PROTO(struct mt7601u_dev *dev, u32 reg, u32 val),
 	TP_ARGS(dev, reg, val),
 	TP_STRUCT__entry(
@@ -51,12 +51,12 @@ DECLARE_EVENT_CLASS(dev_reg_evt,
 	)
 );
 
-DEFINE_EVENT(dev_reg_evt, reg_read,
+DEFINE_EVENT(dev_reg_evtu, reg_read,
 	TP_PROTO(struct mt7601u_dev *dev, u32 reg, u32 val),
 	TP_ARGS(dev, reg, val)
 );
 
-DEFINE_EVENT(dev_reg_evt, reg_write,
+DEFINE_EVENT(dev_reg_evtu, reg_write,
 	TP_PROTO(struct mt7601u_dev *dev, u32 reg, u32 val),
 	TP_ARGS(dev, reg, val)
 );
-- 
2.9.5

^ permalink raw reply related

* Re: [bug, bisected] pfifo_fast causes packet reordering
From: John Fastabend @ 2018-03-21 18:43 UTC (permalink / raw)
  To: Jakob Unterwurzacher, Dave Taht
  Cc: netdev, linux-kernel, David S. Miller, linux-can@vger.kernel.org,
	Martin Elshuber
In-Reply-To: <00cc2d41-6861-9a9c-603f-ba8013b2e2ce@theobroma-systems.com>

On 03/21/2018 03:01 AM, Jakob Unterwurzacher wrote:
> On 16.03.18 11:26, Jakob Unterwurzacher wrote:
>> On 15.03.18 23:30, John Fastabend wrote:
>>>> I have reproduced it using two USB network cards connected to each other. The test tool sends UDP packets containing a counter and listens on the other interface, it is available at
>>>> https://github.com/jakob-tsd/pfifo_stress/blob/master/pfifo_stress.py
>>>
>>> Great thanks, can you also run this with taskset to bind to
>>> a single CPU,
>>>
>>>   # taskset 0x1 ./pifof_stress.py
>>>
>>> And let me know if you still see the OOO.
>>
>> Interesting. Looks like it depends on which core it runs on. CPU0 is clean, CPU1 is not.
> 
> So we are at v4.16-rc6 now - have you managed to reproduce this is or should I try to get the revert correct?

I have a theory on what is going on here. Because we now run without
locks we can have multiple qdisc_run() calls running in parallel. Possible
if we send packets using multiple cores.

   --- application ---
     cpu0    cpu1
        |     |
        |     |
    enqueue enqueue   
        |     |
      pfifo_fast
        |     |
    dequeue   dequeue
         \    /
        ndo_xmit

The skb->ooo_okay flag will keep the enqueue side packets in
order. So that is covered.

But on the dequeue side if two cores dequeue in parallel we will
race to ndo ops to ensure packets are in order. Rarely, I guess the
second dequeue could actually call ndo hook before first dequeued
packet. Because usually the dequeue happens on the same queue the
enqueue happened on we don't see this very often. But there seems
to be a case where the driver calls netif_tx_wake_queue() on a
different core (from the rx interrupt context). The wake queue call
then eventually runs the dequeue on a different core. So when taskset
is aligned with the interrupt everything is in-order when it is moved
to a different core we see the OOO.

Thats my theory at least. Are you able to test a patch if I generate
one to fix this?

FWIW the revert on this is trivial, but I think we can fix this
without too much work. Also, if you had a driver tx queue per core
this would not be an issue. 

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 190570f..171f470 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -792,7 +792,6 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
        .dump           =       pfifo_fast_dump,
        .change_tx_queue_len =  pfifo_fast_change_tx_queue_len,
        .owner          =       THIS_MODULE,
-       .static_flags   =       TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
 };
 EXPORT_SYMBOL(pfifo_fast_ops);

> 
> Best regards,
> Jakob

^ permalink raw reply related

* Re: [PATCH 1/2] bpf: Remove struct bpf_verifier_env argument from print_bpf_insn
From: Jiri Olsa @ 2018-03-21 18:37 UTC (permalink / raw)
  To: Quentin Monnet
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, lkml, netdev
In-Reply-To: <d28f0a7f-3721-88f3-f4e9-97a73a8833e5@netronome.com>

On Wed, Mar 21, 2018 at 05:25:33PM +0000, Quentin Monnet wrote:
> 2018-03-21 16:02 UTC+0100 ~ Jiri Olsa <jolsa@kernel.org>
> > We use print_bpf_insn in user space (bpftool and soon perf),
> > so it'd be nice to keep it generic and strip it off the kernel
> > struct bpf_verifier_env argument.
> > 
> > This argument can be safely removed, because its users can
> > use the struct bpf_insn_cbs::private_data to pass it.
> > 
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  kernel/bpf/disasm.c   | 52 +++++++++++++++++++++++++--------------------------
> >  kernel/bpf/disasm.h   |  5 +----
> >  kernel/bpf/verifier.c |  6 +++---
> >  3 files changed, 30 insertions(+), 33 deletions(-)
> > 
> 
> [...]
> 
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index c6eff108aa99..9f27d3fa7259 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -202,8 +202,7 @@ EXPORT_SYMBOL_GPL(bpf_verifier_log_write);
> >   * generic for symbol export. The function was renamed, but not the calls in
> >   * the verifier to avoid complicating backports. Hence the alias below.
> >   */
> > -static __printf(2, 3) void verbose(struct bpf_verifier_env *env,
> > -				   const char *fmt, ...)
> > +static __printf(2, 3) void verbose(void *private_data, const char *fmt, ...)
> >  	__attribute__((alias("bpf_verifier_log_write")));
> 
> Just as a note, verbose() will be aliased to a function whose prototype
> differs (bpf_verifier_log_write() still expects a struct
> bpf_verifier_env as its first argument). I am not so familiar with
> function aliases, could this change be a concern?

yea, but as it was pointer for pointer switch I did not
see any problem with that.. I'll check more

jirka

^ permalink raw reply

* Re: [PATCH net-next] Documentation/networking: Add net DIM documentation
From: Tal Gilboa @ 2018-03-21 18:35 UTC (permalink / raw)
  To: Randy Dunlap, Marcelo Ricardo Leitner
  Cc: David S. Miller, netdev@vger.kernel.org, Tariq Toukan
In-Reply-To: <3d949ca4-45c4-e349-b3ff-946561b9ef17@infradead.org>

On 3/21/2018 8:33 PM, Randy Dunlap wrote:
> On 03/21/2018 11:20 AM, Marcelo Ricardo Leitner wrote:
>> On Wed, Mar 21, 2018 at 11:30:29AM +0200, Tal Gilboa wrote:
>> ...
>>> +Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the interrupt
>>> +moderation configuration of a channel in order to optimize packet processing.
>>> +The mechanism includes an algorithm which decides if and how to change
>>> +moderation parameters for a channel, usually by performing an analysis on
>>> +runtime data sampled from the system. Net DIM is such a mechanism. In each
>>> +iteration of the algorithm, it analyses a given sample of the data, compares it to
>>> +the previous sample and if required, is can decide to change some of the interrupt moderation
>>
>> Some lines are above 80 chars. Please format the text so they don't exceed it.
> 
> Agreed.
> 
> thanks,
> 
lines' length and spelling mistakes fixed. See V2.
Thanks for the feedback.

^ permalink raw reply

* [PATCH net-next V2] Documentation/networking: Add net DIM documentation
From: Tal Gilboa @ 2018-03-21 18:33 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev@vger.kernel.org, Tariq Toukan, Tal Gilboa

Net DIM is a generic algorithm, purposed for dynamically
optimizing network devices interrupt moderation. This
document describes how it works and how to use it.

Signed-off-by: Tal Gilboa <talgi@mellanox.com>
---
 Documentation/networking/net_dim.txt | 174 +++++++++++++++++++++++++++++++++++
 1 file changed, 174 insertions(+)
 create mode 100644 Documentation/networking/net_dim.txt

diff --git a/Documentation/networking/net_dim.txt b/Documentation/networking/net_dim.txt
new file mode 100644
index 0000000..9cb31c5
--- /dev/null
+++ b/Documentation/networking/net_dim.txt
@@ -0,0 +1,174 @@
+Net DIM - Generic Network Dynamic Interrupt Moderation
+======================================================
+
+Author:
+	Tal Gilboa <talgi@mellanox.com>
+
+
+Contents
+=========
+
+- Assumptions
+- Introduction
+- The Net DIM Algorithm
+- Registering a Network Device to DIM
+- Example
+
+Part 0: Assumptions
+======================
+
+This document assumes the reader has basic knowledge in network drivers
+and in general interrupt moderation.
+
+
+Part I: Introduction
+======================
+
+Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the
+interrupt moderation configuration of a channel in order to optimize packet
+processing. The mechanism includes an algorithm which decides if and how to
+change moderation parameters for a channel, usually by performing an analysis on
+runtime data sampled from the system. Net DIM is such a mechanism. In each
+iteration of the algorithm, it analyses a given sample of the data, compares it
+to the previous sample and if required, it can decide to change some of the
+interrupt moderation configuration fields. The data sample is composed of data
+bandwidth, the number of packets and the number of events. The time between
+samples is also measured. Net DIM compares the current and the previous data and
+returns an adjusted interrupt moderation configuration object. In some cases,
+the algorithm might decide not to change anything. The configuration fields are
+the minimum duration (microseconds) allowed between events and the maximum
+number of wanted packets per event. The Net DIM algorithm ascribes importance to
+increase bandwidth over reducing interrupt rate.
+
+
+Part II: The Net DIM Algorithm
+===============================
+
+Each iteration of the Net DIM algorithm follows these steps:
+1. Calculates new data sample.
+2. Compares it to previous sample.
+3. Makes a decision - suggests interrupt moderation configuration fields.
+4. Applies a schedule work function, which applies suggested configuration.
+
+The first two steps are straightforward, both the new and the previous data are
+supplied by the driver registered to Net DIM. The previous data is the new data
+supplied to the previous iteration. The comparison step checks the difference
+between the new and previous data and decides on the result of the last step.
+A step would result as "better" if bandwidth increases and as "worse" if
+bandwidth reduces. If there is no change in bandwidth, the packet rate is
+compared in a similar fashion - increase == "better" and decrease == "worse".
+In case there is no change in the packet rate as well, the interrupt rate is
+compared. Here the algorithm tries to optimize for lower interrupt rate so an
+increase in the interrupt rate is considered "worse" and a decrease is
+considered "better". Step #2 has an optimization for avoiding false results: it
+only considers a difference between samples as valid if it is greater than a
+certain percentage. Also, since Net DIM does not measure anything by itself, it
+assumes the data provided by the driver is valid.
+
+Step #3 decides on the suggested configuration based on the result from step #2
+and the internal state of the algorithm. The states reflect the "direction" of
+the algorithm: is it going left (reducing moderation), right (increasing
+moderation) or standing still. Another optimization is that if a decision
+to stay still is made multiple times, the interval between iterations of the
+algorithm would increase in order to reduce calculation overhead. Also, after
+"parking" on one of the most left or most right decisions, the algorithm may
+decide to verify this decision by taking a step in the other direction. This is
+done in order to avoid getting stuck in a "deep sleep" scenario. Once a
+decision is made, an interrupt moderation configuration is selected from
+the predefined profiles.
+
+The last step is to notify the registered driver that it should apply the
+suggested configuration. This is done by scheduling a work function, defined by
+the Net DIM API and provided by the registered driver.
+
+As you can see, Net DIM itself does not actively interact with the system. It
+would have trouble making the correct decisions if the wrong data is supplied to
+it and it would be useless if the work function would not apply the suggested
+configuration. This does, however, allow the registered driver some room for
+manoeuvre as it may provide partial data or ignore the algorithm suggestion
+under some conditions.
+
+
+Part III: Registering a Network Device to DIM
+==============================================
+
+Net DIM API exposes the main function net_dim(struct net_dim *dim,
+struct net_dim_sample end_sample). This function is the entry point to the Net
+DIM algorithm and has to be called every time the driver would like to check if
+it should change interrupt moderation parameters. The driver should provide two
+data structures: struct net_dim and struct net_dim_sample. Struct net_dim
+describes the state of DIM for a specific object (RX queue, TX queue,
+other queues, etc.). This includes the current selected profile, previous data
+samples, the callback function provided by the driver and more.
+Struct net_dim_sample describes a data sample, which will be compared to the
+data sample stored in struct net_dim in order to decide on the algorithm's next
+step. The sample should include bytes, packets and interrupts, measured by
+the driver.
+
+In order to use Net DIM from a networking driver, the driver needs to call the
+main net_dim() function. The recommended method is to call net_dim() on each
+interrupt. Since Net DIM has a built-in moderation and it might decide to skip
+iterations under certain conditions, there is no need to moderate the net_dim()
+calls as well. As mentioned above, the driver needs to provide an object of type
+struct net_dim to the net_dim() function call. It is advised for each entity
+using Net DIM to hold a struct net_dim as part of its data structure and use it
+as the main Net DIM API object. The struct net_dim_sample should hold the latest
+bytes, packets and interrupts count. No need to perform any calculations, just
+include the raw data.
+
+The net_dim() call itself does not return anything. Instead Net DIM relies on
+the driver to provide a callback function, which is called when the algorithm
+decides to make a change in the interrupt moderation parameters. This callback
+will be scheduled and run in a separate thread in order not to add overhead to
+the data flow. After the work is done, Net DIM algorithm needs to be set to
+the proper state in order to move to the next iteration.
+
+
+Part IV: Example
+=================
+
+The following code demonstrates how to register a driver to Net DIM. The actual
+usage is not complete but it should make the outline of the usage clear.
+
+my_driver.c:
+
+#include <linux/net_dim.h>
+
+/* Callback for net DIM to schedule on a decision to change moderation */
+void my_driver_do_dim_work(struct work_struct *work)
+{
+	/* Get struct net_dim from struct work_struct */
+	struct net_dim *dim = container_of(work, struct net_dim,
+					   work);
+	/* Do interrupt moderation related stuff */
+	...
+
+	/* Signal net DIM work is done and it should move to next iteration */
+	dim->state = NET_DIM_START_MEASURE;
+}
+
+/* My driver's interrupt handler */
+int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...)
+{
+	...
+	/* A struct to hold current measured data */
+	struct net_dim_sample dim_sample;
+	...
+	/* Initiate data sample struct with current data */
+	net_dim_sample(my_entity->events,
+		       my_entity->packets,
+		       my_entity->bytes,
+		       &dim_sample);
+	/* Call net DIM */
+	net_dim(&my_entity->dim, dim_sample);
+	...
+}
+
+/* My entity's initialization function (my_entity was already allocated) */
+int my_driver_init_my_entity(struct my_driver_entity *my_entity, ...)
+{
+	...
+	/* Initiate struct work_struct with my driver's callback function */
+	INIT_WORK(&my_entity->dim.work, my_driver_do_dim_work);
+	...
+}
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net-next] Documentation/networking: Add net DIM documentation
From: Randy Dunlap @ 2018-03-21 18:33 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner, Tal Gilboa
  Cc: David S. Miller, netdev@vger.kernel.org, Tariq Toukan
In-Reply-To: <20180321182028.GC4831@localhost.localdomain>

On 03/21/2018 11:20 AM, Marcelo Ricardo Leitner wrote:
> On Wed, Mar 21, 2018 at 11:30:29AM +0200, Tal Gilboa wrote:
> ...
>> +Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the interrupt
>> +moderation configuration of a channel in order to optimize packet processing.
>> +The mechanism includes an algorithm which decides if and how to change
>> +moderation parameters for a channel, usually by performing an analysis on
>> +runtime data sampled from the system. Net DIM is such a mechanism. In each
>> +iteration of the algorithm, it analyses a given sample of the data, compares it to
>> +the previous sample and if required, is can decide to change some of the interrupt moderation
> 
> Some lines are above 80 chars. Please format the text so they don't exceed it.

Agreed.

thanks,
-- 
~Randy

^ permalink raw reply

* Re: [PATCH net-next] Documentation/networking: Add net DIM documentation
From: Marcelo Ricardo Leitner @ 2018-03-21 18:20 UTC (permalink / raw)
  To: Tal Gilboa; +Cc: David S. Miller, netdev@vger.kernel.org, Tariq Toukan
In-Reply-To: <1521624629-46666-1-git-send-email-talgi@mellanox.com>

On Wed, Mar 21, 2018 at 11:30:29AM +0200, Tal Gilboa wrote:
...
> +Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the interrupt
> +moderation configuration of a channel in order to optimize packet processing.
> +The mechanism includes an algorithm which decides if and how to change
> +moderation parameters for a channel, usually by performing an analysis on
> +runtime data sampled from the system. Net DIM is such a mechanism. In each
> +iteration of the algorithm, it analyses a given sample of the data, compares it to
> +the previous sample and if required, is can decide to change some of the interrupt moderation

Some lines are above 80 chars. Please format the text so they don't exceed it.

  M.

^ permalink raw reply

* Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access
From: Linus Torvalds @ 2018-03-21 18:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, David Laight, Rahul Lakkireddy, x86@kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	mingo@redhat.com, hpa@zytor.com, davem@davemloft.net,
	akpm@linux-foundation.org, ganeshgr@chelsio.com,
	nirranjan@chelsio.com, indranil@chelsio.com, Andy Lutomirski,
	Peter Zijlstra, Fenghua Yu, Eric Biggers
In-Reply-To: <20180321074634.dzpyjz3ia46snodh@gmail.com>

On Wed, Mar 21, 2018 at 12:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> So I added a bit of instrumentation and the current state of things is that on
> 64-bit x86 every single task has an initialized FPU, every task has the exact
> same, fully filled in xfeatures (XINUSE) value:

Bah. Your CPU is apparently some old crud that barely has AVX. Of
course all those features are enabled.

> Note that this is with an AVX (128-bit) supporting CPU:

That's weak even by modern standards.

I have MPX bounds and MPX CSR on my old Skylake.

And the real worry is things like AVX-512 etc, which is exactly when
things like "save and restore one ymm register" will quite likely
clear the upper bits of the zmm register.

And yes, we can have some statically patched code that takes that into
account, and saves the whole zmm register when AVX512 is on, but the
whole *point* of the dynamic XSAVES thing is actually that Intel wants
to be able enable new user-space features without having to wait for
OS support. Literally. That's why and how it was designed.

And saving a couple of zmm registers is actually pretty hard. They're
big. Do you want to allocate 128 bytes of stack space, preferably
64-byte aligned, for a save area? No. So now it needs to be some kind
of per-thread (or maybe per-CPU, if we're willing to continue to not
preempt) special save area too.

And even then, it doesn't solve the real worry of "maybe there will be
odd interactions with future extensions that we don't even know of".

All this to do a 32-byte PIO access, with absolutely zero data right
now on what the win is?

Yes, yes, I can find an Intel white-paper that talks about setting WC
and then using xmm and ymm instructions to write a single 64-byte
burst over PCIe, and I assume that is where the code and idea came
from. But I don't actually see any reason why a burst of 8 regular
quad-word bytes wouldn't cause a 64-byte burst write too.

So right now this is for _one_ odd rdma controller, with absolutely
_zero_ performance numbers, and a very high likelihood that it does
not matter in the least.

And if there are some atomicity concerns ("we need to guarantee a
single atomic access for race conditions with the hardware"), they are
probably bogus and misplaced anyway, since

 (a) we can't guarantee that AVX2 exists in the first place

 (b) we can't guarantee that %ymm register write will show up on any
bus as a single write transaction anyway

So as far as I can tell, there are basically *zero* upsides, and a lot
of potential downsides.

                Linus

^ permalink raw reply

* [PATCH net-next V1] Documentation/networking: Add net DIM documentation
From: Tal Gilboa @ 2018-03-21 17:56 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev@vger.kernel.org, Tariq Toukan, Tal Gilboa

Net DIM is a generic algorithm, purposed for dynamically
optimizing network devices interrupt moderation. This
document describes how it works and how to use it.

Signed-off-by: Tal Gilboa <talgi@mellanox.com>
---
 Documentation/networking/net_dim.txt | 174 +++++++++++++++++++++++++++++++++++
 1 file changed, 174 insertions(+)
 create mode 100644 Documentation/networking/net_dim.txt

diff --git a/Documentation/networking/net_dim.txt b/Documentation/networking/net_dim.txt
new file mode 100644
index 0000000..823748b
--- /dev/null
+++ b/Documentation/networking/net_dim.txt
@@ -0,0 +1,174 @@
+Net DIM - Generic Network Dynamic Interrupt Moderation
+======================================================
+
+Author:
+	Tal Gilboa <talgi@mellanox.com>
+
+
+Contents
+=========
+
+- Assumptions
+- Introduction
+- The Net DIM Algorithm
+- Registering a Network Device to DIM
+- Example
+
+Part 0: Assumptions
+======================
+
+This document assumes the reader has basic knowledge in network drivers
+and in general interrupt moderation.
+
+
+Part I: Introduction
+======================
+
+Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the interrupt
+moderation configuration of a channel in order to optimize packet processing.
+The mechanism includes an algorithm which decides if and how to change
+moderation parameters for a channel, usually by performing an analysis on
+runtime data sampled from the system. Net DIM is such a mechanism. In each
+iteration of the algorithm, it analyses a given sample of the data, compares it to
+the previous sample and if required, it can decide to change some of the interrupt moderation
+configuration fields. The data sample is composed of data bandwidth, the number of
+packets and the number of events. The time between samples is also measured. Net DIM
+compares the current and the previous data and returns an adjusted interrupt
+moderation configuration object. In some cases, the algorithm might decide not
+to change anything. The configuration fields are the minimum duration
+(microseconds) allowed between events and the maximum number of wanted packets
+per event. The Net DIM algorithm ascribes importance to increase bandwidth over
+reducing interrupt rate.
+
+
+Part II: The Net DIM Algorithm
+===============================
+
+Each iteration of the Net DIM algorithm follows these steps:
+1. Calculates new data sample.
+2. Compares it to previous sample.
+3. Makes a decision - suggests interrupt moderation configuration fields.
+4. Applies a schedule work function, which applies suggested configuration.
+
+The first two steps are straightforward, both the new and the previous data are
+supplied by the driver registered to Net DIM. The previous data is the new data
+supplied to the previous iteration. The comparison step checks the difference
+between the new and previous data and decides on the result of the last step. A step
+would result as "better" if bandwidth increases and as "worse" if bandwidth
+reduces. If there is no change in bandwidth, the packet rate is compared in a similar
+fashion - increase == "better" and decrease == "worse". In case there is no
+change in the packet rate as well, the interrupt rate is compared. Here the
+algorithm tries to optimize for lower interrupt rate so an increase in the
+interrupt rate is considered "worse" and a decrease is considered "better".
+Step #2 has an optimization for avoiding false results: it only considers a
+difference between samples as valid if it is greater than a certain percentage.
+Also, since Net DIM does not measure anything by itself, it assumes the data
+provided by the driver is valid.
+
+Step #3 decides on the suggested configuration based on the result from step #2
+and the internal state of the algorithm. The states reflect the "direction" of
+the algorithm: is it going left (reducing moderation), right (increasing
+moderation) or standing still. Another optimization is that if a decision
+to stay still is made multiple times, the interval between iterations of the
+algorithm would increase in order to reduce calculation overhead. Also, after
+"parking" on one of the most left or most right decisions, the algorithm may
+decide to verify this decision by taking a step in the other direction. This is
+done in order to avoid getting stuck in a "deep sleep" scenario. Once a
+decision is made, an interrupt moderation configuration is selected from
+the predefined profiles.
+
+The last step is to notify the registered driver that it should apply the suggested
+configuration. This is done by scheduling a work function, defined by the Net DIM
+API and provided by the registered driver.
+
+As you can see, Net DIM itself does not actively interact with the system. It
+would have trouble making the correct decisions if the wrong data is supplied to it
+and it would be useless if the work function would not apply the suggested
+configuration. This does, however, allow the registered driver some room for manoeuvre
+as it may provide partial data or ignore the algorithm suggestion under some
+conditions.
+
+
+Part III: Registering a Network Device to DIM
+==============================================
+
+Net DIM API exposes the main function net_dim(struct net_dim *dim,
+struct net_dim_sample end_sample). This function is the entry point to the Net
+DIM algorithm and has to be called every time the driver would like to check if
+it should change interrupt moderation parameters. The driver should provide two
+data structures: struct net_dim and struct net_dim_sample. Struct net_dim
+describes the state of DIM for a specific object (RX queue, TX queue,
+other queues, etc.). This includes the current selected profile, previous data
+samples, the callback function provided by the driver and more.
+Struct net_dim_sample describes a data sample, which will be compared to the
+data sample stored in struct net_dim in order to decide on the algorithm's next
+step. The sample should include bytes, packets and interrupts, measured by
+the driver.
+
+In order to use Net DIM from a networking driver, the driver needs to call the
+main net_dim() function. The recommended method is to call net_dim() on each
+interrupt. Since Net DIM has a built-in moderation and it might decide to skip
+iterations under certain conditions, there is no need to moderate the net_dim()
+calls as well. As mentioned above, the driver needs to provide an object of type
+struct net_dim to the net_dim() function call. It is advised for each entity
+using Net DIM to hold a struct net_dim as part of its data structure and use it
+as the main Net DIM API object. The struct net_dim_sample should hold the latest
+bytes, packets and interrupts count. No need to perform any calculations, just
+include the raw data.
+
+The net_dim() call itself does not return anything. Instead Net DIM relies on the
+driver to provide a callback function, which is called when the algorithm
+decides to make a change in the interrupt moderation parameters. This callback
+will be scheduled and run in a separate thread in order not to add overhead to
+the data flow. After the work is done, Net DIM algorithm needs to be set to
+the proper state in order to move to the next iteration.
+
+
+Part IV: Example
+=================
+
+The following code demonstrates how to register a driver to Net DIM. The actual
+usage is not complete but it should make the outline of the usage clear.
+
+my_driver.c:
+
+#include <linux/net_dim.h>
+
+/* Callback for net DIM to schedule on a decision to change moderation */
+void my_driver_do_dim_work(struct work_struct *work)
+{
+	/* Get struct net_dim from struct work_struct */
+	struct net_dim *dim = container_of(work, struct net_dim,
+					   work);
+	/* Do interrupt moderation related stuff */
+	...
+
+	/* Signal net DIM work is done and it should move to next iteration */
+	dim->state = NET_DIM_START_MEASURE;
+}
+
+/* My driver's interrupt handler */
+int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...)
+{
+	...
+	/* A struct to hold current measured data */
+	struct net_dim_sample dim_sample;
+	...
+	/* Initiate data sample struct with current data */
+	net_dim_sample(my_entity->events,
+		       my_entity->packets,
+		       my_entity->bytes,
+		       &dim_sample);
+	/* Call net DIM */
+	net_dim(&my_entity->dim, dim_sample);
+	...
+}
+
+/* My entity's initialization function (my_entity was already allocated) */
+int my_driver_init_my_entity(struct my_driver_entity *my_entity, ...)
+{
+	...
+	/* Initiate struct work_struct with my driver's callback function */
+	INIT_WORK(&my_entity->dim.work, my_driver_do_dim_work);
+	...
+}
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH][next] net: mvpp2: use correct index on array mvpp2_pools
From: Colin King @ 2018-03-21 17:31 UTC (permalink / raw)
  To: David S . Miller, Antoine Tenart, netdev; +Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

Array mvpp2_pools is being indexed by long_log_pool, however this
looks like a cut-n-paste bug and in fact should be short_log_pool.

Detected by CoverityScan, CID#1466113 ("Copy-paste error")

Fixes: 576193f2d579 ("net: mvpp2: jumbo frames support")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/net/ethernet/marvell/mvpp2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
index 9bd35f2291d6..f8bc3d4a39ff 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -4632,7 +4632,7 @@ static int mvpp2_swf_bm_pool_init(struct mvpp2_port *port)
 	if (!port->pool_short) {
 		port->pool_short =
 			mvpp2_bm_pool_use(port, short_log_pool,
-					  mvpp2_pools[long_log_pool].pkt_size);
+					  mvpp2_pools[short_log_pool].pkt_size);
 		if (!port->pool_short)
 			return -ENOMEM;
 
-- 
2.15.1

^ permalink raw reply related

* Re: [PATCHv2 2/2] bpftool: Adjust to new print_bpf_insn interface
From: Quentin Monnet @ 2018-03-21 17:30 UTC (permalink / raw)
  To: Jiri Olsa, Daniel Borkmann; +Cc: Jiri Olsa, Alexei Starovoitov, lkml, netdev
In-Reply-To: <20180321170019.GE2707@krava>

2018-03-21 18:00 UTC+0100 ~ Jiri Olsa <jolsa@redhat.com>
> On Wed, Mar 21, 2018 at 05:44:52PM +0100, Daniel Borkmann wrote:
> 
> SNIP
> 
>>>> Hi Jiri, this code has changed in the tree. It was moved to
>>>> tools/bpf/bpftool/xlated_dumper.c, and there is now a third function to
>>>> update: print_insn_for_graph(). Could you please rebase the patch?
>>>
>>> sure.. I was over perf tree, I'll check on bpf tree
>>
>> Just to be sure, it should be bpf-next. bpf is for fixes only.
> 
> v2 attached
> 
> thanks,
> jirka
> 
> 
> ---
> Change bpftool to skip the removed struct bpf_verifier_env
> argument in print_bpf_insn. It was passed as NULL anyway.
> 
> No functional change intended.
> 
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  tools/bpf/bpftool/xlated_dumper.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 

Thank you!
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>

^ permalink raw reply

* Re: [PATCH 1/2] bpf: Remove struct bpf_verifier_env argument from print_bpf_insn
From: Quentin Monnet @ 2018-03-21 17:25 UTC (permalink / raw)
  To: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann; +Cc: lkml, netdev
In-Reply-To: <20180321150212.5586-1-jolsa@kernel.org>

2018-03-21 16:02 UTC+0100 ~ Jiri Olsa <jolsa@kernel.org>
> We use print_bpf_insn in user space (bpftool and soon perf),
> so it'd be nice to keep it generic and strip it off the kernel
> struct bpf_verifier_env argument.
> 
> This argument can be safely removed, because its users can
> use the struct bpf_insn_cbs::private_data to pass it.
> 
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  kernel/bpf/disasm.c   | 52 +++++++++++++++++++++++++--------------------------
>  kernel/bpf/disasm.h   |  5 +----
>  kernel/bpf/verifier.c |  6 +++---
>  3 files changed, 30 insertions(+), 33 deletions(-)
> 

[...]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index c6eff108aa99..9f27d3fa7259 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -202,8 +202,7 @@ EXPORT_SYMBOL_GPL(bpf_verifier_log_write);
>   * generic for symbol export. The function was renamed, but not the calls in
>   * the verifier to avoid complicating backports. Hence the alias below.
>   */
> -static __printf(2, 3) void verbose(struct bpf_verifier_env *env,
> -				   const char *fmt, ...)
> +static __printf(2, 3) void verbose(void *private_data, const char *fmt, ...)
>  	__attribute__((alias("bpf_verifier_log_write")));

Just as a note, verbose() will be aliased to a function whose prototype
differs (bpf_verifier_log_write() still expects a struct
bpf_verifier_env as its first argument). I am not so familiar with
function aliases, could this change be a concern?

Other than this the patch seems good to me.
Quentin

>  
>  static bool type_is_pkt_pointer(enum bpf_reg_type type)
> @@ -4601,10 +4600,11 @@ static int do_check(struct bpf_verifier_env *env)
>  		if (env->log.level) {
>  			const struct bpf_insn_cbs cbs = {
>  				.cb_print	= verbose,
> +				.private_data	= env,
>  			};
>  
>  			verbose(env, "%d: ", insn_idx);
> -			print_bpf_insn(&cbs, env, insn, env->allow_ptr_leaks);
> +			print_bpf_insn(&cbs, insn, env->allow_ptr_leaks);
>  		}
>  
>  		if (bpf_prog_is_dev_bound(env->prog->aux)) {
> 

^ permalink raw reply

* Re: [PATCH net-next] Documentation/networking: Add net DIM documentation
From: Randy Dunlap @ 2018-03-21 17:07 UTC (permalink / raw)
  To: Tal Gilboa, David S. Miller; +Cc: netdev@vger.kernel.org, Tariq Toukan
In-Reply-To: <1521624629-46666-1-git-send-email-talgi@mellanox.com>

On 03/21/2018 02:30 AM, Tal Gilboa wrote:
> Net DIM is a generic algorithm, purposed for dynamically
> optimizing network devices interrupt moderation. This
> document describes how it works and how to use it.
> 
> Signed-off-by: Tal Gilboa <talgi@mellanox.com>
> ---
>  Documentation/networking/net_dim.txt | 174 +++++++++++++++++++++++++++++++++++
>  1 file changed, 174 insertions(+)
>  create mode 100644 Documentation/networking/net_dim.txt
> 
> diff --git a/Documentation/networking/net_dim.txt b/Documentation/networking/net_dim.txt
> new file mode 100644
> index 0000000..ef622c8
> --- /dev/null
> +++ b/Documentation/networking/net_dim.txt
> @@ -0,0 +1,174 @@
> +Net DIM - Generic Network Dynamic Interrupt Moderation
> +======================================================
> +
> +Author:
> +	Tal Gilboa <talgi@mellanox.com>
> +
> +
> +Contents
> +=========
> +
> +- Assumptions
> +- Introduction
> +- The Net DIM Algorithm
> +- Registering a Network Device to DIM
> +- Example
> +
> +Part 0: Assumptions
> +======================
> +
> +This document assumes the reader has basic knowledge in network drivers
> +and in general interrupt moderation.
> +
> +
> +Part I: Introduction
> +======================
> +
> +Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the interrupt
> +moderation configuration of a channel in order to optimize packet processing.
> +The mechanism includes an algorithm which decides if and how to change
> +moderation parameters for a channel, usually by performing an analysis on
> +runtime data sampled from the system. Net DIM is such a mechanism. In each
> +iteration of the algorithm, it analyses a given sample of the data, compares it to
> +the previous sample and if required, is can decide to change some of the interrupt moderation

                                        it can

> +configuration fields. The data sample is composed of data bandwidth, the number of
> +packets and the number of events. The time between samples is also measured. Net DIM
> +compares the current and the previous data and returns an adjusted interrupt
> +moderation configuration object. In some cases, the algorithm might decide not
> +to change anything. The configuration fields are the minimum duration
> +(microseconds) allowed between events and the maximum number of wanted packets
> +per event. The Net DIM algorithm ascribes importance to increase bandwidth over
> +reducing interrupt rate.
> +
> +
> +Part II: The Net DIM Algorithm
> +===============================
> +
> +Each iteration of the Net DIM algorithm follows these steps:
> +1. Calculates new data sample.
> +2. Compares it to previous sample.
> +3. Makes a decision - suggests interrupt moderation configuration fields.
> +4. Applies a schedule work function, which applies suggested configuration.
> +
> +The first two steps are straight forward, both the new and the previous data are

                           straightforward;

> +supplied by the driver registered to Net DIM. The previous data is the new data
> +supplied to the previous iteration. The comparison step checks the difference
> +between the new and previous data and decides on the result of the last step. A step
> +would result as "better" if bandwidth increases and as "worse" if bandwidth
> +reduces. If there is no change in bandwidth, the packet rate is compared in a similar
> +fashion - increase == "better" and decrease == "worse". In case there is no
> +change in the packet rate as well, the interrupt rate is compared. Here the
> +algorithm tries to optimize for lower interrupt rate so an increase in the
> +interrupt rate is considered "worse" and a decrease is considered "better".
> +Step #2 has an optimization for avoiding false results, it only considers a

                                                  results:

> +difference between samples as valid if it is greater than a certain percentage.
> +Also, since Net DIM does not measure anything by itself, it assumes the data
> +provided by the driver is valid.
> +
> +Step #3 decides on the suggested configuration based on the result from step #2
> +and the internal state of the algorithm. The states reflect the "direction" of
> +the algorithm, is it going left (reducing moderation), right (increasing

       algorithm:

> +moderation) or standing still. Another optimization is that if a decision
> +to stay still is made multiple times, the interval between iterations of the
> +algorithm would increase in order to reduce calculation overhead. Also, after
> +"parking" on one of the most left or most right decisions, the algorithm may
> +decide to verify this decision by taking a step on the other direction. This is

                                              step in

> +done in order to avoid getting stuck in a "deep sleep" scenario. Once a
> +decision is made, an interrupt moderation configuration is selected from
> +the predefined profiles.
> +
> +The last step is to notify the registered driver that it should apply the suggested
> +configuration. This is done by scheduling a work function, defined by the Net DIM
> +API and provided by the registered driver.
> +
> +As you can see, Net DIM itself does not actively interact with the system. It
> +would have trouble making the correct decisions if the wrong data is supplied to it
> +and it would be useless if the work function would not apply the suggested
> +configuration. This does, however, allows the registered driver some room for manoeuvre

                                      allow

> +as it may provide partial data or ignore the algorithm suggestion under some
> +conditions.
> +
> +
> +Part III: Registering a Network Device to DIM
> +==============================================
> +
> +Net DIM API exposes the main function net_dim(struct net_dim *dim,
> +struct net_dim_sample end_sample). This function is the entry point to the Net
> +DIM algorithm and has to be called every time the driver would like to check if
> +it should change interrupt moderation parameters. The driver should provide two
> +data structures: struct net_dim and struct net_dim_sample. Struct net_dim
> +describes the state of DIM for a specific object (RX queue, TX queue,
> +other queues, etc.). This includes the current selected profile, previous data
> +samples, the callback function provided by the driver and more.
> +Struct net_dim_sample describes a data sample, which will be compared to the
> +data sample stored in struct net_dim in order to decide on the algorithm's next
> +step. The sample should include bytes, packets and interrupts, measured by
> +the driver.
> +
> +In order to use Net DIM from a networking driver, the driver needs to call the
> +main net_dim() function. The recommended method is to call net_dim() on each
> +interrupt. Since Net DIM has a built-in moderation and it might decide to skip
> +iterations under certain conditions, there is no need to moderate the net_dim()
> +calls as well. As mentioned above, the driver needs to provide an object of type
> +struct net_dim to the net_dim() function call. It is advised for each entity
> +using Net DIM to hold a struct net_dim as part of its data structure and use it
> +as the main Net DIM API object. The struct net_dim_sample should hold the latest
> +bytes, packets and interrupts count. No need to perform any calculations, just
> +include the raw data.
> +
> +The net_dim() call itself does not return anything. Instead Net DIM relies on the
> +driver to provide a callback function, which is called when the algorithm
> +decides to make a change in the interrupt moderation parameters. This callback
> +will be scheduled and ran in a separate thread in order not to add overhead to

                     and run

> +the data flow. After the work is done, Net DIM algorithm needs to be set to
> +the proper state in order to move to the next iteration.
> +
> +
> +Part IV: Example
> +=================
> +
> +The following code demonstrates how to register a driver to Net DIM. The actual
> +usage is not complete but it should make the outline if the usage clear.

                                                        of the

> +
> +my_driver.c:
> +
> +#include <linux/net_dim.h>
> +
> +/* Callback for net DIM to schedule on a decision to change moderation */
> +void my_driver_do_dim_work(struct work_struct *work)
> +{
> +	/* Get struct net_dim from struct work_struct */
> +	struct net_dim *dim = container_of(work, struct net_dim,
> +					   work);
> +	/* Do interrupt moderation related stuff */
> +	...
> +
> +	/* Signal net DIM work is done and it should move to next iteration */
> +	dim->state = NET_DIM_START_MEASURE;
> +}
> +
> +/* My driver's interrupt handler */
> +int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...)
> +{
> +	...
> +	/* A struct to hold current measured data */
> +	struct net_dim_sample dim_sample;
> +	...
> +	/* Initiate data sample struct with current data */
> +	net_dim_sample(my_entity->events,
> +		       my_entity->packets,
> +		       my_entity->bytes,
> +		       &dim_sample);
> +	/* Call net DIM */
> +	net_dim(&my_entity->dim, dim_sample);
> +	...
> +}
> +
> +/* My entity's initialization function (my_entity was already allocated) */
> +int my_driver_init_my_entity(struct my_driver_entity *my_entity, ...)
> +{
> +	...
> +	/* Initiate struct work_struct with my driver's callback function */
> +	INIT_WORK(&my_entity->dim.work, my_driver_do_dim_work);
> +	...
> +}
> 


-- 
~Randy

^ permalink raw reply

* Re: [PATCH] i40e: add support for XDP_REDIRECT
From: Björn Töpel @ 2018-03-21 17:00 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jeff Kirsher, intel-wired-lan, Björn Töpel,
	Karlsson, Magnus, Netdev, Duyck, Alexander H
In-Reply-To: <CAKgT0UftD7Me9MBzsvqqWW_BpbisZ+wbJV4vWTYh+3hMWqdk5g@mail.gmail.com>

2018-03-21 17:40 GMT+01:00 Alexander Duyck <alexander.duyck@gmail.com>:
> On Wed, Mar 21, 2018 at 9:15 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> The driver now acts upon the XDP_REDIRECT return action. Two new ndos
>> are implemented, ndo_xdp_xmit and ndo_xdp_flush.
>>
>> XDP_REDIRECT action enables XDP program to redirect frames to other
>> netdevs. The target redirect/forward netdev might release the XDP data
>> page within the ndo_xdp_xmit function (triggered by xdp_do_redirect),
>> which meant that the i40e page count logic had to be tweaked.
>>
>> An example. i40e_clean_rx_irq is entered, and one rx_buffer is pulled
>> from the hardware descriptor ring. Say that the actual page refcount
>> is 1. XDP is enabled, and the redirect action is triggered. The target
>> netdev ndo_xdp_xmit decreases the page refcount, resulting in the page
>> being freed. The prior assumption was that the function owned the page
>> until i40e_put_rx_buffer was called, increasing the refcount again.
>>
>> Now, we don't allow a refcount less than 2. Another option would be
>> calling xdp_do_redirect *after* i40e_put_rx_buffer, but that would
>> have required new/more conditionals.
>>
>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>> ---
>>  drivers/net/ethernet/intel/i40e/i40e_main.c |  2 +
>>  drivers/net/ethernet/intel/i40e/i40e_txrx.c | 83 +++++++++++++++++++++++------
>>  drivers/net/ethernet/intel/i40e/i40e_txrx.h |  2 +
>>  3 files changed, 72 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> index 79ab52276d12..2fb4261b4fd9 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> @@ -11815,6 +11815,8 @@ static const struct net_device_ops i40e_netdev_ops = {
>>         .ndo_bridge_getlink     = i40e_ndo_bridge_getlink,
>>         .ndo_bridge_setlink     = i40e_ndo_bridge_setlink,
>>         .ndo_bpf                = i40e_xdp,
>> +       .ndo_xdp_xmit           = i40e_xdp_xmit,
>> +       .ndo_xdp_flush          = i40e_xdp_flush,
>>  };
>>
>>  /**
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> index c6972bd07e49..2106ae04d053 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> @@ -1589,8 +1589,8 @@ static bool i40e_alloc_mapped_page(struct i40e_ring *rx_ring,
>>         bi->page = page;
>>         bi->page_offset = i40e_rx_offset(rx_ring);
>>
>> -       /* initialize pagecnt_bias to 1 representing we fully own page */
>> -       bi->pagecnt_bias = 1;
>> +       page_ref_add(page, USHRT_MAX - 1);
>> +       bi->pagecnt_bias = USHRT_MAX;
>>
>>         return true;
>>  }
>> @@ -1956,8 +1956,8 @@ static bool i40e_can_reuse_rx_page(struct i40e_rx_buffer *rx_buffer)
>>          * the pagecnt_bias and page count so that we fully restock the
>>          * number of references the driver holds.
>>          */
>> -       if (unlikely(!pagecnt_bias)) {
>> -               page_ref_add(page, USHRT_MAX);
>> +       if (unlikely(pagecnt_bias == 1)) {
>> +               page_ref_add(page, USHRT_MAX - 1);
>>                 rx_buffer->pagecnt_bias = USHRT_MAX;
>>         }
>>
>
> I think I would be more comfortable with this if this patch was moved
> to a separate patch. I also assume similar changes are needed for
> ixgbe, are they not?
>

Yes, ixgbe need a similar change. I'll send a patch for that!

As for i40e, only when the the xdp_do_redirect path is added, the
refcount need to change. That said, I'll split the page reference
counting from the XDP_REDIRECT code in a V2.

>> @@ -2215,7 +2215,7 @@ static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
>>  static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
>>                                     struct xdp_buff *xdp)
>>  {
>> -       int result = I40E_XDP_PASS;
>> +       int err, result = I40E_XDP_PASS;
>>         struct i40e_ring *xdp_ring;
>>         struct bpf_prog *xdp_prog;
>>         u32 act;
>> @@ -2234,6 +2234,10 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
>>                 xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
>>                 result = i40e_xmit_xdp_ring(xdp, xdp_ring);
>>                 break;
>> +       case XDP_REDIRECT:
>> +               err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
>> +               result = !err ? I40E_XDP_TX : I40E_XDP_CONSUMED;
>> +               break;
>>         default:
>>                 bpf_warn_invalid_xdp_action(act);
>>         case XDP_ABORTED:
>> @@ -2269,6 +2273,16 @@ static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
>>  #endif
>>  }
>>
>> +static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
>> +{
>> +       /* Force memory writes to complete before letting h/w
>> +        * know there are new descriptors to fetch.
>> +        */
>> +       wmb();
>> +
>> +       writel(xdp_ring->next_to_use, xdp_ring->tail);
>> +}
>> +
>
> I don't know if you have seen the other changes going by but this
> needs to use writel_relaxed, not just standard writel if you are
> following a wmb.
>

Hmm, other writel_relaxed changes for the i40e driver?

I thought writel only implied smp_wmb, but it's stronger iow. I'll
change to writel_relaxed, or just a writel w/o the fence.

>>  /**
>>   * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
>>   * @rx_ring: rx descriptor ring to transact packets on
>> @@ -2403,16 +2417,9 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
>>         }
>>
>>         if (xdp_xmit) {
>> -               struct i40e_ring *xdp_ring;
>> -
>> -               xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
>> -
>> -               /* Force memory writes to complete before letting h/w
>> -                * know there are new descriptors to fetch.
>> -                */
>> -               wmb();
>> -
>> -               writel(xdp_ring->next_to_use, xdp_ring->tail);
>> +               i40e_xdp_ring_update_tail(
>> +                       rx_ring->vsi->xdp_rings[rx_ring->queue_index]);
>> +               xdp_do_flush_map();
>
> I think I preferred this with the xdp_ring variable. It is more
> readable then this folded up setup.
>

Yes, indeed!

>>         }
>>
>>         rx_ring->skb = skb;
>> @@ -3660,3 +3667,49 @@ netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
>>
>>         return i40e_xmit_frame_ring(skb, tx_ring);
>>  }
>> +
>> +/**
>> + * i40e_xdp_xmit - Implements ndo_xdp_xmit
>> + * @dev: netdev
>> + * @xdp: XDP buffer
>> + *
>> + * Returns Zero if sent, else an error code
>> + **/
>> +int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
>> +{
>> +       struct i40e_netdev_priv *np = netdev_priv(dev);
>> +       unsigned int queue_index = smp_processor_id();
>> +       struct i40e_vsi *vsi = np->vsi;
>> +       int err;
>> +
>> +       if (test_bit(__I40E_VSI_DOWN, vsi->state))
>> +               return -EINVAL;
>> +
>> +       if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>> +               return -EINVAL;
>> +
>
> In order for this bit of logic to work you have to guarantee that you
> cannot have another ring running an XDP_TX using the same Tx queue. I
> don't know if XDP_REDIRECT is mutually exclusive with XDP_TX in the
> same bpf program or not. One thing you might look at doing is adding a
> test with a flag to verify that the processor IDs are all less than
> vsi->num_queue_pairs and if so set the flag, and based on that flag
> XDP_TX would use similar logic to what is used here to determine which
> queue to transmit on instead of just assuming it will use the Tx queue
> matched with the Rx queue. Then you could also use that flag in this
> test to determine if you support this mode or not.
>

XDP_REDIRECT and XDP_TX are mutually exclusive, so we're good here!

Finally, thanks for the quick review! Much appreciated!


Björn

>> +       err = i40e_xmit_xdp_ring(xdp, vsi->xdp_rings[queue_index]);
>> +       if (err != I40E_XDP_TX)
>> +               return -ENOMEM;
>> +
>> +       return 0;
>> +}
>> +
>> +/**
>> + * i40e_xdp_flush - Implements ndo_xdp_flush
>> + * @dev: netdev
>> + **/
>> +void i40e_xdp_flush(struct net_device *dev)
>> +{
>> +       struct i40e_netdev_priv *np = netdev_priv(dev);
>> +       unsigned int queue_index = smp_processor_id();
>> +       struct i40e_vsi *vsi = np->vsi;
>> +
>> +       if (test_bit(__I40E_VSI_DOWN, vsi->state))
>> +               return;
>> +
>> +       if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>> +               return;
>> +
>> +       i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
>> +}
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> index 2444f338bb0c..a97e59721393 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> @@ -510,6 +510,8 @@ u32 i40e_get_tx_pending(struct i40e_ring *ring, bool in_sw);
>>  void i40e_detect_recover_hung(struct i40e_vsi *vsi);
>>  int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
>>  bool __i40e_chk_linearize(struct sk_buff *skb);
>> +int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp);
>> +void i40e_xdp_flush(struct net_device *dev);
>>
>>  /**
>>   * i40e_get_head - Retrieve head from head writeback
>> --
>> 2.7.4
>>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox