Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v3 1/2] kretprobe: produce sane stack traces
From: Masami Hiramatsu @ 2018-11-03 13:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Aleksa Sarai, Naveen N. Rao, Anil S Keshavamurthy,
	David S. Miller, Masami Hiramatsu, Jonathan Corbet,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Brendan Gregg,
	Christian Brauner, Aleksa Sarai, netdev, linux-doc
In-Reply-To: <20181102091658.1bc979a4@gandalf.local.home>

On Fri, 2 Nov 2018 09:16:58 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 2 Nov 2018 17:59:32 +1100
> Aleksa Sarai <cyphar@cyphar.com> wrote:
> 
> > As an aside, I just tested with the frame unwinder and it isn't thrown
> > off-course by kretprobe_trampoline (though obviously the stack is still
> > wrong). So I think we just need to hook into the ORC unwinder to get it
> > to continue skipping up the stack, as well as add the rewriting code for
> > the stack traces (for all unwinders I guess -- though ideally we should
> 
> I agree that this is the right solution.
> 
> > do this without having to add the same code to every architecture).
> 
> True, and there's an art to consolidating the code between
> architectures.
> 
> I'm currently looking at function graph and seeing if I can consolidate
> it too. And I'm also trying to get multiple uses to hook into its
> infrastructure. I think I finally figured out a way to do so.

For supporting multiple users without any memory allocation, I think
each user should consume the shadow stack and store on it.
My old generic retstack implementation did that.

https://github.com/mhiramat/linux/commit/8804f76580cd863d555854b41b9c6df719f8087e

I hope this may give you any insites.
My idea is to generalize shadow stack, not func graph tracer, since
I don't like making kretprobe depends on func graph tracer, but only
the shadow stack.

> 
> The reason it is difficult, is that you need to maintain state between
> the entry of a function and the exit for each task and callback that is
> registered. Hence, it's a 3x tuple (function stack, task, callbacks).
> And this must be maintained with preemption. A task may sleep for
> minutes, and the state needs to be retained.

Would you mean preeempt_disable()? Anyway, we just need to increment index
atomically, don't we?

> The only state that must be retained is the function stack with the
> task, because if that gets out of sync, the system crashes. But the
> callback state can be removed.
> 
> Here's what is there now:
> 
>  When something is registered with the function graph tracer, every
>  task gets a shadowed stack. A hook is added to fork to add shadow
>  stacks to new tasks. Once a shadow stack is added to a task, that
>  shadow stack is never removed until the task exits.
> 
>  When the function is entered, the real return code is stored in the
>  shadow stack and the trampoline address is put in its place.
> 
>  On return, the trampoline is called, and it will pop off the return
>  code from the shadow stack and return to that.
> 
> The issue with multiple users, is that different users may want to
> trace different functions. On entry, the user could say it doesn't want
> to trace the current function, and the return part must not be called
> on exit. Keeping track of which user needs the return called is the
> tricky part.

So that I think only the "shadow stack" part should be generalized.

> 
> Here's what I plan on implementing:
> 
>  Along with a shadow stack, I was going to add a 4096 byte (one page)
>  array that holds 64 8 byte masks to every task as well. This will allow
>  64 simultaneous users (which is rather extreme). If we need to support
>  more, we could allocate another page for all tasks. The 8 byte mask
>  will represent each depth (allowing to do this for 64 function call
>  stack depth, which should also be enough).
> 
>  Each user will be assigned one of the masks. Each bit in the mask
>  represents the depth of the shadow stack. When a function is called,
>  each user registered with the function graph tracer will get called
>  (if they asked to be called for this function, via the ftrace_ops
>  hashes) and if they want to trace the function, then the bit is set in
>  the mask for that stack depth.
> 
>  When the function exits the function and we pop off the return code
>  from the shadow stack, we then look at all the bits set for the
>  corresponding users, and call their return callbacks, and ignore
>  anything that is not set.
> 

It sounds too complicated... why we don't just open the shadow stack for
each user? Of course it may requires a bit "repeat" unwind on the shadow
stack, but it is simple.

Thank you,

> When a user is unregistered, it the corresponding bits that represent
> it are cleared, and it the return callback will not be called. But the
> tasks being traced will still have their shadow stack to allow it to
> get back to normal.
> 
> I'll hopefully have a prototype ready by plumbers.
> 
> And this too will require each architecture to probably change. As a
> side project to this, I'm going to try to consolidate the function
> graph code among all the architectures as well. Not an easy task.
> 
> -- Steve


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH net] sctp: fix strchange_flags name for Stream Change Event
From: Xin Long @ 2018-11-03  5:59 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: davem, Marcelo Ricardo Leitner, Neil Horman

As defined in rfc6525#section-6.1.3, SCTP_STREAM_CHANGE_DENIED
and SCTP_STREAM_CHANGE_FAILED should be used instead of
SCTP_ASSOC_CHANGE_DENIED and SCTP_ASSOC_CHANGE_FAILED.

To keep the compatibility, fix it by adding two macros.

Fixes: b444153fb5a6 ("sctp: add support for generating add stream change event notification")
Reported-by: Jianwen Ji <jiji@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/uapi/linux/sctp.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 34dd3d4..680ecc3 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -568,6 +568,8 @@ struct sctp_assoc_reset_event {
 
 #define SCTP_ASSOC_CHANGE_DENIED	0x0004
 #define SCTP_ASSOC_CHANGE_FAILED	0x0008
+#define SCTP_STREAM_CHANGE_DENIED	SCTP_ASSOC_CHANGE_DENIED
+#define SCTP_STREAM_CHANGE_FAILED	SCTP_ASSOC_CHANGE_FAILED
 struct sctp_stream_change_event {
 	__u16 strchange_type;
 	__u16 strchange_flags;
-- 
2.1.0

^ permalink raw reply related

* [PATCH net] sctp: define SCTP_SS_DEFAULT for Stream schedulers
From: Xin Long @ 2018-11-03  6:01 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: davem, Marcelo Ricardo Leitner, Neil Horman

According to rfc8260#section-4.3.2, SCTP_SS_DEFAULT is required to
defined as SCTP_SS_FCFS or SCTP_SS_RR.

SCTP_SS_FCFS is used for SCTP_SS_DEFAULT's value in this patch.

Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations")
Reported-by: Jianwen Ji <jiji@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/uapi/linux/sctp.h | 1 +
 net/sctp/outqueue.c       | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 680ecc3..c81feb3 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -1153,6 +1153,7 @@ struct sctp_add_streams {
 /* SCTP Stream schedulers */
 enum sctp_sched_type {
 	SCTP_SS_FCFS,
+	SCTP_SS_DEFAULT = SCTP_SS_FCFS,
 	SCTP_SS_PRIO,
 	SCTP_SS_RR,
 	SCTP_SS_MAX = SCTP_SS_RR
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 9cb854b..c37e1c2 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -212,7 +212,7 @@ void sctp_outq_init(struct sctp_association *asoc, struct sctp_outq *q)
 	INIT_LIST_HEAD(&q->retransmit);
 	INIT_LIST_HEAD(&q->sacked);
 	INIT_LIST_HEAD(&q->abandoned);
-	sctp_sched_set_sched(asoc, SCTP_SS_FCFS);
+	sctp_sched_set_sched(asoc, SCTP_SS_DEFAULT);
 }
 
 /* Free the outqueue structure and any related pending chunks.
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH net v7] net/ipv6: Add anycast addresses to a global hashtable
From: David Miller @ 2018-11-03  6:55 UTC (permalink / raw)
  To: 0xeffeff; +Cc: netdev, kuznet, yoshfuji
In-Reply-To: <20181102202357.23526-1-0xeffeff@gmail.com>

From: Jeff Barnhill <0xeffeff@gmail.com>
Date: Fri,  2 Nov 2018 20:23:57 +0000

> icmp6_send() function is expensive on systems with a large number of
> interfaces. Every time it’s called, it has to verify that the source
> address does not correspond to an existing anycast address by looping
> through every device and every anycast address on the device.  This can
> result in significant delays for a CPU when there are a large number of
> neighbors and ND timers are frequently timing out and calling
> neigh_invalidate().
> 
> Add anycast addresses to a global hashtable to allow quick searching for
> matching anycast addresses.  This is based on inet6_addr_lst in addrconf.c.
> 
> Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH v2 net] net: dsa: microchip: initialize mutex before use
From: David Miller @ 2018-11-03  6:57 UTC (permalink / raw)
  To: Tristram.Ha; +Cc: andrew, f.fainelli, pavel, UNGLinuxDriver, netdev
In-Reply-To: <1541211821-20334-1-git-send-email-Tristram.Ha@microchip.com>

From: <Tristram.Ha@microchip.com>
Date: Fri, 2 Nov 2018 19:23:41 -0700

> From: Tristram Ha <Tristram.Ha@microchip.com>
> 
> Initialize mutex before use.  Avoid kernel complaint when
> CONFIG_DEBUG_LOCK_ALLOC is enabled.
> 
> Fixes: b987e98e50ab90e5 ("dsa: add DSA switch driver for Microchip KSZ9477")
> Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>
> Reviewed-by: Pavel Machek <pavel@ucw.cz>
> Reviewed-by: Andrew Lunn <andrew@lunn.ch>
> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH net 0/2] net: timeout fixes for GENET and SYSTEMPORT
From: David Miller @ 2018-11-03  7:04 UTC (permalink / raw)
  To: f.fainelli; +Cc: netdev, jaedon.shin, opendmb
In-Reply-To: <20181101225538.18632-1-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Thu,  1 Nov 2018 15:55:36 -0700

> This patch series fixes occasional transmit timeout around the time
> the system goes into suspend. GENET and SYSTEMPORT have nearly the same
> logic in that regard and were both affected in the same way.

Series applied.

> Please queue up for stable, thanks!

Queued up.

^ permalink raw reply

* Re: [PATCH v3 1/2] kretprobe: produce sane stack traces
From: Steven Rostedt @ 2018-11-03 17:30 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Josh Poimboeuf, Aleksa Sarai, Naveen N. Rao, Anil S Keshavamurthy,
	David S. Miller, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Shuah Khan, Alexei Starovoitov, Daniel Borkmann,
	Brendan Gregg, Christian Brauner, Aleksa Sarai, netdev, linux-doc
In-Reply-To: <20181104013430.9d3e91b8ebbae7dcb6860ef1@kernel.org>

On Sun, 4 Nov 2018 01:34:30 +0900
Masami Hiramatsu <mhiramat@kernel.org> wrote:
> 
> > I was thinking of a bitmask that represents the handlers, and use that
> > to map which handler gets called for which shadow entry for a
> > particular task.  
> 
> Hmm, I doubt that is too complicated and not scalable. I rather like to see
> the open shadow entry...

It can scale and not too complex (I already played a little with it).
But that said, I'm not committed to it, and using the shadow stack is
also an interesting idea.

> 
> entry: [[original_retaddr][function][modified_retaddr]]
> 
> So if there are many users on same function, the entries will be like this 
> 
> [[original_return_address][function][trampoline_A]]
> [[trampline_A][function][trampoline_B]]
> [[trampline_B][function][trampoline_C]]
> 
> And on the top of the stack, there is trampline_C instead of original_return_address.
> In this case, return to trampoline_C(), it jumps back to trampline_B() and then
> it jumps back to trampline_A(). And eventually it jumps back to
> original_return_address.

Where are trampolines A, B, and C made? Do we also need to dynamically
create them? If I register multiple function tracing ones, each one
will need its own trampoline?

> 
> This way, we don't need allocate another bitmap/pages for the shadow stack.
> We only need a shadow stack for each task.
> Also, unwinder can easily find the trampline_C from the shadow stack and restores
> original_return_address. (of course trampline_A,B,C must be registered so that
> search function can skip it.)

What I was thinking was to store a count and the functions to be called:


	[original_return_address]
	[function_A]
	[function_B]
	[function_C]
	[ 3 ]

Then the trampoline that processes the return codes for ftrace (and
kretprobes and everyone else) can simply do:

	count = pop_shadow_stack();
	for (i = 0; i < count; i++) {
		func = pop_shadow_stack();
		func(...);
	}
	return_address = pop_shadow_stack();

That way we only need to register a function to the return handler and
it will be called, without worrying about making trampolines. There
will just be a single trampoline that handles all the work.

-- Steve

^ permalink raw reply

* Re: [PATCH v3 1/2] kretprobe: produce sane stack traces
From: Steven Rostedt @ 2018-11-03 17:33 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Josh Poimboeuf, Aleksa Sarai, Naveen N. Rao, Anil S Keshavamurthy,
	David S. Miller, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Shuah Khan, Alexei Starovoitov, Daniel Borkmann,
	Brendan Gregg, Christian Brauner, Aleksa Sarai, netdev, linux-doc
In-Reply-To: <20181103133021.6676708c@vmware.local.home>

On Sat, 3 Nov 2018 13:30:21 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> What I was thinking was to store a count and the functions to be called:
> 
> 
> 	[original_return_address]
> 	[function_A]
> 	[function_B]
> 	[function_C]
> 	[ 3 ]
> 
> Then the trampoline that processes the return codes for ftrace (and
> kretprobes and everyone else) can simply do:
> 
> 	count = pop_shadow_stack();
> 	for (i = 0; i < count; i++) {
> 		func = pop_shadow_stack();
> 		func(...);
> 	}
> 	return_address = pop_shadow_stack();
> 
> That way we only need to register a function to the return handler and
> it will be called, without worrying about making trampolines. There
> will just be a single trampoline that handles all the work.

And since the most common case is a single function to call, instead of
using a count, we can take advantage that kernel functions are negative
numbers and do:

	[original_return_address]
	[function_A]

	----

	long count;

	count = pop_shadow_stack();
	if (count < 0) {
		func = (void *)count;
		func();
	} else {
		for (i = 0; i < count; i++) {
			[...]

The unwinder will just need to know how to handle all this :-)

-- Steve

^ permalink raw reply

* Re: [PATCH 4/5] b43: Use common cordic algorithm from kernel lib
From: Larry Finger @ 2018-11-03 17:37 UTC (permalink / raw)
  To: Priit Laes, linux-kernel
  Cc: netdev, linux-wireless, David S. Miller, Kalle Valo, b43-dev
In-Reply-To: <c208360e5cc314ac1ad81f907f4538cf2dcbd196.1541238842.git-series.plaes@plaes.org>

On 11/3/18 4:59 AM, Priit Laes wrote:
> Signed-off-by: Priit Laes <plaes@plaes.org>

Where is the commit message? The stuff in the cover letter (Patch 0/N) never 
makes it to the git repository. You must have a message in each of the 
individual patches.

NACK.

Larry

> ---
>   drivers/net/wireless/broadcom/b43/Kconfig  |  1 +
>   drivers/net/wireless/broadcom/b43/phy_lp.c | 13 +++++++------
>   drivers/net/wireless/broadcom/b43/phy_n.c  | 13 +++++++------
>   3 files changed, 15 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/net/wireless/broadcom/b43/Kconfig b/drivers/net/wireless/broadcom/b43/Kconfig
> index fba8560..3e41457 100644
> --- a/drivers/net/wireless/broadcom/b43/Kconfig
> +++ b/drivers/net/wireless/broadcom/b43/Kconfig
> @@ -4,6 +4,7 @@ config B43
>   	select BCMA if B43_BCMA
>   	select SSB if B43_SSB
>   	select FW_LOADER
> +	select CORDIC
>   	---help---
>   	  b43 is a driver for the Broadcom 43xx series wireless devices.
>   
> diff --git a/drivers/net/wireless/broadcom/b43/phy_lp.c b/drivers/net/wireless/broadcom/b43/phy_lp.c
> index 6922cbb..1718e3b 100644
> --- a/drivers/net/wireless/broadcom/b43/phy_lp.c
> +++ b/drivers/net/wireless/broadcom/b43/phy_lp.c
> @@ -23,6 +23,7 @@
>   
>   */
>   
> +#include <linux/cordic.h>
>   #include <linux/slab.h>
>   
>   #include "b43.h"
> @@ -1780,9 +1781,9 @@ static void lpphy_start_tx_tone(struct b43_wldev *dev, s32 freq, u16 max)
>   {
>   	struct b43_phy_lp *lpphy = dev->phy.lp;
>   	u16 buf[64];
> -	int i, samples = 0, angle = 0;
> +	int i, samples = 0, theta = 0;
>   	int rotation = (((36 * freq) / 20) << 16) / 100;
> -	struct b43_c32 sample;
> +	struct cordic_iq sample;
>   
>   	lpphy->tx_tone_freq = freq;
>   
> @@ -1798,10 +1799,10 @@ static void lpphy_start_tx_tone(struct b43_wldev *dev, s32 freq, u16 max)
>   	}
>   
>   	for (i = 0; i < samples; i++) {
> -		sample = b43_cordic(angle);
> -		angle += rotation;
> -		buf[i] = CORDIC_CONVERT((sample.i * max) & 0xFF) << 8;
> -		buf[i] |= CORDIC_CONVERT((sample.q * max) & 0xFF);
> +		sample = cordic_calc_iq(theta);
> +		theta += rotation;
> +		buf[i] = CORDIC_FLOAT((sample.i * max) & 0xFF) << 8;
> +		buf[i] |= CORDIC_FLOAT((sample.q * max) & 0xFF);
>   	}
>   
>   	b43_lptab_write_bulk(dev, B43_LPTAB16(5, 0), samples, buf);
> diff --git a/drivers/net/wireless/broadcom/b43/phy_n.c b/drivers/net/wireless/broadcom/b43/phy_n.c
> index 44ab080..1f9378a 100644
> --- a/drivers/net/wireless/broadcom/b43/phy_n.c
> +++ b/drivers/net/wireless/broadcom/b43/phy_n.c
> @@ -23,6 +23,7 @@
>   
>   */
>   
> +#include <linux/cordic.h>
>   #include <linux/delay.h>
>   #include <linux/slab.h>
>   #include <linux/types.h>
> @@ -1513,7 +1514,7 @@ static void b43_radio_init2055(struct b43_wldev *dev)
>   
>   /* http://bcm-v4.sipsolutions.net/802.11/PHY/N/LoadSampleTable */
>   static int b43_nphy_load_samples(struct b43_wldev *dev,
> -					struct b43_c32 *samples, u16 len) {
> +					struct cordic_iq *samples, u16 len) {
>   	struct b43_phy_n *nphy = dev->phy.n;
>   	u16 i;
>   	u32 *data;
> @@ -1544,7 +1545,7 @@ static u16 b43_nphy_gen_load_samples(struct b43_wldev *dev, u32 freq, u16 max,
>   {
>   	int i;
>   	u16 bw, len, rot, angle;
> -	struct b43_c32 *samples;
> +	struct cordic_iq *samples;
>   
>   	bw = b43_is_40mhz(dev) ? 40 : 20;
>   	len = bw << 3;
> @@ -1561,7 +1562,7 @@ static u16 b43_nphy_gen_load_samples(struct b43_wldev *dev, u32 freq, u16 max,
>   		len = bw << 1;
>   	}
>   
> -	samples = kcalloc(len, sizeof(struct b43_c32), GFP_KERNEL);
> +	samples = kcalloc(len, sizeof(struct cordic_iq), GFP_KERNEL);
>   	if (!samples) {
>   		b43err(dev->wl, "allocation for samples generation failed\n");
>   		return 0;
> @@ -1570,10 +1571,10 @@ static u16 b43_nphy_gen_load_samples(struct b43_wldev *dev, u32 freq, u16 max,
>   	angle = 0;
>   
>   	for (i = 0; i < len; i++) {
> -		samples[i] = b43_cordic(angle);
> +		samples[i] = cordic_calc_iq(angle);
>   		angle += rot;
> -		samples[i].q = CORDIC_CONVERT(samples[i].q * max);
> -		samples[i].i = CORDIC_CONVERT(samples[i].i * max);
> +		samples[i].q = CORDIC_FLOAT(samples[i].q * max);
> +		samples[i].i = CORDIC_FLOAT(samples[i].i * max);
>   	}
>   
>   	i = b43_nphy_load_samples(dev, samples, len);
> 

^ permalink raw reply

* Re: [PATCH 5/6] dt-bindings: can: can-transceiver: Remove legacy binding documentation
From: Sergei Shtylyov @ 2018-11-03 20:12 UTC (permalink / raw)
  To: Faiz Abbas, linux-kernel, devicetree, netdev, linux-can
  Cc: wg, mkl, robh+dt, mark.rutland, kishon
In-Reply-To: <20181102192616.28291-6-faiz_abbas@ti.com>

On 11/2/2018 10:26 PM, Faiz Abbas wrote:

> With the transceiver node being implemented as a phy, remove the legacy
> dcoumentation. Don't remove the code implementing it to maintain dt

    Documentation.

> compatibility.
> 
> Signed-off-by: Faiz Abbas <faiz_abbas@ti.com>
[...]

MBR, Sergei

^ permalink raw reply

* Re: Help with the BPF verifier
From: Arnaldo Carvalho de Melo @ 2018-11-03 11:29 UTC (permalink / raw)
  To: Edward Cree
  Cc: Yonghong Song, Daniel Borkmann, Jiri Olsa, Martin Lau,
	Alexei Starovoitov, Linux Networking Development Mailing List
In-Reply-To: <51197ebf-a3d8-a3ae-0389-d7e4dae3e833@solarflare.com>

Em Fri, Nov 02, 2018 at 03:42:49PM +0000, Edward Cree escreveu:
> On 02/11/18 15:02, Arnaldo Carvalho de Melo wrote:
> > Yeah, didn't work as well: 
> 
> > And the -vv in 'perf trace' didn't seem to map to further details in the
> > output of the verifier debug:
> Yeah for log_level 2 you probably need to make source-level changes to either
>  perf or libbpf (I think the latter).  It's annoying that essentially no tools
>  plumb through an option for that, someone should fix them ;-)
> 
> > libbpf: -- BEGIN DUMP LOG ---
> > libbpf: 
> > 0: (bf) r6 = r1
> > 1: (bf) r1 = r10
> > 2: (07) r1 += -328
> > 3: (b7) r7 = 64
> > 4: (b7) r2 = 64
> > 5: (bf) r3 = r6
> > 6: (85) call bpf_probe_read#4
> > 7: (79) r1 = *(u64 *)(r10 -320)
> > 8: (15) if r1 == 0x101 goto pc+4
> >  R0=inv(id=0) R1=inv(id=0) R6=ctx(id=0,off=0,imm=0) R7=inv64 R10=fp0,call_-1
> > 9: (55) if r1 != 0x2 goto pc+22
> >  R0=inv(id=0) R1=inv2 R6=ctx(id=0,off=0,imm=0) R7=inv64 R10=fp0,call_-1
> > 10: (bf) r1 = r6
> > 11: (07) r1 += 16
> > 12: (05) goto pc+2
> > 15: (79) r3 = *(u64 *)(r1 +0)
> > dereference of modified ctx ptr R1 off=16 disallowed
> Aha, we at least got a different error message this time.
> And indeed llvm has done that optimisation, rather than the more obvious
> 11: r3 = *(u64 *)(r1 +16)
>  because it wants to have lots of reads share a single insn.  You may be able
>  to defeat that optimisation by adding compiler barriers, idk.  Maybe someone
>  with llvm knowledge can figure out how to stop it (ideally, llvm would know
>  when it's generating for bpf backend and not do that).  -O0?  ¯\_(ツ)_/¯

set env: NR_CPUS=4
set env: LINUX_VERSION_CODE=0x41300
set env: CLANG_EXEC=/usr/local/bin/clang
unset env: CLANG_OPTIONS
set env: KERNEL_INC_OPTIONS= -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated  -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h 
set env: PERF_BPF_INC_OPTIONS=-I/home/acme/lib/perf/include/bpf
set env: WORKING_DIR=/lib/modules/4.19.0-rc8-00014-gc0cff31be705/build
set env: CLANG_SOURCE=/home/acme/git/perf/tools/perf/examples/bpf/augmented_raw_syscalls.c
llvm compiling command template: $CLANG_EXEC -D__KERNEL__ -D__NR_CPUS__=$NR_CPUS -DLINUX_VERSION_CODE=$LINUX_VERSION_CODE $CLANG_OPTIONS $PERF_BPF_INC_OPTIONS $KERNEL_INC_OPTIONS -Wno-unused-value -Wno-pointer-sign -working-directory $WORKING_DIR -c "$CLANG_SOURCE" -target bpf $CLANG_EMIT_LLVM -O2 -o - $LLVM_OPTIONS_PIPE
llvm compiling command : /usr/local/bin/clang -D__KERNEL__ -D__NR_CPUS__=4 -DLINUX_VERSION_CODE=0x41300  -I/home/acme/lib/perf/include/bpf  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated  -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h  -Wno-unused-value -Wno-pointer-sign -working-directory /lib/modules/4.19.0-rc8-00014-gc0cff31be705/build -c /home/acme/git/perf/tools/perf/examples/bpf/augmented_raw_syscalls.c -target bpf  -O2 -o -

So it is using -O2, lets try with -O0...

So I added this to my ~/.perfconfig, i.e. the default clang command line
template replacing -O2 with -O0.

[root@seventh perf]# cat ~/.perfconfig 
[llvm]
	clang-bpf-cmd-template = "$CLANG_EXEC -D__KERNEL__ -D__NR_CPUS__=$NR_CPUS -DLINUX_VERSION_CODE=$LINUX_VERSION_CODE $CLANG_OPTIONS $PERF_BPF_INC_OPTIONS $KERNEL_INC_OPTIONS -Wno-unused-value -Wno-pointer-sign -working-directory $WORKING_DIR -c \"$CLANG_SOURCE\" -target bpf $CLANG_EMIT_LLVM -O0 -o - $LLVM_OPTIONS_PIPE"
	# dump-obj = true
[root@seventh perf]# 

And got an explosion:

# trace -vv -e tools/perf/examples/bpf/augmented_raw_syscalls.c sleep 1
bpf: builtin compilation failed: -95, try external compiler
Kernel build dir is set to /lib/modules/4.19.0-rc8-00014-gc0cff31be705/build
set env: KBUILD_DIR=/lib/modules/4.19.0-rc8-00014-gc0cff31be705/build
unset env: KBUILD_OPTS
include option is set to  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated  -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h 
set env: NR_CPUS=4
set env: LINUX_VERSION_CODE=0x41300
set env: CLANG_EXEC=/usr/local/bin/clang
unset env: CLANG_OPTIONS
set env: KERNEL_INC_OPTIONS= -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated  -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h 
set env: PERF_BPF_INC_OPTIONS=-I/home/acme/lib/perf/include/bpf
set env: WORKING_DIR=/lib/modules/4.19.0-rc8-00014-gc0cff31be705/build
set env: CLANG_SOURCE=/home/acme/git/perf/tools/perf/examples/bpf/augmented_raw_syscalls.c
llvm compiling command template: $CLANG_EXEC -D__KERNEL__ -D__NR_CPUS__=$NR_CPUS -DLINUX_VERSION_CODE=$LINUX_VERSION_CODE $CLANG_OPTIONS $PERF_BPF_INC_OPTIONS $KERNEL_INC_OPTIONS -Wno-unused-value -Wno-pointer-sign -working-directory $WORKING_DIR -c "$CLANG_SOURCE" -target bpf $CLANG_EMIT_LLVM -O0 -o - $LLVM_OPTIONS_PIPE
llvm compiling command : /usr/local/bin/clang -D__KERNEL__ -D__NR_CPUS__=4 -DLINUX_VERSION_CODE=0x41300  -I/home/acme/lib/perf/include/bpf  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated  -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h  -Wno-unused-value -Wno-pointer-sign -working-directory /lib/modules/4.19.0-rc8-00014-gc0cff31be705/build -c /home/acme/git/perf/tools/perf/examples/bpf/augmented_raw_syscalls.c -target bpf  -O0 -o - 
clang-7: /home/acme/git/llvm/lib/Target/BPF/MCTargetDesc/BPFAsmBackend.cpp:74: virtual void {anonymous}::BPFAsmBackend::applyFixup(const llvm::MCAssembler&, const llvm::MCFixup&, const llvm::MCValue&, llvm::MutableArrayRef<char>, uint64_t, bool, const llvm::MCSubtargetInfo*) const: Assertion `Value == 0' failed.
Stack dump:
0.	Program arguments: /usr/local/bin/clang-7 -cc1 -triple bpf -emit-obj -mrelax-all -disable-free -main-file-name augmented_raw_syscalls.c -mrelocation-model static -mthread-model posix -mdisable-fp-elim -fmath-errno -masm-verbose -mconstructor-aliases -dwarf-column-info -debugger-tuning=gdb -coverage-notes-file /home/acme/git/perf/-.gcno -nostdsysteminc -nobuiltininc -resource-dir /usr/local/lib/clang/8.0.0 -working-directory /lib/modules/4.19.0-rc8-00014-gc0cff31be705/build -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -include /home/acme/git/linux/include/linux/kconfig.h -D __KERNEL__ -D __NR_CPUS__=4 -D LINUX_VERSION_CODE=0x41300 -I /home/acme/lib/perf/include/bpf -I /home/acme/git/linux/arch/x86/include -I ./arch/x86/include/generated -I /home/acme/git/linux/include -I ./include -I /home/acme/git/linux/arch/x86/include/uapi -I ./arch/x86/include/generated/uapi -I /home/acme/git/linux/include/uapi -I ./include/generated/uapi -O0 -Wno-unused-value -Wno-pointer-sign -fdebug-compilation-dir /home/acme/git/perf -ferror-limit 19 -fmessage-length 203 -fobjc-runtime=gcc -fdiagnostics-show-option -fcolor-diagnostics -o - -x c /home/acme/git/perf/tools/perf/examples/bpf/augmented_raw_syscalls.c -faddrsig 
1.	<eof> parser at end of file
2.	Code generation
#0 0x000000000408ff98 llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/usr/local/bin/clang-7+0x408ff98)
#1 0x000000000409002b (/usr/local/bin/clang-7+0x409002b)
#2 0x000000000408e059 llvm::sys::RunSignalHandlers() (/usr/local/bin/clang-7+0x408e059)
#3 0x000000000408fa5b (/usr/local/bin/clang-7+0x408fa5b)
#4 0x00007fcd33ab81c0 __restore_rt (/lib64/libpthread.so.0+0x121c0)
#5 0x00007fcd32817750 __GI_raise (/lib64/libc.so.6+0x34750)
#6 0x00007fcd32818d31 __GI_abort (/lib64/libc.so.6+0x35d31)
#7 0x00007fcd3281005a __assert_fail_base (/lib64/libc.so.6+0x2d05a)
#8 0x00007fcd328100d2 (/lib64/libc.so.6+0x2d0d2)
#9 0x0000000002602d13 (/usr/local/bin/clang-7+0x2602d13)
#10 0x0000000003bf6da7 llvm::MCAssembler::layout(llvm::MCAsmLayout&) (/usr/local/bin/clang-7+0x3bf6da7)
#11 0x0000000003bf6e2c llvm::MCAssembler::Finish() (/usr/local/bin/clang-7+0x3bf6e2c)
#12 0x0000000003c524e8 llvm::MCObjectStreamer::FinishImpl() (/usr/local/bin/clang-7+0x3c524e8)
#13 0x0000000003c2db4b llvm::MCELFStreamer::FinishImpl() (/usr/local/bin/clang-7+0x3c2db4b)
#14 0x0000000003c5ba43 llvm::MCStreamer::Finish() (/usr/local/bin/clang-7+0x3c5ba43)
#15 0x0000000004c281b9 llvm::AsmPrinter::doFinalization(llvm::Module&) (/usr/local/bin/clang-7+0x4c281b9)
#16 0x00000000038d637b llvm::FPPassManager::doFinalization(llvm::Module&) (/usr/local/bin/clang-7+0x38d637b)
#17 0x00000000038d683d (/usr/local/bin/clang-7+0x38d683d)
#18 0x00000000038d6d6b llvm::legacy::PassManagerImpl::run(llvm::Module&) (/usr/local/bin/clang-7+0x38d6d6b)
#19 0x00000000038d6f9b llvm::legacy::PassManager::run(llvm::Module&) (/usr/local/bin/clang-7+0x38d6f9b)
#20 0x000000000436b091 (/usr/local/bin/clang-7+0x436b091)
#21 0x000000000436df07 clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::DataLayout const&, llvm::Module*, clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >) (/usr/local/bin/clang-7+0x436df07)
#22 0x0000000004f6ee11 (/usr/local/bin/clang-7+0x4f6ee11)
#23 0x00000000062af3a3 clang::ParseAST(clang::Sema&, bool, bool) (/usr/local/bin/clang-7+0x62af3a3)
#24 0x0000000004a73873 clang::ASTFrontendAction::ExecuteAction() (/usr/local/bin/clang-7+0x4a73873)
#25 0x0000000004f6ced8 clang::CodeGenAction::ExecuteAction() (/usr/local/bin/clang-7+0x4f6ced8)
#26 0x0000000004a7329d clang::FrontendAction::Execute() (/usr/local/bin/clang-7+0x4a7329d)
#27 0x0000000004a04ccf clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/usr/local/bin/clang-7+0x4a04ccf)
#28 0x0000000004bc512e clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/usr/local/bin/clang-7+0x4bc512e)
#29 0x0000000001d88a6c cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/usr/local/bin/clang-7+0x1d88a6c)
#30 0x0000000001d7e31e (/usr/local/bin/clang-7+0x1d7e31e)
#31 0x0000000001d7ef94 main (/usr/local/bin/clang-7+0x1d7ef94)
#32 0x00007fcd32803fea __libc_start_main (/lib64/libc.so.6+0x20fea)
#33 0x0000000001d7beea _start (/usr/local/bin/clang-7+0x1d7beea)
clang-7: error: unable to execute command: Aborted (core dumped)
clang-7: error: clang frontend command failed due to signal (use -v to see invocation)
clang version 8.0.0 (http://llvm.org/git/clang.git 8587270a739ee30c926a76d5657e65e85b560f6e) (http://llvm.org/git/llvm.git 0566eefef9c3777bd780ec4cbb9efa764633b76c)
Target: bpf
Thread model: posix
InstalledDir: /usr/local/bin
clang-7: note: diagnostic msg: PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script.
clang-7: note: diagnostic msg: 
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang-7: note: diagnostic msg: /tmp/augmented_raw_syscalls-7444d9.c
clang-7: note: diagnostic msg: /tmp/augmented_raw_syscalls-7444d9.sh
clang-7: note: diagnostic msg: 

********************
ERROR:	unable to compile tools/perf/examples/bpf/augmented_raw_syscalls.c
Hint:	Check error message shown above.
Hint:	You can also pre-compile it into .o using:
     		clang -target bpf -O2 -c tools/perf/examples/bpf/augmented_raw_syscalls.c
     	with proper -I and -D options.
event syntax error: 'tools/perf/examples/bpf/augmented_raw_syscalls.c'
                     \___ Failed to load tools/perf/examples/bpf/augmented_raw_syscalls.c from source: Error when compiling BPF scriptlet

(add -v to see detail)
Run 'perf list' for a list of valid events

 Usage: perf trace [<options>] [<command>]
    or: perf trace [<options>] -- <command> [<options>]
    or: perf trace record [<options>] [<command>]
    or: perf trace record [<options>] -- <command> [<options>]

    -e, --event <event>   event/syscall selector. use 'perf list' to list available events
[root@seventh perf]# 

Trying with -O1...

> Alternatively, your prog looks short enough that maybe you could kick the C
>  habit and write it directly in eBPF asm, that way no-one is optimising things
>  behind your back.  (I realise this option won't appeal to everyone ;-)

Nah, if that is what it takes... Would be better with simple looking C
code tho :-)

> The reason the verifier disallows this, iirc, is because it needs to be able
>  to rewrite the offsets on ctx accesses (see convert_ctx_accesses()) in case
>  underlying kernel struct doesn't match the layout of the ctx ABI.  To do this
>  it needs the ctx offset to live entirely in the insn doing the access,
>  otherwise different paths could lead to the same insn accessing different ctx
>  offsets with different fixups needed — can't be done.
> 
> -Ed

^ permalink raw reply

* Re: Help with the BPF verifier
From: Arnaldo Carvalho de Melo @ 2018-11-03 11:32 UTC (permalink / raw)
  To: Edward Cree
  Cc: Yonghong Song, Daniel Borkmann, Jiri Olsa, Martin Lau,
	Alexei Starovoitov, Linux Networking Development Mailing List
In-Reply-To: <20181103112934.GH20495@kernel.org>

Em Sat, Nov 03, 2018 at 08:29:34AM -0300, Arnaldo Carvalho de Melo escreveu:
> PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
> Preprocessed source(s) and associated run script(s) are located at:
> clang-7: note: diagnostic msg: /tmp/augmented_raw_syscalls-7444d9.c
> clang-7: note: diagnostic msg: /tmp/augmented_raw_syscalls-7444d9.sh
> clang-7: note: diagnostic msg: 
> 
> ********************
> ERROR:	unable to compile tools/perf/examples/bpf/augmented_raw_syscalls.c
> Hint:	Check error message shown above.
> Hint:	You can also pre-compile it into .o using:
>      		clang -target bpf -O2 -c tools/perf/examples/bpf/augmented_raw_syscalls.c
>      	with proper -I and -D options.
> event syntax error: 'tools/perf/examples/bpf/augmented_raw_syscalls.c'
>                      \___ Failed to load tools/perf/examples/bpf/augmented_raw_syscalls.c from source: Error when compiling BPF scriptlet
> 
> (add -v to see detail)
> Run 'perf list' for a list of valid events
> 
>  Usage: perf trace [<options>] [<command>]
>     or: perf trace [<options>] -- <command> [<options>]
>     or: perf trace record [<options>] [<command>]
>     or: perf trace record [<options>] -- <command> [<options>]
> 
>     -e, --event <event>   event/syscall selector. use 'perf list' to list available events
> [root@seventh perf]# 
> 
> Trying with -O1...

-O1 doesn't get clang confused, its just the verifier that doesn't like
the result, i.e. we're back to that optimization, that isn't disabled
with -O1

llvm compiling command : /usr/local/bin/clang -D__KERNEL__ -D__NR_CPUS__=4 -DLINUX_VERSION_CODE=0x41300  -I/home/acme/lib/perf/include/bpf  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated  -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h  -Wno-unused-value -Wno-pointer-sign -working-directory /lib/modules/4.19.0-rc8-00014-gc0cff31be705/build -c /home/acme/git/perf/tools/perf/examples/bpf/augmented_raw_syscalls.c -target bpf  -O1 -o - 
libbpf: loading object 'tools/perf/examples/bpf/augmented_raw_syscalls.c' from buffer
libbpf: section(1) .strtab, size 168, link 0, flags 0, type=3
libbpf: skip section(1) .strtab
libbpf: section(2) .text, size 0, link 0, flags 6, type=1
libbpf: skip section(2) .text
libbpf: section(3) raw_syscalls:sys_enter, size 344, link 0, flags 6, type=1
libbpf: found program raw_syscalls:sys_enter
libbpf: section(4) .relraw_syscalls:sys_enter, size 16, link 10, flags 0, type=9
libbpf: section(5) raw_syscalls:sys_exit, size 16, link 0, flags 6, type=1
libbpf: found program raw_syscalls:sys_exit
libbpf: section(6) maps, size 56, link 0, flags 3, type=1
libbpf: section(7) license, size 4, link 0, flags 3, type=1
libbpf: license of tools/perf/examples/bpf/augmented_raw_syscalls.c is GPL
libbpf: section(8) version, size 4, link 0, flags 3, type=1
libbpf: kernel version of tools/perf/examples/bpf/augmented_raw_syscalls.c is 41300
libbpf: section(9) .llvm_addrsig, size 6, link 10, flags 80000000, type=1879002115
libbpf: skip section(9) .llvm_addrsig
libbpf: section(10) .symtab, size 240, link 1, flags 0, type=2
libbpf: maps in tools/perf/examples/bpf/augmented_raw_syscalls.c: 2 maps in 56 bytes
libbpf: map 0 is "__augmented_syscalls__"
libbpf: map 1 is "__bpf_stdout__"
libbpf: collecting relocating info for: 'raw_syscalls:sys_enter'
libbpf: relo for 4 value 28 name 124
libbpf: relocation: insn_idx=35
libbpf: relocation: find map 1 (__augmented_syscalls__) for insn 35
Added extra kernel map __entry_SYSCALL_64_trampoline fffffe0000006000-fffffe0000007000
Added extra kernel map __entry_SYSCALL_64_trampoline fffffe0000032000-fffffe0000033000
Added extra kernel map __entry_SYSCALL_64_trampoline fffffe000005e000-fffffe000005f000
Added extra kernel map __entry_SYSCALL_64_trampoline fffffe000008a000-fffffe000008b000
bpf: config program 'raw_syscalls:sys_enter'
bpf: config program 'raw_syscalls:sys_exit'
libbpf: create map __bpf_stdout__: fd=3
libbpf: create map __augmented_syscalls__: fd=4
libbpf: load bpf program failed: Permission denied
libbpf: -- BEGIN DUMP LOG ---
libbpf: 
0: (bf) r6 = r1
1: (bf) r1 = r10
2: (07) r1 += -328
3: (b7) r7 = 64
4: (b7) r2 = 64
5: (bf) r3 = r6
6: (85) call bpf_probe_read#4
7: (79) r1 = *(u64 *)(r10 -320)
8: (15) if r1 == 0x101 goto pc+4
 R0=inv(id=0) R1=inv(id=0) R6=ctx(id=0,off=0,imm=0) R7=inv64 R10=fp0,call_-1
9: (55) if r1 != 0x2 goto pc+22
 R0=inv(id=0) R1=inv2 R6=ctx(id=0,off=0,imm=0) R7=inv64 R10=fp0,call_-1
10: (bf) r1 = r6
11: (07) r1 += 16
12: (05) goto pc+2
15: (79) r3 = *(u64 *)(r1 +0)
dereference of modified ctx ptr R1 off=16 disallowed

libbpf: -- END LOG --
libbpf: failed to load program 'raw_syscalls:sys_enter'
libbpf: failed to load object 'tools/perf/examples/bpf/augmented_raw_syscalls.c'
bpf: load objects failed: err=-4007: (Kernel verifier blocks program loading)
event syntax error: 'tools/perf/examples/bpf/augmented_raw_syscalls.c'
                     \___ Kernel verifier blocks program loading

(add -v to see detail)
Run 'perf list' for a list of valid events

 Usage: perf trace [<options>] [<command>]
    or: perf trace [<options>] -- <command> [<options>]
    or: perf trace record [<options>] [<command>]
    or: perf trace record [<options>] -- <command> [<options>]

    -e, --event <event>   event/syscall selector. use 'perf list' to list available events
[root@seventh perf]# 

^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Paweł Staszewski @ 2018-11-03 12:01 UTC (permalink / raw)
  To: Aaron Lu, Jesper Dangaard Brouer
  Cc: Saeed Mahameed, eric.dumazet@gmail.com, netdev@vger.kernel.org,
	Tariq Toukan, ilias.apalodimas@linaro.org, yoel@kviknet.dk,
	mgorman@techsingularity.net
In-Reply-To: <c5135749-4bf8-47b4-dec4-326324ab6e1d@itcare.pl>



W dniu 03.11.2018 o 01:16, Paweł Staszewski pisze:
>
>
> W dniu 02.11.2018 o 20:02, Paweł Staszewski pisze:
>>
>>
>> W dniu 02.11.2018 o 15:20, Aaron Lu pisze:
>>> On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote:
>>>> On Fri, 2 Nov 2018 13:23:56 +0800
>>>> Aaron Lu <aaron.lu@intel.com> wrote:
>>>>
>>>>> On Thu, Nov 01, 2018 at 08:23:19PM +0000, Saeed Mahameed wrote:
>>>>>> On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:
>>>>>>> On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
>>>>>>> wrote:
>>>>>>> ... ...
>>>>>>>> Section copied out:
>>>>>>>>
>>>>>>>>    mlx5e_poll_tx_cq
>>>>>>>>    |
>>>>>>>>     --16.34%--napi_consume_skb
>>>>>>>>               |
>>>>>>>>               |--12.65%--__free_pages_ok
>>>>>>>>               |          |
>>>>>>>>               |           --11.86%--free_one_page
>>>>>>>>               |                     |
>>>>>>>>               |                     |--10.10%
>>>>>>>> --queued_spin_lock_slowpath
>>>>>>>>               |                     |
>>>>>>>>               | --0.65%--_raw_spin_lock
>>>>>>> This callchain looks like it is freeing higher order pages than 
>>>>>>> order
>>>>>>> 0:
>>>>>>> __free_pages_ok is only called for pages whose order are bigger 
>>>>>>> than
>>>>>>> 0.
>>>>>> mlx5 rx uses only order 0 pages, so i don't know where these high 
>>>>>> order
>>>>>> tx SKBs are coming from..
>>>>> Perhaps here:
>>>>> __netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and
>>>>> __napi_alloc_frag() will all call page_frag_alloc(), which will use
>>>>> __page_frag_cache_refill() to get an order 3 page if possible, or 
>>>>> fall
>>>>> back to an order 0 page if order 3 page is not available.
>>>>>
>>>>> I'm not sure if your workload will use the above code path though.
>>>> TL;DR: this is order-0 pages (code-walk trough proof below)
>>>>
>>>> To Aaron, the network stack *can* call __free_pages_ok() with order-0
>>>> pages, via:
>>>>
>>>> static void skb_free_head(struct sk_buff *skb)
>>>> {
>>>>     unsigned char *head = skb->head;
>>>>
>>>>     if (skb->head_frag)
>>>>         skb_free_frag(head);
>>>>     else
>>>>         kfree(head);
>>>> }
>>>>
>>>> static inline void skb_free_frag(void *addr)
>>>> {
>>>>     page_frag_free(addr);
>>>> }
>>>>
>>>> /*
>>>>   * Frees a page fragment allocated out of either a compound or 
>>>> order 0 page.
>>>>   */
>>>> void page_frag_free(void *addr)
>>>> {
>>>>     struct page *page = virt_to_head_page(addr);
>>>>
>>>>     if (unlikely(put_page_testzero(page)))
>>>>         __free_pages_ok(page, compound_order(page));
>>>> }
>>>> EXPORT_SYMBOL(page_frag_free);
>>> I think here is a problem - order 0 pages are freed directly to buddy,
>>> bypassing per-cpu-pages. This might be the reason lock contention
>>> appeared on free path. Can someone apply below diff and see if lock
>>> contention is gone?
>> Will test it tonight
>>
> Patch applied
> perf report:
> https://ufile.io/sytfh
>
>
>
> But i need to wait also with more traffic currently cpu's are sleeping

before patch:
                                 | |                     |          | 
|--13.55%--mlx5e_poll_tx_cq
                                   | |                     |          
|          |          |
                                   | |                     |          
|          | --10.32%--napi_consume_skb
                                   | |                     |          
|          |                     |
                                   | |                     |          
|          | |--8.52%--__free_pages_ok
                                   | |                     |          
|          | |          |
                                   | |                     |          
|          | |           --7.67%--free_one_page
                                   | |                     |          
|          | |                     |
                                   | |                     |          
|          | |                     |--6.05%--queued_spin_lock_slowpath
                                   | |                     |          
|          | |                     |
                                   | |                     |          
|          | |                      --0.64%--_raw_spin_lock
                                   | |                     |          
|          |                     |
                                   | |                     |          
|          | |--0.77%--skb_release_data
                                   | |                     |          
|          |                     |
                                   | |                     |          
|          | --0.72%--page_frag_free

after patch:
                       |          | |                     |          | 
|--3.75%--mlx5e_poll_tx_cq
                        |          | |                     |          
|          |          |
                        |          | |                     |          
|          | --1.53%--napi_consume_skb
                        |          | |                     |          
|          |                     |
                        |          | |                     |          
|          | --0.54%--skb_release_data
                        |          | |                     |          
|          |
                        |          | |                     |          | 
--3.09%--mlx5e_post_rx_wqes
                        |          | |                     |          
|                     |
                        |          | |                     |          | 
--1.21%--__page_pool_alloc_pages_slow
                        |          | |                     |          
|                                |
                        |          | |                     |          | 
--1.16%--__alloc_pages_nodemask
                        |          | |                     | 
|                                           |
                        |          | |                     | | 
--1.05%--get_page_from_freelist


currently waiting for more traffic also


>
>
>
>
>>
>>
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index e2ef1c17942f..65c0ae13215a 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -4554,8 +4554,14 @@ void page_frag_free(void *addr)
>>>   {
>>>       struct page *page = virt_to_head_page(addr);
>>>   -    if (unlikely(put_page_testzero(page)))
>>> -        __free_pages_ok(page, compound_order(page));
>>> +    if (unlikely(put_page_testzero(page))) {
>>> +        unsigned int order = compound_order(page);
>>> +
>>> +        if (order == 0)
>>> +            free_unref_page(page);
>>> +        else
>>> +            __free_pages_ok(page, order);
>>> +    }
>>>   }
>>>   EXPORT_SYMBOL(page_frag_free);
>>>> Notice for the mlx5 driver it support several RX-memory models, so it
>>>> can be hard to follow, but from the perf report output we can see that
>>>> is uses mlx5e_skb_from_cqe_linear, which use build_skb.
>>>>
>>>> --13.63%--mlx5e_skb_from_cqe_linear
>>>>            |
>>>>             --5.02%--build_skb
>>>>                       |
>>>>                        --1.85%--__build_skb
>>>>                                  |
>>>>                                   --1.00%--kmem_cache_alloc
>>>>
>>>> /* build_skb() is wrapper over __build_skb(), that specifically
>>>>   * takes care of skb->head and skb->pfmemalloc
>>>>   * This means that if @frag_size is not zero, then @data must be 
>>>> backed
>>>>   * by a page fragment, not kmalloc() or vmalloc()
>>>>   */
>>>> struct sk_buff *build_skb(void *data, unsigned int frag_size)
>>>> {
>>>>     struct sk_buff *skb = __build_skb(data, frag_size);
>>>>
>>>>     if (skb && frag_size) {
>>>>         skb->head_frag = 1;
>>>>         if (page_is_pfmemalloc(virt_to_head_page(data)))
>>>>             skb->pfmemalloc = 1;
>>>>     }
>>>>     return skb;
>>>> }
>>>> EXPORT_SYMBOL(build_skb);
>>>>
>>>> It still doesn't prove, that the @data is backed by by a order-0 page.
>>>> For the mlx5 driver is uses mlx5e_page_alloc_mapped ->
>>>> page_pool_dev_alloc_pages(), and I can see perf report using
>>>> __page_pool_alloc_pages_slow().
>>>>
>>>> The setup for page_pool in mlx5 uses order=0.
>>>>
>>>>     /* Create a page_pool and register it with rxq */
>>>>     pp_params.order     = 0;
>>>>     pp_params.flags     = 0; /* No-internal DMA mapping in 
>>>> page_pool */
>>>>     pp_params.pool_size = pool_size;
>>>>     pp_params.nid       = cpu_to_node(c->cpu);
>>>>     pp_params.dev       = c->pdev;
>>>>     pp_params.dma_dir   = rq->buff.map_dir;
>>>>
>>>>     /* page_pool can be used even when there is no rq->xdp_prog,
>>>>      * given page_pool does not handle DMA mapping there is no
>>>>      * required state to clear. And page_pool gracefully handle
>>>>      * elevated refcnt.
>>>>      */
>>>>     rq->page_pool = page_pool_create(&pp_params);
>>>>     if (IS_ERR(rq->page_pool)) {
>>>>         err = PTR_ERR(rq->page_pool);
>>>>         rq->page_pool = NULL;
>>>>         goto err_free;
>>>>     }
>>>>     err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
>>>>                      MEM_TYPE_PAGE_POOL, rq->page_pool);
>>> Thanks for the detailed analysis, I'll need more time to understand the
>>> whole picture :-)
>>>
>>
>>
>
>

^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Jesper Dangaard Brouer @ 2018-11-03 12:53 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Saeed Mahameed, pstaszewski@itcare.pl, eric.dumazet@gmail.com,
	netdev@vger.kernel.org, Tariq Toukan, ilias.apalodimas@linaro.org,
	yoel@kviknet.dk, mgorman@techsingularity.net, brouer
In-Reply-To: <20181102142024.GA18343@intel.com>


On Fri, 2 Nov 2018 22:20:24 +0800 Aaron Lu <aaron.lu@intel.com> wrote:

> On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote:
> > On Fri, 2 Nov 2018 13:23:56 +0800
> > Aaron Lu <aaron.lu@intel.com> wrote:
> >   
> > > On Thu, Nov 01, 2018 at 08:23:19PM +0000, Saeed Mahameed wrote:  
> > > > On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:    
> > > > > On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
> > > > > wrote:
> > > > > ... ...    
> > > > > > Section copied out:
> > > > > > 
> > > > > >   mlx5e_poll_tx_cq
> > > > > >   |          
> > > > > >    --16.34%--napi_consume_skb
> > > > > >              |          
> > > > > >              |--12.65%--__free_pages_ok
> > > > > >              |          |          
> > > > > >              |           --11.86%--free_one_page
> > > > > >              |                     |          
> > > > > >              |                     |--10.10%
> > > > > > --queued_spin_lock_slowpath
> > > > > >              |                     |          
> > > > > >              |                      --0.65%--_raw_spin_lock    
> > > > > 
> > > > > This callchain looks like it is freeing higher order pages than order
> > > > > 0:
> > > > > __free_pages_ok is only called for pages whose order are bigger than
> > > > > 0.    
> > > > 
> > > > mlx5 rx uses only order 0 pages, so i don't know where these high order
> > > > tx SKBs are coming from..     
> > > 
> > > Perhaps here:
> > > __netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and
> > > __napi_alloc_frag() will all call page_frag_alloc(), which will use
> > > __page_frag_cache_refill() to get an order 3 page if possible, or fall
> > > back to an order 0 page if order 3 page is not available.
> > > 
> > > I'm not sure if your workload will use the above code path though.  
> > 
> > TL;DR: this is order-0 pages (code-walk trough proof below)
> > 
> > To Aaron, the network stack *can* call __free_pages_ok() with order-0
> > pages, via:
> > 
> > static void skb_free_head(struct sk_buff *skb)
> > {
> > 	unsigned char *head = skb->head;
> > 
> > 	if (skb->head_frag)
> > 		skb_free_frag(head);
> > 	else
> > 		kfree(head);
> > }
> > 
> > static inline void skb_free_frag(void *addr)
> > {
> > 	page_frag_free(addr);
> > }
> > 
> > /*
> >  * Frees a page fragment allocated out of either a compound or order 0 page.
> >  */
> > void page_frag_free(void *addr)
> > {
> > 	struct page *page = virt_to_head_page(addr);
> > 
> > 	if (unlikely(put_page_testzero(page)))
> > 		__free_pages_ok(page, compound_order(page));
> > }
> > EXPORT_SYMBOL(page_frag_free);  
> 
> I think here is a problem - order 0 pages are freed directly to buddy,
> bypassing per-cpu-pages. This might be the reason lock contention
> appeared on free path. 

OMG - you just found a significant issue with the network stacks
interaction with the page allocator!  This explains why I could not get
the PCP (Per-Cpu-Pages) system to have good performance, in my
performance networking benchmarks. As we are basically only using the
alloc side of PCP, and not the free side.
 We have spend years adding different driver level recycle tricks to
avoid this code path getting activated, exactly because it is rather
slow and problematic that we hit this zone->lock.

> Can someone apply below diff and see if lock contention is gone?

I have also applied and tested this patch, and yes the lock contention
is gone.  As mentioned is it rather difficult to hit this code path, as
the driver page recycle mechanism tries to hide/avoid it, but mlx5 +
page_pool + CPU-map recycling have a known weakness that bypass the
driver page recycle scheme (that I've not fixed yet).  I observed a 7%
speedup for this micro benchmark.

 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e2ef1c17942f..65c0ae13215a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4554,8 +4554,14 @@ void page_frag_free(void *addr)
>  {
>  	struct page *page = virt_to_head_page(addr);
>  
> -	if (unlikely(put_page_testzero(page)))
> -		__free_pages_ok(page, compound_order(page));
> +	if (unlikely(put_page_testzero(page))) {
> +		unsigned int order = compound_order(page);
> +
> +		if (order == 0)
> +			free_unref_page(page);
> +		else
> +			__free_pages_ok(page, order);
> +	}
>  }
>  EXPORT_SYMBOL(page_frag_free);

Thank you Aaron for spotting this!!!

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Jesper Dangaard Brouer @ 2018-11-03 12:58 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: Aaron Lu, Saeed Mahameed, eric.dumazet@gmail.com,
	netdev@vger.kernel.org, Tariq Toukan, ilias.apalodimas@linaro.org,
	yoel@kviknet.dk, mgorman@techsingularity.net, brouer
In-Reply-To: <c5135749-4bf8-47b4-dec4-326324ab6e1d@itcare.pl>

On Sat, 3 Nov 2018 01:16:08 +0100
Paweł Staszewski <pstaszewski@itcare.pl> wrote:

> W dniu 02.11.2018 o 20:02, Paweł Staszewski pisze:
> >
> >
> > W dniu 02.11.2018 o 15:20, Aaron Lu pisze:  
> >> On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote:  
> >>> On Fri, 2 Nov 2018 13:23:56 +0800
> >>> Aaron Lu <aaron.lu@intel.com> wrote:
> >>>  
> >>>> On Thu, Nov 01, 2018 at 08:23:19PM +0000, Saeed Mahameed wrote:  
> >>>>> On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:  
> >>>>>> On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
> >>>>>> wrote:
> >>>>>> ... ...  
[...]
> >>> TL;DR: this is order-0 pages (code-walk trough proof below)
> >>>
> >>> To Aaron, the network stack *can* call __free_pages_ok() with order-0
> >>> pages, via:
[...]
> >>  
> >> I think here is a problem - order 0 pages are freed directly to buddy,
> >> bypassing per-cpu-pages. This might be the reason lock contention
> >> appeared on free path. Can someone apply below diff and see if lock
> >> contention is gone?  
> >>
> > Will test it tonight
> >  
> Patch applied
> perf report:
> https://ufile.io/sytfh
> 
> 
> But i need to wait also with more traffic currently cpu's are sleeping
> 

Well, that would be the expected result, that the CPUs get more time to
sleep, if the lock contention is gone...

What is the measured bandwidth now?

Notice, you might still be limited by the PCIe bandwidth, but then your
CPUs might actually decide to sleep, as they are getting data fast
enough.


[...]
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index e2ef1c17942f..65c0ae13215a 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -4554,8 +4554,14 @@ void page_frag_free(void *addr)
> >>   {
> >>       struct page *page = virt_to_head_page(addr);
> >>   -    if (unlikely(put_page_testzero(page)))
> >> -        __free_pages_ok(page, compound_order(page));
> >> +    if (unlikely(put_page_testzero(page))) {
> >> +        unsigned int order = compound_order(page);
> >> +
> >> +        if (order == 0)
> >> +            free_unref_page(page);
> >> +        else
> >> +            __free_pages_ok(page, order);
> >> +    }



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH v3 1/2] kretprobe: produce sane stack traces
From: Masami Hiramatsu @ 2018-11-03 13:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Josh Poimboeuf, Aleksa Sarai, Naveen N. Rao, Anil S Keshavamurthy,
	David S. Miller, Masami Hiramatsu, Jonathan Corbet,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Brendan Gregg,
	Christian Brauner, Aleksa Sarai, netd
In-Reply-To: <20181102121307.32e99414@gandalf.local.home>

On Fri, 2 Nov 2018 12:13:07 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 2 Nov 2018 10:43:26 -0500
> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> > > I'll hopefully have a prototype ready by plumbers.  
> > 
> > Why do we need multiple users?  It would be a lot simpler if we could
> > just enforce a single user per fgraphed/kretprobed function (and return
> > -EBUSY if it's already being traced/probed).
> 
> Because that means if function graph tracer is active, then you can't
> do a kretprobe, and vice versa. I'd really like to have it working for
> multiple users, then we could trace different graph functions and store
> them in different buffers. It would also allow for perf to use function
> graph tracer too.

Steve, how woul you allow multiple users on it? Something like this?

ret_trampoline_multiple(){
   list_for_each(handler, &shadow_entry[i].handlers, list)
	handler(shadow_entry[i]);
   restore_retval_and_jump_to(shadow_entry[i].orig);
}


> > > And this too will require each architecture to probably change. As a
> > > side project to this, I'm going to try to consolidate the function
> > > graph code among all the architectures as well. Not an easy task.  
> > 
> > Do you mean implementing HAVE_FUNCTION_GRAPH_RET_ADDR_PTR for all the
> > arches?  If so, I think have an old crusty patch which attempted to
> > that.  I could try to dig it up if you're interested.
> > 
> 
> I'd like to have that, but it still requires some work. But I'd just
> the truly architecture dependent code be in the architecture (basically
> the asm code), and have the ability to move most of the duplicate code
> out of the archs.

I will also do that for kretprobe handlers.

Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Paweł Staszewski @ 2018-11-03 15:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Aaron Lu, Saeed Mahameed, eric.dumazet@gmail.com,
	netdev@vger.kernel.org, Tariq Toukan, ilias.apalodimas@linaro.org,
	yoel@kviknet.dk, mgorman@techsingularity.net
In-Reply-To: <20181103135831.180628ab@redhat.com>



W dniu 03.11.2018 o 13:58, Jesper Dangaard Brouer pisze:
> On Sat, 3 Nov 2018 01:16:08 +0100
> Paweł Staszewski <pstaszewski@itcare.pl> wrote:
>
>> W dniu 02.11.2018 o 20:02, Paweł Staszewski pisze:
>>>
>>> W dniu 02.11.2018 o 15:20, Aaron Lu pisze:
>>>> On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote:
>>>>> On Fri, 2 Nov 2018 13:23:56 +0800
>>>>> Aaron Lu <aaron.lu@intel.com> wrote:
>>>>>   
>>>>>> On Thu, Nov 01, 2018 at 08:23:19PM +0000, Saeed Mahameed wrote:
>>>>>>> On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:
>>>>>>>> On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
>>>>>>>> wrote:
>>>>>>>> ... ...
> [...]
>>>>> TL;DR: this is order-0 pages (code-walk trough proof below)
>>>>>
>>>>> To Aaron, the network stack *can* call __free_pages_ok() with order-0
>>>>> pages, via:
> [...]
>>>>   
>>>> I think here is a problem - order 0 pages are freed directly to buddy,
>>>> bypassing per-cpu-pages. This might be the reason lock contention
>>>> appeared on free path. Can someone apply below diff and see if lock
>>>> contention is gone?
>>>>
>>> Will test it tonight
>>>   
>> Patch applied
>> perf report:
>> https://ufile.io/sytfh
>>
>>
>> But i need to wait also with more traffic currently cpu's are sleeping
>>
> Well, that would be the expected result, that the CPUs get more time to
> sleep, if the lock contention is gone...
>
> What is the measured bandwidth now?
30 RX /30 TX Gbit/s

>
> Notice, you might still be limited by the PCIe bandwidth, but then your
> CPUs might actually decide to sleep, as they are getting data fast
> enough.
Yes - i will replace network controller to two separate nic's in two 
separate x16 pcie
But after monday.

But i dont think i hit pcie limit there - it looks like pcie x16 gen3 
have 16GB/s RX and 16GB/s TX so bidirectional


>
> [...]
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index e2ef1c17942f..65c0ae13215a 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -4554,8 +4554,14 @@ void page_frag_free(void *addr)
>>>>    {
>>>>        struct page *page = virt_to_head_page(addr);
>>>>    -    if (unlikely(put_page_testzero(page)))
>>>> -        __free_pages_ok(page, compound_order(page));
>>>> +    if (unlikely(put_page_testzero(page))) {
>>>> +        unsigned int order = compound_order(page);
>>>> +
>>>> +        if (order == 0)
>>>> +            free_unref_page(page);
>>>> +        else
>>>> +            __free_pages_ok(page, order);
>>>> +    }
>
>

^ permalink raw reply

* Re: [PATCH] tcp: do not update snd_una if it is same with ack_seq
From: Eric Dumazet @ 2018-11-04  0:40 UTC (permalink / raw)
  To: Yafang Shao, davem, edumazet; +Cc: netdev, linux-kernel
In-Reply-To: <1541264071-9905-1-git-send-email-laoar.shao@gmail.com>



On 11/03/2018 09:54 AM, Yafang Shao wrote:
> In the slow path, TCP_SKB_SB(skb)->ack_seq may be same with tp->snd_una,
> and under this condition we don't need to update the snd_una.
> 
> Furthermore, tcp_ack_update_window() is only called in slow path,
> so introducing this check won't affect the fast path processing.
> 
> By the way, '&' is a little faster than '-', so I replaced after() with
> "flag & FLAG_SND_UNA_ADVANCED".
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  net/ipv4/tcp_input.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 2868ef2..db5a6b7 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3376,7 +3376,8 @@ static int tcp_ack_update_window(struct sock *sk, const struct sk_buff *skb, u32
>  		}
>  	}
>  
> -	tcp_snd_una_update(tp, ack);
> +	if (after(ack, tp->snd_una))
> +		tcp_snd_una_update(tp, ack);
>  

Adding this after() here is confusing, how ack could be before snd_una ?
That would be a serious bug.

I do not see why another conditional has any gain here.

You are trying to avoid very cheap operations :

    u32 delta = ack - tp->snd_una;

    tp->bytes_acked += delta;
    tp->snd_una = ack;

Maybe the real reason for this patch is not explained in the changelog ?

^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Paweł Staszewski @ 2018-11-03 15:43 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Aaron Lu, Saeed Mahameed, eric.dumazet@gmail.com,
	netdev@vger.kernel.org, Tariq Toukan, ilias.apalodimas@linaro.org,
	yoel@kviknet.dk, mgorman@techsingularity.net
In-Reply-To: <9f35c3f5-e865-54db-73bb-960ded60c1cc@itcare.pl>



W dniu 03.11.2018 o 16:23, Paweł Staszewski pisze:
>
>
> W dniu 03.11.2018 o 13:58, Jesper Dangaard Brouer pisze:
>> On Sat, 3 Nov 2018 01:16:08 +0100
>> Paweł Staszewski <pstaszewski@itcare.pl> wrote:
>>
>>> W dniu 02.11.2018 o 20:02, Paweł Staszewski pisze:
>>>>
>>>> W dniu 02.11.2018 o 15:20, Aaron Lu pisze:
>>>>> On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer 
>>>>> wrote:
>>>>>> On Fri, 2 Nov 2018 13:23:56 +0800
>>>>>> Aaron Lu <aaron.lu@intel.com> wrote:
>>>>>>> On Thu, Nov 01, 2018 at 08:23:19PM +0000, Saeed Mahameed wrote:
>>>>>>>> On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:
>>>>>>>>> On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
>>>>>>>>> wrote:
>>>>>>>>> ... ...
>> [...]
>>>>>> TL;DR: this is order-0 pages (code-walk trough proof below)
>>>>>>
>>>>>> To Aaron, the network stack *can* call __free_pages_ok() with 
>>>>>> order-0
>>>>>> pages, via:
>> [...]
>>>>>   I think here is a problem - order 0 pages are freed directly to 
>>>>> buddy,
>>>>> bypassing per-cpu-pages. This might be the reason lock contention
>>>>> appeared on free path. Can someone apply below diff and see if lock
>>>>> contention is gone?
>>>>>
>>>> Will test it tonight
>>> Patch applied
>>> perf report:
>>> https://ufile.io/sytfh
>>>
>>>
>>> But i need to wait also with more traffic currently cpu's are sleeping
>>>
>> Well, that would be the expected result, that the CPUs get more time to
>> sleep, if the lock contention is gone...
>>
>> What is the measured bandwidth now?
> 30 RX /30 TX Gbit/s
>
>>
>> Notice, you might still be limited by the PCIe bandwidth, but then your
>> CPUs might actually decide to sleep, as they are getting data fast
>> enough.
> Yes - i will replace network controller to two separate nic's in two 
> separate x16 pcie
> But after monday.
>
> But i dont think i hit pcie limit there - it looks like pcie x16 gen3 
> have 16GB/s RX and 16GB/s TX so bidirectional
>
Was thinking that maybee memory limit - but also there is 4 channel DDR4 
2666MHz - so total bandwidth for memory is bigger (48GB/s) than needed 
for 100Gbit ethernet





>
>>
>> [...]
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index e2ef1c17942f..65c0ae13215a 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -4554,8 +4554,14 @@ void page_frag_free(void *addr)
>>>>>    {
>>>>>        struct page *page = virt_to_head_page(addr);
>>>>>    -    if (unlikely(put_page_testzero(page)))
>>>>> -        __free_pages_ok(page, compound_order(page));
>>>>> +    if (unlikely(put_page_testzero(page))) {
>>>>> +        unsigned int order = compound_order(page);
>>>>> +
>>>>> +        if (order == 0)
>>>>> +            free_unref_page(page);
>>>>> +        else
>>>>> +            __free_pages_ok(page, order);
>>>>> +    }
>>
>>
>
>

^ permalink raw reply

* SACK compression patch causing performance drop
From: Jean-Louis Dupond @ 2018-11-03 15:59 UTC (permalink / raw)
  To: netdev, edumazet

Hi All,

On recent kernels we noticed a way lower throughput to our SAN system 
than before.
While on pre 4.18 kernels we had 400-700MB/sec read speed, on 4.18+ we 
only had 70-120MB/sec.

The SAN is connected via iSCSI over a 10G network (ixgbe/X520 NICS if it 
matters).

After some debugging, I tried to bisect between 4.17 and 4.18 to see 
what commit caused the slowdown.
It showed that the addition of the SACK compression 
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d9f4262b7ea41ca9981cc790e37cca6e37c789e) 
was the cause.

And indeed, if I set net.ipv4.tcp_comp_sack_nr to 0 on 4.19 for example, 
the throughput is (almost) back to normal again.
So it seems like this change causes quite some performance issues.

Any ideas?

Thanks
Jean-Louis

^ permalink raw reply

* Re: [PATCH] tcp: do not update snd_una if it is same with ack_seq
From: Yafang Shao @ 2018-11-04  1:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Eric Dumazet, netdev, LKML
In-Reply-To: <5f585304-b6f5-5f49-adcf-eaed471c0d76@gmail.com>

On Sun, Nov 4, 2018 at 8:40 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 11/03/2018 09:54 AM, Yafang Shao wrote:
> > In the slow path, TCP_SKB_SB(skb)->ack_seq may be same with tp->snd_una,
> > and under this condition we don't need to update the snd_una.
> >
> > Furthermore, tcp_ack_update_window() is only called in slow path,
> > so introducing this check won't affect the fast path processing.
> >
> > By the way, '&' is a little faster than '-', so I replaced after() with
> > "flag & FLAG_SND_UNA_ADVANCED".
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  net/ipv4/tcp_input.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 2868ef2..db5a6b7 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3376,7 +3376,8 @@ static int tcp_ack_update_window(struct sock *sk, const struct sk_buff *skb, u32
> >               }
> >       }
> >
> > -     tcp_snd_una_update(tp, ack);
> > +     if (after(ack, tp->snd_una))
> > +             tcp_snd_una_update(tp, ack);
> >
>
> Adding this after() here is confusing, how ack could be before snd_una ?
> That would be a serious bug.
>

ack can't be before snd_una, but it can be equal with snd_una.
Seems bellow change would be more specific,
    if (ack != tp->snd_una)
       tcp_snd_una_update(tp, ack);

> I do not see why another conditional has any gain here.
>
> You are trying to avoid very cheap operations :
>
>     u32 delta = ack - tp->snd_una;
>
>     tp->bytes_acked += delta;
>     tp->snd_una = ack;
>
> Maybe the real reason for this patch is not explained in the changelog ?

No additional reason. I just want to avoid these uneccessary operations.
Because I find that this uncessary operations always happen,
especially when head prediction fails and then the packet can't go to
fast path processing.

Thanks
Yafang

^ permalink raw reply

* Re: [PATCH] tcp: do not update snd_una if it is same with ack_seq
From: Yafang Shao @ 2018-11-04  1:27 UTC (permalink / raw)
  To: joe; +Cc: David Miller, Eric Dumazet, netdev, LKML
In-Reply-To: <945a90d62aa45c4b53349ba0a104574759d40efe.camel@perches.com>

On Sun, Nov 4, 2018 at 1:04 AM Joe Perches <joe@perches.com> wrote:
>
> On Sun, 2018-11-04 at 00:54 +0800, Yafang Shao wrote:
> > In the slow path, TCP_SKB_SB(skb)->ack_seq may be same with tp->snd_una,
> > and under this condition we don't need to update the snd_una.
> >
> > Furthermore, tcp_ack_update_window() is only called in slow path,
> > so introducing this check won't affect the fast path processing.
> []
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> []
> > @@ -3610,7 +3611,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
> >       if (flag & FLAG_UPDATE_TS_RECENT)
> >               tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq);
> >
> > -     if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) {
> > +     if (!(flag & FLAG_SLOWPATH) && flag & FLAG_SND_UNA_ADVANCED) {
>
> stylistic nit:
>
> While the precedence is correct in any case,
> perhaps adding parentheses around
>         flag & FLAG_SND_UNA_ADVANCED
> would make it more obvious.
>

Sure. will change it.

Thanks
Yafang

^ permalink raw reply

* Re: [PATCH v3 1/2] kretprobe: produce sane stack traces
From: Masami Hiramatsu @ 2018-11-03 16:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Josh Poimboeuf, Aleksa Sarai, Naveen N. Rao, Anil S Keshavamurthy,
	David S. Miller, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Shuah Khan, Alexei Starovoitov, Daniel Borkmann,
	Brendan Gregg, Christian Brauner, Aleksa Sarai, netdev, linux-doc
In-Reply-To: <20181103091341.3d32683e@vmware.local.home>

On Sat, 3 Nov 2018 09:13:41 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Sat, 3 Nov 2018 22:00:12 +0900
> Masami Hiramatsu <mhiramat@kernel.org> wrote:
> 
> > On Fri, 2 Nov 2018 12:13:07 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> 
> > > Because that means if function graph tracer is active, then you can't
> > > do a kretprobe, and vice versa. I'd really like to have it working for
> > > multiple users, then we could trace different graph functions and store
> > > them in different buffers. It would also allow for perf to use function
> > > graph tracer too.  
> > 
> > Steve, how woul you allow multiple users on it? Something like this?
> > 
> > ret_trampoline_multiple(){
> >    list_for_each(handler, &shadow_entry[i].handlers, list)
> > 	handler(shadow_entry[i]);
> >    restore_retval_and_jump_to(shadow_entry[i].orig);
> > }
> >
> 
> Something like that. But since it's not a single mapping between shadow
> entries and handlers, that is we have multiple tasks with multiple
> shadow entries and multiple handlers, we can't use a link list (two
> different parents per handler).

Yes, I understand it.

> I was thinking of a bitmask that represents the handlers, and use that
> to map which handler gets called for which shadow entry for a
> particular task.

Hmm, I doubt that is too complicated and not scalable. I rather like to see
the open shadow entry...

entry: [[original_retaddr][function][modified_retaddr]]

So if there are many users on same function, the entries will be like this 

[[original_return_address][function][trampoline_A]]
[[trampline_A][function][trampoline_B]]
[[trampline_B][function][trampoline_C]]

And on the top of the stack, there is trampline_C instead of original_return_address.
In this case, return to trampoline_C(), it jumps back to trampline_B() and then
it jumps back to trampline_A(). And eventually it jumps back to
original_return_address.

This way, we don't need allocate another bitmap/pages for the shadow stack.
We only need a shadow stack for each task.
Also, unwinder can easily find the trampline_C from the shadow stack and restores
original_return_address. (of course trampline_A,B,C must be registered so that
search function can skip it.)

Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH] tcp: do not update snd_una if it is same with ack_seq
From: Yafang Shao @ 2018-11-03 16:54 UTC (permalink / raw)
  To: davem, edumazet; +Cc: netdev, linux-kernel, Yafang Shao

In the slow path, TCP_SKB_SB(skb)->ack_seq may be same with tp->snd_una,
and under this condition we don't need to update the snd_una.

Furthermore, tcp_ack_update_window() is only called in slow path,
so introducing this check won't affect the fast path processing.

By the way, '&' is a little faster than '-', so I replaced after() with
"flag & FLAG_SND_UNA_ADVANCED".

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 net/ipv4/tcp_input.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2868ef2..db5a6b7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3376,7 +3376,8 @@ static int tcp_ack_update_window(struct sock *sk, const struct sk_buff *skb, u32
 		}
 	}
 
-	tcp_snd_una_update(tp, ack);
+	if (after(ack, tp->snd_una))
+		tcp_snd_una_update(tp, ack);
 
 	return flag;
 }
@@ -3610,7 +3611,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	if (flag & FLAG_UPDATE_TS_RECENT)
 		tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq);
 
-	if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) {
+	if (!(flag & FLAG_SLOWPATH) && flag & FLAG_SND_UNA_ADVANCED) {
 		/* Window is constant, pure forward advance.
 		 * No more checks are required.
 		 * Note, we use the fact that SND.UNA>=SND.WL2.
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH] tcp: do not update snd_una if it is same with ack_seq
From: Joe Perches @ 2018-11-03 17:04 UTC (permalink / raw)
  To: Yafang Shao, davem, edumazet; +Cc: netdev, linux-kernel
In-Reply-To: <1541264071-9905-1-git-send-email-laoar.shao@gmail.com>

On Sun, 2018-11-04 at 00:54 +0800, Yafang Shao wrote:
> In the slow path, TCP_SKB_SB(skb)->ack_seq may be same with tp->snd_una,
> and under this condition we don't need to update the snd_una.
> 
> Furthermore, tcp_ack_update_window() is only called in slow path,
> so introducing this check won't affect the fast path processing.
[]
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
[]
> @@ -3610,7 +3611,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>  	if (flag & FLAG_UPDATE_TS_RECENT)
>  		tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq);
>  
> -	if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) {
> +	if (!(flag & FLAG_SLOWPATH) && flag & FLAG_SND_UNA_ADVANCED) {

stylistic nit:

While the precedence is correct in any case,
perhaps adding parentheses around
	flag & FLAG_SND_UNA_ADVANCED
would make it more obvious.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox