Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net v2 2/2] ipv6: properly check return value in inet6_dump_all()
From: Alexey Kodanev @ 2018-11-02 16:11 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern, David Miller, Alexey Kodanev
In-Reply-To: <1541175065-25931-1-git-send-email-alexey.kodanev@oracle.com>

Make sure we call fib6_dump_end() if it happens that skb->len
is zero. rtnl_dump_all() can reset cb->args on the next loop
iteration there.

Fixes: 08e814c9e8eb ("net/ipv6: Bail early if user only wants cloned entries")
Fixes: ae677bbb4441 ("net: Don't return invalid table id error when dumping all families")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
---

v2: a new patch in v2

 net/ipv6/ip6_fib.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 1b8bc00..ae37861 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -591,7 +591,7 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 
 	/* fib entries are never clones */
 	if (arg.filter.flags & RTM_F_CLONED)
-		return skb->len;
+		goto out;
 
 	w = (void *)cb->args[2];
 	if (!w) {
@@ -621,7 +621,7 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 		tb = fib6_get_table(net, arg.filter.table_id);
 		if (!tb) {
 			if (arg.filter.dump_all_families)
-				return skb->len;
+				goto out;
 
 			NL_SET_ERR_MSG_MOD(cb->extack, "FIB table does not exist");
 			return -ENOENT;
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH] net/mlx5e: fix high stack usage
From: Saeed Mahameed @ 2018-11-03  0:52 UTC (permalink / raw)
  To: arnd@arndb.de
  Cc: linux-kernel@vger.kernel.org, Moshe Shemesh,
	linux-rdma@vger.kernel.org, Boris Pismenny, Tariq Toukan,
	akpm@linux-foundation.org, Eran Ben Elisha,
	netdev@vger.kernel.org, Ilya Lesokhin, eric.dumazet@gmail.com,
	davem@davemloft.net, leon@kernel.org, Kamal Heib
In-Reply-To: <CAK8P3a0ybuiu_dBfUDfMYUtKZ=yGdB7An0_wx_ZGhnS5rrD3_w@mail.gmail.com>

On Fri, 2018-11-02 at 23:15 +0100, Arnd Bergmann wrote:
> On 11/2/18, Saeed Mahameed <saeedm@mellanox.com> wrote:
> > On Fri, 2018-11-02 at 14:39 -0700, Eric Dumazet wrote:
> > > 
> > > On 11/02/2018 02:05 PM, Saeed Mahameed wrote:
> > > 
> > > > temp will be mem copied to priv->stats.sw at the end,
> > > > memcpy(&priv->stats.sw, &s, sizeof(s));
> > > > 
> > > > one other way to solve this as suggested by Andrew, is to get
> > > > rid
> > > > of
> > > > the temp var and make it point directly to priv->stats.sw
> > > > 
> > > 
> > > What about concurrency ?
> > > 
> > > This temp variable is there to make sure concurrent readers of
> > > stats
> > > might
> > > not see mangle data (because another 'reader' just did a memset()
> > > and
> > > is doing the folding)
> > > 
> > > 
> > > mlx5e_get_stats() can definitely be run at the same time by
> > > multiple
> > > threads.
> > > 
> > 
> > hmm, you are right, i was thinking that mlx5e_get_Stats will
> > trigger a
> > work to update stats and grab the state_lock, but for sw stats this
> > is
> > not the case it is done in place.
> > 
> > BTW memcpy itself is not thread safe.
> 
> Before commit 6c63efe4cfab ("net/mlx5e: Remove redundant
> active_channels
> indication"), there was a read_lock() in the function apparently
> intended to
> made it thread safe. This got removed with the comment
> 
> commit 6c63efe4cfabf230a8ed4b1d880249875ffdac13
> Author: Eran Ben Elisha <eranbe@mellanox.com>
> Date:   Tue May 29 11:06:31 2018 +0300
> 
>     net/mlx5e: Remove redundant active_channels indication
> 
>     Now, when all channels stats are saved regardless of the
> channel's state
>     {open, closed}, we can safely remove this indication and the
> stats spin
>     lock which protects it.
> 
>     Fixes: 76c3810bade3 ("net/mlx5e: Avoid reset netdev stats on
> configuration changes")
> 
> I don't really understand the reasoning, but maybe we can remove
> the memcpy() if the code is thread safe, or we need the lock back if
> it's not.
> 

this lock was needed for a whole different purpose, it wasn't meant to
synchronize between two reader threads, it was meant to synchronize
between driver restarts and the reader for loop which ran over the open
channels, while they could be going through a destruction process.

I think all we need is to maintain two priv->stats.sw copies and use
them as temp for each reader thread, when can only have two concurrent
readers (mlx5e_ethtool_get_ethtool_stats and ndo_get_stats64) .. 

>          Arnd

^ permalink raw reply

* Re: Help with the BPF verifier
From: Edward Cree @ 2018-11-02 15:42 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Yonghong Song, Daniel Borkmann, Jiri Olsa, Martin Lau,
	Alexei Starovoitov, Linux Networking Development Mailing List
In-Reply-To: <20181102150239.GG20495@kernel.org>

On 02/11/18 15:02, Arnaldo Carvalho de Melo wrote:
> Yeah, didn't work as well: 

> And the -vv in 'perf trace' didn't seem to map to further details in the
> output of the verifier debug:
Yeah for log_level 2 you probably need to make source-level changes to either
 perf or libbpf (I think the latter).  It's annoying that essentially no tools
 plumb through an option for that, someone should fix them ;-)

> libbpf: -- BEGIN DUMP LOG ---
> libbpf: 
> 0: (bf) r6 = r1
> 1: (bf) r1 = r10
> 2: (07) r1 += -328
> 3: (b7) r7 = 64
> 4: (b7) r2 = 64
> 5: (bf) r3 = r6
> 6: (85) call bpf_probe_read#4
> 7: (79) r1 = *(u64 *)(r10 -320)
> 8: (15) if r1 == 0x101 goto pc+4
>  R0=inv(id=0) R1=inv(id=0) R6=ctx(id=0,off=0,imm=0) R7=inv64 R10=fp0,call_-1
> 9: (55) if r1 != 0x2 goto pc+22
>  R0=inv(id=0) R1=inv2 R6=ctx(id=0,off=0,imm=0) R7=inv64 R10=fp0,call_-1
> 10: (bf) r1 = r6
> 11: (07) r1 += 16
> 12: (05) goto pc+2
> 15: (79) r3 = *(u64 *)(r1 +0)
> dereference of modified ctx ptr R1 off=16 disallowed
Aha, we at least got a different error message this time.
And indeed llvm has done that optimisation, rather than the more obvious
11: r3 = *(u64 *)(r1 +16)
 because it wants to have lots of reads share a single insn.  You may be able
 to defeat that optimisation by adding compiler barriers, idk.  Maybe someone
 with llvm knowledge can figure out how to stop it (ideally, llvm would know
 when it's generating for bpf backend and not do that).  -O0?  ¯\_(ツ)_/¯
Alternatively, your prog looks short enough that maybe you could kick the C
 habit and write it directly in eBPF asm, that way no-one is optimising things
 behind your back.  (I realise this option won't appeal to everyone ;-)
The reason the verifier disallows this, iirc, is because it needs to be able
 to rewrite the offsets on ctx accesses (see convert_ctx_accesses()) in case
 underlying kernel struct doesn't match the layout of the ctx ABI.  To do this
 it needs the ctx offset to live entirely in the insn doing the access,
 otherwise different paths could lead to the same insn accessing different ctx
 offsets with different fixups needed — can't be done.

-Ed

^ permalink raw reply

* Re: [PATCH] net/mlx5e: fix high stack usage
From: Saeed Mahameed @ 2018-11-03  0:37 UTC (permalink / raw)
  To: jgg@ziepe.ca
  Cc: linux-kernel@vger.kernel.org, Moshe Shemesh,
	linux-rdma@vger.kernel.org, Boris Pismenny, Tariq Toukan,
	akpm@linux-foundation.org, Eran Ben Elisha,
	netdev@vger.kernel.org, Ilya Lesokhin, eric.dumazet@gmail.com,
	davem@davemloft.net, arnd@arndb.de, leon@kernel.org, Kamal Heib
In-Reply-To: <20181102221552.GC17096@ziepe.ca>

On Fri, 2018-11-02 at 16:15 -0600, Jason Gunthorpe wrote:
> On Fri, Nov 02, 2018 at 10:07:02PM +0000, Saeed Mahameed wrote:
> > On Fri, 2018-11-02 at 14:39 -0700, Eric Dumazet wrote:
> > > 
> > > On 11/02/2018 02:05 PM, Saeed Mahameed wrote:
> > >  
> > > > temp will be mem copied to priv->stats.sw at the end,
> > > > memcpy(&priv->stats.sw, &s, sizeof(s)); 
> > > > 
> > > > one other way to solve this as suggested by Andrew, is to get
> > > > rid
> > > > of
> > > > the temp var and make it point directly to priv->stats.sw 
> > > > 
> > > 
> > > What about concurrency ?
> > > 
> > > This temp variable is there to make sure concurrent readers of
> > > stats
> > > might
> > > not see mangle data (because another 'reader' just did a memset()
> > > and
> > > is doing the folding)
> > > 
> > > 
> > > mlx5e_get_stats() can definitely be run at the same time by
> > > multiple
> > > threads.
> > > 
> > 
> > hmm, you are right, i was thinking that mlx5e_get_Stats will
> > trigger a
> > work to update stats and grab the state_lock, but for sw stats this
> > is
> > not the case it is done in place.
> 
> That was my guess when I saw this.. the confusing bit is why is there
> s and temp, why not just s?
> 

silly cut and paste from mlx4 !

> > BTW memcpy itself is not thread safe.
> 
> At least on 64 bit memcpy will do > 8 byte stores when copying so on
> most architectures it will cause individual new or old u64 to be
> returned and not a mess..
> 
> 32 bit will always make a mess.
> 
> If the stats don't update that often then kmalloc'ing a new buffer
> and
> RCU'ing it into view might be a reasonable alternative to this?
> 

kmalloc is not an option, stats update too often, priv->stats.sw is
already a temp, it is needed only to accumulate channels stats into it
and then report them to ethtool buffer or ndo_get_stats , both will
copy the priv->stats.sw to their own buffers, so there can only be two
use cases (two threads), maybe it is best if we maintain two priv-
>stats.sw one for each case and use it as a temp for each thread.

> Jason

^ permalink raw reply

* Re: [PATCH iproute2] Fix warning in tc-skbprio.8 manpage
From: Luca Boccassi @ 2018-11-02 15:28 UTC (permalink / raw)
  To: Edward Cree, netdev; +Cc: stephen, dsahern
In-Reply-To: <d6165e95-06bf-b835-21ae-f0a1f744bec6@solarflare.com>

[-- Attachment #1: Type: text/plain, Size: 1116 bytes --]

On Fri, 2018-11-02 at 15:19 +0000, Edward Cree wrote:
> On 02/11/18 10:57, Luca Boccassi wrote:
> > ". If" gets interpreted as a macro, so move the period to the
> > previous
> > line:
> > 
> >   33: warning: macro `If' not defined
> > 
> > Fixes: 141b55f8544e ("Add SKB Priority qdisc support in tc(8)")
> > 
> > Signed-off-by: Luca Boccassi <bluca@debian.org>
> > ---
> >  man/man8/tc-skbprio.8 | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/man/man8/tc-skbprio.8 b/man/man8/tc-skbprio.8
> > index 844bbf46..8b259839 100644
> > --- a/man/man8/tc-skbprio.8
> > +++ b/man/man8/tc-skbprio.8
> > @@ -29,8 +29,8 @@ While SKB Priority Queue is agnostic to how
> >  .B skb->priority
> >  is assigned. A typical use case is to copy
> >  the 6-bit DS field of IPv4 and IPv6 packets using
> > -.BR tc-skbedit (8)
> > -. If
> > +.BR tc-skbedit (8) .
> > +If
> 
> Won't that make the full-stop bold?
> Removing the space, i.e.
> .BR rc-skbedit (8).
> ought to work.
> -Ed

Yes you are right, fixed in v2, thanks.

-- 
Kind regards,
Luca Boccassi

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH iproute2 v2] Fix warning in tc-skbprio.8 manpage
From: Luca Boccassi @ 2018-11-02 15:27 UTC (permalink / raw)
  To: netdev; +Cc: stephen, dsahern, ecree
In-Reply-To: <20181102105741.25381-1-bluca@debian.org>

". If" gets interpreted as a macro, so move the period to the previous
line:

  33: warning: macro `If' not defined

Fixes: 141b55f8544e ("Add SKB Priority qdisc support in tc(8)")

Signed-off-by: Luca Boccassi <bluca@debian.org>
---
v2: remove extra space to avoid making the full-stop bold.

 man/man8/tc-skbprio.8 | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/man/man8/tc-skbprio.8 b/man/man8/tc-skbprio.8
index 844bbf46..a0a316ba 100644
--- a/man/man8/tc-skbprio.8
+++ b/man/man8/tc-skbprio.8
@@ -29,8 +29,8 @@ While SKB Priority Queue is agnostic to how
 .B skb->priority
 is assigned. A typical use case is to copy
 the 6-bit DS field of IPv4 and IPv6 packets using
-.BR tc-skbedit (8)
-. If
+.BR tc-skbedit (8).
+If
 .B skb->priority
 is greater or equal to 64, the priority is assumed to be 63.
 Priorities less than 64 are taken at face value.
-- 
2.19.1

^ permalink raw reply related

* Re: [PATCH iproute2] Fix warning in tc-skbprio.8 manpage
From: Edward Cree @ 2018-11-02 15:19 UTC (permalink / raw)
  To: Luca Boccassi, netdev; +Cc: stephen, dsahern
In-Reply-To: <20181102105741.25381-1-bluca@debian.org>

On 02/11/18 10:57, Luca Boccassi wrote:
> ". If" gets interpreted as a macro, so move the period to the previous
> line:
>
>   33: warning: macro `If' not defined
>
> Fixes: 141b55f8544e ("Add SKB Priority qdisc support in tc(8)")
>
> Signed-off-by: Luca Boccassi <bluca@debian.org>
> ---
>  man/man8/tc-skbprio.8 | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/man/man8/tc-skbprio.8 b/man/man8/tc-skbprio.8
> index 844bbf46..8b259839 100644
> --- a/man/man8/tc-skbprio.8
> +++ b/man/man8/tc-skbprio.8
> @@ -29,8 +29,8 @@ While SKB Priority Queue is agnostic to how
>  .B skb->priority
>  is assigned. A typical use case is to copy
>  the 6-bit DS field of IPv4 and IPv6 packets using
> -.BR tc-skbedit (8)
> -. If
> +.BR tc-skbedit (8) .
> +If
Won't that make the full-stop bold?
Removing the space, i.e.
.BR rc-skbedit (8).
ought to work.
-Ed
>  .B skb->priority
>  is greater or equal to 64, the priority is assumed to be 63.
>  Priorities less than 64 are taken at face value.

^ permalink raw reply

* ethtool 4.19 released
From: John W. Linville @ 2018-11-02 15:14 UTC (permalink / raw)
  To: netdev

ethtool version 4.19 has been released.

Home page: https://www.kernel.org/pub/software/network/ethtool/
Download link:
https://www.kernel.org/pub/software/network/ethtool/ethtool-4.19.tar.xz

Release notes:

	* Feature: support combinations of FEC modes
	* Feature: better syntax for combinations of FEC modes
	* Fix: Fix uninitialized variable use at qsfp dump

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* Re: Help with the BPF verifier
From: Arnaldo Carvalho de Melo @ 2018-11-02 15:02 UTC (permalink / raw)
  To: Edward Cree
  Cc: Yonghong Song, Daniel Borkmann, Jiri Olsa, Martin Lau,
	Alexei Starovoitov, Linux Networking Development Mailing List
In-Reply-To: <deb1d2a9-25cb-da1f-32b0-53a7b07eb4ad@solarflare.com>

Em Thu, Nov 01, 2018 at 08:05:07PM +0000, Edward Cree escreveu:
> On 01/11/18 18:52, Arnaldo Carvalho de Melo wrote:
> >  R0=inv(id=0) R1=inv6 R2=inv6 R3=inv(id=0) R6=ctx(id=0,off=0,imm=0) R7=inv64 R10=fp0,call_-1
> > 15: (b7) r2 = 0
> > 16: (63) *(u32 *)(r10 -260) = r2
> > 17: (67) r1 <<= 32
> > 18: (77) r1 >>= 32
> > 19: (67) r1 <<= 3
> > 20: (bf) r2 = r6
> > 21: (0f) r2 += r1
> > 22: (79) r3 = *(u64 *)(r2 +16)
> > R2 invalid mem access 'inv'
> I wonder if you could run this with verifier log level 2?  (I'm not sure how
>  you would go about plumbing that through the perf tooling.)  It seems very
>  odd that it ends up with R2=inv, and I'm wondering whether R1 becomes unknown
>  during the shifts or whether the addition in insn 21 somehow produces the
>  unknown-ness.  (I know we used to have a thing[1] where doing ptr += K and
>  then also having an offset in the LDX produced an error about
>  ptr+const+const, but that seems to have been fixed at some point.)
> 
> Note however that even if we get past this, R1 at this point holds 6, so it
>  looks like the verifier is walking the impossible path where we're inside the
>  'if' even though filename_arg = 6.  This is a (slightly annoying) verifier
>  limitation, that it walks paths with impossible combinations of constraints
>  (we've previously had cases where assertions in the verifier would blow up
>  because of this, e.g. registers with max_val less than min_val).  So if the
>  check_ctx_access() is going to worry about whether you're off the end of the
>  array (I'm not sure what your program type is and thus which is_valid_access
>  callback is involved), then it'll complain about this.
> If filename_arg came from some external source you'd have a different
>  problem, because then it would have a totally unknown value, that on entering
>  the 'if' becomes "unknown but < 6", which is still too variable to have as
>  the offset of a ctx access.  Those have to be at a known constant offset, so
>  that we can determine the type of the returned value.
> 
> As a way to fix this, how about [UNTESTED!]:
>     const void *filename_arg = NULL;
>     /* ... */
>     switch (augmented_args.args.syscall_nr) {
>         case SYS_OPEN: filename_arg = args->args[0]; break;
>         case SYS_OPENAT: filename_arg = args->args[1]; break;
>     }
>     /* ... */
>     if (filename_arg) {
>         /* stuff */
>         blah = probe_read_str(/* ... */, filename_arg);
>     } else {
>         /* the other stuff */
>     }
> That way, you're only ever dealing in constant pointers (although judging by
>  an old thread I found[1] about ptr+const+const, the compiler might decide to
>  make some optimisations that end up looking like your existing code).

Yeah, didn't work as well:

SEC("raw_syscalls:sys_enter")
int sys_enter(struct syscall_enter_args *args)
{
	struct {
		struct syscall_enter_args args;
		struct augmented_filename filename;
	} augmented_args;
	unsigned int len = sizeof(augmented_args);
	const void *filename_arg = NULL;

	probe_read(&augmented_args.args, sizeof(augmented_args.args), args);

	switch (augmented_args.args.syscall_nr) {
	case SYS_OPEN:	 filename_arg = (const void *)args->args[0]; break;
	case SYS_OPENAT: filename_arg = (const void *)args->args[1]; break;
	}

	if (filename_arg != NULL) {
		augmented_args.filename.reserved = 0;
		augmented_args.filename.size = probe_read_str(&augmented_args.filename.value,
							      sizeof(augmented_args.filename.value),
							      filename_arg);
		if (augmented_args.filename.size < sizeof(augmented_args.filename.value)) {
			len -= sizeof(augmented_args.filename.value) - augmented_args.filename.size;
			len &= sizeof(augmented_args.filename.value) - 1;
		}
	} else {
		len = sizeof(augmented_args.args);
	}

	perf_event_output(args, &__augmented_syscalls__, BPF_F_CURRENT_CPU, &augmented_args, len);
	return 0;
}

And the -vv in 'perf trace' didn't seem to map to further details in the
output of the verifier debug:

# trace -vv -e tools/perf/examples/bpf/augmented_raw_syscalls.c sleep 1
bpf: builtin compilation failed: -95, try external compiler
Kernel build dir is set to /lib/modules/4.19.0-rc8-00014-gc0cff31be705/build
set env: KBUILD_DIR=/lib/modules/4.19.0-rc8-00014-gc0cff31be705/build
unset env: KBUILD_OPTS
include option is set to  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated  -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h 
set env: NR_CPUS=4
set env: LINUX_VERSION_CODE=0x41300
set env: CLANG_EXEC=/usr/local/bin/clang
unset env: CLANG_OPTIONS
set env: KERNEL_INC_OPTIONS= -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated  -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h 
set env: PERF_BPF_INC_OPTIONS=-I/home/acme/lib/perf/include/bpf
set env: WORKING_DIR=/lib/modules/4.19.0-rc8-00014-gc0cff31be705/build
set env: CLANG_SOURCE=/home/acme/git/perf/tools/perf/examples/bpf/augmented_raw_syscalls.c
llvm compiling command template: $CLANG_EXEC -D__KERNEL__ -D__NR_CPUS__=$NR_CPUS -DLINUX_VERSION_CODE=$LINUX_VERSION_CODE $CLANG_OPTIONS $PERF_BPF_INC_OPTIONS $KERNEL_INC_OPTIONS -Wno-unused-value -Wno-pointer-sign -working-directory $WORKING_DIR -c "$CLANG_SOURCE" -target bpf $CLANG_EMIT_LLVM -O2 -o - $LLVM_OPTIONS_PIPE
llvm compiling command : /usr/local/bin/clang -D__KERNEL__ -D__NR_CPUS__=4 -DLINUX_VERSION_CODE=0x41300  -I/home/acme/lib/perf/include/bpf  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated  -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h  -Wno-unused-value -Wno-pointer-sign -working-directory /lib/modules/4.19.0-rc8-00014-gc0cff31be705/build -c /home/acme/git/perf/tools/perf/examples/bpf/augmented_raw_syscalls.c -target bpf  -O2 -o - 
libbpf: loading object 'tools/perf/examples/bpf/augmented_raw_syscalls.c' from buffer
libbpf: section(1) .strtab, size 168, link 0, flags 0, type=3
libbpf: skip section(1) .strtab
libbpf: section(2) .text, size 0, link 0, flags 6, type=1
libbpf: skip section(2) .text
libbpf: section(3) raw_syscalls:sys_enter, size 344, link 0, flags 6, type=1
libbpf: found program raw_syscalls:sys_enter
libbpf: section(4) .relraw_syscalls:sys_enter, size 16, link 10, flags 0, type=9
libbpf: section(5) raw_syscalls:sys_exit, size 16, link 0, flags 6, type=1
libbpf: found program raw_syscalls:sys_exit
libbpf: section(6) maps, size 56, link 0, flags 3, type=1
libbpf: section(7) license, size 4, link 0, flags 3, type=1
libbpf: license of tools/perf/examples/bpf/augmented_raw_syscalls.c is GPL
libbpf: section(8) version, size 4, link 0, flags 3, type=1
libbpf: kernel version of tools/perf/examples/bpf/augmented_raw_syscalls.c is 41300
libbpf: section(9) .llvm_addrsig, size 6, link 10, flags 80000000, type=1879002115
libbpf: skip section(9) .llvm_addrsig
libbpf: section(10) .symtab, size 240, link 1, flags 0, type=2
libbpf: maps in tools/perf/examples/bpf/augmented_raw_syscalls.c: 2 maps in 56 bytes
libbpf: map 0 is "__augmented_syscalls__"
libbpf: map 1 is "__bpf_stdout__"
libbpf: collecting relocating info for: 'raw_syscalls:sys_enter'
libbpf: relo for 4 value 28 name 124
libbpf: relocation: insn_idx=35
libbpf: relocation: find map 1 (__augmented_syscalls__) for insn 35
Added extra kernel map __entry_SYSCALL_64_trampoline fffffe0000006000-fffffe0000007000
Added extra kernel map __entry_SYSCALL_64_trampoline fffffe0000032000-fffffe0000033000
Added extra kernel map __entry_SYSCALL_64_trampoline fffffe000005e000-fffffe000005f000
Added extra kernel map __entry_SYSCALL_64_trampoline fffffe000008a000-fffffe000008b000
bpf: config program 'raw_syscalls:sys_enter'
bpf: config program 'raw_syscalls:sys_exit'
libbpf: create map __bpf_stdout__: fd=3
libbpf: create map __augmented_syscalls__: fd=4
libbpf: load bpf program failed: Permission denied
libbpf: -- BEGIN DUMP LOG ---
libbpf: 
0: (bf) r6 = r1
1: (bf) r1 = r10
2: (07) r1 += -328
3: (b7) r7 = 64
4: (b7) r2 = 64
5: (bf) r3 = r6
6: (85) call bpf_probe_read#4
7: (79) r1 = *(u64 *)(r10 -320)
8: (15) if r1 == 0x101 goto pc+4
 R0=inv(id=0) R1=inv(id=0) R6=ctx(id=0,off=0,imm=0) R7=inv64 R10=fp0,call_-1
9: (55) if r1 != 0x2 goto pc+22
 R0=inv(id=0) R1=inv2 R6=ctx(id=0,off=0,imm=0) R7=inv64 R10=fp0,call_-1
10: (bf) r1 = r6
11: (07) r1 += 16
12: (05) goto pc+2
15: (79) r3 = *(u64 *)(r1 +0)
dereference of modified ctx ptr R1 off=16 disallowed

libbpf: -- END LOG --
libbpf: failed to load program 'raw_syscalls:sys_enter'
libbpf: failed to load object 'tools/perf/examples/bpf/augmented_raw_syscalls.c'
bpf: load objects failed: err=-4007: (Kernel verifier blocks program loading)
event syntax error: 'tools/perf/examples/bpf/augmented_raw_syscalls.c'
                     \___ Kernel verifier blocks program loading

(add -v to see detail)
Run 'perf list' for a list of valid events

 Usage: perf trace [<options>] [<command>]
    or: perf trace [<options>] -- <command> [<options>]
    or: perf trace record [<options>] [<command>]
    or: perf trace record [<options>] -- <command> [<options>]

    -e, --event <event>   event/syscall selector. use 'perf list' to list available events
[root@seventh perf]# 


I'll check how to plumb that, but its a holiday down here in Brazil,
kids at home...


 
> As for what you want to do with the index coming from userspace, the verifier
>  will not like that at all, as mentioned above, so I think you'll need to do
>  something like:
>     switch (filename_arg_from_userspace) {
>         case 0: filename_arg = args->args[0]; break;
>         case 1: filename_arg = args->args[1]; break;
>         /* etc */
>         default: filename_arg = NULL;
>     }
>  thus ensuring that you only ever have ctx pointers with constant offsets.
> 
> -Ed
> 
> [1]: https://lists.iovisor.org/g/iovisor-dev/topic/21386327#1302

^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Aaron Lu @ 2018-11-02 14:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Saeed Mahameed, pstaszewski@itcare.pl, eric.dumazet@gmail.com,
	netdev@vger.kernel.org, Tariq Toukan, ilias.apalodimas@linaro.org,
	yoel@kviknet.dk, mgorman@techsingularity.net
In-Reply-To: <20181102124037.352b15de@redhat.com>

On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote:
> On Fri, 2 Nov 2018 13:23:56 +0800
> Aaron Lu <aaron.lu@intel.com> wrote:
> 
> > On Thu, Nov 01, 2018 at 08:23:19PM +0000, Saeed Mahameed wrote:
> > > On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:  
> > > > On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
> > > > wrote:
> > > > ... ...  
> > > > > Section copied out:
> > > > > 
> > > > >   mlx5e_poll_tx_cq
> > > > >   |          
> > > > >    --16.34%--napi_consume_skb
> > > > >              |          
> > > > >              |--12.65%--__free_pages_ok
> > > > >              |          |          
> > > > >              |           --11.86%--free_one_page
> > > > >              |                     |          
> > > > >              |                     |--10.10%
> > > > > --queued_spin_lock_slowpath
> > > > >              |                     |          
> > > > >              |                      --0.65%--_raw_spin_lock  
> > > > 
> > > > This callchain looks like it is freeing higher order pages than order
> > > > 0:
> > > > __free_pages_ok is only called for pages whose order are bigger than
> > > > 0.  
> > > 
> > > mlx5 rx uses only order 0 pages, so i don't know where these high order
> > > tx SKBs are coming from..   
> > 
> > Perhaps here:
> > __netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and
> > __napi_alloc_frag() will all call page_frag_alloc(), which will use
> > __page_frag_cache_refill() to get an order 3 page if possible, or fall
> > back to an order 0 page if order 3 page is not available.
> > 
> > I'm not sure if your workload will use the above code path though.
> 
> TL;DR: this is order-0 pages (code-walk trough proof below)
> 
> To Aaron, the network stack *can* call __free_pages_ok() with order-0
> pages, via:
> 
> static void skb_free_head(struct sk_buff *skb)
> {
> 	unsigned char *head = skb->head;
> 
> 	if (skb->head_frag)
> 		skb_free_frag(head);
> 	else
> 		kfree(head);
> }
> 
> static inline void skb_free_frag(void *addr)
> {
> 	page_frag_free(addr);
> }
> 
> /*
>  * Frees a page fragment allocated out of either a compound or order 0 page.
>  */
> void page_frag_free(void *addr)
> {
> 	struct page *page = virt_to_head_page(addr);
> 
> 	if (unlikely(put_page_testzero(page)))
> 		__free_pages_ok(page, compound_order(page));
> }
> EXPORT_SYMBOL(page_frag_free);

I think here is a problem - order 0 pages are freed directly to buddy,
bypassing per-cpu-pages. This might be the reason lock contention
appeared on free path. Can someone apply below diff and see if lock
contention is gone?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2ef1c17942f..65c0ae13215a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4554,8 +4554,14 @@ void page_frag_free(void *addr)
 {
 	struct page *page = virt_to_head_page(addr);
 
-	if (unlikely(put_page_testzero(page)))
-		__free_pages_ok(page, compound_order(page));
+	if (unlikely(put_page_testzero(page))) {
+		unsigned int order = compound_order(page);
+
+		if (order == 0)
+			free_unref_page(page);
+		else
+			__free_pages_ok(page, order);
+	}
 }
 EXPORT_SYMBOL(page_frag_free);
 
> Notice for the mlx5 driver it support several RX-memory models, so it
> can be hard to follow, but from the perf report output we can see that
> is uses mlx5e_skb_from_cqe_linear, which use build_skb.
> 
> --13.63%--mlx5e_skb_from_cqe_linear
>           |          
>            --5.02%--build_skb
>                      |          
>                       --1.85%--__build_skb
>                                 |          
>                                  --1.00%--kmem_cache_alloc
> 
> /* build_skb() is wrapper over __build_skb(), that specifically
>  * takes care of skb->head and skb->pfmemalloc
>  * This means that if @frag_size is not zero, then @data must be backed
>  * by a page fragment, not kmalloc() or vmalloc()
>  */
> struct sk_buff *build_skb(void *data, unsigned int frag_size)
> {
> 	struct sk_buff *skb = __build_skb(data, frag_size);
> 
> 	if (skb && frag_size) {
> 		skb->head_frag = 1;
> 		if (page_is_pfmemalloc(virt_to_head_page(data)))
> 			skb->pfmemalloc = 1;
> 	}
> 	return skb;
> }
> EXPORT_SYMBOL(build_skb);
> 
> It still doesn't prove, that the @data is backed by by a order-0 page.
> For the mlx5 driver is uses mlx5e_page_alloc_mapped ->
> page_pool_dev_alloc_pages(), and I can see perf report using
> __page_pool_alloc_pages_slow().
> 
> The setup for page_pool in mlx5 uses order=0.
> 
> 	/* Create a page_pool and register it with rxq */
> 	pp_params.order     = 0;
> 	pp_params.flags     = 0; /* No-internal DMA mapping in page_pool */
> 	pp_params.pool_size = pool_size;
> 	pp_params.nid       = cpu_to_node(c->cpu);
> 	pp_params.dev       = c->pdev;
> 	pp_params.dma_dir   = rq->buff.map_dir;
> 
> 	/* page_pool can be used even when there is no rq->xdp_prog,
> 	 * given page_pool does not handle DMA mapping there is no
> 	 * required state to clear. And page_pool gracefully handle
> 	 * elevated refcnt.
> 	 */
> 	rq->page_pool = page_pool_create(&pp_params);
> 	if (IS_ERR(rq->page_pool)) {
> 		err = PTR_ERR(rq->page_pool);
> 		rq->page_pool = NULL;
> 		goto err_free;
> 	}
> 	err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
> 					 MEM_TYPE_PAGE_POOL, rq->page_pool);

Thanks for the detailed analysis, I'll need more time to understand the
whole picture :-)

^ permalink raw reply related

* Re: [RFC PATCH v3 06/10] udp: cope with UDP GRO packet misdirection
From: Paolo Abeni @ 2018-11-02 13:44 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, Willem de Bruijn,
	steffen.klassert, Subash Abhinov Kasiviswanathan
In-Reply-To: <CAF=yD-LbaAEF7XS8BT8i1uaZTrGpmAEYvf3Yw7DHFKUKMmGzNA@mail.gmail.com>

On Thu, 2018-11-01 at 17:01 -0400, Willem de Bruijn wrote:
> On Wed, Oct 31, 2018 at 5:57 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > @@ -450,4 +457,32 @@ DECLARE_STATIC_KEY_FALSE(udpv6_encap_needed_key);
> > >  void udpv6_encap_enable(void);
> > >  #endif
> > > 
> > > +static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
> > > +                                           struct sk_buff *skb)
> > > +{
> > > +     bool ipv4 = skb->protocol == htons(ETH_P_IP);
> > 
> > And this cause a compile warning when # CONFIG_IPV6 is not set, I will
> > fix in the next iteration (again thanks kbuildbot)
> 
> Can also just pass it as argument. 

Agreed. 

> This skb->protocol should work correctly
> with tunneled packets, but it wasn't as immediately obvious to me.
> 
> Also
> 
> +       if (unlikely(!segs))
> +               goto drop;
> 
> this should not happen. But if it could and the caller treats it the
> same as error (both now return NULL), then skb needs to be freed.

Right you are. Will do in the next iteration.

Thanks,

Paolo

^ permalink raw reply

* Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed
From: Paul E. McKenney @ 2018-11-02 13:37 UTC (permalink / raw)
  To: David Laight
  Cc: Peter Zijlstra, Trond Myklebust, mark.rutland@arm.com,
	linux-kernel@vger.kernel.org, ralf@linux-mips.org,
	jlayton@kernel.org, linuxppc-dev@lists.ozlabs.org,
	bfields@fieldses.org, linux-mips@linux-mips.org,
	linux@roeck-us.net, linux-nfs@vger.kernel.org,
	akpm@linux-foundation.org, will.deacon@arm.com,
	boqun.feng@gmail.com, paul.burton@mips.com,
	"anna.schumaker@netapp.com" <an
In-Reply-To: <7d1ecd21c4c249138dfdd42b9aaa1cea@AcuMS.aculab.com>

On Fri, Nov 02, 2018 at 10:56:31AM +0000, David Laight wrote:
> From: Paul E. McKenney
> > Sent: 01 November 2018 17:02
> ...
> > And there is a push to define C++ signed arithmetic as 2s complement,
> > but there are still 1s complement systems with C compilers.  Just not
> > C++ compilers.  Legacy...
> 
> Hmmm... I've used C compilers for DSPs where signed integer arithmetic
> used the 'data registers' and would saturate, unsigned used the 'address
> registers' and wrapped.
> That was deliberate because it is much better to clip analogue values.

There are no C++ compilers for those DSPs, correct?  (Some of the
C++ standards commmittee members believe that they have fully checked,
but they might well have missed something.)

> Then there was the annoying cobol run time that didn't update the
> result variable if the result wouldn't fit.
> Took a while to notice that the sum of a list of values was even wrong!
> That would be perfectly valid for C - if unexpected.

Heh!  COBOL and FORTRAN also helped fund my first pass through university.

> > > But for us using -fno-strict-overflow which actually defines signed
> > > overflow
> 
> I wonder how much real code 'strict-overflow' gets rid of?
> IIRC gcc silently turns loops like:
> 	int i; for (i = 1; i != 0; i *= 2) ...
> into infinite ones.
> Which is never what is required.

The usual response is something like this:

	for (i = 1; i < n; i++)

where the compiler has no idea what range of values "n" might take on.
Can't say that I am convinced by that example, but at least we do have
-fno-strict-overflow.

							Thanx, Paul

^ permalink raw reply

* Re: WARNING in kmem_cache_create_usercopy
From: Dominique Martinet @ 2018-11-02 22:39 UTC (permalink / raw)
  To: syzbot
  Cc: davem, ericvh, linux-kernel, lucho, netdev, syzkaller-bugs,
	v9fs-developer
In-Reply-To: <0000000000009d4c780579b04ee4@google.com>

syzbot wrote on Fri, Nov 02, 2018:
> RIP: 0010:kmem_cache_create_usercopy+0xad/0x240 mm/slab_common.c:473
> Code: 44 89 f0 25 00 60 de 04 45 85 ed 89 45 cc 75 0b 8b 45 d0 85 c0
> 0f 85 8e 01 00 00 44 39 eb 72 0a 89 d8 44 29 e8 3b 45 d0 73 7e <0f>
> 0b c7 45 d0 00 00 00 00 4c 8b 45 10 44 89 fa 89 de 4c 89 e7 8b
> RSP: 0018:ffff8801bc23f5d0 EFLAGS: 00010213
> RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000006
> RDX: 0000000000000000 RSI: 0000000000000020 RDI: ffffffff88b04b20
> RBP: ffff8801bc23f608 R08: fffffbfff1283a2d R09: fffffbfff1283a2c
> R10: ffff8801bc23f5c0 R11: ffffffff8941d167 R12: ffffffff88b04b20
> R13: 00000000fffffffd R14: 0000000000000000 R15: 0000000000000000
>  p9_client_create+0xa58/0x1769 net/9p/client.c:1054

No lower bound on msize, the reproducer gives a reply from the
pseudo-server with msize=8 and we happily take it, underflowing the
msize - 11 (hdr+4) argument to kmem_cache_create_usercopy...
This probably never worked anyway, but we now get a warning :)

We need to add a sane lower bound to msize as well as the current upper
bound set in the transport.
We have some header sizes in 9p.h for IO header overhead (P9_IOHDRSZ to
24 for example) but I think that's too low in practice; stuff like
readdir will require more than this to get a single entry... We can
request the server to ask for at least 4k?
9p would probably work with less (e.g. 1k; I'd rather not have to
figgure the minimum length we need to get each messages to work in its
minimal form) but honestly even with 4k the perforamnces will be
terrible, so tempted to go with that...

I'll send a patch imposing 4k next week unless someone else does first,
or replies indicate different preferences.

@Dmitry: semi-related, the C reproducer (
https://syzkaller.appspot.com/x/repro.c?x=1701f831400000 ) has a lot of
"readable" letters spelled out as "\x63..." or chars as 0x3d - it's fine
for generated code and that might be easier for the intermediate
representation syzkaller works with, but do you know something handy
that would help convert these to readable strings?
e.g. "\x63\x61\x63\x68\x65\x3d\xc0\x6d\x61\x70" could be written
"cache=\xc0map", and 0x3d as '=' (hm I guess the later would not always
make sense to convert so probably best left as is, but it gets annoying
pretty fast with longer strings)

-- 
Dominique

^ permalink raw reply

* Re: [RFC PATCH v3 01/10] udp: implement complete book-keeping for encap_needed
From: Paolo Abeni @ 2018-11-02 13:30 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, Willem de Bruijn,
	steffen.klassert, Subash Abhinov Kasiviswanathan
In-Reply-To: <CAF=yD-KMU9+UvL-UNFnZth31O+NzGzdqFHNCq-QbSpZ4_qY86A@mail.gmail.com>

Hi,

On Thu, 2018-11-01 at 16:59 -0400, Willem de Bruijn wrote:
> On Tue, Oct 30, 2018 at 1:28 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > 
> > The *encap_needed static keys are enabled by UDP tunnels
> > and several UDP encapsulations type, but they are never
> > turned off. This can cause unneeded overall performance
> > degradation for systems where such features are used
> > transiently.
> > 
> > This patch introduces complete book-keeping for such keys,
> > decreasing the usage at socket destruction time, if needed,
> > and avoiding that the same socket could increase the key
> > usage multiple times.
> > 
> > rfc v2 - rfc v3:
> >  - use udp_tunnel_encap_enable() in setsockopt()
> > 
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > @@ -2447,7 +2452,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname,
> >                         /* FALLTHROUGH */
> >                 case UDP_ENCAP_L2TPINUDP:
> >                         up->encap_type = val;
> > -                       udp_encap_enable();
> > +                       udp_tunnel_encap_enable(sk->sk_socket);
> 
> this now also needs lock_sock?

Yep, you are right. I'll add it in the next iteration.

Thanks,

Paolo

^ permalink raw reply

* Re: [RFC PATCH v3 04/10] ip: factor out protocol delivery helper
From: Paolo Abeni @ 2018-11-02 13:30 UTC (permalink / raw)
  To: Subash Abhinov Kasiviswanathan
  Cc: netdev, David S. Miller, Willem de Bruijn, Steffen Klassert
In-Reply-To: <482631991107b8493fe7310c01d66b92@codeaurora.org>

On Thu, 2018-11-01 at 00:35 -0600, Subash Abhinov Kasiviswanathan
wrote:
> On 2018-10-30 11:24, Paolo Abeni wrote:
> > So that we can re-use it at the UDP lavel in a later patch
> > 
> 
> Hi Paolo
> 
> Minor queries -
> Should it be "level" instead of "lavel"? Similar comment for the ipv6
> patch as well.

Indeed. Will fix in the next iteration.

> > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> > index 35a786c0aaa0..72250b4e466d 100644
> > --- a/net/ipv4/ip_input.c
> > +++ b/net/ipv4/ip_input.c
> > @@ -188,51 +188,50 @@ bool ip_call_ra_chain(struct sk_buff *skb)
> >  	return false;
> >  }
> > 
> > -static int ip_local_deliver_finish(struct net *net, struct sock *sk,
> > struct sk_buff *skb)
> > +void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb,
> > int protocol)
> 
> Would it be better if this function was declared in include/net/ip.h &
> include/net/ipv6.h rather than in net/ipv4/udp.c & net/ipv6/udp.c as in
> the patch "udp: cope with UDP GRO packet misdirection"
> 
> diff --git a/include/net/ip.h b/include/net/ip.h
> index 72593e1..3d7fdb4 100644
> --- a/include/net/ip.h
> +++ b/include/net/ip.h
> @@ -717,4 +717,6 @@ static inline void ip_cmsg_recv(struct msghdr *msg, 
> struct sk_buff *skb)
>   int rtm_getroute_parse_ip_proto(struct nlattr *attr, u8 *ip_proto,
>                                  struct netlink_ext_ack *extack);
> 
> +void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
> proto);
> +
>   #endif /* _IP_H */
> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> index 8296505..4d4d2cfe 100644
> --- a/include/net/ipv6.h
> +++ b/include/net/ipv6.h
> @@ -1101,4 +1101,8 @@ int ipv6_sock_mc_join_ssm(struct sock *sk, int 
> ifindex,
>                            const struct in6_addr *addr, unsigned int 
> mode);
>   int ipv6_sock_mc_drop(struct sock *sk, int ifindex,
>                        const struct in6_addr *addr);
> +
> +void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
> nexthdr,
> +                              bool have_final);
> +
>   #endif /* _NET_IPV6_H */
> 

Agreed, I will do in the next iteration. 

Thanks,

Paolo

^ permalink raw reply

* [PATCH iproute2] man: ss.8: break and indent long line
From: Luca Boccassi @ 2018-11-02 13:27 UTC (permalink / raw)
  To: netdev; +Cc: stephen, dsahern

Fixes groff warning:
  ss.8 92: warning [p 2, 2.8i]: can't break line

And makes the line also more readable.

Signed-off-by: Luca Boccassi <bluca@debian.org>
---
This line is the example of an output, so you might not want to break it.
Let me know if you prefer a different solution.

 man/man8/ss.8 | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/man/man8/ss.8 b/man/man8/ss.8
index 7a6572b1..699a1271 100644
--- a/man/man8/ss.8
+++ b/man/man8/ss.8
@@ -89,7 +89,13 @@ an uuid of the socket
 Show socket memory usage. The output format is:
 .RS
 .P
-skmem:(r<rmem_alloc>,rb<rcv_buf>,t<wmem_alloc>,tb<snd_buf>,f<fwd_alloc>,w<wmem_queued>,o<opt_mem>,bl<back_log>)
+skmem:(r<rmem_alloc>,rb<rcv_buf>,t<wmem_alloc>,tb<snd_buf>,f<fwd_alloc>,
+.br
+.RS
+.RS
+w<wmem_queued>,o<opt_mem>,bl<back_log>)
+.RE
+.RE
 .P
 .TP
 .B <rmem_alloc>
-- 
2.19.1

^ permalink raw reply related

* Re: [PATCH] net/mlx5e: fix high stack usage
From: Jason Gunthorpe @ 2018-11-02 22:15 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: eric.dumazet@gmail.com, davem@davemloft.net, arnd@arndb.de,
	leon@kernel.org, linux-kernel@vger.kernel.org, Moshe Shemesh,
	linux-rdma@vger.kernel.org, Boris Pismenny, Tariq Toukan,
	akpm@linux-foundation.org, Eran Ben Elisha,
	netdev@vger.kernel.org, Ilya Lesokhin, Kamal Heib
In-Reply-To: <4ebfc3aeb5335a1f671602f9a906f948c39a30da.camel@mellanox.com>

On Fri, Nov 02, 2018 at 10:07:02PM +0000, Saeed Mahameed wrote:
> On Fri, 2018-11-02 at 14:39 -0700, Eric Dumazet wrote:
> > 
> > On 11/02/2018 02:05 PM, Saeed Mahameed wrote:
> >  
> > > temp will be mem copied to priv->stats.sw at the end,
> > > memcpy(&priv->stats.sw, &s, sizeof(s)); 
> > > 
> > > one other way to solve this as suggested by Andrew, is to get rid
> > > of
> > > the temp var and make it point directly to priv->stats.sw 
> > > 
> > 
> > What about concurrency ?
> > 
> > This temp variable is there to make sure concurrent readers of stats
> > might
> > not see mangle data (because another 'reader' just did a memset() and
> > is doing the folding)
> > 
> > 
> > mlx5e_get_stats() can definitely be run at the same time by multiple
> > threads.
> > 
> 
> hmm, you are right, i was thinking that mlx5e_get_Stats will trigger a
> work to update stats and grab the state_lock, but for sw stats this is
> not the case it is done in place.

That was my guess when I saw this.. the confusing bit is why is there
s and temp, why not just s?

> BTW memcpy itself is not thread safe.

At least on 64 bit memcpy will do > 8 byte stores when copying so on
most architectures it will cause individual new or old u64 to be
returned and not a mess..

32 bit will always make a mess.

If the stats don't update that often then kmalloc'ing a new buffer and
RCU'ing it into view might be a reasonable alternative to this?

Jason

^ permalink raw reply

* Re: [PATCH] net/mlx5e: fix high stack usage
From: Arnd Bergmann @ 2018-11-02 22:15 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: eric.dumazet@gmail.com, davem@davemloft.net, leon@kernel.org,
	linux-kernel@vger.kernel.org, Moshe Shemesh,
	linux-rdma@vger.kernel.org, Boris Pismenny, Tariq Toukan,
	akpm@linux-foundation.org, Eran Ben Elisha,
	netdev@vger.kernel.org, Ilya Lesokhin, Kamal Heib
In-Reply-To: <4ebfc3aeb5335a1f671602f9a906f948c39a30da.camel@mellanox.com>

On 11/2/18, Saeed Mahameed <saeedm@mellanox.com> wrote:
> On Fri, 2018-11-02 at 14:39 -0700, Eric Dumazet wrote:
>>
>> On 11/02/2018 02:05 PM, Saeed Mahameed wrote:
>>
>> > temp will be mem copied to priv->stats.sw at the end,
>> > memcpy(&priv->stats.sw, &s, sizeof(s));
>> >
>> > one other way to solve this as suggested by Andrew, is to get rid
>> > of
>> > the temp var and make it point directly to priv->stats.sw
>> >
>>
>> What about concurrency ?
>>
>> This temp variable is there to make sure concurrent readers of stats
>> might
>> not see mangle data (because another 'reader' just did a memset() and
>> is doing the folding)
>>
>>
>> mlx5e_get_stats() can definitely be run at the same time by multiple
>> threads.
>>
>
> hmm, you are right, i was thinking that mlx5e_get_Stats will trigger a
> work to update stats and grab the state_lock, but for sw stats this is
> not the case it is done in place.
>
> BTW memcpy itself is not thread safe.

Before commit 6c63efe4cfab ("net/mlx5e: Remove redundant active_channels
indication"), there was a read_lock() in the function apparently intended to
made it thread safe. This got removed with the comment

commit 6c63efe4cfabf230a8ed4b1d880249875ffdac13
Author: Eran Ben Elisha <eranbe@mellanox.com>
Date:   Tue May 29 11:06:31 2018 +0300

    net/mlx5e: Remove redundant active_channels indication

    Now, when all channels stats are saved regardless of the channel's state
    {open, closed}, we can safely remove this indication and the stats spin
    lock which protects it.

    Fixes: 76c3810bade3 ("net/mlx5e: Avoid reset netdev stats on
configuration changes")

I don't really understand the reasoning, but maybe we can remove
the memcpy() if the code is thread safe, or we need the lock back if it's not.

         Arnd

^ permalink raw reply

* Re: [PATCH] net/mlx5e: fix high stack usage
From: Saeed Mahameed @ 2018-11-02 22:07 UTC (permalink / raw)
  To: eric.dumazet@gmail.com, davem@davemloft.net, arnd@arndb.de,
	leon@kernel.org
  Cc: linux-kernel@vger.kernel.org, Moshe Shemesh,
	linux-rdma@vger.kernel.org, Boris Pismenny, Tariq Toukan,
	akpm@linux-foundation.org, Eran Ben Elisha,
	netdev@vger.kernel.org, Ilya Lesokhin, Kamal Heib
In-Reply-To: <2727f37b-1742-5532-317e-3be8a984266b@gmail.com>

On Fri, 2018-11-02 at 14:39 -0700, Eric Dumazet wrote:
> 
> On 11/02/2018 02:05 PM, Saeed Mahameed wrote:
>  
> > temp will be mem copied to priv->stats.sw at the end,
> > memcpy(&priv->stats.sw, &s, sizeof(s)); 
> > 
> > one other way to solve this as suggested by Andrew, is to get rid
> > of
> > the temp var and make it point directly to priv->stats.sw 
> > 
> 
> What about concurrency ?
> 
> This temp variable is there to make sure concurrent readers of stats
> might
> not see mangle data (because another 'reader' just did a memset() and
> is doing the folding)
> 
> 
> mlx5e_get_stats() can definitely be run at the same time by multiple
> threads.
> 

hmm, you are right, i was thinking that mlx5e_get_Stats will trigger a
work to update stats and grab the state_lock, but for sw stats this is
not the case it is done in place.

BTW memcpy itself is not thread safe.

^ permalink raw reply

* [PATCH iproute2 2/2] Pass CPPFLAGS to the compiler
From: Luca Boccassi @ 2018-11-02 12:35 UTC (permalink / raw)
  To: netdev; +Cc: stephen, dsahern
In-Reply-To: <20181102123544.13000-1-bluca@debian.org>

When building Debian packages pre-processor flags are passed via
CPPFLAGS, as the convention indicates. Specifically, the hardening
-D_FORTIFY_SOURCE=2 flag is used.
Pass CPPFLAGS to all calls of QUIET_CC together with CFLAGS.

Signed-off-by: Luca Boccassi <bluca@debian.org>
---
 configure     | 2 +-
 misc/Makefile | 8 ++++----
 tc/Makefile   | 8 ++++----
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/configure b/configure
index c5655978..5df6082b 100755
--- a/configure
+++ b/configure
@@ -436,4 +436,4 @@ check_cap
 
 echo >> $CONFIG
 echo "%.o: %.c" >> $CONFIG
-echo '	$(QUIET_CC)$(CC) $(CFLAGS) $(EXTRA_CFLAGS) -c -o $@ $<' >> $CONFIG
+echo '	$(QUIET_CC)$(CC) $(CFLAGS) $(EXTRA_CFLAGS) $(CPPFLAGS) -c -o $@ $<' >> $CONFIG
diff --git a/misc/Makefile b/misc/Makefile
index b2dd6b26..6a849af4 100644
--- a/misc/Makefile
+++ b/misc/Makefile
@@ -16,16 +16,16 @@ ss: $(SSOBJ)
 	$(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@
 
 nstat: nstat.c
-	$(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o nstat nstat.c $(LDLIBS) -lm
+	$(QUIET_CC)$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -o nstat nstat.c $(LDLIBS) -lm
 
 ifstat: ifstat.c
-	$(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o ifstat ifstat.c $(LDLIBS) -lm
+	$(QUIET_CC)$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -o ifstat ifstat.c $(LDLIBS) -lm
 
 rtacct: rtacct.c
-	$(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o rtacct rtacct.c $(LDLIBS) -lm
+	$(QUIET_CC)$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -o rtacct rtacct.c $(LDLIBS) -lm
 
 arpd: arpd.c
-	$(QUIET_CC)$(CC) $(CFLAGS) -I$(DBM_INCLUDE) $(LDFLAGS) -o arpd arpd.c $(LDLIBS) -ldb
+	$(QUIET_CC)$(CC) $(CFLAGS) -I$(DBM_INCLUDE) $(CPPFLAGS) $(LDFLAGS) -o arpd arpd.c $(LDLIBS) -ldb
 
 ssfilter.c: ssfilter.y
 	$(QUIET_YACC)bison ssfilter.y -o ssfilter.c
diff --git a/tc/Makefile b/tc/Makefile
index 25a28284..f8010d3c 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -128,7 +128,7 @@ CFLAGS += -DYY_NO_INPUT
 MODDESTDIR := $(DESTDIR)$(LIBDIR)/tc
 
 %.so: %.c
-	$(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -shared -fpic $< -o $@
+	$(QUIET_CC)$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -shared -fpic $< -o $@
 
 
 all: tc $(TCSO)
@@ -158,13 +158,13 @@ clean:
 	rm -f emp_ematch.yacc.*
 
 q_atm.so: q_atm.c
-	$(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -shared -fpic -o q_atm.so q_atm.c -latm
+	$(QUIET_CC)$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -shared -fpic -o q_atm.so q_atm.c -latm
 
 m_xt.so: m_xt.c
-	$(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -shared -fpic -o m_xt.so m_xt.c $$($(PKG_CONFIG) xtables --cflags --libs)
+	$(QUIET_CC)$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -shared -fpic -o m_xt.so m_xt.c $$($(PKG_CONFIG) xtables --cflags --libs)
 
 m_xt_old.so: m_xt_old.c
-	$(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -shared -fpic -o m_xt_old.so m_xt_old.c $$($(PKG_CONFIG) xtables --cflags --libs)
+	$(QUIET_CC)$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -shared -fpic -o m_xt_old.so m_xt_old.c $$($(PKG_CONFIG) xtables --cflags --libs)
 
 em_ipset.o: CFLAGS += $$($(PKG_CONFIG) xtables --cflags)
 
-- 
2.19.1

^ permalink raw reply related

* [PATCH iproute2 1/2] testsuite: build generate_nlmsg with QUIET_CC
From: Luca Boccassi @ 2018-11-02 12:35 UTC (permalink / raw)
  To: netdev; +Cc: stephen, dsahern

Follow the standard pattern, and respect user's verbosity setting.

Signed-off-by: Luca Boccassi <bluca@debian.org>
---
 testsuite/tools/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/testsuite/tools/Makefile b/testsuite/tools/Makefile
index e1d9bfef..85df69ec 100644
--- a/testsuite/tools/Makefile
+++ b/testsuite/tools/Makefile
@@ -3,7 +3,7 @@ CFLAGS=
 include ../../config.mk
 
 generate_nlmsg: generate_nlmsg.c ../../lib/libnetlink.c
-	$(CC) $(CPPFLAGS) $(CFLAGS) $(EXTRA_CFLAGS) -I../../include -include../../include/uapi/linux/netlink.h -o $@ $^ -lmnl
+	$(QUIET_CC) $(CPPFLAGS) $(CFLAGS) $(EXTRA_CFLAGS) -I../../include -include../../include/uapi/linux/netlink.h -o $@ $^ -lmnl
 
 clean:
 	rm -f generate_nlmsg
-- 
2.19.1

^ permalink raw reply related

* Re: [PATCH] net/mlx5e: fix high stack usage
From: Eric Dumazet @ 2018-11-02 21:39 UTC (permalink / raw)
  To: Saeed Mahameed, davem@davemloft.net, arnd@arndb.de,
	leon@kernel.org
  Cc: linux-kernel@vger.kernel.org, Moshe Shemesh,
	linux-rdma@vger.kernel.org, Boris Pismenny, Tariq Toukan,
	akpm@linux-foundation.org, Eran Ben Elisha,
	netdev@vger.kernel.org, Ilya Lesokhin, Kamal Heib
In-Reply-To: <9f214f12ec89020ceb14c1aec25b3a0d968507aa.camel@mellanox.com>



On 11/02/2018 02:05 PM, Saeed Mahameed wrote:
 
> temp will be mem copied to priv->stats.sw at the end,
> memcpy(&priv->stats.sw, &s, sizeof(s)); 
> 
> one other way to solve this as suggested by Andrew, is to get rid of
> the temp var and make it point directly to priv->stats.sw 
> 

What about concurrency ?

This temp variable is there to make sure concurrent readers of stats might
not see mangle data (because another 'reader' just did a memset() and is doing the folding)


mlx5e_get_stats() can definitely be run at the same time by multiple threads.

^ permalink raw reply

* Re: [PATCH] net/mlx5e: fix high stack usage
From: Saeed Mahameed @ 2018-11-02 21:05 UTC (permalink / raw)
  To: davem@davemloft.net, arnd@arndb.de, leon@kernel.org
  Cc: linux-kernel@vger.kernel.org, Moshe Shemesh,
	linux-rdma@vger.kernel.org, Boris Pismenny, Tariq Toukan,
	akpm@linux-foundation.org, Eran Ben Elisha,
	netdev@vger.kernel.org, Ilya Lesokhin, Kamal Heib
In-Reply-To: <20181102153316.1492515-1-arnd@arndb.de>

On Fri, 2018-11-02 at 16:33 +0100, Arnd Bergmann wrote:
> A patch that looks harmless causes the stack usage of the
> mlx5e_grp_sw_update_stats()
> function to drastically increase with x86 gcc-4.9 and higher (tested
> up to 8.1):
> 
> drivers/net/ethernet/mellanox/mlx5/core/en_stats.c: In function
> ‘mlx5e_grp_sw_update_stats’:
> drivers/net/ethernet/mellanox/mlx5/core/en_stats.c:216:1: warning:
> the frame size of 1276 bytes is larger than 500 bytes [-Wframe-
> larger-than=]
> 
> By splitting out the loop body into a non-inlined function, the stack
> size goes
> back down to under 500 bytes.
> 
> Fixes: 779d986d60de ("net/mlx5e: Do not ignore netdevice TX/RX queues
> number")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  .../ethernet/mellanox/mlx5/core/en_stats.c    | 168 +++++++++-------
> --
>  1 file changed, 86 insertions(+), 82 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
> b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
> index 1e55b9c27ffc..c270206f3475 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
> @@ -126,93 +126,97 @@ static int mlx5e_grp_sw_fill_stats(struct
> mlx5e_priv *priv, u64 *data, int idx)
>  	return idx;
>  }
>  
> +static noinline_for_stack void
> +mlx5e_grp_sw_collect_stat(struct mlx5e_priv *priv, struct
> mlx5e_sw_stats *s, int i)
> +{
> +	struct mlx5e_channel_stats *channel_stats = &priv-
> >channel_stats[i];
> +	struct mlx5e_xdpsq_stats *xdpsq_red_stats = &channel_stats-
> >xdpsq;
> +	struct mlx5e_xdpsq_stats *xdpsq_stats = &channel_stats-
> >rq_xdpsq;
> +	struct mlx5e_rq_stats *rq_stats = &channel_stats->rq;
> +	struct mlx5e_ch_stats *ch_stats = &channel_stats->ch;
> +	int j;
> +
> +	s->rx_packets	+= rq_stats->packets;
> +	s->rx_bytes	+= rq_stats->bytes;
> +	s->rx_lro_packets += rq_stats->lro_packets;
> +	s->rx_lro_bytes	+= rq_stats->lro_bytes;
> +	s->rx_ecn_mark	+= rq_stats->ecn_mark;
> +	s->rx_removed_vlan_packets += rq_stats->removed_vlan_packets;
> +	s->rx_csum_none	+= rq_stats->csum_none;
> +	s->rx_csum_complete += rq_stats->csum_complete;
> +	s->rx_csum_unnecessary += rq_stats->csum_unnecessary;
> +	s->rx_csum_unnecessary_inner += rq_stats-
> >csum_unnecessary_inner;
> +	s->rx_xdp_drop     += rq_stats->xdp_drop;
> +	s->rx_xdp_redirect += rq_stats->xdp_redirect;
> +	s->rx_xdp_tx_xmit  += xdpsq_stats->xmit;
> +	s->rx_xdp_tx_full  += xdpsq_stats->full;
> +	s->rx_xdp_tx_err   += xdpsq_stats->err;
> +	s->rx_xdp_tx_cqe   += xdpsq_stats->cqes;
> +	s->rx_wqe_err   += rq_stats->wqe_err;
> +	s->rx_mpwqe_filler_cqes    += rq_stats->mpwqe_filler_cqes;
> +	s->rx_mpwqe_filler_strides += rq_stats->mpwqe_filler_strides;
> +	s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
> +	s->rx_cqe_compress_blks += rq_stats->cqe_compress_blks;
> +	s->rx_cqe_compress_pkts += rq_stats->cqe_compress_pkts;
> +	s->rx_page_reuse  += rq_stats->page_reuse;
> +	s->rx_cache_reuse += rq_stats->cache_reuse;
> +	s->rx_cache_full  += rq_stats->cache_full;
> +	s->rx_cache_empty += rq_stats->cache_empty;
> +	s->rx_cache_busy  += rq_stats->cache_busy;
> +	s->rx_cache_waive += rq_stats->cache_waive;
> +	s->rx_congst_umr  += rq_stats->congst_umr;
> +	s->rx_arfs_err    += rq_stats->arfs_err;
> +	s->ch_events      += ch_stats->events;
> +	s->ch_poll        += ch_stats->poll;
> +	s->ch_arm         += ch_stats->arm;
> +	s->ch_aff_change  += ch_stats->aff_change;
> +	s->ch_eq_rearm    += ch_stats->eq_rearm;
> +	/* xdp redirect */
> +	s->tx_xdp_xmit    += xdpsq_red_stats->xmit;
> +	s->tx_xdp_full    += xdpsq_red_stats->full;
> +	s->tx_xdp_err     += xdpsq_red_stats->err;
> +	s->tx_xdp_cqes    += xdpsq_red_stats->cqes;
> +
> +	for (j = 0; j < priv->max_opened_tc; j++) {
> +		struct mlx5e_sq_stats *sq_stats = &channel_stats-
> >sq[j];
> +
> +		s->tx_packets		+= sq_stats->packets;
> +		s->tx_bytes		+= sq_stats->bytes;
> +		s->tx_tso_packets	+= sq_stats->tso_packets;
> +		s->tx_tso_bytes		+= sq_stats->tso_bytes;
> +		s->tx_tso_inner_packets	+= sq_stats-
> >tso_inner_packets;
> +		s->tx_tso_inner_bytes	+= sq_stats->tso_inner_bytes;
> +		s->tx_added_vlan_packets += sq_stats-
> >added_vlan_packets;
> +		s->tx_nop               += sq_stats->nop;
> +		s->tx_queue_stopped	+= sq_stats->stopped;
> +		s->tx_queue_wake	+= sq_stats->wake;
> +		s->tx_udp_seg_rem	+= sq_stats->udp_seg_rem;
> +		s->tx_queue_dropped	+= sq_stats->dropped;
> +		s->tx_cqe_err		+= sq_stats->cqe_err;
> +		s->tx_recover		+= sq_stats->recover;
> +		s->tx_xmit_more		+= sq_stats->xmit_more;
> +		s->tx_csum_partial_inner += sq_stats-
> >csum_partial_inner;
> +		s->tx_csum_none		+= sq_stats->csum_none;
> +		s->tx_csum_partial	+= sq_stats->csum_partial;
> +#ifdef CONFIG_MLX5_EN_TLS
> +		s->tx_tls_ooo		+= sq_stats->tls_ooo;
> +		s->tx_tls_resync_bytes	+= sq_stats-
> >tls_resync_bytes;
> +#endif
> +		s->tx_cqes		+= sq_stats->cqes;
> +	}
> +}
> +
>  void mlx5e_grp_sw_update_stats(struct mlx5e_priv *priv)
>  {
> -	struct mlx5e_sw_stats temp, *s = &temp;

temp will be mem copied to priv->stats.sw at the end,
memcpy(&priv->stats.sw, &s, sizeof(s)); 

one other way to solve this as suggested by Andrew, is to get rid of
the temp var and make it point directly to priv->stats.sw 


> +	struct mlx5e_sw_stats s;
>  	int i;
>  
> -	memset(s, 0, sizeof(*s));
> -
> -	for (i = 0; i < mlx5e_get_netdev_max_channels(priv->netdev);
> i++) {
> -		struct mlx5e_channel_stats *channel_stats =
> -			&priv->channel_stats[i];
> -		struct mlx5e_xdpsq_stats *xdpsq_red_stats =
> &channel_stats->xdpsq;
> -		struct mlx5e_xdpsq_stats *xdpsq_stats = &channel_stats-
> >rq_xdpsq;
> -		struct mlx5e_rq_stats *rq_stats = &channel_stats->rq;
> -		struct mlx5e_ch_stats *ch_stats = &channel_stats->ch;
> -		int j;
> -
> -		s->rx_packets	+= rq_stats->packets;
> -		s->rx_bytes	+= rq_stats->bytes;
> -		s->rx_lro_packets += rq_stats->lro_packets;
> -		s->rx_lro_bytes	+= rq_stats->lro_bytes;
> -		s->rx_ecn_mark	+= rq_stats->ecn_mark;
> -		s->rx_removed_vlan_packets += rq_stats-
> >removed_vlan_packets;
> -		s->rx_csum_none	+= rq_stats->csum_none;
> -		s->rx_csum_complete += rq_stats->csum_complete;
> -		s->rx_csum_unnecessary += rq_stats->csum_unnecessary;
> -		s->rx_csum_unnecessary_inner += rq_stats-
> >csum_unnecessary_inner;
> -		s->rx_xdp_drop     += rq_stats->xdp_drop;
> -		s->rx_xdp_redirect += rq_stats->xdp_redirect;
> -		s->rx_xdp_tx_xmit  += xdpsq_stats->xmit;
> -		s->rx_xdp_tx_full  += xdpsq_stats->full;
> -		s->rx_xdp_tx_err   += xdpsq_stats->err;
> -		s->rx_xdp_tx_cqe   += xdpsq_stats->cqes;
> -		s->rx_wqe_err   += rq_stats->wqe_err;
> -		s->rx_mpwqe_filler_cqes    += rq_stats-
> >mpwqe_filler_cqes;
> -		s->rx_mpwqe_filler_strides += rq_stats-
> >mpwqe_filler_strides;
> -		s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
> -		s->rx_cqe_compress_blks += rq_stats->cqe_compress_blks;
> -		s->rx_cqe_compress_pkts += rq_stats->cqe_compress_pkts;
> -		s->rx_page_reuse  += rq_stats->page_reuse;
> -		s->rx_cache_reuse += rq_stats->cache_reuse;
> -		s->rx_cache_full  += rq_stats->cache_full;
> -		s->rx_cache_empty += rq_stats->cache_empty;
> -		s->rx_cache_busy  += rq_stats->cache_busy;
> -		s->rx_cache_waive += rq_stats->cache_waive;
> -		s->rx_congst_umr  += rq_stats->congst_umr;
> -		s->rx_arfs_err    += rq_stats->arfs_err;
> -		s->ch_events      += ch_stats->events;
> -		s->ch_poll        += ch_stats->poll;
> -		s->ch_arm         += ch_stats->arm;
> -		s->ch_aff_change  += ch_stats->aff_change;
> -		s->ch_eq_rearm    += ch_stats->eq_rearm;
> -		/* xdp redirect */
> -		s->tx_xdp_xmit    += xdpsq_red_stats->xmit;
> -		s->tx_xdp_full    += xdpsq_red_stats->full;
> -		s->tx_xdp_err     += xdpsq_red_stats->err;
> -		s->tx_xdp_cqes    += xdpsq_red_stats->cqes;
> -
> -		for (j = 0; j < priv->max_opened_tc; j++) {
> -			struct mlx5e_sq_stats *sq_stats =
> &channel_stats->sq[j];
> -
> -			s->tx_packets		+= sq_stats->packets;
> -			s->tx_bytes		+= sq_stats->bytes;
> -			s->tx_tso_packets	+= sq_stats->tso_packets;
> -			s->tx_tso_bytes		+= sq_stats-
> >tso_bytes;
> -			s->tx_tso_inner_packets	+= sq_stats-
> >tso_inner_packets;
> -			s->tx_tso_inner_bytes	+= sq_stats-
> >tso_inner_bytes;
> -			s->tx_added_vlan_packets += sq_stats-
> >added_vlan_packets;
> -			s->tx_nop               += sq_stats->nop;
> -			s->tx_queue_stopped	+= sq_stats->stopped;
> -			s->tx_queue_wake	+= sq_stats->wake;
> -			s->tx_udp_seg_rem	+= sq_stats->udp_seg_rem;
> -			s->tx_queue_dropped	+= sq_stats->dropped;
> -			s->tx_cqe_err		+= sq_stats->cqe_err;
> -			s->tx_recover		+= sq_stats->recover;
> -			s->tx_xmit_more		+= sq_stats-
> >xmit_more;
> -			s->tx_csum_partial_inner += sq_stats-
> >csum_partial_inner;
> -			s->tx_csum_none		+= sq_stats-
> >csum_none;
> -			s->tx_csum_partial	+= sq_stats-
> >csum_partial;
> -#ifdef CONFIG_MLX5_EN_TLS
> -			s->tx_tls_ooo		+= sq_stats->tls_ooo;
> -			s->tx_tls_resync_bytes	+= sq_stats-
> >tls_resync_bytes;
> -#endif
> -			s->tx_cqes		+= sq_stats->cqes;
> -		}
> -	}
> +	memset(&s, 0, sizeof(s));
> +
> +	for (i = 0; i < mlx5e_get_netdev_max_channels(priv->netdev);
> i++)
> +		mlx5e_grp_sw_collect_stat(priv, &s, i);
>  
> -	memcpy(&priv->stats.sw, s, sizeof(*s));
> +	memcpy(&priv->stats.sw, &s, sizeof(s));
>  }
>  
>  static const struct counter_desc q_stats_desc[] = {

^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Jesper Dangaard Brouer @ 2018-11-02 11:40 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Saeed Mahameed, pstaszewski@itcare.pl, eric.dumazet@gmail.com,
	netdev@vger.kernel.org, Tariq Toukan, ilias.apalodimas@linaro.org,
	yoel@kviknet.dk, mgorman@techsingularity.net, brouer
In-Reply-To: <20181102052356.GA17587@intel.com>

On Fri, 2 Nov 2018 13:23:56 +0800
Aaron Lu <aaron.lu@intel.com> wrote:

> On Thu, Nov 01, 2018 at 08:23:19PM +0000, Saeed Mahameed wrote:
> > On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:  
> > > On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
> > > wrote:
> > > ... ...  
> > > > Section copied out:
> > > > 
> > > >   mlx5e_poll_tx_cq
> > > >   |          
> > > >    --16.34%--napi_consume_skb
> > > >              |          
> > > >              |--12.65%--__free_pages_ok
> > > >              |          |          
> > > >              |           --11.86%--free_one_page
> > > >              |                     |          
> > > >              |                     |--10.10%
> > > > --queued_spin_lock_slowpath
> > > >              |                     |          
> > > >              |                      --0.65%--_raw_spin_lock  
> > > 
> > > This callchain looks like it is freeing higher order pages than order
> > > 0:
> > > __free_pages_ok is only called for pages whose order are bigger than
> > > 0.  
> > 
> > mlx5 rx uses only order 0 pages, so i don't know where these high order
> > tx SKBs are coming from..   
> 
> Perhaps here:
> __netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and
> __napi_alloc_frag() will all call page_frag_alloc(), which will use
> __page_frag_cache_refill() to get an order 3 page if possible, or fall
> back to an order 0 page if order 3 page is not available.
> 
> I'm not sure if your workload will use the above code path though.

TL;DR: this is order-0 pages (code-walk trough proof below)

To Aaron, the network stack *can* call __free_pages_ok() with order-0
pages, via:

static void skb_free_head(struct sk_buff *skb)
{
	unsigned char *head = skb->head;

	if (skb->head_frag)
		skb_free_frag(head);
	else
		kfree(head);
}

static inline void skb_free_frag(void *addr)
{
	page_frag_free(addr);
}

/*
 * Frees a page fragment allocated out of either a compound or order 0 page.
 */
void page_frag_free(void *addr)
{
	struct page *page = virt_to_head_page(addr);

	if (unlikely(put_page_testzero(page)))
		__free_pages_ok(page, compound_order(page));
}
EXPORT_SYMBOL(page_frag_free);


Notice for the mlx5 driver it support several RX-memory models, so it
can be hard to follow, but from the perf report output we can see that
is uses mlx5e_skb_from_cqe_linear, which use build_skb.

--13.63%--mlx5e_skb_from_cqe_linear
          |          
           --5.02%--build_skb
                     |          
                      --1.85%--__build_skb
                                |          
                                 --1.00%--kmem_cache_alloc

/* build_skb() is wrapper over __build_skb(), that specifically
 * takes care of skb->head and skb->pfmemalloc
 * This means that if @frag_size is not zero, then @data must be backed
 * by a page fragment, not kmalloc() or vmalloc()
 */
struct sk_buff *build_skb(void *data, unsigned int frag_size)
{
	struct sk_buff *skb = __build_skb(data, frag_size);

	if (skb && frag_size) {
		skb->head_frag = 1;
		if (page_is_pfmemalloc(virt_to_head_page(data)))
			skb->pfmemalloc = 1;
	}
	return skb;
}
EXPORT_SYMBOL(build_skb);

It still doesn't prove, that the @data is backed by by a order-0 page.
For the mlx5 driver is uses mlx5e_page_alloc_mapped ->
page_pool_dev_alloc_pages(), and I can see perf report using
__page_pool_alloc_pages_slow().

The setup for page_pool in mlx5 uses order=0.

	/* Create a page_pool and register it with rxq */
	pp_params.order     = 0;
	pp_params.flags     = 0; /* No-internal DMA mapping in page_pool */
	pp_params.pool_size = pool_size;
	pp_params.nid       = cpu_to_node(c->cpu);
	pp_params.dev       = c->pdev;
	pp_params.dma_dir   = rq->buff.map_dir;

	/* page_pool can be used even when there is no rq->xdp_prog,
	 * given page_pool does not handle DMA mapping there is no
	 * required state to clear. And page_pool gracefully handle
	 * elevated refcnt.
	 */
	rq->page_pool = page_pool_create(&pp_params);
	if (IS_ERR(rq->page_pool)) {
		err = PTR_ERR(rq->page_pool);
		rq->page_pool = NULL;
		goto err_free;
	}
	err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
					 MEM_TYPE_PAGE_POOL, rq->page_pool);

 
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* RE: [PATCH] arm64: dts: stratix10: fix multicast filtering
From: Koskinen, Aaro (Nokia - FI/Espoo) @ 2018-11-02 11:04 UTC (permalink / raw)
  To: Dinh Nguyen, Thor Thayer, Rob Herring, Mark Rutland
  Cc: devicetree@vger.kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20181102105355.18654-1-aaro.koskinen@nokia.com>

Hi,

Please don't apply this - looks like the company SMTP server mangled the From
header. I'll try to send v2 with a workaround.

A.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox