Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH V2 1/1 (was 0/1 by accident)] tools/dtrace: initial implementation of DTrace
From: Jonathan Corbet @ 2019-07-10 20:30 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Kris Van Hees, netdev, bpf, dtrace-devel, linux-kernel, rostedt,
	mhiramat, acme, ast, Peter Zijlstra, Chris Mason, brendan.d.gregg,
	davem
In-Reply-To: <c7f15d1d-1696-4d95-1729-4c4e97bdc43e@iogearbox.net>

On Wed, 10 Jul 2019 21:32:25 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> Looks like you missed Brendan Gregg's prior feedback from v1 [0]. I haven't
> seen a strong compelling argument for why this needs to reside in the kernel
> tree given we also have all the other tracing tools and many of which also
> rely on BPF such as bcc, bpftrace, ply, systemtap, sysdig, lttng to just name
> a few.

So I'm just watching from the sidelines here, but I do feel the need to
point out that Kris appears to be trying to follow the previous feedback
he got from Alexei, where creating tools/dtrace is exactly what he was
told to do:

  https://lwn.net/ml/netdev/20190521175617.ipry6ue7o24a2e6n@ast-mbp.dhcp.thefacebook.com/

Now he's being told the exact opposite.  Not the best experience for
somebody who is trying to make the kernel better.

There are still people interested in DTrace out there.  How would you
recommend that Kris proceed at this point?

Thanks,

jon

^ permalink raw reply

* Re: [GIT PULL] Keys: Set 4 - Key ACLs for 5.3
From: Eric Biggers @ 2019-07-10 20:15 UTC (permalink / raw)
  To: Linus Torvalds, David Howells, James Morris, keyrings, Netdev,
	linux-nfs, CIFS, linux-afs, linux-fsdevel, linux-integrity,
	LSM List, Linux List Kernel Mailing
In-Reply-To: <20190710194620.GA83443@gmail.com>

On Wed, Jul 10, 2019 at 12:46:22PM -0700, Eric Biggers wrote:
> On Wed, Jul 10, 2019 at 11:35:07AM -0700, Linus Torvalds wrote:
> > On Fri, Jul 5, 2019 at 2:30 PM David Howells <dhowells@redhat.com> wrote:
> > >
> > > Here's my fourth block of keyrings changes for the next merge window.  They
> > > change the permissions model used by keys and keyrings to be based on an
> > > internal ACL by the following means:
> > 
> > It turns out that this is broken, and I'll probably have to revert the
> > merge entirely.
> > 
> > With this merge in place, I can't boot any of the machines that have
> > an encrypted disk setup. The boot just stops at
> > 
> >   systemd[1]: Started Forward Password Requests to Plymouth Directory Watch.
> >   systemd[1]: Reached target Paths.
> > 
> > and never gets any further. I never get the prompt for a passphrase
> > for the disk encryption.
> > 
> > Apparently not a lot of developers are using encrypted volumes for
> > their development machines.
> > 
> > I'm not sure if the only requirement is an encrypted volume, or if
> > this is also particular to a F30 install in case you need to be able
> > to reproduce. But considering that you have a redhat email address,
> > I'm sure you can find a F30 install somewhere with an encrypted disk.
> > 
> > David, if you can fix this quickly, I'll hold off on the revert of it
> > all, but I can wait only so long. I've stopped merging stuff since I
> > noticed my machines don't work (this merge window has not been
> > pleasant so far - in addition to this issue I had another entirely
> > unrelated boot failure which made bisecting this one even more fun).
> > 
> > So if I don't see a quick fix, I'll just revert in order to then
> > continue to do pull requests later today. Because I do not want to do
> > further pulls with something that I can't boot as a base.
> > 
> >                  Linus
> 
> This also broke 'keyctl new_session' and hence all the fscrypt tests
> (https://lkml.kernel.org/lkml/20190710011559.GA7973@sol.localdomain/), and it
> also broke loading in-kernel X.509 certificates
> (https://lore.kernel.org/lkml/27671.1562384658@turing-police/T/#u).
> 
> I'm *guessing* these are all some underlying issue where keyrings aren't being
> given all the needed permissions anymore.
> 
> But just FYI, David had said he's on vacation with no laptop or email access for
> 2 weeks starting from Sunday (3 days ago).  So I don't think you can expect a
> quick fix from him.
> 
> I was planning to look into this to fix the fscrypt tests, but it might be a few
> days before I get to it.  And while I'm *guessing* it will be a simple fix, it
> might not be.  So I can't speak for David, but personally I'm fine with the
> commits being reverted for now.
> 
> I'm also unhappy that the new keyctl KEYCTL_GRANT_PERMISSION doesn't have any
> documentation or tests.  (Which seems to be a common problem with David's
> work...  None of the new mount syscalls in v5.2 have any tests, for example, and
> the man pages are still work-in-progress and last sent out for review a year
> ago, despite API changes that occurred before the syscalls were merged.)
> 

Also worth noting that the key ACL patches were only in linux-next for 9 days
before the pull request was sent.  The X.509 certificate loading bug (which
might be the same underlying bug) was reported on July 6 by someone testing
linux-next, but the pull request had already been sent on July 5.  I suspect
these bug(s) would have been fixed if they had been in linux-next for longer.

- Eric

^ permalink raw reply

* Re: [bpf PATCH v2 2/6] bpf: tls fix transition through disconnect with close
From: Jakub Kicinski @ 2019-07-10 20:04 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev, edumazet, bpf
In-Reply-To: <20190710123417.2157a459@cakuba.netronome.com>

On Wed, 10 Jul 2019 12:34:17 -0700, Jakub Kicinski wrote:
> > > > +		if (sk->sk_prot->unhash)
> > > > +			sk->sk_prot->unhash(sk);
> > > > +	}
> > > > +
> > > > +	ctx = tls_get_ctx(sk);
> > > > +	if (ctx->tx_conf == TLS_SW || ctx->rx_conf == TLS_SW)
> > > > +		tls_sk_proto_cleanup(sk, ctx, timeo);

Do we still need to hook into unhash? With patch 6 in place perhaps we
can just do disconnect 🥺

cleanup is going to kick off TX but also:

	if (unlikely(sk->sk_write_pending) &&
	    !wait_on_pending_writer(sk, &timeo))
		tls_handle_open_record(sk, 0);

Are we guaranteed that sk_write_pending is 0?  Otherwise
wait_on_pending_writer is hiding yet another release_sock() :(

> > > > +	icsk->icsk_ulp_data = NULL;    
> > > 
> > > I think close only starts checking if ctx is NULL in patch 6.
> > > Looks like some chunks of ctx checking/clearing got spread to
> > > patch 1 and some to patch 6.    
> > 
> > Yeah, I thought the patches were easier to read this way but
> > maybe not. Could add something in the commit log.  
> 
> Ack! Let me try to get a full grip of patches 2 and 6 and come back 
> to this.
> 
> > > > +	tls_ctx_free_wq(ctx);
> > > > +
> > > > +	if (ctx->unhash)
> > > > +		ctx->unhash(sk);
> > > > +}
> > > > +
> > > >  static void tls_sk_proto_close(struct sock *sk, long timeout)
> > > >  {
> > > >  	struct tls_context *ctx = tls_get_ctx(sk);    


^ permalink raw reply

* Re: [PATCH V2 1/1 (was 0/1 by accident)] tools/dtrace: initial implementation of DTrace
From: Steven Rostedt @ 2019-07-10 19:47 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Kris Van Hees, netdev, bpf, dtrace-devel, linux-kernel, mhiramat,
	acme, ast, Peter Zijlstra, Chris Mason, brendan.d.gregg, davem
In-Reply-To: <c7f15d1d-1696-4d95-1729-4c4e97bdc43e@iogearbox.net>

On Wed, 10 Jul 2019 21:32:25 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:


> Looks like you missed Brendan Gregg's prior feedback from v1 [0]. I haven't
> seen a strong compelling argument for why this needs to reside in the kernel
> tree given we also have all the other tracing tools and many of which also
> rely on BPF such as bcc, bpftrace, ply, systemtap, sysdig, lttng to just name
> a few. Given all the other tracers manage to live outside the kernel tree just
> fine, so can dtrace as well; it's _not_ special in this regard in any way. It
> will be tons of code in long term which is better off in its separate project,
> and if we add tools/dtrace/, other projects will come as well asking for kernel
> tree inclusion 'because tools/dtrace' is now there, too. While it totally makes
> sense to extend the missing kernel bits where needed, it doesn't make sense to
> have another big tracing project similar to perf in the tree. Therefore, I'm
> not applying this patch, sorry.

I agree with this.

Note, trace-cmd is very tied to ftrace just as much as perf is to the
code in tree. There was a window in time I had a choice to add it to
tools/ as well, but after careful consideration, I decided it's best
against it. The only thing being in tree gives you is marketing.
Otherwise, it makes it too coupled. I keep having to compile perf
separately, because a lot of perf distro packages appear to think that
it requires the same kernel version.

It also makes it easier to have your own release cycles, otherwise it
forces you to be on a 2 1/2 month cycle that the kernel is on. And it
forces you to have a clear separation between kernel and user space.

That said, I'm working to put together libraries that interact with all
the current tracers (perf, trace-cmd, lttng, bpftrace, etc) and call it
the "Unified Tracing Platform". The purpose is to allow any tool to be
able to take advantage of any of the supported tracers within the
running kernel. This will be one of the topics at the Tracing MC at
Linux Plumbers in September. I hope to see all of you there ;-)

-- Steve


^ permalink raw reply

* Re: [GIT PULL] Keys: Set 4 - Key ACLs for 5.3
From: Eric Biggers @ 2019-07-10 19:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, James Morris James Morris, keyrings, Netdev,
	linux-nfs, CIFS, linux-afs, linux-fsdevel, linux-integrity,
	LSM List, Linux List Kernel Mailing
In-Reply-To: <CAHk-=wjxoeMJfeBahnWH=9zShKp2bsVy527vo3_y8HfOdhwAAw@mail.gmail.com>

On Wed, Jul 10, 2019 at 11:35:07AM -0700, Linus Torvalds wrote:
> On Fri, Jul 5, 2019 at 2:30 PM David Howells <dhowells@redhat.com> wrote:
> >
> > Here's my fourth block of keyrings changes for the next merge window.  They
> > change the permissions model used by keys and keyrings to be based on an
> > internal ACL by the following means:
> 
> It turns out that this is broken, and I'll probably have to revert the
> merge entirely.
> 
> With this merge in place, I can't boot any of the machines that have
> an encrypted disk setup. The boot just stops at
> 
>   systemd[1]: Started Forward Password Requests to Plymouth Directory Watch.
>   systemd[1]: Reached target Paths.
> 
> and never gets any further. I never get the prompt for a passphrase
> for the disk encryption.
> 
> Apparently not a lot of developers are using encrypted volumes for
> their development machines.
> 
> I'm not sure if the only requirement is an encrypted volume, or if
> this is also particular to a F30 install in case you need to be able
> to reproduce. But considering that you have a redhat email address,
> I'm sure you can find a F30 install somewhere with an encrypted disk.
> 
> David, if you can fix this quickly, I'll hold off on the revert of it
> all, but I can wait only so long. I've stopped merging stuff since I
> noticed my machines don't work (this merge window has not been
> pleasant so far - in addition to this issue I had another entirely
> unrelated boot failure which made bisecting this one even more fun).
> 
> So if I don't see a quick fix, I'll just revert in order to then
> continue to do pull requests later today. Because I do not want to do
> further pulls with something that I can't boot as a base.
> 
>                  Linus

This also broke 'keyctl new_session' and hence all the fscrypt tests
(https://lkml.kernel.org/lkml/20190710011559.GA7973@sol.localdomain/), and it
also broke loading in-kernel X.509 certificates
(https://lore.kernel.org/lkml/27671.1562384658@turing-police/T/#u).

I'm *guessing* these are all some underlying issue where keyrings aren't being
given all the needed permissions anymore.

But just FYI, David had said he's on vacation with no laptop or email access for
2 weeks starting from Sunday (3 days ago).  So I don't think you can expect a
quick fix from him.

I was planning to look into this to fix the fscrypt tests, but it might be a few
days before I get to it.  And while I'm *guessing* it will be a simple fix, it
might not be.  So I can't speak for David, but personally I'm fine with the
commits being reverted for now.

I'm also unhappy that the new keyctl KEYCTL_GRANT_PERMISSION doesn't have any
documentation or tests.  (Which seems to be a common problem with David's
work...  None of the new mount syscalls in v5.2 have any tests, for example, and
the man pages are still work-in-progress and last sent out for review a year
ago, despite API changes that occurred before the syscalls were merged.)

- Eric

^ permalink raw reply

* Re: [bpf PATCH v2 6/6] bpf: sockmap/tls, close can race with map free
From: Jakub Kicinski @ 2019-07-10 19:35 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev, edumazet, bpf
In-Reply-To: <5d255ca6e5b0d_1b7a2aec940d65b4f6@john-XPS-13-9370.notmuch>

On Tue, 09 Jul 2019 20:33:58 -0700, John Fastabend wrote:
> Jakub Kicinski wrote:
> > On Mon, 08 Jul 2019 19:15:18 +0000, John Fastabend wrote:  
> > > @@ -352,15 +354,18 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
> > >  	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE)
> > >  		goto skip_tx_cleanup;
> > >  
> > > -	sk->sk_prot = ctx->sk_proto;
> > >  	tls_sk_proto_cleanup(sk, ctx, timeo);
> > >  
> > >  skip_tx_cleanup:
> > > +	write_lock_bh(&sk->sk_callback_lock);
> > > +	icsk->icsk_ulp_data = NULL;  
> > 
> > Is ulp_data pointer now supposed to be updated under the
> > sk_callback_lock?  
> 
> Yes otherwise it can race with tls_update(). I didn't remove the
> ulp pointer null set from tcp_ulp.c though. Could be done in this
> patch or as a follow up.

Do we need to hold the lock in unhash, too, or is unhash called with
sk_callback_lock held?

> > > +	if (sk->sk_prot->close == tls_sk_proto_close)
> > > +		sk->sk_prot = ctx->sk_proto;
> > > +	write_unlock_bh(&sk->sk_callback_lock);
> > >  	release_sock(sk);
> > >  	if (ctx->rx_conf == TLS_SW)
> > >  		tls_sw_release_strp_rx(ctx);
> > > -	sk_proto_close(sk, timeout);
> > > -
> > > +	ctx->sk_proto_close(sk, timeout);
> > >  	if (ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW &&
> > >  	    ctx->tx_conf != TLS_HW_RECORD && ctx->rx_conf != TLS_HW_RECORD)
> > >  		tls_ctx_free(ctx);  


^ permalink raw reply

* Re: [bpf PATCH v2 2/6] bpf: tls fix transition through disconnect with close
From: Jakub Kicinski @ 2019-07-10 19:34 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev, edumazet, bpf
In-Reply-To: <5d255dececd33_1b7a2aec940d65b45@john-XPS-13-9370.notmuch>

On Tue, 09 Jul 2019 20:39:24 -0700, John Fastabend wrote:
> Jakub Kicinski wrote:
> > On Mon, 08 Jul 2019 19:14:05 +0000, John Fastabend wrote:  
> > > @@ -287,6 +313,27 @@ static void tls_sk_proto_cleanup(struct sock *sk,
> > >  #endif
> > >  }
> > >  
> > > +static void tls_sk_proto_unhash(struct sock *sk)
> > > +{
> > > +	struct inet_connection_sock *icsk = inet_csk(sk);
> > > +	long timeo = sock_sndtimeo(sk, 0);
> > > +	struct tls_context *ctx;
> > > +
> > > +	if (unlikely(!icsk->icsk_ulp_data)) {  
> > 
> > Is this for when sockmap is stacked on top of TLS and TLS got removed
> > without letting sockmap know?  
> 
> Right its a pattern I used on the sockmap side and put here. But
> I dropped the patch to let sockmap stack on top of TLS because
> it was more than a fix IMO. We could probably drop this check on
> the other hand its harmless.

I feel like this code is pretty complex I struggle to follow all the
paths, so perhaps it'd be better to drop stuff that's not necessary 
to have a clearer picture.

> > > +		if (sk->sk_prot->unhash)
> > > +			sk->sk_prot->unhash(sk);
> > > +	}
> > > +
> > > +	ctx = tls_get_ctx(sk);
> > > +	if (ctx->tx_conf == TLS_SW || ctx->rx_conf == TLS_SW)
> > > +		tls_sk_proto_cleanup(sk, ctx, timeo);
> > > +	icsk->icsk_ulp_data = NULL;  
> > 
> > I think close only starts checking if ctx is NULL in patch 6.
> > Looks like some chunks of ctx checking/clearing got spread to
> > patch 1 and some to patch 6.  
> 
> Yeah, I thought the patches were easier to read this way but
> maybe not. Could add something in the commit log.

Ack! Let me try to get a full grip of patches 2 and 6 and come back 
to this.

> > > +	tls_ctx_free_wq(ctx);
> > > +
> > > +	if (ctx->unhash)
> > > +		ctx->unhash(sk);
> > > +}
> > > +
> > >  static void tls_sk_proto_close(struct sock *sk, long timeout)
> > >  {
> > >  	struct tls_context *ctx = tls_get_ctx(sk);  

^ permalink raw reply

* Re: [PATCH V2 1/1 (was 0/1 by accident)] tools/dtrace: initial implementation of DTrace
From: Daniel Borkmann @ 2019-07-10 19:32 UTC (permalink / raw)
  To: Kris Van Hees
  Cc: netdev, bpf, dtrace-devel, linux-kernel, rostedt, mhiramat, acme,
	ast, Peter Zijlstra, Chris Mason, brendan.d.gregg, davem
In-Reply-To: <20190710181227.GA9925@oracle.com>

Hello Kris,

On 07/10/2019 08:12 PM, Kris Van Hees wrote:
> This patch's subject should of course be [PATCH V2 1/1] rather than 0/1.
> Sorry about that.
> 
> On Wed, Jul 10, 2019 at 08:42:24AM -0700, Kris Van Hees wrote:
>> This initial implementation of a tiny subset of DTrace functionality
>> provides the following options:
>>
>> 	dtrace [-lvV] [-b bufsz] -s script
>> 	    -b  set trace buffer size
>> 	    -l  list probes (only works with '-s script' for now)
>> 	    -s  enable or list probes for the specified BPF program
>> 	    -V  report DTrace API version
>>
>> The patch comprises quite a bit of code due to DTrace requiring a few
>> crucial components, even in its most basic form.
>>
>> The code is structured around the command line interface implemented in
>> dtrace.c.  It provides option parsing and drives the three modes of
>> operation that are currently implemented:
>>
>> 1. Report DTrace API version information.
>> 	Report the version information and terminate.
>>
>> 2. List probes in BPF programs.
>> 	Initialize the list of probes that DTrace recognizes, load BPF
>> 	programs, parse all BPF ELF section names, resolve them into
>> 	known probes, and emit the probe names.  Then terminate.
>>
>> 3. Load BPF programs and collect tracing data.
>> 	Initialize the list of probes that DTrace recognizes, load BPF
>> 	programs and attach them to their corresponding probes, set up
>> 	perf event output buffers, and start processing tracing data.
>>
>> This implementation makes extensive use of BPF (handled by dt_bpf.c) and
>> the perf event output ring buffer (handled by dt_buffer.c).  DTrace-style
>> probe handling (dt_probe.c) offers an interface to probes that hides the
>> implementation details of the individual probe types by provider (dt_fbt.c
>> and dt_syscall.c).  Probe lookup by name uses a hashtable implementation
>> (dt_hash.c).  The dt_utils.c code populates a list of online CPU ids, so
>> we know what CPUs we can obtain tracing data from.
>>
>> Building the tool is trivial because its only dependency (libbpf) is in
>> the kernel tree under tools/lib/bpf.  A simple 'make' in the tools/dtrace
>> directory suffices.
>>
>> The 'dtrace' executable needs to run as root because BPF programs cannot
>> be loaded by non-root users.
>>
>> Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
>> Reviewed-by: David Mc Lean <david.mclean@oracle.com>
>> Reviewed-by: Eugene Loh <eugene.loh@oracle.com>
>> ---
>> Changes in v2:
>>         - Use ring_buffer_read_head() and ring_buffer_write_tail() to
>>           avoid use of volatile.
>>         - Handle perf events that wrap around the ring buffer boundary.
>>         - Remove unnecessary PERF_EVENT_IOC_ENABLE.
>>         - Remove -I$(srctree)/tools/perf from KBUILD_HOSTCFLAGS since it
>>           is not actually used.
>>         - Use PT_REGS_PARM1(x), etc instead of my own macros.  Adding 
>>           PT_REGS_PARM6(x) in bpf_sample.c because we need to be able to
>>           support up to 6 arguments passed by registers.

Looks like you missed Brendan Gregg's prior feedback from v1 [0]. I haven't
seen a strong compelling argument for why this needs to reside in the kernel
tree given we also have all the other tracing tools and many of which also
rely on BPF such as bcc, bpftrace, ply, systemtap, sysdig, lttng to just name
a few. Given all the other tracers manage to live outside the kernel tree just
fine, so can dtrace as well; it's _not_ special in this regard in any way. It
will be tons of code in long term which is better off in its separate project,
and if we add tools/dtrace/, other projects will come as well asking for kernel
tree inclusion 'because tools/dtrace' is now there, too. While it totally makes
sense to extend the missing kernel bits where needed, it doesn't make sense to
have another big tracing project similar to perf in the tree. Therefore, I'm
not applying this patch, sorry.

Thanks,
Daniel

  [0] https://lore.kernel.org/bpf/CAE40pdeSfJBpbBHTmwz1xZ+MW02=kJ0krq1mN+EkjSLqf2GX_w@mail.gmail.com/

>> ---
>>  MAINTAINERS                |   6 +
>>  tools/dtrace/Makefile      |  87 ++++++++++
>>  tools/dtrace/bpf_sample.c  | 146 ++++++++++++++++
>>  tools/dtrace/dt_bpf.c      | 185 ++++++++++++++++++++
>>  tools/dtrace/dt_buffer.c   | 338 +++++++++++++++++++++++++++++++++++++
>>  tools/dtrace/dt_fbt.c      | 201 ++++++++++++++++++++++
>>  tools/dtrace/dt_hash.c     | 211 +++++++++++++++++++++++
>>  tools/dtrace/dt_probe.c    | 230 +++++++++++++++++++++++++
>>  tools/dtrace/dt_syscall.c  | 179 ++++++++++++++++++++
>>  tools/dtrace/dt_utils.c    | 132 +++++++++++++++
>>  tools/dtrace/dtrace.c      | 249 +++++++++++++++++++++++++++
>>  tools/dtrace/dtrace.h      |  13 ++
>>  tools/dtrace/dtrace_impl.h | 101 +++++++++++
>>  13 files changed, 2078 insertions(+)
>>  create mode 100644 tools/dtrace/Makefile
>>  create mode 100644 tools/dtrace/bpf_sample.c
>>  create mode 100644 tools/dtrace/dt_bpf.c
>>  create mode 100644 tools/dtrace/dt_buffer.c
>>  create mode 100644 tools/dtrace/dt_fbt.c
>>  create mode 100644 tools/dtrace/dt_hash.c
>>  create mode 100644 tools/dtrace/dt_probe.c
>>  create mode 100644 tools/dtrace/dt_syscall.c
>>  create mode 100644 tools/dtrace/dt_utils.c
>>  create mode 100644 tools/dtrace/dtrace.c
>>  create mode 100644 tools/dtrace/dtrace.h
>>  create mode 100644 tools/dtrace/dtrace_impl.h

^ permalink raw reply

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
From: Eric Dumazet @ 2019-07-10 19:26 UTC (permalink / raw)
  To: Prout, Andrew - LLSC - MITLL, Eric Dumazet, Christoph Paasch
  Cc: David S . Miller, netdev, Greg Kroah-Hartman, Jonathan Looney,
	Neal Cardwell, Tyler Hicks, Yuchung Cheng, Bruce Curtis,
	Jonathan Lemon, Dustin Marquess
In-Reply-To: <e471350b70e244daa10043f06fbb3ebe@ll.mit.edu>



On 7/10/19 8:53 PM, Prout, Andrew - LLSC - MITLL wrote:
> 
> Our initial rollout was v4.14.130, but I reproduced it with v4.14.132 as well, reliably for the samba test and once (not reliably) with synthetic test I was trying. A patched v4.14.132 with this patch partially reverted (just the four lines from tcp_fragment deleted) passed the samba test.
> 
> The synthetic test was a pair of simple send/recv test programs under the following conditions:
> -The send socket was non-blocking
> -SO_SNDBUF set to 128KiB
> -The receiver NIC was being flooded with traffic from multiple hosts (to induce packet loss/retransmits)
> -Load was on both systems: a while(1) program spinning on each CPU core
> -The receiver was on an older unaffected kernel
> 

SO_SNDBUF to 128KB does not permit to recover from heavy losses,
since skbs needs to be allocated for retransmits.

The bug we fixed allowed remote attackers to crash all linux hosts,

I am afraid we have to enforce the real SO_SNDBUF limit, finally.

Even a cushion of 128KB per socket is dangerous, for servers with millions of TCP sockets.

You will either have to set SO_SNDBUF to higher values, or let autotuning in place.
Or revert the patches and allow attackers hit you badly.


^ permalink raw reply

* Re: [PATCH v6 bpf-next 3/3] bpf: add tests for bpf_descendant_of
From: Andrii Nakryiko @ 2019-07-10 19:25 UTC (permalink / raw)
  To: Javier Honduvilla Coto; +Cc: Networking, Yonghong Song, Kernel Team, jonhaslam
In-Reply-To: <20190710180025.94726-4-javierhonduco@fb.com>

On Wed, Jul 10, 2019 at 11:31 AM Javier Honduvilla Coto
<javierhonduco@fb.com> wrote:
>
> Adding the following test cases:

FYI, bpf-next is closed, so this won't be able to go in for about 2 weeks.

>
> - bpf_descendant_of(current->pid) == 1
> - bpf_descendant_of(current->real_parent->pid) == 1
> - bpf_descendant_of(1) == 1
> - bpf_descendant_of(0) == 1
>
> - bpf_descendant_of(-1) == 0
> - bpf_descendant_of(current->children[0]->pid) == 0
>
> Signed-off-by: Javier Honduvilla Coto <javierhonduco@fb.com>
> ---
>  tools/testing/selftests/bpf/.gitignore        |   1 +
>  tools/testing/selftests/bpf/Makefile          |   2 +-
>  tools/testing/selftests/bpf/bpf_helpers.h     |   3 +
>  .../bpf/progs/test_descendant_of_kern.c       |  43 +++
>  .../selftests/bpf/test_descendant_of_user.c   | 266 ++++++++++++++++++
>  5 files changed, 314 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/bpf/progs/test_descendant_of_kern.c
>  create mode 100644 tools/testing/selftests/bpf/test_descendant_of_user.c
>
> diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
> index 90f70d2c7c22..4b63d7105ba2 100644
> --- a/tools/testing/selftests/bpf/.gitignore
> +++ b/tools/testing/selftests/bpf/.gitignore
> @@ -43,3 +43,4 @@ test_sockopt
>  test_sockopt_sk
>  test_sockopt_multi
>  test_tcp_rtt
> +test_descendant_of_user
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 2620406a53ec..b3dc1e26c41c 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -27,7 +27,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
>         test_cgroup_storage test_select_reuseport test_section_names \
>         test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \
>         test_btf_dump test_cgroup_attach xdping test_sockopt test_sockopt_sk \
> -       test_sockopt_multi test_tcp_rtt
> +       test_sockopt_multi test_tcp_rtt test_descendant_of_user

Could you please instead add this test as part of test_progs? See for
instance prog_tests/attach_probe.c for recently added test.

>
>  BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
>  TEST_GEN_FILES = $(BPF_OBJ_FILES)
> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> index 5a3d92c8bec8..7525783ffbc9 100644
> --- a/tools/testing/selftests/bpf/bpf_helpers.h
> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> @@ -1,4 +1,6 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
> +#include <sys/types.h>
> +
>  #ifndef __BPF_HELPERS_H
>  #define __BPF_HELPERS_H
>
> @@ -228,6 +230,7 @@ static void *(*bpf_sk_storage_get)(void *map, struct bpf_sock *sk,
>  static int (*bpf_sk_storage_delete)(void *map, struct bpf_sock *sk) =
>         (void *)BPF_FUNC_sk_storage_delete;
>  static int (*bpf_send_signal)(unsigned sig) = (void *)BPF_FUNC_send_signal;
> +static int (*bpf_descendant_of)(pid_t pid) = (void *) BPF_FUNC_descendant_of;

Can you split bpf_helpers.h update into a separate commit?

>
>  /* llvm builtin functions that eBPF C program may use to
>   * emit BPF_LD_ABS and BPF_LD_IND instructions
> diff --git a/tools/testing/selftests/bpf/progs/test_descendant_of_kern.c b/tools/testing/selftests/bpf/progs/test_descendant_of_kern.c
> new file mode 100644
> index 000000000000..802e01595527
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_descendant_of_kern.c
> @@ -0,0 +1,43 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/bpf.h>
> +#include "bpf_helpers.h"
> +
> +struct bpf_map_def SEC("maps") pidmap = {
> +       .type = BPF_MAP_TYPE_ARRAY,
> +       .key_size = sizeof(__u32),
> +       .value_size = sizeof(__u32),
> +       .max_entries = 2,
> +};
> +
> +struct bpf_map_def SEC("maps") resultmap = {
> +       .type = BPF_MAP_TYPE_ARRAY,
> +       .key_size = sizeof(__u32),
> +       .value_size = sizeof(__u32),
> +       .max_entries = 1,
> +};

Please update this to use new BTF-defined maps (see lots of recently
converted tests for example).

> +
> +SEC("tracepoint/syscalls/sys_enter_open")
> +int trace(void *ctx)
> +{
> +       __u32 pid = bpf_get_current_pid_tgid();
> +       __u32 current_key = 0, ancestor_key = 1, *expected_pid, *ancestor_pid;
> +       __u32 *val;
> +
> +       expected_pid = bpf_map_lookup_elem(&pidmap, &current_key);
> +       if (!expected_pid || *expected_pid != pid)
> +               return 0;
> +
> +       ancestor_pid = bpf_map_lookup_elem(&pidmap, &ancestor_key);
> +       if (!ancestor_pid)
> +               return 0;
> +
> +       val = bpf_map_lookup_elem(&resultmap, &current_key);
> +       if (val)
> +               *val = bpf_descendant_of(*ancestor_pid);
> +
> +       return 0;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> +__u32 _version SEC("version") = 1;
> diff --git a/tools/testing/selftests/bpf/test_descendant_of_user.c b/tools/testing/selftests/bpf/test_descendant_of_user.c
> new file mode 100644
> index 000000000000..f616c8c976a4
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/test_descendant_of_user.c
> @@ -0,0 +1,266 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <syscall.h>
> +#include <unistd.h>
> +#include <linux/perf_event.h>
> +#include <sys/ioctl.h>
> +#include <sys/types.h>
> +#include <sys/wait.h>
> +
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +
> +#define CHECK(condition, tag, format...)                                       \
> +       ({                                                                     \
> +               int __ret = !!(condition);                                     \
> +               if (__ret) {                                                   \
> +                       printf("%s:FAIL:%s ", __func__, tag);                  \
> +                       printf(format);                                        \
> +               } else {                                                       \
> +                       printf("%s:PASS:%s\n", __func__, tag);                 \
> +               }                                                              \
> +               __ret;                                                         \
> +       })

You won't need this if done as part of test_progs.

> +
> +static int bpf_find_map(const char *test, struct bpf_object *obj,
> +                       const char *name)
> +{
> +       struct bpf_map *map;
> +
> +       map = bpf_object__find_map_by_name(obj, name);
> +       if (!map)
> +               return -1;
> +       return bpf_map__fd(map);
> +}
> +
> +int main(int argc, char **argv)
> +{
> +       const char *probe_name = "syscalls/sys_enter_open";
> +       const char *file = "test_descendant_of_kern.o";
> +       int err, bytes, efd, prog_fd, pmu_fd;
> +       int resultmap_fd, pidmap_fd;
> +       struct perf_event_attr attr = {};
> +       struct bpf_object *obj;
> +       __u32 descendant_of_result = 0;
> +       __u32 key = 0, pid;
> +       int exit_code = EXIT_FAILURE;
> +       char buf[256];
> +
> +       int child_pid, ancestor_pid, root_fd, nonexistant = -42;
> +       __u32 ancestor_key = 1;
> +       int pipefd[2];
> +       char marker[1];
> +
> +       err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, &prog_fd);
> +       if (CHECK(err, "bpf_prog_load", "err %d errno %d\n", err, errno))
> +               goto fail;
> +
> +       resultmap_fd = bpf_find_map(__func__, obj, "resultmap");
> +       if (CHECK(resultmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
> +                 resultmap_fd, errno))
> +               goto close_prog;
> +
> +       pidmap_fd = bpf_find_map(__func__, obj, "pidmap");
> +       if (CHECK(pidmap_fd < 0, "bpf_find_map", "err %d errno %d\n", pidmap_fd,
> +                 errno))
> +               goto close_prog;
> +
> +       pid = getpid();
> +       bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
> +       bpf_map_update_elem(pidmap_fd, &ancestor_key, &pid, 0);
> +
> +       snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/events/%s/id",
> +                probe_name);
> +       efd = open(buf, O_RDONLY, 0);
> +       if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
> +               goto close_prog;
> +       bytes = read(efd, buf, sizeof(buf));
> +       close(efd);
> +       if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "read",
> +                 "bytes %d errno %d\n", bytes, errno))
> +               goto close_prog;
> +
> +       attr.config = strtol(buf, NULL, 0);
> +       attr.type = PERF_TYPE_TRACEPOINT;
> +       attr.sample_type = PERF_SAMPLE_RAW;
> +       attr.sample_period = 1;
> +       attr.wakeup_events = 1;
> +
> +       pmu_fd = syscall(__NR_perf_event_open, &attr, getpid(), -1, -1, 0);
> +       if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n", pmu_fd,
> +                 errno))
> +               goto close_prog;
> +
> +       err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
> +       if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n", err,
> +                 errno))
> +               goto close_pmu;
> +
> +       err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
> +       if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n", err,
> +                 errno))
> +               goto close_pmu;

Can you please switch all this to new libbpf tracing APIs? see
prog_tests/attach_probe.c for examples.

> +
> +       // Test that descendant_of(current->pid) is true
> +       bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
> +       bpf_map_update_elem(pidmap_fd, &ancestor_key, &pid, 0);
> +       bpf_map_update_elem(resultmap_fd, &key, &nonexistant, 0);
> +
> +       root_fd = open("/", O_RDONLY);
> +       if (CHECK(efd < 0, "open", "errno %d\n", errno))
> +               goto close_prog;
> +       close(root_fd);

I'd suggest to just use raw_tracepoint 'sys_enter', which would be
easy to trigger with just `usleep(1)`. Makes for quite simpler code.
See, e.g., tools/testing/selftests/bpf/prog_tests/get_stack_raw_tp.c

> +
> +       err = bpf_map_lookup_elem(resultmap_fd, &key, &descendant_of_result);
> +       if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", err, errno))
> +               goto close_pmu;
> +       if (CHECK(descendant_of_result != 1,
> +                 "descendant_of is true with same pid", "%d == %d\n",
> +                 descendant_of_result, 1))
> +               goto close_pmu;
> +

<snip>

^ permalink raw reply

* Re: Question about linux kernel commit: "net/ipv6: move metrics from dst to rt6_info"
From: David Ahern @ 2019-07-10 19:13 UTC (permalink / raw)
  To: Stefano Brivio, Jan Szewczyk
  Cc: davem@davemloft.net, netdev@vger.kernel.org, Wei Wang,
	Martin KaFai Lau, Eric Dumazet
In-Reply-To: <20190710210954.530d72a5@elisabeth>

On 7/10/19 1:09 PM, Stefano Brivio wrote:
> Jan,
> 
> On Wed, 10 Jul 2019 12:59:41 +0000
> Jan Szewczyk <jan.szewczyk@ericsson.com> wrote:
> 
>> Hi!
>> I digged up a little further and maybe it's not a problem with MTU
>> itself. I checked every entry I get from RTM_GETROUTE netlink message
>> and after triggering "too big packet" by pinging ipv6address I get
>> exactly the same messages on 4.12 and 4.18, except that the one with
>> that pinged ipv6address is missing on 4.18 at all. What is weird -
>> it's visible when running "ip route get to ipv6address". Do you know
>> why there is a mismatch there?
> 
> If I understand you correctly, an implementation equivalent to 'ip -6
> route list show' (using the NLM_F_DUMP flag) won't show the so-called
> route exception, while 'ip -6 route get' shows it.
> 
> If that's the case: that was broken by commit 2b760fcf5cfb ("ipv6: hook
> up exception table to store dst cache") that landed in 4.15, and fixed
> by net-next commit 1e47b4837f3b ("ipv6: Dump route exceptions if
> requested"). For more details, see the log of this commit itself.
> 

ah, good point. My mind locked on RTM_GETROUTE as a specific route
request not a dump.

^ permalink raw reply

* Re: Question about linux kernel commit: "net/ipv6: move metrics from dst to rt6_info"
From: Stefano Brivio @ 2019-07-10 19:09 UTC (permalink / raw)
  To: Jan Szewczyk
  Cc: David Ahern, davem@davemloft.net, netdev@vger.kernel.org,
	Wei Wang, Martin KaFai Lau, Eric Dumazet
In-Reply-To: <AM6PR07MB5639E2AEF438DD017246DF13F2F00@AM6PR07MB5639.eurprd07.prod.outlook.com>

Jan,

On Wed, 10 Jul 2019 12:59:41 +0000
Jan Szewczyk <jan.szewczyk@ericsson.com> wrote:

> Hi!
> I digged up a little further and maybe it's not a problem with MTU
> itself. I checked every entry I get from RTM_GETROUTE netlink message
> and after triggering "too big packet" by pinging ipv6address I get
> exactly the same messages on 4.12 and 4.18, except that the one with
> that pinged ipv6address is missing on 4.18 at all. What is weird -
> it's visible when running "ip route get to ipv6address". Do you know
> why there is a mismatch there?

If I understand you correctly, an implementation equivalent to 'ip -6
route list show' (using the NLM_F_DUMP flag) won't show the so-called
route exception, while 'ip -6 route get' shows it.

If that's the case: that was broken by commit 2b760fcf5cfb ("ipv6: hook
up exception table to store dst cache") that landed in 4.15, and fixed
by net-next commit 1e47b4837f3b ("ipv6: Dump route exceptions if
requested"). For more details, see the log of this commit itself.

-- 
Stefano

^ permalink raw reply

* Re: [PATCH bpf-next v9 0/2] bpf: Allow bpf_skb_event_output for more prog types
From: Andrii Nakryiko @ 2019-07-10 19:05 UTC (permalink / raw)
  To: Allan Zhang
  Cc: Networking, bpf, Song Liu, Daniel Borkmann, Alexei Starovoitov
In-Reply-To: <20190710181811.127374-1-allanzhang@google.com>

On Wed, Jul 10, 2019 at 11:18 AM Allan Zhang <allanzhang@google.com> wrote:
>
> Software event output is only enabled by a few prog types right now (TC,
> LWT out, XDP, sockops). Many other skb based prog types need
> bpf_skb_event_output to produce software event.
>
> More prog types are enabled to access bpf_skb_event_output in this
> patch.
>
> v9 changes:
> add "Acked-by" field.

Thanks! This looks good to me. Just FYI. Not sure if you followed, but
bpf-next is closed, so this can't go in until it's open. Maintainers
might ask you to resubmit at that time, if patches don't apply
cleanly.

>
> v8 changes:
> No actual change, just cc to netdev@vger.kernel.org and
> bpf@vger.kernel.org.
> v7 patches are acked by Song Liu.
>
> v7 changes:
> Reformat from hints by scripts/checkpatch.pl, including Song's comment
> on signed-off-by name to captical case in cover letter.
> 3 of hints are ignored:
> 1. new file mode.
> 2. SPDX-License-Identifier for event_output.c since all files under
>    this dir have no such line.
> 3. "Macros ... enclosed in parentheses" for macro in event_output.c
>    due to code's nature.
>
> Change patch 02 subject "bpf:..." to "selftests/bpf:..."
>
> v6 changes:
> Fix Signed-off-by, fix fixup map creation.
>
> v5 changes:
> Fix typos, reformat comments in event_output.c, move revision history to
> cover letter.
>
> v4 changes:
> Reformating log message.
>
> v3 changes:
> Reformating log message.
>
> v2 changes:
> Reformating log message.
>
> Allan Zhang (2):
>   bpf: Allow bpf_skb_event_output for a few prog types
>   selftests/bpf: Add selftests for bpf_perf_event_output
>
>  net/core/filter.c                             |  6 ++
>  tools/testing/selftests/bpf/test_verifier.c   | 12 ++-
>  .../selftests/bpf/verifier/event_output.c     | 94 +++++++++++++++++++
>  3 files changed, 111 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/bpf/verifier/event_output.c
>
> --
> 2.22.0.410.gd8fdbe21b5-goog
>

^ permalink raw reply

* [PATCH] net/mlx5e: Move priv variable into case statement in mlx5e_setup_tc
From: Nathan Chancellor @ 2019-07-10 19:05 UTC (permalink / raw)
  To: Saeed Mahameed, Leon Romanovsky, David S. Miller
  Cc: netdev, linux-rdma, linux-kernel, clang-built-linux,
	Nathan Chancellor

There is an unused variable warning on arm64 defconfig when
CONFIG_MLX5_ESWITCH is unset:

drivers/net/ethernet/mellanox/mlx5/core/en_main.c:3467:21: warning:
unused variable 'priv' [-Wunused-variable]
        struct mlx5e_priv *priv = netdev_priv(dev);
                           ^
1 warning generated.

Move it down into the case statement where it is used.

Fixes: 4e95bc268b91 ("net: flow_offload: add flow_block_cb_setup_simple()")
Link: https://github.com/ClangBuiltLinux/linux/issues/597
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 6d0ae87c8ded..651eb714eb5b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3464,15 +3464,16 @@ static LIST_HEAD(mlx5e_block_cb_list);
 static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type,
 			  void *type_data)
 {
-	struct mlx5e_priv *priv = netdev_priv(dev);
-
 	switch (type) {
 #ifdef CONFIG_MLX5_ESWITCH
-	case TC_SETUP_BLOCK:
+	case TC_SETUP_BLOCK: {
+		struct mlx5e_priv *priv = netdev_priv(dev);
+
 		return flow_block_cb_setup_simple(type_data,
 						  &mlx5e_block_cb_list,
 						  mlx5e_setup_tc_block_cb,
 						  priv, priv, true);
+	}
 #endif
 	case TC_SETUP_QDISC_MQPRIO:
 		return mlx5e_setup_tc_mqprio(dev, type_data);
-- 
2.22.0


^ permalink raw reply related

* lockup in hacked 4.20.17+ kernel, maybe addrconf_verify_work related?
From: Ben Greear @ 2019-07-10 18:51 UTC (permalink / raw)
  To: netdev

Hello,

This is with my hacked-upon kernel, could be my mistake, etc.  But, curious
if anyone else has seen a similar deadlock?  I was running a complicated automated
wifi test, for what that is worth.

66707.212081] ath: regdomain 0x8348 dynamically updated by user
[67044.625948] INFO: task kworker/0:0:27387 blocked for more than 180 seconds.
[67044.631831]       Tainted: G        W  O      4.20.17+ #30
[67044.636180] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[67044.647424] kworker/0:0     D    0 27387      2 0x80000080
[67044.647447] Workqueue: ipv6_addrconf addrconf_verify_work [ipv6]
[67044.647448] Call Trace:
[67044.647455]  ? __schedule+0x29e/0x880
[67044.647457]  schedule+0x2a/0x80
[67044.647459]  schedule_preempt_disabled+0xc/0x20
[67044.647460]  __mutex_lock.isra.10+0x2e7/0x4f0
[67044.647468]  addrconf_verify_work+0x5/0x10 [ipv6]
[67044.647472]  process_one_work+0x1f3/0x420
[67044.647473]  worker_thread+0x28/0x3c0
[67044.647475]  ? process_one_work+0x420/0x420
[67044.647476]  kthread+0x10b/0x130
[67044.647478]  ? kthread_create_on_node+0x60/0x60
[67044.647480]  ret_from_fork+0x1f/0x30
[67044.647491] INFO: task DNS Resolver #6:19810 blocked for more than 180 seconds.
[67044.656865]       Tainted: G        W  O      4.20.17+ #30
[67044.661364] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[67044.670861] DNS Resolver #6 D    0 19810  19213 0x00000080
[67044.670863] Call Trace:
[67044.670870]  ? __schedule+0x29e/0x880
[67044.670872]  schedule+0x2a/0x80
[67044.670874]  schedule_preempt_disabled+0xc/0x20
[67044.670876]  __mutex_lock.isra.10+0x2e7/0x4f0
[67044.670879]  ? netlink_lookup+0x111/0x160
[67044.670881]  __netlink_dump_start+0x4f/0x1d0
[67044.670883]  ? rtnl_xdp_prog_skb+0x60/0x60
[67044.670885]  rtnetlink_rcv_msg+0x25c/0x390
[67044.670886]  ? rtnl_xdp_prog_skb+0x60/0x60
[67044.670888]  ? rtnl_calcit.isra.31+0x110/0x110
[67044.670889]  netlink_rcv_skb+0x44/0x120
[67044.670891]  netlink_unicast+0x18b/0x220
[67044.670893]  netlink_sendmsg+0x1ff/0x3d0
[67044.670896]  sock_sendmsg+0x2b/0x40
[67044.670898]  __sys_sendto+0xe9/0x150
[67044.670901]  ? __audit_syscall_exit+0x216/0x280
[67044.670903]  __x64_sys_sendto+0x1f/0x30
[67044.670906]  do_syscall_64+0x4a/0xf0
[67044.670909]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[67044.670911] RIP: 0033:0x7f94b119b6b0
[67044.670915] Code: Bad RIP value.
[67044.670915] RSP: 002b:00007f94817942e0 EFLAGS: 00000293 ORIG_RAX: 000000000000002c
[67044.670917] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f94b119b6b0
[67044.670917] RDX: 0000000000000014 RSI: 00007f9481795420 RDI: 000000000000004d
[67044.670918] RBP: 00007f9481795420 R08: 00007f94817953c4 R09: 000000000000000c
[67044.670919] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000014
[67044.670919] R13: 00007f94817953c4 R14: 00007f947ff42208 R15: 000000000000004d
[67044.670933] INFO: task kworker/0:2:17500 blocked for more than 180 seconds.
[67044.678868]       Tainted: G        W  O      4.20.17+ #30
[67044.685860] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[67044.692746] kworker/0:2     D    0 17500      2 0x80000080
[67044.692787] Workqueue: events_power_efficient reg_check_chans_work [cfg80211]
[67044.692788] Call Trace:
[67044.692795]  ? __schedule+0x29e/0x880
[67044.692797]  schedule+0x2a/0x80
[67044.692799]  schedule_preempt_disabled+0xc/0x20
[67044.692800]  __mutex_lock.isra.10+0x2e7/0x4f0
[67044.692810]  reg_check_chans_work+0x28/0x350 [cfg80211]
[67044.692815]  process_one_work+0x1f3/0x420
[67044.692817]  worker_thread+0x28/0x3c0
[67044.692819]  ? process_one_work+0x420/0x420
[67044.692821]  kthread+0x10b/0x130
[67044.692822]  ? kthread_create_on_node+0x60/0x60
[67044.692825]  ret_from_fork+0x1f/0x30
[67044.692833] INFO: task iw:1488 blocked for more than 180 seconds.
[67044.700860]       Tainted: G        W  O      4.20.17+ #30
[67044.705216] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[67044.714862] iw              D    0  1488   1487 0x00000080
[67044.714864] Call Trace:
[67044.714871]  ? __schedule+0x29e/0x880
[67044.714874]  schedule+0x2a/0x80
[67044.714876]  schedule_preempt_disabled+0xc/0x20
[67044.714877]  __mutex_lock.isra.10+0x2e7/0x4f0
[67044.714915]  nl80211_dump_wiphy+0x1d/0x1b0 [cfg80211]
[67044.714919]  genl_lock_dumpit+0x23/0x40
[67044.714921]  netlink_dump+0x16d/0x360
[67044.714923]  __netlink_dump_start+0x168/0x1d0
[67044.714925]  genl_family_rcv_msg+0x25f/0x3a0
[67044.714927]  ? genl_lock_dumpit+0x40/0x40
[67044.714928]  ? genl_lock_done+0x40/0x40
[67044.714929]  ? genl_unlock+0x10/0x10
[67044.714931]  ? netlink_unicast+0x1ff/0x220
[67044.714932]  genl_rcv_msg+0x42/0x87
[67044.714934]  ? genl_family_rcv_msg+0x3a0/0x3a0
[67044.714935]  netlink_rcv_skb+0x44/0x120
[67044.714937]  genl_rcv+0x1f/0x30
[67044.714939]  netlink_unicast+0x18b/0x220
[67044.714940]  netlink_sendmsg+0x1ff/0x3d0
[67044.714944]  sock_sendmsg+0x2b/0x40
[67044.714946]  ___sys_sendmsg+0x28a/0x2f0
[67044.714947]  ? ___sys_recvmsg+0x156/0x1d0
[67044.714950]  ? __alloc_pages_nodemask+0x111/0x280
[67044.714954]  ? alloc_pages_vma+0x6f/0x1c0
[67044.714957]  ? page_add_new_anon_rmap+0x72/0xb0
[67044.714958]  ? __handle_mm_fault+0x7db/0x12c0
[67044.714961]  __sys_sendmsg+0x52/0xa0
[67044.714964]  do_syscall_64+0x4a/0xf0
[67044.714967]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[67044.714969] RIP: 0033:0x7fa9c4af15a7
[67044.714972] Code: Bad RIP value.
[67044.714973] RSP: 002b:00007fffdd7ac818 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[67044.714974] RAX: ffffffffffffffda RBX: 00000000021ae990 RCX: 00007fa9c4af15a7
[67044.714975] RDX: 0000000000000000 RSI: 00007fffdd7ac8b0 RDI: 0000000000000008
[67044.714976] RBP: 00000000021b3d80 R08: 0000000000000004 R09: 00007fa9c4dabf20
[67044.714976] R10: 0000000000000170 R11: 0000000000000246 R12: 00000000021b3ec0
[67044.714977] R13: 00007fffdd7ac8b0 R14: 00000000021b3ec0 R15: 00007fffdd7acb18
[67044.714980] INFO: task sshd:1763 blocked for more than 180 seconds.
[67044.720810]       Tainted: G        W  O      4.20.17+ #30
[67044.725186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[67044.732027] sshd            D    0  1763   1355 0x00000080
[67044.732029] Call Trace:
[67044.732038]  ? __schedule+0x29e/0x880
[67044.732040]  schedule+0x2a/0x80
[67044.732042]  schedule_preempt_disabled+0xc/0x20
[67044.732043]  __mutex_lock.isra.10+0x2e7/0x4f0
[67044.732046]  ? netlink_lookup+0x111/0x160
[67044.732048]  __netlink_dump_start+0x4f/0x1d0
[67044.732051]  ? rtnl_xdp_prog_skb+0x60/0x60
[67044.732052]  rtnetlink_rcv_msg+0x25c/0x390
[67044.732054]  ? rtnl_xdp_prog_skb+0x60/0x60
[67044.732055]  ? rtnl_calcit.isra.31+0x110/0x110
[67044.732057]  netlink_rcv_skb+0x44/0x120
[67044.732059]  netlink_unicast+0x18b/0x220
[67044.732060]  netlink_sendmsg+0x1ff/0x3d0
[67044.732064]  sock_sendmsg+0x2b/0x40
[67044.732066]  __sys_sendto+0xe9/0x150
[67044.732070]  ? __audit_syscall_exit+0x216/0x280
[67044.732071]  __x64_sys_sendto+0x1f/0x30
[67044.732075]  do_syscall_64+0x4a/0xf0
[67044.732077]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[67044.732079] RIP: 0033:0x7f16e29c765a
[67044.732082] Code: Bad RIP value.
[67044.732083] RSP: 002b:00007ffe57e52e88 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[67044.732084] RAX: ffffffffffffffda RBX: 00007ffe57e53f80 RCX: 00007f16e29c765a
[67044.732085] RDX: 0000000000000014 RSI: 00007ffe57e53f80 RDI: 0000000000000003
[67044.732085] RBP: 00007ffe57e53fd0 R08: 00007ffe57e53f24 R09: 000000000000000c
[67044.732086] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe57e53f24
[67044.732087] R13: 00007ffe57e54160 R14: 0000000000000000 R15: 0000000000000003

Thanks,
Ben
-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* RE: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
From: Prout, Andrew - LLSC - MITLL @ 2019-07-10 18:53 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Paasch
  Cc: David S . Miller, netdev, Greg Kroah-Hartman, Jonathan Looney,
	Neal Cardwell, Tyler Hicks, Yuchung Cheng, Bruce Curtis,
	Jonathan Lemon, Dustin Marquess
In-Reply-To: <96791fd5-8d36-2e00-3fef-60b23bea05e5@gmail.com>

On 7/10/19 2:29 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On 7/10/19 8:23 PM, Prout, Andrew - LLSC - MITLL wrote:
>> On 6/17/19 8:19 PM, Christoph Paasch wrote:
>>>
>>> Yes, this does the trick for my packetdrill-test.
>>>
>>> I wonder, is there a way we could end up in a situation where we can't
>>> retransmit anymore?
>>> For example, sk_wmem_queued has grown so much that the new test fails.
>>> Then, if we legitimately need to fragment in __tcp_retransmit_skb() we
>>> won't be able to do so. So we will never retransmit. And if no ACK
>>> comes back in to make some room we are stuck, no?
>> 
>> We seem to be having exactly this problem. We’re running on the 4.14 branch. After recently updating our kernel, we’ve been having a problem with TCP connections stalling / dying off without disconnecting. They're stuck and never recover.
>> 
>> I bisected the problem to 4.14.127 commit 9daf226ff92679d09aeca1b5c1240e3607153336 (commit f070ef2ac66716357066b683fb0baf55f8191a2e upstream): tcp: tcp_fragment() should apply sane memory limits. That lead me to this thread.
>> 
>> Our environment is a supercomputing center: lots of servers interconnected with a non-blocking 10Gbit ethernet network. We’ve zeroed in on the problem in two situations: remote users on VPN accessing large files via samba and compute jobs using Intel MPI over TCP/IP/ethernet. It certainly affects other situations, many of our workloads have been unstable since this patch went into production, but those are the two we clearly identified as they fail reliably every time. We had to take the system down for unscheduled maintenance to roll back to an older kernel.
>> 
>> The TCPWqueueTooBig count is incrementing when the problem occurs.
>> 
>> Using ftrace/trace-cmd on an affected process, it appears the call stack is:
>> run_timer_softirq
>> expire_timers
>> call_timer_fn
>> tcp_write_timer
>> tcp_write_timer_handler
>> tcp_retransmit_timer
>> tcp_retransmit_skb
>> __tcp_retransmit_skb
>> tcp_fragment
>> 
>> Andrew Prout
>> MIT Lincoln Laboratory Supercomputing Center
>> 
>
> What was the kernel version you used exactly ?
>
> This problem is supposed to be fixed in v4.14.131

Our initial rollout was v4.14.130, but I reproduced it with v4.14.132 as well, reliably for the samba test and once (not reliably) with synthetic test I was trying. A patched v4.14.132 with this patch partially reverted (just the four lines from tcp_fragment deleted) passed the samba test.

The synthetic test was a pair of simple send/recv test programs under the following conditions:
-The send socket was non-blocking
-SO_SNDBUF set to 128KiB
-The receiver NIC was being flooded with traffic from multiple hosts (to induce packet loss/retransmits)
-Load was on both systems: a while(1) program spinning on each CPU core
-The receiver was on an older unaffected kernel


^ permalink raw reply

* Re: [PATCH v6 0/5] net: macb: cover letter
From: David Miller @ 2019-07-10 18:47 UTC (permalink / raw)
  To: pthombar
  Cc: andrew, nicolas.ferre, f.fainelli, linux, netdev, hkallweit1,
	linux-kernel, rafalc, piotrs, aniljoy, arthurm, stevenh, mparab
In-Reply-To: <1562769391-31803-1-git-send-email-pthombar@cadence.com>


As announced clearly yesterday on this list, the net-next tree is closed.

Please resubmit these changes when the tree opens back up again.

Thank you.

^ permalink raw reply

* Re: [PATCH v2 bpf] selftests/bpf: fix bpf_target_sparc check
From: David Miller @ 2019-07-10 18:42 UTC (permalink / raw)
  To: iii; +Cc: bpf, netdev, andrii.nakryiko
In-Reply-To: <20190710115654.44841-1-iii@linux.ibm.com>

From: Ilya Leoshkevich <iii@linux.ibm.com>
Date: Wed, 10 Jul 2019 13:56:54 +0200

> bpf_helpers.h fails to compile on sparc: the code should be checking
> for defined(bpf_target_sparc), but checks simply for bpf_target_sparc.
> 
> Also change #ifdef bpf_target_powerpc to #if defined() for consistency.
> 
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* RE: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
From: Prout, Andrew - LLSC - MITLL @ 2019-07-10 18:23 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet
  Cc: David S . Miller, netdev, Greg Kroah-Hartman, Jonathan Looney,
	Neal Cardwell, Tyler Hicks, Yuchung Cheng, Bruce Curtis,
	Jonathan Lemon, Dustin Marquess
In-Reply-To: <CALMXkpZ4isoXpFp_5=nVUcWrt5TofYVhpdAjv7LkCH7RFW1tYw@mail.gmail.com>

On 6/17/19 8:19 PM, Christoph Paasch wrote:
> 
> Yes, this does the trick for my packetdrill-test.
> 
> I wonder, is there a way we could end up in a situation where we can't
> retransmit anymore?
> For example, sk_wmem_queued has grown so much that the new test fails.
> Then, if we legitimately need to fragment in __tcp_retransmit_skb() we
> won't be able to do so. So we will never retransmit. And if no ACK
> comes back in to make some room we are stuck, no?

We seem to be having exactly this problem. We’re running on the 4.14 branch. After recently updating our kernel, we’ve been having a problem with TCP connections stalling / dying off without disconnecting. They're stuck and never recover.

I bisected the problem to 4.14.127 commit 9daf226ff92679d09aeca1b5c1240e3607153336 (commit f070ef2ac66716357066b683fb0baf55f8191a2e upstream): tcp: tcp_fragment() should apply sane memory limits. That lead me to this thread.

Our environment is a supercomputing center: lots of servers interconnected with a non-blocking 10Gbit ethernet network. We’ve zeroed in on the problem in two situations: remote users on VPN accessing large files via samba and compute jobs using Intel MPI over TCP/IP/ethernet. It certainly affects other situations, many of our workloads have been unstable since this patch went into production, but those are the two we clearly identified as they fail reliably every time. We had to take the system down for unscheduled maintenance to roll back to an older kernel.

The TCPWqueueTooBig count is incrementing when the problem occurs.

Using ftrace/trace-cmd on an affected process, it appears the call stack is:
run_timer_softirq
expire_timers
call_timer_fn
tcp_write_timer
tcp_write_timer_handler
tcp_retransmit_timer
tcp_retransmit_skb
__tcp_retransmit_skb
tcp_fragment

Andrew Prout
MIT Lincoln Laboratory Supercomputing Center


^ permalink raw reply

* Re: [GIT PULL] Keys: Set 4 - Key ACLs for 5.3
From: Linus Torvalds @ 2019-07-10 18:35 UTC (permalink / raw)
  To: David Howells
  Cc: James Morris James Morris, keyrings, Netdev, linux-nfs, CIFS,
	linux-afs, linux-fsdevel, linux-integrity, LSM List,
	Linux List Kernel Mailing
In-Reply-To: <28477.1562362239@warthog.procyon.org.uk>

On Fri, Jul 5, 2019 at 2:30 PM David Howells <dhowells@redhat.com> wrote:
>
> Here's my fourth block of keyrings changes for the next merge window.  They
> change the permissions model used by keys and keyrings to be based on an
> internal ACL by the following means:

It turns out that this is broken, and I'll probably have to revert the
merge entirely.

With this merge in place, I can't boot any of the machines that have
an encrypted disk setup. The boot just stops at

  systemd[1]: Started Forward Password Requests to Plymouth Directory Watch.
  systemd[1]: Reached target Paths.

and never gets any further. I never get the prompt for a passphrase
for the disk encryption.

Apparently not a lot of developers are using encrypted volumes for
their development machines.

I'm not sure if the only requirement is an encrypted volume, or if
this is also particular to a F30 install in case you need to be able
to reproduce. But considering that you have a redhat email address,
I'm sure you can find a F30 install somewhere with an encrypted disk.

David, if you can fix this quickly, I'll hold off on the revert of it
all, but I can wait only so long. I've stopped merging stuff since I
noticed my machines don't work (this merge window has not been
pleasant so far - in addition to this issue I had another entirely
unrelated boot failure which made bisecting this one even more fun).

So if I don't see a quick fix, I'll just revert in order to then
continue to do pull requests later today. Because I do not want to do
further pulls with something that I can't boot as a base.

                 Linus

^ permalink raw reply

* [PATCH net-next 2/2] nfp: flower: ensure ip protocol is specified for L4 matches
From: John Hurley @ 2019-07-10 18:30 UTC (permalink / raw)
  To: netdev; +Cc: davem, simon.horman, jakub.kicinski, oss-drivers, John Hurley
In-Reply-To: <1562783430-7031-1-git-send-email-john.hurley@netronome.com>

Flower rules on the NFP firmware are able to match on an IP protocol
field. When parsing rules in the driver, unknown IP protocols are only
rejected when further matches are to be carried out on layer 4 fields, as
the firmware will not be able to extract such fields from packets.

L4 protocol dissectors such as FLOW_DISSECTOR_KEY_PORTS are only parsed if
an IP protocol is specified. This leaves a loophole whereby a rule that
attempts to match on transport layer information such as port numbers but
does not explicitly give an IP protocol type can be incorrectly offloaded
(in this case with wildcard port numbers matches).

Fix this by rejecting the offload of flows that attempt to match on L4
information, not only when matching on an unknown IP protocol type, but
also when the protocol is wildcarded.

Fixes: 2a04784594f6 ("nfp: flower: check L4 matches on unknown IP protocols")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/flower/offload.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 885f968..faa8ba0 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -386,18 +386,15 @@ nfp_flower_calculate_key_layers(struct nfp_app *app,
 			key_layer |= NFP_FLOWER_LAYER_TP;
 			key_size += sizeof(struct nfp_flower_tp_ports);
 			break;
-		default:
-			/* Other ip proto - we need check the masks for the
-			 * remainder of the key to ensure we can offload.
-			 */
-			if (nfp_flower_check_higher_than_l3(flow)) {
-				NL_SET_ERR_MSG_MOD(extack, "unsupported offload: unknown IP protocol with L4 matches not supported");
-				return -EOPNOTSUPP;
-			}
-			break;
 		}
 	}
 
+	if (!(key_layer & NFP_FLOWER_LAYER_TP) &&
+	    nfp_flower_check_higher_than_l3(flow)) {
+		NL_SET_ERR_MSG_MOD(extack, "unsupported offload: cannot match on L4 information without specified IP protocol type");
+		return -EOPNOTSUPP;
+	}
+
 	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_TCP)) {
 		struct flow_match_tcp tcp;
 		u32 tcp_flags;
-- 
2.7.4


^ permalink raw reply related

* [PATCH net-next 1/2] nfp: flower: fix ethernet check on match fields
From: John Hurley @ 2019-07-10 18:30 UTC (permalink / raw)
  To: netdev; +Cc: davem, simon.horman, jakub.kicinski, oss-drivers, John Hurley
In-Reply-To: <1562783430-7031-1-git-send-email-john.hurley@netronome.com>

NFP firmware does not explicitly match on an ethernet type field. Rather,
each rule has a bitmask of match fields that can be used to infer the
ethernet type.

Currently, if a flower rule contains an unknown ethernet type, a check is
carried out for matches on other fields of the packet. If matches on
layer 3 or 4 are found, then the offload is rejected as firmware will not
be able to extract these fields from a packet with an ethernet type it
does not currently understand.

However, if a rule contains an unknown ethernet type without any L3 (or
above) matches then this will effectively be offloaded as a rule with a
wildcarded ethertype. This can lead to misclassifications on the firmware.

Fix this issue by rejecting all flower rules that specify a match on an
unknown ethernet type.

Further ensure correct offloads by moving the 'L3 and above' check to any
rule that does not specify an ethernet type and rejecting rules with
further matches. This means that we can still offload rules with a
wildcarded ethertype if they only match on L2 fields but will prevent
rules which match on further fields that we cannot be sure if the firmware
will be able to extract.

Fixes: af9d842c1354 ("nfp: extend flower add flow offload")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/flower/offload.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 7e725fa..885f968 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -368,15 +368,12 @@ nfp_flower_calculate_key_layers(struct nfp_app *app,
 			break;
 
 		default:
-			/* Other ethtype - we need check the masks for the
-			 * remainder of the key to ensure we can offload.
-			 */
-			if (nfp_flower_check_higher_than_mac(flow)) {
-				NL_SET_ERR_MSG_MOD(extack, "unsupported offload: non IPv4/IPv6 offload with L3/L4 matches not supported");
-				return -EOPNOTSUPP;
-			}
-			break;
+			NL_SET_ERR_MSG_MOD(extack, "unsupported offload: match on given EtherType is not supported");
+			return -EOPNOTSUPP;
 		}
+	} else if (nfp_flower_check_higher_than_mac(flow)) {
+		NL_SET_ERR_MSG_MOD(extack, "unsupported offload: cannot match above L2 without specified EtherType");
+		return -EOPNOTSUPP;
 	}
 
 	if (basic.mask && basic.mask->ip_proto) {
-- 
2.7.4


^ permalink raw reply related

* [PATCH net-next 0/2] Fix bugs in NFP flower match offload
From: John Hurley @ 2019-07-10 18:30 UTC (permalink / raw)
  To: netdev; +Cc: davem, simon.horman, jakub.kicinski, oss-drivers, John Hurley

This patchset contains bug fixes for corner cases in the match fields of
flower offloads. The patches ensure that flows that should not be
supported are not (incorrectly) offloaded. These include rules that match
on layer 3 and/or 4 data without specified ethernet or ip protocol fields.

John Hurley (2):
  nfp: flower: fix ethernet check on match fields
  nfp: flower: ensure ip protocol is specified for L4 matches

 .../net/ethernet/netronome/nfp/flower/offload.c    | 28 +++++++++-------------
 1 file changed, 11 insertions(+), 17 deletions(-)

-- 
2.7.4


^ permalink raw reply

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
From: Eric Dumazet @ 2019-07-10 18:28 UTC (permalink / raw)
  To: Prout, Andrew - LLSC - MITLL, Christoph Paasch
  Cc: David S . Miller, netdev, Greg Kroah-Hartman, Jonathan Looney,
	Neal Cardwell, Tyler Hicks, Yuchung Cheng, Bruce Curtis,
	Jonathan Lemon, Dustin Marquess
In-Reply-To: <63cd99ed3d0c440185ebec3ad12327fc@ll.mit.edu>



On 7/10/19 8:23 PM, Prout, Andrew - LLSC - MITLL wrote:
> On 6/17/19 8:19 PM, Christoph Paasch wrote:
>>
>> Yes, this does the trick for my packetdrill-test.
>>
>> I wonder, is there a way we could end up in a situation where we can't
>> retransmit anymore?
>> For example, sk_wmem_queued has grown so much that the new test fails.
>> Then, if we legitimately need to fragment in __tcp_retransmit_skb() we
>> won't be able to do so. So we will never retransmit. And if no ACK
>> comes back in to make some room we are stuck, no?
> 
> We seem to be having exactly this problem. We’re running on the 4.14 branch. After recently updating our kernel, we’ve been having a problem with TCP connections stalling / dying off without disconnecting. They're stuck and never recover.
> 
> I bisected the problem to 4.14.127 commit 9daf226ff92679d09aeca1b5c1240e3607153336 (commit f070ef2ac66716357066b683fb0baf55f8191a2e upstream): tcp: tcp_fragment() should apply sane memory limits. That lead me to this thread.
> 
> Our environment is a supercomputing center: lots of servers interconnected with a non-blocking 10Gbit ethernet network. We’ve zeroed in on the problem in two situations: remote users on VPN accessing large files via samba and compute jobs using Intel MPI over TCP/IP/ethernet. It certainly affects other situations, many of our workloads have been unstable since this patch went into production, but those are the two we clearly identified as they fail reliably every time. We had to take the system down for unscheduled maintenance to roll back to an older kernel.
> 
> The TCPWqueueTooBig count is incrementing when the problem occurs.
> 
> Using ftrace/trace-cmd on an affected process, it appears the call stack is:
> run_timer_softirq
> expire_timers
> call_timer_fn
> tcp_write_timer
> tcp_write_timer_handler
> tcp_retransmit_timer
> tcp_retransmit_skb
> __tcp_retransmit_skb
> tcp_fragment
> 
> Andrew Prout
> MIT Lincoln Laboratory Supercomputing Center
> 

What was the kernel version you used exactly ?

This problem is supposed to be fixed in v4.14.131


^ permalink raw reply

* [PATCH net-next] net/mlx5e: Provide cb_list pointer when setting up tc block on rep
From: Vlad Buslov @ 2019-07-10 18:25 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, pablo, saeedm, Vlad Buslov

Recent refactoring of tc block offloads infrastructure introduced new
flow_block_cb_setup_simple() method intended to be used as unified way for
all drivers to register offload callbacks. However, commit that actually
extended all users (drivers) with block cb list and provided it to
flow_block infra missed mlx5 en_rep. This leads to following NULL-pointer
dereference when creating Qdisc:

[  278.385175] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  278.393233] #PF: supervisor read access in kernel mode
[  278.399446] #PF: error_code(0x0000) - not-present page
[  278.405847] PGD 8000000850e73067 P4D 8000000850e73067 PUD 8620cd067 PMD 0
[  278.414141] Oops: 0000 [#1] SMP PTI
[  278.419019] CPU: 7 PID: 3369 Comm: tc Not tainted 5.2.0-rc6+ #492
[  278.426580] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
[  278.435853] RIP: 0010:flow_block_cb_setup_simple+0xc4/0x190
[  278.442953] Code: 10 48 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 49 89 00 48 05 00 01 00 00 49 89 40 08 31 c0 c3 b8 a1 ff ff ff c3 f3 c3 <48> 8b 06 48 39 c6 75 0a eb 1a 48 8b 00 48 39 c6 74 12
 48 3b 50 28
[  278.464829] RSP: 0018:ffffaf07c3f97990 EFLAGS: 00010246
[  278.471648] RAX: 0000000000000000 RBX: ffff9b43ed4c7680 RCX: ffff9b43d5f80840
[  278.480408] RDX: ffffffffc0491650 RSI: 0000000000000000 RDI: ffffaf07c3f97998
[  278.489110] RBP: ffff9b43ddff9000 R08: ffff9b43d5f80840 R09: 0000000000000001
[  278.497838] R10: 0000000000000009 R11: 00000000000003ad R12: ffffaf07c3f97c08
[  278.506595] R13: ffff9b43d5f80000 R14: ffff9b43ed4c7680 R15: ffff9b43dfa20b40
[  278.515374] FS:  00007f796be1b400(0000) GS:ffff9b43ef840000(0000) knlGS:0000000000000000
[  278.525099] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  278.532453] CR2: 0000000000000000 CR3: 0000000840398002 CR4: 00000000001606e0
[  278.541197] Call Trace:
[  278.545252]  tcf_block_offload_cmd.isra.52+0x7e/0xb0
[  278.551871]  tcf_block_get_ext+0x365/0x3e0
[  278.557569]  qdisc_create+0x15c/0x4e0
[  278.562859]  ? kmem_cache_alloc_trace+0x1a2/0x1c0
[  278.569235]  tc_modify_qdisc+0x1c8/0x780
[  278.574761]  rtnetlink_rcv_msg+0x291/0x340
[  278.580518]  ? _cond_resched+0x15/0x40
[  278.585856]  ? rtnl_calcit.isra.29+0x120/0x120
[  278.591868]  netlink_rcv_skb+0x4a/0x110
[  278.597198]  netlink_unicast+0x1a0/0x250
[  278.602601]  netlink_sendmsg+0x2c1/0x3c0
[  278.608022]  sock_sendmsg+0x5b/0x60
[  278.612969]  ___sys_sendmsg+0x289/0x310
[  278.618231]  ? do_wp_page+0x99/0x730
[  278.623216]  ? page_add_new_anon_rmap+0xbe/0x140
[  278.629298]  ? __handle_mm_fault+0xc84/0x1360
[  278.635113]  ? __sys_sendmsg+0x5e/0xa0
[  278.640285]  __sys_sendmsg+0x5e/0xa0
[  278.645239]  do_syscall_64+0x5b/0x1b0
[  278.650274]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  278.656697] RIP: 0033:0x7f796abdeb87
[  278.661628] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00 00 00 00 8b 05 6a 2b 2c 00 48 63 d2 48 63 ff 85 c0 75 18 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 59 f3 c3 0f 1f 80 00 00 00 00 53
 48 89 f3 48
[  278.683248] RSP: 002b:00007ffde213ba48 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  278.692245] RAX: ffffffffffffffda RBX: 000000005d261e6f RCX: 00007f796abdeb87
[  278.700862] RDX: 0000000000000000 RSI: 00007ffde213bab0 RDI: 0000000000000003
[  278.709527] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006
[  278.718167] R10: 000000000000000c R11: 0000000000000246 R12: 0000000000000001
[  278.726743] R13: 000000000067b580 R14: 0000000000000000 R15: 0000000000000000
[  278.735302] Modules linked in: dummy vxlan ip6_udp_tunnel udp_tunnel sch_ingress nfsv3 nfs_acl nfs lockd grace fscache bridge stp llc sunrpc mlx5_ib ib_uverbs intel_rapl ib_core sb_edac x86_pkg_temp_
thermal intel_powerclamp coretemp kvm_intel kvm mlx5_core irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel igb ghash_clmulni_intel ses mei_me enclosure mlxfw ipmi_ssif intel_cstate iTCO_wdt ptp mei
pps_core iTCO_vendor_support pcspkr joydev intel_uncore i2c_i801 ipmi_si lpc_ich intel_rapl_perf ioatdma wmi dca pcc_cpufreq ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad ast i2c_algo_bit drm_k
ms_helper ttm drm mpt3sas raid_class scsi_transport_sas
[  278.802263] CR2: 0000000000000000
[  278.807170] ---[ end trace b1f0a442a279e66f ]---

Extend en_rep with new static mlx5e_rep_block_cb_list list and pass it to
flow_block_cb_setup_simple() function instead of hardcoded NULL pointer.

Fixes: 955bcb6ea0df ("drivers: net: use flow block API")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 10ef90a7bddd..7245d287633d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -1175,6 +1175,8 @@ static int mlx5e_rep_setup_tc_cb(enum tc_setup_type type, void *type_data,
 	}
 }
 
+static LIST_HEAD(mlx5e_rep_block_cb_list);
+
 static int mlx5e_rep_setup_tc(struct net_device *dev, enum tc_setup_type type,
 			      void *type_data)
 {
@@ -1182,7 +1184,8 @@ static int mlx5e_rep_setup_tc(struct net_device *dev, enum tc_setup_type type,
 
 	switch (type) {
 	case TC_SETUP_BLOCK:
-		return flow_block_cb_setup_simple(type_data, NULL,
+		return flow_block_cb_setup_simple(type_data,
+						  &mlx5e_rep_block_cb_list,
 						  mlx5e_rep_setup_tc_cb,
 						  priv, priv, true);
 	default:
-- 
2.21.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox