Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 2/9] rcu: Add support for consolidated-RCU reader checking (v3)
From: Paul E. McKenney @ 2019-07-17  0:07 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Alexey Kuznetsov, Bjorn Helgaas, Borislav Petkov,
	c0d1n61at3, David S. Miller, edumazet, Greg Kroah-Hartman,
	Hideaki YOSHIFUJI, H. Peter Anvin, Ingo Molnar, Jonathan Corbet,
	Josh Triplett, keescook, kernel-hardening, kernel-team,
	Lai Jiangshan, Len Brown, linux-acpi, linux-doc, linux-pci,
	linux-pm, Mathieu Desnoyers, neilb, netdev, Oleg Nesterov,
	Pavel Machek, peterz, Rafael J. Wysocki, Rasmus Villemoes, rcu,
	Steven Rostedt, Tejun Heo, Thomas Gleixner, will,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)
In-Reply-To: <20190716220205.GB172157@google.com>

On Tue, Jul 16, 2019 at 06:02:05PM -0400, Joel Fernandes wrote:
> On Tue, Jul 16, 2019 at 11:53:03AM -0700, Paul E. McKenney wrote:
> [snip]
> > > > A few more things below.
> > > > > ---
> > > > >  include/linux/rculist.h  | 28 ++++++++++++++++++++-----
> > > > >  include/linux/rcupdate.h |  7 +++++++
> > > > >  kernel/rcu/Kconfig.debug | 11 ++++++++++
> > > > >  kernel/rcu/update.c      | 44 ++++++++++++++++++++++++----------------
> > > > >  4 files changed, 67 insertions(+), 23 deletions(-)
> > > > > 
> > > > > diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> > > > > index e91ec9ddcd30..1048160625bb 100644
> > > > > --- a/include/linux/rculist.h
> > > > > +++ b/include/linux/rculist.h
> > > > > @@ -40,6 +40,20 @@ static inline void INIT_LIST_HEAD_RCU(struct list_head *list)
> > > > >   */
> > > > >  #define list_next_rcu(list)	(*((struct list_head __rcu **)(&(list)->next)))
> > > > >  
> > > > > +/*
> > > > > + * Check during list traversal that we are within an RCU reader
> > > > > + */
> > > > > +
> > > > > +#ifdef CONFIG_PROVE_RCU_LIST
> > > > 
> > > > This new Kconfig option is OK temporarily, but unless there is reason to
> > > > fear malfunction that a few weeks of rcutorture, 0day, and -next won't
> > > > find, it would be better to just use CONFIG_PROVE_RCU.  The overall goal
> > > > is to reduce the number of RCU knobs rather than grow them, must though
> > > > history might lead one to believe otherwise.  :-/
> > > 
> > > If you want, we can try to drop this option and just use PROVE_RCU however I
> > > must say there may be several warnings that need to be fixed in a short
> > > period of time (even a few weeks may be too short) considering the 1000+
> > > uses of RCU lists.
> > Do many people other than me build with CONFIG_PROVE_RCU?  If so, then
> > that would be a good reason for a temporary CONFIG_PROVE_RCU_LIST,
> > as in going away in a release or two once the warnings get fixed.
> 
> PROVE_RCU is enabled by default with PROVE_LOCKING, so it is used quite
> heavilty.
> 
> > > But I don't mind dropping it and it may just accelerate the fixing up of all
> > > callers.
> > 
> > I will let you decide based on the above question.  But if you have
> > CONFIG_PROVE_RCU_LIST, as noted below, it needs to depend on RCU_EXPERT.
> 
> Ok, will make it depend. But yes for temporary purpose, I will leave it as a
> config and remove it later.

Very good, thank you!  Plus you got another ack.  ;-)

							Thanx, Paul

^ permalink raw reply

* Re: OOM triggered by SCTP
From: malc @ 2019-07-16 23:59 UTC (permalink / raw)
  To: Marek Majkowski
  Cc: vyasevich, nhorman, marcelo.leitner, Linux SCTP, netdev,
	kernel-team
In-Reply-To: <CAJPywTL5aKYB40FsAFYFEuhErhgQpYZP5Q_ipMG9pDxqipcEDg@mail.gmail.com>

On Tue, Jul 16, 2019 at 10:49 PM Marek Majkowski <marek@cloudflare.com> wrote:
>
> Morning,
>
> My poor man's fuzzer found something interesting in SCTP. It seems
> like creating large number of SCTP sockets + some magic dance, upsets
> a memory subsystem related to SCTP. The sequence:
>
>  - create SCTP socket
>  - call setsockopts (SCTP_EVENTS)
>  - call bind(::1, port)
>  - call sendmsg(long buffer, MSG_CONFIRM, ::1, port)
>  - close SCTP socket
>  - repeat couple thousand times
>
> Full code:
> https://gist.github.com/majek/bd083dae769804d39134ce01f4f802bb#file-test_sctp-c
>
> I'm running it on virtme the simplest way:
> $ virtme-run --show-boot-console --rw --pwd --kimg bzImage --memory
> 512M --script-sh ./test_sctp
>
> Originally I was running it inside net namespace, and just having a
> localhost interface is sufficient to trigger the problem.
>
> Kernel is 5.2.1 (with KASAN and such, but that shouldn't be a factor).
> In some tests I saw a message that might indicate something funny
> hitting neighbor table:
>
> neighbour: ndisc_cache: neighbor table overflow!
>
> I'm not addr-decoding the stack trace, since it seems unrelated to the
> root cause.
>
> Cheers,
>     Marek

I _think_ this is an 'expected' peculiarity of SCTP on loopback - you
test_sctp.c ends up creating actual associations to itself on the same
socket (you can test safely by reducing the port range (say
30000-32000) and setting the for-loop-clause to 'run < 1')
You'll see a bunch of associations established like the following
(note that I(kernel) was dropping packets for this capture - even with
/only/ 2000 sockets used...)

$ tshark -r sctp.pcap -Y 'sctp.assoc_index==4'
  21 0.000409127          ::1 → ::1           SCTP INIT
  22 0.000436281          ::1 → ::1           SCTP INIT_ACK
  23 0.000442106          ::1 → ::1           SCTP COOKIE_ECHO
  24 0.000463007          ::1 → ::1           SCTP COOKIE_ACK DATA
(Message Fragment)
                                              presumably your close()
happens here and we enter SHUTDOWN-PENDING, where we wait for pending
data to be acknowledged, I'm not convinced that we shouldn't be
SACK'ing the data from the 'peer' at this point - but for whatever
reason, we aren't.
                                              We then run thru
path-max-retrans, and finally ABORT (the abort indication also shows
the PMR-exceeded indication in the 'Cause Information')
  25 0.000476083          ::1 → ::1           SCTP SACK
13619 3.017788109          ::1 → ::1           SCTP DATA (retransmission)
14022 3.222690889          ::1 → ::1           SCTP SACK
18922 21.938217449          ::1 → ::1           SCTP SACK
33476 69.831029904          ::1 → ::1           SCTP HEARTBEAT
33561 69.831310796          ::1 → ::1           SCTP HEARTBEAT_ACK
40816 94.102667600          ::1 → ::1           SCTP SACK
40910 95.942741287          ::1 → ::1           SCTP DATA (retransmission)
41039 96.152023010          ::1 → ::1           SCTP SACK
41100 100.182685237          ::1 → ::1           SCTP SACK
41212 108.230746764          ::1 → ::1           SCTP DATA (retransmission)
41345 108.439061392          ::1 → ::1           SCTP SACK
41407 116.422688507          ::1 → ::1           SCTP HEARTBEAT
41413 116.423183124          ::1 → ::1           SCTP HEARTBEAT_ACK
41494 124.823749255          ::1 → ::1           SCTP SACK
41576 126.663648718          ::1 → ::1           SCTP ABORT

With your entire 512M - you'd only have about 16KB for each of these
31K associations tops, I suspect that having a 64KB pending data chunk
(fragmented ULP msg) for each association for >= 90s is what is
exhausting memory here - although I'm sure Neil or Michael will be
along to correct me ;-)

What's interesting - as you reduce the payload size - we end up
bundling DATA from the 'initiator' side (in COOKIE ECHO) - and
everything works as expected... (the SACK here is for the bundled DATA
chunks TSN.

mlashley@duality /tmp $ tshark -r /tmp/sctp_index4_10K.pcap
   1 0.000000000          ::1 → ::1          SCTP INIT
   2 0.000014491          ::1 → ::1          SCTP INIT_ACK
   3 0.000024190          ::1 → ::1          SCTP COOKIE_ECHO DATA
   4 0.000034833          ::1 → ::1          SCTP COOKIE_ACK
   5 0.000040646          ::1 → ::1          SCTP SACK
   6 0.000050287          ::1 → ::1          SCTP ABORT

In short - the SCTP associations /can/ persist after user-space calls
close() whilst there is outstanding data (for path.max.retrans *
rto-with-doubling[due to T3-rtx expiry])

(My tests on 5.2.0 as it is what I had to hand...)

Cheers,
malc.

^ permalink raw reply

* general protection fault in tls_setsockopt
From: syzbot @ 2019-07-16 23:58 UTC (permalink / raw)
  To: ast, aviadye, bhole_prashant_q7, borisp, daniel, davejwatson,
	davem, john.fastabend, linux-kernel, linux-kselftest, netdev,
	shuah, songliubraving, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    a131c2bf Merge tag 'acpi-5.3-rc1-2' of git://git.kernel.or..
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1603e9c0600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=8bff73c5ba9e876
dashboard link: https://syzkaller.appspot.com/bug?extid=23d9570edec63669d890
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=13560870600000

The bug was bisected to:

commit 7c85c448e7d74c4ddd759440a2141eab663567cf
Author: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Date:   Tue Oct 9 01:04:54 2018 +0000

     selftests/bpf: test_verifier, check bpf_map_lookup_elem access in bpf  
prog

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=17000114600000
final crash:    https://syzkaller.appspot.com/x/report.txt?x=14800114600000
console output: https://syzkaller.appspot.com/x/log.txt?x=10800114600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+23d9570edec63669d890@syzkaller.appspotmail.com
Fixes: 7c85c448e7d7 ("selftests/bpf: test_verifier, check  
bpf_map_lookup_elem access in bpf prog")

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 9098 Comm: syz-executor.2 Not tainted 5.2.0+ #65
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:tls_setsockopt+0x87/0x8c0 net/tls/tls_main.c:592
Code: 80 3c 02 00 0f 85 a5 07 00 00 49 8b 84 24 b0 06 00 00 48 ba 00 00 00  
00 00 fc ff df 48 8d b8 88 00 00 00 48 89 f9 48 c1 e9 03 <80> 3c 11 00 0f  
85 85 07 00 00 41 89 d8 4c 89 f1 44 89 ea 44 89 fe
RSP: 0018:ffff88809b2e7d30 EFLAGS: 00010206
RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000011
RDX: dffffc0000000000 RSI: ffffffff86186a84 RDI: 0000000000000088
RBP: ffff88809b2e7d80 R08: ffff888098896480 R09: ffff888098896d08
R10: ffffed1015d06c83 R11: ffff8880ae83641b R12: ffff888099d86d00
R13: 000000000000001f R14: 0000000020000340 R15: 0000000000000006
FS:  00007f29b1f6b700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffb5492d518 CR3: 00000000a49ea000 CR4: 00000000001406f0
Call Trace:
  sock_common_setsockopt+0x94/0xd0 net/core/sock.c:3130
  __sys_setsockopt+0x253/0x4b0 net/socket.c:2080
  __do_sys_setsockopt net/socket.c:2096 [inline]
  __se_sys_setsockopt net/socket.c:2093 [inline]
  __x64_sys_setsockopt+0xbe/0x150 net/socket.c:2093
  do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x459819
Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f29b1f6ac78 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000459819
RDX: 000000000000001f RSI: 0000000000000006 RDI: 0000000000000005
RBP: 000000000075bf20 R08: 0000000000000004 R09: 0000000000000000
R10: 0000000020000340 R11: 0000000000000246 R12: 00007f29b1f6b6d4
R13: 00000000004c7cb9 R14: 00000000004dd878 R15: 00000000ffffffff
Modules linked in:
---[ end trace 1e30f09ab6b57d8c ]---
RIP: 0010:tls_setsockopt+0x87/0x8c0 net/tls/tls_main.c:592
Code: 80 3c 02 00 0f 85 a5 07 00 00 49 8b 84 24 b0 06 00 00 48 ba 00 00 00  
00 00 fc ff df 48 8d b8 88 00 00 00 48 89 f9 48 c1 e9 03 <80> 3c 11 00 0f  
85 85 07 00 00 41 89 d8 4c 89 f1 44 89 ea 44 89 fe
RSP: 0018:ffff88809b2e7d30 EFLAGS: 00010206
RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000011
RDX: dffffc0000000000 RSI: ffffffff86186a84 RDI: 0000000000000088
RBP: ffff88809b2e7d80 R08: ffff888098896480 R09: ffff888098896d08
R10: ffffed1015d06c83 R11: ffff8880ae83641b R12: ffff888099d86d00
R13: 000000000000001f R14: 0000000020000340 R15: 0000000000000006
FS:  00007f29b1f6b700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000007126b4 CR3: 00000000a49ea000 CR4: 00000000001406f0


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* Re: [PATCH RFC 0/4] Add support to directly attach BPF program to ftrace
From: Joel Fernandes @ 2019-07-16 23:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: linux-kernel, Adrian Ratiu, Alexei Starovoitov, bpf,
	Brendan Gregg, connoro, Daniel Borkmann, duyuchao, Ingo Molnar,
	jeffv, Karim Yaghmour, kernel-team, linux-kselftest,
	Manali Shukla, Manjo Raja Rao, Martin KaFai Lau, Masami Hiramatsu,
	Matt Mullins, Michal Gregorczyk, Michal Gregorczyk,
	Mohammad Husain, namhyung, namhyung, netdev, paul.chaignon,
	primiano, Qais Yousef, Shuah Khan, Song Liu, Srinivas Ramana,
	Steven Rostedt, Tamir Carmeli, Yonghong Song
In-Reply-To: <20190716224150.GC172157@google.com>

On Tue, Jul 16, 2019 at 06:41:50PM -0400, Joel Fernandes wrote:
> On Tue, Jul 16, 2019 at 03:26:52PM -0700, Alexei Starovoitov wrote:
> > On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote:
> > > 
> > > I also thought about the pinning idea before, but we also want to add support
> > > for not just raw tracepoints, but also regular tracepoints (events if you
> > > will). I am hesitant to add a new BPF API just for creating regular
> > > tracepoints and then pinning those as well.
> > 
> > and they should be done through the pinning as well.
> 
> Hmm ok, I will give it some more thought.

I think I can make the new BPF API + pinning approach work, I will try to
work on something like this and post it soon.

Also, I had a question below if you don't mind taking a look:

thanks Alexei!

> > > I don't see why a new bpf node for a trace event is a bad idea, really.
> > 
> > See the patches for kprobe/uprobe FD-based api and the reasons behind it.
> > tldr: text is racy, doesn't scale, poor security, etc.
> 
> Is it possible to use perf without CAP_SYS_ADMIN and control security at the
> per-event level? We are selective about who can access which event, using
> selinux. That's how our ftrace-based tracers work. Its fine grained per-event
> control. That's where I was going with the tracefs approach since we get that
> granularity using the file system.
> 
> Thanks.
> 

^ permalink raw reply

* Re: [PATCH bpf 1/2] selftests/bpf: fix test_verifier/test_maps make dependencies
From: Andrii Nakryiko @ 2019-07-16 23:37 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Andrii Nakryiko, bpf, Networking, Daniel Borkmann,
	Alexei Starovoitov, Kernel Team, Ilya Leoshkevich,
	Stanislav Fomichev, Martin KaFai Lau
In-Reply-To: <20190716225735.GC14834@mini-arch>

On Tue, Jul 16, 2019 at 3:57 PM Stanislav Fomichev <sdf@fomichev.me> wrote:
>
> On 07/16, Andrii Nakryiko wrote:
> > On Tue, Jul 16, 2019 at 12:55 PM Stanislav Fomichev <sdf@fomichev.me> wrote:
> > >
> > > On 07/16, Andrii Nakryiko wrote:
> > > > e46fc22e60a4 ("selftests/bpf: make directory prerequisites order-only")
> > > > exposed existing problem in Makefile for test_verifier and test_maps tests:
> > > > their dependency on auto-generated header file with a list of all tests wasn't
> > > > recorded explicitly. This patch fixes these issues.
> > > Why adding it explicitly fixes it? At least for test_verifier, we have
> > > the following rule:
> > >
> > >         test_verifier.c: $(VERIFIER_TESTS_H)
> > >
> > > And there should be implicit/builtin test_verifier -> test_verifier.c
> > > dependency rule.
> > >
> > > Same for maps, I guess:
> > >
> > >         $(OUTPUT)/test_maps: map_tests/*.c
> > >         test_maps.c: $(MAP_TESTS_H)
> > >
> > > So why is it not working as is? What I'm I missing?
> >
> > I don't know exactly why it's not working, but it's clearly because of
> > that. It's the only difference between how test_progs are set up,
> > which didn't break, and test_maps/test_verifier, which did.
> >
> > Feel free to figure it out through a maze of Makefiles why it didn't
> > work as expected, but this definitely fixed a breakage (at least for
> > me).
> Agreed on not wasting time. I took a brief look (with make -qp) and I
> don't have any clue.

Oh, -qp is cool, noted :)

>
> By default implicit matching doesn't work:
> # makefile (from 'Makefile', line 261)
> /linux/tools/testing/selftests/bpf/test_maps: CFLAGS += $(TEST_MAPS_CFLAGS)
> /linux/tools/testing/selftests/bpf/test_maps: map_tests/sk_storage_map.c /linux/tools/testing/selftests/bpf/test_stub.o /linux/tools/testing/selftests/bpf/libbpf.a
> #  Implicit rule search has not been done.
> #  File is an intermediate prerequisite.
> #  Modification time never checked.
> #  File has not been updated.
> # variable set hash-table stats:
> # Load=1/32=3%, Rehash=0, Collisions=0/2=0%
>
> If I comment out the following line:
> $(TEST_GEN_PROGS): $(OUTPUT)/test_stub.o $(BPFOBJ)
>
> Then it works:
> # makefile (from 'Makefile', line 261)
> /linux/tools/testing/selftests/bpf/test_maps: CFLAGS += $(TEST_MAPS_CFLAGS)
> /linux/tools/testing/selftests/bpf/test_maps: test_maps.c map_tests/sk_storage_map.c
> #  Implicit rule search has been done.
> #  Implicit/static pattern stem: 'test_maps'
> #  File is an intermediate prerequisite.
> #  File does not exist.
> #  File has not been updated.
> # variable set hash-table stats:
> # Load=1/32=3%, Rehash=0, Collisions=0/2=0%
> #  recipe to execute (from '../lib.mk', line 138):
>         $(LINK.c) $^ $(LDLIBS) -o $@
>
> It's because "File is an intermediate prerequisite.", but I
> don't see how it's is a intermediate prerequisite for anything...

Well, it's "File is an intermediate prerequisite." in both cases, so I
don't know if that's it.
Makefiles is not my strong suit, honestly, and definitely not an area
of passion, so no idea

>
>
> One other optional suggestion I have to your second patch: maybe drop all
> those dependencies on the directories altogether? Why not do the following
> instead, for example (same for test_progs/test_maps)?

Some of those _TEST_DIR's are dependencies of multiple targets, so
you'd need to replicate that `mkdir -p $@` in multiple places, which
is annoying.

But I also don't think we need to worry about creating them, because
there is always at least one test (otherwise those tests are useless
anyways) in those directories, so we might as well just remove those
dependencies, I guess.

TBH, those explicit dependencies are good to specify anyways, so I
think this fix is good. Second patch just makes the layout of
dependencies similar, so it's easier to spot differences like this
one.

As usual, I'll let Alexei and Daniel decide which one to apply (or if
we need to take some other approach to fixing this).


>
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 1296253b3422..c2d087ce6d4b 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -277,12 +277,9 @@ VERIFIER_TESTS_H := $(OUTPUT)/verifier/tests.h
>  test_verifier.c: $(VERIFIER_TESTS_H)
>  $(OUTPUT)/test_verifier: CFLAGS += $(TEST_VERIFIER_CFLAGS)
>
> -VERIFIER_TESTS_DIR = $(OUTPUT)/verifier
> -$(VERIFIER_TESTS_DIR):
> -       mkdir -p $@
> -
>  VERIFIER_TEST_FILES := $(wildcard verifier/*.c)
> -$(OUTPUT)/verifier/tests.h: $(VERIFIER_TEST_FILES) | $(VERIFIER_TESTS_DIR)
> +$(OUTPUT)/verifier/tests.h: $(VERIFIER_TEST_FILES)
> +       mkdir -p $(dir $@)
>         $(shell ( cd verifier/; \
>                   echo '/* Generated header, do not edit */'; \
>                   echo '#ifdef FILL_ARRAY'; \

^ permalink raw reply

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-07-16 23:30 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Serge E. Hallyn, Tycho Andersen, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	ebiederm, nhorman
In-Reply-To: <20190716220320.sotbfqplgdructg7@madcap2.tricolour.ca>

On Tue, Jul 16, 2019 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-07-15 17:04, Paul Moore wrote:
> > On Mon, Jul 8, 2019 at 2:06 PM Richard Guy Briggs <rgb@redhat.com> wrote:

...

> > > If we can't trust ns_capable() then why are we passing on
> > > CAP_AUDIT_CONTROL?  It is being passed down and not stripped purposely
> > > by the orchestrator/engine.  If ns_capable() isn't inherited how is it
> > > gained otherwise?  Can it be inserted by cotainer image?  I think the
> > > answer is "no".  Either we trust ns_capable() or we have audit
> > > namespaces (recommend based on user namespace) (or both).
> >
> > My thinking is that since ns_capable() checks the credentials with
> > respect to the current user namespace we can't rely on it to control
> > access since it would be possible for a privileged process running
> > inside an unprivileged container to manipulate the audit container ID
> > (containerized process has CAP_AUDIT_CONTROL, e.g. running as root in
> > the container, while the container itself does not).
>
> What makes an unprivileged container unprivileged?  "root", or "CAP_*"?

My understanding is that when most people refer to an unprivileged
container they are referring to a container run without capabilities
or a container run by a user other than root.  I'm sure there are
better definitions out there, by folks much smarter than me on these
things, but that's my working definition.

> If CAP_AUDIT_CONTROL is granted, does "root" matter?

Our discussions here have been about capabilities, not UIDs.  The only
reason root might matter is that it generally has the full capability
set.

> Does it matter what user namespace it is in?

What likely matters is what check is called: capable() or
ns_capable().  Those can yield very different results.

> I understand that root is *gained* in an
> unprivileged user namespace, but capabilities are inherited or permitted
> and that process either has it or it doesn't and an unprivileged user
> namespace can't gain a capability that has been rescinded.  Different
> subsystems use the userid or capabilities or both to determine
> privileges.

Once again, I believe the important thing to focus on here is
capable() vs ns_capable().  We can't safely rely on ns_capable() for
the audit container ID policy since that is easily met inside the
container regardless of the process' creds which started the
container.

> In this case, is the userid relevant?

We don't do UID checks, we do capability checks, so yes, the UID is irrelevant.

> > > At this point I would say we are at an impasse unless we trust
> > > ns_capable() or we implement audit namespaces.
> >
> > I'm not sure how we can trust ns_capable(), but if you can think of a
> > way I would love to hear it.  I'm also not sure how namespacing audit
> > is helpful (see my above comments), but if you think it is please
> > explain.
>
> So if we are not namespacing, why do we not trust capabilities?

We can trust capable(CAP_AUDIT_CONTROL) for enforcing audit container
ID policy, we can not trust ns_capable(CAP_AUDIT_CONTROL).

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: OOM triggered by SCTP
From: Neil Horman @ 2019-07-16 23:20 UTC (permalink / raw)
  To: Marek Majkowski
  Cc: vyasevich, marcelo.leitner, linux-sctp, netdev, kernel-team
In-Reply-To: <CAJPywTL5aKYB40FsAFYFEuhErhgQpYZP5Q_ipMG9pDxqipcEDg@mail.gmail.com>

On Tue, Jul 16, 2019 at 11:47:40PM +0200, Marek Majkowski wrote:
> Morning,
> 
> My poor man's fuzzer found something interesting in SCTP. It seems
> like creating large number of SCTP sockets + some magic dance, upsets
> a memory subsystem related to SCTP. The sequence:
> 
>  - create SCTP socket
>  - call setsockopts (SCTP_EVENTS)
>  - call bind(::1, port)
>  - call sendmsg(long buffer, MSG_CONFIRM, ::1, port)
>  - close SCTP socket
>  - repeat couple thousand times
> 
> Full code:
> https://gist.github.com/majek/bd083dae769804d39134ce01f4f802bb#file-test_sctp-c
> 
> I'm running it on virtme the simplest way:
> $ virtme-run --show-boot-console --rw --pwd --kimg bzImage --memory
> 512M --script-sh ./test_sctp
> 
> Originally I was running it inside net namespace, and just having a
> localhost interface is sufficient to trigger the problem.
> 
> Kernel is 5.2.1 (with KASAN and such, but that shouldn't be a factor).
> In some tests I saw a message that might indicate something funny
> hitting neighbor table:
> 
> neighbour: ndisc_cache: neighbor table overflow!
> 
> I'm not addr-decoding the stack trace, since it seems unrelated to the
> root cause.
> 
Why would you have to decode anything, the decoded stack trace should be
available in your demsg log.  Cant you just attach that here?

Neil

> Cheers,
>     Marek
> 

^ permalink raw reply

* Re: [PATCH bpf 1/2] selftests/bpf: fix test_verifier/test_maps make dependencies
From: Stanislav Fomichev @ 2019-07-16 22:57 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Daniel Borkmann,
	Alexei Starovoitov, Kernel Team, Ilya Leoshkevich,
	Stanislav Fomichev, Martin KaFai Lau
In-Reply-To: <CAEf4BzZ4XAdjasYq+JGFHnhwEV3G5UYWBuqKMK1yu1KRLn19MQ@mail.gmail.com>

On 07/16, Andrii Nakryiko wrote:
> On Tue, Jul 16, 2019 at 12:55 PM Stanislav Fomichev <sdf@fomichev.me> wrote:
> >
> > On 07/16, Andrii Nakryiko wrote:
> > > e46fc22e60a4 ("selftests/bpf: make directory prerequisites order-only")
> > > exposed existing problem in Makefile for test_verifier and test_maps tests:
> > > their dependency on auto-generated header file with a list of all tests wasn't
> > > recorded explicitly. This patch fixes these issues.
> > Why adding it explicitly fixes it? At least for test_verifier, we have
> > the following rule:
> >
> >         test_verifier.c: $(VERIFIER_TESTS_H)
> >
> > And there should be implicit/builtin test_verifier -> test_verifier.c
> > dependency rule.
> >
> > Same for maps, I guess:
> >
> >         $(OUTPUT)/test_maps: map_tests/*.c
> >         test_maps.c: $(MAP_TESTS_H)
> >
> > So why is it not working as is? What I'm I missing?
> 
> I don't know exactly why it's not working, but it's clearly because of
> that. It's the only difference between how test_progs are set up,
> which didn't break, and test_maps/test_verifier, which did.
> 
> Feel free to figure it out through a maze of Makefiles why it didn't
> work as expected, but this definitely fixed a breakage (at least for
> me).
Agreed on not wasting time. I took a brief look (with make -qp) and I
don't have any clue.

By default implicit matching doesn't work:
# makefile (from 'Makefile', line 261)
/linux/tools/testing/selftests/bpf/test_maps: CFLAGS += $(TEST_MAPS_CFLAGS)
/linux/tools/testing/selftests/bpf/test_maps: map_tests/sk_storage_map.c /linux/tools/testing/selftests/bpf/test_stub.o /linux/tools/testing/selftests/bpf/libbpf.a
#  Implicit rule search has not been done.
#  File is an intermediate prerequisite.
#  Modification time never checked.
#  File has not been updated.
# variable set hash-table stats:
# Load=1/32=3%, Rehash=0, Collisions=0/2=0%

If I comment out the following line:
$(TEST_GEN_PROGS): $(OUTPUT)/test_stub.o $(BPFOBJ)

Then it works:
# makefile (from 'Makefile', line 261)
/linux/tools/testing/selftests/bpf/test_maps: CFLAGS += $(TEST_MAPS_CFLAGS)
/linux/tools/testing/selftests/bpf/test_maps: test_maps.c map_tests/sk_storage_map.c
#  Implicit rule search has been done.
#  Implicit/static pattern stem: 'test_maps'
#  File is an intermediate prerequisite.
#  File does not exist.
#  File has not been updated.
# variable set hash-table stats:
# Load=1/32=3%, Rehash=0, Collisions=0/2=0%
#  recipe to execute (from '../lib.mk', line 138):
        $(LINK.c) $^ $(LDLIBS) -o $@

It's because "File is an intermediate prerequisite.", but I
don't see how it's is a intermediate prerequisite for anything...


One other optional suggestion I have to your second patch: maybe drop all
those dependencies on the directories altogether? Why not do the following
instead, for example (same for test_progs/test_maps)?

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 1296253b3422..c2d087ce6d4b 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -277,12 +277,9 @@ VERIFIER_TESTS_H := $(OUTPUT)/verifier/tests.h
 test_verifier.c: $(VERIFIER_TESTS_H)
 $(OUTPUT)/test_verifier: CFLAGS += $(TEST_VERIFIER_CFLAGS)
 
-VERIFIER_TESTS_DIR = $(OUTPUT)/verifier
-$(VERIFIER_TESTS_DIR):
-       mkdir -p $@
-
 VERIFIER_TEST_FILES := $(wildcard verifier/*.c)
-$(OUTPUT)/verifier/tests.h: $(VERIFIER_TEST_FILES) | $(VERIFIER_TESTS_DIR)
+$(OUTPUT)/verifier/tests.h: $(VERIFIER_TEST_FILES)
+       mkdir -p $(dir $@)
        $(shell ( cd verifier/; \
                  echo '/* Generated header, do not edit */'; \
                  echo '#ifdef FILL_ARRAY'; \

^ permalink raw reply related

* Re: [PATCH RFC 0/4] Add support to directly attach BPF program to ftrace
From: Joel Fernandes @ 2019-07-16 22:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Alexei Starovoitov, linux-kernel, Adrian Ratiu,
	Alexei Starovoitov, bpf, Brendan Gregg, connoro, Daniel Borkmann,
	duyuchao, Ingo Molnar, jeffv, Karim Yaghmour, kernel-team,
	linux-kselftest, Manali Shukla, Manjo Raja Rao, Martin KaFai Lau,
	Masami Hiramatsu, Matt Mullins, Michal Gregorczyk,
	Michal Gregorczyk, Mohammad Husain, namhyung, namhyung, netdev,
	paul.chaignon, primiano, Qais Yousef, Shuah Khan, Song Liu,
	Srinivas Ramana, Tamir Carmeli, Yonghong Song
In-Reply-To: <20190716183117.77b3ed49@gandalf.local.home>

On Tue, Jul 16, 2019 at 06:31:17PM -0400, Steven Rostedt wrote:
> On Tue, 16 Jul 2019 17:30:50 -0400
> Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> > I don't see why a new bpf node for a trace event is a bad idea, really.
> > tracefs is how we deal with trace events on Android. We do it in production
> > systems. This is a natural extension to that and fits with the security model
> > well.
> 
> What I would like to see is a way to have BPF inject data into the
> ftrace ring buffer directly. There's a bpf_trace_printk() that I find a
> bit of a hack (especially since it hooks into trace_printk() which is
> only for debugging purposes). Have a dedicated bpf ftrace ring
> buffer event that can be triggered is what I am looking for. Then comes
> the issue of what ring buffer to place it in, as ftrace can have
> multiple ring buffer instances. But these instances are defined by the
> tracefs instances directory. Having a way to associate a bpf program to
> a specific event in a specific tracefs directory could allow for ways to
> trigger writing into the correct ftrace buffer.

But his problem is with doing the association of a BPF program with tracefs
itself. How would you attach a BPF program with tracefs without doing a text
based approach? His problem is with the text based approach per his last
email.

> But looking over the patches, I see what Alexei means that there's no
> overlap with ftrace and these patches except for the tracefs directory
> itself (which is part of the ftrace infrastructure). And the trace
> events are technically part of the ftrace infrastructure too. I see the
> tracefs interface being used, but I don't see how the bpf programs
> being added affect the ftrace ring buffer or other parts of ftrace. And
> I'm guessing that's what is confusing Alexei.

In a follow-up patch which I am still writing, I am using the trace ring
buffer as temporary storage since I am formatting the trace event into it.
This patch you are replying to is just for raw tracepoint and yes, I agree
this one does not use the ring buffer, but a future addition to it does. So
I don't think the association of this patch series with ftrace is going to be
an issue IMO.

thanks,

 - Joel



^ permalink raw reply

* Re: [PATCH RFC 0/4] Add support to directly attach BPF program to ftrace
From: Steven Rostedt @ 2019-07-16 22:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joel Fernandes, linux-kernel, Adrian Ratiu, Alexei Starovoitov,
	bpf, Brendan Gregg, connoro, Daniel Borkmann, duyuchao,
	Ingo Molnar, jeffv, Karim Yaghmour, kernel-team, linux-kselftest,
	Manali Shukla, Manjo Raja Rao, Martin KaFai Lau, Masami Hiramatsu,
	Matt Mullins, Michal Gregorczyk, Michal Gregorczyk,
	Mohammad Husain, namhyung, namhyung, netdev, paul.chaignon,
	primiano, Qais Yousef, Shuah Khan, Song Liu, Srinivas Ramana,
	Tamir Carmeli, Yonghong Song
In-Reply-To: <20190716222650.tk2coihjtsxszarf@ast-mbp.dhcp.thefacebook.com>

On Tue, 16 Jul 2019 15:26:52 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> I'm absolutely against text based apis.

I guess you don't use /proc ;-)

-- Steve

^ permalink raw reply

* Re: [PATCH RFC 0/4] Add support to directly attach BPF program to ftrace
From: Joel Fernandes @ 2019-07-16 22:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: linux-kernel, Adrian Ratiu, Alexei Starovoitov, bpf,
	Brendan Gregg, connoro, Daniel Borkmann, duyuchao, Ingo Molnar,
	jeffv, Karim Yaghmour, kernel-team, linux-kselftest,
	Manali Shukla, Manjo Raja Rao, Martin KaFai Lau, Masami Hiramatsu,
	Matt Mullins, Michal Gregorczyk, Michal Gregorczyk,
	Mohammad Husain, namhyung, namhyung, netdev, paul.chaignon,
	primiano, Qais Yousef, Shuah Khan, Song Liu, Srinivas Ramana,
	Steven Rostedt, Tamir Carmeli, Yonghong Song
In-Reply-To: <20190716222650.tk2coihjtsxszarf@ast-mbp.dhcp.thefacebook.com>

On Tue, Jul 16, 2019 at 03:26:52PM -0700, Alexei Starovoitov wrote:
> On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote:
> > 
> > I also thought about the pinning idea before, but we also want to add support
> > for not just raw tracepoints, but also regular tracepoints (events if you
> > will). I am hesitant to add a new BPF API just for creating regular
> > tracepoints and then pinning those as well.
> 
> and they should be done through the pinning as well.

Hmm ok, I will give it some more thought.

> > I don't see why a new bpf node for a trace event is a bad idea, really.
> 
> See the patches for kprobe/uprobe FD-based api and the reasons behind it.
> tldr: text is racy, doesn't scale, poor security, etc.

Is it possible to use perf without CAP_SYS_ADMIN and control security at the
per-event level? We are selective about who can access which event, using
selinux. That's how our ftrace-based tracers work. Its fine grained per-event
control. That's where I was going with the tracefs approach since we get that
granularity using the file system.

Thanks.


^ permalink raw reply

* Re: [PATCH RFC 0/4] Add support to directly attach BPF program to ftrace
From: Steven Rostedt @ 2019-07-16 22:31 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Alexei Starovoitov, linux-kernel, Adrian Ratiu,
	Alexei Starovoitov, bpf, Brendan Gregg, connoro, Daniel Borkmann,
	duyuchao, Ingo Molnar, jeffv, Karim Yaghmour, kernel-team,
	linux-kselftest, Manali Shukla, Manjo Raja Rao, Martin KaFai Lau,
	Masami Hiramatsu, Matt Mullins, Michal Gregorczyk,
	Michal Gregorczyk, Mohammad Husain, namhyung, namhyung, netdev,
	paul.chaignon, primiano, Qais Yousef, Shuah Khan, Song Liu,
	Srinivas Ramana, Tamir Carmeli, Yonghong Song
In-Reply-To: <20190716213050.GA161922@google.com>

On Tue, 16 Jul 2019 17:30:50 -0400
Joel Fernandes <joel@joelfernandes.org> wrote:

> I don't see why a new bpf node for a trace event is a bad idea, really.
> tracefs is how we deal with trace events on Android. We do it in production
> systems. This is a natural extension to that and fits with the security model
> well.

What I would like to see is a way to have BPF inject data into the
ftrace ring buffer directly. There's a bpf_trace_printk() that I find a
bit of a hack (especially since it hooks into trace_printk() which is
only for debugging purposes). Have a dedicated bpf ftrace ring
buffer event that can be triggered is what I am looking for. Then comes
the issue of what ring buffer to place it in, as ftrace can have
multiple ring buffer instances. But these instances are defined by the
tracefs instances directory. Having a way to associate a bpf program to
a specific event in a specific tracefs directory could allow for ways to
trigger writing into the correct ftrace buffer.

But looking over the patches, I see what Alexei means that there's no
overlap with ftrace and these patches except for the tracefs directory
itself (which is part of the ftrace infrastructure). And the trace
events are technically part of the ftrace infrastructure too. I see the
tracefs interface being used, but I don't see how the bpf programs
being added affect the ftrace ring buffer or other parts of ftrace. And
I'm guessing that's what is confusing Alexei.

-- Steve

^ permalink raw reply

* [PATCH net v3 7/7] net/rds: Initialize ic->i_fastreg_wrs upon allocation
From: Gerd Rausch @ 2019-07-16 22:29 UTC (permalink / raw)
  To: Santosh Shilimkar, netdev, linux-rdma, rds-devel; +Cc: David Miller

Otherwise, if an IB connection is torn down before "rds_ib_setup_qp"
is called, the value of "ic->i_fastreg_wrs" is still at zero
(as it wasn't initialized by "rds_ib_setup_qp").
Consequently "rds_ib_conn_path_shutdown" will spin forever,
waiting for it to go back to "RDS_IB_DEFAULT_FR_WR",
which of course will never happen as there are no
outstanding work requests.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
---
 net/rds/ib_cm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 1b6fd6c8b12b..4de0214da63c 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -527,7 +527,6 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 	attr.qp_type = IB_QPT_RC;
 	attr.send_cq = ic->i_send_cq;
 	attr.recv_cq = ic->i_recv_cq;
-	atomic_set(&ic->i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR);
 
 	/*
 	 * XXX this can fail if max_*_wr is too large?  Are we supposed
@@ -1139,6 +1138,7 @@ int rds_ib_conn_alloc(struct rds_connection *conn, gfp_t gfp)
 	spin_lock_init(&ic->i_ack_lock);
 #endif
 	atomic_set(&ic->i_signaled_sends, 0);
+	atomic_set(&ic->i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR);
 
 	/*
 	 * rds_ib_conn_shutdown() waits for these to be emptied so they
-- 
2.22.0


^ permalink raw reply related

* [PATCH net v3 6/7] net/rds: Keep track of and wait for FRWR segments in use upon shutdown
From: Gerd Rausch @ 2019-07-16 22:29 UTC (permalink / raw)
  To: Santosh Shilimkar, netdev, linux-rdma, rds-devel; +Cc: David Miller

Since "rds_ib_free_frmr" and "rds_ib_free_frmr_list" simply put
the FRMR memory segments on the "drop_list" or "free_list",
and it is the job of "rds_ib_flush_mr_pool" to reap those entries
by ultimately issuing a "IB_WR_LOCAL_INV" work-request,
we need to trigger and then wait for all those memory segments
attached to a particular connection to be fully released before
we can move on to release the QP, CQ, etc.

So we make "rds_ib_conn_path_shutdown" wait for one more
atomic_t called "i_fastreg_inuse_count" that keeps track of how
many FRWR memory segments are out there marked "FRMR_IS_INUSE"
(and also wake_up rds_ib_ring_empty_wait, as they go away).

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
---
 net/rds/ib.h      |  1 +
 net/rds/ib_cm.c   |  7 +++++++
 net/rds/ib_frmr.c | 43 +++++++++++++++++++++++++++++++++++++------
 3 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index 66c03c7665b2..303c6ee8bdb7 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -156,6 +156,7 @@ struct rds_ib_connection {
 
 	/* To control the number of wrs from fastreg */
 	atomic_t		i_fastreg_wrs;
+	atomic_t		i_fastreg_inuse_count;
 
 	/* interrupt handling */
 	struct tasklet_struct	i_send_tasklet;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 8891822eba4f..1b6fd6c8b12b 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -40,6 +40,7 @@
 #include "rds_single_path.h"
 #include "rds.h"
 #include "ib.h"
+#include "ib_mr.h"
 
 /*
  * Set the selected protocol version
@@ -993,6 +994,11 @@ void rds_ib_conn_path_shutdown(struct rds_conn_path *cp)
 				ic->i_cm_id, err);
 		}
 
+		/* kick off "flush_worker" for all pools in order to reap
+		 * all FRMR registrations that are still marked "FRMR_IS_INUSE"
+		 */
+		rds_ib_flush_mrs();
+
 		/*
 		 * We want to wait for tx and rx completion to finish
 		 * before we tear down the connection, but we have to be
@@ -1005,6 +1011,7 @@ void rds_ib_conn_path_shutdown(struct rds_conn_path *cp)
 		wait_event(rds_ib_ring_empty_wait,
 			   rds_ib_ring_empty(&ic->i_recv_ring) &&
 			   (atomic_read(&ic->i_signaled_sends) == 0) &&
+			   (atomic_read(&ic->i_fastreg_inuse_count) == 0) &&
 			   (atomic_read(&ic->i_fastreg_wrs) == RDS_IB_DEFAULT_FR_WR));
 		tasklet_kill(&ic->i_send_tasklet);
 		tasklet_kill(&ic->i_recv_tasklet);
diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index adaa8e99e5a9..06ecf9d2d4bf 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -32,6 +32,24 @@
 
 #include "ib_mr.h"
 
+static inline void
+rds_transition_frwr_state(struct rds_ib_mr *ibmr,
+			  enum rds_ib_fr_state old_state,
+			  enum rds_ib_fr_state new_state)
+{
+	if (cmpxchg(&ibmr->u.frmr.fr_state,
+		    old_state, new_state) == old_state &&
+	    old_state == FRMR_IS_INUSE) {
+		/* enforce order of ibmr->u.frmr.fr_state update
+		 * before decrementing i_fastreg_inuse_count
+		 */
+		smp_mb__before_atomic();
+		atomic_dec(&ibmr->ic->i_fastreg_inuse_count);
+		if (waitqueue_active(&rds_ib_ring_empty_wait))
+			wake_up(&rds_ib_ring_empty_wait);
+	}
+}
+
 static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev,
 					   int npages)
 {
@@ -118,13 +136,18 @@ static int rds_ib_post_reg_frmr(struct rds_ib_mr *ibmr)
 	if (unlikely(ret != ibmr->sg_len))
 		return ret < 0 ? ret : -EINVAL;
 
+	if (cmpxchg(&frmr->fr_state,
+		    FRMR_IS_FREE, FRMR_IS_INUSE) != FRMR_IS_FREE)
+		return -EBUSY;
+
+	atomic_inc(&ibmr->ic->i_fastreg_inuse_count);
+
 	/* Perform a WR for the fast_reg_mr. Each individual page
 	 * in the sg list is added to the fast reg page list and placed
 	 * inside the fast_reg_mr WR.  The key used is a rolling 8bit
 	 * counter, which should guarantee uniqueness.
 	 */
 	ib_update_fast_reg_key(frmr->mr, ibmr->remap_count++);
-	frmr->fr_state = FRMR_IS_INUSE;
 	frmr->fr_reg = true;
 
 	memset(&reg_wr, 0, sizeof(reg_wr));
@@ -141,7 +164,8 @@ static int rds_ib_post_reg_frmr(struct rds_ib_mr *ibmr)
 	ret = ib_post_send(ibmr->ic->i_cm_id->qp, &reg_wr.wr, NULL);
 	if (unlikely(ret)) {
 		/* Failure here can be because of -ENOMEM as well */
-		frmr->fr_state = FRMR_IS_STALE;
+		rds_transition_frwr_state(ibmr, FRMR_IS_INUSE, FRMR_IS_STALE);
+
 		atomic_inc(&ibmr->ic->i_fastreg_wrs);
 		if (printk_ratelimit())
 			pr_warn("RDS/IB: %s returned error(%d)\n",
@@ -268,8 +292,12 @@ static int rds_ib_post_inv(struct rds_ib_mr *ibmr)
 
 	ret = ib_post_send(i_cm_id->qp, s_wr, NULL);
 	if (unlikely(ret)) {
-		frmr->fr_state = FRMR_IS_STALE;
+		rds_transition_frwr_state(ibmr, FRMR_IS_INUSE, FRMR_IS_STALE);
 		frmr->fr_inv = false;
+		/* enforce order of frmr->fr_inv update
+		 * before incrementing i_fastreg_wrs
+		 */
+		smp_mb__before_atomic();
 		atomic_inc(&ibmr->ic->i_fastreg_wrs);
 		pr_err("RDS/IB: %s returned error(%d)\n", __func__, ret);
 		goto out;
@@ -297,7 +325,7 @@ void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc)
 	struct rds_ib_frmr *frmr = &ibmr->u.frmr;
 
 	if (wc->status != IB_WC_SUCCESS) {
-		frmr->fr_state = FRMR_IS_STALE;
+		rds_transition_frwr_state(ibmr, FRMR_IS_INUSE, FRMR_IS_STALE);
 		if (rds_conn_up(ic->conn))
 			rds_ib_conn_error(ic->conn,
 					  "frmr completion <%pI4,%pI4> status %u(%s), vendor_err 0x%x, disconnecting and reconnecting\n",
@@ -309,8 +337,7 @@ void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc)
 	}
 
 	if (frmr->fr_inv) {
-		if (frmr->fr_state == FRMR_IS_INUSE)
-			frmr->fr_state = FRMR_IS_FREE;
+		rds_transition_frwr_state(ibmr, FRMR_IS_INUSE, FRMR_IS_FREE);
 		frmr->fr_inv = false;
 		wake_up(&frmr->fr_inv_done);
 	}
@@ -320,6 +347,10 @@ void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc)
 		wake_up(&frmr->fr_reg_done);
 	}
 
+	/* enforce order of frmr->{fr_reg,fr_inv} update
+	 * before incrementing i_fastreg_wrs
+	 */
+	smp_mb__before_atomic();
 	atomic_inc(&ic->i_fastreg_wrs);
 }
 
-- 
2.22.0



^ permalink raw reply related

* [PATCH net v3 5/7] net/rds: Set fr_state only to FRMR_IS_FREE if IB_WR_LOCAL_INV had been successful
From: Gerd Rausch @ 2019-07-16 22:29 UTC (permalink / raw)
  To: Santosh Shilimkar, netdev, linux-rdma, rds-devel; +Cc: David Miller

Fix a bug where fr_state first goes to FRMR_IS_STALE, because of a failure
of operation IB_WR_LOCAL_INV, but then gets set back to "FRMR_IS_FREE"
uncoditionally, even though the operation failed.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
---
 net/rds/ib_frmr.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index 708c553c3da5..adaa8e99e5a9 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -309,7 +309,8 @@ void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc)
 	}
 
 	if (frmr->fr_inv) {
-		frmr->fr_state = FRMR_IS_FREE;
+		if (frmr->fr_state == FRMR_IS_INUSE)
+			frmr->fr_state = FRMR_IS_FREE;
 		frmr->fr_inv = false;
 		wake_up(&frmr->fr_inv_done);
 	}
-- 
2.22.0



^ permalink raw reply related

* [PATCH net v3 4/7] net/rds: Fix NULL/ERR_PTR inconsistency
From: Gerd Rausch @ 2019-07-16 22:29 UTC (permalink / raw)
  To: Santosh Shilimkar, netdev, linux-rdma, rds-devel; +Cc: David Miller

Make function "rds_ib_try_reuse_ibmr" return NULL in case
memory region could not be allocated, since callers
simply check if the return value is not NULL.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
---
 net/rds/ib_rdma.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 6b047e63a769..c8c1e3ae8d84 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -450,7 +450,7 @@ struct rds_ib_mr *rds_ib_try_reuse_ibmr(struct rds_ib_mr_pool *pool)
 				rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_depleted);
 			else
 				rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_depleted);
-			return ERR_PTR(-EAGAIN);
+			break;
 		}
 
 		/* We do have some empty MRs. Flush them out. */
@@ -464,7 +464,7 @@ struct rds_ib_mr *rds_ib_try_reuse_ibmr(struct rds_ib_mr_pool *pool)
 			return ibmr;
 	}
 
-	return ibmr;
+	return NULL;
 }
 
 static void rds_ib_mr_pool_flush_worker(struct work_struct *work)
-- 
2.22.0



^ permalink raw reply related

* [PATCH net v3 3/7] net/rds: Wait for the FRMR_IS_FREE (or FRMR_IS_STALE) transition after posting IB_WR_LOCAL_INV
From: Gerd Rausch @ 2019-07-16 22:29 UTC (permalink / raw)
  To: Santosh Shilimkar, netdev, linux-rdma, rds-devel; +Cc: David Miller

In order to:
1) avoid a silly bouncing between "clean_list" and "drop_list"
   triggered by function "rds_ib_reg_frmr" as it is releases frmr
   regions whose state is not "FRMR_IS_FREE" right away.

2) prevent an invalid access error in a race from a pending
   "IB_WR_LOCAL_INV" operation with a teardown ("dma_unmap_sg", "put_page")
   and de-registration ("ib_dereg_mr") of the corresponding
   memory region.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
---
 net/rds/ib_frmr.c | 65 +++++++++++++++++++++++++++--------------------
 net/rds/ib_mr.h   |  2 ++
 2 files changed, 40 insertions(+), 27 deletions(-)

diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index 6038138d6e38..708c553c3da5 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -76,6 +76,7 @@ static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev,
 
 	frmr->fr_state = FRMR_IS_FREE;
 	init_waitqueue_head(&frmr->fr_inv_done);
+	init_waitqueue_head(&frmr->fr_reg_done);
 	return ibmr;
 
 out_no_cigar:
@@ -124,6 +125,7 @@ static int rds_ib_post_reg_frmr(struct rds_ib_mr *ibmr)
 	 */
 	ib_update_fast_reg_key(frmr->mr, ibmr->remap_count++);
 	frmr->fr_state = FRMR_IS_INUSE;
+	frmr->fr_reg = true;
 
 	memset(&reg_wr, 0, sizeof(reg_wr));
 	reg_wr.wr.wr_id = (unsigned long)(void *)ibmr;
@@ -144,7 +146,17 @@ static int rds_ib_post_reg_frmr(struct rds_ib_mr *ibmr)
 		if (printk_ratelimit())
 			pr_warn("RDS/IB: %s returned error(%d)\n",
 				__func__, ret);
+		goto out;
 	}
+
+	/* Wait for the registration to complete in order to prevent an invalid
+	 * access error resulting from a race between the memory region already
+	 * being accessed while registration is still pending.
+	 */
+	wait_event(frmr->fr_reg_done, !frmr->fr_reg);
+
+out:
+
 	return ret;
 }
 
@@ -262,6 +274,19 @@ static int rds_ib_post_inv(struct rds_ib_mr *ibmr)
 		pr_err("RDS/IB: %s returned error(%d)\n", __func__, ret);
 		goto out;
 	}
+
+	/* Wait for the FRMR_IS_FREE (or FRMR_IS_STALE) transition in order to
+	 * 1) avoid a silly bouncing between "clean_list" and "drop_list"
+	 *    triggered by function "rds_ib_reg_frmr" as it is releases frmr
+	 *    regions whose state is not "FRMR_IS_FREE" right away.
+	 * 2) prevents an invalid access error in a race
+	 *    from a pending "IB_WR_LOCAL_INV" operation
+	 *    with a teardown ("dma_unmap_sg", "put_page")
+	 *    and de-registration ("ib_dereg_mr") of the corresponding
+	 *    memory region.
+	 */
+	wait_event(frmr->fr_inv_done, frmr->fr_state != FRMR_IS_INUSE);
+
 out:
 	return ret;
 }
@@ -289,6 +314,11 @@ void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc)
 		wake_up(&frmr->fr_inv_done);
 	}
 
+	if (frmr->fr_reg) {
+		frmr->fr_reg = false;
+		wake_up(&frmr->fr_reg_done);
+	}
+
 	atomic_inc(&ic->i_fastreg_wrs);
 }
 
@@ -297,14 +327,18 @@ void rds_ib_unreg_frmr(struct list_head *list, unsigned int *nfreed,
 {
 	struct rds_ib_mr *ibmr, *next;
 	struct rds_ib_frmr *frmr;
-	int ret = 0;
+	int ret = 0, ret2;
 	unsigned int freed = *nfreed;
 
 	/* String all ib_mr's onto one list and hand them to ib_unmap_fmr */
 	list_for_each_entry(ibmr, list, unmap_list) {
-		if (ibmr->sg_dma_len)
-			ret |= rds_ib_post_inv(ibmr);
+		if (ibmr->sg_dma_len) {
+			ret2 = rds_ib_post_inv(ibmr);
+			if (ret2 && !ret)
+				ret = ret2;
+		}
 	}
+
 	if (ret)
 		pr_warn("RDS/IB: %s failed (err=%d)\n", __func__, ret);
 
@@ -347,31 +381,8 @@ struct rds_ib_mr *rds_ib_reg_frmr(struct rds_ib_device *rds_ibdev,
 	}
 
 	do {
-		if (ibmr) {
-			/* Memory regions make it onto the "clean_list" via
-			 * "rds_ib_flush_mr_pool", after the memory region has
-			 * been posted for invalidation via "rds_ib_post_inv".
-			 *
-			 * At that point in time, "fr_state" may still be
-			 * in state "FRMR_IS_INUSE", since the only place where
-			 * "fr_state" transitions to "FRMR_IS_FREE" is in
-			 * is in "rds_ib_mr_cqe_handler", which is
-			 * triggered by a tasklet.
-			 *
-			 * So we wait for "fr_inv_done" to trigger
-			 * and only put memory regions onto the drop_list
-			 * that failed (i.e. not marked "FRMR_IS_FREE").
-			 *
-			 * This avoids the problem of memory-regions bouncing
-			 * between "clean_list" and "drop_list" before they
-			 * even have a chance to be properly invalidated.
-			 */
-			frmr = &ibmr->u.frmr;
-			wait_event(frmr->fr_inv_done, frmr->fr_state != FRMR_IS_INUSE);
-			if (frmr->fr_state == FRMR_IS_FREE)
-				break;
+		if (ibmr)
 			rds_ib_free_frmr(ibmr, true);
-		}
 		ibmr = rds_ib_alloc_frmr(rds_ibdev, nents);
 		if (IS_ERR(ibmr))
 			return ibmr;
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index ab26c20ed66f..9045a8c0edff 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -58,6 +58,8 @@ struct rds_ib_frmr {
 	enum rds_ib_fr_state	fr_state;
 	bool			fr_inv;
 	wait_queue_head_t	fr_inv_done;
+	bool			fr_reg;
+	wait_queue_head_t	fr_reg_done;
 	struct ib_send_wr	fr_wr;
 	unsigned int		dma_npages;
 	unsigned int		sg_byte_len;
-- 
2.22.0



^ permalink raw reply related

* [PATCH net v3 2/7] net/rds: Get rid of "wait_clean_list_grace" and add locking
From: Gerd Rausch @ 2019-07-16 22:28 UTC (permalink / raw)
  To: Santosh Shilimkar, netdev, linux-rdma, rds-devel; +Cc: David Miller

Waiting for activity on the "clean_list" to quiesce is no substitute
for proper locking.

We can have multiple threads competing for "llist_del_first"
via "rds_ib_reuse_mr", and a single thread competing
for "llist_del_all" and "llist_del_first" via "rds_ib_flush_mr_pool".

Since "llist_del_first" depends on "list->first->next" not to change
in the midst of the operation, simply waiting for all current calls
to "rds_ib_reuse_mr" to quiesce across all CPUs is woefully inadequate:

By the time "wait_clean_list_grace" is done iterating over all CPUs to see
that there is no concurrent caller to "rds_ib_reuse_mr", a new caller may
have just shown up on the first CPU.

Furthermore, <linux/llist.h> explicitly calls out the need for locking:
 * Cases where locking is needed:
 * If we have multiple consumers with llist_del_first used in one consumer,
 * and llist_del_first or llist_del_all used in other consumers,
 * then a lock is needed.

Also, while at it, drop the unused "pool" parameter
from "list_to_llist_nodes".

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
---
 net/rds/ib_mr.h   |  1 +
 net/rds/ib_rdma.c | 56 +++++++++++++++--------------------------------
 2 files changed, 19 insertions(+), 38 deletions(-)

diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index 42daccb7b5eb..ab26c20ed66f 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -98,6 +98,7 @@ struct rds_ib_mr_pool {
 	struct llist_head	free_list;	/* unused MRs */
 	struct llist_head	clean_list;	/* unused & unmapped MRs */
 	wait_queue_head_t	flush_wait;
+	spinlock_t		clean_lock;	/* "clean_list" concurrency */
 
 	atomic_t		free_pinned;	/* memory pinned by free MRs */
 	unsigned long		max_items;
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 0b347f46b2f4..6b047e63a769 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -40,9 +40,6 @@
 
 struct workqueue_struct *rds_ib_mr_wq;
 
-static DEFINE_PER_CPU(unsigned long, clean_list_grace);
-#define CLEAN_LIST_BUSY_BIT 0
-
 static struct rds_ib_device *rds_ib_get_device(__be32 ipaddr)
 {
 	struct rds_ib_device *rds_ibdev;
@@ -195,12 +192,11 @@ struct rds_ib_mr *rds_ib_reuse_mr(struct rds_ib_mr_pool *pool)
 {
 	struct rds_ib_mr *ibmr = NULL;
 	struct llist_node *ret;
-	unsigned long *flag;
+	unsigned long flags;
 
-	preempt_disable();
-	flag = this_cpu_ptr(&clean_list_grace);
-	set_bit(CLEAN_LIST_BUSY_BIT, flag);
+	spin_lock_irqsave(&pool->clean_lock, flags);
 	ret = llist_del_first(&pool->clean_list);
+	spin_unlock_irqrestore(&pool->clean_lock, flags);
 	if (ret) {
 		ibmr = llist_entry(ret, struct rds_ib_mr, llnode);
 		if (pool->pool_type == RDS_IB_MR_8K_POOL)
@@ -209,23 +205,9 @@ struct rds_ib_mr *rds_ib_reuse_mr(struct rds_ib_mr_pool *pool)
 			rds_ib_stats_inc(s_ib_rdma_mr_1m_reused);
 	}
 
-	clear_bit(CLEAN_LIST_BUSY_BIT, flag);
-	preempt_enable();
 	return ibmr;
 }
 
-static inline void wait_clean_list_grace(void)
-{
-	int cpu;
-	unsigned long *flag;
-
-	for_each_online_cpu(cpu) {
-		flag = &per_cpu(clean_list_grace, cpu);
-		while (test_bit(CLEAN_LIST_BUSY_BIT, flag))
-			cpu_relax();
-	}
-}
-
 void rds_ib_sync_mr(void *trans_private, int direction)
 {
 	struct rds_ib_mr *ibmr = trans_private;
@@ -324,8 +306,7 @@ static unsigned int llist_append_to_list(struct llist_head *llist,
  * of clusters.  Each cluster has linked llist nodes of
  * MR_CLUSTER_SIZE mrs that are ready for reuse.
  */
-static void list_to_llist_nodes(struct rds_ib_mr_pool *pool,
-				struct list_head *list,
+static void list_to_llist_nodes(struct list_head *list,
 				struct llist_node **nodes_head,
 				struct llist_node **nodes_tail)
 {
@@ -402,8 +383,13 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
 	 */
 	dirty_to_clean = llist_append_to_list(&pool->drop_list, &unmap_list);
 	dirty_to_clean += llist_append_to_list(&pool->free_list, &unmap_list);
-	if (free_all)
+	if (free_all) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&pool->clean_lock, flags);
 		llist_append_to_list(&pool->clean_list, &unmap_list);
+		spin_unlock_irqrestore(&pool->clean_lock, flags);
+	}
 
 	free_goal = rds_ib_flush_goal(pool, free_all);
 
@@ -416,27 +402,20 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
 		rds_ib_unreg_fmr(&unmap_list, &nfreed, &unpinned, free_goal);
 
 	if (!list_empty(&unmap_list)) {
-		/* we have to make sure that none of the things we're about
-		 * to put on the clean list would race with other cpus trying
-		 * to pull items off.  The llist would explode if we managed to
-		 * remove something from the clean list and then add it back again
-		 * while another CPU was spinning on that same item in llist_del_first.
-		 *
-		 * This is pretty unlikely, but just in case  wait for an llist grace period
-		 * here before adding anything back into the clean list.
-		 */
-		wait_clean_list_grace();
-
-		list_to_llist_nodes(pool, &unmap_list, &clean_nodes, &clean_tail);
+		unsigned long flags;
+
+		list_to_llist_nodes(&unmap_list, &clean_nodes, &clean_tail);
 		if (ibmr_ret) {
 			*ibmr_ret = llist_entry(clean_nodes, struct rds_ib_mr, llnode);
 			clean_nodes = clean_nodes->next;
 		}
 		/* more than one entry in llist nodes */
-		if (clean_nodes)
+		if (clean_nodes) {
+			spin_lock_irqsave(&pool->clean_lock, flags);
 			llist_add_batch(clean_nodes, clean_tail,
 					&pool->clean_list);
-
+			spin_unlock_irqrestore(&pool->clean_lock, flags);
+		}
 	}
 
 	atomic_sub(unpinned, &pool->free_pinned);
@@ -610,6 +589,7 @@ struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_ibdev,
 	init_llist_head(&pool->free_list);
 	init_llist_head(&pool->drop_list);
 	init_llist_head(&pool->clean_list);
+	spin_lock_init(&pool->clean_lock);
 	mutex_init(&pool->flush_lock);
 	init_waitqueue_head(&pool->flush_wait);
 	INIT_DELAYED_WORK(&pool->flush_worker, rds_ib_mr_pool_flush_worker);
-- 
2.22.0



^ permalink raw reply related

* [PATCH net v3 1/7] net/rds: Give fr_state a chance to transition to FRMR_IS_FREE
From: Gerd Rausch @ 2019-07-16 22:28 UTC (permalink / raw)
  To: Santosh Shilimkar, netdev, linux-rdma, rds-devel; +Cc: David Miller

In the context of FRMR (ib_frmr.c):

Memory regions make it onto the "clean_list" via "rds_ib_flush_mr_pool",
after the memory region has been posted for invalidation via
"rds_ib_post_inv".

At that point in time, "fr_state" may still be in state "FRMR_IS_INUSE",
since the only place where "fr_state" transitions to "FRMR_IS_FREE"
is in "rds_ib_mr_cqe_handler", which is triggered by a tasklet.

So in case we notice that "fr_state != FRMR_IS_FREE" (see below),
we wait for "fr_inv_done" to trigger with a maximum of 10msec.
Then we check again, and only put the memory region onto the drop_list
(via "rds_ib_free_frmr") in case the situation remains unchanged.

This avoids the problem of memory-regions bouncing between "clean_list"
and "drop_list" before they even have a chance to be properly invalidated.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
---
 net/rds/ib_frmr.c | 27 ++++++++++++++++++++++++++-
 net/rds/ib_mr.h   |  1 +
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index 32ae26ed58a0..6038138d6e38 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -75,6 +75,7 @@ static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev,
 		pool->max_items_soft = pool->max_items;
 
 	frmr->fr_state = FRMR_IS_FREE;
+	init_waitqueue_head(&frmr->fr_inv_done);
 	return ibmr;
 
 out_no_cigar:
@@ -285,6 +286,7 @@ void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc)
 	if (frmr->fr_inv) {
 		frmr->fr_state = FRMR_IS_FREE;
 		frmr->fr_inv = false;
+		wake_up(&frmr->fr_inv_done);
 	}
 
 	atomic_inc(&ic->i_fastreg_wrs);
@@ -345,8 +347,31 @@ struct rds_ib_mr *rds_ib_reg_frmr(struct rds_ib_device *rds_ibdev,
 	}
 
 	do {
-		if (ibmr)
+		if (ibmr) {
+			/* Memory regions make it onto the "clean_list" via
+			 * "rds_ib_flush_mr_pool", after the memory region has
+			 * been posted for invalidation via "rds_ib_post_inv".
+			 *
+			 * At that point in time, "fr_state" may still be
+			 * in state "FRMR_IS_INUSE", since the only place where
+			 * "fr_state" transitions to "FRMR_IS_FREE" is in
+			 * is in "rds_ib_mr_cqe_handler", which is
+			 * triggered by a tasklet.
+			 *
+			 * So we wait for "fr_inv_done" to trigger
+			 * and only put memory regions onto the drop_list
+			 * that failed (i.e. not marked "FRMR_IS_FREE").
+			 *
+			 * This avoids the problem of memory-regions bouncing
+			 * between "clean_list" and "drop_list" before they
+			 * even have a chance to be properly invalidated.
+			 */
+			frmr = &ibmr->u.frmr;
+			wait_event(frmr->fr_inv_done, frmr->fr_state != FRMR_IS_INUSE);
+			if (frmr->fr_state == FRMR_IS_FREE)
+				break;
 			rds_ib_free_frmr(ibmr, true);
+		}
 		ibmr = rds_ib_alloc_frmr(rds_ibdev, nents);
 		if (IS_ERR(ibmr))
 			return ibmr;
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index 5da12c248431..42daccb7b5eb 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -57,6 +57,7 @@ struct rds_ib_frmr {
 	struct ib_mr		*mr;
 	enum rds_ib_fr_state	fr_state;
 	bool			fr_inv;
+	wait_queue_head_t	fr_inv_done;
 	struct ib_send_wr	fr_wr;
 	unsigned int		dma_npages;
 	unsigned int		sg_byte_len;
-- 
2.22.0



^ permalink raw reply related

* [PATCH net v3 0/7] net/rds: RDMA fixes
From: Gerd Rausch @ 2019-07-16 22:28 UTC (permalink / raw)
  To: Santosh Shilimkar, netdev, linux-rdma, rds-devel; +Cc: David Miller

A number of net/rds fixes necessary to make "rds_rdma.ko"
pass some basic Oracle internal tests.

Gerd Rausch (7):
  net/rds: Give fr_state a chance to transition to FRMR_IS_FREE
  net/rds: Get rid of "wait_clean_list_grace" and add locking
  net/rds: Wait for the FRMR_IS_FREE (or FRMR_IS_STALE) transition after
    posting IB_WR_LOCAL_INV
  net/rds: Fix NULL/ERR_PTR inconsistency
  net/rds: Set fr_state only to FRMR_IS_FREE if IB_WR_LOCAL_INV had been
    successful
  net/rds: Keep track of and wait for FRWR segments in use upon shutdown
  net/rds: Initialize ic->i_fastreg_wrs upon allocation

 net/rds/ib.h      |  1 +
 net/rds/ib_cm.c   |  9 ++++-
 net/rds/ib_frmr.c | 84 ++++++++++++++++++++++++++++++++++++++++++-----
 net/rds/ib_mr.h   |  4 +++
 net/rds/ib_rdma.c | 60 +++++++++++----------------------
 5 files changed, 109 insertions(+), 49 deletions(-)

-- 

Changes in submitted patch v3:
* Use "wait_event" instead of "wait_event_timeout" in order to
  not have a deadline for the HCA firmware to respond.

^ permalink raw reply

* Re: [PATCH RFC 0/4] Add support to directly attach BPF program to ftrace
From: Alexei Starovoitov @ 2019-07-16 22:26 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Adrian Ratiu, Alexei Starovoitov, bpf,
	Brendan Gregg, connoro, Daniel Borkmann, duyuchao, Ingo Molnar,
	jeffv, Karim Yaghmour, kernel-team, linux-kselftest,
	Manali Shukla, Manjo Raja Rao, Martin KaFai Lau, Masami Hiramatsu,
	Matt Mullins, Michal Gregorczyk, Michal Gregorczyk,
	Mohammad Husain, namhyung, namhyung, netdev, paul.chaignon,
	primiano, Qais Yousef, Shuah Khan, Song Liu, Srinivas Ramana,
	Steven Rostedt, Tamir Carmeli, Yonghong Song
In-Reply-To: <20190716213050.GA161922@google.com>

On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote:
> 
> I also thought about the pinning idea before, but we also want to add support
> for not just raw tracepoints, but also regular tracepoints (events if you
> will). I am hesitant to add a new BPF API just for creating regular
> tracepoints and then pinning those as well.

and they should be done through the pinning as well.

> I don't see why a new bpf node for a trace event is a bad idea, really.

See the patches for kprobe/uprobe FD-based api and the reasons behind it.
tldr: text is racy, doesn't scale, poor security, etc.

> tracefs is how we deal with trace events on Android. 

android has made mistakes in the past as well.

> This is a natural extension to that and fits with the security model
> well.

I think it's the opposite.
I'm absolutely against text based apis.


^ permalink raw reply

* [PATCH net] bonding: Force slave speed check after link state recovery for 802.3ad
From: Thomas Falcon @ 2019-07-16 22:25 UTC (permalink / raw)
  To: netdev
  Cc: bjking1, pradeep, Thomas Falcon, Jarod Wilson, Jay Vosburgh,
	Veaceslav Falico, Andy Gospodarek

The following scenario was encountered during testing of logical
partition mobility on pseries partitions with bonded ibmvnic
adapters in LACP mode.

1. Driver receives a signal that the device has been
   swapped, and it needs to reset to initialize the new
   device.

2. Driver reports loss of carrier and begins initialization.

3. Bonding driver receives NETDEV_CHANGE notifier and checks
   the slave's current speed and duplex settings. Because these
   are unknown at the time, the bond sets its link state to
   BOND_LINK_FAIL and handles the speed update, clearing
   AD_PORT_LACP_ENABLE.

4. Driver finishes recovery and reports that the carrier is on.

5. Bond receives a new notification and checks the speed again.
   The speeds are valid but miimon has not altered the link
   state yet.  AD_PORT_LACP_ENABLE remains off.

Because the slave's link state is still BOND_LINK_FAIL,
no further port checks are made when it recovers. Though
the slave devices are operational and have valid speed
and duplex settings, the bond will not send LACPDU's. The
simplest fix I can see is to force another speed check
in bond_miimon_commit. This way the bond will update
AD_PORT_LACP_ENABLE if needed when transitioning from
BOND_LINK_FAIL to BOND_LINK_UP.

CC: Jarod Wilson <jarod@redhat.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
---
 drivers/net/bonding/bond_main.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 9b7016a..02fd782 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2196,6 +2196,15 @@ static void bond_miimon_commit(struct bonding *bond)
 	bond_for_each_slave(bond, slave, iter) {
 		switch (slave->new_link) {
 		case BOND_LINK_NOCHANGE:
+			/* For 802.3ad mode, check current slave speed and
+			 * duplex again in case its port was disabled after
+			 * invalid speed/duplex reporting but recovered before
+			 * link monitoring could make a decision on the actual
+			 * link status
+			 */
+			if (BOND_MODE(bond) == BOND_MODE_8023AD &&
+			    slave->link == BOND_LINK_UP)
+				bond_3ad_adapter_speed_duplex_changed(slave);
 			continue;
 
 		case BOND_LINK_UP:
-- 
1.8.3.1


^ permalink raw reply related

* Re: [RFC bpf-next 0/8] bpf: accelerate insn patching speed
From: Jakub Kicinski @ 2019-07-16 22:12 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jiong Wang, Andrii Nakryiko, Daniel Borkmann, Edward Cree,
	Naveen N. Rao, Andrii Nakryiko, bpf, Networking, oss-drivers,
	Yonghong Song
In-Reply-To: <20190716161701.mk5ye47aj2slkdjp@ast-mbp.dhcp.thefacebook.com>

On Tue, 16 Jul 2019 09:17:03 -0700, Alexei Starovoitov wrote:
> I don't think we have a test for such 'dead prog only due to verifier walk'
> situation. I wonder what happens :)

FWIW we do have verifier and BTF self tests for dead code removal
of entire subprogs! :)

^ permalink raw reply

* [PATCH 3/3] ipconfig: Handle CONFIG_CIFS_ROOT option
From: Paulo Alcantara (SUSE) @ 2019-07-16 22:04 UTC (permalink / raw)
  To: linux-kernel, netdev, linux-cifs, samba-technical
  Cc: Paulo Alcantara (SUSE), David S. Miller, Alexey Kuznetsov,
	Hideaki YOSHIFUJI
In-Reply-To: <20190716220452.3382-1-paulo@paulo.ac>

The experimental root file system support in cifs.ko relies on
ipconfig to set up the network stack and then accessing the SMB share
that contains the rootfs files.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: Paulo Alcantara (SUSE) <paulo@paulo.ac>
---
 net/ipv4/ipconfig.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index 9bcca08efec9..32e20b758b68 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -1483,10 +1483,10 @@ static int __init ip_auto_config(void)
 	 * missing values.
 	 */
 	if (ic_myaddr == NONE ||
-#ifdef CONFIG_ROOT_NFS
+#if defined(CONFIG_ROOT_NFS) || defined(CONFIG_CIFS_ROOT)
 	    (root_server_addr == NONE &&
 	     ic_servaddr == NONE &&
-	     ROOT_DEV == Root_NFS) ||
+	     (ROOT_DEV == Root_NFS || ROOT_DEV == Root_CIFS)) ||
 #endif
 	    ic_first_dev->next) {
 #ifdef IPCONFIG_DYNAMIC
@@ -1513,6 +1513,12 @@ static int __init ip_auto_config(void)
 				goto try_try_again;
 			}
 #endif
+#ifdef CONFIG_CIFS_ROOT
+			if (ROOT_DEV == Root_CIFS) {
+				pr_err("IP-Config: Retrying forever (CIFS root)...\n");
+				goto try_try_again;
+			}
+#endif
 
 			if (--retries) {
 				pr_err("IP-Config: Reopening network devices...\n");
-- 
2.22.0


^ permalink raw reply related

* [PATCH 2/3] init: Support mounting root file systems over SMB
From: Paulo Alcantara (SUSE) @ 2019-07-16 22:04 UTC (permalink / raw)
  To: linux-kernel, netdev, linux-cifs, samba-technical
  Cc: Paulo Alcantara (SUSE), Andrew Morton
In-Reply-To: <20190716220452.3382-1-paulo@paulo.ac>

Add a new virtual device named /dev/cifs (0xfe) to tell the kernel to
mount the root file system over the network by using SMB protocol.

cifs_root_data() will be responsible to retrieve the parsed
information of the new command-line option (cifsroot=) and then call
do_mount_root() with the appropriate mount options for cifs.ko.

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paulo Alcantara (SUSE) <paulo@paulo.ac>
---
 init/do_mounts.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/init/do_mounts.c b/init/do_mounts.c
index 2d1ea3028454..28c5b583afef 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -212,6 +212,7 @@ static int match_dev_by_label(struct device *dev, const void *data)
  *	   a colon.
  *	9) PARTLABEL=<name> with name being the GPT partition label.
  *	   MSDOS partitions do not support labels!
+ *	10) /dev/cifs represents Root_CIFS (0xfe)
  *
  *	If name doesn't have fall into the categories above, we return (0,0).
  *	block_class is used to check if something is a disk name. If the disk
@@ -268,6 +269,9 @@ dev_t name_to_dev_t(const char *name)
 	res = Root_NFS;
 	if (strcmp(name, "nfs") == 0)
 		goto done;
+	res = Root_CIFS;
+	if (strcmp(name, "cifs") == 0)
+		goto done;
 	res = Root_RAM0;
 	if (strcmp(name, "ram") == 0)
 		goto done;
@@ -501,6 +505,42 @@ static int __init mount_nfs_root(void)
 }
 #endif
 
+#ifdef CONFIG_CIFS_ROOT
+
+extern int cifs_root_data(char **dev, char **opts);
+
+#define CIFSROOT_TIMEOUT_MIN	5
+#define CIFSROOT_TIMEOUT_MAX	30
+#define CIFSROOT_RETRY_MAX	5
+
+static int __init mount_cifs_root(void)
+{
+	char *root_dev, *root_data;
+	unsigned int timeout;
+	int try, err;
+
+	err = cifs_root_data(&root_dev, &root_data);
+	if (err != 0)
+		return 0;
+
+	timeout = CIFSROOT_TIMEOUT_MIN;
+	for (try = 1; ; try++) {
+		err = do_mount_root(root_dev, "cifs", root_mountflags,
+				    root_data);
+		if (err == 0)
+			return 1;
+		if (try > CIFSROOT_RETRY_MAX)
+			break;
+
+		ssleep(timeout);
+		timeout <<= 1;
+		if (timeout > CIFSROOT_TIMEOUT_MAX)
+			timeout = CIFSROOT_TIMEOUT_MAX;
+	}
+	return 0;
+}
+#endif
+
 #if defined(CONFIG_BLK_DEV_RAM) || defined(CONFIG_BLK_DEV_FD)
 void __init change_floppy(char *fmt, ...)
 {
@@ -542,6 +582,15 @@ void __init mount_root(void)
 		ROOT_DEV = Root_FD0;
 	}
 #endif
+#ifdef CONFIG_CIFS_ROOT
+	if (ROOT_DEV == Root_CIFS) {
+		if (mount_cifs_root())
+			return;
+
+		printk(KERN_ERR "VFS: Unable to mount root fs via SMB, trying floppy.\n");
+		ROOT_DEV = Root_FD0;
+	}
+#endif
 #ifdef CONFIG_BLK_DEV_FD
 	if (MAJOR(ROOT_DEV) == FLOPPY_MAJOR) {
 		/* rd_doload is 2 for a dual initrd/ramload setup */
-- 
2.22.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox