Netdev List
 help / color / mirror / Atom feed
* Re: lost connection to test machine (3)
From: Dmitry Vyukov @ 2017-12-27 18:22 UTC (permalink / raw)
  To: syzbot
  Cc: LKML, syzkaller-bugs, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, David Miller, netfilter-devel, coreteam, netdev
In-Reply-To: <001a1143d40c2b55b10561566d26@google.com>

On Wed, Dec 27, 2017 at 7:18 PM, syzbot
<syzbot+4396883fa8c4f64e0175@syzkaller.appspotmail.com> wrote:
> Hello,
>
> syzkaller hit the following crash on
> beacbc68ac3e23821a681adb30b45dc55b17488d
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
>
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: <syzbot+4396883fa8c4f64e0175@syzkaller.appspotmail.com>
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.

+netfilter maintainers

Here is cleaned reproducer:

// autogenerated by syzkaller (http://github.com/google/syzkaller)
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <linux/if.h>
#include <linux/netfilter_ipv4/ip_tables.h>

int main()
{
  int fd;

  fd = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
  struct ipt_replace opt = {};
  opt.num_counters = 1;
  opt.size = -1;
  setsockopt(fd, SOL_IP, 0x40, &opt, 0x4);
  return 0;
}


What happens there is that here:

struct xt_table_info *xt_alloc_table_info(unsigned int size)
{
    ...
    if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
        return NULL;

size = -1 and SMP_ALIGN(size) = 0, so this still tries to allocate
4GB+delta bytes.

I don't understand why this uses SMP_ALIGN since we add 2 pages on
top, it seems that we could just drop SMP_ALIGN and local SMP_ALIGN
definition altogether.



> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> If you want to test a patch for this bug, please reply with:
> #syz test: git://repo/address.git branch
> and provide the patch inline or as an attachment.
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/001a1143d40c2b55b10561566d26%40google.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: WARNING in strp_data_ready
From: Dmitry Vyukov @ 2017-12-27 18:25 UTC (permalink / raw)
  To: Tom Herbert
  Cc: John Fastabend, syzbot, David S. Miller, Eric Biggers, LKML,
	Linux Kernel Network Developers, syzkaller-bugs, Tom Herbert,
	Cong Wang
In-Reply-To: <CACT4Y+YXhkZsDdst+H1VQZ9Tc0fMJhF3niXEJoK9TqcTUaC4iA@mail.gmail.com>

On Wed, Dec 6, 2017 at 4:44 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
>> <john.fastabend@gmail.com> wrote:
>>> On 10/24/2017 08:20 AM, syzbot wrote:
>>>> Hello,
>>>>
>>>> syzkaller hit the following crash on 73d3393ada4f70fa3df5639c8d438f2f034c0ecb
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>>>> compiler: gcc (GCC) 7.1.1 20170620
>>>> .config is attached
>>>> Raw console output is attached.
>>>> C reproducer is attached
>>>> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>>>> for information about syzkaller reproducers
>>>>
>>>>
>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_me include/net/sock.h:1505 [inline]
>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_user include/net/sock.h:1511 [inline]
>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>>> Kernel panic - not syncing: panic_on_warn set ...
>>>>
>>>> CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138
>>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>>>> Call Trace:
>>>>  <IRQ>
>>>>  __dump_stack lib/dump_stack.c:16 [inline]
>>>>  dump_stack+0x194/0x257 lib/dump_stack.c:52
>>>>  panic+0x1e4/0x417 kernel/panic.c:181
>>>>  __warn+0x1c4/0x1d9 kernel/panic.c:542
>>>>  report_bug+0x211/0x2d0 lib/bug.c:183
>>>>  fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
>>>>  do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
>>>>  do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
>>>>  do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
>>>>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
>>>>  invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
>>>> RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline]
>>>> RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline]
>>>> RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>>> RSP: 0018:ffff8801db206b18 EFLAGS: 00010206
>>>> RAX: ffff8801d1e02080 RBX: ffff8801dad74c48 RCX: 0000000000000000
>>>> RDX: 0000000000000100 RSI: ffff8801d29fa0a0 RDI: ffffffff85cbede0
>>>> RBP: ffff8801db206b38 R08: 0000000000000005 R09: 1ffffffff0ce0bcd
>>>> R10: ffff8801db206a00 R11: dffffc0000000000 R12: ffff8801d29fa000
>>>> R13: ffff8801dad74c50 R14: ffff8801d4350a92 R15: 0000000000000001
>>>>  psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353
>>>
>>> Looks like KCM is calling sk_data_ready() without first taking the
>>> sock lock.
>>>
>>> /* Called with lower sock held */
>>> static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff *skb)
>>> {
>>>  [...]
>>>         if (kcm_queue_rcv_skb(&kcm->sk, skb)) {
>>>
>>> In this case kcm->sk is not the same lock the comment is referring to.
>>> And kcm_queue_rcv_skb() will eventually call sk_data_ready().
>>>
>>> @Tom, how about wrapping the sk_data_ready call in {lock|release}_sock?
>>> I don't have anything better in mind immediately.
>>>
>> The sock locks are taken in reverse order in the send path so so
>> grabbing kcm sock lock with lower lock held to call sk_data_ready may
>> lead to deadlock like I think.
>>
>> It might be possible to change the order in the send path to do this.
>> Something like:
>>
>> trylock on lower socket lock
>> -if trylock fails
>>   - release kcm sock lock
>>   - lock lower sock
>>   - lock kcm sock
>> - call sendpage locked function
>>
>> I admit that dealing with two levels of socket locks in the data path
>> is quite a pain :-)
>
> up
>
> still happening and we've lost 50K+ test VMs on this

up

Still happens and number of crashes crossed 60K, can we do something
with this please?

^ permalink raw reply

* Re: Pravin Shelar
From: Ben Pfaff @ 2017-12-27 18:25 UTC (permalink / raw)
  To: Julia Lawall; +Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <alpine.DEB.2.20.1712271621370.14107@hadrien>

On Wed, Dec 27, 2017 at 04:22:55PM +0100, Julia Lawall wrote:
> The email address pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org listed for Pravin Shelar in
> MAINTAINERS (OPENVSWITCH section) seems to bounce.

Pravin has used a newer address recently, so I sent out a suggested
update (for OVS):
        https://patchwork.ozlabs.org/patch/853232/

^ permalink raw reply

* Re: [Patch net-next] net_sched: remove the unsafe __skb_array_empty()
From: Cong Wang @ 2017-12-27 18:29 UTC (permalink / raw)
  To: John Fastabend; +Cc: Linux Kernel Network Developers, Jakub Kicinski
In-Reply-To: <419268d4-4078-098c-3c2e-5ce967feb5e5@gmail.com>

On Sat, Dec 23, 2017 at 10:57 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 12/22/2017 12:31 PM, Cong Wang wrote:
>> I understand why you had it, but it is just not safe. You don't want
>> to achieve performance gain by crashing system, right?
>
> huh? So my point is the patch you submit here is not a
> real fix but a work around. To peek the head of a consumer/producer ring
> without a lock, _should_ be fine. This _should_ work as well with
> consumer or producer operations happening at the same time. After some
> digging the issue is in the ptr_ring code.


The comments disagree with you:

/* Might be slightly faster than skb_array_empty below, but only safe if the
 * array is never resized. Also, callers invoking this in a loop must take care
 * to use a compiler barrier, for example cpu_relax().
 */

If the comments are right, you miss a barrier here too.


>
> The peek code (what empty check calls) is the following,
>
> static inline void *__ptr_ring_peek(struct ptr_ring *r)
> {
>         if (likely(r->size))
>                 return r->queue[r->consumer_head];
>         return NULL;
> }
>
> So what the splat is detecting is consumer head being 'out of bounds'.
> This happens because ptr_ring_discard_one increments the consumer_head
> and then checks to see if it overran the array size. If above peek
> happens after the increment, but before the size check we get the
> splat. There are two ways, as far as I can see, to fix this. First
> do the check before incrementing the consumer head. Or the easier
> fix,
>
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -438,7 +438,7 @@ static inline int ptr_ring_consume_batched_bh(struct
> ptr_ring *r,
>
>  static inline void **__ptr_ring_init_queue_alloc(unsigned int size,
> gfp_t gfp)
>  {
> -       return kcalloc(size, sizeof(void *), gfp);
> +       return kcalloc(size + 1, sizeof(void *), gfp);
>  }
>
> With Jakub's help (Thanks!) I was able to reproduce the original splat
> and also verify the above removes it.
>
> To be clear "resizing" a skb_array only refers to changing the actual
> array size not adding/removing elements.

I never look into the implementation, just simply trust the comments.

At least the comments above __skb_array_empty() need to improve.


>
>>
>>>
>>> Although its not logical IMO to have both reset and dequeue running at
>>> the same time. Some skbs would get through others would get sent, sort
>>> of a mess. I don't see how it can be an issue. The never resized bit
>>> in the documentation is referring to resizing the ring size _not_ popping
>>> off elements of the ring. array_empty just reads the consumer head.
>>> The only ring resizing in pfifo fast should be at init and destroy where
>>> enqueue/dequeue should be disconnected by then. Although based on the
>>> trace I missed a case.
>>
>>
>> Both pfifo_fast_reset() and pfifo_fast_dequeue() call
>> skb_array_consume_bh(), so there is no difference w.r.t. resizing.
>>
>
> Sorry not following.
>
>> And ->reset() is called in qdisc_graft() too. Let's say we have htb+pfifo_fast,
>> htb_graft() calls qdisc_replace() which calls qdisc_reset() on pfifo_fast,
>> so clearly pfifo_fast_reset() can run with pfifo_fast_dequeue()
>> concurrently.
>
> Yes and this _should_ be perfectly fine for pfifo_fast. I'm wondering
> though if this API can be cleaned up. What are the paths that do a reset
> without a destroy.. Do we really need to have this pattern where reset
> is called then later destroy. Seems destroy could do the entire cleanup
> and this would simplify things. None of this has to do with the splat
> though.

I don't follow your point any more.

We are talking about ->reset() race with ->dequeue() which is the
cause of the bug, right?

If you expect ->reset() runs in parallel with ->dequeue() for pfifo_fast,
why did you even mention synchronize_net() from the beginning???
Also you changed the code too, to adjust rcu grace period.


>
>>
>>
>>>
>>> I think the right fix is to only call reset/destroy patterns after
>>> waiting a grace period and for all tx_action calls in-flight to
>>> complete. This is also better going forward for more complex qdiscs.
>>
>> But we don't even have rcu read lock in TX BH, do we?
>>
>> Also, people certainly don't like yet another synchronize_net()...
>>
>
> This needs a fix and is a _real_ bug, but removing __skb_array_empty()
> doesn't help solve this at all. Will work on a fix after the holiday
> break. The fix here is to ensure the destroy is not going to happen
> while tx_action is in-flight. Can be done with qdisc_run and checking
> correct bits in lockless case.

Sounds like you missed a lot of things with your "lockless" patches....
First qdisc rcu callback, second rcu read lock in TX BH...

My quick one-line fix is to amend this bug before you going deeper
in this rabbit hole.

^ permalink raw reply

* Re: [ovs-dev] Pravin Shelar
From: Joe Perches @ 2017-12-27 18:33 UTC (permalink / raw)
  To: Ben Pfaff, Julia Lawall, Pravin Shelar; +Cc: netdev, dev
In-Reply-To: <20171227182540.GT13883@ovn.org>

On Wed, 2017-12-27 at 10:25 -0800, Ben Pfaff wrote:
> On Wed, Dec 27, 2017 at 04:22:55PM +0100, Julia Lawall wrote:
> > The email address pshelar@nicira.com listed for Pravin Shelar in
> > MAINTAINERS (OPENVSWITCH section) seems to bounce.
> 
> Pravin has used a newer address recently, so I sent out a suggested
> update (for OVS):
>         https://patchwork.ozlabs.org/patch/853232/

As Pravin is still active with acks but not any authored patches in
the
last year, this should still be updated in the linux-kernel's
MAINTAINERS
file too.
---
diff --git a/MAINTAINERS b/MAINTAINERS
index
a6e86e20761e..5869e5f0b930 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@
-10137,7 +10137,7 @@ F:	drivers/irqchip/irq-ompic.c
 F:	dri
vers/irqchip/irq-or1k-*
 
 OPENVSWITCH
-M:	Pravin Shelar <pshelar@ni
cira.com>
+M:	Pravin Shelar <pshelar@ovn.org>
 L:	netdev@vge
r.kernel.org
 L:	dev@openvswitch.org
 W:	http://openvswitch.org

^ permalink raw reply

* Re: [PATCH V2 net-next 0/3] rds bug fixes
From: David Miller @ 2017-12-27 18:38 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: netdev, rds-devel, santosh.shilimkar
In-Reply-To: <cover.1513962765.git.sowmini.varadhan@oracle.com>

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Fri, 22 Dec 2017 09:38:58 -0800

> Ran into pre-existing bugs when working on the fix for
>    https://www.spinics.net/lists/netdev/msg472849.html
> 
> The bugs fixed in this patchset are unrelated to the syzbot 
> failure (which I'm still testing and trying to reproduce) but 
> meanwhile, let's get these fixes out of the way.
> 
> V2: target net-next (rds:tcp patches have a dependancy on 
> changes that are in net-next, but not yet in net)

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH v2] sctp: Replace use of sockets_allocated with specified macro.
From: David Miller @ 2017-12-27 18:48 UTC (permalink / raw)
  To: xiangxia.m.yue; +Cc: netdev, eric.dumazet
In-Reply-To: <1513966520-5429-1-git-send-email-xiangxia.m.yue@gmail.com>

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Date: Fri, 22 Dec 2017 10:15:20 -0800

> The patch(180d8cd942ce) replaces all uses of struct sock fields'
> memory_pressure, memory_allocated, sockets_allocated, and sysctl_mem
> to accessor macros. But the sockets_allocated field of sctp sock is
> not replaced at all. Then replace it now for unifying the code.
> 
> Fixes: 180d8cd942ce ("foundations of per-cgroup memory pressure controlling.")
> Cc: Glauber Costa <glommer@parallels.com>
> Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH 2/4] libbpf: add error reporting in XDP
From: Alexei Starovoitov @ 2017-12-27 18:57 UTC (permalink / raw)
  To: Eric Leblond; +Cc: netdev, daniel, linux-kernel
In-Reply-To: <20171227180229.1926-3-eric@regit.org>

On Wed, Dec 27, 2017 at 07:02:27PM +0100, Eric Leblond wrote:
> Parse netlink ext attribute to get the error message returned by
> the card. Code is partially take from libnl.
> 
> Signed-off-by: Eric Leblond <eric@regit.org>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH 1/4] libbpf: add function to setup XDP
From: Alexei Starovoitov @ 2017-12-27 18:57 UTC (permalink / raw)
  To: Eric Leblond; +Cc: netdev, daniel, linux-kernel
In-Reply-To: <20171227180229.1926-2-eric@regit.org>

On Wed, Dec 27, 2017 at 07:02:26PM +0100, Eric Leblond wrote:
> Most of the code is taken from set_link_xdp_fd() in bpf_load.c and
> slightly modified to be library compliant.
> 
> Signed-off-by: Eric Leblond <eric@regit.org>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH 3/4] libbpf: break loop earlier
From: Alexei Starovoitov @ 2017-12-27 19:00 UTC (permalink / raw)
  To: Eric Leblond; +Cc: netdev, daniel, linux-kernel
In-Reply-To: <20171227180229.1926-4-eric@regit.org>

On Wed, Dec 27, 2017 at 07:02:28PM +0100, Eric Leblond wrote:
> Get out of the loop when we have a match.
> 
> Signed-off-by: Eric Leblond <eric@regit.org>
> ---
>  tools/lib/bpf/libbpf.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 5fe8aaa2123e..d263748aa341 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -412,6 +412,7 @@ bpf_object__init_prog_names(struct bpf_object *obj)
>  					   prog->section_name);
>  				return -LIBBPF_ERRNO__LIBELF;
>  			}
> +			break;

why this is needed?
The top of the loop is:
 for (si = 0; si < symbols->d_size / sizeof(GElf_Sym) && !name;

so as soon as name is found the loop will exit.
I agree that the loop structure is non-standard is confusing,
but adding break here will make it even more so.
If 'break' is added then '&& !name' should be removed.

^ permalink raw reply

* Re: [PATCH 4/4] libbpf: add missing SPDX-License-Identifier
From: Alexei Starovoitov @ 2017-12-27 19:01 UTC (permalink / raw)
  To: Eric Leblond; +Cc: netdev, daniel, linux-kernel
In-Reply-To: <20171227180229.1926-5-eric@regit.org>

On Wed, Dec 27, 2017 at 07:02:29PM +0100, Eric Leblond wrote:
> Signed-off-by: Eric Leblond <eric@regit.org>

thank you.
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH v2 bpf-next 2/2] tools/bpftool: fix bpftool build with bintutils >= 2.8
From: Roman Gushchin @ 2017-12-27 19:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Quentin Monnet, netdev, linux-kernel, kernel-team, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <20171227023204.eulgkbg7epj7nl76@ast-mbp>

On Tue, Dec 26, 2017 at 06:32:05PM -0800, Alexei Starovoitov wrote:
> On Fri, Dec 22, 2017 at 06:50:01PM +0000, Quentin Monnet wrote:
> > Hi Roman,
> > 
> > 2017-12-22 16:11 UTC+0000 ~ Roman Gushchin <guro@fb.com>
> > > Bpftool build is broken with binutils version 2.28 and later.
> > 
> > Could you check the binutils version? I believe it changed in 2.29
> > instead of 2.28. Could you update your commit log and subject
> > accordingly, please?

Yes, you're right. Thanks!

> > 
> > > The cause is commit 003ca0fd2286 ("Refactor disassembler selection")
> > > in the binutils repo, which changed the disassembler() function
> > > signature.
> > > 
> > > Fix this by adding a new "feature" to the tools/build/features
> > > infrastructure and make it responsible for decision which
> > > disassembler() function signature to use.
> > > 
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
> > > Cc: Alexei Starovoitov <ast@kernel.org>
> > > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > > ---
> > >  tools/bpf/Makefile                                | 29 +++++++++++++++++++++++
> > >  tools/bpf/bpf_jit_disasm.c                        |  7 ++++++
> > >  tools/bpf/bpftool/Makefile                        | 24 +++++++++++++++++++
> > >  tools/bpf/bpftool/jit_disasm.c                    |  7 ++++++
> > >  tools/build/feature/Makefile                      |  4 ++++
> > >  tools/build/feature/test-disassembler-four-args.c | 15 ++++++++++++
> > >  6 files changed, 86 insertions(+)
> > >  create mode 100644 tools/build/feature/test-disassembler-four-args.c
> > > 
> > > diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
> > > index 07a6697466ef..c8ec0ae16bf0 100644
> > > --- a/tools/bpf/Makefile
> > > +++ b/tools/bpf/Makefile
> > > @@ -9,6 +9,35 @@ MAKE = make
> > >  CFLAGS += -Wall -O2
> > >  CFLAGS += -D__EXPORTED_HEADERS__ -I../../include/uapi -I../../include
> > >  
> > > +ifeq ($(srctree),)
> > > +srctree := $(patsubst %/,%,$(dir $(CURDIR)))
> > > +srctree := $(patsubst %/,%,$(dir $(srctree)))
> > > +endif
> > > +
> > > +FEATURE_USER = .bpf
> > > +FEATURE_TESTS = libbfd disassembler-four-args
> > > +FEATURE_DISPLAY = libbfd disassembler-four-args
> > 
> > Thanks for adding libbfd as I requested. However, you do not use it in
> > the Makefile to prevent compilation if the feature is not detected (see
> > "bpfdep" or "elfdep" in tools/lib/bpf/Makefile. Sorry, I should have
> > pointed it in my previous review.
> > 
> > But actually, I have another issue related to the libbfd feature: since
> > commit 280e7c48c3b8 ("perf tools: fix BFD detection on opensuse") it
> > requires libiberty so that libbfd is correctly detected, but libiberty
> > is not needed on all distros (at least Ubuntu can have libbfd without
> > libiberty). Typically, detection fails on my setup, although I do have
> > libbfd installed. So forcing libbfd feature here may eventually force
> > users to install libraries they do not need to compile bpftool, which is
> > not what we want.
> > 
> > I do not have a clean work around to suggest. Maybe have one
> > "libbfd-something" feature that tries to compile without libiberty, then
> > another one that tries with it, and compile the tools if at least one of
> > them succeeds. But it's probably for another patch series. In the
> > meantime, would you please simply remove libbfd detection here and
> > accept my apologies for suggesting to add it in the previous review?
> 
> I think since libbfd is already used by bpftool it's a good thing
> to add feature detection. Even if it's not perfect on some setups.

Agree, we can enhance it later.

> 
> Roman,
> I think you still need to do one more respin to address commit log nit?
> 

Sure, will send soon-ish.

Thanks!

^ permalink raw reply

* Re: WARNING in strp_data_ready
From: Tom Herbert @ 2017-12-27 19:09 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: John Fastabend, syzbot, David S. Miller, Eric Biggers, LKML,
	Linux Kernel Network Developers, syzkaller-bugs, Tom Herbert,
	Cong Wang
In-Reply-To: <CACT4Y+bFsT4ZVvZSZvQdcKXb8n-79gy2tLB02jBm8wwp32YoAA@mail.gmail.com>

Did you try the patch I posted?


On Wed, Dec 27, 2017 at 10:25 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Wed, Dec 6, 2017 at 4:44 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>> <john.fastabend@gmail.com> wrote:
>>>> On 10/24/2017 08:20 AM, syzbot wrote:
>>>>> Hello,
>>>>>
>>>>> syzkaller hit the following crash on 73d3393ada4f70fa3df5639c8d438f2f034c0ecb
>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>>>>> compiler: gcc (GCC) 7.1.1 20170620
>>>>> .config is attached
>>>>> Raw console output is attached.
>>>>> C reproducer is attached
>>>>> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>>>>> for information about syzkaller reproducers
>>>>>
>>>>>
>>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_me include/net/sock.h:1505 [inline]
>>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_user include/net/sock.h:1511 [inline]
>>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>>>> Kernel panic - not syncing: panic_on_warn set ...
>>>>>
>>>>> CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138
>>>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>>>>> Call Trace:
>>>>>  <IRQ>
>>>>>  __dump_stack lib/dump_stack.c:16 [inline]
>>>>>  dump_stack+0x194/0x257 lib/dump_stack.c:52
>>>>>  panic+0x1e4/0x417 kernel/panic.c:181
>>>>>  __warn+0x1c4/0x1d9 kernel/panic.c:542
>>>>>  report_bug+0x211/0x2d0 lib/bug.c:183
>>>>>  fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
>>>>>  do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
>>>>>  do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
>>>>>  do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
>>>>>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
>>>>>  invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
>>>>> RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline]
>>>>> RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline]
>>>>> RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>>>> RSP: 0018:ffff8801db206b18 EFLAGS: 00010206
>>>>> RAX: ffff8801d1e02080 RBX: ffff8801dad74c48 RCX: 0000000000000000
>>>>> RDX: 0000000000000100 RSI: ffff8801d29fa0a0 RDI: ffffffff85cbede0
>>>>> RBP: ffff8801db206b38 R08: 0000000000000005 R09: 1ffffffff0ce0bcd
>>>>> R10: ffff8801db206a00 R11: dffffc0000000000 R12: ffff8801d29fa000
>>>>> R13: ffff8801dad74c50 R14: ffff8801d4350a92 R15: 0000000000000001
>>>>>  psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353
>>>>
>>>> Looks like KCM is calling sk_data_ready() without first taking the
>>>> sock lock.
>>>>
>>>> /* Called with lower sock held */
>>>> static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff *skb)
>>>> {
>>>>  [...]
>>>>         if (kcm_queue_rcv_skb(&kcm->sk, skb)) {
>>>>
>>>> In this case kcm->sk is not the same lock the comment is referring to.
>>>> And kcm_queue_rcv_skb() will eventually call sk_data_ready().
>>>>
>>>> @Tom, how about wrapping the sk_data_ready call in {lock|release}_sock?
>>>> I don't have anything better in mind immediately.
>>>>
>>> The sock locks are taken in reverse order in the send path so so
>>> grabbing kcm sock lock with lower lock held to call sk_data_ready may
>>> lead to deadlock like I think.
>>>
>>> It might be possible to change the order in the send path to do this.
>>> Something like:
>>>
>>> trylock on lower socket lock
>>> -if trylock fails
>>>   - release kcm sock lock
>>>   - lock lower sock
>>>   - lock kcm sock
>>> - call sendpage locked function
>>>
>>> I admit that dealing with two levels of socket locks in the data path
>>> is quite a pain :-)
>>
>> up
>>
>> still happening and we've lost 50K+ test VMs on this
>
> up
>
> Still happens and number of crashes crossed 60K, can we do something
> with this please?

^ permalink raw reply

* [PATCH v3 bpf-next 1/2] tools/bpftool: use version from the kernel source tree
From: Roman Gushchin @ 2017-12-27 19:16 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel-team, Roman Gushchin, Alexei Starovoitov,
	Daniel Borkmann

Bpftool determines it's own version based on the kernel
version, which is picked from the linux/version.h header.

It's strange to use the version of the installed kernel
headers, and makes much more sense to use the version
of the actual source tree, where bpftool sources are.

Fix this by building kernelversion target and use
the resulting string as bpftool version.

Example:
before:

$ bpftool version
bpftool v4.14.6

after:
$ bpftool version
bpftool v4.15.0-rc3

$bpftool version --json
{"version":"4.15.0-rc3"}

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/bpf/bpftool/Makefile |  3 +++
 tools/bpf/bpftool/main.c   | 13 ++-----------
 2 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
index 3f17ad317512..f8f31a8d9269 100644
--- a/tools/bpf/bpftool/Makefile
+++ b/tools/bpf/bpftool/Makefile
@@ -23,6 +23,8 @@ endif
 
 LIBBPF = $(BPF_PATH)libbpf.a
 
+BPFTOOL_VERSION=$(shell make --no-print-directory -sC ../../.. kernelversion)
+
 $(LIBBPF): FORCE
 	$(Q)$(MAKE) -C $(BPF_DIR) OUTPUT=$(OUTPUT) $(OUTPUT)libbpf.a FEATURES_DUMP=$(FEATURE_DUMP_EXPORT)
 
@@ -38,6 +40,7 @@ CC = gcc
 CFLAGS += -O2
 CFLAGS += -W -Wall -Wextra -Wno-unused-parameter -Wshadow
 CFLAGS += -D__EXPORTED_HEADERS__ -I$(srctree)/tools/include/uapi -I$(srctree)/tools/include -I$(srctree)/tools/lib/bpf -I$(srctree)/kernel/bpf/
+CFLAGS += -DBPFTOOL_VERSION='"$(BPFTOOL_VERSION)"'
 LIBS = -lelf -lbfd -lopcodes $(LIBBPF)
 
 INSTALL ?= install
diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c
index ecd53ccf1239..3a0396d87c42 100644
--- a/tools/bpf/bpftool/main.c
+++ b/tools/bpf/bpftool/main.c
@@ -38,7 +38,6 @@
 #include <errno.h>
 #include <getopt.h>
 #include <linux/bpf.h>
-#include <linux/version.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
@@ -95,21 +94,13 @@ static int do_help(int argc, char **argv)
 
 static int do_version(int argc, char **argv)
 {
-	unsigned int version[3];
-
-	version[0] = LINUX_VERSION_CODE >> 16;
-	version[1] = LINUX_VERSION_CODE >> 8 & 0xf;
-	version[2] = LINUX_VERSION_CODE & 0xf;
-
 	if (json_output) {
 		jsonw_start_object(json_wtr);
 		jsonw_name(json_wtr, "version");
-		jsonw_printf(json_wtr, "\"%u.%u.%u\"",
-			     version[0], version[1], version[2]);
+		jsonw_printf(json_wtr, "\"%s\"", BPFTOOL_VERSION);
 		jsonw_end_object(json_wtr);
 	} else {
-		printf("%s v%u.%u.%u\n", bin_name,
-		       version[0], version[1], version[2]);
+		printf("%s v%s\n", bin_name, BPFTOOL_VERSION);
 	}
 	return 0;
 }
-- 
2.14.3

^ permalink raw reply related

* [PATCH v3 bpf-next 2/2] tools/bpftool: fix bpftool build with bintutils >= 2.9
From: Roman Gushchin @ 2017-12-27 19:16 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel-team, Roman Gushchin, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <20171227191629.4920-1-guro@fb.com>

Bpftool build is broken with binutils version 2.29 and later.
The cause is commit 003ca0fd2286 ("Refactor disassembler selection")
in the binutils repo, which changed the disassembler() function
signature.

Fix this by adding a new "feature" to the tools/build/features
infrastructure and make it responsible for decision which
disassembler() function signature to use.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/bpf/Makefile                                | 29 +++++++++++++++++++++++
 tools/bpf/bpf_jit_disasm.c                        |  7 ++++++
 tools/bpf/bpftool/Makefile                        | 24 +++++++++++++++++++
 tools/bpf/bpftool/jit_disasm.c                    |  7 ++++++
 tools/build/feature/Makefile                      |  4 ++++
 tools/build/feature/test-disassembler-four-args.c | 15 ++++++++++++
 6 files changed, 86 insertions(+)
 create mode 100644 tools/build/feature/test-disassembler-four-args.c

diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
index 07a6697466ef..c8ec0ae16bf0 100644
--- a/tools/bpf/Makefile
+++ b/tools/bpf/Makefile
@@ -9,6 +9,35 @@ MAKE = make
 CFLAGS += -Wall -O2
 CFLAGS += -D__EXPORTED_HEADERS__ -I../../include/uapi -I../../include
 
+ifeq ($(srctree),)
+srctree := $(patsubst %/,%,$(dir $(CURDIR)))
+srctree := $(patsubst %/,%,$(dir $(srctree)))
+endif
+
+FEATURE_USER = .bpf
+FEATURE_TESTS = libbfd disassembler-four-args
+FEATURE_DISPLAY = libbfd disassembler-four-args
+
+check_feat := 1
+NON_CHECK_FEAT_TARGETS := clean bpftool_clean
+ifdef MAKECMDGOALS
+ifeq ($(filter-out $(NON_CHECK_FEAT_TARGETS),$(MAKECMDGOALS)),)
+  check_feat := 0
+endif
+endif
+
+ifeq ($(check_feat),1)
+ifeq ($(FEATURES_DUMP),)
+include $(srctree)/tools/build/Makefile.feature
+else
+include $(FEATURES_DUMP)
+endif
+endif
+
+ifeq ($(feature-disassembler-four-args), 1)
+CFLAGS += -DDISASM_FOUR_ARGS_SIGNATURE
+endif
+
 %.yacc.c: %.y
 	$(YACC) -o $@ -d $<
 
diff --git a/tools/bpf/bpf_jit_disasm.c b/tools/bpf/bpf_jit_disasm.c
index 75bf526a0168..30044bc4f389 100644
--- a/tools/bpf/bpf_jit_disasm.c
+++ b/tools/bpf/bpf_jit_disasm.c
@@ -72,7 +72,14 @@ static void get_asm_insns(uint8_t *image, size_t len, int opcodes)
 
 	disassemble_init_for_target(&info);
 
+#ifdef DISASM_FOUR_ARGS_SIGNATURE
+	disassemble = disassembler(info.arch,
+				   bfd_big_endian(bfdf),
+				   info.mach,
+				   bfdf);
+#else
 	disassemble = disassembler(bfdf);
+#endif
 	assert(disassemble);
 
 	do {
diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
index f8f31a8d9269..2237bc43f71c 100644
--- a/tools/bpf/bpftool/Makefile
+++ b/tools/bpf/bpftool/Makefile
@@ -46,6 +46,30 @@ LIBS = -lelf -lbfd -lopcodes $(LIBBPF)
 INSTALL ?= install
 RM ?= rm -f
 
+FEATURE_USER = .bpftool
+FEATURE_TESTS = libbfd disassembler-four-args
+FEATURE_DISPLAY = libbfd disassembler-four-args
+
+check_feat := 1
+NON_CHECK_FEAT_TARGETS := clean uninstall doc doc-clean doc-install doc-uninstall
+ifdef MAKECMDGOALS
+ifeq ($(filter-out $(NON_CHECK_FEAT_TARGETS),$(MAKECMDGOALS)),)
+  check_feat := 0
+endif
+endif
+
+ifeq ($(check_feat),1)
+ifeq ($(FEATURES_DUMP),)
+include $(srctree)/tools/build/Makefile.feature
+else
+include $(FEATURES_DUMP)
+endif
+endif
+
+ifeq ($(feature-disassembler-four-args), 1)
+CFLAGS += -DDISASM_FOUR_ARGS_SIGNATURE
+endif
+
 include $(wildcard *.d)
 
 all: $(OUTPUT)bpftool
diff --git a/tools/bpf/bpftool/jit_disasm.c b/tools/bpf/bpftool/jit_disasm.c
index 1551d3918d4c..57d32e8a1391 100644
--- a/tools/bpf/bpftool/jit_disasm.c
+++ b/tools/bpf/bpftool/jit_disasm.c
@@ -107,7 +107,14 @@ void disasm_print_insn(unsigned char *image, ssize_t len, int opcodes)
 
 	disassemble_init_for_target(&info);
 
+#ifdef DISASM_FOUR_ARGS_SIGNATURE
+	disassemble = disassembler(info.arch,
+				   bfd_big_endian(bfdf),
+				   info.mach,
+				   bfdf);
+#else
 	disassemble = disassembler(bfdf);
+#endif
 	assert(disassemble);
 
 	if (json_output)
diff --git a/tools/build/feature/Makefile b/tools/build/feature/Makefile
index 96982640fbf8..17f2c73fff8b 100644
--- a/tools/build/feature/Makefile
+++ b/tools/build/feature/Makefile
@@ -13,6 +13,7 @@ FILES=                                          \
          test-hello.bin                         \
          test-libaudit.bin                      \
          test-libbfd.bin                        \
+         test-disassembler-four-args.bin        \
          test-liberty.bin                       \
          test-liberty-z.bin                     \
          test-cplus-demangle.bin                \
@@ -188,6 +189,9 @@ $(OUTPUT)test-libpython-version.bin:
 $(OUTPUT)test-libbfd.bin:
 	$(BUILD) -DPACKAGE='"perf"' -lbfd -lz -liberty -ldl
 
+$(OUTPUT)test-disassembler-four-args.bin:
+	$(BUILD) -lbfd -lopcodes
+
 $(OUTPUT)test-liberty.bin:
 	$(CC) $(CFLAGS) -Wall -Werror -o $@ test-libbfd.c -DPACKAGE='"perf"' $(LDFLAGS) -lbfd -ldl -liberty
 
diff --git a/tools/build/feature/test-disassembler-four-args.c b/tools/build/feature/test-disassembler-four-args.c
new file mode 100644
index 000000000000..45ce65cfddf0
--- /dev/null
+++ b/tools/build/feature/test-disassembler-four-args.c
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <bfd.h>
+#include <dis-asm.h>
+
+int main(void)
+{
+	bfd *abfd = bfd_openr(NULL, NULL);
+
+	disassembler(bfd_get_arch(abfd),
+		     bfd_big_endian(abfd),
+		     bfd_get_mach(abfd),
+		     abfd);
+
+	return 0;
+}
-- 
2.14.3

^ permalink raw reply related

* Re: WARNING in strp_data_ready
From: Dmitry Vyukov @ 2017-12-27 19:20 UTC (permalink / raw)
  To: Tom Herbert
  Cc: John Fastabend, syzbot, David S. Miller, Eric Biggers, LKML,
	Linux Kernel Network Developers, syzkaller-bugs, Tom Herbert,
	Cong Wang
In-Reply-To: <CALx6S37RTEzd5pABpULPfrsoe-Huj0GZtQpOUfG=tT9dn5wL_A@mail.gmail.com>

On Wed, Dec 27, 2017 at 8:09 PM, Tom Herbert <tom@herbertland.com> wrote:
> Did you try the patch I posted?

Hi Tom,

No. And I didn't know I need to. Why?
If you think the patch needs additional testing, you can ask syzbot to
test it. See https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot
Otherwise proceed with committing it. Or what are we waiting for?

Thanks



> On Wed, Dec 27, 2017 at 10:25 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
>> On Wed, Dec 6, 2017 at 4:44 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>>> <john.fastabend@gmail.com> wrote:
>>>>> On 10/24/2017 08:20 AM, syzbot wrote:
>>>>>> Hello,
>>>>>>
>>>>>> syzkaller hit the following crash on 73d3393ada4f70fa3df5639c8d438f2f034c0ecb
>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>>>>>> compiler: gcc (GCC) 7.1.1 20170620
>>>>>> .config is attached
>>>>>> Raw console output is attached.
>>>>>> C reproducer is attached
>>>>>> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>>>>>> for information about syzkaller reproducers
>>>>>>
>>>>>>
>>>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_me include/net/sock.h:1505 [inline]
>>>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_user include/net/sock.h:1511 [inline]
>>>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>>>>> Kernel panic - not syncing: panic_on_warn set ...
>>>>>>
>>>>>> CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138
>>>>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>>>>>> Call Trace:
>>>>>>  <IRQ>
>>>>>>  __dump_stack lib/dump_stack.c:16 [inline]
>>>>>>  dump_stack+0x194/0x257 lib/dump_stack.c:52
>>>>>>  panic+0x1e4/0x417 kernel/panic.c:181
>>>>>>  __warn+0x1c4/0x1d9 kernel/panic.c:542
>>>>>>  report_bug+0x211/0x2d0 lib/bug.c:183
>>>>>>  fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
>>>>>>  do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
>>>>>>  do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
>>>>>>  do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
>>>>>>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
>>>>>>  invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
>>>>>> RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline]
>>>>>> RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline]
>>>>>> RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>>>>> RSP: 0018:ffff8801db206b18 EFLAGS: 00010206
>>>>>> RAX: ffff8801d1e02080 RBX: ffff8801dad74c48 RCX: 0000000000000000
>>>>>> RDX: 0000000000000100 RSI: ffff8801d29fa0a0 RDI: ffffffff85cbede0
>>>>>> RBP: ffff8801db206b38 R08: 0000000000000005 R09: 1ffffffff0ce0bcd
>>>>>> R10: ffff8801db206a00 R11: dffffc0000000000 R12: ffff8801d29fa000
>>>>>> R13: ffff8801dad74c50 R14: ffff8801d4350a92 R15: 0000000000000001
>>>>>>  psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353
>>>>>
>>>>> Looks like KCM is calling sk_data_ready() without first taking the
>>>>> sock lock.
>>>>>
>>>>> /* Called with lower sock held */
>>>>> static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff *skb)
>>>>> {
>>>>>  [...]
>>>>>         if (kcm_queue_rcv_skb(&kcm->sk, skb)) {
>>>>>
>>>>> In this case kcm->sk is not the same lock the comment is referring to.
>>>>> And kcm_queue_rcv_skb() will eventually call sk_data_ready().
>>>>>
>>>>> @Tom, how about wrapping the sk_data_ready call in {lock|release}_sock?
>>>>> I don't have anything better in mind immediately.
>>>>>
>>>> The sock locks are taken in reverse order in the send path so so
>>>> grabbing kcm sock lock with lower lock held to call sk_data_ready may
>>>> lead to deadlock like I think.
>>>>
>>>> It might be possible to change the order in the send path to do this.
>>>> Something like:
>>>>
>>>> trylock on lower socket lock
>>>> -if trylock fails
>>>>   - release kcm sock lock
>>>>   - lock lower sock
>>>>   - lock kcm sock
>>>> - call sendpage locked function
>>>>
>>>> I admit that dealing with two levels of socket locks in the data path
>>>> is quite a pain :-)
>>>
>>> up
>>>
>>> still happening and we've lost 50K+ test VMs on this
>>
>> up
>>
>> Still happens and number of crashes crossed 60K, can we do something
>> with this please?

^ permalink raw reply

* Re: [PATCH net] phylink: ensure we report link down when LOS asserted
From: Florian Fainelli @ 2017-12-27 19:27 UTC (permalink / raw)
  To: Russell King, Andrew Lunn; +Cc: netdev
In-Reply-To: <E1eTyRE-0007ZC-Ap@rmk-PC.armlinux.org.uk>



On 12/26/2017 03:15 PM, Russell King wrote:
> Although we disable the netdev carrier, we fail to report in the kernel
> log that the link went down.  Fix this.
> 
> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Fixes: 9525ae83959b ("phylink: add phylink infrastructure")
-- 
Florian

^ permalink raw reply

* Re: [PATCH net] sfp: fix sfp-bus oops when removing socket/upstream
From: Florian Fainelli @ 2017-12-27 19:29 UTC (permalink / raw)
  To: Russell King, Andrew Lunn; +Cc: netdev
In-Reply-To: <E1eTyRJ-0007ZJ-IS@rmk-PC.armlinux.org.uk>



On 12/26/2017 03:15 PM, Russell King wrote:
> When we remove a socket or upstream, and the other side isn't
> registered, we dereference a NULL pointer, causing a kernel oops.
> Fix this.
> 
> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Fixes: ce0aa27ff3f6 ("sfp: add sfp-bus to bridge between network devices
and sfp cages")
-- 
Florian

^ permalink raw reply

* Re: [patch net-next v2 00/10] Add support for resource abstraction
From: Andrew Lunn @ 2017-12-27 19:31 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: David Ahern, netdev, davem, arkadis, mlxsw, vivien.didelot,
	f.fainelli, michael.chan, ganeshgr, saeedm, matanb, leonro,
	idosch, jakub.kicinski, ast, daniel, simon.horman,
	pieter.jansenvanvuuren, john.hurley, alexander.h.duyck, linville,
	gospo, steven.lin1, yuvalm, ogerlitz, roopa
In-Reply-To: <20171227131531.GE1997@nanopsycho>

> Hmm. That documents mainly sysfs. No mention of Netlink at all. But
> maybe I missed it. Also, that defines the interface as is. However we
> are talking about the data exchanged over the interface, not the
> interface itself. I don't see how ASIC/HW specific thing, like for
> example KVD in our case could be part of kernel ABI.

You need to be very careful here. As soon as somebody starts using it,
it might become an ABI. Or you need to clearly document it is not ABI,
there is no guarantee it will not disappear or change its meaning in
the next kernel, and it should be used with extreme caution.

Personally, if DSA drivers were to expose such settings, i would
consider them ABI, which people can rely on to remain stable.

	Andrew

^ permalink raw reply

* Re: [patch net-next v2 00/10] Add support for resource abstraction
From: Roopa Prabhu @ 2017-12-27 19:43 UTC (permalink / raw)
  To: David Ahern
  Cc: Jiri Pirko, netdev, David Miller, Arkadi Sharshevsky, mlxsw,
	Andrew Lunn, Vivien Didelot, Florian Fainelli, Michael Chan,
	ganeshgr, Saeed Mahameed, matanb, leonro, Ido Schimmel,
	jakub.kicinski, ast, Daniel Borkmann, Simon Horman,
	pieter.jansenvanvuuren, john.hurley, Alexander Duyck,
	John W. Linville, Andy Gospodarek <gospo@
In-Reply-To: <ae70d810-8277-899b-b2a9-6b2dbdd5eb21@cumulusnetworks.com>

On Wed, Dec 27, 2017 at 8:34 AM, David Ahern <dsa@cumulusnetworks.com> wrote:
> On 12/27/17 2:09 AM, Jiri Pirko wrote:
>> Wed, Dec 27, 2017 at 05:05:09AM CET, dsa@cumulusnetworks.com wrote:
>>> On 12/26/17 5:23 AM, Jiri Pirko wrote:
>>>> From: Jiri Pirko <jiri@mellanox.com>
>>>>
>>>> Many of the ASIC's internal resources are limited and are shared between
>>>> several hardware procedures. For example, unified hash-based memory can
>>>> be used for many lookup purposes, like FDB and LPM. In many cases the user
>>>> can provide a partitioning scheme for such a resource in order to perform
>>>> fine tuning for his application. In such cases performing driver reload is
>>>> needed for the changes to take place, thus this patchset also adds support
>>>> for hot reload.
>>>>
>>>> Such an abstraction can be coupled with devlink's dpipe interface, which
>>>> models the ASIC's pipeline as a graph of match/action tables. By modeling
>>>> the hardware resource object, and by coupling it to several dpipe tables,
>>>> further visibility can be achieved in order to debug ASIC-wide issues.
>>>>
>>>> The proposed interface will provide the user the ability to understand the
>>>> limitations of the hardware, and receive notification regarding its occupancy.
>>>> Furthermore, monitoring the resource occupancy can be done in real-time and
>>>> can be useful in many cases.
>>>
>>> In the last RFC (not v1, but RFC) I asked for some kind of description
>>> for each resource, and you and Arkadi have pushed back. Let's walk
>>> through an example to see what I mean:
>>>
>>> $ devlink resource show pci/0000:03:00.0
>>> pci/0000:03:00.0:
>>>  name kvd size 245760 size_valid true
>>>  resources:
>>>    name linear size 98304 occ 0
>>>    name hash_double size 60416
>>>    name hash_single size 87040
>>>
>>> So this 2700 has 3 resources that can be managed -- some table or
>>> resource or something named 'kvd' with linear, hash_double and
>>> hash_single sub-resources. What are these names referring too? The above
>>> output gives no description, and 'kvd' is not an industry term. Further,
>>
>> This are internal resources specific to the ASIC. Would you like some
>> description to each or something like that?
>
> devlink has some nice self-documenting capabilities. What's missing here
> is a description of what the resource is used for in standard terms --
> ipv4 host routes, fdb, nexthops, rifs, etc. Even if the description is a
> short list versus an exhaustive list of everything it is used for. e.g.,
> Why would a user decrease linear and increase hash_single or vice versa?


Arkadi, on what david says above, can the resource names and ids not
be driver specific, but moved up to the switchdev layer and just map
to fdb, host routes, nexthops table sizes etc ?.
Can these generic networking resources then in-turn be mapped to kvd
sizes by the driver ?

^ permalink raw reply

* Re: [patch iproute2 v3 3/4] tc: Add -bs option to batch mode
From: Marcelo Ricardo Leitner @ 2017-12-27 19:56 UTC (permalink / raw)
  To: Chris Mi; +Cc: netdev, gerlitz.or, stephen, dsahern
In-Reply-To: <20171225084658.24076-4-chrism@mellanox.com>

On Mon, Dec 25, 2017 at 05:46:57PM +0900, Chris Mi wrote:
> @@ -267,6 +287,7 @@ int main(int argc, char **argv)
>  {
>  	int ret;
>  	char *batch_file = NULL;
> +	int batch_size = 1;
>  
>  	while (argc > 1) {
>  		if (argv[1][0] != '-')
> @@ -297,6 +318,14 @@ int main(int argc, char **argv)
>  			if (argc <= 1)
>  				usage();
>  			batch_file = argv[1];
> +		} else if (matches(argv[1], "-batchsize") == 0 ||
> +				matches(argv[1], "-bs") == 0) {
> +			argc--;	argv++;
> +			if (argc <= 1)
> +				usage();
> +			batch_size = atoi(argv[1]);
> +			if (batch_size > MSG_IOV_MAX)
> +				batch_size = MSG_IOV_MAX;

what about
if (batch_size < 1)
	batch_size = 1;

>  		} else if (matches(argv[1], "-netns") == 0) {
>  			NEXT_ARG();
>  			if (netns_switch(argv[1]))

^ permalink raw reply

* Re: WARNING in strp_data_ready
From: Dmitry Vyukov @ 2017-12-27 20:14 UTC (permalink / raw)
  To: Ozgur
  Cc: Tom Herbert, John Fastabend, syzbot, David S. Miller,
	Eric Biggers, LKML, Linux Kernel Network Developers,
	syzkaller-bugs@googlegroups.com, Tom Herbert, Cong Wang
In-Reply-To: <1164631514405294@web52j.yandex.ru>

On Wed, Dec 27, 2017 at 9:08 PM, Ozgur <ozgur@goosey.org> wrote:
>
>
> 27.12.2017, 22:21, "Dmitry Vyukov" <dvyukov@google.com>:
>> On Wed, Dec 27, 2017 at 8:09 PM, Tom Herbert <tom@herbertland.com> wrote:
>>>  Did you try the patch I posted?
>>
>> Hi Tom,
>
> Hello Dmitry,
>
>> No. And I didn't know I need to. Why?
>> If you think the patch needs additional testing, you can ask syzbot to
>> test it. See https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot
>> Otherwise proceed with committing it. Or what are we waiting for?
>>
>> Thanks
>
> I think we need to fixed patch for crash, in fact check to patch code and test solve the bug.
> How do test it because there is no patch in the following bug?

Hi Ozgur,

I am not sure I completely understand what you mean. But the
reproducer for this bug (which one can use for testing) is here:
https://groups.google.com/forum/#!topic/syzkaller-bugs/Kxs05ziCpgY
Tom also mentions there is some patch for this, but I don't know where
it is, it doesn't seem to be referenced from this thread.


> The fix patch should be for this net/kcm/kcmsock.c file and  lock functions must be added calling sk_data_ready ().
> Regards
>
> Ozgur
>
>>>  On Wed, Dec 27, 2017 at 10:25 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>>>  On Wed, Dec 6, 2017 at 4:44 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>>>>>  <john.fastabend@gmail.com> wrote:
>>>>>>>  On 10/24/2017 08:20 AM, syzbot wrote:
>>>>>>>>  Hello,
>>>>>>>>
>>>>>>>>  syzkaller hit the following crash on 73d3393ada4f70fa3df5639c8d438f2f034c0ecb
>>>>>>>>  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>>>>>>>>  compiler: gcc (GCC) 7.1.1 20170620
>>>>>>>>  .config is attached
>>>>>>>>  Raw console output is attached.
>>>>>>>>  C reproducer is attached
>>>>>>>>  syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>>>>>>>>  for information about syzkaller reproducers
>>>>>>>>
>>>>>>>>  WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_me include/net/sock.h:1505 [inline]
>>>>>>>>  WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_user include/net/sock.h:1511 [inline]
>>>>>>>>  WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>>>>>>>  Kernel panic - not syncing: panic_on_warn set ...
>>>>>>>>
>>>>>>>>  CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138
>>>>>>>>  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>>>>>>>>  Call Trace:
>>>>>>>>   <IRQ>
>>>>>>>>   __dump_stack lib/dump_stack.c:16 [inline]
>>>>>>>>   dump_stack+0x194/0x257 lib/dump_stack.c:52
>>>>>>>>   panic+0x1e4/0x417 kernel/panic.c:181
>>>>>>>>   __warn+0x1c4/0x1d9 kernel/panic.c:542
>>>>>>>>   report_bug+0x211/0x2d0 lib/bug.c:183
>>>>>>>>   fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
>>>>>>>>   do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
>>>>>>>>   do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
>>>>>>>>   do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
>>>>>>>>   do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
>>>>>>>>   invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
>>>>>>>>  RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline]
>>>>>>>>  RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline]
>>>>>>>>  RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>>>>>>>  RSP: 0018:ffff8801db206b18 EFLAGS: 00010206
>>>>>>>>  RAX: ffff8801d1e02080 RBX: ffff8801dad74c48 RCX: 0000000000000000
>>>>>>>>  RDX: 0000000000000100 RSI: ffff8801d29fa0a0 RDI: ffffffff85cbede0
>>>>>>>>  RBP: ffff8801db206b38 R08: 0000000000000005 R09: 1ffffffff0ce0bcd
>>>>>>>>  R10: ffff8801db206a00 R11: dffffc0000000000 R12: ffff8801d29fa000
>>>>>>>>  R13: ffff8801dad74c50 R14: ffff8801d4350a92 R15: 0000000000000001
>>>>>>>>   psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353
>>>>>>>
>>>>>>>  Looks like KCM is calling sk_data_ready() without first taking the
>>>>>>>  sock lock.
>>>>>>>
>>>>>>>  /* Called with lower sock held */
>>>>>>>  static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff *skb)
>>>>>>>  {
>>>>>>>   [...]
>>>>>>>          if (kcm_queue_rcv_skb(&kcm->sk, skb)) {
>>>>>>>
>>>>>>>  In this case kcm->sk is not the same lock the comment is referring to.
>>>>>>>  And kcm_queue_rcv_skb() will eventually call sk_data_ready().
>>>>>>>
>>>>>>>  @Tom, how about wrapping the sk_data_ready call in {lock|release}_sock?
>>>>>>>  I don't have anything better in mind immediately.
>>>>>>  The sock locks are taken in reverse order in the send path so so
>>>>>>  grabbing kcm sock lock with lower lock held to call sk_data_ready may
>>>>>>  lead to deadlock like I think.
>>>>>>
>>>>>>  It might be possible to change the order in the send path to do this.
>>>>>>  Something like:
>>>>>>
>>>>>>  trylock on lower socket lock
>>>>>>  -if trylock fails
>>>>>>    - release kcm sock lock
>>>>>>    - lock lower sock
>>>>>>    - lock kcm sock
>>>>>>  - call sendpage locked function
>>>>>>
>>>>>>  I admit that dealing with two levels of socket locks in the data path
>>>>>>  is quite a pain :-)
>>>>>
>>>>>  up
>>>>>
>>>>>  still happening and we've lost 50K+ test VMs on this
>>>>
>>>>  up
>>>>
>>>>  Still happens and number of crashes crossed 60K, can we do something
>>>>  with this please?

^ permalink raw reply

* Re: [patch net-next v2 00/10] Add support for resource abstraction
From: Arkadi Sharshevsky @ 2017-12-27 20:15 UTC (permalink / raw)
  To: David Ahern, Jiri Pirko
  Cc: netdev, davem, mlxsw, andrew, vivien.didelot, f.fainelli,
	michael.chan, ganeshgr, saeedm, matanb, leonro, idosch,
	jakub.kicinski, ast, daniel, simon.horman, pieter.jansenvanvuuren,
	john.hurley, alexander.h.duyck, linville, gospo, steven.lin1,
	yuvalm, ogerlitz, roopa
In-Reply-To: <ae70d810-8277-899b-b2a9-6b2dbdd5eb21@cumulusnetworks.com>



On 12/27/2017 06:34 PM, David Ahern wrote:
> On 12/27/17 2:09 AM, Jiri Pirko wrote:
>> Wed, Dec 27, 2017 at 05:05:09AM CET, dsa@cumulusnetworks.com wrote:
>>> On 12/26/17 5:23 AM, Jiri Pirko wrote:
>>>> From: Jiri Pirko <jiri@mellanox.com>
>>>>
>>>> Many of the ASIC's internal resources are limited and are shared between
>>>> several hardware procedures. For example, unified hash-based memory can
>>>> be used for many lookup purposes, like FDB and LPM. In many cases the user
>>>> can provide a partitioning scheme for such a resource in order to perform
>>>> fine tuning for his application. In such cases performing driver reload is
>>>> needed for the changes to take place, thus this patchset also adds support
>>>> for hot reload.
>>>>
>>>> Such an abstraction can be coupled with devlink's dpipe interface, which
>>>> models the ASIC's pipeline as a graph of match/action tables. By modeling
>>>> the hardware resource object, and by coupling it to several dpipe tables,
>>>> further visibility can be achieved in order to debug ASIC-wide issues.
>>>>
>>>> The proposed interface will provide the user the ability to understand the
>>>> limitations of the hardware, and receive notification regarding its occupancy.
>>>> Furthermore, monitoring the resource occupancy can be done in real-time and
>>>> can be useful in many cases.
>>>
>>> In the last RFC (not v1, but RFC) I asked for some kind of description
>>> for each resource, and you and Arkadi have pushed back. Let's walk
>>> through an example to see what I mean:
>>>
>>> $ devlink resource show pci/0000:03:00.0
>>> pci/0000:03:00.0:
>>>  name kvd size 245760 size_valid true
>>>  resources:
>>>    name linear size 98304 occ 0
>>>    name hash_double size 60416
>>>    name hash_single size 87040
>>>
>>> So this 2700 has 3 resources that can be managed -- some table or
>>> resource or something named 'kvd' with linear, hash_double and
>>> hash_single sub-resources. What are these names referring too? The above
>>> output gives no description, and 'kvd' is not an industry term. Further,
>>
>> This are internal resources specific to the ASIC. Would you like some
>> description to each or something like that?
> 
> devlink has some nice self-documenting capabilities. What's missing here
> is a description of what the resource is used for in standard terms --
> ipv4 host routes, fdb, nexthops, rifs, etc. Even if the description is a
> short list versus an exhaustive list of everything it is used for. e.g.,
> Why would a user decrease linear and increase hash_single or vice versa?
> 
>>
>>
>>> what are these sizes that a user can control? The output contains no
>>> units, no description, nothing. In short, the above output provides
>>> random numbers associated with random names.
>>
>> Units are now exposed from kernel, just this version of iproute2 patch
>> does not display it.
> 
> please provide an iproute2 patch that does so the full context if this
> patch set can be reviewed from a user perspective.
> 
>>
>>
>>>
>>> I can see dpipe tables exported by this device:
>>>
>>> $ devlink dpipe header show pci/0000:03:00.0
>>>
>>> pci/0000:03:00.0:
>>>  name mlxsw_meta
>>>  field:
>>>    name erif_port bitwidth 32 mapping_type ifindex
>>>    name l3_forward bitwidth 1
>>>    name l3_drop bitwidth 1
>>>    name adj_index bitwidth 32
>>>    name adj_size bitwidth 32
>>>    name adj_hash_index bitwidth 32
>>>
>>>  name ipv6
>>>  field:
>>>    name destination ip bitwidth 128
>>>
>>>  name ipv4
>>>  field:
>>>    name destination ip bitwidth 32
>>>
>>>  name ethernet
>>>  field:
>>>    name destination mac bitwidth 48
>>>
>>> but none mention 'kvd' or 'linear' or 'hash" and none of the other
>>> various devlink options:
>>>
>>> $ devlink
>>> Usage: devlink [ OPTIONS ] OBJECT { COMMAND | help }
>>> where  OBJECT := { dev | port | sb | monitor | dpipe }
>>>
>>> seem to related to resources.
>>>
>>> So how does a user know what they are controlling by this 'resource'
>>> option? Is the user expected to have a PRM or user guide on hand for the
>>> specific device model that is being configured?
>>
>> The relation of specific dpipe table to specific resource is exposed by
>> the kernel as well. Probably the iproute2 patch just does not display
>> it.
> 
> please provide an iproute2 patch that does so the full context if this
> patch set can be reviewed from a user perspective.
>

As Yuval stated you are using the wrong command here.
You are printing the headers not the tables. On each dpipe
table you can see the resource  it is using (the resource
path aka host table uses /kvd/hash_single for example).

This is already working. Just try it.

>>
>>
>>>
>>> Again, I have no objections to kvd, linear, hash, etc terms as they do
>>> relate to Mellanox products. But kvd/linear, for example, does correlate
>>> to industry standard concepts in some way. My request is that the
>>> resource listing guide the user in some way, stating what these
>>> resources mean.
>>
>> So the showed relation to dpipe table would be enougn or you would still
>> like to see some description? I don't like the description concept here
>> as the relations to dpipe table should tell user exactly what he needs
>> to know.
> 
> I believe it is useful to have a 1-line, short description that gives
> the user some memory jogger as to what the resource is used for. It does
> not have to be an exhaustive list, but the user should not have to do
> mental jumping jacks running a bunch of commands to understand the
> resources for vendor specific asics.
> 

^ permalink raw reply

* Re: [patch net-next v2 00/10] Add support for resource abstraction
From: David Ahern @ 2017-12-27 20:16 UTC (permalink / raw)
  To: Andrew Lunn, Jiri Pirko
  Cc: netdev, davem, arkadis, mlxsw, vivien.didelot, f.fainelli,
	michael.chan, ganeshgr, saeedm, matanb, leonro, idosch,
	jakub.kicinski, ast, daniel, simon.horman, pieter.jansenvanvuuren,
	john.hurley, alexander.h.duyck, linville, gospo, steven.lin1,
	yuvalm, ogerlitz, roopa
In-Reply-To: <20171227193110.GA5494@lunn.ch>

On 12/27/17 1:31 PM, Andrew Lunn wrote:
>> Hmm. That documents mainly sysfs. No mention of Netlink at all. But
>> maybe I missed it. Also, that defines the interface as is. However we
>> are talking about the data exchanged over the interface, not the
>> interface itself. I don't see how ASIC/HW specific thing, like for
>> example KVD in our case could be part of kernel ABI.
> 
> You need to be very careful here. As soon as somebody starts using it,
> it might become an ABI. Or you need to clearly document it is not ABI,
> there is no guarantee it will not disappear or change its meaning in
> the next kernel, and it should be used with extreme caution.
> 

+1

Once the names go in, people can write scripts that invoke devlink at
boot to partition resources. With the proposed patch set, the name
(e.g., kvd/linear) becomes part of the ABI.

^ permalink raw reply

* Re: [PATCHv3 0/2] capability controlled user-namespaces
From: Michael Kerrisk (man-pages) @ 2017-12-27 20:23 UTC (permalink / raw)
  To: Mahesh Bandewar (महेश बंडेवार)
  Cc: James Morris, LKML, Netdev, Kernel-hardening, Linux API,
	Kees Cook, Serge Hallyn, Eric W . Biederman, Eric Dumazet,
	David Miller, Mahesh Bandewar
In-Reply-To: <CAF2d9jit74_VCdD-pEFy3bJo2W1-0cDo0BOJC5beJiy8yFPCWg@mail.gmail.com>

Hello Mahesh,

On 27 December 2017 at 18:09, Mahesh Bandewar (महेश बंडेवार)
<maheshb@google.com> wrote:
> Hello James,
>
> Seems like I missed your name to be added into the review of this
> patch series. Would you be willing be pull this into the security
> tree? Serge Hallyn has already ACKed it.

We seem to have no formal documentation/specification of this feature.
I think that should be written up before this patch goes into
mainline...

Cheers,

Michael


>
> On Tue, Dec 5, 2017 at 2:30 PM, Mahesh Bandewar <mahesh@bandewar.net> wrote:
>> From: Mahesh Bandewar <maheshb@google.com>
>>
>> TL;DR version
>> -------------
>> Creating a sandbox environment with namespaces is challenging
>> considering what these sandboxed processes can engage into. e.g.
>> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
>> Current form of user-namespaces, however, if changed a bit can allow
>> us to create a sandbox environment without locking down user-
>> namespaces.
>>
>> Detailed version
>> ----------------
>>
>> Problem
>> -------
>> User-namespaces in the current form have increased the attack surface as
>> any process can acquire capabilities which are not available to them (by
>> default) by performing combination of clone()/unshare()/setns() syscalls.
>>
>>     #define _GNU_SOURCE
>>     #include <stdio.h>
>>     #include <sched.h>
>>     #include <netinet/in.h>
>>
>>     int main(int ac, char **av)
>>     {
>>         int sock = -1;
>>
>>         printf("Attempting to open RAW socket before unshare()...\n");
>>         sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>>         if (sock < 0) {
>>             perror("socket() SOCK_RAW failed: ");
>>         } else {
>>             printf("Successfully opened RAW-Sock before unshare().\n");
>>             close(sock);
>>             sock = -1;
>>         }
>>
>>         if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) {
>>             perror("unshare() failed: ");
>>             return 1;
>>         }
>>
>>         printf("Attempting to open RAW socket after unshare()...\n");
>>         sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
>>         if (sock < 0) {
>>             perror("socket() SOCK_RAW failed: ");
>>         } else {
>>             printf("Successfully opened RAW-Sock after unshare().\n");
>>             close(sock);
>>             sock = -1;
>>         }
>>
>>         return 0;
>>     }
>>
>> The above example shows how easy it is to acquire NET_RAW capabilities
>> and once acquired, these processes could take benefit of above mentioned
>> or similar issues discovered/undiscovered with malicious intent. Note
>> that this is just an example and the problem/solution is not limited
>> to NET_RAW capability *only*.
>>
>> The easiest fix one can apply here is to lock-down user-namespaces which
>> many of the distros do (i.e. don't allow users to create user namespaces),
>> but unfortunately that prevents everyone from using them.
>>
>> Approach
>> --------
>> Introduce a notion of 'controlled' user-namespaces. Every process on
>> the host is allowed to create user-namespaces (governed by the limit
>> imposed by per-ns sysctl) however, mark user-namespaces created by
>> sandboxed processes as 'controlled'. Use this 'mark' at the time of
>> capability check in conjunction with a global capability whitelist.
>> If the capability is not whitelisted, processes that belong to
>> controlled user-namespaces will not be allowed.
>>
>> Once a user-ns is marked as 'controlled'; all its child user-
>> namespaces are marked as 'controlled' too.
>>
>> A global whitelist is list of capabilities governed by the
>> sysctl which is available to (privileged) user in init-ns to modify
>> while it's applicable to all controlled user-namespaces on the host.
>>
>> Marking user-namespaces controlled without modifying the whitelist is
>> equivalent of the current behavior. The default value of whitelist includes
>> all capabilities so that the compatibility is maintained. However it gives
>> admins fine-grained ability to control various capabilities system wide
>> without locking down user-namespaces.
>>
>> Please see individual patches in this series.
>>
>> Mahesh Bandewar (2):
>>   capability: introduce sysctl for controlled user-ns capability whitelist
>>   userns: control capabilities of some user namespaces
>>
>>  Documentation/sysctl/kernel.txt | 21 +++++++++++++++++
>>  include/linux/capability.h      |  7 ++++++
>>  include/linux/user_namespace.h  | 25 ++++++++++++++++++++
>>  kernel/capability.c             | 52 +++++++++++++++++++++++++++++++++++++++++
>>  kernel/sysctl.c                 |  5 ++++
>>  kernel/user_namespace.c         |  4 ++++
>>  security/commoncap.c            |  8 +++++++
>>  7 files changed, 122 insertions(+)
>>
>> --
>> 2.15.0.531.g2ccb3012c9-goog
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox