All of lore.kernel.org
 help / color / mirror / Atom feed
From: Daniel Axtens <dja@axtens.net>
To: Dmitry Vyukov <dvyukov@google.com>,
	Casey Schaufler <casey@schaufler-ca.com>,
	Alexander Potapenko <glider@google.com>,
	clang-built-linux <clang-built-linux@googlegroups.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	syzbot <syzbot+de8d933e7d153aa0c1bb@syzkaller.appspotmail.com>,
	kasan-dev <kasan-dev@googlegroups.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	syzkaller-bugs <syzkaller-bugs@googlegroups.com>
Subject: Re: INFO: rcu detected stall in sys_kill
Date: Fri, 10 Jan 2020 10:25:27 +1100	[thread overview]
Message-ID: <87a76wnrfc.fsf@dja-thinkpad.axtens.net> (raw)
In-Reply-To: <CACT4Y+axj5M4p=mZkFb1MyBw0MK1c6nWb-fKQcYSnYB8n1Cb8Q@mail.gmail.com>

Dmitry Vyukov <dvyukov@google.com> writes:

> On Thu, Jan 9, 2020 at 11:05 AM Dmitry Vyukov <dvyukov@google.com> wrote:
>> > > > > On 1/8/2020 2:25 AM, Tetsuo Handa wrote:
>> > > > > > On 2020/01/08 15:20, Dmitry Vyukov wrote:
>> > > > > >> I temporarily re-enabled smack instance and it produced another 50
>> > > > > >> stalls all over the kernel, and now keeps spewing a dozen every hour.
>> > > > >
>> > > > > Do I have to be using clang to test this? I'm setting up to work on this,
>> > > > > and don't want to waste time using my current tool chain if the problem
>> > > > > is clang specific.
>> > > >
>> > > > Humm, interesting. Initially I was going to say that most likely it's
>> > > > not clang-related. Bug smack instance is actually the only one that
>> > > > uses clang as well (except for KMSAN of course). So maybe it's indeed
>> > > > clang-related rather than smack-related. Let me try to build a kernel
>> > > > with clang.
>> > >
>> > > +clang-built-linux, glider
>> > >
>> > > [clang-built linux is severe broken since early Dec]
>> > >
>> > > Building kernel with clang I can immediately reproduce this locally:
>> > >
>> > > $ syz-manager
>> > > 2020/01/09 09:27:15 loading corpus...
>> > > 2020/01/09 09:27:17 serving http on http://0.0.0.0:50001
>> > > 2020/01/09 09:27:17 serving rpc on tcp://[::]:45851
>> > > 2020/01/09 09:27:17 booting test machines...
>> > > 2020/01/09 09:27:17 wait for the connection from test machine...
>> > > 2020/01/09 09:29:23 machine check:
>> > > 2020/01/09 09:29:23 syscalls                : 2961/3195
>> > > 2020/01/09 09:29:23 code coverage           : enabled
>> > > 2020/01/09 09:29:23 comparison tracing      : enabled
>> > > 2020/01/09 09:29:23 extra coverage          : enabled
>> > > 2020/01/09 09:29:23 setuid sandbox          : enabled
>> > > 2020/01/09 09:29:23 namespace sandbox       : enabled
>> > > 2020/01/09 09:29:23 Android sandbox         : /sys/fs/selinux/policy
>> > > does not exist
>> > > 2020/01/09 09:29:23 fault injection         : enabled
>> > > 2020/01/09 09:29:23 leak checking           : CONFIG_DEBUG_KMEMLEAK is
>> > > not enabled
>> > > 2020/01/09 09:29:23 net packet injection    : enabled
>> > > 2020/01/09 09:29:23 net device setup        : enabled
>> > > 2020/01/09 09:29:23 concurrency sanitizer   : /sys/kernel/debug/kcsan
>> > > does not exist
>> > > 2020/01/09 09:29:23 devlink PCI setup       : PCI device 0000:00:10.0
>> > > is not available
>> > > 2020/01/09 09:29:27 corpus                  : 50226 (0 deleted)
>> > > 2020/01/09 09:29:27 VMs 20, executed 0, cover 0, crashes 0, repro 0
>> > > 2020/01/09 09:29:37 VMs 20, executed 45, cover 0, crashes 0, repro 0
>> > > 2020/01/09 09:29:47 VMs 20, executed 74, cover 0, crashes 0, repro 0
>> > > 2020/01/09 09:29:57 VMs 20, executed 80, cover 0, crashes 0, repro 0
>> > > 2020/01/09 09:30:07 VMs 20, executed 80, cover 0, crashes 0, repro 0
>> > > 2020/01/09 09:30:17 VMs 20, executed 80, cover 0, crashes 0, repro 0
>> > > 2020/01/09 09:30:27 VMs 20, executed 80, cover 0, crashes 0, repro 0
>> > > 2020/01/09 09:30:37 VMs 20, executed 80, cover 0, crashes 0, repro 0
>> > > 2020/01/09 09:30:47 VMs 20, executed 80, cover 0, crashes 0, repro 0
>> > > 2020/01/09 09:30:57 VMs 20, executed 80, cover 0, crashes 0, repro 0
>> > > 2020/01/09 09:31:07 VMs 20, executed 80, cover 0, crashes 0, repro 0
>> > > 2020/01/09 09:31:17 VMs 20, executed 80, cover 0, crashes 0, repro 0
>> > > 2020/01/09 09:31:26 vm-10: crash: INFO: rcu detected stall in do_idle
>> > > 2020/01/09 09:31:27 VMs 13, executed 80, cover 0, crashes 0, repro 0
>> > > 2020/01/09 09:31:28 vm-1: crash: INFO: rcu detected stall in sys_futex
>> > > 2020/01/09 09:31:29 vm-4: crash: INFO: rcu detected stall in sys_futex
>> > > 2020/01/09 09:31:31 vm-0: crash: INFO: rcu detected stall in sys_getsockopt
>> > > 2020/01/09 09:31:33 vm-18: crash: INFO: rcu detected stall in sys_clone3
>> > > 2020/01/09 09:31:35 vm-3: crash: INFO: rcu detected stall in sys_futex
>> > > 2020/01/09 09:31:36 vm-8: crash: INFO: rcu detected stall in do_idle
>> > > 2020/01/09 09:31:37 VMs 7, executed 80, cover 0, crashes 6, repro 0
>> > > 2020/01/09 09:31:38 vm-19: crash: INFO: rcu detected stall in schedule_tail
>> > > 2020/01/09 09:31:40 vm-6: crash: INFO: rcu detected stall in schedule_tail
>> > > 2020/01/09 09:31:42 vm-2: crash: INFO: rcu detected stall in schedule_tail
>> > > 2020/01/09 09:31:44 vm-12: crash: INFO: rcu detected stall in sys_futex
>> > > 2020/01/09 09:31:46 vm-15: crash: INFO: rcu detected stall in sys_nanosleep
>> > > 2020/01/09 09:31:47 VMs 1, executed 80, cover 0, crashes 11, repro 0
>> > > 2020/01/09 09:31:48 vm-16: crash: INFO: rcu detected stall in sys_futex
>> > > 2020/01/09 09:31:50 vm-9: crash: INFO: rcu detected stall in schedule
>> > > 2020/01/09 09:31:52 vm-13: crash: INFO: rcu detected stall in schedule_tail
>> > > 2020/01/09 09:31:54 vm-11: crash: INFO: rcu detected stall in schedule_tail
>> > > 2020/01/09 09:31:56 vm-17: crash: INFO: rcu detected stall in sys_futex
>> > > 2020/01/09 09:31:57 VMs 0, executed 80, cover 0, crashes 16, repro 0
>> > > 2020/01/09 09:31:58 vm-7: crash: INFO: rcu detected stall in sys_futex
>> > > 2020/01/09 09:32:00 vm-5: crash: INFO: rcu detected stall in dput
>> > > 2020/01/09 09:32:02 vm-14: crash: INFO: rcu detected stall in sys_nanosleep
>> > >
>> > >
>> > > Then I switched LSM to selinux and I _still_ can reproduce this. So,
>> > > Casey, you may relax, this is not smack-specific :)
>> > >
>> > > Then I disabled CONFIG_KASAN_VMALLOC and CONFIG_VMAP_STACK and it
>> > > started working normally.
>> > >
>> > > So this is somehow related to both clang and KASAN/VMAP_STACK.
>> > >
>> > > The clang I used is:
>> > > https://storage.googleapis.com/syzkaller/clang-kmsan-362913.tar.gz
>> > > (the one we use on syzbot).
>> >
>> >
>> > Clustering hangs, they all happen within very limited section of the code:
>> >
>> >       1  free_thread_stack+0x124/0x590 kernel/fork.c:284
>> >       5  free_thread_stack+0x12e/0x590 kernel/fork.c:280
>> >      39  free_thread_stack+0x12e/0x590 kernel/fork.c:284
>> >       6  free_thread_stack+0x133/0x590 kernel/fork.c:280
>> >       5  free_thread_stack+0x13d/0x590 kernel/fork.c:280
>> >       2  free_thread_stack+0x141/0x590 kernel/fork.c:280
>> >       6  free_thread_stack+0x14c/0x590 kernel/fork.c:280
>> >       9  free_thread_stack+0x151/0x590 kernel/fork.c:280
>> >       3  free_thread_stack+0x15b/0x590 kernel/fork.c:280
>> >      67  free_thread_stack+0x168/0x590 kernel/fork.c:280
>> >       6  free_thread_stack+0x16d/0x590 kernel/fork.c:284
>> >       2  free_thread_stack+0x177/0x590 kernel/fork.c:284
>> >       1  free_thread_stack+0x182/0x590 kernel/fork.c:284
>> >       1  free_thread_stack+0x186/0x590 kernel/fork.c:284
>> >      16  free_thread_stack+0x18b/0x590 kernel/fork.c:284
>> >       4  free_thread_stack+0x195/0x590 kernel/fork.c:284
>> >
>> > Here is disass of the function:
>> > https://gist.githubusercontent.com/dvyukov/a283d1aaf2ef7874001d56525279ccbd/raw/ac2478bff6472bc473f57f91a75f827cd72bb6bf/gistfile1.txt
>> >
>> > But if I am not mistaken, the function only ever jumps down. So how
>> > can it loop?...
>>
>>
>> This is a miscompilation related to static branches.
>>
>> objdump shows:
>>
>> ffffffff814878f8: 0f 1f 44 00 00        nopl   0x0(%rax,%rax,1)
>>  ./arch/x86/include/asm/jump_label.h:25
>> asm_volatile_goto("1:"
>>
>> However, the actual instruction in memory at the time is:
>>
>>    0xffffffff814878f8 <+408>: jmpq   0xffffffff8148787f <free_thread_stack+287>
>>
>> Which jumps to a wrong location in free_thread_stack and makes it loop.
>>
>> The static branch is this:
>>
>> static inline bool memcg_kmem_enabled(void)
>> {
>>   return static_branch_unlikely(&memcg_kmem_enabled_key);
>> }
>>
>> static inline void memcg_kmem_uncharge(struct page *page, int order)
>> {
>>   if (memcg_kmem_enabled())
>>     __memcg_kmem_uncharge(page, order);
>> }
>>
>> I suspect it may have something to do with loop unrolling. It may jump
>> to the right location, but in the wrong unrolled iteration.
>
>
> Kernel built with clang version 10.0.0
> (https://github.com/llvm/llvm-project.git
> c2443155a0fb245c8f17f2c1c72b6ea391e86e81) works fine.
>
> Alex, please update clang on syzbot machines.

Wow, what a bug. Very happy to be off the hook for causing it, and
feeling a lot better about my inability to reproduce it with a GCC-built
kernel!

Regards,
Daniel

  parent reply	other threads:[~2020-01-09 23:25 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-03  8:27 INFO: rcu detected stall in sys_kill syzbot
2019-12-03  8:38 ` Dmitry Vyukov
2019-12-04 13:58   ` Dmitry Vyukov
2019-12-04 16:05     ` Casey Schaufler
2019-12-04 23:34       ` Daniel Axtens
2019-12-17 13:38         ` Daniel Axtens
2020-01-08  6:20           ` Dmitry Vyukov
2020-01-08 10:25             ` Tetsuo Handa
2020-01-08 17:19               ` Casey Schaufler
2020-01-09  8:19                 ` Dmitry Vyukov
2020-01-09  8:50                   ` Dmitry Vyukov
2020-01-09  9:29                     ` Dmitry Vyukov
2020-01-09 10:05                       ` Dmitry Vyukov
2020-01-09 10:39                         ` Dmitry Vyukov
2020-01-09 16:23                           ` Alexander Potapenko
2020-01-09 17:16                             ` Nick Desaulniers
2020-01-09 17:23                               ` Dmitry Vyukov
2020-01-09 17:38                                 ` Nick Desaulniers
2020-01-10  8:37                                   ` Alexander Potapenko
2020-01-14 10:15                                     ` Dmitry Vyukov
2020-01-09 23:25                           ` Daniel Axtens [this message]
2020-01-09 15:43                     ` Casey Schaufler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a76wnrfc.fsf@dja-thinkpad.axtens.net \
    --to=dja@axtens.net \
    --cc=akpm@linux-foundation.org \
    --cc=casey@schaufler-ca.com \
    --cc=clang-built-linux@googlegroups.com \
    --cc=dvyukov@google.com \
    --cc=glider@google.com \
    --cc=kasan-dev@googlegroups.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    --cc=syzbot+de8d933e7d153aa0c1bb@syzkaller.appspotmail.com \
    --cc=syzkaller-bugs@googlegroups.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.