All of lore.kernel.org
 help / color / mirror / Atom feed
From: John Ogness <john.ogness@linutronix.de>
To: lkp@lists.01.org
Subject: Re: [printk] b031a684bf: INFO:rcu_tasks_detected_stalls_on_tasks
Date: Thu, 28 Jan 2021 16:48:29 +0106	[thread overview]
Message-ID: <877dnxm5ju.fsf@jogness.linutronix.de> (raw)
In-Reply-To: <YBLPhkHQ8cXFiY1X@alley>

[-- Attachment #1: Type: text/plain, Size: 5623 bytes --]

On 2021-01-28, Petr Mladek <pmladek@suse.com> wrote:
> If you are able to reproduce this, it might be interesting to see
> backtraces from all CPUs when RCU stall gets detected. Or even
> printing all processes, even the sleeping ones.

OK, I now have lkp running on my intel core i5 machine and it reliably
reproduces this. First let me say a word about lkp. The instructions in
the report email are a bit misleading. There are lots of steps that it
doesn't mention.

First I needed to install lkp and its dependencies. I did all of this as
root because lkp seems to get confused when sudo is involved.

# git clone https://github.com/intel/lkp-tests.git
# cd lkp-tests
# make install
# lkp install

Then I built the kernel and modules using a helper script to setup the
environment for me:

----- BEGIN /tmp/mk.sh -----
#!/bin/sh

export INSTALL_MOD_PATH=`pwd`
export HOSTCC=gcc-9
export CC=gcc-9
export ARCH=x86_64

exec make "$@"
----- END /tmp/mk.sh -----

# cd linux
# git checkout -b lkp b031a684bfd01d633c79d281bd0cf11c2f834ada
# cp /tmp/config-5.10.0-rc5-gb031a684bfd0 .config
# /tmp/mk.sh -j `nproc` olddefconfig prepare modules_prepare bzImage modules modules_install

Then I created the modules.cgz:

# find lib/modules | cpio -H newc -o | gzip -9c > modules.cgz

I put kernel and modules in /tmp/:

# cp arch/x86/boot/bzImage modules.cgz /tmp/

And ran lkp inside script(1) so I have a logfile:

# script
# lkp qemu -k /tmp/bzImage -m /tmp/modules.cgz /tmp/job-script

And reliably I see the first stall at about 925 seconds:

[  926.386441][   T22] INFO: rcu_tasks detected stalls on tasks:
[  926.387310][   T22] 0000000041eb5240: .. nvcsw: 0/0 holdout: 1 idle_cpu: -1/1
[  926.388196][   T22] task:dmesg           state:R  running task     stack:    0 pid: 1752 ppid:   511 flags:0x2002000
0
[  926.389509][   T22] Call Trace:
[  926.389962][   T22]  __schedule+0xa12/0xab5
[  926.390534][   T22]  ? firmware_map_remove+0xd1/0xd1
[  926.391190][   T22]  ? ksys_read+0x116/0x150
[  926.391773][   T22]  schedule+0x16c/0x1df
[  926.392317][   T22]  exit_to_user_mode_loop+0x28/0x84
[  926.392977][   T22]  exit_to_user_mode_prepare+0x1d/0x4a
[  926.393671][   T22]  syscall_exit_to_user_mode+0x41/0x4f
[  926.394357][   T22]  entry_INT80_compat+0x71/0x76
[  926.394982][   T22] RIP: 0023:0xf7f0fa02
[  926.395528][   T22] RSP: 002b:00000000ff850364 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  926.396590][   T22] RAX: ffffffffffffffe0 RBX: 0000000000000004 RCX: 00000000565c8234
[  926.397607][   T22] RDX: 0000000000001fff RSI: 00000000000001e0 RDI: 00000000565c8234
[  926.398617][   T22] RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000
[  926.399629][   T22] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  926.400643][   T22] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

But actually it is not the rcu stall that interests me so much. I am
more interested in some bizarre console output I am seeing. Here is a
snippet:

[  903.008486][  T356] [  778.732883] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.008496][  T356]
[  903.014733][  T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.030232][  T356] [  778.735197] [  655.203106] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.030242][  T356]
[  903.040809][  T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.044327][  T356] [  778.746898] [  655.214214] [  531.584847] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.044337][  T356]
[  903.048732][  T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.066039][  T356] [  778.751162] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.066050][  T356]
[  903.070123][  T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.077137][  T356] [  778.768744] [  655.227204] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.077147][  T356]
[  903.092792][  T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.095313][  T356] [  778.773836] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.095323][  T356]
[  903.106788][  T356] [  778.781780] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.106798][  T356]
[  903.121986][  T356] [  778.791183] [  655.230287] [  531.589973] [  407.116952] [  283.068815] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.121997][  T356]
[  903.134120][  T356] [  778.797950] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.134130][  T356]
[  903.148741][  T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.151626][  T356] [  778.805164] [  655.241173] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.151637][  T356]
[  903.160126][  T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.166936][  T356] [  778.818740] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.166946][  T356]
[  903.175785][  T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.185955][  T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.189448][  T356] [  778.825864] [  655.250559] [  531.607066] [  407.120936] tasks-torture:torture_onoff task: online 0 failed: errno -5

These multi-timestamp lines are odd. And they are almost exactly 124
seconds apart. What is going on there? The investigation continues...

John Ogness

WARNING: multiple messages have this Message-ID (diff)
From: John Ogness <john.ogness@linutronix.de>
To: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>,
	kernel test robot <oliver.sang@intel.com>,
	LKML <linux-kernel@vger.kernel.org>,
	lkp@lists.01.org, lkp@intel.com, zhengjun.xing@linux.intel.com
Subject: Re: [printk]  b031a684bf: INFO:rcu_tasks_detected_stalls_on_tasks
Date: Thu, 28 Jan 2021 16:48:29 +0106	[thread overview]
Message-ID: <877dnxm5ju.fsf@jogness.linutronix.de> (raw)
In-Reply-To: <YBLPhkHQ8cXFiY1X@alley>

On 2021-01-28, Petr Mladek <pmladek@suse.com> wrote:
> If you are able to reproduce this, it might be interesting to see
> backtraces from all CPUs when RCU stall gets detected. Or even
> printing all processes, even the sleeping ones.

OK, I now have lkp running on my intel core i5 machine and it reliably
reproduces this. First let me say a word about lkp. The instructions in
the report email are a bit misleading. There are lots of steps that it
doesn't mention.

First I needed to install lkp and its dependencies. I did all of this as
root because lkp seems to get confused when sudo is involved.

# git clone https://github.com/intel/lkp-tests.git
# cd lkp-tests
# make install
# lkp install

Then I built the kernel and modules using a helper script to setup the
environment for me:

----- BEGIN /tmp/mk.sh -----
#!/bin/sh

export INSTALL_MOD_PATH=`pwd`
export HOSTCC=gcc-9
export CC=gcc-9
export ARCH=x86_64

exec make "$@"
----- END /tmp/mk.sh -----

# cd linux
# git checkout -b lkp b031a684bfd01d633c79d281bd0cf11c2f834ada
# cp /tmp/config-5.10.0-rc5-gb031a684bfd0 .config
# /tmp/mk.sh -j `nproc` olddefconfig prepare modules_prepare bzImage modules modules_install

Then I created the modules.cgz:

# find lib/modules | cpio -H newc -o | gzip -9c > modules.cgz

I put kernel and modules in /tmp/:

# cp arch/x86/boot/bzImage modules.cgz /tmp/

And ran lkp inside script(1) so I have a logfile:

# script
# lkp qemu -k /tmp/bzImage -m /tmp/modules.cgz /tmp/job-script

And reliably I see the first stall at about 925 seconds:

[  926.386441][   T22] INFO: rcu_tasks detected stalls on tasks:
[  926.387310][   T22] 0000000041eb5240: .. nvcsw: 0/0 holdout: 1 idle_cpu: -1/1
[  926.388196][   T22] task:dmesg           state:R  running task     stack:    0 pid: 1752 ppid:   511 flags:0x2002000
0
[  926.389509][   T22] Call Trace:
[  926.389962][   T22]  __schedule+0xa12/0xab5
[  926.390534][   T22]  ? firmware_map_remove+0xd1/0xd1
[  926.391190][   T22]  ? ksys_read+0x116/0x150
[  926.391773][   T22]  schedule+0x16c/0x1df
[  926.392317][   T22]  exit_to_user_mode_loop+0x28/0x84
[  926.392977][   T22]  exit_to_user_mode_prepare+0x1d/0x4a
[  926.393671][   T22]  syscall_exit_to_user_mode+0x41/0x4f
[  926.394357][   T22]  entry_INT80_compat+0x71/0x76
[  926.394982][   T22] RIP: 0023:0xf7f0fa02
[  926.395528][   T22] RSP: 002b:00000000ff850364 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  926.396590][   T22] RAX: ffffffffffffffe0 RBX: 0000000000000004 RCX: 00000000565c8234
[  926.397607][   T22] RDX: 0000000000001fff RSI: 00000000000001e0 RDI: 00000000565c8234
[  926.398617][   T22] RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000
[  926.399629][   T22] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  926.400643][   T22] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

But actually it is not the rcu stall that interests me so much. I am
more interested in some bizarre console output I am seeing. Here is a
snippet:

[  903.008486][  T356] [  778.732883] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.008496][  T356]
[  903.014733][  T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.030232][  T356] [  778.735197] [  655.203106] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.030242][  T356]
[  903.040809][  T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.044327][  T356] [  778.746898] [  655.214214] [  531.584847] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.044337][  T356]
[  903.048732][  T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.066039][  T356] [  778.751162] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.066050][  T356]
[  903.070123][  T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.077137][  T356] [  778.768744] [  655.227204] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.077147][  T356]
[  903.092792][  T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.095313][  T356] [  778.773836] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.095323][  T356]
[  903.106788][  T356] [  778.781780] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.106798][  T356]
[  903.121986][  T356] [  778.791183] [  655.230287] [  531.589973] [  407.116952] [  283.068815] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.121997][  T356]
[  903.134120][  T356] [  778.797950] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.134130][  T356]
[  903.148741][  T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.151626][  T356] [  778.805164] [  655.241173] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.151637][  T356]
[  903.160126][  T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.166936][  T356] [  778.818740] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.166946][  T356]
[  903.175785][  T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.185955][  T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[  903.189448][  T356] [  778.825864] [  655.250559] [  531.607066] [  407.120936] tasks-torture:torture_onoff task: online 0 failed: errno -5

These multi-timestamp lines are odd. And they are almost exactly 124
seconds apart. What is going on there? The investigation continues...

John Ogness

  reply	other threads:[~2021-01-28 15:42 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-22  8:13 [printk] b031a684bf: INFO:rcu_tasks_detected_stalls_on_tasks kernel test robot
2021-01-22  8:13 ` kernel test robot
2021-01-22 16:21 ` Petr Mladek
2021-01-22 16:21   ` Petr Mladek
2021-01-26  4:52   ` Sergey Senozhatsky
2021-01-26  4:52     ` Sergey Senozhatsky
2021-01-27 21:22     ` John Ogness
2021-01-27 21:22       ` John Ogness
2021-01-27 23:29       ` John Ogness
2021-01-27 23:29         ` John Ogness
2021-01-28  8:30       ` Sergey Senozhatsky
2021-01-28  8:30         ` Sergey Senozhatsky
2021-01-28 11:26         ` John Ogness
2021-01-28 11:26           ` John Ogness
2021-01-28 14:51           ` Petr Mladek
2021-01-28 14:51             ` Petr Mladek
2021-01-28 15:42             ` John Ogness [this message]
2021-01-28 15:42               ` John Ogness
2021-01-28 16:36               ` Petr Mladek
2021-01-28 16:36                 ` Petr Mladek
2021-01-28 17:36                 ` John Ogness
2021-01-28 17:36                   ` John Ogness
2021-01-28 21:38                   ` John Ogness
2021-01-28 21:38                     ` John Ogness
2021-01-23  2:44 ` Sergey Senozhatsky
2021-01-23  2:44   ` Sergey Senozhatsky
2021-02-02 17:13 ` John Ogness
2021-02-02 17:13   ` John Ogness
2021-02-04 21:32   ` John Ogness
2021-02-04 21:32     ` John Ogness
2021-02-05  9:32     ` Petr Mladek
2021-02-05  9:32       ` Petr Mladek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877dnxm5ju.fsf@jogness.linutronix.de \
    --to=john.ogness@linutronix.de \
    --cc=lkp@lists.01.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.