From: Stephen Berman <stephen.berman@gmx.net>
To: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Peter Zijlstra <peterz@infradead.org>,
linux-kernel@vger.kernel.org
Subject: Re: power-off delay/hang due to commit 6d25be57 (mainline)
Date: Tue, 09 Jun 2020 12:06:23 +0200 [thread overview]
Message-ID: <87y2owwo2o.fsf@rub.de> (raw)
In-Reply-To: <20200522164012.ynyvrjompv42jtmx@linutronix.de> (Sebastian Andrzej Siewior's message of "Fri, 22 May 2020 18:40:12 +0200")
[-- Attachment #1: Type: text/plain, Size: 4466 bytes --]
On Fri, 22 May 2020 18:40:12 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> Sorry for the late reply.
No problem, since as it turned out, I didn't have to time till now to
follow up on your latest suggestions. Details below.
> On 2020-05-14 23:39:40 [+0200], Stephen Berman wrote:
>> >> How will I know if that happens, is there a specific message in the tty?
>> >
>> > On the tty console where you see the "timing out command, waited"
>> > message, there should be something starting with
>> > |BUG: workqueue lockup - pool
>> >
>> > following with the pool information that got stuck. That code checks the
>> > workqueues every 30secs by default. So if you waited >= 60secs then
>> > system is not detecting a stall.
>>
>> As you can see in the photo, there was no message about a workqueue
>> lockup, only "task halt:5320 blocked for more than <XXX> seconds" every
>> two minutes. I suppose that comes from one of the other options I
>> enabled. Does it reveal anything about the problem?
>
>>From the picture, you are on your way to level 0, which would issue the
> final shutdown command, but you are not quite there yet.
>
> If you add a printk() to the reboot syscall, then I wouldn't expect you
> to see it. (something like that):
>
> diff --git a/kernel/reboot.c b/kernel/reboot.c
> index c4d472b7f1b42..19bc35bc0cda0 100644
> --- a/kernel/reboot.c
> +++ b/kernel/reboot.c
> @@ -314,6 +314,7 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
> char buffer[256];
> int ret = 0;
>
> + pr_err("%s(%d)CMD: %lx\n", __func__, __LINE__, cmd);
> /* We only trust the superuser with rebooting the system. */
> if (!ns_capable(pid_ns->user_ns, CAP_SYS_BOOT))
> return -EPERM;
>
> If you add "ignore_loglevel initcall_debug" to the command line then you
> should see the init callbacks of each driver. But there will be nothing
> on your shutdown (as I expect it).
>
> The "task X blocked for more than 120 secs" is part of the hung task
> detector. With the "ignore_loglevel" above you should be able to see the
> callchain of the task. I suspect that the task poked the cd-drive which
> isn't answering. So from detector's point of view, the task issued a
> system call which appears to hang an makes no progress.
>
>> > Could
>> > you please check if the stall-dector says something?
>>
>> Is that the message I repeated above or do you mean the workqueue?
>
> The hung message is not workqueue related. It is the task `halt' that
> makes no progress. There is not stall of the workqueue as far as the
> system can tell.
> The two boot options, I suggested above, may reveal additional
> information that are printed but suppressed due to the loglevel.
>
> My guess now is that maybe shutting down wifi also paused the AHCI
> controller which makes no progress now. So booting without cdrom/disk
> should not cause any problems.
>
> Could you please:
> - try booting with "ignore_loglevel initcall_debug" and see if
> additional information is printed on the console.
>
> - Remove cd / ATA-disk to check if anything else causes a stall. As by
> your report you only mentioned those two (and if I see it correctly,
> you rootFS is on nvme so removing the disk might be doable).
>
> - Could you remove the Wifi (just the driver, no the physical hw) to see
> if it makes any difference?
I recompiled kernel 5.6.4 with the printk() call you suggested, then
booted the kernel with "ignore_loglevel initcall_debug" (but leaving the
CDROM and wifi intact for now). After working as I normally do, I
called `shutdown -h now', again as usual. After the "Bringing down the
loopback interface" message there were these two messages:
reboot: __do_sys_reboot(317)CMD: 89abcdef
reboot: __do_sys_reboot(317)CMD: 4321fedc
Then nothing more for two minutes, then, as previously:
sr 5:0:0:0: tag#10 timing out command, waited 120 seconds.
Then I did a hard reboot.
This morning I detached the cables to the CDROM (but left the disk in
the box) and again booted 5.6.4 with "ignore_loglevel initcall_debug".
After working for a bit, I called `shutdown -h now', and now there were
quite a few more messages, but again the machine did not power off.
Here is the ouput, which I transcribed carefully, hopefully without
error (I omitted the fractional parts of the timestamps and some of the
usb1-portX messages):
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: no-cdrom-shutdown-5.6.4 --]
[-- Type: application/text/plain, Size: 4598 bytes --]
Sending all processes the TERM signal...
[1123.***] elogind-daemon[4318]: Received signal 15 [TERM]
Sending all processes the KILL signal...
Deactivating all swap files/partitions...
Unmounting all other currently mounted file systems...
[1123.***] EXT4-fs (nvme0n1p5):re-mounted. Opts: (null)
* Bringing down the loopback interface...
[1123.***] reboot: __do_sys_reboot(317)CMD: 89abcdef
[1123.***] reboot: __do_sys_reboot(317)CMD: 4321fedc
[1124.***] snd_hda_codec_generic hdaudioC0D2: shutdown
[1124.***] snd_hda_codec_realtek hdaudioC0D0: shutdown
[1124.***] snd_hda_intel 0000:00:1f.3: shutdown
[1124.***] sd 4:0:0:0: shutdown
[1124.***] sd 4:0:0:0: shutdown [sda] Synchronizing SCSI cache
[1124.***] sd 4:0:0:0: shutdown [sda] Stopping disk
[1125.***] i2c i2c-4: shutdown
[1125.***] alarmtimer alarmtimer.0.auto: shutdown
[1125.***] usb usb2-port10: shutdown
[1125.***] usb usb2-port9: shutdown
[1125.***] usb usb2-port8: shutdown
[1125.***] usb usb2-port7: shutdown
[1125.***] usb usb2-port6: shutdown
[1125.***] usb usb2-port5: shutdown
[1125.***] usb usb2-port4: shutdown
[1125.***] usb usb2-port3: shutdown
[1125.***] usb usb2-port2: shutdown
[1125.***] usb usb2-port1: shutdown
[1125.***] usb usb1-port16: shutdown
[1125.***] ...
[1125.***] usb usb1-port1: shutdown
[1125.***] i2c i2c-3: shutdown
[1125.***] i2c i2c-2: shutdown
[1125.***] i2c i2c-1: shutdown
[1125.***] i2c i2c-0: shutdown
[1125.***] serial8250 serial8250: shutdown
[1125.***] aer 0000:00:1b.4:pcie002: shutdown
[1125.***] aer 0000:00:1b.2:pcie002: shutdown
[1125.***] efi-framebuffer efi-framebuffer.0: shutdown
[1125.***] pcspkr pcspkr: shutdown
[1125.***] system 00:08: shutdown
[1125.***] system 00:07: shutdown
[1125.***] system 00:06: shutdown
[1125.***] system 00:05: shutdown
[1125.***] system 00:04: shutdown
[1125.***] rtc_cmos 00:03: shutdown
[1125.***] system 00:02: shutdown
[1125.***] serial 00:01: shutdown
[1125.***] system 00:00: shutdown
[1125.***] acpi-fan PNP0C0B:04: shutdown
[1125.***] acpi-fan PNP0C0B:03: shutdown
[1125.***] acpi-fan PNP0C0B:02: shutdown
[1125.***] acpi-fan PNP0C0B:01: shutdown
[1125.***] acpi-fan PNP0C0B:00: shutdown
[1125.***] nvme 0000:03:00.0: shutdown
[1125.***] iwlwifi 0000:02:00.0: shutdown
[1125.***] e1000e 0000:00:1f.0: shutdown
[1125.***] e1000e EEE TX LPI TIMER: 00000011
[1125.***] pci 0000:00:1f.5: shutdown
[1125.***] i801_smbus 0000:00:1f.4: shutdown
[1125.***] pcieport 0000:00:1d.0: shutdown
[1125.***] pcieport 0000:00:1c.0: shutdown
[1125.***] pcieport 0000:00:1b.4: shutdown
[1125.***] pcieport 0000:00:1b.2: shutdown
[1125.***] pcieport 0000:00:1b.0: shutdown
[1125.***] ahci 0000:00:17.0: shutdown
[1125.***] pci 0000:00:16.0: shutdown
[1125.***] pci 0000:00:14.2: shutdown
[1125.***] xhci_hcd 0000:00:14.0: shutdown
[1125.***] pci 0000:00:12.0: shutdown
[1125.***] i915 0000:00:02.0: shutdown
[1125.***] skl_uncore 0000:00:00.0: shutdown
[1125.***] ACPI: Preparing to enter system sleep state S5
[1352.***] INFO: task halt:5187 blocked for more than 122 seconds.
[1352.***] Not tainted 5.6.4 #4
[1352.***] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message
[1352.***] halt D13856 5187 5186 0x00004000
[1352.***] Call Trace:
[1352.***] ? __schedule+0x272/0x5b0
[1352.***] schedule+0x45/0xb0
[1352.***] schedule_timeout+0x204/0x2f0
[1352.***] ? acpi_os_release_object+0x5/0x10
[1352.***] ? acpi_ut_update_object_reference+0x14e/0x1d2
[1352.***] wait_for_completion+0xa3/0x100
[1352.***] ? wake_up_q+0x90/0x90
[1352.***] flush_workqueue+0x130/0x420
[1352.***] kernel_power_off+0x3d/0x70
[1352.***] __do_sys_reboot+0x140/0x220
[1352.***] ? do_send_sig_info+0x58/0x90
[1352.***] ? kill_pid_info+0x34/0x60
[1352.***] ? kill_something_info+0x133/0x160
[1352.***] ? __x64_sys_kill+0x9a/0xc0
[1352.***] do_syscall_64+0x43/0x130
[1352.***] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[1352.***] RIP: 0033:0x7f95dd6992c3
[1352.***] Code: Bad RIP value.
[1352.***] RSP: 002b:00007ffc4a3e4b88 EFLAGS: 00000202 ORIG_RAX: 00000000000000a9
[1352.***] RAX: ffffffffffffffda RBX: 00007ffc4a3e4cd8 RCX: 00007f95dd6992c3
[1352.***] RDX: 000000004321fedc RSI: 0000000028121969 RDI: 00000000fee1dead
[1352.***] RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000000
[1352.***] R10: 0000000000400968 R11: 0000000000000282 R12: 0000000000000000
[1352.***] R13: 0000000000000000 R14: 00000000ffffffff R15: 0000000000000000
[1475.***] INFO: task halt:5187 blocked for more than 245 seconds.
...
[-- Attachment #3: Type: text/plain, Size: 502 bytes --]
After the last message here, the "echo", Call Trace, RIP and register
messages were repeated exactly as above. At this point I did a hard
reboot. A web search found something similar to the above output
("System hangs (bad RIP value) when disk used in pool is removed"), but
didn't otherwise seem to be related to the problem I'm having.
Do you still want me to check whether removing the iwlwifi driver makes
a differece? And with the CDROM still detached, or does that not
matter?
Steve Berman
next prev parent reply other threads:[~2020-06-09 10:06 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-01 15:46 power-off delay/hang due to commit 6d25be57 (mainline) Stephen Berman
2020-05-06 21:57 ` Sebastian Andrzej Siewior
2020-05-08 21:30 ` Stephen Berman
2020-05-13 22:04 ` Sebastian Andrzej Siewior
2020-05-14 21:39 ` Stephen Berman
2020-05-22 16:40 ` Sebastian Andrzej Siewior
2020-06-09 10:06 ` Stephen Berman [this message]
2020-06-09 20:23 ` Sebastian Andrzej Siewior
2020-06-10 8:21 ` Stephen Berman
2020-06-10 10:25 ` Sebastian Andrzej Siewior
2020-06-10 22:49 ` Stephen Berman
2020-06-11 15:39 ` Stephen Berman
2020-06-12 11:01 ` Sebastian Andrzej Siewior
2020-06-14 12:12 ` Stephen Berman
2020-06-14 17:10 ` Sebastian Andrzej Siewior
2020-06-15 7:58 ` Stephen Berman
2020-06-15 14:51 ` Sebastian Andrzej Siewior
2020-06-15 15:41 ` Stephen Berman
2020-06-15 15:58 ` Sebastian Andrzej Siewior
2020-06-15 16:19 ` Stephen Berman
2020-06-15 16:32 ` Sebastian Andrzej Siewior
2020-06-16 7:14 ` Stephen Berman
2020-06-16 7:38 ` Sebastian Andrzej Siewior
2020-06-16 8:13 ` Stephen Berman
2020-06-16 15:55 ` Sebastian Andrzej Siewior
2020-06-16 20:28 ` Stephen Berman
2020-06-17 14:27 ` Sebastian Andrzej Siewior
2020-06-17 21:09 ` Stephen Berman
2020-06-24 20:11 ` Sebastian Andrzej Siewior
2020-06-24 21:49 ` Stephen Berman
2020-07-14 13:44 ` Sebastian Andrzej Siewior
2020-07-14 13:54 ` Rafael J. Wysocki
2020-07-14 14:11 ` Sebastian Andrzej Siewior
2020-07-14 15:53 ` Rafael J. Wysocki
2020-07-14 16:10 ` Sebastian Andrzej Siewior
2020-08-11 10:27 ` Sebastian Andrzej Siewior
2020-08-11 14:02 ` Rafael J. Wysocki
2020-07-19 10:07 ` Stephen Berman
2020-08-11 11:58 ` Stephen Berman
2020-08-11 13:29 ` Sebastian Andrzej Siewior
2020-08-11 14:34 ` Rafael J. Wysocki
2020-08-11 15:25 ` Sebastian Andrzej Siewior
2020-08-11 17:22 ` Stephen Berman
2020-08-11 18:49 ` Sebastian Andrzej Siewior
2020-10-06 21:49 ` Sebastian Andrzej Siewior
2020-10-07 16:18 ` Rafael J. Wysocki
2020-10-26 17:20 ` Sebastian Andrzej Siewior
2020-12-02 18:03 ` Sebastian Andrzej Siewior
2020-12-02 18:31 ` Rafael J. Wysocki
2020-12-02 19:13 ` Rafael J. Wysocki
2020-12-31 20:46 ` Rafael J. Wysocki
2021-01-02 11:03 ` Rafael J. Wysocki
2021-01-04 15:38 ` Stephen Berman
2021-01-24 13:49 ` Stephen Berman
2021-01-25 16:25 ` Rafael J. Wysocki
2020-06-20 19:08 ` [PATCH] SCSI: Disable CD-ROM poll on shutdown kernel test robot
2020-06-09 21:26 ` power-off delay/hang due to commit 6d25be57 (mainline) Stephen Berman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87y2owwo2o.fsf@rub.de \
--to=stephen.berman@gmx.net \
--cc=bigeasy@linutronix.de \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox