devicetree.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [lkp] [printk] 34578dc67f: EIP is at vprintk_emit+0x1ea/0x600
       [not found]   ` <20160223155331.GA448@swordfish>
@ 2016-02-24  1:19     ` Sergey Senozhatsky
  2016-02-24 11:46       ` Petr Mladek
  0 siblings, 1 reply; 7+ messages in thread
From: Sergey Senozhatsky @ 2016-02-24  1:19 UTC (permalink / raw)
  To: kernel test robot
  Cc: lkp, Rob Herring, Frank Rowand, Grant Likely, devicetree, LKML,
	Andrew Morton, Calvin Owens, Dave Jones, Kyle McMartin, Tejun Heo,
	Jan Kara, Petr Mladek, Stephen Rothwell, Sergey Senozhatsky,
	Sergey Senozhatsky


Hello,

Cc Rob, Frank, Grant

On (02/24/16 00:53), Sergey Senozhatsky wrote:
[..]
>  284 [    0.000000]  per task-struct memory footprint: 2112 bytes
>  285 [    0.000000]  per task-struct memory footprint: 2112 bytes
>  286 [    0.000000] ------------------------
>  287 [    0.000000] ------------------------
>  288 [    0.000000] | Locking API testsuite:
>  289 [    0.000000] | Locking API testsuite:
>  290 [    0.000000] ----------------------------------------------------------------------------
>  291 [    0.000000] ----------------------------------------------------------------------------
>  292 [    0.000000]                                  | spin |wlock |rlock |mutex | wsem | rsem |
>  293 [    0.000000]                                  | spin |wlock |rlock |mutex | wsem | rsem |
>  294 [    0.000000]   --------------------------------------------------------------------------
>  295 [    0.000000]   --------------------------------------------------------------------------
> 
> 
> looking at your Kernel command line
> 
> [    0.000000] Kernel command line: root=/dev/ram0 user=lkp job=/lkp/scheduled/vm-kbuild-yocto-i386-53/bisect_boot-1-yocto-minimal-i386.cgz-i386-randconfig-h1-02192137-34578dc67f38c02ccbe696e4099967884caa8e15-20160220-72722-vao2m5-0.yaml ARCH=i386 kconfig=i386-randconfig-h1-02192137 branch=linux-next/master commi     t=34578dc67f38c02ccbe696e4099967884caa8e15 BOOT_IMAGE=/pkg/linux/i386-randconfig-h1-02192137/gcc-5/34578dc67f38c02ccbe696e4099967884caa8e15/vmlinuz-4.5.0-rc4-00295-g34578dc max_uptime=600 RESULT_ROOT=/result/boot/1/vm-kbuild-yocto-i386/yocto-minimal-i386.cgz/i386-randconfig-h1-02192137/gcc-5/34578dc67f38c02ccbe69     6e4099967884caa8e15/9 LKP_SERVER=inn earlyprintk=ttyS0,115200 systemd.log_level=err debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 pan
 ic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0 console=ttyS0,115200 console=tty0 vga=normal rw ip=::::vm-kbuild-yo     cto-i386-53::dhcp drbd.minor_count=8
> 
> 
> - earlyprintk=ttyS0,115200
> - console=ttyS0,115200
> - console=tty0
> 
> and I see "bootconsole [earlyser0] enabled" but no "bootconsole [earlyser0] disabled" message, which
> I'd expect to see...
> 


and you get the NMI watchdog softlockup because you have a whole bunch of

   "of_overlay_destroy: Could not find overlay #6"
   "### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6"

messages to print. seems that somehitng just pushes them in a loop.
there are too many of them:

** 16981217 printk messages dropped ** 
[   33.495591] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
** 16981217 printk messages dropped ** 
[   33.495591] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
[   33.495593] of_overlay_destroy: Could not find overlay #6
[   33.495593] of_overlay_destroy: Could not find overlay #6


 ** 16981217 printk messages dropped **



[   33.497583] of_overlay_destroy: Could not find overlay #6
[   33.497583] of_overlay_destroy: Could not find overlay #6
[   33.497584] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
[   33.497584] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
[   33.497586] of_overlay_destroy: Could not find overlay #6
[   33.497586] of_overlay_destroy: Could not find overlay #6
[   33.497587] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
[   33.497587] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
[   33.497589] of_overlay_destroy: Could not find overlay #6
[   33.497589] of_overlay_destroy: Could not find overlay #6
[   33.497589] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
[   33.497589] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
[   33.497591] of_overlay_destroy: Could not find overlay #6
[   33.497591] of_overlay_destroy: Could not find overlay #6
[   33.497592] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
[   33.497592] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
[   33.497594] of_overlay_destroy: Could not find overlay #6
[   33.497594] of_overlay_destroy: Could not find overlay #6
[   33.497595] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
[   33.497595] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
[   33.497596] of_overlay_destroy: Could not find overlay #6
[   33.497596] of_overlay_destroy: Could not find overlay #6
[   33.497597] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
[   33.497597] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
[   33.497599] of_overlay_destroy: Could not find overlay #6
[   33.497599] of_overlay_destroy: Could not find overlay #6
[   33.497600] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
[   33.497600] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
[   33.497601] of_overlay_destroy: Could not find overlay #6
[   33.497601] of_overlay_destroy: Could not find overlay #6
[   33.497602] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
[   33.497602] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
[   33.497604] of_overlay_destroy: Could not find overlay #6
[   33.497604] of_overlay_destroy: Could not find overlay #6
[   33.497605] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6
[   33.497605] ### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6




the source of "too many messages to printk" is this `while (1) loop'

static void of_unittest_destroy_tracked_overlays(void)
{
	int id, ret, defers;

	if (overlay_first_id < 0)
		return;

	/* try until no defers */
	do {
		defers = 0;
		/* remove in reverse order */
		for (id = MAX_UNITTEST_OVERLAYS - 1; id >= 0; id--) {
			if (!(overlay_id_bits[BIT_WORD(id)] & BIT_MASK(id)))
				continue;

			ret = of_overlay_destroy(id + overlay_first_id);
			if (ret != 0) {
				defers++;
				pr_warn("%s: overlay destroy failed for #%d\n",
					__func__, id + overlay_first_id);
				continue;
			}

			overlay_id_bits[BIT_WORD(id)] &= ~BIT_MASK(id);
		}
	} while (defers > 0);
}


I don't know what's the root cause of missing overlay id in idr,
but in this particular case the loop transform into /* well, "defers" should
overflow at some point, but I doubt we must count on it */

	do {
		ret = of_overlay_destroy()->idr_find()->pr_err("Could not find overlay"), return -ENODEV
		if (ret != 0) {
			pr_warn("overlay destroy failed for");
			continue;
		}

	} while (1);


the "while (1) printk();" pattern is known to be dangerous; we need to
fix printk().





so something like this perhaps.


From: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Subject: [PATCH] fix endless loop

---
 drivers/of/unittest.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index 979b6e4..5058017 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -1165,6 +1165,12 @@ static void of_unittest_destroy_tracked_overlays(void)
 				continue;
 
 			ret = of_overlay_destroy(id + overlay_first_id);
+			if (ret == -ENODEV) {
+				pr_warn("%s: no overlay to destroy for #%d\n",
+						__func__, id + overlay_first_id);
+				continue;
+			}
+
 			if (ret != 0) {
 				defers++;
 				pr_warn("%s: overlay destroy failed for #%d\n",
-- 
2.7.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [lkp] [printk] 34578dc67f: EIP is at vprintk_emit+0x1ea/0x600
  2016-02-24  1:19     ` [lkp] [printk] 34578dc67f: EIP is at vprintk_emit+0x1ea/0x600 Sergey Senozhatsky
@ 2016-02-24 11:46       ` Petr Mladek
       [not found]         ` <20160224114614.GY3305-KsEp0d+Q8qECVLCxKZUutA@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Petr Mladek @ 2016-02-24 11:46 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: kernel test robot, lkp, Rob Herring, Frank Rowand, Grant Likely,
	devicetree, LKML, Andrew Morton, Calvin Owens, Dave Jones,
	Kyle McMartin, Tejun Heo, Jan Kara, Stephen Rothwell,
	Sergey Senozhatsky

On Wed 2016-02-24 10:19:41, Sergey Senozhatsky wrote:
> 
> Hello,
> 
> Cc Rob, Frank, Grant
> 
> On (02/24/16 00:53), Sergey Senozhatsky wrote:
> [..]
> >  284 [    0.000000]  per task-struct memory footprint: 2112 bytes
> >  285 [    0.000000]  per task-struct memory footprint: 2112 bytes
> >  286 [    0.000000] ------------------------
> >  287 [    0.000000] ------------------------
> >  288 [    0.000000] | Locking API testsuite:
> >  289 [    0.000000] | Locking API testsuite:
> >  290 [    0.000000] ----------------------------------------------------------------------------
> >  291 [    0.000000] ----------------------------------------------------------------------------
> >  292 [    0.000000]                                  | spin |wlock |rlock |mutex | wsem | rsem |
> >  293 [    0.000000]                                  | spin |wlock |rlock |mutex | wsem | rsem |
> >  294 [    0.000000]   --------------------------------------------------------------------------
> >  295 [    0.000000]   --------------------------------------------------------------------------
> > 
> > 
> > looking at your Kernel command line
> > 
> > [    0.000000] Kernel command line: root=/dev/ram0 user=lkp job=/lkp/scheduled/vm-kbuild-yocto-i386-53/bisect_boot-1-yocto-minimal-i386.cgz-i386-randconfig-h1-02192137-34578dc67f38c02ccbe696e4099967884caa8e15-20160220-72722-vao2m5-0.yaml ARCH=i386 kconfig=i386-randconfig-h1-02192137 branch=linux-next/master commi     t=34578dc67f38c02ccbe696e4099967884caa8e15 BOOT_IMAGE=/pkg/linux/i386-randconfig-h1-02192137/gcc-5/34578dc67f38c02ccbe696e4099967884caa8e15/vmlinuz-4.5.0-rc4-00295-g34578dc max_uptime=600 RESULT_ROOT=/result/boot/1/vm-kbuild-yocto-i386/yocto-minimal-i386.cgz/i386-randconfig-h1-02192137/gcc-5/34578dc67f38c02ccbe69     6e4099967884caa8e15/9 LKP_SERVER=inn earlyprintk=ttyS0,115200 systemd.log_level=err debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 p
 anic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0 console=ttyS0,115200 console=tty0 vga=normal rw ip=::::vm-kbuild-yo     cto-i386-53::dhcp drbd.minor_count=8
> > 
> > 
> > - earlyprintk=ttyS0,115200
> > - console=ttyS0,115200
> > - console=tty0
> > 
> > and I see "bootconsole [earlyser0] enabled" but no "bootconsole [earlyser0] disabled" message, which
> > I'd expect to see...
> > 
> 
> 
> and you get the NMI watchdog softlockup because you have a whole bunch of
> 
>    "of_overlay_destroy: Could not find overlay #6"
>    "### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6"
> 
> messages to print. seems that somehitng just pushes them in a loop.
> there are too many of them:

This sounds like a reasonable explanation. It seems that
of_unittest_destroy_tracked_overlays() really ended in an infinite
loop.

But I am still curious why the softlookup points to

[   33.497718] EIP is at vprintk_emit+0x1ea/0x600

Also there is on the stack

[   33.497741]  [<c068e712>] vprintk_default+0x32/0x40
[   33.497741]  [<c068e712>] vprintk_default+0x32/0x40
[   33.497744]  [<c06fdf6e>] printk+0x11/0x13
[   33.497744]  [<c06fdf6e>] printk+0x11/0x13
[   33.497748]  [<c0df5eec>] of_unittest_overlay+0x8d1/0x900
[   33.497748]  [<c0df5eec>] of_unittest_overlay+0x8d1/0x900
[   33.497750]  [<c0df6b1f>] of_unittest+0xc04/0xc2d
[   33.497750]  [<c0df6b1f>] of_unittest+0xc04/0xc2d

I would expect that the soft lookup happens in console_unlock()
called with IRQs disabled. It seems to me that of_unittest_overlay()
is called with IRQs enabled.

I want to be sure that the patch in printk() did not introduce
a deadlock that is visible only under a high printk load.

I guess that the softlookup was not visible before the
printk patch was applied. I wonder if the patch made something
worse. It was supposed to improve things but...

Do I miss something, please?

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [lkp] [printk] 34578dc67f: EIP is at vprintk_emit+0x1ea/0x600
       [not found]         ` <20160224114614.GY3305-KsEp0d+Q8qECVLCxKZUutA@public.gmane.org>
@ 2016-02-24 12:50           ` Sergey Senozhatsky
  2016-02-24 15:12             ` Sergey Senozhatsky
  0 siblings, 1 reply; 7+ messages in thread
From: Sergey Senozhatsky @ 2016-02-24 12:50 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, kernel test robot, lkp-JC7UmRfGjtg,
	Rob Herring, Frank Rowand, Grant Likely,
	devicetree-u79uwXL29TY76Z2rM5mHXA, LKML, Andrew Morton,
	Calvin Owens, Dave Jones, Kyle McMartin, Tejun Heo, Jan Kara,
	Stephen Rothwell, Sergey Senozhatsky

Hello,

On (02/24/16 12:46), Petr Mladek wrote:
[..]
> > and you get the NMI watchdog softlockup because you have a whole bunch of
> > 
> >    "of_overlay_destroy: Could not find overlay #6"
> >    "### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6"
> > 
> > messages to print. seems that somehitng just pushes them in a loop.
> > there are too many of them:
> 
> This sounds like a reasonable explanation. It seems that
> of_unittest_destroy_tracked_overlays() really ended in an infinite
> loop.
> 
> But I am still curious why the softlookup points to
> 
> [   33.497718] EIP is at vprintk_emit+0x1ea/0x600
> 
> Also there is on the stack
> 
> [   33.497741]  [<c068e712>] vprintk_default+0x32/0x40
> [   33.497741]  [<c068e712>] vprintk_default+0x32/0x40
> [   33.497744]  [<c06fdf6e>] printk+0x11/0x13
> [   33.497744]  [<c06fdf6e>] printk+0x11/0x13
> [   33.497748]  [<c0df5eec>] of_unittest_overlay+0x8d1/0x900
> [   33.497748]  [<c0df5eec>] of_unittest_overlay+0x8d1/0x900
> [   33.497750]  [<c0df6b1f>] of_unittest+0xc04/0xc2d
> [   33.497750]  [<c0df6b1f>] of_unittest+0xc04/0xc2d
> 
> I would expect that the soft lookup happens in console_unlock()
> called with IRQs disabled. It seems to me that of_unittest_overlay()
> is called with IRQs enabled.

watchdog has two different parts: hrtimer part (via IRQ) checks the
touch_ts and wakeups the updater smpboot kthread; and updater kthread,
that updates touch_ts. to get a lockup you can just keep the preemption
disabled for 20+ seconds, so updater kthread will not update touch_ts.
the next hrtimer irq will detect lockup. and the .config we have is

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set

that can add up to the issue.

(well, at least watchdog was working this way the last time I saw it).


> I want to be sure that the patch in printk() did not introduce
> a deadlock that is visible only under a high printk load.

sure. I did additional intensive tests today, and saw no issues.
schematically, something like this:

	u64 start = local_lock() >> 31UL:

	{ preempt_disable(), local_irq_save() }
	while (1) {
		u64 now = local_clock() >> 31UL;
		if (now - start > TIMEOUT)
			goto out;
		pr_err(">>>>>\n");
	}
out:
	{ preempt_enable(), local_irq_restore() }


> I guess that the softlookup was not visible before the
> printk patch was applied. I wonder if the patch made something
> worse. It was supposed to improve things but...

I think the softlockup under the given conditions (endless printk and
no preemption) will show up regardless the patches being applied; but
I see you point, of course.

	-ss
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [lkp] [printk] 34578dc67f: EIP is at vprintk_emit+0x1ea/0x600
  2016-02-24 12:50           ` Sergey Senozhatsky
@ 2016-02-24 15:12             ` Sergey Senozhatsky
  2016-02-25  5:10               ` Sergey Senozhatsky
  0 siblings, 1 reply; 7+ messages in thread
From: Sergey Senozhatsky @ 2016-02-24 15:12 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Petr Mladek, Sergey Senozhatsky, kernel test robot, lkp,
	Rob Herring, Frank Rowand, Grant Likely, devicetree, LKML,
	Andrew Morton, Calvin Owens, Dave Jones, Kyle McMartin, Tejun Heo,
	Jan Kara, Stephen Rothwell

On (02/24/16 21:50), Sergey Senozhatsky wrote:
> On (02/24/16 12:46), Petr Mladek wrote:
> [..]
> > > and you get the NMI watchdog softlockup because you have a whole bunch of
> > > 
> > >    "of_overlay_destroy: Could not find overlay #6"
> > >    "### dt-test ### of_unittest_destroy_tracked_overlays: overlay destroy failed for #6"
> > > 
> > > messages to print. seems that somehitng just pushes them in a loop.
> > > there are too many of them:
> > 
> > This sounds like a reasonable explanation. It seems that
> > of_unittest_destroy_tracked_overlays() really ended in an infinite
> > loop.
> > 
> > But I am still curious why the softlookup points to
> > 
> > [   33.497718] EIP is at vprintk_emit+0x1ea/0x600
> > 
> > Also there is on the stack
> > 
> > [   33.497741]  [<c068e712>] vprintk_default+0x32/0x40
> > [   33.497741]  [<c068e712>] vprintk_default+0x32/0x40
> > [   33.497744]  [<c06fdf6e>] printk+0x11/0x13
> > [   33.497744]  [<c06fdf6e>] printk+0x11/0x13
> > [   33.497748]  [<c0df5eec>] of_unittest_overlay+0x8d1/0x900
> > [   33.497748]  [<c0df5eec>] of_unittest_overlay+0x8d1/0x900
> > [   33.497750]  [<c0df6b1f>] of_unittest+0xc04/0xc2d
> > [   33.497750]  [<c0df6b1f>] of_unittest+0xc04/0xc2d
> > 
> > I would expect that the soft lookup happens in console_unlock()
> > called with IRQs disabled. It seems to me that of_unittest_overlay()
> > is called with IRQs enabled.

hm... both of the logbuf_lock/irq reqions in vprintk_emit/console_unlock
are not modified by the patch set. there is, however, one thing that has
changed -- additional console_cont_flush() calls, which does spin_lock_irq
logbuf_lock and spin_unlock_irq logbuf_lock.

> > I want to be sure that the patch in printk() did not introduce
> > a deadlock that is visible only under a high printk load.

I'll do more tests, certainly.

	-ss

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [lkp] [printk] 34578dc67f: EIP is at vprintk_emit+0x1ea/0x600
  2016-02-24 15:12             ` Sergey Senozhatsky
@ 2016-02-25  5:10               ` Sergey Senozhatsky
  2016-03-02 10:30                 ` Petr Mladek
  0 siblings, 1 reply; 7+ messages in thread
From: Sergey Senozhatsky @ 2016-02-25  5:10 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, kernel test robot,
	lkp-JC7UmRfGjtg, Rob Herring, Frank Rowand, Grant Likely,
	devicetree-u79uwXL29TY76Z2rM5mHXA, LKML, Andrew Morton,
	Calvin Owens, Dave Jones, Kyle McMartin, Tejun Heo, Jan Kara,
	Stephen Rothwell

Hello,

On (02/25/16 00:12), Sergey Senozhatsky wrote:
[..]
> > > I want to be sure that the patch in printk() did not introduce
> > > a deadlock that is visible only under a high printk load.
> 
> I'll do more tests, certainly.

No behavioral change between linux-next and linux-next modulo printk patches was observed.

next-20160224

git revert 8a30bf8ff5e07a589ad43363b06ea361ad723a0e 2822460a0ea97656ae71b2b4ddac087f79c5876a b6751672d3ac1d3d876fd5ea7ec579956eee4d6b


set PREEMPT_NONE

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y



add PANIC of softlockup and panic on hung_task

CONFIG_LOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR=y
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=1
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y			<<
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=1
CONFIG_DETECT_HUNG_TASK=y
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=11
CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y			<<
CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=1
CONFIG_WQ_WATCHDOG=y
CONFIG_PANIC_ON_OOPS=y
CONFIG_PANIC_ON_OOPS_VALUE=1
CONFIG_PANIC_TIMEOUT=23



modify watchdog timeout (so it detects lockup sooner) and add
printk() flood -- the code does not even disable preemption,
just busy looping, we have PREEMPT_NONE kernel.


git diff

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 46055db..8759dbb 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1410,6 +1410,21 @@ static int __init zram_init(void)
 {
        int ret;
 
+       /*unsigned long flags;*/
+       u64 s, e;
+
+       /*preempt_disable();*/
+       s = local_clock() >> 31UL;
+       while (1) {
+               e = local_clock() >> 31UL;
+               if (e - s > 10)
+                       goto out;
+               pr_err(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n");
+       }
+out:
+       /*preempt_enable();*/
+
+
        ret = class_register(&zram_control_class);
        if (ret) {
                pr_err("Unable to register zram-control class\n");
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index b3ace6e..8655364 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -54,7 +54,7 @@ static unsigned long __read_mostly watchdog_enabled = SOFT_WATCHDOG_ENABLED;
 int __read_mostly nmi_watchdog_enabled;
 int __read_mostly soft_watchdog_enabled;
 int __read_mostly watchdog_user_enabled;
-int __read_mostly watchdog_thresh = 10;
+int __read_mostly watchdog_thresh = 1;
 
 #ifdef CONFIG_SMP
 int __read_mostly sysctl_softlockup_all_cpu_backtrace;


---


The test (unpatched linux-next)


NMI watchdog: BUG: soft lockup - CPU#2 stuck for 3s! [modprobe:418]
Modules linked in: ...
irq event stamp: 40679
hardirqs last  enabled at (40679): [<ffffffff8108b771>] vprintk_emit+0x3f0/0x499
hardirqs last disabled at (40678): [<ffffffff8108b3d4>] vprintk_emit+0x53/0x499
softirqs last  enabled at (35360): [<ffffffff810475e9>] __do_softirq+0x2d8
softirqs last disabled at (35357): [<ffffffff810478ce>] irq_exit+0x41/0x95
CPU: 2 PID: 418 Comm: modprobe Tainted: G           O    4.5.0-rc5-next-20160224-dbg-00012-gfedb791-dirty #187
Hardware name: ...
task: ffff88013290ce00 ti: ffff880131f98000 task.ti: ffff880131f98000
RIP: 0010:[<ffffffff8108b2d8>]  [<ffffffff8108b2d8>] console_unlock+0x3ad/0x456
RSP: 0018:ffff880131f9bbc8  EFLAGS: 00000246
RAX: ffff88013290ce00 RBX: ffffffff82891fdc RCX: 00000000ffffffff
RDX: ffffffff812c99a8 RSI: 0000000000000001 RDI: ffffffff8108b2d5
RBP: ffff880131f9bc08 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000ffff00ff R11: 000000000000042b R12: 0000000000000000
R13: 0000000000000246 R14: 0000000000000003 R15: 0000000000000001
S:  00007fe6ed4fd700(0000) GS:ffff880137d00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe6eb905000 CR3: 00000000bb7fc000 CR4: 00000000000006e0
Stack:
 0000000000000000 0000000000000029 000000000001a128 00000000ffffffff
 0000000000000003 0000000000000028 ffffffff82877602 0000000000000028
 ffff880131f9bc60 ffffffff8108b7cb 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff8108b7cb>] vprintk_emit+0x44a/0x499
 [<ffffffffa01f0000>] ? 0xffffffffa01f0000
 [<ffffffff8108b931>] vprintk_default+0x1d/0x1f
 [<ffffffff810f1b21>] printk+0x48/0x50
 [<ffffffffa01f0032>] zram_init+0x32/0x1000 [zram]
 [<ffffffff81000402>] do_one_initcall+0xe7/0x16
 [<ffffffff810f1d5f>] ? do_init_module+0x27/0x1d9
 [<ffffffff81093619>] ? rcu_read_lock_sched_held+0x61/0x69
 [<ffffffff811330fa>] ? kmem_cache_alloc_trace+0x138/0x1d0
 [<ffffffff810f1d5f>] ? do_init_module+0x27/0x1d9
 [<ffffffff810f1d97>] do_init_module+0x5f/0x1d9
 [<ffffffff810b2b2c>] load_module+0x19da/0x1eb8
 [<ffffffff810830c4>] ? lock_acquire+0x10f/0x1a3
 [<ffffffff810b3109>] SyS_init_module+0xff/0x125
 [<ffffffff810b3109>] ? SyS_init_module+0xff/0x125
 [<ffffffff8150e525>] entry_SYSCALL_64_fastpath+0x18/0xa8
Code: 89 05 11 cc 80 01 4c 89 35 0e cc 80 01 88 d0 83 e2 fe 83 e0 1e 88 53 0f 0f b6 c0 89 05 f2 cb 80 01 eb 9e e8 9e 5a ff ff 41 55 9d <e9> ed fe ff ff 48 83 3d 3b cd 80 01 00 c7 05 3d cd 80 01 00 00

Kernel panic - not syncing: softlockup: hung tasks
CPU: 2 PID: 418 Comm: modprobe Tainted: G           O L  4.5.0-rc5-next-20160224-dbg-00012-gfedb791-dirty #187
Hardware name: ...
 0000000000000000 ffff880137d03d68 ffffffff8123334d ffffffff817ad7c7
 ffff880137d03df8 ffff880137d03de8 ffffffff810f124e ffffffff00000008
 ffff880137d03df8 ffff880137d03d90 0000000000000000 00000000ffffffff
Call Trace:
 <IRQ>  [<ffffffff8123334d>] dump_stack+0x67/0x90
 [<ffffffff810f124e>] panic+0xd1/0x212
 [<ffffffff8107d1c1>] ? down_trylock+0x2d/0x37
 [<ffffffff810c12c5>] watchdog_timer_fn+0x18e/0x1b0
 [<ffffffff8109cee9>] __hrtimer_run_queues+0x1ee/0x418
 [<ffffffff810c1137>] ? __touch_watchdog+0x1c/0x1c
 [<ffffffff8109d2af>] hrtimer_interrupt+0xa8/0x192
 [<ffffffff810354cd>] hpet_interrupt_handler+0x2a/0x31
 [<ffffffff8108ccd8>] handle_irq_event_percpu+0xd2/0x2e4
 [<ffffffff8108cf22>] handle_irq_event+0x38/0x56
 [<ffffffff8108fc63>] handle_edge_irq+0xd9/0xfb
 [<ffffffff810174ad>] handle_irq+0x1c/0x20
 [<ffffffff81510795>] do_IRQ+0x85/0x1
 [<ffffffff8150ee86>] common_interrupt+0x86/0x86
 <EOI>  [<ffffffff812c99a8>] ? vt_console_print+0x2f9/0x308
 [<ffffffff8108b2d5>] ? console_unlock+0x3aa/0x456
 [<ffffffff8108b2d8>] ? console_unlock+0x3ad/0x456
 [<ffffffff8108b7cb>] vprintk_emit+0x44a/0x499
 [<ffffffffa01f0000>] ? 0xffffffffa01f0000
 [<ffffffff8108b931>] vprintk_default+0x1d/0x1f
 [<ffffffff810f1b21>] printk+0x48/0x50
 [<ffffffffa01f0032>] zram_init+0x32/0x1000 [zram]
 [<ffffffff81000402>] do_one_initcall+0xe7/0x16c
 [<ffffffff810f1d5f>] ? do_init_module+0x27/0x1d9
 [<ffffffff81093619>] ? rcu_read_lock_sched_held+0x61/0x69
 [<ffffffff811330fa>] ? kmem_cache_alloc_trace+0x138/0x1d0
 [<ffffffff810f1d5f>] ? do_init_module+0x27/0x1d9
 [<ffffffff810f1d97>] do_init_module+0x5f/0x1d9
 [<ffffffff810b2b2c>] load_module+0x19da/0x1eb8
 [<ffffffff810830c4>] ? lock_acquire+0x10f/0x1a3
 [<ffffffff810b3109>] SyS_init_module+0xff/0x125
 [<ffffffff810b3109>] ? SyS_init_module+0xff/0x125
 [<ffffffff8150e525>] entry_SYSCALL_64_fastpath+0x18/0xa8



so in this test printk()->console_unlock() was performed on the same CPU.
but in general those can be different CPUs. one busy spinning doing printk(),
the other busy spinning doing console_unlock()->call_console_drivers().

	-ss
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [lkp] [printk] 34578dc67f: EIP is at vprintk_emit+0x1ea/0x600
  2016-02-25  5:10               ` Sergey Senozhatsky
@ 2016-03-02 10:30                 ` Petr Mladek
  2016-03-02 10:41                   ` Sergey Senozhatsky
  0 siblings, 1 reply; 7+ messages in thread
From: Petr Mladek @ 2016-03-02 10:30 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, kernel test robot, lkp, Rob Herring,
	Frank Rowand, Grant Likely, devicetree, LKML, Andrew Morton,
	Calvin Owens, Dave Jones, Kyle McMartin, Tejun Heo, Jan Kara,
	Stephen Rothwell

On Thu 2016-02-25 14:10:05, Sergey Senozhatsky wrote:
> CONFIG_PREEMPT_NONE=y
> # CONFIG_PREEMPT_VOLUNTARY is not set
> # CONFIG_PREEMPT is not set
> CONFIG_PREEMPT_COUNT=y

I was curious why your patch actually did not help to avoid the
softlockup. The infinite printk loop was called in a safe context,
CONFIG_PREEMPT_COUNT=y, so it did set console_may_schedule = 1.

But it never got console_sem, never called console_unlock()
and never called cond_resched() there. So, it never got preempted
in the PREEMPT_NONE kernel. The console_sem was owned by
another process that was busy handling the flood of messages.

Note that the infinite cycle calling printk() might be interrupted anywhere
by the NMI watchdog. It explains why the original report pointed
at the beginning of the printk. It was not a deadlock.
I got the softlockup on random locations here.

Also it is not the classic softlockup in console_unlock().
The process handling the console actually could sleep if it
has console_may_schedule = 1. This is why console_unlock()
did not appeared in the softlockup backtrace.


Conclusion:

Sergey is correct and his patches are innocent here.

Sergey, if you send the fix for
of_unittest_destroy_tracked_overlays(), please add me into CC.


Best Regards,
Petr

PS: I am sorry for the noise and that it took me so long to shake my
head around this problem. Thanks a lot for patience.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [lkp] [printk] 34578dc67f: EIP is at vprintk_emit+0x1ea/0x600
  2016-03-02 10:30                 ` Petr Mladek
@ 2016-03-02 10:41                   ` Sergey Senozhatsky
  0 siblings, 0 replies; 7+ messages in thread
From: Sergey Senozhatsky @ 2016-03-02 10:41 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, kernel test robot, lkp,
	Rob Herring, Frank Rowand, Grant Likely, devicetree, LKML,
	Andrew Morton, Calvin Owens, Dave Jones, Kyle McMartin, Tejun Heo,
	Jan Kara, Stephen Rothwell

On (03/02/16 11:30), Petr Mladek wrote:
> On Thu 2016-02-25 14:10:05, Sergey Senozhatsky wrote:
> > CONFIG_PREEMPT_NONE=y
> > # CONFIG_PREEMPT_VOLUNTARY is not set
> > # CONFIG_PREEMPT is not set
> > CONFIG_PREEMPT_COUNT=y
> 
> I was curious why your patch actually did not help to avoid the
> softlockup. The infinite printk loop was called in a safe context,
> CONFIG_PREEMPT_COUNT=y, so it did set console_may_schedule = 1.
> 
> But it never got console_sem, never called console_unlock()
> and never called cond_resched() there. So, it never got preempted
> in the PREEMPT_NONE kernel. The console_sem was owned by
> another process that was busy handling the flood of messages.
> 
> Note that the infinite cycle calling printk() might be interrupted anywhere
> by the NMI watchdog. It explains why the original report pointed
> at the beginning of the printk. It was not a deadlock.
> I got the softlockup on random locations here.
> 
> Also it is not the classic softlockup in console_unlock().
> The process handling the console actually could sleep if it
> has console_may_schedule = 1. This is why console_unlock()
> did not appeared in the softlockup backtrace.
> 
> 
> Conclusion:
> 
> Sergey is correct and his patches are innocent here.

thank you, Petr!


> Sergey, if you send the fix for
> of_unittest_destroy_tracked_overlays(), please add me into CC.

of course, will try to send it out tonight.


> PS: I am sorry for the noise and that it took me so long to shake my
> head around this problem. Thanks a lot for patience.

no problem at all. thank you. the more we double check it - the better.

	-ss

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-03-02 10:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <87fuwk1c0o.fsf@yhuang-dev.intel.com>
     [not found] ` <20160223131538.GA2048@swordfish>
     [not found]   ` <20160223155331.GA448@swordfish>
2016-02-24  1:19     ` [lkp] [printk] 34578dc67f: EIP is at vprintk_emit+0x1ea/0x600 Sergey Senozhatsky
2016-02-24 11:46       ` Petr Mladek
     [not found]         ` <20160224114614.GY3305-KsEp0d+Q8qECVLCxKZUutA@public.gmane.org>
2016-02-24 12:50           ` Sergey Senozhatsky
2016-02-24 15:12             ` Sergey Senozhatsky
2016-02-25  5:10               ` Sergey Senozhatsky
2016-03-02 10:30                 ` Petr Mladek
2016-03-02 10:41                   ` Sergey Senozhatsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).