All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gautham R Shenoy <ego@in.ibm.com>
To: Yi Yang <yi.y.yang@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	Oleg Nesterov <oleg@tv-sign.ru>,
	"Rafael J. Wysocki" <rjw@sisk.pl>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
Date: Tue, 4 Mar 2008 14:39:33 +0530	[thread overview]
Message-ID: <20080304090933.GA8997@in.ibm.com> (raw)
In-Reply-To: <20080304052613.GA28632@in.ibm.com>

On Tue, Mar 04, 2008 at 10:56:13AM +0530, Gautham R Shenoy wrote:
> On Mon, Mar 03, 2008 at 10:45:04PM +0800, Yi Yang wrote:
> > On Mon, 2008-03-03 at 21:01 +0530, Gautham R Shenoy wrote:
> > > > This issue seems such one, but i tried to change it to follow this rule but
> > > > the issue is still there.
> > > > 
> > > > Why isn't the kernel thread [watchdog/1] reaped by its parent? its state
> > > > is TASK_RUNNING with high priority (R< means this), why it isn't done?
> > > > 
> > > > Anyone ever met such a problem? Your thought?
> > > 
> > > Hi Yi,
> > > 
> > > This is indeed strange. I am able to reproduce this problem on my 4-way
> > > box. From what I see in the past two runs, we're waiting in the
> > > cpu-hotplug callback path for the watchdog/1 thread to stop.
> > > 
> > > During cpu-offline, once the cpu goes offline, in the migration_call(), 
> > > we migrate any tasks associated with the offline cpus
> > > to some other cpu. This also mean breaking affinity for tasks which were
> > > affined to the cpu which went down. So watchdog/1 has been migrated to
> > > some other cpu.
> > No, [watchdog/1] is just for CPU #1, if CPU #1 has been offline, it
> > should be killed but not migrated to other CPU because other CPU has
> > such a kthread.
> 
> Yes, it is killed once it gets a chance to run *after* cpu goes offline.
> The moment it runs on some other cpu, it will see the kthread_should_stop()
> because in the cpu-hotplug callback path we've issues a 
> kthread_stop(watchdog/1)
> 
> Again, we can argue that we could issue a kthread_stop() 
> in CPU_DOWN_PREPARE, rather than in CPU_DEAD and restart 
> it in CPU_DOWN_FAILED if the cpu-hotplug operation does fail.
> 
> > 
> > Maybe migration_call was doing such a bad thing. :-)
> 
> Nope, from what I see migration call is not having any problems. It is
> behaving the way it is supposed to behave :)
> 
> The other observation I noted was the WARN_ON_ONCE() in hrtick() [1]
> that I am consistently hitting after the first cpu goes offline.
> 
> So at times, the callback thread is blocked on kthread_stop(k) in
> softlockup.c, while other time, it was blocked in
> cleanup_workqueue_threads() in workqueue.c. 

This is the hung_task_timeout message after a couple of cpu-offlines.

This is on 2.6.25-rc3.

INFO: task bash:4467 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       f3701dd0 00000046 f796aac0 f796aac0 f796abf8 cc434b80 00000000 f41ee940 
       0180b046 0000026e 00000016 00000000 00000008 f796b080 f796aac0 00000002 
       7fffffff 7fffffff f3701e1c f3701df8 c04e033a f3701e1c f3701dec c0139dec 
Call Trace:
 [<c04e033a>] schedule_timeout+0x16/0x8b
 [<c0139dec>] ? trace_hardirqs_on+0xe9/0x111
 [<c04e01c9>] wait_for_common+0xcf/0x12e
 [<c011a3f0>] ? default_wake_function+0x0/0xd
 [<c04e02aa>] wait_for_completion+0x12/0x14
 [<c012ccbb>] flush_cpu_workqueue+0x50/0x66
 [<c012cd28>] ? wq_barrier_func+0x0/0xd
 [<c012cd14>] cleanup_workqueue_thread+0x43/0x57
 [<c04c6f87>] workqueue_cpu_callback+0x8e/0xbd
 [<c04e3975>] notifier_call_chain+0x2b/0x4a
 [<c0132e9d>] __raw_notifier_call_chain+0xe/0x10
 [<c0132eab>] raw_notifier_call_chain+0xc/0xe
 [<c013e054>] _cpu_down+0x150/0x1ec
 [<c013e133>] cpu_down+0x23/0x30
 [<c02e3897>] store_online+0x27/0x5a
 [<c02e3870>] ? store_online+0x0/0x5a
 [<c02e09d8>] sysdev_store+0x20/0x25
 [<c0196d2d>] sysfs_write_file+0xad/0xdf
 [<c0196c80>] ? sysfs_write_file+0x0/0xdf
 [<c0163da9>] vfs_write+0x8c/0x108
 [<c0164333>] sys_write+0x3b/0x60
 [<c01049da>] sysenter_past_esp+0x5f/0xa5
 =======================
3 locks held by bash/4467:
 #0:  (&buffer->mutex){--..}, at: [<c0196ca5>] sysfs_write_file+0x25/0xdf
 #1:  (cpu_add_remove_lock){--..}, at: [<c013e10e>] cpu_maps_update_begin+0xf/0x11
 #2:  (cpu_hotplug_lock){----}, at: [<c013df5b>] _cpu_down+0x57/0x1ec

So it's not just a not reaping of watchdog thread issue.

I doubt it's due to some locking dependency since we have lockdep checks
in the workqueue code before we flush the cpu_workqueue.

--
Thanks and Regards
gautham

  reply	other threads:[~2008-03-04  9:10 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-03-02 18:42 [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline Yi Yang
2008-03-03 11:54 ` Dmitry Adamushko
2008-03-03 11:56   ` Ingo Molnar
2008-03-03 12:02     ` Dmitry Adamushko
2008-03-03 14:53       ` Yi Yang
2008-03-03 17:37         ` Yi Yang
2008-03-03 15:31 ` Gautham R Shenoy
2008-03-03 14:45   ` Yi Yang
2008-03-04  5:26     ` Gautham R Shenoy
2008-03-04  9:09       ` Gautham R Shenoy [this message]
2008-03-03 21:56         ` Yi Yang
2008-03-04 15:01       ` Oleg Nesterov
2008-03-04 14:37         ` Yi Yang
2008-03-06 20:05           ` Yi Yang
2008-03-05 10:05         ` Gautham R Shenoy
2008-03-05 13:53           ` Oleg Nesterov
2008-03-06 11:15             ` Gautham R Shenoy
2008-03-06 12:22               ` Gautham R Shenoy
2008-03-06 13:44         ` Gautham R Shenoy
2008-03-07  2:54           ` Oleg Nesterov
2008-03-07  9:10             ` Gautham R Shenoy
2008-03-07 10:51               ` Gautham R Shenoy
2008-03-06 23:20                 ` Yi Yang
2008-03-07 13:02                 ` Dmitry Adamushko
2008-03-07 13:55                   ` Gautham R Shenoy
2008-03-07 15:50                     ` Gautham R Shenoy
2008-03-07 19:14                       ` [BUG 2.6.25-rc3] scheduler/hotplug: some processes aredealocked " Suresh Siddha
2008-03-07 20:18                   ` [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked " Andrew Morton
2008-03-07 21:36                     ` Rafael J. Wysocki
2008-03-07 23:01                       ` Suresh Siddha
2008-03-07 23:29                         ` Andrew Morton
2008-03-07 23:43                           ` Rafael J. Wysocki
2008-03-08  1:50                             ` Suresh Siddha
2008-03-08  2:09                               ` Andrew Morton
2008-03-08  5:10                               ` [PATCH] adjust root-domain->online span in response to hotplug event Gregory Haskins
2008-03-08  8:41                                 ` Ingo Molnar
2008-03-08 17:50                                   ` [PATCH] adjust root-domain->online span in response to hotplugevent Gregory Haskins
2008-03-09  0:31                                     ` Dmitry Adamushko
2008-03-10 14:12                                       ` Gregory Haskins
2008-03-09  2:35                                 ` [PATCH] adjust root-domain->online span in response to hotplug event Suresh Siddha
2008-03-10 12:41                                   ` Gregory Haskins
2008-03-10  8:14                                 ` Gautham R Shenoy
2008-03-10 13:13                                   ` [PATCH] cpu-hotplug: Register update_sched_domains() notifier with higher prio Gautham R Shenoy
2008-03-10 22:25                                     ` Andrew Morton
2008-03-10 13:39                                   ` [PATCH] keep rd->online and cpu_online_map in sync Gregory Haskins
2008-03-10 14:21                                     ` Gautham R Shenoy
2008-03-10 18:12                                     ` Suresh Siddha
2008-03-10 22:03                                       ` Rafael J. Wysocki
2008-03-10 22:00                                         ` Gregory Haskins
2008-03-10 22:10                                           ` Suresh Siddha
2008-03-10 21:59                                             ` [PATCH v2] " Gregory Haskins
2008-03-10 23:36                                               ` Andrew Morton
2008-03-11  1:34                                                 ` Suresh Siddha
2008-03-11  4:39                                                   ` Gautham R Shenoy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080304090933.GA8997@in.ibm.com \
    --to=ego@in.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=oleg@tv-sign.ru \
    --cc=rjw@sisk.pl \
    --cc=tglx@linutronix.de \
    --cc=yi.y.yang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.