* [Regression] Commit 2344abbcbdb82140050e8be29d3d55e4f6fe860b breaks resume on nx6325
@ 2008-09-20 23:24 Rafael J. Wysocki
[not found] ` <200809210124.30547.rjw-KKrjLPT3xs0@public.gmane.org>
0 siblings, 1 reply; 6+ messages in thread
From: Rafael J. Wysocki @ 2008-09-20 23:24 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Andrew Morton, Ingo Molnar, Kernel Testers, Linus Torvalds, LKML,
Pavel Machek
Hi,
Unfortunately resume from suspend to RAM is completely broken on my hp nx6325
because of
commit 2344abbcbdb82140050e8be29d3d55e4f6fe860b
Author: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
Date: Tue Sep 16 11:32:50 2008 -0700
clockevents: make device shutdown robust
Signed-off-by: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
Reverting of this commit makes things work again.
I know what this commit is for etc., but it obviosly need a replacement. :-(
Thanks,
Rafael
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Regression] Commit 2344abbcbdb82140050e8be29d3d55e4f6fe860b breaks resume on nx6325
[not found] ` <200809210124.30547.rjw-KKrjLPT3xs0@public.gmane.org>
@ 2008-09-20 23:32 ` Rafael J. Wysocki
[not found] ` <200809210132.46504.rjw-KKrjLPT3xs0@public.gmane.org>
0 siblings, 1 reply; 6+ messages in thread
From: Rafael J. Wysocki @ 2008-09-20 23:32 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Andrew Morton, Ingo Molnar, Kernel Testers, Linus Torvalds, LKML,
Pavel Machek
On Sunday, 21 of September 2008, Rafael J. Wysocki wrote:
> Hi,
>
> Unfortunately resume from suspend to RAM is completely broken on my hp nx6325
> because of
>
> commit 2344abbcbdb82140050e8be29d3d55e4f6fe860b
> Author: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> Date: Tue Sep 16 11:32:50 2008 -0700
>
> clockevents: make device shutdown robust
>
> Signed-off-by: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
>
> Reverting of this commit makes things work again.
One more thing, "broken" means that the box doesn't resume (the suspend part
seems to work correctly) and instead it seems to enter a neverending loop
that cannot be broken by any means except for the power button, so I think it
occurs with interrupts disabled.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Regression] Commit 2344abbcbdb82140050e8be29d3d55e4f6fe860b breaks resume on nx6325
[not found] ` <200809210132.46504.rjw-KKrjLPT3xs0@public.gmane.org>
@ 2008-09-21 19:03 ` Rafael J. Wysocki
[not found] ` <200809212103.17578.rjw-KKrjLPT3xs0@public.gmane.org>
0 siblings, 1 reply; 6+ messages in thread
From: Rafael J. Wysocki @ 2008-09-21 19:03 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Andrew Morton, Ingo Molnar, Kernel Testers, Linus Torvalds, LKML,
Pavel Machek
On Sunday, 21 of September 2008, Rafael J. Wysocki wrote:
> On Sunday, 21 of September 2008, Rafael J. Wysocki wrote:
> > Hi,
> >
> > Unfortunately resume from suspend to RAM is completely broken on my hp nx6325
> > because of
> >
> > commit 2344abbcbdb82140050e8be29d3d55e4f6fe860b
> > Author: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> > Date: Tue Sep 16 11:32:50 2008 -0700
> >
> > clockevents: make device shutdown robust
> >
> > Signed-off-by: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> >
> > Reverting of this commit makes things work again.
>
> One more thing, "broken" means that the box doesn't resume (the suspend part
> seems to work correctly) and instead it seems to enter a neverending loop
> that cannot be broken by any means except for the power button, so I think it
> occurs with interrupts disabled.
Update:
After some more debugging I verified that in fact the $subject commit breaks
CPU hotplugging (the 'online' part), so I should be able to get some more
information about what really happens.
This still will be difficult, because the box hangs solid (magic sysrq doesn't
work in this state) almost immediately after a (failing) attempt to 'online' the
previously 'offlined' CPU.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Regression] Commit 2344abbcbdb82140050e8be29d3d55e4f6fe860b breaks resume on nx6325
[not found] ` <200809212103.17578.rjw-KKrjLPT3xs0@public.gmane.org>
@ 2008-09-21 22:36 ` Rafael J. Wysocki
[not found] ` <200809220036.52831.rjw-KKrjLPT3xs0@public.gmane.org>
0 siblings, 1 reply; 6+ messages in thread
From: Rafael J. Wysocki @ 2008-09-21 22:36 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Andrew Morton, Ingo Molnar, Kernel Testers, Linus Torvalds, LKML,
Pavel Machek
On Sunday, 21 of September 2008, Rafael J. Wysocki wrote:
> On Sunday, 21 of September 2008, Rafael J. Wysocki wrote:
> > On Sunday, 21 of September 2008, Rafael J. Wysocki wrote:
> > > Hi,
> > >
> > > Unfortunately resume from suspend to RAM is completely broken on my hp nx6325
> > > because of
> > >
> > > commit 2344abbcbdb82140050e8be29d3d55e4f6fe860b
> > > Author: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> > > Date: Tue Sep 16 11:32:50 2008 -0700
> > >
> > > clockevents: make device shutdown robust
> > >
> > > Signed-off-by: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> > >
> > > Reverting of this commit makes things work again.
> >
> > One more thing, "broken" means that the box doesn't resume (the suspend part
> > seems to work correctly) and instead it seems to enter a neverending loop
> > that cannot be broken by any means except for the power button, so I think it
> > occurs with interrupts disabled.
>
> Update:
>
> After some more debugging I verified that in fact the $subject commit breaks
> CPU hotplugging (the 'online' part), so I should be able to get some more
> information about what really happens.
>
> This still will be difficult, because the box hangs solid (magic sysrq doesn't
> work in this state) almost immediately after a (failing) attempt to 'online' the
> previously 'offlined' CPU.
Update:
It turns out that as long as X is not used (ie. I don't switch to it from the
console) the machine doesn't hang after the failing attempt to 'online' CPU1
(previously 'offlined'). This observation allowed me to get some more data.
Below is the trace of the process trying to 'online' CPU1 (obtained with
'echo t > /proc/sysrq-trigger'):
[ 385.548018] bash D 0000000000000000 0 3842 3796
[ 385.548018] ffff880074cdfb78 0000000000000082 ffff880074cdfac8 ffff88007500e3c0
[ 385.548018] ffffffff8073c900 ffffffff8073c900 ffffffff8073c900 ffffffff8073c900
[ 385.548018] ffffffff8073c900 ffffffff8073c900 ffffffff80738c80 ffffffff8073c900
[ 385.548018] Call Trace:
[ 385.548018] [<ffffffff80457078>] schedule_timeout+0x22/0xbf
[ 385.548018] [<ffffffff8022c678>] ? need_resched+0x1e/0x28
[ 385.548018] [<ffffffff80235150>] ? __cond_resched+0x34/0x3a
[ 385.548018] [<ffffffff80456edd>] wait_for_common+0xd5/0x13c
[ 385.548018] [<ffffffff802309e7>] ? default_wake_function+0x0/0xf
[ 385.548018] [<ffffffff80456fce>] wait_for_completion+0x18/0x1a
[ 385.548018] [<ffffffff80248d5f>] synchronize_rcu+0x32/0x3b
[ 385.548018] [<ffffffff80248de0>] ? wakeme_after_rcu+0x0/0x10
[ 385.548018] [<ffffffff8023262a>] partition_sched_domains+0xc3/0x1ee
[ 385.548018] [<ffffffff8026850f>] cpuset_track_online_cpus+0x26d/0x281
[ 385.548018] [<ffffffff80230a13>] ? wake_up_process+0x10/0x12
[ 385.548018] [<ffffffff8024e7e1>] notifier_call_chain+0x33/0x5b
[ 385.548018] [<ffffffff8024e848>] __raw_notifier_call_chain+0x9/0xb
[ 385.548018] [<ffffffff8024e859>] raw_notifier_call_chain+0xf/0x11
[ 385.548018] [<ffffffff804555f4>] _cpu_up+0xd9/0x114
[ 385.548018] [<ffffffff80455686>] cpu_up+0x57/0x67
[ 385.548018] [<ffffffff8044b8d0>] store_online+0x4d/0x75
[ 385.548018] [<ffffffff80399e53>] sysdev_store+0x1b/0x1d
[ 385.548018] [<ffffffff802ea1c2>] sysfs_write_file+0xdf/0x114
[ 385.548018] [<ffffffff802a3ae1>] vfs_write+0xa7/0xe1
[ 385.548018] [<ffffffff802a3bd5>] sys_write+0x47/0x6c
[ 385.548018] [<ffffffff8020c1ab>] system_call_fastpath+0x16/0x1b
where the synchronize_rcu() appears to be called from
detach_destroy_domains(). It seems to stay in this schedule_timeout() forever.
(Kernel .config is at
http://www.sisk.pl/kernel/debug/mainline/2.6.27-rc6-git6/nx6325-config)
I have repeated the following for a couple of times:
- boot the machine
- log in as root
- run 'echo 0 > /sys/devices/system/cpu/cpu1/online'
- run 'echo 1 > /sys/devices/system/cpu/cpu1/online' (this never returns)
and each time I have got an identical (except for the time stamps) trace for
the bash process running the last command.
So, it looks like RCU is broken and I have no idea how to debug it further.
Please advise.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Regression] Commit 2344abbcbdb82140050e8be29d3d55e4f6fe860b breaks resume on nx6325
[not found] ` <200809220036.52831.rjw-KKrjLPT3xs0@public.gmane.org>
@ 2008-09-22 8:41 ` Ingo Molnar
[not found] ` <20080922084103.GF6601-X9Un+BFzKDI@public.gmane.org>
0 siblings, 1 reply; 6+ messages in thread
From: Ingo Molnar @ 2008-09-22 8:41 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Thomas Gleixner, Andrew Morton, Kernel Testers, Linus Torvalds,
LKML, Pavel Machek
* Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> So, it looks like RCU is broken and I have no idea how to debug it
> further.
>
> Please advise.
i think RCU is not broken, RCU just happens to be the first entity
affected by C1E-enter magically killing the (previously working) lapic
timer clockevents device. We are working on a fix.
Ingo
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Regression] Commit 2344abbcbdb82140050e8be29d3d55e4f6fe860b breaks resume on nx6325
[not found] ` <20080922084103.GF6601-X9Un+BFzKDI@public.gmane.org>
@ 2008-09-22 15:37 ` Rafael J. Wysocki
0 siblings, 0 replies; 6+ messages in thread
From: Rafael J. Wysocki @ 2008-09-22 15:37 UTC (permalink / raw)
To: Ingo Molnar
Cc: Thomas Gleixner, Andrew Morton, Kernel Testers, Linus Torvalds,
LKML, Pavel Machek
On Monday, 22 of September 2008, Ingo Molnar wrote:
>
> * Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
>
> > So, it looks like RCU is broken and I have no idea how to debug it
> > further.
> >
> > Please advise.
>
> i think RCU is not broken, RCU just happens to be the first entity
> affected by C1E-enter magically killing the (previously working) lapic
> timer clockevents device.
Well, I should have said "RCU breaks as a result of the underlying issue". :-)
> We are working on a fix.
Thanks!
Rafael
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2008-09-22 15:37 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-20 23:24 [Regression] Commit 2344abbcbdb82140050e8be29d3d55e4f6fe860b breaks resume on nx6325 Rafael J. Wysocki
[not found] ` <200809210124.30547.rjw-KKrjLPT3xs0@public.gmane.org>
2008-09-20 23:32 ` Rafael J. Wysocki
[not found] ` <200809210132.46504.rjw-KKrjLPT3xs0@public.gmane.org>
2008-09-21 19:03 ` Rafael J. Wysocki
[not found] ` <200809212103.17578.rjw-KKrjLPT3xs0@public.gmane.org>
2008-09-21 22:36 ` Rafael J. Wysocki
[not found] ` <200809220036.52831.rjw-KKrjLPT3xs0@public.gmane.org>
2008-09-22 8:41 ` Ingo Molnar
[not found] ` <20080922084103.GF6601-X9Un+BFzKDI@public.gmane.org>
2008-09-22 15:37 ` Rafael J. Wysocki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).