[tracing, hang] dumping events gets stuck in synchronise

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [tracing, hang] dumping events gets stuck in synchronise_sched
@ 2010-08-17  7:37 Dave Chinner
  2010-08-17  8:37 ` Lai Jiangshan
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2010-08-17  7:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: Steven Rostedt

Tracing folks,

I've got a machine stuck with a cpu spinning in a tight loop (the
new writeback/sync livelock avoidance code is, well, livelocking),
and I was trying to find out what triggered by using the writeback
trace events. Unfortunately, I can't dump the trace events because
it gets stuck here:

[  263.744094] cat           D 0000000000000000     0  3031   3001 0x00000000
[  263.744094]  ffff880117917af8 0000000000000082 0000000000000292 0000000000000010
[  263.744094]  ffff880100000000 0000000000013580 ffff88011f0d87f0 0000000000013580
[  263.744094]  ffff88011f0d8b58 ffff880117917fd8 ffff88011f0d8b60 ffff880117917fd8
[  263.744094] Call Trace:
[  263.744094]  [<ffffffff818019c5>] schedule_timeout+0x1d5/0x2a0
[  263.744094]  [<ffffffff813fc2d4>] ? do_raw_spin_lock+0x54/0x160
[  263.744094]  [<ffffffff813fc2d4>] ? do_raw_spin_lock+0x54/0x160
[  263.744094]  [<ffffffff818015ff>] wait_for_common+0xcf/0x170
[  263.744094]  [<ffffffff81078b40>] ? default_wake_function+0x0/0x20
[  263.744094]  [<ffffffff8180177d>] wait_for_completion+0x1d/0x20
[  263.744094]  [<ffffffff810d8045>] synchronize_sched+0x55/0x60
[  263.744094]  [<ffffffff8109afb0>] ? wakeme_after_rcu+0x0/0x20
[  263.744094]  [<ffffffff810e0a09>] ring_buffer_read_prepare_sync+0x9/0x10
[  263.744094]  [<ffffffff810e7efc>] tracing_open+0x2bc/0x470
[  263.744094]  [<ffffffff810e7c40>] ? tracing_open+0x0/0x470
[  263.744094]  [<ffffffff81143fed>] __dentry_open+0xed/0x340
[  263.744094]  [<ffffffff813c32af>] ? security_inode_permission+0x1f/0x30
[  263.744094]  [<ffffffff81144354>] nameidata_to_filp+0x54/0x70
[  263.744094]  [<ffffffff81151688>] do_last+0x368/0x5d0
[  263.744094]  [<ffffffff81153ae5>] do_filp_open+0x205/0x5e0
[  263.744094]  [<ffffffff813fc22e>] ? do_raw_spin_unlock+0x5e/0xb0
[  263.744094]  [<ffffffff8115eaaa>] ? alloc_fd+0xfa/0x140
[  263.744094]  [<ffffffff81143dc5>] do_sys_open+0x65/0x130
[  263.744094]  [<ffffffff81143ed0>] sys_open+0x20/0x30
[  263.744094]  [<ffffffff81036032>] system_call_fastpath+0x16/0x1b

If a CPU does not yield, then synchronize_sched() will never complete
and hence I can't get to whatever events that might lead me to the
cause of the hung CPU. A bit of a Catch-22, really.

Given that the trace events are there mainly for debugging, this
seems like a bit of an oversight - hanging a CPU in a tight loop is
not an uncommon event during code development....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [tracing, hang] dumping events gets stuck in synchronise_sched
  2010-08-17  7:37 [tracing, hang] dumping events gets stuck in synchronise_sched Dave Chinner
@ 2010-08-17  8:37 ` Lai Jiangshan
  2010-08-17 11:52   ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Lai Jiangshan @ 2010-08-17  8:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, Steven Rostedt

On 08/17/2010 03:37 PM, Dave Chinner wrote:
> Tracing folks,
> 
> I've got a machine stuck with a cpu spinning in a tight loop (the
> new writeback/sync livelock avoidance code is, well, livelocking),
> and I was trying to find out what triggered by using the writeback
> trace events. Unfortunately, I can't dump the trace events because
> it gets stuck here:

> 
> Given that the trace events are there mainly for debugging, this
> seems like a bit of an oversight - hanging a CPU in a tight loop is
> not an uncommon event during code development....
> 

You can try 'cat trace_pipe', if I did not miss you meaning.

Lai


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [tracing, hang] dumping events gets stuck in synchronise_sched
  2010-08-17  8:37 ` Lai Jiangshan
@ 2010-08-17 11:52   ` Dave Chinner
  2010-08-17 13:02     ` Steven Rostedt
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2010-08-17 11:52 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: linux-kernel, Steven Rostedt

On Tue, Aug 17, 2010 at 04:37:44PM +0800, Lai Jiangshan wrote:
> On 08/17/2010 03:37 PM, Dave Chinner wrote:
> > Tracing folks,
> > 
> > I've got a machine stuck with a cpu spinning in a tight loop (the
> > new writeback/sync livelock avoidance code is, well, livelocking),
> > and I was trying to find out what triggered by using the writeback
> > trace events. Unfortunately, I can't dump the trace events because
> > it gets stuck here:
> 
> > 
> > Given that the trace events are there mainly for debugging, this
> > seems like a bit of an oversight - hanging a CPU in a tight loop is
> > not an uncommon event during code development....
> > 
> 
> You can try 'cat trace_pipe', if I did not miss you meaning.

I'll try it, but I'm really after the static event list which is why
I'm using the trace file rather than trace_pipe. I want the history,
not new events as they happen. 

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [tracing, hang] dumping events gets stuck in synchronise_sched
  2010-08-17 11:52   ` Dave Chinner
@ 2010-08-17 13:02     ` Steven Rostedt
  2010-08-17 22:40       ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2010-08-17 13:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Lai Jiangshan, linux-kernel

On Tue, 2010-08-17 at 21:52 +1000, Dave Chinner wrote:
> On Tue, Aug 17, 2010 at 04:37:44PM +0800, Lai Jiangshan wrote:
> > On 08/17/2010 03:37 PM, Dave Chinner wrote:
> > > Tracing folks,
> > > 
> > > I've got a machine stuck with a cpu spinning in a tight loop (the
> > > new writeback/sync livelock avoidance code is, well, livelocking),
> > > and I was trying to find out what triggered by using the writeback
> > > trace events. Unfortunately, I can't dump the trace events because
> > > it gets stuck here:
> > 
> > > 
> > > Given that the trace events are there mainly for debugging, this
> > > seems like a bit of an oversight - hanging a CPU in a tight loop is
> > > not an uncommon event during code development....
> > > 
> > 
> > You can try 'cat trace_pipe', if I did not miss you meaning.
> 
> I'll try it, but I'm really after the static event list which is why
> I'm using the trace file rather than trace_pipe. I want the history,
> not new events as they happen. 

When the systems locks up, I assume you want to see why? The trace_pipe
should show that without locking the system.

You could also try downloading trace-cmd and running the tracer with
that. That will save all traces to a file while running the trace.

-- Steve



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [tracing, hang] dumping events gets stuck in synchronise_sched
  2010-08-17 13:02     ` Steven Rostedt
@ 2010-08-17 22:40       ` Dave Chinner
  2010-08-17 23:07         ` Steven Rostedt
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2010-08-17 22:40 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Lai Jiangshan, linux-kernel

On Tue, Aug 17, 2010 at 09:02:38AM -0400, Steven Rostedt wrote:
> On Tue, 2010-08-17 at 21:52 +1000, Dave Chinner wrote:
> > On Tue, Aug 17, 2010 at 04:37:44PM +0800, Lai Jiangshan wrote:
> > > On 08/17/2010 03:37 PM, Dave Chinner wrote:
> > > > Tracing folks,
> > > > 
> > > > I've got a machine stuck with a cpu spinning in a tight loop (the
> > > > new writeback/sync livelock avoidance code is, well, livelocking),
> > > > and I was trying to find out what triggered by using the writeback
> > > > trace events. Unfortunately, I can't dump the trace events because
> > > > it gets stuck here:
> > > 
> > > > 
> > > > Given that the trace events are there mainly for debugging, this
> > > > seems like a bit of an oversight - hanging a CPU in a tight loop is
> > > > not an uncommon event during code development....
> > > > 
> > > 
> > > You can try 'cat trace_pipe', if I did not miss you meaning.
> > 
> > I'll try it, but I'm really after the static event list which is why
> > I'm using the trace file rather than trace_pipe. I want the history,
> > not new events as they happen. 
> 
> When the systems locks up, I assume you want to see why? The trace_pipe
> should show that without locking the system.

Exactly.

> You could also try downloading trace-cmd and running the tracer with
> that. That will save all traces to a file while running the trace.

I don't have tens of GB available to store all the traces that an
xfstests test run generates. In general, I don't need the traces,
either, and when  I do the problem is usually in the current ring
buffer, which is why I typically dump the events after the fact.

If the trace file cannot be made to handle this type of use
robustly, then perhaps it should be removed...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [tracing, hang] dumping events gets stuck in synchronise_sched
  2010-08-17 22:40       ` Dave Chinner
@ 2010-08-17 23:07         ` Steven Rostedt
  2010-08-18  0:55           ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2010-08-17 23:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Lai Jiangshan, linux-kernel

On Wed, 2010-08-18 at 08:40 +1000, Dave Chinner wrote:

> > When the systems locks up, I assume you want to see why? The trace_pipe
> > should show that without locking the system.
> 
> Exactly.

So it did work?

> 
> > You could also try downloading trace-cmd and running the tracer with
> > that. That will save all traces to a file while running the trace.
> 
> I don't have tens of GB available to store all the traces that an
> xfstests test run generates. In general, I don't need the traces,
> either, and when  I do the problem is usually in the current ring
> buffer, which is why I typically dump the events after the fact.

Note, you can also use: trace-cmd extract

which will extract the data from the ring buffer after the fact.

The resulting file ("trace.dat" by default) would allow you to read it
via kernelshark, or use the trace-cmd filters on it as well.

> 
> If the trace file cannot be made to handle this type of use
> robustly, then perhaps it should be removed...

You mean because it uses synchronize_sched() it should be removed? Seems
that it was another kernel bug that caused the issue.

I find the trace file nice to use, to look at a file over and over
again. I could just cat trace_pipe into another file and look at that
too I suppose. But I don't find this reason as rational to remove it.

-- Steve

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [tracing, hang] dumping events gets stuck in synchronise_sched
  2010-08-17 23:07         ` Steven Rostedt
@ 2010-08-18  0:55           ` Dave Chinner
  2010-08-18  1:07             ` Steven Rostedt
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2010-08-18  0:55 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Lai Jiangshan, linux-kernel

On Tue, Aug 17, 2010 at 07:07:11PM -0400, Steven Rostedt wrote:
> On Wed, 2010-08-18 at 08:40 +1000, Dave Chinner wrote:
> 
> > > When the systems locks up, I assume you want to see why? The trace_pipe
> > > should show that without locking the system.
> > 
> > Exactly.
> 
> So it did work?

I haven't got back to the hanging kernel yet - I had to go back to
2.6.35 to triage and test a regression (which I couldn't with broken
writeback) and I'm still dealing with the fallout of that one...

> > If the trace file cannot be made to handle this type of use
> > robustly, then perhaps it should be removed...
> 
> You mean because it uses synchronize_sched() it should be removed? Seems
> that it was another kernel bug that caused the issue.

Right now I'm using the trace file for exactly what the docuemntation
says it should be used for, but it hangs. You're saying that the
interface is effectively redundant because there's a trace_pipe
interface that shouldn't hang and I should use that instead.

Debug interfaces should be reliable in the face of common problems -
a system burning a CPU in a tight loop is a pretty common problem.
That leads to three options: either fix the hang, document the fact
that the trace file is not reliable and that you should use other
interfaces, or remove the interface altogether.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [tracing, hang] dumping events gets stuck in synchronise_sched
  2010-08-18  0:55           ` Dave Chinner
@ 2010-08-18  1:07             ` Steven Rostedt
  0 siblings, 0 replies; 8+ messages in thread
From: Steven Rostedt @ 2010-08-18  1:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Lai Jiangshan, linux-kernel

On Wed, 2010-08-18 at 10:55 +1000, Dave Chinner wrote:
> On Tue, Aug 17, 2010 at 07:07:11PM -0400, Steven Rostedt wrote:

> > You mean because it uses synchronize_sched() it should be removed? Seems
> > that it was another kernel bug that caused the issue.
> 
> Right now I'm using the trace file for exactly what the docuemntation
> says it should be used for, but it hangs. You're saying that the
> interface is effectively redundant because there's a trace_pipe
> interface that shouldn't hang and I should use that instead.

trace_pipe consumes the data as trace does not.

> 
> Debug interfaces should be reliable in the face of common problems -
> a system burning a CPU in a tight loop is a pretty common problem.
> That leads to three options: either fix the hang, document the fact
> that the trace file is not reliable and that you should use other
> interfaces, or remove the interface altogether.

I could update the documentation to warn that the trace file uses the
common utility "synchronize_sched()" and that if you are debugging a
system that has a CPU out of commission, not to use it because,
synchronize_sched() will never return. If you need to debug this case,
then use trace_pipe which will consume all data, but is made for live
reads and does not use synchronize_sched().

But the tracer is used for more than just debugging. I use it for
analyzing system behavior, and not to debug anything, which I use the
trace file for all the time. I usually enable tracing, disable it, then
look at the trace file. For any live tracing, trace_pipe should be used.
If I had not documented that, I should go back and do so.

-- Steve

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-08-18  1:07 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-17  7:37 [tracing, hang] dumping events gets stuck in synchronise_sched Dave Chinner
2010-08-17  8:37 ` Lai Jiangshan
2010-08-17 11:52   ` Dave Chinner
2010-08-17 13:02     ` Steven Rostedt
2010-08-17 22:40       ` Dave Chinner
2010-08-17 23:07         ` Steven Rostedt
2010-08-18  0:55           ` Dave Chinner
2010-08-18  1:07             ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox