bad trace magic

Linux btrace development
 help / color / mirror / Atom feed

* bad trace magic
@ 2008-11-19 11:38 Martin Peschke
  2008-11-19 11:49 ` Jens Axboe
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Martin Peschke @ 2008-11-19 11:38 UTC (permalink / raw)
  To: linux-btrace

Hi,

I run across blktrace complaints about broken traces once in a while.
I have pimped up the debugging output in verify_trace():

  --- bad trace magic ---
  magic    0x00000000
  sequence 0x00000000
  time     0x0000000000000000
  sector   0x0000000000000000
  bytes	   0x65617407		<--- begin of trace (trace magic)
  action   0x000005b0
  pid      0x0000029c
  device   0x7c185ac0
  cpu      0x00000000
  error    0x0036
  pdu_len  0x18c0

  00000000 00000000 00000000 00000000 00000000 00000000
  65617407 000005b0 0000029c 7c185ac0 00000000 003618c0

  bad trace in /sys/kernel/debug/block/sdbm/trace1

Traces appear to get out of place by some bytes. The offset varies.

I have got a couple of questions:

Does blktrace stop processing of a trace stream in this case, doesn't
it?

Next I am going to dump the trace which precedes a broken trace.
Any better ideas how to debug this one?

Thanks,
Martin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bad trace magic
  2008-11-19 11:38 bad trace magic Martin Peschke
@ 2008-11-19 11:49 ` Jens Axboe
  2008-11-19 12:43 ` Martin Peschke
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2008-11-19 11:49 UTC (permalink / raw)
  To: linux-btrace

On Wed, Nov 19 2008, Martin Peschke wrote:
> Hi,
> 
> I run across blktrace complaints about broken traces once in a while.
> I have pimped up the debugging output in verify_trace():
> 
>   --- bad trace magic ---
>   magic    0x00000000
>   sequence 0x00000000
>   time     0x0000000000000000
>   sector   0x0000000000000000
>   bytes	   0x65617407		<--- begin of trace (trace magic)
>   action   0x000005b0
>   pid      0x0000029c
>   device   0x7c185ac0
>   cpu      0x00000000
>   error    0x0036
>   pdu_len  0x18c0
> 
>   00000000 00000000 00000000 00000000 00000000 00000000
>   65617407 000005b0 0000029c 7c185ac0 00000000 003618c0
> 
>   bad trace in /sys/kernel/debug/block/sdbm/trace1
> 
> Traces appear to get out of place by some bytes. The offset varies.
> 
> 
> I have got a couple of questions:
> 
> Does blktrace stop processing of a trace stream in this case, doesn't
> it?
> 
> Next I am going to dump the trace which precedes a broken trace.
> Any better ideas how to debug this one?

I think Alan and I debugged something like this in the early days of
blktrace. Is your stream consistently off when this happens? Does the
sequence numbers jump?

The most likely culprit is a relay bug, and there's recently been some
fixing for that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bad trace magic
  2008-11-19 11:38 bad trace magic Martin Peschke
  2008-11-19 11:49 ` Jens Axboe
@ 2008-11-19 12:43 ` Martin Peschke
  2008-11-19 12:44 ` Jens Axboe
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Martin Peschke @ 2008-11-19 12:43 UTC (permalink / raw)
  To: linux-btrace


On Wed, 2008-11-19 at 12:49 +0100, Jens Axboe wrote:
> On Wed, Nov 19 2008, Martin Peschke wrote:
> > Hi,
> > 
> > I run across blktrace complaints about broken traces once in a while.
> > I have pimped up the debugging output in verify_trace():
> > ...
> I think Alan and I debugged something like this in the early days of
> blktrace. Is your stream consistently off when this happens? Does the
> sequence numbers jump?

Well, consistently off? If it's off once, is there a next trace?

Sequence numbers appear to jump, from 183e8 to f1b1 in this case:

--- bad trace magic ---
magic    0x00000001
sequence 0x0001007a
time     0x000000000000fb40
sector   0x000000000005925e
bytes    0x65617407
action   0x0000f1b1
pid      0x00000773
device   0xec86baa8
cpu      0x00000000
error    0x007f
pdu_len  0x8430

00000001 0001007a 00000000 0000fb40 00000000 0005925e 65617407 0000f1b1
00000773 ec86baa8 00000000 007f8430

bad trace in /sys/kernel/debug/block/sdae/trace0

--- previous trace ---
magic    0x65617407
sequence 0x000183e8
time     0x00000815371024d2
sector   0x000000000052cdf0
bytes    0x00001000
action   0x09820008
pid      0x000019bf
device   0x041000e0
cpu      0x00000000
error    0x0000
pdu_len  0x0000

65617407 000183e8 00000815 371024d2 00000000 0052cdf0 00001000 09820008
000019bf 041000e0 00000000 00000000


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bad trace magic
  2008-11-19 11:38 bad trace magic Martin Peschke
  2008-11-19 11:49 ` Jens Axboe
  2008-11-19 12:43 ` Martin Peschke
@ 2008-11-19 12:44 ` Jens Axboe
  2008-11-19 15:00 ` Martin Peschke
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2008-11-19 12:44 UTC (permalink / raw)
  To: linux-btrace

On Wed, Nov 19 2008, Martin Peschke wrote:
> 
> On Wed, 2008-11-19 at 12:49 +0100, Jens Axboe wrote:
> > On Wed, Nov 19 2008, Martin Peschke wrote:
> > > Hi,
> > > 
> > > I run across blktrace complaints about broken traces once in a while.
> > > I have pimped up the debugging output in verify_trace():
> > > ...
> > I think Alan and I debugged something like this in the early days of
> > blktrace. Is your stream consistently off when this happens? Does the
> > sequence numbers jump?
> 
> Well, consistently off? If it's off once, is there a next trace?

How are you running blktrace? If you have the blktrace output files, you
can inspect what it would have done next. blkparse may not do more after
it detects a bad trace, though.
> 
> Sequence numbers appear to jump, from 183e8 to f1b1 in this case:
> 
> --- bad trace magic ---
> magic    0x00000001
> sequence 0x0001007a
> time     0x000000000000fb40
> sector   0x000000000005925e
> bytes    0x65617407
> action   0x0000f1b1
> pid      0x00000773
> device   0xec86baa8
> cpu      0x00000000
> error    0x007f
> pdu_len  0x8430
> 
> 00000001 0001007a 00000000 0000fb40 00000000 0005925e 65617407 0000f1b1
> 00000773 ec86baa8 00000000 007f8430

So that's interesting. What byte offset did this happen at? And what is
your subbuffer size for blktrace?

> 
> bad trace in /sys/kernel/debug/block/sdae/trace0
> 
> --- previous trace ---
> magic    0x65617407
> sequence 0x000183e8
> time     0x00000815371024d2
> sector   0x000000000052cdf0
> bytes    0x00001000
> action   0x09820008
> pid      0x000019bf
> device   0x041000e0
> cpu      0x00000000
> error    0x0000
> pdu_len  0x0000
> 
> 65617407 000183e8 00000815 371024d2 00000000 0052cdf0 00001000 09820008
> 000019bf 041000e0 00000000 00000000
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrace" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bad trace magic
  2008-11-19 11:38 bad trace magic Martin Peschke
                   ` (2 preceding siblings ...)
  2008-11-19 12:44 ` Jens Axboe
@ 2008-11-19 15:00 ` Martin Peschke
  2008-11-19 15:07 ` Jens Axboe
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Martin Peschke @ 2008-11-19 15:00 UTC (permalink / raw)
  To: linux-btrace


On Wed, 2008-11-19 at 13:44 +0100, Jens Axboe wrote:
> How are you running blktrace?

blktrace -a issue -a drv_data -a complete -w 43200 -o
- /dev/sda ... /dev/sdz | blkiomon

> If you have the blktrace output files, you
> can inspect what it would have done next.

I am concerned about pdu_len being some random number then.
I don't think blktrace would find a valid trace after taking a random
number of bytes for a trace payload.

> > Sequence numbers appear to jump, from 183e8 to f1b1 in this case:
> > 
> > --- bad trace magic ---
> > magic    0x00000001
> > sequence 0x0001007a
> > time     0x000000000000fb40
> > sector   0x000000000005925e
> > bytes    0x65617407
> > action   0x0000f1b1
> > pid      0x00000773
> > device   0xec86baa8
> > cpu      0x00000000
> > error    0x007f
> > pdu_len  0x8430
> > 
> > 00000001 0001007a 00000000 0000fb40 00000000 0005925e 65617407 0000f1b1
> > 00000773 ec86baa8 00000000 007f8430
> 
> So that's interesting. What byte offset did this happen at? And what is
> your subbuffer size for blktrace?

see tip_subbuf output (from flush_subbuf_file()) below

--- bad trace magic ---
magic    0x00017000
sequence 0x40010011
time     0x0000346f04100080
sector   0x0000000100000018
bytes    0x00000001
action   0x00010080
pid      0x00000000
device   0x0000bbea
cpu      0x00000000
error    0x0014
pdu_len  0x8d98

00017000 40010011 0000346f 04100080 00000001 00000018
00000001 00010080 00000000 0000bbea 00000000 00148d98

bad trace in /sys/kernel/debug/block/sdy/trace1

ts->buf 0x2004c5027c0, ts->len 0x108, ts->max_len 0x80000,
offset 0x48, t 0x2004c502808

--- previous trace ---
magic    0x65617407
sequence 0x0001a232
time     0x00000e5bb87cd096
sector   0x0000000000843c08
bytes    0x00001000
action   0x40010011
pid      0x00003467
device   0x04100080
cpu      0x00000001
error    0x0000
pdu_len  0x0018

65617407 0001a232 00000e5b b87cd096 00000000 00843c08
00001000 40010011 00003467 04100080 00000001 00000018


--- bad trace magic ---
magic    0x00001000
sequence 0x09410007
time     0x00003469008000c0
sector   0x0000000000000000
bytes    0x65617407
action   0x0000fae2
pid      0x00000e44
device   0xbbefaa91
cpu      0x00000000
error    0x008f
pdu_len  0xd358

00001000 09410007 00003469 008000c0 00000000 00000000
65617407 0000fae2 00000e44 bbefaa91 00000000 008fd358

bad trace in /sys/kernel/debug/block/sdm/trace0

ts->buf 0x20060280910, ts->len 0xc00, ts->max_len 0x80000,
offset 0xf0, t 0x20060280a00

--- previous trace ---
magic    0x65617407
sequence 0x0000fadd
time     0x65617407000067d8
sector   0x00000d9ba13e3b74
bytes    0x00000000
action   0x00500030
pid      0x00001000
device   0x40010011
cpu      0x00003466
error    0x0080
pdu_len  0x00c0

65617407 0000fadd 65617407 000067d8 00000d9b a13e3b74
00000000 00500030 00001000 40010011 00003466 008000c0




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bad trace magic
  2008-11-19 11:38 bad trace magic Martin Peschke
                   ` (3 preceding siblings ...)
  2008-11-19 15:00 ` Martin Peschke
@ 2008-11-19 15:07 ` Jens Axboe
  2008-11-19 17:41 ` Martin Peschke
  2008-11-26  9:33 ` Martin Peschke
  6 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2008-11-19 15:07 UTC (permalink / raw)
  To: linux-btrace

On Wed, Nov 19 2008, Martin Peschke wrote:
> 
> On Wed, 2008-11-19 at 13:44 +0100, Jens Axboe wrote:
> > How are you running blktrace?
> 
> blktrace -a issue -a drv_data -a complete -w 43200 -o
> - /dev/sda ... /dev/sdz | blkiomon

Ah ok. How hard is this to reproduce? If you could try a non-pipe
approach, then it would be nice to see if it happens there as well.

> > If you have the blktrace output files, you
> > can inspect what it would have done next.
> 
> I am concerned about pdu_len being some random number then.
> I don't think blktrace would find a valid trace after taking a random
> number of bytes for a trace payload.

You are right, it's definitely a rocky road...

> > > Sequence numbers appear to jump, from 183e8 to f1b1 in this case:
> > > 
> > > --- bad trace magic ---
> > > magic    0x00000001
> > > sequence 0x0001007a
> > > time     0x000000000000fb40
> > > sector   0x000000000005925e
> > > bytes    0x65617407
> > > action   0x0000f1b1
> > > pid      0x00000773
> > > device   0xec86baa8
> > > cpu      0x00000000
> > > error    0x007f
> > > pdu_len  0x8430
> > > 
> > > 00000001 0001007a 00000000 0000fb40 00000000 0005925e 65617407 0000f1b1
> > > 00000773 ec86baa8 00000000 007f8430
> > 
> > So that's interesting. What byte offset did this happen at? And what is
> > your subbuffer size for blktrace?
> 
> see tip_subbuf output (from flush_subbuf_file()) below
> 
> --- bad trace magic ---
> magic    0x00017000
> sequence 0x40010011
> time     0x0000346f04100080
> sector   0x0000000100000018
> bytes    0x00000001
> action   0x00010080
> pid      0x00000000
> device   0x0000bbea
> cpu      0x00000000
> error    0x0014
> pdu_len  0x8d98
> 
> 00017000 40010011 0000346f 04100080 00000001 00000018
> 00000001 00010080 00000000 0000bbea 00000000 00148d98
> 
> bad trace in /sys/kernel/debug/block/sdy/trace1
> 
> ts->buf 0x2004c5027c0, ts->len 0x108, ts->max_len 0x80000,
> offset 0x48, t 0x2004c502808
> 
> --- previous trace ---
> magic    0x65617407
> sequence 0x0001a232
> time     0x00000e5bb87cd096
> sector   0x0000000000843c08
> bytes    0x00001000
> action   0x40010011
> pid      0x00003467
> device   0x04100080
> cpu      0x00000001
> error    0x0000
> pdu_len  0x0018
> 
> 65617407 0001a232 00000e5b b87cd096 00000000 00843c08
> 00001000 40010011 00003467 04100080 00000001 00000018
> 
> 
> --- bad trace magic ---
> magic    0x00001000
> sequence 0x09410007
> time     0x00003469008000c0
> sector   0x0000000000000000
> bytes    0x65617407
> action   0x0000fae2
> pid      0x00000e44
> device   0xbbefaa91
> cpu      0x00000000
> error    0x008f
> pdu_len  0xd358
> 
> 00001000 09410007 00003469 008000c0 00000000 00000000
> 65617407 0000fae2 00000e44 bbefaa91 00000000 008fd358
> 
> bad trace in /sys/kernel/debug/block/sdm/trace0
> 
> ts->buf 0x20060280910, ts->len 0xc00, ts->max_len 0x80000,
> offset 0xf0, t 0x20060280a00
> 
> --- previous trace ---
> magic    0x65617407
> sequence 0x0000fadd
> time     0x65617407000067d8
> sector   0x00000d9ba13e3b74
> bytes    0x00000000
> action   0x00500030
> pid      0x00001000
> device   0x40010011
> cpu      0x00003466
> error    0x0080
> pdu_len  0x00c0
> 
> 65617407 0000fadd 65617407 000067d8 00000d9b a13e3b74
> 00000000 00500030 00001000 40010011 00003466 008000c0

So you don't happen to have a stream of data we can inspect _after_ the
bad magic occured?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bad trace magic
  2008-11-19 11:38 bad trace magic Martin Peschke
                   ` (4 preceding siblings ...)
  2008-11-19 15:07 ` Jens Axboe
@ 2008-11-19 17:41 ` Martin Peschke
  2008-11-26  9:33 ` Martin Peschke
  6 siblings, 0 replies; 8+ messages in thread
From: Martin Peschke @ 2008-11-19 17:41 UTC (permalink / raw)
  To: linux-btrace


On Wed, 2008-11-19 at 16:07 +0100, Jens Axboe wrote:
> On Wed, Nov 19 2008, Martin Peschke wrote:
> > On Wed, 2008-11-19 at 13:44 +0100, Jens Axboe wrote:
> > > How are you running blktrace?
> > 
> > blktrace -a issue -a drv_data -a complete -w 43200 -o
> > - /dev/sda ... /dev/sdz | blkiomon
> 
> Ah ok. How hard is this to reproduce?

It's not too hard. I get several bad traces within about half an hour.

> If you could try a non-pipe approach, then it would be nice to see if
> it happens there as well.

I have tried this for about 90 minutes:

blktrace -a complete -a issue -a
drv_data /dev/sda   /dev/sdae  /dev/sdaj  /dev/sdao  /dev/sdat  /dev/sday
/dev/sdbc  /dev/sdbh  /dev/sdbm  /dev/sdbr  /dev/sdbw  /dev/sdca
/dev/sdg  /dev/sdl  /dev/sdq  /dev/sdv /dev/sdaa

No bad traces. Need to check whether verify_trace() was called at all...

For this test I had to reduce the number of devices traced due to
the oom-killer picking on blktrace.

Or did you have something else in mind?

> So you don't happen to have a stream of data we can inspect _after_
> the bad magic occured?

I was trying with the command line above.
Would be blktrace in network mode worth another try?


I have just run the my initial test case on a recent linux git tree.
Same issue, bad traces:

--- bad trace magic ---
magic    0x00000000
sequence 0x00000000
time     0x65617407000014d5    <--- first trace magic
sector   0x656174070000a75c    <--- another trace magic?!
bytes    0x0000011e
action   0x473dbce5
pid      0x00000000
device   0x0095b5e0
cpu      0x00004000
error    0x0181
pdu_len  0x0008

00000000 00000000 65617407 000014d5 65617407 0000a75c
0000011e 473dbce5 00000000 0095b5e0 00004000 01810008

bad trace in /sys/kernel/debug/block/sdaa/trace0
no previous trace
ts->buf 0x0x81fa0610, ts->len 0x958, ts->max_len 0x80000, offset 0x0, t
0x0x81fa0610

Then I ran blkrawverify /dev/sdaa - no issue found.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bad trace magic
  2008-11-19 11:38 bad trace magic Martin Peschke
                   ` (5 preceding siblings ...)
  2008-11-19 17:41 ` Martin Peschke
@ 2008-11-26  9:33 ` Martin Peschke
  6 siblings, 0 replies; 8+ messages in thread
From: Martin Peschke @ 2008-11-26  9:33 UTC (permalink / raw)
  To: linux-btrace


On Wed, 2008-11-19 at 16:07 +0100, Jens Axboe wrote:
> So you don't happen to have a stream of data we can inspect _after_ the
> bad magic occured?

Hi Jens,
any suggestions, a commmand line, for capturing that stream of data?

I have been looking into the kernel side of it, and blktrace/relay
appear to continue writing data when a bad trace has been detected in
user space.

Thanks,
Martin


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-11-26  9:33 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-19 11:38 bad trace magic Martin Peschke
2008-11-19 11:49 ` Jens Axboe
2008-11-19 12:43 ` Martin Peschke
2008-11-19 12:44 ` Jens Axboe
2008-11-19 15:00 ` Martin Peschke
2008-11-19 15:07 ` Jens Axboe
2008-11-19 17:41 ` Martin Peschke
2008-11-26  9:33 ` Martin Peschke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox