[Drbd-dev] DRBD-8: recent regression causing corruption and crashes

Distributed Replicated Block Device (DRBD) development
 help / color / mirror / Atom feed

* [Drbd-dev] DRBD-8: recent regression causing corruption and crashes
@ 2006-08-11  2:31 Graham, Simon
  0 siblings, 0 replies; 3+ messages in thread
From: Graham, Simon @ 2006-08-11  2:31 UTC (permalink / raw)
  To: drbd-dev

It seems that there has been a recent regression (somewhere in the past
couple of weeks) that is causing me a couple of nasty and very
reproducible problems as follows:

1. I get errors during initial synchronization of a volume like this
that cause the resync to be aborted:

drbd15: tl_verify: failed to find req e51a4da0, sector 0 in list
drbd15: Got a corrupt block_id/sector pair(2).
drbd15: short read expecting header on sock: r=-512
drbd15: tl_clear()

	and

drbd15: ASSERT( drbd_req_get_sector(i) == sector ) in
/sandbox/sgraham/sn/trunk/platform/drbd/8.0/drbd/drbd_main.c:308
drbd15: tl_verify: found req eaffff28 but it has wrong sector (ac030
versus 0)

	(Note that this trace is the enhanced trace I posted a few hours
ago). When I turn on the packet tracing I get some very
	odd messages, including lots of WriteAck's with 0 for the sector
(but everything else correct - SYNCER_ID for the block id and an 
      incrementing sequence number 0 like this:

Aug 10 10:58:24 ellwood kernel: drbd15: <<< WriteAck (sector 0, size
8000, id ffffffffffffffff, seq 1671)
Aug 10 10:58:24 ellwood kernel: drbd15: <<< WriteAck (sector 0, size
8000, id ffffffffffffffff, seq 1672)
Aug 10 10:58:24 ellwood kernel: drbd15: <<< WriteAck (sector 0, size
8000, id ffffffffffffffff, seq 1673)

	but I don't see the Ack's with non-SYNCER_ID block id that cause
the previous trace (I'm guessing this is just
	printk not writing all trace messages due to the rate). At first
I thought this was just a bug in the printing
	code but I'm damned if I can spot it!

2. I get panics with the following signature:- these look like they are
happening when a local write
    on the primary (which this node is) completes.

drbd15: tl_verify: failed to find req eab1ce48, sector 0 in list
drbd15: Got a corrupt block_id/sector pair(2).
drbd15: drbd_pp_alloc interrupted!
drbd15: error receiving RSDataRequest, l: 24!
drbd15: tl_clear()
general protection fault: 6fa0 [#1]
Modules linked in: drbd ipmi_devintf ipmi_si ipmi_msghandler video
thermal processor fan button battery ac
CPU:    0
EIP:    0061:[<eafabf4f>]    Not tainted VLI
EFLAGS: 00010096   (2.6.16.13-xen0 #2) 
EIP is at 0xeafabf4f
eax: eafabf3f   ebx: 00000001   ecx: 00000001   edx: 00000003
esi: eafabf48   edi: bf48eafa   ebp: c0557cbc   esp: c0557c9c
ds: 007b   es: 007b   ss: 0069
Process swapper (pid: 0, threadinfo=c0556000 task=c04c6860)
Stack: <0>c0115a98 eafabf40 00000003 00000000 00000000 00000000 00000001
00000020 
       c0557ce0 c0115b03 eaf97634 00000003 00000001 00000000 00000000
00000001 
       eaf975e8 c0557cec c021c27f 00000000 c0557d20 c044ca27 eaf97624
eaf97624 
Call Trace:
 [<c010513a>] show_stack_log_lvl+0xaa/0xe0
 [<c010534e>] show_registers+0x18e/0x210
 [<c0105549>] die+0xd9/0x180
 [<c0105eb4>] do_general_protection+0xc4/0x100
 [<c0104d5f>] error_code+0x2b/0x30
 [<c0115b03>] __wake_up+0x33/0x60
 [<c021c27f>] __up+0x1f/0x30
 [<c044ca27>] __up_wakeup+0x7/0x10
 [<f127f107>] drbd_endio_write_pri+0xa7/0x230 [drbd]
 [<c015efaf>] bio_endio+0x4f/0x80
 [<c03cf569>] dec_pending+0x49/0x80
 [<c03cf64b>] clone_endio+0xab/0xd0
 [<c015efaf>] bio_endio+0x4f/0x80
 [<c020e260>] __end_that_request_first+0xc0/0x340
 [<c020e52f>] end_that_request_chunk+0x1f/0x30
 [<c0310b57>] scsi_end_request+0x37/0x110
 [<c0310ea4>] scsi_io_completion+0x104/0x530
 [<c037c4fa>] sd_rw_intr+0xca/0x290
 [<c030bdd5>] scsi_finish_command+0x85/0xa0
 [<c03118cb>] scsi_softirq_done+0xcb/0xf0
 [<c020e5cb>] blk_done_softirq+0x8b/0xb0
 [<c011e4c2>] __do_softirq+0x62/0xd0
 [<c011e578>] do_softirq+0x48/0x60
 [<c011e645>] irq_exit+0x35/0x40
 [<c01067a2>] do_IRQ+0x22/0x30
 [<c02dfb74>] evtchn_do_upcall+0x54/0xa0
 [<c0104d90>] hypervisor_callback+0x2c/0x34
 [<c0102b36>] cpu_idle+0x66/0x70
 [<c010202b>] _stext+0x2b/0x40
 [<c055886a>] start_kernel+0x1aa/0x1f0
 [<c010006f>] 0xc010006f
Code: 00 a0 fa ea 6c bf fa ea 25 e4 44 c0 34 76 f9 ea 00 a0 fa ea 30 20
1d eb 00 00 00 00 01 00 00 00 30 20 1d eb 48 bf fa ea 48 bf fa <ea> 34
76 f9 ea a0 6f ad ea e8 75 f9 ea 24 76 f9 ea a0 6f ad ea 
 <0>Kernel panic - not syncing: Fatal exception in interrupt

At the moment I'm somewhat stumped by this and would welcome any
thoughts on where to look...

Thanks
Simon

PS: Should have explained what we did before hand - the sequence is:

1. Both nodes issue drbdmeta commands to initialize the disk area - this
is done independently - that is, the systems
   are not connected at this point.

2. One node mounts the drbd disk and untars a blob of data, still not
connected to the other node

3, The systems are connected and rebooted and start resyncing
automatically, leading to the above problems.
   Note that we are also writing the volume on the SyncSource during the
resync.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: [Drbd-dev] DRBD-8: recent regression causing corruption and crashes
@ 2006-08-11 16:01 Graham, Simon
  2006-08-11 18:45 ` Lars Ellenberg
  0 siblings, 1 reply; 3+ messages in thread
From: Graham, Simon @ 2006-08-11 16:01 UTC (permalink / raw)
  To: Graham, Simon, drbd-dev

Quick update:

> -----Original Message-----
> 1. I get errors during initial synchronization of a volume like this
> that cause the resync to be aborted:
> 
> drbd15: tl_verify: failed to find req e51a4da0, sector 0 in list
> drbd15: Got a corrupt block_id/sector pair(2).
> drbd15: short read expecting header on sock: r=-512
> drbd15: tl_clear()
> 

FWIW, I have now confirmed that the WriteAck messages are being sent
with sector=0 (I used ethereal plus a drbd protocol dissector that I'm
working on -- more on that later when it's closer to being ready;
however, I checked the binary dump as well, and the node is definitely
sending lots of WriteAck's with sector==0 and block_id==SYNCER_ID) - for
example:

Frame 5706 (354 bytes on wire, 354 bytes captured)
Ethernet II, Src: Dell_4b:00:e2 (00:13:72:4b:00:e2), Dst: Dell_4a:ff:8e
(00:13:72:4a:ff:8e)
Internet Protocol, Src: 192.168.1.100 (192.168.1.100), Dst:
192.168.1.101 (192.168.1.101)
Transmission Control Protocol, Src Port: 7803 (7803), Dst Port: 42289
(42289), Seq: 129, Ack: 9, Len: 288
DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000
DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000
DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000
DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000
DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000
DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000
DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000
DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000
DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000

> 2. I get panics with the following signature:- these look like they
are
> happening when a local write
>     on the primary (which this node is) completes.

The panic signature seems to change - for example, I just got one like
this in the receiver thread:

drbd15: ASSERT( drbd_req_get_sector(i) == sector ) in
/sandbox/sgraham/sn/trunk/platform/drbd/8.0/drbd/drbd_main.c:313
drbd15: tl_verify: found req e63d0240 but it has wrong sector (8 versus
0)
drbd15: tl_verify: failed to find req e63d02b0, sector 0 in list
drbd15: Got a corrupt block_id/sector pair(2).
drbd15: drbd_pp_alloc interrupted!
drbd15: error receiving RSDataRequest, l: 24!
drbd15: tl_clear()
drbd15: in tl_clear_barrier:374: ap_pending_cnt = -1 < 0 !
drbd15: ap_pending_cnt = -1
Unable to handle kernel paging request at virtual address 7f10debb
 printing eip:
7f10debb
*pde = ma 00000000 pa fffff000
Oops: 0000 [#1]
Modules linked in: drbd ipmi_devintf ipmi_si ipmi_msghandler video
thermal processor fan button battery ac
CPU:    0
EIP:    0061:[<7f10debb>]    Not tainted VLI
EFLAGS: 00010082   (2.6.16.13-xen0 #2) 
EIP is at 0x7f10debb
eax: eae0833e   ebx: 7720debb   ecx: ffffffff   edx: 00000003
esi: 7eccc0b3   edi: df4deadb   ebp: f3c0c015   esp: eadb7ece
ds: 007b   es: 007b   ss: 0069
Process drbd15_receiver (pid: 4250, threadinfo=eadb6000 task=c17d4a70)
Stack: <0>b3b1eadb f3c0f128 f3c0debb 0200debb 00000000 00030000 00010000
00000000 
       00000000 00000000 7ef80000 7ef8eadb 7f10eadb 91c0eadb ff80ec82
91c0ffff 
       7f54ec82 b62eeadb 91c0f128 6620ec82 fff8c155 0001001f 02000000
00030000 
Call Trace:
 [<c010513a>] show_stack_log_lvl+0xaa/0xe0
 [<c010534e>] show_registers+0x18e/0x210
 [<c0105549>] die+0xd9/0x180
 [<c0112b7c>] do_page_fault+0x3cc/0x640
 [<c0104d5f>] error_code+0x2b/0x30
Code:  Bad EIP value.
 <0>Fatal exception: panic in 5 seconds

All suggestions gratefully accepted!
Simon

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Drbd-dev] DRBD-8: recent regression causing corruption and crashes
  2006-08-11 16:01 [Drbd-dev] DRBD-8: recent regression causing corruption and crashes Graham, Simon
@ 2006-08-11 18:45 ` Lars Ellenberg
  0 siblings, 0 replies; 3+ messages in thread
From: Lars Ellenberg @ 2006-08-11 18:45 UTC (permalink / raw)
  To: drbd-dev

/ 2006-08-11 12:01:23 -0400
\ Graham, Simon:
> Quick update:
> 

How exactly do you "test"?
Kernel and hardware?
(sorry, if you posted that earlier, just point me to it)

I triggered a full sync (drbdadm invalidate),
and while that was running, access the Primary(SyncSource)
(cp -av /somethinghuge/ /mnt/drbd-mount-point/)

> > 1. I get errors during initial synchronization of a volume like this
> > that cause the resync to be aborted:
> > 
> > drbd15: tl_verify: failed to find req e51a4da0, sector 0 in list

I don't see those here.

> DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000

I don't see these either.

> > 2. I get panics with the following signature:- these look like they
> are
> > happening when a local write
> >     on the primary (which this node is) completes.
> 
> The panic signature seems to change - for example, I just got one like
> this in the receiver thread:
> 
> drbd15: ASSERT( drbd_req_get_sector(i) == sector ) in
> /sandbox/sgraham/sn/trunk/platform/drbd/8.0/drbd/drbd_main.c:313
> drbd15: tl_verify: found req e63d0240 but it has wrong sector (8 versus
> 0)

nor these.

> drbd15: in tl_clear_barrier:374: ap_pending_cnt = -1 < 0 !

this is bad...

What I do see here is: "ap_pending > 0" still too often, when I
disconnect during resync + write activity, effectively blocking the
Primary's io subsystem.  seemingly we still got bugs in tl_clear :(
need to look into that further.

> Code:  Bad EIP value.
>  <0>Fatal exception: panic in 5 seconds

outch.

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2006-08-11 18:45 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-11 16:01 [Drbd-dev] DRBD-8: recent regression causing corruption and crashes Graham, Simon
2006-08-11 18:45 ` Lars Ellenberg
  -- strict thread matches above, loose matches on Subject: below --
2006-08-11  2:31 Graham, Simon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox