* [Drbd-dev] DRBD-8: recent regression causing corruption and crashes
@ 2006-08-11 2:31 Graham, Simon
0 siblings, 0 replies; 3+ messages in thread
From: Graham, Simon @ 2006-08-11 2:31 UTC (permalink / raw)
To: drbd-dev
It seems that there has been a recent regression (somewhere in the past
couple of weeks) that is causing me a couple of nasty and very
reproducible problems as follows:
1. I get errors during initial synchronization of a volume like this
that cause the resync to be aborted:
drbd15: tl_verify: failed to find req e51a4da0, sector 0 in list
drbd15: Got a corrupt block_id/sector pair(2).
drbd15: short read expecting header on sock: r=-512
drbd15: tl_clear()
and
drbd15: ASSERT( drbd_req_get_sector(i) == sector ) in
/sandbox/sgraham/sn/trunk/platform/drbd/8.0/drbd/drbd_main.c:308
drbd15: tl_verify: found req eaffff28 but it has wrong sector (ac030
versus 0)
(Note that this trace is the enhanced trace I posted a few hours
ago). When I turn on the packet tracing I get some very
odd messages, including lots of WriteAck's with 0 for the sector
(but everything else correct - SYNCER_ID for the block id and an
incrementing sequence number 0 like this:
Aug 10 10:58:24 ellwood kernel: drbd15: <<< WriteAck (sector 0, size
8000, id ffffffffffffffff, seq 1671)
Aug 10 10:58:24 ellwood kernel: drbd15: <<< WriteAck (sector 0, size
8000, id ffffffffffffffff, seq 1672)
Aug 10 10:58:24 ellwood kernel: drbd15: <<< WriteAck (sector 0, size
8000, id ffffffffffffffff, seq 1673)
but I don't see the Ack's with non-SYNCER_ID block id that cause
the previous trace (I'm guessing this is just
printk not writing all trace messages due to the rate). At first
I thought this was just a bug in the printing
code but I'm damned if I can spot it!
2. I get panics with the following signature:- these look like they are
happening when a local write
on the primary (which this node is) completes.
drbd15: tl_verify: failed to find req eab1ce48, sector 0 in list
drbd15: Got a corrupt block_id/sector pair(2).
drbd15: drbd_pp_alloc interrupted!
drbd15: error receiving RSDataRequest, l: 24!
drbd15: tl_clear()
general protection fault: 6fa0 [#1]
Modules linked in: drbd ipmi_devintf ipmi_si ipmi_msghandler video
thermal processor fan button battery ac
CPU: 0
EIP: 0061:[<eafabf4f>] Not tainted VLI
EFLAGS: 00010096 (2.6.16.13-xen0 #2)
EIP is at 0xeafabf4f
eax: eafabf3f ebx: 00000001 ecx: 00000001 edx: 00000003
esi: eafabf48 edi: bf48eafa ebp: c0557cbc esp: c0557c9c
ds: 007b es: 007b ss: 0069
Process swapper (pid: 0, threadinfo=c0556000 task=c04c6860)
Stack: <0>c0115a98 eafabf40 00000003 00000000 00000000 00000000 00000001
00000020
c0557ce0 c0115b03 eaf97634 00000003 00000001 00000000 00000000
00000001
eaf975e8 c0557cec c021c27f 00000000 c0557d20 c044ca27 eaf97624
eaf97624
Call Trace:
[<c010513a>] show_stack_log_lvl+0xaa/0xe0
[<c010534e>] show_registers+0x18e/0x210
[<c0105549>] die+0xd9/0x180
[<c0105eb4>] do_general_protection+0xc4/0x100
[<c0104d5f>] error_code+0x2b/0x30
[<c0115b03>] __wake_up+0x33/0x60
[<c021c27f>] __up+0x1f/0x30
[<c044ca27>] __up_wakeup+0x7/0x10
[<f127f107>] drbd_endio_write_pri+0xa7/0x230 [drbd]
[<c015efaf>] bio_endio+0x4f/0x80
[<c03cf569>] dec_pending+0x49/0x80
[<c03cf64b>] clone_endio+0xab/0xd0
[<c015efaf>] bio_endio+0x4f/0x80
[<c020e260>] __end_that_request_first+0xc0/0x340
[<c020e52f>] end_that_request_chunk+0x1f/0x30
[<c0310b57>] scsi_end_request+0x37/0x110
[<c0310ea4>] scsi_io_completion+0x104/0x530
[<c037c4fa>] sd_rw_intr+0xca/0x290
[<c030bdd5>] scsi_finish_command+0x85/0xa0
[<c03118cb>] scsi_softirq_done+0xcb/0xf0
[<c020e5cb>] blk_done_softirq+0x8b/0xb0
[<c011e4c2>] __do_softirq+0x62/0xd0
[<c011e578>] do_softirq+0x48/0x60
[<c011e645>] irq_exit+0x35/0x40
[<c01067a2>] do_IRQ+0x22/0x30
[<c02dfb74>] evtchn_do_upcall+0x54/0xa0
[<c0104d90>] hypervisor_callback+0x2c/0x34
[<c0102b36>] cpu_idle+0x66/0x70
[<c010202b>] _stext+0x2b/0x40
[<c055886a>] start_kernel+0x1aa/0x1f0
[<c010006f>] 0xc010006f
Code: 00 a0 fa ea 6c bf fa ea 25 e4 44 c0 34 76 f9 ea 00 a0 fa ea 30 20
1d eb 00 00 00 00 01 00 00 00 30 20 1d eb 48 bf fa ea 48 bf fa <ea> 34
76 f9 ea a0 6f ad ea e8 75 f9 ea 24 76 f9 ea a0 6f ad ea
<0>Kernel panic - not syncing: Fatal exception in interrupt
At the moment I'm somewhat stumped by this and would welcome any
thoughts on where to look...
Thanks
Simon
PS: Should have explained what we did before hand - the sequence is:
1. Both nodes issue drbdmeta commands to initialize the disk area - this
is done independently - that is, the systems
are not connected at this point.
2. One node mounts the drbd disk and untars a blob of data, still not
connected to the other node
3, The systems are connected and rebooted and start resyncing
automatically, leading to the above problems.
Note that we are also writing the volume on the SyncSource during the
resync.
^ permalink raw reply [flat|nested] 3+ messages in thread
* RE: [Drbd-dev] DRBD-8: recent regression causing corruption and crashes
@ 2006-08-11 16:01 Graham, Simon
2006-08-11 18:45 ` Lars Ellenberg
0 siblings, 1 reply; 3+ messages in thread
From: Graham, Simon @ 2006-08-11 16:01 UTC (permalink / raw)
To: Graham, Simon, drbd-dev
Quick update:
> -----Original Message-----
> 1. I get errors during initial synchronization of a volume like this
> that cause the resync to be aborted:
>
> drbd15: tl_verify: failed to find req e51a4da0, sector 0 in list
> drbd15: Got a corrupt block_id/sector pair(2).
> drbd15: short read expecting header on sock: r=-512
> drbd15: tl_clear()
>
FWIW, I have now confirmed that the WriteAck messages are being sent
with sector=0 (I used ethereal plus a drbd protocol dissector that I'm
working on -- more on that later when it's closer to being ready;
however, I checked the binary dump as well, and the node is definitely
sending lots of WriteAck's with sector==0 and block_id==SYNCER_ID) - for
example:
Frame 5706 (354 bytes on wire, 354 bytes captured)
Ethernet II, Src: Dell_4b:00:e2 (00:13:72:4b:00:e2), Dst: Dell_4a:ff:8e
(00:13:72:4a:ff:8e)
Internet Protocol, Src: 192.168.1.100 (192.168.1.100), Dst:
192.168.1.101 (192.168.1.101)
Transmission Control Protocol, Src Port: 7803 (7803), Dst Port: 42289
(42289), Seq: 129, Ack: 9, Len: 288
DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000
DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000
DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000
DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000
DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000
DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000
DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000
DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000
DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000
> 2. I get panics with the following signature:- these look like they
are
> happening when a local write
> on the primary (which this node is) completes.
The panic signature seems to change - for example, I just got one like
this in the receiver thread:
drbd15: ASSERT( drbd_req_get_sector(i) == sector ) in
/sandbox/sgraham/sn/trunk/platform/drbd/8.0/drbd/drbd_main.c:313
drbd15: tl_verify: found req e63d0240 but it has wrong sector (8 versus
0)
drbd15: tl_verify: failed to find req e63d02b0, sector 0 in list
drbd15: Got a corrupt block_id/sector pair(2).
drbd15: drbd_pp_alloc interrupted!
drbd15: error receiving RSDataRequest, l: 24!
drbd15: tl_clear()
drbd15: in tl_clear_barrier:374: ap_pending_cnt = -1 < 0 !
drbd15: ap_pending_cnt = -1
Unable to handle kernel paging request at virtual address 7f10debb
printing eip:
7f10debb
*pde = ma 00000000 pa fffff000
Oops: 0000 [#1]
Modules linked in: drbd ipmi_devintf ipmi_si ipmi_msghandler video
thermal processor fan button battery ac
CPU: 0
EIP: 0061:[<7f10debb>] Not tainted VLI
EFLAGS: 00010082 (2.6.16.13-xen0 #2)
EIP is at 0x7f10debb
eax: eae0833e ebx: 7720debb ecx: ffffffff edx: 00000003
esi: 7eccc0b3 edi: df4deadb ebp: f3c0c015 esp: eadb7ece
ds: 007b es: 007b ss: 0069
Process drbd15_receiver (pid: 4250, threadinfo=eadb6000 task=c17d4a70)
Stack: <0>b3b1eadb f3c0f128 f3c0debb 0200debb 00000000 00030000 00010000
00000000
00000000 00000000 7ef80000 7ef8eadb 7f10eadb 91c0eadb ff80ec82
91c0ffff
7f54ec82 b62eeadb 91c0f128 6620ec82 fff8c155 0001001f 02000000
00030000
Call Trace:
[<c010513a>] show_stack_log_lvl+0xaa/0xe0
[<c010534e>] show_registers+0x18e/0x210
[<c0105549>] die+0xd9/0x180
[<c0112b7c>] do_page_fault+0x3cc/0x640
[<c0104d5f>] error_code+0x2b/0x30
Code: Bad EIP value.
<0>Fatal exception: panic in 5 seconds
All suggestions gratefully accepted!
Simon
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Drbd-dev] DRBD-8: recent regression causing corruption and crashes
2006-08-11 16:01 [Drbd-dev] DRBD-8: recent regression causing corruption and crashes Graham, Simon
@ 2006-08-11 18:45 ` Lars Ellenberg
0 siblings, 0 replies; 3+ messages in thread
From: Lars Ellenberg @ 2006-08-11 18:45 UTC (permalink / raw)
To: drbd-dev
/ 2006-08-11 12:01:23 -0400
\ Graham, Simon:
> Quick update:
>
How exactly do you "test"?
Kernel and hardware?
(sorry, if you posted that earlier, just point me to it)
I triggered a full sync (drbdadm invalidate),
and while that was running, access the Primary(SyncSource)
(cp -av /somethinghuge/ /mnt/drbd-mount-point/)
> > 1. I get errors during initial synchronization of a volume like this
> > that cause the resync to be aborted:
> >
> > drbd15: tl_verify: failed to find req e51a4da0, sector 0 in list
I don't see those here.
> DRBD, Cmd: WriteAck, BlkId: SYNCER Sector: 0, AckLen: 8000
I don't see these either.
> > 2. I get panics with the following signature:- these look like they
> are
> > happening when a local write
> > on the primary (which this node is) completes.
>
> The panic signature seems to change - for example, I just got one like
> this in the receiver thread:
>
> drbd15: ASSERT( drbd_req_get_sector(i) == sector ) in
> /sandbox/sgraham/sn/trunk/platform/drbd/8.0/drbd/drbd_main.c:313
> drbd15: tl_verify: found req e63d0240 but it has wrong sector (8 versus
> 0)
nor these.
> drbd15: in tl_clear_barrier:374: ap_pending_cnt = -1 < 0 !
this is bad...
What I do see here is: "ap_pending > 0" still too often, when I
disconnect during resync + write activity, effectively blocking the
Primary's io subsystem. seemingly we still got bugs in tl_clear :(
need to look into that further.
> Code: Bad EIP value.
> <0>Fatal exception: panic in 5 seconds
outch.
--
: Lars Ellenberg Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe http://www.linbit.com :
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2006-08-11 18:45 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-11 16:01 [Drbd-dev] DRBD-8: recent regression causing corruption and crashes Graham, Simon
2006-08-11 18:45 ` Lars Ellenberg
-- strict thread matches above, loose matches on Subject: below --
2006-08-11 2:31 Graham, Simon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox