Distributed Replicated Block Device (DRBD) development
 help / color / mirror / Atom feed
* [Drbd-dev] DRBD-8: crash due to incorrect thread termination
@ 2006-09-14 22:49 Graham, Simon
  2006-09-15 13:19 ` Philipp Reisner
  2006-09-18 12:01 ` Philipp Reisner
  0 siblings, 2 replies; 4+ messages in thread
From: Graham, Simon @ 2006-09-14 22:49 UTC (permalink / raw)
  To: drbd-dev

I reported the issue with threads being stopped synchronously from the
wrong context a while ago but now I've found an actual case that causes
a panic() - if BOTH sides are detached, and you attach one side, the
other will crash apparently because the code attempts to synchronously
stop the sender thread, fails, then discards all the network connection
data as the sender thread is still attempting to use it...

Just thought I'd pass on an easy test case (since I don't know how to
fix this one ;-).

Simon

console log from side doing attach:

Sep 14 18:44:01 penn kernel: drbd0: Found 6 transactions (324 active
extents) in activity log.
Sep 14 18:44:01 penn kernel: drbd0: max_segment_size ( = BIO size ) =
32768
Sep 14 18:44:01 penn kernel: drbd0: reading of bitmap took 1 jiffies
Sep 14 18:44:01 penn kernel: drbd0: recounting of set bits took
additional 0 jiffies
Sep 14 18:44:01 penn kernel: drbd0: 0 KB marked out-of-sync by on disk
bit-map.
Sep 14 18:44:01 penn kernel: drbd0: data >>> ReportSizes (d 0MiB, u
0MiB, c 15007MiB, max bio 8000, q order 0)
Sep 14 18:44:01 penn kernel: drbd0: data >>> ReportState (s 28a { role(
Secondary ) peer( Secondary ) conn( Connected ) disk( Attaching ) pdsk(
Diskless )})
Sep 14 18:44:01 penn kernel: drbd0: sock was reset by peer

from side that crashes:

Sep 14 18:44:01 teller kernel: drbd0: some backing storage is needed
Sep 14 18:44:01 teller kernel: drbd0: peer( Secondary -> Unknown ) conn(
Connected -> StandAlone ) pdsk( Diskless -> DUnknown ) 
Sep 14 18:44:01 teller kernel: drbd0: drbd_thread_stop: drbd0_receiver
[4885]: drbd0_receiver 1 -> 2; 1
Sep 14 18:44:01 teller kernel: drbd0: ASSERT( !wait ) in
/sandbox/sgraham/sn/drbd-panic/platform/drbd/8.0/drbd/drbd_main.c:1100
Sep 14 18:44:01 teller kernel:  [<c01050a1>] show_trace+0x21/0x30
Sep 14 18:44:01 teller kernel:  [<c01051de>] dump_stack+0x1e/0x20
Sep 14 18:44:01 teller kernel:  [<f12aa6af>]
_drbd_thread_stop+0x16f/0x1b0 [drbd]
Sep 14 18:44:01 teller kernel:  [<f12aa163>] after_state_ch+0x773/0x920
[drbd]
Sep 14 18:44:01 teller kernel:  [<f12a8753>] drbd_change_state+0xa3/0xc0
[drbd]
Sep 14 18:44:01 teller kernel:  [<f12a8798>] drbd_force_state+0x28/0x30
[drbd]
Sep 14 18:44:01 teller kernel:  [<f12a0265>] receive_sizes+0x4e5/0x7a0
[drbd]
Sep 14 18:44:01 teller kernel:  [<f12a1419>] drbdd+0x49/0x140 [drbd]
Sep 14 18:44:01 teller kernel:  [<f12a2135>] drbdd_init+0xe5/0x120
[drbd]
Sep 14 18:44:01 teller kernel:  [<f12aa37b>] drbd_thread_setup+0x6b/0xc0
[drbd]
Sep 14 18:44:01 teller kernel:  [<c0102d9d>]
kernel_thread_helper+0x5/0x18
Sep 14 18:44:01 teller kernel: drbd0: drbd_thread_stop: drbd0_receiver
[4885]: drbd0_worker 1 -> 2; 1
Sep 14 18:44:01 teller kernel: drbd0: worker terminated
Sep 14 18:44:01 teller kernel: drbd0: drbd_thread_stop: drbd0_receiver
[4885]: drbd0_receiver 2 -> 2; 1
Sep 14 18:44:01 teller kernel: drbd0: drbd_thread_stop: drbd0_receiver
[4885]: NULL 0 -> 2; 1
Sep 14 18:44:01 teller kernel: drbd0: drbd_bm_resize called with
capacity == 0
Sep 14 18:44:01 teller kernel: drbd0: drbd_thread_stop: drbd0_receiver
[4885]: drbd0_receiver 2 -> 2; 0
Sep 14 18:44:01 teller kernel: drbd0: error receiving ReportSizes, l:
32!
Sep 14 18:44:01 teller kernel: drbd0: drbd_thread_stop: drbd0_receiver
[4885]: drbd0_asender 1 -> 2; 1
Sep 14 18:44:01 teller kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000004
Sep 14 18:44:01 teller kernel:  printing eip:
Sep 14 18:44:01 teller kernel: c03dd67d
Sep 14 18:44:01 teller kernel: *pde = ma 00000000 pa fffff000
Sep 14 18:44:01 teller kernel: Oops: 0002 [#1]
Sep 14 18:44:01 teller kernel: Modules linked in: drbd ipmi_devintf
ipmi_si ipmi_msghandler video thermal processor fan button battery ac
e1000
Sep 14 18:44:01 teller kernel: CPU:    0
Sep 14 18:44:01 teller kernel: EIP:    0061:[<c03dd67d>]    Not tainted
VLI
Sep 14 18:44:01 teller kernel: EFLAGS: 00010246   (2.6.16.13-xen0 #1) 
Sep 14 18:44:01 teller kernel: EIP is at sk_wait_data+0x8d/0xc0
Sep 14 18:44:01 teller kernel: eax: 00000000   ebx: 00000000   ecx:
00000000   edx: eb2c6054
Sep 14 18:44:01 teller kernel: esi: eb2c6000   edi: eb131d88   ebp:
eb131db4   esp: eb131d68
Sep 14 18:44:01 teller kernel: ds: 007b   es: 007b   ss: 0069
Sep 14 18:44:02 teller kernel: Process drbd0_asender (pid: 4887,
threadinfo=eb130000 task=eb54f570)
Sep 14 18:44:02 teller kernel: Stack: <0>00000000 eb54f570 c012da30
eb131d94 eb131d94 eb131d9c c04161b0 eb2c6000 
Sep 14 18:44:02 teller kernel:        00000000 eb54f570 c012da30
eb967518 eb967518 eb131db4 c04069fd eb2c6000 
Sep 14 18:44:02 teller kernel:        00000000 00000000 eb2c6000
eb131e08 c0406f23 eb2c6000 eb131df4 eb967500 
Sep 14 18:44:02 teller kernel: Call Trace:
Sep 14 18:44:02 teller kernel:  [<c010515a>]
show_stack_log_lvl+0xaa/0xe0
Sep 14 18:44:02 teller kernel:  [<c010536e>] show_registers+0x18e/0x210
Sep 14 18:44:02 teller kernel:  [<c0105569>] die+0xd9/0x180
Sep 14 18:44:02 teller kernel:  [<c0112ccc>] do_page_fault+0x3cc/0x68e
Sep 14 18:44:02 teller kernel:  [<c0104d7f>] error_code+0x2b/0x30
Sep 14 18:44:02 teller kernel:  [<c0406f23>] tcp_recvmsg+0x393/0x720
Sep 14 18:44:02 teller kernel:  [<c03dddfd>]
sock_common_recvmsg+0x4d/0x70
Sep 14 18:44:02 teller kernel:  [<c03da1db>] sock_recvmsg+0xcb/0x100
Sep 14 18:44:02 teller kernel:  [<f129cb05>] drbd_recv_short+0x85/0xc0
[drbd]
Sep 14 18:44:02 teller kernel:  [<f12a2abc>] drbd_asender+0x10c/0x496
[drbd]
Sep 14 18:44:02 teller kernel:  [<f12aa37b>] drbd_thread_setup+0x6b/0xc0
[drbd]
Sep 14 18:44:02 teller kernel:  [<c0102d9d>]
kernel_thread_helper+0x5/0x18
Sep 14 18:44:02 teller kernel: Code: 00 0f ba 68 04 01 89 f0 8d 5e 54 e8
de 05 00 00 39 5e 54 74 2f 89 f0 e8 82 05 00 00 39 5e 54 0f 95 c0 0f b6
d8 8b 86 f0 00 00 00 <0f> ba 70 04 01 8b 46 38 89 fa e8 34 03 d5 ff 83
c4 40 89 d8 5b 
Sep 14 18:44:02 teller kernel:  <0>Fatal exception: panic in 5 seconds

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Drbd-dev] DRBD-8: crash due to incorrect thread termination
  2006-09-14 22:49 [Drbd-dev] DRBD-8: crash due to incorrect thread termination Graham, Simon
@ 2006-09-15 13:19 ` Philipp Reisner
  2006-09-18 12:01 ` Philipp Reisner
  1 sibling, 0 replies; 4+ messages in thread
From: Philipp Reisner @ 2006-09-15 13:19 UTC (permalink / raw)
  To: drbd-dev

Am Freitag, 15. September 2006 00:49 schrieb Graham, Simon:
> I reported the issue with threads being stopped synchronously from the
> wrong context a while ago but now I've found an actual case that causes
> a panic() - if BOTH sides are detached, and you attach one side, the
> other will crash apparently because the code attempts to synchronously
> stop the sender thread, fails, then discards all the network connection
> data as the sender thread is still attempting to use it...
>
> Just thought I'd pass on an easy test case (since I don't know how to
> fix this one ;-).
>

Great, a test case, is a nice motivation to fix the issue ! But I am 
running out of time today, probably on Monday...

-phil
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Drbd-dev] DRBD-8: crash due to incorrect thread termination
  2006-09-14 22:49 [Drbd-dev] DRBD-8: crash due to incorrect thread termination Graham, Simon
  2006-09-15 13:19 ` Philipp Reisner
@ 2006-09-18 12:01 ` Philipp Reisner
  1 sibling, 0 replies; 4+ messages in thread
From: Philipp Reisner @ 2006-09-18 12:01 UTC (permalink / raw)
  To: drbd-dev

Am Freitag, 15. September 2006 00:49 schrieb Graham, Simon:
> I reported the issue with threads being stopped synchronously from the
> wrong context a while ago but now I've found an actual case that causes
> a panic() - if BOTH sides are detached, and you attach one side, the
> other will crash apparently because the code attempts to synchronously
> stop the sender thread, fails, then discards all the network connection
> data as the sender thread is still attempting to use it...
>
> Just thought I'd pass on an easy test case (since I don't know how to
> fix this one ;-).
>

Hi Simon,

I think I could reporduce a similar case. They crashed in the process of 
connecting two diskless nodes. 
I fixed the issue with this commit:
http://lists.linbit.com/pipermail/drbd-cvs/2006-September/001241.html

Could you please verify that the crash you experienced is also gone.

Thanks,
 Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [Drbd-dev] DRBD-8: crash due to incorrect thread termination
@ 2006-09-18 13:51 Graham, Simon
  0 siblings, 0 replies; 4+ messages in thread
From: Graham, Simon @ 2006-09-18 13:51 UTC (permalink / raw)
  To: Philipp Reisner, drbd-dev

Just run a quick test and yes the problem is fixed - thanks!
Simon

> -----Original Message-----
> From: drbd-dev-bounces@linbit.com [mailto:drbd-dev-bounces@linbit.com]
> On Behalf Of Philipp Reisner
> Sent: Monday, September 18, 2006 8:01 AM
> To: drbd-dev@linbit.com
> Subject: Re: [Drbd-dev] DRBD-8: crash due to incorrect thread
> termination
> 
> Am Freitag, 15. September 2006 00:49 schrieb Graham, Simon:
> > I reported the issue with threads being stopped synchronously from
> the
> > wrong context a while ago but now I've found an actual case that
> causes
> > a panic() - if BOTH sides are detached, and you attach one side, the
> > other will crash apparently because the code attempts to
> synchronously
> > stop the sender thread, fails, then discards all the network
> connection
> > data as the sender thread is still attempting to use it...
> >
> > Just thought I'd pass on an easy test case (since I don't know how to
> > fix this one ;-).
> >
> 
> Hi Simon,
> 
> I think I could reporduce a similar case. They crashed in the process
> of
> connecting two diskless nodes.
> I fixed the issue with this commit:
> http://lists.linbit.com/pipermail/drbd-cvs/2006-September/001241.html
> 
> Could you please verify that the crash you experienced is also gone.
> 
> Thanks,
>  Philipp
> --
> : Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
> : LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
> : Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :
> _______________________________________________
> drbd-dev mailing list
> drbd-dev@lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-dev

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-09-18 13:51 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-14 22:49 [Drbd-dev] DRBD-8: crash due to incorrect thread termination Graham, Simon
2006-09-15 13:19 ` Philipp Reisner
2006-09-18 12:01 ` Philipp Reisner
  -- strict thread matches above, loose matches on Subject: below --
2006-09-18 13:51 Graham, Simon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox