Distributed Replicated Block Device (DRBD) development
 help / color / mirror / Atom feed
* RE: [Drbd-dev] DRBD8: Panic in drbd_bm_write_sect() after an io errorduring resync.
@ 2007-02-16 14:55 Montrose, Ernest
  2007-02-16 17:31 ` Lars Ellenberg
  0 siblings, 1 reply; 5+ messages in thread
From: Montrose, Ernest @ 2007-02-16 14:55 UTC (permalink / raw)
  To: Philipp Reisner, drbd-dev

Phil,
Thanks!

I think all these panics on I/O errors are all related to the same bug.

Your comments make me look at a different angle... Looking at the logs
around the failure
Shows a problem on repeated I/O errors...the state machine is somewhat
confused..It essentially
Goes from Uptodate->Failed which is fine...then from
Failed->Diskless...fine...then we go and
Wait for mdev->local_cnt to be false like you explained...
Then we get more I/O errors...and our problem starts...
We go from Diskless->failed..again.(This does not seem correct since we
just went from this state)
Then faile->diskless again
We get more I/O errors...(not good)
Mdev->bc is set to null eventually
We went and wait again for mdev->local_cnt to be False..(not good)
Now we die an awful ungodly death..:)

Here is the full log around the failure:
Feb 15 16:01:57 captain kernel: end_request: I/O error, dev sda, sector
17554615
Feb 15 16:01:57 captain kernel: drbd0: disk( UpToDate -> Failed )
Feb 15 16:01:57 captain kernel: drbd0: Local IO failed. Detaching...
Feb 15 16:01:57 captain kernel: drbd_io_error: EM--****** Handling an IO
error***mdev->bc is valid***********************
Feb 15 16:01:57 captain kernel: drbd0: disk( Failed -> Diskless )
Feb 15 16:01:57 captain kernel: drbd0: Notified peer that my disk is
broken.
Feb 15 16:01:57 captain kernel: after_state_ch: EM-- *******Waiting for
mdev->local_cnt to be FALSE ******
Feb 15 16:01:57 captain kernel: end_request: I/O error, dev sda, sector
17554623
Feb 15 16:01:57 captain kernel: drbd0: disk( Diskless -> Failed )
Feb 15 16:01:57 captain kernel: drbd0: Local IO failed. Detaching...
Feb 15 16:01:57 captain kernel: drbd_io_error: EM--****** Handling an IO
error***mdev->bc is valid***********************
Feb 15 16:01:57 captain kernel: drbd0: disk( Failed -> Diskless )
Feb 15 16:01:57 captain kernel: drbd0: Notified peer that my disk is
broken.
Feb 15 16:01:57 captain kernel: end_request: I/O error, dev sda, sector
17554719
Feb 15 16:01:57 captain kernel: drbd0: disk( Diskless -> Failed )
Feb 15 16:01:57 captain kernel: drbd0: Local IO failed. Detaching...
Feb 15 16:01:57 captain kernel: after_state_ch: EM-- *******Setting
mdev->bc to NULL after freeing it ******
Feb 15 16:01:57 captain kernel: after_state_ch: EM-- *******Waiting for
mdev->local_cnt to be FALSE ******
Feb 15 16:01:57 captain kernel: after_state_ch: EM-- *******Setting
mdev->bc to NULL after freeing it ******
Feb 15 16:01:57 captain kernel: drbd_io_error: EM--****** Handling an IO
error***mdev->bc is NOT valid***********************
Feb 15 16:01:57 captain kernel: Unable to handle kernel NULL pointer
dereference at virtual address 000000ac
Feb 15 16:01:57 captain kernel:  printing eip:
Feb 15 16:01:57 captain kernel: ee3caaee
Feb 15 16:01:57 captain kernel: 0592d000 -> *pde = 00000001:9449d001
Feb 15 16:01:57 captain kernel: 2b89d000 -> *pme = 00000000:00000000
Feb 15 16:01:57 captain kernel: Oops: 0000 [#1]
Feb 15 16:01:57 captain kernel: SMP
Feb 15 16:01:57 captain kernel: Modules linked in: iptable_filter
ip_tables x_tables drbd cn bridge ipv6 ipmi_devintf ipmi_si
ipmi_msghandler i2c_dev i2c_core binfmt_misc dm_mirror video thermal
processor fan container button battery ac hw_random shpchp pci_hotplug
e1000 piix ide_cd cdrom sg raid1 dm_mod ide_disk mptscsih mptsas mptspi
mptfc mptscsi mptbase sd_mod scsi_mod
Feb 15 16:01:57 captain kernel: CPU:    0
Feb 15 16:01:57 captain kernel: EIP:    0061:[<ee3caaee>]    Tainted: GF
VLI
Feb 15 16:01:57 captain kernel: EFLAGS: 00010202  (2.6.16.29-xen #1)
Feb 15 16:01:57 captain kernel: EIP is at drbd_io_error+0x5e/0x1f0
[drbd]
Feb 15 16:01:57 captain kernel: eax: 00000000  ebx: ec877400  ecx:
f55f7000  edx: 00000000
Feb 15 16:01:57 captain kernel: esi: 00000000  edi: eafd7f58  ebp:
eafd7efc  esp: eafd7ed8
Feb 15 16:01:57 captain kernel: ds: 007b  es: 007b  ss: 0069
Feb 15 16:01:57 captain kernel: Process drbd0_asender (pid: 5743,
threadinfo=eafd6000 task=c597fab0)
Feb 15 16:01:57 captain kernel: Stack: <0>ee3dc660 00000000 00000019
ec877400 00000000 00000000 ec877400 eab90b2c
Feb 15 16:01:57 captain kernel:        eafd7f58 eafd7f30 ee3bcce4
00000000 00000000 00000000 c597fab0 eb43c980
Feb 15 16:01:57 captain kernel:        00000001 005ee800 00000000
eab90b2c ec877400 eafd7f58 eafd7f6c ee3bab09
Feb 15 16:01:57 captain kernel: Call Trace:
Feb 15 16:01:57 captain kernel:  [<c0105401>]
show_stack_log_lvl+0xa1/0xe0
Feb 15 16:01:57 captain kernel:  [<c01055f1>] show_registers+0x181/0x200
Feb 15 16:01:57 captain kernel:  [<c0105810>] die+0x100/0x1a0
Feb 15 16:01:57 captain kernel:  [<c01156f6>] do_page_fault+0x3c6/0x8b1
Feb 15 16:01:57 captain kernel:  [<c0105067>] error_code+0x2b/0x30
Feb 15 16:01:57 captain kernel:  [<ee3bcce4>] e_end_block+0x244/0x250
[drbd]
Feb 15 16:01:57 captain kernel:  [<ee3bab09>]
drbd_process_done_ee+0xc9/0x160 [drbd]
Feb 15 16:01:57 captain kernel:  [<ee3c27bf>] drbd_asender+0x16f/0x5ad
[drbd]
Feb 15 16:01:57 captain kernel:  [<ee3ccf7c>]
drbd_thread_setup+0x8c/0x100 [drbd]
Feb 15 16:01:57 captain kernel:  [<c0102e95>]
kernel_thread_helper+0x5/0x10
Feb 15 16:01:57 captain kernel: Code: 85 c0 74 6d c7 04 24 04 c6 3d ee
e8 ad 69 d5 d1 f0 ff 83 80 02 00 00 0f b6 83 3d 02 00 00 d0 e8 83 e0 0f
83 f8 01 76 0c 8b 43 14 <8b> 80 ac 00 00 00 89 45 f0 f0 ff 8b 80 02 00
00 0f 94 c0 84 c0
Feb 15 16:01:57 captain kernel:  <0>Fatal exception: panic in 5 seconds

-----Original Message-----
From: Philipp Reisner [mailto:philipp.reisner@linbit.com] 
Sent: Friday, February 16, 2007 6:44 AM
To: drbd-dev@linbit.com
Cc: Montrose, Ernest
Subject: Re: [Drbd-dev] DRBD8: Panic in drbd_bm_write_sect() after an io
errorduring resync.

Am Donnerstag, 15. Februar 2007 16:44 schrieb Montrose, Ernest:
> Phil,
> I will try all these but I think I have some clues for you that may 
> lead you to a fix.
> I instrumented the driver and caused the crash. Essentially what I 
> understand to be happening Is that after_state_ch() is setting 
> mdev->bc to NULL and then
> drbd_io_error() is using it
> after in: drbd_io_error(){.......
>   If(inc_local_if_state(mdev,Failed )){
> 	eh = mdev->bc->dc.on_io_error; <-----we die here I
think.mdev->bc is 
> NULL ...
> }
> Mdev->bc was set to Null earlier in after_state_ch(){.....
> If(os.disk >Diskless && ns.disk == Diskless){ ....mdev->bc = NULL; ..
> }
>
> This is some sort of a race condition as this does not happen all the 
> times.

Ernest,

At first look I would say this is not possible. Because in
after_state_ch() it only gets freed when we already reached the Diskless
state.

In drbd_io_error() the access to mdev->bc is protected by the
inc_local() - dec_local() clasp. 

In case we are Diskless that inc_local_if...() fails.
And on the other side, before freeing it in after_state_ch() we wait
until the local_count really reached 0. 

PS: Was my explanation how to create that .s files too brief ?

-phil
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :

^ permalink raw reply	[flat|nested] 5+ messages in thread
* RE: [Drbd-dev] DRBD8: Panic in drbd_bm_write_sect() after an io errorduring resync.
@ 2007-02-15 15:44 Montrose, Ernest
  2007-02-16 11:43 ` Philipp Reisner
  0 siblings, 1 reply; 5+ messages in thread
From: Montrose, Ernest @ 2007-02-15 15:44 UTC (permalink / raw)
  To: Philipp Reisner, drbd-dev

Phil,
I will try all these but I think I have some clues for you that may lead
you to a fix.
I instrumented the driver and caused the crash. Essentially what I
understand to be happening 
Is that after_state_ch() is setting mdev->bc to NULL and then
drbd_io_error() is using it
after in: drbd_io_error(){.......
  If(inc_local_if_state(mdev,Failed )){
	eh = mdev->bc->dc.on_io_error; <-----we die here I
think.mdev->bc is NULL
...
}
Mdev->bc was set to Null earlier in after_state_ch(){.....
If(os.disk >Diskless && ns.disk == Diskless){
....mdev->bc = NULL;
..
}

This is some sort of a race condition as this does not happen all the
times.  Below
Is the result of my instrumentation.  You can see that we behaved nicely
at first after
An I/O error..but latter when the same I/O error occurs...we die.:
Some more very telling data:

=======Start debug messages=======
Feb 15 08:50:16 captain kernel: sd 1:0:28:0: SCSI error: return code =
0x8000002
Feb 15 08:50:16 captain kernel: sda: Current: sense key: Medium Error
Feb 15 08:50:16 captain kernel:    Additional sense: Recovered data with
retries and/or circ applied
Feb 15 08:50:16 captain kernel: end_request: I/O error, dev sda, sector
19071159
Feb 15 08:50:16 captain kernel: drbd0: disk( Diskless -> Failed )
Feb 15 08:50:16 captain kernel: drbd0: Local IO failed. Detaching...
Feb 15 08:50:16 captain kernel: drbd_io_error: EM--****** Handling an IO
error****************************************
Feb 15 08:50:16 captain kernel: drbd_io_error: EM--****** Handling an IO
error***mdev is valid***********************
Feb 15 08:50:16 captain kernel: drbd_io_error: EM--****** Handling an IO
error***mdev->bc is valid***********************
Feb 15 08:50:16 captain kernel: drbd0: disk( Failed -> Diskless )
Feb 15 08:50:16 captain kernel: drbd0: Notified peer that my disk is
broken.
Feb 15 08:50:16 captain kernel: sd 1:0:28:0: SCSI error: return code =
0x8000002
Feb 15 08:50:16 captain kernel: sda: Current: sense key: Medium Error
Feb 15 08:50:16 captain kernel:    Additional sense: Recovered data with
retries and/or circ applied
Feb 15 08:50:16 captain kernel: end_request: I/O error, dev sda, sector
19071167
Feb 15 08:50:16 captain kernel: drbd0: disk( Diskless -> Failed )
Feb 15 08:50:16 captain kernel: drbd0: Local IO failed. Detaching...
Feb 15 08:50:16 captain kernel: after_state_ch: EM-- *******Setting
mdev->bc to NULL after freeing it ******
Feb 15 08:50:16 captain last message repeated 2 times
Feb 15 08:50:16 captain kernel: drbd_io_error: EM--****** Handling an IO
error****************************************
Feb 15 08:50:16 captain kernel: drbd_io_error: EM--****** Handling an IO
error***mdev is valid***********************
Feb 15 08:50:16 captain kernel: drbd_io_error: EM--****** Handling an IO
error***mdev->bc is NOT valid***********************
Feb 15 08:50:16 captain kernel: Unable to handle kernel NULL pointer
dereference at virtual address 000000ac

======End debug messages=======


I will prepare the other stuff and send them to later....Thanks!!!
EM--




-----Original Message-----
From: drbd-dev-bounces@linbit.com [mailto:drbd-dev-bounces@linbit.com]
On Behalf Of Philipp Reisner
Sent: Thursday, February 15, 2007 10:28 AM
To: drbd-dev@linbit.com
Subject: Re: [Drbd-dev] DRBD8: Panic in drbd_bm_write_sect() after an io
errorduring resync.

Am Mittwoch, 14. Februar 2007 19:03 schrieb Montrose, Ernest:
> Hi all,
> We are overwelmed with panic's after io errors. Seem mdev->bc is null 
> due to some race condition.  Here is one instance:
>
> Two node cluster, node A and Node B. Syncsource is node A. While 
> syncing Reads are issued on Node B.  I/O errosrs start to occur on 
> node A,  Node A panics :
>
[...OOPS... ]

Hi Ernest,

I was not able to understand the cause of the oops on the first glance.

Could you provide the output of ksymoops when you feed this OOPS to it ?
( I am interested in the disassebled code)

AND 

I do this debugging by comparing it to the assembler output of the
compiler.
Please provide the .s files from the machine where you build your drbd
(with your compiler, kernel config and kernel source).

Remke DRBD with "make V=1"

The create the .s file:
Replaceing the "-c" option with "-gstabs+ -S" and the -o "foo.o" to -o
"foo.s" in the call of the compiler

Something like this:
(cd $KDIR ; gcc ... /some/path/foo.c )

Thanks,
Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :
_______________________________________________
drbd-dev mailing list
drbd-dev@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-dev

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-02-19 14:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-02-16 14:55 [Drbd-dev] DRBD8: Panic in drbd_bm_write_sect() after an io errorduring resync Montrose, Ernest
2007-02-16 17:31 ` Lars Ellenberg
2007-02-19 14:13   ` Philipp Reisner
  -- strict thread matches above, loose matches on Subject: below --
2007-02-15 15:44 Montrose, Ernest
2007-02-16 11:43 ` Philipp Reisner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox