Distributed Replicated Block Device (DRBD) development
 help / color / mirror / Atom feed
* RE: [Drbd-dev] DRBD8: Panic in drbd_bm_write_sect() after an ioerrorduring resync.
@ 2007-02-15 16:07 Montrose, Ernest
  2007-02-16 11:44 ` Philipp Reisner
  0 siblings, 1 reply; 5+ messages in thread
From: Montrose, Ernest @ 2007-02-15 16:07 UTC (permalink / raw)
  To: Montrose, Ernest, Philipp Reisner, drbd-dev

Phil,
Ooops! 
OK...Actually that one was a different panic...But there are some common
threads here..
Mdev->bc is NULL, happens after an I/O error.  I do not want to confuse
the issue though
I will try to collect the data you ask for, though I am not sure that
this case is as reproduceable
as the one I explained below. We have lots of these panics on I/O
errors..sorry...

Thanks.

EM--
-----Original Message-----
From: drbd-dev-bounces@linbit.com [mailto:drbd-dev-bounces@linbit.com]
On Behalf Of Montrose, Ernest
Sent: Thursday, February 15, 2007 10:45 AM
To: Philipp Reisner; drbd-dev@linbit.com
Subject: RE: [Drbd-dev] DRBD8: Panic in drbd_bm_write_sect() after an
ioerrorduring resync.

Phil,
I will try all these but I think I have some clues for you that may lead
you to a fix.
I instrumented the driver and caused the crash. Essentially what I
understand to be happening Is that after_state_ch() is setting mdev->bc
to NULL and then
drbd_io_error() is using it
after in: drbd_io_error(){.......
  If(inc_local_if_state(mdev,Failed )){
	eh = mdev->bc->dc.on_io_error; <-----we die here I
think.mdev->bc is NULL ...
}
Mdev->bc was set to Null earlier in after_state_ch(){.....
If(os.disk >Diskless && ns.disk == Diskless){ ....mdev->bc = NULL; ..
}

This is some sort of a race condition as this does not happen all the
times.  Below Is the result of my instrumentation.  You can see that we
behaved nicely at first after An I/O error..but latter when the same I/O
error occurs...we die.:
Some more very telling data:

=======Start debug messages=======
Feb 15 08:50:16 captain kernel: sd 1:0:28:0: SCSI error: return code =
0x8000002
Feb 15 08:50:16 captain kernel: sda: Current: sense key: Medium Error
Feb 15 08:50:16 captain kernel:    Additional sense: Recovered data with
retries and/or circ applied
Feb 15 08:50:16 captain kernel: end_request: I/O error, dev sda, sector
19071159
Feb 15 08:50:16 captain kernel: drbd0: disk( Diskless -> Failed ) Feb 15
08:50:16 captain kernel: drbd0: Local IO failed. Detaching...
Feb 15 08:50:16 captain kernel: drbd_io_error: EM--****** Handling an IO
error****************************************
Feb 15 08:50:16 captain kernel: drbd_io_error: EM--****** Handling an IO
error***mdev is valid*********************** Feb 15 08:50:16 captain
kernel: drbd_io_error: EM--****** Handling an IO error***mdev->bc is
valid*********************** Feb 15 08:50:16 captain kernel: drbd0:
disk( Failed -> Diskless ) Feb 15 08:50:16 captain kernel: drbd0:
Notified peer that my disk is broken.
Feb 15 08:50:16 captain kernel: sd 1:0:28:0: SCSI error: return code =
0x8000002
Feb 15 08:50:16 captain kernel: sda: Current: sense key: Medium Error
Feb 15 08:50:16 captain kernel:    Additional sense: Recovered data with
retries and/or circ applied
Feb 15 08:50:16 captain kernel: end_request: I/O error, dev sda, sector
19071167
Feb 15 08:50:16 captain kernel: drbd0: disk( Diskless -> Failed ) Feb 15
08:50:16 captain kernel: drbd0: Local IO failed. Detaching...
Feb 15 08:50:16 captain kernel: after_state_ch: EM-- *******Setting
mdev->bc to NULL after freeing it ******
Feb 15 08:50:16 captain last message repeated 2 times Feb 15 08:50:16
captain kernel: drbd_io_error: EM--****** Handling an IO
error****************************************
Feb 15 08:50:16 captain kernel: drbd_io_error: EM--****** Handling an IO
error***mdev is valid*********************** Feb 15 08:50:16 captain
kernel: drbd_io_error: EM--****** Handling an IO error***mdev->bc is NOT
valid*********************** Feb 15 08:50:16 captain kernel: Unable to
handle kernel NULL pointer dereference at virtual address 000000ac

======End debug messages=======


I will prepare the other stuff and send them to later....Thanks!!!
EM--




-----Original Message-----
From: drbd-dev-bounces@linbit.com [mailto:drbd-dev-bounces@linbit.com]
On Behalf Of Philipp Reisner
Sent: Thursday, February 15, 2007 10:28 AM
To: drbd-dev@linbit.com
Subject: Re: [Drbd-dev] DRBD8: Panic in drbd_bm_write_sect() after an io
errorduring resync.

Am Mittwoch, 14. Februar 2007 19:03 schrieb Montrose, Ernest:
> Hi all,
> We are overwelmed with panic's after io errors. Seem mdev->bc is null 
> due to some race condition.  Here is one instance:
>
> Two node cluster, node A and Node B. Syncsource is node A. While 
> syncing Reads are issued on Node B.  I/O errosrs start to occur on 
> node A,  Node A panics :
>
[...OOPS... ]

Hi Ernest,

I was not able to understand the cause of the oops on the first glance.

Could you provide the output of ksymoops when you feed this OOPS to it ?
( I am interested in the disassebled code)

AND 

I do this debugging by comparing it to the assembler output of the
compiler.
Please provide the .s files from the machine where you build your drbd
(with your compiler, kernel config and kernel source).

Remke DRBD with "make V=1"

The create the .s file:
Replaceing the "-c" option with "-gstabs+ -S" and the -o "foo.o" to -o
"foo.s" in the call of the compiler

Something like this:
(cd $KDIR ; gcc ... /some/path/foo.c )

Thanks,
Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :
_______________________________________________
drbd-dev mailing list
drbd-dev@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-dev
_______________________________________________
drbd-dev mailing list
drbd-dev@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-dev

^ permalink raw reply	[flat|nested] 5+ messages in thread
* RE: [Drbd-dev] DRBD8: Panic in drbd_bm_write_sect() after an ioerrorduring resync.
@ 2007-02-16 17:42 Montrose, Ernest
  0 siblings, 0 replies; 5+ messages in thread
From: Montrose, Ernest @ 2007-02-16 17:42 UTC (permalink / raw)
  To: Lars Ellenberg, drbd-dev

Lars,
I will patch and see what happens..
As for your last comment...The IO handler definitely keeps ticking it
seems
Regardless of the reference count.

Thanks!
EM-- 

-----Original Message-----
From: drbd-dev-bounces@linbit.com [mailto:drbd-dev-bounces@linbit.com]
On Behalf Of Lars Ellenberg
Sent: Friday, February 16, 2007 12:32 PM
To: drbd-dev@linbit.com
Subject: Re: [Drbd-dev] DRBD8: Panic in drbd_bm_write_sect() after an
ioerrorduring resync.

/ 2007-02-16 09:55:12 -0500
\ Montrose, Ernest:
> Phil,
> Thanks!
> 
> I think all these panics on I/O errors are all related to the same
bug.
> 
> Your comments make me look at a different angle... Looking at the logs

> around the failure Shows a problem on repeated I/O errors...the state 
> machine is somewhat confused..It essentially Goes from 
> Uptodate->Failed which is fine...then from
> Failed->Diskless...fine...then we go and
> Wait for mdev->local_cnt to be false like you explained...
> Then we get more I/O errors...and our problem starts...
> We go from Diskless->failed..again.(This does not seem correct since 
> we just went from this state)

even though I dislike our overall state engine design, it may be enough
to do

--- drbd/drbd_main.c    (revision 2754)
+++ drbd/drbd_main.c    (working copy)
@@ -604,6 +604,11 @@
                dec_local(mdev);
        }

+       /* If we are Diskless, we can only go to Attaching. */
+       if ( (os.disk == Diskless) && (ns.disk != Attaching) ) {
+               ns.disk = Diskless;
+       }
+
        /* Early state sanitising. Dissalow the invalidate ioctl to
 * connect  */
        if( (ns.conn == StartingSyncS || ns.conn == StartingSyncT) &&
                os.conn < Connected ) {


> Then faile->diskless again
> We get more I/O errors...(not good)
> Mdev->bc is set to null eventually
> We went and wait again for mdev->local_cnt to be False..(not good) Now

> we die an awful ungodly death..:)
> 
> Here is the full log around the failure:
> Feb 15 16:01:57 captain kernel: end_request: I/O error, dev sda, 
> sector
> 17554615
> Feb 15 16:01:57 captain kernel: drbd0: disk( UpToDate -> Failed ) Feb 
> 15 16:01:57 captain kernel: drbd0: Local IO failed. Detaching...
> Feb 15 16:01:57 captain kernel: drbd_io_error: EM--****** Handling an 
> IO error***mdev->bc is valid*********************** Feb 15 16:01:57 
> captain kernel: drbd0: disk( Failed -> Diskless ) Feb 15 16:01:57 
> captain kernel: drbd0: Notified peer that my disk is broken.
> Feb 15 16:01:57 captain kernel: after_state_ch: EM-- *******Waiting 
> for
> mdev->local_cnt to be FALSE ******
> Feb 15 16:01:57 captain kernel: end_request: I/O error, dev sda, 
> sector
> 17554623
> Feb 15 16:01:57 captain kernel: drbd0: disk( Diskless -> Failed )

right. this is not allowed.

but this also means that our reference counting of in-flight local
requests is not ok, since once local_cnt is zero, there should be no
more in-flight requests to the local disk that might trigger the end_io
handler.

-- 
: Lars Ellenberg                            Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :
_______________________________________________
drbd-dev mailing list
drbd-dev@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-dev

^ permalink raw reply	[flat|nested] 5+ messages in thread
* RE: [Drbd-dev] DRBD8: Panic in drbd_bm_write_sect() after an ioerrorduring resync.
@ 2007-02-16 20:52 Montrose, Ernest
  0 siblings, 0 replies; 5+ messages in thread
From: Montrose, Ernest @ 2007-02-16 20:52 UTC (permalink / raw)
  To: Lars Ellenberg, drbd-dev

Lars,
It appears that this patch will work around the panic.

Thanks!

EM-- 

-----Original Message-----
From: drbd-dev-bounces@linbit.com [mailto:drbd-dev-bounces@linbit.com]
On Behalf Of Lars Ellenberg
Sent: Friday, February 16, 2007 12:32 PM
To: drbd-dev@linbit.com
Subject: Re: [Drbd-dev] DRBD8: Panic in drbd_bm_write_sect() after an
ioerrorduring resync.

/ 2007-02-16 09:55:12 -0500
\ Montrose, Ernest:
> Phil,
> Thanks!
> 
> I think all these panics on I/O errors are all related to the same
bug.
> 
> Your comments make me look at a different angle... Looking at the logs

> around the failure Shows a problem on repeated I/O errors...the state 
> machine is somewhat confused..It essentially Goes from 
> Uptodate->Failed which is fine...then from
> Failed->Diskless...fine...then we go and
> Wait for mdev->local_cnt to be false like you explained...
> Then we get more I/O errors...and our problem starts...
> We go from Diskless->failed..again.(This does not seem correct since 
> we just went from this state)

even though I dislike our overall state engine design, it may be enough
to do

--- drbd/drbd_main.c    (revision 2754)
+++ drbd/drbd_main.c    (working copy)
@@ -604,6 +604,11 @@
                dec_local(mdev);
        }

+       /* If we are Diskless, we can only go to Attaching. */
+       if ( (os.disk == Diskless) && (ns.disk != Attaching) ) {
+               ns.disk = Diskless;
+       }
+
        /* Early state sanitising. Dissalow the invalidate ioctl to
 * connect  */
        if( (ns.conn == StartingSyncS || ns.conn == StartingSyncT) &&
                os.conn < Connected ) {


> Then faile->diskless again
> We get more I/O errors...(not good)
> Mdev->bc is set to null eventually
> We went and wait again for mdev->local_cnt to be False..(not good) Now

> we die an awful ungodly death..:)
> 
> Here is the full log around the failure:
> Feb 15 16:01:57 captain kernel: end_request: I/O error, dev sda, 
> sector
> 17554615
> Feb 15 16:01:57 captain kernel: drbd0: disk( UpToDate -> Failed ) Feb 
> 15 16:01:57 captain kernel: drbd0: Local IO failed. Detaching...
> Feb 15 16:01:57 captain kernel: drbd_io_error: EM--****** Handling an 
> IO error***mdev->bc is valid*********************** Feb 15 16:01:57 
> captain kernel: drbd0: disk( Failed -> Diskless ) Feb 15 16:01:57 
> captain kernel: drbd0: Notified peer that my disk is broken.
> Feb 15 16:01:57 captain kernel: after_state_ch: EM-- *******Waiting 
> for
> mdev->local_cnt to be FALSE ******
> Feb 15 16:01:57 captain kernel: end_request: I/O error, dev sda, 
> sector
> 17554623
> Feb 15 16:01:57 captain kernel: drbd0: disk( Diskless -> Failed )

right. this is not allowed.

but this also means that our reference counting of in-flight local
requests is not ok, since once local_cnt is zero, there should be no
more in-flight requests to the local disk that might trigger the end_io
handler.

-- 
: Lars Ellenberg                            Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :
_______________________________________________
drbd-dev mailing list
drbd-dev@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-dev

^ permalink raw reply	[flat|nested] 5+ messages in thread
* RE: [Drbd-dev] DRBD8: Panic in drbd_bm_write_sect() after an ioerrorduring resync.
@ 2007-02-20 12:49 Montrose, Ernest
  0 siblings, 0 replies; 5+ messages in thread
From: Montrose, Ernest @ 2007-02-20 12:49 UTC (permalink / raw)
  To: Philipp Reisner, drbd-dev

Phil,
Thanks...I will test this and see if we got these nasty oops this
time..:)

EM-- 

-----Original Message-----
From: drbd-dev-bounces@linbit.com [mailto:drbd-dev-bounces@linbit.com]
On Behalf Of Philipp Reisner
Sent: Monday, February 19, 2007 9:13 AM
To: drbd-dev@linbit.com
Subject: Re: [Drbd-dev] DRBD8: Panic in drbd_bm_write_sect() after an
ioerrorduring resync.


Hi,

Just to document the end of this thread here.
This
http://lists.linbit.com/pipermail/drbd-cvs/2007-February/001468.html
is the link to the patch that made it into SVN.

-Phil
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :
_______________________________________________
drbd-dev mailing list
drbd-dev@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-dev

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-02-20 12:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-02-15 16:07 [Drbd-dev] DRBD8: Panic in drbd_bm_write_sect() after an ioerrorduring resync Montrose, Ernest
2007-02-16 11:44 ` Philipp Reisner
  -- strict thread matches above, loose matches on Subject: below --
2007-02-16 17:42 Montrose, Ernest
2007-02-16 20:52 Montrose, Ernest
2007-02-20 12:49 Montrose, Ernest

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox