All of lore.kernel.org
 help / color / mirror / Atom feed
* [Drbd-dev] Re: [DRBD-user] drbd_panic() in drbd_receiver.c
       [not found] <342BAC0A5467384983B586A6B0B37671031FB31E@EXNA.corp.stratus.com>
@ 2006-07-04 10:07 ` Philipp Reisner
  0 siblings, 0 replies; 3+ messages in thread
From: Philipp Reisner @ 2006-07-04 10:07 UTC (permalink / raw)
  To: drbd-dev; +Cc: drbd-user

Am Montag, 3. Juli 2006 19:03 schrieb Graham, Simon:
> I too have been looking into this -- I agree with Damian and think it's
> very important that DRBD never panic in cases like this if it is to be
> used in an HA system -- I think the final approach has to be one of
> fixing up underlying disk errors where possible and returning an error
> to the caller where it is not possible to fix up.
>
> In this specific case (NegDReply), it seems that it would be OK to
> simply remove the panic() and complete the original request with an EIO
> error or somesuch - this does mean adding a call to
> drbd_bio_endio(bio,0) in addition to removing the panic() though.
>
> Even if this is acceptable, there are a bunch of other places where
> panic is currently done that, I think, also need to be changed,
> including:
>
> 1. In drbd_set_state if the node is now Primary and does not have access
> to good data; I think this can simply be removed
>    since drbd_fail_request_early already returns a failure to the caller
> in this case.
>
> 2. Failure to write bitmap to disk; not sure what the right answer is
> here - any suggestions? (perhaps force the disk to be
>    inconsistent in some manner that will require a complete resync?)
>
> 3. Failure to write meta data to disk; ditto above only harder -- if you
> cant write to the meta-data area, you cant store data
>    that indicates the contents are bad...
>
> 4. Received NegRSDReply -- during resync, SyncTarget gets error from
> SyncSource; In this specific case, it seems to me that
>    a possible solution is to leave the block in question set in the
> bitmap, ensure that the state is never set consistent
>    on the current SyncTarget and ensure that no matter what happens, the
> current SyncSource remains the best source of data.
>    A potential issue with this is that the SyncTarget will continue to
> attempt to synchronize the block in question - since
>    it's still set in the bitmap it will eventually be found again when
> the syncer wraps round - maybe that's OK though (so
>    long as there is some sort of delay between attempts)?
>
> I am planning on implementing these, assuming there isn't any huge
> disagreement on the approach and assuming it isn't already in
> progress...
>
> Perhaps we should take this discussion to drvd-dev?
> Simon
>
> PS: Once the panics are gone, there is a second phase required which is
> to fix up underlying errors where possible -- for example, if the volume
> is consistent on both sides and a read on the primary fails, not only
> should the read be retried to the secondary but also the returned data
> should be rewritten on the primary -- for a class of errors, this will
> actually fix the problem as the disk will remap a bad block when the
> write is done; is anyone working on this?

Excellent ideas. In case you really start to work on this, please
base your work on the drbd-8.0 code, preferably the trunk.

PS: Moving this thread over to drbd-dev, is a good idea.

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: [Drbd-dev] Re: [DRBD-user] drbd_panic() in drbd_receiver.c
@ 2006-07-04 15:01 Graham, Simon
  2006-07-04 15:23 ` Lars Ellenberg
  0 siblings, 1 reply; 3+ messages in thread
From: Graham, Simon @ 2006-07-04 15:01 UTC (permalink / raw)
  To: drbd-dev

Thanks -- I'm actually starting with drbd 7 just because it's _much_
easier for me to test (we have a complete build/install/test
infrastructure currently based on using 0.7) however I will push the
changes into the head of the trunk as soon as I can.

FWIW, I have the set-state, NegDReply and NegDSReply stuff coded and
running; I'm using a known bad disk and no panics so far! -- the only
issue I have now is that I think I need to kick the resync processing
when a NegDSReply is received -- /proc/drbd always shows the resync as
100% and stalled;

BTW: do you have any suggestions for handling the bitmap and meta-data
write failures?

Also - let me know if you think you would incorporate these changes into
the 0.7 branch - if so, I'll send patches (oh and let me know what the
'approved' mechanism for sending patches is please).

Simon

-----Original Message-----
From: drbd-dev-bounces@linbit.com [mailto:drbd-dev-bounces@linbit.com]
On Behalf Of Philipp Reisner
Sent: Tuesday, July 04, 2006 6:08 AM
To: drbd-dev@linbit.com
Cc: drbd-user@linbit.com
Subject: [Drbd-dev] Re: [DRBD-user] drbd_panic() in drbd_receiver.c

Am Montag, 3. Juli 2006 19:03 schrieb Graham, Simon:
> ...

Excellent ideas. In case you really start to work on this, please
base your work on the drbd-8.0 code, preferably the trunk.

PS: Moving this thread over to drbd-dev, is a good idea.

-Philipp
-- 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Drbd-dev] Re: [DRBD-user] drbd_panic() in drbd_receiver.c
  2006-07-04 15:01 Graham, Simon
@ 2006-07-04 15:23 ` Lars Ellenberg
  0 siblings, 0 replies; 3+ messages in thread
From: Lars Ellenberg @ 2006-07-04 15:23 UTC (permalink / raw)
  To: drbd-dev, drbd-dev

/ 2006-07-04 11:01:32 -0400
\ Graham, Simon:
> Thanks -- I'm actually starting with drbd 7 just because it's _much_
> easier for me to test (we have a complete build/install/test
> infrastructure currently based on using 0.7) however I will push the
> changes into the head of the trunk as soon as I can.
> 
> FWIW, I have the set-state, NegDReply and NegDSReply stuff coded and
> running; I'm using a known bad disk and no panics so far!

great, send over a svn diff...

> -- the only
> issue I have now is that I think I need to kick the resync processing
> when a NegDSReply is received -- /proc/drbd always shows the resync as
> 100% and stalled;

there are several internal dependencies and state changes that need to
be adjusted...

> BTW: do you have any suggestions for handling the bitmap and meta-data
> write failures?

difficult. we probably need to have several "drbd super blocks" in
drbd8, so we at least have a much higher chance to get important flags
on stable storage _somewhere_. I guess we don't want to have several
bitmaps, but some means to store the "meta data is not reliable anymore"
flag in several places. updates to these blocks have to be
transactional. this is not yet done, but it is on the todo list...


> Also - let me know if you think you would incorporate these changes into
> the 0.7 branch

unlikely. not impossible, though.

> - if so, I'll send patches (oh and let me know what the
> 'approved' mechanism for sending patches is please).

svn diff

cheers,

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2006-07-04 15:23 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <342BAC0A5467384983B586A6B0B37671031FB31E@EXNA.corp.stratus.com>
2006-07-04 10:07 ` [Drbd-dev] Re: [DRBD-user] drbd_panic() in drbd_receiver.c Philipp Reisner
2006-07-04 15:01 Graham, Simon
2006-07-04 15:23 ` Lars Ellenberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.