From: Lars Ellenberg <lars.ellenberg@linbit.com>
To: drbd-dev@lists.linbit.com
Subject: Re: [Drbd-dev] I/O can hang on primary synctarget after an io error.
Date: Mon, 25 Feb 2008 22:06:58 +0100 [thread overview]
Message-ID: <20080225210658.GB14695@mail.linbit.com> (raw)
In-Reply-To: <BD7042533C2F8943A6A4257A9E31C454F47ACB@EXNA.corp.stratus.com>
On Mon, Feb 25, 2008 at 03:31:01PM -0500, Montrose, Ernest wrote:
>
> Hi all,
>
> We are seeing an issue where I/O to a volume that received an I/O
> error during re-sync as the sync target hangs. Looking at the logs it
> seems that what's going on is that we are skipping a dec_local(). My
> theory is that after_state_ch() is blocked forever waiting for
> local_cnt to be 0 as we are becoming Diskless. So the worker will not
> do any work, hence the hang I/O. Here is the relevant logs:
>
>
> Feb 13 03:48:55 node0 kernel: drbd5: Began resync as SyncTarget (will
> sync 1048508 KB [262127 bits set]).
> Feb 13 03:48:55 node0 kernel: drbd5: Writing meta data super block
> now.
> Feb 13 03:48:55 node0 kernel: drbd5: Creating new epoch in
> drbd_try_rs_begin_io
> Feb 13 03:48:55 node0 kernel: drbd5: ***Simulating Resync write
> failure
> Feb 13 03:48:56 node0 kernel: drbd5: Resync aborted.
> Feb 13 03:48:56 node0 kernel: drbd5: conn( SyncTarget -> Connected )
> disk( Inconsistent -> Failed )
> Feb 13 03:48:56 node0 kernel: drbd5: Local IO failed. Detaching...
> Feb 13 03:48:56 node0 kernel: drbd5: disk( Failed -> Diskless )
> Feb 13 03:48:56 node0 kernel: drbd5: Notified peer that my disk is
> broken.
> Feb 13 03:48:56 node0 kernel: drbd5: Can not write resync data to
> local disk.
> Feb 13 03:54:57 node0 kernel: drbd5: drbd_nl_disk_conf: mdev->bc not
> NULL.
>
>
> Notice the last line of the log. Our test environment must have tried
> to do an "attach" so since local_cnt is not 0 we never freed the "bc".
>
>
> But from the "Can not write resync data to local disk." we can go to
> drbd_endio_write_sec() and there we see a suspicious :
>
> If(bio->bi_size) return 1;
it's not suspicious. it's "standard procedure".
it even got removed from the internal kernel API recently.
> We are supposed to do the dec_local at the end of drbd_endio_write
> sec(). I am guessing that's where the problem is. But I do not know
> why bi_size would be greater then 0. Is the fix simply to dec_local
> while returning?
IF there is imbalance in the local refcounting, then elsewhere.
drbd_endio_write_sec is correct, afaics.
do you have this a009fc907a14f69026b32fbb48a4db6f1cdd5ecd
commit included in your code base?
--
: Lars Ellenberg Tel +43-1-8178292-0 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :
next prev parent reply other threads:[~2008-02-25 21:06 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-02-23 20:44 [Drbd-dev] [DRBD8.0 PATCH] Updated fix to ensure stale state is not sent if a cluster wide state change is in progress Graham, Simon
2008-02-25 15:22 ` Philipp Reisner
2008-02-25 20:31 ` [Drbd-dev] I/O can hang on primary synctarget after an io error Montrose, Ernest
2008-02-25 21:06 ` Lars Ellenberg [this message]
[not found] ` <BD7042533C2F8943A6A4257A9E31C454F47ACB@EXNA.corp.str atus.com>
2008-02-25 21:53 ` Montrose, Ernest
2008-02-26 12:49 ` Lars Ellenberg
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080225210658.GB14695@mail.linbit.com \
--to=lars.ellenberg@linbit.com \
--cc=drbd-dev@lists.linbit.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox