From: Neil Brown <neilb@suse.de>
To: Jes Sorensen <Jes.Sorensen@redhat.com>,
Nate Dailey <nate.dailey@stratus.com>
Cc: linux-raid@vger.kernel.org, William.Kuzeja@stratus.com, xni@redhat.com
Subject: Re: [PATCH 0/2] raid1/10: Handle write errors correctly in narrow_write_error()
Date: Sat, 24 Oct 2015 16:31:11 +1100 [thread overview]
Message-ID: <87k2qcygrk.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <wrfjoafpcvjs.fsf@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 2618 bytes --]
Jes Sorensen <Jes.Sorensen@redhat.com> writes:
> Nate Dailey <nate.dailey@stratus.com> writes:
>> Thank you!
>>
>> I confirmed that this patch prevents the bug.
>>
>> Nate
>
> Awesome, thanks Nate!
>
> Neil once you commit the final version of this patch, please let me
> know.
>
> Cheers,
> Jes
>
>>
>>
>>
>> On 10/22/2015 08:09 PM, Neil Brown wrote:
>>> Nate Dailey <nate.dailey@stratus.com> writes:
>>>
>>>> The problem is that we aren't getting true write (medium) errors.
>>>>
>>>> In this case we're testing device removals. The write errors happen
>>>> because the
>>>> disk goes away. Narrow_write_error returns 1, the bitmap bit is cleared, and
>>>> then when the device is re-added the resync might not include the sectors in
>>>> that chunk (there's some luck involved; if other writes to that chunk happen
>>>> while the disk is removed, we're okay--bug is easier to hit with
>>>> smaller bitmap
>>>> chunks because of this).
>>>>
>>>>
>>> OK, that makes sense.
>>>
>>> The device removal will be noticed when the bad block log is written
>>> out.
>>> When a bad-block is recorded we make sure to write that out promptly
>>> before bio_endio() gets called. But not before close_write() has called
>>> bitmap_end_write().
>>>
>>> So I guess we need to delay the close_write() call until the
>>> bad-block-log has been written.
>>>
>>> I think this patch should do it. Can you test?
>>>
>>> Thanks,
>>> NeilBrown
>>>
>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>> index c1ad0b075807..1a1c5160c930 100644
>>> --- a/drivers/md/raid1.c
>>> +++ b/drivers/md/raid1.c
>>> @@ -2269,8 +2269,6 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
>>> rdev_dec_pending(conf->mirrors[m].rdev,
>>> conf->mddev);
>>> }
>>> - if (test_bit(R1BIO_WriteError, &r1_bio->state))
>>> - close_write(r1_bio);
>>> if (fail) {
>>> spin_lock_irq(&conf->device_lock);
>>> list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
>>> @@ -2396,6 +2394,9 @@ static void raid1d(struct md_thread *thread)
>>> r1_bio = list_first_entry(&tmp, struct r1bio,
>>> retry_list);
>>> list_del(&r1_bio->retry_list);
>>> + if (mddev->degraded)
>>> + set_bit(R1BIO_Degraded, &r1_bio->state);
>>> + close_write(r1_bio);
>>> raid_end_bio_io(r1_bio);
>>> }
>>> }
I've just pushed out a version (for raid10 as well) in my 'for-linus'
branch. I'll submit to Linus later today after zero-day comes back with
no errors.
This version contains some extra code which is not needed, but makes the
change more obviously correct.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]
prev parent reply other threads:[~2015-10-24 5:31 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-20 16:09 [PATCH 0/2] raid1/10: Handle write errors correctly in narrow_write_error() Jes.Sorensen
2015-10-20 16:09 ` [PATCH 1/2] md/raid1: submit_bio_wait() returns 0 on success Jes.Sorensen
2015-10-20 16:09 ` [PATCH 2/2] md/raid10: " Jes.Sorensen
2015-10-20 20:29 ` [PATCH 0/2] raid1/10: Handle write errors correctly in narrow_write_error() Neil Brown
2015-10-20 23:12 ` Jes Sorensen
2015-10-22 15:59 ` Jes Sorensen
2015-10-22 16:01 ` [PATCH 1/2] md/raid1: Do not clear bitmap bit if submit_bio_wait() fails Jes.Sorensen
2015-10-22 16:01 ` [PATCH 2/2] md/raid10: " Jes.Sorensen
2015-10-22 21:36 ` [PATCH 0/2] raid1/10: Handle write errors correctly in narrow_write_error() Neil Brown
2015-10-22 22:37 ` Nate Dailey
2015-10-23 0:09 ` Neil Brown
2015-10-23 14:30 ` Nate Dailey
2015-10-23 18:02 ` Jes Sorensen
2015-10-24 5:31 ` Neil Brown [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87k2qcygrk.fsf@notabene.neil.brown.name \
--to=neilb@suse.de \
--cc=Jes.Sorensen@redhat.com \
--cc=William.Kuzeja@stratus.com \
--cc=linux-raid@vger.kernel.org \
--cc=nate.dailey@stratus.com \
--cc=xni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).