All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jes Sorensen <Jes.Sorensen@redhat.com>
To: Neil Brown <neilb@suse.de>
Cc: linux-raid <linux-raid@vger.kernel.org>, Xiao Ni <xni@redhat.com>
Subject: Re: 4.1-rc6 radi5 OOPS
Date: Fri, 12 Jun 2015 17:52:58 -0400	[thread overview]
Message-ID: <wrfj1thg4mrp.fsf@redhat.com> (raw)
In-Reply-To: <20150611164847.7cd87c13@home.neil.brown.name> (Neil Brown's message of "Thu, 11 Jun 2015 16:48:47 +1000")

Neil Brown <neilb@suse.de> writes:
> On Wed, 10 Jun 2015 12:27:35 -0400
> Jes Sorensen <Jes.Sorensen@redhat.com> wrote:
>
>> Neil Brown <neilb@suse.de> writes:
>> > On Wed, 10 Jun 2015 10:19:42 +1000 Neil Brown <neilb@suse.de> wrote:
>> >
>> >> So it looks like some sort of race.  I have other evidence of a race
>> >> with the resync/reshape thread starting/stopping.  If I track that
>> >> down it'll probably fix this issue too.
>> >
>> > I think I have found just such a race.  If you request a reshape just
>> > as a recovery completes, you can end up with two reshapes running.
>> > This causes confusion :-)
>> >
>> > Can you try this patch?  If I can remember how to reproduce my race
>> > I'll test it on that too.
>> >
>> > Thanks,
>> > NeilBrown
>> 
>> Hi Neil,
>> 
>> Thanks for the patch - I tried with this applied, but it still crashed
>> for me :( I had to mangle it manually, somehow it got modified in the
>> email.
>
> Very :-(
>
> I had high hopes for that patch.  I cannot find anything else that could lead
> to what you are seeing.  I wish I could reproduce it but it is probably highly
> sensitive to timing so some hardware shows it and others don't.
>
> It looks very much like two 'resync' threads are running at the same time.
> When one finishes, it sets ->reshape_progress to -1 (MaxSector), which trips up
> the other one.
>
> In the hang that I very rarely see, one thread (presumably) finishes and sets
> MD_RECOVERY_DONE, so the raid5d threads waits for the resync thread to
> complete, and that thread is waiting for the raid5d to retire some stripe_heads.
>
> ... though the 'resync' thread is probably actually doing a 'reshape'...

Neil

Good news - albeit not guaranteed yet. I tried with the full patch that
you sent to Linus, and with that I haven't been able to reproduce the
problem so far. I'll try and do some more testing over the weekend.

The patch I manually applied only had two hunks in it, the one you
pushed to Linus looks a lot more complete :)

> Did you get a chance to bisect it?  I must admit that I doubt that would be
> useful.  It probably starts when "md_start_sync" was introduced and maybe made
> worse when some locking with mddev_lock was relaxed.
>
> The only way I can see a race is if MD_RECOVERY_DONE gets left set.  When a new
> thread is started.  md_check_recovery always clears it before starting a thread,
> but raid5_start_reshape doesn't - or didn't before the patch I gave you.
>
> It might make more sense to clear the bit in md_reap_sync_thread as below,
> but if the first patch didn't work, this one is unlikely to.
>
> Would you be able to test with the following patch?  There is a chance it might
> confirm whether two sync threads are running at the same time.

I can try with this patch on too, but I won't get to it before next
week. It's been a week of non related MD issues.

Thanks a lot!

Cheers,
Jes

  parent reply	other threads:[~2015-06-12 21:52 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-03 20:20 4.1-rc6 radi5 OOPS Jes Sorensen
2015-06-03 20:40 ` NeilBrown
2015-06-03 21:57   ` Jes Sorensen
2015-06-03 22:15     ` NeilBrown
2015-06-04  1:44       ` Jes Sorensen
2015-06-10  0:19     ` Neil Brown
2015-06-10  1:57       ` Neil Brown
2015-06-10 16:27         ` Jes Sorensen
2015-06-11  6:48           ` Neil Brown
2015-06-11  7:02             ` Neil Brown
2015-06-11  7:20               ` Neil Brown
2015-06-12 21:52             ` Jes Sorensen [this message]
2015-06-13  4:26               ` Neil Brown
2015-06-10 21:02       ` Jes Sorensen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=wrfj1thg4mrp.fsf@redhat.com \
    --to=jes.sorensen@redhat.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=xni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.