linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.de>
To: "Hawrylewicz Czarnowski,
	Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>
Cc: "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>,
	"Neubauer, Wojciech" <Wojciech.Neubauer@intel.com>,
	"Williams, Dan J" <dan.j.williams@intel.com>,
	"Ciechanowski, Ed" <ed.ciechanowski@intel.com>
Subject: Re: [PATCH] md: do not write resync checkpoint, if max_sector has been reached.
Date: Tue, 1 Feb 2011 12:07:52 +1100	[thread overview]
Message-ID: <20110201120752.00a9b534@notabene.brown> (raw)
In-Reply-To: <66C59AD0932712458090B447266D638C010C8D149A@irsmsx504.ger.corp.intel.com>

On Mon, 31 Jan 2011 18:48:08 +0000 "Hawrylewicz Czarnowski, Przemyslaw"
<przemyslaw.hawrylewicz.czarnowski@intel.com> wrote:

> > -----Original Message-----
> > From: NeilBrown [mailto:neilb@suse.de]
> > Sent: Monday, January 31, 2011 3:46 AM
> > To: Hawrylewicz Czarnowski, Przemyslaw
> > Cc: linux-raid@vger.kernel.org; Neubauer, Wojciech; Williams, Dan J;
> > Ciechanowski, Ed
> > Subject: Re: [PATCH] md: do not write resync checkpoint, if max_sector has
> > been reached.
> > 
> > On Thu, 27 Jan 2011 17:50:15 +0100 Przemyslaw Czarnowski
> > <przemyslaw.hawrylewicz.czarnowski@intel.com> wrote:
> > 
> > > If disk fails during resync, sync service of personality usually skips
> > the
> > > rest of not synchronized stripes. It finishes sync thread (md_do_sync())
> > > and wakes up the main raid thread. md_recovery_check() starts and
> > > unregisteres sync thread.
> > > In the meanwhile mdmon also services failed disk - removes and replaces
> > it
> > > with a new one (if it was available).
> > > If checkpoint is stored (with value of array's max_sector), next
> > > md_recovery_check() will restart resync. It finishes normally and
> > > activates ALL spares (including the one added recently) what is wrong.
> > > Another md_recovery_check() will not start recovery as all disks are in
> > > sync. If checkpoint is not stored, second resync does not start and
> > > recovery can proceed.
> > >
> > > Signed-off-by: Przemyslaw Czarnowski
> > <przemyslaw.hawrylewicz.czarnowski@intel.com>
> > > ---
> > >  drivers/md/md.c |    3 ++-
> > >  1 files changed, 2 insertions(+), 1 deletions(-)
> > >
> > > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > > index 3e40aad..6eda858 100644
> > > --- a/drivers/md/md.c
> > > +++ b/drivers/md/md.c
> > > @@ -6929,7 +6929,8 @@ void md_do_sync(mddev_t *mddev)
> > >  	if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) &&
> > >  	    mddev->curr_resync > 2) {
> > >  		if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
> > > -			if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
> > > +			if (test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
> > > +			    mddev->curr_resync < max_sectors) {
> > >  				if (mddev->curr_resync >= mddev->recovery_cp) {
> > >  					printk(KERN_INFO
> > >  					       "md: checkpointing %s of %s.\n",
> > >
> > 
> > This is wrong.  If curr_resync has reached some value, then the array *is*
> > in-sync up to that point.
> > 
> > If a device fails then that often makes the array fully in-sync - because
> > there it no longer any room for inconsistency.
> > This is particularly true for RAID1.  If one drive in a 2-drive RAID1
> > fails,
> > then the array instantly becomes in-sync.
> > For RAID5, we should arguably fail the array at that point rather than
> > marking it in-sync, but that would probably cause more data loss than it
> > avoids, so we don't.
> > In any case - the array is now in-sync.
> Yes, I agree. But it is not the point here, in this bug.
> 
> > 
> > If a spare is added by mdmon at this time, then the array is not 'out of
> > sync', it is 'in need for recovery'.  'recovery' and 'resync' are different
> > things.
> I fully understand the difference between recovery and resync (and reshape). 
> 
> > 
> > md_check_recovery should run remove_and_add_spares are this point.  That
> And it does. 
> 
> > should return a non-zero value (because it found the spare that mdmon
> > added)
> But the return value is wrong (it is correct according to current configuration). Please let me explain once again what's going on.
> 
> The flow is as follows:
> 0. resync is in progress
> 1. one disk fails
> 2. md_error() wakes up raid thread
> 3. md_do_sync() gets skipped=1 from mddev->pers->sync_request() and some amount of skipped sectors/stripes - usually all remaining to resync. mddev->recovery_cp is set to last sector (max_sector in md_do_sync)
> 3a. md_check_recovery() sees MD_RECOVERY_INTR (clears it) and unregisters recovery thread (which actually does resync)
> 3b. mdmon unblocks array member
> 4. md_check_recovery checks if some action is required. 
> 4a. reshape is not taken into account as reshape_position==MaxSector
> 4b. recovery is not taken into account as mdmon did not add spare yet
> 4c. resync is started, as recovery_cp!=MaxSector (!)
> 5. md_do_sync exists normally (gets skipped=1 from mddev->pers->sync_request()) as checkpoint pointed at the last sector; it clears mddev->recovery_cp.
> 6. mdmon adds disk (via slot-store())
> 7. md_check_recovery() does cleanup after finished resync
> 7a. MD_RECOVERY_INTR is not set anymore, mddev->pers->spare_active() is started and ALL devices !In_sync available in array are set in_sync, and array degradation is cleared(!).
> 7b. remove_and_add_spares() does not see spares available (mddev->degraded==0) so recovery does not start.
> 

Thank you for this excellent problem description.

I think the mistake here is at step 6.
mdmon should not be trying to add a disk until the resync has completed.
In particular, mdmon shouldn't try, and md should not allow mdmon to succeed.

So slot_store should return -EBUSY if MD_RECOVERY_RUNNING is set

mdmon needs to 'know' when a sync/etc is happening, and should avoid
processing ->check_degraded if it is.

I have added the md fix to my for-next branch.
I might do the mdadm fix later.

Thanks,
NeilBrown



> > and  should then start a recovery pass which will ignore recovery_cp (which
> > is a really badly chosen variable name - it should be 'resync_cp', not
> > 'recovery_cp'.
> as you can see above, recovery_cp is not ignored (yes the name is confusing)
> 
> > 
> > So if you are experiencing a problem where mdmon adds a spare and appears
> > to
> > get recovered instantly, (which is what you seem to be saying) then the
> to be precise, recovery do not start at all...
> 
> > problem is else-where.
> 
> > If you can reproduce it, then it would help to put some tracing in
> > md_check_recovery, particularly reporting the return value of
> > remove_and_add_spares, and the value that is finally chosen for
> > mddev->recovery.
> if you want some logs, I have plenty:) But I think my description will suffice to understand the problem.
> 
> > 
> > Thanks,
> > NeilBrown
> 


  reply	other threads:[~2011-02-01  1:07 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-01-27 16:50 [PATCH] md: do not write resync checkpoint, if max_sector has been reached Przemyslaw Czarnowski
2011-01-31  2:45 ` NeilBrown
2011-01-31 18:48   ` Hawrylewicz Czarnowski, Przemyslaw
2011-02-01  1:07     ` NeilBrown [this message]
2011-02-01 23:45       ` Hawrylewicz Czarnowski, Przemyslaw

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110201120752.00a9b534@notabene.brown \
    --to=neilb@suse.de \
    --cc=Wojciech.Neubauer@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=ed.ciechanowski@intel.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=przemyslaw.hawrylewicz.czarnowski@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).