linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Neil Brown <neilb@suse.de>
To: Song Liu <songliubraving@fb.com>, Shaohua Li <shli@fb.com>,
	"linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Cc: Kernel Team <Kernel-team@fb.com>,
	"hch@infradead.org" <hch@infradead.org>,
	"dan.j.williams@intel.com" <dan.j.williams@intel.com>
Subject: RE: [PATCH 7/8] md: skip resync for raid array with journal
Date: Fri, 02 Oct 2015 09:50:40 +1000	[thread overview]
Message-ID: <87zj02i1wv.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <C709E4D363AAB64590BFAC54D4C478AA0104B82A61@PRN-MBX02-4.TheFacebook.com>

[-- Attachment #1: Type: text/plain, Size: 5334 bytes --]

Song Liu <songliubraving@fb.com> writes:

>> -----Original Message-----
>> From: Neil Brown [mailto:neilb@suse.de]
>> Sent: Tuesday, September 29, 2015 9:25 PM
>> To: Shaohua Li; linux-raid@vger.kernel.org
>> Cc: Kernel Team; Song Liu; hch@infradead.org; dan.j.williams@intel.com
>> Subject: Re: [PATCH 7/8] md: skip resync for raid array with journal
>> 
>> Shaohua Li <shli@fb.com> writes:
>> 
>> > If a raid array has journal, the journal can guarantee the
>> > consistency, we can skip resync after a unclean shutdown. The
>> > exception is raid creation or user initiated resync, which we still do a raid
>> resync.
>> >
>> > Signed-off-by: Shaohua Li <shli@fb.com>
>> > ---
>> >  drivers/md/md.c | 4 ++++
>> >  drivers/md/md.h | 1 +
>> >  2 files changed, 5 insertions(+)
>> >
>> > diff --git a/drivers/md/md.c b/drivers/md/md.c index b3f9eed..95824fb
>> > 100644
>> > --- a/drivers/md/md.c
>> > +++ b/drivers/md/md.c
>> > @@ -1669,6 +1669,8 @@ static int super_1_validate(struct mddev *mddev,
>> struct md_rdev *rdev)
>> >  			}
>> >  			set_bit(Journal, &rdev->flags);
>> >  			rdev->journal_tail = le64_to_cpu(sb->journal_tail);
>> > +			if (mddev->recovery_cp == MaxSector)
>> > +				set_bit(MD_JOURNAL_CLEAN, &mddev-
>> >flags);
>> >  			break;
>> >  		default:
>> >  			rdev->saved_raid_disk = role;
>> > @@ -1711,6 +1713,8 @@ static void super_1_sync(struct mddev *mddev,
>> struct md_rdev *rdev)
>> >  	sb->events = cpu_to_le64(mddev->events);
>> >  	if (mddev->in_sync)
>> >  		sb->resync_offset = cpu_to_le64(mddev->recovery_cp);
>> > +	else if (test_bit(MD_JOURNAL_CLEAN, &mddev->flags))
>> > +		sb->resync_offset = cpu_to_le64(MaxSector);
>> >  	else
>> >  		sb->resync_offset = cpu_to_le64(0);
>> >
>> > diff --git a/drivers/md/md.h b/drivers/md/md.h index 226f4ba..0288a0b
>> > 100644
>> > --- a/drivers/md/md.h
>> > +++ b/drivers/md/md.h
>> > @@ -236,6 +236,7 @@ struct mddev {
>> >  #define MD_STILL_CLOSED	4	/* If set, then array has not been
>> opened since
>> >  				 * md_ioctl checked on it.
>> >  				 */
>> > +#define MD_JOURNAL_CLEAN 5	/* A raid with journal is already clean
>> */
>> >
>> >  	int				suspended;
>> >  	atomic_t			active_io;
>> > --
>> > 1.8.1
>> 
>> This looks right as far as it goes, but I don't think it goes far enough.
>> The particular scenario that bothers me is if the array is started without the
>> journal being present.
>> I cannot see anything to prevent that - is there?
>
> I have sent mdadm patches that prevent the array to assemble with missing 
> journal, for both --assemble and --incremental:
>
> http://marc.info/?l=linux-raid&m=144080445720123
> http://marc.info/?l=linux-raid&m=144080446120127 
>

That is probably good and useful, but the kernel cannot rely on mdadm to
protect it.  If the kernel is asked to start a journalled array when no
journal is present, it must do something that is safe.
That might be to refuse, it might be to force a resync, it might be to
somehow "know" if the journal is empty and only force a resync in that
case.

This should be treated much like a "dirty/degraded" start.  Currently
the kernel refuses to do that unless a module parameter is set.  When
mdadm is asked to --force start a dirty/degraded array, it modifies the
metadata so that it doesn't appear to be degraded.  Something similar
might be best for the missing-journal case.

I *think* the current kernel code will allow the array to be started
without a resync even if the journal was not empty, and that is not
safe.

> We can force start in both cases with --force or --run. 
>
> Currently, mdadm doesn't clear MD_FEATURE_JOURNAL bit, so we get warning 
> every time. 
>
> I did miss one step: to mark the missing journal as fault. 
>
>> 
>> In that case we need to assume that the array is not in-sync, and we need to
>> clear MD_FEATURE_JOURNAL so if it gets stopped and then assembled again
>> with the stale journal doesn't get used.
>> 
>> One unfortunate side effect of that is that you couldn't stop the array cleanly
>> (leaving the journal effectively empty) and then restart with no journal and
>> no resync.  Is that a problem I wonder?
>> 
>> I'm not sure what the best solution is here, but we need a clear
>> understanding of what happens if you try to assemble an array without the
>> journal where previously it had one, and I don't think the current code gets it
>> right.
>
> I just talked with Shaohua. We think we can probably do the following:
>
> 1. When assemble with cache device missing, show the warning (with --force 
> option). Run resync if previous shutdown is not clean (this is the default
> behavior.

Probably sensible.  But I think the current code never marks the array as
"not clean" when there is a journal.

>
> 2. When we get I/O error on the journal device at run time, make the whole
> array as read only.

Yes, I think that is best.

>
> 3. Add new operation that adds new journal device to an array with missing 
> journal. We can limit this option to inactive array only.

It would be great if we could add a journal to an active array...

>
> Would this follow cover all the cases?

Quite possibly.  I don't want to commit myself until I see the code :-)

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

  reply	other threads:[~2015-10-01 23:50 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-02 20:49 [PATCH 0/8] raid5-cache fixes Shaohua Li
2015-09-02 20:49 ` [PATCH 1/8] md: fix feature map check Shaohua Li
2015-09-02 20:49 ` [PATCH 2/8] raid5: fix build error Shaohua Li
2015-09-02 20:49 ` [PATCH 3/8] raid5-cache: switching to state machine for log disk cache flush Shaohua Li
2015-09-02 20:49 ` [PATCH 4/8] raid5-cache: fix a user-after-free bug Shaohua Li
2015-09-02 20:49 ` [PATCH 5/8] raid5-cache: move functionality out of __r5l_set_io_unit_state Shaohua Li
2015-09-02 20:49 ` [PATCH 6/8] raid5-cache: optimize FLUSH IO with log enabled Shaohua Li
2015-09-02 20:49 ` [PATCH 7/8] md: skip resync for raid array with journal Shaohua Li
2015-09-30  4:24   ` Neil Brown
2015-10-01 17:47     ` Song Liu
2015-10-01 23:50       ` Neil Brown [this message]
2015-09-02 20:49 ` [PATCH 8/8] raid5-cache: add trim support for log Shaohua Li
2015-09-30  4:14   ` Neil Brown
2015-09-30  5:36 ` [PATCH 0/8] raid5-cache fixes Neil Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87zj02i1wv.fsf@notabene.neil.brown.name \
    --to=neilb@suse.de \
    --cc=Kernel-team@fb.com \
    --cc=dan.j.williams@intel.com \
    --cc=hch@infradead.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=shli@fb.com \
    --cc=songliubraving@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).