From: Neil Brown <neilb@suse.de>
To: Song Liu <songliubraving@fb.com>, Shaohua Li <shli@fb.com>,
"linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Cc: Kernel Team <Kernel-team@fb.com>,
"hch@infradead.org" <hch@infradead.org>,
"dan.j.williams@intel.com" <dan.j.williams@intel.com>
Subject: RE: [PATCH 7/8] md: skip resync for raid array with journal
Date: Fri, 02 Oct 2015 09:50:40 +1000 [thread overview]
Message-ID: <87zj02i1wv.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <C709E4D363AAB64590BFAC54D4C478AA0104B82A61@PRN-MBX02-4.TheFacebook.com>
[-- Attachment #1: Type: text/plain, Size: 5334 bytes --]
Song Liu <songliubraving@fb.com> writes:
>> -----Original Message-----
>> From: Neil Brown [mailto:neilb@suse.de]
>> Sent: Tuesday, September 29, 2015 9:25 PM
>> To: Shaohua Li; linux-raid@vger.kernel.org
>> Cc: Kernel Team; Song Liu; hch@infradead.org; dan.j.williams@intel.com
>> Subject: Re: [PATCH 7/8] md: skip resync for raid array with journal
>>
>> Shaohua Li <shli@fb.com> writes:
>>
>> > If a raid array has journal, the journal can guarantee the
>> > consistency, we can skip resync after a unclean shutdown. The
>> > exception is raid creation or user initiated resync, which we still do a raid
>> resync.
>> >
>> > Signed-off-by: Shaohua Li <shli@fb.com>
>> > ---
>> > drivers/md/md.c | 4 ++++
>> > drivers/md/md.h | 1 +
>> > 2 files changed, 5 insertions(+)
>> >
>> > diff --git a/drivers/md/md.c b/drivers/md/md.c index b3f9eed..95824fb
>> > 100644
>> > --- a/drivers/md/md.c
>> > +++ b/drivers/md/md.c
>> > @@ -1669,6 +1669,8 @@ static int super_1_validate(struct mddev *mddev,
>> struct md_rdev *rdev)
>> > }
>> > set_bit(Journal, &rdev->flags);
>> > rdev->journal_tail = le64_to_cpu(sb->journal_tail);
>> > + if (mddev->recovery_cp == MaxSector)
>> > + set_bit(MD_JOURNAL_CLEAN, &mddev-
>> >flags);
>> > break;
>> > default:
>> > rdev->saved_raid_disk = role;
>> > @@ -1711,6 +1713,8 @@ static void super_1_sync(struct mddev *mddev,
>> struct md_rdev *rdev)
>> > sb->events = cpu_to_le64(mddev->events);
>> > if (mddev->in_sync)
>> > sb->resync_offset = cpu_to_le64(mddev->recovery_cp);
>> > + else if (test_bit(MD_JOURNAL_CLEAN, &mddev->flags))
>> > + sb->resync_offset = cpu_to_le64(MaxSector);
>> > else
>> > sb->resync_offset = cpu_to_le64(0);
>> >
>> > diff --git a/drivers/md/md.h b/drivers/md/md.h index 226f4ba..0288a0b
>> > 100644
>> > --- a/drivers/md/md.h
>> > +++ b/drivers/md/md.h
>> > @@ -236,6 +236,7 @@ struct mddev {
>> > #define MD_STILL_CLOSED 4 /* If set, then array has not been
>> opened since
>> > * md_ioctl checked on it.
>> > */
>> > +#define MD_JOURNAL_CLEAN 5 /* A raid with journal is already clean
>> */
>> >
>> > int suspended;
>> > atomic_t active_io;
>> > --
>> > 1.8.1
>>
>> This looks right as far as it goes, but I don't think it goes far enough.
>> The particular scenario that bothers me is if the array is started without the
>> journal being present.
>> I cannot see anything to prevent that - is there?
>
> I have sent mdadm patches that prevent the array to assemble with missing
> journal, for both --assemble and --incremental:
>
> http://marc.info/?l=linux-raid&m=144080445720123
> http://marc.info/?l=linux-raid&m=144080446120127
>
That is probably good and useful, but the kernel cannot rely on mdadm to
protect it. If the kernel is asked to start a journalled array when no
journal is present, it must do something that is safe.
That might be to refuse, it might be to force a resync, it might be to
somehow "know" if the journal is empty and only force a resync in that
case.
This should be treated much like a "dirty/degraded" start. Currently
the kernel refuses to do that unless a module parameter is set. When
mdadm is asked to --force start a dirty/degraded array, it modifies the
metadata so that it doesn't appear to be degraded. Something similar
might be best for the missing-journal case.
I *think* the current kernel code will allow the array to be started
without a resync even if the journal was not empty, and that is not
safe.
> We can force start in both cases with --force or --run.
>
> Currently, mdadm doesn't clear MD_FEATURE_JOURNAL bit, so we get warning
> every time.
>
> I did miss one step: to mark the missing journal as fault.
>
>>
>> In that case we need to assume that the array is not in-sync, and we need to
>> clear MD_FEATURE_JOURNAL so if it gets stopped and then assembled again
>> with the stale journal doesn't get used.
>>
>> One unfortunate side effect of that is that you couldn't stop the array cleanly
>> (leaving the journal effectively empty) and then restart with no journal and
>> no resync. Is that a problem I wonder?
>>
>> I'm not sure what the best solution is here, but we need a clear
>> understanding of what happens if you try to assemble an array without the
>> journal where previously it had one, and I don't think the current code gets it
>> right.
>
> I just talked with Shaohua. We think we can probably do the following:
>
> 1. When assemble with cache device missing, show the warning (with --force
> option). Run resync if previous shutdown is not clean (this is the default
> behavior.
Probably sensible. But I think the current code never marks the array as
"not clean" when there is a journal.
>
> 2. When we get I/O error on the journal device at run time, make the whole
> array as read only.
Yes, I think that is best.
>
> 3. Add new operation that adds new journal device to an array with missing
> journal. We can limit this option to inactive array only.
It would be great if we could add a journal to an active array...
>
> Would this follow cover all the cases?
Quite possibly. I don't want to commit myself until I see the code :-)
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]
next prev parent reply other threads:[~2015-10-01 23:50 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-02 20:49 [PATCH 0/8] raid5-cache fixes Shaohua Li
2015-09-02 20:49 ` [PATCH 1/8] md: fix feature map check Shaohua Li
2015-09-02 20:49 ` [PATCH 2/8] raid5: fix build error Shaohua Li
2015-09-02 20:49 ` [PATCH 3/8] raid5-cache: switching to state machine for log disk cache flush Shaohua Li
2015-09-02 20:49 ` [PATCH 4/8] raid5-cache: fix a user-after-free bug Shaohua Li
2015-09-02 20:49 ` [PATCH 5/8] raid5-cache: move functionality out of __r5l_set_io_unit_state Shaohua Li
2015-09-02 20:49 ` [PATCH 6/8] raid5-cache: optimize FLUSH IO with log enabled Shaohua Li
2015-09-02 20:49 ` [PATCH 7/8] md: skip resync for raid array with journal Shaohua Li
2015-09-30 4:24 ` Neil Brown
2015-10-01 17:47 ` Song Liu
2015-10-01 23:50 ` Neil Brown [this message]
2015-09-02 20:49 ` [PATCH 8/8] raid5-cache: add trim support for log Shaohua Li
2015-09-30 4:14 ` Neil Brown
2015-09-30 5:36 ` [PATCH 0/8] raid5-cache fixes Neil Brown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87zj02i1wv.fsf@notabene.neil.brown.name \
--to=neilb@suse.de \
--cc=Kernel-team@fb.com \
--cc=dan.j.williams@intel.com \
--cc=hch@infradead.org \
--cc=linux-raid@vger.kernel.org \
--cc=shli@fb.com \
--cc=songliubraving@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).