From: NeilBrown <neilb@suse.com>
To: Song Liu <songliubraving@fb.com>
Cc: "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>,
Shaohua Li <shli@fb.com>, Kernel Team <Kernel-team@fb.com>,
"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
"hch@infradead.org" <hch@infradead.org>,
"liuzhengyuang521@gmail.com" <liuzhengyuang521@gmail.com>,
"liuzhengyuan@kylinos.cn" <liuzhengyuan@kylinos.cn>
Subject: Re: [PATCH v5 4/8] md/r5cache: write part of r5cache
Date: Wed, 19 Oct 2016 11:53:21 +1100 [thread overview]
Message-ID: <87r37dxipq.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <00D21118-73D2-42D1-8062-D468D0C134B4@fb.com>
[-- Attachment #1: Type: text/plain, Size: 4061 bytes --]
On Fri, Oct 14 2016, Song Liu wrote:
>> On Oct 13, 2016, at 11:53 PM, NeilBrown <neilb@suse.com> wrote:
>>
>> On Thu, Oct 13 2016, Song Liu wrote:
>>>
>>> For RMW, the code allocates an extra page for each data block
>>> being updated. This is stored in r5dev->page and the old data
>>> is read into it. Then the prexor calculation subtracts ->page
>>> from the parity block, and the reconstruct calculation adds the
>>> ->orig_page data back into the parity block.
>>
>> What happens if the alloc_page() fails?
>
> That will be tough, but solvable.. We can
> read old data to page
> do prexor
> read new data from journal device to page
> do xor
> do the rest of the work.
>
> Or we can force the code to rcw, which does not need extra page.
> But rcw, does not always work in degraded mode. So, this is a good
> reason not to do write-back in degraded mode...
Prohibiting write-back in degraded mode would not be enough to ensure
that you can always use rcw. The array can become degraded after you
make the decision to use caching, and before to need to read old data
for rmw.
I would suggest a small (2 entry?) mempool where each entry in the
mempool holds enough pages to complete an rmw. Only use the mempool if
an alloc_page() fails.
>>
>>> +
>>> +void r5c_handle_cached_data_endio(struct r5conf *conf,
>>> + struct stripe_head *sh, int disks, struct bio_list *return_bi)
>>> +{
>>> + int i;
>>> +
>>> + for (i = sh->disks; i--; ) {
>>> + if (test_bit(R5_InCache, &sh->dev[i].flags) &&
>>> + sh->dev[i].written) {
>>
>> Is it possible for R5_InCache to be set, but 'written' to be NULL ???
>
> Yes, it is possible. A stripe may go through "write data to journal, return IO" multiple
> times before parity calculation. When it comes here the second time, dev written in the
> first time will have R5_InCache set, but its written will be NULL.
OK, that makes sense.
So is it possible for 'written' to be set, but R5_InCache to be clear?
i.e. do we really need to test R5_InCache here?
>>>
>>> static void r5l_io_run_stripes(struct r5l_io_unit *io)
>>> @@ -483,7 +566,8 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
>>> io = log->current_io;
>>>
>>> for (i = 0; i < sh->disks; i++) {
>>> - if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
>>> + if (!test_bit(R5_Wantwrite, &sh->dev[i].flags) &&
>>> + !test_bit(R5_Wantcache, &sh->dev[i].flags))
>>> continue;
>>
>> If changed R5_Wantcache to R5_Wantjournal, and always set it on blocks
>> that were destined for the journal, then this would just be
>>
>> if (!test_bit(R5_Wantjournal, &sh->dev[i].flags))
>>
>> which would make lots of sense... Just a thought.
>
> We set R5_Wantwrite in multiple places. If we simplify the code here, we will need to make
> those places aware of journal. I guess that is not ideal either?
Maybe...
We have so many state flags that I like to be very cautious about adding
more, and to make sure they have a very well defined meaning that
doesn't overlap with other flags too much.
The above code suggests that Wantwrite and Wantcache overlap to some
extent.
Could we discard Wantcache and just use Wantwrite combined with InCache?
Wantwrite means that the block needed to be written to the RAID.
If InCache isn't set, it also needs to be written to the cache (if the
cache is being used).
Then the above code would be
if (!test_bit(R5_Wantwrite) || test_bit(R5_InCache))
continue;
which means "if we don't want to write this, or if it is already in the
cache, then nothing to do here".
Maybe.
>
>
>>
>>> }
>>>
>>> -static void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
>>> +void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
>>
>> Why are you removing the 'static' here? You don't call it from any
>> other file.
>
> In next patch, it is called in raid.c.
So remove 'static' in the next patch please.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
next prev parent reply other threads:[~2016-10-19 0:53 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-10-13 5:49 [PATCH v5 0/8] raid5-cache: enabling cache features Song Liu
2016-10-13 5:49 ` [PATCH v5 1/8] md/r5cache: Check array size in r5l_init_log Song Liu
2016-10-13 5:49 ` [PATCH v5 2/8] md/r5cache: move some code to raid5.h Song Liu
2016-10-13 5:49 ` [PATCH v5 3/8] md/r5cache: State machine for raid5-cache write back mode Song Liu
2016-10-14 6:13 ` NeilBrown
2016-10-14 7:03 ` Song Liu
2016-10-13 5:49 ` [PATCH v5 4/8] md/r5cache: write part of r5cache Song Liu
2016-10-14 6:53 ` NeilBrown
2016-10-14 7:33 ` Song Liu
2016-10-19 0:53 ` NeilBrown [this message]
2016-10-21 20:42 ` Song Liu
2016-10-13 5:49 ` [PATCH v5 5/8] md/r5cache: reclaim support Song Liu
2016-10-19 2:03 ` NeilBrown
2016-10-21 21:04 ` Song Liu
2016-10-13 5:49 ` [PATCH v5 6/8] md/r5cache: sysfs entry r5c_state Song Liu
2016-10-19 2:19 ` NeilBrown
2016-10-21 21:20 ` Song Liu
2016-10-13 5:49 ` [PATCH v5 7/8] md/r5cache: r5c recovery Song Liu
2016-10-19 2:32 ` NeilBrown
2016-10-20 3:03 ` NeilBrown
2016-10-21 21:12 ` Song Liu
2016-10-26 1:18 ` NeilBrown
2016-10-13 5:49 ` [PATCH v5 8/8] md/r5cache: handle SYNC and FUA Song Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87r37dxipq.fsf@notabene.neil.brown.name \
--to=neilb@suse.com \
--cc=Kernel-team@fb.com \
--cc=dan.j.williams@intel.com \
--cc=hch@infradead.org \
--cc=linux-raid@vger.kernel.org \
--cc=liuzhengyuan@kylinos.cn \
--cc=liuzhengyuang521@gmail.com \
--cc=shli@fb.com \
--cc=songliubraving@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).