From: NeilBrown <neilb@suse.com>
To: Shaohua Li <shli@kernel.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>,
Christoph Hellwig <hch@lst.de>,
linux-raid@vger.kernel.org
Subject: Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
Date: Tue, 22 Nov 2016 13:19:07 +1100 [thread overview]
Message-ID: <87k2bwcl44.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <20161122010220.dcq6brjhsliw4io6@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 3534 bytes --]
On Tue, Nov 22 2016, Shaohua Li wrote:
> On Tue, Nov 22, 2016 at 11:25:04AM +1100, Neil Brown wrote:
>> On Tue, Nov 22 2016, Shaohua Li wrote:
>>
>> > On Mon, Nov 21, 2016 at 12:19:43PM +1100, Neil Brown wrote:
>> >> There are 2 problems with using bi_phys_segments as a counter
>> >> 1/ we only use 16bits, which limits bios to 256M
>> >> 2/ it is poor form to reuse a field like this. It interferes
>> >> with other changes to bios.
>> >>
>> >> We need to clean up a few things before we can change the use the
>> >> counter which is now available inside a bio.
>> >>
>> >> I have only tested this lightly. More review and testing would be
>> >> appreciated.
>> >
>> > So without the accounting, we:
>> > - don't do bio completion trace
>>
>> Yes, but hopefully that will be added back to bio_endio() soon.
>>
>> > - call md_write_start/md_write_end excessively, which involves atomic operation.
>>
>> raid5_inc_bio_active_stripes() did an atomic operation. I don't think
>> there is a net increase in the number of atomic operations.
>
> That's different. md_write_start/end uses a global atomic.
> raid5_inc_bio_active_stripes uses a bio atomic. So we have more cache bouncing now.
Maybe.
Most md_write_start() calls are made in the context of
raid5_make_request().
We could
- call md_write_start() once at the start
- count how many times we want to call it in a variable local to
raid5_make_request()
- atomically add that to the counter at the end.
Similarly mode md_write_end() requests are in the context of raid5d. It
could maintain local counter and apply them all in a single update
before it sleeps.
It would be a little messy, but not too horrible I think.
>
>> >
>> > Not big problems. But we are actually reusing __bi_remaining, I'm wondering why
>> > we not explicitly reuse it. Eg, adds bio_dec_remaining_return() and uses it
>> > like raid5_dec_bi_active_stripes.
>>
>> Because using it exactly the same way that other places use it leads to
>> fewer surprises, now or later.
>> And I think that the effort to rearrange the code so that we could just
>> call bio_endio() brought real improvements in code clarity and
>> simplicity.
>
> Not the same way. The return_bi list and retry list fix are still good. We can
> replace the bio_endio in your patch with something like:
> if (bio_dec_remaining_return() == 0) {
> trace_block_bio_complete()
> md_write_end()
> bio_endio();
> }
> This will give us better control when to end io.
This isn't safe. The bio arriving at raid5_make_request() might already
have been split and could be chained. Then raid5 might never see
bio_dec_remaining_return() return zero.
For example, suppose there is a RAID0 make of some other device, and
this RAID5.
A write request arrives which crosses a chunk boundary.
raid0.c calls bio_split to split off a new bio that will fit in the other
device, leaving the original bio with a larger bi_sector which will get
mapped only into the raid5.
The split bio is chained into the original bio, elevating its
__bi_remaining count.
If the other device is particularly slow, or the RAID5 is particularly
fast, the RAID5 IO might complete before the split bio completes, so
raid5 will only see __bi_remaining go down to one, not zero.
When the split bio finally completes, it's bi_endio is
bio_chain_endio(), and that will call the final bio_endio() on the
original bio. md_write_end() would then never be called.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
next prev parent reply other threads:[~2016-11-22 2:19 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-11-21 1:19 [md PATCH 0/5] Stop using bi_phys_segments as a counter NeilBrown
2016-11-21 1:19 ` [md PATCH 5/5] md/raid5: use bio_inc_remaining() instead of repurposing " NeilBrown
2016-11-21 1:19 ` [md PATCH 4/5] md/raid5: call bio_endio() directly rather than queuing for later NeilBrown
2016-11-21 1:19 ` [md PATCH 2/5] md/raid5: use md_write_start to count stripes, not bios NeilBrown
2016-11-21 1:19 ` [md PATCH 1/5] md: optimize md_write_start() slightly NeilBrown
2016-11-21 1:19 ` [md PATCH 3/5] md/raid5: simplfy delaying of writes while metadata is updated NeilBrown
2016-11-21 2:32 ` [md PATCH 6/5] md/raid5: remove over-loading of ->bi_phys_segments NeilBrown
2016-11-21 14:01 ` [md PATCH 0/5] Stop using bi_phys_segments as a counter Christoph Hellwig
2016-11-21 23:43 ` Shaohua Li
2016-11-22 0:25 ` NeilBrown
2016-11-22 1:02 ` Shaohua Li
2016-11-22 2:19 ` NeilBrown [this message]
2016-11-22 8:01 ` Shaohua Li
2016-11-23 2:08 ` NeilBrown
2016-11-23 8:45 ` Christoph Hellwig
2016-11-24 0:31 ` NeilBrown
2017-02-06 8:56 ` Christoph Hellwig
2017-02-06 21:41 ` Shaohua Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87k2bwcl44.fsf@notabene.neil.brown.name \
--to=neilb@suse.com \
--cc=hch@lst.de \
--cc=khlebnikov@yandex-team.ru \
--cc=linux-raid@vger.kernel.org \
--cc=shli@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).