From: NeilBrown <neilb@suse.com>
To: Shaohua Li <shli@fb.com>
Cc: Christoph Hellwig <hch@lst.de>, linux-raid@vger.kernel.org
Subject: Re: [RFC] using mempools for raid5-cache
Date: Fri, 11 Dec 2015 12:56:04 +1100 [thread overview]
Message-ID: <87k2old9zf.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <20151211011049.GA3810341@devbig084.prn1.facebook.com>
[-- Attachment #1: Type: text/plain, Size: 4949 bytes --]
On Fri, Dec 11 2015, Shaohua Li wrote:
> On Fri, Dec 11, 2015 at 11:09:58AM +1100, NeilBrown wrote:
>> On Fri, Dec 11 2015, Shaohua Li wrote:
>>
>> > On Wed, Dec 09, 2015 at 05:34:45PM +1100, NeilBrown wrote:
>> >> On Wed, Dec 09 2015, Shaohua Li wrote:
>> >>
>> >> > On Wed, Dec 09, 2015 at 11:36:30AM +1100, NeilBrown wrote:
>> >> >> On Thu, Dec 03 2015, Christoph Hellwig wrote:
>> >> >>
>> >> >> > Currently the raid5-cache code is heavily relying on GFP_NOFAIL allocations.
>> >> >> >
>> >> >> > I've looked into replacing these with mempools and biosets, and for the
>> >> >> > bio and the meta_page that's pretty trivial as they have short life times
>> >> >> > and do make guaranteed progress. I'm massively struggling with the iounit
>> >> >> > allocation, though. These can live on for a long time over log I/O, cache
>> >> >> > flushing and last but not least RAID I/O, and every attempt at something
>> >> >> > mempool-like results in reproducible deadlocks. I wonder if we need to
>> >> >> > figure out some more efficient data structure to communicate the completion
>> >> >> > status that doesn't rely on these fairly long living allocations from
>> >> >> > the I/O path.
>> >> >>
>> >> >> Presumably the root cause of these deadlocks is that the raid5d thread
>> >> >> has called
>> >> >> handle_stripe -> ops_run_io ->r5l_write_stripe -> r5l_log_stripe
>> >> >> -> r5l_get_meta -> r5l_new_meta
>> >> >>
>> >> >> and r5l_new_meta is blocked on memory allocation, which won't complete
>> >> >> until some raid5 stripes get written out, which requires raid5d to do
>> >> >> something more useful than sitting and waiting.
>> >> >>
>> >> >> I suspect a good direction towards a solution would be to allow the
>> >> >> memory allocation to fail, to cleanly propagate that failure indication
>> >> >> up through r5l_log_stripe to r5l_write_stripe which falls back to adding
>> >> >> the stripe_head to ->no_space_stripes.
>> >> >>
>> >> >> Then we only release stripes from no_space_stripes when a memory
>> >> >> allocation might succeed.
>> >> >>
>> >> >> There are lots of missing details, and possibly we would need a separate
>> >> >> list rather than re-using no_space_stripes.
>> >> >> But the key idea is that raid5d should never block (except beneath
>> >> >> submit_bio on some other device) and when it cannot make progress
>> >> >> without blocking, it should queue the stripe_head for later handling.
>> >> >>
>> >> >> Does that make sense?
>> >> >
>> >> > It does remove the scary __GFP_NOFAIL, but the approach is essentially
>> >> > idential to a 'retry after allocation failure'. Why not just let the mm
>> >> > (with __GFP_NOFAIL) to do the retry then?
>> >> >
>> >>
>> >> Because deadlocks.
>> >>
>> >> If raid5d is waiting for the mm to allocated memory, then it cannot
>> >> retire write requests which could free up memory.
>> >
>> > Ok, I missed the deadlock issue. Looks raid5d can't sleep waitting for
>> > memory. That means the mempool for metadata page/bio doesn't work too,
>> > as raid5d might block and not dispatch previous IO. Your proposal looks
>> > reasonable.
>>
>> raid5d mustn't sleep waiting for something that only raid5d can provide.
>>
>> A bio allocated from a mempool will be returned to that mempool when the
>> IO request completes (if the end_io routine calls bio_put, which it
>> does).
>> So when waiting on a mempool for a bio, you are only waiting for some
>> underlying device to perform IO, you are not waiting for anything that
>> raid5d might do. So that is safe.
>>
>> Similarly the meta_page is freed in the end_io function, so it is safe
>> to wait for that.
>> Note that the important feature of a mempool is not so much that it
>> pre-allocates some memory. The important point is that when that
>> pre-allocated memory is freed it can *only* be used by the owner of the
>> mempool.
>> By contrast memory allocated with NOFAIL, when freed might be used by
>> any other caller anywhere in the kernel.
>> So waiting on a mempool means waiting for certain well defined events.
>> Waiting in NOFAIL could end up waiting for almost anything. So it is
>> much harder to reason about deadlocks.
>>
>> The bio and the meta_page contrast with the r5l_io_unit which isn't
>> freed by the first end_io, but remains until raid5d has done some more
>> work. So raid5d cannot wait for a r5l_io_unit to be freed.
>
> I'm confused 2 bios are already enough to avoid deadlock, sorry. Can you
> peek Christoph's patches? I'll work on a patch for r5l_io_unit.
>
I don't understand what you are tring to say.
Yes, a bio pool of 2 bios is (I believe) enough to avoid a deadlock when
allocation a bio. It no affect on any deadlock involved in allocating
an r5l_io_unit.
I have looked at Christoph's patches. What in particular did you hope I
would see?
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]
next prev parent reply other threads:[~2015-12-11 1:56 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-12-02 16:10 [RFC] using mempools for raid5-cache Christoph Hellwig
2015-12-02 16:10 ` [PATCH 1/2] raid5-cache: use a bio_set Christoph Hellwig
2015-12-03 5:01 ` Shaohua Li
2015-12-08 23:22 ` NeilBrown
2015-12-14 21:12 ` Christoph Hellwig
2015-12-02 16:10 ` [PATCH 2/2] raid5-cache: use a mempool for the metadata block Christoph Hellwig
2015-12-08 23:27 ` NeilBrown
2015-12-03 4:49 ` [RFC] using mempools for raid5-cache Shaohua Li
2015-12-09 0:36 ` NeilBrown
2015-12-09 1:28 ` Shaohua Li
2015-12-09 6:34 ` NeilBrown
2015-12-10 23:40 ` Shaohua Li
2015-12-11 0:09 ` NeilBrown
2015-12-11 1:10 ` Shaohua Li
2015-12-11 1:56 ` NeilBrown [this message]
2015-12-09 13:51 ` Wols Lists
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87k2old9zf.fsf@notabene.neil.brown.name \
--to=neilb@suse.com \
--cc=hch@lst.de \
--cc=linux-raid@vger.kernel.org \
--cc=shli@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).