linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.com>
To: Coly Li <colyli@suse.de>, Shaohua Li <shli@kernel.org>
Cc: NeilBrown <neilb@suse.de>,
	linux-raid@vger.kernel.org, Shaohua Li <shli@fb.com>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	Guoqing Jiang <gqjiang@suse.com>
Subject: Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
Date: Tue, 21 Feb 2017 11:29:59 +1100	[thread overview]
Message-ID: <878tp0s7w8.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <7178ADEA-6263-4FB0-95A5-96E5F71A9740@suse.de>

[-- Attachment #1: Type: text/plain, Size: 7052 bytes --]

On Mon, Feb 20 2017, Coly Li wrote:

> 发自我的 iPhone
>> 在 2017年2月20日,下午3:04,Shaohua Li <shli@kernel.org> 写道:
>> 
>>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
>>>> On Mon, Feb 20 2017, NeilBrown wrote:
>>>> 
>>>>> On Fri, Feb 17 2017, Coly Li wrote:
>>>>> 
>>>>>> On 2017/2/16 下午3:04, NeilBrown wrote:
>>>>>> I know you are going to change this as Shaohua wantsthe spitting to
>>>>>> happen in a separate function, which I agree with, but there is 
>>>>>> something else wrong here. Calling bio_split/bio_chain repeatedly
>>>>>> in a loop is dangerous. It is OK for simple devices, but when one
>>>>>> request can wait for another request to the same device it can
>>>>>> deadlock. This can happen with raid1.  If a resync request calls
>>>>>> raise_barrier() between one request and the next, then the next has
>>>>>> to wait for the resync request, which has to wait for the first
>>>>>> request. As the first request will be stuck in the queue in 
>>>>>> generic_make_request(), you get a deadlock.
>>>>> 
>>>>> For md raid1, queue in generic_make_request(), can I understand it as
>>>>> bio_list_on_stack in this function? And queue in underlying device,
>>>>> can I understand it as the data structures like plug->pending and
>>>>> conf->pending_bio_list ?
>>>> 
>>>> Yes, the queue in generic_make_request() is the bio_list_on_stack.  That
>>>> is the only queue I am talking about.  I'm not referring to
>>>> plug->pending or conf->pending_bio_list at all.
>>>> 
>>>>> 
>>>>> I still don't get the point of deadlock, let me try to explain why I
>>>>> don't see the possible deadlock. If a bio is split, and the first part
>>>>> is processed by make_request_fn(), and then a resync comes and it will
>>>>> raise a barrier, there are 3 possible conditions,
>>>>> - the resync I/O tries to raise barrier on same bucket of the first
>>>>> regular bio. Then the resync task has to wait to the first bio drops
>>>>> its conf->nr_pending[idx]
>>>> 
>>>> Not quite.
>>>> First, the resync task (in raise_barrier()) will wait for ->nr_waiting[idx]
>>>> to be zero.  We can assume this happens immediately.
>>>> Then the resync_task will increment ->barrier[idx].
>>>> Only then will it wait for the first bio to drop ->nr_pending[idx].
>>>> The processing of that first bio will have submitted bios to the
>>>> underlying device, and they will be in the bio_list_on_stack queue, and
>>>> will not be processed until raid1_make_request() completes.
>>>> 
>>>> The loop in raid1_make_request() will then call make_request_fn() which
>>>> will call wait_barrier(), which will wait for ->barrier[idx] to be
>>>> zero.
>>> 
>>> Thinking more carefully about this.. the 'idx' that the second bio will
>>> wait for will normally be different, so there won't be a deadlock after
>>> all.
>>> 
>>> However it is possible for hash_long() to produce the same idx for two
>>> consecutive barrier_units so there is still the possibility of a
>>> deadlock, though it isn't as likely as I thought at first.
>> 
>> Wrapped the function pointer issue Neil pointed out into Coly's original patch.
>> Also fix a 'use-after-free' bug. For the deadlock issue, I'll add below patch,
>> please check.
>> 
>> Thanks,
>> Shaohua
>> 
>
> Hmm, please hold, I am still thinking of it. With barrier bucket and
> hash_long(), I don't see dead lock yet. For raid10 it might happen,
> but once we have barrier bucket on it , there will no deadlock.
>
> My question is, this deadlock only happens when a big bio is split,
> and the split small bios are continuous, and the resync io visiting
> barrier buckets in sequntial order too. In the case if adjacent split
> regular bios or resync bios hit same barrier bucket, it will be a very
> big failure of hash design, and should have been found already. But no
> one complain it, so I don't convince myself tje deadlock is real with
> io barrier buckets (this is what Neil concerns).

I think you are wrong about the design goal of a hash function.
When feed a sequence of inputs, with any stride (i.e. with any constant
difference between consecutive inputs), the output of the hash function
should appear to be random.
A random sequence can produce the same number twice in a row.
If the hash function produces a number from 0 to N-1, you would expect
two consecutive outputs to be the same about once every N inputs.

Even if there was no possibility of a deadlock from a resync request
happening between two bios, there are other possibilities.

It is not, in general, safe to call mempool_alloc() twice in a row,
without first ensuring that the first allocation will get freed by some
other thread.  raid1_write_request() allocates from r1bio_pool, and then
submits bios to the underlying device, which get queued on
bio_list_on_stack.  They will not be processed until after
raid1_make_request() completes, so when raid1_make_request loops around
and calls raid1_write_request() again, it will try to allocate another
r1bio from r1bio_pool, and this might end up waiting for the r1bio which
is trapped and cannot complete.

As r1bio_pool preallocates 256 entries, this is unlikely  but not
impossible.  If 256 threads all attempt a write (or read) that crosses a
boundary, then they will consume all 256 preallocated entries, and want
more. If there is no free memory, they will block indefinitely.

bio_alloc_bioset() has punt_bios_to_rescuer() to attempt to work around
a deadlock very similar to this, but it is a very specific solution,
wouldn't help raid1, and is much more complex than just rearranging the
code.


>
> For the function pointer asignment, it is because I see a brach
> happens in a loop. If I use a function pointer, I can avoid redundant
> brach inside the loop. raid1_read_request() and raid1_write_request()
> are not simple functions, I don't know whether gcc may make them
> inline or not, so I am on the way to check the disassembled code..

It is a long time since I studied how CPUs handle different sorts of
machine code, but I'm fairly sure that and indirect branch (i.e. a
branch through a function pointer) is harder to optimize than a
conditional branch.

But I think the readability of the code is more important, and having an
if-then-else in a loop is more familiar to most readers than using a
function pointer like this.

>
> The loop in raid1_make_request() is quite high level, I am not sure
> whether CPU brach pridiction may work correctly, especially when it is
> a big DISCARD bio, using function pointer may drop a possible brach.
>
> So I need to check what we get and lose when use function pointer or
> not. If it is not urgent, please hold this patch for a while. 
>
> The only thing I worry in the bellowed patch is, if a very big DISCARD
> bio comes, will the kernel space stack trend to be overflow?

How would a large request cause extra stack space to be used?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

  parent reply	other threads:[~2017-02-21  0:29 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-15 16:35 [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window colyli
2017-02-15 16:35 ` [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code colyli
2017-02-15 17:15   ` Coly Li
2017-02-16  2:25   ` Shaohua Li
2017-02-17 18:42     ` Coly Li
2017-02-16  7:04   ` NeilBrown
2017-02-17  7:56     ` Coly Li
2017-02-17 18:35       ` Coly Li
2017-02-16  2:22 ` [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
2017-02-16 17:05   ` Coly Li
2017-02-17 12:40     ` Coly Li
2017-02-16  7:04 ` NeilBrown
2017-02-17  6:56   ` Coly Li
2017-02-19 23:50     ` NeilBrown
2017-02-20  2:51       ` NeilBrown
2017-02-20  7:04         ` Shaohua Li
2017-02-20  8:07           ` Coly Li
2017-02-20  8:30             ` Coly Li
2017-02-20 18:14             ` Wols Lists
2017-02-21 11:30               ` Coly Li
2017-02-21 19:20                 ` Wols Lists
2017-02-21 20:16                   ` Coly Li
2017-02-21  0:29             ` NeilBrown [this message]
2017-02-21  9:45               ` Coly Li
2017-02-21 17:45                 ` Shaohua Li
2017-02-21 20:09                   ` Coly Li
2017-02-23  5:54                     ` Coly Li
2017-02-23 17:34                       ` Shaohua Li
2017-02-23 19:31                         ` Coly Li
2017-02-23 19:58                           ` Shaohua Li
2017-02-24 17:02                             ` Coly Li
2017-02-24 10:19                           ` 王金浦
2017-02-28 19:42                             ` Shaohua Li
2017-03-01 17:01                               ` 王金浦
2017-02-23 23:14                       ` NeilBrown
2017-02-24 17:06                         ` Coly Li
2017-02-24 17:17                           ` Shaohua Li
2017-02-24 18:57                             ` Coly Li
2017-02-24 19:02                               ` Shaohua Li
2017-02-24 19:19                                 ` Coly Li
2017-02-17 19:41   ` Shaohua Li
2017-02-18  2:40     ` Coly Li
2017-02-19 23:42     ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878tp0s7w8.fsf@notabene.neil.brown.name \
    --to=neilb@suse.com \
    --cc=colyli@suse.de \
    --cc=gqjiang@suse.com \
    --cc=jthumshirn@suse.de \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=shli@fb.com \
    --cc=shli@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).