Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Ilya Dryomov <idryomov@gmail.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	stable@vger.kernel.org,
	Sergey Jerusalimov <wintchester@gmail.com>,
	Jeff Layton <jlayton@redhat.com>,
	linux-xfs@vger.kernel.org
Subject: Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
Date: Thu, 30 Mar 2017 15:53:35 +0200	[thread overview]
Message-ID: <CAOi1vP-MV0dX_vogf=+TTqaB8xzDEq2BZQYBM6Uf1quy_MYEQw@mail.gmail.com> (raw)
In-Reply-To: <20170330062500.GB1972@dhcp22.suse.cz>

On Thu, Mar 30, 2017 at 8:25 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 29-03-17 16:25:18, Ilya Dryomov wrote:
>> On Wed, Mar 29, 2017 at 1:16 PM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Wed 29-03-17 13:10:01, Ilya Dryomov wrote:
>> >> On Wed, Mar 29, 2017 at 12:55 PM, Michal Hocko <mhocko@kernel.org> wrote:
>> >> > On Wed 29-03-17 12:41:26, Michal Hocko wrote:
>> >> > [...]
>> >> >> > ceph_con_workfn
>> >> >> >   mutex_lock(&con->mutex)  # ceph_connection::mutex
>> >> >> >   try_write
>> >> >> >     ceph_tcp_connect
>> >> >> >       sock_create_kern
>> >> >> >         GFP_KERNEL allocation
>> >> >> >           allocator recurses into XFS, more I/O is issued
>> >> >
>> >> > One more note. So what happens if this is a GFP_NOIO request which
>> >> > cannot make any progress? Your IO thread is blocked on con->mutex
>> >> > as you write below but the above thread cannot proceed as well. So I am
>> >> > _really_ not sure this acutally helps.
>> >>
>> >> This is not the only I/O worker.  A ceph cluster typically consists of
>> >> at least a few OSDs and can be as large as thousands of OSDs.  This is
>> >> the reason we are calling sock_create_kern() on the writeback path in
>> >> the first place: pre-opening thousands of sockets isn't feasible.
>> >
>> > Sorry for being dense here but what actually guarantees the forward
>> > progress? My current understanding is that the deadlock is caused by
>> > con->mutext being held while the allocation cannot make a forward
>> > progress. I can imagine this would be possible if the other io flushers
>> > depend on this lock. But then NOIO vs. KERNEL allocation doesn't make
>> > much difference. What am I missing?
>>
>> con->mutex is per-ceph_connection, osdc->request_mutex is global and is
>> the real problem here because we need both on the submit side, at least
>> in 3.18.  You are correct that even with GFP_NOIO this code may lock up
>> in theory, however I think it's very unlikely in practice.
>
> No, it would just make such a bug more obscure. The real problem seems
> to be that you rely on locks which cannot guarantee a forward progress
> in the IO path. And that is a bug IMHO.
>
>> We got rid of osdc->request_mutex in 4.7, so these workers are almost
>> independent in newer kernels and should be able to free up memory for
>> those blocked on GFP_NOIO retries with their respective con->mutex
>> held.  Using GFP_KERNEL and thus allowing the recursion is just asking
>> for an AA deadlock on con->mutex OTOH, so it does make a difference.
>
> You keep saying this but so far I haven't heard how the AA deadlock is
> possible. Both GFP_KERNEL and GFP_NOIO can stall for an unbounded amount
> of time and that would cause you problems AFAIU.
>
>> I'm a little confused by this discussion because for me this patch was
>> a no-brainer...
>
> No, it is a brainer. Because recursion prevention should be carefully
> thought through. The lack of this approach has caused that we have
> thousands of GFP_NOFS uses all over the kernel without a clear or proper
> justification. Adding more on top doesn't help long term
> maintainability.
>
>> Locking aside, you said it was the stack trace in the changelog that
>> got your attention
>
> No, it is the usage of the scope GFP_NOIO API usage without a proper
> explanation which caught my attention.
>
>> are you saying it's OK for a block
>> device to recurse back into the filesystem when doing I/O, potentially
>> generating more I/O?
>
> No, block device has to make a forward progress guarantee when
> allocating and so use mempools or other means to achieve the same.

OK, let me put this differently.  Do you agree that a block device
cannot make _any_ kind of progress guarantee if it does a GFP_KERNEL
allocation in the I/O path?

Thanks,

                Ilya

next prev parent reply	other threads:[~2017-03-30 13:53 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20170328122559.966310440@linuxfoundation.org>
     [not found] ` <20170328122601.905696872@linuxfoundation.org>
     [not found]   ` <20170328124312.GE18241@dhcp22.suse.cz>
     [not found]     ` <CAOi1vP-TeEwNM8n=Z5b6yx1epMDVJ4f7+S1poubA7zfT7L0hQQ@mail.gmail.com>
     [not found]       ` <20170328133040.GJ18241@dhcp22.suse.cz>
     [not found]         ` <CAOi1vP-doHSj8epQ1zLBnEi8QM4Eb7nFb5uo-XeUquZUkhacsg@mail.gmail.com>
2017-03-29 10:41           ` [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations Michal Hocko
2017-03-29 10:55             ` Michal Hocko
2017-03-29 11:10               ` Ilya Dryomov
2017-03-29 11:16                 ` Michal Hocko
2017-03-29 14:25                   ` Ilya Dryomov
2017-03-30  6:25                     ` Michal Hocko
2017-03-30 10:02                       ` Ilya Dryomov
2017-03-30 11:21                         ` Michal Hocko
2017-03-30 13:48                           ` Ilya Dryomov
2017-03-30 14:36                             ` Michal Hocko
2017-03-30 15:06                               ` Ilya Dryomov
2017-03-30 16:12                                 ` Michal Hocko
2017-03-30 17:19                                   ` Ilya Dryomov
2017-03-30 18:44                                     ` Michal Hocko
2017-03-30 13:53                       ` Ilya Dryomov [this message]
2017-03-30 13:59                         ` Michal Hocko
2017-03-29 11:05             ` Brian Foster
2017-03-29 11:14               ` Ilya Dryomov
2017-03-29 11:18                 ` Michal Hocko
2017-03-29 11:49                   ` Brian Foster
2017-03-29 14:30                     ` Ilya Dryomov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOi1vP-MV0dX_vogf=+TTqaB8xzDEq2BZQYBM6Uf1quy_MYEQw@mail.gmail.com' \
    --to=idryomov@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=jlayton@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mhocko@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=wintchester@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).