From: Johannes Weiner <hannes@cmpxchg.org>
To: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.cz>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
Date: Tue, 3 Dec 2013 22:01:01 -0500 [thread overview]
Message-ID: <20131204030101.GV3556@cmpxchg.org> (raw)
In-Reply-To: <alpine.DEB.2.02.1312031531510.5946@chino.kir.corp.google.com>
On Tue, Dec 03, 2013 at 03:40:13PM -0800, David Rientjes wrote:
> On Tue, 3 Dec 2013, Johannes Weiner wrote:
>
> > > > Spin on which level? The whole point of this change was to not spin for
> > > > ever because the caller might sit on top of other locks which might
> > > > prevent somebody else to die although it has been killed.
> > >
> > > See my question about the non-memcg page allocator behavior below.
> >
> > No, please answer the question.
> >
>
> The question would be answered below, by having consistency in allocation
> and charging paths between both the page allocator and memcg.
>
> > > I'm not quite sure how significant of a point this is, though, because it
> > > depends on the caller doing the __GFP_NOFAIL allocations that allow the
> > > bypass. If you're doing
> > >
> > > for (i = 0; i < 1 << 20; i++)
> > > page[i] = alloc_page(GFP_NOFS | __GFP_NOFAIL);
> >
> > Hyperbole serves no one.
> >
>
> Since this bypasses all charges to the root memcg in oom conditions as a
> result of your patch, how do you ensure the "leakage" is contained to a
> small amount of memory? Are we currently just trusting the users of
> __GFP_NOFAIL that they aren't allocating a large amount of memory?
Yes, as answered in my first reply to you:
---
> Ah, this is because of 3168ecbe1c04 ("mm: memcg: use proper memcg in limit
> bypass") which just bypasses all of these allocations and charges the root
> memcg. So if allocations want to bypass memcg isolation they just have to
> be __GFP_NOFAIL?
I don't think we have another option.
---
Is there a specific reason you keep repeating the same questions?
> > > I'm referring to the generic non-memcg page allocator behavior. Forget
> > > memcg for a moment. What is the behavior in the _page_allocator_ for
> > > GFP_NOFS | __GFP_NOFAIL? Do we spin forever if reclaim fails or do we
> > > bypas the per-zone min watermarks to allow it to allocate because "it
> > > needs to succeed, it may be holding filesystem locks"?
> > >
> > > It's already been acknowledged in this thread that no bypassing is done
> > > in the page allocator and it just spins. There's some handwaving saying
> > > that since the entire system is oom that there is a greater chance that
> > > memory will be freed by something else, but that's just handwaving and is
> > > certainly no guaranteed.
> >
> > Do you have another explanation of why this deadlock is not triggering
> > in the global case? It's pretty obvious that there is a deadlock that
> > can not be resolved unless some unrelated task intervenes, just read
> > __alloc_pages_slowpath().
> >
> > But we had a concrete bug report for memcg where there was no other
> > task to intervene. One was stuck in the OOM killer waiting for the
> > victim to exit, the victim was stuck on locks that the killer held.
> >
>
> I believe the page allocator would be susceptible to the same deadlock if
> nothing else on the system can reclaim memory and that belief comes from
> code inspection that shows __GFP_NOFAIL is not guaranteed to ever succeed
> in the page allocator as their charges now are (with your patch) in memcg.
> I do not have an example of such an incident.
Me neither.
> > > So, my question again: why not bypass the per-zone min watermarks in the
> > > page allocator?
> >
> > I don't even know what your argument is supposed to be. The fact that
> > we don't do it in the page allocator means that there can't be a bug
> > in memcg?
> >
>
> I'm asking if we should allow GFP_NOFS | __GFP_NOFAIL allocations in the
> page allocator to bypass per-zone min watermarks after reclaim has failed
> since the oom killer cannot be called in such a context so that the page
> allocator is not susceptible to the same deadlock without a complete
> depletion of memory reserves?
Yes, I think so.
> It's not an argument, it's a question. Relax.
Right.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <hannes@cmpxchg.org>
To: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.cz>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
Date: Tue, 3 Dec 2013 22:01:01 -0500 [thread overview]
Message-ID: <20131204030101.GV3556@cmpxchg.org> (raw)
In-Reply-To: <alpine.DEB.2.02.1312031531510.5946@chino.kir.corp.google.com>
On Tue, Dec 03, 2013 at 03:40:13PM -0800, David Rientjes wrote:
> On Tue, 3 Dec 2013, Johannes Weiner wrote:
>
> > > > Spin on which level? The whole point of this change was to not spin for
> > > > ever because the caller might sit on top of other locks which might
> > > > prevent somebody else to die although it has been killed.
> > >
> > > See my question about the non-memcg page allocator behavior below.
> >
> > No, please answer the question.
> >
>
> The question would be answered below, by having consistency in allocation
> and charging paths between both the page allocator and memcg.
>
> > > I'm not quite sure how significant of a point this is, though, because it
> > > depends on the caller doing the __GFP_NOFAIL allocations that allow the
> > > bypass. If you're doing
> > >
> > > for (i = 0; i < 1 << 20; i++)
> > > page[i] = alloc_page(GFP_NOFS | __GFP_NOFAIL);
> >
> > Hyperbole serves no one.
> >
>
> Since this bypasses all charges to the root memcg in oom conditions as a
> result of your patch, how do you ensure the "leakage" is contained to a
> small amount of memory? Are we currently just trusting the users of
> __GFP_NOFAIL that they aren't allocating a large amount of memory?
Yes, as answered in my first reply to you:
---
> Ah, this is because of 3168ecbe1c04 ("mm: memcg: use proper memcg in limit
> bypass") which just bypasses all of these allocations and charges the root
> memcg. So if allocations want to bypass memcg isolation they just have to
> be __GFP_NOFAIL?
I don't think we have another option.
---
Is there a specific reason you keep repeating the same questions?
> > > I'm referring to the generic non-memcg page allocator behavior. Forget
> > > memcg for a moment. What is the behavior in the _page_allocator_ for
> > > GFP_NOFS | __GFP_NOFAIL? Do we spin forever if reclaim fails or do we
> > > bypas the per-zone min watermarks to allow it to allocate because "it
> > > needs to succeed, it may be holding filesystem locks"?
> > >
> > > It's already been acknowledged in this thread that no bypassing is done
> > > in the page allocator and it just spins. There's some handwaving saying
> > > that since the entire system is oom that there is a greater chance that
> > > memory will be freed by something else, but that's just handwaving and is
> > > certainly no guaranteed.
> >
> > Do you have another explanation of why this deadlock is not triggering
> > in the global case? It's pretty obvious that there is a deadlock that
> > can not be resolved unless some unrelated task intervenes, just read
> > __alloc_pages_slowpath().
> >
> > But we had a concrete bug report for memcg where there was no other
> > task to intervene. One was stuck in the OOM killer waiting for the
> > victim to exit, the victim was stuck on locks that the killer held.
> >
>
> I believe the page allocator would be susceptible to the same deadlock if
> nothing else on the system can reclaim memory and that belief comes from
> code inspection that shows __GFP_NOFAIL is not guaranteed to ever succeed
> in the page allocator as their charges now are (with your patch) in memcg.
> I do not have an example of such an incident.
Me neither.
> > > So, my question again: why not bypass the per-zone min watermarks in the
> > > page allocator?
> >
> > I don't even know what your argument is supposed to be. The fact that
> > we don't do it in the page allocator means that there can't be a bug
> > in memcg?
> >
>
> I'm asking if we should allow GFP_NOFS | __GFP_NOFAIL allocations in the
> page allocator to bypass per-zone min watermarks after reclaim has failed
> since the oom killer cannot be called in such a context so that the page
> allocator is not susceptible to the same deadlock without a complete
> depletion of memory reserves?
Yes, I think so.
> It's not an argument, it's a question. Relax.
Right.
next prev parent reply other threads:[~2013-12-04 3:01 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-11-22 17:17 [patch] mm: memcg: do not declare OOM from __GFP_NOFAIL allocations Johannes Weiner
2013-11-22 17:17 ` Johannes Weiner
2013-11-22 17:17 ` Johannes Weiner
2013-11-27 1:01 ` David Rientjes
2013-11-27 1:01 ` David Rientjes
2013-11-27 3:33 ` David Rientjes
2013-11-27 3:33 ` David Rientjes
[not found] ` <alpine.DEB.2.02.1311261931210.5973-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2013-11-27 16:39 ` Johannes Weiner
2013-11-27 16:39 ` Johannes Weiner
2013-11-27 16:39 ` Johannes Weiner
[not found] ` <20131127163916.GB3556-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2013-11-27 21:38 ` David Rientjes
2013-11-27 21:38 ` David Rientjes
2013-11-27 21:38 ` David Rientjes
2013-11-27 22:53 ` Johannes Weiner
2013-11-27 22:53 ` Johannes Weiner
2013-11-27 23:34 ` David Rientjes
2013-11-27 23:34 ` David Rientjes
[not found] ` <alpine.DEB.2.02.1311271526080.22848-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2013-11-28 10:20 ` Michal Hocko
2013-11-28 10:20 ` Michal Hocko
2013-11-28 10:20 ` Michal Hocko
[not found] ` <20131128102049.GF2761-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-11-29 23:46 ` David Rientjes
2013-11-29 23:46 ` David Rientjes
2013-11-29 23:46 ` David Rientjes
[not found] ` <alpine.DEB.2.02.1311291543400.22413-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2013-12-02 13:22 ` Michal Hocko
2013-12-02 13:22 ` Michal Hocko
2013-12-02 13:22 ` Michal Hocko
2013-12-02 23:02 ` David Rientjes
2013-12-02 23:02 ` David Rientjes
[not found] ` <alpine.DEB.2.02.1312021452510.13465-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2013-12-03 22:25 ` Johannes Weiner
2013-12-03 22:25 ` Johannes Weiner
2013-12-03 22:25 ` Johannes Weiner
2013-12-03 23:40 ` David Rientjes
2013-12-03 23:40 ` David Rientjes
2013-12-04 3:01 ` Johannes Weiner [this message]
2013-12-04 3:01 ` Johannes Weiner
2013-12-04 4:34 ` Dave Chinner
2013-12-04 4:34 ` Dave Chinner
2013-12-04 5:25 ` Johannes Weiner
2013-12-04 5:25 ` Johannes Weiner
2013-12-04 6:10 ` David Rientjes
2013-12-04 6:10 ` David Rientjes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131204030101.GV3556@cmpxchg.org \
--to=hannes@cmpxchg.org \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.cz \
--cc=rientjes@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.