Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Feng Tang <feng.tang@intel.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	dave.hansen@intel.com, ying.huang@intel.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
Date: Thu, 5 Nov 2020 09:40:28 +0800	[thread overview]
Message-ID: <20201105014028.GA86777@shbuild999.sh.intel.com> (raw)
In-Reply-To: <20201104085343.GA18718@dhcp22.suse.cz>

On Wed, Nov 04, 2020 at 09:53:43AM +0100, Michal Hocko wrote:
 
> > > As I've said in reply to your second patch. I think we can make the oom
> > > killer behavior more sensible in this misconfigured cases but I do not
> > > think we want break the cpuset isolation for such a configuration.
> > 
> > Do you mean we skip the killing and just let the allocation fail? We've
> > checked the oom killer code first, when the oom happens, both DRAM
> > node and unmovable node have lots of free memory, and killing process
> > won't improve the situation.
> 
> We already do skip oom killer and fail for lowmem allocation requests already.
> This is similar in some sense. Another option would be to kill the
> allocating context which will have less corner cases potentially because
> some allocation failures might be unexpected.

Yes, this can avoid the helpless oom killing to kill a good process(no
memory pressure at all)

And I think the important thing is to judge whether this usage (binding
docker like workload to unmovable node) is a valid case :) 

Initially, I thought it invalid too, but later think it still makes some
sense for the 2 cases:
    * user want to bind his workload to one node(most of user space
      memory) to avoid cross-node traffic, and that node happens to
      be configured as unmovable
    * one small DRAM node + big PMEM node, and memory latency insensitive
      workload could be bound to the cheaper unmovable PMEM node
 
> > (Folloing is copied from your comments for 2/2) 
> > > This allows to spill memory allocations over to any other node which
> > > has Normal (or other lower) zones and as such it breaks cpuset isolation.
> > > As I've pointed out in the reply to your cover letter it seems that
> > > this is more of a misconfiguration than a bug.
> > 
> > For the usage case (docker container running), the spilling is already
> > happening, I traced its memory allocation requests, many of them are
> > movable, and got fallback to the normal node naturally with current
> 
> Could you be more specific? This sounds like a bug. Allocations
> shouldn't spill over to a node which is not in the cpuset. There are few
> exceptions like IRQ context but that shouldn't happen regurarly.

I mean when the docker starts, it will spawn many processes which obey
the mem binding set, and they have some kernel page requests, which got
successfully allocated, like the following callstack:

	[  567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G        W I       5.9.0-rc8+ #6
	[  567.044956] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
	[  567.044958] Call Trace:
	[  567.044972]  dump_stack+0x74/0x9a
	[  567.044978]  __alloc_pages_nodemask.cold+0x22/0xe5
	[  567.044986]  alloc_pages_current+0x87/0xe0
	[  567.044991]  allocate_slab+0x2e5/0x4f0
	[  567.044996]  ___slab_alloc+0x380/0x5d0
	[  567.045021]  __slab_alloc+0x20/0x40
	[  567.045025]  kmem_cache_alloc+0x2a0/0x2e0
	[  567.045033]  mqueue_alloc_inode+0x1a/0x30
	[  567.045041]  alloc_inode+0x22/0xa0
	[  567.045045]  new_inode_pseudo+0x12/0x60
	[  567.045049]  new_inode+0x17/0x30
	[  567.045052]  mqueue_get_inode+0x45/0x3b0
	[  567.045060]  mqueue_fill_super+0x41/0x70
	[  567.045067]  vfs_get_super+0x7f/0x100
	[  567.045074]  get_tree_keyed+0x1d/0x20
	[  567.045080]  mqueue_get_tree+0x1c/0x20
	[  567.045086]  vfs_get_tree+0x2a/0xc0
	[  567.045092]  fc_mount+0x13/0x50
	[  567.045099]  mq_create_mount+0x92/0xe0
	[  567.045102]  mq_init_ns+0x3b/0x50
	[  567.045106]  copy_ipcs+0x10a/0x1b0
	[  567.045113]  create_new_namespaces+0xa6/0x2b0
	[  567.045118]  unshare_nsproxy_namespaces+0x5a/0xb0
	[  567.045124]  ksys_unshare+0x19f/0x360
	[  567.045129]  __x64_sys_unshare+0x12/0x20
	[  567.045135]  do_syscall_64+0x38/0x90
	[  567.045143]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

For it, the __alloc_pages_nodemask() will first try process's targed
nodemask(unmovable node here), and there is no availabe zone, so it
goes with the NULL nodemask, and get a page in the slowpath.

And this process happens to the user space allocation as well, but
it got blocked by the CPUSET node binding check.

Thanks,
Feng

next prev parent reply	other threads:[~2020-11-05  1:40 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-04  6:10 [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node Feng Tang
2020-11-04  6:10 ` [RFC PATCH 1/2] mm, oom: dump meminfo for all memory nodes Feng Tang
2020-11-04  7:18   ` Michal Hocko
2020-11-04  6:10 ` [RFC PATCH 2/2] mm, page_alloc: loose the node binding check to avoid helpless oom killing Feng Tang
2020-11-04  7:23   ` Michal Hocko
2020-11-04  7:13 ` [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node Michal Hocko
2020-11-04  7:38   ` Feng Tang
2020-11-04  7:58     ` Michal Hocko
2020-11-04  8:40       ` Feng Tang
2020-11-04  8:53         ` Michal Hocko
2020-11-05  1:40           ` Feng Tang [this message]
2020-11-05 12:08             ` Michal Hocko
2020-11-05 12:53               ` Vlastimil Babka
2020-11-05 12:58                 ` Michal Hocko
2020-11-05 13:07                   ` Feng Tang
2020-11-05 13:12                     ` Michal Hocko
2020-11-05 13:43                       ` Feng Tang
2020-11-05 16:16                         ` Michal Hocko
2020-11-06  7:06                           ` Feng Tang
2020-11-06  8:10                             ` Michal Hocko
2020-11-06  9:08                               ` Feng Tang
2020-11-06 10:35                                 ` Michal Hocko
2020-11-05 13:14                   ` Vlastimil Babka
2020-11-05 13:19                     ` Michal Hocko
2020-11-05 13:34                       ` Vlastimil Babka
2020-11-06  4:32               ` Huang, Ying
2020-11-06  7:43                 ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201105014028.GA86777@shbuild999.sh.intel.com \
    --to=feng.tang@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox