linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
To: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Han Pingtian <hanpt@linux.vnet.ibm.com>,
	Pekka Enberg <penberg@kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Paul Mackerras <paulus@samba.org>,
	Anton Blanchard <anton@samba.org>, Matt Mackall <mpm@selenic.com>,
	Christoph Lameter <cl@linux.com>,
	linuxppc-dev@lists.ozlabs.org,
	Wanpeng Li <liwanp@linux.vnet.ibm.com>, Tejun Heo <tj@kernel.org>
Subject: Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
Date: Tue, 22 Jul 2014 14:43:11 -0700	[thread overview]
Message-ID: <20140722214311.GM4156@linux.vnet.ibm.com> (raw)
In-Reply-To: <alpine.DEB.2.02.1407211809140.9778@chino.kir.corp.google.com>

Hi David,

On 21.07.2014 [18:16:58 -0700], David Rientjes wrote:
> On Mon, 21 Jul 2014, Nishanth Aravamudan wrote:
> 
> > Sorry for bringing up this old thread again, but I had a question for
> > you, David. node_to_mem_node(), which does seem like a useful API,
> > doesn't seem like it can just node_distance() solely, right? Because
> > that just tells us the relative cost (or so I think about it) of using
> > resources from that node. But we also need to know if that node itself
> > has memory, etc. So using the zonelists is required no matter what? And
> > upon memory hotplug (or unplug), the topology can change in a way that
> > affects things, so node online time isn't right either?
> > 
> 
> I think there's two use cases of interest:
> 
>  - allocating from a memoryless node where numa_node_id() is memoryless, 
>    and
> 
>  - using node_to_mem_node() for a possibly-memoryless node for kmalloc().
> 
> I believe the first should have its own node_zonelist[0], whether it's 
> memoryless or not, that points to a list of zones that start with those 
> with the smallest distance.

Ok, and that would be used for falling back in the appropriate priority?

> I think its own node_zonelist[1], for __GFP_THISNODE allocations,
> should point to the node with present memory that has the smallest
> distance.

And so would this, but with the caveat that we can fail here and don't
go further? Semantically, __GFP_THISNODE then means "as close as
physically possible ignoring run-time memory constraints". I say that
because obviously we might get off-node memory without memoryless nodes,
but that shouldn't be used to satisfy __GPF_THISNODE allocations.

> For sure node_zonelist[0] cannot be NULL since things like 
> first_online_pgdat() would break and it should be unnecessary to do 
> node_to_mem_node() for all allocations when CONFIG_HAVE_MEMORYLESS_NODES 
> since the zonelists should already be defined properly.  All nodes, 
> regardless of whether they have memory or not, should probably end up 
> having a struct pglist_data unless there's a reason for another level of 
> indirection.

So I've re-tested Joonsoo's patch 2 and 3 from the series he sent, and
on powerpc now, things look really good. On a KVM instance with the
following topology:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
node 1 size: 16336 MB
node 1 free: 14274 MB
node distances:
node   0   1 
  0:  10  40 
  1:  40  10 

3.16.0-rc6 gives:

        Slab:            1039744 kB
	SReclaimable:      38976 kB
	SUnreclaim:      1000768 kB

Joonsoo's patches give:

        Slab:             366144 kB
	SReclaimable:      36928 kB
	SUnreclaim:       329216 kB

For reference, CONFIG_SLAB gives:

        Slab:             122496 kB
	SReclaimable:      14912 kB
	SUnreclaim:       107584 kB

At Tejun's request [adding him to Cc], I also partially reverted
81c98869faa5 ("kthread: ensure locality of task_struct allocations"): 

	Slab:             428864 kB
	SReclaimable:      44288 kB
	SUnreclaim:       384576 kB

This seems slightly worse, but I think it's because of the same
root-cause that I indicated in my RFC patch 2/2, quoting it here:

"    There is an issue currently where NUMA information is used on powerpc
    (and possibly ia64) before it has been read from the device-tree, which
    leads to large slab consumption with CONFIG_SLUB and memoryless nodes.
    
    NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
    after start_secondary(), similar to ia64, which is invoked via
    smp_init().
    
    Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
    early_initcall()") made init_workqueues() be invoked via
    do_pre_smp_initcalls(), which is obviously before the secondary
    processors are online.
    ...
    Therefore, when init_workqueues() runs, it sees all CPUs as being on
    Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
    a high number of slab deactivations
    (http://www.spinics.net/lists/linux-mm/msg67489.html)."

Christoph/Tejun, do you see the issue I'm referring to? Is my analysis
correct? It seems like regardless of CONFIG_USE_PERCPU_NUMA_NODE_ID, we
have to be especially careful that users of cpu_to_{node,mem} and
related APIs run *after* correct values are stored for all used CPUs?

In any case, with Joonsoo's patches, we shouldn't see slab deactivations
*if* the NUMA topology information is stored correctly. The full
changelog and patch is at http://patchwork.ozlabs.org/patch/371266/.

Adding my patch on top of Joonsoo's and the revert, I get:

	Slab:             411776 kB
	SReclaimable:      40960 kB
	SUnreclaim:       370816 kB

So CONFIG_SLUB still uses about 3x as much slab memory, but it's not so
much that we are close to OOM with small VM/LPAR sizes.

Thoughts?

I would like to push:

1) Joonsoo's patch to add get_numa_mem, renamed to node_to_mem_node(),
which is caching the result of local_memory_node() for each node.

2) Joonsoo's patch to use node_to_mem_node in __slab_alloc() and
get_partial() when memoryless nodes are encountered.

3) Partial revert of 81c98869faa5 ("kthread: ensure locality of
task_struct allocations") to remove a reference to cpu_to_mem() from the
kthread code. After this, the only references to cpu_to_mem() are in
headers, mm/slab.c, and kernel/profile.c (the last of which is because
of the use of alloc_pages_exact_node(), it seems).

4) Re-post of my patch to fix an ordering issue for the per-CPU NUMA
information on powerpc

I understand your concerns, I think, about Joonsoo's patches, but we're
hitting this pretty regularly in the field and it would be nice to have
something workable in the short-term, while I try and follow-up on these
more invasive ideas.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-07-22 21:43 UTC|newest]

Thread overview: 124+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-01-07  2:21 [PATCH] slub: Don't throw away partial remote slabs if there is no local memory Anton Blanchard
2014-01-07  4:19 ` Wanpeng Li
2014-01-07  4:19 ` Wanpeng Li
2014-01-07  4:19 ` Wanpeng Li
2014-01-07  6:49 ` Andi Kleen
2014-01-08 14:03   ` Anton Blanchard
2014-01-07  7:41 ` Joonsoo Kim
2014-01-07  8:48   ` Wanpeng Li
2014-01-07  8:48   ` Wanpeng Li
2014-01-07  9:10     ` Joonsoo Kim
2014-01-07  9:21       ` Wanpeng Li
2014-01-07  9:21       ` Wanpeng Li
2014-01-07  9:21       ` Wanpeng Li
2014-01-07  9:31         ` Joonsoo Kim
2014-01-07  9:49           ` Wanpeng Li
2014-01-07  9:49           ` Wanpeng Li
2014-01-07  9:49           ` Wanpeng Li
2014-01-07  8:48   ` Wanpeng Li
2014-01-07  9:52   ` Wanpeng Li
2014-01-07  9:52   ` Wanpeng Li
2014-01-07  9:52   ` Wanpeng Li
2014-01-09  0:20     ` Joonsoo Kim
2014-01-20  9:10   ` Wanpeng Li
2014-01-20  9:10   ` Wanpeng Li
2014-01-20  9:10   ` Wanpeng Li
     [not found]   ` <52dce7fe.e5e6420a.5ff6.ffff84a0SMTPIN_ADDED_BROKEN@mx.google.com>
2014-01-20 22:13     ` Christoph Lameter
2014-01-21  2:20       ` Wanpeng Li
2014-01-21  2:20       ` Wanpeng Li
2014-01-21  2:20       ` Wanpeng Li
2014-01-24  3:09       ` Wanpeng Li
2014-01-24  3:14         ` Wanpeng Li
2014-01-24  3:14         ` Wanpeng Li
2014-01-24  3:14         ` Wanpeng Li
     [not found]         ` <52e1da8f.86f7440a.120f.25f3SMTPIN_ADDED_BROKEN@mx.google.com>
2014-01-24 15:50           ` Christoph Lameter
2014-01-24 21:03             ` David Rientjes
2014-01-24 22:19               ` Nishanth Aravamudan
2014-01-24 23:29               ` Nishanth Aravamudan
2014-01-24 23:49                 ` David Rientjes
2014-01-25  0:16                   ` Nishanth Aravamudan
2014-01-25  0:25                     ` David Rientjes
2014-01-25  1:10                       ` Nishanth Aravamudan
2014-01-27  5:58                         ` Joonsoo Kim
2014-01-28 18:29                           ` Nishanth Aravamudan
2014-01-29 15:54                             ` Christoph Lameter
2014-01-29 22:36                             ` Nishanth Aravamudan
2014-01-30 16:26                               ` Christoph Lameter
2014-02-03 23:00                             ` Nishanth Aravamudan
2014-02-04  3:38                               ` Christoph Lameter
2014-02-04  7:26                                 ` Nishanth Aravamudan
2014-02-04 20:39                                   ` Christoph Lameter
2014-02-05  0:13                                     ` Nishanth Aravamudan
2014-02-05 19:28                                       ` Christoph Lameter
2014-02-06  2:08                                         ` Nishanth Aravamudan
2014-02-06 17:25                                           ` Christoph Lameter
2014-01-27 16:18                         ` Christoph Lameter
2014-02-06  2:07                       ` Nishanth Aravamudan
2014-02-06  8:04                         ` Joonsoo Kim
     [not found]                           ` <20140206185955.GA7845@linux.vnet.ibm.com>
2014-02-06 19:28                             ` Nishanth Aravamudan
2014-02-07  8:03                               ` Joonsoo Kim
2014-02-06  8:07                         ` [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id() Joonsoo Kim
2014-02-06  8:07                           ` [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node Joonsoo Kim
2014-02-06  8:52                             ` David Rientjes
2014-02-06 10:29                               ` Joonsoo Kim
2014-02-06 19:11                                 ` Nishanth Aravamudan
2014-02-07  5:42                                   ` Joonsoo Kim
2014-02-06 20:52                                 ` David Rientjes
2014-02-07  5:48                                   ` Joonsoo Kim
2014-02-07 17:53                                     ` Christoph Lameter
2014-02-07 18:51                                       ` Christoph Lameter
2014-02-07 21:38                                         ` Nishanth Aravamudan
2014-02-10  1:15                                           ` Joonsoo Kim
2014-02-10  1:29                                         ` Joonsoo Kim
2014-02-11 18:45                                           ` Christoph Lameter
2014-02-10 19:13                                         ` Nishanth Aravamudan
2014-02-11  7:42                                           ` Joonsoo Kim
2014-02-12 22:16                                             ` Christoph Lameter
2014-02-13  3:53                                               ` Nishanth Aravamudan
2014-02-17  6:52                                               ` Joonsoo Kim
2014-02-18 16:38                                                 ` Christoph Lameter
2014-02-19 22:04                                                   ` David Rientjes
2014-02-20 16:02                                                     ` Christoph Lameter
2014-02-24  5:08                                                   ` Joonsoo Kim
2014-02-24 19:54                                                     ` Christoph Lameter
2014-03-13 16:51                                                       ` Nishanth Aravamudan
2014-02-18 17:22                                               ` Nishanth Aravamudan
2014-02-13  6:51                                             ` Nishanth Aravamudan
2014-02-17  7:00                                               ` Joonsoo Kim
2014-02-18 16:57                                                 ` Christoph Lameter
2014-02-18 17:28                                                   ` Nishanth Aravamudan
2014-02-18 19:58                                                     ` Christoph Lameter
2014-02-18 21:09                                                       ` Nishanth Aravamudan
2014-02-18 21:49                                                         ` Christoph Lameter
2014-02-18 22:22                                                           ` Nishanth Aravamudan
2014-02-19 16:11                                                             ` Christoph Lameter
2014-02-19 22:03                                                       ` David Rientjes
2014-02-08  9:57                                     ` David Rientjes
2014-02-10  1:09                                       ` Joonsoo Kim
2014-07-22  1:03                                         ` Nishanth Aravamudan
2014-07-22  1:16                                           ` David Rientjes
2014-07-22 21:43                                             ` Nishanth Aravamudan [this message]
2014-07-22 21:49                                               ` Tejun Heo
2014-07-22 23:47                                               ` Nishanth Aravamudan
2014-07-23  0:43                                               ` David Rientjes
2014-02-06  8:07                           ` [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node Joonsoo Kim
2014-02-06 17:30                             ` Christoph Lameter
2014-02-07  5:41                               ` Joonsoo Kim
2014-02-07 17:49                                 ` Christoph Lameter
2014-02-10  1:22                                   ` Joonsoo Kim
2014-02-06  8:37                           ` [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id() David Rientjes
2014-02-06 17:31                             ` Christoph Lameter
2014-02-06 17:26                           ` Christoph Lameter
2014-05-16 23:37                           ` Nishanth Aravamudan
2014-05-19  2:41                             ` Joonsoo Kim
2014-06-05  0:13                           ` [RESEND PATCH] " David Rientjes
2014-01-27 16:24                     ` [PATCH] slub: Don't throw away partial remote slabs if there is no local memory Christoph Lameter
2014-01-27 16:16                   ` Christoph Lameter
2014-01-24  3:09       ` Wanpeng Li
2014-01-24  3:09       ` Wanpeng Li
2014-01-07  9:42 ` David Laight
2014-01-08 14:14   ` Anton Blanchard
2014-01-07 10:28 ` Wanpeng Li
2014-01-07 10:28 ` Wanpeng Li
2014-01-07 10:28 ` Wanpeng Li
     [not found] ` <20140107041939.GA20916@hacker.(null)>
2014-01-08 14:17   ` Anton Blanchard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140722214311.GM4156@linux.vnet.ibm.com \
    --to=nacc@linux.vnet.ibm.com \
    --cc=anton@samba.org \
    --cc=cl@linux.com \
    --cc=hanpt@linux.vnet.ibm.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=liwanp@linux.vnet.ibm.com \
    --cc=mpm@selenic.com \
    --cc=paulus@samba.org \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).