* [PATCH] mm: rename and document alloc_pages_exact_node
@ 2015-07-21 13:55 Vlastimil Babka
2015-07-21 14:05 ` Christoph Lameter
` (4 more replies)
0 siblings, 5 replies; 12+ messages in thread
From: Vlastimil Babka @ 2015-07-21 13:55 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Andrew Morton, linux-ia64, linuxppc-dev,
cbe-oss-dev, kvm, Vlastimil Babka, Mel Gorman, David Rientjes,
Greg Thelen, Aneesh Kumar K.V, Christoph Lameter, Pekka Enberg,
Joonsoo Kim, Naoya Horiguchi, Tony Luck, Fenghua Yu,
Arnd Bergmann, Benjamin Herrenschmidt, Paul Mackerras,
Michael Ellerman, Gleb Natapov, Paolo Bonzini, Thomas Gleixner,
Ingo Molnar, H. Peter Anvin, Cliff Whickman, Robin Holt
The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page
allocator: do not check NUMA node ID when the caller knows the node is valid")
as an optimized variant of alloc_pages_node(), that doesn't allow the node id
to be -1. Unfortunately the name of the function can easily suggest that the
allocation is restricted to the given node. In truth, the node is only
preferred, unless __GFP_THISNODE is among the gfp flags.
The misleading name has lead to mistakes in the past, see 5265047ac301 ("mm,
thp: really limit transparent hugepage allocation to local node") and
b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node").
To prevent further mistakes, this patch renames the function to
alloc_pages_prefer_node() and documents it together with alloc_pages_node().
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cliff Whickman <cpw@sgi.com>
Cc: Robin Holt <robinmholt@gmail.com>
---
I hope the new name will be OK. Although it doesn't fully convey the
difference from alloc_pages_node(), renaming the latter would be a much
larger patch (some 60 callsites) and hopefully the new comments are enough to
prevent further confusion.
I'm CC'ing also maintainers of the callsites so they can verify that the
callsites that don't pass __GFP_THISNODE are really not intended to restrict
allocation to the given node. I went through them myself and each looked like
it's better off if it can successfully allocate on a fallback node rather
than fail. DavidR checked them also I think, but it's better if maintainers
can verify that. I'm not completely sure about all the usages in sl*b due to
multiple layers through which gfp flags are being passed.
arch/ia64/hp/common/sba_iommu.c | 2 +-
arch/ia64/kernel/uncached.c | 2 +-
arch/ia64/sn/pci/pci_dma.c | 2 +-
arch/powerpc/platforms/cell/ras.c | 2 +-
arch/x86/kvm/vmx.c | 2 +-
drivers/misc/sgi-xp/xpc_uv.c | 2 +-
include/linux/gfp.h | 14 ++++++++++++--
kernel/profile.c | 8 ++++----
mm/filemap.c | 2 +-
mm/huge_memory.c | 2 +-
mm/hugetlb.c | 4 ++--
mm/memory-failure.c | 2 +-
mm/mempolicy.c | 4 ++--
mm/migrate.c | 4 ++--
mm/page_alloc.c | 2 --
mm/slab.c | 2 +-
mm/slob.c | 4 ++--
mm/slub.c | 2 +-
18 files changed, 35 insertions(+), 27 deletions(-)
diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 344387a..17762af 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -1146,7 +1146,7 @@ sba_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle,
if (node == NUMA_NO_NODE)
node = numa_node_id();
- page = alloc_pages_exact_node(node, flags, get_order(size));
+ page = alloc_pages_prefer_node(node, flags, get_order(size));
if (unlikely(!page))
return NULL;
diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c
index 20e8a9b..1451961 100644
--- a/arch/ia64/kernel/uncached.c
+++ b/arch/ia64/kernel/uncached.c
@@ -97,7 +97,7 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid)
/* attempt to allocate a granule's worth of cached memory pages */
- page = alloc_pages_exact_node(nid,
+ page = alloc_pages_prefer_node(nid,
GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
IA64_GRANULE_SHIFT-PAGE_SHIFT);
if (!page) {
diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c
index d0853e8..dcab82f 100644
--- a/arch/ia64/sn/pci/pci_dma.c
+++ b/arch/ia64/sn/pci/pci_dma.c
@@ -92,7 +92,7 @@ static void *sn_dma_alloc_coherent(struct device *dev, size_t size,
*/
node = pcibus_to_node(pdev->bus);
if (likely(node >=0)) {
- struct page *p = alloc_pages_exact_node(node,
+ struct page *p = alloc_pages_prefer_node(node,
flags, get_order(size));
if (likely(p))
diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c
index e865d74..646a310 100644
--- a/arch/powerpc/platforms/cell/ras.c
+++ b/arch/powerpc/platforms/cell/ras.c
@@ -123,7 +123,7 @@ static int __init cbe_ptcal_enable_on_node(int nid, int order)
area->nid = nid;
area->order = order;
- area->pages = alloc_pages_exact_node(area->nid,
+ area->pages = alloc_pages_prefer_node(area->nid,
GFP_KERNEL|__GFP_THISNODE,
area->order);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2d73807..a8723a8 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3158,7 +3158,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
struct page *pages;
struct vmcs *vmcs;
- pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
+ pages = alloc_pages_prefer_node(node, GFP_KERNEL, vmcs_config.order);
if (!pages)
return NULL;
vmcs = page_address(pages);
diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
index 95c8944..3ff0a24 100644
--- a/drivers/misc/sgi-xp/xpc_uv.c
+++ b/drivers/misc/sgi-xp/xpc_uv.c
@@ -239,7 +239,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
mq->mmr_blade = uv_cpu_to_blade_id(cpu);
nid = cpu_to_node(cpu);
- page = alloc_pages_exact_node(nid,
+ page = alloc_pages_prefer_node(nid,
GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
pg_order);
if (page == NULL) {
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 15928f0..ff892d7 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -300,6 +300,10 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
}
+/*
+ * Allocate pages, preferring the node given as nid. When nid equals -1,
+ * prefer the current CPU's node.
+ */
static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
@@ -310,7 +314,14 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}
-static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
+/*
+ * Allocate pages, preferring the node given as nid. Unlike alloc_pages_node(),
+ * nid must be a valid online node, there is no fallback to current CPU's node.
+ *
+ * In order to completely restrict allocation to the preferred node, include
+ * __GFP_THISNODE in the gfp_mask.
+ */
+static inline struct page *alloc_pages_prefer_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
@@ -354,7 +365,6 @@ extern unsigned long get_zeroed_page(gfp_t gfp_mask);
void *alloc_pages_exact(size_t size, gfp_t gfp_mask);
void free_pages_exact(void *virt, size_t size);
-/* This is different from alloc_pages_exact_node !!! */
void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
#define __get_free_page(gfp_mask) \
diff --git a/kernel/profile.c b/kernel/profile.c
index a7bcd28..6ee695e 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -339,7 +339,7 @@ static int profile_cpu_callback(struct notifier_block *info,
node = cpu_to_mem(cpu);
per_cpu(cpu_profile_flip, cpu) = 0;
if (!per_cpu(cpu_profile_hits, cpu)[1]) {
- page = alloc_pages_exact_node(node,
+ page = alloc_pages_prefer_node(node,
GFP_KERNEL | __GFP_ZERO,
0);
if (!page)
@@ -347,7 +347,7 @@ static int profile_cpu_callback(struct notifier_block *info,
per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
}
if (!per_cpu(cpu_profile_hits, cpu)[0]) {
- page = alloc_pages_exact_node(node,
+ page = alloc_pages_prefer_node(node,
GFP_KERNEL | __GFP_ZERO,
0);
if (!page)
@@ -543,14 +543,14 @@ static int create_hash_tables(void)
int node = cpu_to_mem(cpu);
struct page *page;
- page = alloc_pages_exact_node(node,
+ page = alloc_pages_prefer_node(node,
GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
0);
if (!page)
goto out_cleanup;
per_cpu(cpu_profile_hits, cpu)[1]
= (struct profile_hit *)page_address(page);
- page = alloc_pages_exact_node(node,
+ page = alloc_pages_prefer_node(node,
GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
0);
if (!page)
diff --git a/mm/filemap.c b/mm/filemap.c
index 6bf5e42..52e3b55 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -648,7 +648,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
do {
cpuset_mems_cookie = read_mems_allowed_begin();
n = cpuset_mem_spread_node();
- page = alloc_pages_exact_node(n, gfp, 0);
+ page = alloc_pages_prefer_node(n, gfp, 0);
} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
return page;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 078832c..7130e97 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2350,7 +2350,7 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
*/
up_read(&mm->mmap_sem);
- *hpage = alloc_pages_exact_node(node, gfp, HPAGE_PMD_ORDER);
+ *hpage = alloc_pages_prefer_node(node, gfp, HPAGE_PMD_ORDER);
if (unlikely(!*hpage)) {
count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
*hpage = ERR_PTR(-ENOMEM);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 271e443..4c6c0c1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1088,7 +1088,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
{
struct page *page;
- page = alloc_pages_exact_node(nid,
+ page = alloc_pages_prefer_node(nid,
htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
__GFP_REPEAT|__GFP_NOWARN,
huge_page_order(h));
@@ -1250,7 +1250,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
__GFP_REPEAT|__GFP_NOWARN,
huge_page_order(h));
else
- page = alloc_pages_exact_node(nid,
+ page = alloc_pages_prefer_node(nid,
htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
__GFP_REPEAT|__GFP_NOWARN, huge_page_order(h));
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 501820c..132f043 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1503,7 +1503,7 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
return alloc_huge_page_node(page_hstate(compound_head(p)),
nid);
else
- return alloc_pages_exact_node(nid, GFP_HIGHUSER_MOVABLE, 0);
+ return alloc_pages_prefer_node(nid, GFP_HIGHUSER_MOVABLE, 0);
}
/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7477432..2e2a876 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -945,7 +945,7 @@ static struct page *new_node_page(struct page *page, unsigned long node, int **x
return alloc_huge_page_node(page_hstate(compound_head(page)),
node);
else
- return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE |
+ return alloc_pages_prefer_node(node, GFP_HIGHUSER_MOVABLE |
__GFP_THISNODE, 0);
}
@@ -1986,7 +1986,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
nmask = policy_nodemask(gfp, pol);
if (!nmask || node_isset(node, *nmask)) {
mpol_cond_put(pol);
- page = alloc_pages_exact_node(node,
+ page = alloc_pages_prefer_node(node,
gfp | __GFP_THISNODE, order);
goto out;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index f53838f..b51e106 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1187,7 +1187,7 @@ static struct page *new_page_node(struct page *p, unsigned long private,
return alloc_huge_page_node(page_hstate(compound_head(p)),
pm->node);
else
- return alloc_pages_exact_node(pm->node,
+ return alloc_pages_prefer_node(pm->node,
GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
}
@@ -1553,7 +1553,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
int nid = (int) data;
struct page *newpage;
- newpage = alloc_pages_exact_node(nid,
+ newpage = alloc_pages_prefer_node(nid,
(GFP_HIGHUSER_MOVABLE |
__GFP_THISNODE | __GFP_NOMEMALLOC |
__GFP_NORETRY | __GFP_NOWARN) &
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebffa0e..e6eef56 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3062,8 +3062,6 @@ EXPORT_SYMBOL(alloc_pages_exact);
*
* Like alloc_pages_exact(), but try to allocate on node nid first before falling
* back.
- * Note this is not alloc_pages_exact_node() which allocates on a specific node,
- * but is not exact.
*/
void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
{
diff --git a/mm/slab.c b/mm/slab.c
index 7eb38dd..471fde8 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1594,7 +1594,7 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
if (memcg_charge_slab(cachep, flags, cachep->gfporder))
return NULL;
- page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
+ page = alloc_pages_prefer_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
if (!page) {
memcg_uncharge_slab(cachep, cachep->gfporder);
slab_out_of_memory(cachep, flags, nodeid);
diff --git a/mm/slob.c b/mm/slob.c
index 4765f65..cfda5e2 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -45,7 +45,7 @@
* NUMA support in SLOB is fairly simplistic, pushing most of the real
* logic down to the page allocator, and simply doing the node accounting
* on the upper levels. In the event that a node id is explicitly
- * provided, alloc_pages_exact_node() with the specified node id is used
+ * provided, alloc_pages_prefer_node() with the specified node id is used
* instead. The common case (or when the node id isn't explicitly provided)
* will default to the current node, as per numa_node_id().
*
@@ -193,7 +193,7 @@ static void *slob_new_pages(gfp_t gfp, int order, int node)
#ifdef CONFIG_NUMA
if (node != NUMA_NO_NODE)
- page = alloc_pages_exact_node(node, gfp, order);
+ page = alloc_pages_prefer_node(node, gfp, order);
else
#endif
page = alloc_pages(gfp, order);
diff --git a/mm/slub.c b/mm/slub.c
index 54c0876..8b10404 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1323,7 +1323,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
if (node == NUMA_NO_NODE)
page = alloc_pages(flags, order);
else
- page = alloc_pages_exact_node(node, flags, order);
+ page = alloc_pages_prefer_node(node, flags, order);
if (!page)
memcg_uncharge_slab(s, order);
--
2.4.5
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: rename and document alloc_pages_exact_node
2015-07-21 13:55 [PATCH] mm: rename and document alloc_pages_exact_node Vlastimil Babka
@ 2015-07-21 14:05 ` Christoph Lameter
2015-07-21 14:14 ` Robin Holt
` (3 subsequent siblings)
4 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2015-07-21 14:05 UTC (permalink / raw)
To: Vlastimil Babka
Cc: linux-mm, linux-kernel, Andrew Morton, linux-ia64, linuxppc-dev,
cbe-oss-dev, kvm, Mel Gorman, David Rientjes, Greg Thelen,
Aneesh Kumar K.V, Pekka Enberg, Joonsoo Kim, Naoya Horiguchi,
Tony Luck, Fenghua Yu, Arnd Bergmann, Benjamin Herrenschmidt,
Paul Mackerras, Michael Ellerman, Gleb Natapov, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Cliff Whickman,
Robin Holt
On Tue, 21 Jul 2015, Vlastimil Babka wrote:
> The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page
> allocator: do not check NUMA node ID when the caller knows the node is valid")
> as an optimized variant of alloc_pages_node(), that doesn't allow the node id
> to be -1. Unfortunately the name of the function can easily suggest that the
> allocation is restricted to the given node. In truth, the node is only
> preferred, unless __GFP_THISNODE is among the gfp flags.
Yup. I complained about this when this was introduced. Glad to see this
fixed. Initially this was alloc_pages_node() which just means that a node
is specified. The exact behavior of the allocation is determined by flags
such as GFP_THISNODE. I'd rather have that restored because otherwise we
get into weird code like the one below. And such an arrangement also
leaves the way open to add more flags in the future that may change the
allocation behavior.
> area->nid = nid;
> area->order = order;
> - area->pages = alloc_pages_exact_node(area->nid,
> + area->pages = alloc_pages_prefer_node(area->nid,
> GFP_KERNEL|__GFP_THISNODE,
> area->order);
This is not preferring a node but requiring alloction on that node.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: rename and document alloc_pages_exact_node
2015-07-21 13:55 [PATCH] mm: rename and document alloc_pages_exact_node Vlastimil Babka
2015-07-21 14:05 ` Christoph Lameter
@ 2015-07-21 14:14 ` Robin Holt
2015-07-21 21:31 ` David Rientjes
` (2 subsequent siblings)
4 siblings, 0 replies; 12+ messages in thread
From: Robin Holt @ 2015-07-21 14:14 UTC (permalink / raw)
To: Vlastimil Babka
Cc: linux-mm, LKML, Andrew Morton, linux-ia64, linuxppc-dev,
Christoph Lameter, Cliff Whickman
On Tue, Jul 21, 2015 at 8:55 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
> The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page
> allocator: do not check NUMA node ID when the caller knows the node is valid")
> as an optimized variant of alloc_pages_node(), that doesn't allow the node id
> to be -1. Unfortunately the name of the function can easily suggest that the
> allocation is restricted to the given node. In truth, the node is only
> preferred, unless __GFP_THISNODE is among the gfp flags.
>
...
> Cc: Robin Holt <robinmholt@gmail.com>
Acked-by: Robin Holt <robinmholt@gmail.com>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: rename and document alloc_pages_exact_node
2015-07-21 13:55 [PATCH] mm: rename and document alloc_pages_exact_node Vlastimil Babka
2015-07-21 14:05 ` Christoph Lameter
2015-07-21 14:14 ` Robin Holt
@ 2015-07-21 21:31 ` David Rientjes
2015-07-22 11:32 ` Vlastimil Babka
2015-07-22 1:23 ` Michael Ellerman
2015-07-22 11:31 ` Paolo Bonzini
4 siblings, 1 reply; 12+ messages in thread
From: David Rientjes @ 2015-07-21 21:31 UTC (permalink / raw)
To: Vlastimil Babka
Cc: linux-mm, linux-kernel, Andrew Morton, linux-ia64, linuxppc-dev,
cbe-oss-dev, kvm, Mel Gorman, Greg Thelen, Aneesh Kumar K.V,
Christoph Lameter, Pekka Enberg, Joonsoo Kim, Naoya Horiguchi,
Tony Luck, Fenghua Yu, Arnd Bergmann, Benjamin Herrenschmidt,
Paul Mackerras, Michael Ellerman, Gleb Natapov, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Cliff Whickman,
Robin Holt
On Tue, 21 Jul 2015, Vlastimil Babka wrote:
> The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page
> allocator: do not check NUMA node ID when the caller knows the node is valid")
> as an optimized variant of alloc_pages_node(), that doesn't allow the node id
> to be -1. Unfortunately the name of the function can easily suggest that the
> allocation is restricted to the given node. In truth, the node is only
> preferred, unless __GFP_THISNODE is among the gfp flags.
>
> The misleading name has lead to mistakes in the past, see 5265047ac301 ("mm,
> thp: really limit transparent hugepage allocation to local node") and
> b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node").
>
> To prevent further mistakes, this patch renames the function to
> alloc_pages_prefer_node() and documents it together with alloc_pages_node().
>
alloc_pages_exact_node(), as you said, connotates that the allocation will
take place on that node or will fail. So why not go beyond this patch and
actually make alloc_pages_exact_node() set __GFP_THISNODE and then call
into a new alloc_pages_prefer_node(), which would be the current
alloc_pages_exact_node() implementation, and then fix up the callers?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: rename and document alloc_pages_exact_node
2015-07-21 21:31 ` David Rientjes
@ 2015-07-22 11:32 ` Vlastimil Babka
2015-07-22 21:52 ` David Rientjes
0 siblings, 1 reply; 12+ messages in thread
From: Vlastimil Babka @ 2015-07-22 11:32 UTC (permalink / raw)
To: David Rientjes
Cc: linux-mm, linux-kernel, Andrew Morton, linux-ia64, linuxppc-dev,
cbe-oss-dev, kvm, Mel Gorman, Greg Thelen, Aneesh Kumar K.V,
Christoph Lameter, Pekka Enberg, Joonsoo Kim, Naoya Horiguchi,
Tony Luck, Fenghua Yu, Arnd Bergmann, Benjamin Herrenschmidt,
Paul Mackerras, Michael Ellerman, Gleb Natapov, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Cliff Whickman,
Robin Holt
On 07/21/2015 11:31 PM, David Rientjes wrote:
> On Tue, 21 Jul 2015, Vlastimil Babka wrote:
>
>> The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page
>> allocator: do not check NUMA node ID when the caller knows the node is valid")
>> as an optimized variant of alloc_pages_node(), that doesn't allow the node id
>> to be -1. Unfortunately the name of the function can easily suggest that the
>> allocation is restricted to the given node. In truth, the node is only
>> preferred, unless __GFP_THISNODE is among the gfp flags.
>>
>> The misleading name has lead to mistakes in the past, see 5265047ac301 ("mm,
>> thp: really limit transparent hugepage allocation to local node") and
>> b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node").
>>
>> To prevent further mistakes, this patch renames the function to
>> alloc_pages_prefer_node() and documents it together with alloc_pages_node().
>>
>
> alloc_pages_exact_node(), as you said, connotates that the allocation will
> take place on that node or will fail. So why not go beyond this patch and
> actually make alloc_pages_exact_node() set __GFP_THISNODE and then call
> into a new alloc_pages_prefer_node(), which would be the current
> alloc_pages_exact_node() implementation, and then fix up the callers?
OK, but then we have alloc_pages_node(), alloc_pages_prefer_node() and
alloc_pages_exact_node(). Isn't that a bit too much? The first two
differ only in tiny bit:
static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
/* Unknown node is current node */
if (nid < 0)
nid = numa_node_id();
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}
static inline struct page *alloc_pages_prefer_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}
So _prefer_node is just a tiny optimization over the other one. It
should be maybe called __alloc_pages_node() then? This would perhaps
discourage users outside of mm/arch code (where it may matter). The
savings of a skipped branch is likely dubious anyway... It would be also
nice if alloc_pages_node() could use __alloc_pages_node() internally, but
I'm not sure if all callers are safe wrt the
VM_BUG_ON(!node_online(nid)) part.
So when the alloc_pages_prefer_node is diminished as __alloc_pages_node
or outright removed, then maybe alloc_pages_exact_node() which adds
__GFP_THISNODE on its own, might be a useful wrapper. But I agree with
Christoph it's a duplication of the gfp_flags functionality and I don't
think there would be many users left anyway.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: rename and document alloc_pages_exact_node
2015-07-22 11:32 ` Vlastimil Babka
@ 2015-07-22 21:52 ` David Rientjes
2015-07-23 14:11 ` Christoph Lameter
0 siblings, 1 reply; 12+ messages in thread
From: David Rientjes @ 2015-07-22 21:52 UTC (permalink / raw)
To: Vlastimil Babka
Cc: linux-mm, linux-kernel, Andrew Morton, linux-ia64, linuxppc-dev,
cbe-oss-dev, kvm, Mel Gorman, Greg Thelen, Aneesh Kumar K.V,
Christoph Lameter, Pekka Enberg, Joonsoo Kim, Naoya Horiguchi,
Tony Luck, Fenghua Yu, Arnd Bergmann, Benjamin Herrenschmidt,
Paul Mackerras, Michael Ellerman, Gleb Natapov, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Cliff Whickman,
Robin Holt
On Wed, 22 Jul 2015, Vlastimil Babka wrote:
> > alloc_pages_exact_node(), as you said, connotates that the allocation will
> > take place on that node or will fail. So why not go beyond this patch and
> > actually make alloc_pages_exact_node() set __GFP_THISNODE and then call
> > into a new alloc_pages_prefer_node(), which would be the current
> > alloc_pages_exact_node() implementation, and then fix up the callers?
>
> OK, but then we have alloc_pages_node(), alloc_pages_prefer_node() and
> alloc_pages_exact_node(). Isn't that a bit too much? The first two
> differ only in tiny bit:
>
> static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
> unsigned int order)
> {
> /* Unknown node is current node */
> if (nid < 0)
> nid = numa_node_id();
>
> return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> }
>
> static inline struct page *alloc_pages_prefer_node(int nid, gfp_t gfp_mask,
> unsigned int order)
> {
> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
>
> return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> }
>
Eek, yeah, that does look bad. I'm not even sure the
if (nid < 0)
nid = numa_node_id();
is correct; I think this should be comparing to NUMA_NO_NODE rather than
all negative numbers, otherwise we silently ignore overflow and nobody
ever knows.
> So _prefer_node is just a tiny optimization over the other one. It
> should be maybe called __alloc_pages_node() then? This would perhaps
> discourage users outside of mm/arch code (where it may matter). The
> savings of a skipped branch is likely dubious anyway... It would be also
> nice if alloc_pages_node() could use __alloc_pages_node() internally, but
> I'm not sure if all callers are safe wrt the
> VM_BUG_ON(!node_online(nid)) part.
>
I'm not sure how large you want to make your patch :) In a perfect world
I would think that we wouldn't have an alloc_pages_prefer_node() at all
and everything would be converted to alloc_pages_node() which would do
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
and then alloc_pages_exact_node() would do
return alloc_pages_node(nid, gfp_mask | __GFP_THISNODE, order);
and existing alloc_pages_exact_node() callers fixed up depending on
whether they set the bit or not.
The only possible downside would be existing users of
alloc_pages_node() that are calling it with an offline node. Since it's a
VM_BUG_ON() that would catch that, I think it should be changed to a
VM_WARN_ON() and eventually fixed up because it's nonsensical.
VM_BUG_ON() here should be avoided.
Or just go with a single alloc_pages_node() and rename __GFP_THISNODE to
__GFP_EXACT_NODE which may accomplish the same thing :)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: rename and document alloc_pages_exact_node
2015-07-22 21:52 ` David Rientjes
@ 2015-07-23 14:11 ` Christoph Lameter
2015-07-23 20:27 ` David Rientjes
0 siblings, 1 reply; 12+ messages in thread
From: Christoph Lameter @ 2015-07-23 14:11 UTC (permalink / raw)
To: David Rientjes
Cc: Vlastimil Babka, linux-mm, linux-kernel, Andrew Morton,
linux-ia64, linuxppc-dev, cbe-oss-dev, kvm, Mel Gorman,
Greg Thelen, Aneesh Kumar K.V, Pekka Enberg, Joonsoo Kim,
Naoya Horiguchi, Tony Luck, Fenghua Yu, Arnd Bergmann,
Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
Gleb Natapov, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, Cliff Whickman, Robin Holt
On Wed, 22 Jul 2015, David Rientjes wrote:
> Eek, yeah, that does look bad. I'm not even sure the
>
> if (nid < 0)
> nid = numa_node_id();
>
> is correct; I think this should be comparing to NUMA_NO_NODE rather than
> all negative numbers, otherwise we silently ignore overflow and nobody
> ever knows.
Comparing to NUMA_NO_NODE would be better. Also use numa_mem_id() instead
to support memoryless nodes better?
> The only possible downside would be existing users of
> alloc_pages_node() that are calling it with an offline node. Since it's a
> VM_BUG_ON() that would catch that, I think it should be changed to a
> VM_WARN_ON() and eventually fixed up because it's nonsensical.
> VM_BUG_ON() here should be avoided.
The offline node thing could be addresses by using numa_mem_id()?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: rename and document alloc_pages_exact_node
2015-07-23 14:11 ` Christoph Lameter
@ 2015-07-23 20:27 ` David Rientjes
2015-07-24 14:52 ` Vlastimil Babka
0 siblings, 1 reply; 12+ messages in thread
From: David Rientjes @ 2015-07-23 20:27 UTC (permalink / raw)
To: Christoph Lameter
Cc: Vlastimil Babka, linux-mm, linux-kernel, Andrew Morton,
linux-ia64, linuxppc-dev, cbe-oss-dev, kvm, Mel Gorman,
Greg Thelen, Aneesh Kumar K.V, Pekka Enberg, Joonsoo Kim,
Naoya Horiguchi, Tony Luck, Fenghua Yu, Arnd Bergmann,
Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
Gleb Natapov, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, Cliff Whickman, Robin Holt
On Thu, 23 Jul 2015, Christoph Lameter wrote:
> > The only possible downside would be existing users of
> > alloc_pages_node() that are calling it with an offline node. Since it's a
> > VM_BUG_ON() that would catch that, I think it should be changed to a
> > VM_WARN_ON() and eventually fixed up because it's nonsensical.
> > VM_BUG_ON() here should be avoided.
>
> The offline node thing could be addresses by using numa_mem_id()?
>
I was concerned about any callers that were passing an offline node, not
NUMA_NO_NODE, today. One of the alloc-node functions has a VM_BUG_ON()
for it, the other silently calls node_zonelist() on it.
I suppose the final alloc_pages_node() implementation could be
if (nid == NUMA_NO_NODE || VM_WARN_ON(!node_online(nid)))
nid = numa_mem_id();
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
though.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: rename and document alloc_pages_exact_node
2015-07-23 20:27 ` David Rientjes
@ 2015-07-24 14:52 ` Vlastimil Babka
0 siblings, 0 replies; 12+ messages in thread
From: Vlastimil Babka @ 2015-07-24 14:52 UTC (permalink / raw)
To: David Rientjes, Christoph Lameter
Cc: linux-mm, linux-kernel, Andrew Morton, linux-ia64, linuxppc-dev,
cbe-oss-dev, kvm, Mel Gorman, Greg Thelen, Aneesh Kumar K.V,
Pekka Enberg, Joonsoo Kim, Naoya Horiguchi, Tony Luck, Fenghua Yu,
Arnd Bergmann, Benjamin Herrenschmidt, Paul Mackerras,
Michael Ellerman, Gleb Natapov, Paolo Bonzini, Thomas Gleixner,
Ingo Molnar, H. Peter Anvin, Cliff Whickman, Robin Holt
On 07/23/2015 10:27 PM, David Rientjes wrote:
> On Thu, 23 Jul 2015, Christoph Lameter wrote:
>
>>> The only possible downside would be existing users of
>>> alloc_pages_node() that are calling it with an offline node. Since it's a
>>> VM_BUG_ON() that would catch that, I think it should be changed to a
>>> VM_WARN_ON() and eventually fixed up because it's nonsensical.
>>> VM_BUG_ON() here should be avoided.
>>
>> The offline node thing could be addresses by using numa_mem_id()?
>>
>
> I was concerned about any callers that were passing an offline node, not
> NUMA_NO_NODE, today. One of the alloc-node functions has a VM_BUG_ON()
> for it, the other silently calls node_zonelist() on it.
>
> I suppose the final alloc_pages_node() implementation could be
>
> if (nid == NUMA_NO_NODE || VM_WARN_ON(!node_online(nid)))
> nid = numa_mem_id();
>
> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
>
> though.
I've posted v2 based on David's and Christoph's suggestions (thanks) but
to avoid spamming everyone until we agree on the final interface, it's
marked as RFC and excludes the arch people from CC:
http://marc.info/?l=linux-kernel&m=143774920902608&w=2
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: rename and document alloc_pages_exact_node
2015-07-21 13:55 [PATCH] mm: rename and document alloc_pages_exact_node Vlastimil Babka
` (2 preceding siblings ...)
2015-07-21 21:31 ` David Rientjes
@ 2015-07-22 1:23 ` Michael Ellerman
2015-07-22 11:31 ` Paolo Bonzini
4 siblings, 0 replies; 12+ messages in thread
From: Michael Ellerman @ 2015-07-22 1:23 UTC (permalink / raw)
To: Vlastimil Babka
Cc: linux-mm, linux-kernel, Andrew Morton, linux-ia64, linuxppc-dev,
cbe-oss-dev, kvm, Mel Gorman, David Rientjes, Greg Thelen,
Aneesh Kumar K.V, Christoph Lameter, Pekka Enberg, Joonsoo Kim,
Naoya Horiguchi, Tony Luck, Fenghua Yu, Arnd Bergmann,
Benjamin Herrenschmidt, Paul Mackerras, Gleb Natapov,
Paolo Bonzini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
Cliff Whickman, Robin Holt
On Tue, 2015-07-21 at 15:55 +0200, Vlastimil Babka wrote:
> The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page
> allocator: do not check NUMA node ID when the caller knows the node is valid")
> as an optimized variant of alloc_pages_node(), that doesn't allow the node id
> to be -1. Unfortunately the name of the function can easily suggest that the
> allocation is restricted to the given node. In truth, the node is only
> preferred, unless __GFP_THISNODE is among the gfp flags.
>
> The misleading name has lead to mistakes in the past, see 5265047ac301 ("mm,
> thp: really limit transparent hugepage allocation to local node") and
> b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node").
>
> To prevent further mistakes, this patch renames the function to
> alloc_pages_prefer_node() and documents it together with alloc_pages_node().
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> I'm CC'ing also maintainers of the callsites so they can verify that the
> callsites that don't pass __GFP_THISNODE are really not intended to restrict
> allocation to the given node. I went through them myself and each looked like
> it's better off if it can successfully allocate on a fallback node rather
> than fail. DavidR checked them also I think, but it's better if maintainers
> can verify that. I'm not completely sure about all the usages in sl*b due to
> multiple layers through which gfp flags are being passed.
> diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c
> index e865d74..646a310 100644
> --- a/arch/powerpc/platforms/cell/ras.c
> +++ b/arch/powerpc/platforms/cell/ras.c
> @@ -123,7 +123,7 @@ static int __init cbe_ptcal_enable_on_node(int nid, int order)
>
> area->nid = nid;
> area->order = order;
> - area->pages = alloc_pages_exact_node(area->nid,
> + area->pages = alloc_pages_prefer_node(area->nid,
> GFP_KERNEL|__GFP_THISNODE,
> area->order);
Yeah that looks right to me.
This code enables some firmware memory calibration so I think it really does
want to get memory on the specified node, or else fail.
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
cheers
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: rename and document alloc_pages_exact_node
2015-07-21 13:55 [PATCH] mm: rename and document alloc_pages_exact_node Vlastimil Babka
` (3 preceding siblings ...)
2015-07-22 1:23 ` Michael Ellerman
@ 2015-07-22 11:31 ` Paolo Bonzini
2015-07-22 21:44 ` David Rientjes
4 siblings, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2015-07-22 11:31 UTC (permalink / raw)
To: Vlastimil Babka, linux-mm
Cc: linux-kernel, Andrew Morton, linux-ia64, linuxppc-dev,
cbe-oss-dev, kvm, Mel Gorman, David Rientjes, Greg Thelen,
Aneesh Kumar K.V, Christoph Lameter, Pekka Enberg, Joonsoo Kim,
Naoya Horiguchi, Tony Luck, Fenghua Yu, Arnd Bergmann,
Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
Gleb Natapov, Thomas Gleixner, Ingo Molnar
On 21/07/2015 15:55, Vlastimil Babka wrote:
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 2d73807..a8723a8 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3158,7 +3158,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
> struct page *pages;
> struct vmcs *vmcs;
>
> - pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
> + pages = alloc_pages_prefer_node(node, GFP_KERNEL, vmcs_config.order);
> if (!pages)
> return NULL;
> vmcs = page_address(pages);
Even though there's a pretty strong preference for the "right" node,
things can work if the node is the wrong one. The order is always zero
in practice, so the allocation should succeed.
Paolo
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: rename and document alloc_pages_exact_node
2015-07-22 11:31 ` Paolo Bonzini
@ 2015-07-22 21:44 ` David Rientjes
0 siblings, 0 replies; 12+ messages in thread
From: David Rientjes @ 2015-07-22 21:44 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Vlastimil Babka, linux-mm, linux-kernel, Andrew Morton,
linux-ia64, linuxppc-dev, cbe-oss-dev, kvm, Mel Gorman,
Greg Thelen, Aneesh Kumar K.V, Christoph Lameter, Pekka Enberg,
Joonsoo Kim, Naoya Horiguchi, Tony Luck, Fenghua Yu,
Arnd Bergmann, Benjamin Herrenschmidt, Paul Mackerras,
Michael Ellerman, Gleb Natapov, Thomas Gleixner, Ingo Molnar
On Wed, 22 Jul 2015, Paolo Bonzini wrote:
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 2d73807..a8723a8 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -3158,7 +3158,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
> > struct page *pages;
> > struct vmcs *vmcs;
> >
> > - pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
> > + pages = alloc_pages_prefer_node(node, GFP_KERNEL, vmcs_config.order);
> > if (!pages)
> > return NULL;
> > vmcs = page_address(pages);
>
> Even though there's a pretty strong preference for the "right" node,
> things can work if the node is the wrong one. The order is always zero
> in practice, so the allocation should succeed.
>
You're code is fine both before and after the patch since __GFP_THISNODE
isn't set. The allocation will eventually succeed but, as you said, may
be from remote memory (and the success of allocating on node may be
influenced by the global setting of zone_reclaim_mode).
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2015-07-24 14:52 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-21 13:55 [PATCH] mm: rename and document alloc_pages_exact_node Vlastimil Babka
2015-07-21 14:05 ` Christoph Lameter
2015-07-21 14:14 ` Robin Holt
2015-07-21 21:31 ` David Rientjes
2015-07-22 11:32 ` Vlastimil Babka
2015-07-22 21:52 ` David Rientjes
2015-07-23 14:11 ` Christoph Lameter
2015-07-23 20:27 ` David Rientjes
2015-07-24 14:52 ` Vlastimil Babka
2015-07-22 1:23 ` Michael Ellerman
2015-07-22 11:31 ` Paolo Bonzini
2015-07-22 21:44 ` David Rientjes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).