From: Michal Hocko <mhocko@suse.com>
To: David Hildenbrand <david@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Paul Mackerras <paulus@samba.org>,
Rashmica Gupta <rashmica.g@gmail.com>,
linuxppc-dev@lists.ozlabs.org,
Andrew Morton <akpm@linux-foundation.org>,
Mike Rapoport <rppt@kernel.org>,
Oscar Salvador <osalvador@suse.de>
Subject: Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
Date: Tue, 3 Nov 2020 10:23:09 +0100 [thread overview]
Message-ID: <20201103092309.GD21990@dhcp22.suse.cz> (raw)
In-Reply-To: <20201029162718.29910-5-david@redhat.com>
On Thu 29-10-20 17:27:18, David Hildenbrand wrote:
> Let's use alloc_contig_pages() for allocating memory and remove the
> linear mapping manually via arch_remove_linear_mapping(). Mark all pages
> PG_offline, such that they will definitely not get touched - e.g.,
> when hibernating. When freeing memory, try to revert what we did.
>
> The original idea was discussed in:
> https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
>
> This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
> architectures, whereby only single pages are unmapped from the linear
> mapping. Let's mimic what memory hot(un)plug would do with the linear
> mapping.
>
> We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.
>
> Simple test under QEMU TCG (10GB RAM, single NUMA node):
>
> sh-5.0# mount -t debugfs none /sys/kernel/debug/
> sh-5.0# cat /sys/devices/system/memory/block_size_bytes
> 40000000
> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [ 71.052836][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
> sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [ 75.424302][ T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
> [ 75.430549][ T356] memtrace: Freed trace memory back on node 0
> [ 75.604520][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
> sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [ 80.418835][ T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages
> [ 80.430493][ T356] memtrace: Freed trace memory back on node 0
> [ 80.433882][ T356] memtrace: Failed to allocate trace memory on node 0
> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [ 91.920158][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>
> Note 1: We currently won't be allocating from ZONE_MOVABLE - because our
> pages are not movable. However, as we don't run with any memory
> hot(un)plug mechanism around, we could make an exception to
> increase the chance of allocations succeeding.
>
> Note 2: PG_reserved isn't sufficient. E.g., kernel_page_present() used
> along PG_reserved in hibernation code will always return "true"
> on powerpc, resulting in the pages getting touched. It's too
> generic - e.g., indicates boot allocations.
>
> Note 3: For now, we keep using memory_block_size_bytes() as minimum
> granularity. I'm not able to come up with a better guess (most
> probably, doing it on a section basis could be possible).
>
> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Rashmica Gupta <rashmica.g@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Thanks! This looks like a move into the right direction. I cannot really
judge implementation details because I am not familiar with the code.
I have only one tiny concern:
[...]
> -/* called with device_hotplug_lock held */
> -static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
> +static u64 memtrace_alloc_node(u32 nid, u64 size)
> {
> - const unsigned long start = PFN_PHYS(start_pfn);
> - const unsigned long size = PFN_PHYS(nr_pages);
> + const unsigned long nr_pages = PHYS_PFN(size);
> + unsigned long pfn, start_pfn;
> + struct page *page;
>
> - if (walk_memory_blocks(start, size, NULL, check_memblock_online))
> - return false;
> -
> - walk_memory_blocks(start, size, (void *)MEM_GOING_OFFLINE,
> - change_memblock_state);
> -
> - if (offline_pages(start_pfn, nr_pages)) {
> - walk_memory_blocks(start, size, (void *)MEM_ONLINE,
> - change_memblock_state);
> - return false;
> - }
> + /*
> + * Trace memory needs to be aligned to the size, which is guaranteed
> + * by alloc_contig_pages().
> + */
> + page = alloc_contig_pages(nr_pages, __GFP_THISNODE | __GFP_NOWARN,
> + nid, NULL);
__GFP_THISNODE without other modifiers looks suspicious. I suspect you
want to enfore node locality and exclude movable zones by this. While
this works it is an antipattern. I would rather use GFP_KERNEL |
__GFP_THISNODE | __GFP_NOWARN to be more in line with other gfp usage.
If for no other reasons we want to be able to work inside a normal
compaction context (comparing to effectively GFP_NOIO which the above
implies). Also this looks like a sleepable context.
--
Michal Hocko
SUSE Labs
WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@suse.com>
To: David Hildenbrand <david@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
linuxppc-dev@lists.ozlabs.org,
Michael Ellerman <mpe@ellerman.id.au>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Paul Mackerras <paulus@samba.org>,
Rashmica Gupta <rashmica.g@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Mike Rapoport <rppt@kernel.org>,
Oscar Salvador <osalvador@suse.de>,
Wei Yang <richard.weiyang@linux.alibaba.com>
Subject: Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
Date: Tue, 3 Nov 2020 10:23:09 +0100 [thread overview]
Message-ID: <20201103092309.GD21990@dhcp22.suse.cz> (raw)
In-Reply-To: <20201029162718.29910-5-david@redhat.com>
On Thu 29-10-20 17:27:18, David Hildenbrand wrote:
> Let's use alloc_contig_pages() for allocating memory and remove the
> linear mapping manually via arch_remove_linear_mapping(). Mark all pages
> PG_offline, such that they will definitely not get touched - e.g.,
> when hibernating. When freeing memory, try to revert what we did.
>
> The original idea was discussed in:
> https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
>
> This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
> architectures, whereby only single pages are unmapped from the linear
> mapping. Let's mimic what memory hot(un)plug would do with the linear
> mapping.
>
> We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.
>
> Simple test under QEMU TCG (10GB RAM, single NUMA node):
>
> sh-5.0# mount -t debugfs none /sys/kernel/debug/
> sh-5.0# cat /sys/devices/system/memory/block_size_bytes
> 40000000
> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [ 71.052836][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
> sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [ 75.424302][ T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
> [ 75.430549][ T356] memtrace: Freed trace memory back on node 0
> [ 75.604520][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
> sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [ 80.418835][ T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages
> [ 80.430493][ T356] memtrace: Freed trace memory back on node 0
> [ 80.433882][ T356] memtrace: Failed to allocate trace memory on node 0
> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [ 91.920158][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>
> Note 1: We currently won't be allocating from ZONE_MOVABLE - because our
> pages are not movable. However, as we don't run with any memory
> hot(un)plug mechanism around, we could make an exception to
> increase the chance of allocations succeeding.
>
> Note 2: PG_reserved isn't sufficient. E.g., kernel_page_present() used
> along PG_reserved in hibernation code will always return "true"
> on powerpc, resulting in the pages getting touched. It's too
> generic - e.g., indicates boot allocations.
>
> Note 3: For now, we keep using memory_block_size_bytes() as minimum
> granularity. I'm not able to come up with a better guess (most
> probably, doing it on a section basis could be possible).
>
> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Rashmica Gupta <rashmica.g@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Thanks! This looks like a move into the right direction. I cannot really
judge implementation details because I am not familiar with the code.
I have only one tiny concern:
[...]
> -/* called with device_hotplug_lock held */
> -static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
> +static u64 memtrace_alloc_node(u32 nid, u64 size)
> {
> - const unsigned long start = PFN_PHYS(start_pfn);
> - const unsigned long size = PFN_PHYS(nr_pages);
> + const unsigned long nr_pages = PHYS_PFN(size);
> + unsigned long pfn, start_pfn;
> + struct page *page;
>
> - if (walk_memory_blocks(start, size, NULL, check_memblock_online))
> - return false;
> -
> - walk_memory_blocks(start, size, (void *)MEM_GOING_OFFLINE,
> - change_memblock_state);
> -
> - if (offline_pages(start_pfn, nr_pages)) {
> - walk_memory_blocks(start, size, (void *)MEM_ONLINE,
> - change_memblock_state);
> - return false;
> - }
> + /*
> + * Trace memory needs to be aligned to the size, which is guaranteed
> + * by alloc_contig_pages().
> + */
> + page = alloc_contig_pages(nr_pages, __GFP_THISNODE | __GFP_NOWARN,
> + nid, NULL);
__GFP_THISNODE without other modifiers looks suspicious. I suspect you
want to enfore node locality and exclude movable zones by this. While
this works it is an antipattern. I would rather use GFP_KERNEL |
__GFP_THISNODE | __GFP_NOWARN to be more in line with other gfp usage.
If for no other reasons we want to be able to work inside a normal
compaction context (comparing to effectively GFP_NOIO which the above
implies). Also this looks like a sleepable context.
--
Michal Hocko
SUSE Labs
next prev parent reply other threads:[~2020-11-03 9:25 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-10-29 16:27 [PATCH v1 0/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations David Hildenbrand
2020-10-29 16:27 ` David Hildenbrand
2020-10-29 16:27 ` [PATCH v1 1/4] powerpc/mm: factor out creating/removing linear mapping David Hildenbrand
2020-10-29 16:27 ` David Hildenbrand
2020-10-29 16:27 ` [PATCH v1 2/4] powerpc/mm: print warning in arch_remove_linear_mapping() David Hildenbrand
2020-10-29 16:27 ` David Hildenbrand
2020-11-04 9:42 ` osalvador
2020-11-04 9:42 ` osalvador
2020-11-11 12:10 ` David Hildenbrand
2020-11-11 12:10 ` David Hildenbrand
2020-10-29 16:27 ` [PATCH v1 3/4] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory() David Hildenbrand
2020-10-29 16:27 ` David Hildenbrand
2020-11-04 9:50 ` osalvador
2020-11-04 9:50 ` osalvador
2020-11-04 12:06 ` Mike Rapoport
2020-11-04 12:06 ` Mike Rapoport
2020-11-04 12:11 ` Oscar Salvador
2020-11-04 12:11 ` Oscar Salvador
2020-11-11 12:07 ` David Hildenbrand
2020-11-11 12:07 ` David Hildenbrand
2020-11-04 12:11 ` Oscar Salvador
2020-11-04 12:11 ` Oscar Salvador
2020-10-29 16:27 ` [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations David Hildenbrand
2020-10-29 16:27 ` David Hildenbrand
2020-11-03 9:23 ` Michal Hocko [this message]
2020-11-03 9:23 ` Michal Hocko
2020-11-03 9:29 ` David Hildenbrand
2020-11-03 9:29 ` David Hildenbrand
2020-11-05 2:40 ` Michael Ellerman
2020-11-05 2:40 ` Michael Ellerman
2020-11-05 8:29 ` David Hildenbrand
2020-11-05 8:29 ` David Hildenbrand
2020-11-05 10:47 ` Michael Ellerman
2020-11-05 10:47 ` Michael Ellerman
2020-11-25 11:57 ` [PATCH v1 0/4] " Michael Ellerman
2020-11-25 11:57 ` Michael Ellerman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201103092309.GD21990@dhcp22.suse.cz \
--to=mhocko@suse.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=osalvador@suse.de \
--cc=paulus@samba.org \
--cc=rashmica.g@gmail.com \
--cc=richard.weiyang@linux.alibaba.com \
--cc=rppt@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.