* [PATCH v1 1/3] mm: fix uninitialized memmaps on a partially populated last section
[not found] <20191209174836.11063-1-david@redhat.com>
@ 2019-12-09 17:48 ` David Hildenbrand
2019-12-09 21:15 ` Daniel Jordan
0 siblings, 1 reply; 4+ messages in thread
From: David Hildenbrand @ 2019-12-09 17:48 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, David Hildenbrand, stable, Naoya Horiguchi,
Pavel Tatashin, Andrew Morton, Steven Sistare, Michal Hocko,
Daniel Jordan, Bob Picco, Oscar Salvador
If max_pfn is not aligned to a section boundary, we can easily run into
BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a
memory size that is not a multiple of 128MB (e.g., 4097MB, but also
4160MB). I was told that on real HW, we can easily have this scenario
(esp., one of the main reasons sub-section hotadd of devmem was added).
The issue is, that we have a valid memmap (pfn_valid()) for the
whole section, and the whole section will be marked "online".
pfn_to_online_page() will succeed, but the memmap contains garbage.
E.g., doing a "cat /proc/kpageflags > /dev/null" results in
[ 303.218313] BUG: unable to handle page fault for address: fffffffffffffffe
[ 303.218899] #PF: supervisor read access in kernel mode
[ 303.219344] #PF: error_code(0x0000) - not-present page
[ 303.219787] PGD 12614067 P4D 12614067 PUD 12616067 PMD 0
[ 303.220266] Oops: 0000 [#1] SMP NOPTI
[ 303.220587] CPU: 0 PID: 424 Comm: cat Not tainted 5.4.0-next-20191128+ #17
[ 303.221169] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
[ 303.222140] RIP: 0010:stable_page_flags+0x4d/0x410
[ 303.222554] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
[ 303.224135] RSP: 0018:ffff9f5980187e58 EFLAGS: 00010202
[ 303.224576] RAX: fffffffffffffffe RBX: ffffda1285004000 RCX: ffff9f5980187dd4
[ 303.225178] RDX: 0000000000000001 RSI: ffffffff92662420 RDI: 0000000000000246
[ 303.225789] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
[ 303.226405] R10: 0000000000000000 R11: 0000000000000000 R12: 00007f31d070e000
[ 303.227012] R13: 0000000000140100 R14: 00007f31d070e800 R15: ffffda1285004000
[ 303.227629] FS: 00007f31d08f6580(0000) GS:ffff90a6bba00000(0000) knlGS:0000000000000000
[ 303.228329] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 303.228820] CR2: fffffffffffffffe CR3: 00000001332a2000 CR4: 00000000000006f0
[ 303.229438] Call Trace:
[ 303.229654] kpageflags_read.cold+0x57/0xf0
[ 303.230016] proc_reg_read+0x3c/0x60
[ 303.230332] vfs_read+0xc2/0x170
[ 303.230614] ksys_read+0x65/0xe0
[ 303.230898] do_syscall_64+0x5c/0xa0
[ 303.231216] entry_SYSCALL_64_after_hwframe+0x49/0xbe
This patch fixes that by at least zero-ing out that memmap (so e.g.,
page_to_pfn() will not crash). Commit 907ec5fca3dc ("mm: zero remaining
unavailable struct pages") tried to fix a similar issue, but forgot to
consider this special case.
After this patch, there are still problems to solve. E.g., not all of these
pages falling into a memory hole will actually get initialized later
and set PageReserved - they are only zeroed out - but at least the
immediate crashes are gone. A follow-up patch will take care of this.
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Cc: <stable@vger.kernel.org> # v4.15+
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/page_alloc.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 62dcd6b76c80..1eb2ce7c79e4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6932,7 +6932,8 @@ static u64 zero_pfn_range(unsigned long spfn, unsigned long epfn)
* This function also addresses a similar issue where struct pages are left
* uninitialized because the physical address range is not covered by
* memblock.memory or memblock.reserved. That could happen when memblock
- * layout is manually configured via memmap=.
+ * layout is manually configured via memmap=, or when the highest physical
+ * address (max_pfn) does not end on a section boundary.
*/
void __init zero_resv_unavail(void)
{
@@ -6950,7 +6951,16 @@ void __init zero_resv_unavail(void)
pgcnt += zero_pfn_range(PFN_DOWN(next), PFN_UP(start));
next = end;
}
- pgcnt += zero_pfn_range(PFN_DOWN(next), max_pfn);
+
+ /*
+ * Early sections always have a fully populated memmap for the whole
+ * section - see pfn_valid(). If the last section has holes at the
+ * end and that section is marked "online", the memmap will be
+ * considered initialized. Make sure that memmap has a well defined
+ * state.
+ */
+ pgcnt += zero_pfn_range(PFN_DOWN(next),
+ round_up(max_pfn, PAGES_PER_SECTION));
/*
* Struct pages that do not have backing memory. This could be because
--
2.21.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v1 1/3] mm: fix uninitialized memmaps on a partially populated last section
2019-12-09 17:48 ` [PATCH v1 1/3] mm: fix uninitialized memmaps on a partially populated last section David Hildenbrand
@ 2019-12-09 21:15 ` Daniel Jordan
2019-12-10 10:11 ` David Hildenbrand
0 siblings, 1 reply; 4+ messages in thread
From: Daniel Jordan @ 2019-12-09 21:15 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, stable, Naoya Horiguchi, Pavel Tatashin,
Andrew Morton, Steven Sistare, Michal Hocko, Daniel Jordan,
Bob Picco, Oscar Salvador
Hi David,
On Mon, Dec 09, 2019 at 06:48:34PM +0100, David Hildenbrand wrote:
> If max_pfn is not aligned to a section boundary, we can easily run into
> BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a
> memory size that is not a multiple of 128MB (e.g., 4097MB, but also
> 4160MB). I was told that on real HW, we can easily have this scenario
> (esp., one of the main reasons sub-section hotadd of devmem was added).
>
> The issue is, that we have a valid memmap (pfn_valid()) for the
> whole section, and the whole section will be marked "online".
> pfn_to_online_page() will succeed, but the memmap contains garbage.
>
> E.g., doing a "cat /proc/kpageflags > /dev/null" results in
>
> [ 303.218313] BUG: unable to handle page fault for address: fffffffffffffffe
> [ 303.218899] #PF: supervisor read access in kernel mode
> [ 303.219344] #PF: error_code(0x0000) - not-present page
> [ 303.219787] PGD 12614067 P4D 12614067 PUD 12616067 PMD 0
> [ 303.220266] Oops: 0000 [#1] SMP NOPTI
> [ 303.220587] CPU: 0 PID: 424 Comm: cat Not tainted 5.4.0-next-20191128+ #17
I can't reproduce this on x86-64 qemu, next-20191128 or mainline, with either
memory size. What config are you using? How often are you hitting it?
It may not have anything to do with the config, and I may be getting lucky with
the garbage in my memory.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v1 1/3] mm: fix uninitialized memmaps on a partially populated last section
2019-12-09 21:15 ` Daniel Jordan
@ 2019-12-10 10:11 ` David Hildenbrand
2019-12-10 22:18 ` Daniel Jordan
0 siblings, 1 reply; 4+ messages in thread
From: David Hildenbrand @ 2019-12-10 10:11 UTC (permalink / raw)
To: Daniel Jordan
Cc: linux-kernel, linux-mm, stable, Naoya Horiguchi, Pavel Tatashin,
Andrew Morton, Steven Sistare, Michal Hocko, Bob Picco,
Oscar Salvador
On 09.12.19 22:15, Daniel Jordan wrote:
> Hi David,
>
> On Mon, Dec 09, 2019 at 06:48:34PM +0100, David Hildenbrand wrote:
>> If max_pfn is not aligned to a section boundary, we can easily run into
>> BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a
>> memory size that is not a multiple of 128MB (e.g., 4097MB, but also
>> 4160MB). I was told that on real HW, we can easily have this scenario
>> (esp., one of the main reasons sub-section hotadd of devmem was added).
>>
>> The issue is, that we have a valid memmap (pfn_valid()) for the
>> whole section, and the whole section will be marked "online".
>> pfn_to_online_page() will succeed, but the memmap contains garbage.
>>
>> E.g., doing a "cat /proc/kpageflags > /dev/null" results in
>>
>> [ 303.218313] BUG: unable to handle page fault for address: fffffffffffffffe
>> [ 303.218899] #PF: supervisor read access in kernel mode
>> [ 303.219344] #PF: error_code(0x0000) - not-present page
>> [ 303.219787] PGD 12614067 P4D 12614067 PUD 12616067 PMD 0
>> [ 303.220266] Oops: 0000 [#1] SMP NOPTI
>> [ 303.220587] CPU: 0 PID: 424 Comm: cat Not tainted 5.4.0-next-20191128+ #17
>
Hi Daniel,
> I can't reproduce this on x86-64 qemu, next-20191128 or mainline, with either
> memory size. What config are you using? How often are you hitting it?
Thanks for verifying! Hah, there is one piece missing to reproduce via
"cat /proc/kpageflags > /dev/null" that I ignored on my QEMU cmdline (see below)
I can reproduce it reliably (QEMU with "-m 4160M") via
[root@localhost ~]# uname -a
Linux localhost 5.5.0-rc1-next-20191209 #93 SMP Tue Dec 10 10:46:19 CET 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@localhost ~]# ./page-types -r -a 0x144001
[ 200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
[ 200.477500] #PF: supervisor read access in kernel mode
[ 200.478334] #PF: error_code(0x0000) - not-present page
[ 200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
[ 200.479557] Oops: 0000 [#4] SMP NOPTI
[ 200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G D W 5.5.0-rc1-next-20191209 #93
[ 200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
[ 200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
[ 200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
[ 200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
[ 200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
[ 200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
[ 200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
[ 200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
[ 200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
[ 200.487130] FS: 00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
[ 200.487804] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
[ 200.488897] Call Trace:
[ 200.489115] kpageflags_read+0xe9/0x140
[ 200.489447] proc_reg_read+0x3c/0x60
[ 200.489755] vfs_read+0xc2/0x170
[ 200.490037] ksys_pread64+0x65/0xa0
[ 200.490352] do_syscall_64+0x5c/0xa0
[ 200.490665] entry_SYSCALL_64_after_hwframe+0x49/0xbe
(tool located in tools/vm/page-types.c, see also patch #2)
To reproduce via "cat /proc/kpageflags > /dev/null", you have to
hot/coldplug one DIMM, to move max_pfn beyond the garbage memmap
(see also patch #2). My QEMU cmdline with Fedora 31:
qemu-system-x86_64 \
--enable-kvm \
-m 4160M,slots=4,maxmem=8G \
-hda Fedora-Cloud-Base-31-1.9.x86_64.qcow2 \
-machine pc \
-nographic \
-nodefaults \
-chardev stdio,id=serial,signal=off \
-device isa-serial,chardev=serial \
-object memory-backend-ram,id=mem0,size=1024M \
-device pc-dimm,id=dimm0,memdev=mem0
[root@localhost ~]# uname -a
Linux localhost 5.3.7-301.fc31.x86_64 #1 SMP Mon Oct 21 19:18:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@localhost ~]# cat /proc/kpageflags > /dev/null
[ 111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
[ 111.517907] #PF: supervisor read access in kernel mode
[ 111.518333] #PF: error_code(0x0000) - not-present page
[ 111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0
>
> It may not have anything to do with the config, and I may be getting lucky with
> the garbage in my memory.
>
Some things that might be relevant from my config.
# CONFIG_PAGE_POISONING is not set
CONFIG_DEFERRED_STRUCT_PAGE_INIT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y
CONFIG_MEMORY_HOTPLUG=y
CONFIG_MEMORY_HOTPLUG_SPARSE=y
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
The F31 default config should make it trigger.
Will update this patch description - thanks!
...
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v1 1/3] mm: fix uninitialized memmaps on a partially populated last section
2019-12-10 10:11 ` David Hildenbrand
@ 2019-12-10 22:18 ` Daniel Jordan
0 siblings, 0 replies; 4+ messages in thread
From: Daniel Jordan @ 2019-12-10 22:18 UTC (permalink / raw)
To: David Hildenbrand
Cc: Daniel Jordan, linux-kernel, linux-mm, stable, Naoya Horiguchi,
Pavel Tatashin, Andrew Morton, Steven Sistare, Michal Hocko,
Bob Picco, Oscar Salvador
On Tue, Dec 10, 2019 at 11:11:03AM +0100, David Hildenbrand wrote:
> Some things that might be relevant from my config.
>
> # CONFIG_PAGE_POISONING is not set
> CONFIG_DEFERRED_STRUCT_PAGE_INIT=y
> CONFIG_SPARSEMEM_EXTREME=y
> CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
> CONFIG_SPARSEMEM_VMEMMAP=y
> CONFIG_HAVE_MEMBLOCK_NODE_MAP=y
> CONFIG_MEMORY_HOTPLUG=y
> CONFIG_MEMORY_HOTPLUG_SPARSE=y
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
Thanks for all that. After some poking around, turns out enabling DEBUG_VM
with its page poisoning let me hit it right away, which makes me wonder how
often someone would see this without it.
Anyway, fix looks good to me.
Tested-by: Daniel Jordan <daniel.m.jordan@oracle.com>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2019-12-10 22:19 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20191209174836.11063-1-david@redhat.com>
2019-12-09 17:48 ` [PATCH v1 1/3] mm: fix uninitialized memmaps on a partially populated last section David Hildenbrand
2019-12-09 21:15 ` Daniel Jordan
2019-12-10 10:11 ` David Hildenbrand
2019-12-10 22:18 ` Daniel Jordan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).