* [PATCH v3 0/4] optimize memory hotplug @ 2018-02-13 19:31 Pavel Tatashin 2018-02-13 19:31 ` [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check Pavel Tatashin ` (4 more replies) 0 siblings, 5 replies; 19+ messages in thread From: Pavel Tatashin @ 2018-02-13 19:31 UTC (permalink / raw) To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm, linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe Changelog: v2 - v3 Fixed two issues found during testing Addressed Kbuild warning reports v1 - v2 Added struct page poisoning checking in order to verify that struct pages are never accessed until initialized during memory hotplug This patchset: - Improves hotplug performance by eliminating a number of struct page traverses during memory hotplug. - Fixes some issues with hotplugging, where boundaries were not properly checked. And on x86 block size was not properly aligned with end of memory - Also, potentially improves boot performance by eliminating condition from __init_single_page(). - Adds robustness by verifying that that struct pages are correctly poisoned when flags are accessed. The following experiments were performed on Xeon(R) CPU E7-8895 v3 @ 2.60GHz with 1T RAM: booting in qemu with 960G of memory, time to initialize struct pages: no-kvm: TRY1 TRY2 BEFORE: 39.433668 39.39705 AFTER: 36.903781 36.989329 with-kvm: BEFORE: 10.977447 11.103164 AFTER: 10.929072 10.751885 Hotplug 896G memory: no-kvm: TRY1 TRY2 BEFORE: 848.740000 846.910000 AFTER: 783.070000 786.560000 with-kvm: TRY1 TRY2 BEFORE: 34.410000 33.57 AFTER: 29.810000 29.580000 Pavel Tatashin (4): mm/memory_hotplug: enforce block size aligned range check x86/mm/memory_hotplug: determine block size based on the end of boot memory mm: uninitialized struct page poisoning sanity checking mm/memory_hotplug: optimize memory hotplug arch/x86/mm/init_64.c | 33 +++++++++++++++++++++++++++++---- drivers/base/memory.c | 38 +++++++++++++++++++++----------------- drivers/base/node.c | 17 ++++++++++------- include/linux/memory_hotplug.h | 2 ++ include/linux/mm.h | 4 +++- include/linux/node.h | 4 ++-- include/linux/page-flags.h | 22 +++++++++++++++++----- mm/memblock.c | 2 +- mm/memory_hotplug.c | 36 ++++++++++-------------------------- mm/page_alloc.c | 28 ++++++++++------------------ mm/sparse.c | 29 ++++++++++++++++++++++++++--- 11 files changed, 131 insertions(+), 84 deletions(-) -- 2.16.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check 2018-02-13 19:31 [PATCH v3 0/4] optimize memory hotplug Pavel Tatashin @ 2018-02-13 19:31 ` Pavel Tatashin 2018-02-15 11:34 ` Michal Hocko 2018-02-13 19:31 ` [PATCH v3 2/4] x86/mm/memory_hotplug: determine block size based on the end of boot memory Pavel Tatashin ` (3 subsequent siblings) 4 siblings, 1 reply; 19+ messages in thread From: Pavel Tatashin @ 2018-02-13 19:31 UTC (permalink / raw) To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm, linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe Start qemu with the following arguments: -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G Which: boots machine with 64G, and adds a device mem1 with 2G which can be hotplugged later. Also make sure that config has the following turned on: CONFIG_MEMORY_HOTPLUG CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE CONFIG_ACPI_HOTPLUG_MEMORY >From qemu monitor hotplug the memory (make sure config has (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 The operation will fail with the following trace: WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205 pages_correctly_reserved+0xe6/0x110 Modules linked in: CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014 RIP: 0010:pages_correctly_reserved+0xe6/0x110 RSP: 0018:ffffbe5086b53d98 EFLAGS: 00010246 RAX: ffff9acb3fff3180 RBX: ffff9acaf7646038 RCX: 0000000000000800 RDX: ffff9acb3fff3000 RSI: 0000000000000218 RDI: 00000000010c0000 RBP: 0000000001080000 R08: ffffe81f83000040 R09: 0000000001100000 R10: ffff9acb3fff6000 R11: 0000000000000246 R12: 0000000000080000 R13: 0000000000000000 R14: ffffbe5086b53f08 R15: ffff9acaf7506f20 FS: 00007fd7f20da8c0(0000) GS:ffff9acb3fc00000(0000) knlGS:000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd7f20f2000 CR3: 0000000ff7ac2001 CR4: 00000000001606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: memory_subsys_online+0x44/0xa0 device_online+0x51/0x80 store_mem_state+0x5e/0xe0 kernfs_fop_write+0xfa/0x170 __vfs_write+0x2e/0x150 ? __inode_security_revalidate+0x47/0x60 ? selinux_file_permission+0xd5/0x130 ? _cond_resched+0x10/0x20 vfs_write+0xa8/0x1a0 ? find_vma+0x54/0x60 SyS_write+0x4d/0xb0 do_syscall_64+0x5d/0x110 entry_SYSCALL_64_after_hwframe+0x21/0x86 RIP: 0033:0x7fd7f0d3a840 RSP: 002b:00007fff5db77c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fd7f0d3a840 RDX: 0000000000000006 RSI: 00007fd7f20f2000 RDI: 0000000000000007 RBP: 00007fd7f20f2000 R08: 000055db265c4ab0 R09: 00007fd7f20da8c0 R10: 0000000000000006 R11: 0000000000000246 R12: 000055db265c49d0 R13: 0000000000000006 R14: 000055db265c5510 R15: 000000000000000b Code: fe ff ff 07 00 77 24 48 89 f8 48 c1 e8 17 49 8b 14 c2 48 85 d2 74 14 40 0f b6 c6 49 81 c0 00 00 20 00 48 c1 e0 04 48 01 d0 75 93 <0f> ff 31 c0 c3 b8 01 00 00 00 c3 31 d2 48 c7 c7 b0 32 67 a6 31 ---[ end trace 6203bc4f1a5d30e8 ]--- The problem is detected in: drivers/base/memory.c static bool pages_correctly_reserved(unsigned long start_pfn) 205 if (WARN_ON_ONCE(!pfn_valid(pfn))) This function loops through every section in the newly added memory block and verifies that the first pfn is valid, meaning section exists, has mapping (struct page array), and is online. The block size on x86 is usually 128M, but when machine is booted with more than 64G of memory, the block size is changed to 2G: $ cat /sys/devices/system/memory/block_size_bytes 80000000 or $ dmesg | grep "block size" [ 0.086469] x86/mm: Memory block size: 2048MB During memory hotplug, and hotremove we verify that the range is section size aligned, but we actually must verify that it is block size aligned, because that is the proper unit for hotplug operations. See: Documentation/memory-hotplug.txt So, when the start_pfn of newly added memory is not block size aligned, we can get a memory block that has only part of it with properly populated sections. In our case the start_pfn starts from the last_pfn (end of physical memory). $ dmesg | grep last_pfn [ 0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000 0x1040000 == 65G, and so is not 2G aligned! The fix is to enforce that memory that is hotplugged and hotremoved is block size aligned. With this fix, running the above sequence yield to the following result: (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 Block size [0x80000000] unaligned hotplug range: start 0x1040000000, size 0x80000000 acpi PNP0C80:00: add_memory failed acpi PNP0C80:00: acpi_memory_enable_device() error acpi PNP0C80:00: Enumeration failure Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> --- mm/memory_hotplug.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index b2bd52ff7605..565048f496f7 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1083,15 +1083,16 @@ int try_online_node(int nid) static int check_hotplug_memory_range(u64 start, u64 size) { - u64 start_pfn = PFN_DOWN(start); + unsigned long block_sz = memory_block_size_bytes(); + u64 block_nr_pages = block_sz >> PAGE_SHIFT; u64 nr_pages = size >> PAGE_SHIFT; + u64 start_pfn = PFN_DOWN(start); - /* Memory range must be aligned with section */ - if ((start_pfn & ~PAGE_SECTION_MASK) || - (nr_pages % PAGES_PER_SECTION) || (!nr_pages)) { - pr_err("Section-unaligned hotplug range: start 0x%llx, size 0x%llx\n", - (unsigned long long)start, - (unsigned long long)size); + /* memory range must be block size aligned */ + if (!nr_pages || !IS_ALIGNED(start_pfn, block_nr_pages) || + !IS_ALIGNED(nr_pages, block_nr_pages)) { + pr_err("Block size [%#lx] unaligned hotplug range: start %#llx, size %#llx", + block_sz, start, size); return -EINVAL; } -- 2.16.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check 2018-02-13 19:31 ` [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check Pavel Tatashin @ 2018-02-15 11:34 ` Michal Hocko 2018-02-15 13:36 ` Pavel Tatashin 0 siblings, 1 reply; 19+ messages in thread From: Michal Hocko @ 2018-02-15 11:34 UTC (permalink / raw) To: Pavel Tatashin Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, linux-mm, linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe On Tue 13-02-18 14:31:56, Pavel Tatashin wrote: > Start qemu with the following arguments: > -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G > > Which: boots machine with 64G, and adds a device mem1 with 2G which can be > hotplugged later. > > Also make sure that config has the following turned on: > CONFIG_MEMORY_HOTPLUG > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE > CONFIG_ACPI_HOTPLUG_MEMORY > > >From qemu monitor hotplug the memory (make sure config has > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 > > The operation will fail with the following trace: > > WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205 > pages_correctly_reserved+0xe6/0x110 > Modules linked in: > CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014 > RIP: 0010:pages_correctly_reserved+0xe6/0x110 > RSP: 0018:ffffbe5086b53d98 EFLAGS: 00010246 > RAX: ffff9acb3fff3180 RBX: ffff9acaf7646038 RCX: 0000000000000800 > RDX: ffff9acb3fff3000 RSI: 0000000000000218 RDI: 00000000010c0000 > RBP: 0000000001080000 R08: ffffe81f83000040 R09: 0000000001100000 > R10: ffff9acb3fff6000 R11: 0000000000000246 R12: 0000000000080000 > R13: 0000000000000000 R14: ffffbe5086b53f08 R15: ffff9acaf7506f20 > FS: 00007fd7f20da8c0(0000) GS:ffff9acb3fc00000(0000) knlGS:000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fd7f20f2000 CR3: 0000000ff7ac2001 CR4: 00000000001606f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > memory_subsys_online+0x44/0xa0 > device_online+0x51/0x80 > store_mem_state+0x5e/0xe0 > kernfs_fop_write+0xfa/0x170 > __vfs_write+0x2e/0x150 > ? __inode_security_revalidate+0x47/0x60 > ? selinux_file_permission+0xd5/0x130 > ? _cond_resched+0x10/0x20 > vfs_write+0xa8/0x1a0 > ? find_vma+0x54/0x60 > SyS_write+0x4d/0xb0 > do_syscall_64+0x5d/0x110 > entry_SYSCALL_64_after_hwframe+0x21/0x86 > RIP: 0033:0x7fd7f0d3a840 > RSP: 002b:00007fff5db77c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 > RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fd7f0d3a840 > RDX: 0000000000000006 RSI: 00007fd7f20f2000 RDI: 0000000000000007 > RBP: 00007fd7f20f2000 R08: 000055db265c4ab0 R09: 00007fd7f20da8c0 > R10: 0000000000000006 R11: 0000000000000246 R12: 000055db265c49d0 > R13: 0000000000000006 R14: 000055db265c5510 R15: 000000000000000b > Code: fe ff ff 07 00 77 24 48 89 f8 48 c1 e8 17 49 8b 14 c2 48 85 d2 74 14 > 40 0f b6 c6 49 81 c0 00 00 20 00 48 c1 e0 04 48 01 d0 75 93 <0f> ff 31 c0 > c3 b8 01 00 00 00 c3 31 d2 48 c7 c7 b0 32 67 a6 31 > ---[ end trace 6203bc4f1a5d30e8 ]--- > > The problem is detected in: drivers/base/memory.c > static bool pages_correctly_reserved(unsigned long start_pfn) > 205 if (WARN_ON_ONCE(!pfn_valid(pfn))) > > This function loops through every section in the newly added memory block > and verifies that the first pfn is valid, meaning section exists, has > mapping (struct page array), and is online. > > The block size on x86 is usually 128M, but when machine is booted with more > than 64G of memory, the block size is changed to 2G: > $ cat /sys/devices/system/memory/block_size_bytes > 80000000 > > or > $ dmesg | grep "block size" > [ 0.086469] x86/mm: Memory block size: 2048MB > > During memory hotplug, and hotremove we verify that the range is section > size aligned, but we actually must verify that it is block size aligned, > because that is the proper unit for hotplug operations. See: > Documentation/memory-hotplug.txt > > So, when the start_pfn of newly added memory is not block size aligned, we > can get a memory block that has only part of it with properly populated > sections. > > In our case the start_pfn starts from the last_pfn (end of physical > memory). > $ dmesg | grep last_pfn > [ 0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000 > > 0x1040000 == 65G, and so is not 2G aligned! > > The fix is to enforce that memory that is hotplugged and hotremoved is > block size aligned. > > With this fix, running the above sequence yield to the following result: > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 > Block size [0x80000000] unaligned hotplug range: start 0x1040000000, > size 0x80000000 > acpi PNP0C80:00: add_memory failed > acpi PNP0C80:00: acpi_memory_enable_device() error > acpi PNP0C80:00: Enumeration failure The whole memblock != section_size sucks! It leads to corner cases like you see. There is no real reason why we shouldn't be able to to online 2G unaligned memory range. Failing for that purpose is just wrong. The whole thing is just not well thought through and works only for well configured cases. Your patch doesn't address the underlying problem. On the other hand, it is incorrect to check memory section here conceptually because this is not a hotplug unit as you say so I am OK with the patch regardless. It deserves a big fat TODO to fix this properly at least. I am not sure why we insist on the alignment in the first place. All we should care about is the proper memory section based range. The code is crap and it assumes pageblock start aligned at some places but there shouldn't be anything fundamental to change that. > Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> > --- > mm/memory_hotplug.c | 15 ++++++++------- > 1 file changed, 8 insertions(+), 7 deletions(-) > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index b2bd52ff7605..565048f496f7 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1083,15 +1083,16 @@ int try_online_node(int nid) > > static int check_hotplug_memory_range(u64 start, u64 size) > { > - u64 start_pfn = PFN_DOWN(start); > + unsigned long block_sz = memory_block_size_bytes(); > + u64 block_nr_pages = block_sz >> PAGE_SHIFT; > u64 nr_pages = size >> PAGE_SHIFT; > + u64 start_pfn = PFN_DOWN(start); > > - /* Memory range must be aligned with section */ > - if ((start_pfn & ~PAGE_SECTION_MASK) || > - (nr_pages % PAGES_PER_SECTION) || (!nr_pages)) { > - pr_err("Section-unaligned hotplug range: start 0x%llx, size 0x%llx\n", > - (unsigned long long)start, > - (unsigned long long)size); > + /* memory range must be block size aligned */ > + if (!nr_pages || !IS_ALIGNED(start_pfn, block_nr_pages) || > + !IS_ALIGNED(nr_pages, block_nr_pages)) { > + pr_err("Block size [%#lx] unaligned hotplug range: start %#llx, size %#llx", > + block_sz, start, size); > return -EINVAL; > } > > -- > 2.16.1 > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check 2018-02-15 11:34 ` Michal Hocko @ 2018-02-15 13:36 ` Pavel Tatashin 2018-02-15 14:40 ` Michal Hocko 0 siblings, 1 reply; 19+ messages in thread From: Pavel Tatashin @ 2018-02-15 13:36 UTC (permalink / raw) To: Michal Hocko Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman, Linux Memory Management List, Linux Kernel Mailing List, Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao, Thomas Gleixner, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe Hi Michal, Thank you very much for your reviews and for Acking this patch. > > The whole memblock != section_size sucks! It leads to corner cases like > you see. There is no real reason why we shouldn't be able to to online > 2G unaligned memory range. Failing for that purpose is just wrong. The > whole thing is just not well thought through and works only for well > configured cases. Hotplug operates over memory blocks, and it seems that conceptually memory blocks are OK: their sizes are defined by arch, and may represent a pluggable dimm (on virtual machines it is a different story though). If we forced memory blocks to be equal to section size, that would force us to handle millions of memory devices in sysfs, which would not scale well. > > Your patch doesn't address the underlying problem. What is the underlying problem? The hotplug operation was allowed, but we ended up with half populated memory block, which is broken. The patch solves this problem by making sure that this is never the case for any arch, no matter what block size is defined as unit of hotplugging. > On the other hand, it > is incorrect to check memory section here conceptually because this is > not a hotplug unit as you say so I am OK with the patch regardless. It > deserves a big fat TODO to fix this properly at least. I am not sure why > we insist on the alignment in the first place. All we should care about > is the proper memory section based range. The code is crap and it > assumes pageblock start aligned at some places but there shouldn't be > anything fundamental to change that. So, if I understand correctly, ideally you would like to redefine unit of memory hotplug to be equal to section size? > >> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> > > Acked-by: Michal Hocko <mhocko@suse.com> Thank you, Pavel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check 2018-02-15 13:36 ` Pavel Tatashin @ 2018-02-15 14:40 ` Michal Hocko 2018-02-15 15:05 ` Pavel Tatashin 0 siblings, 1 reply; 19+ messages in thread From: Michal Hocko @ 2018-02-15 14:40 UTC (permalink / raw) To: Pavel Tatashin Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman, Linux Memory Management List, Linux Kernel Mailing List, Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao, Thomas Gleixner, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe On Thu 15-02-18 08:36:00, Pavel Tatashin wrote: > Hi Michal, > > Thank you very much for your reviews and for Acking this patch. > > > > > The whole memblock != section_size sucks! It leads to corner cases like > > you see. There is no real reason why we shouldn't be able to to online > > 2G unaligned memory range. Failing for that purpose is just wrong. The > > whole thing is just not well thought through and works only for well > > configured cases. > > Hotplug operates over memory blocks, and it seems that conceptually > memory blocks are OK: their sizes are defined by arch, and may > represent a pluggable dimm (on virtual machines it is a different > story though). If we forced memory blocks to be equal to section size, > that would force us to handle millions of memory devices in sysfs, > which would not scale well. Yes, I am very well avare of the reason why memblock is larger on larger systems. I was merely ranting about the way how it has been added to the existing code. > > Your patch doesn't address the underlying problem. > > What is the underlying problem? The hotplug operation was allowed, but > we ended up with half populated memory block, which is broken. The > patch solves this problem by making sure that this is never the case > for any arch, no matter what block size is defined as unit of > hotplugging. The underlying problem is that we require some alignment here. There shouldn't be any reason to do so. The only requirement dictated by the memory model is the size of the section. > > On the other hand, it > > is incorrect to check memory section here conceptually because this is > > not a hotplug unit as you say so I am OK with the patch regardless. It > > deserves a big fat TODO to fix this properly at least. I am not sure why > > we insist on the alignment in the first place. All we should care about > > is the proper memory section based range. The code is crap and it > > assumes pageblock start aligned at some places but there shouldn't be > > anything fundamental to change that. > > So, if I understand correctly, ideally you would like to redefine unit > of memory hotplug to be equal to section size? No, not really. I just think the alignment shouldn't really matter. Each memory block should simply represent a hotplugable entitity with a well defined pfn start and size (in multiples of section size). This is in fact what we do internally anyway. One problem might be that an existing userspace might depend on the existing size restrictions so we might not be able to have variable block sizes. But block size alignment should be fixable. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check 2018-02-15 14:40 ` Michal Hocko @ 2018-02-15 15:05 ` Pavel Tatashin 0 siblings, 0 replies; 19+ messages in thread From: Pavel Tatashin @ 2018-02-15 15:05 UTC (permalink / raw) To: Michal Hocko Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman, Linux Memory Management List, Linux Kernel Mailing List, Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao, Thomas Gleixner, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe > No, not really. I just think the alignment shouldn't really matter. Each > memory block should simply represent a hotplugable entitity with a well > defined pfn start and size (in multiples of section size). This is in > fact what we do internally anyway. One problem might be that an existing > userspace might depend on the existing size restrictions so we might not > be able to have variable block sizes. But block size alignment should be > fixable. > Hi Michal, I see what you mean, and I agree Linux should simply honor reasonable requests from HW/HV. On x86 qemu hotplugable entity is 128M, on sun4v SPARC it is 256M, with current scheme we still would end up with huge number of memory devices in sysfs if block size is fixed and equal to minimum hotplugable entitity. Just as an example, SPARC sun4v may have logical domains up-to 32T, with 256M granularity that is 131K files in /sys/devices/system/memory/! But, if it is variable, I am not sure how to solve it. The whole interface must be redefined. Because even if we hotplugged a highly aligned large chunk of memory and created only one memory device for it, we should have a way to remove just a small piece of that memory if underlying HV/HW requested. /sys/devices/system/memory/block_size_bytes Would have to be moved into memory block echo offline > /sys/devices/system/memory/memoryXXX/state This would need to be redefined somehow to work only on part of the block. I am not really sure what a good solution would be without breaking the userspace. Thank you, Pavel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v3 2/4] x86/mm/memory_hotplug: determine block size based on the end of boot memory 2018-02-13 19:31 [PATCH v3 0/4] optimize memory hotplug Pavel Tatashin 2018-02-13 19:31 ` [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check Pavel Tatashin @ 2018-02-13 19:31 ` Pavel Tatashin 2018-02-15 11:37 ` Michal Hocko 2018-02-13 19:31 ` [PATCH v3 3/4] mm: uninitialized struct page poisoning sanity checking Pavel Tatashin ` (2 subsequent siblings) 4 siblings, 1 reply; 19+ messages in thread From: Pavel Tatashin @ 2018-02-13 19:31 UTC (permalink / raw) To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm, linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe Memory sections are combined into "memory block" chunks. These chunks are the units upon which memory can be added and removed. On x86, the new memory may be added after the end of the boot memory, therefore, if block size does not align with end of boot memory, memory hot-plugging/hot-removing can be broken. Currently, whenever machine is boot with more than 64G the block size unconditionally increased to 2G from the base 128M in order to reduce number of memory devices in sysfs: /sys/devices/system/memory/memoryXXX But, we must use the largest allowed block size that aligns to the next address to be able to hotplug the next block of memory. So, when memory is larger than 64G, we check the end address and find the largest block size that is still power of two but smaller or equal to 2G. Before, the fix: Run qemu with: -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 Block size [0x80000000] unaligned hotplug range: start 0x1040000000, size 0x80000000 acpi PNP0C80:00: add_memory failed acpi PNP0C80:00: acpi_memory_enable_device() error acpi PNP0C80:00: Enumeration failure With the fix memory is added successfully, as the block size is set to 1G, and therefore aligns with start address 0x1040000000. Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> --- arch/x86/mm/init_64.c | 33 +++++++++++++++++++++++++++++---- 1 file changed, 29 insertions(+), 4 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 1ab42c852069..f7dc80364397 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1326,14 +1326,39 @@ int kern_addr_valid(unsigned long addr) return pfn_valid(pte_pfn(*pte)); } +/* + * Block size is the minimum quantum of memory which can be hot-plugged or + * hot-removed. It must be power of two, and must be equal or larger than + * MIN_MEMORY_BLOCK_SIZE. + */ +#define MAX_BLOCK_SIZE (2UL << 30) + +/* Amount of ram needed to start using large blocks */ +#define MEM_SIZE_FOR_LARGE_BLOCK (64UL << 30) + static unsigned long probe_memory_block_size(void) { - unsigned long bz = MIN_MEMORY_BLOCK_SIZE; + unsigned long boot_mem_end = max_pfn << PAGE_SHIFT; + unsigned long bz; - /* if system is UV or has 64GB of RAM or more, use large blocks */ - if (is_uv_system() || ((max_pfn << PAGE_SHIFT) >= (64UL << 30))) - bz = 2UL << 30; /* 2GB */ + /* If this is UV system, always set 2G block size */ + if (is_uv_system()) { + bz = MAX_BLOCK_SIZE; + goto done; + } + /* Use regular block if RAM is smaller than MEM_SIZE_FOR_LARGE_BLOCK */ + if (boot_mem_end < MEM_SIZE_FOR_LARGE_BLOCK) { + bz = MIN_MEMORY_BLOCK_SIZE; + goto done; + } + + /* Find the largest allowed block size that aligns to memory end */ + for (bz = MAX_BLOCK_SIZE; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) { + if (IS_ALIGNED(boot_mem_end, bz)) + break; + } +done: pr_info("x86/mm: Memory block size: %ldMB\n", bz >> 20); return bz; -- 2.16.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v3 2/4] x86/mm/memory_hotplug: determine block size based on the end of boot memory 2018-02-13 19:31 ` [PATCH v3 2/4] x86/mm/memory_hotplug: determine block size based on the end of boot memory Pavel Tatashin @ 2018-02-15 11:37 ` Michal Hocko 2018-02-15 13:39 ` Pavel Tatashin 0 siblings, 1 reply; 19+ messages in thread From: Michal Hocko @ 2018-02-15 11:37 UTC (permalink / raw) To: Pavel Tatashin Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, linux-mm, linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe On Tue 13-02-18 14:31:57, Pavel Tatashin wrote: > Memory sections are combined into "memory block" chunks. These chunks are > the units upon which memory can be added and removed. > > On x86, the new memory may be added after the end of the boot memory, > therefore, if block size does not align with end of boot memory, memory > hot-plugging/hot-removing can be broken. > > Currently, whenever machine is boot with more than 64G the block size > unconditionally increased to 2G from the base 128M in order to reduce > number of memory devices in sysfs: > /sys/devices/system/memory/memoryXXX > > But, we must use the largest allowed block size that aligns to the next > address to be able to hotplug the next block of memory. > > So, when memory is larger than 64G, we check the end address and find the > largest block size that is still power of two but smaller or equal to 2G. > > Before, the fix: > Run qemu with: > -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G > > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 > Block size [0x80000000] unaligned hotplug range: start 0x1040000000, > size 0x80000000 > acpi PNP0C80:00: add_memory failed > acpi PNP0C80:00: acpi_memory_enable_device() error > acpi PNP0C80:00: Enumeration failure > > With the fix memory is added successfully, as the block size is set to 1G, > and therefore aligns with start address 0x1040000000. I dunno. If x86 maintainers are OK with this then why not, but I do not like how this is x86 specific. I would much rather address this by making the memblock user interface more sane. > Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> > --- > arch/x86/mm/init_64.c | 33 +++++++++++++++++++++++++++++---- > 1 file changed, 29 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c > index 1ab42c852069..f7dc80364397 100644 > --- a/arch/x86/mm/init_64.c > +++ b/arch/x86/mm/init_64.c > @@ -1326,14 +1326,39 @@ int kern_addr_valid(unsigned long addr) > return pfn_valid(pte_pfn(*pte)); > } > > +/* > + * Block size is the minimum quantum of memory which can be hot-plugged or > + * hot-removed. It must be power of two, and must be equal or larger than > + * MIN_MEMORY_BLOCK_SIZE. > + */ > +#define MAX_BLOCK_SIZE (2UL << 30) > + > +/* Amount of ram needed to start using large blocks */ > +#define MEM_SIZE_FOR_LARGE_BLOCK (64UL << 30) > + > static unsigned long probe_memory_block_size(void) > { > - unsigned long bz = MIN_MEMORY_BLOCK_SIZE; > + unsigned long boot_mem_end = max_pfn << PAGE_SHIFT; > + unsigned long bz; > > - /* if system is UV or has 64GB of RAM or more, use large blocks */ > - if (is_uv_system() || ((max_pfn << PAGE_SHIFT) >= (64UL << 30))) > - bz = 2UL << 30; /* 2GB */ > + /* If this is UV system, always set 2G block size */ > + if (is_uv_system()) { > + bz = MAX_BLOCK_SIZE; > + goto done; > + } > > + /* Use regular block if RAM is smaller than MEM_SIZE_FOR_LARGE_BLOCK */ > + if (boot_mem_end < MEM_SIZE_FOR_LARGE_BLOCK) { > + bz = MIN_MEMORY_BLOCK_SIZE; > + goto done; > + } > + > + /* Find the largest allowed block size that aligns to memory end */ > + for (bz = MAX_BLOCK_SIZE; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) { > + if (IS_ALIGNED(boot_mem_end, bz)) > + break; > + } > +done: > pr_info("x86/mm: Memory block size: %ldMB\n", bz >> 20); > > return bz; > -- > 2.16.1 > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 2/4] x86/mm/memory_hotplug: determine block size based on the end of boot memory 2018-02-15 11:37 ` Michal Hocko @ 2018-02-15 13:39 ` Pavel Tatashin 2018-02-15 19:00 ` Ingo Molnar 0 siblings, 1 reply; 19+ messages in thread From: Pavel Tatashin @ 2018-02-15 13:39 UTC (permalink / raw) To: Michal Hocko Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman, Linux Memory Management List, Linux Kernel Mailing List, Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao, Thomas Gleixner, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe > I dunno. If x86 maintainers are OK with this then why not, but I do not > like how this is x86 specific. I would much rather address this by > making the memblock user interface more sane. > Hi Michal, Ingo Molnar reviewed this patch, and Okayed it. Thank you, Pavel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 2/4] x86/mm/memory_hotplug: determine block size based on the end of boot memory 2018-02-15 13:39 ` Pavel Tatashin @ 2018-02-15 19:00 ` Ingo Molnar 0 siblings, 0 replies; 19+ messages in thread From: Ingo Molnar @ 2018-02-15 19:00 UTC (permalink / raw) To: Pavel Tatashin Cc: Michal Hocko, Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman, Linux Memory Management List, Linux Kernel Mailing List, Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao, Thomas Gleixner, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe * Pavel Tatashin <pasha.tatashin@oracle.com> wrote: > > I dunno. If x86 maintainers are OK with this then why not, but I do not > > like how this is x86 specific. I would much rather address this by > > making the memblock user interface more sane. > > > > Hi Michal, > > Ingo Molnar reviewed this patch, and Okayed it. But I'd not be against robustifying the whole generic interface against such misconfiguration either. But having the warning should be enough in practice, right? Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v3 3/4] mm: uninitialized struct page poisoning sanity checking 2018-02-13 19:31 [PATCH v3 0/4] optimize memory hotplug Pavel Tatashin 2018-02-13 19:31 ` [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check Pavel Tatashin 2018-02-13 19:31 ` [PATCH v3 2/4] x86/mm/memory_hotplug: determine block size based on the end of boot memory Pavel Tatashin @ 2018-02-13 19:31 ` Pavel Tatashin 2018-02-15 11:53 ` Michal Hocko 2018-02-13 19:31 ` [PATCH v3 4/4] mm/memory_hotplug: optimize memory hotplug Pavel Tatashin 2018-02-13 21:53 ` [PATCH v3 0/4] " Andrew Morton 4 siblings, 1 reply; 19+ messages in thread From: Pavel Tatashin @ 2018-02-13 19:31 UTC (permalink / raw) To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm, linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe During boot we poison struct page memory in order to ensure that no one is accessing this memory until the struct pages are initialized in __init_single_page(). This patch adds more scrutiny to this checking, by making sure that flags do not equal to poison pattern when the are accessed. The pattern is all ones. Since, node id is also stored in struct page, and may be accessed quiet early we add the enforcement into page_to_nid() function as well. Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> --- include/linux/mm.h | 4 +++- include/linux/page-flags.h | 22 +++++++++++++++++----- mm/memblock.c | 2 +- 3 files changed, 21 insertions(+), 7 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index ad06d42adb1a..ad71136a6494 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -896,7 +896,9 @@ extern int page_to_nid(const struct page *page); #else static inline int page_to_nid(const struct page *page) { - return (page->flags >> NODES_PGSHIFT) & NODES_MASK; + struct page *p = (struct page *)page; + + return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK; } #endif diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 50c2b8786831..5d5493e1f7ba 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -156,9 +156,18 @@ static __always_inline int PageCompound(struct page *page) return test_bit(PG_head, &page->flags) || PageTail(page); } +#define PAGE_POISON_PATTERN ~0ul +static inline int PagePoisoned(const struct page *page) +{ + return page->flags == PAGE_POISON_PATTERN; +} + /* * Page flags policies wrt compound pages * + * PF_POISONED_CHECK + * check if this struct page poisoned/uninitialized + * * PF_ANY: * the page flag is relevant for small, head and tail pages. * @@ -176,17 +185,20 @@ static __always_inline int PageCompound(struct page *page) * PF_NO_COMPOUND: * the page flag is not relevant for compound pages. */ -#define PF_ANY(page, enforce) page -#define PF_HEAD(page, enforce) compound_head(page) +#define PF_POISONED_CHECK(page) ({ \ + VM_BUG_ON_PGFLAGS(PagePoisoned(page), page); \ + page;}) +#define PF_ANY(page, enforce) PF_POISONED_CHECK(page) +#define PF_HEAD(page, enforce) PF_POISONED_CHECK(compound_head(page)) #define PF_ONLY_HEAD(page, enforce) ({ \ VM_BUG_ON_PGFLAGS(PageTail(page), page); \ - page;}) + PF_POISONED_CHECK(page);}) #define PF_NO_TAIL(page, enforce) ({ \ VM_BUG_ON_PGFLAGS(enforce && PageTail(page), page); \ - compound_head(page);}) + PF_POISONED_CHECK(compound_head(page));}) #define PF_NO_COMPOUND(page, enforce) ({ \ VM_BUG_ON_PGFLAGS(enforce && PageCompound(page), page); \ - page;}) + PF_POISONED_CHECK(page);}) /* * Macros to create function definitions for page flags diff --git a/mm/memblock.c b/mm/memblock.c index 5a9ca2a1751b..d85c8754e0ce 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1373,7 +1373,7 @@ void * __init memblock_virt_alloc_try_nid_raw( min_addr, max_addr, nid); #ifdef CONFIG_DEBUG_VM if (ptr && size > 0) - memset(ptr, 0xff, size); + memset(ptr, PAGE_POISON_PATTERN, size); #endif return ptr; } -- 2.16.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v3 3/4] mm: uninitialized struct page poisoning sanity checking 2018-02-13 19:31 ` [PATCH v3 3/4] mm: uninitialized struct page poisoning sanity checking Pavel Tatashin @ 2018-02-15 11:53 ` Michal Hocko 2018-02-15 13:41 ` Pavel Tatashin 0 siblings, 1 reply; 19+ messages in thread From: Michal Hocko @ 2018-02-15 11:53 UTC (permalink / raw) To: Pavel Tatashin Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, linux-mm, linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe On Tue 13-02-18 14:31:58, Pavel Tatashin wrote: > During boot we poison struct page memory in order to ensure that no one is > accessing this memory until the struct pages are initialized in > __init_single_page(). > > This patch adds more scrutiny to this checking, by making sure that flags > do not equal to poison pattern when the are accessed. The pattern is all s@the are@they are@ > ones. > > Since, node id is also stored in struct page, and may be accessed quiet s@quiet@quite@ > early we add the enforcement into page_to_nid() function as well. It would be worth adding that this applies only to NODE_NOT_IN_PAGE_FLAGS=n > Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Other than that it looks like a reasonable debugging feature. Acked-by: Michal Hocko <mhocko@suse.com> > --- > include/linux/mm.h | 4 +++- > include/linux/page-flags.h | 22 +++++++++++++++++----- > mm/memblock.c | 2 +- > 3 files changed, 21 insertions(+), 7 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index ad06d42adb1a..ad71136a6494 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -896,7 +896,9 @@ extern int page_to_nid(const struct page *page); > #else > static inline int page_to_nid(const struct page *page) > { > - return (page->flags >> NODES_PGSHIFT) & NODES_MASK; > + struct page *p = (struct page *)page; > + > + return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK; > } > #endif > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 50c2b8786831..5d5493e1f7ba 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -156,9 +156,18 @@ static __always_inline int PageCompound(struct page *page) > return test_bit(PG_head, &page->flags) || PageTail(page); > } > > +#define PAGE_POISON_PATTERN ~0ul > +static inline int PagePoisoned(const struct page *page) > +{ > + return page->flags == PAGE_POISON_PATTERN; > +} > + > /* > * Page flags policies wrt compound pages > * > + * PF_POISONED_CHECK > + * check if this struct page poisoned/uninitialized > + * > * PF_ANY: > * the page flag is relevant for small, head and tail pages. > * > @@ -176,17 +185,20 @@ static __always_inline int PageCompound(struct page *page) > * PF_NO_COMPOUND: > * the page flag is not relevant for compound pages. > */ > -#define PF_ANY(page, enforce) page > -#define PF_HEAD(page, enforce) compound_head(page) > +#define PF_POISONED_CHECK(page) ({ \ > + VM_BUG_ON_PGFLAGS(PagePoisoned(page), page); \ > + page;}) > +#define PF_ANY(page, enforce) PF_POISONED_CHECK(page) > +#define PF_HEAD(page, enforce) PF_POISONED_CHECK(compound_head(page)) > #define PF_ONLY_HEAD(page, enforce) ({ \ > VM_BUG_ON_PGFLAGS(PageTail(page), page); \ > - page;}) > + PF_POISONED_CHECK(page);}) > #define PF_NO_TAIL(page, enforce) ({ \ > VM_BUG_ON_PGFLAGS(enforce && PageTail(page), page); \ > - compound_head(page);}) > + PF_POISONED_CHECK(compound_head(page));}) > #define PF_NO_COMPOUND(page, enforce) ({ \ > VM_BUG_ON_PGFLAGS(enforce && PageCompound(page), page); \ > - page;}) > + PF_POISONED_CHECK(page);}) > > /* > * Macros to create function definitions for page flags > diff --git a/mm/memblock.c b/mm/memblock.c > index 5a9ca2a1751b..d85c8754e0ce 100644 > --- a/mm/memblock.c > +++ b/mm/memblock.c > @@ -1373,7 +1373,7 @@ void * __init memblock_virt_alloc_try_nid_raw( > min_addr, max_addr, nid); > #ifdef CONFIG_DEBUG_VM > if (ptr && size > 0) > - memset(ptr, 0xff, size); > + memset(ptr, PAGE_POISON_PATTERN, size); > #endif > return ptr; > } > -- > 2.16.1 > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 3/4] mm: uninitialized struct page poisoning sanity checking 2018-02-15 11:53 ` Michal Hocko @ 2018-02-15 13:41 ` Pavel Tatashin 0 siblings, 0 replies; 19+ messages in thread From: Pavel Tatashin @ 2018-02-15 13:41 UTC (permalink / raw) To: Michal Hocko Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman, Linux Memory Management List, Linux Kernel Mailing List, Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao, Thomas Gleixner, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe > Acked-by: Michal Hocko <mhocko@suse.com> Thank you, I will do the changes that you requested. Pavel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v3 4/4] mm/memory_hotplug: optimize memory hotplug 2018-02-13 19:31 [PATCH v3 0/4] optimize memory hotplug Pavel Tatashin ` (2 preceding siblings ...) 2018-02-13 19:31 ` [PATCH v3 3/4] mm: uninitialized struct page poisoning sanity checking Pavel Tatashin @ 2018-02-13 19:31 ` Pavel Tatashin 2018-02-15 12:43 ` Michal Hocko 2018-02-13 21:53 ` [PATCH v3 0/4] " Andrew Morton 4 siblings, 1 reply; 19+ messages in thread From: Pavel Tatashin @ 2018-02-13 19:31 UTC (permalink / raw) To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm, linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe This patch was inspired by the discussion of this problem: http://lkml.kernel.org/r/20180130083006.GB1245@in.ibm.com Currently, during memory hotplugging we traverse struct pages several times: 1. memset(0) in sparse_add_one_section() 2. loop in __add_section() to set do: set_page_node(page, nid); and SetPageReserved(page); 3. loop in pages_correctly_reserved() to check that SetPageReserved is set. 4. loop in memmap_init_zone() to call __init_single_pfn() This patch removes loops 1, 2, and 3 and only leaves the loop 4, where all struct page fields are initialized in one go, the same as it is now done during boot. The benefits: - We improve the memory hotplug performance because we are not evicting cache several times and also reduce loop branching overheads. - Remove condition from hotpath in __init_single_pfn(), that was added in order to fix the problem that was reported by Bharata in the above email thread, thus also improve the performance during normal boot. - Make memory hotplug more similar to boot memory initialization path because we zero and initialize struct pages only in one function. - Simplifies memory hotplug strut page initialization code, and thus enables future improvements, such as multi-threading the initialization of struct pages in order to improve the hotplug performance even further on larger machines. Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> --- drivers/base/memory.c | 38 +++++++++++++++++++++----------------- drivers/base/node.c | 17 ++++++++++------- include/linux/memory_hotplug.h | 2 ++ include/linux/node.h | 4 ++-- mm/memory_hotplug.c | 21 ++------------------- mm/page_alloc.c | 28 ++++++++++------------------ mm/sparse.c | 29 ++++++++++++++++++++++++++--- 7 files changed, 73 insertions(+), 66 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index fe4b24f05f6a..a14fb0cd424a 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -187,13 +187,14 @@ int memory_isolate_notify(unsigned long val, void *v) } /* - * The probe routines leave the pages reserved, just as the bootmem code does. - * Make sure they're still that way. + * The probe routines leave the pages uninitialized, just as the bootmem code + * does. Make sure we do not access them, but instead use only information from + * within sections. */ -static bool pages_correctly_reserved(unsigned long start_pfn) +static bool pages_correctly_probed(unsigned long start_pfn) { - int i, j; - struct page *page; + unsigned long section_nr = pfn_to_section_nr(start_pfn); + unsigned long section_nr_end = section_nr + sections_per_block; unsigned long pfn = start_pfn; /* @@ -201,21 +202,24 @@ static bool pages_correctly_reserved(unsigned long start_pfn) * SPARSEMEM_VMEMMAP. We lookup the page once per section * and assume memmap is contiguous within each section */ - for (i = 0; i < sections_per_block; i++, pfn += PAGES_PER_SECTION) { + for (; section_nr < section_nr_end; section_nr++) { if (WARN_ON_ONCE(!pfn_valid(pfn))) return false; - page = pfn_to_page(pfn); - - for (j = 0; j < PAGES_PER_SECTION; j++) { - if (PageReserved(page + j)) - continue; - - printk(KERN_WARNING "section number %ld page number %d " - "not reserved, was it already online?\n", - pfn_to_section_nr(pfn), j); + if (!present_section_nr(section_nr)) { + pr_warn("section %ld pfn[%lx, %lx) not present", + section_nr, pfn, pfn + PAGES_PER_SECTION); + return false; + } else if (!valid_section_nr(section_nr)) { + pr_warn("section %ld pfn[%lx, %lx) no valid memmap", + section_nr, pfn, pfn + PAGES_PER_SECTION); + return false; + } else if (online_section_nr(section_nr)) { + pr_warn("section %ld pfn[%lx, %lx) is already online", + section_nr, pfn, pfn + PAGES_PER_SECTION); return false; } + pfn += PAGES_PER_SECTION; } return true; @@ -237,7 +241,7 @@ memory_block_action(unsigned long phys_index, unsigned long action, int online_t switch (action) { case MEM_ONLINE: - if (!pages_correctly_reserved(start_pfn)) + if (!pages_correctly_probed(start_pfn)) return -EBUSY; ret = online_pages(start_pfn, nr_pages, online_type); @@ -727,7 +731,7 @@ int register_new_memory(int nid, struct mem_section *section) } if (mem->section_count == sections_per_block) - ret = register_mem_sect_under_node(mem, nid); + ret = register_mem_sect_under_node(mem, nid, false); out: mutex_unlock(&mem_sysfs_mutex); return ret; diff --git a/drivers/base/node.c b/drivers/base/node.c index ee090ab9171c..f1c0c130ac88 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -397,7 +397,8 @@ static int __ref get_nid_for_pfn(unsigned long pfn) } /* register memory section under specified node if it spans that node */ -int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) +int register_mem_sect_under_node(struct memory_block *mem_blk, int nid, + bool check_nid) { int ret; unsigned long pfn, sect_start_pfn, sect_end_pfn; @@ -423,11 +424,13 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) continue; } - page_nid = get_nid_for_pfn(pfn); - if (page_nid < 0) - continue; - if (page_nid != nid) - continue; + if (check_nid) { + page_nid = get_nid_for_pfn(pfn); + if (page_nid < 0) + continue; + if (page_nid != nid) + continue; + } ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj, &mem_blk->dev.kobj, kobject_name(&mem_blk->dev.kobj)); @@ -502,7 +505,7 @@ int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages) mem_blk = find_memory_block_hinted(mem_sect, mem_blk); - ret = register_mem_sect_under_node(mem_blk, nid); + ret = register_mem_sect_under_node(mem_blk, nid, true); if (!err) err = ret; diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index aba5f86eb038..0f2c17bef9d6 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -234,6 +234,8 @@ void put_online_mems(void); void mem_hotplug_begin(void); void mem_hotplug_done(void); +int get_section_nid(unsigned long section_nr); + extern void set_zone_contiguous(struct zone *zone); extern void clear_zone_contiguous(struct zone *zone); diff --git a/include/linux/node.h b/include/linux/node.h index 4ece0fee0ffc..41f171861dcc 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -67,7 +67,7 @@ extern void unregister_one_node(int nid); extern int register_cpu_under_node(unsigned int cpu, unsigned int nid); extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid); extern int register_mem_sect_under_node(struct memory_block *mem_blk, - int nid); + int nid, bool check_nid); extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk, unsigned long phys_index); @@ -97,7 +97,7 @@ static inline int unregister_cpu_under_node(unsigned int cpu, unsigned int nid) return 0; } static inline int register_mem_sect_under_node(struct memory_block *mem_blk, - int nid) + int nid, bool check_nid) { return 0; } diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 565048f496f7..cf1041380ab7 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -250,7 +250,6 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn, struct vmem_altmap *altmap, bool want_memblock) { int ret; - int i; if (pfn_valid(phys_start_pfn)) return -EEXIST; @@ -259,23 +258,6 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn, if (ret < 0) return ret; - /* - * Make all the pages reserved so that nobody will stumble over half - * initialized state. - * FIXME: We also have to associate it with a node because page_to_nid - * relies on having page with the proper node. - */ - for (i = 0; i < PAGES_PER_SECTION; i++) { - unsigned long pfn = phys_start_pfn + i; - struct page *page; - if (!pfn_valid(pfn)) - continue; - - page = pfn_to_page(pfn); - set_page_node(page, nid); - SetPageReserved(page); - } - if (!want_memblock) return 0; @@ -909,7 +891,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ int ret; struct memory_notify arg; - nid = pfn_to_nid(pfn); + nid = get_section_nid(pfn_to_section_nr(pfn)); + /* associate pfn range with the zone */ zone = move_pfn_range(online_type, nid, pfn, nr_pages); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 81e18ceef579..2667b35fca41 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1177,10 +1177,9 @@ static void free_one_page(struct zone *zone, } static void __meminit __init_single_page(struct page *page, unsigned long pfn, - unsigned long zone, int nid, bool zero) + unsigned long zone, int nid) { - if (zero) - mm_zero_struct_page(page); + mm_zero_struct_page(page); set_page_links(page, zone, nid, pfn); init_page_count(page); page_mapcount_reset(page); @@ -1194,12 +1193,6 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn, #endif } -static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone, - int nid, bool zero) -{ - return __init_single_page(pfn_to_page(pfn), pfn, zone, nid, zero); -} - #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT static void __meminit init_reserved_page(unsigned long pfn) { @@ -1218,7 +1211,7 @@ static void __meminit init_reserved_page(unsigned long pfn) if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone)) break; } - __init_single_pfn(pfn, zid, nid, true); + __init_single_page(pfn_to_page(pfn), pfn, zid, nid); } #else static inline void init_reserved_page(unsigned long pfn) @@ -1535,7 +1528,7 @@ static unsigned long __init deferred_init_pages(int nid, int zid, } else { page++; } - __init_single_page(page, pfn, zid, nid, true); + __init_single_page(page, pfn, zid, nid); nr_pages++; } return (nr_pages); @@ -5328,6 +5321,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, pg_data_t *pgdat = NODE_DATA(nid); unsigned long pfn; unsigned long nr_initialised = 0; + struct page *page; #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP struct memblock_region *r = NULL, *tmp; #endif @@ -5389,6 +5383,11 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, #endif not_early: + page = pfn_to_page(pfn); + __init_single_page(page, pfn, zone, nid); + if (context == MEMMAP_HOTPLUG) + SetPageReserved(page); + /* * Mark the block movable so that blocks are reserved for * movable at startup. This will force kernel allocations @@ -5405,15 +5404,8 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, * because this is done early in sparse_add_one_section */ if (!(pfn & (pageblock_nr_pages - 1))) { - struct page *page = pfn_to_page(pfn); - - __init_single_page(page, pfn, zone, nid, - context != MEMMAP_HOTPLUG); set_pageblock_migratetype(page, MIGRATE_MOVABLE); cond_resched(); - } else { - __init_single_pfn(pfn, zone, nid, - context != MEMMAP_HOTPLUG); } } } diff --git a/mm/sparse.c b/mm/sparse.c index 7af5e7a92528..d7808307023b 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -30,11 +30,14 @@ struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT] #endif EXPORT_SYMBOL(mem_section); -#ifdef NODE_NOT_IN_PAGE_FLAGS +#if defined(NODE_NOT_IN_PAGE_FLAGS) || defined(CONFIG_MEMORY_HOTPLUG) /* * If we did not store the node number in the page then we have to * do a lookup in the section_to_node_table in order to find which * node the page belongs to. + * + * We also use this data in case memory hotplugging is enabled to be + * able to determine nid while struct pages are not yet initialized. */ #if MAX_NUMNODES <= 256 static u8 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned; @@ -42,17 +45,28 @@ static u8 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned; static u16 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned; #endif +#ifdef NODE_NOT_IN_PAGE_FLAGS int page_to_nid(const struct page *page) { return section_to_node_table[page_to_section(page)]; } EXPORT_SYMBOL(page_to_nid); +#endif /* NODE_NOT_IN_PAGE_FLAGS */ static void set_section_nid(unsigned long section_nr, int nid) { section_to_node_table[section_nr] = nid; } -#else /* !NODE_NOT_IN_PAGE_FLAGS */ + +/* Return NID for given section number */ +int get_section_nid(unsigned long section_nr) +{ + if (WARN_ON(section_nr >= NR_MEM_SECTIONS)) + return 0; + return section_to_node_table[section_nr]; +} +EXPORT_SYMBOL(get_section_nid); +#else /* ! (NODE_NOT_IN_PAGE_FLAGS || CONFIG_MEMORY_HOTPLUG) */ static inline void set_section_nid(unsigned long section_nr, int nid) { } @@ -816,7 +830,14 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat, goto out; } - memset(memmap, 0, sizeof(struct page) * PAGES_PER_SECTION); +#ifdef CONFIG_DEBUG_VM + /* + * poison uninitialized struct pages in order to catch invalid flags + * combinations. + */ + memset(memmap, PAGE_POISON_PATTERN, + sizeof(struct page) * PAGES_PER_SECTION); +#endif section_mark_present(ms); @@ -827,6 +848,8 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat, if (ret <= 0) { kfree(usemap); __kfree_section_memmap(memmap, altmap); + } else { + set_section_nid(section_nr, pgdat->node_id); } return ret; } -- 2.16.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v3 4/4] mm/memory_hotplug: optimize memory hotplug 2018-02-13 19:31 ` [PATCH v3 4/4] mm/memory_hotplug: optimize memory hotplug Pavel Tatashin @ 2018-02-15 12:43 ` Michal Hocko 2018-02-15 13:46 ` Pavel Tatashin 0 siblings, 1 reply; 19+ messages in thread From: Michal Hocko @ 2018-02-15 12:43 UTC (permalink / raw) To: Pavel Tatashin Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, linux-mm, linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe On Tue 13-02-18 14:31:59, Pavel Tatashin wrote: [...] > @@ -201,21 +202,24 @@ static bool pages_correctly_reserved(unsigned long start_pfn) > * SPARSEMEM_VMEMMAP. We lookup the page once per section > * and assume memmap is contiguous within each section > */ > - for (i = 0; i < sections_per_block; i++, pfn += PAGES_PER_SECTION) { > + for (; section_nr < section_nr_end; section_nr++) { > if (WARN_ON_ONCE(!pfn_valid(pfn))) > return false; > - page = pfn_to_page(pfn); > - > - for (j = 0; j < PAGES_PER_SECTION; j++) { > - if (PageReserved(page + j)) > - continue; > - > - printk(KERN_WARNING "section number %ld page number %d " > - "not reserved, was it already online?\n", > - pfn_to_section_nr(pfn), j); > > + if (!present_section_nr(section_nr)) { > + pr_warn("section %ld pfn[%lx, %lx) not present", > + section_nr, pfn, pfn + PAGES_PER_SECTION); > + return false; > + } else if (!valid_section_nr(section_nr)) { > + pr_warn("section %ld pfn[%lx, %lx) no valid memmap", > + section_nr, pfn, pfn + PAGES_PER_SECTION); > + return false; > + } else if (online_section_nr(section_nr)) { > + pr_warn("section %ld pfn[%lx, %lx) is already online", > + section_nr, pfn, pfn + PAGES_PER_SECTION); > return false; > } > + pfn += PAGES_PER_SECTION; > } This should be a separate patch IMHO. It is an optimization on its own. The original code tries to be sparse neutral but we do depend on sparse anyway. [...] > /* register memory section under specified node if it spans that node */ > -int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) > +int register_mem_sect_under_node(struct memory_block *mem_blk, int nid, > + bool check_nid) This check_nid begs for a documentation. When do we need to set it? I can see that register_new_memory path doesn't check node id. It is quite reasonable to expect that a new memblock doesn't span multiple numa nodes which can be the case for register_one_node but a word or two are really due. > { > int ret; > unsigned long pfn, sect_start_pfn, sect_end_pfn; > @@ -423,11 +424,13 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) > continue; > } > > - page_nid = get_nid_for_pfn(pfn); > - if (page_nid < 0) > - continue; > - if (page_nid != nid) > - continue; > + if (check_nid) { > + page_nid = get_nid_for_pfn(pfn); > + if (page_nid < 0) > + continue; > + if (page_nid != nid) > + continue; > + } > ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj, > &mem_blk->dev.kobj, > kobject_name(&mem_blk->dev.kobj)); > @@ -502,7 +505,7 @@ int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages) > > mem_blk = find_memory_block_hinted(mem_sect, mem_blk); > > - ret = register_mem_sect_under_node(mem_blk, nid); > + ret = register_mem_sect_under_node(mem_blk, nid, true); > if (!err) > err = ret; > I would be tempted to split this into a separate patch as well. The review will be much easier. [...] > diff --git a/mm/sparse.c b/mm/sparse.c > index 7af5e7a92528..d7808307023b 100644 > --- a/mm/sparse.c > +++ b/mm/sparse.c > @@ -30,11 +30,14 @@ struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT] > #endif > EXPORT_SYMBOL(mem_section); > > -#ifdef NODE_NOT_IN_PAGE_FLAGS > +#if defined(NODE_NOT_IN_PAGE_FLAGS) || defined(CONFIG_MEMORY_HOTPLUG) > /* > * If we did not store the node number in the page then we have to > * do a lookup in the section_to_node_table in order to find which > * node the page belongs to. > + * > + * We also use this data in case memory hotplugging is enabled to be > + * able to determine nid while struct pages are not yet initialized. > */ > #if MAX_NUMNODES <= 256 > static u8 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned; > @@ -42,17 +45,28 @@ static u8 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned; > static u16 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned; > #endif > > +#ifdef NODE_NOT_IN_PAGE_FLAGS > int page_to_nid(const struct page *page) > { > return section_to_node_table[page_to_section(page)]; > } > EXPORT_SYMBOL(page_to_nid); > +#endif /* NODE_NOT_IN_PAGE_FLAGS */ This is quite ugly. You allocate 256MB for small numa systems and 512MB for larger NUMAs unconditionally for MEMORY_HOTPLUG. I see you need it to safely replace page_to_nid by get_section_nid but this is just too high of the price. Please note that this shouldn't be really needed. At least not for onlining. We already _do_ know the node association with the pfn range. So we should be able to get the nid from memblock. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 4/4] mm/memory_hotplug: optimize memory hotplug 2018-02-15 12:43 ` Michal Hocko @ 2018-02-15 13:46 ` Pavel Tatashin 0 siblings, 0 replies; 19+ messages in thread From: Pavel Tatashin @ 2018-02-15 13:46 UTC (permalink / raw) To: Michal Hocko Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman, Linux Memory Management List, Linux Kernel Mailing List, Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao, Thomas Gleixner, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe > This should be a separate patch IMHO. It is an optimization on its > own. The original code tries to be sparse neutral but we do depend on > sparse anyway. Yes, Mingo already asked me to split this patch. I've done just that and will send it out soon. > > [...] >> /* register memory section under specified node if it spans that node */ >> -int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) >> +int register_mem_sect_under_node(struct memory_block *mem_blk, int nid, >> + bool check_nid) > > This check_nid begs for a documentation. When do we need to set it? I > can see that register_new_memory path doesn't check node id. It is quite > reasonable to expect that a new memblock doesn't span multiple numa > nodes which can be the case for register_one_node but a word or two are > really due. OK, I will add a comment, and BTW, this is also going to be a separate patch for ease of review. > >> { >> int ret; >> unsigned long pfn, sect_start_pfn, sect_end_pfn; >> @@ -423,11 +424,13 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) >> continue; >> } >> >> - page_nid = get_nid_for_pfn(pfn); >> - if (page_nid < 0) >> - continue; >> - if (page_nid != nid) >> - continue; >> + if (check_nid) { >> + page_nid = get_nid_for_pfn(pfn); >> + if (page_nid < 0) >> + continue; >> + if (page_nid != nid) >> + continue; >> + } >> ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj, >> &mem_blk->dev.kobj, >> kobject_name(&mem_blk->dev.kobj)); >> @@ -502,7 +505,7 @@ int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages) >> >> mem_blk = find_memory_block_hinted(mem_sect, mem_blk); >> >> - ret = register_mem_sect_under_node(mem_blk, nid); >> + ret = register_mem_sect_under_node(mem_blk, nid, true); >> if (!err) >> err = ret; >> > > I would be tempted to split this into a separate patch as well. The > review will be much easier. Yes, but that would be the last patch in the series. > This is quite ugly. You allocate 256MB for small numa systems and 512MB > for larger NUMAs unconditionally for MEMORY_HOTPLUG. I see you need it > to safely replace page_to_nid by get_section_nid but this is just too > high of the price. Please note that this shouldn't be really needed. At > least not for onlining. We already _do_ know the node association with > the pfn range. So we should be able to get the nid from memblock. OK, I will think for a different place to store nid temporarily, or how to get it. Thank you, Pavel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 0/4] optimize memory hotplug 2018-02-13 19:31 [PATCH v3 0/4] optimize memory hotplug Pavel Tatashin ` (3 preceding siblings ...) 2018-02-13 19:31 ` [PATCH v3 4/4] mm/memory_hotplug: optimize memory hotplug Pavel Tatashin @ 2018-02-13 21:53 ` Andrew Morton 2018-02-14 8:09 ` Ingo Molnar 4 siblings, 1 reply; 19+ messages in thread From: Andrew Morton @ 2018-02-13 21:53 UTC (permalink / raw) To: Pavel Tatashin Cc: steven.sistare, daniel.m.jordan, mgorman, mhocko, linux-mm, linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe On Tue, 13 Feb 2018 14:31:55 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote: > This patchset: > - Improves hotplug performance by eliminating a number of > struct page traverses during memory hotplug. > > - Fixes some issues with hotplugging, where boundaries > were not properly checked. And on x86 block size was not properly aligned > with end of memory > > - Also, potentially improves boot performance by eliminating condition from > __init_single_page(). > > - Adds robustness by verifying that that struct pages are correctly > poisoned when flags are accessed. I'm now attempting to get a 100% review rate on MM patches, which is why I started adding my Reviewed-by: when I do that thing. I'm not familiar enough with this code to add my own Reviewed-by:, and we'll need to figure out what to do in such cases. I shall be sending out periodic review-status summaries. If you're able to identify a suitable reviewer for this work and to offer them beer, that would help. Let's see what happens as the weeks unfold. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 0/4] optimize memory hotplug 2018-02-13 21:53 ` [PATCH v3 0/4] " Andrew Morton @ 2018-02-14 8:09 ` Ingo Molnar 2018-02-14 14:14 ` Pavel Tatashin 0 siblings, 1 reply; 19+ messages in thread From: Ingo Molnar @ 2018-02-14 8:09 UTC (permalink / raw) To: Andrew Morton Cc: Pavel Tatashin, steven.sistare, daniel.m.jordan, mgorman, mhocko, linux-mm, linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe * Andrew Morton <akpm@linux-foundation.org> wrote: > On Tue, 13 Feb 2018 14:31:55 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote: > > > This patchset: > > - Improves hotplug performance by eliminating a number of > > struct page traverses during memory hotplug. > > > > - Fixes some issues with hotplugging, where boundaries > > were not properly checked. And on x86 block size was not properly aligned > > with end of memory > > > > - Also, potentially improves boot performance by eliminating condition from > > __init_single_page(). > > > > - Adds robustness by verifying that that struct pages are correctly > > poisoned when flags are accessed. > > I'm now attempting to get a 100% review rate on MM patches, which is > why I started adding my Reviewed-by: when I do that thing. > > I'm not familiar enough with this code to add my own Reviewed-by:, and > we'll need to figure out what to do in such cases. I shall be sending > out periodic review-status summaries. > > If you're able to identify a suitable reviewer for this work and to > offer them beer, that would help. Let's see what happens as the weeks > unfold. The largest patch, fix patch #2, looks good to me and fixes a real bug. Patch #1 and #3 also look good to me (assuming the runtime overhead added by patch #3 is OK to you): Reviewed-by: Ingo Molnar <mingo@kernel.org> (I suspect patch #1 and patch #2 should also get a Cc: stable.) Patch #4 is too large to review IMO: it should be split up into as many patches as practically possible. That will also help bisectability, should anything break. Before applying these patches please fix changelog and code comment spelling. But it's all good stuff AFAICS! Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 0/4] optimize memory hotplug 2018-02-14 8:09 ` Ingo Molnar @ 2018-02-14 14:14 ` Pavel Tatashin 0 siblings, 0 replies; 19+ messages in thread From: Pavel Tatashin @ 2018-02-14 14:14 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Steve Sistare, Daniel Jordan, Mel Gorman, Michal Hocko, Linux Memory Management List, Linux Kernel Mailing List, Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao, Thomas Gleixner, mingo, hpa, x86, dan.j.williams, kirill.shutemov, bhe Hi Ingo, Thank you very much for your review. I will address spelling issues, and will also try to split the patch #4. Regarding runtime concern for patch #3: the extra checking is only performed when the both of the following CONFIGs are enabled: CONFIG_DEBUG_VM=y CONFIG_DEBUG_VM_PGFLAGS=y I do not expect either of these to be ever enabled on a production systems. Thank you, Pavel On Wed, Feb 14, 2018 at 3:09 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Andrew Morton <akpm@linux-foundation.org> wrote: > >> On Tue, 13 Feb 2018 14:31:55 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote: >> >> > This patchset: >> > - Improves hotplug performance by eliminating a number of >> > struct page traverses during memory hotplug. >> > >> > - Fixes some issues with hotplugging, where boundaries >> > were not properly checked. And on x86 block size was not properly aligned >> > with end of memory >> > >> > - Also, potentially improves boot performance by eliminating condition from >> > __init_single_page(). >> > >> > - Adds robustness by verifying that that struct pages are correctly >> > poisoned when flags are accessed. >> >> I'm now attempting to get a 100% review rate on MM patches, which is >> why I started adding my Reviewed-by: when I do that thing. >> >> I'm not familiar enough with this code to add my own Reviewed-by:, and >> we'll need to figure out what to do in such cases. I shall be sending >> out periodic review-status summaries. >> >> If you're able to identify a suitable reviewer for this work and to >> offer them beer, that would help. Let's see what happens as the weeks >> unfold. > > The largest patch, fix patch #2, looks good to me and fixes a real bug. > Patch #1 and #3 also look good to me (assuming the runtime overhead > added by patch #3 is OK to you): > > Reviewed-by: Ingo Molnar <mingo@kernel.org> > > (I suspect patch #1 and patch #2 should also get a Cc: stable.) > > Patch #4 is too large to review IMO: it should be split up into as many patches as > practically possible. That will also help bisectability, should anything break. > > Before applying these patches please fix changelog and code comment spelling. > > But it's all good stuff AFAICS! > > Thanks, > > Ingo > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2018-02-15 19:00 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-02-13 19:31 [PATCH v3 0/4] optimize memory hotplug Pavel Tatashin 2018-02-13 19:31 ` [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check Pavel Tatashin 2018-02-15 11:34 ` Michal Hocko 2018-02-15 13:36 ` Pavel Tatashin 2018-02-15 14:40 ` Michal Hocko 2018-02-15 15:05 ` Pavel Tatashin 2018-02-13 19:31 ` [PATCH v3 2/4] x86/mm/memory_hotplug: determine block size based on the end of boot memory Pavel Tatashin 2018-02-15 11:37 ` Michal Hocko 2018-02-15 13:39 ` Pavel Tatashin 2018-02-15 19:00 ` Ingo Molnar 2018-02-13 19:31 ` [PATCH v3 3/4] mm: uninitialized struct page poisoning sanity checking Pavel Tatashin 2018-02-15 11:53 ` Michal Hocko 2018-02-15 13:41 ` Pavel Tatashin 2018-02-13 19:31 ` [PATCH v3 4/4] mm/memory_hotplug: optimize memory hotplug Pavel Tatashin 2018-02-15 12:43 ` Michal Hocko 2018-02-15 13:46 ` Pavel Tatashin 2018-02-13 21:53 ` [PATCH v3 0/4] " Andrew Morton 2018-02-14 8:09 ` Ingo Molnar 2018-02-14 14:14 ` Pavel Tatashin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).