* collision between ZONE_MOVABLE and memblock allocations @ 2023-07-18 22:01 Ross Zwisler 2023-07-19 5:44 ` Mike Rapoport 2023-07-19 6:14 ` Michal Hocko 0 siblings, 2 replies; 23+ messages in thread From: Ross Zwisler @ 2023-07-18 22:01 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: Mike Rapoport, Andrew Morton, Matthew Wilcox, Mel Gorman, Michal Hocko, Vlastimil Babka, David Hildenbrand Hello, I've been trying to use the 'movablecore=' kernel command line option to create a ZONE_MOVABLE memory zone on my x86_64 systems, and have noticed that offlining the resulting ZONE_MOVABLE area consistently fails in my setups because that zone contains unmovable pages. My testing has been in a x86_64 QEMU VM with a single NUMA node and 4G, 8G or 16G of memory, all of which fail 100% of the time. Digging into it a bit, these unmovable pages are Reserved pages which were allocated in early boot as part of the memblock allocator. Many of these allocations are for data structures for the SPARSEMEM memory model, including 'struct mem_section' objects. These memblock allocations can be tracked by setting the 'memblock=debug' kernel command line parameter, and are marked as reserved in: memmap_init_reserved_pages() reserve_bootmem_region() With the command line params 'movablecore=256M memblock=debug' and a v6.5.0-rc2 kernel I get the following on my 4G system: # lsmem --split ZONES --output-all RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable Memory block size: 128M Total online memory: 4G Total offline memory: 0B And when I try to offline memory block 39, I get: # echo 0 > /sys/devices/system/memory/memory39/online bash: echo: write error: Device or resource busy with dmesg saying: [ 57.439849] page:0000000076a3e320 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13ff00 [ 57.444073] flags: 0x1fffffc0001000(reserved|node=0|zone=3|lastcpupid=0x1fffff) [ 57.447301] page_type: 0xffffffff() [ 57.448754] raw: 001fffffc0001000 ffffdd6384ffc008 ffffdd6384ffc008 0000000000000000 [ 57.450383] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 [ 57.452011] page dumped because: unmovable page Looking back at the memblock allocations, I can see that the physical address for pfn:0x13ff00 was used in a memblock allocation: [ 0.395180] memblock_reserve: [0x000000013ff00000-0x000000013ffbffff] memblock_alloc_range_nid+0xe0/0x150 The full dmesg output can be found here: https://pastebin.com/cNztqa4u The 'movablecore=' command line parameter is handled in 'find_zone_movable_pfns_for_nodes()', which decides where ZONE_MOVABLE should start and end. Currently ZONE_MOVABLE is always located at the end of a NUMA node. The issue is that the memblock allocator and the processing of the movablecore= command line parameter don't know about one another, and in my x86_64 testing they both always use memory at the end of the NUMA node and have collisions. From several comments in the code I believe that this is a known issue: https://elixir.bootlin.com/linux/v6.5-rc2/source/mm/page_isolation.c#L59 /* * Both, bootmem allocations and memory holes are marked * PG_reserved and are unmovable. We can even have unmovable * allocations inside ZONE_MOVABLE, for example when * specifying "movablecore". */ https://elixir.bootlin.com/linux/v6.5-rc2/source/include/linux/mmzone.h#L765 * 2. memblock allocations: kernelcore/movablecore setups might create * situations where ZONE_MOVABLE contains unmovable allocations * after boot. Memory offlining and allocations fail early. We check for these unmovable pages by scanning for 'PageReserved()' in the area we are trying to offline, which happens in has_unmovable_pages(). Interestingly, the boot timing works out like this: 1. Allocate memblock areas to set up the SPARSEMEM model. [ 0.369990] Call Trace: [ 0.370404] <TASK> [ 0.370759] ? dump_stack_lvl+0x43/0x60 [ 0.371410] ? sparse_init_nid+0x2dc/0x560 [ 0.372116] ? sparse_init+0x346/0x450 [ 0.372755] ? paging_init+0xa/0x20 [ 0.373349] ? setup_arch+0xa6a/0xfc0 [ 0.373970] ? slab_is_available+0x5/0x20 [ 0.374651] ? start_kernel+0x5e/0x770 [ 0.375290] ? x86_64_start_reservations+0x14/0x30 [ 0.376109] ? x86_64_start_kernel+0x71/0x80 [ 0.376835] ? secondary_startup_64_no_verify+0x167/0x16b [ 0.377755] </TASK> 2. Process movablecore= kernel command line parameter and set up memory zones [ 0.489382] Call Trace: [ 0.489818] <TASK> [ 0.490187] ? dump_stack_lvl+0x43/0x60 [ 0.490873] ? free_area_init+0x115/0xc80 [ 0.491588] ? __printk_cpu_sync_put+0x5/0x30 [ 0.492354] ? dump_stack_lvl+0x48/0x60 [ 0.493002] ? sparse_init_nid+0x2dc/0x560 [ 0.493697] ? zone_sizes_init+0x60/0x80 [ 0.494361] ? setup_arch+0xa6a/0xfc0 [ 0.494981] ? slab_is_available+0x5/0x20 [ 0.495674] ? start_kernel+0x5e/0x770 [ 0.496312] ? x86_64_start_reservations+0x14/0x30 [ 0.497123] ? x86_64_start_kernel+0x71/0x80 [ 0.497847] ? secondary_startup_64_no_verify+0x167/0x16b [ 0.498768] </TASK> 3. Mark memblock areas as Reserved. [ 0.761136] Call Trace: [ 0.761534] <TASK> [ 0.761876] dump_stack_lvl+0x43/0x60 [ 0.762474] reserve_bootmem_region+0x1e/0x170 [ 0.763201] memblock_free_all+0xe3/0x250 [ 0.763862] ? swiotlb_init_io_tlb_mem.constprop.0+0x11a/0x130 [ 0.764812] ? swiotlb_init_remap+0x195/0x2c0 [ 0.765519] mem_init+0x19/0x1b0 [ 0.766047] mm_core_init+0x9c/0x3d0 [ 0.766630] start_kernel+0x264/0x770 [ 0.767229] x86_64_start_reservations+0x14/0x30 [ 0.767987] x86_64_start_kernel+0x71/0x80 [ 0.768666] secondary_startup_64_no_verify+0x167/0x16b [ 0.769534] </TASK> So, during ZONE_MOVABLE setup we currently can't do the same has_unmovable_pages() scan looking for PageReserved() to check for overlap because the pages have not yet been marked as Reserved. I do think that we need to fix this collision between ZONE_MOVABLE and memmap allocations, because this issue essentially makes the movablecore= kernel command line parameter useless in many cases, as the ZONE_MOVABLE region it creates will often actually be unmovable. Here are the options I currently see for resolution: 1. Change the way ZONE_MOVABLE memory is allocated so that it is allocated from the beginning of the NUMA node instead of the end. This should fix my use case, but again is prone to breakage in other configurations (# of NUMA nodes, other architectures) where ZONE_MOVABLE and memblock allocations might overlap. I think that this should be relatively straightforward and low risk, though. 2. Make the code which processes the movablecore= command line option aware of the memblock allocations, and have it choose a region for ZONE_MOVABLE which does not have these allocations. This might be done by checking for PageReserved() as we do with offlining memory, though that will take some boot time reordering, or we'll have to figure out the overlap in another way. This may also result in us having two ZONE_NORMAL zones for a given NUMA node, with a ZONE_MOVABLE section in between them. I'm not sure if this is allowed? If we can get it working, this seems like the most correct solution to me, but also the most difficult and risky because it involves significant changes in the code for memory setup at early boot. Am I missing anything are there other solutions we should consider, or do you have an opinion on which solution we should pursue? Thanks, - Ross ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-18 22:01 collision between ZONE_MOVABLE and memblock allocations Ross Zwisler @ 2023-07-19 5:44 ` Mike Rapoport 2023-07-19 22:26 ` Ross Zwisler 2023-07-19 6:14 ` Michal Hocko 1 sibling, 1 reply; 23+ messages in thread From: Mike Rapoport @ 2023-07-19 5:44 UTC (permalink / raw) To: Ross Zwisler Cc: linux-kernel, linux-mm, Andrew Morton, Matthew Wilcox, Mel Gorman, Michal Hocko, Vlastimil Babka, David Hildenbrand Hi, On Tue, Jul 18, 2023 at 04:01:06PM -0600, Ross Zwisler wrote: > Hello, > > I've been trying to use the 'movablecore=' kernel command line option to create > a ZONE_MOVABLE memory zone on my x86_64 systems, and have noticed that > offlining the resulting ZONE_MOVABLE area consistently fails in my setups > because that zone contains unmovable pages. My testing has been in a x86_64 > QEMU VM with a single NUMA node and 4G, 8G or 16G of memory, all of which fail > 100% of the time. > > Digging into it a bit, these unmovable pages are Reserved pages which were > allocated in early boot as part of the memblock allocator. Many of these > allocations are for data structures for the SPARSEMEM memory model, including > 'struct mem_section' objects. These memblock allocations can be tracked by > setting the 'memblock=debug' kernel command line parameter, and are marked as > reserved in: > > memmap_init_reserved_pages() > reserve_bootmem_region() > > With the command line params 'movablecore=256M memblock=debug' and a v6.5.0-rc2 > kernel I get the following on my 4G system: > > # lsmem --split ZONES --output-all > RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES > 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None > 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32 > 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal > 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable > > Memory block size: 128M > Total online memory: 4G > Total offline memory: 0B > > And when I try to offline memory block 39, I get: > > # echo 0 > /sys/devices/system/memory/memory39/online > bash: echo: write error: Device or resource busy > > with dmesg saying: > > [ 57.439849] page:0000000076a3e320 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13ff00 > [ 57.444073] flags: 0x1fffffc0001000(reserved|node=0|zone=3|lastcpupid=0x1fffff) > [ 57.447301] page_type: 0xffffffff() > [ 57.448754] raw: 001fffffc0001000 ffffdd6384ffc008 ffffdd6384ffc008 0000000000000000 > [ 57.450383] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 > [ 57.452011] page dumped because: unmovable page > > Looking back at the memblock allocations, I can see that the physical address for > pfn:0x13ff00 was used in a memblock allocation: > > [ 0.395180] memblock_reserve: [0x000000013ff00000-0x000000013ffbffff] memblock_alloc_range_nid+0xe0/0x150 > > The full dmesg output can be found here: https://pastebin.com/cNztqa4u > > The 'movablecore=' command line parameter is handled in > 'find_zone_movable_pfns_for_nodes()', which decides where ZONE_MOVABLE should > start and end. Currently ZONE_MOVABLE is always located at the end of a NUMA > node. > > The issue is that the memblock allocator and the processing of the movablecore= > command line parameter don't know about one another, and in my x86_64 testing > they both always use memory at the end of the NUMA node and have collisions. > > From several comments in the code I believe that this is a known issue: > > https://elixir.bootlin.com/linux/v6.5-rc2/source/mm/page_isolation.c#L59 > /* > * Both, bootmem allocations and memory holes are marked > * PG_reserved and are unmovable. We can even have unmovable > * allocations inside ZONE_MOVABLE, for example when > * specifying "movablecore". > */ > > https://elixir.bootlin.com/linux/v6.5-rc2/source/include/linux/mmzone.h#L765 > * 2. memblock allocations: kernelcore/movablecore setups might create > * situations where ZONE_MOVABLE contains unmovable allocations > * after boot. Memory offlining and allocations fail early. > > We check for these unmovable pages by scanning for 'PageReserved()' in the area > we are trying to offline, which happens in has_unmovable_pages(). > > Interestingly, the boot timing works out like this: > > 1. Allocate memblock areas to set up the SPARSEMEM model. > [ 0.369990] Call Trace: > [ 0.370404] <TASK> > [ 0.370759] ? dump_stack_lvl+0x43/0x60 > [ 0.371410] ? sparse_init_nid+0x2dc/0x560 > [ 0.372116] ? sparse_init+0x346/0x450 > [ 0.372755] ? paging_init+0xa/0x20 > [ 0.373349] ? setup_arch+0xa6a/0xfc0 > [ 0.373970] ? slab_is_available+0x5/0x20 > [ 0.374651] ? start_kernel+0x5e/0x770 > [ 0.375290] ? x86_64_start_reservations+0x14/0x30 > [ 0.376109] ? x86_64_start_kernel+0x71/0x80 > [ 0.376835] ? secondary_startup_64_no_verify+0x167/0x16b > [ 0.377755] </TASK> > > 2. Process movablecore= kernel command line parameter and set up memory zones > [ 0.489382] Call Trace: > [ 0.489818] <TASK> > [ 0.490187] ? dump_stack_lvl+0x43/0x60 > [ 0.490873] ? free_area_init+0x115/0xc80 > [ 0.491588] ? __printk_cpu_sync_put+0x5/0x30 > [ 0.492354] ? dump_stack_lvl+0x48/0x60 > [ 0.493002] ? sparse_init_nid+0x2dc/0x560 > [ 0.493697] ? zone_sizes_init+0x60/0x80 > [ 0.494361] ? setup_arch+0xa6a/0xfc0 > [ 0.494981] ? slab_is_available+0x5/0x20 > [ 0.495674] ? start_kernel+0x5e/0x770 > [ 0.496312] ? x86_64_start_reservations+0x14/0x30 > [ 0.497123] ? x86_64_start_kernel+0x71/0x80 > [ 0.497847] ? secondary_startup_64_no_verify+0x167/0x16b > [ 0.498768] </TASK> > > 3. Mark memblock areas as Reserved. > [ 0.761136] Call Trace: > [ 0.761534] <TASK> > [ 0.761876] dump_stack_lvl+0x43/0x60 > [ 0.762474] reserve_bootmem_region+0x1e/0x170 > [ 0.763201] memblock_free_all+0xe3/0x250 > [ 0.763862] ? swiotlb_init_io_tlb_mem.constprop.0+0x11a/0x130 > [ 0.764812] ? swiotlb_init_remap+0x195/0x2c0 > [ 0.765519] mem_init+0x19/0x1b0 > [ 0.766047] mm_core_init+0x9c/0x3d0 > [ 0.766630] start_kernel+0x264/0x770 > [ 0.767229] x86_64_start_reservations+0x14/0x30 > [ 0.767987] x86_64_start_kernel+0x71/0x80 > [ 0.768666] secondary_startup_64_no_verify+0x167/0x16b > [ 0.769534] </TASK> > > So, during ZONE_MOVABLE setup we currently can't do the same > has_unmovable_pages() scan looking for PageReserved() to check for overlap > because the pages have not yet been marked as Reserved. > > I do think that we need to fix this collision between ZONE_MOVABLE and memmap > allocations, because this issue essentially makes the movablecore= kernel > command line parameter useless in many cases, as the ZONE_MOVABLE region it > creates will often actually be unmovable. > > Here are the options I currently see for resolution: > > 1. Change the way ZONE_MOVABLE memory is allocated so that it is allocated from > the beginning of the NUMA node instead of the end. This should fix my use case, > but again is prone to breakage in other configurations (# of NUMA nodes, other > architectures) where ZONE_MOVABLE and memblock allocations might overlap. I > think that this should be relatively straightforward and low risk, though. > > 2. Make the code which processes the movablecore= command line option aware of > the memblock allocations, and have it choose a region for ZONE_MOVABLE which > does not have these allocations. This might be done by checking for > PageReserved() as we do with offlining memory, though that will take some boot > time reordering, or we'll have to figure out the overlap in another way. This > may also result in us having two ZONE_NORMAL zones for a given NUMA node, with > a ZONE_MOVABLE section in between them. I'm not sure if this is allowed? If > we can get it working, this seems like the most correct solution to me, but > also the most difficult and risky because it involves significant changes in > the code for memory setup at early boot. > > Am I missing anything are there other solutions we should consider, or do you > have an opinion on which solution we should pursue? I'd add 3. Switch memblock to use bottom up allocations. Historically memblock allocated memory from the top to avoid corrupting the kernel image and to avoid exhausting precious ZONE_DMA. I believe we can use bottom-up allocations with lower limit of memblock allocations set to 16M. With the hack below no memblock allocations will end up in ZONE_MOVABLE: diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 16babff771bd..5e940f057dd4 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1116,6 +1116,7 @@ void __init setup_arch(char **cmdline_p) memblock_set_current_limit(ISA_END_ADDRESS); e820__memblock_setup(); + memblock_set_bottom_up(true); /* * Needs to run after memblock setup because it needs the physical diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 2aadb2019b4f..ed1e14a2a62d 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -660,16 +660,6 @@ static int __init numa_init(int (*init_func)(void)) if (ret < 0) return ret; - /* - * We reset memblock back to the top-down direction - * here because if we configured ACPI_NUMA, we have - * parsed SRAT in init_func(). It is ok to have the - * reset here even if we did't configure ACPI_NUMA - * or acpi numa init fails and fallbacks to dummy - * numa init. - */ - memblock_set_bottom_up(false); - ret = numa_cleanup_meminfo(&numa_meminfo); if (ret < 0) return ret; diff --git a/mm/memblock.c b/mm/memblock.c index 3feafea06ab2..7ba040bf8da2 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1388,6 +1388,7 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, bool exact_nid) { enum memblock_flags flags = choose_memblock_flags(); + phys_addr_t min = SZ_16M; phys_addr_t found; if (WARN_ONCE(nid == MAX_NUMNODES, "Usage of MAX_NUMNODES is deprecated. Use NUMA_NO_NODE instead\n")) @@ -1400,13 +1401,13 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, } again: - found = memblock_find_in_range_node(size, align, start, end, nid, + found = memblock_find_in_range_node(size, align, min, end, nid, flags); if (found && !memblock_reserve(found, size)) goto done; if (nid != NUMA_NO_NODE && !exact_nid) { - found = memblock_find_in_range_node(size, align, start, + found = memblock_find_in_range_node(size, align, min, end, NUMA_NO_NODE, flags); if (found && !memblock_reserve(found, size)) @@ -1420,6 +1421,11 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, goto again; } + if (start < min) { + min = start; + goto again; + } + return 0; done: > Thanks, > - Ross -- Sincerely yours, Mike. ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-19 5:44 ` Mike Rapoport @ 2023-07-19 22:26 ` Ross Zwisler 2023-07-21 11:20 ` Mike Rapoport 0 siblings, 1 reply; 23+ messages in thread From: Ross Zwisler @ 2023-07-19 22:26 UTC (permalink / raw) To: Mike Rapoport Cc: linux-kernel, linux-mm, Andrew Morton, Matthew Wilcox, Mel Gorman, Michal Hocko, Vlastimil Babka, David Hildenbrand On Wed, Jul 19, 2023 at 08:44:34AM +0300, Mike Rapoport wrote: > 3. Switch memblock to use bottom up allocations. Historically memblock > allocated memory from the top to avoid corrupting the kernel image and to > avoid exhausting precious ZONE_DMA. I believe we can use bottom-up > allocations with lower limit of memblock allocations set to 16M. > > With the hack below no memblock allocations will end up in ZONE_MOVABLE: Yep, I've confirmed that for my use cases at least this does the trick, thank you! I had thought about moving the memblock allocations, but had no idea it was (basically) already supported and thought it'd be much riskier than just adjusting where ZONE_MOVABLE lived. Is there a reason for this to not be a real option for users, maybe per a kernel config knob or something? I'm happy to explore other options in this thread, but this is doing the trick so far. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-19 22:26 ` Ross Zwisler @ 2023-07-21 11:20 ` Mike Rapoport 2023-07-26 7:49 ` Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: Mike Rapoport @ 2023-07-21 11:20 UTC (permalink / raw) To: Ross Zwisler Cc: linux-kernel, linux-mm, Andrew Morton, Matthew Wilcox, Mel Gorman, Michal Hocko, Vlastimil Babka, David Hildenbrand On Wed, Jul 19, 2023 at 04:26:04PM -0600, Ross Zwisler wrote: > On Wed, Jul 19, 2023 at 08:44:34AM +0300, Mike Rapoport wrote: > > 3. Switch memblock to use bottom up allocations. Historically memblock > > allocated memory from the top to avoid corrupting the kernel image and to > > avoid exhausting precious ZONE_DMA. I believe we can use bottom-up > > allocations with lower limit of memblock allocations set to 16M. > > > > With the hack below no memblock allocations will end up in ZONE_MOVABLE: > > Yep, I've confirmed that for my use cases at least this does the trick, thank > you! I had thought about moving the memblock allocations, but had no idea it > was (basically) already supported and thought it'd be much riskier than just > adjusting where ZONE_MOVABLE lived. > > Is there a reason for this to not be a real option for users, maybe per a > kernel config knob or something? I'm happy to explore other options in this > thread, but this is doing the trick so far. I think we can make x86 always use bottom up. To do this properly we'd need to set lower limit for memblock allocations to MAX_DMA32_PFN and allow fallback below it so that early allocations won't eat memory from ZONE_DMA32. Aside from x86 boot being fragile in general I don't see why this wouldn't work. -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-21 11:20 ` Mike Rapoport @ 2023-07-26 7:49 ` Michal Hocko 2023-07-26 10:48 ` Mike Rapoport 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2023-07-26 7:49 UTC (permalink / raw) To: Mike Rapoport Cc: Ross Zwisler, linux-kernel, linux-mm, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka, David Hildenbrand On Fri 21-07-23 14:20:09, Mike Rapoport wrote: > On Wed, Jul 19, 2023 at 04:26:04PM -0600, Ross Zwisler wrote: > > On Wed, Jul 19, 2023 at 08:44:34AM +0300, Mike Rapoport wrote: > > > 3. Switch memblock to use bottom up allocations. Historically memblock > > > allocated memory from the top to avoid corrupting the kernel image and to > > > avoid exhausting precious ZONE_DMA. I believe we can use bottom-up > > > allocations with lower limit of memblock allocations set to 16M. > > > > > > With the hack below no memblock allocations will end up in ZONE_MOVABLE: > > > > Yep, I've confirmed that for my use cases at least this does the trick, thank > > you! I had thought about moving the memblock allocations, but had no idea it > > was (basically) already supported and thought it'd be much riskier than just > > adjusting where ZONE_MOVABLE lived. > > > > Is there a reason for this to not be a real option for users, maybe per a > > kernel config knob or something? I'm happy to explore other options in this > > thread, but this is doing the trick so far. > > I think we can make x86 always use bottom up. > > To do this properly we'd need to set lower limit for memblock allocations > to MAX_DMA32_PFN and allow fallback below it so that early allocations > won't eat memory from ZONE_DMA32. > > Aside from x86 boot being fragile in general I don't see why this wouldn't > work. This would add a very subtle depency of a functionality on the specific boot allocator behavior and that is bad for long term maintenance. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-26 7:49 ` Michal Hocko @ 2023-07-26 10:48 ` Mike Rapoport 2023-07-26 12:57 ` Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: Mike Rapoport @ 2023-07-26 10:48 UTC (permalink / raw) To: Michal Hocko Cc: Ross Zwisler, linux-kernel, linux-mm, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka, David Hildenbrand On Wed, Jul 26, 2023 at 09:49:12AM +0200, Michal Hocko wrote: > On Fri 21-07-23 14:20:09, Mike Rapoport wrote: > > On Wed, Jul 19, 2023 at 04:26:04PM -0600, Ross Zwisler wrote: > > > On Wed, Jul 19, 2023 at 08:44:34AM +0300, Mike Rapoport wrote: > > > > 3. Switch memblock to use bottom up allocations. Historically memblock > > > > allocated memory from the top to avoid corrupting the kernel image and to > > > > avoid exhausting precious ZONE_DMA. I believe we can use bottom-up > > > > allocations with lower limit of memblock allocations set to 16M. > > > > > > > > With the hack below no memblock allocations will end up in ZONE_MOVABLE: > > > > > > Yep, I've confirmed that for my use cases at least this does the trick, thank > > > you! I had thought about moving the memblock allocations, but had no idea it > > > was (basically) already supported and thought it'd be much riskier than just > > > adjusting where ZONE_MOVABLE lived. > > > > > > Is there a reason for this to not be a real option for users, maybe per a > > > kernel config knob or something? I'm happy to explore other options in this > > > thread, but this is doing the trick so far. > > > > I think we can make x86 always use bottom up. > > > > To do this properly we'd need to set lower limit for memblock allocations > > to MAX_DMA32_PFN and allow fallback below it so that early allocations > > won't eat memory from ZONE_DMA32. > > > > Aside from x86 boot being fragile in general I don't see why this wouldn't > > work. > > This would add a very subtle depency of a functionality on the specific > boot allocator behavior and that is bad for long term maintenance. What do you mean by "specific boot allocator behavior"? Using a limit for allocations and then falling back to the entire available memory if allocation fails within the limits? > -- > Michal Hocko > SUSE Labs -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-26 10:48 ` Mike Rapoport @ 2023-07-26 12:57 ` Michal Hocko 2023-07-26 13:23 ` Mike Rapoport 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2023-07-26 12:57 UTC (permalink / raw) To: Mike Rapoport Cc: Ross Zwisler, linux-kernel, linux-mm, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka, David Hildenbrand On Wed 26-07-23 13:48:45, Mike Rapoport wrote: > On Wed, Jul 26, 2023 at 09:49:12AM +0200, Michal Hocko wrote: > > On Fri 21-07-23 14:20:09, Mike Rapoport wrote: > > > On Wed, Jul 19, 2023 at 04:26:04PM -0600, Ross Zwisler wrote: > > > > On Wed, Jul 19, 2023 at 08:44:34AM +0300, Mike Rapoport wrote: > > > > > 3. Switch memblock to use bottom up allocations. Historically memblock > > > > > allocated memory from the top to avoid corrupting the kernel image and to > > > > > avoid exhausting precious ZONE_DMA. I believe we can use bottom-up > > > > > allocations with lower limit of memblock allocations set to 16M. > > > > > > > > > > With the hack below no memblock allocations will end up in ZONE_MOVABLE: > > > > > > > > Yep, I've confirmed that for my use cases at least this does the trick, thank > > > > you! I had thought about moving the memblock allocations, but had no idea it > > > > was (basically) already supported and thought it'd be much riskier than just > > > > adjusting where ZONE_MOVABLE lived. > > > > > > > > Is there a reason for this to not be a real option for users, maybe per a > > > > kernel config knob or something? I'm happy to explore other options in this > > > > thread, but this is doing the trick so far. > > > > > > I think we can make x86 always use bottom up. > > > > > > To do this properly we'd need to set lower limit for memblock allocations > > > to MAX_DMA32_PFN and allow fallback below it so that early allocations > > > won't eat memory from ZONE_DMA32. > > > > > > Aside from x86 boot being fragile in general I don't see why this wouldn't > > > work. > > > > This would add a very subtle depency of a functionality on the specific > > boot allocator behavior and that is bad for long term maintenance. > > What do you mean by "specific boot allocator behavior"? I mean that the expectation that the boot allocator starts from low addresses and functionality depending on that is too fragile. This has already caused some problems in the past IIRC. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-26 12:57 ` Michal Hocko @ 2023-07-26 13:23 ` Mike Rapoport 2023-07-26 14:23 ` Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: Mike Rapoport @ 2023-07-26 13:23 UTC (permalink / raw) To: Michal Hocko Cc: Ross Zwisler, linux-kernel, linux-mm, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka, David Hildenbrand On Wed, Jul 26, 2023 at 02:57:55PM +0200, Michal Hocko wrote: > On Wed 26-07-23 13:48:45, Mike Rapoport wrote: > > On Wed, Jul 26, 2023 at 09:49:12AM +0200, Michal Hocko wrote: > > > On Fri 21-07-23 14:20:09, Mike Rapoport wrote: > > > > On Wed, Jul 19, 2023 at 04:26:04PM -0600, Ross Zwisler wrote: > > > > > On Wed, Jul 19, 2023 at 08:44:34AM +0300, Mike Rapoport wrote: > > > > > > 3. Switch memblock to use bottom up allocations. Historically memblock > > > > > > allocated memory from the top to avoid corrupting the kernel image and to > > > > > > avoid exhausting precious ZONE_DMA. I believe we can use bottom-up > > > > > > allocations with lower limit of memblock allocations set to 16M. > > > > > > > > > > > > With the hack below no memblock allocations will end up in ZONE_MOVABLE: > > > > > > > > > > Yep, I've confirmed that for my use cases at least this does the trick, thank > > > > > you! I had thought about moving the memblock allocations, but had no idea it > > > > > was (basically) already supported and thought it'd be much riskier than just > > > > > adjusting where ZONE_MOVABLE lived. > > > > > > > > > > Is there a reason for this to not be a real option for users, maybe per a > > > > > kernel config knob or something? I'm happy to explore other options in this > > > > > thread, but this is doing the trick so far. > > > > > > > > I think we can make x86 always use bottom up. > > > > > > > > To do this properly we'd need to set lower limit for memblock allocations > > > > to MAX_DMA32_PFN and allow fallback below it so that early allocations > > > > won't eat memory from ZONE_DMA32. > > > > > > > > Aside from x86 boot being fragile in general I don't see why this wouldn't > > > > work. > > > > > > This would add a very subtle depency of a functionality on the specific > > > boot allocator behavior and that is bad for long term maintenance. > > > > What do you mean by "specific boot allocator behavior"? > > I mean that the expectation that the boot allocator starts from low > addresses and functionality depending on that is too fragile. This has > already caused some problems in the past IIRC. Well, any change in x86 boot sequence may cause all sorts of problems :) We do some of the boot time allocations from low addresses when movable_node is enabled and that is entirely implicit and buried deep inside the code. What I'm suggesting is to switch the allocations to bottom-up once and for all with explicitly set lower limit and a defined semantics for a fallback. This might cause some bumps in the beginning, but I don't expect it to be a maintenance problem in the long run. And it will free higher memory from early allocations for all usecases, not just this one. > -- > Michal Hocko > SUSE Labs -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-26 13:23 ` Mike Rapoport @ 2023-07-26 14:23 ` Michal Hocko 0 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2023-07-26 14:23 UTC (permalink / raw) To: Mike Rapoport Cc: Ross Zwisler, linux-kernel, linux-mm, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka, David Hildenbrand On Wed 26-07-23 16:23:17, Mike Rapoport wrote: > On Wed, Jul 26, 2023 at 02:57:55PM +0200, Michal Hocko wrote: > > On Wed 26-07-23 13:48:45, Mike Rapoport wrote: > > > On Wed, Jul 26, 2023 at 09:49:12AM +0200, Michal Hocko wrote: > > > > On Fri 21-07-23 14:20:09, Mike Rapoport wrote: > > > > > On Wed, Jul 19, 2023 at 04:26:04PM -0600, Ross Zwisler wrote: > > > > > > On Wed, Jul 19, 2023 at 08:44:34AM +0300, Mike Rapoport wrote: > > > > > > > 3. Switch memblock to use bottom up allocations. Historically memblock > > > > > > > allocated memory from the top to avoid corrupting the kernel image and to > > > > > > > avoid exhausting precious ZONE_DMA. I believe we can use bottom-up > > > > > > > allocations with lower limit of memblock allocations set to 16M. > > > > > > > > > > > > > > With the hack below no memblock allocations will end up in ZONE_MOVABLE: > > > > > > > > > > > > Yep, I've confirmed that for my use cases at least this does the trick, thank > > > > > > you! I had thought about moving the memblock allocations, but had no idea it > > > > > > was (basically) already supported and thought it'd be much riskier than just > > > > > > adjusting where ZONE_MOVABLE lived. > > > > > > > > > > > > Is there a reason for this to not be a real option for users, maybe per a > > > > > > kernel config knob or something? I'm happy to explore other options in this > > > > > > thread, but this is doing the trick so far. > > > > > > > > > > I think we can make x86 always use bottom up. > > > > > > > > > > To do this properly we'd need to set lower limit for memblock allocations > > > > > to MAX_DMA32_PFN and allow fallback below it so that early allocations > > > > > won't eat memory from ZONE_DMA32. > > > > > > > > > > Aside from x86 boot being fragile in general I don't see why this wouldn't > > > > > work. > > > > > > > > This would add a very subtle depency of a functionality on the specific > > > > boot allocator behavior and that is bad for long term maintenance. > > > > > > What do you mean by "specific boot allocator behavior"? > > > > I mean that the expectation that the boot allocator starts from low > > addresses and functionality depending on that is too fragile. This has > > already caused some problems in the past IIRC. > > Well, any change in x86 boot sequence may cause all sorts of problems :) > > We do some of the boot time allocations from low addresses when > movable_node is enabled and that is entirely implicit and buried deep > inside the code. > > What I'm suggesting is to switch the allocations to bottom-up once and for > all with explicitly set lower limit and a defined semantics for a fallback. > > This might cause some bumps in the beginning, but I don't expect it to be a > maintenance problem in the long run. > > And it will free higher memory from early allocations for all usecases, not > just this one. Higher memory is usually not a problem AFAIK. It is lowmem that is a more scarce resource because some HW might be constrained in why phys address range is visible. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-18 22:01 collision between ZONE_MOVABLE and memblock allocations Ross Zwisler 2023-07-19 5:44 ` Mike Rapoport @ 2023-07-19 6:14 ` Michal Hocko 2023-07-19 7:59 ` Mike Rapoport 2023-07-19 22:48 ` Ross Zwisler 1 sibling, 2 replies; 23+ messages in thread From: Michal Hocko @ 2023-07-19 6:14 UTC (permalink / raw) To: Ross Zwisler Cc: linux-kernel, linux-mm, Mike Rapoport, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka, David Hildenbrand On Tue 18-07-23 16:01:06, Ross Zwisler wrote: [...] > I do think that we need to fix this collision between ZONE_MOVABLE and memmap > allocations, because this issue essentially makes the movablecore= kernel > command line parameter useless in many cases, as the ZONE_MOVABLE region it > creates will often actually be unmovable. movablecore is kinda hack and I would be more inclined to get rid of it rather than build more into it. Could you be more specific about your use case? > Here are the options I currently see for resolution: > > 1. Change the way ZONE_MOVABLE memory is allocated so that it is allocated from > the beginning of the NUMA node instead of the end. This should fix my use case, > but again is prone to breakage in other configurations (# of NUMA nodes, other > architectures) where ZONE_MOVABLE and memblock allocations might overlap. I > think that this should be relatively straightforward and low risk, though. > > 2. Make the code which processes the movablecore= command line option aware of > the memblock allocations, and have it choose a region for ZONE_MOVABLE which > does not have these allocations. This might be done by checking for > PageReserved() as we do with offlining memory, though that will take some boot > time reordering, or we'll have to figure out the overlap in another way. This > may also result in us having two ZONE_NORMAL zones for a given NUMA node, with > a ZONE_MOVABLE section in between them. I'm not sure if this is allowed? Yes, this is no problem. Zones are allowed to be sparse. > If > we can get it working, this seems like the most correct solution to me, but > also the most difficult and risky because it involves significant changes in > the code for memory setup at early boot. > > Am I missing anything are there other solutions we should consider, or do you > have an opinion on which solution we should pursue? If this really needs to be addressed than 2) is certainly a more robust approach. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-19 6:14 ` Michal Hocko @ 2023-07-19 7:59 ` Mike Rapoport 2023-07-19 8:06 ` Michal Hocko 2023-07-19 22:48 ` Ross Zwisler 1 sibling, 1 reply; 23+ messages in thread From: Mike Rapoport @ 2023-07-19 7:59 UTC (permalink / raw) To: Michal Hocko Cc: Ross Zwisler, linux-kernel, linux-mm, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka, David Hildenbrand On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote: > On Tue 18-07-23 16:01:06, Ross Zwisler wrote: > [...] > > I do think that we need to fix this collision between ZONE_MOVABLE and memmap > > allocations, because this issue essentially makes the movablecore= kernel > > command line parameter useless in many cases, as the ZONE_MOVABLE region it > > creates will often actually be unmovable. > > movablecore is kinda hack and I would be more inclined to get rid of it > rather than build more into it. Could you be more specific about your > use case? > > > Here are the options I currently see for resolution: > > > > 1. Change the way ZONE_MOVABLE memory is allocated so that it is allocated from > > the beginning of the NUMA node instead of the end. This should fix my use case, > > but again is prone to breakage in other configurations (# of NUMA nodes, other > > architectures) where ZONE_MOVABLE and memblock allocations might overlap. I > > think that this should be relatively straightforward and low risk, though. > > > > 2. Make the code which processes the movablecore= command line option aware of > > the memblock allocations, and have it choose a region for ZONE_MOVABLE which > > does not have these allocations. This might be done by checking for > > PageReserved() as we do with offlining memory, though that will take some boot > > time reordering, or we'll have to figure out the overlap in another way. This > > may also result in us having two ZONE_NORMAL zones for a given NUMA node, with > > a ZONE_MOVABLE section in between them. I'm not sure if this is allowed? > > Yes, this is no problem. Zones are allowed to be sparse. The current initialization order is roughly * very early initialization with some memblock allocations * determine zone locations and sizes * initialize memory map - memblock_alloc(lots of memory) * lots of unrelated initializations that may allocate memory * release free pages from memblock to the buddy allocator With 2) we can make sure the memory map and early allocations won't be in the ZONE_MOVABLE, but we'll still may have reserved pages there. > > If > > we can get it working, this seems like the most correct solution to me, but > > also the most difficult and risky because it involves significant changes in > > the code for memory setup at early boot. > > > > Am I missing anything are there other solutions we should consider, or do you > > have an opinion on which solution we should pursue? > > If this really needs to be addressed than 2) is certainly a more robust > approach. > -- > Michal Hocko > SUSE Labs -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-19 7:59 ` Mike Rapoport @ 2023-07-19 8:06 ` Michal Hocko 2023-07-19 8:14 ` David Hildenbrand 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2023-07-19 8:06 UTC (permalink / raw) To: Mike Rapoport Cc: Ross Zwisler, linux-kernel, linux-mm, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka, David Hildenbrand On Wed 19-07-23 10:59:52, Mike Rapoport wrote: > On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote: > > On Tue 18-07-23 16:01:06, Ross Zwisler wrote: > > [...] > > > I do think that we need to fix this collision between ZONE_MOVABLE and memmap > > > allocations, because this issue essentially makes the movablecore= kernel > > > command line parameter useless in many cases, as the ZONE_MOVABLE region it > > > creates will often actually be unmovable. > > > > movablecore is kinda hack and I would be more inclined to get rid of it > > rather than build more into it. Could you be more specific about your > > use case? > > > > > Here are the options I currently see for resolution: > > > > > > 1. Change the way ZONE_MOVABLE memory is allocated so that it is allocated from > > > the beginning of the NUMA node instead of the end. This should fix my use case, > > > but again is prone to breakage in other configurations (# of NUMA nodes, other > > > architectures) where ZONE_MOVABLE and memblock allocations might overlap. I > > > think that this should be relatively straightforward and low risk, though. > > > > > > 2. Make the code which processes the movablecore= command line option aware of > > > the memblock allocations, and have it choose a region for ZONE_MOVABLE which > > > does not have these allocations. This might be done by checking for > > > PageReserved() as we do with offlining memory, though that will take some boot > > > time reordering, or we'll have to figure out the overlap in another way. This > > > may also result in us having two ZONE_NORMAL zones for a given NUMA node, with > > > a ZONE_MOVABLE section in between them. I'm not sure if this is allowed? > > > > Yes, this is no problem. Zones are allowed to be sparse. > > The current initialization order is roughly > > * very early initialization with some memblock allocations > * determine zone locations and sizes > * initialize memory map > - memblock_alloc(lots of memory) > * lots of unrelated initializations that may allocate memory > * release free pages from memblock to the buddy allocator > > With 2) we can make sure the memory map and early allocations won't be in > the ZONE_MOVABLE, but we'll still may have reserved pages there. Yes this will always be fragile. If the spefic placement of the movable memory is not important and the only thing that matters is the size and numa locality then an easier to maintain solution would be to simply offline enough memory blocks very early in the userspace bring up and online it back as movable. If offlining fails just try another memblock. This doesn't require any kernel code change. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-19 8:06 ` Michal Hocko @ 2023-07-19 8:14 ` David Hildenbrand 2023-07-19 23:05 ` Ross Zwisler 0 siblings, 1 reply; 23+ messages in thread From: David Hildenbrand @ 2023-07-19 8:14 UTC (permalink / raw) To: Michal Hocko, Mike Rapoport Cc: Ross Zwisler, linux-kernel, linux-mm, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka On 19.07.23 10:06, Michal Hocko wrote: > On Wed 19-07-23 10:59:52, Mike Rapoport wrote: >> On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote: >>> On Tue 18-07-23 16:01:06, Ross Zwisler wrote: >>> [...] >>>> I do think that we need to fix this collision between ZONE_MOVABLE and memmap >>>> allocations, because this issue essentially makes the movablecore= kernel >>>> command line parameter useless in many cases, as the ZONE_MOVABLE region it >>>> creates will often actually be unmovable. >>> >>> movablecore is kinda hack and I would be more inclined to get rid of it >>> rather than build more into it. Could you be more specific about your >>> use case? >>> >>>> Here are the options I currently see for resolution: >>>> >>>> 1. Change the way ZONE_MOVABLE memory is allocated so that it is allocated from >>>> the beginning of the NUMA node instead of the end. This should fix my use case, >>>> but again is prone to breakage in other configurations (# of NUMA nodes, other >>>> architectures) where ZONE_MOVABLE and memblock allocations might overlap. I >>>> think that this should be relatively straightforward and low risk, though. >>>> >>>> 2. Make the code which processes the movablecore= command line option aware of >>>> the memblock allocations, and have it choose a region for ZONE_MOVABLE which >>>> does not have these allocations. This might be done by checking for >>>> PageReserved() as we do with offlining memory, though that will take some boot >>>> time reordering, or we'll have to figure out the overlap in another way. This >>>> may also result in us having two ZONE_NORMAL zones for a given NUMA node, with >>>> a ZONE_MOVABLE section in between them. I'm not sure if this is allowed? >>> >>> Yes, this is no problem. Zones are allowed to be sparse. >> >> The current initialization order is roughly >> >> * very early initialization with some memblock allocations >> * determine zone locations and sizes >> * initialize memory map >> - memblock_alloc(lots of memory) >> * lots of unrelated initializations that may allocate memory >> * release free pages from memblock to the buddy allocator >> >> With 2) we can make sure the memory map and early allocations won't be in >> the ZONE_MOVABLE, but we'll still may have reserved pages there. > > Yes this will always be fragile. If the spefic placement of the movable > memory is not important and the only thing that matters is the size and > numa locality then an easier to maintain solution would be to simply > offline enough memory blocks very early in the userspace bring up and > online it back as movable. If offlining fails just try another > memblock. This doesn't require any kernel code change. As an alternative, we might use the "memmap=nn[KMG]!ss[KMG]" [1] parameter to mark some memory as protected. That memory can then be configured as devdax device and online to ZONE_MOVABLE (dev/dax). [1] https://docs.pmem.io/persistent-memory/getting-started-guide/creating-development-environments/linux-environments/linux-memmap -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-19 8:14 ` David Hildenbrand @ 2023-07-19 23:05 ` Ross Zwisler 2023-07-26 8:31 ` David Hildenbrand 0 siblings, 1 reply; 23+ messages in thread From: Ross Zwisler @ 2023-07-19 23:05 UTC (permalink / raw) To: David Hildenbrand Cc: Michal Hocko, Mike Rapoport, linux-kernel, linux-mm, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka On Wed, Jul 19, 2023 at 10:14:59AM +0200, David Hildenbrand wrote: > On 19.07.23 10:06, Michal Hocko wrote: > > On Wed 19-07-23 10:59:52, Mike Rapoport wrote: > > > On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote: > > > > On Tue 18-07-23 16:01:06, Ross Zwisler wrote: > > > > [...] > > > > > I do think that we need to fix this collision between ZONE_MOVABLE and memmap > > > > > allocations, because this issue essentially makes the movablecore= kernel > > > > > command line parameter useless in many cases, as the ZONE_MOVABLE region it > > > > > creates will often actually be unmovable. > > > > > > > > movablecore is kinda hack and I would be more inclined to get rid of it > > > > rather than build more into it. Could you be more specific about your > > > > use case? > > > > > > > > > Here are the options I currently see for resolution: > > > > > > > > > > 1. Change the way ZONE_MOVABLE memory is allocated so that it is allocated from > > > > > the beginning of the NUMA node instead of the end. This should fix my use case, > > > > > but again is prone to breakage in other configurations (# of NUMA nodes, other > > > > > architectures) where ZONE_MOVABLE and memblock allocations might overlap. I > > > > > think that this should be relatively straightforward and low risk, though. > > > > > > > > > > 2. Make the code which processes the movablecore= command line option aware of > > > > > the memblock allocations, and have it choose a region for ZONE_MOVABLE which > > > > > does not have these allocations. This might be done by checking for > > > > > PageReserved() as we do with offlining memory, though that will take some boot > > > > > time reordering, or we'll have to figure out the overlap in another way. This > > > > > may also result in us having two ZONE_NORMAL zones for a given NUMA node, with > > > > > a ZONE_MOVABLE section in between them. I'm not sure if this is allowed? > > > > > > > > Yes, this is no problem. Zones are allowed to be sparse. > > > > > > The current initialization order is roughly > > > > > > * very early initialization with some memblock allocations > > > * determine zone locations and sizes > > > * initialize memory map > > > - memblock_alloc(lots of memory) > > > * lots of unrelated initializations that may allocate memory > > > * release free pages from memblock to the buddy allocator > > > > > > With 2) we can make sure the memory map and early allocations won't be in > > > the ZONE_MOVABLE, but we'll still may have reserved pages there. > > > > Yes this will always be fragile. If the spefic placement of the movable > > memory is not important and the only thing that matters is the size and > > numa locality then an easier to maintain solution would be to simply > > offline enough memory blocks very early in the userspace bring up and > > online it back as movable. If offlining fails just try another > > memblock. This doesn't require any kernel code change. > > As an alternative, we might use the "memmap=nn[KMG]!ss[KMG]" [1] parameter > to mark some memory as protected. > > That memory can then be configured as devdax device and online to > ZONE_MOVABLE (dev/dax). > > [1] https://docs.pmem.io/persistent-memory/getting-started-guide/creating-development-environments/linux-environments/linux-memmap I've previously been reconfiguring devdax memory like this: ndctl create-namespace --reconfig=namespace0.0 -m devdax -f daxctl reconfigure-device --mode=system-ram dax0.0 Is this how you've been doing it, or is there something else I should consider? I just sent mail to Michal outlining my use case, hopefully it makes sense. I had thought about using 'memmap=' in the first kernel and the worry was that I'd have to support many different machines with different memory configurations, and have to hard-code memory offsets and lengths for the various memmap= kernel command line parameters. If I can make ZONE_MOVABLE work that's preferable because the kernel will choose the correct usermem-only region for me, and then I can just use that region for the crash kernel and 3rd kernel boots. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-19 23:05 ` Ross Zwisler @ 2023-07-26 8:31 ` David Hildenbrand 0 siblings, 0 replies; 23+ messages in thread From: David Hildenbrand @ 2023-07-26 8:31 UTC (permalink / raw) To: Ross Zwisler Cc: Michal Hocko, Mike Rapoport, linux-kernel, linux-mm, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka >> As an alternative, we might use the "memmap=nn[KMG]!ss[KMG]" [1] parameter >> to mark some memory as protected. >> >> That memory can then be configured as devdax device and online to >> ZONE_MOVABLE (dev/dax). >> >> [1] https://docs.pmem.io/persistent-memory/getting-started-guide/creating-development-environments/linux-environments/linux-memmap > > I've previously been reconfiguring devdax memory like this: > > ndctl create-namespace --reconfig=namespace0.0 -m devdax -f > daxctl reconfigure-device --mode=system-ram dax0.0 > > Is this how you've been doing it, or is there something else I should > consider? No, exactly like that. > > I just sent mail to Michal outlining my use case, hopefully it makes sense. Yes, thanks for sharing, I'll dig deeper into that next. > > I had thought about using 'memmap=' in the first kernel and the worry was that > I'd have to support many different machines with different memory > configurations, and have to hard-code memory offsets and lengths for the > various memmap= kernel command line parameters. Indeed. > If I can make ZONE_MOVABLE > work that's preferable because the kernel will choose the correct usermem-only > region for me, and then I can just use that region for the crash kernel and > 3rd kernel boots. It really sounds like you might be better off using CMA or alloc_contig_pages(). The latter is unreliable, though, and the memory cannot be used for other purposes once alloc_contig_pages() succeeded See arch/powerpc/platforms/powernv/memtrace.c one one user that needs to set aside a lot of memory to store HW traces. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-19 6:14 ` Michal Hocko 2023-07-19 7:59 ` Mike Rapoport @ 2023-07-19 22:48 ` Ross Zwisler 2023-07-20 7:49 ` Michal Hocko ` (2 more replies) 1 sibling, 3 replies; 23+ messages in thread From: Ross Zwisler @ 2023-07-19 22:48 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, Mike Rapoport, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka, David Hildenbrand On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote: > On Tue 18-07-23 16:01:06, Ross Zwisler wrote: > [...] > > I do think that we need to fix this collision between ZONE_MOVABLE and memmap > > allocations, because this issue essentially makes the movablecore= kernel > > command line parameter useless in many cases, as the ZONE_MOVABLE region it > > creates will often actually be unmovable. > > movablecore is kinda hack and I would be more inclined to get rid of it > rather than build more into it. Could you be more specific about your > use case? The problem that I'm trying to solve is that I'd like to be able to get kernel core dumps off machines (chromebooks) so that we can debug crashes. Because the memory used by the crash kernel ("crashkernel=" kernel command line option) is consumed the entire time the machine is booted, there is a strong motivation to keep the crash kernel as small and as simple as possible. To this end I'm trying to get away without SSD drivers, not having to worry about encryption on the SSDs, etc. So, the rough plan right now is: 1) During boot set aside some memory that won't contain kernel allocations. I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways. We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE region (or whatever non-kernel region) will be set aside as PMEM in the crash kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line parameter passed to the crash kernel. So, in my sample 4G VM system, I see: # lsmem --split ZONES --output-all RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable Memory block size: 128M Total online memory: 4G Total offline memory: 0B so I'll pass "memmap=256M!0x130000000" to the crash kernel. 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set aside only contains user data, which we don't want to store anyway. We make a filesystem in there, and create a kernel crash dump using 'makedumpfile': mkfs.ext4 /dev/pmem0 mount /dev/pmem0 /mnt makedumpfile -c -d 31 /proc/vmcore /mnt/kdump We then set up the next full kernel boot to also have this same PMEM region, using the same memmap kernel parameter. We reboot back into a full kernel. 3) The next full kernel will be a normal boot with a full networking stack, SSD drivers, disk encryption, etc. We mount up our PMEM filesystem, pull out the kdump and either store it somewhere persistent or upload it somewhere. We can then unmount the PMEM and reconfigure it back to system ram so that the live system isn't missing memory. ndctl create-namespace --reconfig=namespace0.0 -m devdax -f daxctl reconfigure-device --mode=system-ram dax0.0 This is the flow I'm trying to support, and have mostly working in a VM, except up until now makedumpfile would crash because all the memblock structures it needed were in the PMEM area that I had just wiped out by making a new filesystem. :) Do you see any blockers that would make this infeasible? For the non-kernel memory, is the ZONE_MOVABLE path that I'm currently pursuing the best option, or would we be better off with your suggestion elsewhere in this thread: > If the spefic placement of the movable memory is not important and the only > thing that matters is the size and numa locality then an easier to maintain > solution would be to simply offline enough memory blocks very early in the > userspace bring up and online it back as movable. If offlining fails just > try another memblock. This doesn't require any kernel code change. If this 2nd way is preferred, can you point me to how I can offline the memory blocks & then get them back later in boot? Thanks for the help! - Ross ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-19 22:48 ` Ross Zwisler @ 2023-07-20 7:49 ` Michal Hocko 2023-07-20 12:13 ` Michal Hocko 2023-07-26 8:44 ` David Hildenbrand 2 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2023-07-20 7:49 UTC (permalink / raw) To: Ross Zwisler Cc: linux-kernel, linux-mm, Mike Rapoport, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka, David Hildenbrand, Jiri Bohac [CC Jiri Bohac] On Wed 19-07-23 16:48:21, Ross Zwisler wrote: > On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote: > > On Tue 18-07-23 16:01:06, Ross Zwisler wrote: > > [...] > > > I do think that we need to fix this collision between ZONE_MOVABLE and memmap > > > allocations, because this issue essentially makes the movablecore= kernel > > > command line parameter useless in many cases, as the ZONE_MOVABLE region it > > > creates will often actually be unmovable. > > > > movablecore is kinda hack and I would be more inclined to get rid of it > > rather than build more into it. Could you be more specific about your > > use case? > > The problem that I'm trying to solve is that I'd like to be able to get kernel > core dumps off machines (chromebooks) so that we can debug crashes. Because > the memory used by the crash kernel ("crashkernel=" kernel command line > option) is consumed the entire time the machine is booted, there is a strong > motivation to keep the crash kernel as small and as simple as possible. To > this end I'm trying to get away without SSD drivers, not having to worry about > encryption on the SSDs, etc. This is something Jiri is also looking into. > So, the rough plan right now is: > > 1) During boot set aside some memory that won't contain kernel allocations. > I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways. > > We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE > region (or whatever non-kernel region) will be set aside as PMEM in the crash > kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line > parameter passed to the crash kernel. > > So, in my sample 4G VM system, I see: > > # lsmem --split ZONES --output-all > RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES > 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None > 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32 > 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal > 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable > > Memory block size: 128M > Total online memory: 4G > Total offline memory: 0B > > so I'll pass "memmap=256M!0x130000000" to the crash kernel. > > 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set > aside only contains user data, which we don't want to store anyway. We make a > filesystem in there, and create a kernel crash dump using 'makedumpfile': > > mkfs.ext4 /dev/pmem0 > mount /dev/pmem0 /mnt > makedumpfile -c -d 31 /proc/vmcore /mnt/kdump > > We then set up the next full kernel boot to also have this same PMEM region, > using the same memmap kernel parameter. We reboot back into a full kernel. > > 3) The next full kernel will be a normal boot with a full networking stack, > SSD drivers, disk encryption, etc. We mount up our PMEM filesystem, pull out > the kdump and either store it somewhere persistent or upload it somewhere. We > can then unmount the PMEM and reconfigure it back to system ram so that the > live system isn't missing memory. > > ndctl create-namespace --reconfig=namespace0.0 -m devdax -f > daxctl reconfigure-device --mode=system-ram dax0.0 > > This is the flow I'm trying to support, and have mostly working in a VM, > except up until now makedumpfile would crash because all the memblock > structures it needed were in the PMEM area that I had just wiped out by making > a new filesystem. :) > > Do you see any blockers that would make this infeasible? > > For the non-kernel memory, is the ZONE_MOVABLE path that I'm currently > pursuing the best option, or would we be better off with your suggestion > elsewhere in this thread: The main problem I would see with this approach is that the small Movable zone you set aside would be easily consumed and reclaimed. That could generate some unexpected performance artifacts. We used to see those with small zones or large differences in zone sizes in the past. But functionally this should work. Or I do not see any fundamental problems at least. Jiri is looking at this from a slightly different angle. Very broadly, he would like to have a dedicated CMA pool and reuse that for the kernel memory (dropping anything sitting there) when crashing. GFP_MOVABLE allocations can use CMA pools. > > If the spefic placement of the movable memory is not important and the only > > thing that matters is the size and numa locality then an easier to maintain > > solution would be to simply offline enough memory blocks very early in the > > userspace bring up and online it back as movable. If offlining fails just > > try another memblock. This doesn't require any kernel code change. > > If this 2nd way is preferred, can you point me to how I can offline the memory > blocks & then get them back later in boot? /bin/echo offline > /sys/devices/system/memory/memory$NUM/state && \ echo online_movable > /sys/devices/system/memory/memory$NUM/state more in Documentation/admin-guide/mm/memory-hotplug.rst -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-19 22:48 ` Ross Zwisler 2023-07-20 7:49 ` Michal Hocko @ 2023-07-20 12:13 ` Michal Hocko 2023-07-24 16:56 ` Ross Zwisler 2023-07-26 8:44 ` David Hildenbrand 2 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2023-07-20 12:13 UTC (permalink / raw) To: Ross Zwisler Cc: linux-kernel, linux-mm, Mike Rapoport, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka, David Hildenbrand On Wed 19-07-23 16:48:21, Ross Zwisler wrote: > On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote: > > On Tue 18-07-23 16:01:06, Ross Zwisler wrote: > > [...] > > > I do think that we need to fix this collision between ZONE_MOVABLE and memmap > > > allocations, because this issue essentially makes the movablecore= kernel > > > command line parameter useless in many cases, as the ZONE_MOVABLE region it > > > creates will often actually be unmovable. > > > > movablecore is kinda hack and I would be more inclined to get rid of it > > rather than build more into it. Could you be more specific about your > > use case? > > The problem that I'm trying to solve is that I'd like to be able to get kernel > core dumps off machines (chromebooks) so that we can debug crashes. Because > the memory used by the crash kernel ("crashkernel=" kernel command line > option) is consumed the entire time the machine is booted, there is a strong > motivation to keep the crash kernel as small and as simple as possible. To > this end I'm trying to get away without SSD drivers, not having to worry about > encryption on the SSDs, etc. > > So, the rough plan right now is: > > 1) During boot set aside some memory that won't contain kernel allocations. > I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways. > > We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE > region (or whatever non-kernel region) will be set aside as PMEM in the crash > kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line > parameter passed to the crash kernel. > > So, in my sample 4G VM system, I see: > > # lsmem --split ZONES --output-all > RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES > 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None > 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32 > 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal > 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable > > Memory block size: 128M > Total online memory: 4G > Total offline memory: 0B > > so I'll pass "memmap=256M!0x130000000" to the crash kernel. > > 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set > aside only contains user data, which we don't want to store anyway. We make a > filesystem in there, and create a kernel crash dump using 'makedumpfile': > > mkfs.ext4 /dev/pmem0 > mount /dev/pmem0 /mnt > makedumpfile -c -d 31 /proc/vmcore /mnt/kdump > > We then set up the next full kernel boot to also have this same PMEM region, > using the same memmap kernel parameter. We reboot back into a full kernel. Btw. How do you ensure that the address range doesn't get reinitialized by POST? Do you rely on kexec boot here? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-20 12:13 ` Michal Hocko @ 2023-07-24 16:56 ` Ross Zwisler 0 siblings, 0 replies; 23+ messages in thread From: Ross Zwisler @ 2023-07-24 16:56 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, linux-mm, Mike Rapoport, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka, David Hildenbrand On Thu, Jul 20, 2023 at 02:13:25PM +0200, Michal Hocko wrote: > On Wed 19-07-23 16:48:21, Ross Zwisler wrote: > > On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote: > > > On Tue 18-07-23 16:01:06, Ross Zwisler wrote: > > > [...] > > > > I do think that we need to fix this collision between ZONE_MOVABLE and memmap > > > > allocations, because this issue essentially makes the movablecore= kernel > > > > command line parameter useless in many cases, as the ZONE_MOVABLE region it > > > > creates will often actually be unmovable. > > > > > > movablecore is kinda hack and I would be more inclined to get rid of it > > > rather than build more into it. Could you be more specific about your > > > use case? > > > > The problem that I'm trying to solve is that I'd like to be able to get kernel > > core dumps off machines (chromebooks) so that we can debug crashes. Because > > the memory used by the crash kernel ("crashkernel=" kernel command line > > option) is consumed the entire time the machine is booted, there is a strong > > motivation to keep the crash kernel as small and as simple as possible. To > > this end I'm trying to get away without SSD drivers, not having to worry about > > encryption on the SSDs, etc. > > > > So, the rough plan right now is: > > > > 1) During boot set aside some memory that won't contain kernel allocations. > > I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways. > > > > We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE > > region (or whatever non-kernel region) will be set aside as PMEM in the crash > > kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line > > parameter passed to the crash kernel. > > > > So, in my sample 4G VM system, I see: > > > > # lsmem --split ZONES --output-all > > RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES > > 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None > > 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32 > > 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal > > 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable > > > > Memory block size: 128M > > Total online memory: 4G > > Total offline memory: 0B > > > > so I'll pass "memmap=256M!0x130000000" to the crash kernel. > > > > 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set > > aside only contains user data, which we don't want to store anyway. We make a > > filesystem in there, and create a kernel crash dump using 'makedumpfile': > > > > mkfs.ext4 /dev/pmem0 > > mount /dev/pmem0 /mnt > > makedumpfile -c -d 31 /proc/vmcore /mnt/kdump > > > > We then set up the next full kernel boot to also have this same PMEM region, > > using the same memmap kernel parameter. We reboot back into a full kernel. > > Btw. How do you ensure that the address range doesn't get reinitialized > by POST? Do you rely on kexec boot here? I've been working under the assumption that I do need to do a full reboot (not just another kexec boot) so that the devices in the system (NICs, disks, etc) are all reinitialized and don't carry over bad state from the crash. I do know about the 'reset_devices' kernel command line parameter, but wasn't sure that would be enough. From looking around it seems like this is very driver + device dependent, so maybe I just need to test more. In any case, you're right, if we do a full reboot and go through POST, it's system dependent on whether BIOS/UEFI/Coreboot/etc will zero memory, and if it does this feature won't work unless we kexec to the 3rd kernel. I've also heard concerns around whether a full reboot will cause the memory controller to reinitialize and potentially cause memory bit flips or similar, though I haven't yet seen this myself. Has anyone seen such bit flips / memory corruption due to system reboot, or is this a non-issue in your experience? Lots to figure out, thanks for the help. :) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-19 22:48 ` Ross Zwisler 2023-07-20 7:49 ` Michal Hocko 2023-07-20 12:13 ` Michal Hocko @ 2023-07-26 8:44 ` David Hildenbrand 2023-07-26 13:08 ` David Hildenbrand 2023-07-27 8:18 ` Michal Hocko 2 siblings, 2 replies; 23+ messages in thread From: David Hildenbrand @ 2023-07-26 8:44 UTC (permalink / raw) To: Ross Zwisler, Michal Hocko Cc: linux-kernel, linux-mm, Mike Rapoport, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka On 20.07.23 00:48, Ross Zwisler wrote: > On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote: >> On Tue 18-07-23 16:01:06, Ross Zwisler wrote: >> [...] >>> I do think that we need to fix this collision between ZONE_MOVABLE and memmap >>> allocations, because this issue essentially makes the movablecore= kernel >>> command line parameter useless in many cases, as the ZONE_MOVABLE region it >>> creates will often actually be unmovable. >> >> movablecore is kinda hack and I would be more inclined to get rid of it >> rather than build more into it. Could you be more specific about your >> use case? > > The problem that I'm trying to solve is that I'd like to be able to get kernel > core dumps off machines (chromebooks) so that we can debug crashes. Because > the memory used by the crash kernel ("crashkernel=" kernel command line > option) is consumed the entire time the machine is booted, there is a strong > motivation to keep the crash kernel as small and as simple as possible. To > this end I'm trying to get away without SSD drivers, not having to worry about > encryption on the SSDs, etc. Okay, so you intend to keep the crashkernel area as small as possible. > > So, the rough plan right now is: > > 1) During boot set aside some memory that won't contain kernel allocations. > I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways. > > We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE > region (or whatever non-kernel region) will be set aside as PMEM in the crash > kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line > parameter passed to the crash kernel. > > So, in my sample 4G VM system, I see: > > # lsmem --split ZONES --output-all > RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES > 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None > 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32 > 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal > 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable > > Memory block size: 128M > Total online memory: 4G > Total offline memory: 0B > > so I'll pass "memmap=256M!0x130000000" to the crash kernel. > > 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set > aside only contains user data, which we don't want to store anyway. I raised that in different context already, but such assumptions are not 100% future proof IMHO. For example, we might at one point be able to make user page tables movable and place them on there. But yes, most kernel data structures (which you care about) will probably never be movable and never end up on these regions. > We make a > filesystem in there, and create a kernel crash dump using 'makedumpfile': > > mkfs.ext4 /dev/pmem0 > mount /dev/pmem0 /mnt > makedumpfile -c -d 31 /proc/vmcore /mnt/kdump > > We then set up the next full kernel boot to also have this same PMEM region, > using the same memmap kernel parameter. We reboot back into a full kernel. > > 3) The next full kernel will be a normal boot with a full networking stack, > SSD drivers, disk encryption, etc. We mount up our PMEM filesystem, pull out > the kdump and either store it somewhere persistent or upload it somewhere. We > can then unmount the PMEM and reconfigure it back to system ram so that the > live system isn't missing memory. > > ndctl create-namespace --reconfig=namespace0.0 -m devdax -f > daxctl reconfigure-device --mode=system-ram dax0.0 > > This is the flow I'm trying to support, and have mostly working in a VM, > except up until now makedumpfile would crash because all the memblock > structures it needed were in the PMEM area that I had just wiped out by making > a new filesystem. :) Thinking out loud (and remembering that some architectures relocate the crashkernel during kexec, if I am not wrong), maybe the following would also work and make your setup eventually easier: 1) Don't reserve a crashkernel area in the traditional way, instead reserve that area using CMA. It can be used for MOVABLE allocations. 2) Let kexec load the crashkernel+initrd into ordinary memory only (consuming as much as you would need there). 3) On kexec, relocate the crashkernel+initrd into the CMA area (overwriting any movable data in there) 4) In makedumpfile, don't dump any memory that falls into the crashkernel area. It might already have been overwritten by the second kernel Maybe that would allow you to make the crashkernel+initrd slightly bigger (to include SSD drivers etc.) and have a bigger crashkernel area, because while the crashkernel is armed it will only consume the crashkernel+initrd size and not the overall crashkernel area size. If that makes any sense :) -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-26 8:44 ` David Hildenbrand @ 2023-07-26 13:08 ` David Hildenbrand 2023-07-27 8:18 ` Michal Hocko 1 sibling, 0 replies; 23+ messages in thread From: David Hildenbrand @ 2023-07-26 13:08 UTC (permalink / raw) To: Ross Zwisler, Michal Hocko Cc: linux-kernel, linux-mm, Mike Rapoport, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka On 26.07.23 10:44, David Hildenbrand wrote: > On 20.07.23 00:48, Ross Zwisler wrote: >> On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote: >>> On Tue 18-07-23 16:01:06, Ross Zwisler wrote: >>> [...] >>>> I do think that we need to fix this collision between ZONE_MOVABLE and memmap >>>> allocations, because this issue essentially makes the movablecore= kernel >>>> command line parameter useless in many cases, as the ZONE_MOVABLE region it >>>> creates will often actually be unmovable. >>> >>> movablecore is kinda hack and I would be more inclined to get rid of it >>> rather than build more into it. Could you be more specific about your >>> use case? >> >> The problem that I'm trying to solve is that I'd like to be able to get kernel >> core dumps off machines (chromebooks) so that we can debug crashes. Because >> the memory used by the crash kernel ("crashkernel=" kernel command line >> option) is consumed the entire time the machine is booted, there is a strong >> motivation to keep the crash kernel as small and as simple as possible. To >> this end I'm trying to get away without SSD drivers, not having to worry about >> encryption on the SSDs, etc. > > Okay, so you intend to keep the crashkernel area as small as possible. > >> >> So, the rough plan right now is: >> > 1) During boot set aside some memory that won't contain kernel > allocations. >> I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways. >> >> We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE >> region (or whatever non-kernel region) will be set aside as PMEM in the crash >> kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line >> parameter passed to the crash kernel. >> >> So, in my sample 4G VM system, I see: >> >> # lsmem --split ZONES --output-all >> RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES >> 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None >> 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32 >> 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal >> 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable >> >> Memory block size: 128M >> Total online memory: 4G >> Total offline memory: 0B >> >> so I'll pass "memmap=256M!0x130000000" to the crash kernel. >> >> 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set >> aside only contains user data, which we don't want to store anyway. > > I raised that in different context already, but such assumptions are not > 100% future proof IMHO. For example, we might at one point be able to > make user page tables movable and place them on there. > > But yes, most kernel data structures (which you care about) will > probably never be movable and never end up on these regions. > >> We make a >> filesystem in there, and create a kernel crash dump using 'makedumpfile': >> >> mkfs.ext4 /dev/pmem0 >> mount /dev/pmem0 /mnt >> makedumpfile -c -d 31 /proc/vmcore /mnt/kdump >> >> We then set up the next full kernel boot to also have this same PMEM region, >> using the same memmap kernel parameter. We reboot back into a full kernel. >> >> 3) The next full kernel will be a normal boot with a full networking stack, >> SSD drivers, disk encryption, etc. We mount up our PMEM filesystem, pull out >> the kdump and either store it somewhere persistent or upload it somewhere. We >> can then unmount the PMEM and reconfigure it back to system ram so that the >> live system isn't missing memory. >> >> ndctl create-namespace --reconfig=namespace0.0 -m devdax -f >> daxctl reconfigure-device --mode=system-ram dax0.0 >> >> This is the flow I'm trying to support, and have mostly working in a VM, >> except up until now makedumpfile would crash because all the memblock >> structures it needed were in the PMEM area that I had just wiped out by making >> a new filesystem. :) > > > Thinking out loud (and remembering that some architectures relocate the > crashkernel during kexec, if I am not wrong), maybe the following would > also work and make your setup eventually easier: > > 1) Don't reserve a crashkernel area in the traditional way, instead > reserve that area using CMA. It can be used for MOVABLE allocations. > > 2) Let kexec load the crashkernel+initrd into ordinary memory only > (consuming as much as you would need there). Oh, I realized that one might be able to place the kernel+initrd directly in the area by allocating via CMA. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-26 8:44 ` David Hildenbrand 2023-07-26 13:08 ` David Hildenbrand @ 2023-07-27 8:18 ` Michal Hocko 2023-07-27 9:41 ` David Hildenbrand 1 sibling, 1 reply; 23+ messages in thread From: Michal Hocko @ 2023-07-27 8:18 UTC (permalink / raw) To: David Hildenbrand Cc: Ross Zwisler, linux-kernel, linux-mm, Mike Rapoport, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka On Wed 26-07-23 10:44:21, David Hildenbrand wrote: > On 20.07.23 00:48, Ross Zwisler wrote: > > On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote: > > > On Tue 18-07-23 16:01:06, Ross Zwisler wrote: > > > [...] > > > > I do think that we need to fix this collision between ZONE_MOVABLE and memmap > > > > allocations, because this issue essentially makes the movablecore= kernel > > > > command line parameter useless in many cases, as the ZONE_MOVABLE region it > > > > creates will often actually be unmovable. > > > > > > movablecore is kinda hack and I would be more inclined to get rid of it > > > rather than build more into it. Could you be more specific about your > > > use case? > > > > The problem that I'm trying to solve is that I'd like to be able to get kernel > > core dumps off machines (chromebooks) so that we can debug crashes. Because > > the memory used by the crash kernel ("crashkernel=" kernel command line > > option) is consumed the entire time the machine is booted, there is a strong > > motivation to keep the crash kernel as small and as simple as possible. To > > this end I'm trying to get away without SSD drivers, not having to worry about > > encryption on the SSDs, etc. > > Okay, so you intend to keep the crashkernel area as small as possible. > > > > > So, the rough plan right now is: > > > 1) During boot set aside some memory that won't contain kernel > allocations. > > I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways. > > > > We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE > > region (or whatever non-kernel region) will be set aside as PMEM in the crash > > kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line > > parameter passed to the crash kernel. > > > > So, in my sample 4G VM system, I see: > > > > # lsmem --split ZONES --output-all > > RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES > > 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None > > 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32 > > 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal > > 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable > > Memory block size: 128M > > Total online memory: 4G > > Total offline memory: 0B > > > > so I'll pass "memmap=256M!0x130000000" to the crash kernel. > > > > 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set > > aside only contains user data, which we don't want to store anyway. > > I raised that in different context already, but such assumptions are not > 100% future proof IMHO. For example, we might at one point be able to make > user page tables movable and place them on there. > > But yes, most kernel data structures (which you care about) will probably > never be movable and never end up on these regions. > > > We make a > > filesystem in there, and create a kernel crash dump using 'makedumpfile': > > > > mkfs.ext4 /dev/pmem0 > > mount /dev/pmem0 /mnt > > makedumpfile -c -d 31 /proc/vmcore /mnt/kdump > > > > We then set up the next full kernel boot to also have this same PMEM region, > > using the same memmap kernel parameter. We reboot back into a full kernel. > > > > 3) The next full kernel will be a normal boot with a full networking stack, > > SSD drivers, disk encryption, etc. We mount up our PMEM filesystem, pull out > > the kdump and either store it somewhere persistent or upload it somewhere. We > > can then unmount the PMEM and reconfigure it back to system ram so that the > > live system isn't missing memory. > > > > ndctl create-namespace --reconfig=namespace0.0 -m devdax -f > > daxctl reconfigure-device --mode=system-ram dax0.0 > > > > This is the flow I'm trying to support, and have mostly working in a VM, > > except up until now makedumpfile would crash because all the memblock > > structures it needed were in the PMEM area that I had just wiped out by making > > a new filesystem. :) > > > Thinking out loud (and remembering that some architectures relocate the > crashkernel during kexec, if I am not wrong), maybe the following would also > work and make your setup eventually easier: > > 1) Don't reserve a crashkernel area in the traditional way, instead reserve > that area using CMA. It can be used for MOVABLE allocations. > > 2) Let kexec load the crashkernel+initrd into ordinary memory only > (consuming as much as you would need there). > > 3) On kexec, relocate the crashkernel+initrd into the CMA area (overwriting > any movable data in there) > > 4) In makedumpfile, don't dump any memory that falls into the crashkernel > area. It might already have been overwritten by the second kernel This is more or less what Jiri is looking into. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: collision between ZONE_MOVABLE and memblock allocations 2023-07-27 8:18 ` Michal Hocko @ 2023-07-27 9:41 ` David Hildenbrand 0 siblings, 0 replies; 23+ messages in thread From: David Hildenbrand @ 2023-07-27 9:41 UTC (permalink / raw) To: Michal Hocko Cc: Ross Zwisler, linux-kernel, linux-mm, Mike Rapoport, Andrew Morton, Matthew Wilcox, Mel Gorman, Vlastimil Babka On 27.07.23 10:18, Michal Hocko wrote: > On Wed 26-07-23 10:44:21, David Hildenbrand wrote: >> On 20.07.23 00:48, Ross Zwisler wrote: >>> On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote: >>>> On Tue 18-07-23 16:01:06, Ross Zwisler wrote: >>>> [...] >>>>> I do think that we need to fix this collision between ZONE_MOVABLE and memmap >>>>> allocations, because this issue essentially makes the movablecore= kernel >>>>> command line parameter useless in many cases, as the ZONE_MOVABLE region it >>>>> creates will often actually be unmovable. >>>> >>>> movablecore is kinda hack and I would be more inclined to get rid of it >>>> rather than build more into it. Could you be more specific about your >>>> use case? >>> >>> The problem that I'm trying to solve is that I'd like to be able to get kernel >>> core dumps off machines (chromebooks) so that we can debug crashes. Because >>> the memory used by the crash kernel ("crashkernel=" kernel command line >>> option) is consumed the entire time the machine is booted, there is a strong >>> motivation to keep the crash kernel as small and as simple as possible. To >>> this end I'm trying to get away without SSD drivers, not having to worry about >>> encryption on the SSDs, etc. >> >> Okay, so you intend to keep the crashkernel area as small as possible. >> >>> >>> So, the rough plan right now is: >>> > 1) During boot set aside some memory that won't contain kernel >> allocations. >>> I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways. >>> >>> We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE >>> region (or whatever non-kernel region) will be set aside as PMEM in the crash >>> kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line >>> parameter passed to the crash kernel. >>> >>> So, in my sample 4G VM system, I see: >>> >>> # lsmem --split ZONES --output-all >>> RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES >>> 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None >>> 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32 >>> 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal >>> 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable >>> Memory block size: 128M >>> Total online memory: 4G >>> Total offline memory: 0B >>> >>> so I'll pass "memmap=256M!0x130000000" to the crash kernel. >>> >>> 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set >>> aside only contains user data, which we don't want to store anyway. >> >> I raised that in different context already, but such assumptions are not >> 100% future proof IMHO. For example, we might at one point be able to make >> user page tables movable and place them on there. >> >> But yes, most kernel data structures (which you care about) will probably >> never be movable and never end up on these regions. >> >>> We make a >>> filesystem in there, and create a kernel crash dump using 'makedumpfile': >>> >>> mkfs.ext4 /dev/pmem0 >>> mount /dev/pmem0 /mnt >>> makedumpfile -c -d 31 /proc/vmcore /mnt/kdump >>> >>> We then set up the next full kernel boot to also have this same PMEM region, >>> using the same memmap kernel parameter. We reboot back into a full kernel. >>> >>> 3) The next full kernel will be a normal boot with a full networking stack, >>> SSD drivers, disk encryption, etc. We mount up our PMEM filesystem, pull out >>> the kdump and either store it somewhere persistent or upload it somewhere. We >>> can then unmount the PMEM and reconfigure it back to system ram so that the >>> live system isn't missing memory. >>> >>> ndctl create-namespace --reconfig=namespace0.0 -m devdax -f >>> daxctl reconfigure-device --mode=system-ram dax0.0 >>> >>> This is the flow I'm trying to support, and have mostly working in a VM, >>> except up until now makedumpfile would crash because all the memblock >>> structures it needed were in the PMEM area that I had just wiped out by making >>> a new filesystem. :) >> >> >> Thinking out loud (and remembering that some architectures relocate the >> crashkernel during kexec, if I am not wrong), maybe the following would also >> work and make your setup eventually easier: >> >> 1) Don't reserve a crashkernel area in the traditional way, instead reserve >> that area using CMA. It can be used for MOVABLE allocations. >> >> 2) Let kexec load the crashkernel+initrd into ordinary memory only >> (consuming as much as you would need there). >> >> 3) On kexec, relocate the crashkernel+initrd into the CMA area (overwriting >> any movable data in there) >> >> 4) In makedumpfile, don't dump any memory that falls into the crashkernel >> area. It might already have been overwritten by the second kernel > > This is more or less what Jiri is looking into. > Ah, very nice. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2023-07-27 9:42 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-07-18 22:01 collision between ZONE_MOVABLE and memblock allocations Ross Zwisler 2023-07-19 5:44 ` Mike Rapoport 2023-07-19 22:26 ` Ross Zwisler 2023-07-21 11:20 ` Mike Rapoport 2023-07-26 7:49 ` Michal Hocko 2023-07-26 10:48 ` Mike Rapoport 2023-07-26 12:57 ` Michal Hocko 2023-07-26 13:23 ` Mike Rapoport 2023-07-26 14:23 ` Michal Hocko 2023-07-19 6:14 ` Michal Hocko 2023-07-19 7:59 ` Mike Rapoport 2023-07-19 8:06 ` Michal Hocko 2023-07-19 8:14 ` David Hildenbrand 2023-07-19 23:05 ` Ross Zwisler 2023-07-26 8:31 ` David Hildenbrand 2023-07-19 22:48 ` Ross Zwisler 2023-07-20 7:49 ` Michal Hocko 2023-07-20 12:13 ` Michal Hocko 2023-07-24 16:56 ` Ross Zwisler 2023-07-26 8:44 ` David Hildenbrand 2023-07-26 13:08 ` David Hildenbrand 2023-07-27 8:18 ` Michal Hocko 2023-07-27 9:41 ` David Hildenbrand
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).